2026-01-12

Title: Enhancing Foundation Models in Transaction Understanding with LLM-based Sentence Embeddings

Authors: Xiran Fan, Zhimeng Jiang, Chin-Chia Michael Yeh, Yuzhong Chen, Yingtong Dou, Menghai Pan, Yan Zheng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05271
Pdf URL: https://arxiv.org/pdf/2601.05271
Copy Paste: [[2601.05271]] Enhancing Foundation Models in Transaction Understanding with LLM-based Sentence Embeddings(https://arxiv.org/abs/2601.05271)
Keywords: language model, llm
Abstract: The ubiquity of payment networks generates vast transactional data encoding rich consumer and merchant behavioral patterns. Recent foundation models for transaction analysis process tabular data sequentially but rely on index-based representations for categorical merchant fields, causing substantial semantic information loss by converting rich textual data into discrete tokens. While Large Language Models (LLMs) can address this limitation through superior semantic understanding, their computational overhead challenges real-time financial deployment. We introduce a hybrid framework that uses LLM-generated embeddings as semantic initializations for lightweight transaction models, balancing interpretability with operational efficiency. Our approach employs multi-source data fusion to enrich merchant categorical fields and a one-word constraint principle for consistent embedding generation across LLM architectures. We systematically address data quality through noise filtering and context-aware enrichment. Experiments on large-scale transaction datasets demonstrate significant performance improvements across multiple transaction understanding tasks.
摘要：无处不在的支付网络产生了大量的交易数据，编码了丰富的消费者和商家行为模式。最近的交易分析基础模型按顺序处理表格数据，但依赖于分类商家字段的基于索引的表示，通过将丰富的文本数据转换为离散标记而导致大量语义信息丢失。虽然大型语言模型 (LLM) 可以通过卓越的语义理解来解决这一限制，但它们的计算开销对实时财务部署提出了挑战。我们引入了一种混合框架，该框架使用 LLM 生成的嵌入作为轻量级事务模型的语义初始化，平衡可解释性与操作效率。我们的方法采用多源数据融合来丰富商家分类字段，并采用单字约束原则来跨 LLM 架构实现一致的嵌入生成。我们通过噪声过滤和上下文感知丰富来系统地解决数据质量问题。大规模交易数据集上的实验证明了多个交易理解任务的性能显着提高。

Title: Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models

Authors: Zheng Luo, T Pranav Kutralingam, Ogochukwu N Okoani, Wanpeng Xu, Hua Wei, Xiyang Hu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05366
Pdf URL: https://arxiv.org/pdf/2601.05366
Copy Paste: [[2601.05366]] Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models(https://arxiv.org/abs/2601.05366)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) are increasingly deployed as agents that invoke external tools through structured function calls. While recent work reports strong tool-calling performance under standard English-centric evaluations, the robustness of tool calling under multilingual user interactions remains underexplored. In this work, we introduce MLCL, a diagnostic benchmark, and conduct a systematic evaluation of multilingual tool calling across Chinese, Hindi, and the low-resource language Igbo. Through fine-grained error analysis, we show that many failures occur despite correct intent understanding and tool selection. We identify parameter value language mismatch as a dominant failure mode, where models generate semantically appropriate parameter values in the user's language, violating language-invariant execution conventions. We further evaluate several inference-time system strategies and find that while these strategies substantially reduce language-induced execution errors, none of them can fully recover English-level performance.
摘要：大型语言模型 (LLM) 越来越多地部署为通过结构化函数调用来调用外部工具的代理。虽然最近的工作报告了在标准的以英语为中心的评估下强大的工具调用性能，但多语言用户交互下工具调用的稳健性仍未得到充分探索。在这项工作中，我们引入了诊断基准 MLCL，并对中文、印地语和低资源语言伊博语的多语言工具调用进行了系统评估。通过细粒度的错误分析，我们表明，尽管意图理解和工具选择正确，但仍然会发生许多故障。我们将参数值语言不匹配视为主要故障模式，其中模型以用户语言生成语义上合适的参数值，违反了语言不变的执行约定。我们进一步评估了几种推理时间系统策略，发现虽然这些策略大大减少了语言引起的执行错误，但它们都不能完全恢复英语水平的性能。

Title: Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection

Authors: Zhiwei Liu, Yupen Cao, Yuechen Jiang, Mohsinul Kabir, Polydoros Giannouris, Chen Xu, Ziyang Xu, Tianlei Zhu, Tariquzzaman Faisal, Triantafillos Papadopoulos, Yan Wang, Lingfei Qian, Xueqing Peng, Zhuohan Xie, Ye Yuan, Saeed Almheiri, Abdulrazzaq Alnajjar, Mingbin Chen, Harry Stuart, Paul Thompson, Prayag Tiwari, Alejandro Lopez-Lira, Xue Liu, Jimin Huang, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05403
Pdf URL: https://arxiv.org/pdf/2601.05403
Copy Paste: [[2601.05403]] Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection(https://arxiv.org/abs/2601.05403)
Keywords: language model, llm
Abstract: Large language models (LLMs) have been widely applied across various domains of finance. Since their training data are largely derived from human-authored corpora, LLMs may inherit a range of human biases. Behavioral biases can lead to instability and uncertainty in decision-making, particularly when processing financial information. However, existing research on LLM bias has mainly focused on direct questioning or simplified, general-purpose settings, with limited consideration of the complex real-world financial environments and high-risk, context-sensitive, multilingual financial misinformation detection tasks (\mfmd). In this work, we propose \mfmdscen, a comprehensive benchmark for evaluating behavioral biases of LLMs in \mfmd across diverse economic scenarios. In collaboration with financial experts, we construct three types of complex financial scenarios: (i) role- and personality-based, (ii) role- and region-based, and (iii) role-based scenarios incorporating ethnicity and religious beliefs. We further develop a multilingual financial misinformation dataset covering English, Chinese, Greek, and Bengali. By integrating these scenarios with misinformation claims, \mfmdscen enables a systematic evaluation of 22 mainstream LLMs. Our findings reveal that pronounced behavioral biases persist across both commercial and open-source models. This project will be available at this https URL.
摘要：大型语言模型（LLM）已广泛应用于金融的各个领域。由于他们的训练数据主要来自人类编写的语料库，法学硕士可能继承了一系列人类偏见。行为偏差可能导致决策的不稳定和不确定性，特别是在处理财务信息时。然而，现有的关于法学硕士偏差的研究主要集中在直接提问或简化的通用设置上，对复杂的现实世界金融环境和高风险、上下文敏感、多语言的金融错误信息检测任务的考虑有限（\mfmd）。在这项工作中，我们提出了 \mfmdscen，一个用于评估法学硕士在不同经济情景下的行为偏差的综合基准。我们与金融专家合作，构建了三种类型的复杂金融场景：（i）基于角色和个性的场景，（ii）基于角色和地区的场景，以及（iii）包含种族和宗教信仰的基于角色的场景。我们进一步开发了一个涵盖英语、中文、希腊语和孟加拉语的多语言金融错误信息数据集。通过将这些场景与错误信息主张相结合，\mfmdscen 可以对 22 个主流法学硕士进行系统评估。我们的研究结果表明，商业和开源模型中都存在明显的行为偏差。该项目将通过此 https URL 提供。

Title: Glitter: Visualizing Lexical Surprisal for Readability in Administrative Texts

Authors: Jan Černý, Ivana Kvapilíková, Silvie Cinková
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05411
Pdf URL: https://arxiv.org/pdf/2601.05411
Copy Paste: [[2601.05411]] Glitter: Visualizing Lexical Surprisal for Readability in Administrative Texts(https://arxiv.org/abs/2601.05411)
Keywords: language model
Abstract: This work investigates how measuring information entropy of text can be used to estimate its readability. We propose a visualization framework that can be used to approximate information entropy of text using multiple language models and visualize the result. The end goal is to use this method to estimate and improve readability and clarity of administrative or bureaucratic texts. Our toolset is available as a libre software on this https URL.
摘要：这项工作研究了如何使用测量文本的信息熵来估计其可读性。我们提出了一种可视化框架，可用于使用多种语言模型来近似文本的信息熵并可视化结果。最终目标是使用这种方法来估计和提高行政或官僚文本的可读性和清晰度。我们的工具集可在此 https URL 上作为自由软件使用。

Title: Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

Authors: Minda Zhao, Yilun Du, Mengyu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05414
Pdf URL: https://arxiv.org/pdf/2601.05414
Copy Paste: [[2601.05414]] Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions(https://arxiv.org/abs/2601.05414)
Keywords: language model, llm, prompt, chat
Abstract: As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines across domains like educational assessment and synthetic data construction, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces N=1000 samples within one response, and Independent Requests, comprising $N=1000$ stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 13% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the requested sampling horizon N increases. Finally, we demonstrate the propagation of these failures into downstream tasks: models fail to enforce uniform answer-position constraints in MCQ generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating the use of external tools for applications requiring statistical guarantees.
摘要：随着大型语言模型 (LLM) 从聊天界面过渡到教育评估和合成数据构建等领域的随机管道的组成部分，从指定概率分布中忠实采样的能力已成为一种功能需求，而不是一种理论上的好奇心。我们首次对前沿 LLM 中的本地概率抽样进行大规模、统计支持的审计，对 15 个分布的 11 个模型进行了基准测试。为了理清故障模式，我们采用了双协议设计：批量生成（模型在一次响应中生成 N=1000 个样本）和独立请求（包括 $N=1000$ 无状态调用）。我们观察到明显的协议不对称性：批量生成仅实现了适度的统计有效性，中位通过率为 13%，而独立请求几乎完全崩溃，11 个模型中有 10 个没有通过任何分布。除了这种不对称性之外，我们还发现采样保真度随着分布复杂性的增加而单调下降，并随着要求的采样范围 N 的增加而恶化。最后，我们演示了这些失败在下游任务中的传播：模型无法在 MCQ 生成中强制执行统一的答案位置约束，并且在属性约束的文本到图像提示合成中系统地违反人口统计目标。这些发现表明，当前的法学硕士缺乏功能性的内部采样器，因此需要使用外部工具来满足需要统计保证的应用。

Title: Tracing Moral Foundations in Large Language Models

Authors: Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, Morteza Dehghani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05437
Pdf URL: https://arxiv.org/pdf/2601.05437
Copy Paste: [[2601.05437]] Tracing Moral Foundations in Large Language Models(https://arxiv.org/abs/2601.05437)
Keywords: language model, llm
Abstract: Large language models (LLMs) often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.'' Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed within two instruction-tuned LLMs: Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that both models represent and distinguish moral foundations in a structured, layer-dependent way that aligns with human judgments. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.
摘要：大型语言模型 (LLM) 通常会产生类似人类的道德判断，但尚不清楚这是否反映了内部概念结构或表面的“道德模仿”。使用道德基础理论 (MFT) 作为分析框架，我们研究道德基础如何在两个指令调整的 LLM 中编码、组织和表达：Llama-3.1-8B-Instruct 和 Qwen2.5-7B-Instruct。我们采用多层次方法，结合（i）对 MFT 概念表示及其与人类道德感知的一致性进行分层分析，（ii）在残差流上预训练稀疏自动编码器（SAE）以识别支持道德概念的稀疏特征，以及（iii）使用密集 MFT 向量和稀疏 SAE 特征进行因果引导干预。我们发现这两种模型都以一种结构化的、依赖于层次的方式来表示和区分道德基础，这种方式与人类的判断相一致。在更精细的尺度上，SAE 特征显示出与特定基础的清晰语义联系，表明共享表示中部分解开的机制。最后，沿着密集向量或稀疏特征进行引导会在基础相关行为中产生可预测的变化，从而证明内部表征和道德输出之间的因果关系。总之，我们的结果提供了机械证据，表明法学硕士中的道德概念是分布式的、分层的和部分解开的，这表明多元道德结构可以作为一种潜在模式从语言的统计规律中显现出来。

Title: Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction

Authors: Hongjin Kim, Jaewook Lee, Kiyoung Lee, Jong-hun Shin, Soojong Lim, Oh-Woog Kwon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05459
Pdf URL: https://arxiv.org/pdf/2601.05459
Copy Paste: [[2601.05459]] Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction(https://arxiv.org/abs/2601.05459)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) demonstrate strong reasoning and self-correction abilities in high-resource languages like English, but their performance remains limited in low-resource languages such as Korean. In this study, we investigate whether reinforcement learning (RL) can enhance Korean reasoning abilities to a degree comparable to English. Our findings reveal that RL alone yields limited improvements when applied to models lacking inherent Korean reasoning capabilities. To address this, we explore several fine-tuning strategies and show that aligning the model's internal reasoning processes with Korean inputs-particularly by tuning Korean-specific neurons in early layers-is key to unlocking RL's effectiveness. We introduce a self-correction code-switching dataset to facilitate this alignment and observe significant performance gains in both mathematical reasoning and self-correction tasks. Ultimately, we conclude that the crucial factor in multilingual reasoning enhancement is not injecting new linguistic knowledge, but effectively eliciting and aligning existing reasoning capabilities. Our study provides a new perspective on how internal translation and neuron-level tuning contribute to multilingual reasoning alignment in LLMs.
摘要：大型语言模型（LLM）在英语等高资源语言中表现出强大的推理和自我纠正能力，但在韩语等低资源语言中其表现仍然有限。在这项研究中，我们调查强化学习（RL）是否可以将韩语推理能力提高到与英语相当的程度。我们的研究结果表明，当应用于缺乏韩国固有推理能力的模型时，仅强化学习所产生的改进有限。为了解决这个问题，我们探索了几种微调策略，并表明将模型的内部推理过程与韩语输入保持一致（特别是通过调整早期层中特定于韩语的神经元）是解锁 RL 有效性的关键。我们引入了一个自校正代码切换数据集来促进这种对齐，并观察数学推理和自校正任务中的显着性能提升。最终，我们得出的结论是，多语言推理增强的关键因素不是注入新的语言知识，而是有效引出和调整现有的推理能力。我们的研究为内部翻译和神经元级调整如何促进法学硕士的多语言推理对齐提供了新的视角。

Title: Towards Valid Student Simulation with Large Language Models

Authors: Zhihao Yuan, Yunze Xiao, Ming Li, Weihao Xuan, Richard Tong, Mona Diab, Tom Mitchell
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2601.05473
Pdf URL: https://arxiv.org/pdf/2601.05473
Copy Paste: [[2601.05473]] Towards Valid Student Simulation with Large Language Models(https://arxiv.org/abs/2601.05473)
Keywords: language model, llm
Abstract: This paper presents a conceptual and methodological framework for large language model (LLM) based student simulation in educational settings. The authors identify a core failure mode, termed the "competence paradox" in which broadly capable LLMs are asked to emulate partially knowledgeable learners, leading to unrealistic error patterns and learning dynamics. To address this, the paper reframes student simulation as a constrained generation problem governed by an explicit Epistemic State Specification (ESS), which defines what a simulated learner can access, how errors are structured, and how learner state evolves over time. The work further introduces a Goal-by-Environment framework to situate simulated student systems according to behavioral objectives and deployment contexts. Rather than proposing a new system or benchmark, the paper synthesizes prior literature, formalizes key design dimensions, and articulates open challenges related to validity, evaluation, and ethical risks. Overall, the paper argues for epistemic fidelity over surface realism as a prerequisite for using LLM-based simulated students as reliable scientific and pedagogical instruments.
摘要：本文提出了教育环境中基于大语言模型（LLM）的学生模拟的概念和方法框架。作者确定了一种核心失败模式，称为“能力悖论”，其中要求能力广泛的法学硕士模仿部分知识渊博的学习者，从而导致不切实际的错误模式和学习动态。为了解决这个问题，本文将学生模拟重新定义为由明确的认知状态规范（ESS）控制的约束生成问题，该规范定义了模拟学习者可以访问的内容、错误的结构以及学习者状态如何随时间演变。这项工作进一步引入了一个按环境目标框架，根据行为目标和部署环境来定位模拟学生系统。该论文不是提出一个新的系统或基准，而是综合了先前的文献，形式化了关键的设计维度，并阐明了与有效性、评估和道德风险相关的开放挑战。总体而言，本文认为，认识保真度高于表面现实主义是使用基于法学硕士的模拟学生作为可靠的科学和教学工具的先决条件。

Title: The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence

Authors: Herun Wan, Jiaying Wu, Minnan Luo, Fanxiao Li, Zhi Zeng, Min-Yen Kan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05478
Pdf URL: https://arxiv.org/pdf/2601.05478
Copy Paste: [[2601.05478]] The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence(https://arxiv.org/abs/2601.05478)
Keywords: llm
Abstract: To reliably assist human decision-making, LLMs must maintain factual internal beliefs against misleading injections. While current models resist explicit misinformation, we uncover a fundamental vulnerability to sophisticated, hard-to-falsify evidence. To systematically probe this weakness, we introduce MisBelief, a framework that generates misleading evidence via collaborative, multi-round interactions among multi-role LLMs. This process mimics subtle, defeasible reasoning and progressive refinement to create logically persuasive yet factually deceptive claims. Using MisBelief, we generate 4,800 instances across three difficulty levels to evaluate 7 representative LLMs. Results indicate that while models are robust to direct misinformation, they are highly sensitive to this refined evidence: belief scores in falsehoods increase by an average of 93.0\%, fundamentally compromising downstream recommendations. To address this, we propose Deceptive Intent Shielding (DIS), a governance mechanism that provides an early warning signal by inferring the deceptive intent behind evidence. Empirical results demonstrate that DIS consistently mitigates belief shifts and promotes more cautious evidence evaluation.
摘要：为了可靠地协助人类决策，法学硕士必须保持事实性的内部信念，防止误导性注射。虽然当前的模型可以抵制明显的错误信息，但我们发现了复杂的、难以伪造的证据的根本漏洞。为了系统地探讨这一弱点，我们引入了 MisBelief，这是一个通过多角色法学硕士之间的协作、多轮交互生成误导性证据的框架。这个过程模仿微妙的、可推翻的推理和渐进的改进，以创造逻辑上有说服力但事实上具有欺骗性的主张。使用 MisBelief，我们生成了三个难度级别的 4,800 个实例来评估 7 个具有代表性的法学硕士。结果表明，虽然模型对直接错误信息具有鲁棒性，但它们对这种精炼证据高度敏感：虚假信息的置信度分数平均增加 93.0%，从根本上损害了下游建议。为了解决这个问题，我们提出了欺骗意图屏蔽（DIS），这是一种通过推断证据背后的欺骗意图来提供早期预警信号的治理机制。实证结果表明，DIS 始终如一地减轻信念转变并促进更谨慎的证据评估。

Title: MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards

Authors: Zhiyu Shen, Ziming Wu, Fuming Lai, Shaobing Lian, Yanghui Rao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05488
Pdf URL: https://arxiv.org/pdf/2601.05488
Copy Paste: [[2601.05488]] MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards(https://arxiv.org/abs/2601.05488)
Keywords: llm, prompt
Abstract: Maintaining consistency in long-term dialogues remains a fundamental challenge for LLMs, as standard retrieval mechanisms often fail to capture the temporal evolution of historical states. While memory-augmented frameworks offer a structured alternative, current systems rely on static prompting of closed-source models or suffer from ineffective training paradigms with sparse rewards. We introduce MemBuilder, a reinforcement learning framework that trains models to orchestrate multi-dimensional memory construction with attributed dense rewards. MemBuilder addresses two key challenges: (1) Sparse Trajectory-Level Rewards: we employ synthetic session-level question generation to provide dense intermediate rewards across extended trajectories; and (2) Multi-Dimensional Memory Attribution: we introduce contribution-aware gradient weighting that scales policy updates based on each component's downstream impact. Experimental results show that MemBuilder enables a 4B-parameter model to outperform state-of-the-art closed-source baselines, exhibiting strong generalization across long-term dialogue benchmarks.
摘要：保持长期对话的一致性仍然是法学硕士面临的一个基本挑战，因为标准检索机制往往无法捕捉历史状态的时间演变。虽然记忆增强框架提供了结构化的替代方案，但当前的系统依赖于闭源模型的静态提示，或者受到奖励稀疏的无效训练范例的影响。我们引入了 MemBuilder，这是一种强化学习框架，可训练模型以编排具有归因密集奖励的多维记忆构建。 MemBuilder 解决了两个关键挑战：（1）稀疏轨迹级奖励：我们采用合成会话级问题生成来提供跨扩展轨迹的密集中间奖励； (2) 多维记忆归因：我们引入了贡献感知梯度权重，根据每个组件的下游影响来扩展策略更新。实验结果表明，MemBuilder 使 4B 参数模型的性能优于最先进的闭源基线，在长期对话基准中表现出强大的泛化能力。

Title: FlashMem: Distilling Intrinsic Latent Memory via Computation Reuse

Authors: Yubo Hou, Zhisheng Chen, Tao Wan, Zengchang Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05505
Pdf URL: https://arxiv.org/pdf/2601.05505
Copy Paste: [[2601.05505]] FlashMem: Distilling Intrinsic Latent Memory via Computation Reuse(https://arxiv.org/abs/2601.05505)
Keywords: language model, agent
Abstract: The stateless architecture of Large Language Models inherently lacks the mechanism to preserve dynamic context, compelling agents to redundantly reprocess history to maintain long-horizon autonomy. While latent memory offers a solution, current approaches are hindered by architectural segregation, relying on auxiliary encoders that decouple memory from the reasoning backbone. We propose FlashMem, a framework that distills intrinsic memory directly from transient reasoning states via computation reuse. Leveraging the property that internal representations uniquely encode input trajectories, FlashMem identifies the last hidden state as a sufficient statistic for the interaction history. This enables a Shared-KV Consolidator to synthesize memory by attending directly to the backbone's frozen cache, eliminating redundant re-parameterization. Furthermore, a parameter-free Cognitive Monitor leverages attention entropy to adaptively trigger consolidation only when high epistemic uncertainty is detected. Experiments demonstrate that FlashMem matches the performance of heavy baselines while reducing inference latency by 5 times, effectively bridging the gap between efficiency and persistent cognition.
摘要：大型语言模型的无状态架构本质上缺乏保留动态上下文的机制，迫使代理重复地重新处理历史以维持长期自治。虽然潜在内存提供了一种解决方案，但当前的方法受到架构隔离的阻碍，依赖于将内存与推理主干分离的辅助编码器。我们提出了 FlashMem，这是一个通过计算重用直接从瞬态推理状态中提取内在内存的框架。利用内部表示对输入轨迹进行唯一编码的特性，FlashMem 将最后一个隐藏状态识别为交互历史的充分统计数据。这使得 Shared-KV Consolidator 能够通过直接关注主干的冻结缓存来合成内存，从而消除冗余的重新参数化。此外，无参数认知监视器仅在检测到高认知不确定性时才利用注意力熵自适应地触发巩固。实验表明，FlashMem 与重基线的性能相当，同时将推理延迟降低了 5 倍，有效缩小了效率和持久认知之间的差距。

Title: CHisAgent: A Multi-Agent Framework for Event Taxonomy Construction in Ancient Chinese Cultural Systems

Authors: Xuemei Tang, Chengxi Yan, Jinghang Gu, Chu-Ren Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05520
Pdf URL: https://arxiv.org/pdf/2601.05520
Copy Paste: [[2601.05520]] CHisAgent: A Multi-Agent Framework for Event Taxonomy Construction in Ancient Chinese Cultural Systems(https://arxiv.org/abs/2601.05520)
Keywords: language model, llm, agent
Abstract: Despite strong performance on many tasks, large language models (LLMs) show limited ability in historical and cultural reasoning, particularly in non-English contexts such as Chinese history. Taxonomic structures offer an effective mechanism to organize historical knowledge and improve understanding. However, manual taxonomy construction is costly and difficult to scale. Therefore, we propose \textbf{CHisAgent}, a multi-agent LLM framework for historical taxonomy construction in ancient Chinese contexts. CHisAgent decomposes taxonomy construction into three role-specialized stages: a bottom-up \textit{Inducer} that derives an initial hierarchy from raw historical corpora, a top-down \textit{Expander} that introduces missing intermediate concepts using LLM world knowledge, and an evidence-guided \textit{Enricher} that integrates external structured historical resources to ensure faithfulness. Using the \textit{Twenty-Four Histories}, we construct a large-scale, domain-aware event taxonomy covering politics, military, diplomacy, and social life in ancient China. Extensive reference-free and reference-based evaluations demonstrate improved structural coherence and coverage, while further analysis shows that the resulting taxonomy supports cross-cultural alignment.
摘要：尽管大型语言模型（LLM）在许多任务上表现出色，但在历史和文化推理方面表现出有限的能力，特别是在中国历史等非英语背景下。分类结构提供了组织历史知识和增进理解的有效机制。然而，手动分类法构建成本高昂且难以扩展。因此，我们提出了\textbf{CHisAgent}，一个用于古代中国背景下历史分类学构建的多主体LLM框架。 CHisAgent 将分类法构建分解为三个角色专门阶段：一个自下而上的 \textit{Inducer}，从原始历史语料库中派生出初始层次结构；一个自上而下的 \textit{Expander}，使用 LLM 世界知识引入缺失的中间概念；以及一个证据引导的 \textit{Enricher}，它集成了外部结构化历史资源以确保准确性。利用\textit{二十四史}，我们构建了一个涵盖中国古代政治、军事、外交和社会生活的大规模、领域感知的事件分类。广泛的无参考和基于参考的评估表明结构一致性和覆盖范围得到了改善，而进一步的分析表明由此产生的分类法支持跨文化一致性。

Title: Closing the Modality Reasoning Gap for Speech Large Language Models

Authors: Chaoren Wang, Heng Lu, Xueyao Zhang, Shujie Liu, Yan Lu, Jinyu Li, Zhizheng Wu
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2601.05543
Pdf URL: https://arxiv.org/pdf/2601.05543
Copy Paste: [[2601.05543]] Closing the Modality Reasoning Gap for Speech Large Language Models(https://arxiv.org/abs/2601.05543)
Keywords: language model, llm
Abstract: Although speech large language models have achieved notable progress, a substantial modality reasoning gap remains: their reasoning performance on speech inputs is markedly weaker than on text. This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions. Experiments on challenging reasoning benchmarks, including MMSU and OBQA, show that our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.
摘要：尽管语音大语言模型取得了显着的进展，但仍然存在巨大的模态推理差距：它们在语音输入上的推理性能明显弱于文本。这种差距可能与 Transformer 层之间的表征漂移和长链推理中的行为偏差有关。为了解决这个问题，我们引入了 TARS，这是一种强化学习框架，通过不对称奖励设计来调整文本条件和语音条件轨迹。该框架采用两个密集且互补的信号：表示对齐（测量语音和文本条件轨迹之间的分层隐藏状态相似性）和行为对齐（评估生成的输出和参考文本完成之间的语义一致性）。在具有挑战性的推理基准（包括 MMSU 和 OBQA）上进行的实验表明，我们的方法显着缩小了模态推理差距，并在 7B 规模的语音法学硕士中实现了最先进的性能。

Title: Can Large Language Models Differentiate Harmful from Argumentative Essays? Steps Toward Ethical Essay Scoring

Authors: Hongjin Kim, Jeonghyun Kang, Harksoo Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05545
Pdf URL: https://arxiv.org/pdf/2601.05545
Copy Paste: [[2601.05545]] Can Large Language Models Differentiate Harmful from Argumentative Essays? Steps Toward Ethical Essay Scoring(https://arxiv.org/abs/2601.05545)
Keywords: language model, llm
Abstract: This study addresses critical gaps in Automated Essay Scoring (AES) systems and Large Language Models (LLMs) with regard to their ability to effectively identify and score harmful essays. Despite advancements in AES technology, current models often overlook ethically and morally problematic elements within essays, erroneously assigning high scores to essays that may propagate harmful opinions. In this study, we introduce the Harmful Essay Detection (HED) benchmark, which includes essays integrating sensitive topics such as racism and gender bias, to test the efficacy of various LLMs in recognizing and scoring harmful content. Our findings reveal that: (1) LLMs require further enhancement to accurately distinguish between harmful and argumentative essays, and (2) both current AES models and LLMs fail to consider the ethical dimensions of content during scoring. The study underscores the need for developing more robust AES systems that are sensitive to the ethical implications of the content they are scoring.
摘要：这项研究解决了自动论文评分 (AES) 系统和大型语言模型 (LLM) 在有效识别和评分有害论文的能力方面的关键差距。尽管 AES 技术取得了进步，但当前的模型常常忽视论文中存在伦理和道德问题的元素，错误地给可能传播有害观点的论文打高分。在这项研究中，我们引入了有害论文检测（HED）基准，其中包括整合种族主义和性别偏见等敏感话题的论文，以测试各种法学硕士在识别和评分有害内容方面的功效。我们的研究结果表明：(1) 法学硕士需要进一步增强才能准确区分有害论文和争论性论文，(2) 当前的 AES 模型和法学硕士在评分过程中都未能考虑内容的道德维度。该研究强调需要开发更强大的 AES 系统，该系统对其评分内容的道德影响敏感。

Title: ReasonAny: Incorporating Reasoning Capability to Any Model via Simple and Effective Model Merging

Authors: Junyao Yang, Chen Qian, Dongrui Liu, Wen Shen, Yong Liu, Jing Shao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05560
Pdf URL: https://arxiv.org/pdf/2601.05560
Copy Paste: [[2601.05560]] ReasonAny: Incorporating Reasoning Capability to Any Model via Simple and Effective Model Merging(https://arxiv.org/abs/2601.05560)
Keywords: chain-of-thought
Abstract: Large Reasoning Models (LRMs) with long chain-of-thought reasoning have recently achieved remarkable success. Yet, equipping domain-specialized models with such reasoning capabilities, referred to as "Reasoning + X", remains a significant challenge. While model merging offers a promising training-free solution, existing methods often suffer from a destructive performance collapse: existing methods tend to both weaken reasoning depth and compromise domain-specific utility. Interestingly, we identify a counter-intuitive phenomenon underlying this failure: reasoning ability predominantly resides in parameter regions with low gradient sensitivity, contrary to the common assumption that domain capabilities correspond to high-magnitude parameters. Motivated by this insight, we propose ReasonAny, a novel merging framework that resolves the reasoning-domain performance collapse through Contrastive Gradient Identification. Experiments across safety, biomedicine, and finance domains show that ReasonAny effectively synthesizes "Reasoning + X" capabilities, significantly outperforming state-of-the-art baselines while retaining robust reasoning performance.
摘要：具有长链思维推理的大型推理模型（LRM）最近取得了显着的成功。然而，为特定领域的模型配备这种推理能力（称为“推理+X”）仍然是一个重大挑战。虽然模型合并提供了一种有前途的免训练解决方案，但现有方法经常遭受破坏性的性能崩溃：现有方法往往会削弱推理深度并损害特定领域的效用。有趣的是，我们发现了这种失败背后的反直觉现象：推理能力主要存在于梯度敏感度较低的参数区域，这与域能力对应于高量级参数的常见假设相反。受这一见解的启发，我们提出了 ReasonAny，这是一种新颖的合并框架，通过对比梯度识别解决推理领域性能崩溃的问题。跨安全、生物医学和金融领域的实验表明，ReasonAny 有效地综合了“推理 + X”功能，显着优于最先进的基线，同时保留了强大的推理性能。

Title: Can large language models interpret unstructured chat data on dynamic group decision-making processes? Evidence on joint destination choice

Authors: Sung-Yoo Lim, Koki Sato, Kiyoshi Takami, Giancarlos Parady, Eui-Jin Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05582
Pdf URL: https://arxiv.org/pdf/2601.05582
Copy Paste: [[2601.05582]] Can large language models interpret unstructured chat data on dynamic group decision-making processes? Evidence on joint destination choice(https://arxiv.org/abs/2601.05582)
Keywords: language model, llm, prompt, chat
Abstract: Social activities result from complex joint activity-travel decisions between group members. While observing the decision-making process of these activities is difficult via traditional travel surveys, the advent of new types of data, such as unstructured chat data, can help shed some light on these complex processes. However, interpreting these decision-making processes requires inferring both explicit and implicit factors. This typically involves the labor-intensive task of manually annotating dialogues to capture context-dependent meanings shaped by the social and cultural norms. This study evaluates the potential of Large Language Models (LLMs) to automate and complement human annotation in interpreting decision-making processes from group chats, using data on joint eating-out activities in Japan as a case study. We designed a prompting framework inspired by the knowledge acquisition process, which sequentially extracts key decision-making factors, including the group-level restaurant choice set and outcome, individual preferences of each alternative, and the specific attributes driving those preferences. This structured process guides the LLM to interpret group chat data, converting unstructured dialogues into structured tabular data describing decision-making factors. To evaluate LLM-driven outputs, we conduct a quantitative analysis using a human-annotated ground truth dataset and a qualitative error analysis to examine model limitations. Results show that while the LLM reliably captures explicit decision-making factors, it struggles to identify nuanced implicit factors that human annotators readily identified. We pinpoint specific contexts when LLM-based extraction can be trusted versus when human oversight remains essential. These findings highlight both the potential and limitations of LLM-based analysis for incorporating non-traditional data sources on social activities.
摘要：社交活动是团体成员之间复杂的联合活动-旅行决策的结果。虽然通过传统的旅行调查很难观察这些活动的决策过程，但非结构化聊天数据等新型数据的出现可以帮助揭示这些复杂的过程。然而，解释这些决策过程需要推断显性和隐性因素。这通常涉及手动注释对话以捕获由社会和文化规范塑造的上下文相关含义的劳动密集型任务。本研究以日本联合外出就餐活动的数据作为案例研究，评估了大型语言模型 (LLM) 在解释群聊决策过程中自动化和补充人类注释的潜力。我们设计了一个受知识获取过程启发的提示框架，该框架依次提取关键决策因素，包括团体层面的餐厅选择集和结果、每个选项的个人偏好以及驱动这些偏好的特定属性。这个结构化过程指导法学硕士解释群聊数据，将非结构化对话转换为描述决策因素的结构化表格数据。为了评估法学硕士驱动的输出，我们使用人工注释的地面实况数据集进行定量分析，并进行定性误差分析以检查模型的局限性。结果表明，虽然法学硕士能够可靠地捕捉显性决策因素，但它很难识别人类注释者容易识别的细微隐性因素。我们确定了基于法学硕士的提取何时可以信任以及何时人类监督仍然重要的特定背景。这些发现强调了基于法学硕士的分析在纳入社会活动非传统数据源方面的潜力和局限性。

Title: ACR: Adaptive Context Refactoring via Context Refactoring Operators for Multi-Turn Dialogue

Authors: Jiawei Shen, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui, Qingyu Niu, Guoqing Ma, Yidan Liang, Jingjiang Liu, Yiling Wang, Shimin Di, Jiajie Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05589
Pdf URL: https://arxiv.org/pdf/2601.05589
Copy Paste: [[2601.05589]] ACR: Adaptive Context Refactoring via Context Refactoring Operators for Multi-Turn Dialogue(https://arxiv.org/abs/2601.05589)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown remarkable performance in multi-turn dialogue. However, in multi-turn dialogue, models still struggle to stay aligned with what has been established earlier, follow dependencies across many turns, and avoid drifting into incorrect facts as the interaction grows longer. Existing approaches primarily focus on extending the context window, introducing external memory, or applying context compression, yet these methods still face limitations such as \textbf{contextual inertia} and \textbf{state drift}. To address these challenges, we propose the \textbf{A}daptive \textbf{C}ontext \textbf{R}efactoring \textbf{(ACR)} Framework, which dynamically monitors and reshapes the interaction history to mitigate contextual inertia and state drift actively. ACR is built on a library of context refactoring operators and a teacher-guided self-evolving training paradigm that learns when to intervene and how to refactor, thereby decoupling context management from the reasoning process. Extensive experiments on multi-turn dialogue demonstrate that our method significantly outperforms existing baselines while reducing token consumption.
摘要：大型语言模型（LLM）在多轮对话中表现出了卓越的性能。然而，在多轮对话中，模型仍然难以与之前建立的内容保持一致，遵循多轮对话的依赖关系，并避免随着交互时间的延长而陷入不正确的事实。现有的方法主要集中在扩展上下文窗口、引入外部存储器或应用上下文压缩，但这些方法仍然面临诸如\textbf{上下文惯性}和\textbf{状态漂移}等限制。为了应对这些挑战，我们提出了 \textbf{A}daptive \textbf{C}ontext \textbf{R}efactoring \textbf{(ACR)} 框架，该框架动态监控和重塑交互历史记录，以主动减轻上下文惯性和状态漂移。 ACR 建立在上下文重构操作符库和教师引导的自我进化训练范例之上，该范例学习何时进行干预以及如何重构，从而将上下文管理与推理过程解耦。对多轮对话的大量实验表明，我们的方法显着优于现有基线，同时减少了令牌消耗。

Title: Data Augmented Pipeline for Legal Information Extraction and Reasoning

Authors: Nguyen Minh Phuong, Ha-Thanh Nguyen, May Myo Zin, Ken Satoh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05609
Pdf URL: https://arxiv.org/pdf/2601.05609
Copy Paste: [[2601.05609]] Data Augmented Pipeline for Legal Information Extraction and Reasoning(https://arxiv.org/abs/2601.05609)
Keywords: language model, llm
Abstract: In this paper, we propose a pipeline leveraging Large Language Models (LLMs) for data augmentation in Information Extraction tasks within the legal domain. The proposed method is both simple and effective, significantly reducing the manual effort required for data annotation while enhancing the robustness of Information Extraction systems. Furthermore, the method is generalizable, making it applicable to various Natural Language Processing (NLP) tasks beyond the legal domain.
摘要：在本文中，我们提出了一种利用大型语言模型（LLM）的管道，用于法律领域内信息提取任务中的数据增强。所提出的方法既简单又有效，显着减少了数据注释所需的手动工作，同时增强了信息提取系统的鲁棒性。此外，该方法具有可推广性，适用于法律领域之外的各种自然语言处理（NLP）任务。

Title: GIFT: Games as Informal Training for Generalizable LLMs

Authors: Nuoyan Lyu, Bingbing Xu, Weihao Meng, Yige Yuan, Yang Zhang, Zhiyong Huang, Tat-Seng Chua, Huawei Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05633
Pdf URL: https://arxiv.org/pdf/2601.05633
Copy Paste: [[2601.05633]] GIFT: Games as Informal Training for Generalizable LLMs(https://arxiv.org/abs/2601.05633)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) have achieved remarkable success in formal learning tasks such as mathematics and code generation, they still struggle with the "practical wisdom" and generalizable intelligence, such as strategic creativity and social reasoning, that characterize human cognition. This gap arises from a lack of informal learning, which thrives on interactive feedback rather than goal-oriented instruction. In this paper, we propose treating Games as a primary environment for LLM informal learning, leveraging their intrinsic reward signals and abstracted complexity to cultivate diverse competencies. To address the performance degradation observed in multi-task learning, we introduce a Nested Training Framework. Unlike naive task mixing optimizing an implicit "OR" objective, our framework employs sequential task composition to enforce an explicit "AND" objective, compelling the model to master multiple abilities simultaneously to achieve maximal rewards. Using GRPO-based reinforcement learning across Matrix Games, TicTacToe, and Who's the Spy games, we demonstrate that integrating game-based informal learning not only prevents task interference but also significantly bolsters the model's generalization across broad ability-oriented benchmarks. The framework and implementation are publicly available.
摘要：虽然大型语言模型（LLM）在数学和代码生成等正式学习任务中取得了显着的成功，但它们仍然在“实践智慧”和普遍智能（例如战略创造力和社会推理）方面遇到困难，这些都是人类认知的特征。这种差距是由于缺乏非正式学习而产生的，非正式学习依赖于互动反馈而不是目标导向的指导。在本文中，我们建议将游戏视为法学硕士非正式学习的主要环境，利用其内在的奖励信号和抽象的复杂性来培养多样化的能力。为了解决多任务学习中观察到的性能下降问题，我们引入了嵌套训练框架。与优化隐式“OR”目标的朴素任务混合不同，我们的框架采用顺序任务组合来强制执行显式“AND”目标，迫使模型同时掌握多种能力以实现最大奖励。在 Matrix Games、TicTacToe 和 Who's the Spy 游戏中使用基于 GRPO 的强化学习，我们证明集成基于游戏的非正式学习不仅可以防止任务干扰，而且还可以显着增强模型在广泛的以能力为导向的基准中的泛化能力。该框架和实施是公开的。

Title: Multilingual Amnesia: On the Transferability of Unlearning in Multilingual LLMs

Authors: Alireza Dehghanpour Farashah, Aditi Khandelwal, Marylou Fauchard, Zhuan Shi, Negar Rostamzadeh, Golnoosh Farnadi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05641
Pdf URL: https://arxiv.org/pdf/2601.05641
Copy Paste: [[2601.05641]] Multilingual Amnesia: On the Transferability of Unlearning in Multilingual LLMs(https://arxiv.org/abs/2601.05641)
Keywords: language model, llm
Abstract: As multilingual large language models become more widely used, ensuring their safety and fairness across diverse linguistic contexts presents unique challenges. While existing research on machine unlearning has primarily focused on monolingual settings, typically English, multilingual environments introduce additional complexities due to cross-lingual knowledge transfer and biases embedded in both pretraining and fine-tuning data. In this work, we study multilingual unlearning using the Aya-Expanse 8B model under two settings: (1) data unlearning and (2) concept unlearning. We extend benchmarks for factual knowledge and stereotypes to ten languages through translation: English, French, Arabic, Japanese, Russian, Farsi, Korean, Hindi, Hebrew, and Indonesian. These languages span five language families and a wide range of resource levels. Our experiments show that unlearning in high-resource languages is generally more stable, with asymmetric transfer effects observed between typologically related languages. Furthermore, our analysis of linguistic distances indicates that syntactic similarity is the strongest predictor of cross-lingual unlearning behavior.
摘要：随着多语言大语言模型的使用越来越广泛，确保其在不同语言环境中的安全性和公平性提出了独特的挑战。虽然现有的机器取消学习研究主要集中在单语环境（通常是英语）上，但由于跨语言知识转移以及预训练和微调数据中嵌入的偏差，多语言环境引入了额外的复杂性。在这项工作中，我们使用 Aya-Expanse 8B 模型在两种设置下研究多语言遗忘：(1) 数据遗忘和 (2) 概念遗忘。我们通过翻译将事实知识和刻板印象的基准扩展到十种语言：英语、法语、阿拉伯语、日语、俄语、波斯语、韩语、印地语、希伯来语和印度尼西亚语。这些语言跨越五个语系和广泛的资源水平。我们的实验表明，高资源语言的遗忘通常更稳定，在类型相关的语言之间观察到不对称的迁移效应。此外，我们对语言距离的分析表明，句法相似性是跨语言忘却行为的最强预测因素。

Title: A Framework for Personalized Persuasiveness Prediction via Context-Aware User Profiling

Authors: Sejun Park, Yoonah Park, Jongwon Lim, Yohan Jo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05654
Pdf URL: https://arxiv.org/pdf/2601.05654
Copy Paste: [[2601.05654]] A Framework for Personalized Persuasiveness Prediction via Context-Aware User Profiling(https://arxiv.org/abs/2601.05654)
Keywords: llm
Abstract: Estimating the persuasiveness of messages is critical in various applications, from recommender systems to safety assessment of LLMs. While it is imperative to consider the target persuadee's characteristics, such as their values, experiences, and reasoning styles, there is currently no established systematic framework to optimize leveraging a persuadee's past activities (e.g., conversations) to the benefit of a persuasiveness prediction model. To address this problem, we propose a context-aware user profiling framework with two trainable components: a query generator that generates optimal queries to retrieve persuasion-relevant records from a user's history, and a profiler that summarizes these records into a profile to effectively inform the persuasiveness prediction model. Our evaluation on the ChangeMyView Reddit dataset shows consistent improvements over existing methods across multiple predictor models, with gains of up to +13.77%p in F1 score. Further analysis shows that effective user profiles are context-dependent and predictor-specific, rather than relying on static attributes or surface-level similarity. Together, these results highlight the importance of task-oriented, context-dependent user profiling for personalized persuasiveness prediction.
摘要：从推荐系统到法学硕士的安全评估，估计消息的说服力在各种应用中都至关重要。虽然必须考虑目标被说服者的特征，例如他们的价值观、经验和推理风格，但目前还没有建立的系统框架来优化利用被说服者过去的活动（例如对话）来实现说服力预测模型。为了解决这个问题，我们提出了一个上下文感知的用户分析框架，它具有两个可训练的组件：一个查询生成器，用于生成最佳查询以从用户的历史记录中检索与说服力相关的记录；以及一个分析器，用于将这些记录汇总到一个配置文件中，以有效地为说服力预测模型提供信息。我们对 ChangeMyView Reddit 数据集的评估显示，多个预测模型的现有方法均得到了一致的改进，F1 得分提高了高达 +13.77%p。进一步的分析表明，有效的用户配置文件是依赖于上下文和预测器特定的，而不是依赖于静态属性或表面级别的相似性。总之，这些结果凸显了面向任务、上下文相关的用户分析对于个性化说服力预测的重要性。

Title: Stephanie2: Thinking, Waiting, and Making Decisions Like Humans in Step-by-Step AI Social Chat

Authors: Hao Yang, Hongyuan Lu, Dingkang Yang, Wenliang Yang, Peng Sun, Xiaochuan Zhang, Jun Xiao, Kefan He, Wai Lam, Yang Liu, Xinhua Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05657
Pdf URL: https://arxiv.org/pdf/2601.05657
Copy Paste: [[2601.05657]] Stephanie2: Thinking, Waiting, and Making Decisions Like Humans in Step-by-Step AI Social Chat(https://arxiv.org/abs/2601.05657)
Keywords: chat, agent
Abstract: Instant-messaging human social chat typically progresses through a sequence of short messages. Existing step-by-step AI chatting systems typically split a one-shot generation into multiple messages and send them sequentially, but they lack an active waiting mechanism and exhibit unnatural message pacing. In order to address these issues, we propose Stephanie2, a novel next-generation step-wise decision-making dialogue agent. With active waiting and message-pace adaptation, Stephanie2 explicitly decides at each step whether to send or wait, and models latency as the sum of thinking time and typing time to achieve more natural pacing. We further introduce a time-window-based dual-agent dialogue system to generate pseudo dialogue histories for human and automatic evaluations. Experiments show that Stephanie2 clearly outperforms Stephanie1 on metrics such as naturalness and engagement, and achieves a higher pass rate on human evaluation with the role identification Turing test.
摘要：即时消息人类社交聊天通常通过一系列短消息进行。现有的分步式人工智能聊天系统通常将一次性生成的消息分成多条消息并按顺序发送，但它们缺乏主动等待机制，并且表现出不自然的消息节奏。为了解决这些问题，我们提出了 Stephanie2，一种新颖的下一代逐步决策对话代理。通过主动等待和消息节奏适应，Stephanie2 在每一步明确决定是发送还是等待，并将延迟建模为思考时间和打字时间的总和，以实现更自然的节奏。我们进一步引入了基于时间窗口的双代理对话系统，以生成用于人类和自动评估的伪对话历史。实验表明，Stephanie2 在自然度和参与度等指标上明显优于 Stephanie1，并且在角色识别图灵测试的人类评估上取得了更高的通过率。

Title: Afri-MCQA: Multimodal Cultural Question Answering for African Languages

Authors: Atnafu Lambebo Tonja, Srija Anand, Emilio Villa-Cueva, Israel Abebe Azime, Jesujoba Oluwadara Alabi, Muhidin A. Mohamed, Debela Desalegn Yadeta, Negasi Haile Abadi, Abigail Oppong, Nnaemeka Casmir Obiefuna, Idris Abdulmumin, Naome A Etori, Eric Peter Wairagala, Kanda Patrick Tshinu, Imanigirimbabazi Emmanuel, Gabofetswe Malema, Alham Fikri Aji, David Ifeoluwa Adelani, Thamar Solorio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05699
Pdf URL: https://arxiv.org/pdf/2601.05699
Copy Paste: [[2601.05699]] Afri-MCQA: Multimodal Cultural Question Answering for African Languages(https://arxiv.org/abs/2601.05699)
Keywords: language model, llm
Abstract: Africa is home to over one-third of the world's languages, yet remains underrepresented in AI research. We introduce Afri-MCQA, the first Multilingual Cultural Question-Answering benchmark covering 7.5k Q&A pairs across 15 African languages from 12 countries. The benchmark offers parallel English-African language Q&A pairs across text and speech modalities and was entirely created by native speakers. Benchmarking large language models (LLMs) on Afri-MCQA shows that open-weight models perform poorly across evaluated cultures, with near-zero accuracy on open-ended VQA when queried in native language or speech. To evaluate linguistic competence, we include control experiments meant to assess this specific aspect separate from cultural knowledge, and we observe significant performance gaps between native languages and English for both text and speech. These findings underscore the need for speech-first approaches, culturally grounded pretraining, and cross-lingual cultural transfer. To support more inclusive multimodal AI development in African languages, we release our Afri-MCQA under academic license or CC BY-NC 4.0 on HuggingFace (this https URL)
摘要：非洲拥有世界上超过三分之一的语言，但在人工智能研究中的代表性仍然不足。我们推出 Afri-MCQA，这是第一个多语言文化问答基准，涵盖来自 12 个国家的 15 种非洲语言的 7,500 对问答。该基准测试提供了跨文本和语音模式的平行英语-非洲语言问答对，并且完全由母语人士创建。对 Afri-MCQA 上的大语言模型 (LLM) 进行基准测试表明，开放权重模型在评估的文化中表现不佳，当用母语或语音查询时，开放式 VQA 的准确率接近于零。为了评估语言能力，我们进行了控制实验，旨在评估与文化知识无关的这一特定方面，并且我们观察到母语和英语在文本和语音方面的显着表现差距。这些发现强调了语音优先方法、基于文化的预训练和跨语言文化迁移的必要性。为了支持非洲语言更具包容性的多模式人工智能开发，我们在 HuggingFace 上根据学术许可或 CC BY-NC 4.0 发布了 Afri-MCQA（此 https URL）

Title: Multimodal In-context Learning for ASR of Low-resource Languages

Authors: Zhaolin Li, Jan Niehues
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05707
Pdf URL: https://arxiv.org/pdf/2601.05707
Copy Paste: [[2601.05707]] Multimodal In-context Learning for ASR of Low-resource Languages(https://arxiv.org/abs/2601.05707)
Keywords: language model, llm, prompt
Abstract: Automatic speech recognition (ASR) still covers only a small fraction of the world's languages, mainly due to supervised data scarcity. In-context learning (ICL) with large language models (LLMs) addresses this problem, but prior work largely focuses on high-resource languages covered during training and text-only settings. This paper investigates whether speech LLMs can learn unseen languages with multimodal ICL (MICL), and how this learning can be used to improve ASR. We conduct experiments with two speech LLMs, Phi-4 and Qwen3-Omni, on three diverse endangered languages. Firstly, we find that MICL is effective for unseen languages, leveraging both speech and text modalities. We further show that cross-lingual transfer learning improves MICL efficiency on target languages without training on them. Moreover, we analyze attention patterns to interpret MICL mechanisms, and we observe layer-dependent preferences between audio and text context, with an overall bias towards text. Finally, we show that prompt-based ASR with speech LLMs performs poorly on unseen languages, motivating a simple ASR system that combines a stronger acoustic model with a speech LLM via MICL-based selection of acoustic hypotheses. Results show that MICL consistently improves ASR performance, and that cross-lingual transfer learning matches or outperforms corpus-trained language models without using target-language data. Our code is publicly available.
摘要：自动语音识别（ASR）仍然只覆盖世界语言的一小部分，这主要是由于监督数据的稀缺。具有大型语言模型 (LLM) 的上下文学习 (ICL) 解决了这个问题，但之前的工作主要集中在训练和纯文本设置期间涵盖的高资源语言。本文研究了语音法学硕士是否可以通过多模态 ICL (MICL) 学习看不见的语言，以及如何利用这种学习来提高 ASR。我们使用两位语音法学硕士 Phi-4 和 Qwen3-Omni 对三种不同的濒危语言进行了实验。首先，我们发现 MICL 对看不见的语言有效，利用了语音和文本模式。我们进一步表明，跨语言迁移学习可以提高目标语言的 MICL 效率，而无需对其进行训练。此外，我们分析注意力模式来解释 MICL 机制，并观察音频和文本上下文之间的层相关偏好，总体偏向文本。最后，我们表明，基于提示的 ASR 和语音 LLM 在未见过的语言上表现不佳，从而激发了一个简单的 ASR 系统，该系统通过基于 MICL 的声学假设选择将更强大的声学模型与语音 LLM 结合起来。结果表明，MICL 持续提高了 ASR 性能，并且跨语言迁移学习在不使用目标语言数据的情况下匹配或优于语料库训练的语言模型。我们的代码是公开的。

Title: Visualising Information Flow in Word Embeddings with Diffusion Tensor Imaging

Authors: Thomas Fabian
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05713
Pdf URL: https://arxiv.org/pdf/2601.05713
Copy Paste: [[2601.05713]] Visualising Information Flow in Word Embeddings with Diffusion Tensor Imaging(https://arxiv.org/abs/2601.05713)
Keywords: language model, llm
Abstract: Understanding how large language models (LLMs) represent natural language is a central challenge in natural language processing (NLP) research. Many existing methods extract word embeddings from an LLM, visualise the embedding space via point-plots, and compare the relative positions of certain words. However, this approach only considers single words and not whole natural language expressions, thus disregards the context in which a word is used. Here we present a novel tool for analysing and visualising information flow in natural language expressions by applying diffusion tensor imaging (DTI) to word embeddings. We find that DTI reveals how information flows between word embeddings. Tracking information flows within the layers of an LLM allows for comparing different model structures and revealing opportunities for pruning an LLM's under-utilised layers. Furthermore, our model reveals differences in information flows for tasks like pronoun resolution and metaphor detection. Our results show that our model permits novel insights into how LLMs represent actual natural language expressions, extending the comparison of isolated word embeddings and improving the interpretability of NLP models.
摘要：了解大型语言模型 (LLM) 如何表示自然语言是自然语言处理 (NLP) 研究的核心挑战。许多现有方法从法学硕士中提取词嵌入，通过点图可视化嵌入空间，并比较某些词的相对位置。然而，这种方法只考虑单个单词而不是整个自然语言表达，因此忽略了单词使用的上下文。在这里，我们提出了一种新颖的工具，通过将扩散张量成像（DTI）应用于词嵌入来分析和可视化自然语言表达中的信息流。我们发现 DTI 揭示了信息在词嵌入之间的流动方式。跟踪法学硕士各层内的信息流可以比较不同的模型结构，并揭示修剪法学硕士未充分利用的层的机会。此外，我们的模型揭示了代词解析和隐喻检测等任务的信息流差异。我们的结果表明，我们的模型可以对 LLM 如何表示实际自然语言表达提供新颖的见解，扩展孤立词嵌入的比较并提高 NLP 模型的可解释性。

Title: Analysing Differences in Persuasive Language in LLM-Generated Text: Uncovering Stereotypical Gender Patterns

Authors: Amalie Brogaard Pauli, Maria Barrett, Max Müller-Eberstein, Isabelle Augenstein, Ira Assent
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05751
Pdf URL: https://arxiv.org/pdf/2601.05751
Copy Paste: [[2601.05751]] Analysing Differences in Persuasive Language in LLM-Generated Text: Uncovering Stereotypical Gender Patterns(https://arxiv.org/abs/2601.05751)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly used for everyday communication tasks, including drafting interpersonal messages intended to influence and persuade. Prior work has shown that LLMs can successfully persuade humans and amplify persuasive language. It is therefore essential to understand how user instructions affect the generation of persuasive language, and to understand whether the generated persuasive language differs, for example, when targeting different groups. In this work, we propose a framework for evaluating how persuasive language generation is affected by recipient gender, sender intent, or output language. We evaluate 13 LLMs and 16 languages using pairwise prompt instructions. We evaluate model responses on 19 categories of persuasive language using an LLM-as-judge setup grounded in social psychology and communication science. Our results reveal significant gender differences in the persuasive language generated across all models. These patterns reflect biases consistent with gender-stereotypical linguistic tendencies documented in social psychology and sociolinguistics.
摘要：大型语言模型 (LLM) 越来越多地用于日常交流任务，包括起草旨在影响和说服的人际信息。之前的研究表明，法学硕士可以成功说服人类并增强说服性语言。因此，有必要了解用户指令如何影响说服性语言的生成，并了解生成的说服性语言是否有所不同，例如针对不同群体时。在这项工作中，我们提出了一个框架来评估说服性语言的生成如何受到接收者性别、发送者意图或输出语言的影响。我们使用成对提示指令评估 13 种法学硕士和 16 种语言。我们使用基于社会心理学和传播科学的法学硕士作为评判设置，评估模型对 19 类说服性语言的反应。我们的结果揭示了所有模型生成的说服性语言存在显着的性别差异。这些模式反映了与社会心理学和社会语言学中记录的性别刻板语言倾向一致的偏见。

Title: AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

Authors: Shu Yang, Jingyu Hu, Tong Li, Hanqi Yan, Wenxuan Wang, Di Wang
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2601.05752
Pdf URL: https://arxiv.org/pdf/2601.05752
Copy Paste: [[2601.05752]] AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor(https://arxiv.org/abs/2601.05752)
Keywords: llm
Abstract: We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.
摘要：我们推出了 AutoMonitor-Bench，这是第一个基准测试，旨在系统地评估基于 LLM 的不当行为监视器在不同任务和故障模式下的可靠性。 AutoMonitor-Bench 包含 3,010 个经过仔细注释的测试样本，涵盖问答、代码生成和推理，以及配对的不当行为和良性实例。我们使用两个互补指标来评估监控器：漏报率 (MR) 和误报率 (FAR)，分别捕获检测不当行为的失败和对良性行为的过度敏感。通过评估 12 个专有法学硕士和 10 个开源法学硕士，我们观察到监控性能的巨大差异以及 MR 和 FAR 之间的一致权衡，揭示了固有的安全性与实用性之间的紧张关系。为了进一步探索监控可靠性的局限性，我们构建了一个包含 153,581 个样本的大规模训练语料库，并对 Qwen3-4B-Instruction 进行微调，以研究对已知的、相对易于构建的不当行为数据集进行训练是否可以提高对看不见的和更隐含的不当行为的监控性能。我们的结果凸显了可靠、可扩展的不当行为监控的挑战，并激励未来针对基于法学硕士的监控者的任务感知设计和培训策略的工作。

Title: One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models

Authors: Benedikt Ebing, Lennart Keller, Goran Glavaš
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05776
Pdf URL: https://arxiv.org/pdf/2601.05776
Copy Paste: [[2601.05776]] One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models(https://arxiv.org/abs/2601.05776)
Keywords: language model
Abstract: Exposing latent lexical overlap, script romanization has emerged as an effective strategy for improving cross-lingual transfer (XLT) in multilingual language models (mLMs). Most prior work, however, focused on setups that favor romanization the most: (1) transfer from high-resource Latin-script to low-resource non-Latin-script languages and/or (2) between genealogically closely related languages with different scripts. It thus remains unclear whether romanization is a good representation choice for pretraining general-purpose mLMs, or, more precisely, if information loss associated with romanization harms performance for high-resource languages. We address this gap by pretraining encoder LMs from scratch on both romanized and original texts for six typologically diverse high-resource languages, investigating two potential sources of degradation: (i) loss of script-specific information and (ii) negative cross-lingual interference from increased vocabulary overlap. Using two romanizers with different fidelity profiles, we observe negligible performance loss for languages with segmental scripts, whereas languages with morphosyllabic scripts (Chinese and Japanese) suffer degradation that higher-fidelity romanization mitigates but cannot fully recover. Importantly, comparing monolingual LMs with their mLM counterpart, we find no evidence that increased subword overlap induces negative interference. We further show that romanization improves encoding efficiency (i.e., fertility) for segmental scripts at a negligible performance cost.
摘要：通过暴露潜在的词汇重叠，脚本罗马化已成为改善多语言语言模型 (MLM) 中跨语言迁移 (XLT) 的有效策略。然而，大多数先前的工作都集中在最有利于罗马化的设置上：（1）从高资源拉丁文字到低资源非拉丁文字语言的转移和/或（2）在具有不同文字的谱系密切相关的语言之间转移。因此，目前尚不清楚罗马化是否是预训练通用 MLM 的良好表示选择，或者更准确地说，与罗马化相关的信息丢失是否会损害高资源语言的性能。我们通过从头开始对六种类型不同的高资源语言的罗马化文本和原始文本进行编码器 LM 的预训练来解决这一差距，并调查两个潜在的退化来源：(i) 特定于脚本的信息丢失；(ii) 词汇重叠增加带来的负面跨语言干扰。使用具有不同保真度配置文件的两个罗马化器，我们观察到具有分段脚本的语言的性能损失可以忽略不计，而具有形态音节脚本的语言（中文和日语）的性能下降可以通过更高保真度的罗马化来缓解，但无法完全恢复。重要的是，将单语 LM 与其对应的 mLM 进行比较，我们发现没有证据表明子词重叠的增加会引起负面干扰。我们进一步表明，罗马化以可忽略的性能成本提高了分段脚本的编码效率（即生产力）。

Title: Simplify-This: A Comparative Analysis of Prompt-Based and Fine-Tuned LLMs

Authors: Eilam Cohen, Itamar Bul, Danielle Inbar, Omri Loewenbach
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05794
Pdf URL: https://arxiv.org/pdf/2601.05794
Copy Paste: [[2601.05794]] Simplify-This: A Comparative Analysis of Prompt-Based and Fine-Tuned LLMs(https://arxiv.org/abs/2601.05794)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) enable strong text generation, and in general there is a practical tradeoff between fine-tuning and prompt engineering. We introduce Simplify-This, a comparative study evaluating both paradigms for text simplification with encoder-decoder LLMs across multiple benchmarks, using a range of evaluation metrics. Fine-tuned models consistently deliver stronger structural simplification, whereas prompting often attains higher semantic similarity scores yet tends to copy inputs. A human evaluation favors fine-tuned outputs overall. We release code, a cleaned derivative dataset used in our study, checkpoints of fine-tuned models, and prompt templates to facilitate reproducibility and future work.
摘要：大型语言模型（LLM）可以生成强大的文本，并且通常在微调和提示工程之间存在实际的权衡。我们引入了 Simplify-This，这是一项比较研究，使用一系列评估指标，跨多个基准评估编码器-解码器 LLM 的文本简化范例。经过微调的模型始终能够提供更强的结构简化，而提示通常会获得更高的语义相似度分数，但往往会复制输入。人工评估总体上有利于微调输出。我们发布了代码、研究中使用的清理后的衍生数据集、微调模型的检查点以及提示模板，以促进可重复性和未来的工作。

Title: EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Authors: Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05808
Pdf URL: https://arxiv.org/pdf/2601.05808
Copy Paste: [[2601.05808]] EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis(https://arxiv.org/abs/2601.05808)
Keywords: language model, llm, hallucination, agent
Abstract: Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. We release our code and data at this https URL.
摘要：大型语言模型（LLM）预计将被训练为在各种现实环境中充当代理，但这个过程依赖于丰富多样的工具交互沙箱。然而，对真实系统的访问通常受到限制； LLM模拟的环境容易产生幻觉和不一致的情况；而且手动构建的沙箱很难扩展。在本文中，我们提出了 EnvScaler，这是一种通过编程综合实现可扩展工具交互环境的自动化框架。 EnvScaler 由两个组件组成。首先，SkelBuilder通过主题挖掘、逻辑建模、质量评估构建多样化的环境骨架。然后，ScenGenerator 为每个环境生成多个任务场景和基于规则的轨迹验证函数。通过EnvScaler，我们综合了191个环境和大约7K个场景，并将它们应用于Qwen3系列模型的监督微调（SFT）和强化学习（RL）。三个基准测试的结果表明，EnvScaler 显着提高了法学硕士在涉及多回合、多工具交互的复杂环境中解决任务的能力。我们在此 https URL 发布我们的代码和数据。

Title: LLMs as Science Journalists: Supporting Early-stage Researchers in Communicating Their Science to the Public

Authors: Milad Alshomary, Grace Li, Anubhav Jangra, Yufang Hou, Kathleen McKeown, Smaranda Muresan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05821
Pdf URL: https://arxiv.org/pdf/2601.05821
Copy Paste: [[2601.05821]] LLMs as Science Journalists: Supporting Early-stage Researchers in Communicating Their Science to the Public(https://arxiv.org/abs/2601.05821)
Keywords: language model, llm, prompt
Abstract: The scientific community needs tools that help early-stage researchers effectively communicate their findings and innovations to the public. Although existing general-purpose Large Language Models (LLMs) can assist in this endeavor, they are not optimally aligned for it. To address this, we propose a framework for training LLMs to emulate the role of a science journalist that can be used by early-stage researchers to learn how to properly communicate their papers to the general public. We evaluate the usefulness of our trained LLM Journalists in leading conversations with both simulated and human researchers. %compared to the general-purpose ones. Our experiments indicate that LLMs trained using our framework ask more relevant questions that address the societal impact of research, prompting researchers to clarify and elaborate on their findings. In the user study, the majority of participants who interacted with our trained LLM Journalist appreciated it more than interacting with general-purpose LLMs.
摘要：科学界需要工具来帮助早期研究人员有效地向公众传达他们的发现和创新。尽管现有的通用大型语言模型 (LLM) 可以帮助实现这一目标，但它们并未对此进行最佳调整。为了解决这个问题，我们提出了一个培训法学硕士模仿科学记者角色的框架，早期研究人员可以使用该框架来学习如何向公众正确传达他们的论文。我们评估训练有素的法学硕士记者在与模拟研究人员和人类研究人员进行对话时的有用性。 %与通用型相比。我们的实验表明，使用我们的框架培训的法学硕士提出了更多相关问题来解决研究的社会影响，促使研究人员澄清和详细阐述他们的发现。在用户研究中，与我们训练有素的法学硕士记者互动的大多数参与者比与通用法学硕士互动更欣赏它。

Title: Peek2: A Regex-free implementation of pretokenizers for Byte-level BPE

Authors: Liu Zai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05833
Pdf URL: https://arxiv.org/pdf/2601.05833
Copy Paste: [[2601.05833]] Peek2: A Regex-free implementation of pretokenizers for Byte-level BPE(https://arxiv.org/abs/2601.05833)
Keywords: gpt
Abstract: Pretokenization is a crucial, sequential pass in Byte-level BPE tokenizers. Our proposed new implementation, Peek2, serves as a drop-in replacement for cl100k-like pretokenizers used in GPT-3, LLaMa-3, and Qwen-2.5. Designed with performance and safety in mind, Peek2 is Regex-free and delivers a $ 1.11\times $ improvement in overall throughput across the entire Byte-level BPE encoding process. This algorithm runs entirely on the CPU, has stable linear complexity $ O(n) $, and provides presegmentation results identical to those of the original Regex-based pretokenizer.
摘要：预标记化是字节级 BPE 标记器中至关重要的顺序传递。我们提出的新实现 Peek2 可直接替代 GPT-3、LLaMa-3 和 Qwen-2.5 中使用的类似 cl100k 的预分词器。 Peek2 在设计时考虑到了性能和安全性，不含正则表达式，并在整个字节级 BPE 编码过程中将整体吞吐量提高了 1.11 美元\乘以 1.11 美元。该算法完全在 CPU 上运行，具有稳定的线性复杂度 $O(n)$，并提供与原始基于 Regex 的预分词器相同的预分割结果。

Title: Left, Right, or Center? Evaluating LLM Framing in News Classification and Generation

Authors: Molly Kennedy, Ali Parker, Yihong Liu, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05835
Pdf URL: https://arxiv.org/pdf/2601.05835
Copy Paste: [[2601.05835]] Left, Right, or Center? Evaluating LLM Framing in News Classification and Generation(https://arxiv.org/abs/2601.05835)
Keywords: language model, llm, prompt
Abstract: Large Language Model (LLM) based summarization and text generation are increasingly used for producing and rewriting text, raising concerns about political framing in journalism where subtle wording choices can shape interpretation. Across nine state-of-the-art LLMs, we study political framing by testing whether LLMs' classification-based bias signals align with framing behavior in their generated summaries. We first compare few-shot ideology predictions against LEFT/CENTER/RIGHT labels. We then generate "steered" summaries under FAITHFUL, CENTRIST, LEFT, and RIGHT prompts, and score all outputs using a single fixed ideology evaluator. We find pervasive ideological center-collapse in both article-level ratings and generated text, indicating a systematic tendency toward centrist framing. Among evaluated models, Grok 4 is by far the most ideologically expressive generator, while Claude Sonnet 4.5 and Llama 3.1 achieve the strongest bias-rating performance among commercial and open-weight models, respectively.
摘要：基于大语言模型 (LLM) 的摘要和文本生成越来越多地用于生成和重写文本，这引发了人们对新闻业政治框架的担忧，其中微妙的措辞选择可能会影响解释。在九个最先进的法学硕士中，我们通过测试法学硕士基于分类的偏见信号是否与其生成的摘要中的框架行为一致来研究政治框架。我们首先将几次意识形态预测与左/中/右标签进行比较。然后，我们在“忠实”、“中间”、“左”和“右”提示下生成“引导”摘要，并使用单个固定意识形态评估器对所有输出进行评分。我们发现文章级别的评级和生成的文本中普遍存在意识形态中心崩溃，表明中间派框架的系统性倾向。在评估的模型中，Grok 4 是迄今为止最具意识形态表达力的生成器，而 Claude Sonnet 4.5 和 Llama 3.1 分别在商业模型和开放权重模型中实现了最强的偏差评级性能。

Title: Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs

Authors: Sandeep Mishra, Devichand Budagam, Anubhab Mandal, Bishal Santra, Pawan Goyal, Manish Gupta
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2601.05851
Pdf URL: https://arxiv.org/pdf/2601.05851
Copy Paste: [[2601.05851]] Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs(https://arxiv.org/abs/2601.05851)
Keywords: language model, chat
Abstract: Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.
摘要：实时多模式自动完成对于数字助理、聊天机器人、设计工具和医疗保健咨询至关重要，因为用户输入依赖于共享的视觉上下文。我们引入了多模式自动完成 (MAC)，这是一项使用部分键入的文本和视觉提示来预测实时聊天中即将出现的字符的任务。与传统的纯文本自动完成 (TAC) 不同，MAC 在多模式上下文中进行预测，以更好地捕获用户意图。为了完成此任务，我们采用 MMDialog 和 ImageChat 来创建基准数据集。我们根据强大的文本基线评估领先的视觉语言模型（VLM），强调准确性和效率的权衡。我们提出了 Router-Suggest，这是一个基于对话上下文在文本模型和 VLM 之间动态选择的路由器框架，以及适用于资源受限环境的轻量级变体。 Router-Suggest 比性能最佳的 VLM 实现了 2.3 倍到 10 倍的加速。一项用户研究表明，VLM 在用户满意度方面明显优于文本模型，尤其是节省了用户的打字工作量并提高了多轮对话的完成质量。这些发现强调了自动完成中对多模式上下文的需求，从而产生更智能、具有用户意识的助手。

Title: CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

Authors: Alexandra Dragomir, Florin Brad, Radu Tudor Ionescu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05858
Pdf URL: https://arxiv.org/pdf/2601.05858
Copy Paste: [[2601.05858]] CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning(https://arxiv.org/abs/2601.05858)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated competitive performance in zero-shot multilingual machine translation (MT). Some follow-up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state-of-the-art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy-to-hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples. We demonstrate consistent gains across several model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques. We publicly release our code at this https URL.
摘要：大型语言模型 (LLM) 在零样本多语言机器翻译 (MT) 方面表现出了具有竞争力的性能。一些后续工作通过偏好优化进一步提高了 MT 性能，但它们留下了一个很大程度上未被充分探索的关键方面：训练期间给出数据样本的顺序。我们通过将课程学习集成到各种最先进的偏好优化算法中来解决这个主题，以提高机器翻译的性能。我们引入了一种新颖的重新启动课程学习策略（CLewR），该策略在训练过程中多次重复从易到难的课程，以有效减轻简单示例的灾难性遗忘。我们展示了多个模型系列（Gemma2、Qwen2.5、Llama3.1）和偏好优化技术的一致增益。我们在此 https URL 公开发布我们的代码。

Title: FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG

Authors: Maxime Dassen, Rebecca Kotula, Kenton Murray, Andrew Yates, Dawn Lawrie, Efsun Kayi, James Mayfield, Kevin Duh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05866
Pdf URL: https://arxiv.org/pdf/2601.05866
Copy Paste: [[2601.05866]] FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG(https://arxiv.org/abs/2601.05866)
Keywords: hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) models are critically undermined by citation hallucinations, a deceptive failure where a model confidently cites a source that fails to support its claim. Existing work often attributes hallucination to a simple over-reliance on the model's parametric knowledge. We challenge this view and introduce FACTUM (Framework for Attesting Citation Trustworthiness via Underlying Mechanisms), a framework of four mechanistic scores measuring the distinct contributions of a model's attention and FFN pathways, and the alignment between them. Our analysis reveals two consistent signatures of correct citation: a significantly stronger contribution from the model's parametric knowledge and greater use of the attention sink for information synthesis. Crucially, we find the signature of a correct citation is not static but evolves with model scale. For example, the signature of a correct citation for the Llama-3.2-3B model is marked by higher pathway alignment, whereas for the Llama-3.1-8B model, it is characterized by lower alignment, where pathways contribute more distinct, orthogonal information. By capturing this complex, evolving signature, FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC. Our findings reframe citation hallucination as a complex, scale-dependent interplay between internal mechanisms, paving the way for more nuanced and reliable RAG systems.
摘要：检索增强生成（RAG）模型受到引用幻觉的严重破坏，这是一种欺骗性的失败，模型自信地引用了无法支持其主张的来源。现有的工作经常将幻觉归因于对模型参数知识的简单过度依赖。我们挑战这种观点并引入 FACTUM（通过底层机制证明引文可信度的框架），这是一个由四个机械分数组成的框架，用于衡量模型注意力和 FFN 路径的独特贡献以及它们之间的一致性。我们的分析揭示了正确引用的两个一致特征：模型参数知识的贡献明显更强，以及更多地使用注意力接收器进行信息合成。至关重要的是，我们发现正确引用的签名不是静态的，而是随着模型规模而变化。例如，Llama-3.2-3B 模型的正确引用的特征是较高的路径对齐，而对于 Llama-3.1-8B 模型，其特征是较低的对齐，其中路径贡献了更独特的正交信息。通过捕获这种复杂的、不断变化的特征，FACTUM 的 AUC 性能比最先进的基线高出 37.5%。我们的研究结果将引文幻觉重新定义为内部机制之间复杂的、规模相关的相互作用，为更细致、更可靠的 RAG 系统铺平了道路。

Title: Continual-learning for Modelling Low-Resource Languages from Large Language Models

Authors: Santosh Srinath K, Mudit Somani, Varun Reddy Padala, Prajna Devi Upadhyay, Abhijit Das
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05874
Pdf URL: https://arxiv.org/pdf/2601.05874
Copy Paste: [[2601.05874]] Continual-learning for Modelling Low-Resource Languages from Large Language Models(https://arxiv.org/abs/2601.05874)
Keywords: language model, llm
Abstract: Modelling a language model for a multi-lingual scenario includes several potential challenges, among which catastrophic forgetting is the major challenge. For example, small language models (SLM) built for low-resource languages by adapting large language models (LLMs) pose the challenge of catastrophic forgetting. This work proposes to employ a continual learning strategy using parts-of-speech (POS)-based code-switching along with a replay adapter strategy to mitigate the identified gap of catastrophic forgetting while training SLM from LLM. Experiments conducted on vision language tasks such as visual question answering and language modelling task exhibits the success of the proposed architecture.
摘要：为多语言场景建模语言模型包括几个潜在的挑战，其中灾难性遗忘是主要挑战。例如，通过适应大语言模型（LLM）为低资源语言构建的小语言模型（SLM）带来了灾难性遗忘的挑战。这项工作建议采用基于词性 (POS) 的语码转换的持续学习策略以及重放适配器策略，以在从法学硕士 (LLM) 训练 SLM 时减轻已识别的灾难性遗忘差距。在视觉语言任务（例如视觉问答和语言建模任务）上进行的实验展示了所提出的架构的成功。

Title: iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

Authors: Meghana Sunil, Manikandarajan Venmathimaran, Muthu Subash Kavitha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05877
Pdf URL: https://arxiv.org/pdf/2601.05877
Copy Paste: [[2601.05877]] iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models(https://arxiv.org/abs/2601.05877)
Keywords: chain-of-thought
Abstract: Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer--Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings.
摘要：最近的工作表明，大型多模态模型（LMM）可以通过自我博弈和内在反馈，从未标记的数据中进行自我改进。然而，现有的自我进化框架主要奖励最终结果，尽管中间推理对于基于视觉的决策很重要，但它受到的限制很弱。我们提出了 iReasoner，这是一个自我进化的框架，它通过显式引发思想链（CoT）并奖励其内部协议来改进 LMM 的隐式推理。在未标记图像上的提议者-求解器循环中，iReasoner 通过在中间推理步骤上定义的轨迹感知信号来增强结果级别的内在奖励，提供学习信号来区分导致相同答案的推理路径，而无需地面真实标签或外部判断。从 Qwen2.5-VL-7B 开始，iReasoner 在完全无监督的训练后在各种多模态推理基准上产生高达 $+2.1$ 点。我们希望这项工作能够成为在纯无监督环境中 LMM 推理意识自我改进的起点。

Title: Gender Bias in LLMs: Preliminary Evidence from Shared Parenting Scenario in Czech Family Law

Authors: Jakub Harasta, Matej Vasina, Martin Kornel, Tomas Foltynek
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2601.05879
Pdf URL: https://arxiv.org/pdf/2601.05879
Copy Paste: [[2601.05879]] Gender Bias in LLMs: Preliminary Evidence from Shared Parenting Scenario in Czech Family Law(https://arxiv.org/abs/2601.05879)
Keywords: language model, gpt, llm
Abstract: Access to justice remains limited for many people, leading laypersons to increasingly rely on Large Language Models (LLMs) for legal self-help. Laypeople use these tools intuitively, which may lead them to form expectations based on incomplete, incorrect, or biased outputs. This study examines whether leading LLMs exhibit gender bias in their responses to a realistic family law scenario. We present an expert-designed divorce scenario grounded in Czech family law and evaluate four state-of-the-art LLMs GPT-5 nano, Claude Haiku 4.5, Gemini 2.5 Flash, and Llama 3.3 in a fully zero-shot interaction. We deploy two versions of the scenario, one with gendered names and one with neutral labels, to establish a baseline for comparison. We further introduce nine legally relevant factors that vary the factual circumstances of the case and test whether these variations influence the models' proposed shared-parenting ratios. Our preliminary results highlight differences across models and suggest gender-dependent patterns in the outcomes generated by some systems. The findings underscore both the risks associated with laypeople's reliance on LLMs for legal guidance and the need for more robust evaluation of model behavior in sensitive legal contexts. We present exploratory and descriptive evidence intended to identify systematic asymmetries rather than to establish causal effects.
摘要：对于许多人来说，诉诸司法的机会仍然有限，导致外行越来越依赖大语言模型 (LLM) 进行法律自助。外行人凭直觉使用这些工具，这可能会导致他们根据不完整、不正确或有偏见的输出形成预期。本研究探讨了领先的法学硕士在对现实家庭法情景的反应中是否表现出性别偏见。我们提出了一个基于捷克家庭法的专家设计的离婚场景，并在完全零样本交互中评估了四个最先进的法学硕士 GPT-5 nano、Claude Haiku 4.5、Gemini 2.5 Flash 和 Llama 3.3。我们部署了该场景的两个版本，一种带有性别名称，另一种带有中性标签，以建立比较基线。我们进一步引入了九个法律相关因素，这些因素改变了案件的实际情况，并测试这些变化是否影响模型提出的共同育儿比率。我们的初步结果强调了模型之间的差异，并表明某些系统产生的结果存在性别依赖性模式。研究结果强调了外行人依赖法学硕士获得法律指导所带来的风险，以及在敏感的法律背景下对模范行为进行更强有力的评估的必要性。我们提供探索性和描述性证据，旨在识别系统性不对称性，而不是确定因果效应。

Title: An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift

Authors: Constantinos Karouzos, Xingwei Tan, Nikolaos Aletras
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05882
Pdf URL: https://arxiv.org/pdf/2601.05882
Copy Paste: [[2601.05882]] An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift(https://arxiv.org/abs/2601.05882)
Keywords: language model
Abstract: Preference tuning aligns pretrained language models to human judgments of quality, helpfulness, or safety by optimizing over explicit preference signals rather than likelihood alone. Prior work has shown that preference-tuning degrades performance and reduces helpfulness when evaluated outside the training domain. However, the extent to which adaptation strategies mitigate this domain shift remains unexplored. We address this challenge by conducting a comprehensive and systematic study of alignment generalization under domain shift. We compare five popular alignment objectives and various adaptation strategies from source to target, including target-domain supervised fine-tuning and pseudo-labeling, across summarization and question-answering helpfulness tasks. Our findings reveal systematic differences in generalization across alignment objectives under domain shift. We show that adaptation strategies based on pseudo-labeling can substantially reduce domain-shift degradation
摘要：偏好调整通过优化明确的偏好信号而不是单独的可能性，使预训练的语言模型与人类对质量、有用性或安全性的判断保持一致。先前的工作表明，在训练领域之外进行评估时，偏好调整会降低性能并降低有用性。然而，适应策略在多大程度上缓解这种领域转变仍有待探索。我们通过对域转移下的对齐泛化进行全面、系统的研究来应对这一挑战。我们比较了五种流行的对齐目标和从源到目标的各种适应策略，包括总结和问答帮助任务中的目标域监督微调和伪标签。我们的研究结果揭示了领域转移下对齐目标之间泛化的系统差异。我们证明基于伪标记的适应策略可以大大减少域转移退化

Title: HAPS: Hierarchical LLM Routing with Joint Architecture and Parameter Search

Authors: Zihang Tian, Rui Li, Jingsen Zhang, Xiaohe Bo, Wei Huo, Xu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05903
Pdf URL: https://arxiv.org/pdf/2601.05903
Copy Paste: [[2601.05903]] HAPS: Hierarchical LLM Routing with Joint Architecture and Parameter Search(https://arxiv.org/abs/2601.05903)
Keywords: language model, llm
Abstract: Large language model (LLM) routing aims to exploit the specialized strengths of different LLMs for diverse tasks. However, existing approaches typically focus on selecting LLM architectures while overlooking parameter settings, which are critical for task performance. In this paper, we introduce HAPS, a hierarchical LLM routing framework that jointly searches over model architectures and parameters. Specifically, we use a high-level router to select among candidate LLM architectures, and then search for the optimal parameters for the selected architectures based on a low-level router. We design a parameter generation network to share parameters between the two routers to mutually enhance their capabilities. In the training process, we design a reward-augmented objective to effectively optimize our framework. Experiments on two commonly used benchmarks show that HAPS consistently outperforms strong routing baselines. We have released our code at this https URL.
摘要：大语言模型 (LLM) 路由旨在利用不同 LLM 的专业优势来完成不同的任务。然而，现有方法通常侧重于选择 LLM 架构，而忽略了对任务性能至关重要的参数设置。在本文中，我们介绍了 HAPS，这是一种分层的 LLM 路由框架，可联合搜索模型架构和参数。具体来说，我们使用高层路由器在候选LLM架构中进行选择，然后基于低层路由器搜索所选架构的最佳参数。我们设计了一个参数生成网络来在两个路由器之间共享参数，以相互增强它们的能力。在训练过程中，我们设计了奖励增强目标来有效优化我们的框架。对两个常用基准的实验表明，HAPS 始终优于强大的路由基准。我们已在此 https URL 发布了我们的代码。

Title: Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

Authors: Haoming Xu, Ningyuan Zhao, Yunzhi Yao, Weihong Xu, Hongru Wang, Xinle Deng, Shumin Deng, Jeff Z. Pan, Huajun Chen, Ningyu Zhang
Subjects: cs.CL, cs.AI, cs.HC, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2601.05905
Pdf URL: https://arxiv.org/pdf/2601.05905
Copy Paste: [[2601.05905]] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency(https://arxiv.org/abs/2601.05905)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) are increasingly deployed in real-world settings, correctness alone is insufficient. Reliable deployment requires maintaining truthful beliefs under contextual perturbations. Existing evaluations largely rely on point-wise confidence like Self-Consistency, which can mask brittle belief. We show that even facts answered with perfect self-consistency can rapidly collapse under mild contextual interference. To address this gap, we propose Neighbor-Consistency Belief (NCB), a structural measure of belief robustness that evaluates response coherence across a conceptual neighborhood. To validate the efficiency of NCB, we introduce a new cognitive stress-testing protocol that probes outputs stability under contextual interference. Experiments across multiple LLMs show that the performance of high-NCB data is relatively more resistant to interference. Finally, we present Structure-Aware Training (SAT), which optimizes context-invariant belief structure and reduces long-tail knowledge brittleness by approximately 30%. Code will be available at this https URL.
摘要：随着大型语言模型 (LLM) 越来越多地部署在现实世界中，仅靠正确性是不够的。可靠的部署需要在环境扰动下保持真实的信念。现有的评估很大程度上依赖于诸如自我一致性之类的逐点信心，这可能掩盖脆弱的信念。我们证明，即使事实具有完美的自洽性，在轻微的语境干扰下也可能迅速崩溃。为了解决这一差距，我们提出了邻域一致性信念（NCB），这是一种信念稳健性的结构性度量，用于评估整个概念邻域的响应一致性。为了验证 NCB 的效率，我们引入了一种新的认知压力测试协议，该协议可探测上下文干扰下的输出稳定性。多个LLM的实验表明，高NCB数据的性能相对更能抵抗干扰。最后，我们提出了结构感知训练（SAT），它优化了上下文不变的信念结构，并将长尾知识的脆性降低了大约 30%。代码将在此 https URL 中提供。

Title: Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

Authors: Phuong-Hang Le, Valentin Pelloin, Arnault Chatelain, Maryem Bouziane, Mohammed Ghennai, Qianwen Guan, Kirill Milintsevich, Salima Mdhaffar, Aidan Mannion, Nils Defauw, Shuyue Gu, Alexandre Audibert, Marco Dinarelli, Yannick Estève, Lorraine Goeuriot, Steffen Lalande, Nicolas Hervé, Maximin Coavoux, François Portet, Étienne Ollion, Marie Candito, Maxime Peyrard, Solange Rossato, Benjamin Lecouteux, Aurélie Nardy, Gilles Sérasset, Vincent Segonne, Solène Evain, Diandra Fabre, Didier Schwab
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05911
Pdf URL: https://arxiv.org/pdf/2601.05911
Copy Paste: [[2601.05911]] Pantagruel: Unified Self-Supervised Encoders for French Text and Speech(https://arxiv.org/abs/2601.05911)
Keywords: llm
Abstract: We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l'Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.
摘要：我们发布了 Pantagruel 模型，这是一个新的法语文本和语音自监督编码器模型系列。 Pantagruel 不是预测文本标记或语音单元等模态定制目标，而是学习特征空间中的上下文目标表示，从而允许模态特定编码器更有效地捕获语言和声学规律。单独的模型在大型法语语料库上进行了预训练，包括用于文本的 Wikipedia、OSCAR 和 CroissantLLM，以及用于语音的 MultilingualLibriSpeech、LeBenchmark 和 INA-100k。 INA-100k 是新推出的 100,000 小时法语音频语料库，源自法国广播电视广播国家存储库法国国家视听研究所 (INA) 的档案，提供高度多样化的音频数据。我们在涵盖两种模式的广泛下游任务中评估庞大固埃，包括来自 FLUE 或 LeBenchmark 等标准法国基准的任务。在这些任务中，与 CamemBERT、FlauBERT 和 LeBenchmark2.0 等强大的法国基线相比，庞大固埃模型表现出了具有竞争力或优越的性能，同时保持了可以无缝处理语音或文本输入的共享架构。这些结果证实了特征空间自监督目标对于法语表征学习的有效性，并强调庞大固埃作为多模态语音文本理解的坚实基础。

Title: Can We Predict Before Executing Machine Learning Agents?

Authors: Jingsheng Zheng, Jintian Zhang, Yujie Luo, Yuren Mao, Yunjun Gao, Lun Du, Huajun Chen, Ningyu Zhang
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2601.05930
Pdf URL: https://arxiv.org/pdf/2601.05930
Copy Paste: [[2601.05930]] Can We Predict Before Executing Machine Learning Agents?(https://arxiv.org/abs/2601.05930)
Keywords: llm, agent
Abstract: Autonomous machine learning agents have revolutionized scientific discovery, yet they remain constrained by a Generate-Execute-Feedback paradigm. Previous approaches suffer from a severe Execution Bottleneck, as hypothesis evaluation relies strictly on expensive physical execution. To bypass these physical constraints, we internalize execution priors to substitute costly runtime checks with instantaneous predictive reasoning, drawing inspiration from World Models. In this work, we formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report, achieving 61.5% accuracy and robust confidence calibration. Finally, we instantiate this framework in FOREAGENT, an agent that employs a Predict-then-Verify loop, achieving a 6x acceleration in convergence while surpassing execution-based baselines by +6%. Our code and dataset will be publicly available soon at this https URL.
摘要：自主机器学习代理彻底改变了科学发现，但它们仍然受到生成-执行-反馈范式的限制。以前的方法存在严重的执行瓶颈，因为假设评估严格依赖于昂贵的物理执行。为了绕过这些物理限制，我们从世界模型中汲取灵感，将执行先验内在化，用即时预测推理代替昂贵的运行时检查。在这项工作中，我们形式化了以数据为中心的解决方案偏好的任务，并构建了一个包含 18,438 个成对比较的综合语料库。我们证明，法学硕士在提供经过验证的数据分析报告时表现出显着的预测能力，实现了 61.5% 的准确度和强大的置信度校准。最后，我们在 FOREAGENT 中实例化该框架，该代理采用 Predict-then-Verify 循环，实现了 6 倍的收敛加速，同时超出基于执行的基线 +6%。我们的代码和数据集很快将在此 https URL 上公开。

Title: Distilling Feedback into Memory-as-a-Tool

Authors: Víctor Gallego
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05960
Pdf URL: https://arxiv.org/pdf/2601.05960
Copy Paste: [[2601.05960]] Distilling Feedback into Memory-as-a-Tool(https://arxiv.org/abs/2601.05960)
Keywords: llm, agent
Abstract: We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls. We evaluate this method on the Rubric Feedback Bench, a novel dataset for rubric-based learning. Experiments demonstrate that our augmented LLMs rapidly match the performance of test-time refinement pipelines while drastically reducing inference cost.
摘要：我们提出了一个框架，通过基于文件的内存系统和代理控制的工具调用，将瞬态批评转换为可检索的指导方针，从而摊销推理时间推理的成本。我们在 Rubric Feedback Bench 上评估了这种方法，这是一个用于基于 rubric 的学习的新颖数据集。实验表明，我们的增强型法学硕士可以快速匹配测试时细化流程的性能，同时大大降低推理成本。

Title: The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

Authors: Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06002
Pdf URL: https://arxiv.org/pdf/2601.06002
Copy Paste: [[2601.06002]] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning(https://arxiv.org/abs/2601.06002)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.
摘要：大型语言模型 (LLM) 通常无法从人类或非长 CoT LLM 的模仿中学习有效的长思维链 (Long CoT) 推理。为了理解这一点，我们提出有效且可学习的长CoT轨迹在统一视图中具有稳定的类分子结构，这些结构由三种相互作用类型形成：深度推理（类共价）、自我反射（类氢键）和自我探索（类范德华）。对精炼轨迹的分析表明，这些结构来自 Long CoT 微调，而不是关键字模仿。我们引入了有效语义异构体，并表明只有促进快速熵收敛的键才能支持稳定的长 CoT 学习，而结构竞争会损害训练。根据这些发现，我们提出了 Mole-Syn，这是一种分布传输图方法，可指导有效长 CoT 结构的合成，从而提高跨基准的性能和 RL 稳定性。

Title: Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

Authors: Elias Lumer, Faheem Nizar, Akshaya Jangiti, Kevin Frank, Anmol Gulati, Mandar Phadate, Vamse Kumar Subbiah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06007
Pdf URL: https://arxiv.org/pdf/2601.06007
Copy Paste: [[2601.06007]] Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks(https://arxiv.org/abs/2601.06007)
Keywords: language model, llm, prompt, agent
Abstract: Recent advancements in Large Language Model (LLM) agents have enabled complex multi-turn agentic tasks requiring extensive tool calling, where conversations can span dozens of API calls with increasingly large context windows. However, although major LLM providers offer prompt caching to reduce cost and latency, its benefits for agentic workloads remain underexplored in the research literature. To our knowledge, no prior work quantifies these cost savings or compares caching strategies for multi-turn agentic tasks. We present a comprehensive evaluation of prompt caching across three major LLM providers (OpenAI, Anthropic, and Google) and compare three caching strategies, including full context caching, system prompt only caching, and caching that excludes dynamic tool results. We evaluate on DeepResearchBench, a multi-turn agentic benchmark where agents autonomously execute real-world web search tool calls to answer complex research questions, measuring both API cost and time to first token (TTFT) across over 500 agent sessions with 10,000-token system prompts. Our results demonstrate that prompt caching reduces API costs by 45-80% and improves time to first token by 13-31% across providers. We find that strategic prompt cache block control, such as placing dynamic content at the end of the system prompt, avoiding dynamic traditional function calling, and excluding dynamic tool results, provides more consistent benefits than naive full-context caching, which can paradoxically increase latency. Our analysis reveals nuanced variations in caching behavior across providers, and we provide practical guidance for implementing prompt caching in production agentic systems.
摘要：大型语言模型 (LLM) 代理的最新进展实现了需要大量工具调用的复杂多轮代理任务，其中对话可以跨越数十个 API 调用，上下文窗口越来越大。然而，尽管主要的法学硕士提供商提供即时缓存以降低成本和延迟，但其对代理工作负载的好处在研究文献中仍未得到充分探索。据我们所知，之前的工作没有量化这些成本节省或比较多轮代理任务的缓存策略。我们对三个主要 LLM 提供商（OpenAI、Anthropic 和 Google）的提示缓存进行了全面评估，并比较了三种缓存策略，包括完整上下文缓存、仅系统提示缓存和排除动态工具结果的缓存。我们在 DeepResearchBench 上进行评估，这是一个多回合代理基准，代理可以自主执行现实世界的网络搜索工具调用来回答复杂的研究问题，通过 10,000 个令牌系统提示，在 500 多个代理会话中测量 API 成本和首次令牌时间 (TTFT)。我们的结果表明，即时缓存可将 API 成本降低 45-80%，并将各个提供商的第一个令牌的时间缩短 13-31%。我们发现，策略性的提示缓存块控制，例如将动态内容放在系统提示的末尾，避免动态传统函数调用，以及排除动态工具结果，比朴素的全上下文缓存提供了更一致的好处，而后者可能会增加延迟。我们的分析揭示了不同提供商之间的缓存行为的细微差别，并且我们为在生产代理系统中实现即时缓存提供了实用指导。

Title: Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

Authors: Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, Juanzi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06021
Pdf URL: https://arxiv.org/pdf/2601.06021
Copy Paste: [[2601.06021]] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards(https://arxiv.org/abs/2601.06021)
Keywords: llm, hallucination, agent
Abstract: Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose \textbf{Citation-aware Rubric Rewards (CaRR)}, a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce \textbf{Citation-aware Group Relative Policy Optimization (C-GRPO)}, which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at this https URL.
摘要：强化学习 (RL) 已成为增强基于 LLM 的深度搜索代理的关键技术。然而，现有的方法主要依赖于二元结果奖励，无法捕捉智能体推理过程的全面性和真实性，并且常常导致不良行为，例如捷径利用和幻觉。为了解决这些限制，我们提出了 \textbf{Citation-aware Rubric Rewards (CaRR)}，这是一种针对深度搜索代理的细粒度奖励框架，强调推理的全面性、事实基础和证据连通性。 CaRR 将复杂的问题分解为可验证的单跳规则，并要求代理通过明确识别隐藏实体、用正确的引用支持它们以及构建链接到预测答案的完整证据链来满足这些规则。我们进一步引入\textbf{引文感知组相对策略优化（C-GRPO）}，它结合了 CaRR 和结果奖励来训练强大的深度搜索代理。实验表明，C-GRPO 在多个深度搜索基准测试中始终优于基于结果的标准 RL 基线。我们的分析还验证了 C-GRPO 有效地阻止了捷径利用，促进了全面的、基于证据的推理，并对开放式深度研究任务表现出强大的泛化能力。我们的代码和数据可在此 https URL 中获取。

Title: AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

Authors: Chengming Cui, Tianxin Wei, Ziyi Chen, Ruizhong Qiu, Zhichen Zeng, Zhining Liu, Xuying Ning, Duo Zhou, Jingrui He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06022
Pdf URL: https://arxiv.org/pdf/2601.06022
Copy Paste: [[2601.06022]] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs(https://arxiv.org/abs/2601.06022)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit complementary strengths arising from differences in pretraining data, model architectures, and decoding behaviors. Inference-time ensembling provides a practical way to combine these capabilities without retraining. However, existing ensemble approaches suffer from fundamental limitations. Most rely on fixed fusion granularity, which lacks the flexibility required for mid-generation adaptation and fails to adapt to different generation characteristics across tasks. To address these challenges, we propose AdaFuse, an adaptive ensemble decoding framework that dynamically selects semantically appropriate fusion units during generation. Rather than committing to a fixed granularity, AdaFuse adjusts fusion behavior on the fly based on the decoding context, with words serving as basic building blocks for alignment. To be specific, we introduce an uncertainty-based criterion to decide whether to apply ensembling at each decoding step. Under confident decoding states, the model continues generation directly. In less certain states, AdaFuse invokes a diversity-aware scaling strategy to explore alternative candidate continuations and inform ensemble decisions. This design establishes a synergistic interaction between adaptive ensembling and test-time scaling, where ensemble decisions guide targeted exploration, and the resulting diversity in turn strengthens ensemble quality. Experiments on open-domain question answering, arithmetic reasoning, and machine translation demonstrate that AdaFuse consistently outperforms strong ensemble baselines, achieving an average relative improvement of 6.88%. The code is available at this https URL.
摘要：大型语言模型 (LLM) 由于预训练数据、模型架构和解码行为的差异而表现出互补的优势。推理时间集成提供了一种无需重新训练即可组合这些功能的实用方法。然而，现有的集成方法存在根本性的局限性。大多数依赖于固定的融合粒度，缺乏中期适应所需的灵活性，无法适应跨任务的不同世代特征。为了应对这些挑战，我们提出了 AdaFuse，这是一种自适应集成解码框架，可在生成过程中动态选择语义上合适的融合单元。 AdaFuse 没有致力于固定的粒度，而是根据解码上下文动态调整融合行为，其中单词作为对齐的基本构建块。具体来说，我们引入了基于不确定性的标准来决定是否在每个解码步骤应用集成。在置信解码状态下，模型直接继续生成。在不太确定的状态下，AdaFuse 调用多样性感知扩展策略来探索替代候选延续并为整体决策提供信息。这种设计在自适应集成和测试时间缩放之间建立了协同相互作用，其中集成决策指导有针对性的探索，而由此产生的多样性反过来又增强了集成质量。开放域问答、算术推理和机器翻译的实验表明，AdaFuse 始终优于强大的集成基线，实现了 6.88% 的平均相对改进。该代码可从此 https URL 获取。