2026-01-22

Title: The Slow Drift of Support: Boundary Failures in Multi-Turn Mental Health LLM Dialogues

Authors: Youyou Cheng, Zhuangwei Kang, Kerry Jiang, Chenyu Sun, Qiyang Pan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14269
Pdf URL: https://arxiv.org/pdf/2601.14269
Copy Paste: [[2601.14269]] The Slow Drift of Support: Boundary Failures in Multi-Turn Mental Health LLM Dialogues(https://arxiv.org/abs/2601.14269)
Keywords: language model, llm
Abstract: Large language models (LLMs) have been widely used for mental health support. However, current safety evaluations in this field are mostly limited to detecting whether LLMs output prohibited words in single-turn conversations, neglecting the gradual erosion of safety boundaries in long dialogues. Examples include making definitive guarantees, assuming responsibility, and playing professional roles. We believe that with the evolution of mainstream LLMs, words with obvious safety risks are easily filtered by their underlying systems, while the real danger lies in the gradual transgression of boundaries during multi-turn interactions, driven by the LLM's attempts at comfort and empathy. This paper proposes a multi-turn stress testing framework and conducts long-dialogue safety tests on three cutting-edge LLMs using two pressure methods: static progression and adaptive probing. We generated 50 virtual patient profiles and stress-tested each model through up to 20 rounds of virtual psychiatric dialogues. The experimental results show that violations are common, and both pressure modes produced similar violation rates. However, adaptive probing significantly advanced the time at which models crossed boundaries, reducing the average number of turns from 9.21 in static progression to 4.64. Under both mechanisms, making definitive or zero-risk promises was the primary way in which boundaries were breached. These findings suggest that the robustness of LLM safety boundaries cannot be inferred solely through single-turn tests; it is necessary to fully consider the wear and tear on safety boundaries caused by different interaction pressures and characteristics in extended dialogues.
摘要：大语言模型（LLM）已广泛用于心理健康支持。然而，目前该领域的安全评估大多局限于检测LLM在单轮对话中是否输出违禁词，忽略了长对话中安全边界的逐渐侵蚀。例子包括做出明确的保证、承担责任和发挥专业作用。我们认为，随着主流法学硕士的演进，具有明显安全风险的词汇很容易被底层系统过滤掉，而真正的危险在于法学硕士在安抚和同理心的驱使下，在多轮交互中逐渐越界。本文提出了多轮压力测试框架，并使用静态渐进和自适应探测两种压力方法对三名前沿的法学硕士进行了长对话安全测试。我们生成了 50 个虚拟患者档案，并通过多达 20 轮的虚拟精神病学对话对每个模型进行了压力测试。实验结果表明，违规行为很常见，并且两种压力模式产生的违规率相似。然而，自适应探测显着缩短了模型跨越边界的时间，将平均转弯次数从静态进展中的 9.21 圈减少到 4.64 圈。在这两种机制下，做出明确的或零风险的承诺是违反界限的主要方式。这些发现表明，LLM 安全边界的稳健性不能仅通过单轮测试来推断；需要充分考虑扩展对话中不同的交互压力和特征对安全边界的磨损。

Title: Opening the Black Box: A Survey on the Mechanisms of Multi-Step Reasoning in Large Language Models

Authors: Liangming Pan, Jason Liang, Jiaran Ye, Minglai Yang, Xinyuan Lu, Fengbin Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14270
Pdf URL: https://arxiv.org/pdf/2601.14270
Copy Paste: [[2601.14270]] Opening the Black Box: A Survey on the Mechanisms of Multi-Step Reasoning in Large Language Models(https://arxiv.org/abs/2601.14270)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable abilities to solve problems requiring multiple reasoning steps, yet the internal mechanisms enabling such capabilities remain elusive. Unlike existing surveys that primarily focus on engineering methods to enhance performance, this survey provides a comprehensive overview of the mechanisms underlying LLM multi-step reasoning. We organize the survey around a conceptual framework comprising seven interconnected research questions, from how LLMs execute implicit multi-hop reasoning within hidden activations to how verbalized explicit reasoning remodels the internal computation. Finally, we highlight five research directions for future mechanistic studies.
摘要：大型语言模型 (LLM) 已表现出解决需要多个推理步骤的问题的卓越能力，但实现这种能力的内部机制仍然难以捉摸。与主要关注提高绩效的工程方法的现有调查不同，本调查全面概述了 LLM 多步骤推理的机制。我们围绕一个概念框架组织调查，该框架包含七个相互关联的研究问题，从法学硕士如何在隐藏激活中执行隐式多跳推理到语言化显式推理如何重塑内部计算。最后，我们强调了未来机理研究的五个研究方向。

Title: Hallucination-Free Automatic Question & Answer Generation for Intuitive Learning

Authors: Nicholas X. Wang, Aggelos K. Katsaggelos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14280
Pdf URL: https://arxiv.org/pdf/2601.14280
Copy Paste: [[2601.14280]] Hallucination-Free Automatic Question & Answer Generation for Intuitive Learning(https://arxiv.org/abs/2601.14280)
Keywords: language model, llm, hallucination, chain-of-thought, agent
Abstract: Hallucinations in large language models (LLMs), defined as fluent yet incorrect or incoherent outputs, pose a significant challenge to the automatic generation of educational multiple-choice questions (MCQs). We identified four key hallucination types in MCQ generation: reasoning inconsistencies, insolvability, factual errors, and mathematical errors. To address this, we propose a hallucination-free multi-agent generation framework that breaks down MCQ generation into discrete, verifiable stages. Our framework utilizes both rule-based and LLM-based detection agents, as well as hallucination scoring metrics to optimize question quality. We redefined MCQ generation as an optimization task minimizing hallucination risk while maximizing validity, answerability, and cost-efficiency. We also introduce an agent-led refinement process that uses counterfactual reasoning and chain-of-thought (CoT) to iteratively improve hallucination in question generation. We evaluated a sample of AP- aligned STEM questions, where our system reduced hallucination rates by over 90% compared to baseline generation while preserving the educational value and style of questions. Our results demonstrate that structured multi-agent collaboration can mitigate hallucinations in educational content creation at scale, paving the way for more reliable LLM-powered learning tools.
摘要：大语言模型 (LLM) 中的幻觉被定义为流畅但不正确或不连贯的输出，这对自动生成教育多项选择题 (MCQ) 提出了重大挑战。我们确定了 MCQ 生成中的四种关键幻觉类型：推理不一致、无法解决、事实错误和数学错误。为了解决这个问题，我们提出了一个无幻觉的多智能体生成框架，它将 MCQ 生成分解为离散的、可验证的阶段。我们的框架利用基于规则和基于 LLM 的检测代理以及幻觉评分指标来优化问题质量。我们将 MCQ 生成重新定义为一项优化任务，可最大限度地降低幻觉风险，同时最大限度地提高有效性、可回答性和成本效率。我们还引入了一个由代理主导的细化过程，该过程使用反事实推理和思维链（CoT）来迭代改进问题生成中的幻觉。我们评估了 AP 一致的 STEM 问题样本，与基线生成相比，我们的系统将幻觉率降低了 90% 以上，同时保留了问题的教育价值和风格。我们的结果表明，结构化多智能体协作可以减轻大规模教育内容创建中的幻觉，为更可靠的法学硕士支持的学习工具铺平道路。

Title: RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension

Authors: Yelin Chen, Fanjin Zhang, Suping Sun, Yunhe Pang, Yuanchun Wang, Jian Song, Xiaoyan Li, Lei Hou, Shu Zhao, Jie Tang, Juanzi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14289
Pdf URL: https://arxiv.org/pdf/2601.14289
Copy Paste: [[2601.14289]] RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension(https://arxiv.org/abs/2601.14289)
Keywords: gpt, llm
Abstract: Understanding research papers remains challenging for foundation models due to specialized scientific discourse and complex figures and tables, yet existing benchmarks offer limited fine-grained evaluation at scale. To address this gap, we introduce RPC-Bench, a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers, containing 15K human-verified QA pairs. We design a fine-grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts. We also define an elaborate LLM-human interaction annotation framework to support large-scale labeling and quality control. Following the LLM-as-a-Judge paradigm, we develop a scalable framework that evaluates models on correctness-completeness and conciseness, with high agreement to human judgment. Experiments reveal that even the strongest models (GPT-5) achieve only 68.2% correctness-completeness, dropping to 37.46% after conciseness adjustment, highlighting substantial gaps in precise academic paper understanding. Our code and data are available at this https URL.
摘要：由于专业的科学论述和复杂的图表，理解研究论文对于基础模型来说仍然具有挑战性，但现有的基准提供的大规模细粒度评估有限。为了解决这一差距，我们引入了 RPC-Bench，这是一个基于高质量计算机科学论文的审稿反驳交换而构建的大规模问答基准，包含 15K 个经过人工验证的 QA 对。我们设计了一个与科学研究流程相一致的细粒度分类法，以评估模型在学术背景下理解和回答“为什么”、“什么”和“如何”问题的能力。我们还定义了一个精心设计的法学硕士-人类交互注释框架来支持大规模标记和质量控制。遵循法学硕士作为法官范式，我们开发了一个可扩展的框架，用于评估模型的正确性、完整性和简洁性，与人类判断高度一致。实验表明，即使是最强的模型（GPT-5）也只能达到 68.2% 的正确性完整性，经过简洁性调整后降至 37.46%，凸显了学术论文精确理解方面的巨大差距。我们的代码和数据可在此 https URL 中获取。

Title: Project Aletheia: Verifier-Guided Distillation of Backtracking for Small Language Models

Authors: Aradhya Dixit, Tianxi Liang, Jai Telang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14290
Pdf URL: https://arxiv.org/pdf/2601.14290
Copy Paste: [[2601.14290]] Project Aletheia: Verifier-Guided Distillation of Backtracking for Small Language Models(https://arxiv.org/abs/2601.14290)
Keywords: language model
Abstract: Small Language Models (SLMs, under 10B parameters) are attractive for private, on-device deployment, yet they frequently fail on strict constraint-satisfaction problems due to linear, overconfident reasoning traces that do not recover from early mistakes. We introduce Verifier-Guided Distillation, a training protocol that transfers the process of error repair - explicit conflict detection and backtracking - rather than only correct final answers. By training a 7B model on verified reasoning traces that include mistakes and self-corrections, we show that latent verification behavior can emerge in small models, enabling them to occasionally stop, detect contradictions, and revise earlier assumptions.
摘要：小语言模型（SLM，低于 10B 参数）对于私有设备上部署很有吸引力，但由于线性、过度自信的推理轨迹无法从早期错误中恢复，它们经常在严格的约束满足问题上失败。我们引入了验证者引导蒸馏，这是一种传输错误修复过程的训练协议 - 显式冲突检测和回溯 - 而不仅仅是正确的最终答案。通过在包括错误和自我纠正的经过验证的推理轨迹上训练 7B 模型，我们表明潜在的验证行为可以在小模型中出现，使它们能够偶尔停下来，检测矛盾并修改早期的假设。

Title: Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding

Authors: Juncheng Wang, Zhe Hu, Chao Xu, Siyue Ren, Yuxiang Feng, Yang Liu, Baigui Sun, Shujun Wang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2601.14304
Pdf URL: https://arxiv.org/pdf/2601.14304
Copy Paste: [[2601.14304]] Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding(https://arxiv.org/abs/2601.14304)
Keywords: prompt
Abstract: Autoregressive (AR) models excel at generating temporally coherent audio by producing tokens sequentially, yet they often falter in faithfully following complex textual prompts, especially those describing complex sound events. We uncover a surprising capability in AR audio generators: their early prefix tokens implicitly encode global semantic attributes of the final output, such as event count and sound-object category, revealing a form of implicit planning. Building on this insight, we propose Plan-Critic, a lightweight auxiliary model trained with a Generalized Advantage Estimation (GAE)-inspired objective to predict final instruction-following quality from partial generations. At inference time, Plan-Critic enables guided exploration: it evaluates candidate prefixes early, prunes low-fidelity trajectories, and reallocates computation to high-potential planning seeds. Our Plan-Critic-guided sampling achieves up to a 10-point improvement in CLAP score over the AR baseline-establishing a new state of the art in AR text-to-audio generation-while maintaining computational parity with standard best-of-N decoding. This work bridges the gap between causal generation and global semantic alignment, demonstrating that even strictly autoregressive models can plan ahead.
摘要：自回归 (AR) 模型擅长通过顺序生成标记来生成时间连贯的音频，但它们常常无法忠实地遵循复杂的文本提示，尤其是那些描述复杂声音事件的提示。我们发现了 AR 音频生成器的一个令人惊讶的功能：它们的早期前缀标记隐式编码最终输出的全局语义属性，例如事件计数和声音对象类别，揭示了一种隐式规划的形式。基于这一见解，我们提出了 Plan-Critic，这是一种轻量级辅助模型，采用广义优势估计 (GAE) 启发的目标进行训练，以预测部分生成的最终指令遵循质量。在推理时，Plan-Critic 支持引导式探索：它尽早评估候选前缀，修剪低保真度轨迹，并将计算重新分配给高潜力的规划种子。我们的 Plan-Critic 引导采样在 CLAP 分数上比 AR 基线提高了 10 分，确立了 AR 文本到音频生成的新技术水平，同时保持了与标准 best-of-N 解码的计算对等性。这项工作弥合了因果生成和全局语义对齐之间的差距，证明即使是严格的自回归模型也可以提前计划。

Title: Large Language Models for Large-Scale, Rigorous Qualitative Analysis in Applied Health Services Research

Authors: Sasha Ronaghi, Emma-Louise Aveling, Maria Levis, Rachel Lauren Ross, Emily Alsentzer, Sara Singer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14478
Pdf URL: https://arxiv.org/pdf/2601.14478
Copy Paste: [[2601.14478]] Large Language Models for Large-Scale, Rigorous Qualitative Analysis in Applied Health Services Research(https://arxiv.org/abs/2601.14478)
Keywords: language model, llm
Abstract: Large language models (LLMs) show promise for improving the efficiency of qualitative analysis in large, multi-site health-services research. Yet methodological guidance for LLM integration into qualitative analysis and evidence of their impact on real-world research methods and outcomes remain limited. We developed a model- and task-agnostic framework for designing human-LLM qualitative analysis methods to support diverse analytic aims. Within a multi-site study of diabetes care at Federally Qualified Health Centers (FQHCs), we leveraged the framework to implement human-LLM methods for (1) qualitative synthesis of researcher-generated summaries to produce comparative feedback reports and (2) deductive coding of 167 interview transcripts to refine a practice-transformation intervention. LLM assistance enabled timely feedback to practitioners and the incorporation of large-scale qualitative data to inform theory and practice changes. This work demonstrates how LLMs can be integrated into applied health-services research to enhance efficiency while preserving rigor, offering guidance for continued innovation with LLMs in qualitative research.
摘要：大型语言模型 (LLM) 有望提高大型多地点健康服务研究中的定性分析效率。然而，将法学硕士纳入定性分析的方法指导及其对现实世界研究方法和结果影响的证据仍然有限。我们开发了一个与模型和任务无关的框架，用于设计人类法学硕士定性分析方法，以支持不同的分析目标。在联邦合格健康中心 (FQHC) 的糖尿病护理多中心研究中，我们利用该框架实施人类法学硕士方法，用于 (1) 对研究人员生成的摘要进行定性综合，以生成比较反馈报告；(2) 对 167 份访谈记录进行演绎编码，以完善实践转型干预措施。法学硕士的协助能够及时向从业者提供反馈，并纳入大规模定性数据，为理论和实践的变化提供信息。这项工作展示了法学硕士如何融入应用健康服务研究，以提高效率，同时保持严谨性，为法学硕士在定性研究中的持续创新提供指导。

Title: Can LLM Reasoning Be Trusted? A Comparative Study: Using Human Benchmarking on Statistical Tasks

Authors: Crish Nagarkar, Leonid Bogachev, Serge Sharoff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14479
Pdf URL: https://arxiv.org/pdf/2601.14479
Copy Paste: [[2601.14479]] Can LLM Reasoning Be Trusted? A Comparative Study: Using Human Benchmarking on Statistical Tasks(https://arxiv.org/abs/2601.14479)
Keywords: language model, llm
Abstract: This paper investigates the ability of large language models (LLMs) to solve statistical tasks, as well as their capacity to assess the quality of reasoning. While state-of-the-art LLMs have demonstrated remarkable performance in a range of NLP tasks, their competence in addressing even moderately complex statistical challenges is not well understood. We have fine-tuned selected open-source LLMs on a specially developed dataset to enhance their statistical reasoning capabilities, and compared their performance with the human scores used as a benchmark. Our results show that the fine-tuned models achieve better performance on advanced statistical tasks on the level comparable to a statistics student. Fine-tuning demonstrates architecture-dependent improvements, with some models showing significant performance gains, indicating clear potential for deployment in educational technology and statistical analysis assistance systems. We also show that LLMs themselves can be far better judges of the answers quality (including explanation and reasoning assessment) in comparison to traditional metrics, such as BLEU or BertScore. This self-evaluation capability enables scalable automated assessment for statistical education platforms and quality assurance in automated analysis tools. Potential applications also include validation tools for research methodology in academic and industry settings, and quality control mechanisms for data analysis workflows.
摘要：本文研究了大型语言模型（LLM）解决统计任务的能力，以及评估推理质量的能力。虽然最先进的法学硕士在一系列 NLP 任务中表现出了卓越的表现，但他们在解决中等复杂的统计挑战方面的能力尚不清楚。我们在专门开发的数据集上对选定的开源法学硕士进行了微调，以增强他们的统计推理能力，并将他们的表现与用作基准的人类分数进行比较。我们的结果表明，经过微调的模型在高级统计任务上取得了与统计学学生相当的更好表现。微调展示了依赖于架构的改进，一些模型显示出显着的性能提升，表明在教育技术和统计分析辅助系统中部署的明显潜力。我们还表明，与 BLEU 或 BertScore 等传统指标相比，法学硕士本身可以更好地判断答案质量（包括解释和推理评估）。这种自我评估功能可以实现统计教育平台的可扩展自动化评估以及自动化分析工具的质量保证。潜在的应用还包括学术和行业环境中研究方法的验证工具，以及数据分析工作流程的质量控制机制。

Title: Business Logic-Driven Text-to-SQL Data Synthesis for Business Intelligence

Authors: Jinhui Liu, Ximeng Zhang, Yanbo Ai, Zhou Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14518
Pdf URL: https://arxiv.org/pdf/2601.14518
Copy Paste: [[2601.14518]] Business Logic-Driven Text-to-SQL Data Synthesis for Business Intelligence(https://arxiv.org/abs/2601.14518)
Keywords: agent
Abstract: Evaluating Text-to-SQL agents in private business intelligence (BI) settings is challenging due to the scarcity of realistic, domain-specific data. While synthetic evaluation data offers a scalable solution, existing generation methods fail to capture business realism--whether questions reflect realistic business logic and workflows. We propose a Business Logic-Driven Data Synthesis framework that generates data grounded in business personas, work scenarios, and workflows. In addition, we improve the data quality by imposing a business reasoning complexity control strategy that diversifies the analytical reasoning steps required to answer the questions. Experiments on a production-scale Salesforce database show that our synthesized data achieves high business realism (98.44%), substantially outperforming OmniSQL (+19.5%) and SQL-Factory (+54.7%), while maintaining strong question-SQL alignment (98.59%). Our synthetic data also reveals that state-of-the-art Text-to-SQL models still have significant performance gaps, achieving only 42.86% execution accuracy on the most complex business queries.
摘要：由于缺乏实际的、特定于域的数据，在私人商业智能 (BI) 设置中评估文本到 SQL 代理具有挑战性。虽然综合评估数据提供了可扩展的解决方案，但现有的生成方法无法捕捉业务现实——问题是否反映了现实的业务逻辑和工作流程。我们提出了一个业务逻辑驱动的数据合成框架，该框架生成基于业务角色、工作场景和工作流程的数据。此外，我们通过实施业务推理复杂性控制策略来提高数据质量，该策略使回答问题所需的分析推理步骤多样化。在生产规模的 Salesforce 数据库上进行的实验表明，我们的合成数据实现了较高的业务现实性 (98.44%)，大大优于 OmniSQL (+19.5%) 和 SQL-Factory (+54.7%)，同时保持了强大的问题与 SQL 一致性 (98.59%)。我们的合成数据还表明，最先进的文本到 SQL 模型仍然存在显着的性能差距，在最复杂的业务查询上仅实现 42.86% 的执行准确性。

Title: Towards Execution-Grounded Automated AI Research

Authors: Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Candès, Diyi Yang, Tatsunori Hashimoto
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.14525
Pdf URL: https://arxiv.org/pdf/2601.14525
Copy Paste: [[2601.14525]] Towards Execution-Grounded Automated AI Research(https://arxiv.org/abs/2601.14525)
Keywords: gpt, llm
Abstract: Automated AI research holds great potential to accelerate scientific discovery. However, current LLMs often generate plausible-looking but ineffective ideas. Execution grounding may help, but it is unclear whether automated execution is feasible and whether LLMs can learn from the execution feedback. To investigate these, we first build an automated executor to implement ideas and launch large-scale parallel GPU experiments to verify their effectiveness. We then convert two realistic research problems - LLM pre-training and post-training - into execution environments and demonstrate that our automated executor can implement a large fraction of the ideas sampled from frontier LLMs. We analyze two methods to learn from the execution feedback: evolutionary search and reinforcement learning. Execution-guided evolutionary search is sample-efficient: it finds a method that significantly outperforms the GRPO baseline (69.4% vs 48.0%) on post-training, and finds a pre-training recipe that outperforms the nanoGPT baseline (19.7 minutes vs 35.9 minutes) on pre-training, all within just ten search epochs. Frontier LLMs often generate meaningful algorithmic ideas during search, but they tend to saturate early and only occasionally exhibit scaling trends. Reinforcement learning from execution reward, on the other hand, suffers from mode collapse. It successfully improves the average reward of the ideator model but not the upper-bound, due to models converging on simple ideas. We thoroughly analyze the executed ideas and training dynamics to facilitate future efforts towards execution-grounded automated AI research.
摘要：自动化人工智能研究在加速科学发现方面具有巨大潜力。然而，当前的法学硕士常常产生看似合理但无效的想法。执行落地可能会有所帮助，但目前尚不清楚自动执行是否可行，以及法学硕士是否可以从执行反馈中学习。为了研究这些，我们首先构建一个自动化执行器来实现想法，并启动大规模并行 GPU 实验来验证其有效性。然后，我们将两个现实的研究问题（法学硕士预训练和训练后）转化为执行环境，并证明我们的自动化执行器可以实现从前沿法学硕士中抽样的大部分想法。我们分析了两种从执行反馈中学习的方法：进化搜索和强化学习。执行引导的进化搜索具有样本效率：它找到了一种在训练后显着优于 GRPO 基线（69.4% vs 48.0%）的方法，并找到了一种在预训练中优于 nanoGPT 基线（19.7 分钟 vs 35.9 分钟）的预训练配方，所有这些都在短短 10 个搜索周期内完成。前沿法学硕士通常在搜索过程中产生有意义的算法想法，但它们往往会很早就饱和，并且只是偶尔表现出扩展趋势。另一方面，基于执行奖励的强化学习会遭遇模式崩溃。它成功地提高了创意者模型的平均奖励，但没有提高上限，因为模型收敛于简单的想法。我们彻底分析执行的想法和培训动态，以促进未来基于执行的自动化人工智能研究的努力。

Title: Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models

Authors: Brian Christian, Matan Mazor
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2601.14553
Pdf URL: https://arxiv.org/pdf/2601.14553
Copy Paste: [[2601.14553]] Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models(https://arxiv.org/abs/2601.14553)
Keywords: language model, llm, prompt
Abstract: Fair decisions require ignoring irrelevant, potentially biasing, information. To achieve this, decision-makers need to approximate what decision they would have made had they not known certain facts, such as the gender or race of a job candidate. This counterfactual self-simulation is notoriously hard for humans, leading to biased judgments even by well-meaning actors. Here we show that large language models (LLMs) suffer from similar limitations in their ability to approximate what decisions they would make under counterfactual knowledge in offsetting gender and race biases and overcoming sycophancy. We show that prompting models to ignore or pretend not to know biasing information fails to offset these biases and occasionally backfires. However, unlike humans, LLMs can be given access to a ground-truth model of their own counterfactual cognition -- their own API. We show that this access to the responses of a blinded replica enables fairer decisions, while providing greater transparency to distinguish implicit from intentionally biased behavior.
摘要：公平的决策需要忽略不相关的、可能带有偏见的信息。为了实现这一目标，决策者需要估计，如果他们不知道某些事实（例如求职者的性别或种族），他们会做出什么决定。这种反事实的自我模拟对人类来说是出了名的困难，甚至会导致善意的行为者做出有偏见的判断。在这里，我们表明，大型语言模型（LLM）在抵消性别和种族偏见以及克服阿谀奉承方面，在预测在反事实知识下会做出什么决策的能力方面也受到类似的限制。我们表明，促使模型忽略或假装不知道偏差信息无法抵消这些偏差，有时还会适得其反。然而，与人类不同的是，法学硕士可以访问他们自己的反事实认知的真实模型——他们自己的 API。我们证明，这种对盲态副本响应的访问可以实现更公平的决策，同时提供更大的透明度来区分隐性行为和故意偏见行为。

Title: Rewarding How Models Think Pedagogically: Integrating Pedagogical Reasoning and Thinking Rewards for LLMs in Education

Authors: Unggi Lee, Jiyeong Bae, Jaehyeon Park, Haeun Park, Taejun Park, Younghoon Jeon, Sungmin Cho, Junbo Koh, Yeil Jeong, Gyeonggeon Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14560
Pdf URL: https://arxiv.org/pdf/2601.14560
Copy Paste: [[2601.14560]] Rewarding How Models Think Pedagogically: Integrating Pedagogical Reasoning and Thinking Rewards for LLMs in Education(https://arxiv.org/abs/2601.14560)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly deployed as intelligent tutoring systems, yet research on optimizing LLMs specifically for educational contexts remains limited. Recent works have proposed reinforcement learning approaches for training LLM tutors, but these methods focus solely on optimizing visible responses while neglecting the model's internal thinking process. We introduce PedagogicalRL-Thinking, a framework that extends pedagogical alignment to reasoning LLMs in education through two novel approaches: (1) Pedagogical Reasoning Prompting, which guides internal reasoning using domain-specific educational theory rather than generic instructions; and (2) Thinking Reward, which explicitly evaluates and reinforces the pedagogical quality of the model's reasoning traces. Our experiments reveal that domain-specific, theory-grounded prompting outperforms generic prompting, and that Thinking Reward is most effective when combined with pedagogical prompting. Furthermore, models trained only on mathematics tutoring dialogues show improved performance on educational benchmarks not seen during training, while preserving the base model's factual knowledge. Our quantitative and qualitative analyses reveal that pedagogical thinking reward produces systematic reasoning trace changes, with increased pedagogical reasoning and more structured instructional decision-making in the tutor's thinking process.
摘要：大型语言模型 (LLM) 越来越多地被部署为智能辅导系统，但专门针对教育环境优化 LLM 的研究仍然有限。最近的工作提出了用于培训LLM导师的强化学习方法，但这些方法仅注重优化可见响应，而忽略了模型的内部思维过程。我们引入了 PedagogicalRL-Thinking，这是一个框架，通过两种新颖的方法将教学法一致性扩展到教育中的推理法学硕士：（1）教学推理提示，它使用特定领域的教育理论而不是通用指令来指导内部推理；（2）思维奖励，明确评估和强化模型推理轨迹的教学质量。我们的实验表明，特定领域的、基于理论的提示优于一般提示，并且思考奖励与教学提示相结合时最有效。此外，仅接受数学辅导对话训练的模型在训练期间未见的教育基准上表现出更好的性能，同时保留了基本模型的事实知识。我们的定量和定性分析表明，教学思维奖励会产生系统推理痕迹的变化，教师思维过程中的教学推理增加，教学决策更加结构化。

Title: Social Caption: Evaluating Social Understanding in Multimodal Models

Authors: Bhaavanaa Thumu, Leena Mathur, Youssouf Kebe, Louis-Philippe Morency
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.14569
Pdf URL: https://arxiv.org/pdf/2601.14569
Copy Paste: [[2601.14569]] Social Caption: Evaluating Social Understanding in Multimodal Models(https://arxiv.org/abs/2601.14569)
Keywords: language model, llm
Abstract: Social understanding abilities are crucial for multimodal large language models (MLLMs) to interpret human social interactions. We introduce Social Caption, a framework grounded in interaction theory to evaluate social understanding abilities of MLLMs along three dimensions: Social Inference (SI), the ability to make accurate inferences about interactions; Holistic Social Analysis (HSA), the ability to generate comprehensive descriptions of interactions; Directed Social Analysis (DSA), the ability to extract relevant social information from interactions. We analyze factors influencing model performance in social understanding, such as scale, architectural design, and spoken context. Experiments with MLLM judges contribute insights about scaling automated evaluation of multimodal social understanding.
摘要：社交理解能力对于多模态大语言模型（MLLM）解释人类社交互动至关重要。我们引入了 Social Caption，这是一个基于交互理论的框架，用于从三个维度评估 MLLM 的社交理解能力：社交推理 (SI)，对交互做出准确推理的能力；整体社会分析（HSA），能够生成交互的全面描述；定向社交分析 (DSA)，从交互中提取相关社交信息的能力。我们分析了影响模型在社会理解方面表现的因素，例如规模、架构设计和口语环境。 MLLM 法官的实验提供了有关扩展多模式社会理解自动评估的见解。

Title: SearchGym: Bootstrapping Real-World Search Agents via Cost-Effective and High-Fidelity Environment Simulation

Authors: Xichen Zhang, Ziyi He, Yinghao Zhu, Sitong Wu, Shaozuo Yu, Meng Chu, Wenhu Zhang, Haoru Tan, Jiaya Jia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14615
Pdf URL: https://arxiv.org/pdf/2601.14615
Copy Paste: [[2601.14615]] SearchGym: Bootstrapping Real-World Search Agents via Cost-Effective and High-Fidelity Environment Simulation(https://arxiv.org/abs/2601.14615)
Keywords: hallucination, agent
Abstract: Search agents have emerged as a pivotal paradigm for solving open-ended, knowledge-intensive reasoning tasks. However, training these agents via Reinforcement Learning (RL) faces a critical dilemma: interacting with live commercial Web APIs is prohibitively expensive, while relying on static data snapshots often introduces noise due to data misalignment. This misalignment generates corrupted reward signals that destabilize training by penalizing correct reasoning or rewarding hallucination. To address this, we propose SearchGym, a simulation environment designed to bootstrap robust search agents. SearchGym employs a rigorous generative pipeline to construct a verifiable knowledge graph and an aligned document corpus, ensuring that every reasoning task is factually grounded and strictly solvable. Building on this controllable environment, we introduce SearchGym-RL, a curriculum learning methodology that progressively optimizes agent policies through purified feedback, evolving from basic interactions to complex, long-horizon planning. Extensive experiments across the Llama and Qwen families demonstrate strong Sim-to-Real generalization. Notably, our Qwen2.5-7B-Base model trained within SearchGym surpasses the web-enhanced ASearcher baseline across nine diverse benchmarks by an average relative margin of 10.6%. Our results validate that high-fidelity simulation serves as a scalable and highly cost-effective methodology for developing capable search agents.
摘要：搜索代理已成为解决开放式知识密集型推理任务的关键范例。然而，通过强化学习 (RL) 训练这些代理面临着一个严峻的困境：与实时商业 Web API 交互的成本过高，而依赖静态数据快照往往会因数据不一致而引入噪声。这种错位会产生损坏的奖励信号，通过惩罚正确的推理或奖励幻觉来破坏训练的稳定性。为了解决这个问题，我们提出了 SearchGym，这是一个旨在引导强大的搜索代理的模拟环境。 SearchGym 采用严格的生成管道来构建可验证的知识图谱和对齐的文档语料库，确保每个推理任务都有事实依据且严格可解决。在此可控环境的基础上，我们引入了 SearchGym-RL，这是一种课程学习方法，可通过纯化的反馈逐步优化代理策略，从基本交互演变为复杂的长期规划。 Llama 和 Qwen 系列的广泛实验证明了强大的模拟到真实的泛化能力。值得注意的是，我们在 SearchGym 中训练的 Qwen2.5-7B-Base 模型在九个不同的基准上超过了网络增强的 ASearcher 基线，平均相对优势为 10.6%。我们的结果证明，高保真模拟可以作为一种可扩展且极具成本效益的方法来开发强大的搜索代理。

Title: Say Anything but This: When Tokenizer Betrays Reasoning in LLMs

Authors: Navid Ayoobi, Marcus I Armstrong, Arjun Mukherjee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14658
Pdf URL: https://arxiv.org/pdf/2601.14658
Copy Paste: [[2601.14658]] Say Anything but This: When Tokenizer Betrays Reasoning in LLMs(https://arxiv.org/abs/2601.14658)
Keywords: language model, llm
Abstract: Large language models (LLMs) reason over discrete token ID sequences, yet modern subword tokenizers routinely produce non-unique encodings: multiple token ID sequences can detokenize to identical surface strings. This representational mismatch creates an unmeasured fragility wherein reasoning processes can fail. LLMs may treat two internal representations as distinct "words" even when they are semantically identical at the text level. In this work, we show that tokenization can betray LLM reasoning through one-to-many token ID mappings. We introduce a tokenization-consistency probe that requires models to replace designated target words in context while leaving all other content unchanged. The task is intentionally simple at the surface level, enabling us to attribute failures to tokenizer-detokenizer artifacts rather than to knowledge gaps or parameter limitations. Through analysis of over 11000 replacement trials across state-of-the-art open-source LLMs, we find a non-trivial rate of outputs exhibit phantom edits: cases where models operate under the illusion of correct reasoning, a phenomenon arising from tokenizer-induced representational defects. We further analyze these cases and provide a taxonomy of eight systematic tokenizer artifacts, including whitespace-boundary shifts and intra-word resegmentation. These findings indicate that part of apparent reasoning deficiency originates in the tokenizer layer, motivating tokenizer-level remedies before incurring the cost of training ever-larger models on ever-larger corpora.
摘要：大型语言模型 (LLM) 对离散的令牌 ID 序列进行推理，但现代子字分词器通常会生成非唯一的编码：多个令牌 ID 序列可以去令牌化为相同的表面字符串。这种代表性的不匹配造成了无法衡量的脆弱性，推理过程可能会失败。法学硕士可能会将两个内部表示视为不同的“单词”，即使它们在文本级别语义上相同。在这项工作中，我们证明标记化可以通过一对多标记 ID 映射背叛 LLM 推理。我们引入了标记化一致性探测，要求模型替换上下文中的指定目标单词，同时保持所有其他内容不变。该任务在表面上故意简单化，使我们能够将失败归因于标记器-去标记器工件，而不是知识差距或参数限制。通过对最先进的开源 LLM 中超过 11000 次替代试验的分析，我们发现很大一部分输出表现出幻影编辑：模型在正确推理的幻觉下运行的情况，这是由分词器引起的表征缺陷引起的现象。我们进一步分析这些案例，并提供八种系统分词器工件的分类，包括空白边界移位和字内重新分段。这些发现表明，部分明显的推理缺陷源于分词器层，在产生在越来越大的语料库上训练越来越大的模型的成本之前，激发了分词器级别的补救措施。

Title: AdaTIR: Adaptive Tool-Integrated Reasoning via Difficulty-Aware Policy Optimization

Authors: Zhaiyu Fang, Ruipeng Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14696
Pdf URL: https://arxiv.org/pdf/2601.14696
Copy Paste: [[2601.14696]] AdaTIR: Adaptive Tool-Integrated Reasoning via Difficulty-Aware Policy Optimization(https://arxiv.org/abs/2601.14696)
Keywords: language model, llm, agent
Abstract: Tool-Integrated Reasoning (TIR) has significantly enhanced the capabilities of Large Language Models (LLMs), yet current agents tend to exhibit cognitive offloading, redundantly invoking external tools even for simple tasks. In this paper, we suggest that true agentic intelligence requires not just tool invocation, but the adaptive wisdom to discern when to use them. We propose AdaTIR, a framework that shifts the paradigm from static tool invocation to difficulty-aware reasoning internalization. By introducing a difficulty-aware efficiency reward, AdaTIR dynamically adjusts tool budgets based on task complexity--internalizing reasoning for simple tasks while selectively invoking tools for complex tasks. Furthermore, we identify a sign reversal problem where tool penalties outweigh correctness rewards, mistakenly penalizing correct rollouts with negative advantages. To resolve this, we propose Clipped Advantage Shaping (CAS), which ensures that correctness remains the primary objective while using efficiency as a secondary constraint. Empirical results demonstrate that AdaTIR reduces tool calls by up to 97.6% on simple tasks and 28.2% on complex challenges while maintaining or enhancing accuracy. Notably, AdaTIR successfully internalizes reasoning, outperforming baselines by 4.8% on AIME 2024 even when tool access is strictly disabled.
摘要：工具集成推理（TIR）显着增强了大型语言模型（LLM）的能力，但当前的智能体往往表现出认知卸载，即使对于简单的任务也会冗余地调用外部工具。在本文中，我们建议真正的代理智能不仅需要工具调用，还需要自适应智慧来辨别何时使用它们。我们提出了 AdaTIR，一个将范式从静态工具调用转变为难度感知推理内化的框架。通过引入难度感知效率奖励，AdaTIR 根据任务复杂性动态调整工具预算——内化简单任务的推理，同时选择性地调用复杂任务的工具。此外，我们还发现了一个符号逆转问题，其中工具惩罚超过了正确性奖励，错误地惩罚了具有负面优势的正确推出。为了解决这个问题，我们提出了截断优势塑造（CAS），它确保正确性仍然是主要目标，而使用效率作为次要约束。实证结果表明，AdaTIR 在保持或提高准确性的同时，可以将简单任务的工具调用量减少高达 97.6%，将复杂任务的工具调用量减少 28.2%。值得注意的是，AdaTIR 成功地内部化了推理，即使在严格禁用工具访问的情况下，在 AIME 2024 上的表现也比基准高出 4.8%。

Title: ClaimDB: A Fact Verification Benchmark over Large Structured Data

Authors: Michael Theologitis, Preetam Prabhu Srikar Dammu, Chirag Shah, Dan Suciu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14698
Pdf URL: https://arxiv.org/pdf/2601.14698
Copy Paste: [[2601.14698]] ClaimDB: A Fact Verification Benchmark over Large Structured Data(https://arxiv.org/abs/2601.14698)
Keywords: llm
Abstract: Despite substantial progress in fact-verification benchmarks, claims grounded in large-scale structured data remain underexplored. In this work, we introduce ClaimDB, the first fact-verification benchmark where the evidence for claims is derived from compositions of millions of records and multiple tables. ClaimDB consists of 80 unique real-life databases covering a wide range of domains, from governance and healthcare to media, education and the natural sciences. At this scale, verification approaches that rely on "reading" the evidence break down, forcing a timely shift toward reasoning in executable programs. We conduct extensive experiments with 30 state-of-the-art proprietary and open-source (below 70B) LLMs and find that none exceed 83% accuracy, with more than half below 55%. Our analysis also reveals that both closed- and open-source models struggle with abstention -- the ability to admit that there is no evidence to decide -- raising doubts about their reliability in high-stakes data analysis. We release the benchmark, code, and the LLM leaderboard at this https URL .
摘要：尽管事实验证基准取得了实质性进展，但基于大规模结构化数据的主张仍未得到充分探索。在这项工作中，我们引入了 ClaimDB，这是第一个事实验证基准，其中声明的证据来自数百万条记录和多个表的组合。 ClaimDB 由 80 个独特的现实数据库组成，涵盖从治理和医疗保健到媒体、教育和自然科学等广泛领域。在这种规模下，依赖于“读取”证据的验证方法就会失效，迫使我们及时转向可执行程序中的推理。我们对 30 个最先进的专有和开源（低于 70B）LLM 进行了广泛的实验，发现没有一个准确率超过 83%，其中一半以上的准确率低于 55%。我们的分析还表明，封闭和开源模型都在与弃权作斗争——承认没有证据可以做出决定的能力——这引发了人们对其在高风险数据分析中的可靠性的怀疑。我们在此 https URL 发布基准、代码和 LLM 排行榜。

Title: DARL: Encouraging Diverse Answers for General Reasoning without Verifiers

Authors: Chongxuan Huang, Lei Lin, Xiaodong Shi, Wenping Hu, Ruiming Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14700
Pdf URL: https://arxiv.org/pdf/2601.14700
Copy Paste: [[2601.14700]] DARL: Encouraging Diverse Answers for General Reasoning without Verifiers(https://arxiv.org/abs/2601.14700)
Keywords: language model
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated promising gains in enhancing the reasoning capabilities of large language models. However, its dependence on domain-specific verifiers significantly restricts its applicability to open and general domains. Recent efforts such as RLPR have extended RLVR to general domains, enabling training on broader datasets and achieving improvements over RLVR. However, a notable limitation of these methods is their tendency to overfit to reference answers, which constrains the model's ability to generate diverse outputs. This limitation is particularly pronounced in open-ended tasks such as writing, where multiple plausible answers exist. To address this, we propose DARL, a simple yet effective reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference while preserving alignment with it. Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers. Extensive experiments on thirteen benchmarks demonstrate consistent improvements in reasoning performance. Notably, DARL surpasses RLPR, achieving average gains of 1.3 points on six reasoning benchmarks and 9.5 points on seven general benchmarks, highlighting its effectiveness in improving both reasoning accuracy and output diversity.
摘要：带有可验证奖励的强化学习（RLVR）在增强大型语言模型的推理能力方面表现出了可喜的成果。然而，它对特定领域验证器的依赖极大地限制了它对开放和通用领域的适用性。 RLPR 等近期工作已将 RLVR 扩展到一般领域，从而能够在更广泛的数据集上进行训练并实现对 RLVR 的改进。然而，这些方法的一个显着局限性是它们倾向于过度拟合参考答案，这限制了模型生成不同输出的能力。这种限制在开放式任务中尤其明显，例如写作，其中存在多个看似合理的答案。为了解决这个问题，我们提出了 DARL，这是一个简单而有效的强化学习框架，它鼓励在与参考的受控偏差范围内生成多样化的答案，同时保持与参考的一致性。我们的框架与现有的通用强化学习方法完全兼容，并且可以无缝集成，无需额外的验证器。对十三个基准进行的广泛实验证明了推理性能的持续改进。值得注意的是，DARL 超越了 RLPR，在 6 个推理基准上平均提升 1.3 分，在 7 个通用基准上平均提升 9.5 分，凸显了其在提高推理准确性和输出多样性方面的有效性。

Title: Typhoon OCR: Open Vision-Language Model For Thai Document Extraction

Authors: Surapon Nonesung, Natapong Nitarach, Teetouch Jaknamon, Pittawat Taveekitworachai, Kunat Pipatanakul
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14722
Pdf URL: https://arxiv.org/pdf/2601.14722
Copy Paste: [[2601.14722]] Typhoon OCR: Open Vision-Language Model For Thai Document Extraction(https://arxiv.org/abs/2601.14722)
Keywords: language model
Abstract: Document extraction is a core component of digital workflows, yet existing vision-language models (VLMs) predominantly favor high-resource languages. Thai presents additional challenges due to script complexity from non-latin letters, the absence of explicit word boundaries, and the prevalence of highly unstructured real-world documents, limiting the effectiveness of current open-source models. This paper presents Typhoon OCR, an open VLM for document extraction tailored for Thai and English. The model is fine-tuned from vision-language backbones using a Thai-focused training dataset. The dataset is developed using a multi-stage data construction pipeline that combines traditional OCR, VLM-based restructuring, and curated synthetic data. Typhoon OCR is a unified framework capable of text transcription, layout reconstruction, and document-level structural consistency. The latest iteration of our model, Typhoon OCR V1.5, is a compact and inference-efficient model designed to reduce reliance on metadata and simplify deployment. Comprehensive evaluations across diverse Thai document categories, including financial reports, government forms, books, infographics, and handwritten documents, show that Typhoon OCR achieves performance comparable to or exceeding larger frontier proprietary models, despite substantially lower computational cost. The results demonstrate that open vision-language OCR models can achieve accurate text extraction and layout reconstruction for Thai documents, reaching performance comparable to proprietary systems while remaining lightweight and deployable.
摘要：文档提取是数字工作流程的核心组成部分，但现有的视觉语言模型 (VLM) 主要偏向于高资源语言。由于非拉丁字母的脚本复杂性、缺乏明确的单词边界以及高度非结构化的现实世界文档的普遍存在，泰语带来了额外的挑战，限制了当前开源模型的有效性。本文介绍了 Typhoon OCR，这是一种专为泰语和英语定制的文档提取开放式 VLM。该模型是使用以泰语为中心的训练数据集从视觉语言主干进行微调的。该数据集是使用多阶段数据构建管道开发的，该管道结合了传统 OCR、基于 VLM 的重组和精选的合成数据。 Typhoon OCR 是一个统一的框架，能够进行文本转录、布局重建和文档级结构一致性。我们模型的最新版本 Typhoon OCR V1.5 是一个紧凑且推理高效的模型，旨在减少对元数据的依赖并简化部署。对不同泰国文档类别（包括财务报告、政府表格、书籍、信息图表和手写文档）的综合评估表明，Typhoon OCR 的性能可与更大的前沿专有模型相当或超过，尽管计算成本大幅降低。结果表明，开放视觉语言 OCR 模型可以实现泰语文档的准确文本提取和布局重建，达到与专有系统相当的性能，同时保持轻量级和可部署性。

Title: Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Authors: Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2601.14750
Pdf URL: https://arxiv.org/pdf/2601.14750
Copy Paste: [[2601.14750]] Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning(https://arxiv.org/abs/2601.14750)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at this https URL
摘要：思想链 (CoT) 提示在释放大型语言模型 (LLM) 的推理能力方面取得了显着的成功。尽管 CoT 提示可以增强推理能力，但其冗长的内容会带来大量的计算开销。最近的工作通常只关注结果对齐，缺乏对中间推理过程的监督。这些缺陷掩盖了潜在推理链的可分析性。为了应对这些挑战，我们引入了思维渲染（RoT），这是第一个通过将文本步骤渲染为图像来具体化推理链的框架，使潜在的基本原理变得明确且可追踪。具体来说，我们利用现有视觉语言模型（VLM）的视觉编码器作为语义锚来将视觉嵌入与文本空间对齐。这种设计确保了即插即用的实现，而不会产生额外的预训练开销。对数学和逻辑推理基准的大量实验表明，与显式 CoT 相比，我们的方法实现了 3-4 倍的令牌压缩和显着的推理加速。此外，它保持了与其他方法相比的竞争性能，验证了该范例的可行性。我们的代码可在此 https URL 获取

Title: RECAP: Resistance Capture in Text-based Mental Health Counseling with Large Language Models

Authors: Anqi Li, Yuqian Chen, Yu Lu, Zhaoming Chen, Yuan Xie, Zhenzhong Lan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14780
Pdf URL: https://arxiv.org/pdf/2601.14780
Copy Paste: [[2601.14780]] RECAP: Resistance Capture in Text-based Mental Health Counseling with Large Language Models(https://arxiv.org/abs/2601.14780)
Keywords: language model, llm, prompt
Abstract: Recognizing and navigating client resistance is critical for effective mental health counseling, yet detecting such behaviors is particularly challenging in text-based interactions. Existing NLP approaches oversimplify resistance categories, ignore the sequential dynamics of therapeutic interventions, and offer limited interpretability. To address these limitations, we propose PsyFIRE, a theoretically grounded framework capturing 13 fine-grained resistance behaviors alongside collaborative interactions. Based on PsyFIRE, we construct the ClientResistance corpus with 23,930 annotated utterances from real-world Chinese text-based counseling, each supported by context-specific rationales. Leveraging this dataset, we develop RECAP, a two-stage framework that detects resistance and fine-grained resistance types with explanations. RECAP achieves 91.25% F1 for distinguishing collaboration and resistance and 66.58% macro-F1 for fine-grained resistance categories classification, outperforming leading prompt-based LLM baselines by over 20 points. Applied to a separate counseling dataset and a pilot study with 62 counselors, RECAP reveals the prevalence of resistance, its negative impact on therapeutic relationships and demonstrates its potential to improve counselors' understanding and intervention strategies.
摘要：识别和应对客户的抵制对于有效的心理健康咨询至关重要，但在基于文本的交互中检测此类行为尤其具有挑战性。现有的 NLP 方法过度简化了阻力类别，忽略了治疗干预的连续动态，并且提供的可解释性有限。为了解决这些限制，我们提出了 PsyFIRE，这是一个基于理论的框架，捕获 13 种细粒度的抵抗行为以及协作交互。基于 PsyFIRE，我们构建了 ClientResistance 语料库，其中包含来自现实世界中文文本咨询的 23,930 条带注释的话语，每条话语都有特定上下文的理由支持。利用该数据集，我们开发了 RECAP，这是一个两阶段框架，可检测阻力和细粒度阻力类型并提供解释。 RECAP 在区分协作和阻力方面实现了 91.25% 的 F1，在细粒度阻力类别分类方面实现了 66.58% 的宏观 F1，比领先的基于提示的 LLM 基线高出 20 多点。 RECAP 应用于单独的咨询数据集和一项涉及 62 名咨询师的试点研究，揭示了抵抗的普遍性及其对治疗关系的负面影响，并展示了其改善咨询师的理解和干预策略的潜力。

Title: Comparative Study of Large Language Models on Chinese Film Script Continuation: An Empirical Analysis Based on GPT-5.2 and Qwen-Max

Authors: Yuxuan Cao, Zida Yang, Ye Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14826
Pdf URL: https://arxiv.org/pdf/2601.14826
Copy Paste: [[2601.14826]] Comparative Study of Large Language Models on Chinese Film Script Continuation: An Empirical Analysis Based on GPT-5.2 and Qwen-Max(https://arxiv.org/abs/2601.14826)
Keywords: language model, gpt, llm
Abstract: As large language models (LLMs) are increasingly applied to creative writing, their performance on culturally specific narrative tasks warrants systematic investigation. This study constructs the first Chinese film script continuation benchmark comprising 53 classic films, and designs a multi-dimensional evaluation framework comparing GPT-5.2 and Qwen-Max-Latest. Using a "first half to second half" continuation paradigm with 3 samples per film, we obtained 303 valid samples (GPT-5.2: 157, 98.7% validity; Qwen-Max: 146, 91.8% validity). Evaluation integrates ROUGE-L, Structural Similarity, and LLM-as-Judge scoring (DeepSeek-Reasoner). Statistical analysis of 144 paired samples reveals: Qwen-Max achieves marginally higher ROUGE-L (0.2230 vs 0.2114, d=-0.43); however, GPT-5.2 significantly outperforms in structural preservation (0.93 vs 0.75, d=0.46), overall quality (44.79 vs 25.72, d=1.04), and composite scores (0.50 vs 0.39, d=0.84). The overall quality effect size reaches large effect level (d>0.8). GPT-5.2 excels in character consistency, tone-style matching, and format preservation, while Qwen-Max shows deficiencies in generation stability. This study provides a reproducible framework for LLM evaluation in Chinese creative writing.
摘要：随着大型语言模型（LLM）越来越多地应用于创意写作，它们在特定文化叙事任务上的表现值得系统研究。本研究构建了首个包含53部经典电影的中国电影剧本延续基准，并设计了比较GPT-5.2和Qwen-Max-Latest的多维度评估框架。采用每部电影 3 个样本的“上半场到下半场”延续范式，我们获得了 303 个有效样本（GPT-5.2：157，有效性 98.7%；Qwen-Max：146，有效性 91.8%）。评估集成了 ROUGE-L、结构相似性和 LLM-as-Judge 评分 (DeepSeek-Reasoner)。对 144 个配对样本的统计分析表明：Qwen-Max 的 ROUGE-L 略高（0.2230 vs 0.2114，d=-0.43）；然而，GPT-5.2 在结构保存（0.93 vs 0.75，d=0.46）、整体质量（44.79 vs 25.72，d=1.04）和综合得分（0.50 vs 0.39，d=0.84）方面显着优于。整体质量效应大小达到大效应水平（d>0.8）。 GPT-5.2 在字符一致性、色调风格匹配和格式保存方面表现出色，而 Qwen-Max 在生成稳定性方面表现出不足。本研究为中文创意写作法学硕士评估提供了一个可重复的框架。

Title: HiNS: Hierarchical Negative Sampling for More Comprehensive Memory Retrieval Embedding Model

Authors: Motong Tian, Allen P. Wong, Mingjun Mao, Wangchunshu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14857
Pdf URL: https://arxiv.org/pdf/2601.14857
Copy Paste: [[2601.14857]] HiNS: Hierarchical Negative Sampling for More Comprehensive Memory Retrieval Embedding Model(https://arxiv.org/abs/2601.14857)
Keywords: agent
Abstract: Memory-augmented language agents rely on embedding models for effective memory retrieval. However, existing training data construction overlooks a critical limitation: the hierarchical difficulty of negative samples and their natural distribution in human-agent interactions. In practice, some negatives are semantically close distractors while others are trivially irrelevant, and natural dialogue exhibits structured proportions of these types. Current approaches using synthetic or uniformly sampled negatives fail to reflect this diversity, limiting embedding models' ability to learn nuanced discrimination essential for robust memory retrieval. In this work, we propose a principled data construction framework HiNS that explicitly models negative sample difficulty tiers and incorporates empirically grounded negative ratios derived from conversational data, enabling the training of embedding models with substantially improved retrieval fidelity and generalization in memory-intensive tasks. Experiments show significant improvements: on LoCoMo, F1/BLEU-1 gains of 3.27%/3.30%(MemoryOS) and 1.95%/1.78% (Mem0); on PERSONAMEM, total score improvements of 1.19% (MemoryOS) and 2.55% (Mem0).
摘要：记忆增强语言代理依靠嵌入模型来进行有效的记忆检索。然而，现有的训练数据构建忽视了一个关键的限制：负样本的层次难度及其在人机交互中的自然分布。在实践中，一些否定词在语义上是接近的干扰因素，而另一些则无关紧要，自然对话展示了这些类型的结构化比例。当前使用合成或均匀采样负样本的方法无法反映这种多样性，限制了嵌入模型学习稳健记忆检索所必需的细微差别的能力。在这项工作中，我们提出了一个有原则的数据构建框架 HiNS，它显式地模拟负样本难度等级，并结合从会话数据中得出的基于经验的负比率，从而能够训练嵌入模型，从而在记忆密集型任务中显着提高检索保真度和泛化能力。实验显示显着改进：在LoCoMo上，F1/BLEU-1增益为3.27%/3.30%(MemoryOS)和1.95%/1.78%(Mem0)；在 PERSONAMEM 上，总分提高了 1.19% (MemoryOS) 和 2.55% (Mem0)。

Title: Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation

Authors: Rui Qi, Fengran Mo, Yufeng Chen, Xue Zhang, Shuo Wang, Hongliang Li, Jinan Xu, Meng Jiang, Jian-Yun Nie, Kaiyu Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14896
Pdf URL: https://arxiv.org/pdf/2601.14896
Copy Paste: [[2601.14896]] Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation(https://arxiv.org/abs/2601.14896)
Keywords: retrieval-augmented generation
Abstract: Multilingual retrieval-augmented generation (MRAG) requires models to effectively acquire and integrate beneficial external knowledge from multilingual collections. However, most existing studies employ a unitive process where queries of equivalent semantics across different languages are processed through a single-turn retrieval and subsequent optimization. Such a ``one-size-fits-all'' strategy is often suboptimal in multilingual settings, as the models occur to knowledge bias and conflict during the interaction with the search engine. To alleviate the issues, we propose LcRL, a multilingual search-augmented reinforcement learning framework that integrates a language-coupled Group Relative Policy Optimization into the policy and reward models. We adopt the language-coupled group sampling in the rollout module to reduce knowledge bias, and regularize an auxiliary anti-consistency penalty in the reward models to mitigate the knowledge conflict. Experimental results demonstrate that LcRL not only achieves competitive performance but is also appropriate for various practical scenarios such as constrained training data and retrieval over collections encompassing a large number of languages. Our code is available at this https URL.
摘要：多语言检索增强生成（MRAG）需要模型能够有效地从多语言集合中获取和整合有益的外部知识。然而，大多数现有研究采用统一的过程，其中通过单轮检索和后续优化来处理不同语言之间的等效语义查询。这种“一刀切”的策略在多语言环境中通常不是最佳的，因为模型在与搜索引擎交互过程中会出现知识偏差和冲突。为了缓解这些问题，我们提出了 LcRL，一种多语言搜索增强强化学习框架，它将语言耦合的组相对策略优化集成到策略和奖励模型中。我们在推出模块中采用语言耦合的分组采样来减少知识偏差，并规范奖励模型中的辅助反一致性惩罚以减轻知识冲突。实验结果表明，LcRL 不仅实现了有竞争力的性能，而且适用于各种实际场景，例如约束训练数据和包含大量语言的集合的检索。我们的代码可以在这个 https URL 上找到。

Title: PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation

Authors: Chenning Xu, Mao Zheng, Mingyu Zheng, Mingyang Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14903
Pdf URL: https://arxiv.org/pdf/2601.14903
Copy Paste: [[2601.14903]] PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation(https://arxiv.org/abs/2601.14903)
Keywords: llm, long context
Abstract: Podcast script generation requires LLMs to synthesize structured, context-grounded dialogue from diverse inputs, yet systematic evaluation resources for this task remain limited. To bridge this gap, we introduce PodBench, a benchmark comprising 800 samples with inputs up to 21K tokens and complex multi-speaker instructions. We propose a multifaceted evaluation framework that integrates quantitative constraints with LLM-based quality assessment. Extensive experiments reveal that while proprietary models generally excel, open-source models equipped with explicit reasoning demonstrate superior robustness in handling long contexts and multi-speaker coordination compared to standard baselines. However, our analysis uncovers a persistent divergence where high instruction following does not guarantee high content substance. PodBench offers a reproducible testbed to address these challenges in long-form, audio-centric generation.
摘要：播客脚本生成需要法学硕士从不同的输入中综合结构化的、基于上下文的对话，但用于此任务的系统评估资源仍然有限。为了弥补这一差距，我们引入了 PodBench，这是一个包含 800 个样本的基准测试，输入多达 21K 个令牌和复杂的多扬声器指令。我们提出了一个多方面的评估框架，将定量约束与基于法学硕士的质量评估相结合。大量实验表明，虽然专有模型通常表现出色，但与标准基线相比，配备显式推理的开源模型在处理长上下文和多说话人协调方面表现出卓越的鲁棒性。然而，我们的分析揭示了一个持续存在的分歧，即高度的指令遵循并不能保证高含量的内容。 PodBench 提供了一个可重复的测试平台，以解决以音频为中心的长格式生成中的这些挑战。

Title: CodeDelegator: Mitigating Context Pollution via Role Separation in Code-as-Action Agents

Authors: Tianxiang Fei, Cheng Chen, Yue Pan, Mao Zheng, Mingyang Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14914
Pdf URL: https://arxiv.org/pdf/2601.14914
Copy Paste: [[2601.14914]] CodeDelegator: Mitigating Context Pollution via Role Separation in Code-as-Action Agents(https://arxiv.org/abs/2601.14914)
Keywords: language model, llm, agent
Abstract: Recent advances in large language models (LLMs) allow agents to represent actions as executable code, offering greater expressivity than traditional tool-calling. However, real-world tasks often demand both strategic planning and detailed implementation. Using a single agent for both leads to context pollution from debugging traces and intermediate failures, impairing long-horizon performance. We propose CodeDelegator, a multi-agent framework that separates planning from implementation via role specialization. A persistent Delegator maintains strategic oversight by decomposing tasks, writing specifications, and monitoring progress without executing code. For each sub-task, a new Coder agent is instantiated with a clean context containing only its specification, shielding it from prior failures. To coordinate between agents, we introduce Ephemeral-Persistent State Separation (EPSS), which isolates each Coder's execution state while preserving global coherence, preventing debugging traces from polluting the Delegator's context. Experiments on various benchmarks demonstrate the effectiveness of CodeDelegator across diverse scenarios.
摘要：大型语言模型 (LLM) 的最新进展允许代理将操作表示为可执行代码，从而提供比传统工具调用更高的表达能力。然而，现实世界的任务通常需要战略规划和详细实施。对两者使用单一代理会导致调试跟踪和中间故障造成上下文污染，从而损害长期性能。我们提出了 CodeDelegator，这是一个多代理框架，通过角色专业化将规划与实施分开。持久的委托人通过分解任务、编写规范和监控进度而不执行代码来维持战略监督。对于每个子任务，都会使用仅包含其规范的干净上下文实例化一个新的 Coder 代理，从而使其免受先前故障的影响。为了在代理之间进行协调，我们引入了临时持久状态分离（EPSS），它隔离每个编码器的执行状态，同时保持全局一致性，防止调试跟踪污染委托者的上下文。各种基准测试的实验证明了 CodeDelegator 在不同场景下的有效性。

Title: The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations

Authors: Pierre-Antoine Lequeu, Léo Labat, Laurène Cave, Gaël Lejeune, François Yvon, Benjamin Piwowarski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14944
Pdf URL: https://arxiv.org/pdf/2601.14944
Copy Paste: [[2601.14944]] The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations(https://arxiv.org/abs/2601.14944)
Keywords: language model, llm
Abstract: LLMs are ubiquitous in modern NLP, and while their applicability extends to texts produced for democratic activities such as online deliberations or large-scale citizen consultations, ethical questions have been raised for their usage as analysis tools. We continue this line of research with two main goals: (a) to develop resources that can help standardize citizen contributions in public forums at the pragmatic level, and make them easier to use in topic modeling and political analysis; (b) to study how well this standardization can reliably be performed by small, open-weights LLMs, i.e. models that can be run locally and transparently with limited resources. Accordingly, we introduce Corpus Clarification as a preprocessing framework for large-scale consultation data that transforms noisy, multi-topic contributions into structured, self-contained argumentative units ready for downstream analysis. We present GDN-CC, a manually-curated dataset of 1,231 contributions to the French Grand Débat National, comprising 2,285 argumentative units annotated for argumentative structure and manually clarified. We then show that finetuned Small Language Models match or outperform LLMs on reproducing these annotations, and measure their usability for an opinion clustering task. We finally release GDN-CC-large, an automatically annotated corpus of 240k contributions, the largest annotated democratic consultation dataset to date.
摘要：LLM 在现代自然语言处理中无处不在，虽然它们的适用性扩展到为民主活动（例如在线审议或大规模公民咨询）而生成的文本，但它们作为分析工具的使用却引发了道德问题。我们继续这一研究方向有两个主要目标：（a）开发资源，帮助在务实层面标准化公民在公共论坛中的贡献，并使它们更容易在主题建模和政治分析中使用； (b) 研究小型、开放权重法学硕士（即可以在有限资源下本地透明运行的模型）可靠地执行这种标准化的效果。因此，我们引入语料库澄清作为大规模咨询数据的预处理框架，将嘈杂的多主题贡献转换为结构化的、独立的论证单元，为下游分析做好准备。我们提出了 GDN-CC，这是一个手动整理的数据集，其中包含对法国全国大辩论的 1,231 篇贡献，其中包含 2,285 个论证单元，注释了论证结构并进行了手动澄清。然后，我们展示了经过微调的小语言模型在再现这些注释方面匹配或优于法学硕士，并衡量它们在意见聚类任务中的可用性。我们最终发布了 GDN-CC-large，这是一个包含 24 万条贡献的自动注释语料库，是迄今为止最大的带注释民主协商数据集。

Title: CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning

Authors: Zhiyuan Lu, Chenliang Li, Yingcheng Shi, Weizhou Shen, Ming Yan, Fei Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14952
Pdf URL: https://arxiv.org/pdf/2601.14952
Copy Paste: [[2601.14952]] CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning(https://arxiv.org/abs/2601.14952)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single long texts or rely on a "sparse retrieval" assumption-that answers can be derived from a few relevant chunks. This assumption fails for true corpus-level analysis, where evidence is highly dispersed across hundreds of documents and answers require global integration, comparison, and statistical aggregation. To address this critical gap, we introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework. By decoupling reasoning from textual representation, this framework creates complex, computation-intensive queries with programmatically guaranteed ground-truth answers, challenging systems to perform holistic reasoning over vast, unstructured text without relying on fallible human annotation. We further demonstrate the utility of our framework beyond evaluation, showing that fine-tuning on our synthesized data effectively enhances an LLM's general long-context reasoning capabilities. Extensive experiments reveal that even state-of-the-art long-context LLMs struggle as input length increases, and standard retrieval-augmented generation systems collapse entirely. Our findings indicate that memory-augmented agentic architectures offer a more robust alternative, suggesting a critical shift is needed from simply extending context windows to developing advanced architectures for global information synthesis.
摘要：虽然大型语言模型现在可以处理百万个令牌上下文，但它们在整个文档存储库中进行推理的能力在很大程度上仍未经过测试。现有的基准是不够的，因为它们大多局限于单个长文本或依赖于“稀疏检索”假设——答案可以从一些相关的块中得出。这种假设不适用于真正的语料库级别分析，其中证据高度分散在数百个文档中，而答案需要全局整合、比较和统计聚合。为了解决这一关键差距，我们引入了 CorpusQA，这是一个通过新颖的数据合成框架生成的新基准，可扩展至 1000 万个代币。通过将推理与文本表示解耦，该框架创建了复杂的计算密集型查询，并以编程方式保证真实答案，这对系统提出了挑战，要求系统在不依赖易出错的人工注释的情况下对大量非结构化文本执行整体推理。我们进一步证明了我们的框架在评估之外的实用性，表明对我们的合成数据进行微调可以有效增强法学硕士的一般长上下文推理能力。大量的实验表明，即使是最先进的长上下文法学硕士也会随着输入长度的增加而陷入困境，而标准的检索增强生成系统会完全崩溃。我们的研究结果表明，记忆增强代理架构提供了一种更强大的替代方案，这表明需要从简单地扩展上下文窗口到开发用于全局信息合成的高级架构进行关键转变。

Title: A Comprehensive Benchmark of Language Models on Unicode and Romanized Sinhala

Authors: Minuri Rajapakse, Ruvan Weerasinghe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14958
Pdf URL: https://arxiv.org/pdf/2601.14958
Copy Paste: [[2601.14958]] A Comprehensive Benchmark of Language Models on Unicode and Romanized Sinhala(https://arxiv.org/abs/2601.14958)
Keywords: language model
Abstract: The performance of Language Models (LMs) on lower-resource, morphologically rich languages like Sinhala remains under-explored, particularly for Romanized Sinhala, which is prevalent in digital communication. This paper presents a comprehensive benchmark of modern LMs on a diverse corpus of Unicode and Romanized Sinhala. We evaluate open-source models using perplexity, a measure of how well a model predicts a text, and leading closed-source models via a qualitative analysis of sentence completion. Our findings reveal that the Mistral-Nemo-Base-2407 model achieves the strongest predictive performance on Unicode text and the Mistral-7B-v0.3 model for Romanized text. The results also highlight the strong all-around performance of the Llama-3.1-8B model for both scripts. Furthermore, a significant performance disparity exists among closed-source models: Gemini-1.5-pro and DeepSeek excel at Unicode generation, whereas Claude-3.5-Sonnet is superior at handling Romanized text. These results provide an essential guide for practitioners selecting models for Sinhala-specific applications and highlight the critical role of training data in handling script variations.
摘要：语言模型 (LM) 在资源较少、形态丰富的语言（如僧伽罗语）上的性能仍未得到充分探索，特别是对于在数字通信中普遍存在的罗马化僧伽罗语。本文提出了现代 LM 在 Unicode 和罗马化僧伽罗语不同语料库上的综合基准。我们使用困惑度（模型预测文本的能力的衡量标准）来评估开源模型，并通过对句子完成的定性分析来评估开源模型。我们的研究结果表明，Mistral-Nemo-Base-2407 模型在 Unicode 文本上实现了最强的预测性能，而 Mistral-7B-v0.3 模型在罗马化文本上实现了最强的预测性能。结果还凸显了 Llama-3.1-8B 模型对于两种脚本的强大全面性能。此外，闭源模型之间存在显着的性能差异：Gemini-1.5-pro 和 DeepSeek 在 Unicode 生成方面表现出色，而 Claude-3.5-Sonnet 在处理罗马化文本方面表现出色。这些结果为从业者选择僧伽罗语特定应用程序的模型提供了重要的指导，并强调了训练数据在处理脚本变化方面的关键作用。

Title: Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora

Authors: Chaymaa Abbas, Nour Shamaa, Mariette Awad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14994
Pdf URL: https://arxiv.org/pdf/2601.14994
Copy Paste: [[2601.14994]] Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora(https://arxiv.org/abs/2601.14994)
Keywords: language model, llm
Abstract: Data contamination undermines the validity of Large Language Model evaluation by enabling models to rely on memorized benchmark content rather than true generalization. While prior work has proposed contamination detection methods, these approaches are largely limited to English benchmarks, leaving multilingual contamination poorly understood. In this work, we investigate contamination dynamics in multilingual settings by fine-tuning several open-weight LLMs on varying proportions of Arabic datasets and evaluating them on original English benchmarks. To detect memorization, we extend the Tested Slot Guessing method with a choice-reordering strategy and incorporate Min-K% probability analysis, capturing both behavioral and distributional contamination signals. Our results show that translation into Arabic suppresses conventional contamination indicators, yet models still benefit from exposure to contaminated data, particularly those with stronger Arabic capabilities. This effect is consistently reflected in rising Mink% scores and increased cross-lingual answer consistency as contamination levels grow. To address this blind spot, we propose Translation-Aware Contamination Detection, which identifies contamination by comparing signals across multiple translated benchmark variants rather than English alone. The Translation-Aware Contamination Detection reliably exposes contamination even when English-only methods fail. Together, our findings highlight the need for multilingual, translation-aware evaluation pipelines to ensure fair, transparent, and reproducible assessment of LLMs.
摘要：数据污染使模型依赖于记忆的基准内容而不是真正的泛化，从而破坏了大型语言模型评估的有效性。虽然之前的工作提出了污染检测方法，但这些方法很大程度上仅限于英语基准，导致对多语言污染的理解知之甚少。在这项工作中，我们通过对不同比例的阿拉伯语数据集上的几个开放权重法学硕士进行微调，并在原始英语基准上对其进行评估，来研究多语言环境中的污染动态。为了检测记忆情况，我们使用选择重新排序策略扩展了测试槽猜测方法，并结合了 Min-K% 概率分析，捕获行为和分布污染信号。我们的结果表明，翻译成阿拉伯语可以抑制传统的污染指标，但模型仍然受益于接触受污染的数据，特别是那些具有更强阿拉伯语能力的数据。随着污染水平的增加，这种效果始终体现在 Mink% 分数的上升和跨语言答案一致性的提高上。为了解决这个盲点，我们提出了翻译感知污染检测，它通过比较多个翻译基准变体的信号来识别污染，而不仅仅是英语。即使纯英语方法失败，翻译感知污染检测也能可靠地暴露污染。总之，我们的研究结果强调需要多语言、翻译感知的评估流程，以确保对法学硕士的公平、透明和可重复的评估。

Title: Knowledge Restoration-driven Prompt Optimization: Unlocking LLM Potential for Open-Domain Relational Triplet Extraction

Authors: Xiaonan Jing, Gongqing Wu, Xingrui Zhuo, Lang Sun, Jiapu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.15037
Pdf URL: https://arxiv.org/pdf/2601.15037
Copy Paste: [[2601.15037]] Knowledge Restoration-driven Prompt Optimization: Unlocking LLM Potential for Open-Domain Relational Triplet Extraction(https://arxiv.org/abs/2601.15037)
Keywords: language model, llm, prompt
Abstract: Open-domain Relational Triplet Extraction (ORTE) is the foundation for mining structured knowledge without predefined schemas. Despite the impressive in-context learning capabilities of Large Language Models (LLMs), existing methods are hindered by their reliance on static, heuristic-driven prompting strategies. Due to the lack of reflection mechanisms required to internalize erroneous signals, these methods exhibit vulnerability in semantic ambiguity, often making erroneous extraction patterns permanent. To address this bottleneck, we propose a Knowledge Reconstruction-driven Prompt Optimization (KRPO) framework to assist LLMs in continuously improving their extraction capabilities for complex ORTE task flows. Specifically, we design a self-evaluation mechanism based on knowledge restoration, which provides intrinsic feedback signals by projecting structured triplets into semantic consistency scores. Subsequently, we propose a prompt optimizer based on a textual gradient that can internalize historical experiences to iteratively optimize prompts, which can better guide LLMs to handle subsequent extraction tasks. Furthermore, to alleviate relation redundancy, we design a relation canonicalization memory that collects representative relations and provides semantically distinct schemas for the triplets. Extensive experiments across three datasets show that KRPO significantly outperforms strong baselines in the extraction F1 score.
摘要：开放域关系三元组提取 (ORTE) 是在没有预定义模式的情况下挖掘结构化知识的基础。尽管大型语言模型（LLM）具有令人印象深刻的上下文学习能力，但现有方法由于依赖静态、启发式驱动的提示策略而受到阻碍。由于缺乏内化错误信号所需的反射机制，这些方法在语义模糊性方面表现出脆弱性，通常会导致错误的提取模式永久存在。为了解决这个瓶颈，我们提出了知识重构驱动的提示优化（KRPO）框架，以帮助法学硕士不断提高复杂ORTE任务流的提取能力。具体来说，我们设计了一种基于知识恢复的自我评估机制，它通过将结构化三元组投影到语义一致性分数来提供内在反馈信号。随后，我们提出了一种基于文本梯度的提示优化器，可以内化历史经验来迭代优化提示，从而可以更好地指导LLM处理后续的提取任务。此外，为了减轻关系冗余，我们设计了一种关系规范化内存，它收集代表性关系并为三元组提供语义上不同的模式。跨三个数据集的大量实验表明，KRPO 在提取 F1 分数方面显着优于强基线。

Title: \textsc{LogicScore}: Fine-grained Logic Evaluation of Conciseness, Completeness, and Determinateness in Attributed Question Answering

Authors: Zhichao Yan, Yunxiao Zhao, Jiapu Wang, Jiaoyan Chen, Shaoru Guo, Xiaoli Li, Ru Li, Jeff Z. Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.15050
Pdf URL: https://arxiv.org/pdf/2601.15050
Copy Paste: [[2601.15050]] \textsc{LogicScore}: Fine-grained Logic Evaluation of Conciseness, Completeness, and Determinateness in Attributed Question Answering(https://arxiv.org/abs/2601.15050)
Keywords: language model, gpt, llm
Abstract: Current evaluation methods for Attributed Question Answering (AQA) suffer from \textit{attribution myopia}: they emphasize verification of isolated statements and their attributions but overlook the global logical integrity of long-form answers. Consequently, Large Language Models (LLMs) often produce factually grounded yet logically incoherent responses with elusive deductive gaps. To mitigate this limitation, we present \textsc{LogicScore}, a unified evaluation framework that shifts the paradigm from local assessment to global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: \textit{Completeness} (logically sound deduction), \textit{Conciseness} (non-redundancy), and \textit{Determinateness} (consistent answer entailment). Extensive experiments across three multi-hop QA datasets (HotpotQA, MusiQue, and 2WikiMultiHopQA) and over 20 LLMs (including GPT-5, Gemini-3-Pro, LLaMA3, and task-specific tuned models) reveal a critical capability gap: leading models often achieve high attribution scores (e.g., 92.85\% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11\% Conciseness for Gemini-3 Pro). Our work establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development. Codes are available at: this https URL.
摘要：当前归因问答（AQA）的评估方法患有 \textit{归因短视}：它们强调对孤立陈述及其归因的验证，但忽视了长式答案的整体逻辑完整性。因此，大型语言模型 (LLM) 经常会产生基于事实但逻辑上不连贯的响应，并伴有难以捉摸的演绎差距。为了缓解这一限制，我们提出了 \textsc{LogicScore}，这是一个统一的评估框架，它将范式从局部评估转变为全局推理审查。基于霍恩规则，我们的方法集成了向后验证机制来系统地评估三个关键推理维度：\textit{完整性}（逻辑上合理的推论）、\textit{简洁性}（非冗余）和\textit{确定性}（一致的答案蕴涵）。在三个多跳 QA 数据集（HotpotQA、MusiQue 和 2WikiMultiHopQA）和 20 多个 LLM（包括 GPT-5、Gemini-3-Pro、LLaMA3 和特定于任务的调整模型）中进行的广泛实验揭示了一个关键的能力差距：领先的模型通常会获得较高的归因分数（例如，Gemini-3 Pro 的精度为 92.85%），但在全局推理质量方面却遇到了困难（例如， Gemini-3 Pro 的简洁性为 35.11%）。我们的工作为逻辑评估建立了强有力的标准，强调在法学硕士发展中需要优先考虑推理连贯性和事实基础。代码可在以下位置获得：此 https URL。

Title: Multi-Agent Constraint Factorization Reveals Latent Invariant Solution Structure

Authors: Christopher Scofield
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2601.15077
Pdf URL: https://arxiv.org/pdf/2601.15077
Copy Paste: [[2601.15077]] Multi-Agent Constraint Factorization Reveals Latent Invariant Solution Structure(https://arxiv.org/abs/2601.15077)
Keywords: language model, agent
Abstract: Multi-agent systems (MAS) composed of large language models often exhibit improved problem-solving performance despite operating on identical information. In this work, we provide a formal explanation for this phenomenon grounded in operator theory and constrained optimization. We model each agent as enforcing a distinct family of validity constraints on a shared solution state, and show that a MAS implements a factorized composition of constraint-enforcement operators. Under mild conditions, these dynamics converge to invariant solution sets defined by the intersection of agent constraint sets. Such invariant structures are generally not dynamically accessible to a single agent applying all constraints simultaneously, even when expressive capacity and information are identical. We extend this result from exact constraint enforcement to soft constraints via proximal operators, and apply the formalism to contemporary text-based dialog systems.
摘要：尽管使用相同的信息，由大型语言模型组成的多代理系统（MAS）通常会表现出改进的问题解决性能。在这项工作中，我们基于算子理论和约束优化为这种现象提供了正式的解释。我们将每个代理建模为在共享解决方案状态上强制执行一系列不同的有效性约束，并表明 MAS 实现了约束执行运算符的因式分解组合。在温和的条件下，这些动力学收敛到由代理约束集的交集定义的不变解集。即使表达能力和信息相同，这种不变结构通常也不能被同时应用所有约束的单个代理动态访问。我们将这一结果从精确约束强制扩展到通过近端运算符的软约束，并将形式主义应用于当代基于文本的对话系统。

Title: RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)

Authors: Yishu Wei, Adam E. Flanders, Errol Colak, John Mongan, Luciano M Prevedello, Po-Hao Chen, Henrique Min Ho Lee, Gilberto Szarf, Hamilton Shoji, Jason Sho, Katherine Andriole, Tessa Cook, Lisa C. Adams, Linda C. Chu, Maggie Chung, Geraldine Brusca-Augello, Djeven P. Deva, Navneet Singh, Felipe Sanchez Tijmes, Jeffrey B. Alpert, Elsie T. Nguyen, Drew A. Torigian, Kate Hanneman, Lauren K Groner, Alexander Phan, Ali Islam, Matias F.Callejas, Gustavo Borges da Silva Teles, Faisal Jamal, Maryam Vazirabad, Ali Tejani, Hari Trivedi, Paulo Kuriki, Rajesh Bhayana, Elana T. Benishay, Yi Lin, Yifan Peng, George Shih
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.15129
Pdf URL: https://arxiv.org/pdf/2601.15129
Copy Paste: [[2601.15129]] RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)(https://arxiv.org/abs/2601.15129)
Keywords: language model, gpt, llm
Abstract: Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked "Agree all", "Agree mostly" or "Disagree" to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected "Agree All" for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available this https URL, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.
摘要：多模态大语言模型在多项选择板式考试中表现出与放射学学员相当的性能。然而，要开发临床上有用的多模式法学硕士工具，由领域专家策划的高质量基准至关重要。整理 100 项胸部放射学研究的已发布和保留数据集，并提出人工智能 (AI) 辅助的专家标记程序，使放射科医生能够更有效地标记研究。总共使用了 13,735 张未识别的胸部 X 光照片及其来自 MIDRC 的相应报告。 GPT-4o 从报告中提取异常结果，然后使用本地托管的 LLM（Phi-4-Reasoning）将其映射到 12 个基准标签。从这些研究中，根据人工智能建议的基准标签抽取了 1,000 个样本供专家评审；抽样算法确保所选研究具有临床相关性并涵盖一系列难度级别。十七名胸部放射科医生参与，他们标记“全部同意”、“大部分同意”或“不同意”，以表明他们对法学硕士建议标签的正确性的评估。每张胸片均由三名专家评估。其中，至少有两名放射科医生对 381 张射线照片选择了“全部同意”。从这组照片中，选择了 200 张，优先考虑那些不太常见或有多个发现标签的照片，并分为 100 张发布的 X 光照片和 100 张保留作为保留数据集。保留数据集仅由 RSNA 用于独立评估不同的模型。创建了包含 200 项胸部 X 光研究（具有 12 个基准标签）的基准，并通过此 https URL 公开发布，每张胸部 X 光照片均由三名放射科医生验证。此外，还开发了人工智能辅助标记程序，以帮助放射科医生大规模标记，最大限度地减少不必要的遗漏，并支持半协作环境。

Title: Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

Authors: Yinzhu Chen, Abdine Maiga, Hossein A. Rahmani, Emine Yilmaz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.15161
Pdf URL: https://arxiv.org/pdf/2601.15161
Copy Paste: [[2601.15161]] Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems(https://arxiv.org/abs/2601.15161)
Keywords: language model, gpt, llm, hallucination, agent
Abstract: Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety. These risks are particularly challenging as they often manifest as subtle clinical errors that evade detection by generic metrics, while expert-authored fine-grained rubrics remain costly to construct and difficult to scale. In this paper, we propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics. Our approach grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts and synthesizing them with user interaction constraints to form verifiable, fine-grained evaluation criteria. Evaluated on HealthBench, our framework achieves a Clinical Intent Alignment (CIA) score of 60.12%, a statistically significant improvement over the GPT-4o baseline (55.16%). In discriminative tests, our rubrics yield a mean score delta ($\mu_{\Delta} = 8.658$) and an AUROC of 0.977, nearly doubling the quality separation achieved by GPT-4o baseline (4.972). Beyond evaluation, our rubrics effectively guide response refinement, improving quality by 9.2% (from 59.0% to 68.2%). This provides a scalable and transparent foundation for both evaluating and improving medical LLMs. The code is available at this https URL.
摘要：大语言模型 (LLM) 越来越多地用于临床决策支持，其中幻觉和不安全建议可能对患者安全造成直接风险。这些风险尤其具有挑战性，因为它们通常表现为微妙的临床错误，无法通过通用指标进行检测，而专家撰写的细粒度评估标准的构建成本仍然很高且难以扩展。在本文中，我们提出了一种检索增强的多代理框架，旨在自动生成特定于实例的评估规则。我们的方法通过将检索到的内容分解为原子事实并将其与用户交互约束合成以形成可验证的、细粒度的评估标准，从而将评估建立在权威的医学证据的基础上。在 HealthBench 上进行评估，我们的框架实现了 60.12% 的临床意图一致性 (CIA) 评分，比 GPT-4o 基线 (55.16%) 有统计学上的显着改善。在判别性测试中，我们的评分标准产生平均分数 delta ($\mu_{\Delta} = 8.658$) 和 AUROC 0.977，几乎是 GPT-4o 基线 (4.972) 实现的质量分离的两倍。除了评估之外，我们的评分标准还有效指导了响应细化，将质量提高了 9.2%（从 59.0% 到 68.2%）。这为评估和改进医学法学硕士提供了可扩展且透明的基础。该代码可从此 https URL 获取。

Title: The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Authors: Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.15165
Pdf URL: https://arxiv.org/pdf/2601.15165
Copy Paste: [[2601.15165]] The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models(https://arxiv.org/abs/2601.15165)
Keywords: language model, llm
Abstract: Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: this https URL
摘要：扩散大型语言模型 (dLLM) 打破了传统 LLM 严格的从左到右的约束，能够以任意顺序生成令牌。直观上，这种灵活性意味着解决方案空间严格超越固定的自回归轨迹，从理论上讲，为数学和编码等一般任务释放了卓越的推理潜力。因此，许多工作都利用强化学习（RL）来激发 dLLM 的推理能力。在本文中，我们揭示了一个反直觉的现实：当前形式的任意顺序生成缩小而不是扩大了 dLLM 的推理边界。我们发现 dLLM 倾向于利用这种顺序灵活性来绕过对于探索至关重要的高不确定性标记，从而导致解决方案空间过早崩溃。这一观察结果挑战了 dLLM 现有 RL 方法的前提，其中相当大的复杂性，例如处理组合轨迹和棘手的可能性，通常致力于保持这种灵活性。我们证明，有意放弃任意顺序并应用标准组相对策略优化（GRPO）可以更好地引发有效推理。我们的方法 JustGRPO 非常简约，但却非常有效（例如，在 GSM8K 上的准确率达到 89.1%），同时完全保留了 dLLM 的并行解码能力。项目页面：此 https URL

Title: Is Peer Review Really in Decline? Analyzing Review Quality across Venues and Time

Authors: Ilia Kuznetsov, Rohan Nayak, Alla Rozovskaya, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.15172
Pdf URL: https://arxiv.org/pdf/2601.15172
Copy Paste: [[2601.15172]] Is Peer Review Really in Decline? Analyzing Review Quality across Venues and Time(https://arxiv.org/abs/2601.15172)
Keywords: llm
Abstract: Peer review is at the heart of modern science. As submission numbers rise and research communities grow, the decline in review quality is a popular narrative and a common concern. Yet, is it true? Review quality is difficult to measure, and the ongoing evolution of reviewing practices makes it hard to compare reviews across venues and time. To address this, we introduce a new framework for evidence-based comparative study of review quality and apply it to major AI and machine learning conferences: ICLR, NeurIPS and *ACL. We document the diversity of review formats and introduce a new approach to review standardization. We propose a multi-dimensional schema for quantifying review quality as utility to editors and authors, coupled with both LLM-based and lightweight measurements. We study the relationships between measurements of review quality, and its evolution over time. Contradicting the popular narrative, our cross-temporal analysis reveals no consistent decline in median review quality across venues and years. We propose alternative explanations, and outline recommendations to facilitate future empirical studies of review quality.
摘要：同行评审是现代科学的核心。随着提交数量的增加和研究社区的发展，审稿质量的下降是一种流行的说法和普遍关注的问题。然而，这是真的吗？评论质量很难衡量，而且评论实践的不断发展使得很难比较不同地点和时间的评论。为了解决这个问题，我们引入了一个新的审稿质量循证比较研究框架，并将其应用于主要的人工智能和机器学习会议：ICLR、NeurIPS 和 *ACL。我们记录了评审格式的多样性，并引入了一种新的评审标准化方法。我们提出了一种多维模式，用于量化审稿质量，作为编辑和作者的实用工具，并结合基于法学硕士和轻量级的测量。我们研究评论质量测量之间的关系及其随时间的演变。与流行的说法相反，我们的跨时间分析显示，不同地点和年份的中值评论质量没有持续下降。我们提出了替代解释，并概述了建议，以促进未来评论质量的实证研究。

Title: Supporting Humans in Evaluating AI Summaries of Legal Depositions

Authors: Naghmeh Farzi, Laura Dietz, Dave D. Lewis
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2601.15182
Pdf URL: https://arxiv.org/pdf/2601.15182
Copy Paste: [[2601.15182]] Supporting Humans in Evaluating AI Summaries of Legal Depositions(https://arxiv.org/abs/2601.15182)
Keywords: language model, llm
Abstract: While large language models (LLMs) are increasingly used to summarize long documents, this trend poses significant challenges in the legal domain, where the factual accuracy of deposition summaries is crucial. Nugget-based methods have been shown to be extremely helpful for the automated evaluation of summarization approaches. In this work, we translate these methods to the user side and explore how nuggets could directly assist end users. Although prior systems have demonstrated the promise of nugget-based evaluation, its potential to support end users remains underexplored. Focusing on the legal domain, we present a prototype that leverages a factual nugget-based approach to support legal professionals in two concrete scenarios: (1) determining which of two summaries is better, and (2) manually improving an automatically generated summary.
摘要：虽然大型语言模型 (LLM) 越来越多地用于总结长文档，但这种趋势在法律领域提出了重大挑战，因为在法律领域，证词摘要的事实准确性至关重要。基于块的方法已被证明对于摘要方法的自动评估非常有帮助。在这项工作中，我们将这些方法转化为用户端，并探索块如何直接帮助最终用户。尽管先前的系统已经展示了基于块的评估的前景，但其支持最终用户的潜力仍未得到充分开发。着眼于法律领域，我们提出了一个原型，利用基于事实金块的方法在两个具体场景中为法律专业人士提供支持：(1) 确定两个摘要中哪一个更好，(2) 手动改进自动生成的摘要。

Title: Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Authors: Anmol Goel, Cornelius Emde, Sangdoo Yun, Seong Joon Oh, Martin Gubri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.15220
Pdf URL: https://arxiv.org/pdf/2601.15220
Copy Paste: [[2601.15220]] Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models(https://arxiv.org/abs/2601.15220)
Keywords: language model, agent
Abstract: We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.
摘要：我们在语言模型中发现了一个新现象：前沿模型的良性微调可能导致隐私崩溃。我们发现训练数据中多样化、微妙的模式可能会降低上下文隐私，包括优化有用性、暴露用户信息、情感和主观对话以及调试代码打印内部变量等。经过微调的模型失去了推理上下文隐私规范的能力，与工具不恰当地共享信息，并违反了跨上下文的内存边界。隐私崩溃是一种“无声的失败”，因为模型在标准安全和实用基准上保持高性能，同时表现出严重的隐私漏洞。我们的实验显示了六个模型（封闭和开放权重）、五个微调数据集（现实世界和受控数据）以及两个任务类别（代理和基于记忆）的隐私崩溃的证据。我们的机制分析表明，与保留的任务相关特征相比，隐私表示对于微调来说特别脆弱。我们的结果揭示了当前安全评估中的关键差距，特别是在专业代理的部署方面。

Title: Metadata Conditioned Large Language Models for Localization

Authors: Anjishnu Mukherjee, Ziwei Zhu, Antonios Anastasopoulos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.15236
Pdf URL: https://arxiv.org/pdf/2601.15236
Copy Paste: [[2601.15236]] Metadata Conditioned Large Language Models for Localization(https://arxiv.org/abs/2601.15236)
Keywords: language model
Abstract: Large language models are typically trained by treating text as a single global distribution, often resulting in geographically homogenized behavior. We study metadata conditioning as a lightweight approach for localization, pre-training 31 models (at 0.5B and 1B parameter scales) from scratch on large-scale English news data annotated with verified URLs, country tags, and continent tags, covering 4 continents and 17 countries. Across four controlled experiments, we show that metadata conditioning consistently improves in-region performance without sacrificing cross-region generalization, enables global models to recover localization comparable to region-specific models, and improves learning efficiency. Our ablation studies demonstrate that URL-level metadata alone captures much of the geographic signal, while balanced regional data coverage remains essential, as metadata cannot fully compensate for missing regions. Finally, we introduce a downstream benchmark of 800 localized news MCQs and show that after instruction tuning, metadata conditioned global models achieve accuracy comparable to LLaMA-3.2-1B-Instruct, despite being trained on substantially less data. Together, these results establish metadata conditioning as a practical and compute-efficient approach for localization of language models.
摘要：大型语言模型通常通过将文本视为单个全局分布来进行训练，这通常会导致地理上同质化的行为。我们研究元数据调节作为一种轻量级的本地化方法，在大规模英文新闻数据上从头开始预训练 31 个模型（0.5B 和 1B 参数尺度），这些数据用经过验证的 URL、国家/地区标签和大陆标签注释，涵盖 4 大洲和 17 个国家。通过四个受控实验，我们表明元数据调节在不牺牲跨区域泛化的情况下持续提高区域内性能，使全局模型能够恢复与特定区域模型相当的本地化，并提高学习效率。我们的消融研究表明，仅 URL 级元数据就可以捕获大部分地理信号，而平衡的区域数据覆盖仍然至关重要，因为元数据无法完全补偿缺失的区域。最后，我们引入了 800 个本地化新闻 MCQ 的下游基准，并表明在指令调整后，元数据调节的全局模型达到了与 LLaMA-3.2-1B-Instruct 相当的准确性，尽管训练数据少得多。总之，这些结果将元数据调节确立为一种实用且计算高效的语言模型本地化方法。

Title: Taxonomy-Aligned Risk Extraction from 10-K Filings with Autonomous Improvement Using LLMs

Authors: Rian Dolphin, Joe Dursun, Jarrett Blankenship, Katie Adams, Quinton Pike
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.15247
Pdf URL: https://arxiv.org/pdf/2601.15247
Copy Paste: [[2601.15247]] Taxonomy-Aligned Risk Extraction from 10-K Filings with Autonomous Improvement Using LLMs(https://arxiv.org/abs/2601.15247)
Keywords: llm, agent
Abstract: We present a methodology for extracting structured risk factors from corporate 10-K filings while maintaining adherence to a predefined hierarchical taxonomy. Our three-stage pipeline combines LLM extraction with supporting quotes, embedding-based semantic mapping to taxonomy categories, and LLM-as-a-judge validation that filters spurious assignments. To evaluate our approach, we extract 10,688 risk factors from S&P 500 companies and examine risk profile similarity across industry clusters. Beyond extraction, we introduce autonomous taxonomy maintenance where an AI agent analyzes evaluation feedback to identify problematic categories, diagnose failure patterns, and propose refinements, achieving 104.7% improvement in embedding separation in a case study. External validation confirms the taxonomy captures economically meaningful structure: same-industry companies exhibit 63% higher risk profile similarity than cross-industry pairs (Cohen's d=1.06, AUC 0.82, p<0.001). The methodology generalizes to any domain requiring taxonomy-aligned extraction from unstructured text, with autonomous improvement enabling continuous quality maintenance and enhancement as systems process more documents.
摘要：我们提出了一种从公司 10-K 文件中提取结构化风险因素的方法，同时保持遵守预定义的分层分类法。我们的三阶段管道将 LLM 提取与支持引用、基于嵌入的语义映射到分类类别以及过滤虚假分配的 LLM 作为法官验证相结合。为了评估我们的方法，我们从标准普尔 500 强公司中提取了 10,688 个风险因素，并检查了跨行业集群的风险状况相似性。除了提取之外，我们还引入了自主分类维护，其中人工智能代理分析评估反馈以识别有问题的类别、诊断故障模式并提出改进建议，从而在案例研究中的嵌入分离方面实现了 104.7% 的改进。外部验证证实该分类法捕获了具有经济意义的结构：同行业公司的风险概况相似度比跨行业公司高 63%（Cohen's d=1.06，AUC 0.82，p<0.001）。该方法可推广到任何需要从非结构化文本中进行分类学对齐提取的领域，并通过自主改进，在系统处理更多文档时实现持续的质量维护和增强。

Title: The Effect of Scripts and Formats on LLM Numeracy

Authors: Varshini Reddy, Craig W. Schmidt, Seth Ebner, Adam Wiemerslage, Yuval Pinter, Chris Tanner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.15251
Pdf URL: https://arxiv.org/pdf/2601.15251
Copy Paste: [[2601.15251]] The Effect of Scripts and Formats on LLM Numeracy(https://arxiv.org/abs/2601.15251)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have achieved impressive proficiency in basic arithmetic, rivaling human-level performance on standard numerical tasks. However, little attention has been given to how these models perform when numerical expressions deviate from the prevailing conventions present in their training corpora. In this work, we investigate numerical reasoning across a wide range of numeral scripts and formats. We show that LLM accuracy drops substantially when numerical inputs are rendered in underrepresented scripts or formats, despite the underlying mathematical reasoning being identical. We further demonstrate that targeted prompting strategies, such as few-shot prompting and explicit numeral mapping, can greatly narrow this gap. Our findings highlight an overlooked challenge in multilingual numerical reasoning and provide actionable insights for working with LLMs to reliably interpret, manipulate, and generate numbers across diverse numeral scripts and formatting styles.
摘要：大型语言模型 (LLM) 在基本算术方面取得了令人印象深刻的熟练程度，在标准数值任务上的表现可与人类水平相媲美。然而，当数值表达式偏离训练语料库中的普遍惯例时，很少有人关注这些模型的表现。在这项工作中，我们研究了各种数字脚本和格式的数字推理。我们表明，当数字输入以代表性不足的脚本或格式呈现时，LLM 的准确性会大幅下降，尽管潜在的数学推理是相同的。我们进一步证明，有针对性的提示策略，例如几次提示和明确的数字映射，可以大大缩小这一差距。我们的研究结果强调了多语言数字推理中被忽视的挑战，并为与法学硕士合作提供了可行的见解，以跨不同的数字脚本和格式样式可靠地解释、操作和生成数字。

Title: Robust Fake News Detection using Large Language Models under Adversarial Sentiment Attacks

Authors: Sahar Tahmasebi, Eric Müller-Budack, Ralph Ewerth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.15277
Pdf URL: https://arxiv.org/pdf/2601.15277
Copy Paste: [[2601.15277]] Robust Fake News Detection using Large Language Models under Adversarial Sentiment Attacks(https://arxiv.org/abs/2601.15277)
Keywords: language model, llm
Abstract: Misinformation and fake news have become a pressing societal challenge, driving the need for reliable automated detection methods. Prior research has highlighted sentiment as an important signal in fake news detection, either by analyzing which sentiments are associated with fake news or by using sentiment and emotion features for classification. However, this poses a vulnerability since adversaries can manipulate sentiment to evade detectors especially with the advent of large language models (LLMs). A few studies have explored adversarial samples generated by LLMs, but they mainly focus on stylistic features such as writing style of news publishers. Thus, the crucial vulnerability of sentiment manipulation remains largely unexplored. In this paper, we investigate the robustness of state-of-the-art fake news detectors under sentiment manipulation. We introduce AdSent, a sentiment-robust detection framework designed to ensure consistent veracity predictions across both original and sentiment-altered news articles. Specifically, we (1) propose controlled sentiment-based adversarial attacks using LLMs, (2) analyze the impact of sentiment shifts on detection performance. We show that changing the sentiment heavily impacts the performance of fake news detection models, indicating biases towards neutral articles being real, while non-neutral articles are often classified as fake content. (3) We introduce a novel sentiment-agnostic training strategy that enhances robustness against such perturbations. Extensive experiments on three benchmark datasets demonstrate that AdSent significantly outperforms competitive baselines in both accuracy and robustness, while also generalizing effectively to unseen datasets and adversarial scenarios.
摘要：错误信息和虚假新闻已成为紧迫的社会挑战，推动了对可靠的自动检测方法的需求。先前的研究强调情绪是假新闻检测中的重要信号，要么通过分析哪些情绪与假新闻相关，要么通过使用情绪和情感特征进行分类。然而，这带来了一个漏洞，因为对手可以操纵情绪来逃避检测器，尤其是随着大型语言模型（LLM）的出现。一些研究探讨了法学硕士生成的对抗性样本，但它们主要关注新闻出版商的写作风格等文体特征。因此，情绪操纵的关键脆弱性在很大程度上仍未得到探索。在本文中，我们研究了最先进的假新闻检测器在情绪操纵下的鲁棒性。我们引入了 AdSent，这是一种情绪稳健的检测框架，旨在确保原始新闻文章和情绪改变的新闻文章的预测一致。具体来说，我们（1）提出使用 LLM 进行受控的基于情绪的对抗性攻击，（2）分析情绪变化对检测性能的影响。我们表明，改变情绪会严重影响假新闻检测模型的性能，这表明中立文章的真实性存在偏差，而非中立文章通常被归类为虚假内容。 (3)我们引入了一种新颖的与情绪无关的训练策略，可以增强针对此类扰动的鲁棒性。对三个基准数据集的大量实验表明，AdSent 在准确性和鲁棒性方面均显着优于竞争基准，同时还可以有效地推广到未见过的数据集和对抗场景。