2026-01-08

Title: DeepResearch-Slice: Bridging the Retrieval-Utilization Gap via Explicit Text Slicing

Authors: Shuo Lu, Yinuo Xu, Jianjie Cheng, Lingxiao He, Meng Wang, Jian Liang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03261
Pdf URL: https://arxiv.org/pdf/2601.03261
Copy Paste: [[2601.03261]] DeepResearch-Slice: Bridging the Retrieval-Utilization Gap via Explicit Text Slicing(https://arxiv.org/abs/2601.03261)
Keywords: agent
Abstract: Deep Research agents predominantly optimize search policies to maximize retrieval probability. However, we identify a critical bottleneck: the retrieval-utilization gap, where models fail to use gold evidence even after it is retrieved, due to context blindness in noisy environments. To bridge this gap, we propose DeepResearch-Slice, a simple yet effective neuro-symbolic framework. Unlike implicit attention, our approach predicts precise span indices to perform a deterministic hard filter before reasoning. Extensive evaluations across six benchmarks show substantial robustness gains. Applying our method to frozen backbones yields a 73 percent relative improvement, from 19.1 percent to 33.0 percent, effectively mitigating noise without requiring parameter updates to the reasoning model. These results highlight the need for explicit grounding mechanisms in open-ended research.
摘要：深度研究代理主要优化搜索策略以最大化检索概率。然而，我们发现了一个关键瓶颈：检索利用差距，由于嘈杂环境中的上下文盲目性，即使在检索到黄金证据后，模型也无法使用黄金证据。为了弥补这一差距，我们提出了 DeepResearch-Slice，一个简单而有效的神经符号框架。与隐式注意力不同，我们的方法预测精确的跨度索引，以在推理之前执行确定性硬过滤。对六个基准的广泛评估显示出稳健性的显着提升。将我们的方法应用于冻结骨干网可产生 73% 的相对改进，从 19.1% 提高到 33.0%，有效地减轻噪声，而不需要对推理模型进行参数更新。这些结果凸显了开放式研究中明确的基础机制的必要性。

Title: Internal Reasoning vs. External Control: A Thermodynamic Analysis of Sycophancy in Large Language Models

Authors: Edward Y. Chang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03263
Pdf URL: https://arxiv.org/pdf/2601.03263
Copy Paste: [[2601.03263]] Internal Reasoning vs. External Control: A Thermodynamic Analysis of Sycophancy in Large Language Models(https://arxiv.org/abs/2601.03263)
Keywords: language model, gpt
Abstract: Large Language Models frequently exhibit sycophancy, prioritizing user agreeableness over correctness. We investigate whether this requires external regulation or can be mitigated by internal reasoning alone. Using CAP-GSM8K (N=500), an adversarial dataset, we evaluate internal (CoT) versus external (RCA) mechanisms across GPT-3.5, GPT-4o, and GPT-5.1. Our results reveal the structural limits of internal reasoning: it causes performance collapse in weak models (the Prioritization Paradox) and leaves an 11.4\% final output gap in frontier models. In contrast, RCA structurally eliminates sycophancy (0.0\%) across all tiers. We synthesize these findings into a thermodynamic hierarchy: hybrid systems achieve Resonance (optimal efficiency) only when capabilities are matched and strong, while weak or mismatched pairs succumb to Dissonance and Entropy. This confirms that external structural constraints are strictly necessary to guarantee safety.
摘要：大型语言模型经常表现出阿谀奉承，优先考虑用户的愉快程度而不是正确性。我们调查这是否需要外部监管，或者可以仅通过内部推理来缓解。使用对抗性数据集 CAP-GSM8K (N=500)，我们评估了 GPT-3.5、GPT-4o 和 GPT-5.1 的内部 (CoT) 与外部 (RCA) 机制。我们的结果揭示了内部推理的结构限制：它导致弱模型中的性能崩溃（优先级悖论），并在前沿模型中留下 11.4% 的最终输出差距。相比之下，RCA 从结构上消除了所有层级的阿谀奉承 (0.0\%)。我们将这些发现综合成一个热力学层次结构：混合系统只有在能力匹配且强大时才能实现共振（最佳效率），而弱或不匹配的对则屈服于不和谐和熵。这证实了外部结构约束对于保证安全是严格必要的。

Title: Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

Authors: Kai Hu, Abhinav Aggarwal, Mehran Khodabandeh, David Zhang, Eric Hsin, Li Chen, Ankit Jain, Matt Fredrikson, Akash Bharadwaj
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03265
Pdf URL: https://arxiv.org/pdf/2601.03265
Copy Paste: [[2601.03265]] Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models(https://arxiv.org/abs/2601.03265)
Keywords: language model, gpt, llm, prompt
Abstract: This paper introduces Jailbreak-Zero, a novel red teaming methodology that shifts the paradigm of Large Language Model (LLM) safety evaluation from a constrained example-based approach to a more expansive and effective policy-based framework. By leveraging an attack LLM to generate a high volume of diverse adversarial prompts and then fine-tuning this attack model with a preference dataset, Jailbreak-Zero achieves Pareto optimality across the crucial objectives of policy coverage, attack strategy diversity, and prompt fidelity to real user inputs. The empirical evidence demonstrates the superiority of this method, showcasing significantly higher attack success rates against both open-source and proprietary models like GPT-40 and Claude 3.5 when compared to existing state-of-the-art techniques. Crucially, Jailbreak-Zero accomplishes this while producing human-readable and effective adversarial prompts with minimal need for human intervention, thereby presenting a more scalable and comprehensive solution for identifying and mitigating the safety vulnerabilities of LLMs.
摘要：本文介绍了 Jailbreak-Zero，这是一种新颖的红队方法，它将大型语言模型 (LLM) 安全评估的范式从基于示例的受限方法转变为更广泛、更有效的基于策略的框架。通过利用攻击 LLM 生成大量不同的对抗性提示，然后使用偏好数据集微调该攻击模型，Jailbreak-Zero 在策略覆盖、攻击策略多样性和提示对真实用户输入的保真度等关键目标上实现了帕累托最优。经验证据证明了这种方法的优越性，与现有的最先进技术相比，针对 GPT-40 和 Claude 3.5 等开源和专有模型的攻击成功率明显更高。至关重要的是，Jailbreak-Zero 实现了这一目标，同时产生了人类可读且有效的对抗性提示，并且几乎不需要人工干预，从而提供了一个更具可扩展性和全面的解决方案，用于识别和减轻法学硕士的安全漏洞。

Title: Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support

Authors: Alif Munim, Jun Ma, Omar Ibrahim, Alhusain Abdalla, Shuolin Yin, Leo Chen, Bo Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03266
Pdf URL: https://arxiv.org/pdf/2601.03266
Copy Paste: [[2601.03266]] Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support(https://arxiv.org/abs/2601.03266)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow local inference but often require large model sizes that limit their use in resource-constrained clinical settings. Here, we benchmark two on-device LLMs, gpt-oss-20b and gpt-oss-120b, across three representative clinical tasks: general disease diagnosis, specialty-specific (ophthalmology) diagnosis and management, and simulation of human expert grading and evaluation. We compare their performance with state-of-the-art proprietary models (GPT-5 and o4-mini) and a leading open-source model (DeepSeek-R1), and we further evaluate the adaptability of on-device systems by fine-tuning gpt-oss-20b on general diagnostic data. Across tasks, gpt-oss models achieve performance comparable to or exceeding DeepSeek-R1 and o4-mini despite being substantially smaller. In addition, fine-tuning remarkably improves the diagnostic accuracy of gpt-oss-20b, enabling it to approach the performance of GPT-5. These findings highlight the potential of on-device LLMs to deliver accurate, adaptable, and privacy-preserving clinical decision support, offering a practical pathway for broader integration of LLMs into routine clinical practice.
摘要：大型语言模型 (LLM) 在临床决策方面迅速取得进展，但专有系统的部署因隐私问题和对基于云的基础设施的依赖而受到阻碍。开源替代方案允许本地推理，但通常需要较大的模型大小，这限制了它们在资源有限的临床环境中的使用。在这里，我们对两个设备上的 LLM（gpt-oss-20b 和 gpt-oss-120b）进行了基准测试，涉及三个代表性的临床任务：一般疾病诊断、专业特定（眼科）诊断和管理以及人类专家分级和评估的模拟。我们将它们的性能与最先进的专有模型（GPT-5 和 o4-mini）以及领先的开源模型（DeepSeek-R1）进行比较，并通过对一般诊断数据微调 gpt-oss-20b 来进一步评估设备上系统的适应性。在各种任务中，gpt-oss 模型的性能可与 DeepSeek-R1 和 o4-mini 相当或超过，尽管其体积要小得多。此外，微调显着提高了gpt-oss-20b的诊断准确性，使其接近GPT-5的性能。这些发现凸显了设备上法学硕士在提供准确、适应性强和保护隐私的临床决策支持方面的潜力，为将法学硕士更广泛地融入常规临床实践提供了一条实用途径。

Title: OpenAI GPT-5 System Card

Authors: Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Alexey Ivanov, Alexi Christakis, Alistair Gillespie, Allison Tam, Ally Bennett, Alvin Wan, Alyssa Huang, Amy McDonald Sandjideh, Amy Yang, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrei Gheorghe, Andres Garcia Garcia, Andrew Braunstein, Andrew Liu, Andrew Schmidt, Andrey Mereskin, Andrey Mishchenko, Andy Applebaum, Andy Rogerson, Ann Rajan, Annie Wei, Anoop Kotha, Anubha Srivastava, Anushree Agrawal, Arun Vijayvergiya, Ashley Tyra, Ashvin Nair, Avi Nayak, Ben Eggers, Bessie Ji, Beth Hoover, Bill Chen, Blair Chen, Boaz Barak, Borys Minaiev, Botao Hao, Bowen Baker, Brad Lightcap, Brandon McKinzie, Brandon Wang, Brendan Quinn, Brian Fioca, Brian Hsu, Brian Yang, Brian Yu, Brian Zhang, Brittany Brenner, Callie Riggins Zetino, Cameron Raymond, Camillo Lugaresi, Carolina Paz, Cary Hudson, Cedric Whitney, Chak Li, Charles Chen, Charlotte Cole, Chelsea Voss, Chen Ding, Chen Shen, Chengdu Huang, Chris Colby, Chris Hallacy, Chris Koch, Chris Lu, Christina Kaplan, Christina Kim, CJ Minott-Henriques, Cliff Frey, Cody Yu, Coley Czarnecki, Colin Reid, Colin Wei, Cory Decareaux, Cristina Scheau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03267
Pdf URL: https://arxiv.org/pdf/2601.03267
Copy Paste: [[2601.03267]] OpenAI GPT-5 System Card(https://arxiv.org/abs/2601.03267)
Keywords: gpt, hallucination, prompt, chat, agent
Abstract: This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.
摘要：这是与 2025 年 8 月 OpenAI GPT-5 发布同时发布的系统卡。GPT-5 是一个统一的系统，具有可以回答大多数问题的智能快速模型、针对较难问题的更深层次推理模型，以及一个实时路由器，可以根据对话类型、复杂性、工具需求和明确的意图（例如，如果您在提示中说“认真考虑一下”）快速决定使用哪个模型。路由器不断接受真实信号的训练，包括用户何时切换模型、响应的偏好率以及测量的正确性，并随着时间的推移而改进。一旦达到使用限制，每个模型的迷你版本就会处理剩余的查询。该系统卡主要关注 gpt-5-thinking 和 gpt-5-main，而其他模型的评估可在附录中找到。 GPT-5 系统不仅在基准测试上优于以前的模型并且更快地回答问题，而且更重要的是对于现实世界的查询更有用。我们在减少幻觉、改善指令遵循和最大限度地减少阿谀奉承方面取得了重大进展，并在 ChatGPT 最常见的三个用途：写作、编码和健康方面提升了 GPT-5 的性能。所有 GPT-5 型号还具有安全完成功能，这是我们防止禁止内容的最新安全培训方法。与 ChatGPT 代理类似，我们决定在我们的准备框架下将 gpt-5 思维视为生物和化学领域的高能力，从而激活相关的保障措施。虽然我们没有明确的证据表明该模型可以有意义地帮助新手造成严重的生物伤害（我们定义的高能力阈值），但我们选择采取预防措施。

Title: WRAVAL -- WRiting Assist eVALuation

Authors: Gabriel Benedict, Matthew Butler, Naved Merchant, Eetu Salama-Laine
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03268
Pdf URL: https://arxiv.org/pdf/2601.03268
Copy Paste: [[2601.03268]] WRAVAL -- WRiting Assist eVALuation(https://arxiv.org/abs/2601.03268)
Keywords: language model, llm, prompt
Abstract: The emergence of Large Language Models (LLMs) has shifted language model evaluation toward reasoning and problem-solving tasks as measures of general intelligence. Small Language Models (SLMs) -- defined here as models under 10B parameters -- typically score 3-4 times lower than LLMs on these metrics. However, we demonstrate that these evaluations fail to capture SLMs' effectiveness in common industrial applications, such as tone modification tasks (e.g., funny, serious, professional). We propose an evaluation framework specifically designed to highlight SLMs' capabilities in non-reasoning tasks where predefined evaluation datasets don't exist. Our framework combines novel approaches in data generation, prompt-tuning, and LLM-based evaluation to demonstrate the potential of task-specific finetuning. This work provides practitioners with tools to effectively benchmark both SLMs and LLMs for practical applications, particularly in edge and private computing scenarios. Our implementation is available at: this https URL.
摘要：大型语言模型（LLM）的出现已将语言模型评估转向推理和解决问题的任务，作为一般智力的衡量标准。小语言模型 (SLM)——此处定义为 10B 参数下的模型——在这些指标上的得分通常比 LLM 低 3-4 倍。然而，我们证明这些评估未能捕捉到 SLM 在常见工业应用中的有效性，例如语气修改任务（例如，有趣的、严肃的、专业的）。我们提出了一个专门设计的评估框架，旨在突出 SLM 在不存在预定义评估数据集的非推理任务中的能力。我们的框架结合了数据生成、即时调整和基于 LLM 的评估方面的新颖方法，以展示特定于任务的微调的潜力。这项工作为从业者提供了有效基准化 SLM 和 LLM 实际应用的工具，特别是在边缘和私有计算场景中。我们的实现可在以下位置找到：此 https URL。

Title: The Instruction Gap: LLMs get lost in Following Instruction

Authors: Vishesh Tripathi, Uday Allu, Biddwan Ahmed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03269
Pdf URL: https://arxiv.org/pdf/2601.03269
Copy Paste: [[2601.03269]] The Instruction Gap: LLMs get lost in Following Instruction(https://arxiv.org/abs/2601.03269)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation, yet their deployment in enterprise environments reveals a critical limitation: inconsistent adherence to custom instructions. This study presents a comprehensive evaluation of 13 leading LLMs across instruction compliance, response accuracy, and performance metrics in realworld RAG (Retrieval-Augmented Generation) scenarios. Through systematic testing with samples and enterprise-grade evaluation protocols, we demonstrate that instruction following varies dramatically across models, with Claude-Sonnet-4 and GPT-5 achieving the highest results. Our findings reveal the "instruction gap" - a fundamental challenge where models excel at general tasks but struggle with precise instruction adherence required for enterprise deployment. This work provides practical insights for organizations deploying LLM-powered solutions and establishes benchmarks for instruction-following capabilities across major model families.
摘要：大型语言模型 (LLM) 在自然语言理解和生成方面表现出了卓越的能力，但它们在企业环境中的部署却暴露出一个关键限制：对自定义指令的遵守不一致。这项研究对 13 个领先的法学硕士在现实世界 RAG（检索增强生成）场景中的指令合规性、响应准确性和性能指标进行了全面评估。通过样本和企业级评估协议的系统测试，我们证明了不同模型的指令遵循差异很大，其中 Claude-Sonnet-4 和 GPT-5 取得了最高的结果。我们的研究结果揭示了“指令差距”——这是一个根本性的挑战，模型在一般任务上表现出色，但在企业部署所需的精确指令遵守方面遇到困难。这项工作为部署 LLM 支持的解决方案的组织提供了实用的见解，并为主要模型系列的指令遵循能力建立了基准。

Title: Less is more: Not all samples are effective for evaluation

Authors: Wentang Song, Jinqiang Li, Kele Huang, Junhui Lin, Shengxiang Wu, Zhongshi Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03272
Pdf URL: https://arxiv.org/pdf/2601.03272
Copy Paste: [[2601.03272]] Less is more: Not all samples are effective for evaluation(https://arxiv.org/abs/2601.03272)
Keywords: language model, llm
Abstract: The versatility of Large Language Models (LLMs) in vertical domains has spurred the development of numerous specialized evaluation benchmarks. However, these benchmarks often suffer from significant semantic redundancy and impose high computational costs during evaluation. Existing compression methods, such as tinyBenchmarks depend critically on correctness labels from multiple historical models evaluated on the full test set, making them inapplicable in cold-start scenarios, such as the introduction of a new task, domain, or model with no prior evaluation history. To address this limitation, we propose a history-free test set compression framework that requires no prior model performance data. Our method begins by fine-tuning a base LLM on a small amount of domain-specific data to internalize task-relevant semantics. It then generates high-level semantic embeddings for all original test samples using only their raw textual content. In this domain-adapted embedding space, we perform task-aware clustering and introduce a novel dataset X-ray mechanism that analyzes cluster geometry to dynamically calibrate the compression intensity based on the intrinsic redundancy of the benchmark. Experiments on professional-domain dataset, notably a large-scale 3GPP communications benchmark, demonstrate that our approach effectively identifies and removes redundant samples, reducing evaluation cost by over 90% while preserving high fidelity to the full benchmark.
摘要：大型语言模型（LLM）在垂直领域的多功能性刺激了许多专业评估基准的开发。然而，这些基准测试通常会遭受严重的语义冗余，并且在评估过程中会产生高昂的计算成本。现有的压缩方法（例如tinyBenchmarks）严重依赖于在完整测试集上评估的多个历史模型的正确性标签，这使得它们不适用于冷启动场景，例如引入没有先前评估历史的新任务、领域或模型。为了解决这个限制，我们提出了一种无历史记录的测试集压缩框架，不需要先前的模型性能数据。我们的方法首先对少量特定领域数据的基础 LLM 进行微调，以内化与任务相关的语义。然后，它仅使用原始文本内容为所有原始测试样本生成高级语义嵌入。在这个域适应的嵌入空间中，我们执行任务感知聚类，并引入一种新颖的数据集 X 射线机制，该机制分析聚类几何形状，以根据基准的内在冗余动态校准压缩强度。在专业领域数据集（特别是大规模 3GPP 通信基准）上进行的实验表明，我们的方法可以有效识别和删除冗余样本，将评估成本降低 90% 以上，同时保持对完整基准的高保真度。

Title: GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators

Authors: Naseem Machlovi, Maryam Saleki, Ruhul Amin, Mohamed Rahouti, Shawqi Al-Maliki, Junaid Qadir, Mohamed M. Abdallah, Ala Al-Fuqaha
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03273
Pdf URL: https://arxiv.org/pdf/2601.03273
Copy Paste: [[2601.03273]] GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators(https://arxiv.org/abs/2601.03273)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) become deeply embedded in daily life, the urgent need for safer moderation systems, distinguishing between naive from harmful requests while upholding appropriate censorship boundaries, has never been greater. While existing LLMs can detect harmful or unsafe content, they often struggle with nuanced cases such as implicit offensiveness, subtle gender and racial biases, and jailbreak prompts, due to the subjective and context-dependent nature of these issues. Furthermore, their heavy reliance on training data can reinforce societal biases, resulting in inconsistent and ethically problematic outputs. To address these challenges, we introduce GuardEval, a unified multi-perspective benchmark dataset designed for both training and evaluation, containing 106 fine-grained categories spanning human emotions, offensive and hateful language, gender and racial bias, and broader safety concerns. We also present GemmaGuard (GGuard), a QLoRA fine-tuned version of Gemma3-12B trained on GuardEval, to assess content moderation with fine-grained labels. Our evaluation shows that GGuard achieves a macro F1 score of 0.832, substantially outperforming leading moderation models, including OpenAI Moderator (0.64) and Llama Guard (0.61). We show that multi-perspective, human-centered safety benchmarks are critical for reducing biased and inconsistent moderation decisions. GuardEval and GGuard together demonstrate that diverse, representative data materially improve safety, fairness, and robustness on complex, borderline cases.
摘要：随着大型语言模型 (LLM) 深入日常生活，人们迫切需要更安全的审核系统，区分幼稚的请求和有害的请求，同时维护适当的审查界限。虽然现有的法学硕士可以检测有害或不安全的内容，但由于这些问题的主观性和上下文相关性，他们经常会遇到微妙的情况，例如隐含的冒犯性、微妙的性别和种族偏见以及越狱提示。此外，他们对培训数据的严重依赖可能会加剧社会偏见，导致产出不一致且存在道德问题。为了应对这些挑战，我们引入了 GuardEval，这是一个专为训练和评估而设计的统一多视角基准数据集，包含 106 个细粒度类别，涵盖人类情感、攻击性和仇恨性语言、性别和种族偏见以及更广泛的安全问题。我们还推出了 GemmaGuard (GGuard)，这是在 GuardEval 上训练的 Gemma3-12B 的 QLoRA 微调版本，用于使用细粒度标签评估内容审核。我们的评估显示，GGuard 的宏观 F1 得分为 0.832，大大优于领先的审核模型，包括 OpenAI Moderator (0.64) 和 Llama Guard (0.61)。我们表明，多视角、以人为本的安全基准对于减少有偏见和不一致的审核决策至关重要。 GuardEval 和 GGuard 共同证明，多样化、有代表性的数据可以显着提高复杂、边缘案例的安全性、公平性和稳健性。

Title: LLM_annotate: A Python package for annotating and analyzing fiction characters

Authors: Hannes Rosenbusch
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03274
Pdf URL: https://arxiv.org/pdf/2601.03274
Copy Paste: [[2601.03274]] LLM_annotate: A Python package for annotating and analyzing fiction characters(https://arxiv.org/abs/2601.03274)
Keywords: language model, llm
Abstract: LLM_annotate is a Python package for analyzing the personality of fiction characters with large language models. It standardizes workflows for annotating character behaviors in full texts (e.g., books and movie scripts), inferring character traits, and validating annotation/inference quality via a human-in-the-loop GUI. The package includes functions for text chunking, LLM-based annotation, character name disambiguation, quality scoring, and computation of character-level statistics and embeddings. Researchers can use any LLM, commercial, open-source, or custom, within LLM_annotate. Through tutorial examples using The Simpsons Movie and the novel Pride and Prejudice, I demonstrate the usage of the package for efficient and reproducible character analyses.
摘要：LLM_annotate 是一个 Python 包，用于使用大型语言模型分析小说人物的性格。它标准化了注释全文（例如书籍和电影脚本）中的角色行为、推断角色特征以及通过人机交互 GUI 验证注释/推理质量的工作流程。该软件包包括文本分块、基于 LLM 的注释、字符名称消歧、质量评分以及字符级统计和嵌入计算的功能。研究人员可以在 LLM_annotate 中使用任何 LLM，无论是商业的、开源的还是定制的。通过使用《辛普森一家电影》和小说《傲慢与偏见》的教程示例，我演示了如何使用该包进行高效且可重复的角色分析。

Title: Topic Segmentation Using Generative Language Models

Authors: Pierre Mackenzie, Maya Shah, Patrick Frenett
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03276
Pdf URL: https://arxiv.org/pdf/2601.03276
Copy Paste: [[2601.03276]] Topic Segmentation Using Generative Language Models(https://arxiv.org/abs/2601.03276)
Keywords: language model, llm, prompt
Abstract: Topic segmentation using generative Large Language Models (LLMs) remains relatively unexplored. Previous methods use semantic similarity between sentences, but such models lack the long range dependencies and vast knowledge found in LLMs. In this work, we propose an overlapping and recursive prompting strategy using sentence enumeration. We also support the adoption of the boundary similarity evaluation metric. Results show that LLMs can be more effective segmenters than existing methods, but issues remain to be solved before they can be relied upon for topic segmentation.
摘要：使用生成式大语言模型 (LLM) 的主题分割仍然相对未经探索。以前的方法使用句子之间的语义相似性，但此类模型缺乏法学硕士中的长距离依赖性和大量知识。在这项工作中，我们提出了一种使用句子枚举的重叠和递归提示策略。我们还支持采用边界相似性评估指标。结果表明，法学硕士可以是比现有方法更有效的分段器，但在依赖它们进行主题分段之前，仍有问题需要解决。

Title: Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64

Authors: Bugra Kilictas, Faruk Alpay
Subjects: cs.CL, cs.AI, cs.AR, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03324
Pdf URL: https://arxiv.org/pdf/2601.03324
Copy Paste: [[2601.03324]] Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64(https://arxiv.org/abs/2601.03324)
Keywords: language model, llm
Abstract: The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" the bottleneck where data movement latency outstrips arithmetic throughput. Standard inference runtimes often incur significant overhead through high-level abstractions, dynamic dispatch, and unaligned memory access patterns. In this work, we present a novel "Virtual Tensor Core" architecture implemented in software, optimized specifically for ARM64 microarchitectures (Apple Silicon). By bypassing standard library containers in favor of direct memory mapping (mmap) and implementing hand-tuned NEON SIMD kernels, we achieve a form of "Software-Defined Direct Memory Access (DMA)." Our proposed Tensor Virtualization Layout (TVL) guarantees 100% cache line utilization for weight matrices, while our zero-copy loader eliminates initialization latency. Experimental results on a 110M parameter model demonstrate a stable throughput of >60 tokens/second on M2 hardware. While proprietary hardware accelerators (e.g., Apple AMX) can achieve higher peak throughput, our architecture provides a fully open, portable, and deterministic reference implementation for studying the memory bottleneck on general-purpose ARM silicon, meeting the 200ms psycholinguistic latency threshold without opaque dependencies.
摘要：在边缘设备上部署大型语言模型 (LLM) 从根本上受到“内存墙”的限制，“内存墙”是数据移动延迟超过算术吞吐量的瓶颈。标准推理运行时通常会因高级抽象、动态调度和未对齐的内存访问模式而产生大量开销。在这项工作中，我们提出了一种在软件中实现的新颖的“虚拟张量核心”架构，专门针对 ARM64 微架构（Apple Silicon）进行了优化。通过绕过标准库容器而采用直接内存映射 (mmap) 并实现手动调整的 NEON SIMD 内核，我们实现了一种“软件定义的直接内存访问 (DMA)”形式。我们提出的张量虚拟化布局 (TVL) 保证权重矩阵的缓存行利用率为 100%，而我们的零复制加载器消除了初始化延迟。 110M 参数模型的实验结果表明，M2 硬件上的稳定吞吐量>60 个令牌/秒。虽然专有硬件加速器（例如 Apple AMX）可以实现更高的峰值吞吐量，但我们的架构提供了完全开放、可移植且确定性的参考实现，用于研究通用 ARM 芯片上的内存瓶颈，满足 200 毫秒心理语言延迟阈值，且没有不透明的依赖性。

Title: A path to natural language through tokenisation and transformers

Authors: David S. Berman, Alexander G. Stapleton
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2601.03368
Pdf URL: https://arxiv.org/pdf/2601.03368
Copy Paste: [[2601.03368]] A path to natural language through tokenisation and transformers(https://arxiv.org/abs/2601.03368)
Keywords: language model
Abstract: Natural languages exhibit striking regularities in their statistical structure, including notably the emergence of Zipf's and Heaps' laws. Despite this, it remains broadly unclear how these properties relate to the modern tokenisation schemes used in contemporary transformer models. In this note, we analyse the information content (as measured by the Shannon entropy) of various corpora under the assumption of a Zipfian frequency distribution, and derive a closed-form expression for the slot entropy expectation value. We then empirically investigate how byte--pair encoding (BPE) transforms corpus statistics, showing that recursive applications of BPE drive token frequencies toward a Zipfian power law while inducing a characteristic growth pattern in empirical entropy. Utilizing the ability of transformers to learn context dependent token probability distributions, we train language models on corpora tokenised at varying BPE depths, revealing that the model predictive entropies increasingly agree with Zipf-derived predictions as the BPE depth increases. Attention-based diagnostics further indicate that deeper tokenisation reduces local token dependencies, bringing the empirical distribution closer to the weakly dependent (near IID) regime. Together, these results clarify how BPE acts not only as a compression mechanism but also as a statistical transform that reconstructs key informational properties of natural language.
摘要：自然语言在其统计结构中表现出惊人的规律性，特别是齐普夫定律和堆定律的出现。尽管如此，目前仍不清楚这些属性与当代变压器模型中使用的现代标记化方案有何关系。在本文中，我们在 Zipfian 频率分布的假设下分析了各种语料库的信息内容（通过香农熵测量），并推导了时隙熵期望值的封闭式表达式。然后，我们根据经验研究字节对编码 (BPE) 如何转换语料库统计数据，表明 BPE 的递归应用将令牌频率推向 Zipfian 幂律，同时诱导经验熵的特征增长模式。利用 Transformer 学习上下文相关标记概率分布的能力，我们在不同 BPE 深度标记化的语料库上训练语言模型，结果表明，随着 BPE 深度的增加，模型预测熵与 Zipf 推导的预测越来越一致。基于注意力的诊断进一步表明，更深层次的标记化减少了局部标记依赖性，使经验分布更接近弱依赖性（接近 IID）状态。总之，这些结果阐明了 BPE 如何不仅充当压缩机制，而且充当重建自然语言关键信息属性的统计转换。

Title: Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models

Authors: Zhibo Hu, Chen Wang, Yanfeng Shu, Hye-young Paik, Liming Zhu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03388
Pdf URL: https://arxiv.org/pdf/2601.03388
Copy Paste: [[2601.03388]] Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models(https://arxiv.org/abs/2601.03388)
Keywords: language model, llm
Abstract: Earlier research has shown that metaphors influence human's decision making, which raises the question of whether metaphors also influence large language models (LLMs)' reasoning pathways, considering their training data contain a large number of metaphors. In this work, we investigate the problem in the scope of the emergent misalignment problem where LLMs can generalize patterns learned from misaligned content in one domain to another domain. We discover a strong causal relationship between metaphors in training data and the misalignment degree of LLMs' reasoning contents. With interventions using metaphors in pre-training, fine-tuning and re-alignment phases, models' cross-domain misalignment degrees change significantly. As we delve deeper into the causes behind this phenomenon, we observe that there is a connection between metaphors and the activation of global and local latent features of large reasoning models. By monitoring these latent features, we design a detector that predict misaligned content with high accuracy.
摘要：早期的研究表明，隐喻会影响人类的决策，这就提出了一个问题：考虑到大型语言模型（LLM）的训练数据包含大量隐喻，隐喻是否也会影响其推理路径。在这项工作中，我们研究了紧急错位问题范围内的问题，其中法学硕士可以将从一个领域的错位内容中学到的模式推广到另一个领域。我们发现训练数据中的隐喻与法学硕士推理内容的错位程度之间存在很强的因果关系。通过在预训练、微调和重新对齐阶段使用隐喻进行干预，模型的跨域错位程度发生了显着变化。当我们深入研究这种现象背后的原因时，我们发现隐喻与大型推理模型的全局和局部潜在特征的激活之间存在联系。通过监控这些潜在特征，我们设计了一个检测器，可以高精度预测未对齐的内容。

Title: Breaking the Assistant Mold: Modeling Behavioral Variation in LLM Based Procedural Character Generation

Authors: Maan Qraitem, Kate Saenko, Bryan A. Plummer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03396
Pdf URL: https://arxiv.org/pdf/2601.03396
Copy Paste: [[2601.03396]] Breaking the Assistant Mold: Modeling Behavioral Variation in LLM Based Procedural Character Generation(https://arxiv.org/abs/2601.03396)
Keywords: llm
Abstract: Procedural content generation has enabled vast virtual worlds through levels, maps, and quests, but large-scale character generation remains underexplored. We identify two alignment-induced biases in existing methods: a positive moral bias, where characters uniformly adopt agreeable stances (e.g. always saying lying is bad), and a helpful assistant bias, where characters invariably answer questions directly (e.g. never refusing or deflecting). While such tendencies suit instruction-following systems, they suppress dramatic tension and yield predictable characters, stemming from maximum likelihood training and assistant fine-tuning. To address this, we introduce PersonaWeaver, a framework that disentangles world-building (roles, demographics) from behavioral-building (moral stances, interactional styles), yielding characters with more diverse reactions and moral stances, as well as second-order diversity in stylistic markers like length, tone, and punctuation. Code: this https URL
摘要：程序内容生成通过关卡、地图和任务实现了广阔的虚拟世界，但大规模角色生成仍未得到充分探索。我们在现有方法中发现了两种由对齐引起的偏见：积极的道德偏见，即角色一致采取令人愉快的立场（例如，总是说撒谎是不好的）；以及有益的助理偏见，即角色总是直接回答问题（例如，从不拒绝或转移注意力）。虽然这种倾向适合指令跟踪系统，但它们会抑制戏剧性的紧张并产生可预测的角色，这源于最大似然训练和辅助微调。为了解决这个问题，我们引入了 PersonaWeaver，一个框架，它将世界构建（角色、人口统计）与行为构建（道德立场、互动风格）分开，产生具有更多样化反应和道德立场的角色，以及长度、语气和标点符号等风格标记的二阶多样性。代码：这个https URL

Title: Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms

Authors: Ruihan Zhang, Jun Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03401
Pdf URL: https://arxiv.org/pdf/2601.03401
Copy Paste: [[2601.03401]] Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms(https://arxiv.org/abs/2601.03401)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly trained on massive, heterogeneous text corpora, raising serious concerns about the unauthorised use of proprietary or personal data during model training. In this work, we address the problem of data protection against unwanted model learning in a realistic black-box setting. We propose Disclaimer Injection, a novel data-level defence that renders text unlearnable to LLMs. Rather than relying on model-side controls or explicit data removal, our approach exploits the models' own alignment mechanisms: by injecting carefully designed alignment-triggering disclaimers to prevent effective learning. Through layer-wise analysis, we find that fine-tuning on such protected data induces persistent activation of alignment-related layers, causing alignment constraints to override task learning even on common inputs. Consequently, models trained on such data exhibit substantial and systematic performance degradation compared to standard fine-tuning. Our results identify alignment behaviour as a previously unexplored lever for data protection and, to our knowledge, present the first practical method for restricting data learnability at LLM scale without requiring access to or modification of the training pipeline.
摘要：大型语言模型 (LLM) 越来越多地在海量异构文本语料库上进行训练，这引起了人们对模型训练期间未经授权使用专有或个人数据的严重担忧。在这项工作中，我们解决了在现实的黑盒设置中防止不需要的模型学习的数据保护问题。我们提出了免责声明注入，这是一种新颖的数据级防御，可以使法学硕士无法学习文本。我们的方法不是依赖于模型端控制或显式数据删除，而是利用模型自身的对齐机制：通过注入精心设计的对齐触发免责声明来防止有效学习。通过逐层分析，我们发现对此类受保护数据的微调会导致对齐相关层的持续激活，从而导致对齐约束覆盖任务学习，即使在公共输入上也是如此。因此，与标准微调相比，基于此类数据训练的模型表现出显着且系统的性能下降。我们的结果将对齐行为确定为以前未探索的数据保护杠杆，并且据我们所知，提出了第一个在 LLM 规模上限制数据可学习性的实用方法，而无需访问或修改训练管道。

Title: Tigrinya Number Verbalization: Rules, Algorithm, and Implementation

Authors: Fitsum Gaim, Issayas Tesfamariam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03403
Pdf URL: https://arxiv.org/pdf/2601.03403
Copy Paste: [[2601.03403]] Tigrinya Number Verbalization: Rules, Algorithm, and Implementation(https://arxiv.org/abs/2601.03403)
Keywords: language model, llm
Abstract: We present a systematic formalization of Tigrinya cardinal and ordinal number verbalization, addressing a gap in computational resources for the language. This work documents the canonical rules governing the expression of numerical values in spoken Tigrinya, including the conjunction system, scale words, and special cases for dates, times, and currency. We provide a formal algorithm for number-to-word conversion and release an open-source implementation. Evaluation of frontier large language models (LLMs) reveals significant gaps in their ability to accurately verbalize Tigrinya numbers, underscoring the need for explicit rule documentation. This work serves language modeling, speech synthesis, and accessibility applications targeting Tigrinya-speaking communities.
摘要：我们提出了提格里尼亚语基数词和序数语言表达的系统形式化，解决了该语言计算资源的缺口。这项工作记录了提格里尼亚语口语中数值表达的规范规则，包括连词系统、音阶词以及日期、时间和货币的特殊情况。我们提供了一种用于数字到单词转换的正式算法并发布了开源实现。对前沿大语言模型 (LLM) 的评估揭示了它们在准确表达提格里尼亚数字的能力方面存在显着差距，强调了明确规则记录的必要性。这项工作服务于针对提格里尼亚语社区的语言建模、语音合成和无障碍应用程序。

Title: Implicit Graph, Explicit Retrieval: Towards Efficient and Interpretable Long-horizon Memory for Large Language Models

Authors: Xin Zhang, Kailai Yang, Hao Li, Chenyue Li, Qiyu Wei, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03417
Pdf URL: https://arxiv.org/pdf/2601.03417
Copy Paste: [[2601.03417]] Implicit Graph, Explicit Retrieval: Towards Efficient and Interpretable Long-horizon Memory for Large Language Models(https://arxiv.org/abs/2601.03417)
Keywords: language model, llm, long context
Abstract: Long-horizon applications increasingly require large language models (LLMs) to answer queries when relevant evidence is sparse and dispersed across very long contexts. Existing memory systems largely follow two paradigms: explicit structured memories offer interpretability but often become brittle under long-context overload, while latent memory mechanisms are efficient and stable yet difficult to inspect. We propose LatentGraphMem, a memory framework that combines implicit graph memory with explicit subgraph retrieval. LatentGraphMem stores a graph-structured memory in latent space for stability and efficiency, and exposes a task-specific subgraph retrieval interface that returns a compact symbolic subgraph under a fixed budget for downstream reasoning and human inspection. During training, an explicit graph view is materialized to interface with a frozen reasoner for question-answering supervision. At inference time, retrieval is performed in latent space and only the retrieved subgraph is externalized. Experiments on long-horizon benchmarks across multiple model scales show that LatentGraphMem consistently outperforms representative explicit-graph and latent-memory baselines, while enabling parameter-efficient adaptation and flexible scaling to larger reasoners without introducing large symbolic artifacts.
摘要：当相关证据稀疏且分散在很长的上下文中时，长视野应用程序越来越需要大型语言模型（LLM）来回答查询。现有的记忆系统主要遵循两种范式：显式结构化记忆提供可解释性，但在长上下文过载下往往变得脆弱，而潜在记忆机制高效且稳定，但难以检查。我们提出 LatentGraphMem，一种将隐式图内存与显式子图检索相结合的内存框架。 LatentGraphMem 在潜在空间中存储图结构内存以实现稳定性和效率，并公开特定于任务的子图检索接口，该接口在固定预算下返回紧凑的符号子图，用于下游推理和人工检查。在训练期间，显式的图形视图被具体化，以与冻结推理器交互以进行问答监督。在推理时，检索是在潜在空间中执行的，并且仅检索到的子图被外部化。跨多个模型规模的长期基准测试表明，LatentGraphMem 始终优于代表性的显式图和潜在内存基线，同时能够实现参数高效的适应和灵活缩放到更大的推理器，而无需引入大型符号工件。

Title: PCoA: A New Benchmark for Medical Aspect-Based Summarization With Phrase-Level Context Attribution

Authors: Bohao Chu, Sameh Frihat, Tabea M. G. Pakull, Hendrik Damm, Meijie Li, Ula Muhabbek, Georg Lodde, Norbert Fuhr
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03418
Pdf URL: https://arxiv.org/pdf/2601.03418
Copy Paste: [[2601.03418]] PCoA: A New Benchmark for Medical Aspect-Based Summarization With Phrase-Level Context Attribution(https://arxiv.org/abs/2601.03418)
Keywords: language model
Abstract: Verifying system-generated summaries remains challenging, as effective verification requires precise attribution to the source context, which is especially crucial in high-stakes medical domains. To address this challenge, we introduce PCoA, an expert-annotated benchmark for medical aspect-based summarization with phrase-level context attribution. PCoA aligns each aspect-based summary with its supporting contextual sentences and contributory phrases within them. We further propose a fine-grained, decoupled evaluation framework that independently assesses the quality of generated summaries, citations, and contributory phrases. Through extensive experiments, we validate the quality and consistency of the PCoA dataset and benchmark several large language models on the proposed task. Experimental results demonstrate that PCoA provides a reliable benchmark for evaluating system-generated summaries with phrase-level context attribution. Furthermore, comparative experiments show that explicitly identifying relevant sentences and contributory phrases before summarization can improve overall quality. The data and code are available at this https URL.
摘要：验证系统生成的摘要仍然具有挑战性，因为有效的验证需要精确归因于源上下文，这在高风险的医疗领域尤其重要。为了应对这一挑战，我们引入了 PCoA，这是一种专家注释的基准，用于具有短语级上下文归因的基于医学方面的摘要。 PCoA 将每个基于方面的摘要与其支持上下文句子和其中的贡献短语对齐。我们进一步提出了一个细粒度、解耦的评估框架，该框架独立评估生成的摘要、引文和贡献短语的质量。通过大量的实验，我们验证了 PCoA 数据集的质量和一致性，并针对所提出的任务对几个大型语言模型进行了基准测试。实验结果表明，PCoA 为评估具有短语级上下文归因的系统生成的摘要提供了可靠的基准。此外，比较实验表明，在摘要之前明确识别相关句子和贡献短语可以提高整体质量。数据和代码可从此 https URL 获取。

Title: Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models

Authors: Sasha Ronaghi, Chloe Stanwyck, Asad Aali, Amir Ronaghi, Miguel Fuentes, Tina Hernandez-Boussard, Emily Alsentzer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03423
Pdf URL: https://arxiv.org/pdf/2601.03423
Copy Paste: [[2601.03423]] Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models(https://arxiv.org/abs/2601.03423)
Keywords: language model, llm
Abstract: Adapting language models to the clinical domain through continued pretraining and fine-tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. CAPT supports models with disjoint vocabularies, leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model's reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches (average +17.6% over UniTE, +41.4% over proxy tuning across tasks). Through token-level analysis and physician case studies, we demonstrate that CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity.
摘要：通过持续的预训练和微调使语言模型适应临床领域需要对每个新模型进行昂贵的再训练。我们提出了跨架构代理调优（CAPT），这是一种模型集成方法，可以使用现有的临床模型对最先进的通用领域模型进行免训练的调整。 CAPT 支持具有不相交词汇表的模型，利用对比解码有选择地注入临床相关信号，同时保留通用领域模型的推理和流畅性。在六项临床分类和文本生成任务中，具有新一代通用领域模型和老一代临床模型的 CAPT 始终优于单独模型和最先进的集成方法（平均比 UniTE +17.6%，比跨任务代理调整 +41.4%）。通过标记级分析和医生案例研究，我们证明 CAPT 增强了临床上可操作的语言，减少了上下文错误，并提高了临床特异性。

Title: The Critical Role of Aspects in Measuring Document Similarity

Authors: Eftekhar Hossain, Tarnika Hazra, Ahatesham Bhuiyan, Santu Karmaker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03435
Pdf URL: https://arxiv.org/pdf/2601.03435
Copy Paste: [[2601.03435]] The Critical Role of Aspects in Measuring Document Similarity(https://arxiv.org/abs/2601.03435)
Keywords: gpt, llm, prompt
Abstract: We introduce ASPECTSIM, a simple and interpretable framework that requires conditioning document similarity on an explicitly specified aspect, which is different from the traditional holistic approach in measuring document similarity. Experimenting with a newly constructed benchmark of 26K aspect-document pairs, we found that ASPECTSIM, when implemented with direct GPT-4o prompting, achieves substantially higher human-machine agreement ($\approx$80% higher) than the same for holistic similarity without explicit aspects. These findings underscore the importance of explicitly accounting for aspects when measuring document similarity and highlight the need to revise standard practice. Next, we conducted a large-scale meta-evaluation using 16 smaller open-source LLMs and 9 embedding models with a focus on making ASPECTSIM accessible and reproducible. While directly prompting LLMs to produce ASPECTSIM scores turned out be ineffective (20-30% human-machine agreement), a simple two-stage refinement improved their agreement by $\approx$140%. Nevertheless, agreement remains well below that of GPT-4o-based models, indicating that smaller open-source LLMs still lag behind large proprietary models in capturing aspect-conditioned similarity.
摘要：我们引入了 ASPECTSIM，一个简单且可解释的框架，需要在明确指定的方面调节文档相似性，这与测量文档相似性的传统整体方法不同。通过对新构建的 26K 方面文档对的基准进行实验，我们发现，当使用直接 GPT-4o 提示实现时，ASPECTSIM 比没有明确方面的整体相似性实现了更高的人机一致性（高出约 80%）。这些发现强调了在衡量文档相似性时明确考虑各个方面的重要性，并强调了修改标准实践的必要性。接下来，我们使用 16 个较小的开源 LLM 和 9 个嵌入模型进行了大规模元评估，重点是使 ASPECTSIM 易于访问和可重复。虽然直接促使 LLM 生成 ASPECTSIM 分数被证明是无效的（人机一致性为 20-30%），但简单的两阶段改进将其一致性提高了约 140%。尽管如此，一致性仍然远低于基于 GPT-4o 的模型，这表明较小的开源 LLM 在捕获方面条件相似性方面仍然落后于大型专有模型。

Title: Grading Scale Impact on LLM-as-a-Judge: Human-LLM Alignment Is Highest on 0-5 Grading Scale

Authors: Weiyue Li, Minda Zhao, Weixuan Dong, Jiahui Cai, Yuze Wei, Michael Pocress, Yi Li, Wanyan Yuan, Xiaoyue Wang, Ruoyu Hou, Kaiyuan Lou, Wenqi Zeng, Yutong Yang, Yilun Du, Mengyu Wang
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2601.03444
Pdf URL: https://arxiv.org/pdf/2601.03444
Copy Paste: [[2601.03444]] Grading Scale Impact on LLM-as-a-Judge: Human-LLM Alignment Is Highest on 0-5 Grading Scale(https://arxiv.org/abs/2601.03444)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly used as automated evaluators, yet prior works demonstrate that these LLM judges often lack consistency in scoring when the prompt is altered. However, the effect of the grading scale itself remains underexplored. We study the LLM-as-a-judge problem by comparing two kinds of raters: humans and LLMs. We collect ratings from both groups on three scales and across six benchmarks that include objective, open-ended subjective, and mixed tasks. Using intraclass correlation coefficients (ICC) to measure absolute agreement, we find that LLM judgments are not perfectly consistent across scales on subjective benchmarks, and that the choice of scale substantially shifts human-LLM agreement, even when within-group panel reliability is high. Aggregated over tasks, the grading scale of 0-5 yields the strongest human-LLM alignment. We further demonstrate that pooled reliability can mask benchmark heterogeneity and reveal systematic subgroup differences in alignment across gender groups, strengthening the importance of scale design and sub-level diagnostics as essential components of LLM-as-a-judge protocols.
摘要：大型语言模型（LLM）越来越多地被用作自动评估器，但先前的研究表明，当提示改变时，这些 LLM 评委通常在评分上缺乏一致性。然而，分级量表本身的影响仍未得到充分探索。我们通过比较两种评估者（人类和法学硕士）来研究法学硕士作为法官的问题。我们根据三个量表和六个基准收集两组的评分，其中包括客观、开放式主观和混合任务。使用类内相关系数（ICC）来衡量绝对一致性，我们发现法学硕士的判断在主观基准的不同尺度上并不完全一致，而且尺度的选择大大改变了人类与法学硕士的一致性，即使组内小组的可靠性很高。对任务进行汇总，0-5 的评分标准产生了最强的人类与法学硕士的一致性。我们进一步证明，汇总可靠性可以掩盖基准异质性，并揭示跨性别群体一致性的系统性亚组差异，从而加强量表设计和次级诊断作为法学硕士法官协议的重要组成部分的重要性。

Title: Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

Authors: Atsuki Yamaguchi, Maggie Mi, Nikolaos Aletras
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03448
Pdf URL: https://arxiv.org/pdf/2601.03448
Copy Paste: [[2601.03448]] Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks(https://arxiv.org/abs/2601.03448)
Keywords: language model
Abstract: Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.
摘要：语言模型 (LM) 在原始文本数据集上进行预训练，以逐个标记生成文本序列。虽然这种方法有助于学习世界知识和推理，但它并没有明确优化语言能力。为了弥补这一差距，我们提出了 L2T，这是一种将语言学习任务与标准下一个标记预测相结合的预训练框架。受人类语言习得的启发，L2T 将原始文本转换为结构化输入输出对，以提供明确的语言刺激。在原始文本和 L2T 数据的混合上对 LM 进行预训练不仅可以提高语言能力基准的整体性能，还可以加速其获取，同时保持在一般推理任务上的竞争性能。

Title: Prompting Underestimates LLM Capability for Time Series Classification

Authors: Dan Schumacher, Erfan Nourbakhsh, Rocky Slavin, Anthony Rios
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03464
Pdf URL: https://arxiv.org/pdf/2601.03464
Copy Paste: [[2601.03464]] Prompting Underestimates LLM Capability for Time Series Classification(https://arxiv.org/abs/2601.03464)
Keywords: language model, llm, prompt
Abstract: Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model's representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.
摘要：基于提示的评估表明，大型语言模型（LLM）在时间序列分类方面表现不佳，引发了人们对它们是否编码有意义的时间结构的怀疑。我们通过直接将提示输出与相同内部表示的线性探针进行比较，表明这一结论反映了基于提示的生成的局限性，而不是模型的表示能力。虽然零样本提示几乎是偶然的，但线性探针将平均 F1 从 0.15-0.26 提高到 0.61-0.67，通常匹配或超过专门的时间序列模型。逐层分析进一步表明，类判别性时间序列信息出现在早期变压器层中，并通过视觉和多模态输入放大。总之，这些结果表明法学硕士内部代表的内容与基于提示的评估所揭示的内容之间存在系统性不匹配，导致当前的评估低估了他们对时间序列的理解。

Title: EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Authors: Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu, Carl Yang, Max S. Y. Lau, Qi He, Lu Cheng, Wei Jin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03471
Pdf URL: https://arxiv.org/pdf/2601.03471
Copy Paste: [[2601.03471]] EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning(https://arxiv.org/abs/2601.03471)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The subsets respectively evaluate text-grounded factual recall, multi-step inference linking document evidence with epidemiological principles, and conclusion reconstruction with the Discussion section withheld. Construction combines expert-designed taxonomy guidance, multi-model verification, and retrieval-based difficulty control. Experiments on ten open models reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence grounding, inferential reasoning, and conclusion reconstruction.
摘要：可靠的流行病学推理需要综合研究证据来推断人口层面的疾病负担、传播动态和干预效果。现有的医学问答基准主要强调临床知识或患者层面的推理，但很少系统地评估基于证据的流行病学推断。我们推出 EpiQAL，这是针对不同疾病的流行病学问答的第一个诊断基准，由开放获取文献构建的三个子集组成。这些子集分别评估基于文本的事实回忆、将文档证据与流行病学原理联系起来的多步骤推理以及保留讨论部分的结论重建。构建结合了专家设计的分类指导、多模型验证和基于检索的难度控制。对十个开放模型的实验表明，目前的法学硕士在流行病学推理方面表现有限，其中多步推理构成了最大的挑战。模型排名会随着子集的不同而变化，仅靠规模并不能预测成功。思维链提示有利于多步推理，但在其他地方会产生好坏参半的结果。 EpiQAL 提供细粒度的诊断信号，用于证据基础、推理和结论重建。

Title: CALM: Culturally Self-Aware Language Models

Authors: Lingzhi Shen, Xiaohao Cai, Yunfei Long, Imran Razzak, Guanming Chen, Shoaib Jameel
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03483
Pdf URL: https://arxiv.org/pdf/2601.03483
Copy Paste: [[2601.03483]] CALM: Culturally Self-Aware Language Models(https://arxiv.org/abs/2601.03483)
Keywords: language model, prompt
Abstract: Cultural awareness in language models is the capacity to understand and adapt to diverse cultural contexts. However, most existing approaches treat culture as static background knowledge, overlooking its dynamic and evolving nature. This limitation reduces their reliability in downstream tasks that demand genuine cultural sensitivity. In this work, we introduce CALM, a novel framework designed to endow language models with cultural self-awareness. CALM disentangles task semantics from explicit cultural concepts and latent cultural signals, shaping them into structured cultural clusters through contrastive learning. These clusters are then aligned via cross-attention to establish fine-grained interactions among related cultural features and are adaptively integrated through a Mixture-of-Experts mechanism along culture-specific dimensions. The resulting unified representation is fused with the model's original knowledge to construct a culturally grounded internal identity state, which is further enhanced through self-prompted reflective learning, enabling continual adaptation and self-correction. Extensive experiments conducted on multiple cross-cultural benchmark datasets demonstrate that CALM consistently outperforms state-of-the-art methods.
摘要：语言模型中的文化意识是理解和适应不同文化背景的能力。然而，大多数现有方法将文化视为静态背景知识，忽视了其动态和不断发展的本质。这种限制降低了它们在需要真正文化敏感性的下游任务中的可靠性。在这项工作中，我们介绍了 CALM，这是一个旨在赋予语言模型文化自我意识的新颖框架。 CALM 将任务语义与明确的文化概念和潜在的文化信号分开，通过对比学习将它们塑造成结构化的文化集群。然后，这些簇通过交叉注意力进行对齐，以在相关文化特征之间建立细粒度的交互，并通过专家混合机制沿着特定文化的维度进行适应性整合。由此产生的统一表示与模型的原始知识融合，构建一个以文化为基础的内部身份状态，并通过自我提示的反思性学习进一步增强，从而实现持续的适应和自我纠正。对多个跨文化基准数据集进行的大量实验表明，CALM 始终优于最先进的方法。

Title: Submodular Evaluation Subset Selection in Automatic Prompt Optimization

Authors: Jinming Nian, Zhiyuan Peng, Hongwei Shang, Dae Hoon Park, Yi Fang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03493
Pdf URL: https://arxiv.org/pdf/2601.03493
Copy Paste: [[2601.03493]] Submodular Evaluation Subset Selection in Automatic Prompt Optimization(https://arxiv.org/abs/2601.03493)
Keywords: prompt
Abstract: Automatic prompt optimization reduces manual prompt engineering, but relies on task performance measured on a small, often randomly sampled evaluation subset as its main source of feedback signal. Despite this, how to select that evaluation subset is usually treated as an implementation detail. We study evaluation subset selection for prompt optimization from a principled perspective and propose SESS, a submodular evaluation subset selection method. We frame selection as maximizing an objective set function and show that, under mild conditions, it is monotone and submodular, enabling greedy selection with theoretical guarantees. Across GSM8K, MATH, and GPQA-Diamond, submodularly selected evaluation subsets can yield better optimized prompts than random or heuristic baselines.
摘要：自动提示优化减少了手动提示工程，但依赖于在小型、通常随机采样的评估子集上测量的任务性能，作为其反馈信号的主要来源。尽管如此，如何选择评估子集通常被视为实现细节。我们从原理的角度研究了快速优化的评估子集选择，并提出了 SESS，一种子模块评估子集选择方法。我们将选择框架为最大化目标集函数，并表明，在温和条件下，它是单调和子模的，从而在理论上保证了贪婪选择。在 GSM8K、MATH 和 GPQA-Diamond 中，子模块选择的评估子集可以比随机或启发式基线产生更好的优化提示。

Title: Beyond Perplexity: A Lightweight Benchmark for Knowledge Retention in Supervised Fine-Tuning

Authors: Soheil Zibakhsh Shabgahi, Pedram Aghazadeh, Farinaz Koushanfar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03505
Pdf URL: https://arxiv.org/pdf/2601.03505
Copy Paste: [[2601.03505]] Beyond Perplexity: A Lightweight Benchmark for Knowledge Retention in Supervised Fine-Tuning(https://arxiv.org/abs/2601.03505)
Keywords: language model, llm
Abstract: Supervised Fine-Tuning (SFT) is a standard approach for injecting domain knowledge into Large Language Models (LLMs). However, relying on validation perplexity to monitor training is often insufficient, as it confounds stylistic mimicry with genuine factual internalization. To address this, we introduce the Knowledge Retention (KR) Test , a lightweight, corpus-grounded evaluation framework designed to distinguish factual learning from linguistics. KR-Test utilizes automatically generated contrastive examples to measure likelihood preferences for correct versus incorrect continuations, requiring no instruction tuning or generative decoding. We validate the framework's integrity through a "blind vs. oracle" baseline analysis. Furthermore, we demonstrate the diagnostic capabilities of KR-Test by analyzing the training dynamics of Low-Rank Adaptation (LoRA). By exposing the fine-grained dissociation between linguistic convergence and knowledge retention, KR-Test enhances the interpretability of fine-tuning dynamics.
摘要：监督微调 (SFT) 是将领域知识注入大型语言模型 (LLM) 的标准方法。然而，依靠验证困惑度来监控训练通常是不够的，因为它混淆了风格模仿和真正的事实内化。为了解决这个问题，我们引入了知识保留（KR）测试，这是一个轻量级的、基于语料库的评估框架，旨在区分事实学习和语言学。 KR-Test 利用自动生成的对比示例来测量正确与错误延续的可能性偏好，无需指令调整或生成解码。我们通过“盲目与预言”基线分析来验证框架的完整性。此外，我们通过分析低秩适应（LoRA）的训练动态来展示 KR-Test 的诊断能力。通过揭示语言趋同和知识保留之间的细粒度分离，KR-Test 增强了微调动态的可解释性。

Title: Reasoning Pattern Alignment Merging for Adaptive Reasoning

Authors: Zhaofeng Zhong, Wei Yuan, Tong Chen, Xiangyu Zhao, Quoc Viet Hung Nguyen, Hongzhi Yin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03506
Pdf URL: https://arxiv.org/pdf/2601.03506
Copy Paste: [[2601.03506]] Reasoning Pattern Alignment Merging for Adaptive Reasoning(https://arxiv.org/abs/2601.03506)
Keywords: prompt, chain-of-thought
Abstract: Recent large reasoning models (LRMs) have made substantial progress in complex reasoning tasks, yet they often generate lengthy reasoning paths for every query, incurring unnecessary computation and latency. Existing speed-up approaches typically rely on retraining the model or designing sophisticated prompting, which are either prohibitively expensive or highly sensitive to the input and prompt formulation. In this work, we study model merging as a lightweight alternative for efficient reasoning: by combining a long chain-of-thought (Long-CoT) reasoning model with a Short-CoT instruction model, we obtain an adaptive reasoner without training from scratch or requiring large-scale additional data. Building on this idea, we propose Reasoning Pattern Alignment Merging (RPAM), a layer-wise model merging framework based on feature alignment to facilitate query-adaptive reasoning. RPAM first constructs a small pattern-labeled calibration set that assigns each query an appropriate reasoning pattern. It then optimizes layer-wise merging coefficients by aligning the merged model's intermediate representations with those of the selected model, while a contrastive objective explicitly pushes them away from the non-selected model. Experiments on seven widely used reasoning benchmarks show that RPAM substantially reduces inference cost while maintaining strong performance. Upon article acceptance, we will provide open-source code to reproduce experiments for RPAM.
摘要：最近的大型推理模型（LRM）在复杂的推理任务中取得了实质性进展，但它们经常为每个查询生成冗长的推理路径，从而产生不必要的计算和延迟。现有的加速方法通常依赖于重新训练模型或设计复杂的提示，这些方法要么成本高昂，要么对输入和提示公式高度敏感。在这项工作中，我们研究模型合并作为高效推理的轻量级替代方案：通过将长思想链（Long-CoT）推理模型与短CoT指令模型相结合，我们获得了自适应推理器，而无需从头开始训练或需要大规模附加数据。基于这个想法，我们提出了推理模式对齐合并（RPAM），这是一种基于特征对齐的分层模型合并框架，以促进查询自适应推理。 RPAM 首先构建一个小的模式标记校准集，为每个查询分配适当的推理模式。然后，它通过将合并模型的中间表示与所选模型的中间表示对齐来优化分层合并系数，同时对比目标明确地将它们推离未选择的模型。对七个广泛使用的推理基准的实验表明，RPAM 在保持强大性能的同时大大降低了推理成本。文章接受后，我们将提供开源代码来重现 RPAM 实验。

Title: IntroLM: Introspective Language Models via Prefilling-Time Self-Evaluation

Authors: Hossein Hosseini Kasnavieh, Gholamreza Haffari, Chris Leckie, Adel N. Toosi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03511
Pdf URL: https://arxiv.org/pdf/2601.03511
Copy Paste: [[2601.03511]] IntroLM: Introspective Language Models via Prefilling-Time Self-Evaluation(https://arxiv.org/abs/2601.03511)
Keywords: language model, llm
Abstract: A major challenge for the operation of large language models (LLMs) is how to predict whether a specific LLM will produce sufficiently high-quality output for a given query. Existing approaches rely on external classifiers, most commonly BERT based models, which suffer from limited context windows, constrained representational capacity, and additional computational overhead. We propose IntroLM, a method that enables causal language models to predict their own output quality during the prefilling phase without affecting generation using introspective tokens. By introducing token conditional LoRA that activates only for the introspective token, the model learns to predict the output quality for a given query while preserving the original backbone behavior and avoiding external evaluators. On question answering benchmarks, IntroLM applied to Qwen3 8B achieves a ROC AUC of 90 precent for success prediction, outperforming a DeBERTa classifier by 14 precent. When integrated into multi model routing systems, IntroLM achieves superior cost performance tradeoffs, reducing latency by up to 33 precent and large model usage by up to 50 precent at matched reliability.
摘要：大型语言模型 (LLM) 操作的一个主要挑战是如何预测特定的 LLM 是否将为给定查询生成足够高质量的输出。现有的方法依赖于外部分类器，最常见的是基于 BERT 的模型，这些模型受到有限的上下文窗口、受限的表示能力和额外的计算开销的影响。我们提出了 IntroLM，一种使因果语言模型能够在预填充阶段预测其自身输出质量的方法，而不影响使用内省标记的生成。通过引入仅针对内省令牌激活的令牌条件 LoRA，该模型学会预测给定查询的输出质量，同时保留原始主干行为并避免外部评估器。在问答基准测试中，应用于 Qwen3 8B 的 IntroLM 成功预测的 ROC AUC 达到了 90%，比 DeBERTa 分类器高出 14%。当集成到多模型路由系统中时，IntroLM 可实现卓越的性价比权衡，在匹配的可靠性下将延迟减少高达 33%，将大型模型使用率减少高达 50%。

Title: Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents

Authors: Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik Hamann, Jingrui He, Hanghang Tong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03515
Pdf URL: https://arxiv.org/pdf/2601.03515
Copy Paste: [[2601.03515]] Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents(https://arxiv.org/abs/2601.03515)
Keywords: language model, llm, agent
Abstract: Long-term memory is a critical capability for multimodal large language model (MLLM) agents, particularly in conversational settings where information accumulates and evolves over time. However, existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories. Thus, we introduce Mem-Gallery, a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents. Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management. Extensive benchmarking across thirteen memory systems reveals several key findings, highlighting the necessity of explicit multimodal information retention and memory organization, the persistent limitations in memory reasoning and knowledge management, as well as the efficiency bottleneck of current models.
摘要：长期记忆是多模式大语言模型（MLLM）代理的一项关键能力，特别是在信息随着时间积累和演变的对话环境中。然而，现有的基准要么评估纯文本对话中的多会话记忆，要么评估本地化上下文中的多模态理解，未能评估多模态记忆如何在长期对话轨迹中保存、组织和发展。因此，我们引入了 Mem-Gallery，这是一个用于评估 MLLM 代理中多模式长期会话记忆的新基准。 Mem-Gallery 具有基于视觉和文本信息的高质量多会话对话，具有较长的交互范围和丰富的多模式依赖性。在此数据集的基础上，我们提出了一个系统评估框架，该框架沿着三个功能维度评估关键记忆能力：记忆提取和测试时间适应、记忆推理和记忆知识管理。对十三个记忆系统的广泛基准测试揭示了几个关键发现，强调了显式多模式信息保留和记忆组织的必要性、记忆推理和知识管理的持续局限性以及当前模型的效率瓶颈。

Title: PALM-Bench: A Comprehensive Benchmark for Personalized Audio-Language Models

Authors: Yuwen Wang, Xinyuan Qian, Tian-Hao Zhang, Jiaran Gao, Yuchen Pan, Xin Wang, Zhou Pan, Chen Wei, Yiming Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03531
Pdf URL: https://arxiv.org/pdf/2601.03531
Copy Paste: [[2601.03531]] PALM-Bench: A Comprehensive Benchmark for Personalized Audio-Language Models(https://arxiv.org/abs/2601.03531)
Keywords: language model, prompt
Abstract: Large Audio-Language Models (LALMs) have demonstrated strong performance in audio understanding and generation. Yet, our extensive benchmarking reveals that their behavior is largely generic (e.g., summarizing spoken content) and fails to adequately support personalized question answering (e.g., summarizing what my best friend says). In contrast, human conditions their interpretation and decision-making on each individual's personal context. To bridge this gap, we formalize the task of Personalized LALMs (PALM) for recognizing personal concepts and reasoning within personal context. Moreover, we create the first benchmark (PALM-Bench) to foster the methodological advances in PALM and enable structured evaluation on several tasks across multi-speaker scenarios. Our extensive experiments on representative open-source LALMs, show that existing training-free prompting and supervised fine-tuning strategies, while yield improvements, remains limited in modeling personalized knowledge and transferring them across tasks robustly. Data and code will be released.
摘要：大型音频语言模型（LALM）在音频理解和生成方面表现出了强大的性能。然而，我们广泛的基准测试表明，他们的行为很大程度上是通用的（例如，总结口头内容），并且无法充分支持个性化问答（例如，总结我最好的朋友所说的话）。相比之下，人类根据每个人的个人背景来调整他们的解释和决策。为了弥补这一差距，我们正式确定了个性化 LALM (PALM) 的任务，用于在个人背景下识别个人概念和推理。此外，我们创建了第一个基准 (PALM-Bench)，以促进 PALM 方法的进步，并支持对跨多说话者场景的多项任务进行结构化评估。我们对代表性开源 LALM 进行的广泛实验表明，现有的免训练提示和监督微调策略虽然产量有所提高，但在建模个性化知识并将其稳健地跨任务迁移方面仍然受到限制。数据和代码将被发布。

Title: Persona-aware and Explainable Bikeability Assessment: A Vision-Language Model Approach

Authors: Yilong Dai, Ziyi Wang, Chenguang Wang, Kexin Zhou, Yiheng Qian, Susu Xu, Xiang Yan
Subjects: cs.CL, cs.CV, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03534
Pdf URL: https://arxiv.org/pdf/2601.03534
Copy Paste: [[2601.03534]] Persona-aware and Explainable Bikeability Assessment: A Vision-Language Model Approach(https://arxiv.org/abs/2601.03534)
Keywords: language model, chain-of-thought
Abstract: Bikeability assessment is essential for advancing sustainable urban transportation and creating cyclist-friendly cities, and it requires incorporating users' perceptions of safety and comfort. Yet existing perception-based bikeability assessment approaches face key limitations in capturing the complexity of road environments and adequately accounting for heterogeneity in subjective user perceptions. This paper proposes a persona-aware Vision-Language Model framework for bikeability assessment with three novel contributions: (i) theory-grounded persona conditioning based on established cyclist typology that generates persona-specific explanations via chain-of-thought reasoning; (ii) multi-granularity supervised fine-tuning that combines scarce expert-annotated reasoning with abundant user ratings for joint prediction and explainable assessment; and (iii) AI-enabled data augmentation that creates controlled paired data to isolate infrastructure variable impacts. To test and validate this framework, we developed a panoramic image-based crowdsourcing system and collected 12,400 persona-conditioned assessments from 427 cyclists. Experiment results show that the proposed framework offers competitive bikeability rating prediction while uniquely enabling explainable factor attribution.
摘要：自行车适宜性评估对于推进可持续城市交通和创建自行车友好型城市至关重要，它需要纳入用户对安全和舒适的看法。然而，现有的基于感知的可骑行性评估方法在捕捉道路环境的复杂性和充分考虑用户主观感知的异质性方面面临着关键的局限性。本文提出了一种用于可骑行性评估的角色感知视觉语言模型框架，具有三个新颖的贡献：（i）基于已建立的骑车人类型学的基于理论的角色调节，通过思想链推理生成特定于角色的解释； (ii) 多粒度监督微调，将稀缺的专家注释推理与丰富的用户评分相结合，以进行联合预测和可解释的评估； (iii) 支持人工智能的数据增强，创建受控配对数据以隔离基础设施变量影响。为了测试和验证这个框架，我们开发了一个基于全景图像的众包系统，并收集了 427 名骑自行车者的 12,400 份人物条件评估。实验结果表明，所提出的框架提供了有竞争力的可骑行性评级预测，同时独特地实现了可解释的因素归因。

Title: DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing

Authors: Hongzhi Zhang, Yuanze Hu, Tinghai Zhang, Jia Fu, Tao Wang, Junwei Jing, Zhaoxin Fan, Qi Wang, Ruiming Tang, Han Li, Guorui Zhou, Kun Gai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03540
Pdf URL: https://arxiv.org/pdf/2601.03540
Copy Paste: [[2601.03540]] DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing(https://arxiv.org/abs/2601.03540)
Keywords: language model, llm, hallucination, agent
Abstract: The evolution of Large Language Models (LLMs) towards autonomous agents has catalyzed progress in Deep Research. While retrieval capabilities are well-benchmarked, the post-retrieval synthesis stage--where agents must digest massive amounts of context and consolidate fragmented evidence into coherent, long-form reports--remains under-evaluated due to the subjectivity of open-ended writing. To bridge this gap, we introduce DeepSynth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities. We leverage high-quality survey papers as gold standards, reverse-engineering research requests and constructing "Oracle Contexts" from their bibliographies to isolate synthesis from retrieval noise. We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization), transforming subjective judgment into verifiable metrics. Experiments across 96 tasks reveal that synthesizing information from hundreds of references remains a significant challenge. Our results demonstrate that agentic plan-and-write workflows significantly outperform single-turn generation, effectively reducing hallucinations and improving adherence to complex structural constraints.
摘要：大型语言模型 (LLM) 向自主代理的发展促进了深度研究的进展。虽然检索能力已进行了良好的基准测试，但由于开放式写作的主观性，检索后综合阶段（代理人必须消化大量上下文并将碎片证据整合成连贯的长篇报告）仍然被低估。为了弥补这一差距，我们引入了 DeepSynth-Eval，这是一个旨在客观评估信息整合能力的基准。我们利用高质量的调查论文作为黄金标准，对研究请求进行逆向工程，并从参考书目中构建“Oracle Contexts”，以将综合与检索噪声隔离开来。我们提出了一种使用通用检查表（用于事实覆盖）和约束检查表（用于结构组织）的细粒度评估协议，将主观判断转化为可验证的指标。 96 项任务的实验表明，从数百个参考文献中综合信息仍然是一项重大挑战。我们的结果表明，代理计划和写入工作流程显着优于单轮生成，有效减少幻觉并提高对复杂结构约束的遵守。

Title: Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models

Authors: Xukai Liu, Ye Liu, Jipeng Zhang, Yanghai Zhang, Kai Zhang, Qi Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03542
Pdf URL: https://arxiv.org/pdf/2601.03542
Copy Paste: [[2601.03542]] Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models(https://arxiv.org/abs/2601.03542)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) perform well on multi-hop reasoning, yet how they internally compose multiple facts remains unclear. Recent work proposes \emph{hop-aligned circuit hypothesis}, suggesting that bridge entities are computed sequentially across layers before later-hop answers. Through systematic analyses on real-world multi-hop queries, we show that this hop-aligned assumption does not generalize: later-hop answer entities can become decodable earlier than bridge entities, a phenomenon we call \emph{layer-order inversion}, which strengthens with total hops. To explain this behavior, we propose a \emph{probabilistic recall-and-extract} framework that models multi-hop reasoning as broad probabilistic recall in shallow MLP layers followed by selective extraction in deeper attention layers. This framework is empirically validated through systematic probing analyses, reinterpreting prior layer-wise decoding evidence, explaining chain-of-thought gains, and providing a mechanistic diagnosis of multi-hop failures despite correct single-hop knowledge. Code is available at this https URL.
摘要：大型语言模型（LLM）在多跳推理方面表现良好，但它们内部如何组成多个事实仍不清楚。最近的工作提出了\emph{跳跃对齐电路假设}，建议在稍后的跳跃答案之前跨层顺序计算桥接实体。通过对现实世界的多跳查询的系统分析，我们表明这种跳对齐的假设并不能推广：后跳答案实体可以比桥接实体更早地被解码，我们将这种现象称为 \emph{层序反转}，这种现象随着总跳数的增加而增强。为了解释这种行为，我们提出了一个 \emph{概率召回和提取} 框架，该框架将多跳推理建模为浅层 MLP 层中的广泛概率召回，然后在更深的关注层中进行选择性提取。该框架通过系统的探测分析、重新解释先前的逐层解码证据、解释思想链增益以及在具有正确的单跳知识的情况下提供多跳故障的机械诊断来进行实证验证。代码可从此 https URL 获取。

Title: EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory

Authors: Ye Shen, Dun Pei, Yiqiu Guo, Junying Wang, Yijin Guo, Zicheng Zhang, Qi Jia, Jun Zhou, Guangtao Zhai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03543
Pdf URL: https://arxiv.org/pdf/2601.03543
Copy Paste: [[2601.03543]] EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory(https://arxiv.org/abs/2601.03543)
Keywords: language model, llm, agent
Abstract: Despite recent advances in understanding and leveraging long-range conversational memory, existing benchmarks still lack systematic evaluation of large language models(LLMs) across diverse memory dimensions, particularly in multi-session settings. In this work, we propose EvolMem, a new benchmark for assessing multi-session memory capabilities of LLMs and agent systems. EvolMem is grounded in cognitive psychology and encompasses both declarative and non-declarative memory, further decomposed into multiple fine-grained abilities. To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations. This framework enables scalable generation of multi-session conversations with controllable complexity, accompanied by sample-specific evaluation guidelines. Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions. Moreover, agent memory mechanisms do not necessarily enhance LLMs' capabilities and often exhibit notable efficiency limitations. Data and code will be released at this https URL.
摘要：尽管最近在理解和利用远程会话记忆方面取得了进展，但现有基准仍然缺乏对跨不同记忆维度的大型语言模型（LLM）的系统评估，特别是在多会话设置中。在这项工作中，我们提出了 EvolMem，一个用于评估法学硕士和代理系统的多会话内存能力的新基准。 EvolMem 以认知心理学为基础，涵盖陈述性记忆和非陈述性记忆，并进一步分解为多种细粒度的能力。为了构建基准，我们引入了一个混合数据合成框架，该框架由主题发起的生成和叙事启发的转换组成。该框架能够以可扩展的方式生成具有可控复杂性的多会话对话，并附有特定于样本的评估指南。广泛的评估表明，没有哪个法学硕士在所有记忆维度上始终优于其他法学硕士。此外，代理记忆机制不一定能增强法学硕士的能力，而且常常表现出明显的效率限制。数据和代码将在此 https URL 发布。

Title: Value-Action Alignment in Large Language Models under Privacy-Prosocial Conflict

Authors: Guanyu Chen, Chenxiao Yu, Xiyang Hu
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03546
Pdf URL: https://arxiv.org/pdf/2601.03546
Copy Paste: [[2601.03546]] Value-Action Alignment in Large Language Models under Privacy-Prosocial Conflict(https://arxiv.org/abs/2601.03546)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used to simulate decision-making tasks involving personal data sharing, where privacy concerns and prosocial motivations can push choices in opposite directions. Existing evaluations often measure privacy-related attitudes or sharing intentions in isolation, which makes it difficult to determine whether a model's expressed values jointly predict its downstream data-sharing actions as in real human behaviors. We introduce a context-based assessment protocol that sequentially administers standardized questionnaires for privacy attitudes, prosocialness, and acceptance of data sharing within a bounded, history-carrying session. To evaluate value-action alignments under competing attitudes, we use multi-group structural equation modeling (MGSEM) to identify relations from privacy concerns and prosocialness to data sharing. We propose Value-Action Alignment Rate (VAAR), a human-referenced directional agreement metric that aggregates path-level evidence for expected signs. Across multiple LLMs, we observe stable but model-specific Privacy-PSA-AoDS profiles, and substantial heterogeneity in value-action alignment.
摘要：大语言模型（LLM）越来越多地用于模拟涉及个人数据共享的决策任务，其中隐私问题和亲社会动机可能会将选择推向相反的方向。现有的评估通常孤立地衡量与隐私相关的态度或共享意图，这使得很难确定模型所表达的价值观是否能够像真实人类行为一样共同预测其下游数据共享行为。我们引入了一种基于情境的评估协议，该协议在有界的、承载历史的会话中依次管理关于隐私态度、亲社会性和数据共享接受度的标准化调查问卷。为了评估竞争态度下的价值-行动一致性，我们使用多组结构方程模型（MGSEM）来识别从隐私问题和亲社会性到数据共享的关系。我们提出了价值-行动一致性率（VAAR），这是一种人类参考的方向一致性指标，可以聚合预期迹象的路径级证据。在多个法学硕士中，我们观察到稳定但特定于模型的隐私-PSA-AoDS 配置文件，以及价值-行动一致性方面的显着异质性。

Title: Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios

Authors: Sangyub Lee, Heedou Kim, Hyeoncheol Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03553
Pdf URL: https://arxiv.org/pdf/2601.03553
Copy Paste: [[2601.03553]] Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios(https://arxiv.org/abs/2601.03553)
Keywords: language model, llm, prompt
Abstract: The use of Large Language Models (LLMs) in police operations is growing, yet an evaluation framework tailored to police operations remains absent. While LLM's responses may not always be legally incorrect, their unverified use still can lead to severe issues such as unlawful arrests and improper evidence collection. To address this, we propose PAS (Police Action Scenarios), a systematic framework covering the entire evaluation process. Applying this framework, we constructed a novel QA dataset from over 8,000 official documents and established key metrics validated through statistical analysis with police expert judgements. Experimental results show that commercial LLMs struggle with our new police-related tasks, particularly in providing fact-based recommendations. This study highlights the necessity of an expandable evaluation framework to ensure reliable AI-driven police operations. We release our data and prompt template.
摘要：大型语言模型（LLM）在警察行动中的使用越来越多，但仍然缺乏适合警察行动的评估框架。虽然法学硕士的回应可能并不总是在法律上不正确，但未经证实的使用仍然可能导致非法逮捕和不当证据收集等严重问题。为了解决这个问题，我们提出了PAS（警察行动场景），这是一个涵盖整个评估过程的系统框架。应用这个框架，我们从 8,000 多个官方文件中构建了一个新颖的 QA 数据集，并建立了通过统计分析和警方专家判断进行验证的关键指标。实验结果表明，商业法学硕士很难完成与警察相关的新任务，特别是在提供基于事实的建议方面。这项研究强调了可扩展评估框架的必要性，以确保人工智能驱动的警察行动可靠。我们发布了我们的数据和提示模板。

Title: DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs

Authors: Shidong Cao, Hongzhan Lin, Yuxuan Gu, Ziyang Luo, Jing Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03559
Pdf URL: https://arxiv.org/pdf/2601.03559
Copy Paste: [[2601.03559]] DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs(https://arxiv.org/abs/2601.03559)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) reasoning improves multi-step mathematical problem solving in large language models but remains vulnerable to exposure bias and error accumulation, as early mistakes propagate irreversibly through autoregressive decoding. In this work, we propose DiffCoT, a diffusion-styled CoT framework that reformulates CoT reasoning as an iterative denoising process. DiffCoT integrates diffusion principles at the reasoning-step level via a sliding-window mechanism, enabling unified generation and retrospective correction of intermediate steps while preserving token-level autoregression. To maintain causal consistency, we further introduce a causal diffusion noise schedule that respects the temporal structure of reasoning chains. Extensive experiments on three multi-step CoT reasoning benchmarks across diverse model backbones demonstrate that DiffCoT consistently outperforms existing CoT preference optimization methods, yielding improved robustness and error-correction capability in CoT reasoning.
摘要：思想链 (CoT) 推理改进了大型语言模型中的多步骤数学问题解决，但仍然容易受到暴露偏差和错误累积的影响，因为早期错误通过自回归解码不可逆地传播。在这项工作中，我们提出了 DiffCoT，一种扩散式的 CoT 框架，它将 CoT 推理重新表述为迭代去噪过程。 DiffCoT 通过滑动窗口机制在推理步骤级别集成扩散原理，实现中间步骤的统一生成和追溯校正，同时保留 token 级别的自回归。为了保持因果一致性，我们进一步引入了一种尊重推理链时间结构的因果扩散噪声表。对不同模型主干的三个多步骤 CoT 推理基准进行的广泛实验表明，DiffCoT 始终优于现有的 CoT 偏好优化方法，从而提高了 CoT 推理的鲁棒性和纠错能力。

Title: How Do Large Language Models Learn Concepts During Continual Pre-Training?

Authors: Barry Menglong Yao (1), Sha Li (2), Yunzhi Yao (3), Minqian Liu (2), Zaishuo Xia (1), Qifan Wang (4), Lifu Huang (1) ((1) UC Davis, (2) Virginia Tech, (3) UCLA, (4) Meta AI)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03570
Pdf URL: https://arxiv.org/pdf/2601.03570
Copy Paste: [[2601.03570]] How Do Large Language Models Learn Concepts During Continual Pre-Training?(https://arxiv.org/abs/2601.03570)
Keywords: language model, llm
Abstract: Human beings primarily understand the world through concepts (e.g., dog), abstract mental representations that structure perception, reasoning, and learning. However, how large language models (LLMs) acquire, retain, and forget such concepts during continual pretraining remains poorly understood. In this work, we study how individual concepts are acquired and forgotten, as well as how multiple concepts interact through interference and synergy. We link these behavioral dynamics to LLMs' internal Concept Circuits, computational subgraphs associated with specific concepts, and incorporate Graph Metrics to characterize circuit structure. Our analysis reveals: (1) LLMs concept circuits provide a non-trivial, statistically significant signal of concept learning and forgetting; (2) Concept circuits exhibit a stage-wise temporal pattern during continual pretraining, with an early increase followed by gradual decrease and stabilization; (3) concepts with larger learning gains tend to exhibit greater forgetting under subsequent training; (4) semantically similar concepts induce stronger interference than weakly related ones; (5) conceptual knowledge differs in their transferability, with some significantly facilitating the learning of others. Together, our findings offer a circuit-level view of concept learning dynamics and inform the design of more interpretable and robust concept-aware training strategies for LLMs.
摘要：人类主要通过概念（例如狗）、构成感知、推理和学习的抽象心理表征来理解世界。然而，人们对大型语言模型（LLM）如何在持续的预训练过程中获取、保留和忘记这些概念仍然知之甚少。在这项工作中，我们研究单个概念如何获得和遗忘，以及多个概念如何通过干扰和协同作用相互作用。我们将这些行为动态与法学硕士的内部概念电路、与特定概念相关的计算子图联系起来，并结合图形指标来表征电路结构。我们的分析表明：（1）法学硕士概念回路提供了一个重要的、统计上显着的概念学习和遗忘信号；（2）概念电路在持续预训练过程中表现出阶段性的时间模式，先增加后逐渐减少并稳定；（3）学习收益较大的概念在后续训练中往往表现出较大的遗忘；（4）语义相似的概念比弱相关的概念产生更强的干扰； (5) 概念性知识的可迁移性不同，有些知识可以显着促进其他人的学习。总之，我们的研究结果提供了概念学习动态的电路级视图，并为法学硕士设计更具可解释性和稳健的概念意识培训策略提供了信息。

Title: PsychEthicsBench: Evaluating Large Language Models Against Australian Mental Health Ethics

Authors: Yaling Shen, Stephanie Fong, Yiwen Jiang, Zimu Wang, Feilong Tang, Qingyang Xu, Xiangyu Zhao, Zhongxing Xu, Jiahe Liu, Jinpeng Hu, Dominic Dwyer, Zongyuan Ge
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03578
Pdf URL: https://arxiv.org/pdf/2601.03578
Copy Paste: [[2601.03578]] PsychEthicsBench: Evaluating Large Language Models Against Australian Mental Health Ethics(https://arxiv.org/abs/2601.03578)
Keywords: language model, llm
Abstract: The increasing integration of large language models (LLMs) into mental health applications necessitates robust frameworks for evaluating professional safety alignment. Current evaluative approaches primarily rely on refusal-based safety signals, which offer limited insight into the nuanced behaviors required in clinical practice. In mental health, clinically inadequate refusals can be perceived as unempathetic and discourage help-seeking. To address this gap, we move beyond refusal-centric metrics and introduce \texttt{PsychEthicsBench}, the first principle-grounded benchmark based on Australian psychology and psychiatry guidelines, designed to evaluate LLMs' ethical knowledge and behavioral responses through multiple-choice and open-ended tasks with fine-grained ethicality annotations. Empirical results across 14 models reveal that refusal rates are poor indicators of ethical behavior, revealing a significant divergence between safety triggers and clinical appropriateness. Notably, we find that domain-specific fine-tuning can degrade ethical robustness, as several specialized models underperform their base backbones in ethical alignment. PsychEthicsBench provides a foundation for systematic, jurisdiction-aware evaluation of LLMs in mental health, encouraging more responsible development in this domain.
摘要：大语言模型 (LLM) 越来越多地融入心理健康应用，需要强大的框架来评估专业安全一致性。目前的评估方法主要依赖于基于拒绝的安全信号，这对临床实践中所需的细微行为的洞察力有限。在心理健康方面，临床上不充分的拒绝可能会被视为缺乏同理心，并阻碍寻求帮助。为了解决这一差距，我们超越了以拒绝为中心的指标，并引入了 \texttt{PsychEthicsBench}，这是第一个基于澳大利亚心理学和精神病学指南的基于原则的基准，旨在通过带有细粒度道德注释的多项选择和开放式任务来评估法学硕士的道德知识和行为反应。 14 个模型的实证结果表明，拒绝率是道德行为的不良指标，揭示了安全触发因素和临床适当性之间的显着差异。值得注意的是，我们发现特定领域的微调可能会降低道德稳健性，因为一些专门的模型在道德一致性方面表现不佳。 PsychEthicsBench 为心理健康法学硕士的系统性、司法管辖区意识评估奠定了基础，鼓励该领域更加负责任的发展。

Title: OLA: Output Language Alignment in Code-Switched LLM Interactions

Authors: Juhyun Oh, Haneul Yoo, Faiz Ghifari Haznitrama, Alice Oh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03589
Pdf URL: https://arxiv.org/pdf/2601.03589
Copy Paste: [[2601.03589]] OLA: Output Language Alignment in Code-Switched LLM Interactions(https://arxiv.org/abs/2601.03589)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Code-switching, alternating between languages within a conversation, is natural for multilingual users, yet poses fundamental challenges for large language models (LLMs). When a user code-switches in their prompt to an LLM, they typically do not specify the expected language of the LLM response, and thus LLMs must infer the output language from contextual and pragmatic cues. We find that current LLMs systematically fail to align with this expectation, responding in undesired languages even when cues are clear to humans. We introduce OLA, a benchmark to evaluate LLMs' Output Language Alignment in code-switched interactions. OLA focuses on Korean--English code-switching and spans simple intra-sentential mixing to instruction-content mismatches. Even frontier models frequently misinterpret implicit language expectation, exhibiting a bias toward non-English responses. We further show this bias generalizes beyond Korean to Chinese and Indonesian pairs. Models also show instability through mid-response switching and language intrusions. Chain-of-Thought prompting fails to resolve these errors, indicating weak pragmatic reasoning about output language. However, Code-Switching Aware DPO with minimal data (about 1K examples) substantially reduces misalignment, suggesting these failures stem from insufficient alignment rather than fundamental limitations. Our results highlight the need to align multilingual LLMs with users' implicit expectations in real-world code-switched interactions.
摘要：语码转换（即对话中不同语言之间的交替）对于多语言用户来说是很自然的事情，但对大型语言模型 (LLM) 提出了根本性的挑战。当用户在提示中将代码切换到 LLM 时，他们通常不会指定 LLM 响应的预期语言，因此 LLM 必须根据上下文和语用提示推断输出语言。我们发现，当前的法学硕士系统性地未能满足这一期望，即使线索对人类来说很清楚，也会以不受欢迎的语言做出回应。我们引入 OLA，这是一个评估法学硕士在代码交换交互中的输出语言对齐的基准。 OLA 专注于韩语-英语语码转换，涵盖简单的句子内混合到指令内容不匹配。即使是前沿模型也经常会误解隐含的语言期望，表现出对非英语反应的偏见。我们进一步表明，这种偏见不仅普遍存在于韩国人中，还普遍存在于中国人和印度尼西亚人中。模型还通过中间响应切换和语言侵入表现出不稳定性。思维链提示无法解决这些错误，表明有关输出语言的实用推理较弱。然而，具有最少数据（大约 1K 个示例）的 Code-Switching Aware DPO 大大减少了错位，这表明这些失败源于对齐不足，而不是根本限制。我们的结果强调需要将多语言法学硕士与用户在现实世界的语码转换交互中的隐含期望保持一致。

Title: From Chains to Graphs: Self-Structured Reasoning for General-Domain LLMs

Authors: Yingjian Chen, Haoran Liu, Yinhong Liu, Sherry T. Tong, Aosong Feng, Jinghui Lu, Juntao Zhang, Yusuke Iwasawa, Yutaka Matsuo, Irene Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03597
Pdf URL: https://arxiv.org/pdf/2601.03597
Copy Paste: [[2601.03597]] From Chains to Graphs: Self-Structured Reasoning for General-Domain LLMs(https://arxiv.org/abs/2601.03597)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Large Language Models (LLMs) show strong reasoning ability in open-domain question answering, yet their reasoning processes are typically linear and often logically inconsistent. In contrast, real-world reasoning requires integrating multiple premises and solving subproblems in parallel. Existing methods, such as Chain-of-Thought (CoT), express reasoning in a linear textual form, which may appear coherent but frequently leads to inconsistent conclusions. Recent approaches rely on externally provided graphs and do not explore how LLMs can construct and use their own graph-structured reasoning, particularly in open-domain QA. To fill this gap, we novelly explore graph-structured reasoning of LLMs in general-domain question answering. We propose Self-Graph Reasoning (SGR), a framework that enables LLMs to explicitly represent their reasoning process as a structured graph before producing the final answer. We further construct a graph-structured reasoning dataset that merges multiple candidate reasoning graphs into refined graph structures for model training. Experiments on five QA benchmarks across both general and specialized domains show that SGR consistently improves reasoning consistency and yields a 17.74% gain over the base model. The LLaMA-3.3-70B model fine-tuned with SGR performs comparably to GPT-4o and surpasses Claude-3.5-Haiku, demonstrating the effectiveness of graph-structured reasoning.
摘要：大型语言模型（LLM）在开放域问答中表现出强大的推理能力，但其推理过程通常是线性的，并且逻辑上常常不一致。相比之下，现实世界的推理需要整合多个前提并并行解决子问题。现有的方法，例如思想链（CoT），以线性文本形式表达推理，这可能看起来连贯，但经常导致不一致的结论。最近的方法依赖于外部提供的图，并且没有探索法学硕士如何构建和使用自己的图结构推理，特别是在开放领域的 QA 中。为了填补这一空白，我们新颖地探索了法学硕士在通用领域问答中的图结构推理。我们提出了自图推理（SGR），这是一个框架，使法学硕士能够在产生最终答案之前将其推理过程明确表示为结构化图。我们进一步构建了一个图结构推理数据集，将多个候选推理图合并为细化的图结构以进行模型训练。在通用和专业领域的五个 QA 基准上进行的实验表明，SGR 持续提高了推理一致性，并比基本模型提高了 17.74%。使用 SGR 进行微调的 LLaMA-3.3-70B 模型的性能与 GPT-4o 相当，并超过 Claude-3.5-Haiku，证明了图结构推理的有效性。

Title: DiVA: Fine-grained Factuality Verification with Agentic-Discriminative Verifier

Authors: Hui Huang, Muyun Yang, Yuki Arase
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03605
Pdf URL: https://arxiv.org/pdf/2601.03605
Copy Paste: [[2601.03605]] DiVA: Fine-grained Factuality Verification with Agentic-Discriminative Verifier(https://arxiv.org/abs/2601.03605)
Keywords: language model, llm, agent
Abstract: Despite the significant advancements of Large Language Models (LLMs), their factuality remains a critical challenge, fueling growing interest in factuality verification. Existing research on factuality verification primarily conducts binary judgments (e.g., correct or incorrect), which fails to distinguish varying degrees of error severity. This limits its utility for applications such as fine-grained evaluation and preference optimization. To bridge this gap, we propose the Agentic Discriminative Verifier (DiVA), a hybrid framework that synergizes the agentic search capabilities of generative models with the precise scoring aptitude of discriminative models. We also construct a new benchmark, FGVeriBench, as a robust testbed for fine-grained factuality verification. Experimental results on FGVeriBench demonstrate that our DiVA significantly outperforms existing methods on factuality verification for both general and multi-hop questions.
摘要：尽管大型语言模型（LLM）取得了显着进步，但其真实性仍然是一个严峻的挑战，这激发了人们对真实性验证日益增长的兴趣。现有的事实验证研究主要进行二元判断（例如正确或错误），无法区分不同程度的错误严重程度。这限制了它在细粒度评估和偏好优化等应用中的实用性。为了弥补这一差距，我们提出了代理判别验证器（DiVA），这是一种混合框架，它将生成模型的代理搜索能力与判别模型的精确评分能力相结合。我们还构建了一个新的基准 FGVeriBench，作为细粒度事实验证的强大测试平台。 FGVeriBench 上的实验结果表明，我们的 DiVA 在一般问题和多跳问题的事实性验证方面显着优于现有方法。

Title: Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation

Authors: Binh Nguyen, Thai Le
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2601.03615
Pdf URL: https://arxiv.org/pdf/2601.03615
Copy Paste: [[2601.03615]] Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation(https://arxiv.org/abs/2601.03615)
Keywords: language model
Abstract: Audio Language Models (ALMs) offer a promising shift towards explainable audio deepfake detections (ADDs), moving beyond \textit{black-box} classifiers by providing some level of transparency into their predictions via reasoning traces. This necessitates a new class of model robustness analysis: robustness of the predictive reasoning under adversarial attacks, which goes beyond existing paradigm that mainly focuses on the shifts of the final predictions (e.g., fake v.s. real). To analyze such reasoning shifts, we introduce a forensic auditing framework to evaluate the robustness of ALMs' reasoning under adversarial attacks in three inter-connected dimensions: acoustic perception, cognitive coherence, and cognitive dissonance. Our systematic analysis reveals that explicit reasoning does not universally enhance robustness. Instead, we observe a bifurcation: for models exhibiting robust acoustic perception, reasoning acts as a defensive \textit{``shield''}, protecting them from adversarial attacks. However, for others, it imposes a performance \textit{``tax''}, particularly under linguistic attacks which reduce cognitive coherence and increase attack success rate. Crucially, even when classification fails, high cognitive dissonance can serve as a \textit{silent alarm}, flagging potential manipulation. Overall, this work provides a critical evaluation of the role of reasoning in forensic audio deepfake analysis and its vulnerabilities.
摘要：音频语言模型（ALM）为可解释的音频深度伪造检测（ADD）提供了一个有前途的转变，通过推理轨迹为预测提供一定程度的透明度，超越了 \textit{black-box} 分类器。这就需要一类新的模型鲁棒性分析：对抗性攻击下预测推理的鲁棒性，这超出了主要关注最终预测变化（例如，假与真）的现有范式。为了分析这种推理转变，我们引入了一个取证审计框架，从三个相互关联的维度来评估 ALM 在对抗性攻击下推理的稳健性：声学感知、认知一致性和认知失调。我们的系统分析表明，显式推理并不能普遍增强鲁棒性。相反，我们观察到一个分歧：对于表现出强大声学感知的模型，推理充当防御性 \textit{``shield''}，保护它们免受对抗性攻击。然而，对于其他人来说，它会施加性能 \textit{``tax''}，特别是在降低认知连贯性并提高攻击成功率的语言攻击下。至关重要的是，即使分类失败，高度认知失调也可以作为\textit{无声警报}，标记潜在的操纵。总体而言，这项工作对推理在法医音频深度伪造分析及其漏洞中的作用提供了批判性评估。

Title: Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines

Authors: Jean Seo, Gibaeg Kim, Kihun Shin, Seungseop Lim, Hyunkyung Lee, Wooseok Han, Jongwon Lee, Eunho Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03627
Pdf URL: https://arxiv.org/pdf/2601.03627
Copy Paste: [[2601.03627]] Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines(https://arxiv.org/abs/2601.03627)
Keywords: llm
Abstract: We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre-consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on this https URL, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.
摘要：我们介绍 EPAG，这是一个基准数据集和框架，旨在使用诊断指南评估法学硕士的预咨询能力。法学硕士通过 HPI 诊断指南比较直接评估，并通过疾病诊断间接评估。在我们的实验中，我们观察到，使用精心策划的特定任务数据集进行微调的小型开源模型可以在预咨询中胜过前沿法学硕士。此外，我们发现 HPI（现病史）数量的增加并不一定会提高诊断性能。进一步的实验表明，预咨询的语言会影响对话的特征。通过在此 https URL 上开源我们的数据集和评估管道，我们的目标是为现实临床环境中法学硕士应用程序的评估和进一步开发做出贡献。

Title: Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases

Authors: Hui Huang, Xuanxin Wu, Muyun Yang, Yuki Arase
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03630
Pdf URL: https://arxiv.org/pdf/2601.03630
Copy Paste: [[2601.03630]] Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases(https://arxiv.org/abs/2601.03630)
Keywords: llm, prompt
Abstract: This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judge to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior instruction-following capabilities in evaluation contexts; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong biases in superficial quality. To improve the robustness against biases, we propose PlanJudge, an evaluation strategy that prompts the model to generate an explicit evaluation plan before execution. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in both LRMs and standard LLMs.
摘要：本文首次系统比较了大型推理模型 (LRM) 是否优于非推理法学硕士。我们的实证分析得出了四个关键发现：1）LRM 在判断准确性方面优于非推理 LLM，尤其是在推理密集型任务上； 2) LRM 在评估环境中表现出卓越的指令遵循能力； 3）LRM对针对判断任务的对抗性攻击表现出增强的鲁棒性； 4) 然而，LRM 在表面质量上仍然表现出强烈的偏差。为了提高针对偏差的鲁棒性，我们提出了 PlanJudge，这是一种评估策略，可促使模型在执行前生成明确的评估计划。尽管它很简单，但我们的实验表明 PlanJudge 显着减轻了 LRM 和标准 LLM 中的偏差。

Title: Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning

Authors: Zheng Wu, Xingyu Lou, Xinbei Ma, Yansi Li, Weiwen Liu, Weinan Zhang, Jun Wang, Zhuosheng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03641
Pdf URL: https://arxiv.org/pdf/2601.03641
Copy Paste: [[2601.03641]] Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning(https://arxiv.org/abs/2601.03641)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM)-based agents significantly extend the utility of LLMs by interacting with dynamic environments. However, enabling agents to continually learn new tasks without catastrophic forgetting remains a critical challenge, known as the stability-plasticity dilemma. In this work, we argue that this dilemma fundamentally arises from the failure to explicitly distinguish between common knowledge shared across tasks and conflicting knowledge introduced by task-specific interference. To address this, we propose Agent-Dice, a parameter fusion framework based on directional consensus evaluation. Concretely, Agent-Dice disentangles knowledge updates through a two-stage process: geometric consensus filtering to prune conflicting gradients, and curvature-based importance weighting to amplify shared semantics. We provide a rigorous theoretical analysis that establishes the validity of the proposed fusion scheme and offers insight into the origins of the stability-plasticity dilemma. Extensive experiments on GUI agents and tool-use agent domains demonstrate that Agent-Dice exhibits outstanding continual learning performance with minimal computational overhead and parameter updates.
摘要：基于大型语言模型 (LLM) 的代理通过与动态环境交互，显着扩展了 LLM 的实用性。然而，使智能体能够不断学习新任务而不会发生灾难性遗忘仍然是一个严峻的挑战，即所谓的稳定性-可塑性困境。在这项工作中，我们认为这种困境从根本上来说是由于未能明确区分跨任务共享的常识和特定任务干扰引入的冲突知识。为了解决这个问题，我们提出了 Agent-Dice，一种基于定向共识评估的参数融合框架。具体来说，Agent-Dice 通过两个阶段的过程来理清知识更新：几何一致性过滤以修剪冲突的梯度，以及基于曲率的重要性加权以放大共享语义。我们提供了严格的理论分析，证明了所提出的融合方案的有效性，并深入了解了稳定性-可塑性困境的起源。对 GUI 代理和工具使用代理领域的大量实验表明，Agent-Dice 以最小的计算开销和参数更新表现出出色的持续学习性能。

Title: LLM-MC-Affect: LLM-Based Monte Carlo Modeling of Affective Trajectories and Latent Ambiguity for Interpersonal Dynamic Insight

Authors: Yu-Zheng Lin, Bono Po-Jen Shih, John Paul Martin Encinas, Elizabeth Victoria Abraham Achom, Karan Himanshu Patel, Jesus Horacio Pacheco, Sicong Shao, Jyotikrishna Dass, Soheil Salehi, Pratik Satam
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2601.03645
Pdf URL: https://arxiv.org/pdf/2601.03645
Copy Paste: [[2601.03645]] LLM-MC-Affect: LLM-Based Monte Carlo Modeling of Affective Trajectories and Latent Ambiguity for Interpersonal Dynamic Insight(https://arxiv.org/abs/2601.03645)
Keywords: llm
Abstract: Emotional coordination is a core property of human interaction that shapes how relational meaning is constructed in real time. While text-based affect inference has become increasingly feasible, prior approaches often treat sentiment as a deterministic point estimate for individual speakers, failing to capture the inherent subjectivity, latent ambiguity, and sequential coupling found in mutual exchanges. We introduce LLM-MC-Affect, a probabilistic framework that characterizes emotion not as a static label, but as a continuous latent probability distribution defined over an affective space. By leveraging stochastic LLM decoding and Monte Carlo estimation, the methodology approximates these distributions to derive high-fidelity sentiment trajectories that explicitly quantify both central affective tendencies and perceptual ambiguity. These trajectories enable a structured analysis of interpersonal coupling through sequential cross-correlation and slope-based indicators, identifying leading or lagging influences between interlocutors. To validate the interpretive capacity of this approach, we utilize teacher-student instructional dialogues as a representative case study, where our quantitative indicators successfully distill high-level interaction insights such as effective scaffolding. This work establishes a scalable and deployable pathway for understanding interpersonal dynamics, offering a generalizable solution that extends beyond education to broader social and behavioral research.
摘要：情感协调是人类互动的核心属性，它决定了关系意义的实时构建方式。虽然基于文本的情感推断变得越来越可行，但先前的方法通常将情感视为个体说话者的确定性点估计，未能捕捉到相互交流中固有的主观性、潜在的模糊性和顺序耦合。我们引入了 LLM-MC-Affect，这是一种概率框架，它将情感不是静态标签，而是在情感空间上定义的连续潜在概率分布。通过利用随机 LLM 解码和蒙特卡罗估计，该方法近似这些分布，以得出高保真情感轨迹，明确量化中心情感倾向和感知模糊性。这些轨迹可以通过顺序互相关和基于斜率的指标对人际耦合进行结构化分析，从而识别对话者之间的领先或滞后影响。为了验证这种方法的解释能力，我们利用师生教学对话作为代表性案例研究，其中我们的定量指标成功地提炼了高水平的互动见解，例如有效的脚手架。这项工作为理解人际动态建立了一个可扩展和可部署的途径，提供了一个从教育扩展到更广泛的社会和行为研究的通用解决方案。

Title: ELO: Efficient Layer-Specific Optimization for Continual Pretraining of Multilingual LLMs

Authors: HanGyeol Yoo, ChangSu Choi, Minjun Kim, Seohyun Song, SeungWoo Song, Inho Won, Jongyoul Park, Cheoneum Park, KyungTae Lim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03648
Pdf URL: https://arxiv.org/pdf/2601.03648
Copy Paste: [[2601.03648]] ELO: Efficient Layer-Specific Optimization for Continual Pretraining of Multilingual LLMs(https://arxiv.org/abs/2601.03648)
Keywords: language model, llm
Abstract: We propose an efficient layer-specific optimization (ELO) method designed to enhance continual pretraining (CP) for specific languages in multilingual large language models (MLLMs). This approach addresses the common challenges of high computational cost and degradation of source language performance associated with traditional CP. The ELO method consists of two main stages: (1) ELO Pretraining, where a small subset of specific layers, identified in our experiments as the critically important first and last layers, are detached from the original MLLM and trained with the target language. This significantly reduces not only the number of trainable parameters but also the total parameters computed during the forward pass, minimizing GPU memory consumption and accelerating the training process. (2) Layer Alignment, where the newly trained layers are reintegrated into the original model, followed by a brief full fine-tuning step on a small dataset to align the parameters. Experimental results demonstrate that the ELO method achieves a training speedup of up to 6.46 times compared to existing methods, while improving target language performance by up to 6.2\% on qualitative benchmarks and effectively preserving source language (English) capabilities.
摘要：我们提出了一种有效的特定层优化（ELO）方法，旨在增强多语言大语言模型（MLLM）中特定语言的持续预训练（CP）。这种方法解决了与传统 CP 相关的高计算成本和源语言性能下降的常见挑战。 ELO 方法由两个主要阶段组成：(1) ELO 预训练，其中一小部分特定层（在我们的实验中确定为至关重要的第一层和最后一层）与原始 MLLM 分离并使用目标语言进行训练。这不仅显着减少了可训练参数的数量，还减少了前向传播期间计算的参数总数，从而最大限度地减少了 GPU 内存消耗并加速了训练过程。 (2) 层对齐，将新训练的层重新集成到原始模型中，然后对小数据集进行简短的全面微调步骤以对齐参数。实验结果表明，与现有方法相比，ELO 方法实现了高达 6.46 倍的训练加速，同时在定性基准上将目标语言性能提高了高达 6.2%，并有效保留了源语言（英语）能力。

Title: SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation

Authors: Gengyang Li, Wang Cai, Yifeng Gao, Yunfang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03649
Pdf URL: https://arxiv.org/pdf/2601.03649
Copy Paste: [[2601.03649]] SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation(https://arxiv.org/abs/2601.03649)
Keywords: prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting improves reasoning but often produces long and redundant traces that substantially increase inference cost. We present SyncThink, a training-free and plug-and-play decoding method that reduces CoT overhead without modifying model weights. We find that answer tokens attend weakly to early reasoning and instead focus on the special token "/think", indicating an information bottleneck. Building on this observation, SyncThink monitors the model's own reasoning-transition signal and terminates reasoning. Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy using 656 generated tokens and 28.68 s latency, compared to 61.22 percent, 2141 tokens, and 92.01 s for full CoT decoding. On long-horizon tasks such as GPQA, SyncThink can further yield up to +8.1 absolute accuracy by preventing over-thinking.
摘要：思想链 (CoT) 提示可以改善推理，但通常会产生长而冗余的痕迹，从而大大增加推理成本。我们提出了 SyncThink，这是一种免训练、即插即用的解码方法，可以在不修改模型权重的情况下减少 CoT 开销。我们发现答案标记很少关注早期推理，而是关注特殊标记“/think”，这表明存在信息瓶颈。在此观察的基础上，SyncThink 监控模型自身的推理转换信号并终止推理。在 GSM8K、MMLU、GPQA 和 BBH 上跨三个 DeepSeek-R1 精炼模型进行的实验表明，SyncThink 使用 656 个生成的令牌和 28.68 秒的延迟实现了 62.00% 的平均 Top-1 准确度，而完整 CoT 解码的平均 Top-1 准确度为 61.22%、2141 个令牌和 92.01 秒。在 GPQA 等长期任务中，SyncThink 可以通过防止过度思考进一步产生高达 +8.1 的绝对准确度。

Title: e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

Authors: Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, Zhicheng Dou
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2601.03666
Pdf URL: https://arxiv.org/pdf/2601.03666
Copy Paste: [[2601.03666]] e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings(https://arxiv.org/abs/2601.03666)
Keywords: language model
Abstract: Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at this https URL.
摘要：现代信息系统通常涉及不同类型的项目，例如文本查询、图像、视频剪辑或音频片段。这激发了全模态嵌入模型，将异构模态映射到共享空间以进行直接比较。然而，最近的全模态嵌入仍然严重依赖于从预训练视觉语言模型（VLM）主干继承的隐式对齐。在实践中，这会导致三个常见问题：（i）相似性 logits 具有依赖于模态的锐度，因此分数的尺度不一致； (ii) 随着时间的推移，批次内负片的效果会降低，因为混合模式批次会产生不平衡的硬度分布；结果，许多负数很快变得微不足道，并且贡献很少的梯度； (iii) 跨模式的嵌入显示一阶和二阶统计数据不匹配，这使得排名不太稳定。为了解决这些问题，我们提出了 e5-omni，这是一种轻量级的显式对齐方案，可将现成的 VLM 改编成鲁棒的全模态嵌入模型。 e5-omni 结合了三个简单的组件：(1) 模态感知温度校准，以对齐相似性尺度；(2) 可控负例课程，具有去偏功能，专注于混淆负例，同时减少假负例的影响；(3) 具有协方差正则化的批量白化，以更好地匹配共享嵌入空间中的跨模态几何。 MMEB-V2 和 AudioCaps 上的实验表明，与强大的双模态和全模态基线相比，获得了一致的增益，并且相同的方法也可以很好地转移到其他 VLM 主干。我们在此 https URL 发布模型检查点。

Title: DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management

Authors: Zhitong Chen, Kai Yin, Xiangjue Dong, Chengkai Liu, Xiangpeng Li, Yiming Xiao, Bo Li, Junwei Ma, Ali Mostafavi, James Caverlee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03670
Pdf URL: https://arxiv.org/pdf/2601.03670
Copy Paste: [[2601.03670]] DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management(https://arxiv.org/abs/2601.03670)
Keywords: llm
Abstract: Accurate question answering (QA) in disaster management requires reasoning over uncertain and conflicting information, a setting poorly captured by existing benchmarks built on clean evidence. We introduce DisastQA, a large-scale benchmark of 3,000 rigorously verified questions (2,000 multiple-choice and 1,000 open-ended) spanning eight disaster types. The benchmark is constructed via a human-LLM collaboration pipeline with stratified sampling to ensure balanced coverage. Models are evaluated under varying evidence conditions, from closed-book to noisy evidence integration, enabling separation of internal knowledge from reasoning under imperfect information. For open-ended QA, we propose a human-verified keypoint-based evaluation protocol emphasizing factual completeness over verbosity. Experiments with 20 models reveal substantial divergences from general-purpose leaderboards such as MMLU-Pro. While recent open-weight models approach proprietary systems in clean settings, performance degrades sharply under realistic noise, exposing critical reliability gaps for disaster response. All code, data, and evaluation resources are available at this https URL.
摘要：灾害管理中准确的问答 (QA) 需要对不确定和相互冲突的信息进行推理，而建立在干净证据基础上的现有基准很难捕捉到这种情况。我们推出了 DisastQA，这是一个包含 3,000 个经过严格验证的问题（2,000 个多项选择题和 1,000 个开放式问题）的大型基准，涵盖八种灾难类型。该基准是通过人与法学硕士合作管道构建的，并采用分层抽样以确保平衡的覆盖范围。模型在不同的证据条件下进行评估，从闭卷到嘈杂的证据整合，使得内部知识与不完美信息下的推理分离。对于开放式的质量保证，我们提出了一种经过人工验证的基于关键点的评估协议，强调事实的完整性而不是冗长。对 20 个模型的实验揭示了与 MMLU-Pro 等通用排行榜的显着差异。虽然最近的开放权重模型在清洁环境中接近专有系统，但在现实噪声下性能急剧下降，暴露了灾难响应的关键可靠性差距。所有代码、数据和评估资源均可从此 https URL 获取。

Title: NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models

Authors: Weiqi Liu, Yongliang Miao, Haiyan Zhao, Yanguang Liu, Mengnan Du
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03671
Pdf URL: https://arxiv.org/pdf/2601.03671
Copy Paste: [[2601.03671]] NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models(https://arxiv.org/abs/2601.03671)
Keywords: language model, llm, agent
Abstract: Neuron-level interpretation in large language models (LLMs) is fundamentally challenged by widespread polysemanticity, where individual neurons respond to multiple distinct semantic concepts. Existing single-pass interpretation methods struggle to faithfully capture such multi-concept behavior. In this work, we propose NeuronScope, a multi-agent framework that reformulates neuron interpretation as an iterative, activation-guided process. NeuronScope explicitly deconstructs neuron activations into atomic semantic components, clusters them into distinct semantic modes, and iteratively refines each explanation using neuron activation feedback. Experiments demonstrate that NeuronScope uncovers hidden polysemanticity and produces explanations with significantly higher activation correlation compared to single-pass baselines.
摘要：大语言模型（LLM）中的神经元级解释从根本上受到广泛的多语义性的挑战，其中单个神经元对多个不同的语义概念做出反应。现有的单通道解释方法很难忠实地捕捉这种多概念行为。在这项工作中，我们提出了 NeuronScope，一种多智能体框架，它将神经元解释重新表述为迭代的、激活引导的过程。 NeuronScope 将神经元激活显式解构为原子语义组件，将它们聚类为不同的语义模式，并使用神经元激活反馈迭代地细化每个解释。实验表明，与单通道基线相比，NeuronScope 揭示了隐藏的多语义性，并产生了具有显着更高激活相关性的解释。

Title: Towards Compositional Generalization of LLMs via Skill Taxonomy Guided Data Synthesis

Authors: Yifan Wei, Li Du, Xiaoyan Yu, Yang Feng, Angsheng Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03676
Pdf URL: https://arxiv.org/pdf/2601.03676
Copy Paste: [[2601.03676]] Towards Compositional Generalization of LLMs via Skill Taxonomy Guided Data Synthesis(https://arxiv.org/abs/2601.03676)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) and agent-based systems often struggle with compositional generalization due to a data bottleneck in which complex skill combinations follow a long-tailed, power-law distribution, limiting both instruction-following performance and generalization in agent-centric tasks. To address this challenge, we propose STEPS, a Skill Taxonomy guided Entropy-based Post-training data Synthesis framework for generating compositionally challenging data. STEPS explicitly targets compositional generalization by uncovering latent relationships among skills and organizing them into an interpretable, hierarchical skill taxonomy using structural information theory. Building on this taxonomy, we formulate data synthesis as a constrained information maximization problem, selecting skill combinations that maximize marginal structural information within the hierarchy while preserving semantic coherence. Experiments on challenging instruction-following benchmarks show that STEPS outperforms existing data synthesis baselines, while also yielding improved compositional generalization in downstream agent-based evaluations.
摘要：大型语言模型 (LLM) 和基于代理的系统经常因数据瓶颈而难以实现组合泛化，其中复杂的技能组合遵循长尾幂律分布，从而限制了以代理为中心的任务中的指令跟踪性能和泛化。为了应对这一挑战，我们提出了 STEPS，这是一种技能分类法引导的基于熵的训练后数据合成框架，用于生成具有挑战性的数据。 STEPS 通过揭示技能之间的潜在关系并使用结构信息理论将它们组织成可解释的、分层的技能分类法来明确目标组合概括。在此分类法的基础上，我们将数据合成表述为受限信息最大化问题，选择能够最大化层次结构内的边际结构信息同时保持语义一致性的技能组合。对具有挑战性的指令遵循基准的实验表明，STEPS 优于现有的数据合成基线，同时还在下游基于代理的评估中产生了改进的组合泛化。

Title: From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs

Authors: Shaojie Wang, Liang Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03682
Pdf URL: https://arxiv.org/pdf/2601.03682
Copy Paste: [[2601.03682]] From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs(https://arxiv.org/abs/2601.03682)
Keywords: language model, llm, chain-of-thought
Abstract: Recent studies reveal that large language models (LLMs) exhibit limited logical reasoning abilities in mathematical problem-solving, instead often relying on pattern-matching and memorization. We systematically analyze this limitation, focusing on logical relationship understanding, which is a core capability underlying genuine logical reasoning, and reveal that errors related to this capability account for over 90\% of incorrect predictions, with Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) failing to substantially reduce these errors. To address this bottleneck, we propose First-Step Logical Reasoning (FSLR), a lightweight training framework targeting logical relationship understanding. Our key insight is that the first planning step-identifying which variables to use and which operation to apply-encourages the model to derive logical relationships directly from the problem statement. By training models on this isolated step, FSLR provides explicit supervision for logical relationship understanding, unlike CoT-SFT which implicitly embeds such relationships within complete solution trajectories. Extensive experiments across multiple models and datasets demonstrate that FSLR consistently outperforms CoT-SFT under both in-distribution and out-of-distribution settings, with average improvements of 3.2\% and 4.6\%, respectively. Moreover, FSLR achieves 4-6x faster training and reduces training token consumption by over 80\%.
摘要：最近的研究表明，大型语言模型（LLM）在解决数学问题时表现出有限的逻辑推理能力，而常常依赖于模式匹配和记忆。我们系统地分析了这一局限性，重点关注逻辑关系理解，这是真正逻辑推理的核心能力，并发现与该能力相关的错误占错误预测的 90% 以上，而思想链监督微调（CoT-SFT）未能大幅减少这些错误。为了解决这个瓶颈，我们提出了第一步逻辑推理（FSLR），这是一种针对逻辑关系理解的轻量级训练框架。我们的主要见解是，第一个规划步骤（确定要使用哪些变量以及要应用哪些操作）鼓励模型直接从问题陈述中导出逻辑关系。通过在这个孤立的步骤上训练模型，FSLR 为逻辑关系理解提供了明确的监督，这与 CoT-SFT 不同，CoT-SFT 隐式地将此类关系嵌入到完整的解决方案轨迹中。跨多个模型和数据集的大量实验表明，在分布内和分布外设置下，FSLR 始终优于 CoT-SFT，平均改进分别为 3.2% 和 4.6%。此外，FSLR 的训练速度提高了 4-6 倍，并将训练令牌消耗减少了 80% 以上。

Title: Evaluation Framework for AI Creativity: A Case Study Based on Story Generation

Authors: Pharath Sathya, Yin Jou Huang, Fei Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03698
Pdf URL: https://arxiv.org/pdf/2601.03698
Copy Paste: [[2601.03698]] Evaluation Framework for AI Creativity: A Case Study Based on Story Generation(https://arxiv.org/abs/2601.03698)
Keywords: prompt
Abstract: Evaluating creative text generation remains a challenge because existing reference-based metrics fail to capture the subjective nature of creativity. We propose a structured evaluation framework for AI story generation comprising four components (Novelty, Value, Adherence, and Resonance) and eleven sub-components. Using controlled story generation via ``Spike Prompting'' and a crowdsourced study of 115 readers, we examine how different creative components shape both immediate and reflective human creativity judgments. Our findings show that creativity is evaluated hierarchically rather than cumulatively, with different dimensions becoming salient at different stages of judgment, and that reflective evaluation substantially alters both ratings and inter-rater agreement. Together, these results support the effectiveness of our framework in revealing dimensions of creativity that are obscured by reference-based evaluation.
摘要：评估创意文本的生成仍然是一个挑战，因为现有的基于参考的指标无法捕捉创意的主观本质。我们提出了一个用于人工智能故事生成的结构化评估框架，包括四个组件（新颖性、价值、依从性和共鸣）和十一个子组件。通过“尖峰提示”控制故事生成以及对 115 名读者进行众包研究，我们研究了不同的创意成分如何塑造直接和反思性的人类创造力判断。我们的研究结果表明，创造力是分层评估的，而不是累积性的，不同的维度在不同的判断阶段变得突出，并且反思性评估会极大地改变评级和评估者间的一致性。总之，这些结果支持我们的框架在揭示被基于参考的评估所掩盖的创造力维度方面的有效性。

Title: RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Authors: Quy-Anh Dang, Chris Ngo, Truong-Son Hy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03699
Pdf URL: https://arxiv.org/pdf/2601.03699
Copy Paste: [[2601.03699]] RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models(https://arxiv.org/abs/2601.03699)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real-world deployment. Code: this https URL
摘要：随着大型语言模型 (LLM) 成为安全关键型应用程序不可或缺的一部分，确保其针对对抗性提示的稳健性至关重要。然而，现有的红队数据集存在风险分类不一致、领域覆盖范围有限和评估过时的问题，阻碍了系统性的漏洞评估。为了应对这些挑战，我们引入了 RedBench，这是一个通用数据集，聚合了来自领先会议和存储库的 37 个基准数据集，其中包含攻击和拒绝提示的 29,362 个样本。 RedBench 采用包含 22 个风险类别和 19 个领域的标准化分类法，能够对 LLM 漏洞进行一致且全面的评估。我们提供对现有数据集的详细分析，为现代法学硕士建立基线，并开源数据集和评估代码。我们的贡献促进了强有力的比较，促进了未来的研究，并促进了安全可靠的法学硕士的开发，以供实际部署。代码：这个https URL

Title: ADEPT: Adaptive Dynamic Early-Exit Process for Transformers

Authors: Sangmin Yoo, Srikanth Malla, Chiho Choi, Wei D. Lu, Joon Hee Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03700
Pdf URL: https://arxiv.org/pdf/2601.03700
Copy Paste: [[2601.03700]] ADEPT: Adaptive Dynamic Early-Exit Process for Transformers(https://arxiv.org/abs/2601.03700)
Keywords: language model, prompt
Abstract: The inference of large language models imposes significant computational workloads, often requiring the processing of billions of parameters. Although early-exit strategies have proven effective in reducing computational demands by halting inference earlier, they apply either to only the first token in the generation phase or at the prompt level in the prefill phase. Thus, the Key-Value (KV) cache for skipped layers remains a bottleneck for subsequent token generation, limiting the benefits of early exit. We introduce ADEPT (Adaptive Dynamic Early-exit Process for Transformers), a novel approach designed to overcome this issue and enable dynamic early exit in both the prefill and generation phases. The proposed adaptive token-level early-exit mechanism adjusts computation dynamically based on token complexity, optimizing efficiency without compromising performance. ADEPT further enhances KV generation procedure by decoupling sequential dependencies in skipped layers, making token-level early exit more practical. Experimental results demonstrate that ADEPT improves efficiency by up to 25% in language generation tasks and achieves a 4x speed-up in downstream classification tasks, with up to a 45% improvement in performance.
摘要：大型语言模型的推理会带来巨大的计算工作量，通常需要处理数十亿个参数。尽管提前退出策略已被证明可以通过提前停止推理来有效减少计算需求，但它们要么仅适用于生成阶段的第一个令牌，要么适用于预填充阶段的提示级别。因此，跳过层的键值（KV）缓存仍然是后续令牌生成的瓶颈，限制了提前退出的好处。我们引入了 ADEPT（变压器自适应动态提前退出流程），这是一种新颖的方法，旨在克服这个问题，并在预填充和生成阶段实现动态提前退出。所提出的自适应令牌级提前退出机制根据令牌复杂性动态调整计算，在不影响性能的情况下优化效率。 ADEPT 通过解耦跳过层中的顺序依赖关系，进一步增强了 KV 生成过程，使令牌级提前退出更加实用。实验结果表明，ADEPT 在语言生成任务中效率提升高达 25%，在下游分类任务中实现 4 倍提速，性能提升高达 45%。

Title: Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR

Authors: Yunhao Liang, Ruixuan Ying, Bo Li, Hong Li, Kai Yan, Qingwen Li, Min Yang, Okamoto Satoshi, Zhe Cui, Shiwen Ni
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2601.03714
Pdf URL: https://arxiv.org/pdf/2601.03714
Copy Paste: [[2601.03714]] Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR(https://arxiv.org/abs/2601.03714)
Keywords: llm, hallucination
Abstract: DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question: "Visual merit or linguistic crutch - which drives DeepSeek-OCR's performance?" By employing sentence-level and word-level semantic corruption, we isolate the model's intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR's performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR's capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at this https URL.
摘要：DeepSeek-OCR 利用光学 2D 映射方法来实现高比率视觉文本压缩，声称解码的文本标记超过输入视觉标记的十倍。虽然这为 LLM 长上下文瓶颈提供了一个有希望的解决方案，但我们研究了一个关键问题：“视觉价值或语言拐杖 - 哪个驱动 DeepSeek-OCR 的性能？”通过采用句子级和单词级语义损坏，我们将模型的固有 OCR 功能与其语言先验隔离开来。结果表明，如果没有语言支持，DeepSeek-OCR 的性能会从大约 90% 骤降到 20%。针对 13 个基线模型的比较基准测试表明，传统管道 OCR 方法对此类语义扰动表现出比端到端方法显着更高的鲁棒性。此外，我们发现较低的视觉标记数量与对先验的依赖增加相关，从而加剧了幻觉风险。上下文压力测试还显示，模型在大约 10,000 个文本标记处崩溃，这表明当前的光学压缩技术可能会矛盾地加剧长上下文瓶颈。这项研究根据经验定义了 DeepSeek-OCR 的能力边界，并为未来视觉文本压缩范式的优化提供了重要的见解。我们在此 https URL 发布本研究中使用的所有数据、结果和脚本。

Title: MIND: From Passive Mimicry to Active Reasoning through Capability-Aware Multi-Perspective CoT Distillation

Authors: Jin Cui, Jiaqi Guo, Jiepeng Zhou, Ruixuan Yang, Jiayi Lu, Jiajun Xu, Jiangcheng Song, Boran Zhao, Pengju Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03717
Pdf URL: https://arxiv.org/pdf/2601.03717
Copy Paste: [[2601.03717]] MIND: From Passive Mimicry to Active Reasoning through Capability-Aware Multi-Perspective CoT Distillation(https://arxiv.org/abs/2601.03717)
Keywords: language model, llm, chain-of-thought
Abstract: While Large Language Models (LLMs) have emerged with remarkable capabilities in complex tasks through Chain-of-Thought reasoning, practical resource constraints have sparked interest in transferring these abilities to smaller models. However, achieving both domain performance and cross-domain generalization remains challenging. Existing approaches typically restrict students to following a single golden rationale and treat different reasoning paths independently. Due to distinct inductive biases and intrinsic preferences, alongside the student's evolving capacity and reasoning preferences during training, a teacher's "optimal" rationale could act as out-of-distribution noise. This misalignment leads to a degeneration of the student's latent reasoning distribution, causing suboptimal performance. To bridge this gap, we propose MIND, a capability-adaptive framework that transitions distillation from passive mimicry to active cognitive construction. We synthesize diverse teacher perspectives through a novel "Teaching Assistant" network. By employing a Feedback-Driven Inertia Calibration mechanism, this network utilizes inertia-filtered training loss to align supervision with the student's current adaptability, effectively enhancing performance while mitigating catastrophic forgetting. Extensive experiments demonstrate that MIND achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks, and our sophisticated latent space analysis further confirms the mechanism of reasoning ability internalization.
摘要：虽然大型语言模型 (LLM) 通过思想链推理在复杂任务中具有卓越的能力，但实际的资源限制激发了人们将这些能力转移到较小模型的兴趣。然而，同时实现领域性能和跨领域泛化仍然具有挑战性。现有的方法通常限制学生遵循单一的黄金原理并独立对待不同的推理路径。由于明显的归纳偏差和内在偏好，加上学生在训练过程中不断发展的能力和推理偏好，教师的“最佳”理由可能会成为分布外噪声。这种失调会导致学生潜在推理分布的退化，导致表现不佳。为了弥补这一差距，我们提出了 MIND，这是一种能力自适应框架，可将被动模仿的升华转变为主动认知构建。我们通过新颖的“助教”网络综合不同的教师观点。通过采用反馈驱动的惯性校准机制，该网络利用惯性过滤的训练损失来使监督与学生当前的适应性相一致，有效地提高表现，同时减少灾难性遗忘。大量的实验表明，MIND 在分布内和分布外基准上都实现了最先进的性能，并且我们复杂的潜在空间分析进一步证实了推理能力内化的机制。

Title: Stuttering-Aware Automatic Speech Recognition for Indonesian Language

Authors: Fadhil Muhammad, Alwin Djuliansah, Adrian Aryaputra Hamzah, Kurniawati Azizah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03727
Pdf URL: https://arxiv.org/pdf/2601.03727
Copy Paste: [[2601.03727]] Stuttering-Aware Automatic Speech Recognition for Indonesian Language(https://arxiv.org/abs/2601.03727)
Keywords: language model
Abstract: Automatic speech recognition systems have achieved remarkable performance on fluent speech but continue to degrade significantly when processing stuttered speech, a limitation that is particularly acute for low-resource languages like Indonesian where specialized datasets are virtually non-existent. To overcome this scarcity, we propose a data augmentation framework that generates synthetic stuttered audio by injecting repetitions and prolongations into fluent text through a combination of rule-based transformations and large language models followed by text-to-speech synthesis. We apply this synthetic data to fine-tune a pre-trained Indonesian Whisper model using transfer learning, enabling the architecture to adapt to dysfluent acoustic patterns without requiring large-scale real-world recordings. Our experiments demonstrate that this targeted synthetic exposure consistently reduces recognition errors on stuttered speech while maintaining performance on fluent segments, validating the utility of synthetic data pipelines for developing more inclusive speech technologies in under-represented languages.
摘要：自动语音识别系统在流畅语音方面取得了显着的性能，但在处理口吃语音时，性能继续显着下降，这一限制对于像印度尼西亚语这样几乎不存在专门数据集的资源匮乏的语言尤其严重。为了克服这种稀缺性，我们提出了一种数据增强框架，通过结合基于规则的转换和大型语言模型，然后进行文本到语音合成，将重复和延长注入到流畅的文本中，从而生成合成口吃音频。我们应用这些合成数据，通过迁移学习来微调预训练的印度尼西亚 Whisper 模型，使该架构能够适应不流畅的声学模式，而无需大规模的现实世界录音。我们的实验表明，这种有针对性的合成暴露持续减少了口吃语音的识别错误，同时保持了流利片段的性能，验证了合成数据管道在代表性不足的语言中开发更具包容性的语音技术的效用。

Title: O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL

Authors: Yi Yao, He Zhu, Piaohong Wang, Jincheng Ren, Xinlong Yang, Qianben Chen, Xiaowan Li, Dingfeng Shi, Jiaxian Li, Qiexiang Wang, Sinuo Wang, Xinpeng Liu, Jiaqi Wu, Minghao Liu, Wangchunshu Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03743
Pdf URL: https://arxiv.org/pdf/2601.03743
Copy Paste: [[2601.03743]] O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL(https://arxiv.org/abs/2601.03743)
Keywords: language model, llm, agent
Abstract: The performance gap between closed-source and open-source large language models (LLMs) is largely attributed to disparities in access to high-quality training data. To bridge this gap, we introduce a novel framework for the automated synthesis of sophisticated, research-grade instructional data. Our approach centers on a multi-agent workflow where collaborative AI agents simulate complex tool-integrated reasoning to generate diverse and high-fidelity data end-to-end. Leveraging this synthesized data, we develop a two-stage training strategy that integrates supervised fine-tuning with a novel reinforcement learning method, designed to maximize model alignment and capability. Extensive experiments demonstrate that our framework empowers open-source models across multiple scales, enabling them to achieve new state-of-the-art performance on the major deep research benchmark. This work provides a scalable and effective pathway for advancing open-source LLMs without relying on proprietary data or models.
摘要：闭源和开源大语言模型（LLM）之间的性能差距很大程度上归因于获取高质量训练数据的差异。为了弥补这一差距，我们引入了一种新颖的框架，用于自动合成复杂的研究级教学数据。我们的方法以多代理工作流程为中心，其中协作人工智能代理模拟复杂的工具集成推理，以生成端到端的多样化和高保真数据。利用这些合成数据，我们开发了一种两阶段训练策略，将监督微调与新颖的强化学习方法相结合，旨在最大限度地提高模型对齐和能力。大量的实验表明，我们的框架支持跨多个尺度的开源模型，使它们能够在主要的深度研究基准上实现新的最先进的性能。这项工作为推进开源法学硕士提供了一条可扩展且有效的途径，而无需依赖专有数据或模型。

Title: Whose Facts Win? LLM Source Preferences under Knowledge Conflicts

Authors: Jakob Schuster, Vagrant Gautam, Katja Markert
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03746
Pdf URL: https://arxiv.org/pdf/2601.03746
Copy Paste: [[2601.03746]] Whose Facts Win? LLM Source Preferences under Knowledge Conflicts(https://arxiv.org/abs/2601.03746)
Keywords: language model, llm, retrieval-augmented generation
Abstract: As large language models (LLMs) are more frequently used in retrieval-augmented generation pipelines, it is increasingly relevant to study their behavior under knowledge conflicts. Thus far, the role of the source of the retrieved information has gone unexamined. We address this gap with a novel framework to investigate how source preferences affect LLM resolution of inter-context knowledge conflicts in English, motivated by interdisciplinary research on credibility. With a comprehensive, tightly-controlled evaluation of 13 open-weight LLMs, we find that LLMs prefer institutionally-corroborated information (e.g., government or newspaper sources) over information from people and social media. However, these source preferences can be reversed by simply repeating information from less credible sources. To mitigate repetition effects and maintain consistent preferences, we propose a novel method that reduces repetition bias by up to 99.8%, while also maintaining at least 88.8% of original preferences. We release all data and code to encourage future work on credibility and source preferences in knowledge-intensive NLP.
摘要：随着大型语言模型（LLM）更频繁地用于检索增强生成管道，研究它们在知识冲突下的行为变得越来越重要。到目前为止，检索到的信息来源的作用尚未得到检验。我们通过一个新颖的框架来解决这一差距，以研究源偏好如何影响英语中的法学硕士解决上下文间知识冲突的方法，其动机是可信度的跨学科研究。通过对 13 个开放权重法学硕士进行全面、严格控制的评估，我们发现法学硕士更喜欢机构证实的信息（例如政府或报纸来源），而不是来自个人和社交媒体的信息。然而，这些来源偏好可以通过简单地重复来自不太可信来源的信息来逆转。为了减轻重复效应并保持一致的偏好，我们提出了一种新颖的方法，可以将重复偏差减少高达 99.8%，同时还保持至少 88.8% 的原始偏好。我们发布所有数据和代码，以鼓励未来在知识密集型 NLP 中的可信度和来源偏好方面开展工作。

Title: Evaluation of Multilingual LLMs Personalized Text Generation Capabilities Targeting Groups and Social-Media Platforms

Authors: Dominik Macko
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03752
Pdf URL: https://arxiv.org/pdf/2601.03752
Copy Paste: [[2601.03752]] Evaluation of Multilingual LLMs Personalized Text Generation Capabilities Targeting Groups and Social-Media Platforms(https://arxiv.org/abs/2601.03752)
Keywords: language model, llm, prompt
Abstract: Capabilities of large language models to generate multilingual coherent text have continuously enhanced in recent years, which opens concerns about their potential misuse. Previous research has shown that they can be misused for generation of personalized disinformation in multiple languages. It has also been observed that personalization negatively affects detectability of machine-generated texts; however, this has been studied in the English language only. In this work, we examine this phenomenon across 10 languages, while we focus not only on potential misuse of personalization capabilities, but also on potential benefits they offer. Overall, we cover 1080 combinations of various personalization aspects in the prompts, for which the texts are generated by 16 distinct language models (17,280 texts in total). Our results indicate that there are differences in personalization quality of the generated texts when targeting demographic groups and when targeting social-media platforms across languages. Personalization towards platforms affects detectability of the generated texts in a higher scale, especially in English, where the personalization quality is the highest.
摘要：近年来，大型语言模型生成多语言连贯文本的能力不断增强，这引起了人们对其潜在滥用的担忧。先前的研究表明，它们可能会被滥用来生成多种语言的个性化虚假信息。人们还发现，个性化会对机器生成文本的可检测性产生负面影响；然而，仅用英语对此进行了研究。在这项工作中，我们跨 10 种语言研究了这一现象，同时我们不仅关注个性化功能的潜在滥用，还关注它们提供的潜在好处。总体而言，我们在提示中涵盖了各种个性化方面的 1080 种组合，其中文本由 16 种不同的语言模型生成（总共 17,280 条文本）。我们的结果表明，在针对人口群体和针对跨语言的社交媒体平台时，生成的文本的个性化质量存在差异。针对平台的个性化会在更大范围内影响生成文本的可检测性，尤其是英语，其中个性化质量最高。

Title: Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations

Authors: Pingjun Hong, Benjamin Roth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03775
Pdf URL: https://arxiv.org/pdf/2601.03775
Copy Paste: [[2601.03775]] Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations(https://arxiv.org/abs/2601.03775)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) can produce verbalized self-explanations, yet prior studies suggest that such rationales may not reliably reflect the model's true decision process. We ask whether these explanations nevertheless help users predict model behavior, operationalized as counterfactual simulatability. Using StrategyQA, we evaluate how well humans and LLM judges can predict a model's answers to counterfactual follow-up questions, with and without access to the model's chain-of-thought or post-hoc explanations. We compare LLM-generated counterfactuals with pragmatics-based perturbations as alternative ways to construct test cases for assessing the potential usefulness of explanations. Our results show that self-explanations consistently improve simulation accuracy for both LLM judges and humans, but the degree and stability of gains depend strongly on the perturbation strategy and judge strength. We also conduct a qualitative analysis of free-text justifications written by human users when predicting the model's behavior, which provides evidence that access to explanations helps humans form more accurate predictions on the perturbed questions.
摘要：大型语言模型（LLM）可以产生语言化的自我解释，但先前的研究表明，这种基本原理可能无法可靠地反映模型的真实决策过程。我们询问这些解释是否仍然可以帮助用户预测模型行为，并可操作为反事实可模拟性。使用 StrategyQA，我们评估人类和法学硕士法官在是否能够访问模型的思路或事后解释的情况下预测模型对反事实后续问题的答案的能力。我们将法学硕士生成的反事实与基于语用学的扰动进行比较，作为构建测试用例以评估解释的潜在有用性的替代方法。我们的结果表明，自我解释持续提高了 LLM 法官和人类的模拟准确性，但增益的程度和稳定性在很大程度上取决于扰动策略和判断强度。我们还对人类用户在预测模型行为时编写的自由文本理由进行定性分析，这提供了证据表明访问解释有助于人类对扰动的问题形成更准确的预测。

Title: Tracing the complexity profiles of different linguistic phenomena through the intrinsic dimension of LLM representations

Authors: Marco Baroni, Emily Cheng, Iria deDios-Flores, Francesca Franzon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03779
Pdf URL: https://arxiv.org/pdf/2601.03779
Copy Paste: [[2601.03779]] Tracing the complexity profiles of different linguistic phenomena through the intrinsic dimension of LLM representations(https://arxiv.org/abs/2601.03779)
Keywords: llm
Abstract: We explore the intrinsic dimension (ID) of LLM representations as a marker of linguistic complexity, asking if different ID profiles across LLM layers differentially characterize formal and functional complexity. We find the formal contrast between sentences with multiple coordinated or subordinated clauses to be reflected in ID differences whose onset aligns with a phase of more abstract linguistic processing independently identified in earlier work. The functional contrasts between sentences characterized by right branching vs. center embedding or unambiguous vs. ambiguous relative clause attachment are also picked up by ID, but in a less marked way, and they do not correlate with the same processing phase. Further experiments using representational similarity and layer ablation confirm the same trends. We conclude that ID is a useful marker of linguistic complexity in LLMs, that it allows to differentiate between different types of complexity, and that it points to similar stages of linguistic processing across disparate LLMs.
摘要：我们探索 LLM 表示的内在维度（ID）作为语言复杂性的标记，询问 LLM 层中的不同 ID 配置文件是否会差异化地表征形式和功能复杂性。我们发现具有多个协调或从属子句的句子之间的形式对比反映在 ID 差异中，其开始与早期工作中独立识别的更抽象语言处理的阶段一致。以右分支与中心嵌入或明确与模糊关系从句附件为特征的句子之间的功能对比也可以通过 ID 来获取，但以不太明显的方式进行，并且它们与相同的处理阶段不相关。使用代表性相似性和层消融的进一步实验证实了相同的趋势。我们得出的结论是，ID 是法学硕士语言复杂性的有用标记，它可以区分不同类型的复杂性，并且它指出了不同法学硕士语言处理的相似阶段。

Title: HearSay Benchmark: Do Audio LLMs Leak What They Hear?

Authors: Jin Wang, Liang Lin, Kaiwen Luo, Weiliu Wang, Yitian Chen, Moayad Aloqaily, Xuehai Tang, Zhenhong Zhou, Kun Wang, Li Sun, Qingsong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03783
Pdf URL: https://arxiv.org/pdf/2601.03783
Copy Paste: [[2601.03783]] HearSay Benchmark: Do Audio LLMs Leak What They Hear?(https://arxiv.org/abs/2601.03783)
Keywords: language model, llm, chain-of-thought
Abstract: While Audio Large Language Models (ALLMs) have achieved remarkable progress in understanding and generation, their potential privacy implications remain largely unexplored. This paper takes the first step to investigate whether ALLMs inadvertently leak user privacy solely through acoustic voiceprints and introduces $\textit{HearSay}$, a comprehensive benchmark constructed from over 22,000 real-world audio clips. To ensure data quality, the benchmark is meticulously curated through a rigorous pipeline involving automated profiling and human verification, guaranteeing that all privacy labels are grounded in factual records. Extensive experiments on $\textit{HearSay}$ yield three critical findings: $\textbf{Significant Privacy Leakage}$: ALLMs inherently extract private attributes from voiceprints, reaching 92.89% accuracy on gender and effectively profiling social attributes. $\textbf{Insufficient Safety Mechanisms}$: Alarmingly, existing safeguards are severely inadequate; most models fail to refuse privacy-intruding requests, exhibiting near-zero refusal rates for physiological traits. $\textbf{Reasoning Amplifies Risk}$: Chain-of-Thought (CoT) reasoning exacerbates privacy risks in capable models by uncovering deeper acoustic correlations. These findings expose critical vulnerabilities in ALLMs, underscoring the urgent need for targeted privacy alignment. The codes and dataset are available at this https URL
摘要：虽然音频大语言模型 (ALLM) 在理解和生成方面取得了显着进展，但其潜在的隐私影响在很大程度上仍未得到探索。本文迈出了第一步，调查 ALLM 是否仅通过声学声纹无意中泄露了用户隐私，并介绍了 $\textit{HearSay}$，这是一个由超过 22,000 个真实世界音频剪辑构建的综合基准。为了确保数据质量，该基准是通过包括自动分析和人工验证的严格流程精心策划的，确保所有隐私标签都以事实记录为基础。对 $\textit{HearSay}$ 进行的广泛实验得出了三个关键发现： $\textbf{重大隐私泄露}$：ALLM 本质上从声纹中提取隐私属性，性别准确率达到 92.89%，并有效分析社交属性。 $\textbf{安全机制不足}$：令人震惊的是，现有的保障措施严重不足；大多数模型无法拒绝侵犯隐私的请求，对生理特征的拒绝率接近于零。 $\textbf{推理放大风险}$：思想链 (CoT) 推理通过发现更深层次的声学相关性，加剧了模型中的隐私风险。这些发现暴露了 ALLM 中的关键漏洞，强调了有针对性的隐私调整的迫切需要。代码和数据集可在此 https URL 获取

Title: Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents

Authors: Dehao Tao, Guoliang Ma, Yongfeng Huang, Minghu Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03785
Pdf URL: https://arxiv.org/pdf/2601.03785
Copy Paste: [[2601.03785]] Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents(https://arxiv.org/abs/2601.03785)
Keywords: language model, llm, agent
Abstract: Human-agent dialogues often exhibit topic continuity-a stable thematic frame that evolves through temporally adjacent exchanges-yet most large language model (LLM) agent memory systems fail to preserve it. Existing designs follow a fragmentation-compensation paradigm: they first break dialogue streams into isolated utterances for storage, then attempt to restore coherence via embedding-based retrieval. This process irreversibly damages narrative and causal flow, while biasing retrieval towards lexical similarity. We introduce membox, a hierarchical memory architecture centered on a Topic Loom that continuously monitors dialogue in a sliding-window fashion, grouping consecutive same-topic turns into coherent "memory boxes" at storage time. Sealed boxes are then linked by a Trace Weaver into long-range event-timeline traces, recovering macro-topic recurrences across discontinuities. Experiments on LoCoMo demonstrate that Membox achieves up to 68% F1 improvement on temporal reasoning tasks, outperforming competitive baselines (e.g., Mem0, A-MEM). Notably, Membox attains these gains while using only a fraction of the context tokens required by existing methods, highlighting a superior balance between efficiency and effectiveness. By explicitly modeling topic continuity, Membox offers a cognitively motivated mechanism for enhancing both coherence and efficiency in LLM agents.
摘要：人类与代理的对话通常表现出主题连续性——一个通过时间相邻的交换而演变的稳定的主题框架——但大多数大型语言模型（LLM）代理记忆系统无法保留它。现有的设计遵循碎片补偿范式：它们首先将对话流分解成孤立的话语进行存储，然后尝试通过基于嵌入的检索来恢复连贯性。这个过程不可逆转地损害了叙事和因果流，同时使检索偏向词汇相似性。我们引入了 membox，这是一种以 Topic Loom 为中心的分层内存架构，它以滑动窗口的方式持续监控对话，在存储时将连续的相同主题分组为连贯的“内存盒”。然后，密封的盒子由跟踪编织器链接到远程事件时间线跟踪中，从而恢复跨不连续性的宏观主题重现。 LoCoMo 上的实验表明，Membox 在时间推理任务上实现了高达 68% 的 F1 改进，优于竞争基线（例如 Mem0、A-MEM）。值得注意的是，Membox 只使用现有方法所需上下文令牌的一小部分就获得了这些收益，凸显了效率和效果之间的卓越平衡。通过明确地建模主题连续性，Membox 提供了一种认知激励机制，用于增强 LLM 代理的连贯性和效率。

Title: Compact Example-Based Explanations for Language Models

Authors: Loris Schoenegger, Benjamin Roth
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03786
Pdf URL: https://arxiv.org/pdf/2601.03786
Copy Paste: [[2601.03786]] Compact Example-Based Explanations for Language Models(https://arxiv.org/abs/2601.03786)
Keywords: language model
Abstract: Training data influence estimation methods quantify the contribution of training documents to a model's output, making them a promising source of information for example-based explanations. As humans cannot interpret thousands of documents, only a small subset of the training data can be presented as an explanation. Although the choice of which documents to include directly affects explanation quality, previous evaluations of such systems have largely ignored any selection strategies. To address this, we propose a novel selection relevance score, a retraining-free metric that quantifies how useful a set of examples is for explaining a model's output. We validate this score through fine-tuning experiments, confirming that it can predict whether a set of examples supports or undermines the model's predictions. Using this metric, we further show that common selection strategies often underperform random selection. Motivated by this finding, we propose a strategy that balances influence and representativeness, enabling better use of selection budgets than naively selecting the highest-ranking examples.
摘要：训练数据影响估计方法量化训练文档对模型输出的贡献，使它们成为基于示例的解释的有前途的信息来源。由于人类无法解释数千份文档，因此只能提供训练数据的一小部分作为解释。尽管选择包含哪些文档直接影响解释质量，但以前对此类系统的评估在很大程度上忽略了任何选择策略。为了解决这个问题，我们提出了一种新颖的选择相关性得分，这是一种无需重新训练的指标，可以量化一组示例对于解释模型输出的有用程度。我们通过微调实验验证了这个分数，确认它可以预测一组示例是否支持或破坏模型的预测。使用这个指标，我们进一步表明，常见的选择策略通常表现不如随机选择。受这一发现的启发，我们提出了一种平衡影响力和代表性的策略，与天真地选择排名最高的例子相比，能够更好地利用选择预算。

Title: NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

Authors: Zhongtao Miao, Kaiyan Zhao, Masaaki Nagata, Yoshimasa Tsuruoka
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03790
Pdf URL: https://arxiv.org/pdf/2601.03790
Copy Paste: [[2601.03790]] NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning(https://arxiv.org/abs/2601.03790)
Keywords: agent
Abstract: Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains underexplored compared with general machine translation (MT). In this paper, we propose an agentic framework, NeoAMT, for neologism-aware machine translation using a Wiktionary search tool. Specifically, we first create a new dataset for neologism-aware machine translation and develop a search tool based on Wiktionary. The new dataset covers 16 languages and 75 translation directions and is derived from approximately 10 million records of an English Wiktionary dump. The retrieval corpus of the search tool is also constructed from around 3 million cleaned records of the Wiktionary dump. We then use it for training the translation agent with reinforcement learning (RL) and evaluating the accuracy of neologism-aware machine translation. Based on this, we also propose an RL training framework that contains a novel reward design and an adaptive rollout generation approach by leveraging "translation difficulty" to further improve the translation quality of translation agents using our search tool.
摘要：新词感知机器翻译旨在将包含新词的源句子翻译成目标语言。与一般机器翻译 (MT) 相比，该领域仍未得到充分探索。在本文中，我们提出了一个代理框架 NeoAMT，用于使用维基词典搜索工具进行新词感知机器翻译。具体来说，我们首先为新词感知机器翻译创建一个新的数据集，并开发一个基于维基词典的搜索工具。新数据集涵盖 16 种语言和 75 个翻译方向，源自英语维基词典转储的约 1000 万条记录。该搜索工具的检索语料库也是根据维基词典转储的约 300 万条经过清理的记录构建的。然后，我们用它来通过强化学习（RL）来训练翻译代理，并评估新词感知机器翻译的准确性。基于此，我们还提出了一个 RL 训练框架，其中包含新颖的奖励设计和自适应推出生成方法，通过利用“翻译难度”来进一步提高使用我们的搜索工具的翻译代理的翻译质量。

Title: Do LLMs Really Memorize Personally Identifiable Information? Revisiting PII Leakage with a Cue-Controlled Memorization Framework

Authors: Xiaoyu Luo, Yiyi Chen, Qiongxiu Li, Johannes Bjerva
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03791
Pdf URL: https://arxiv.org/pdf/2601.03791
Copy Paste: [[2601.03791]] Do LLMs Really Memorize Personally Identifiable Information? Revisiting PII Leakage with a Cue-Controlled Memorization Framework(https://arxiv.org/abs/2601.03791)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have been reported to "leak" Personally Identifiable Information (PII), with successful PII reconstruction often interpreted as evidence of memorization. We propose a principled revision of memorization evaluation for LLMs, arguing that PII leakage should be evaluated under low lexical cue conditions, where target PII cannot be reconstructed through prompt-induced generalization or pattern completion. We formalize Cue-Resistant Memorization (CRM) as a cue-controlled evaluation framework and a necessary condition for valid memorization evaluation, explicitly conditioning on prompt-target overlap cues. Using CRM, we conduct a large-scale multilingual re-evaluation of PII leakage across 32 languages and multiple memorization paradigms. Revisiting reconstruction-based settings, including verbatim prefix-suffix completion and associative reconstruction, we find that their apparent effectiveness is driven primarily by direct surface-form cues rather than by true memorization. When such cues are controlled for, reconstruction success diminishes substantially. We further examine cue-free generation and membership inference, both of which exhibit extremely low true positive rates. Overall, our results suggest that previously reported PII leakage is better explained by cue-driven behavior than by genuine memorization, highlighting the importance of cue-controlled evaluation for reliably quantifying privacy-relevant memorization in LLMs.
摘要：据报道，大型语言模型 (LLM) 会“泄露”个人身份信息 (PII)，成功的 PII 重建通常被解释为记忆的证据。我们建议对法学硕士的记忆评估进行原则性修订，认为应在低词汇提示条件下评估 PII 泄漏，其中目标 PII 无法通过提示诱导的泛化或模式完成来重建。我们将线索抵抗记忆（CRM）形式化为线索控制的评估框架和有效记忆评估的必要条件，明确地以提示目标重叠线索为条件。使用 CRM，我们对 32 种语言和多种记忆范式的 PII 泄漏进行了大规模的多语言重新评估。重新审视基于重建的设置，包括逐字前缀后缀完成和联想重建，我们发现它们的明显有效性主要是由直接的表面形式线索而不是真正的记忆驱动的。当这些线索受到控制时，重建的成功率就会大大降低。我们进一步研究了无线索生成和成员推理，两者都表现出极低的真阳性率。总体而言，我们的结果表明，之前报道的 PII 泄露可以通过线索驱动行为来解释，而不是通过真正的记忆来解释，这凸显了线索控制评估对于可靠量化法学硕士中与隐私相关的记忆的重要性。

Title: VietMed-MCQ: A Consistency-Filtered Data Synthesis Framework for Vietnamese Traditional Medicine Evaluation

Authors: Huynh Trung Kiet, Dao Sy Duy Minh, Nguyen Dinh Ha Duong, Le Hoang Minh Huy, Long Nguyen, Dien Dinh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03792
Pdf URL: https://arxiv.org/pdf/2601.03792
Copy Paste: [[2601.03792]] VietMed-MCQ: A Consistency-Filtered Data Synthesis Framework for Vietnamese Traditional Medicine Evaluation(https://arxiv.org/abs/2601.03792)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in general medical domains. However, their performance significantly degrades in specialized, culturally specific domains such as Vietnamese Traditional Medicine (VTM), primarily due to the scarcity of high-quality, structured benchmarks. In this paper, we introduce VietMed-MCQ, a novel multiple-choice question dataset generated via a Retrieval-Augmented Generation (RAG) pipeline with an automated consistency check mechanism. Unlike previous synthetic datasets, our framework incorporates a dual-model validation approach to ensure reasoning consistency through independent answer verification, though the substring-based evidence checking has known limitations. The complete dataset of 3,190 questions spans three difficulty levels and underwent validation by one medical expert and four students, achieving 94.2 percent approval with substantial inter-rater agreement (Fleiss' kappa = 0.82). We benchmark seven open-source models on VietMed-MCQ. Results reveal that general-purpose models with strong Chinese priors outperform Vietnamese-centric models, highlighting cross-lingual conceptual transfer, while all models still struggle with complex diagnostic reasoning. Our code and dataset are publicly available to foster research in low-resource medical domains.
摘要：大型语言模型（LLM）在普通医学领域表现出了卓越的熟练程度。然而，它们在越南传统医学 (VTM) 等特定文化领域的表现显着下降，这主要是由于缺乏高质量、结构化的基准。在本文中，我们介绍了 VietMed-MCQ，这是一种新颖的多项选择问题数据集，通过具有自动一致性检查机制的检索增强生成（RAG）管道生成。与以前的合成数据集不同，我们的框架采用了双模型验证方法，通过独立答案验证来确保推理一致性，尽管基于子串的证据检查具有已知的局限性。包含 3,190 个问题的完整数据集涵盖三个难度级别，并由一名医学专家和四名学生进行了验证，获得了 94.2% 的认可度，评估者之间达成了实质性一致（Fleiss 的 kappa = 0.82）。我们在 VietMed-MCQ 上对七个开源模型进行了基准测试。结果表明，具有较强中文先验的通用模型优于以越南语为中心的模型，突出了跨语言概念迁移，而所有模型仍然难以应对复杂的诊断推理。我们的代码和数据集是公开可用的，以促进资源匮乏的医学领域的研究。

Title: Where meaning lives: Layer-wise accessibility of psycholinguistic features in encoder and decoder language models

Authors: Taisiia Tikhomirova, Dirk U. Wulff
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03798
Pdf URL: https://arxiv.org/pdf/2601.03798
Copy Paste: [[2601.03798]] Where meaning lives: Layer-wise accessibility of psycholinguistic features in encoder and decoder language models(https://arxiv.org/abs/2601.03798)
Keywords: language model
Abstract: Understanding where transformer language models encode psychologically meaningful aspects of meaning is essential for both theory and practice. We conduct a systematic layer-wise probing study of 58 psycholinguistic features across 10 transformer models, spanning encoder-only and decoder-only architectures, and compare three embedding extraction methods. We find that apparent localization of meaning is strongly method-dependent: contextualized embeddings yield higher feature-specific selectivity and different layer-wise profiles than isolated embeddings. Across models and methods, final-layer representations are rarely optimal for recovering psycholinguistic information with linear probes. Despite these differences, models exhibit a shared depth ordering of meaning dimensions, with lexical properties peaking earlier and experiential and affective dimensions peaking later. Together, these results show that where meaning "lives" in transformer models reflects an interaction between methodological choices and architectural constraints.
摘要：了解 Transformer 语言模型在何处编码具有心理意义的意义对于理论和实践都至关重要。我们对 10 个 Transformer 模型的 58 个心理语言学特征进行了系统的分层探测研究，涵盖仅编码器和仅解码器架构，并比较了三种嵌入提取方法。我们发现，明显的意义本地化很大程度上依赖于方法：上下文化嵌入比孤立嵌入产生更高的特定特征选择性和不同的分层配置文件。在模型和方法中，最终层表示对于使用线性探针恢复心理语言信息来说很少是最佳的。尽管存在这些差异，模型仍表现出意义维度的共享深度排序，词汇属性较早达到峰值，经验和情感维度较晚达到峰值。总之，这些结果表明，变压器模型中的意义“存在”反映了方法选择和架构约束之间的相互作用。

Title: AI Generated Text Detection

Authors: Adilkhan Alikhanov, Aidar Amangeldi, Diar Demeubay, Dilnaz Akhmetzhan, Nurbek Moldakhmetov, Omar Polat, Galymzhan Zharas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03812
Pdf URL: https://arxiv.org/pdf/2601.03812
Copy Paste: [[2601.03812]] AI Generated Text Detection(https://arxiv.org/abs/2601.03812)
Keywords: language model, llm
Abstract: The rapid development of large language models has led to an increase in AI-generated text, with students increasingly using LLM-generated content as their own work, which violates academic integrity. This paper presents an evaluation of AI text detection methods, including both traditional machine learning models and transformer-based architectures. We utilize two datasets, HC3 and DAIGT v2, to build a unified benchmark and apply a topic-based data split to prevent information leakage. This approach ensures robust generalization across unseen domains. Our experiments show that TF-IDF logistic regression achieves a reasonable baseline accuracy of 82.87%. However, deep learning models outperform it. The BiLSTM classifier achieves an accuracy of 88.86%, while DistilBERT achieves a similar accuracy of 88.11% with the highest ROC-AUC score of 0.96, demonstrating the strongest overall performance. The results indicate that contextual semantic modeling is significantly superior to lexical features and highlight the importance of mitigating topic memorization through appropriate evaluation protocols. The limitations of this work are primarily related to dataset diversity and computational constraints. In future work, we plan to expand dataset diversity and utilize parameter-efficient fine-tuning methods such as LoRA. We also plan to explore smaller or distilled models and employ more efficient batching strategies and hardware-aware optimization.
摘要：大型语言模型的快速发展导致AI生成文本的增加，学生越来越多地使用LLM生成的内容作为自己的作品，这违反了学术诚信。本文对人工智能文本检测方法进行了评估，包括传统的机器学习模型和基于 Transformer 的架构。我们利用 HC3 和 DAIGT v2 两个数据集来构建统一的基准，并应用基于主题的数据分割来防止信息泄漏。这种方法确保了跨未知领域的稳健泛化。我们的实验表明，TF-IDF 逻辑回归达到了 82.87% 的合理基线精度。然而，深度学习模型的表现优于它。 BiLSTM 分类器的准确率达到了 88.86%，而 DistilBERT 的准确率也达到了 88.11%，ROC-AUC 得分最高为 0.96，表现出最强的整体性能。结果表明，上下文语义建模明显优于词汇特征，并强调了通过适当的评估协议减轻主题记忆的重要性。这项工作的局限性主要与数据集多样性和计算限制有关。在未来的工作中，我们计划扩大数据集多样性并利用 LoRA 等参数高效的微调方法。我们还计划探索更小的或精炼的模型，并采用更高效的批处理策略和硬件感知优化。

Title: Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning

Authors: Fei Wu, Zhenrong Zhang, Qikai Chang, Jianshu Zhang, Quan Liu, Jun Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03823
Pdf URL: https://arxiv.org/pdf/2601.03823
Copy Paste: [[2601.03823]] Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning(https://arxiv.org/abs/2601.03823)
Keywords: language model, llm, chain-of-thought
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) elicits long chain-of-thought reasoning in large language models (LLMs), but outcome-based rewards lead to coarse-grained advantage estimation. While existing approaches improve RLVR via token-level entropy or sequence-level length control, they lack a semantically grounded, step-level measure of reasoning progress. As a result, LLMs fail to distinguish necessary deduction from redundant verification: they may continue checking after reaching a correct solution and, in extreme cases, overturn a correct trajectory into an incorrect final answer. To remedy the lack of process supervision, we introduce a training-free probing mechanism that extracts intermediate confidence and correctness and combines them into a Step Potential signal that explicitly estimates the reasoning state at each step. Building on this signal, we propose Step Potential Advantage Estimation (SPAE), a fine-grained credit assignment method that amplifies potential gains, penalizes potential drops, and applies penalty after potential saturates to encourage timely termination. Experiments across multiple benchmarks show SPAE consistently improves accuracy while substantially reducing response length, outperforming strong RL baselines and recent efficient reasoning and token-level advantage estimation methods. The code is available at this https URL.
摘要：具有可验证奖励的强化学习 (RLVR) 会在大型语言模型 (LLM) 中引发长链思维推理，但基于结果的奖励会导致粗粒度的优势估计。虽然现有方法通过标记级熵或序列级长度控制来改进 RLVR，但它们缺乏基于语义的、步骤级的推理进度度量。因此，法学硕士无法区分必要的推论和冗余验证：他们可能会在得出正确的解决方案后继续检查，并在极端情况下将正确的轨迹推翻为不正确的最终答案。为了弥补过程监督的缺乏，我们引入了一种免训练的探测机制，该机制提取中间置信度和正确性，并将它们组合成一个步骤电位信号，该信号明确地估计每个步骤的推理状态。在此信号的基础上，我们提出了阶梯潜在优势估计（SPAE），这是一种细粒度的信用分配方法，可以放大潜在收益，惩罚潜在下降，并在潜在饱和后施加惩罚，以鼓励及时终止。跨多个基准的实验表明，SPAE 持续提高了准确性，同时大幅缩短了响应长度，优于强大的 RL 基线以及最近的高效推理和令牌级优势估计方法。该代码可从此 https URL 获取。

Title: What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs

Authors: Seyed Mahed Mousavi, Simone Alghisi, Giuseppe Riccardi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03858
Pdf URL: https://arxiv.org/pdf/2601.03858
Copy Paste: [[2601.03858]] What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs(https://arxiv.org/abs/2601.03858)
Keywords: llm
Abstract: Continual Pre-Training (CPT) is widely used for acquiring and updating factual knowledge in LLMs. This practice treats loss as a proxy for knowledge learning, while offering no grounding into how it changes during training. We study CPT as a knowledge learning process rather than a solely optimization problem. We construct a controlled, distribution-matched benchmark of factual documents and interleave diagnostic probes directly into the CPT loop, enabling epoch-level measurement of knowledge acquisition dynamics and changes in Out-Of-Domain (OOD) general skills (e.g., math). We further analyze how CPT reshapes knowledge circuits during training. Across three instruction-tuned LLMs and multiple CPT strategies, optimization and learning systematically diverge as loss decreases monotonically while factual learning is unstable and non-monotonic. Acquired facts are rarely consolidated, learning is strongly conditioned on prior exposure, and OOD performance degrades from early epochs. Circuit analysis reveals rapid reconfiguration of knowledge pathways across epochs, providing an explanation for narrow acquisition windows and systematic forgetting. These results show that loss optimization is misaligned with learning progress in CPT and motivate evaluation of stopping criteria based on task-level learning dynamics.
摘要：持续预培训（CPT）广泛用于获取和更新法学硕士的事实知识。这种做法将损失视为知识学习的代理，但没有提供其在训练过程中如何变化的基础。我们将 CPT 作为一个知识学习过程来研究，而不仅仅是一个优化问题。我们构建了一个受控的、分布匹配的事实文档基准，并将诊断探针直接插入到 CPT 循环中，从而能够对知识获取动态和域外 (OOD) 一般技能（例如数学）的变化进行时代级别的测量。我们进一步分析CPT在训练过程中如何重塑知识回路。在三个指令调整的 LLM 和多个 CPT 策略中，优化和学习系统性地出现分歧，因为损失单调减少，而事实学习不稳定且非单调。获得的事实很少得到巩固，学习很大程度上取决于先前的接触，并且 OOD 性能从早期阶段开始就会下降。回路分析揭示了跨时代知识路径的快速重新配置，为狭窄的获取窗口和系统性遗忘提供了解释。这些结果表明，损失优化与 CPT 中的学习进度不一致，并激发了基于任务级学习动态的停止标准评估。

Title: PartisanLens: A Multilingual Dataset of Hyperpartisan and Conspiratorial Immigration Narratives in European Media

Authors: Michele Joshua Maggini, Paloma Piot, Anxo Pérez, Erik Bran Marino, Lúa Santamaría Montesinos, Ana Lisboa, Marta Vázquez Abuín, Javier Parapar, Pablo Gamallo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03860
Pdf URL: https://arxiv.org/pdf/2601.03860
Copy Paste: [[2601.03860]] PartisanLens: A Multilingual Dataset of Hyperpartisan and Conspiratorial Immigration Narratives in European Media(https://arxiv.org/abs/2601.03860)
Keywords: language model, llm
Abstract: Detecting hyperpartisan narratives and Population Replacement Conspiracy Theories (PRCT) is essential to addressing the spread of misinformation. These complex narratives pose a significant threat, as hyperpartisanship drives political polarisation and institutional distrust, while PRCTs directly motivate real-world extremist violence, making their identification critical for social cohesion and public safety. However, existing resources are scarce, predominantly English-centric, and often analyse hyperpartisanship, stance, and rhetorical bias in isolation rather than as interrelated aspects of political discourse. To bridge this gap, we introduce \textsc{PartisanLens}, the first multilingual dataset of \num{1617} hyperpartisan news headlines in Spanish, Italian, and Portuguese, annotated in multiple political discourse aspects. We first evaluate the classification performance of widely used Large Language Models (LLMs) on this dataset, establishing robust baselines for the classification of hyperpartisan and PRCT narratives. In addition, we assess the viability of using LLMs as automatic annotators for this task, analysing their ability to approximate human annotation. Results highlight both their potential and current limitations. Next, moving beyond standard judgments, we explore whether LLMs can emulate human annotation patterns by conditioning them on socio-economic and ideological profiles that simulate annotator perspectives. At last, we provide our resources and evaluation, \textsc{PartisanLens} supports future research on detecting partisan and conspiratorial narratives in European contexts.
摘要：发现极端党派叙事和人口替代阴谋论 (PRCT) 对于解决错误信息的传播至关重要。这些复杂的叙述构成了重大威胁，因为超级党派之争加剧了政治两极分化和机构不信任，而 PRCT 直接激发了现实世界的极端主义暴力，使其识别对于社会凝聚力和公共安全至关重要。然而，现有资源稀缺，主要以英语为中心，并且经常孤立地分析超级党派偏见、立场和修辞偏见，而不是作为政治话语的相互关联的方面。为了弥补这一差距，我们引入了 \textsc{PartisanLens}，这是第一个包含西班牙语、意大利语和葡萄牙语 \num{1617} 超党派新闻标题的多语言数据集，并在多个政治话语方面进行了注释。我们首先评估广泛使用的大型语言模型（LLM）在此数据集上的分类性能，为超党派和 PRCT 叙述的分类建立可靠的基线。此外，我们还评估了使用法学硕士作为此任务的自动注释器的可行性，分析了它们近似人类注释的能力。结果凸显了它们的潜力和当前的局限性。接下来，超越标准判断，我们探讨法学硕士是否可以通过模拟注释者视角的社会经济和意识形态概况来模拟人类注释模式。最后，我们提供我们的资源和评估，\textsc{PartisanLens} 支持未来在欧洲背景下检测党派和阴谋叙事的研究。

Title: What Matters For Safety Alignment?

Authors: Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2601.03868
Pdf URL: https://arxiv.org/pdf/2601.03868
Copy Paste: [[2601.03868]] What Matters For Safety Alignment?(https://arxiv.org/abs/2601.03868)
Keywords: gpt, llm, prompt
Abstract: This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.
摘要：本文对安全对准能力进行了全面的实证研究。我们评估法学硕士和法学硕士的安全一致性的重要因素，为开发更安全、更可靠的人工智能系统提供重要的见解。我们系统地研究和比较了六个关键的内在模型特征和三种外部攻击技术的影响。我们的大规模评估是使用 13 个不同模型系列中的 32 个最近流行的 LLM 和 LRM 进行的，参数范围从 3B 到 235B。该评估利用了五个已建立的安全数据集，并通过 56 种越狱技术和四种 CoT 攻击策略探测模型漏洞，导致 460 万次 API 调用。我们的主要实证研究结果有四个方面。首先，我们将 LRM GPT-OSS-20B、Qwen3-Next-80B-A3B-Thinking 和 GPT-OSS-120B 确定为最安全的前三个模型，这证实了集成推理和自我反思机制对于稳健安全对齐的显着优势。其次，培训后和知识提炼可能会导致安全一致性的系统性退化。因此，我们认为，在这些阶段，安全必须被视为明确的约束或核心优化目标，而不仅仅是服从于对一般能力的追求。第三，我们揭示了一个明显的漏洞：通过响应前缀进行 CoT 攻击可以将 Seed-OSS-36B-Instruct 的攻击成功率平均提高 3.34 倍，从 0.6% 提高到 96.3%。这一重要发现强调了文本完成接口和允许用户在 LLM 服务中定义响应前缀的功能固有的安全风险，凸显了对架构和部署保障措施的迫切需要。第四，角色扮演、提示注入和基于梯度的对抗性提示搜索是现代模型中引发不一致行为的主要方法。

Title: Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

Authors: Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03872
Pdf URL: https://arxiv.org/pdf/2601.03872
Copy Paste: [[2601.03872]] Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning(https://arxiv.org/abs/2601.03872)
Keywords: language model, gpt, llm, agent
Abstract: The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) \textbf{training-free cluster-based routing} that exploits empirical priors for domain-specific alignment, and (2) \textbf{RL-based multi-step routing} that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.
摘要：大型语言模型（LLM）与外部工具的集成显着扩展了人工智能代理的功能。然而，随着法学硕士和工具多样性的增加，选择最佳模型-工具组合成为高维优化挑战。现有方法通常依赖于单个模型或固定的工具调用逻辑，无法利用异构模型-工具对之间的性能变化。在本文中，我们提出了 ATLAS（自适应工具-LLM 对齐和协同调用），这是一种用于跨域复杂推理中动态工具使用的双路径框架。 ATLAS 通过双路径方法运行：(1) \textbf{基于集群的免训练路由}，利用经验先验进行特定领域的对齐；(2) \textbf{基于强化学习的多步路由}，探索分布外泛化的自主轨迹。跨 15 个基准的大量实验表明，我们的方法优于 GPT-4o 等闭源模型，在分布内 (+10.1%) 和分布外 (+13.1%) 任务上都超过了现有的路由方法。此外，我们的框架通过编排专门的多模式工具在视觉推理方面显示出显着的进步。

Title: Evaluating Small Decoder-Only Language Models for Grammar Correction and Text Simplification

Authors: Anthony Lamelas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03874
Pdf URL: https://arxiv.org/pdf/2601.03874
Copy Paste: [[2601.03874]] Evaluating Small Decoder-Only Language Models for Grammar Correction and Text Simplification(https://arxiv.org/abs/2601.03874)
Keywords: language model, llm, hallucination
Abstract: Large language models have become extremely popular recently due to their ability to achieve strong performance on a variety of tasks, such as text generation and rewriting, but their size and computation cost make them difficult to access, deploy, and secure in many settings. This paper investigates whether small, decoder-only language models can provide an efficient alternative for the tasks of grammar correction and text simplification. The experiments in this paper focus on testing small language models out of the box, fine-tuned, and run sequentially on the JFLEG and ASSET datasets using established metrics. The results show that while SLMs may learn certain behaviors well, their performance remains below strong baselines and current LLMs. The results also show that SLMs struggle with retaining meaning and hallucinations. These findings suggest that despite their efficiency advantages, current SLMs are not yet competitive enough with modern LLMs for rewriting, and further advances in training are required for SLMs to close the performance gap between them and today's LLMs.
摘要：大型语言模型最近变得非常流行，因为它们能够在各种任务（例如文本生成和重写）上实现强大的性能，但它们的大小和计算成本使得它们在许多设置中难以访问、部署和保护。本文研究了小型、仅解码器的语言模型是否可以为语法纠正和文本简化任务提供有效的替代方案。本文的实验重点是测试开箱即用的小语言模型，进行微调，并使用既定指标在 JFLEG 和 ASSET 数据集上按顺序运行。结果表明，虽然 SLM 可以很好地学习某些行为，但他们的表现仍然低于强基线和当前的 LLM。结果还表明，SLM 很难保留意义和产生幻觉。这些发现表明，尽管具有效率优势，但当前的 SLM 与现代法学硕士相比，在重写方面还没有足够的竞争力，并且 SLM 需要在培训方面取得进一步进展，以缩小其与当今法学硕士之间的绩效差距。

Title: Decide Then Retrieve: A Training-Free Framework with Uncertainty-Guided Triggering and Dual-Path Retrieval

Authors: Wang Chen, Guanqiang Qi, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03908
Pdf URL: https://arxiv.org/pdf/2601.03908
Copy Paste: [[2601.03908]] Decide Then Retrieve: A Training-Free Framework with Uncertainty-Guided Triggering and Dual-Path Retrieval(https://arxiv.org/abs/2601.03908)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, but existing approaches indiscriminately trigger retrieval and rely on single-path evidence construction, often introducing noise and limiting performance gains. In this work, we propose Decide Then Retrieve (DTR), a training-free framework that adaptively determines when retrieval is necessary and how external information should be selected. DTR leverages generation uncertainty to guide retrieval triggering and introduces a dual-path retrieval mechanism with adaptive information selection to better handle sparse and ambiguous queries. Extensive experiments across five open-domain QA benchmarks, multiple model scales, and different retrievers demonstrate that DTR consistently improves EM and F1 over standard RAG and strong retrieval-enhanced baselines, while reducing unnecessary retrievals. The code and data used in this paper are available at this https URL.
摘要：检索增强生成（RAG）通过整合外部知识来增强大型语言模型（LLM），但现有方法不加区别地触发检索并依赖于单路径证据构建，通常会引入噪音并限制性能提升。在这项工作中，我们提出了决定然后检索（DTR），这是一个免训练的框架，可以自适应地确定何时需要检索以及应如何选择外部信息。 DTR 利用生成不确定性来指导检索触发，并引入具有自适应信息选择的双路径检索机制，以更好地处理稀疏和模糊查询。跨五个开放域 QA 基准、多个模型尺度和不同检索器的广泛实验表明，DTR 相对于标准 RAG 和强大的检索增强基线，持续改进了 EM 和 F1，同时减少了不必要的检索。本文中使用的代码和数据可在此 https URL 中获取。

Title: When Models Decide and When They Bind: A Two-Stage Computation for Multiple-Choice Question-Answering

Authors: Hugh Mee Wong, Rick Nouwen, Albert Gatt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03914
Pdf URL: https://arxiv.org/pdf/2601.03914
Copy Paste: [[2601.03914]] When Models Decide and When They Bind: A Two-Stage Computation for Multiple-Choice Question-Answering(https://arxiv.org/abs/2601.03914)
Keywords: language model
Abstract: Multiple-choice question answering (MCQA) is easy to evaluate but adds a meta-task: models must both solve the problem and output the symbol that *represents* the answer, conflating reasoning errors with symbol-binding failures. We study how language models implement MCQA internally using representational analyses (PCA, linear probes) as well as causal interventions. We find that option-boundary (newline) residual states often contain strong linearly decodable signals related to per-option correctness. Winner-identity probing reveals a two-stage progression: the winning *content position* becomes decodable immediately after the final option is processed, while the *output symbol* is represented closer to the answer emission position. Tests under symbol and content permutations support a two-stage mechanism in which models first select a winner in content space and then bind or route that winner to the appropriate symbol to emit.
摘要：多项选择题回答（MCQA）很容易评估，但增加了一个元任务：模型必须既解决问题又输出“代表”答案的符号，将推理错误与符号绑定失败混为一谈。我们研究语言模型如何使用表征分析（PCA、线性探针）以及因果干预在内部实施 MCQA。我们发现选项边界（换行）残差状态通常包含与每个选项正确性相关的强线性可解码信号。获胜者身份探测揭示了一个两阶段的进程：获胜的“内容位置”在处理最终选项后立即变得可解码，而“输出符号”则表示为更接近答案发射位置。符号和内容排列下的测试支持两阶段机制，其中模型首先在内容空间中选择获胜者，然后将该获胜者绑定或路由到要发出的适当符号。

Title: Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models

Authors: Haeun Jang, Hwan Chang, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03926
Pdf URL: https://arxiv.org/pdf/2601.03926
Copy Paste: [[2601.03926]] Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models(https://arxiv.org/abs/2601.03926)
Keywords: language model, prompt
Abstract: The deployment of Large Vision-Language Models (LVLMs) for real-world document question answering is often constrained by dynamic, user-defined policies that dictate information disclosure based on context. While ensuring adherence to these explicit constraints is critical, existing safety research primarily focuses on implicit social norms or text-only settings, overlooking the complexities of multimodal documents. In this paper, we introduce Doc-PP (Document Policy Preservation Benchmark), a novel benchmark constructed from real-world reports requiring reasoning across heterogeneous visual and textual elements under strict non-disclosure policies. Our evaluation highlights a systemic Reasoning-Induced Safety Gap: models frequently leak sensitive information when answers must be inferred through complex synthesis or aggregated across modalities, effectively circumventing existing safety constraints. Furthermore, we identify that providing extracted text improves perception but inadvertently facilitates leakage. To address these vulnerabilities, we propose DVA (Decompose-Verify-Aggregation), a structural inference framework that decouples reasoning from policy verification. Experimental results demonstrate that DVA significantly outperforms standard prompting defenses, offering a robust baseline for policy-compliant document understanding
摘要：用于现实世界文档问答的大型视觉语言模型 (LVLM) 的部署通常受到动态的、用户定义的策略的限制，这些策略根据上下文规定信息披露。虽然确保遵守这些明确的约束至关重要，但现有的安全研究主要集中在隐含的社会规范或纯文本设置，忽视了多模式文档的复杂性。在本文中，我们介绍了 Doc-PP（文档政策保存基准），这是一种根据现实世界报告构建的新颖基准，需要在严格的保密政策下跨异构视觉和文本元素进行推理。我们的评估强调了系统性推理引发的安全差距：当必须通过复杂的综合或跨模式聚合来推断答案时，模型经常会泄漏敏感信息，从而有效地规避现有的安全限制。此外，我们发现提供提取的文本可以改善感知，但会无意中促进泄漏。为了解决这些漏洞，我们提出了 DVA（分解-验证-聚合），这是一种将推理与策略验证分离的结构推理框架。实验结果表明，DVA 显着优于标准提示防御，为符合策略的文档理解提供了可靠的基线

Title: Large-Scale Aspect-Based Sentiment Analysis with Reasoning-Infused LLMs

Authors: Paweł Liskowski, Krzysztof Jankowski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03940
Pdf URL: https://arxiv.org/pdf/2601.03940
Copy Paste: [[2601.03940]] Large-Scale Aspect-Based Sentiment Analysis with Reasoning-Infused LLMs(https://arxiv.org/abs/2601.03940)
Keywords: gpt, llm, chain-of-thought
Abstract: We introduce Arctic-ABSA, a collection of powerful models for real-life aspect-based sentiment analysis (ABSA). Our models are tailored to commercial needs, trained on a large corpus of public data alongside carefully generated synthetic data, resulting in a dataset 20 times larger than SemEval14. We extend typical ABSA models by expanding the number of sentiment classes from the standard three (positive, negative, neutral) to five, adding mixed and unknown classes, while also jointly predicting overall text sentiment and supporting multiple languages. We experiment with reasoning injection by fine-tuning on Chain-of-Thought (CoT) examples and introduce a novel reasoning pretraining technique for encoder-only models that significantly improves downstream fine-tuning and generalization. Our 395M-parameter encoder and 8B-parameter decoder achieve up to 10 percentage points higher accuracy than GPT-4o and Claude 3.5 Sonnet, while setting new state-of-the-art results on the SemEval14 benchmark. A single multilingual model maintains 87-91% accuracy across six languages without degrading English performance. We release ABSA-mix, a large-scale benchmark aggregating 17 public ABSA datasets across 92 domains.
摘要：我们介绍 Arctic-ABSA，这是一组用于现实生活中基于方面的情感分析 (ABSA) 的强大模型。我们的模型是根据商业需求量身定制的，在大量公共数据和精心生成的合成数据上进行训练，产生的数据集比 SemEval14 大 20 倍。我们扩展了典型的 ABSA 模型，将情感类别的数量从标准的三个（积极、消极、中性）扩展到五个，添加混合和未知类别，同时还联合预测整体文本情感并支持多种语言。我们通过对思想链（CoT）示例进行微调来进行推理注入实验，并为仅编码器模型引入了一种新颖的推理预训练技术，该技术显着改进了下游微调和泛化。我们的 395M 参数编码器和 8B 参数解码器的精度比 GPT-4o 和 Claude 3.5 Sonnet 高出 10 个百分点，同时在 SemEval14 基准上创下了新的最先进结果。单一多语言模型可在六种语言中保持 87-91% 的准确率，而不会降低英语性能。我们发布了 ABSA-mix，这是一个大规模基准测试，聚合了 92 个领域的 17 个公共 ABSA 数据集。

Title: RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection

Authors: Song-Duo Ma, Yi-Hung Liu, Hsin-Yu Lin, Pin-Yu Chen, Hong-Yan Huang, Shau-Yung Hsu, Yun-Nung Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03981
Pdf URL: https://arxiv.org/pdf/2601.03981
Copy Paste: [[2601.03981]] RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection(https://arxiv.org/abs/2601.03981)
Keywords: llm
Abstract: To efficiently combat the spread of LLM-generated misinformation, we present RADAR, a retrieval-augmented detector with adversarial refinement for robust fake news detection. Our approach employs a generator that rewrites real articles with factual perturbations, paired with a lightweight detector that verifies claims using dense passage retrieval. To enable effective co-evolution, we introduce verbal adversarial feedback (VAF). Rather than relying on scalar rewards, VAF issues structured natural-language critiques; these guide the generator toward more sophisticated evasion attempts, compelling the detector to adapt and improve. On a fake news detection benchmark, RADAR achieves 86.98% ROC-AUC, significantly outperforming general-purpose LLMs with retrieval. Ablation studies confirm that detector-side retrieval yields the largest gains, while VAF and few-shot demonstrations provide critical signals for robust training.
摘要：为了有效地打击法学硕士生成的错误信息的传播，我们推出了 RADAR，这是一种检索增强型检测器，具有对抗性细化功能，可实现稳健的假新闻检测。我们的方法采用了一个生成器，可以用事实扰动重写真实的文章，并搭配一个轻量级检测器，使用密集的段落检索来验证声明。为了实现有效的共同进化，我们引入了口头对抗性反馈（VAF）。 VAF 不依赖标量奖励，而是提出结构化的自然语言批评；这些引导生成器进行更复杂的规避尝试，迫使检测器进行适应和改进。在假新闻检测基准上，RADAR 达到了 86.98% ROC-AUC，在检索方面明显优于通用法学硕士。消融研究证实，探测器端检索可产生最大的增益，而 VAF 和少样本演示为稳健训练提供了关键信号。

Title: Benchmark^2: Systematic Evaluation of LLM Benchmarks

Authors: Qi Qian, Chengsong Huang, Jingwen Xu, Changze Lv, Muling Wu, Wenhao Liu, Xiaohua Wang, Zhenghua Wang, Zisu Huang, Muzhao Tian, Jianhan Xu, Kun Hu, He-Da Wang, Yao Hu, Xuanjing Huang, Xiaoqing Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03986
Pdf URL: https://arxiv.org/pdf/2601.03986
Copy Paste: [[2601.03986]] Benchmark^2: Systematic Evaluation of LLM Benchmarks(https://arxiv.org/abs/2601.03986)
Keywords: language model, llm
Abstract: The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three complementary metrics: (1) Cross-Benchmark Ranking Consistency, measuring whether a benchmark produces model rankings aligned with peer benchmarks; (2) Discriminability Score, quantifying a benchmark's ability to differentiate between models; and (3) Capability Alignment Deviation, identifying problematic instances where stronger models fail but weaker models succeed within the same model family. We conduct extensive experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains, evaluating 11 LLMs across four model families. Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction based on our metrics can achieve comparable evaluation performance with substantially reduced test sets.
摘要：用于评估大型语言模型 (LLM) 的基准的快速增长迫切需要系统方法来评估基准质量本身。我们提出了 Benchmark^2，一个由三个互补指标组成的综合框架：（1）跨基准排名一致性，衡量基准是否产生与同行基准一致的模型排名； (2) 可辨别性得分，量化基准区分模型的能力； (3) 能力对齐偏差，识别同一模型系列中较强模型失败但较弱模型成功的问题实例。我们针对涵盖数学、推理和知识领域的 15 个基准进行了广泛的实验，评估了四个模型系列中的 11 个法学硕士。我们的分析揭示了现有基准之间存在显着的质量差异，并表明基于我们的指标的选择性基准构建可以通过大幅减少测试集来实现可比较的评估性能。

Title: VotIE: Information Extraction from Meeting Minutes

Authors: José Pedro Evans, Luís Filipe Cunha, Purificação Silvano, Alípio Jorge, Nuno Guimarães, Sérgio Nunes, Ricardo Campos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03997
Pdf URL: https://arxiv.org/pdf/2601.03997
Copy Paste: [[2601.03997]] VotIE: Information Extraction from Meeting Minutes(https://arxiv.org/abs/2601.03997)
Keywords: llm
Abstract: Municipal meeting minutes record key decisions in local democratic processes. Unlike parliamentary proceedings, which typically adhere to standardized formats, they encode voting outcomes in highly heterogeneous, free-form narrative text that varies widely across municipalities, posing significant challenges for automated extraction. In this paper, we introduce VotIE (Voting Information Extraction), a new information extraction task aimed at identifying structured voting events in narrative deliberative records, and establish the first benchmark for this task using Portuguese municipal minutes, building on the recently introduced CitiLink corpus. Our experiments yield two key findings. First, under standard in-domain evaluation, fine-tuned encoders, specifically XLM-R-CRF, achieve the strongest performance, reaching 93.2\% macro F1, outperforming generative approaches. Second, in a cross-municipality setting that evaluates transfer to unseen administrative contexts, these models suffer substantial performance degradation, whereas few-shot LLMs demonstrate greater robustness, with significantly smaller declines in performance. Despite this generalization advantage, the high computational cost of generative models currently constrains their practicality. As a result, lightweight fine-tuned encoders remain a more practical option for large-scale, real-world deployment. To support reproducible research in administrative NLP, we publicly release our benchmark, trained models, and evaluation framework.
摘要：市政会议纪要记录了地方民主进程中的关键决定。与通常遵循标准化格式的议会程序不同，它们以高度异构、形式自由的叙述文本对投票结果进行编码，这些文本在各个城市之间差异很大，这给自动提取带来了重大挑战。在本文中，我们介绍了 VotIE（投票信息提取），这是一项新的信息提取任务，旨在识别叙述性审议记录中的结构化投票事件，并在最近推出的 CitiLink 语料库的基础上，使用葡萄牙市政会议记录为该任务建立了第一个基准。我们的实验产生了两个重要发现。首先，在标准域内评估下，微调编码器，特别是 XLM-R-CRF，实现了最强的性能，达到 93.2% 的宏 F1，优于生成方法。其次，在评估转移到看不见的行政环境的跨市环境中，这些模型的性能大幅下降，而少数法学硕士则表现出更大的鲁棒性，性能下降幅度要小得多。尽管具有这种泛化优势，但生成模型的高计算成本目前限制了它们的实用性。因此，轻量级微调编码器仍然是大规模实际部署的更实用的选择。为了支持行政 NLP 的可重复研究，我们公开发布了我们的基准、训练模型和评估框架。

Title: Simulated Students in Tutoring Dialogues: Substance or Illusion?

Authors: Alexander Scarlatos, Jaewook Lee, Simon Woodhead, Andrew Lan
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2601.04025
Pdf URL: https://arxiv.org/pdf/2601.04025
Copy Paste: [[2601.04025]] Simulated Students in Tutoring Dialogues: Substance or Illusion?(https://arxiv.org/abs/2601.04025)
Keywords: language model, llm, prompt
Abstract: Advances in large language models (LLMs) enable many new innovations in education. However, evaluating the effectiveness of new technology requires real students, which is time-consuming and hard to scale up. Therefore, many recent works on LLM-powered tutoring solutions have used simulated students for both training and evaluation, often via simple prompting. Surprisingly, little work has been done to ensure or even measure the quality of simulated students. In this work, we formally define the student simulation task, propose a set of evaluation metrics that span linguistic, behavioral, and cognitive aspects, and benchmark a wide range of student simulation methods on these metrics. We experiment on a real-world math tutoring dialogue dataset, where both automated and human evaluation results show that prompting strategies for student simulation perform poorly; supervised fine-tuning and preference optimization yield much better but still limited performance, motivating future work on this challenging task.
摘要：大语言模型 (LLM) 的进步实现了教育领域的许多新创新。然而，评估新技术的有效性需要真正的学生，这既耗时又难以规模化。因此，最近许多关于法学硕士支持的辅导解决方案的工作都使用模拟学生进行培训和评估，通常通过简单的提示。令人惊讶的是，在确保甚至衡量模拟学生的质量方面几乎没有做任何工作。在这项工作中，我们正式定义了学生模拟任务，提出了一组涵盖语言、行为和认知方面的评估指标，并根据这些指标对各种学生模拟方法进行了基准测试。我们在现实世界的数学辅导对话数据集上进行了实验，其中自动评估结果和人工评估结果都表明，学生模拟的提示策略表现不佳；有监督的微调和偏好优化产生了更好的结果，但仍然有限的性能，激励了未来对这项具有挑战性的任务的工作。

Title: SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency

Authors: Jonggeun Lee, Junseong Pyo, Gyuhyeon Seo, Yohan Jo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04029
Pdf URL: https://arxiv.org/pdf/2601.04029
Copy Paste: [[2601.04029]] SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency(https://arxiv.org/abs/2601.04029)
Keywords: language model
Abstract: Large Audio-Language Models (LALMs) as judges have emerged as a prominent approach for evaluating speech generation quality, yet their ability to assess speaker consistency across multi-turn conversations remains unexplored. We present SpeakerSleuth, a benchmark evaluating whether LALMs can reliably judge speaker consistency in multi-turn dialogues through three tasks reflecting real-world requirements. We construct 1,818 human-verified evaluation instances across four diverse datasets spanning synthetic and real speech, with controlled acoustic difficulty. Evaluating nine widely-used LALMs, we find that models struggle to reliably detect acoustic inconsistencies. For instance, given audio samples of the same speaker's turns, some models overpredict inconsistency, whereas others are overly lenient. Models further struggle to identify the exact turns that are problematic. When other interlocutors' turns are provided together, performance degrades dramatically as models prioritize textual coherence over acoustic cues, failing to detect even obvious gender switches for a speaker. On the other hand, models perform substantially better in choosing the audio that best matches the speaker among several acoustic variants, demonstrating inherent acoustic discrimination capabilities. These findings expose a significant bias in LALMs: they tend to prioritize text over acoustics, revealing fundamental modality imbalances that need to be addressed to build reliable audio-language judges.
摘要：作为评判的大型音频语言模型（LALM）已成为评估语音生成质量的重要方法，但它们评估多轮对话中说话者一致性的能力仍有待探索。我们提出了SpeakerSleuth，这是一个基准，评估 LALM 是否能够通过反映现实世界需求的三个任务可靠地判断多轮对话中的说话者一致性。我们在四个不同的数据集中构建了 1,818 个经过人工验证的评估实例，涵盖合成语音和真实语音，并具有受控的声学难度。通过评估九种广泛使用的 LALM，我们发现模型很难可靠地检测声学不一致。例如，给定同一说话者回合的音频样本，一些模型会过度预测不一致，而另一些模型则过于宽松。模型进一步难以识别有问题的确切转弯。当其他对话者轮流同时提供时，性能会急剧下降，因为模型优先考虑文本连贯性而不是声音提示，甚至无法检测到说话者明显的性别转换。另一方面，模型在从多种声学变体中选择与说话者最匹配的音频方面表现得更好，展示了固有的声学辨别能力。这些发现暴露了 LALM 中的一个重大偏见：它们倾向于优先考虑文本而不是声学，揭示了需要解决的基本模态不平衡问题，以建立可靠的音频语言法官。

Title: Analyzing and Improving Cross-lingual Knowledge Transfer for Machine Translation

Authors: David Stap
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04036
Pdf URL: https://arxiv.org/pdf/2601.04036
Copy Paste: [[2601.04036]] Analyzing and Improving Cross-lingual Knowledge Transfer for Machine Translation(https://arxiv.org/abs/2601.04036)
Keywords: language model
Abstract: Multilingual machine translation systems aim to make knowledge accessible across languages, yet learning effective cross-lingual representations remains challenging. These challenges are especially pronounced for low-resource languages, where limited parallel data constrains generalization and transfer. Understanding how multilingual models share knowledge across languages requires examining the interaction between representations, data availability, and training strategies. In this thesis, we study cross-lingual knowledge transfer in neural models and develop methods to improve robustness and generalization in multilingual settings, using machine translation as a central testbed. We analyze how similarity between languages influences transfer, how retrieval and auxiliary supervision can strengthen low-resource translation, and how fine-tuning on parallel data can introduce unintended trade-offs in large language models. We further examine the role of language diversity during training and show that increasing translation coverage improves generalization and reduces off-target behavior. Together, this work highlights how modeling choices and data composition shape multilingual learning and offers insights toward more inclusive and resilient multilingual NLP systems.
摘要：多语言机器翻译系统旨在使知识能够跨语言访问，但学习有效的跨语言表示仍然具有挑战性。这些挑战对于低资源语言尤其明显，其中有限的并行数据限制了泛化和传输。了解多语言模型如何跨语言共享知识需要检查表示、数据可用性和训练策略之间的交互。在本论文中，我们研究神经模型中的跨语言知识转移，并使用机器翻译作为中心测试平台，开发提高多语言环境中的鲁棒性和泛化性的方法。我们分析了语言之间的相似性如何影响迁移，检索和辅助监督如何加强低资源翻译，以及并行数据的微调如何在大型语言模型中引入意想不到的权衡。我们进一步研究了语言多样性在训练过程中的作用，并表明增加翻译覆盖率可以提高泛化能力并减少脱靶行为。总之，这项工作强调了建模选择和数据组合如何塑造多语言学习，并为更具包容性和弹性的多语言 NLP 系统提供见解。

Title: When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life

Authors: Xinyue Lou, Jinan Xu, Jingyi Yin, Xiaolong Wang, Zhaolu Kang, Youwei Liao, Yixuan Wang, Xiangyu Shi, Fengran Mo, Su Yao, Kaiyu Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04043
Pdf URL: https://arxiv.org/pdf/2601.04043
Copy Paste: [[2601.04043]] When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life(https://arxiv.org/abs/2601.04043)
Keywords: language model, llm
Abstract: As Multimodal Large Language Models (MLLMs) become an indispensable assistant in human life, the unsafe content generated by MLLMs poses a danger to human behavior, perpetually overhanging human society like a sword of Damocles. To investigate and evaluate the safety impact of MLLMs responses on human behavior in daily life, we introduce SaLAD, a multimodal safety benchmark which contains 2,013 real-world image-text samples across 10 common categories, with a balanced design covering both unsafe scenarios and cases of oversensitivity. It emphasizes realistic risk exposure, authentic visual inputs, and fine-grained cross-modal reasoning, ensuring that safety risks cannot be inferred from text alone. We further propose a safety-warning-based evaluation framework that encourages models to provide clear and informative safety warnings, rather than generic refusals. Results on 18 MLLMs demonstrate that the top-performing models achieve a safe response rate of only 57.2% on unsafe queries. Moreover, even popular safety alignment methods limit effectiveness of the models in our scenario, revealing the vulnerabilities of current MLLMs in identifying dangerous behaviors in daily life. Our dataset is available at this https URL.
摘要：随着多模态大语言模型（MLLM）成为人类生活中不可或缺的助手，MLLM产生的不安全内容对人类行为构成了威胁，像达摩克利斯之剑一样永远悬在人类社会之上。为了调查和评估 MLLM 响应对人类日常生活行为的安全影响，我们引入了 SaLAD，这是一种多模式安全基准，其中包含 10 个常见类别的 2,013 个真实世界图像文本样本，其平衡设计涵盖了不安全场景和过度敏感的情况。它强调现实的风险暴露、真实的视觉输入和细粒度的跨模态推理，确保不能仅从文本推断安全风险。我们进一步提出了一个基于安全警告的评估框架，鼓励模型提供清晰且信息丰富的安全警告，而不是笼统的拒绝。 18 个 MLLM 的结果表明，性能最佳的模型对不安全查询的安全响应率仅为 57.2%。此外，即使是流行的安全对齐方法也限制了模型在我们的场景中的有效性，揭示了当前 MLLM 在识别日常生活中危险行为方面的漏洞。我们的数据集可通过此 https URL 获取。

Title: Modular Prompt Optimization: Optimizing Structured Prompts with Section-Local Textual Gradients

Authors: Prith Sharma, Austin Z. Henley
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04055
Pdf URL: https://arxiv.org/pdf/2601.04055
Copy Paste: [[2601.04055]] Modular Prompt Optimization: Optimizing Structured Prompts with Section-Local Textual Gradients(https://arxiv.org/abs/2601.04055)
Keywords: language model, llm, prompt
Abstract: Prompt quality plays a central role in controlling the behavior, reliability, and reasoning performance of large language models (LLMs), particularly for smaller open-source instruction-tuned models that depend heavily on explicit structure. While recent work has explored automatic prompt optimization using textual gradients and self-refinement, most existing methods treat prompts as monolithic blocks of text, making it difficult to localize errors, preserve critical instructions, or prevent uncontrolled prompt growth. We introduce Modular Prompt Optimization (MPO), a schema-based prompt optimization framework that treats prompts as structured objects composed of fixed semantic sections, including system role, context, task description, constraints, and output format. MPO applies section-local textual gradients, generated by a critic language model, to refine each section independently while keeping the overall prompt schema fixed. Section updates are consolidated through de-duplication to reduce redundancy and interference between components, yielding an interpretable and robust optimization process. We evaluate MPO on two reasoning benchmarks, ARC-Challenge and MMLU, using LLaMA-3 8B-Instruct and Mistral-7B-Instruct as solver models. Across both benchmarks and models, MPO consistently outperforms an untuned structured prompt and the TextGrad baseline, achieving substantial accuracy gains without modifying model parameters or altering prompt structure. These results demonstrate that maintaining a fixed prompt schema while applying localized, section-wise optimization is an effective and practical approach for improving reasoning performance in small open-source LMs.
摘要：提示质量在控制大型语言模型 (LLM) 的行为、可靠性和推理性能方面发挥着核心作用，特别是对于严重依赖显式结构的小型开源指令调整模型。虽然最近的工作探索了使用文本渐变和自我优化的自动提示优化，但大多数现有方法将提示视为整体文本块，这使得定位错误、保留关键指令或防止不受控制的提示增长变得困难。我们引入了模块化提示优化（MPO），这是一种基于模式的提示优化框架，它将提示视为由固定语义部分组成的结构化对象，包括系统角色、上下文、任务描述、约束和输出格式。 MPO 应用由评论家语言模型生成的部分局部文本梯度来独立细化每个部分，同时保持整体提示模式固定。通过重复数据删除来整合部分更新，以减少组件之间的冗余和干扰，从而产生可解释且稳健的优化过程。我们使用 LLaMA-3 8B-Instruct 和 Mistral-7B-Instruct 作为求解器模型，在 ARC-Challenge 和 MMLU 两个推理基准上评估 MPO。在基准和模型中，MPO 始终优于未调整的结构化提示和 TextGrad 基线，在不修改模型参数或改变提示结构的情况下实现了显着的准确性提升。这些结果表明，在应用局部分段优化的同时保持固定的提示模式是提高小型开源 LM 推理性能的有效且实用的方法。

Title: Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion

Authors: Yuanfeng Xu, Yuhao Chen, Liang Lin, Guangrun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04056
Pdf URL: https://arxiv.org/pdf/2601.04056
Copy Paste: [[2601.04056]] Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion(https://arxiv.org/abs/2601.04056)
Keywords: language model
Abstract: The bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders the development of truly unified multimodal systems. While Masked Language Models (MLMs) offer efficient bidirectional context, they traditionally lack the generative fidelity of autoregressive models and the semantic continuity of diffusion models. Furthermore, extending masked generation to multimodal settings introduces severe alignment challenges and training instability. In this work, we propose \textbf{CoM-DAD} (\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion), a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process. CoM-DAD decouples high-level semantic planning from low-level token synthesis. First, we model the semantic manifold via a continuous latent diffusion process; second, we treat token generation as a discrete absorbing diffusion process, regulated by a \textbf{Variable-Rate Noise Schedule}, conditioned on these evolving semantic priors. Crucially, we introduce a \textbf{Stochastic Mixed-Modal Transport} strategy that aligns disparate modalities without requiring heavy contrastive dual-encoders. Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.
摘要：生成建模分为离散数据（文本）的自回归方法和连续数据（图像）的扩散方法，阻碍了真正统一的多模态系统的开发。虽然掩码语言模型 (MLM) 提供高效的双向上下文，但它们传统上缺乏自回归模型的生成保真度和扩散模型的语义连续性。此外，将蒙版生成扩展到多模态设置会带来严重的对齐挑战和训练不稳定。在这项工作中，我们提出了 \textbf{CoM-DAD} （\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion），这是一种新颖的概率框架，它将多模态生成重新表述为分层双过程。 CoM-DAD 将高级语义规划与低级令牌合成分离。首先，我们通过连续的潜在扩散过程对语义流形进行建模；其次，我们将令牌生成视为一个离散的吸收扩散过程，由 \textbf{可变速率噪声调度} 调节，以这些不断发展的语义先验为条件。至关重要的是，我们引入了一种 \textbf{随机混合模态传输} 策略，该策略可以在不需要大量对比双编码器的情况下对齐不同的模态。我们的方法表现出优于标准掩模建模的卓越稳定性，为可扩展、统一的文本图像生成建立了新的范例。

Title: KDCM: Reducing Hallucination in LLMs through Explicit Reasoning Structures

Authors: Jinbo Hao, Kai Yang, Qingzhen Su, Yifan Li, Chao Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04086
Pdf URL: https://arxiv.org/pdf/2601.04086
Copy Paste: [[2601.04086]] KDCM: Reducing Hallucination in LLMs through Explicit Reasoning Structures(https://arxiv.org/abs/2601.04086)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: To mitigate hallucinations in large language models (LLMs), we propose a framework that focuses on errors induced by prompts. Our method extends a chain-style knowledge distillation approach by incorporating a programmable module that guides knowledge graph exploration. This module is embedded as executable code within the reasoning prompt, allowing the model to leverage external structured knowledge during inference. Based on this design, we develop an enhanced distillation-based reasoning framework that explicitly regulates intermediate reasoning steps, resulting in more reliable predictions. We evaluate the proposed approach on multiple public benchmarks using GPT-4 and LLaMA-3.3. Experimental results show that code-guided reasoning significantly improves contextual modeling and reduces prompt-induced hallucinations. Specifically, HIT@1, HIT@3, and HIT@5 increase by 15.64%, 13.38%, and 13.28%, respectively, with scores exceeding 95% across several evaluation settings. These findings indicate that the proposed method effectively constrains erroneous reasoning while improving both accuracy and interpretability.
摘要：为了减轻大型语言模型（LLM）中的幻觉，我们提出了一个专注于提示引起的错误的框架。我们的方法通过合并指导知识图探索的可编程模块来扩展链式知识蒸馏方法。该模块作为可执行代码嵌入到推理提示中，允许模型在推理过程中利用外部结构化知识。基于此设计，我们开发了一个增强的基于蒸馏的推理框架，该框架明确规范中间推理步骤，从而产生更可靠的预测。我们使用 GPT-4 和 LLaMA-3.3 在多个公共基准上评估所提出的方法。实验结果表明，代码引导推理显着改善了上下文建模并减少了提示引起的幻觉。具体而言，HIT@1、HIT@3 和 HIT@5 分别增长了 15.64%、13.38% 和 13.28%，在多个评估设置中得分超过 95%。这些发现表明，所提出的方法有效地限制了错误推理，同时提高了准确性和可解释性。

Title: SearchAttack: Red-Teaming LLMs against Real-World Threats via Framing Unsafe Web Information-Seeking Tasks

Authors: Yu Yan, Sheng Sun, Mingfeng Li, Zheming Yang, Chiwei Zhu, Fei Ma, Benfeng Xu, Min Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04093
Pdf URL: https://arxiv.org/pdf/2601.04093
Copy Paste: [[2601.04093]] SearchAttack: Red-Teaming LLMs against Real-World Threats via Framing Unsafe Web Information-Seeking Tasks(https://arxiv.org/abs/2601.04093)
Keywords: llm, chat
Abstract: Recently, people have suffered and become increasingly aware of the unreliability gap in LLMs for open and knowledge-intensive tasks, and thus turn to search-augmented LLMs to mitigate this issue. However, when the search engine is triggered for harmful tasks, the outcome is no longer under the LLM's control. Once the returned content directly contains targeted, ready-to-use harmful takeaways, the LLM's safeguards cannot withdraw that exposure. Motivated by this dilemma, we identify web search as a critical attack surface and propose \textbf{\textit{SearchAttack}} for red-teaming. SearchAttack outsources the harmful semantics to web search, retaining only the query's skeleton and fragmented clues, and further steers LLMs to reconstruct the retrieved content via structural rubrics to achieve malicious goals. Extensive experiments are conducted to red-team the search-augmented LLMs for responsible vulnerability assessment. Empirically, SearchAttack demonstrates strong effectiveness in attacking these systems.
摘要：最近，人们遭受了痛苦，并越来越意识到法学硕士在开放和知识密集型任务方面的不可靠性差距，因此转向搜索增强法学硕士来缓解这一问题。然而，当搜索引擎因有害任务而被触发时，结果就不再受法学硕士的控制。一旦返回的内容直接包含有针对性的、即用型有害内容，法学硕士的保障措施就无法撤回该风险。出于这种困境，我们将网络搜索确定为关键的攻击面，并提出 \textbf{\textit{SearchAttack}} 进行红队合作。 SearchAttack将有害语义外包给网络搜索，只保留查询的骨架和碎片线索，并进一步引导法学硕士通过结构性规则重构检索到的内容，以达到恶意目的。进行了大量的实验，对搜索增强的法学硕士进行红队评估，以进行负责任的漏洞评估。根据经验，SearchAttack 在攻击这些系统方面表现出强大的有效性。

Title: Layer-wise Positional Bias in Short-Context Language Modeling

Authors: Maryam Rahimi, Mahdi Nouri, Yadollah Yaghoobzadeh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04098
Pdf URL: https://arxiv.org/pdf/2601.04098
Copy Paste: [[2601.04098]] Layer-wise Positional Bias in Short-Context Language Modeling(https://arxiv.org/abs/2601.04098)
Keywords: language model
Abstract: Language models often show a preference for using information from specific positions in the input regardless of semantic relevance. While positional bias has been studied in various contexts, from attention sinks to task performance degradation in long-context settings, prior work has not established how these biases evolve across individual layers and input positions, or how they vary independent of task complexity. We introduce an attribution-based framework to analyze positional effects in short-context language modeling. Using layer conductance with a sliding-window approach, we quantify how each layer distributes importance across input positions, yielding layer-wise positional importance profiles. We find that these profiles are architecture-specific, stable across inputs, and invariant to lexical scrambling. Characterizing these profiles, we find prominent recency bias that increases with depth and subtle primacy bias that diminishes through model depth. Beyond positional structure, we also show that early layers preferentially weight content words over function words across all positions, while later layers lose this word-type differentiation.
摘要：语言模型通常倾向于使用输入中特定位置的信息，而不管语义相关性如何。虽然位置偏差已经在各种背景下进行了研究，从注意力集中到长上下文环境中的任务性能下降，但先前的工作尚未确定这些偏差如何在各个层和输入位置之间演变，或者它们如何独立于任务复杂性而变化。我们引入基于归因的框架来分析短上下文语言建模中的位置效应。使用层电导和滑动窗口方法，我们量化每个层如何在输入位置之间分配重要性，从而产生逐层位置重要性概况。我们发现这些配置文件是特定于体系结构的，在输入之间稳定，并且对词汇置乱具有不变性。通过表征这些配置文件，我们发现显着的新近度偏差随着深度的增加而增加，而微妙的首要偏差则随着模型深度的增加而减少。除了位置结构之外，我们还表明，早期层在所有位置上优先对实词进行加权，而不是虚词，而后面的层则失去了这种词类型区分。

Title: InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

Authors: Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, Yan Lu
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2601.04126
Pdf URL: https://arxiv.org/pdf/2601.04126
Copy Paste: [[2601.04126]] InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training(https://arxiv.org/abs/2601.04126)
Keywords: llm, agent
Abstract: GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present InfiniteWeb, a system that automatically generates functional web environments at scale for GUI agent training. While LLMs perform well on generating a single webpage, building a realistic and functional website with many interconnected pages faces challenges. We address these challenges through unified specification, task-centric test-driven development, and a combination of website seed with reference design image to ensure diversity. Our system also generates verifiable task evaluators enabling dense reward signals for reinforcement learning. Experiments show that InfiniteWeb surpasses commercial coding agents at realistic website construction, and GUI agents trained on our generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web, demonstrating the effectiveness of proposed system.
摘要：代表用户与图形界面交互的 GUI 代理代表了实用人工智能助手的一个有前途的方向。然而，由于缺乏合适的环境，训练此类智能体受到阻碍。我们推出了 InfiniteWeb，这是一个可以自动大规模生成用于 GUI 代理训练的功能性 Web 环境的系统。虽然法学硕士在生成单个网页方面表现良好，但构建具有许多互连页面的现实且实用的网站面临着挑战。我们通过统一规范、以任务为中心的测试驱动开发以及网站种子与参考设计图像的结合来应对这些挑战，以确保多样性。我们的系统还生成可验证的任务评估器，为强化学习提供密集的奖励信号。实验表明，InfiniteWeb 在实际网站构建方面超越了商业编码代理，并且在我们生成的环境中训练的 GUI 代理在 OSWorld 和 Online-Mind2Web 上实现了显着的性能改进，证明了所提出系统的有效性。

Title: ContextFocus: Activation Steering for Contextual Faithfulness in Large Language Models

Authors: Nikhil Anand, Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena, Koyel Mukherjee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04131
Pdf URL: https://arxiv.org/pdf/2601.04131
Copy Paste: [[2601.04131]] ContextFocus: Activation Steering for Contextual Faithfulness in Large Language Models(https://arxiv.org/abs/2601.04131)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) encode vast amounts of parametric knowledge during pre-training. As world knowledge evolves, effective deployment increasingly depends on their ability to faithfully follow externally retrieved context. When such evidence conflicts with the model's internal knowledge, LLMs often default to memorized facts, producing unfaithful outputs. In this work, we introduce ContextFocus, a lightweight activation steering approach that improves context faithfulness in such knowledge-conflict settings while preserving fluency and efficiency. Unlike prior approaches, our solution requires no model finetuning and incurs minimal inference-time overhead, making it highly efficient. We evaluate ContextFocus on the ConFiQA benchmark, comparing it against strong baselines including ContextDPO, COIECD, and prompting-based methods. Furthermore, we show that our method is complementary to prompting strategies and remains effective on larger models. Extensive experiments show that ContextFocus significantly improves contextual-faithfulness. Our results highlight the effectiveness, robustness, and efficiency of ContextFocus in improving contextual-faithfulness of LLM outputs.
摘要：大型语言模型 (LLM) 在预训练期间编码大量参数知识。随着世界知识的发展，有效的部署越来越依赖于他们忠实地遵循外部检索上下文的能力。当这些证据与模型的内部知识相冲突时，法学硕士通常会默认使用记忆的事实，从而产生不忠实的输出。在这项工作中，我们引入了 ContextFocus，这是一种轻量级的激活引导方法，可以提高此类知识冲突环境中的上下文忠实度，同时保持流畅性和效率。与之前的方法不同，我们的解决方案不需要模型微调，并且推理时间开销最小，因此非常高效。我们在 ConFiQA 基准上评估 ContextFocus，将其与包括 ContextDPO、COIECD 和基于提示的方法在内的强基线进行比较。此外，我们表明我们的方法是对提示策略的补充，并且在更大的模型上仍然有效。大量实验表明，ContextFocus 显着提高了上下文忠实度。我们的结果强调了 ContextFocus 在提高法学硕士输出的上下文忠实度方面的有效性、稳健性和效率。

Title: LLMberjack: Guided Trimming of Debate Trees for Multi-Party Conversation Creation

Authors: Leonardo Bottona, Nicolò Penzo, Bruno Lepri, Marco Guerini, Sara Tonelli
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2601.04135
Pdf URL: https://arxiv.org/pdf/2601.04135
Copy Paste: [[2601.04135]] LLMberjack: Guided Trimming of Debate Trees for Multi-Party Conversation Creation(https://arxiv.org/abs/2601.04135)
Keywords: language model, llm
Abstract: We present LLMberjack, a platform for creating multi-party conversations starting from existing debates, originally structured as reply trees. The system offers an interactive interface that visualizes discussion trees and enables users to construct coherent linearized dialogue sequences while preserving participant identity and discourse relations. It integrates optional large language model (LLM) assistance to support automatic editing of the messages and speakers' descriptions. We demonstrate the platform's utility by showing how tree visualization facilitates the creation of coherent, meaningful conversation threads and how LLM support enhances output quality while reducing human effort. The tool is open-source and designed to promote transparent and reproducible workflows to create multi-party conversations, addressing a lack of resources of this type.
摘要：我们推出了 LLberjack，一个从现有辩论开始创建多方对话的平台，最初的结构是回复树。该系统提供了一个交互式界面，可以可视化讨论树，使用户能够构建连贯的线性化对话序列，同时保留参与者的身份和话语关系。它集成了可选的大语言模型 (LLM) 辅助功能，支持自动编辑消息和演讲者的描述。我们通过展示树可视化如何促进连贯、有意义的对话线程的创建以及法学硕士支持如何在减少人力的同时提高输出质量来展示该平台的实用性。该工具是开源的，旨在促进透明和可重复的工作流程来创建多方对话，解决此类资源的缺乏问题。

Title: FLEx: Language Modeling with Few-shot Language Explanations

Authors: Adar Avsian, Christopher Richardson, Anirudh Sundar, Larry Heck
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04157
Pdf URL: https://arxiv.org/pdf/2601.04157
Copy Paste: [[2601.04157]] FLEx: Language Modeling with Few-shot Language Explanations(https://arxiv.org/abs/2601.04157)
Keywords: language model, prompt, chain-of-thought
Abstract: Language models have become effective at a wide range of tasks, from math problem solving to open-domain question answering. However, they still make mistakes, and these mistakes are often repeated across related queries. Natural language explanations can help correct these errors, but collecting them at scale may be infeasible, particularly in domains where expert annotators are required. To address this issue, we introduce FLEx ($\textbf{F}$ew-shot $\textbf{L}$anguage $\textbf{Ex}$planations), a method for improving model behavior using a small number of explanatory examples. FLEx selects representative model errors using embedding-based clustering, verifies that the associated explanations correct those errors, and summarizes them into a prompt prefix that is prepended at inference-time. This summary guides the model to avoid similar errors on new inputs, without modifying model weights. We evaluate FLEx on CounterBench, GSM8K, and ReasonIF. We find that FLEx consistently outperforms chain-of-thought (CoT) prompting across all three datasets and reduces up to 83\% of CoT's remaining errors.
摘要：语言模型已经在广泛的任务中变得有效，从数学问题解决到开放域问答。然而，他们仍然会犯错误，并且这些错误经常在相关查询中重复出现。自然语言解释可以帮助纠正这些错误，但大规模收集它们可能是不可行的，特别是在需要专家注释者的领域。为了解决这个问题，我们引入了 FLEx ($\textbf{F}$ew-shot $\textbf{L}$anguage $\textbf{Ex}$planations)，这是一种使用少量解释性示例来改进模型行为的方法。 FLEx 使用基于嵌入的聚类选择代表性模型错误，验证相关解释是否纠正了这些错误，并将它们汇总到在推理时添加的提示前缀中。此摘要指导模型避免新输入上的类似错误，而无需修改模型权重。我们在 CounterBench、GSM8K 和 ReasonIF 上评估 FLEx。我们发现 FLEx 在所有三个数据集上始终优于思想链 (CoT) 提示，并减少了高达 83% 的 CoT 剩余错误。

Title: All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection

Authors: Yuechen Jiang, Zhiwei Liu, Yupeng Cao, Yueru He, Ziyang Xu, Chen Xu, Zhiyang Deng, Prayag Tiwari, Xi Chen, Alejandro Lopez-Lira, Jimin Huang, Junichi Tsujii, Sophia Ananiadou
Subjects: cs.CL, cs.CE, q-fin.CP
Abstract URL: https://arxiv.org/abs/2601.04160
Pdf URL: https://arxiv.org/pdf/2601.04160
Copy Paste: [[2601.04160]] All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection(https://arxiv.org/abs/2601.04160)
Keywords: language model
Abstract: We introduce RFC Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference free misinformation detection and comparison based diagnosis using paired original perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC Bench provides a structured testbed for studying reference free reasoning and advancing more reliable financial misinformation detection in real world settings.
摘要：我们引入了 RFC Bench，这是一个评估现实新闻下的金融错误信息的大型语言模型的基准。 RFC Bench 在段落级别运行，捕捉财经新闻的上下文复杂性，其中意义从分散的线索中显现出来。该基准定义了两个互补的任务：无参考错误信息检测和使用配对原始扰动输入的基于比较的诊断。实验揭示了一个一致的模式：当比较上下文可用时，性能要强得多，而无参考设置则暴露出明显的弱点，包括不稳定的预测和无效输出的增加。这些结果表明，当前模型在没有外部基础的情况下很难维持连贯的信念状态。通过强调这一差距，RFC Bench 提供了一个结构化测试平台，用于研究无参考推理并在现实世界环境中推进更可靠的金融错误信息检测。