2025-10-13

Title: Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

Authors: Shahriar Kabir Nahin, Hadi Askari, Muhao Chen, Anshuman Chhabra
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08592
Pdf URL: https://arxiv.org/pdf/2510.08592
Copy Paste: [[2510.08592]] Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models(https://arxiv.org/abs/2510.08592)
Keywords: language model, llm, prompt
Abstract: Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across four open-source models (Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3 and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard and OpenAI Moderation API), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode. Through this work, we hope to motivate future research on designing robust TTS strategies that are both effective and secure against diversity-targeted stress tests as illustrated by RefDiv.
摘要：测试时间缩放 (TTS) 通过探索多个候选响应然后对此集进行操作以找到最佳输出来改进 LLM 推理。 TTS 背后的一个默认前提是足够多样化的候选池可以提高可靠性。在这项工作中，我们证明 TTS 中的这一假设引入了一种以前未识别的故障模式。当候选多样性受到限制时，即使是少量，TTS 也更有可能产生不安全的输出。我们提出了一种参考引导多样性减少协议 (RefDiv)，该协议可用作对 TTS 管道进行压力测试的诊断攻击。通过对四种开源模型（Qwen3、Mistral、Llama3.1、Gemma3）和两种广泛使用的 TTS 策略（蒙特卡洛树搜索和 Best-of-N）的广泛实验，限制多样性始终表明 TTS 产生不安全结果的速率。这种效果通常比直接具有高对抗意图分数的提示产生的效果更强。这种观察到的现象也转移到 TTS 策略和闭源模型（例如 OpenAI o3 和 Gemini-2.5-Pro），从而表明这是 TTS 的普遍且现存的属性，而不是特定于模型的工件。此外，我们发现许多广泛使用的安全护栏分类器（例如 Llama-Guard 和 OpenAI Moderation API）无法标记 RefDiv 生成的对抗性输入提示，这表明现有防御措施对这种多样性驱动的故障模式提供的保护有限。通过这项工作，我们希望推动未来的研究，设计稳健的 TTS 策略，这些策略对于针对多样性的压力测试既有效又安全，如 RefDiv 所示。

Title: Systematic Diagnosis of Brittle Reasoning in Large Language Models

Authors: V. S. Raghu Parupudi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08595
Pdf URL: https://arxiv.org/pdf/2510.08595
Copy Paste: [[2510.08595]] Systematic Diagnosis of Brittle Reasoning in Large Language Models(https://arxiv.org/abs/2510.08595)
Keywords: language model, gpt
Abstract: A central question in artificial intelligence is the extent to which machine learning models comprehend mathematics. To address this, we propose a novel framework for measuring mathematical reasoning that moves beyond standard benchmarks to diagnose specific failure points. Our method first generates structured, step-by-step reasoning from gpt-3.5-turbo on the GSM8K dataset. We then use a more capable analyst model, gpt-4o-mini, to categorize errors and, crucially, perform an unsupervised clustering of every reasoning sentence to identify emergent "reasoning modes." This analysis reveals a cognitive profile with a stark, nonhuman-like brittleness: while the model achieves near-perfect accuracy on procedural modes like sequential calculation, its performance on modes requiring combinatorial reasoning with restrictions plummets. By identifying and quantifying the reliability of these distinct reasoning skills, our work provides a more granular method to evaluate mathematical comprehension and offers a precise roadmap for developing new capabilities and more reliable future applications.
摘要：人工智能的一个核心问题是机器学习模型对数学的理解程度。为了解决这个问题，我们提出了一种用于测量数学推理的新颖框架，该框架超越了标准基准来诊断特定的故障点。我们的方法首先从 GSM8K 数据集上的 gpt-3.5-turbo 生成结构化的逐步推理。然后，我们使用功能更强大的分析模型 gpt-4o-mini 对错误进行分类，最重要的是，对每个推理句子进行无监督聚类，以识别出现的“推理模式”。该分析揭示了一种明显的、非人类的脆弱性的认知特征：虽然该模型在顺序计算等程序模式上实现了近乎完美的准确性，但其在需要带限制的组合推理的模式上的性能却直线下降。通过识别和量化这些不同推理技能的可靠性，我们的工作提供了一种更精细的方法来评估数学理解力，并为开发新功能和更可靠的未来应用程序提供了精确的路线图。

Title: Confidence, Not Perplexity: A Better Metric for the Creative Era of LLMs

Authors: V. S. Raghu Parupudi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08596
Pdf URL: https://arxiv.org/pdf/2510.08596
Copy Paste: [[2510.08596]] Confidence, Not Perplexity: A Better Metric for the Creative Era of LLMs(https://arxiv.org/abs/2510.08596)
Keywords: gpt, llm, prompt
Abstract: Reference-free metrics like self-perplexity are strongly biased against creative text generation. We propose the Confidence Score (CS), derived from a model's output probability distribution, as a less biased alternative. Experiments on gpt-4o-mini show that while fluency-based metrics prefer novel responses in 0\% of cases on 99 creative prompts, our CS does so 19% of the time, a statistically significant difference (95% CI for difference: [11.1%, 27.3%]). We also show that CS effectively distinguishes between easy, medium, and hard tasks, confirmed by non-overlapping confidence intervals. The Confidence Score thus mitigates the creativity bias of traditional metrics while retaining their core evaluative strengths, offering a more balanced assessment for modern LLMs.
摘要：像自我困惑这样的无参考指标对创造性文本的生成有很大的偏见。我们提出从模型的输出概率分布导出的置信度得分 (CS)，作为偏差较小的替代方案。 gpt-4o-mini 上的实验表明，虽然基于流畅性的指标在 99 个创意提示中 0\% 的情况下更喜欢新颖的响应，但我们的 CS 在 19% 的时间里这样做，具有统计显着性差异（差异的 95% CI：[11.1%，27.3%]）。我们还表明，CS 可以有效地区分简单、中等和困难的任务，这通过不重叠的置信区间得到证实。因此，置信度分数减轻了传统指标的创造力偏差，同时保留了其核心评估优势，为现代法学硕士提供了更平衡的评估。

Title: Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation

Authors: Devleena Das, Rajeev Patwari, Ashish Sirasao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08600
Pdf URL: https://arxiv.org/pdf/2510.08600
Copy Paste: [[2510.08600]] Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation(https://arxiv.org/abs/2510.08600)
Keywords: language model
Abstract: Inference optimizations such as quantization, pruning, format and datatype conversion, model export, and serialization can lead to functional degradations in language model task performance. While most efforts on performance recovery for deployment focus on robust quantization techniques, we focus on recovering model accuracies from any sources that degrade model weights, such as improper model serialization. In this work, we propose Recover-LoRA, a lightweight and dataset agnostic method to recover accuracy in degraded models. Recover-LoRA uses synthetic data and logit distillation to learn LoRA adapters on selective layers that facilitate aligning the degraded model to its full precision model. We investigate the utility of Recover-LoRA across a diverse set of small language models (SLMs), including models with varying attention architectures, multi-head attention (MHA) and group-query attention (GQA), as well as several evaluation datasets. Our results show that Recover-LoRA recovers model accuracies by 5-17% on MHA and GQA SLMs.
摘要：量化、修剪、格式和数据类型转换、模型导出和序列化等推理优化可能会导致语言模型任务性能的功能下降。虽然部署性能恢复的大多数努力都集中在稳健的量化技术上，但我们专注于从任何降低模型权重的来源（例如不正确的模型序列化）恢复模型精度。在这项工作中，我们提出了 Recover-LoRA，一种轻量级且与数据集无关的方法，用于恢复退化模型的准确性。 Recover-LoRA 使用合成数据和 logit 蒸馏来学习选择性层上的 LoRA 适配器，从而有助于将降级模型与其全精度模型对齐。我们研究了 Recover-LoRA 在各种小语言模型 (SLM) 中的实用性，包括具有不同注意力架构、多头注意力 (MHA) 和组查询注意力 (GQA) 的模型，以及几个评估数据集。我们的结果表明，Recover-LoRA 在 MHA 和 GQA SLM 上将模型精度恢复了 5-17%。

Title: Mnemosyne: An Unsupervised, Human-Inspired Long-Term Memory Architecture for Edge-Based LLMs

Authors: Aneesh Jonelagadda, Christina Hahn, Haoze Zheng, Salvatore Penachio (Kaliber AI)
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2510.08601
Pdf URL: https://arxiv.org/pdf/2510.08601
Copy Paste: [[2510.08601]] Mnemosyne: An Unsupervised, Human-Inspired Long-Term Memory Architecture for Edge-Based LLMs(https://arxiv.org/abs/2510.08601)
Keywords: language model, llm
Abstract: Long-term memory is essential for natural, realistic dialogue. However, current large language model (LLM) memory systems rely on either brute-force context expansion or static retrieval pipelines that fail on edge-constrained devices. We introduce Mnemosyne, an unsupervised, human-inspired long-term memory architecture designed for edge-based LLMs. Our approach uses graph-structured storage, modular substance and redundancy filters, memory committing and pruning mechanisms, and probabilistic recall with temporal decay and refresh processes modeled after human memory. Mnemosyne also introduces a concentrated "core summary" efficiently derived from a fixed-length subset of the memory graph to capture the user's personality and other domain-specific long-term details such as, using healthcare application as an example, post-recovery ambitions and attitude towards care. Unlike existing retrieval-augmented methods, Mnemosyne is designed for use in longitudinal healthcare assistants, where repetitive and semantically similar but temporally distinct conversations are limited by naive retrieval. In experiments with longitudinal healthcare dialogues, Mnemosyne demonstrates the highest win rate of 65.8% in blind human evaluations of realism and long-term memory capability compared to a baseline RAG win rate of 31.1%. Mnemosyne also achieves current highest LoCoMo benchmark scores in temporal reasoning and single-hop retrieval compared to other same-backboned techniques. Further, the average overall score of 54.6% was second highest across all methods, beating commonly used Mem0 and OpenAI baselines among others. This demonstrates that improved factual recall, enhanced temporal reasoning, and much more natural user-facing responses can be feasible with an edge-compatible and easily transferable unsupervised memory architecture.
摘要：长期记忆对于自然、现实的对话至关重要。然而，当前的大语言模型 (LLM) 内存系统依赖于强力上下文扩展或静态检索管道，而这些管道在边缘受限设备上会失败。我们推出 Mnemosyne，这是一种无监督、受人类启发的长期记忆架构，专为基于边缘的法学硕士而设计。我们的方法使用图结构存储、模块化物质和冗余过滤器、内存提交和修剪机制以及模仿人类记忆建模的时间衰减和刷新过程的概率召回。 Mnemosyne 还引入了从内存图的固定长度子集有效导出的集中“核心摘要”，以捕获用户的个性和其他特定领域的长期详细信息，例如以医疗保健应用程序为例，恢复后的抱负和对护理的态度。与现有的检索增强方法不同，Mnemosyne 设计用于纵向医疗助理，其中重复且语义相似但时间上不同的对话受到简单检索的限制。在纵向医疗保健对话实验中，Mnemosyne 在现实性和长期记忆能力的盲人评估中表现出最高的胜率 65.8%，而基线 RAG 胜率为 31.1%。与其他相同主干技术相比，Mnemosyne 在时间推理和单跳检索方面也取得了当前最高的 LoCoMo 基准分数。此外，54.6% 的平均总分在所有方法中排名第二，击败了常用的 Mem0 和 OpenAI 基线等。这表明，通过边缘兼容且易于传输的无监督内存架构，可以实现改进的事实回忆、增强的时间推理和更自然的面向用户的响应。

Title: Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

Authors: Cong Zeng, Shengkun Tang, Yuanzhou Chen, Zhiqiang Shen, Wenchao Yu, Xujiang Zhao, Haifeng Chen, Wei Cheng, Zhiqiang Xu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08602
Pdf URL: https://arxiv.org/pdf/2510.08602
Copy Paste: [[2510.08602]] Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection(https://arxiv.org/abs/2510.08602)
Keywords: language model, gpt, llm, chat
Abstract: The rapid advancement of large language models (LLMs) such as ChatGPT, DeepSeek, and Claude has significantly increased the presence of AI-generated text in digital communication. This trend has heightened the need for reliable detection methods to distinguish between human-authored and machine-generated content. Existing approaches both zero-shot methods and supervised classifiers largely conceptualize this task as a binary classification problem, often leading to poor generalization across domains and models. In this paper, we argue that such a binary formulation fundamentally mischaracterizes the detection task by assuming a coherent representation of human-written texts. In reality, human texts do not constitute a unified distribution, and their diversity cannot be effectively captured through limited sampling. This causes previous classifiers to memorize observed OOD characteristics rather than learn the essence of `non-ID' behavior, limiting generalization to unseen human-authored inputs. Based on this observation, we propose reframing the detection task as an out-of-distribution (OOD) detection problem, treating human-written texts as distributional outliers while machine-generated texts are in-distribution (ID) samples. To this end, we develop a detection framework using one-class learning method including DeepSVDD and HRN, and score-based learning techniques such as energy-based method, enabling robust and generalizable performance. Extensive experiments across multiple datasets validate the effectiveness of our OOD-based approach. Specifically, the OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset. Moreover, we test our detection framework on multilingual, attacked, and unseen-model and -domain text settings, demonstrating the robustness and generalizability of our framework. Code, pretrained weights, and demo will be released.
摘要：ChatGPT、DeepSeek 和 Claude 等大型语言模型 (LLM) 的快速发展显着增加了人工智能生成文本在数字通信中的存在。这一趋势更加需要可靠的检测方法来区分人类创作的内容和机器生成的内容。现有的零样本方法和监督分类器在很大程度上将此任务概念化为二元分类问题，通常导致跨领域和模型的泛化能力较差。在本文中，我们认为这种二元表述通过假设人类书写文本的连贯表示从根本上错误地描述了检测任务。实际上，人类文本并不构成统一的分布，并且通过有限的采样无法有效捕获其多样性。这导致以前的分类器记住观察到的 OOD 特征，而不是学习“非 ID”行为的本质，从而限制了对看不见的人类创作输入的泛化。基于这一观察，我们建议将检测任务重新定义为分布外（OOD）检测问题，将人类编写的文本视为分布异常值，而机器生成的文本则视为分布内（ID）样本。为此，我们开发了一个检测框架，使用包括 DeepSVDD 和 HRN 在内的一类学习方法，以及基于分数的学习技术（例如基于能量的方法），从而实现稳健且可泛化的性能。跨多个数据集的广泛实验验证了我们基于 OOD 的方法的有效性。具体来说，基于 OOD 的方法在 DeepFake 数据集上实现了 98.3% 的 AUROC 和 AUPR，而 FPR95 仅达到 8.9%。此外，我们在多语言、受攻击、未见过的模型和域文本设置上测试我们的检测框架，证明了我们框架的稳健性和通用性。代码、预训练权重和演示将被发布。

Title: YpathRAG:A Retrieval-Augmented Generation Framework and Benchmark for Pathology

Authors: Deshui Yu, Yizhi Wang, Saihui Jin, Taojie Zhu, Fanyi Zeng, Wen Qian, Zirui Huang, Jingli Ouyang, Jiameng Li, Zhen Song, Tian Guan, Yonghong He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08603
Pdf URL: https://arxiv.org/pdf/2510.08603
Copy Paste: [[2510.08603]] YpathRAG:A Retrieval-Augmented Generation Framework and Benchmark for Pathology(https://arxiv.org/abs/2510.08603)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) excel on general tasks yet still hallucinate in high-barrier domains such as pathology. Prior work often relies on domain fine-tuning, which neither expands the knowledge boundary nor enforces evidence-grounded constraints. We therefore build a pathology vector database covering 28 subfields and 1.53 million paragraphs, and present YpathRAG, a pathology-oriented RAG framework with dual-channel hybrid retrieval (BGE-M3 dense retrieval coupled with vocabulary-guided sparse retrieval) and an LLM-based supportive-evidence judgment module that closes the retrieval-judgment-generation loop. We also release two evaluation benchmarks, YpathR and YpathQA-M. On YpathR, YpathRAG attains Recall@5 of 98.64%, a gain of 23 percentage points over the baseline; on YpathQA-M, a set of the 300 most challenging questions, it increases the accuracies of both general and medical LLMs by 9.0% on average and up to 15.6%. These results demonstrate improved retrieval quality and factual reliability, providing a scalable construction paradigm and interpretable evaluation for pathology-oriented RAG.
摘要：大型语言模型（LLM）在一般任务上表现出色，但在病理学等高障碍领域仍然存在幻觉。先前的工作通常依赖于领域微调，这既不扩展知识边界，也不强制执行基于证据的约束。因此，我们建立了一个涵盖28个子领域和153万个段落的病理向量数据库，并提出了YpathRAG，一种面向病理学的双通道混合检索RAG框架（BGE-M3密集检索与词汇引导稀疏检索相结合）和基于LLM的支持性证据判断模块，该模块关闭了检索-判断-生成循环。我们还发布了两个评估基准，YpathR 和 YpathQA-M。在 YpathR 上，YpathRAG 的 Recall@5 达到 98.64%，比基线提高了 23 个百分点；在 YpathQA-M（一组 300 个最具挑战性的问题）上，它使普通法学硕士和医学法学硕士的准确率平均提高了 9.0%，最高可达 15.6%。这些结果证明了检索质量和事实可靠性的提高，为面向病理的 RAG 提供了可扩展的构建范例和可解释的评估。

Title: LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

Authors: Raffaele Mura, Giorgio Piras, Kamilė Lukošiūtė, Maura Pintor, Amin Karbasi, Battista Biggio
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08604
Pdf URL: https://arxiv.org/pdf/2510.08604
Copy Paste: [[2510.08604]] LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback(https://arxiv.org/abs/2510.08604)
Keywords: language model, prompt
Abstract: Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.
摘要：越狱是一种对抗性攻击，旨在绕过大型语言模型的内置安全机制。自动越狱通常会通过强制模型生成受限或有害响应的初始部分来优化对抗性后缀或调整长提示模板。在这项工作中，我们展示了利用此类机制解锁模型响应的现有越狱攻击可以通过对输入提示进行简单的基于困惑的过滤来检测。为了克服这个问题，我们提出了 LatentBreak，这是一种白盒越狱攻击，它可以生成自然的对抗性提示，并且具有低困惑度，能够逃避此类防御。 LatentBreak 用语义等效的单词替换输入提示中的单词，保留提示的初始意图，而不是添加高复杂度的对抗性后缀或长模板。这些词是通过最小化对抗性提示的表示和无害的请求的表示之间的潜在空间的距离来选择的。我们的广泛评估表明，LatentBreak 会导致更短且低困惑度的提示，因此在多个安全模型上优于基于困惑度的过滤器的越狱算法。

Title: Toward a Safer Web: Multilingual Multi-Agent LLMs for Mitigating Adversarial Misinformation Attacks

Authors: Nouar Aldahoul, Yasir Zaki
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08605
Pdf URL: https://arxiv.org/pdf/2510.08605
Copy Paste: [[2510.08605]] Toward a Safer Web: Multilingual Multi-Agent LLMs for Mitigating Adversarial Misinformation Attacks(https://arxiv.org/abs/2510.08605)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: The rapid spread of misinformation on digital platforms threatens public discourse, emotional stability, and decision-making. While prior work has explored various adversarial attacks in misinformation detection, the specific transformations examined in this paper have not been systematically studied. In particular, we investigate language-switching across English, French, Spanish, Arabic, Hindi, and Chinese, followed by translation. We also study query length inflation preceding summarization and structural reformatting into multiple-choice questions. In this paper, we present a multilingual, multi-agent large language model framework with retrieval-augmented generation that can be deployed as a web plugin into online platforms. Our work underscores the importance of AI-driven misinformation detection in safeguarding online factual integrity against diverse attacks, while showcasing the feasibility of plugin-based deployment for real-world web applications.
摘要：数字平台上错误信息的迅速传播威胁着公众话语、情绪稳定和决策。虽然之前的工作已经探索了错误信息检测中的各种对抗性攻击，但本文研究的具体转换尚未得到系统研究。我们特别研究了英语、法语、西班牙语、阿拉伯语、印地语和中文之间的语言切换，然后进行翻译。我们还研究了摘要之前的查询长度膨胀和结构重新格式化为多项选择问题。在本文中，我们提出了一种多语言、多智能体大语言模型框架，具有检索增强生成功能，可以作为网络插件部署到在线平台中。我们的工作强调了人工智能驱动的错误信息检测在保护在线事实完整性免受各种攻击方面的重要性，同时展示了基于插件的实际 Web 应用程序部署的可行性。

Title: MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation

Authors: Weihua Zheng, Zhengyuan Liu, Tanmoy Chakraborty, Weiwen Xu, Xiaoxue Gao, Bryan Chen Zhengyu Tan, Bowei Zou, Chang Liu, Yujia Hu, Xing Xie, Xiaoyuan Yi, Jing Yao, Chaojun Wang, Long Li, Rui Liu, Huiyao Liu, Koji Inoue, Ryuichi Sumida, Tatsuya Kawahara, Fan Xu, Lingyu Ye, Wei Tian, Dongjun Kim, Jimin Jung, Jaehyung Seo, Nadya Yuki Wangsajaya, Pham Minh Duc, Ojasva Saxena, Palash Nandi, Xiyan Tao, Wiwik Karlina, Tuan Luong, Keertana Arun Vasan, Roy Ka-Wei Lee, Nancy F. Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08608
Pdf URL: https://arxiv.org/pdf/2510.08608
Copy Paste: [[2510.08608]] MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation(https://arxiv.org/abs/2510.08608)
Keywords: language model, llm
Abstract: Large language models (LLMs) are now used worldwide, yet their multimodal understanding and reasoning often degrade outside Western, high-resource settings. We propose MMA-ASIA, a comprehensive framework to evaluate LLMs' cultural awareness with a focus on Asian contexts. MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned multiple-choice benchmark covering 8 Asian countries and 10 languages, comprising 27,000 questions; over 79 percent require multi-step reasoning grounded in cultural context, moving beyond simple memorization. To our knowledge, this is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech. This enables direct tests of cross-modal transfer. Building on this benchmark, we propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity. To ensure rigorous assessment, a Cultural Awareness Grounding Validation Module detects "shortcut learning" by checking whether the requisite cultural knowledge supports correct answers. Finally, through comparative model analysis, attention tracing, and an innovative Vision-ablated Prefix Replay (VPR) method, we probe why models diverge across languages and modalities, offering actionable insights for building culturally reliable multimodal LLMs.
摘要：大型语言模型（LLM）现已在全球范围内使用，但其多模态理解和推理能力在西方资源丰富的环境之外往往会下降。我们提出 MMA-ASIA，这是一个评估法学硕士文化意识的综合框架，重点关注亚洲背景。 MMA-ASIA 以人为策划、多语言、多模式的多项选择基准为中心，涵盖 8 个亚洲国家和 10 种语言，包含 27,000 个问题；超过 79% 的问题需要基于文化背景的多步骤推理，而不仅仅是简单的记忆。据我们所知，这是第一个在输入级别上跨三种模式对齐的数据集：文本、图像（视觉问答）和语音。这使得跨模式传输的直接测试成为可能。在此基准的基础上，我们提出了一个五维评估协议，用于衡量：（i）各国文化意识差异，（ii）跨语言一致性，（iii）跨模式一致性，（iv）文化知识泛化，以及（v）基础有效性。为了确保严格的评估，文化意识基础验证模块通过检查必要的文化知识是否支持正确答案来检测“捷径学习”。最后，通过比较模型分析、注意力追踪和创新的视觉消除前缀重放（VPR）方法，我们探讨了模型为何在语言和模式之间存在差异，为构建文化上可靠的多模式法学硕士提供了可行的见解。

Title: GraphGhost: Tracing Structures Behind Large Language Models

Authors: Xinnan Dai, Kai Guo, Chung-Hsiang Lo, Shenglai Zeng, Jiayuan Ding, Dongsheng Luo, Subhabrata Mukherjee, Jiliang Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08613
Pdf URL: https://arxiv.org/pdf/2510.08613
Copy Paste: [[2510.08613]] GraphGhost: Tracing Structures Behind Large Language Models(https://arxiv.org/abs/2510.08613)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, yet the structural mechanisms underlying these abilities remain under explored. In this work, we introduce GraphGhost, a unified framework that represents neuron activations and their signal propagation as graphs, explaining how LLMs capture structural semantics from sequential inputs and generate outputs through structurally consistent mechanisms. This graph-based perspective enables us to employ graph algorithms such as PageRank to characterize the properties of LLMs, revealing both shared and model-specific reasoning behaviors across diverse datasets. We further identify the activated neurons within GraphGhost and evaluate them through structural interventions, showing that edits to key neuron nodes can trigger reasoning collapse, altering both logical flow and semantic understanding. Together, these contributions position GraphGhost as a powerful tool for analyzing, intervening in, and ultimately understanding the structural foundations of reasoning in LLMs.
摘要：大型语言模型（LLM）表现出卓越的推理能力，但这些能力背后的结构机制仍有待探索。在这项工作中，我们介绍了 GraphGhost，这是一个统一的框架，它将神经元激活及其信号传播表示为图形，解释了 LLM 如何从顺序输入中捕获结构语义并通过结构一致的机制生成输出。这种基于图的视角使我们能够采用 PageRank 等图算法来表征 LLM 的属性，揭示不同数据集中共享的和特定于模型的推理行为。我们进一步识别 GraphGhost 中激活的神经元，并通过结构干预对其进行评估，表明对关键神经元节点的编辑可以触发推理崩溃，改变逻辑流和语义理解。总之，这些贡献使 GraphGhost 成为分析、干预并最终理解法学硕士推理的结构基础的强大工具。

Title: Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications

Authors: Mingxuan Liu, Yuhe Ke, Wentao Zhu, Mayli Mertens, Yilin Ning, Jingchi Liao, Chuan Hong, Daniel Shu Wei Ting, Yifan Peng, Danielle S. Bitterman, Marcus Eng Hock Ong, Nan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08614
Pdf URL: https://arxiv.org/pdf/2510.08614
Copy Paste: [[2510.08614]] Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications(https://arxiv.org/abs/2510.08614)
Keywords: language model, llm
Abstract: The integration of large language models (LLMs) into healthcare holds promise to enhance clinical decision-making, yet their susceptibility to biases remains a critical concern. Gender has long influenced physician behaviors and patient outcomes, raising concerns that LLMs assuming human-like roles, such as clinicians or medical educators, may replicate or amplify gender-related biases. Using case studies from the New England Journal of Medicine Challenge (NEJM), we assigned genders (female, male, or unspecified) to multiple open-source and proprietary LLMs. We evaluated their response consistency across LLM-gender assignments regarding both LLM-based diagnosis and models' judgments on the clinical relevance or necessity of patient gender. In our findings, diagnoses were relatively consistent across LLM genders for most models. However, for patient gender's relevance and necessity in LLM-based diagnosis, all models demonstrated substantial inconsistency across LLM genders, particularly for relevance judgements. Some models even displayed a systematic female-male disparity in their interpretation of patient gender. These findings present an underexplored bias that could undermine the reliability of LLMs in clinical practice, underscoring the need for routine checks of identity-assignment consistency when interacting with LLMs to ensure reliable and equitable AI-supported clinical care.
摘要：将大语言模型（LLM）整合到医疗保健中有望增强临床决策，但它们对偏见的敏感性仍然是一个关键问题。性别长期以来一直影响着医生的行为和患者的治疗结果，人们担心法学硕士扮演类似人类的角色，例如临床医生或医学教育者，可能会复制或放大与性别相关的偏见。利用新英格兰医学杂志挑战赛 (NEJM) 的案例研究，我们将性别（女性、男性或未指定）分配给多个开源和专有的法学硕士。我们评估了他们在法学硕士性别分配中对基于法学硕士的诊断和模型对患者性别的临床相关性或必要性的判断的反应一致性。在我们的研究结果中，大多数模型的法学硕士性别的诊断相对一致。然而，对于基于法学硕士的诊断中患者性别的相关性和必要性，所有模型都表现出跨法学硕士性别的显着不一致，特别是在相关性判断方面。一些模型甚至在对患者性别的解释上表现出系统性的男女差异。这些发现提出了一种尚未得到充分研究的偏见，可能会破坏法学硕士在临床实践中的可靠性，强调在与法学硕士互动时需要对身份分配一致性进行常规检查，以确保可靠和公平的人工智能支持的临床护理。

Title: Iterative LLM-Based Generation and Refinement of Distracting Conditions in Math Word Problems

Authors: Kaiqi Yang, Hang Li, Yucheng Chu, Zitao Liu, Mi Tian, Hui Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08615
Pdf URL: https://arxiv.org/pdf/2510.08615
Copy Paste: [[2510.08615]] Iterative LLM-Based Generation and Refinement of Distracting Conditions in Math Word Problems(https://arxiv.org/abs/2510.08615)
Keywords: language model, llm, prompt
Abstract: Mathematical reasoning serves as a crucial testbed for evaluating the intelligence of large language models (LLMs), and math word problems (MWPs) represent one of the most widely used formats. Most existing MWP datasets contain only the necessary information, while problems with distracting or excessive conditions are often overlooked. Prior studies have shown that popular LLMs experience a dramatic performance drop when such distracting conditions are introduced. However, available datasets of MWPs with distracting conditions remain limited, and most exhibit low difficulty and out-of-context expressions. These shortcomings make the distracting conditions easy to detect and disregard, thereby reducing the credibility of benchmarking on these datasets. Moreover, when distracting conditions are added, the reasoning process and answers may change, requiring intensive manual effort to check and rewrite solutions. To address these issues, we design an iterative framework that leverages LLMs to generate distracting conditions automatically. We develop a set of prompts to revise MWPs from multiple perspectives and cognitive levels, encouraging the creation of meaningful distracting conditions as well as suggestions for further refinement. A key advantage of our framework is the preservation of shared solutions between the original and revised problems: the LLMs are explicitly guided to generate distractions that do not alter the original solution, thus eliminating the need to produce new answers. This framework is efficient and easy to deploy, substantially reducing the effort required to generate MWPs with distracting conditions while maintaining high data quality.
摘要：数学推理是评估大型语言模型 (LLM) 智能的重要测试平台，而数学应用题 (MWP) 是最广泛使用的格式之一。大多数现有的 MWP 数据集仅包含必要的信息，而分散注意力或过度条件的问题往往被忽视。先前的研究表明，当引入这种分散注意力的条件时，受欢迎的法学硕士的表现会急剧下降。然而，具有干扰条件的 MWP 的可用数据集仍然有限，并且大多数表现出低难度和断章取义的表达。这些缺点使得分散注意力的条件很容易被检测和忽视，从而降低了对这些数据集进行基准测试的可信度。此外，当添加分散注意力的条件时，推理过程和答案可能会发生变化，需要大量的人工检查和重写解决方案。为了解决这些问题，我们设计了一个迭代框架，利用法学硕士自动生成分散注意力的条件。我们制定了一系列提示，从多个角度和认知水平修改 MWP，鼓励创造有意义的分散注意力的条件以及进一步完善的建议。我们框架的一个关键优势是保留原始问题和修订后问题之间的共享解决方案：明确指导法学硕士产生不会改变原始解决方案的干扰，从而消除了产生新答案的需要。该框架高效且易于部署，大大减少了在干扰条件下生成 MWP 所需的工作量，同时保持了较高的数据质量。

Title: LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests

Authors: Juan Miguel Navarro Carranza
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08616
Pdf URL: https://arxiv.org/pdf/2510.08616
Copy Paste: [[2510.08616]] LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests(https://arxiv.org/abs/2510.08616)
Keywords: language model, llm
Abstract: Benchmark scores for Large Language Models (LLMs) can be inflated by memorization of test items or near duplicates. We present a simple, protocol that probes generalization by re-evaluating models on paraphrased versions of benchmark questions. Using Mistral-7B-Instruct and Qwen2.5-7B-Instruct, we measure the accuracy gap between original and paraphrased items on ARC-Easy and ARC-Challenge. Our pipeline controls decoding, enforces multiple-choice output format, and includes a robust paraphrase-cleaning step to preserve semantics. We find that paraphrasing induces a non-trivial accuracy drop (original vs. paraphrased), consistent with prior concerns about contamination and brittle surface-form shortcuts.
摘要：大型语言模型 (LLM) 的基准分数可能会因记忆测试项目或接近重复的项目而夸大。我们提出了一个简单的协议，通过重新评估基准问题的释义版本的模型来探索泛化。使用 Mistral-7B-Instruct 和 Qwen2.5-7B-Instruct，我们测量了 ARC-Easy 和 ARC-Challenge 上原始项目和释义项目之间的准确性差距。我们的管道控制解码，强制执行多项选择输出格式，并包括强大的释义清理步骤以保留语义。我们发现，释义会导致准确率显着下降（原始与释义），这与之前对污染和脆弱的表面形式捷径的担忧一致。

Title: JAI-1: A Thai-Centric Large Language Model

Authors: Attapol T. Rutherford, Jullajak Karnjanaekarin, Narongkorn Panitsrisit, Pontakorn Trakuekul, Sumana Sumanakul, Natchanon Pollertlam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08620
Pdf URL: https://arxiv.org/pdf/2510.08620
Copy Paste: [[2510.08620]] JAI-1: A Thai-Centric Large Language Model(https://arxiv.org/abs/2510.08620)
Keywords: language model, llm
Abstract: This technical report introduces JAI-1, a Thai-centric language model with 75B parameters. Recent Thai models have primarily relied on existing open-source models, applying additional training without structural modifications to specialize in Thai. However, this approach risks eroding pre-existing knowledge in the model's parameter space during the injection of Thai-specific information, as optimized parameters for general tasks may conflict with new linguistic requirements. In contrast, JAI-1 adopts an upscaling strategy: starting from a smaller, high-performing English open-source LLM, we expanded its parameter space and utilized the newly allocated capacity to systematically integrate Thai-language knowledge. This methodology not only preserves the original model's general intelligence but also establishes a unique architecture distinct from other open-source models, enabling scalable future enhancements. During pre-training, JAI-1 was exposed to 1.5T tokens, including over 300B Thai language tokens. This was followed by post-training stages -- supervised fine-tuning and alignment tuning -- using more than 600K instruction-based examples. The final model demonstrated superior performance compared to Typhoon2-70B on Thai-centric benchmarks (IFEval-TH, MT-Bench-TH, and JAI-Hall-Bench), validating the efficacy of its upscaling and knowledge-integration framework.
摘要：本技术报告介绍了 JAI-1，一个以泰语为中心的语言模型，具有 75B 参数。最近的泰语模型主要依赖于现有的开源模型，在不进行结构修改的情况下应用额外的训练来专门研究泰语。然而，这种方法在注入泰语特定信息期间可能会侵蚀模型参数空间中预先存在的知识，因为一般任务的优化参数可能与新的语言要求相冲突。相比之下，JAI-1采取了升级策略：从规模较小、高性能的英语开源LLM开始，扩大其参数空间，并利用新分配的容量来系统地整合泰语知识。这种方法不仅保留了原始模型的通用智能，而且还建立了与其他开源模型不同的独特架构，从而实现了可扩展的未来增强。在预训练期间，JAI-1 接触了 1.5T 代币，其中包括超过 300B 泰语代币。接下来是训练后阶段——监督微调和对齐调整——使用超过 60 万个基于指令的示例。最终模型在以泰国为中心的基准（IFEval-TH、MT-Bench-TH 和 JAI-Hall-Bench）上表现出比 Typhoon2-70B 更优越的性能，验证了其升级和知识集成框架的有效性。

Title: From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational Agents

Authors: Wen-Yu Chang, Tzu-Hung Huang, Chih-Ho Chen, Yun-Nung Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08621
Pdf URL: https://arxiv.org/pdf/2510.08621
Copy Paste: [[2510.08621]] From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational Agents(https://arxiv.org/abs/2510.08621)
Keywords: agent
Abstract: Amid the rapid rise of agentic dialogue models, realistic user-simulator studies are essential for tuning effective conversation strategies. This work investigates a sales-oriented agent that adapts its dialogue based on user profiles spanning age, gender, and occupation. While age and gender influence overall performance, occupation produces the most pronounced differences in conversational intent. Leveraging this insight, we introduce a lightweight, occupation-conditioned strategy that guides the agent to prioritize intents aligned with user preferences, resulting in shorter and more successful dialogues. Our findings highlight the importance of rich simulator profiles and demonstrate how simple persona-informed strategies can enhance the effectiveness of sales-oriented dialogue systems.
摘要：随着代理对话模型的迅速兴起，现实的用户模拟器研究对于调整有效的对话策略至关重要。这项工作研究了一个以销售为导向的代理，该代理根据年龄、性别和职业的用户资料调整其对话。虽然年龄和性别影响整体表现，但职业在对话意图方面产生最明显的差异。利用这种洞察力，我们引入了一种轻量级的、以职业为条件的策略，指导代理优先考虑与用户偏好一致的意图，从而实现更短、更成功的对话。我们的研究结果强调了丰富的模拟器配置文件的重要性，并证明了简单的角色信息策略如何能够提高以销售为导向的对话系统的有效性。

Title: Text2Stories: Evaluating the Alignment Between Stakeholder Interviews and Generated User Stories

Authors: Francesco Dente, Fabiano Dalpiaz, Paolo Papotti
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2510.08622
Pdf URL: https://arxiv.org/pdf/2510.08622
Copy Paste: [[2510.08622]] Text2Stories: Evaluating the Alignment Between Stakeholder Interviews and Generated User Stories(https://arxiv.org/abs/2510.08622)
Keywords: language model, llm
Abstract: Large language models (LLMs) can be employed for automating the generation of software requirements from natural language inputs such as the transcripts of elicitation interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders' needs remains a largely manual task. We introduce Text2Stories, a task and metrics for text-to-story alignment that allow quantifying the extent to which requirements (in the form of user stories) match the actual needs expressed by the elicitation session participants. Given an interview transcript and a set of user stories, our metric quantifies (i) correctness: the proportion of stories supported by the transcript, and (ii) completeness: the proportion of transcript supported by at least one story. We segment the transcript into text chunks and instantiate the alignment as a matching problem between chunks and stories. Experiments over four datasets show that an LLM-based matcher achieves 0.86 macro-F1 on held-out annotations, while embedding models alone remain behind but enable effective blocking. Finally, we show how our metrics enable the comparison across sets of stories (e.g., human vs. generated), positioning Text2Stories as a scalable, source-faithful complement to existing user-story quality criteria.
摘要：大型语言模型 (LLM) 可用于根据自然语言输入（例如启发式访谈的笔录）自动生成软件需求。然而，评估这些派生的需求是否忠实地反映了利益相关者的需求仍然是一项主要的手动任务。我们引入了 Text2Stories，这是一种用于文本到故事对齐的任务和指标，可以量化需求（以用户故事的形式）与启发会话参与者表达的实际需求的匹配程度。给定采访记录和一组用户故事，我们的指标量化（i）正确性：记录支持的故事比例，以及（ii）完整性：至少一个故事支持的记录比例。我们将文本分割成文本块，并将对齐实例化为块和故事之间的匹配问题。对四个数据集的实验表明，基于 LLM 的匹配器在保留注释上实现了 0.86 宏 F1，而单独的嵌入模型仍然落后，但能够实现有效的阻止。最后，我们展示了我们的指标如何实现多组故事（例如，人类故事与生成故事）的比较，将 Text2Stories 定位为对现有用户故事质量标准的可扩展、忠实来源的补充。

Title: PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction

Authors: Anubhav Shrimal, Aryan Jain, Soumyajit Chowdhury, Promod Yenigalla
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08623
Pdf URL: https://arxiv.org/pdf/2510.08623
Copy Paste: [[2510.08623]] PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction(https://arxiv.org/abs/2510.08623)
Keywords: language model, llm, hallucination, agent
Abstract: Structured information extraction from unstructured text is critical for emerging Software 3.0 systems where LLM agents autonomously interact with APIs and tools. Recent approaches apply large language models directly to extraction tasks using existing JSON schemas, often with constraint decoding or reinforcement learning approaches to ensure syntactic validity, but treat JSON schemas as static contracts designed for human developers, leading to suboptimal extraction performance, frequent hallucinations, and unreliable agent behavior when schemas contain ambiguous or incomplete specifications. We recognize that JSON schemas themselves are a form of natural language understanding contract that encodes rules, relationships, and expectations about data structure contracts that LLMs should be able to both interpret and systematically improve. Consequently, we develop PARSE (Parameter Automated Refinement and Schema Extraction), a novel system with two synergistic components: ARCHITECT, which autonomously optimizes JSON schemas for LLM consumption while maintaining backward compatibility through RELAY (an integrated code generation system), and SCOPE, which implements reflection-based extraction with combined static and LLM-based guardrails. We evaluate PARSE qualitatively and quantitatively on three datasets including Schema-Guided Dialogue (SGD), Structured Web Data Extraction (SWDE), and internal retail conversation data, and find that it achieves up to 64.7% improvement in extraction accuracy on SWDE with combined framework improvements reaching 10% across models, while reducing extraction errors by 92% within the first retry and and maintaining practical latency.
摘要：从非结构化文本中提取结构化信息对于新兴软件 3.0 系统至关重要，其中 LLM 代理自主地与 API 和工具进行交互。最近的方法使用现有的 JSON 模式将大型语言模型直接应用于提取任务，通常使用约束解码或强化学习方法来确保语法有效性，但将 JSON 模式视为为人类开发人员设计的静态契约，当模式包含模糊或不完整的规范时，会导致提取性能不佳、频繁出现幻觉和不可靠的代理行为。我们认识到 JSON 模式本身是一种自然语言理解契约的形式，它对数据结构契约的规则、关系和期望进行编码，LLM 应该能够解释和系统地改进这些数据结构契约。因此，我们开发了 PARSE（参数自动细化和模式提取），这是一种具有两个协同组件的新颖系统：ARCHITECT，它自动优化用于 LLM 消耗的 JSON 模式，同时通过 RELAY（集成代码生成系统）保持向后兼容性；以及 SCOPE，它通过结合静态和基于 LLM 的护栏来实现基于反射的提取。我们对模式引导对话 (SGD)、结构化 Web 数据提取 (SWDE) 和内部零售对话数据等三个数据集对 PARSE 进行定性和定量评估，发现它在 SWDE 上的提取准确性提高了 64.7%，跨模型的组合框架改进达到 10%，同时在第一次重试时将提取错误减少了 92%，并保持了实际延迟。

Title: Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B

Authors: Nisar Ahmed, Muhammad Imran Zaman, Gulshan Saleem, Ali Hassan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08624
Pdf URL: https://arxiv.org/pdf/2510.08624
Copy Paste: [[2510.08624]] Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B(https://arxiv.org/abs/2510.08624)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Benchmarks for large language models (LLMs) often rely on rubric-scented prompts that request visible reasoning and strict formatting, whereas real deployments demand terse, contract-bound answers. We investigate whether such "evaluation scent" inflates measured performance without commensurate capability gains. Using a single open-weights model (GPT-OSS-20B), we run six paired A/B scenarios that hold task content and decoding fixed while varying framing (evaluation-oriented vs. real-world) and reasoning depth (Medium/High): deterministic math, strict code-fix, citation generation, incentive flips (caution vs. competence), CoT visibility, and multilingual (Urdu) headers. Deterministic validators compute accuracy, answer-only compliance, hedging/refusals, chain-of-thought (CoT) length, and schema compliance, with pre-registered deltas and composite indices. Across scenarios, evaluation framing reliably inflates CoT (hundreds to >1000 characters) and reduces answer-only compliance, with limited or inconsistent accuracy gains. In structured outputs, it improves wrappers (e.g., fenced blocks, enumerated lists) but not regex-validated substance. Incentive wording reweights error composition: praising caution modestly improves accuracy at high reasoning and reduces wrong-but-confident errors, whereas praising competence yields terser but riskier outputs. Urdu rubric headers reproduce these signatures and can decrease accuracy at higher reasoning depth, indicating multilingual parity risks. We provide a reproducible A/B framework (prompt banks, validators, per-run scores, scripts; versioned DOI) and practical guidance: neutral phrasing or dual-framing checks, contract-aware grading, style-delta reporting, confidence governance, and multilingual dashboards to ensure that benchmark gains reflect deployable capability.
摘要：大型语言模型 (LLM) 的基准通常依赖于标题式的提示，要求可见的推理和严格的格式，而真正的部署则需要简洁、受合同约束的答案。我们调查这种“评估气味”是否夸大了测量到的绩效，而没有相应的能力增益。使用单个开放权重模型 (GPT-OSS-20B)，我们运行六对 A/B 场景，这些场景保持任务内容和解码固定，同时改变框架（面向评估与现实世界）和推理深度（中/高）：确定性数学、严格的代码修复、引文生成、激励翻转（谨慎与能力）、CoT 可见性和多语言（乌尔都语）标题。确定性验证器使用预先注册的增量和综合指数来计算准确性、仅回答合规性、对冲/拒绝、思想链 (CoT) 长度和模式合规性。在各种场景中，评估框架可靠地增加了 CoT（数百到超过 1000 个字符）并降低了仅回答的合规性，并且准确性增益有限或不一致。在结构化输出中，它改进了包装器（例如，围栏块、枚举列表），但没有改进正则表达式验证的内容。激励措辞重新衡量错误的构成：适度地赞扬谨慎可以提高高级推理的准确性，并减少错误但自信的错误，而赞扬能力会产生更简洁但风险更大的输出。乌尔都语标题会重现这些签名，并且在较高推理深度下可能会降低准确性，这表明多语言平价风险。我们提供可复制的 A/B 框架（提示银行、验证器、每次运行分数、脚本；版本化 DOI）和实用指导：中性措辞或双框架检查、合同感知评分、风格增量报告、置信治理和多语言仪表板，以确保基准收益反映可部署的能力。

Title: From What to Why: Thought-Space Recommendation with Small Language Models

Authors: Prosenjit Biswas, Pervez Shaik, Abhinav Thorat, Ravi Kolla, Niranjan Pedanekar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08626
Pdf URL: https://arxiv.org/pdf/2510.08626
Copy Paste: [[2510.08626]] From What to Why: Thought-Space Recommendation with Small Language Models(https://arxiv.org/abs/2510.08626)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have advanced recommendation capabilities through enhanced reasoning, but pose significant challenges for real-world deployment due to high inference costs. Conversely, while Small Language Models (SLMs) offer an efficient alternative, their reasoning capabilities for recommendation remain underexplored. Existing systems often use natural language rationales merely as unsupervised descriptive text, failing to harness their full potential as learning signals. In this work our main idea is to create a common understanding of user and items across multiple domains called Thought Space with SLMs instead of using LLMs' distilled knowledge. To that end we propose PULSE (Preference Understanding by Latent Semantic Embeddings), a framework that treats SLM-generated rationales as director learning signals, supervising them with interaction histories to jointly model user actions (what) and their semantic drivers (why). Existing methods consider only interactions such as sequences and embeddings, whereas PULSE treats rationales as first-class signals, this novel design yields embeddings that are more robust and generalizable. Extensive experiments demonstrate that PULSE outperforms leading ID, Collaborative Filtering (CF), and LLM-based sequential recommendation models across multiple benchmark datasets. Furthermore, PULSE exhibits superior transferability in cross-domain recommendation and demonstrates strong performance on downstream tasks such as reasoning-oriented question answering. Our code is available \href{this https URL}{here}.
摘要：大型语言模型 (LLM) 通过增强推理具有高级推荐功能，但由于推理成本较高，给实际部署带来了重大挑战。相反，虽然小语言模型（SLM）提供了一种有效的替代方案，但它们的推荐推理能力仍未得到充分开发。现有系统通常仅使用自然语言原理作为无监督的描述性文本，未能充分发挥其作为学习信号的潜力。在这项工作中，我们的主要想法是使用 SLM 创建对跨多个领域的用户和项目的共同理解，称为“思维空间”，而不是使用法学硕士的提炼知识。为此，我们提出了 PULSE（潜在语义嵌入的偏好理解），这是一个将 SLM 生成的基本原理视为导演学习信号的框架，通过交互历史来监督它们，以联合建模用户行为（什么）及其语义驱动因素（为什么）。现有方法仅考虑序列和嵌入等交互，而 PULSE 将基本原理视为一流信号，这种新颖的设计产生更稳健和更通用的嵌入。大量实验表明，PULSE 在多个基准数据集上的性能优于领先的 ID、协作过滤 (CF) 和基于 LLM 的顺序推荐模型。此外，PULSE 在跨领域推荐中表现出优异的可迁移性，并在面向推理的问答等下游任务上表现出强大的性能。我们的代码可在\href{此 https URL}{此处}获取。

Title: ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

Authors: Jingbiao Mei, Mingsheng Sun, Jinghong Chen, Pengda Qin, Yuhong Li, Da Chen, Bill Byrne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08630
Pdf URL: https://arxiv.org/pdf/2510.08630
Copy Paste: [[2510.08630]] ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection(https://arxiv.org/abs/2510.08630)
Keywords: prompt, chain-of-thought, agent
Abstract: Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15\% and 17\% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support.
摘要：仇恨模因已成为一种特别具有挑战性的在线滥用形式，推动了自动检测系统的发展。大多数现有方法依赖于直接检测，仅产生二进制预测。此类模型无法提供现实世界审核所需的背景和解释。最近使用思想链提示或 LMM 代理的“解释然后检测”方法的性能比简单的 SFT 基线更差，甚至 GRPO 等先进的训练后方法也无法缩小差距。我们的分析确定了此类系统的两个关键问题：模型没有假设目标和攻击类型等重要的政策相关线索作为可能的解释；并且二元奖励信号不足以指导推理。为了应对这些挑战，受人类注释者的培训和评估过程的启发，我们提出了 ExPO-HM（针对仇恨模因的解释然后检测策略优化）。 ExPO-HM 将 SFT 热身、GRPO 与课程学习以及条件决策熵 (CDE) 结合起来，作为推理质量的度量和奖励。在三个仇恨模因基准测试中，ExPO-HM 在二进制检测、细粒度分类和推理质量方面实现了最先进的性能，与 GRPO 和 DPO 基准相比，F1 分别提高了 15% 和 17%。通过将仇恨模因检测从简单的二进制警报转移到解释驱动的检测，ExPO-HM 提供了准确、可解释且可操作的审核支持。

Title: Next Semantic Scale Prediction via Hierarchical Diffusion Language Models

Authors: Cai Zhou, Chenyu Wang, Dinghuai Zhang, Shangyuan Tong, Yifei Wang, Stephen Bates, Tommi Jaakkola
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08632
Pdf URL: https://arxiv.org/pdf/2510.08632
Copy Paste: [[2510.08632]] Next Semantic Scale Prediction via Hierarchical Diffusion Language Models(https://arxiv.org/abs/2510.08632)
Keywords: language model
Abstract: In this paper we introduce Hierarchical Diffusion Language Models (HDLM) -- a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to its higher-level ancestor with more abstract semantics according to the scheduler, while in the reverse process the model progressively predicts the next, more detailed semantics. Taken together, HDLM provides a general time-varying next semantic scale prediction process for language modeling. We derive closed-form expressions for the diffusion Evidence Lower Bound (ELBO), and show that HDLM can be implemented in a flexible manner while including the existing MDLM as a special case. We also propose practical training techniques based on the insights. Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines.
摘要：在本文中，我们介绍了分层扩散语言模型（HDLM）——一种用于语言建模的新型离散扩散模型系列。 HDLM 建立在分层词汇表的基础上，其中具有详细语义的低级标记被映射到具有粗粒度含义的高级标记。在正向过程中，每个令牌根据调度程序独立地扰动其具有更抽象语义的更高级别祖先，而在反向过程中，模型逐步预测下一个更详细的语义。总而言之，HDLM 为语言建模提供了通用的时变下一个语义尺度预测过程。我们推导了扩散证据下界（ELBO）的封闭式表达式，并表明 HDLM 可以以灵活的方式实现，同时将现有的 MDLM 作为特例。我们还根据这些见解提出实用的培训技术。广泛的文本生成实验验证了 HDLM 的有效性，其验证率和生成困惑度始终低于基线。

Title: Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression

Authors: Chengzhengxu Li, Xiaoming Liu, Zhaohan Zhang, Shaochu Zhang, Shengchao Liu, Guoxin Ma, Yu Lan, Chao Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08647
Pdf URL: https://arxiv.org/pdf/2510.08647
Copy Paste: [[2510.08647]] Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression(https://arxiv.org/abs/2510.08647)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent developments have enabled advanced reasoning in Large Language Models (LLMs) via long Chain-of-Thought (CoT), while long CoT suffers from high computational costs and significant latency losses owing to the autoregressive nature of generative LLMs. CoT compression aims to improve efficiency in the reasoning process by reducing output length. Previous works trade reasoning efficiency by either laborious discrete prompt designing or the construction of external compressed CoT datasets that sacrifice key reasoning details. In this work, we propose Upfront CoT (UCoT): an efficient reasoning framework with upfront thought embedding to automate CoT compression. UCoT is a cooperative workflow involving a small model (compressor) and a large model (executor). The first stage of UCoT trains compressor to generate upfront thought embeddings rich in reasoning information for the executor, avoiding the drawbacks of manually designed prompts. The second stage optimizes executor to utilize upfront thought embeddings to derive the correct answer with short reasoning, using a reward mechanism. Extensive experiments show that UCoT maintains the powerful reasoning ability of executor while significantly reducing the length of CoT. It is worth mentioning that when applying UCoT to the Qwen2.5-7B-Instruct model, the usage of tokens on GSM8K dataset is reduced by 50\%, while the performance is 3.08\% higher than that of the state-of-the-art (SOTA) method. The code and dataset are in supplementary material.
摘要：最近的发展通过长思想链 (CoT) 实现了大型语言模型 (LLM) 中的高级推理，而由于生成式 LLM 的自回归性质，长 CoT 面临着高计算成本和显着的延迟损失。 CoT 压缩旨在通过减少输出长度来提高推理过程的效率。以前的作品通过费力的离散提示设计或牺牲关键推理细节的外部压缩 CoT 数据集的构建来牺牲推理效率。在这项工作中，我们提出了 Upfront CoT (UCoT)：一种高效的推理框架，具有预先思想嵌入来自动执行 CoT 压缩。 UCoT 是一个涉及小模型（压缩器）和大模型（执行器）的协作工作流程。 UCoT 的第一阶段训练压缩器为执行器生成富含推理信息的预先思想嵌入，避免了手动设计提示的缺点。第二阶段优化执行器，利用预先的思想嵌入，使用奖励机制，通过简短的推理得出正确的答案。大量实验表明，UCoT在显着缩短CoT长度的同时，保持了执行器强大的推理能力。值得一提的是，将UCoT应用于Qwen2.5-7B-Instruct模型时，GSM8K数据集上的token使用量减少了50%，而性能比state-of-the-art（SOTA）方法提高了3.08%。代码和数据集位于补充材料中。

Title: Formalizing Style in Personal Narratives

Authors: Gustave Cortal (ENS Paris Saclay, LISN), Alain Finkel (ENS Paris Saclay)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08649
Pdf URL: https://arxiv.org/pdf/2510.08649
Copy Paste: [[2510.08649]] Formalizing Style in Personal Narratives(https://arxiv.org/abs/2510.08649)
Keywords: language model
Abstract: Personal narratives are stories authors construct to make meaning of their experiences. Style, the distinctive way authors use language to express themselves, is fundamental to how these narratives convey subjective experiences. Yet there is a lack of a formal framework for systematically analyzing these stylistic choices. We present a novel approach that formalizes style in personal narratives as patterns in the linguistic choices authors make when communicating subjective experiences. Our framework integrates three domains: functional linguistics establishes language as a system of meaningful choices, computer science provides methods for automatically extracting and analyzing sequential patterns, and these patterns are linked to psychological observations. Using language models, we automatically extract linguistic features such as processes, participants, and circumstances. We apply our framework to hundreds of dream narratives, including a case study on a war veteran with post-traumatic stress disorder. Analysis of his narratives uncovers distinctive patterns, particularly how verbal processes dominate over mental ones, illustrating the relationship between linguistic choices and psychological states.
摘要：个人叙述是作者为了使他们的经历变得有意义而构建的故事。风格是作者使用语言表达自己的独特方式，是这些叙述如何传达主观体验的基础。然而，缺乏一个正式的框架来系统地分析这些风格选择。我们提出了一种新颖的方法，将个人叙述中的风格形式化为作者在传达主观经验时做出的语言选择的模式。我们的框架整合了三个领域：功能语言学将语言建立为有意义的选择系统，计算机科学提供了自动提取和分析顺序模式的方法，并且这些模式与心理观察相关。使用语言模型，我们自动提取语言特征，例如过程、参与者和环境。我们将我们的框架应用于数百个梦境叙述，包括对患有创伤后应激障碍的退伍军人的案例研究。对他的叙述的分析揭示了独特的模式，特别是言语过程如何主导心理过程，说明了语言选择与心理状态之间的关系。

Title: A Novel Framework for Augmenting Rating Scale Tests with LLM-Scored Text Data

Authors: Joe Watson, Ivan O'Conner, Chia-Wen Chen, Luning Sun, Fang Luo, David Stillwell
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2510.08663
Pdf URL: https://arxiv.org/pdf/2510.08663
Copy Paste: [[2510.08663]] A Novel Framework for Augmenting Rating Scale Tests with LLM-Scored Text Data(https://arxiv.org/abs/2510.08663)
Keywords: llm
Abstract: Psychological assessments typically rely on structured rating scales, which cannot incorporate the rich nuance of a respondent's natural language. This study leverages recent LLM advances to harness qualitative data within a novel conceptual framework, combining LLM-scored text and traditional rating-scale items to create an augmented test. We demonstrate this approach using depression as a case study, developing and assessing the framework on a real-world sample of upper secondary students (n=693) and corresponding synthetic dataset (n=3,000). On held-out test sets, augmented tests achieved statistically significant improvements in measurement precision and accuracy. The information gain from the LLM items was equivalent to adding between 6.3 (real data) and 16.0 (synthetic data) items to the original 19-item test. Our approach marks a conceptual shift in automated scoring that bypasses its typical bottlenecks: instead of relying on pre-labelled data or complex expert-created rubrics, we empirically select the most informative LLM scoring instructions based on calculations of item information. This framework provides a scalable approach for leveraging the growing stream of transcribed text to enhance traditional psychometric measures, and we discuss its potential utility in clinical health and beyond.
摘要：心理评估通常依赖于结构化评分量表，该量表无法包含受访者自然语言的丰富细微差别。本研究利用法学硕士的最新进展，在新颖的概念框架内利用定性数据，结合法学硕士评分文本和传统的评级量表项目来创建增强测试。我们使用抑郁症作为案例研究来演示这种方法，在现实世界的高中生样本 (n=693) 和相应的合成数据集 (n=3,000) 上开发和评估框架。在保留的测试集上，增强测试在测量精度和准确度方面实现了统计上的显着改进。 LLM 项目的信息增益相当于在原来的 19 项测试中添加 6.3（真实数据）和 16.0（合成数据）项。我们的方法标志着自动评分的概念转变，绕过了其典型的瓶颈：我们不再依赖预先标记的数据或复杂的专家创建的评分标准，而是根据项目信息的计算凭经验选择信息最丰富的LLM评分指令。该框架提供了一种可扩展的方法，可以利用不断增长的转录文本流来增强传统的心理测量方法，我们讨论了它在临床健康及其他领域的潜在用途。

Title: dInfer: An Efficient Inference Framework for Diffusion Language Models

Authors: Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, Da Zheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08666
Pdf URL: https://arxiv.org/pdf/2510.08666
Copy Paste: [[2510.08666]] dInfer: An Efficient Inference Framework for Diffusion Language Models(https://arxiv.org/abs/2510.08666)
Keywords: language model, llm
Abstract: Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components-model, diffusion iteration manager, decoding strategy, and KV-cache manager-and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared with AR models (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with latest vLLM inference engine, dInfer still deliverers $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at this https URL.
摘要：基于扩散的大语言模型 (dLLM) 已成为自回归 (AR) LLM 的有前途的替代方案，利用基于去噪的生成来实现固有的并行性。尽管出现了越来越多的开源 dLLM 模型，但由于缺乏标准化和高效的推理框架，它们的广泛采用仍然受到限制。我们提出了 dInfer，一个高效且可扩展的 dLLM 推理框架。 dInfer 将推理管道分解为四个模块化组件——模型、扩散迭代管理器、解码策略和 KV 缓存管理器——并为每个组件集成了新颖的算法以及系统级优化。通过算法创新和系统增强的结合，dInfer 在不影响 LLaDA-MoE 输出质量的情况下实现了显着的效率提升。在批量大小为 1 时，它在 HumanEval 上每秒超过 1,100 个令牌，并且在 8 美元×H800 GPU 上的六个基准测试中平均每秒超过 800 个令牌。与之前的系统相比，dInfer 比 Fast-dLLM 提供了 10 倍的加速，同时保持相似的模型性能。即使与使用最新 vLLM 推理引擎高度优化的 AR 模型（具有相当数量的激活参数和性能）QWen2.5-3B 相比，dInfer 仍然提供 $2$-$3\times$ 加速。 dInfer 的实现在此 https URL 上开源。

Title: Scaling Laws for Code: A More Data-Hungry Regime

Authors: Xianzhen Luo, Wenzhen Zheng, Qingfu Zhu, Rongyi Zhang, Houyi Li, Siming Huang, YuanTao Fan, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08702
Pdf URL: https://arxiv.org/pdf/2510.08702
Copy Paste: [[2510.08702]] Scaling Laws for Code: A More Data-Hungry Regime(https://arxiv.org/abs/2510.08702)
Keywords: language model, llm
Abstract: Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.
摘要：代码大型语言模型 (LLM) 正在彻底改变软件工程。然而，指导有效训练的尺度法则主要是在自然语言（NL）上进行分析的。鉴于代码和自然语言之间存在严格语法等根本差异，目前尚不清楚这些法律是否直接适用于代码。为了解决这一差距，我们对代码缩放定律进行了首次大规模实证研究，包括 117 次实验运行，模型大小从 0.2B 到 3.8B，训练令牌从 2B 到 128B。我们符合 Chinchilla 法则和 Farsser 法则。首先，结果表明更具表现力的 Farseer 定律提供了更高的准确性。其次，分析表明，Code LLM 可以有效地随着模型大小的变化而扩展。至关重要的是，代码代表了一种更需要数据的机制，需要比 NL 更高的数据参数比。最后，另外两组关于代码-NL 混合的实验表明，NL 有利于资源受限的场景，但在较高的计算预算下会变得有害。

Title: Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning

Authors: Li Zhang, Matthias Grabmair, Morgan Gray, Kevin Ashley
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08710
Pdf URL: https://arxiv.org/pdf/2510.08710
Copy Paste: [[2510.08710]] Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning(https://arxiv.org/abs/2510.08710)
Keywords: language model, llm
Abstract: Case-based reasoning is a cornerstone of U.S. legal practice, requiring professionals to argue about a current case by drawing analogies to and distinguishing from past precedents. While Large Language Models (LLMs) have shown remarkable capabilities, their proficiency in this complex, nuanced form of reasoning needs further investigation. We propose a formal framework that decomposes the process of identifying significant distinctions between cases into three-stage reasoning tasks. Our framework models cases using factual predicates called factors, organizes them into a legal knowledge hierarchy, and defines verifiable rules for identifying distinctions, analyzing their argumentative support, and evaluating their significance. Through comprehensive evaluation of modern reasoning LLMs, we reveal a paradox: while models achieve high accuracy on surface-level reasoning (Task 1), performance degrades on hierarchical reasoning (Task 2: 64.82%-92.09%) and collapses on integrated analysis (Task 3: 11.46%-33.99%). Most strikingly, we find that models consistently expend more computational resources on incorrect responses than correct ones, suggesting that "thinking longer" does not always mean "thinking smarter." Our work provides a methodology for fine-grained analysis of LLM reasoning capabilities in complex domains and reveals fundamental limitations that must be addressed for robust and trustworthy legal AI.
摘要：基于案例的推理是美国法律实践的基石，要求专业人士通过类比和区分过去的先例来论证当前的案件。虽然大型语言模型 (LLM) 表现出了非凡的能力，但它们对这种复杂、细致的推理形式的熟练程度还需要进一步研究。我们提出了一个正式的框架，将识别案例之间显着区别的过程分解为三阶段推理任务。我们的框架使用称为因素的事实谓词对案例进行建模，将它们组织成法律知识层次结构，并定义可验证的规则来识别差异、分析其论证支持并评估其重要性。通过对现代推理法学硕士的综合评估，我们揭示了一个悖论：模型在表面推理（任务1）上实现了高精度，但在层次推理（任务2：64.82％-92.09％）上性能下降，在集成分析（任务3：11.46％-33.99％）上崩溃。最引人注目的是，我们发现模型在错误响应上花费的计算资源始终多于正确响应，这表明“思考时间更长”并不总是意味着“思考更聪明”。我们的工作提供了一种对复杂领域的法学硕士推理能力进行细粒度分析的方法，并揭示了稳健且值得信赖的法律人工智能必须解决的基本限制。

Title: How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

Authors: Xianzhen Luo, Jinyang Huang, Wenzhen Zheng, Qingfu Zhu, Mingzheng Xu, Yiheng Xu, Yuantao Fan, Libo Qin, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08720
Pdf URL: https://arxiv.org/pdf/2510.08720
Copy Paste: [[2510.08720]] How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective(https://arxiv.org/abs/2510.08720)
Keywords: language model, llm
Abstract: Evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task. Existing benchmarks suffer from high computational costs, score inflation, and a bias towards trivial bugs over rare, critical faults. In this work, we ask two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix. The rank of this matrix specifies the minimal number of independent error patterns (wrong codes) and provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity. To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power. Our dataset is available at: this https URL and our code is at: this https URL.
摘要：评估大型语言模型 (LLM) 自动生成的测试用例是一项关键但具有挑战性的任务。现有的基准测试存在计算成本高、分数膨胀以及偏向于琐碎错误而不是罕见的严重错误的问题。在这项工作中，我们提出两个基本问题：（1）足以表示整个错误空间的最小错误代码集是什么？ (2) 区分它们所需的最小测试用例集是什么？我们引入了一个框架，它将基准构建形式化为在二进制代码测试矩阵中寻找最佳诊断基础。该矩阵的等级指定了独立错误模式（错误代码）的最小数量，并提供了完整故障覆盖所需的测试用例数量的严格上限。我们的目标是确定一个大小等于矩阵秩的基础，以最大化内部多样性。为了解决这个 NP 难题，我们提出了 WrongSelect，一种有效的近似算法，用于选择最大程度不同的错误代码。将该框架应用于数以百万计的竞争性编程提交，我们构建了 TC-Bench，一个紧凑、多样化且抗通胀的基准。大量实验表明，即使是最先进的测试用例生成方法在 TC-Bench 上也只能实现约 60% 的排除率，暴露出其诊断能力的显着差距。我们的数据集位于：此 https URL，我们的代码位于：此 https URL。

Title: How Reliable is Language Model Micro-Benchmarking?

Authors: Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08730
Pdf URL: https://arxiv.org/pdf/2510.08730
Copy Paste: [[2510.08730]] How Reliable is Language Model Micro-Benchmarking?(https://arxiv.org/abs/2510.08730)
Keywords: language model
Abstract: Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can determine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability.
摘要：微基准测试为语言模型开发通常耗费的时间和成本提供了一种解决方案：对现有基准测试的一小部分进行评估。然而，这些微基准能否像它们所取代的完整基准一样对模型进行一致的排名？他们能否比选择随机数据点子集更一致地对模型进行排名？在很多情况下，我们发现答案是否定的。我们引入了一种用于微基准测试的元评估方法，该方法研究微基准测试如何根据两个模型在完整基准测试上的性能差异对两个模型进行排名。这种方法可以确定哪些模型对可以通过微基准正确排名，从而可以对微基准大小和可靠性之间的权衡进行更细粒度的分析。先前的工作建议选择最少 10 个示例；我们发现，没有一种微基准测试方法能够一致地将模型对在 MMLU-Pro 上的准确度相差 3.5 分，或者在 BIG-bench Hard 上相差 4 分。为了对性能相对相似的模型对进行一致排名，我们表明通常必须选择多达 250 个示例，此时随机采样与现有的微基准测试方法具有竞争力。当仅将 MMLU-Pro 微基准上的 8B 指令调整模型与 25 个示例进行比较时，我们发现超过一半的成对比较不太可能被保留。我们的工作为微基准用户和开发人员在评估效率和可靠性之间进行权衡提供了可行的指导。

Title: Coordinates from Context: Using LLMs to Ground Complex Location References

Authors: Tessa Masis, Brendan O'Connor
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08741
Pdf URL: https://arxiv.org/pdf/2510.08741
Copy Paste: [[2510.08741]] Coordinates from Context: Using LLMs to Ground Complex Location References(https://arxiv.org/abs/2510.08741)
Keywords: llm
Abstract: Geocoding is the task of linking a location reference to an actual geographic location and is essential for many downstream analyses of unstructured text. In this paper, we explore the challenging setting of geocoding compositional location references. Building on recent work demonstrating LLMs' abilities to reason over geospatial data, we evaluate LLMs' geospatial knowledge versus reasoning skills relevant to our task. Based on these insights, we propose an LLM-based strategy for geocoding compositional location references. We show that our approach improves performance for the task and that a relatively small fine-tuned LLM can achieve comparable performance with much larger off-the-shelf models.
摘要：地理编码是将位置引用链接到实际地理位置的任务，对于非结构化文本的许多下游分析至关重要。在本文中，我们探讨了地理编码组合位置参考的挑战性设置。基于最近展示法学硕士对地理空间数据进行推理的能力的工作，我们评估了法学硕士的地理空间知识与与我们的任务相关的推理技能。基于这些见解，我们提出了一种基于 LLM 的策略，用于对组合位置引用进行地理编码。我们表明，我们的方法提高了任务的性能，并且相对较小的微调 LLM 可以实现与更大的现成模型相当的性能。

Title: Measuring Moral LLM Responses in Multilingual Capacities

Authors: Kimaya Basu, Savi Kolari, Allison Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08776
Pdf URL: https://arxiv.org/pdf/2510.08776
Copy Paste: [[2510.08776]] Measuring Moral LLM Responses in Multilingual Capacities(https://arxiv.org/abs/2510.08776)
Keywords: gpt, llm
Abstract: With LLM usage becoming widespread across countries, languages, and humanity more broadly, the need to understand and guardrail their multilingual responses increases. Large-scale datasets for testing and benchmarking have been created to evaluate and facilitate LLM responses across multiple dimensions. In this study, we evaluate the responses of frontier and leading open-source models in five dimensions across low and high-resource languages to measure LLM accuracy and consistency across multilingual contexts. We evaluate the responses using a five-point grading rubric and a judge LLM. Our study shows that GPT-5 performed the best on average in each category, while other models displayed more inconsistency across language and category. Most notably, in the Consent & Autonomy and Harm Prevention & Safety categories, GPT scored the highest with averages of 3.56 and 4.73, while Gemini 2.5 Pro scored the lowest with averages of 1.39 and 1.98, respectively. These findings emphasize the need for further testing on how linguistic shifts impact LLM responses across various categories and improvement in these areas.
摘要：随着法学硕士在不同国家、不同语言和更广泛的人类中广泛使用，理解和防范其多语言反应的需求也随之增加。用于测试和基准测试的大规模数据集已经创建，以评估和促进跨多个维度的法学硕士响应。在这项研究中，我们评估了前沿和领先的开源模型在低资源和高资源语言的五个维度上的反应，以衡量多语言背景下法学硕士的准确性和一致性。我们使用五分评分标准和法学硕士法官对回答进行评估。我们的研究表明，GPT-5 在每个类别中平均表现最好，而其他模型在语言和类别之间表现出更多的不一致。最值得注意的是，在“同意与自主”和“伤害预防与安全”类别中，GPT 得分最高，平均值为 3.56 和 4.73，而 Gemini 2.5 Pro 得分最低，平均值分别为 1.39 和 1.98。这些发现强调需要进一步测试语言变化如何影响不同类别的法学硕士反应以及这些领域的改进。

Title: Learning What to Remember: Adaptive Probabilistic Memory Retention for Memory-Efficient Language Models

Authors: S M Rafiuddin, Muntaha Nujat Khan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08798
Pdf URL: https://arxiv.org/pdf/2510.08798
Copy Paste: [[2510.08798]] Learning What to Remember: Adaptive Probabilistic Memory Retention for Memory-Efficient Language Models(https://arxiv.org/abs/2510.08798)
Keywords: language model
Abstract: Transformer attention scales quadratically with sequence length O(n^2), limiting long-context use. We propose Adaptive Retention, a probabilistic, layer-wise token selection mechanism that learns which representations to keep under a strict global budget M. Retention is modeled with Bernoulli gates trained via a Hard-Concrete/variational relaxation and enforced with a simple top-M rule at inference, making the method differentiable and drop-in for standard encoders. Across classification, extractive QA, and long-document summarization, keeping only 30-50% of tokens preserves >= 95% of full-model performance while cutting peak memory by ~35-45% and improving throughput by up to ~1.8x. This architecture-agnostic approach delivers practical long-context efficiency without modifying base attention or task heads.
摘要：Transformer 注意力随序列长度 O(n^2) 二次缩放，限制了长上下文的使用。我们提出了自适应保留，一种概率性、分层的标记选择机制，它学习哪些表示应保持在严格的全局预算 M 之下。保留是使用通过硬具体/变分松弛训练的伯努利门建模的，并在推理时使用简单的 top-M 规则强制执行，使该方法可微分并可用于标准编码器。在分类、提取式 QA 和长文档摘要中，仅保留 30-50% 的标记即可保留 >= 95% 的全模型性能，同时将峰值内存削减约 35-45%，并将吞吐量提高约 1.8 倍。这种与架构无关的方法提供了实用的长上下文效率，而无需修改基本注意力或任务头。

Title: Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

Authors: Wangjie You, Xusheng Wang, Xing Wang, Wenxiang Jiao, Chao Feng, Juntao Li, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08800
Pdf URL: https://arxiv.org/pdf/2510.08800
Copy Paste: [[2510.08800]] Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective(https://arxiv.org/abs/2510.08800)
Keywords: language model, llm, retrieval-augmented generation
Abstract: While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs' ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.
摘要：虽然大型语言模型（LLM）已经表现出先进的推理能力，但它们在一般中文环境中的综合评估仍然不足。为了弥补这一差距，我们提出了中文常识多跳推理（CCMOR），这是一个新颖的基准，旨在评估法学硕士将中文特定的事实知识与多步逻辑推理相结合的能力。具体来说，我们首先从现有的 QA 数据集构建域平衡种子集，然后开发一个由 LLM 驱动的管道来生成锚定在事实单元链上的多跳问题。为了确保生成数据集的质量，我们实施了人机交互验证系统，由领域专家系统地验证和完善生成的问题。使用 CCMOR，我们评估最先进的法学硕士，证明法学硕士处理长尾知识和执行知识密集型推理的能力持续存在局限性。值得注意的是，检索增强生成极大地缩小了这些知识差距，从而产生了显着的性能提升。

Title: MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

Authors: Siddeshwar Raghavan, Tanwi Mallick
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08804
Pdf URL: https://arxiv.org/pdf/2510.08804
Copy Paste: [[2510.08804]] MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding(https://arxiv.org/abs/2510.08804)
Keywords: language model, llm, hallucination, agent
Abstract: We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks. Unlike general-purpose coding, scientific workflows require algorithms that are rigorous, interconnected with deep domain knowledge, and incorporate domain-specific reasoning, as well as algorithm iteration without requiring I/O test cases. Many scientific problems also require a sequence of subproblems to be solved, leading to the final desired result. MOSAIC is designed as a training-free framework with specially designed agents to self-reflect, create the rationale, code, and debug within a student-teacher paradigm to address the challenges of scientific code generation. This design facilitates stepwise problem decomposition, targeted error correction, and, when combined with our Consolidated Context Window (CCW), mitigates LLM hallucinations when solving complex scientific tasks involving chained subproblems. We evaluate MOSAIC on scientific coding benchmarks and demonstrate that our specialized agentic framework outperforms existing approaches in terms of accuracy, robustness, and interpretability.
摘要：我们提出了 MOSAIC，这是一种多智能体大型语言模型 (LLM) 框架，用于解决具有挑战性的科学编码任务。与通用编码不同，科学工作流程需要严格的算法，与深层领域知识互连，并结合特定领域的推理，以及不需要 I/O 测试用例的算法迭代。许多科学问题还需要解决一系列子问题，才能得到最终的预期结果。 MOSAIC 被设计为一个免培训框架，具有专门设计的代理，可以在学生-教师范例中进行自我反思、创建基本原理、编码和调试，以应对科学代码生成的挑战。这种设计有助于逐步问题分解、有针对性的纠错，并且与我们的综合上下文窗口（CCW）结合使用时，可以在解决涉及链式子问题的复杂科学任务时减轻法学硕士的幻觉。我们根据科学编码基准评估 MOSAIC，并证明我们的专业代理框架在准确性、稳健性和可解释性方面优于现有方法。

Title: The Model's Language Matters: A Comparative Privacy Analysis of LLMs

Authors: Abhishek K. Mishra, Antoine Boutet, Lucas Magnana
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2510.08813
Pdf URL: https://arxiv.org/pdf/2510.08813
Copy Paste: [[2510.08813]] The Model's Language Matters: A Comparative Privacy Analysis of LLMs(https://arxiv.org/abs/2510.08813)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly deployed across multilingual applications that handle sensitive data, yet their scale and linguistic variability introduce major privacy risks. Mostly evaluated for English, this paper investigates how language structure affects privacy leakage in LLMs trained on English, Spanish, French, and Italian medical corpora. We quantify six linguistic indicators and evaluate three attack vectors: extraction, counterfactual memorization, and membership inference. Results show that privacy vulnerability scales with linguistic redundancy and tokenization granularity: Italian exhibits the strongest leakage, while English shows higher membership separability. In contrast, French and Spanish display greater resilience due to higher morphological complexity. Overall, our findings provide the first quantitative evidence that language matters in privacy leakage, underscoring the need for language-aware privacy-preserving mechanisms in LLM deployments.
摘要：大型语言模型 (LLM) 越来越多地部署在处理敏感数据的多语言应用程序中，但其规模和语言变异性带来了重大隐私风险。本文主要针对英语进行评估，研究了语言结构如何影响接受英语、西班牙语、法语和意大利语医学语料库培训的法学硕士的隐私泄露。我们量化了六种语言指标并评估了三种攻击向量：提取、反事实记忆和成员推理。结果表明，隐私漏洞随着语言冗余和标记化粒度的变化而变化：意大利语表现出最强的泄漏，而英语则表现出更高的成员可分离性。相比之下，法语和西班牙语由于形态复杂性较高而表现出更大的弹性。总的来说，我们的研究结果提供了第一个定量证据，表明语言在隐私泄露中发挥着重要作用，强调了法学硕士部署中语言感知隐私保护机制的必要性。

Title: Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs

Authors: Jia Ao Sun, Hao Yu, Fabrizio Gotti, Fengran Mo, Yihong Wu, Yuchen Hui, Jian-Yun Nie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08825
Pdf URL: https://arxiv.org/pdf/2510.08825
Copy Paste: [[2510.08825]] Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs(https://arxiv.org/abs/2510.08825)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have demonstrated impressive reasoning abilities yet remain unreliable on knowledge-intensive, multi-hop questions -- they miss long-tail facts, hallucinate when uncertain, and their internal knowledge lags behind real-world change. Knowledge graphs (KGs) offer a structured source of relational evidence, but existing KGQA methods face fundamental trade-offs: compiling complete SPARQL queries without knowing available relations proves brittle, retrieving large subgraphs introduces noise, and complex agent frameworks with parallel exploration exponentially expand search spaces. To address these limitations, we propose Search-on-Graph (SoG), a simple yet effective framework that enables LLMs to perform iterative informed graph navigation using a single, carefully designed \textsc{Search} function. Rather than pre-planning paths or retrieving large subgraphs, SoG follows an ``observe-then-navigate'' principle: at each step, the LLM examines actual available relations from the current entity before deciding on the next hop. This approach further adapts seamlessly to different KG schemas and handles high-degree nodes through adaptive filtering. Across six KGQA benchmarks spanning Freebase and Wikidata, SoG achieves state-of-the-art performance without fine-tuning. We demonstrate particularly strong gains on Wikidata benchmarks (+16\% improvement over previous best methods) alongside consistent improvements on Freebase benchmarks.
摘要：大型语言模型（LLM）已经表现出令人印象深刻的推理能力，但在知识密集型、多跳问题上仍然不可靠——它们会错过长尾事实，在不确定时产生幻觉，而且它们的内部知识落后于现实世界的变化。知识图 (KG) 提供了关系证据的结构化来源，但现有的 KGQA 方法面临着根本性的权衡：在不知道可用关系的情况下编译完整的 SPARQL 查询很脆弱，检索大型子图会引入噪声，并且具有并行探索的复杂代理框架会呈指数级扩展搜索空间。为了解决这些限制，我们提出了图搜索（SoG），这是一个简单而有效的框架，使法学硕士能够使用单个精心设计的 \textsc{Search} 函数执行迭代知情图导航。 SoG 遵循“观察然后导航”原则，而不是预先规划路径或检索大型子图：在每一步中，LLM 都会在决定下一跳之前检查当前实体的实际可用关系。这种方法进一步无缝适应不同的 KG 模式，并通过自适应过滤处理高度节点。在涵盖 Freebase 和 Wikidata 的六个 KGQA 基准测试中，SoG 无需微调即可实现最先进的性能。我们在 Wikidata 基准测试上展示了特别强劲的收益（比之前的最佳方法提高了 16%），同时在 Freebase 基准测试上也取得了持续的改进。

Title: Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

Authors: Ragib Amin Nihal, Rui Wen, Kazuhiro Nakadai, Jun Sakuma
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2510.08859
Pdf URL: https://arxiv.org/pdf/2510.08859
Copy Paste: [[2510.08859]] Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models(https://arxiv.org/abs/2510.08859)
Keywords: language model, llm
Abstract: Large language models (LLMs) remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. These attacks target different harm categories (like malware generation, harassment, or fraud) through distinct conversational approaches (educational discussions, personal experiences, hypothetical scenarios). Existing multi-turn jailbreaking methods often rely on heuristic or ad hoc exploration strategies, providing limited insight into underlying model weaknesses. The relationship between conversation patterns and model vulnerabilities across harm categories remains poorly understood. We propose Pattern Enhanced Chain of Attack (PE-CoA), a framework of five conversation patterns to construct effective multi-turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics: models exhibit distinct weakness profiles where robustness to one conversational pattern does not generalize to others, and model families share similar failure modes. These findings highlight limitations of safety training and indicate the need for pattern-aware defenses. Code available on: this https URL
摘要：大型语言模型（LLM）仍然容易受到多轮越狱攻击，这些攻击利用会话上下文逐渐绕过安全约束。这些攻击通过不同的对话方法（教育讨论、个人经历、假设场景）针对不同的危害类别（例如恶意软件生成、骚扰或欺诈）。现有的多轮越狱方法通常依赖于启发式或临时探索策略，对潜在模型弱点的洞察力有限。对话模式和跨危害类别的模型漏洞之间的关系仍然知之甚少。我们提出了模式增强攻击链（PE-CoA），这是一个由五种对话模式组成的框架，通过自然对话构建有效的多轮越狱。通过对涵盖 10 个危害类别的 12 个 LLM 进行 PE-CoA 评估，我们实现了最先进的性能，发现了特定于模式的漏洞和 LLM 行为特征：模型表现出明显的弱点，其中一种对话模式的稳健性不能推广到其他模式，并且模型系列具有相似的故障模式。这些发现凸显了安全培训的局限性，并表明需要模式感知防御。代码位于：此 https URL

Title: Quality Estimation Reranking for Document-Level Translation

Authors: Krzysztof Mrozinski, Minji Kang, Ahmed Khota, Vincent Michael Sutanto, Giovanni Gatti De Giacomo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08870
Pdf URL: https://arxiv.org/pdf/2510.08870
Copy Paste: [[2510.08870]] Quality Estimation Reranking for Document-Level Translation(https://arxiv.org/abs/2510.08870)
Keywords: language model, llm
Abstract: Quality estimation (QE) reranking is a form of quality-aware decoding which aims to improve machine translation (MT) by scoring and selecting the best candidate from a pool of generated translations. While known to be effective at the sentence level, its application to the increasingly prominent domain of document-level translation remains underexplored. In this work, we evaluate QE reranking performance on document-level (rather than the typical sentence-level) translation, using various learned and large language model (LLM)-based QE metrics. We find that with our best learned metric, SLIDE, BLEURT-20 scores improve by +2.00 with only two candidates, and by +5.09 with 32, across both decoder-only LLM models and encoder-decoder neural machine translation (NMT) models. Using the best LLM-based metric, GEMBA-DA, gains of +1.63 and +4.30 are achieved under the same conditions. Although gains shrink with longer inputs, reranking with 32 candidates yields improvements of +2.34 (SLIDE) and +1.40 (GEMBA-DA) on our longest documents (512-1024 source tokens). These findings demonstrate the practical value of document-level QE, with minimal runtime overhead given suitable translation models and hardware.
摘要：质量估计 (QE) 重新排名是质量感知解码的一种形式，旨在通过评分并从生成的翻译池中选择最佳候选来改进机器翻译 (MT)。虽然众所周知在句子级别有效，但其在日益重要的文档级别翻译领域的应用仍未得到充分探索。在这项工作中，我们使用各种基于学习和大语言模型 (LLM) 的 QE 指标来评估文档级（而不是典型的句子级）翻译的 QE 重排序性能。我们发现，在仅使用解码器的 LLM 模型和编码器-解码器神经机器翻译 (NMT) 模型中，使用我们最好的学习指标 SLIDE，BLEURT-20 分数在只有两个候选者的情况下提高了 +2.00，在 32 个候选者中提高了 +5.09。使用基于 LLM 的最佳指标 GEMBA-DA，在相同条件下可实现 +1.63 和 +4.30 的增益。尽管收益会随着输入长度的增加而缩小，但对 32 个候选者进行重新排名后，我们最长的文档（512-1024 个源标记）的性能提升了 +2.34 (SLIDE) 和 +1.40 (GEMBA-DA)。这些发现证明了文档级 QE 的实用价值，在适当的翻译模型和硬件的情况下，运行时开销最小。

Title: FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Authors: Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Jian-Yun Nie
Subjects: cs.CL, cs.CE, cs.IR
Abstract URL: https://arxiv.org/abs/2510.08886
Pdf URL: https://arxiv.org/pdf/2510.08886
Copy Paste: [[2510.08886]] FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs(https://arxiv.org/abs/2510.08886)
Keywords: language model, llm
Abstract: The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.
摘要：公认会计原则 (GAAP) 的复杂性和可扩展商业报告语言 (XBRL) 归档的层次结构使得财务审计越来越难以自动化和验证。虽然大型语言模型 (LLM) 在非结构化文本理解方面表现出了强大的能力，但它们对结构化、相互依赖和分类驱动的财务文档进行推理的能力在很大程度上仍未得到探索。为了填补这一空白，我们引入了 FinAuditing，这是第一个分类法一致、结构感知、多文档的基准，用于评估法学硕士的财务审计任务。 FinAuditing 基于真实的符合 US-GAAP 的 XBRL 文件构建，定义了三个互补的子任务：用于语义一致性的 FinSM、用于关系一致性的 FinRE 以及用于数字一致性的 FinMR，每个子任务都针对结构化审计推理的不同方面。我们进一步提出了一个统一的评估框架，集成了这些子任务的检索、分类和推理指标。对 13 个最先进的法学硕士进行的广泛零样本实验表明，当前模型在语义、关系和数学维度上的表现不一致，在对分层多文档结构进行推理时，准确性下降高达 60-90%。我们的研究结果揭示了现代法学硕士在基于分类学的金融推理方面的系统局限性，并将金融审计建立为开发值得信赖、结构意识和监管一致的金融情报系统的基础。基准数据集可在 Hugging Face 上获取。

Title: Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

Authors: Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu, Kai Song, Xiangliang Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08892
Pdf URL: https://arxiv.org/pdf/2510.08892
Copy Paste: [[2510.08892]] Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR(https://arxiv.org/abs/2510.08892)
Keywords: language model, llm
Abstract: Reinforcement Learning has demonstrated substantial improvements in the reasoning abilities of Large Language Models (LLMs), exhibiting significant applicability across various domains. Recent research has identified that tokens within LLMs play distinct roles during reasoning tasks, categorizing them into high-entropy reasoning tokens and low-entropy knowledge tokens. Prior approaches have typically focused on restricting updates to indirectly encourage exploration, yet they do not explicitly facilitate exploratory behavior during the token generation stage itself. In this work, we introduce a complementary approach that explicitly promotes exploration during sampling by applying distinct temperature settings for different token types. Specifically, our method employs higher temperatures for reasoning tokens to actively encourage exploration, while retaining lower temperatures for knowledge tokens to maintain factual correctness. Furthermore, we systematically investigate various multi-temperature scheduling strategies and their impacts within reinforcement learning contexts. Empirical evaluations on several reasoning benchmarks demonstrate that our approach significantly enhances the reasoning performance of LLMs. The code is available at this https URL.
摘要：强化学习已证明大型语言模型 (LLM) 的推理能力得到了显着提高，在各个领域都表现出了显着的适用性。最近的研究发现，法学硕士中的令牌在推理任务中发挥着不同的作用，将它们分为高熵推理令牌和低熵知识令牌。先前的方法通常侧重于限制更新以间接鼓励探索，但它们并没有明确促进代币生成阶段本身的探索行为。在这项工作中，我们引入了一种补充方法，通过对不同的令牌类型应用不同的温度设置来明确促进采样期间的探索。具体来说，我们的方法对推理标记采用较高的温度来积极鼓励探索，同时为知识标记保留较低的温度以保持事实的正确性。此外，我们系统地研究了各种多温度调度策略及其在强化学习环境中的影响。对几个推理基准的实证评估表明，我们的方法显着提高了法学硕士的推理性能。该代码可从此 https URL 获取。

Title: A Unified Biomedical Named Entity Recognition Framework with Large Language Models

Authors: Tengxiao Lv, Ling Luo, Juntao Li, Yanhua Wang, Yuchen Pan, Chao Liu, Yanan Wang, Yan Jiang, Huiyi Lv, Yuanyuan Sun, Jian Wang, Hongfei Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08902
Pdf URL: https://arxiv.org/pdf/2510.08902
Copy Paste: [[2510.08902]] A Unified Biomedical Named Entity Recognition Framework with Large Language Models(https://arxiv.org/abs/2510.08902)
Keywords: language model, llm
Abstract: Accurate recognition of biomedical named entities is critical for medical information extraction and knowledge discovery. However, existing methods often struggle with nested entities, entity boundary ambiguity, and cross-lingual generalization. In this paper, we propose a unified Biomedical Named Entity Recognition (BioNER) framework based on Large Language Models (LLMs). We first reformulate BioNER as a text generation task and design a symbolic tagging strategy to jointly handle both flat and nested entities with explicit boundary annotation. To enhance multilingual and multi-task generalization, we perform bilingual joint fine-tuning across multiple Chinese and English datasets. Additionally, we introduce a contrastive learning-based entity selector that filters incorrect or spurious predictions by leveraging boundary-sensitive positive and negative samples. Experimental results on four benchmark datasets and two unseen corpora show that our method achieves state-of-the-art performance and robust zero-shot generalization across languages. The source codes are freely available at this https URL.
摘要：生物医学命名实体的准确识别对于医学信息提取和知识发现至关重要。然而，现有的方法经常遇到嵌套实体、实体边界模糊和跨语言泛化的问题。在本文中，我们提出了一个基于大型语言模型（LLM）的统一生物医学命名实体识别（BioNER）框架。我们首先将 BioNER 重新定义为文本生成任务，并设计一种符号标记策略，以通过显式边界注释联合处理平面和嵌套实体。为了增强多语言和多任务泛化能力，我们在多个中文和英文数据集上进行双语联合微调。此外，我们引入了一种基于对比学习的实体选择器，它通过利用边界敏感的正样本和负样本来过滤不正确或虚假的预测。四个基准数据集和两个未见过的语料库的实验结果表明，我们的方法实现了最先进的性能和跨语言的稳健的零样本泛化。源代码可在此 https URL 上免费获取。

Title: Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors

Authors: Xin Liu, RunSong Zhao, PengCheng Huang, XinYu Liu, JunYi Xiao, ChunYang Xiao, Tong Xiao, Shengxiang Gao, Zhengtao Yu, JingBo Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08907
Pdf URL: https://arxiv.org/pdf/2510.08907
Copy Paste: [[2510.08907]] Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors(https://arxiv.org/abs/2510.08907)
Keywords: language model, llm, long context
Abstract: Context compression presents a promising approach for accelerating large language model (LLM) inference by compressing long contexts into compact representations. Current context compression methods predominantly rely on autoencoding tasks to train context-agnostic compression tokens to compress contextual semantics. While autoencoding tasks enable compression tokens to acquire compression capabilities, compression via autoencoding tasks creates a fundamental mismatch: the models are optimized for reconstruction that diverge from actual downstream tasks, thereby weakening the features more beneficial for real-world usage. We propose Semantic-Anchor Compression (SAC), a novel method that shifts from autoencoding task based compression to an architecture that is equipped with this compression capability \textit{a priori}. Instead of training models to compress contexts through autoencoding tasks, SAC directly selects so-called anchor tokens from the original context and aggregates contextual information into their key-value (KV) representations. By deriving representations directly from the contextual tokens, SAC eliminates the need for autoencoding training. To ensure compression performance while directly leveraging anchor tokens, SAC incorporates two key designs: (1) anchor embeddings that enable the compressor to identify critical tokens, and (2) bidirectional attention modification that allows anchor tokens to capture information from the entire context. Experimental results demonstrate that SAC consistently outperforms existing context compression methods across various compression ratios. On out-of-distribution evaluation using MRQA, SAC achieves 1 EM improvement at 5x compression over strong baselines, with increasing advantages at higher compression ratios.
摘要：上下文压缩提供了一种有前景的方法，通过将长上下文压缩为紧凑的表示来加速大语言模型 (LLM) 推理。当前的上下文压缩方法主要依赖于自动编码任务来训练上下文无关的压缩令牌来压缩上下文语义。虽然自动编码任务使压缩令牌能够获得压缩功能，但通过自动编码任务进行压缩会造成根本性的不匹配：模型针对与实际下游任务不同的重建进行了优化，从而削弱了对现实世界使用更有利的功能。我们提出了语义锚压缩（SAC），这是一种从基于自动编码任务的压缩转变为配备这种压缩能力\textit{先验}的架构的新颖方法。 SAC 不是通过自动编码任务训练模型来压缩上下文，而是直接从原始上下文中选择所谓的锚标记，并将上下文信息聚合到其键值 (KV) 表示中。通过直接从上下文标记导出表示，SAC 消除了自动编码训练的需要。为了在直接利用锚标记的同时确保压缩性能，SAC 采用了两个关键设计：(1) 锚嵌入，使压缩器能够识别关键标记；(2) 双向注意力修改，允许锚标记从整个上下文中捕获信息。实验结果表明，SAC 在各种压缩比上始终优于现有的上下文压缩方法。在使用 MRQA 进行分布外评估时，SAC 在比强基线压缩 5 倍的情况下实现了 1 EM 改进，并且在更高的压缩比下优势不断增强。

Title: Artificial Impressions: Evaluating Large Language Model Behavior Through the Lens of Trait Impressions

Authors: Nicholas Deas, Kathleen McKeown
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08915
Pdf URL: https://arxiv.org/pdf/2510.08915
Copy Paste: [[2510.08915]] Artificial Impressions: Evaluating Large Language Model Behavior Through the Lens of Trait Impressions(https://arxiv.org/abs/2510.08915)
Keywords: language model, llm, prompt
Abstract: We introduce and study artificial impressions--patterns in LLMs' internal representations of prompts that resemble human impressions and stereotypes based on language. We fit linear probes on generated prompts to predict impressions according to the two-dimensional Stereotype Content Model (SCM). Using these probes, we study the relationship between impressions and downstream model behavior as well as prompt features that may inform such impressions. We find that LLMs inconsistently report impressions when prompted, but also that impressions are more consistently linearly decodable from their hidden representations. Additionally, we show that artificial impressions of prompts are predictive of the quality and use of hedging in model responses. We also investigate how particular content, stylistic, and dialectal features in prompts impact LLM impressions.
摘要：我们介绍并研究人工印象——法学硕士对提示的内部表征的模式，类似于人类印象和基于语言的刻板印象。我们在生成的提示上安装线性探针，以根据二维刻板印象内容模型（SCM）预测印象。使用这些探针，我们研究印象与下游模型行为之间的关系以及可能告知此类印象的提示特征。我们发现法学硕士在收到提示时报告的印象不一致，而且印象也可以从其隐藏的表示中更一致地线性解码。此外，我们表明，提示的人为印象可以预测模型响应中对冲的质量和使用。我们还调查提示中的特定内容、风格和方言特征如何影响法学硕士印象。

Title: SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures

Authors: Jiaming Wang, Zhe Tang, Yilin Jin, Peng Ding, Xiaoyu Li, Xuezhi Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08942
Pdf URL: https://arxiv.org/pdf/2510.08942
Copy Paste: [[2510.08942]] SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures(https://arxiv.org/abs/2510.08942)
Keywords: language model, llm, agent
Abstract: As large language models (LLMs) are widely deployed as domain-specific agents, many benchmarks have been proposed to evaluate their ability to follow instructions and make decisions in real-world scenarios. However, business scenarios often involve complex standard operating procedures (SOPs), and the evaluation of LLM capabilities in such contexts has not been fully explored. To bridge this gap, we propose SOP-Maze, a benchmark constructed from real-world business data and adapted into a collection of 397 tasks from 23 complex SOP scenarios. We further categorize SOP tasks into two broad classes: Lateral Root System (LRS), representing wide-option tasks that demand precise selection; and Heart Root System (HRS), which emphasizes deep logical reasoning with complex branches. Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze. We conduct a comprehensive analysis and identify three key error categories: (i) route blindness: difficulty following procedures; (ii) conversational fragility: inability to handle real dialogue nuances; and (iii) calculation errors: mistakes in time or arithmetic reasoning under complex contexts. The systematic study explores LLM performance across SOP tasks that challenge both breadth and depth, offering new insights for improving model capabilities. We have open-sourced our work on this https URL.
摘要：随着大型语言模型（LLM）被广泛部署为特定领域的代理，人们提出了许多基准来评估它们在现实场景中遵循指令和做出决策的能力。然而，业务场景往往涉及复杂的标准操作程序（SOP），并且在这种情况下LLM能力的评估尚未得到充分探索。为了弥补这一差距，我们提出了 SOP-Maze，这是一个根据真实业务数据构建的基准，并适应了来自 23 个复杂 SOP 场景的 397 项任务的集合。我们进一步将 SOP 任务分为两大类：侧根系统（LRS），代表需要精确选择的广泛选项任务；心根系统（HRS），强调具有复杂分支的深层逻辑推理。大量实验表明，几乎所有最先进的模型都难以应对 SOP 迷宫。我们进行了全面分析，确定了三个关键错误类别：（i）路线盲目性：难以遵循程序； (ii) 对话脆弱性：无法处理真正对话的细微差别； (iii) 计算错误：复杂环境下的时间或算术推理错误。这项系统研究探讨了法学硕士在挑战广度和深度的 SOP 任务中的表现，为提高模型能力提供了新的见解。我们已经开源了这个 https URL 上的工作。

Title: Creation of the Chinese Adaptive Policy Communication Corpus

Authors: Bolun Sun, Charles Chang, Yuen Yuen Ang, Pingxu Hao, Ruotong Mu, Yuchen Xu, Zhengxin Zhang
Subjects: cs.CL, cs.CE, cs.CY
Abstract URL: https://arxiv.org/abs/2510.08986
Pdf URL: https://arxiv.org/pdf/2510.08986
Copy Paste: [[2510.08986]] Creation of the Chinese Adaptive Policy Communication Corpus(https://arxiv.org/abs/2510.08986)
Keywords: language model, llm
Abstract: We introduce CAPC-CG, the Chinese Adaptive Policy Communication (Central Government) Corpus, the first open dataset of Chinese policy directives annotated with a five-color taxonomy of clear and ambiguous language categories, building on Ang's theory of adaptive policy communication. Spanning 1949-2023, this corpus includes national laws, administrative regulations, and ministerial rules issued by China's top authorities. Each document is segmented into paragraphs, producing a total of 3.3 million units. Alongside the corpus, we release comprehensive metadata, a two-round labeling framework, and a gold-standard annotation set developed by expert and trained coders. Inter-annotator agreement achieves a Fleiss's kappa of K = 0.86 on directive labels, indicating high reliability for supervised modeling. We provide baseline classification results with several large language models (LLMs), together with our annotation codebook, and describe patterns from the dataset. This release aims to support downstream tasks and multilingual NLP research in policy communication.
摘要：我们介绍 CAPC-CG，中国适应性政策沟通（中央政府）语料库，这是第一个中国政策指令的开放数据集，以明确和模糊语言类别的五色分类法进行注释，建立在 Ang 的适应性政策沟通理论的基础上。该语料库涵盖 1949 年至 2023 年期间由中国最高当局颁布的国家法律、行政法规和部门规章。每个文档都分为段落，总共产生 330 万个单元。除了语料库之外，我们还发布了全面的元数据、两轮标签框架以及由专家和训练有素的编码人员开发的黄金标准注释集。注释器间协议在指令标签上实现了 K = 0.86 的 Fleiss kappa，这表明监督建模具有高可靠性。我们通过多个大型语言模型 (LLM) 以及我们的注释代码本提供基线分类结果，并描述数据集中的模式。该版本旨在支持政策沟通中的下游任务和多语言 NLP 研究。

Title: MASA: LLM-Driven Multi-Agent Systems for Autoformalization

Authors: Lan Zhang, Marco Valentino, André Freitas
Subjects: cs.CL, cs.FL
Abstract URL: https://arxiv.org/abs/2510.08988
Pdf URL: https://arxiv.org/pdf/2510.08988
Copy Paste: [[2510.08988]] MASA: LLM-Driven Multi-Agent Systems for Autoformalization(https://arxiv.org/abs/2510.08988)
Keywords: language model, llm, agent
Abstract: Autoformalization serves a crucial role in connecting natural language and formal reasoning. This paper presents MASA, a novel framework for building multi-agent systems for autoformalization driven by Large Language Models (LLMs). MASA leverages collaborative agents to convert natural language statements into their formal representations. The architecture of MASA is designed with a strong emphasis on modularity, flexibility, and extensibility, allowing seamless integration of new agents and tools to adapt to a fast-evolving field. We showcase the effectiveness of MASA through use cases on real-world mathematical definitions and experiments on formal mathematics datasets. This work highlights the potential of multi-agent systems powered by the interaction of LLMs and theorem provers in enhancing the efficiency and reliability of autoformalization, providing valuable insights and support for researchers and practitioners in the field.
摘要：自动形式化在连接自然语言和形式推理方面发挥着至关重要的作用。本文提出了 MASA，这是一种用于构建由大型语言模型 (LLM) 驱动的自动形式化多智能体系统的新颖框架。 MASA 利用协作代理将自然语言语句转换为其正式表示形式。 MASA 的架构设计非常强调模块化、灵活性和可扩展性，允许新代理和工具的无缝集成，以适应快速发展的领域。我们通过现实世界数学定义的用例和正式数学数据集的实验来展示 MASA 的有效性。这项工作凸显了由法学硕士和定理证明者交互驱动的多智能体系统在提高自动形式化的效率和可靠性方面的潜力，为该领域的研究人员和从业者提供了宝贵的见解和支持。

Title: DARO: Difficulty-Aware Reweighting Policy Optimization

Authors: Jingyu Zhou, Lu Ma, Hao Liang, Chengyu Shen, Bin Cui, Wentao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09001
Pdf URL: https://arxiv.org/pdf/2510.09001
Copy Paste: [[2510.09001]] DARO: Difficulty-Aware Reweighting Policy Optimization(https://arxiv.org/abs/2510.09001)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have shown that reasoning ability can be significantly enhanced through Reinforcement Learning with Verifiable Rewards (RLVR). Group Relative Policy Optimization (GRPO) has emerged as the de facto approach for RLVR, inspiring numerous variants. However, our mathematical analysis reveals that these methods are fundamentally weighted variations of GRPO. We provide a unified view, demonstrating that their reliance on static or overly simplistic weighting schemes tied to sample difficulty prevents adaptation to a model's evolving capabilities. This creates a significant loss scale issue, where training disproportionately focuses on certain difficulty levels at the expense of others, hindering overall performance. To address these limitations, we introduce \textbf{Difficulty-Aware Reweighting Policy Optimization (DARO)}, a method that dynamically adjusts the loss contribution of each difficulty group based on the model's learning state. Extensive experiments on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Llama3.1-8B show that DARO outperforms four leading baselines across six math benchmarks, achieving significantly faster convergence and superior final performance.
摘要：大语言模型（LLM）的最新进展表明，通过可验证奖励的强化学习（RLVR）可以显着增强推理能力。组相对策略优化 (GRPO) 已成为 RLVR 事实上的方法，激发了众多变体。然而，我们的数学分析表明，这些方法从根本上来说是 GRPO 的加权变体。我们提供了一个统一的观点，证明他们对与样本难度相关的静态或过于简单的加权方案的依赖阻碍了对模型不断发展的能力的适应。这造成了严重的损失规模问题，训练不成比例地集中于某些难度级别，而牺牲了其他难度级别，从而阻碍了整体表现。为了解决这些限制，我们引入了\textbf{难度感知重加权策略优化（DARO）}，这是一种根据模型的学习状态动态调整每个难度组的损失贡献的方法。对 Qwen2.5-Math-1.5B、Qwen2.5-Math-7B 和 Llama3.1-8B 的大量实验表明，DARO 在六个数学基准测试中的表现优于四个领先基线，实现了显着更快的收敛和卓越的最终性能。

Title: Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models

Authors: Yutao Mou, Xiaoling Zhou, Yuxiao Luo, Shikun Zhang, Wei Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09004
Pdf URL: https://arxiv.org/pdf/2510.09004
Copy Paste: [[2510.09004]] Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models(https://arxiv.org/abs/2510.09004)
Keywords: language model
Abstract: Safety alignment is essential for building trustworthy artificial intelligence, yet it remains challenging to enhance model safety without degrading general performance. Current approaches require computationally expensive searches for the optimal proportion of safety-critical and general-purpose data to balance safety and general performance, incurring high costs with limited gains. In this work, we show that LoRA-based Refusal-training enables performance-preserving safety alignment even when trained solely on safety data, demonstrating that LoRA serves as cost-efficient, performance-preserving, and plug-and-play safety patches. Beyond empirical findings, we provide both theoretical and experimental evidence that LoRA effectively decouples safety into a low-rank subspace largely orthogonal to the model's intrinsic transformation space, ensuring that safety enhancements do not interfere with inherent capabilities.
摘要：安全对齐对于构建值得信赖的人工智能至关重要，但在不降低总体性能的情况下增强模型安全性仍然具有挑战性。当前的方法需要进行昂贵的计算搜索，以寻找安全关键数据和通用数据的最佳比例，以平衡安全性和一般性能，从而导致成本高昂，收益有限。在这项工作中，我们展示了基于 LoRA 的拒绝训练即使仅基于安全数据进行训练也可以实现性能保持安全对齐，从而证明 LoRA 可以作为具有成本效益、性能保持和即插即用的安全补丁。除了实证研究结果之外，我们还提供了理论和实验证据，证明 LoRA 有效地将安全性解耦到与模型的内在转换空间基本正交的低秩子空间中，确保安全性增强不会干扰固有功能。

Title: LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction

Authors: Shengmin Piao, Jieun Lee, Sanghyun Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09014
Pdf URL: https://arxiv.org/pdf/2510.09014
Copy Paste: [[2510.09014]] LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction(https://arxiv.org/abs/2510.09014)
Keywords: language model, llm
Abstract: The Text-to-SQL task translates natural language questions into SQL queries, enabling intuitive database interaction for non-experts. While recent methods leveraging Large Language Models (LLMs) achieve strong performance, their reliance on proprietary models raise concerns about deployment feasibility and data privacy. In this work, we introduce LitE-SQL, a Lightweight and Efficient framework with two components: (i) a Schema Retriever that performs efficient schema linking using a vector database of pre-computed schema embeddings, and (ii) a SQL Generator fine-tuned in two stages-supervised fine-tuning followed by execution-guided reinforcement-enabling self-correction without costly multi-candidate generation. On BIRD, LitE-SQL achieves 72.10% execution accuracy, and on Spider 1.0 it reaches 88.45%, demonstrating comparable or superior performance to LLM-based methods despite using 2x to 30x fewer parameters. Our findings demonstrate that high-quality Text-to-SQL generation is feasible with lightweight models, offering a practical solution for privacy-sensitive and resource-constrained settings.
摘要：文本到 SQL 任务将自然语言问题转换为 SQL 查询，为非专家提供直观的数据库交互。虽然最近利用大型语言模型 (LLM) 的方法取得了强大的性能，但它们对专有模型的依赖引起了人们对部署可行性和数据隐私的担忧。在这项工作中，我们介绍了 LitE-SQL，一个轻量级且高效的框架，具有两个组件：（i）一个模式检索器，使用预先计算的模式嵌入的向量数据库执行高效的模式链接，以及（ii）一个 SQL 生成器，分两个阶段进行微调——监督微调，然后是执行引导的强化——无需昂贵的多候选即可进行自我校正。一代。在 BIRD 上，LitE-SQL 的执行精度达到 72.10%，在 Spider 1.0 上达到 88.45%，尽管使用的参数少了 2 到 30 倍，但其性能与基于 LLM 的方法相当或更高。我们的研究结果表明，使用轻量级模型可以实现高质量的文本到 SQL 生成，为隐私敏感和资源受限的设置提供实用的解决方案。

Title: Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise

Authors: Keno Harada, Lui Yoshida, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09030
Pdf URL: https://arxiv.org/pdf/2510.09030
Copy Paste: [[2510.09030]] Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise(https://arxiv.org/abs/2510.09030)
Keywords: language model, gpt, llm, prompt
Abstract: The performance of Large Language Models (LLMs) is highly sensitive to the prompts they are given. Drawing inspiration from the field of prompt optimization, this study investigates the potential for enhancing Automated Essay Scoring (AES) by refining the scoring rubrics used by LLMs. Specifically, our approach prompts models to iteratively refine rubrics by reflecting on models' own scoring rationales and observed discrepancies with human scores on sample essays. Experiments on the TOEFL11 and ASAP datasets using GPT-4.1, Gemini-2.5-Pro, and Qwen-3-Next-80B-A3B-Instruct show Quadratic Weighted Kappa (QWK) improvements of up to 0.19 and 0.47, respectively. Notably, even with a simple initial rubric, our approach achieves comparable or better QWK than using detailed human-authored rubrics. Our findings highlight the importance of iterative rubric refinement in LLM-based AES to enhance alignment with human evaluations.
摘要：大型语言模型 (LLM) 的性能对其给出的提示高度敏感。本研究从提示优化领域汲取灵感，通过改进法学硕士使用的评分标准，研究了增强自动作文评分 (AES) 的潜力。具体来说，我们的方法通过反思模型自身的评分原理以及观察到的样本论文与人类评分的差异，促使模型迭代地完善评分标准。使用 GPT-4.1、Gemini-2.5-Pro 和 Qwen-3-Next-80B-A3B-Instruct 在 TOEFL11 和 ASAP 数据集上进行的实验表明，二次加权 Kappa (QWK) 分别提高了 0.19 和 0.47。值得注意的是，即使使用简单的初始评估标准，我们的方法也能获得与使用详细的人类编写的评估标准相当或更好的 QWK。我们的研究结果强调了基于法学硕士的 AES 中迭代细化的重要性，以增强与人类评估的一致性。

Title: Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

Authors: Adity Khisa, Nusrat Jahan Lia, Tasnim Mahfuz Nafis, Zarif Masud, Tanzir Pial, Shebuti Rayana, Ahmedul Kabir
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09032
Pdf URL: https://arxiv.org/pdf/2510.09032
Copy Paste: [[2510.09032]] Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language(https://arxiv.org/abs/2510.09032)
Keywords: language model
Abstract: As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based multilingual and regional transformer models (mBERT, XLM-RoBERTa, DistilBERT, DeBERTaV3, BanglaBERT, and IndicBERT) on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our manually validated monolingual dataset to encourage further research on multilingual language modeling for low-resource languages.
摘要：作为一种可用数据有限的印度-雅利安语言，Chakma 在语言模型中的代表性仍然不足。在这项工作中，我们介绍了一个上下文连贯的孟加拉音译 Chakma 语料库，该语料库根据 Chakma 文献整理，并由母语人士验证。使用此数据集，我们在掩码语言建模 (MLM) 任务上微调了六种基于编码器的多语言和区域 Transformer 模型（mBERT、XLM-RoBERTa、DistilBERT、DeBERTaV3、BanglaBERT 和 IndicBERT）。我们的实验表明，经过微调的多语言模型在适应孟加拉音译 Chakma 时优于预训练的模型，实现了高达 73.54% 的标记准确率和低至 2.90 的困惑度。我们的分析进一步强调了数据质量对模型性能的影响，并显示了 OCR 管道对于形态丰富的印度文字的局限性。我们的研究表明，孟加拉音译的 Chakma 对于 Chakma 语言的迁移学习非常有效，并且我们发布了手动验证的单语言数据集，以鼓励对资源匮乏语言的多语言模型进行进一步研究。

Title: Large Language Models Do NOT Really Know What They Don't Know

Authors: Chi Seng Cheang, Hou Pong Chan, Wenxuan Zhang, Yang Deng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09033
Pdf URL: https://arxiv.org/pdf/2510.09033
Copy Paste: [[2510.09033]] Large Language Models Do NOT Really Know What They Don't Know(https://arxiv.org/abs/2510.09033)
Keywords: language model, llm, hallucination
Abstract: Recent work suggests that large language models (LLMs) encode factuality signals in their internal representations, such as hidden states, attention weights, or token probabilities, implying that LLMs may "know what they don't know". However, LLMs can also produce factual errors by relying on shortcuts or spurious associations. These error are driven by the same training objective that encourage correct predictions, raising the question of whether internal computations can reliably distinguish between factual and hallucinated outputs. In this work, we conduct a mechanistic analysis of how LLMs internally process factual queries by comparing two types of hallucinations based on their reliance on subject information. We find that when hallucinations are associated with subject knowledge, LLMs employ the same internal recall process as for correct responses, leading to overlapping and indistinguishable hidden-state geometries. In contrast, hallucinations detached from subject knowledge produce distinct, clustered representations that make them detectable. These findings reveal a fundamental limitation: LLMs do not encode truthfulness in their internal states but only patterns of knowledge recall, demonstrating that "LLMs don't really know what they don't know".
摘要：最近的研究表明，大型语言模型（LLM）在其内部表示中编码事实信号，例如隐藏状态、注意力权重或令牌概率，这意味着 LLM 可能“知道他们不知道的东西”。然而，法学硕士也可能因依赖捷径或虚假关联而产生事实错误。这些错误是由鼓励正确预测的相同训练目标驱动的，这就提出了内部计算是否能够可靠地区分事实输出和幻觉输出的问题。在这项工作中，我们通过比较两种类型的基于对主题信息的依赖的幻觉，对法学硕士如何内部处理事实查询进行机械分析。我们发现，当幻觉与学科知识相关时，法学硕士采用与正确反应相同的内部回忆过程，导致重叠和难以区分的隐藏状态几何。相比之下，脱离学科知识的幻觉会产生独特的、聚集的表征，使它们能够被检测到。这些发现揭示了一个根本性的局限性：法学硕士不会在其内部状态中编码真实性，而只会编码知识回忆的模式，这表明“法学硕士并不真正知道他们不知道什么”。

Title: Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation

Authors: Muhammad Ali Shafique, Kanwal Mehreen, Muhammad Arham, Maaz Amjad, Sabur Butt, Hamza Farooq
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.09051
Pdf URL: https://arxiv.org/pdf/2510.09051
Copy Paste: [[2510.09051]] Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation(https://arxiv.org/abs/2510.09051)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Developing a high-performing large language models (LLMs) for low-resource languages such as Urdu, present several challenges. These challenges include the scarcity of high-quality datasets, multilingual inconsistencies, and safety concerns. Existing multilingual LLMs often address these issues by translating large volumes of available data. However, such translations often lack quality and cultural nuance while also incurring significant costs for data curation and training. To address these issues, we propose Alif-1.0-8B-Instruct, a multilingual Urdu-English model, that tackles these challenges with a unique approach. We train the model on a high-quality, multilingual synthetic dataset (Urdu-Instruct), developed using a modified self-instruct technique. By using unique prompts and seed values for each task along with a global task pool, this dataset incorporates Urdu-native chain-of-thought based reasoning, bilingual translation, cultural relevance, and ethical safety alignments. This technique significantly enhances the comprehension of Alif-1.0-8B-Instruct model for Urdu-specific tasks. As a result, Alif-1.0-8B-Instruct, built upon the pretrained Llama-3.1-8B, demonstrates superior performance compared to Llama-3.1-8B-Instruct for Urdu specific-tasks. It also outperformed leading multilingual LLMs, including Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, and Cohere-Aya-Expanse-8B, all within a training budget of under $100. Our results demonstrate that high-performance and low-resource language LLMs can be developed efficiently and culturally aligned using our modified self-instruct approach. All datasets, models, and code are publicly available at: this https URL.
摘要：为乌尔都语等资源匮乏的语言开发高性能的大型语言模型 (LLM) 面临着一些挑战。这些挑战包括高质量数据集的稀缺、多语言不一致以及安全问题。现有的多语言法学硕士通常通过翻译大量可用数据来解决这些问题。然而，此类翻译往往缺乏质量和文化细微差别，同时还会产生大量的数据管理和培训成本。为了解决这些问题，我们提出了 Alif-1.0-8B-Instruct，这是一种多语言乌尔都语-英语模型，它以独特的方法应对这些挑战。我们在高质量、多语言合成数据集（乌尔都语-Instruct）上训练模型，该数据集使用改进的自指令技术开发。通过为每个任务使用独特的提示和种子值以及全局任务池，该数据集结合了乌尔都语本土基于思想链的推理、双语翻译、文化相关性和道德安全一致性。该技术显着增强了 Alif-1.0-8B-Instruct 模型对乌尔都语特定任务的理解。因此，基于预训练的 Llama-3.1-8B 构建的 Alif-1.0-8B-Instruct 在乌尔都语特定任务方面表现出比 Llama-3.1-8B-Instruct 更优越的性能。它还优于领先的多语言法学硕士，包括 Mistral-7B-Instruct-v0.3、Qwen-2.5-7B-Instruct 和 Cohere-Aya-Expanse-8B，所有培训预算均低于 100 美元。我们的结果表明，使用我们修改后的自学方法，可以有效地开发高性能和低资源的语言法学硕士，并在文化上保持一致。所有数据集、模型和代码均可在以下网址公开获取：此 https URL。

Title: ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

Authors: Chung-En Sun, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09062
Pdf URL: https://arxiv.org/pdf/2510.09062
Copy Paste: [[2510.09062]] ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability(https://arxiv.org/abs/2510.09062)
Keywords: chain-of-thought
Abstract: Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability. To this end, we propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to: (i) improve interpretability by producing structured, tag-based traces with high-level planning that are easier for humans to follow; (ii) enhance faithfulness by explicitly disclosing the decisive information guiding each solution, with consistent cross-section references; and (iii) promote reliability by providing self-assessments of both the derivation's soundness and the confidence of the final answer. We apply ReFIne to the Qwen3 models at multiple scales (1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty. Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces (interpretability +44.0%), more faithfully expose their underlying decision process (faithfulness +18.8%), and offer informative confidence estimates (reliability +42.4%). These findings highlight an overlooked but important direction: reasoning models should be optimized not only for accuracy, but also for broader dimensions of trustworthiness. Our code is available at: this https URL
摘要：长链思维 (CoT) 推理的最新进展在很大程度上优先考虑了答案准确性和令牌效率，同时忽视了对可信度至关重要的方面。我们认为可用的推理系统必须是值得信赖的，具有三个属性：可解释性、忠实性和可靠性。为此，我们提出了 ReFIne，这是一种新的训练框架，它将监督微调与 GRPO 相结合，以鼓励模型：（i）通过生成结构化的、基于标签的跟踪以及更容易人类遵循的高级规划来提高可解释性； (ii) 通过明确披露指导每个解决方案的决定性信息以及一致的横截面参考来增强忠实度； (iii) 通过对推导的合理性和最终答案的置信度进行自我评估来提高可靠性。我们将 ReFIne 应用于多个尺度 (1.7B/4B/8B) 的 Qwen3 模型，并在不同难度的数学基准上进行评估。我们的实验结果表明，ReFIne 模型生成更清晰、结构更好的推理轨迹（可解释性 +44.0%），更忠实地揭示其基本决策过程（可信度 +18.8%），并提供信息丰富的置信度估计（可靠性 +42.4%）。这些发现凸显了一个被忽视但重要的方向：推理模型不仅应该针对准确性进行优化，还应该针对更广泛的可信度进行优化。我们的代码位于：此 https URL

Title: FrameEOL: Semantic Frame Induction using Causal Language Models

Authors: Chihiro Yano, Kosuke Yamada, Hayato Tsukagoshi, Ryohei Sasano, Koichi Takeda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09097
Pdf URL: https://arxiv.org/pdf/2510.09097
Copy Paste: [[2510.09097]] FrameEOL: Semantic Frame Induction using Causal Language Models(https://arxiv.org/abs/2510.09097)
Keywords: language model, gpt, prompt
Abstract: Semantic frame induction is the task of clustering frame-evoking words according to the semantic frames they evoke. In recent years, leveraging embeddings of frame-evoking words that are obtained using masked language models (MLMs) such as BERT has led to high-performance semantic frame induction. Although causal language models (CLMs) such as the GPT and Llama series succeed in a wide range of language comprehension tasks and can engage in dialogue as if they understood frames, they have not yet been applied to semantic frame induction. We propose a new method for semantic frame induction based on CLMs. Specifically, we introduce FrameEOL, a prompt-based method for obtaining Frame Embeddings that outputs One frame-name as a Label representing the given situation. To obtain embeddings more suitable for frame induction, we leverage in-context learning (ICL) and deep metric learning (DML). Frame induction is then performed by clustering the resulting embeddings. Experimental results on the English and Japanese FrameNet datasets demonstrate that the proposed methods outperform existing frame induction methods. In particular, for Japanese, which lacks extensive frame resources, the CLM-based method using only 5 ICL examples achieved comparable performance to the MLM-based method fine-tuned with DML.
摘要：语义框架归纳是根据框架唤起词所唤起的语义框架对它们进行聚类的任务。近年来，利用 BERT 等掩码语言模型 (MLM) 获得的框架诱发词的嵌入，实现了高性能的语义框架归纳。尽管 GPT 和 Llama 系列等因果语言模型 (CLM) 在广泛的语言理解任务中取得了成功，并且可以像理解框架一样进行对话，但它们尚未应用于语义框架归纳。我们提出了一种基于 CLM 的语义框架归纳新方法。具体来说，我们引入了 FrameEOL，一种基于提示的获取帧嵌入的方法，该方法输出一个帧名称作为代表给定情况的标签。为了获得更适合框架归纳的嵌入，我们利用上下文学习（ICL）和深度度量学习（DML）。然后通过对所得嵌入进行聚类来执行帧归纳。英语和日语 FrameNet 数据集上的实验结果表明，所提出的方法优于现有的帧归纳方法。特别是，对于缺乏广泛框架资源的日语，仅使用 5 个 ICL 示例的基于 CLM 的方法就获得了与使用 DML 微调的基于 MLM 的方法相当的性能。

Title: When Retrieval Succeeds and Fails: Rethinking Retrieval-Augmented Generation for LLMs

Authors: Yongjie Wang, Yue Yu, Kaisong Song, Jun Lin, Zhiqi Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09106
Pdf URL: https://arxiv.org/pdf/2510.09106
Copy Paste: [[2510.09106]] When Retrieval Succeeds and Fails: Rethinking Retrieval-Augmented Generation for LLMs(https://arxiv.org/abs/2510.09106)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have enabled a wide range of applications through their powerful capabilities in language understanding and generation. However, as LLMs are trained on static corpora, they face difficulties in addressing rapidly evolving information or domain-specific queries. Retrieval-Augmented Generation (RAG) was developed to overcome this limitation by integrating LLMs with external retrieval mechanisms, allowing them to access up-to-date and contextually relevant knowledge. However, as LLMs themselves continue to advance in scale and capability, the relative advantages of traditional RAG frameworks have become less pronounced and necessary. Here, we present a comprehensive review of RAG, beginning with its overarching objectives and core components. We then analyze the key challenges within RAG, highlighting critical weakness that may limit its effectiveness. Finally, we showcase applications where LLMs alone perform inadequately, but where RAG, when combined with LLMs, can substantially enhance their effectiveness. We hope this work will encourage researchers to reconsider the role of RAG and inspire the development of next-generation RAG systems.
摘要：大型语言模型（LLM）通过其强大的语言理解和生成能力实现了广泛的应用。然而，由于法学硕士是在静态语料库上进行培训的，因此他们在处理快速变化的信息或特定领域的查询时面临着困难。检索增强生成（RAG）的开发是为了克服这一限制，通过将法学硕士与外部检索机制集成，使他们能够访问最新的和上下文相关的知识。然而，随着法学硕士本身在规模和能力上不断进步，传统 RAG 框架的相对优势已经变得不那么明显和必要。在这里，我们从 RAG 的总体目标和核心组成部分开始对 RAG 进行全面回顾。然后，我们分析 RAG 中的主要挑战，强调可能限制其有效性的关键弱点。最后，我们展示了仅法学硕士表现不佳的应用程序，但 RAG 与法学硕士结合使用时可以显着提高其有效性。我们希望这项工作能够鼓励研究人员重新考虑 RAG 的作用，并激发下一代 RAG 系统的开发。

Title: DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

Authors: Enze Zhang, Jiaying Wang, Mengxi Xiao, Jifei Liu, Ziyan Kuang, Rui Dong, Youzhong Dong, Sophia Ananiadou, Min Peng, Qianqian Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09116
Pdf URL: https://arxiv.org/pdf/2510.09116
Copy Paste: [[2510.09116]] DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation(https://arxiv.org/abs/2510.09116)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.
摘要：大型语言模型（LLM）具有相当先进的机器翻译（MT），但它们在翻译网络小说方面的有效性仍不清楚。现有的基准依赖于表面指标，无法捕捉该类型的独特特征。为了弥补这些差距，我们引入了第一个网络小说翻译综合评估框架DITING，从成语翻译、词汇歧义、术语本地化、时态一致性、零代词解析和文化安全六个维度评估叙事和文化保真度，并有超过18K专家注释的汉英句子对支持。我们进一步提出了 AgentEval，一种推理驱动的多智能体评估框架，它模拟专家审议来评估词汇重叠之外的翻译质量，在七个测试的自动指标中实现与人类判断的最高相关性。为了实现指标比较，我们开发了 MetricAlign，这是一个由 300 个句子对组成的元评估数据集，并标注了错误标签和标量质量分数。对十四种开放式、封闭式和商业模式的综合评估表明，中国培养的法学硕士超越了规模较大的外国同行，并且 DeepSeek-V3 提供了最忠实且风格连贯的翻译。我们的工作为探索基于法学硕士的网络小说翻译建立了新的范式，并提供公共资源来推进未来的研究。

Title: Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM

Authors: Seiya Ishikura, Hiroaki Yamada, Tatsuya Hiraoka, Hiroaki Yamada, Takenobu Tokunaga
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09158
Pdf URL: https://arxiv.org/pdf/2510.09158
Copy Paste: [[2510.09158]] Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM(https://arxiv.org/abs/2510.09158)
Keywords: llm, chat
Abstract: This study proposes augmenting dialog data with think-aloud utterances (TAUs) for modeling individual personalities in text chat by LLM. TAU is a verbalization of a speaker's thought before articulating the utterance. We expect "persona LLMs" trained with TAU-augmented data can mimic the speaker's personality trait better. We tested whether the trained persona LLMs obtain the human personality with respect to Big Five, a framework characterizing human personality traits from five aspects. The results showed that LLMs trained with TAU-augmented data more closely align to the speakers' Agreeableness and Neuroticism of Big Five than those trained with original dialog data. We also found that the quality of TAU-augmentation impacts persona LLM's performance.
摘要：本研究提出通过有声思考话语 (TAU) 来增强对话数据，以便法学硕士在文本聊天中对个人性格进行建模。 TAU 是说话者在表达话语之前思想的语言表达。我们期望使用 TAU 增强数据训练的“角色法学硕士”能够更好地模仿说话者的性格特征。我们测试了受过训练的人格面具法学硕士是否获得了大五人格，这是一个从五个方面表征人类人格特征的框架。结果表明，与使用原始对话数据训练的法学硕士相比，使用 TAU 增强数据训练的法学硕士更符合说话者的宜人性和大五人格的神经质。我们还发现 TAU 增强的质量会影响角色 LLM 的表现。

Title: LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning

Authors: Changjiang Gao, Zixian Huang, Jingyang Gong, Shujian Huang, Lei Li, Fei Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09189
Pdf URL: https://arxiv.org/pdf/2510.09189
Copy Paste: [[2510.09189]] LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning(https://arxiv.org/abs/2510.09189)
Keywords: language model, llm
Abstract: General Large Language Models (LLMs) excel in reasoning, but those enhanced for translation struggle with reasoning tasks. To address this, we propose a novel translationenhanced recipe that begins with instruct models and applies layer-selective tuning only on parallel data. Following this pipeline, we introduce the Qwen3-XPlus models, which demonstrate significant improvements in translation performance across both high- and lowresource languages, achieving 15+ spBLEU and 40+ xComet in low-resource languages, like Swahili. Interestingly, training only with small parallel datasets, Qwen3-XPlus achieves an average improvement of 1+ points on 7 multilingual tasks while maintaining proficiency comparable to the Qwen3 instruct model in 15 popular reasoning datasets. This work offers a promising approach to multilingual enhancement, significantly reducing complexity and enhancing accessibility for a wider range of languages. The code and model are publicly available.
摘要：通用大型语言模型（LLM）在推理方面表现出色，但那些针对翻译而增强的模型则难以完成推理任务。为了解决这个问题，我们提出了一种新颖的翻译增强方法，该方法从指令模型开始，仅对并行数据应用层选择性调整。沿着这个流程，我们引入了 Qwen3-XPlus 模型，该模型展示了高资源语言和低资源语言的翻译性能的显着改进，在斯瓦希里语等低资源语言中实现了 15+ spBLEU 和 40+ xComet。有趣的是，仅使用小型并行数据集进行训练，Qwen3-XPlus 在 7 个多语言任务上平均提高了 1+ 分，同时在 15 个流行推理数据集中保持了与 Qwen3 指令模型相当的熟练程度。这项工作为多语言增强提供了一种有前途的方法，显着降低了复杂性并增强了更广泛语言的可访问性。代码和模型是公开的。

Title: DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

Authors: Yiqi Li, Yusheng Liao, Zhe Chen, Yanfeng Wang, Yu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09211
Pdf URL: https://arxiv.org/pdf/2510.09211
Copy Paste: [[2510.09211]] DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction(https://arxiv.org/abs/2510.09211)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: When performing reasoning tasks with user-specific requirements, such as strict output formats, large language models (LLMs) often prioritize reasoning over adherence to detailed instructions. Fine-tuning LLMs on supervised datasets to address this is impractical due to high computational costs and limited parameter access. To tackle this, we propose DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs' outputs through chain-of-thought (CoT) correction. DICE decouples the process by first prompting LLMs to generate natural language responses, then using trained SLMs to analyze and refine these outputs to meet structured output specifications. This framework preserves LLMs' broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands. Specifically, DICE first constructs structured CoT adaptation datasets via a two-stage method and subsequently applies a dual-tuning strategy to fine-tune SLMs for generating structured outputs in an analyze-then-answer pattern. Experiments demonstrate that DICE improves the average format accuracy and content correctness of LLM outputs by 35.4\% and 29.4\%, respectively, achieving state-of-the-art (SOTA) performance over other competitive baselines.
摘要：当执行具有用户特定要求（例如严格的输出格式）的推理任务时，大型语言模型（LLM）通常优先考虑推理而不是遵守详细指令。由于高计算成本和有限的参数访问，在监督数据集上微调法学硕士来解决这个问题是不切实际的。为了解决这个问题，我们提出了 DICE，这是一个轻量级框架，可指导小型语言模型 (SLM) 通过思想链 (CoT) 修正来完善法学硕士的输出。 DICE 通过首先提示 LLM 生成自然语言响应，然后使用训练有素的 SLM 来分析和细化这些输出以满足结构化输出规范，从而解耦该过程。该框架保留了法学硕士广泛的知识和推理能力，同时确保输出符合用户需求。具体来说，DICE 首先通过两阶段方法构建结构化 CoT 适应数据集，然后应用双调整策略来微调 SLM，以便以分析然后回答的模式生成结构化输出。实验表明，DICE 将 LLM 输出的平均格式准确性和内容正确性分别提高了 35.4% 和 29.4%，与其他竞争基准相比实现了最先进 (SOTA) 的性能。

Title: IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data

Authors: Tao Feng, Lizhen Qu, Niket Tandon, Gholamreza Haffari
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.09217
Pdf URL: https://arxiv.org/pdf/2510.09217
Copy Paste: [[2510.09217]] IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data(https://arxiv.org/abs/2510.09217)
Keywords: llm
Abstract: Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant computation for known relations, and unrealistic assumptions. While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover known and novel causal relations. In addition to causal discovery on initial variables, the missing variable proposal component of IRIS identifies and incorporates missing variables to expand the causal graphs. Our approach enables real-time causal discovery from only a set of initial variables without requiring pre-existing datasets.
摘要：因果发现是科学研究的基础，但传统的统计算法面临着重大挑战，包括昂贵的数据收集、已知关系的冗余计算以及不切实际的假设。虽然最近基于法学硕士的方法擅长识别众所周知的因果关系，但它们无法发现新的关系。我们引入了 IRIS（实时因果发现迭代检索和集成系统），这是一个解决这些限制的新颖框架。从一组初始变量开始，IRIS 自动收集相关文档、提取变量并揭示因果关系。我们的混合因果发现方法结合了统计算法和基于法学硕士的方法来发现已知和新颖的因果关系。除了初始变量的因果发现之外，IRIS 的缺失变量提议组件还识别并合并缺失变量以扩展因果图。我们的方法可以仅从一组初始变量中进行实时因果发现，而不需要预先存在的数据集。

Title: CrisiText: A dataset of warning messages for LLM training in emergency communication

Authors: Giacomo Gonella, Gian Maria Campedelli, Stefano Menini, Marco Guerini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09243
Pdf URL: https://arxiv.org/pdf/2510.09243
Copy Paste: [[2510.09243]] CrisiText: A dataset of warning messages for LLM training in emergency communication(https://arxiv.org/abs/2510.09243)
Keywords: llm
Abstract: Effectively identifying threats and mitigating their potential damage during crisis situations, such as natural disasters or violent attacks, is paramount for safeguarding endangered individuals. To tackle these challenges, AI has been used in assisting humans in emergency situations. Still, the use of NLP techniques remains limited and mostly focuses on classification tasks. The significant potential of timely warning message generation using NLG architectures, however, has been largely overlooked. In this paper we present CrisiText, the first large-scale dataset for the generation of warning messages across 13 different types of crisis scenarios. The dataset contains more than 400,000 warning messages (spanning almost 18,000 crisis situations) aimed at assisting civilians during and after such events. To generate the dataset, we started from existing crisis descriptions and created chains of events related to the scenarios. Each event was then paired with a warning message. The generations follow experts' written guidelines to ensure correct terminology and factuality of their suggestions. Additionally, each message is accompanied by three suboptimal warning types to allow for the study of different NLG approaches. To this end, we conducted a series of experiments comparing supervised fine-tuning setups with preference alignment, zero-shot, and few-shot approaches. We further assessed model performance in out-of-distribution scenarios and evaluated the effectiveness of an automatic post-editor.
摘要：在自然灾害或暴力袭击等危机情况下，有效识别威胁并减轻其潜在损害对于保护濒临灭绝的个人至关重要。为了应对这些挑战，人工智能已被用于在紧急情况下帮助人类。尽管如此，NLP 技术的使用仍然有限，并且主要集中在分类任务上。然而，使用 NLG 架构及时生成警告消息的巨大潜力却在很大程度上被忽视了。在本文中，我们提出了 CrisiText，这是第一个用于在 13 种不同类型的危机场景中生成警告消息的大型数据集。该数据集包含超过 400,000 条警告消息（涵盖近 18,000 种危机情况），旨在在此类事件期间和之后为平民提供帮助。为了生成数据集，我们从现有的危机描述开始，创建了与场景相关的事件链。然后每个事件都与警告消息配对。几代人遵循专家的书面指南，以确保他们的建议使用正确的术语和事实。此外，每条消息都附有三种次优警告类型，以便研究不同的 NLG 方法。为此，我们进行了一系列实验，将监督微调设置与偏好对齐、零样本和少样本方法进行比较。我们进一步评估了分布外场景中的模型性能，并评估了自动后期编辑器的有效性。

Title: DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning

Authors: Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09255
Pdf URL: https://arxiv.org/pdf/2510.09255
Copy Paste: [[2510.09255]] DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning(https://arxiv.org/abs/2510.09255)
Keywords: llm, prompt, agent
Abstract: Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our DSPO-trained 7B model improves over a comparable previous work by \textbf{34.1\%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf{9\% relative}, maintaining exceptional training stability.
摘要：增强法学硕士主动搜索外部知识的能力对于复杂和现实世界的任务至关重要。当前的方法要么依靠提示来激发模型固有的代理能力，要么在将强化学习应用于复杂的交互任务时遇到性能上限和崩溃，从而使它们真正的代理潜力未被开发。为了解决这个问题，我们引入了 \textbf{D}dynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO)，这是一种改进的 RL 算法，旨在通过序列级优化和动态样本过滤来进行稳健的代理训练。我们纯粹通过强化学习来训练我们的模型，以交错进行多轮搜索和推理，从而无需监督演示数据。在多个 QA 基准测试中，我们经过 DSPO 训练的 7B 模型比之前的同类工作提高了 \textbf{34.1\%}，甚至比复杂多跳 QA 中的 14B 模型（例如 HotpotQA）的性能提高了近 \textbf{9\% 相对}，保持了出色的训练稳定性。

Title: Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

Authors: Yongding Tao, Tian Wang, Yihong Dong, Huanyu Liu, Kechi Zhang, Xiaolong Hu, Ge Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.09259
Pdf URL: https://arxiv.org/pdf/2510.09259
Copy Paste: [[2510.09259]] Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models(https://arxiv.org/abs/2510.09259)
Keywords: language model, llm
Abstract: Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.
摘要：数据污染对大型语言模型 (LLM) 的可靠评估构成重大威胁。当基准样本可能无意中出现在训练集中时，就会出现此问题，从而损害报告性能的有效性。虽然针对预训练和监督微调阶段的检测方法已经开发出来，但对于强化学习 (RL) 训练后日益重要的阶段，仍存在关键的研究空白。随着强化学习后训练成为推进法学硕士推理的关键，这种范式中缺乏专门的污染检测方法带来了一个严重的漏洞。为了解决这个问题，我们对 RL 训练后场景中的数据检测进行了首次系统研究，并提出了自我批评。我们的方法受到一个关键观察的启发：在 RL 阶段之后，LLM 的输出熵分布往往会崩溃为高度特定和稀疏的模式。自我批评探索潜在的政策崩溃，即模型收敛到狭窄的推理路径，从而导致熵减少。为了促进这项研究，我们还引入了 RL-MIA，这是一个为模拟这种特定污染场景而构建的基准。大量实验表明，Self-Critique 在多个模型和污染任务中显着优于基线方法，AUC 提高高达 30%。虽然现有方法接近于随机猜测 RL 相污染，但我们的方法使检测成为可能。

Title: CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation

Authors: Kaiwen Wei, Xiao Liu, Jie Zhang, Zijian Wang, Ruida Liu, Yuming Yang, Xin Xiao, Xiao Sun, Haoyang Zeng, Changzai Pan, Yidan Zhang, Jiang Zhong, Peijin Wang, Yingchao Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09266
Pdf URL: https://arxiv.org/pdf/2510.09266
Copy Paste: [[2510.09266]] CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation(https://arxiv.org/abs/2510.09266)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Multimodal Retrieval-Augmented Generation (MRAG) enables Multimodal Large Language Models (MLLMs) to generate responses with external multimodal evidence, and numerous video-based MRAG benchmarks have been proposed to evaluate model capabilities across retrieval and generation stages. However, existing benchmarks remain limited in modality coverage and format diversity, often focusing on single- or limited-modality tasks, or coarse-grained scene understanding. To address these gaps, we introduce CFVBench, a large-scale, manually verified benchmark constructed from 599 publicly available videos, yielding 5,360 open-ended QA pairs. CFVBench spans high-density formats and domains such as chart-heavy reports, news broadcasts, and software tutorials, requiring models to retrieve and reason over long temporal video spans while maintaining fine-grained multimodal information. Using CFVBench, we systematically evaluate 7 retrieval methods and 14 widely-used MLLMs, revealing a critical bottleneck: current models (even GPT5 or Gemini) struggle to capture transient yet essential fine-grained multimodal details. To mitigate this, we propose Adaptive Visual Refinement (AVR), a simple yet effective framework that adaptively increases frame sampling density and selectively invokes external tools when necessary. Experiments show that AVR consistently enhances fine-grained multimodal comprehension and improves performance across all evaluated MLLMs
摘要：多模态检索增强生成 (MRAG) 使多模态大语言模型 (MLLM) 能够生成具有外部多模态证据的响应，并且已经提出了许多基于视频的 MRAG 基准来评估跨检索和生成阶段的模型能力。然而，现有的基准在模态覆盖范围和格式多样性方面仍然有限，通常侧重于单一或有限模态任务，或粗粒度的场景理解。为了解决这些差距，我们引入了 CFVBench，这是一个大规模的、手动验证的基准测试，由 599 个公开视频构建，产生 5,360 个开放式 QA 对。 CFVBench 跨越高密度格式和领域，例如图表密集的报告、新闻广播和软件教程，要求模型在长时间视频跨度上进行检索和推理，同时维护细粒度的多模态信息。使用 CFVBench，我们系统地评估了 7 种检索方法和 14 种广泛使用的 MLLM，揭示了一个关键瓶颈：当前模型（甚至 GPT5 或 Gemini）难以捕获瞬态但重要的细粒度多模态细节。为了缓解这个问题，我们提出了自适应视觉细化（AVR），这是一个简单而有效的框架，可以自适应地增加帧采样密度并在必要时有选择地调用外部工具。实验表明，AVR 始终如一地增强细粒度多模态理解，并提高所有评估的 MLLM 的性能

Title: Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation

Authors: Xiangxu Zhang, Lei Li, Yanyun Zhou, Xiao Zhou, Yingying Zhang, Xian Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09275
Pdf URL: https://arxiv.org/pdf/2510.09275
Copy Paste: [[2510.09275]] Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation(https://arxiv.org/abs/2510.09275)
Keywords: language model, llm
Abstract: Medical diagnostics is a high-stakes and complex domain that is critical to patient care. However, current evaluations of large language models (LLMs) are fundamentally misaligned with real-world clinical practice. Most of them rely on static benchmarks derived from public medical exam items, which tend to overestimate model performance and ignore the difference between textbook cases and the ambiguous, varying conditions in the real world. Recent efforts toward dynamic evaluation offer a promising alternative, but their improvements are limited to superficial perturbations and a narrow focus on accuracy. To address these gaps, we propose DyReMe, a dynamic benchmark for medical diagnostics that better reflects real clinical practice. Unlike static exam-style questions, DyReMe generates fresh, consultation-like cases that introduce distractors such as differential diagnoses and common misdiagnosis factors. It also varies expression styles to mimic diverse real-world query habits. Beyond accuracy, DyReMe evaluates LLMs on three additional clinically relevant dimensions: veracity, helpfulness, and consistency. Our experiments demonstrate that this dynamic approach yields more challenging and realistic assessments, revealing significant misalignments between the performance of state-of-the-art LLMs and real clinical practice. These findings highlight the urgent need for evaluation frameworks that better reflect the demands of trustworthy medical diagnostics.
摘要：医疗诊断是一个高风险且复杂的领域，对患者护理至关重要。然而，目前对大语言模型（LLM）的评估从根本上与现实世界的临床实践不一致。他们中的大多数依赖于公共医学考试项目得出的静态基准，这往往会高估模型的性能，而忽略教科书案例与现实世界中模糊的、变化的条件之间的差异。最近对动态评估的努力提供了一种有希望的替代方案，但它们的改进仅限于表面扰动和对准确性的狭隘关注。为了弥补这些差距，我们提出了 DyReMe，这是一种动态的医疗诊断基准，可以更好地反映真实的临床实践。与静态考试式问题不同，DyReMe 生成新鲜的、类似咨询的案例，其中引入了鉴别诊断和常见误诊因素等干扰因素。它还通过不同的表达方式来模仿不同的现实世界查询习惯。除了准确性之外，DyReMe 还根据另外三个临床相关维度评估法学硕士：准确性、有用性和一致性。我们的实验表明，这种动态方法可以产生更具挑战性和现实的评估，揭示了最先进的法学硕士的表现与实际临床实践之间的显着偏差。这些发现凸显了迫切需要更好地反映值得信赖的医疗诊断需求的评估框架。

Title: CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts

Authors: Jiuheng Lin, Cong Jiang, Zirui Wu, Jiarui Sun, Yansong Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09278
Pdf URL: https://arxiv.org/pdf/2510.09278
Copy Paste: [[2510.09278]] CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts(https://arxiv.org/abs/2510.09278)
Keywords: llm
Abstract: Training expert LLMs in domains with scarce data is difficult, often relying on multiple-choice questions (MCQs). However, standard outcome-based reinforcement learning (RL) on MCQs is risky. While it may improve accuracy, we observe it often degrades reasoning quality such as logical consistency. Existing solutions to supervise reasoning, such as large-scale Process Reward Models (PRMs), are prohibitively expensive. To address this, we propose CLARity, a cost-effective RL framework that enhances reasoning quality using only a small, general-purpose LLM. CLARity integrates a consistency-aware reward mechanism with a 2-stage refine-then-monitor training pipeline to enhance reasoning consistency, and a dynamic data reformulation strategy to to better exploit limited data. Experiments demonstrate that CLARity improves response consistency by 16.5% and accuracy by 7.5% over baselines. Human evaluations further confirm holistic improvements in coherence and professionalism. Thus, CLARity offers a generalizable solution that enables smaller models to effectively guide expert models by reasoning this http URL code is open sourced at: this https URL
摘要：在数据稀缺的领域培训专家法学硕士很困难，通常依赖于多项选择题 (MCQ)。然而，基于 MCQ 的标准基于结果的强化学习 (RL) 是有风险的。虽然它可能会提高准确性，但我们观察到它经常会降低推理质量，例如逻辑一致性。现有的监督推理解决方案，例如大规模过程奖励模型（PRM），成本高昂。为了解决这个问题，我们提出了 CLARity，这是一种经济高效的 RL 框架，仅使用小型通用 LLM 即可提高推理质量。 CLARity 将一致性感知奖励机制与 2 阶段细化-然后监控训练管道相集成，以增强推理一致性，以及动态数据重构策略，以更好地利用有限数据。实验表明，与基线相比，CLARity 将响应一致性提高了 16.5%，准确性提高了 7.5%。人类评估进一步证实了一致性和专业性的整体改进。因此，CLARity 提供了一个通用的解决方案，通过推理此 http URL 代码是开源的，使较小的模型能够有效地指导专家模型：此 https URL

Title: MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics

Authors: Jiapeng Wang, Changxin Tian, Kunlong Chen, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, Jun Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09295
Pdf URL: https://arxiv.org/pdf/2510.09295
Copy Paste: [[2510.09295]] MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics(https://arxiv.org/abs/2510.09295)
Keywords: language model, llm
Abstract: Reliable evaluation is fundamental to the progress of Large Language Models (LLMs), yet the evaluation process during pre-training is plagued by significant instability that obscures true learning dynamics. In this work, we systematically diagnose this instability, attributing it to two distinct sources: \textit{Parameter Instability} from training stochasticity and \textit{Evaluation Instability} from noisy measurement protocols. To counteract both sources of noise, we introduce \textbf{MaP}, a dual-pronged framework that synergistically integrates checkpoint \underline{M}erging \underline{a}nd the \underline{P}ass@k metric. Checkpoint merging smooths the parameter space by averaging recent model weights, while Pass@k provides a robust, low-variance statistical estimate of model capability. Extensive experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent model rankings. Ultimately, MaP provides a more reliable and faithful lens for observing LLM training dynamics, laying a crucial empirical foundation for LLM research.
摘要：可靠的评估是大型语言模型（LLM）进步的基础，但预训练期间的评估过程却受到严重不稳定性的困扰，从而掩盖了真正的学习动态。在这项工作中，我们系统地诊断了这种不稳定性，将其归因于两个不同的来源：来自训练随机性的 \textit{参数不稳定性} 和来自噪声测量协议的 \textit{评估不稳定性}。为了抵消这两种噪声源，我们引入了 \textbf{MaP}，这是一个双管齐下的框架，它协同集成了检查点 \underline{M}erging \underline{a} 和 \underline{P}ass@k 指标。检查点合并通过平均最近的模型权重来平滑参数空间，而 Pass@k 提供了模型能力的稳健、低方差统计估计。大量实验表明，MaP 可以显着平滑性能曲线，减少运行间方差，并确保模型排名更加一致。最终，MaP 为观察 LLM 培训动态提供了更可靠、更忠实的视角，为 LLM 研究奠定了重要的实证基础。

Title: ShiZhi: A Chinese Lightweight Large Language Model for Court View Generation

Authors: Zhitian Hou, Kun Zeng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09297
Pdf URL: https://arxiv.org/pdf/2510.09297
Copy Paste: [[2510.09297]] ShiZhi: A Chinese Lightweight Large Language Model for Court View Generation(https://arxiv.org/abs/2510.09297)
Keywords: language model, llm
Abstract: Criminal Court View Generation (CVG) is a fundamental task in legal artificial intelligence, aiming to automatically generate the "Court View" section of a legal case document. Generating court views is challenging due to the diversity and complexity of case facts, and directly generating from raw facts may limit performance. In this paper, we present ShiZhi, the first large language model (LLM) specifically designed for court view generation. We construct a Chinese Court View Generation dataset, CCVG, of more than 110K cases, each containing fact descriptions paired with corresponding court views. Based on this dataset, ShiZhi achieving 58.5 BLEU-1 on court view generation and 86.1\% accuracy with 92.5\% macro F1 on charge prediction. Experimental results demonstrate that even a small LLM can generate reasonable and legally coherent court views when trained on high-quality domain-specific data. Our model and dataset are available at \href{this https URL}{this https URL}.
摘要：刑事法庭视图生成（CVG）是法律人工智能的一项基本任务，旨在自动生成法律案件文件的“法庭视图”部分。由于案件事实的多样性和复杂性，生成法庭意见具有挑战性，直接从原始事实生成可能会限制性能。在本文中，我们提出了ShiZhi，这是第一个专门为法庭视图生成而设计的大型语言模型（LLM）。我们构建了一个包含超过 11 万个案例的中国法院观点生成数据集 CCVG，每个案例都包含与相应法院观点配对的事实描述。基于此数据集，ShiZhi 在球场视图生成方面实现了 58.5 BLEU-1，在控球预测方面实现了 86.1% 的准确率和 92.5% 的宏观 F1。实验结果表明，即使是小型法学硕士，在接受高质量特定领域数据的培训时也能产生合理且法律上一致的法庭观点。我们的模型和数据集可在 \href{this https URL}{this https URL} 获取。

Title: Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference

Authors: Jianuo Huang, Yaojie Zhang, Yicun Yang, Benhao Huang, Biqing Qi, Dongrui Liu, Linfeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09309
Pdf URL: https://arxiv.org/pdf/2510.09309
Copy Paste: [[2510.09309]] Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference(https://arxiv.org/abs/2510.09309)
Keywords: language model, llm, long context, prompt
Abstract: Diffusion large language models (dLLMs) present a promising alternative to dominant autoregressive models (ARMs) by the ability of parallel decoding at the expense of substantial computation and memory costs. Specifically, the cache mechanism for bidirectional attention in dLLMs demands large memory footprint, restricting their ability to handle long contexts under resource-limited settings. Existing cache eviction strategies are designed for ARMs and ignore the unique characteristics of dLLMs, thus leading to unsatisfactory performance. To address these challenges, we introduce MaskKV, a training-free cache eviction framework tailored to dLLMs, focusing on the effect of mask tokens in dLLMs. MaskKV is built on two key innovations: (1) a mask-query guided scoring mechanism that leverages attention weights to identify and evict less critical prompt tokens for each head; (2) an adaptive cache budgeting strategy that improves efficiency by reducing allocation in intermediate layers and concentrating resources on prompt-preferring heads. On LLaDA with MaskKV, compressing the KV cache to only 256 pairs (less than 5% of tokens) retains 94% of the full-cache performance on LongBench and achieves up to 31x acceleration at 32k prompt length. The code is publicly available at: this https URL
摘要：扩散大语言模型（dLLM）通过并行解码的能力，以大量计算和内存成本为代价，为主导自回归模型（ARM）提供了一种有前景的替代方案。具体来说，dLLM 中双向注意力的缓存机制需要大量内存占用，限制了它们在资源有限的设置下处理长上下文的能力。现有的缓存驱逐策略是为ARM设计的，忽略了dLLM的独特特性，从而导致性能不理想。为了应对这些挑战，我们引入了 MaskKV，这是一种专为 dLLM 量身定制的免训练缓存驱逐框架，重点关注掩码令牌在 dLLM 中的影响。 MaskKV 建立在两项关键创新之上：(1) 掩码查询引导评分机制，利用注意力权重来识别和驱逐每个头不太关键的提示标记；（2）自适应缓存预算策略，通过减少中间层的分配并将资源集中在提示优先的磁头上来提高效率。在带有 MaskKV 的 LLaDA 上，将 KV 缓存压缩到仅 256 对（少于 5% 的令牌）在 LongBench 上保留了 94% 的全缓存性能，并在 32k 提示长度下实现了高达 31 倍的加速。该代码可在以下位置公开获取：此 https URL

Title: Verifying Chain-of-Thought Reasoning via Its Computational Graph

Authors: Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.09312
Pdf URL: https://arxiv.org/pdf/2510.09312
Copy Paste: [[2510.09312]] Verifying Chain-of-Thought Reasoning via Its Computational Graph(https://arxiv.org/abs/2510.09312)
Keywords: llm, chain-of-thought
Abstract: Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We introduce a white-box method: Circuit-based Reasoning Verification (CRV). We hypothesize that attribution graphs of correct CoT steps, viewed as execution traces of the model's latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model's faulty reasoning. Our work shows that, by scrutinizing a model's computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.
摘要：当前的思想链 (CoT) 验证方法根据输出（黑盒）或激活（灰盒）预测推理的正确性，但对计算失败的原因提供的了解有限。我们引入一种白盒方法：基于电路的推理验证（CRV）。我们假设正确的 CoT 步骤的归因图（被视为模型潜在推理电路的执行轨迹）具有与错误步骤不同的结构指纹。通过根据这些图的结构特征训练分类器，我们表明这些痕迹包含推理错误的强大信号。我们的白盒方法产生了其他方法无法获得的新颖的科学见解。 (1) 我们证明错误的结构特征具有高度预测性，建立了直接通过其计算图验证推理的可行性。 (2) 我们发现这些签名具有高度的领域特定性，揭示了不同推理任务中的失败表现为不同的计算模式。 (3) 我们提供证据证明这些签名不仅仅是相关的；通过使用我们的分析来指导对各个转码器特征的有针对性的干预，我们成功地纠正了模型的错误推理。我们的工作表明，通过仔细检查模型的计算过程，我们可以从简单的错误检测转向对 LLM 推理的更深入、因果的理解。

Title: FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

Authors: Yu-Chen Lu, Chong-Yan Chen, Chi-Chih Chang, Yu-Fang Hu, Kai-Chiang Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09332
Pdf URL: https://arxiv.org/pdf/2510.09332
Copy Paste: [[2510.09332]] FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference(https://arxiv.org/abs/2510.09332)
Keywords: language model, llm
Abstract: Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.
摘要：尽管大型语言模型（LLM）已经取得了显着的性能，但其庞大的参数数量阻碍了在资源受限的硬件上的部署。低秩压缩可以减少内存使用和计算需求，但在所有层上应用统一的压缩比通常会导致性能显着下降，并且以前的方法在解码过程中表现不佳。为了解决这些问题，我们提出了细粒度低秩压缩器（FLRC），它可以有效地确定每层的最佳秩分配，并结合渐进式低秩解码来保持文本生成质量。对不同基准的综合实验证明了 FLRC 的优越性，与最先进的低秩压缩方法相比，ROUGE-L 在摘要任务上实现了高达 17% 的改进，建立了更强大、更高效的框架来改进 LLM 推理。

Title: LLP: LLM-based Product Pricing in E-commerce

Authors: Hairu Wang, Sheng You, Qiheng Zhang, Xike Xie, Shuguang Han, Yuchen Wu, Fei Huang, Jufeng Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09347
Pdf URL: https://arxiv.org/pdf/2510.09347
Copy Paste: [[2510.09347]] LLP: LLM-based Product Pricing in E-commerce(https://arxiv.org/abs/2510.09347)
Keywords: language model, llm
Abstract: Unlike Business-to-Consumer e-commerce platforms (e.g., Amazon), inexperienced individual sellers on Consumer-to-Consumer platforms (e.g., eBay) often face significant challenges in setting prices for their second-hand products efficiently. Therefore, numerous studies have been proposed for automating price prediction. However, most of them are based on static regression models, which suffer from poor generalization performance and fail to capture market dynamics (e.g., the price of a used iPhone decreases over time). Inspired by recent breakthroughs in Large Language Models (LLMs), we introduce LLP, the first LLM-based generative framework for second-hand product pricing. LLP first retrieves similar products to better align with the dynamic market change. Afterwards, it leverages the LLMs' nuanced understanding of key pricing information in free-form text to generate accurate price suggestions. To strengthen the LLMs' domain reasoning over retrieved products, we apply a two-stage optimization, supervised fine-tuning (SFT) followed by group relative policy optimization (GRPO), on a dataset built via bidirectional reasoning. Moreover, LLP employs a confidence-based filtering mechanism to reject unreliable price suggestions. Extensive experiments demonstrate that LLP substantially surpasses existing methods while generalizing well to unseen categories. We have successfully deployed LLP on Xianyu\footnote\{Xianyu is China's largest second-hand e-commerce platform.\}, significantly outperforming the previous pricing method. Under the same 30\% product coverage, it raises the static adoption rate (SAR) from 40\% to 72\%, and maintains a strong SAR of 47\% even at 90\% recall.
摘要：与企业对消费者电子商务平台（例如亚马逊）不同，消费者对消费者平台（例如 eBay）上缺乏经验的个人卖家在有效地为其二手产品定价时经常面临重大挑战。因此，人们提出了许多关于自动化价格预测的研究。然而，它们大多数都是基于静态回归模型，泛化性能较差，无法捕捉市场动态（例如，二手 iPhone 的价格随着时间的推移而下降）。受大型语言模型 (LLM) 最近突破的启发，我们推出了 LLP，这是第一个基于 LLM 的二手产品定价生成框架。 LLP 首先检索类似产品，以更好地适应动态的市场变化。然后，它利用法学硕士对自由格式文本中关键定价信息的细致理解来生成准确的价格建议。为了加强法学硕士对检索产品的领域推理，我们在通过双向推理构建的数据集上应用了两阶段优化、监督微调（SFT）和组相对策略优化（GRPO）。此外，LLP 采用基于置信度的过滤机制来拒绝不可靠的价格建议。大量实验表明，LLP 大大超越了现有方法，同时很好地推广到了未见过的类别。我们已经在闲鱼\footnote\{闲鱼是中国最大的二手电商平台\}上成功部署了LLP，明显优于之前的定价方式。在同样 30\% 的产品覆盖率下，它将静态采用率 (SAR) 从 40\% 提高到 72\%，即使在 90\% 的召回率下，仍保持 47\% 的强劲 SAR。

Title: ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering

Authors: Francesco Maria Molfese, Luca Moroni, Ciro Porcaro, Simone Conia, Roberto Navigli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09351
Pdf URL: https://arxiv.org/pdf/2510.09351
Copy Paste: [[2510.09351]] ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering(https://arxiv.org/abs/2510.09351)
Keywords: language model, llm
Abstract: While Small Language Models (SLMs) have demonstrated promising performance on an increasingly wide array of commonsense reasoning benchmarks, current evaluation practices rely almost exclusively on the accuracy of their final answers, neglecting the validity of the reasoning processes that lead to those answers. To address this issue, we introduce ReTraceQA, a novel benchmark that introduces process-level evaluation for commonsense reasoning tasks. Our expert-annotated dataset reveals that in a substantial portion of instances (14-24%), SLMs provide correct final answers despite flawed reasoning processes, suggesting that the capabilities of SLMs are often overestimated by evaluation metrics that focus only on comparing the final answer with the ground truth. Indeed, we show that when employing strong Large Language Models (LLMs) as automated judges for reasoning-aware evaluation rather than answer-only metrics, SLM performance drops significantly across all models and datasets, with scores decreasing by up to 25%.
摘要：虽然小语言模型 (SLM) 在越来越广泛的常识推理基准上表现出了良好的性能，但当前的评估实践几乎完全依赖于最终答案的准确性，而忽略了得出这些答案的推理过程的有效性。为了解决这个问题，我们引入了 ReTraceQA，这是一种新颖的基准，为常识推理任务引入了过程级评估。我们的专家注释数据集显示，在很大一部分情况下（14-24%），尽管推理过程存在缺陷，SLM 仍提供了正确的最终答案，这表明仅注重将最终答案与真实情况进行比较的评估指标常常高估了 SLM 的能力。事实上，我们表明，当使用强大的大型语言模型 (LLM) 作为推理感知评估的自动判断者而不是仅回答答案的指标时，所有模型和数据集的 SLM 性能都会显着下降，分数下降高达 25%。

Title: Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Authors: Yunxiang Zhang, Muhammad Khalifa, Lechen Zhang, Xin Liu, Ayoung Lee, Xinliang Frederick Zhang, Farima Fatahi Bayat, Lu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09354
Pdf URL: https://arxiv.org/pdf/2510.09354
Copy Paste: [[2510.09354]] Logit Arithmetic Elicits Long Reasoning Capabilities Without Training(https://arxiv.org/abs/2510.09354)
Keywords: chain-of-thought
Abstract: Large reasoning models exhibit long chain-of-thought reasoning with strategies such as backtracking and self-correction, though recent studies suggest that these abilities typically require additional training. We first investigate whether such behaviors can be elicited without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logit arithmetic to tune a target large non-reasoning model for long reasoning using a substantially smaller reasoning model as the guider. We then show that we can further boost its performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model, a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in average accuracy by 24.5% and 29.1%, respectively, over five reasoning benchmarks using the Qwen2.5-32B guided by R1-Distill-Qwen-1.5B, a model 21x smaller. Moreover, we find that ThinkLogit remains effective when the guider and target come from different model families. It is also orthogonal to post-training methods for small models, as guiders improved through supervised distillation or reinforcement learning can be directly plugged in to yield stronger large models, offering a practical path to unlock long reasoning in large-scale models without costly post-training.
摘要：大型推理模型表现出长链思维推理，具有回溯和自我纠正等策略，尽管最近的研究表明这些能力通常需要额外的训练。我们首先调查是否可以在没有任何训练的情况下引发此类行为。为此，我们提出了一种解码时方法 ThinkLogit，它利用 logit 算术来调整目标大型非推理模型，以使用较小的推理模型作为指导进行长推理。然后，我们表明，我们可以通过对从目标和引导模型中采样的正确/不正确推理对进行偏好优化来训练引导模型，进一步提高其性能，我们将这种设置称为 ThinkLogit-DPO。我们的实验表明，与使用由 R1-Distill-Qwen-1.5B（模型小 21 倍）引导的 Qwen2.5-32B 进行的五个推理基准相比，ThinkLogit 和 ThinkLogit-DPO 的平均准确率分别提高了 24.5% 和 29.1%。此外，我们发现当引导者和目标来自不同的模型系列时，ThinkLogit 仍然有效。它也与小型模型的后训练方法正交，因为通过监督蒸馏或强化学习改进的引导器可以直接插入以产生更强大的大型模型，从而提供了在大型模型中解锁长推理的实用路径，而无需昂贵的后训练。

Title: NL2GenSym: Natural Language to Generative Symbolic Rules for SOAR Cognitive Architecture via Large Language Models

Authors: Fang Yuan, Junjie Zeng, Yue Hu, Zhengqiu Zhu, Quanjun Yin, Yuxiang Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09355
Pdf URL: https://arxiv.org/pdf/2510.09355
Copy Paste: [[2510.09355]] NL2GenSym: Natural Language to Generative Symbolic Rules for SOAR Cognitive Architecture via Large Language Models(https://arxiv.org/abs/2510.09355)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: SOAR, a classic symbol-based cognitive architecture, has been fostering the development of general, human-like intelligent agents. Nevertheless, its practical adoption is hindered by the laborious manual rule coding. Emerging Large Language Models (LLMs) present the immense potential for efficient rules generation. However, there is a critical gap that current research predominantly focuses on conceptual frameworks and lacks robust experimental validation. To bridge this gap, we propose \textit{N}atural \textit{L}anguage to \textit{Gen}erative \textit{Sym}bolic Rules (NL2GenSym), a novel framework that integrates LLMs with SOAR to autonomously produce generative symbolic rules from natural language. Specifically, our framework introduces a novel Execution-Grounded Generator-Critic mechanism. The LLM-based Generator, guided by a Retrieval-Augmented Generation-accessed self-evolving domain knowledge base, proposes rules from natural language. Subsequently, these rules are immediately executed within the SOAR environment to rigorously validate their correctness. Based on this execution-grounded feedback, a reflective LLM-based Critic drives the iterative refinement of these rules. Experiments on our specialized Water Jug Problem (WJP) dataset, utilizing both Gemini and Qwen series models, validate the efficacy of our framework. It achieves a success rate over 86\% in generating rules from natural language. Crucially, the framework also generates novel heuristic rules, reducing average decision cycles for solving the WJP to 1.98 times the optimal solution and 1/1000 of baseline methods. Additionally, our initial experiments show that NL2GenSym enables smaller-parameter models to achieve better performance than larger counterparts.
摘要：SOAR 是一种经典的基于符号的认知架构，一直在促进通用的类人智能代理的开发。然而，其实际采用却受到繁琐的手动规则编码的阻碍。新兴的大型语言模型（LLM）呈现出高效规则生成的巨大潜力。然而，目前的研究主要集中在概念框架上，缺乏强有力的实验验证，这是一个关键的差距。为了弥补这一差距，我们提出 \textit{N}natural \textit{L}anguage 到 \textit{Gen}erative \textit{Sym}bolic Rules (NL2GenSym)，这是一个将 LLM 与 SOAR 集成的新颖框架，可从自然语言自主生成生成符号规则。具体来说，我们的框架引入了一种新颖的 Execution-Grounded Generator-Critic 机制。基于法学硕士的生成器在检索增强生成访问的自我进化领域知识库的指导下，从自然语言中提出规则。随后，这些规则立即在 SOAR 环境中执行，以严格验证其正确性。基于这种基于执行的反馈，基于法学硕士的反思性批评家推动这些规则的迭代细化。利用 Gemini 和 Qwen 系列模型在我们专门的水壶问题 (WJP) 数据集上进行的实验验证了我们框架的有效性。它在从自然语言生成规则方面实现了超过 86% 的成功率。至关重要的是，该框架还生成新颖的启发式规则，将求解 WJP 的平均决策周期减少到最佳解决方案的 1.98 倍，是基线方法的 1/1000。此外，我们的初步实验表明，NL2GenSym 使较小参数的模型能够比较大参数的模型获得更好的性能。

Title: Understanding the Effects of Domain Finetuning on LLMs

Authors: Eshaan Tanwar, Deepak Nathani, William Yang Wang, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09359
Pdf URL: https://arxiv.org/pdf/2510.09359
Copy Paste: [[2510.09359]] Understanding the Effects of Domain Finetuning on LLMs(https://arxiv.org/abs/2510.09359)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) fine-tuned for specific domains exhibit strong performance; however, the underlying mechanisms by which this fine-tuning reshapes their parametric space are not well understood. Prior works primarily focus on auto-regressive or general-purpose instruct models, leaving domain-specialised LLMs under-explored. We present the first systematic study of domain-specific fine-tuning in large medical language models. Our analysis reveals that fine-tuning modifies only a small subset of the representational subspace, essentially preserving the pre-trained model's representation. To interpret these changes in subspaces, we propose tuning vectors, a novel framework inspired by task vectors, which explicitly capture the directional parameter shifts induced by fine-tuning. We demonstrate that these vectors are critical for enhancing both instruction-following and generation quality. Furthermore, combining tuning vectors across different domains yields improved generalisation. Upon closer inspection of directional alignment, we find these vectors primarily write new directional information into the MLP layers of the model, while amplifying existing directions in attention heads. Our findings offer new insights into LLM adaptation and provide a general, interpretable framework for analysing specialisation in large language models.
摘要：针对特定领域进行微调的大型语言模型（LLM）表现出强大的性能；然而，这种微调重塑参数空间的基本机制尚不清楚。之前的工作主要集中在自回归或通用指导模型上，而领域专业的法学硕士尚未得到充分探索。我们提出了大型医学语言模型中特定领域微调的第一个系统研究。我们的分析表明，微调仅修改表征子空间的一小部分，本质上保留了预训练模型的表征。为了解释子空间中的这些变化，我们提出了调整向量，这是一种受任务向量启发的新颖框架，它明确地捕获了微调引起的方向参数变化。我们证明这些向量对于提高指令跟踪和生成质量至关重要。此外，跨不同域组合调整向量可以提高泛化能力。通过仔细检查方向对齐，我们发现这些向量主要将新的方向信息写入模型的 MLP 层，同时放大注意力头中的现有方向。我们的研究结果为法学硕士适应提供了新的见解，并为分析大型语言模型的专业化提供了一个通用的、可解释的框架。

Title: Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood

Authors: Xingyu Lin, Yilin Wen, En Wang, Du Su, Wenbin Liu, Chenfu Bao, Zhonghou Lv
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09369
Pdf URL: https://arxiv.org/pdf/2510.09369
Copy Paste: [[2510.09369]] Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood(https://arxiv.org/abs/2510.09369)
Keywords: language model, llm, chain-of-thought
Abstract: Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.
摘要：组相对策略优化 (GRPO) 显着提高了大型语言模型 (LLM) 的推理能力，特别是通过提高其数学性能。然而，GRPO 和相关的熵正则化方法仍然面临着源于思想链 (CoT) 固有的稀疏代币奖励的挑战。当前的方法通常依赖于无差别的令牌级熵调整，这经常导致熵崩溃或模型崩溃。在这项工作中，我们提出了 TEPO，这是一种新颖的代币级框架，它结合了马尔可夫似然（序列似然），通过代币级聚合将组级奖励与代币联系起来。实验表明，TEPO 在关键指标（包括 @k 和准确性）方面始终优于现有基线。它不仅为数学推理任务设定了新的技术水平，而且还显着增强了训练稳定性。

Title: Beyond Single-Granularity Prompts: A Multi-Scale Chain-of-Thought Prompt Learning for Graph

Authors: Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Xinyan Huang, Weigang Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09394
Pdf URL: https://arxiv.org/pdf/2510.09394
Copy Paste: [[2510.09394]] Beyond Single-Granularity Prompts: A Multi-Scale Chain-of-Thought Prompt Learning for Graph(https://arxiv.org/abs/2510.09394)
Keywords: prompt, chain-of-thought
Abstract: The "pre-train, prompt'' paradigm, designed to bridge the gap between pre-training tasks and downstream objectives, has been extended from the NLP domain to the graph domain and has achieved remarkable progress. Current mainstream graph prompt-tuning methods modify input or output features using learnable prompt vectors. However, existing approaches are confined to single-granularity (e.g., node-level or subgraph-level) during prompt generation, overlooking the inherently multi-scale structural information in graph data, which limits the diversity of prompt semantics. To address this issue, we pioneer the integration of multi-scale information into graph prompt and propose a Multi-Scale Graph Chain-of-Thought (MSGCOT) prompting framework. Specifically, we design a lightweight, low-rank coarsening network to efficiently capture multi-scale structural features as hierarchical basis vectors for prompt generation. Subsequently, mimicking human cognition from coarse-to-fine granularity, we dynamically integrate multi-scale information at each reasoning step, forming a progressive coarse-to-fine prompt chain. Extensive experiments on eight benchmark datasets demonstrate that MSGCOT outperforms the state-of-the-art single-granularity graph prompt-tuning method, particularly in few-shot scenarios, showcasing superior performance.
摘要：“预训练、提示”范式旨在弥合预训练任务与下游目标之间的差距，已从 NLP 领域扩展到图领域，并取得了显着进展。当前主流的图提示调优方法使用可学习的提示向量修改输入或输出特征。然而，现有方法仅限于单粒度（例如节点级或子图级）在提示生成过程中，忽略了图数据中固有的多尺度结构信息，这限制了提示语义的多样性。为了解决这个问题，我们开创性地将多尺度信息集成到图提示中，并提出了多尺度图思维链（MSGCOT）提示框架。具体来说，我们设计了一个轻量级、低秩粗化网络，以有效捕获多尺度结构特征作为层次基础用于快速生成的向量。随后，模仿人类从粗到细的认知，在每个推理步骤动态整合多尺度信息，形成渐进的从粗到细的提示链。对八个基准数据集的大量实验表明，MSGCOT 优于最先进的单粒度图提示调整方法，特别是在少量场景中，展示了卓越的性能。

Title: Active Model Selection for Large Language Models

Authors: Yavuz Durmazkeser, Patrik Okanovic, Andreas Kirsch, Torsten Hoefler, Nezihe Merve Gürel
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.09418
Pdf URL: https://arxiv.org/pdf/2510.09418
Copy Paste: [[2510.09418]] Active Model Selection for Large Language Models(https://arxiv.org/abs/2510.09418)
Keywords: language model, llm
Abstract: We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.
摘要：我们介绍 LLM SELECTOR，这是第一个用于大型语言模型 (LLM) 主动模型选择的框架。与之前依赖于完全注释数据集的评估和基准测试方法不同，LLM SELECTOR 可以通过有限的注释有效地识别最佳的 LLM。特别是，对于任何给定的任务，LLM SELECTOR 自适应地选择一小组查询来注释，这些查询最能提供有关该任务的最佳模型的信息。为了进一步降低注释成本，我们利用基于判断的预言机注释模型。通过对 151 个 LLM 的 6 个基准进行广泛的实验，我们表明，在为任务选择最佳和接近最佳的 LLM 时，LLM SELECTOR 可以将注释成本降低高达 59.62%。

Title: On the Representations of Entities in Auto-regressive Large Language Models

Authors: Victor Morand, Josiane Mothe, Benjamin Piwowarski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09421
Pdf URL: https://arxiv.org/pdf/2510.09421
Copy Paste: [[2510.09421]] On the Representations of Entities in Auto-regressive Large Language Models(https://arxiv.org/abs/2510.09421)
Keywords: language model, llm
Abstract: Named entities are fundamental building blocks of knowledge in text, grounding factual information and structuring relationships within language. Despite their importance, it remains unclear how Large Language Models (LLMs) internally represent entities. Prior research has primarily examined explicit relationships, but little is known about entity representations themselves. We introduce entity mention reconstruction as a novel framework for studying how LLMs encode and manipulate entities. We investigate whether entity mentions can be generated from internal representations, how multi-token entities are encoded beyond last-token embeddings, and whether these representations capture relational knowledge. Our proposed method, leveraging _task vectors_, allows to consistently generate multi-token mentions from various entity representations derived from the LLMs hidden states. We thus introduce the _Entity Lens_, extending the _logit-lens_ to predict multi-token mentions. Our results bring new evidence that LLMs develop entity-specific mechanisms to represent and manipulate any multi-token entities, including those unseen during training. Our code is avalable at this https URL .
摘要：命名实体是文本知识的基本构建块，是事实信息的基础并在语言中构建关系。尽管大型语言模型 (LLM) 很重要，但其内部如何表示实体仍不清楚。先前的研究主要考察了显式关系，但对实体表示本身知之甚少。我们引入实体提及重建作为研究法学硕士如何编码和操作实体的新颖框架。我们研究是否可以从内部表示生成实体提及，如何在最后一个令牌嵌入之外对多令牌实体进行编码，以及这些表示是否捕获关系知识。我们提出的方法利用_任务向量_，允许从 LLM 隐藏状态派生的各种实体表示中一致地生成多标记提及。因此，我们引入了_Entity Lens_，扩展了_logit-lens_来预测多令牌提及。我们的结果带来了新的证据，表明法学硕士开发了特定于实体的机制来表示和操纵任何多代币实体，包括那些在训练期间看不见的实体。我们的代码可在此 https URL 获取。

Title: The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Authors: Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf
Subjects: cs.CL, cs.AI, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2510.09424
Pdf URL: https://arxiv.org/pdf/2510.09424
Copy Paste: [[2510.09424]] The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach(https://arxiv.org/abs/2510.09424)
Keywords: llm
Abstract: This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.
摘要：本文提出了使用语音法学硕士进行端到端口语对话状态跟踪的上下文管理策略的比较研究。我们系统地评估了传统的多模态上下文（结合文本历史和口语当前转向）、完整口语历史和压缩口语历史方法。我们在 SpokenWOZ 语料库上进行的实验表明，提供完整的口语对话作为输入可以在类似大小的模型中产生最高的性能，显着超越先前的方法。此外，我们表明，基于注意力池的口语历史压缩提供了强有力的权衡，在保持有竞争力的准确性的同时减少了上下文大小。详细分析证实，改进源于更有效的上下文利用。

Title: KORMo: Korean Open Reasoning Model for Everyone

Authors: Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song, Minkyung Cho, Junhun Yuk, Changsu Choi, Dongjae Shin, Huige Lee, Hoyun Song, Alice Oh, Kyungtae Lim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09426
Pdf URL: https://arxiv.org/pdf/2510.09426
Copy Paste: [[2510.09426]] KORMo: Korean Open Reasoning Model for Everyone(https://arxiv.org/abs/2510.09426)
Keywords: language model, llm
Abstract: This work presents the first large-scale investigation into constructing a fully open bilingual large language model (LLM) for a non-English language, specifically Korean, trained predominantly on synthetic data. We introduce KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English corpus in which 68.74% of the Korean portion is synthetic. Through systematic experimentation, we demonstrate that synthetic data, when carefully curated with balanced linguistic coverage and diverse instruction styles, does not cause instability or degradation during large-scale pretraining. Furthermore, the model achieves performance comparable to that of contemporary open-weight multilingual baselines across a wide range of reasoning, knowledge, and instruction-following benchmarks. Our experiments reveal two key findings: (1) synthetic data can reliably sustain long-horizon pretraining without model collapse, and (2) bilingual instruction tuning enables near-native reasoning and discourse coherence in Korean. By fully releasing all components including data, code, training recipes, and logs, this work establishes a transparent framework for developing synthetic data-driven fully open models (FOMs) in low-resource settings and sets a reproducible precedent for future multilingual LLM research.
摘要：这项工作提出了针对非英语语言（特别是韩语）构建完全开放的双语大语言模型（LLM）的首次大规模调查，主要在合成数据上进行训练。我们引入了 KORMo-10B，这是一个在韩语-英语语料库上从头开始训练的 10.8B 参数模型，其中 68.74% 的韩语部分是合成的。通过系统的实验，我们证明，当精心策划平衡的语言覆盖和多样化的教学风格时，合成数据不会在大规模预训练期间导致不稳定或退化。此外，该模型在广泛的推理、知识和指令遵循基准方面实现了与当代开放权重多语言基线相当的性能。我们的实验揭示了两个关键发现：（1）合成数据可以可靠地维持长期预训练而不会导致模型崩溃，（2）双语指令调整可以实现韩语中近乎母语的推理和话语连贯性。通过完全发布包括数据、代码、训练配方和日志在内的所有组件，这项工作建立了一个透明的框架，用于在资源匮乏的环境中开发合成数据驱动的完全开放模型（FOM），并为未来的多语言法学硕士研究树立了可复制的先例。

Title: Domain-Adapted Pre-trained Language Models for Implicit Information Extraction in Crash Narratives

Authors: Xixi Wang, Jordanka Kovaceva, Miguel Costa, Shuai Wang, Francisco Camara Pereira, Robert Thomson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09434
Pdf URL: https://arxiv.org/pdf/2510.09434
Copy Paste: [[2510.09434]] Domain-Adapted Pre-trained Language Models for Implicit Information Extraction in Crash Narratives(https://arxiv.org/abs/2510.09434)
Keywords: language model, gpt, llm
Abstract: Free-text crash narratives recorded in real-world crash databases have been shown to play a significant role in improving traffic safety. However, large-scale analyses remain difficult to implement as there are no documented tools that can batch process the unstructured, non standardized text content written by various authors with diverse experience and attention to detail. In recent years, Transformer-based pre-trained language models (PLMs), such as Bidirectional Encoder Representations from Transformers (BERT) and large language models (LLMs), have demonstrated strong capabilities across various natural language processing tasks. These models can extract explicit facts from crash narratives, but their performance declines on inference-heavy tasks in, for example, Crash Type identification, which can involve nearly 100 categories. Moreover, relying on closed LLMs through external APIs raises privacy concerns for sensitive crash data. Additionally, these black-box tools often underperform due to limited domain knowledge. Motivated by these challenges, we study whether compact open-source PLMs can support reasoning-intensive extraction from crash narratives. We target two challenging objectives: 1) identifying the Manner of Collision for a crash, and 2) Crash Type for each vehicle involved in the crash event from real-world crash narratives. To bridge domain gaps, we apply fine-tuning techniques to inject task-specific knowledge to LLMs with Low-Rank Adaption (LoRA) and BERT. Experiments on the authoritative real-world dataset Crash Investigation Sampling System (CISS) demonstrate that our fine-tuned compact models outperform strong closed LLMs, such as GPT-4o, while requiring only minimal training resources. Further analysis reveals that the fine-tuned PLMs can capture richer narrative details and even correct some mislabeled annotations in the dataset.
摘要：现实世界事故数据库中记录的自由文本事故叙述已被证明在改善交通安全方面发挥着重要作用。然而，大规模分析仍然难以实施，因为没有记录工具可以批量处理由具有不同经验和对细节的关注的不同作者编写的非结构化、非标准化文本内容。近年来，基于 Transformer 的预训练语言模型 (PLM)，例如 Transformer 的双向编码器表示 (BERT) 和大型语言模型 (LLM)，在各种自然语言处理任务中表现出了强大的能力。这些模型可以从崩溃叙述中提取明确的事实，但在推理繁重的任务中，它们的性能会下降，例如，可能涉及近 100 个类别的崩溃类型识别。此外，通过外部 API 依赖封闭的法学硕士会引发敏感崩溃数据的隐私问题。此外，由于领域知识有限，这些黑盒工具通常表现不佳。受这些挑战的推动，我们研究紧凑的开源 PLM 是否可以支持从崩溃叙述中进行推理密集型提取。我们的目标是两个具有挑战性的目标：1）确定碰撞的碰撞方式，2）根据现实世界的碰撞叙述确定碰撞事件中涉及的每辆车的碰撞类型。为了弥合领域差距，我们应用微调技术，通过低秩适应 (LoRA) 和 BERT 将特定于任务的知识注入到 LLM 中。在权威的真实数据集碰撞调查采样系统 (CISS) 上进行的实验表明，我们经过微调的紧凑模型优于强大的封闭式 LLM，例如 GPT-4o，同时只需要最少的训练资源。进一步的分析表明，经过微调的 PLM 可以捕获更丰富的叙述细节，甚至可以纠正数据集中一些错误标记的注释。

Title: Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

Authors: Ines Altemir Marinas, Anastasiia Kucherenko, Alexander Sternfeld, Andrei Kucharavy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09471
Pdf URL: https://arxiv.org/pdf/2510.09471
Copy Paste: [[2510.09471]] Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World(https://arxiv.org/abs/2510.09471)
Keywords: language model, llm
Abstract: The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet. In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation.
摘要：大型语言模型 (LLM) 的性能由其训练数据决定。尽管开放式法学硕士数量激增，但获得法学硕士培训数据的机会仍然有限。即使对于完全开放的法学硕士，数据的规模也使得一般科学界几乎难以理解，尽管可能包含从互联网上抓取的关键数据。在本文中，我们提出了 Apertus LLM 训练数据的全文索引管道。利用 Elasticsearch 并行索引和 Alps 基础设施（最先进、高能效的 arm64 超级集群），我们能够对用于训练 Apertus LLM 系列的 15.2T 代币中的 8.6T 代币进行索引，从而创建了一个关键的 LLM 安全工具和一个有效的离线、精选、开放网络搜索引擎。我们的贡献是三重的。首先，我们证明 Elasticsearch 可以成功移植到下一代基于 arm64 的基础设施上。其次，我们证明了现代 LLM 训练数据集和整个开放网络规模的全文索引是可行且可访问的。最后，我们证明此类索引可用于确保以前无法访问的与越狱无关的 LLM 安全性。我们希望我们的发现对尝试大规模数据索引的其他团队有用，并促进向绿色计算的总体过渡。

Title: Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic

Authors: Manuel Vargas Guzmán, Jakub Szymanik, Maciej Malicki
Subjects: cs.CL, cs.LG, cs.LO
Abstract URL: https://arxiv.org/abs/2510.09472
Pdf URL: https://arxiv.org/pdf/2510.09472
Copy Paste: [[2510.09472]] Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic(https://arxiv.org/abs/2510.09472)
Keywords: language model, llm
Abstract: Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications like logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: compositionality, the capacity to abstract atomic logical rules underlying complex inferences, and recursiveness, the aptitude to build intricate representations through iterative application of inference rules. In the literature, these two aspects are often confounded together under the umbrella term of generalization. To sharpen this distinction, we investigated the logical generalization capabilities of pre-trained large language models (LLMs) using the syllogistic fragment as a benchmark for natural language reasoning. Though simple, this fragment provides a foundational yet expressive subset of formal logic that supports controlled evaluation of essential reasoning abilities. Our findings reveal a significant disparity: while LLMs demonstrate reasonable proficiency in recursiveness, they struggle with compositionality. To overcome these limitations and establish a reliable logical prover, we propose a hybrid architecture integrating symbolic reasoning with neural computation. This synergistic interaction enables robust and efficient inference, neural components accelerate processing, while symbolic reasoning ensures completeness. Our experiments show that high efficiency is preserved even with relatively small neural components. As part of our proposed methodology, this analysis gives a rationale and highlights the potential of hybrid models to effectively address key generalization barriers in neural reasoning systems.
摘要：尽管神经模型取得了显着的进步，但它们的泛化能力（逻辑推理等应用的基石）仍然是一个严峻的挑战。我们描述了这种能力的两个基本方面：组合性（抽象复杂推理背后的原子逻辑规则的能力）和递归性（通过推理规则的迭代应用构建复杂表示的能力）。在文献中，这两个方面经常在泛化的总称下混淆在一起。为了加深这种区别，我们使用三段论片段作为自然语言推理的基准，研究了预训练的大型语言模型（LLM）的逻辑泛化能力。虽然简单，但该片段提供了形式逻辑的基础但富有表现力的子集，支持对基本推理能力的受控评估。我们的研究结果揭示了一个显着的差异：虽然法学硕士在递归方面表现出相当的熟练程度，但他们在组合性方面遇到了困难。为了克服这些限制并建立可靠的逻辑证明者，我们提出了一种将符号推理与神经计算相结合的混合架构。这种协同相互作用可实现稳健且高效的推理，神经组件可加速处理，而符号推理可确保完整性。我们的实验表明，即使使用相对较小的神经组件，也能保持高效率。作为我们提出的方法的一部分，该分析给出了基本原理并强调了混合模型有效解决神经推理系统中关键泛化障碍的潜力。

Title: Multimodal Policy Internalization for Conversational Agents

Authors: Zhenhailong Wang, Jiateng Liu, Amin Fazel, Ritesh Sarkhel, Xing Fan, Xiang Li, Chenlei Guo, Heng Ji, Ruhi Sarikaya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09474
Pdf URL: https://arxiv.org/pdf/2510.09474
Copy Paste: [[2510.09474]] Multimodal Policy Internalization for Conversational Agents(https://arxiv.org/abs/2510.09474)
Keywords: gpt, llm, prompt, chat, agent
Abstract: Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: this https URL.
摘要：ChatGPT 和 Alexa+ 等现代会话代理依赖于指定元数据、响应样式和工具使用规则的预定义策略。随着这些基于 LLM 的系统扩展到支持不同的业务和用户查询，此类策略（通常作为上下文提示实现）变得越来越复杂和冗长，使得忠实遵守变得困难，并带来巨大的固定计算成本。随着多模式代理的兴起，管理视觉和多模式行为的政策至关重要，但仍未得到充分研究。之前的即时压缩工作主要是缩短任务模板和演示，而现有的策略调整研究仅关注基于文本的安全规则。我们引入了多模式策略内部化（MPI），这是一项新任务，它将推理密集型多模式策略内部化到模型参数中，从而在推理过程中不包含策略的情况下实现更强的策略遵循。 MPI 带来了独特的数据和算法挑战。我们构建了两个涵盖合成和现实决策以及工具使用任务的数据集，并提出了 TriMPI，一个三阶段训练框架。 TriMPI 首先通过持续预训练注入策略知识，然后执行监督微调，最后应用 PolicyRollout，这是一种 GRPO 风格的强化学习扩展，可通过策略感知响应来增强部署，以进行扎根探索。 TriMPI 在端到端准确性、泛化性和遗忘鲁棒性方面取得了显着的进步。作为多模式政策内部化的第一项工作，我们提供数据集、培训方案和综合评估以促进未来的研究。项目页面：此 https URL。

Title: StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

Authors: Yuchen Lu, Run Yang, Yichen Zhang, Shuguang Yu, Runpeng Dai, Ziwei Wang, Jiayi Xiang, Wenxin E, Siran Gao, Xinyao Ruan, Yirui Huang, Chenjing Xi, Haibo Hu, Yueming Fu, Qinglan Yu, Xiaobing Wei, Jiani Gu, Rui Sun, Jiaxuan Jia, Fan Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09517
Pdf URL: https://arxiv.org/pdf/2510.09517
Copy Paste: [[2510.09517]] StatEval: A Comprehensive Benchmark for Large Language Models in Statistics(https://arxiv.org/abs/2510.09517)
Keywords: language model, gpt, llm, agent
Abstract: Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce \textbf{StatEval}, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57\% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: this https URL.
摘要：大型语言模型（LLM）在数学和逻辑推理方面表现出了显着的进步，但统计学作为一门独特的综合学科，在基准测试工作中仍未得到充分探索。为了解决这一差距，我们引入了 \textbf{StatEval}，这是第一个专门用于统计的综合基准测试，涵盖了不同难度级别的广度和深度。 StatEval 包含涵盖本科生和研究生课程的 13,817 个基础问题，以及从领先期刊中提取的 2374 个研究级证明任务。为了构建基准，我们设计了一个可扩展的多代理管道，具有人机交互验证功能，可以自动进行大规模问题提取、重写和质量控制，同时确保学术严谨性。我们进一步提出了一个针对计算和基于证明的任务量身定制的稳健评估框架，从而能够对推理能力进行细粒度评估。实验结果表明，虽然 GPT5-mini 等闭源模型在研究级问题上的成绩低于 57%，但开源模型的表现明显较低。这些发现凸显了统计推理的独特挑战和当前法学硕士的局限性。我们期望 StatEval 成为在大型语言模型中推进统计智能的严格基准。所有数据和代码都可以在我们的网络平台上获得：此 https URL。

Title: Can We Reliably Rank Model Performance across Domains without Labeled Data?

Authors: Veronica Rammouz, Aaron Gonzalez, Carlos Cruzportillo, Adrian Tan, Nicole Beebe, Anthony Rios
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09519
Pdf URL: https://arxiv.org/pdf/2510.09519
Copy Paste: [[2510.09519]] Can We Reliably Rank Model Performance across Domains without Labeled Data?(https://arxiv.org/abs/2510.09519)
Keywords: language model
Abstract: Estimating model performance without labels is an important goal for understanding how NLP models generalize. While prior work has proposed measures based on dataset similarity or predicted correctness, it remains unclear when these estimates produce reliable performance rankings across domains. In this paper, we analyze the factors that affect ranking reliability using a two-step evaluation setup with four base classifiers and several large language models as error predictors. Experiments on the GeoOLID and Amazon Reviews datasets, spanning 15 domains, show that large language model-based error predictors produce stronger and more consistent rank correlations with true accuracy than drift-based or zero-shot baselines. Our analysis reveals two key findings: ranking is more reliable when performance differences across domains are larger, and when the error model's predictions align with the base model's true failure patterns. These results clarify when performance estimation methods can be trusted and provide guidance for their use in cross-domain model evaluation.
摘要：在没有标签的情况下估计模型性能是理解 NLP 模型如何泛化的一个重要目标。虽然之前的工作提出了基于数据集相似性或预测正确性的措施，但目前尚不清楚这些估计何时会产生可靠的跨领域性能排名。在本文中，我们使用两步评估设置来分析影响排名可靠性的因素，该设置以四个基分类器和几个大型语言模型作为错误预测器。对跨越 15 个领域的 GeoOLID 和 Amazon Reviews 数据集进行的实验表明，与基于漂移或零样本基线相比，基于大型语言模型的错误预测器可以产生更强大、更一致的排名相关性和真实准确度。我们的分析揭示了两个关键发现：当跨领域的性能差异较大时，以及当错误模型的预测与基本模型的真实故障模式一致时，排名会更可靠。这些结果阐明了何时可以信任性能估计方法，并为其在跨域模型评估中的使用提供指导。

Title: Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

Authors: Yihong Liu, Raoyuan Zhao, Lena Altinger, Hinrich Schütze, Michael A. Hedderich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09536
Pdf URL: https://arxiv.org/pdf/2510.09536
Copy Paste: [[2510.09536]] Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors(https://arxiv.org/abs/2510.09536)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed in multilingual, real-world applications with user inputs -- naturally introducing typographical errors (typos). Yet most benchmarks assume clean input, leaving the robustness of LLMs to typos across languages largely underexplored. To address this gap, we introduce MulTypo, a multilingual typo generation algorithm that simulates human-like errors based on language-specific keyboard layouts and typing behavior. We evaluate 18 open-source LLMs across three model families and five downstream tasks spanning language inference, multi-choice question answering, mathematical reasoning, and machine translation tasks. Our results show that typos consistently degrade performance, particularly in generative tasks and those requiring reasoning -- while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. We also observe language-dependent robustness: high-resource languages are generally more robust than low-resource ones, and translation from English is more robust than translation into English. Our findings underscore the need for noise-aware training and multilingual robustness evaluation. We make our code and data publicly available.
摘要：大型语言模型 (LLM) 越来越多地部署在需要用户输入的多语言现实应用程序中，这自然会引入印刷错误（打字错误）。然而，大多数基准测试都假设输入是干净的，因此法学硕士对跨语言拼写错误的稳健性在很大程度上尚未得到充分探索。为了解决这一差距，我们引入了 MulTypo，这是一种多语言打字错误生成算法，可以根据特定于语言的键盘布局和打字行为来模拟类似人类的错误。我们评估了三个模型系列的 18 个开源法学硕士和五个涵盖语言推理、多项选择题回答、数学推理和机器翻译任务的下游任务。我们的结果表明，拼写错误会持续降低性能，特别是在生成任务和需要推理的任务中，而自然语言推理任务相对更稳健。指令调整可提高干净输入性能，但可能会增加噪声下的脆性。我们还观察到语言相关的稳健性：高资源语言通常比低资源语言更稳健，英语翻译比英语翻译更稳健。我们的研究结果强调了噪声感知培训和多语言鲁棒性评估的必要性。我们公开我们的代码和数据。

Title: SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Authors: Chengyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09541
Pdf URL: https://arxiv.org/pdf/2510.09541
Copy Paste: [[2510.09541]] SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models(https://arxiv.org/abs/2510.09541)
Keywords: language model, llm
Abstract: Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.
摘要：扩散大语言模型 (dLLM) 因其并行解码多个标记的能力而成为自回归模型的有效替代方案。然而，通过强化学习（RL）使 dLLM 与人类偏好或特定任务奖励保持一致具有挑战性，因为它们难以处理的对数似然性阻碍了标准策略梯度方法的直接应用。虽然之前的工作使用证据下限 (ELBO) 等替代指标，但这些片面的近似可能会引入显着的政策梯度偏差。为了解决这个问题，我们提出了三明治策略梯度（SPG），它利用了真实对数似然的上限和下限。实验表明，SPG 显着优于基于 ELBO 或一步估计的基线。具体而言，与最先进的 dLLM 强化学习方法相比，SPG 在 GSM8K 中的准确度提高了 3.6%，在 MATH500 中提高了 2.6%，在倒计时中提高了 18.4%，在数独中提高了 27.0%。

Title: Beyond Surface Reasoning: Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models

Authors: Qiguang Chen, Hanjing Li, Libo Qin, Dengyun Peng, Jinhao Liu, Jiangyi Wang, Chengyue Wu, Xie Chen, Yantao Du, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09544
Pdf URL: https://arxiv.org/pdf/2510.09544
Copy Paste: [[2510.09544]] Beyond Surface Reasoning: Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models(https://arxiv.org/abs/2510.09544)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recently, Diffusion Large Language Models (DLLMs) have offered high throughput and effective sequential reasoning, making them a competitive alternative to autoregressive LLMs (ALLMs). However, parallel decoding, which enables simultaneous token updates, conflicts with the causal order often required for rigorous reasoning. We first identify this conflict as the core Parallel-Sequential Contradiction (PSC). Behavioral analyses in both simple and complex reasoning tasks show that DLLMs exhibit genuine parallelism only for directly decidable outputs. As task difficulty increases, they revert to autoregressive-like behavior, a limitation exacerbated by autoregressive prompting, which nearly doubles the number of decoding steps with remasking without improving quality. Moreover, PSC restricts DLLMs' self-reflection, reasoning depth, and exploratory breadth. To further characterize PSC, we introduce three scaling dimensions for DLLMs: parallel, diffusion, and sequential. Empirically, while parallel scaling yields consistent improvements, diffusion and sequential scaling are constrained by PSC. Based on these findings, we propose several practical mitigations, parallel-oriented prompting, diffusion early stopping, and parallel scaling, to reduce PSC-induced ineffectiveness and inefficiencies.
摘要：最近，扩散大型语言模型 (DLLM) 提供了高吞吐量和有效的顺序推理，使其成为自回归 LLM (ALLM) 的有竞争力的替代方案。然而，允许同时更新令牌的并行解码与严格推理通常所需的因果顺序相冲突。我们首先将这种冲突确定为核心并行顺序矛盾（PSC）。简单和复杂推理任务中的行为分析表明，DLLM 仅针对可直接判定的输出表现出真正的并行性。随着任务难度的增加，它们会恢复到类似自回归的行为，这种限制因自回归提示而加剧，这使重新屏蔽的解码步骤数量几乎增加了一倍，而质量却没有提高。此外，PSC 限制了 DLLM 的自我反思、推理深度和探索广度。为了进一步表征 PSC，我们为 DLLM 引入了三个缩放维度：并行、扩散和顺序。根据经验，虽然并行缩放会产生一致的改进，但扩散和顺序缩放受到 PSC 的限制。基于这些发现，我们提出了几种实用的缓解措施、面向并行的提示、扩散提前停止和并行扩展，以减少 PSC 引起的无效和低效率。

Title: Hierarchical Indexing with Knowledge Enrichment for Multilingual Video Corpus Retrieval

Authors: Yu Wang, Tianhao Tan, Yifei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09553
Pdf URL: https://arxiv.org/pdf/2510.09553
Copy Paste: [[2510.09553]] Hierarchical Indexing with Knowledge Enrichment for Multilingual Video Corpus Retrieval(https://arxiv.org/abs/2510.09553)
Keywords: language model, llm
Abstract: Retrieving relevant instructional videos from multilingual medical archives is crucial for answering complex, multi-hop questions across language boundaries. However, existing systems either compress hour-long videos into coarse embeddings or incur prohibitive costs for fine-grained matching. We tackle the Multilingual Video Corpus Retrieval (mVCR) task in the NLPCC-2025 M4IVQA challenge with a multi-stage framework that integrates multilingual semantics, domain terminology, and efficient long-form processing. Video subtitles are divided into semantically coherent chunks, enriched with concise knowledge-graph (KG) facts, and organized into a hierarchical tree whose node embeddings are generated by a language-agnostic multilingual encoder. At query time, the same encoder embeds the input question; a coarse-to-fine tree search prunes irrelevant branches, and only the top-ranked chunks are re-scored by a lightweight large language model (LLM). This design avoids exhaustive cross-encoder scoring while preserving chunk-level precision. Experiments on the mVCR test set demonstrate state-of-the-art performance, and ablation studies confirm the complementary contributions of KG enrichment, hierarchical indexing, and targeted LLM re-ranking. The proposed method offers an accurate and scalable solution for multilingual retrieval in specialized medical video collections.
摘要：从多语言医学档案中检索相关教学视频对于回答跨语言边界的复杂、多跳问题至关重要。然而，现有系统要么将长达一小时的视频压缩为粗略嵌入，要么因细粒度匹配而产生高昂的成本。我们使用集成多语言语义、领域术语和高效长格式处理的多阶段框架来解决 NLPCC-2025 M4IVQA 挑战中的多语言视频语料库检索 (mVCR) 任务。视频字幕被分为语义连贯的块，用简洁的知识图谱（KG）事实进行丰富，并组织成层次树，其节点嵌入由与语言无关的多语言编码器生成。在查询时，同一个编码器嵌入输入问题；从粗到细的树搜索会修剪不相关的分支，并且只有排名靠前的块才会由轻量级大型语言模型（LLM）重新评分。这种设计避免了详尽的交叉编码器评分，同时保留了块级精度。 mVCR 测试集上的实验展示了最先进的性能，消融研究证实了 KG 丰富、分层索引和有针对性的 LLM 重新排名的互补贡献。所提出的方法为专业医学视频集合中的多语言检索提供了准确且可扩展的解决方案。

Title: A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages

Authors: Raoyuan Zhao, Yihong Liu, Hinrich Schütze, Michael A. Hedderich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09555
Pdf URL: https://arxiv.org/pdf/2510.09555
Copy Paste: [[2510.09555]] A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages(https://arxiv.org/abs/2510.09555)
Keywords: prompt, chain-of-thought
Abstract: Large reasoning models (LRMs) increasingly rely on step-by-step Chain-of-Thought (CoT) reasoning to improve task performance, particularly in high-resource languages such as English. While recent work has examined final-answer accuracy in multilingual settings, the thinking traces themselves, i.e., the intermediate steps that lead to the final answer, remain underexplored. In this paper, we present the first comprehensive study of multilingual CoT reasoning, evaluating three key dimensions: performance, consistency, and faithfulness. We begin by measuring language compliance, answer accuracy, and answer consistency when LRMs are explicitly instructed or prompt-hacked to think in a target language, revealing strong language preferences and divergent performance across languages. Next, we assess crosslingual consistency of thinking traces by interchanging them between languages. We find that the quality and effectiveness of thinking traces vary substantially depending on the prompt language. Finally, we adapt perturbation-based techniques -- i.e., truncation and error injection -- to probe the faithfulness of thinking traces across languages, showing that models rely on traces to varying degrees. We release our code and data to support future research.
摘要：大型推理模型 (LRM) 越来越依赖分步思维链 (CoT) 推理来提高任务性能，特别是在英语等高资源语言中。虽然最近的工作检查了多语言环境中最终答案的准确性，但思维轨迹本身，即导致最终答案的中间步骤，仍然没有得到充分探索。在本文中，我们首次对多语言 CoT 推理进行全面研究，评估了三个关键维度：性能、一致性和忠实度。当 LRM 被明确指示或提示使用目标语言思考时，我们首先测量语言合规性、答案准确性和答案一致性，从而揭示强烈的语言偏好和跨语言的不同表现。接下来，我们通过在语言之间交换思维痕迹来评估思维痕迹的跨语言一致性。我们发现思维痕迹的质量和有效性根据提示语言的不同而有很大差异。最后，我们采用基于扰动的技术（即截断和错误注入）来探测跨语言思维痕迹的忠实度，表明模型在不同程度上依赖于痕迹。我们发布代码和数据以支持未来的研究。

Title: WUGNECTIVES: Novel Entity Inferences of Language Models from Discourse Connectives

Authors: Daniel Brubaker, William Sheffield, Junyi Jessy Li, Kanishka Misra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09556
Pdf URL: https://arxiv.org/pdf/2510.09556
Copy Paste: [[2510.09556]] WUGNECTIVES: Novel Entity Inferences of Language Models from Discourse Connectives(https://arxiv.org/abs/2510.09556)
Keywords: language model
Abstract: The role of world knowledge has been particularly crucial to predict the discourse connective that marks the discourse relation between two arguments, with language models (LMs) being generally successful at this task. We flip this premise in our work, and instead study the inverse problem of understanding whether discourse connectives can inform LMs about the world. To this end, we present WUGNECTIVES, a dataset of 8,880 stimuli that evaluates LMs' inferences about novel entities in contexts where connectives link the entities to particular attributes. On investigating 17 different LMs at various scales, and training regimens, we found that tuning an LM to show reasoning behavior yields noteworthy improvements on most connectives. At the same time, there was a large variation in LMs' overall performance across connective type, with all models systematically struggling on connectives that express a concessive meaning. Our findings pave the way for more nuanced investigations into the functional role of language cues as captured by LMs. We release WUGNECTIVES at this https URL.
摘要：世界知识的作用对于预测标志着两个论点之间的话语关系的话语连接词尤其重要，语言模型（LM）在这项任务上通常是成功的。我们在工作中翻转了这个前提，转而研究理解话语连接词是否可以让 LM 了解世界的逆问题。为此，我们提出了 WUGNECTIVES，这是一个包含 8,880 个刺激的数据集，用于评估 LM 在连接词将实体链接到特定属性的上下文中对新实体的推论。在研究不同规模的 17 个不同的 LM 和训练方案时，我们发现调整 LM 以显示推理行为可以在大多数连接词上产生显着的改进。与此同时，不同连接类型的语言模型的整体性能存在很大差异，所有模型都系统性地在表达让步含义的连接词上陷入困境。我们的研究结果为对 LM 捕获的语言线索的功能作用进行更细致的研究铺平了道路。我们在此 https URL 发布 WUGNECTIVES。

Title: AutoPR: Let's Automate Your Academic Promotion!

Authors: Qiguang Chen, Zheng Yan, Mingda Yang, Libo Qin, Yixin Yuan, Hanjing Li, Jinhao Liu, Yiyan Ji, Dengyun Peng, Jiannan Guan, Mengkang Hu, Yantao Du, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09558
Pdf URL: https://arxiv.org/pdf/2510.09558
Copy Paste: [[2510.09558]] AutoPR: Let's Automate Your Academic Promotion!(https://arxiv.org/abs/2510.09558)
Keywords: llm, agent
Abstract: As the volume of peer-reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer-reviewed articles to high-quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi-agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform-specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.
摘要：随着同行评审研究数量的激增，学者们越来越依赖社交平台进行发现，而作者则投入大量精力来推广他们的工作，以确保可见性和引用。为了简化这一过程并减少对人力的依赖，我们引入了自动推广（AutoPR），这是一项新颖的任务，可将研究论文转化为准确、引人入胜且及时的公共内容。为了进行严格的评估，我们发布了 PRBench，这是一个多模式基准，它将 512 篇同行评审的文章与高质量的促销帖子联系起来，沿着三个轴评估系统：保真度（准确性和语气）、参与度（受众定位和吸引力）和一致性（时间和渠道优化）。我们还推出了 PRAgent，这是一个多代理框架，可分三个阶段实现 AutoPR 的自动化：通过多模式准备进行内容提取、协同合成以优化输出，以及针对特定平台进行调整以优化规范、语气和标签以实现最大影响力。与 PRBench 上的直接 LLM 管道相比，PRAgent 表现出了显着的改进，包括总观看时间增加了 604%，点赞数增加了 438%，整体参与度至少提高了 2.9 倍。消融研究表明，平台建模和有针对性的促销对这些收益的贡献最大。我们的研究结果将 AutoPR 定位为一个易于处理、可衡量的研究问题，并为可扩展、有影响力的自动化学术交流提供了路线图。

Title: Dyna-Mind: Learning to Simulate from Experience for Better AI Agents

Authors: Xiao Yu, Baolin Peng, Michel Galley, Hao Cheng, Qianhui Wu, Janardhan Kulkarni, Suman Nath, Zhou Yu, Jianfeng Gao
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2510.09577
Pdf URL: https://arxiv.org/pdf/2510.09577
Copy Paste: [[2510.09577]] Dyna-Mind: Learning to Simulate from Experience for Better AI Agents(https://arxiv.org/abs/2510.09577)
Keywords: agent
Abstract: Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ''vicarious trial and error'' - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent's reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent's simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.
摘要：推理模型最近在数学和编码等领域取得了显着进展。然而，他们在数学和编码方面的专家级能力与他们在网络导航和计算机/电话使用等长期交互式任务中的表现形成鲜明对比。受人类认知文献的启发，我们认为当前的人工智能代理需要“替代性试错”——在行动之前在心理上模拟替代未来的能力——以增强他们在复杂交互环境中的理解和表现。我们引入了 Dyna-Mind，这是一个两阶段的训练框架，它明确地教导 (V)LM 代理将此类模拟集成到他们的推理中。在第一阶段，我们引入了模拟推理（ReSim），它训练代理从通过环境交互收集的真实经验构建的扩展搜索树中生成结构化推理轨迹。因此，ReSim 将智能体的推理建立在忠实的世界动态基础上，并赋予其在推理中预测未来状态的能力。在第二阶段，我们提出了Dyna-GRPO，这是一种在线强化学习方法，通过使用结果奖励和中间状态作为实际推出的反馈来进一步增强代理的模拟和决策能力。对两个综合基准（Sokoban 和 ALFWorld）和一个现实基准（AndroidWorld）的实验表明，(1) ReSim 有效地将模拟能力注入人工智能代理中，(2) Dyna-GRPO 利用结果和交互级信号来学习针对长期规划密集型任务的更好策略。总之，这些结果凸显了模拟在使人工智能代理能够在更具挑战性的环境中更有效地推理、计划和行动方面发挥的核心作用。

Title: Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models

Authors: Donghang Wu, Haoyang Zhang, Jun Chen, Xiangyu (Tony)Zhang, Hexin Liu, Eng Siong Chng, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09592
Pdf URL: https://arxiv.org/pdf/2510.09592
Copy Paste: [[2510.09592]] Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models(https://arxiv.org/abs/2510.09592)
Keywords: language model, chain-of-thought
Abstract: Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is attracting increasing attention. We present, for the first time, Mind-Paced Speaking (MPS), a brain-inspired framework that enables high-fidelity, real-time reasoning. Similar to how humans utilize distinct brain regions for thinking and responding, we propose a novel dual-brain approach, employing a "Formulation Brain" for high-level reasoning to pace and guide a separate "Articulation Brain" for fluent speech generation. This division of labor eliminates mode-switching, preserving the integrity of the reasoning process. Experiments show that MPS significantly outperforms existing think-while-speaking methods and achieves reasoning performance comparable to models that pre-compute the full CoT before speaking, while drastically reducing latency. Under a zero-latency configuration, the proposed method achieves an accuracy of 92.8% on the mathematical reasoning task Spoken-MQA and attains a score of 82.5 on the speech conversation task URO-Bench. Our work effectively bridges the gap between high-quality reasoning and real-time interaction.
摘要：由于按顺序生成整个思维过程的延迟令人望而却步，实时口语语言模型 (SLM) 很难利用思维链 (CoT) 推理。让 SLM 能够像人类一样边说话边思考，正引起越来越多的关注。我们首次推出 Mind-Paced Speaking (MPS)，这是一种受大脑启发的框架，可实现高保真、实时推理。与人类如何利用不同的大脑区域进行思考和反应类似，我们提出了一种新颖的双脑方法，采用“公式大脑”进行高级推理，以节奏和指导单独的“清晰度大脑”以生成流畅的语音。这种分工消除了模式切换，保持了推理过程的完整性。实验表明，MPS 的性能显着优于现有的边说话边思考的方法，其推理性能可与说话前预先计算完整 CoT 的模型相媲美，同时大幅减少延迟。在零延迟配置下，该方法在数学推理任务 Spoken-MQA 上的准确率达到 92.8%，在语音对话任务 URO-Bench 上的得分为 82.5。我们的工作有效地弥合了高质量推理和实时交互之间的差距。

Title: Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation

Authors: Sondos Mahmoud Bsharat, Zhiqiang Shen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.09599
Pdf URL: https://arxiv.org/pdf/2510.09599
Copy Paste: [[2510.09599]] Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation(https://arxiv.org/abs/2510.09599)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually selected reasoning instances and systematically varies exemplar augmentation through principled instruction prompting intensities at test time to synthesize diverse reasoning trajectory contexts. Then we finetune the various sizes of Qwen-2.5 models on P-TTS data. Across a suite of mathematical reasoning AIME2024 & 25, MATH500, and GPQA-Diamond, our P-TTS-7B and 32B models outperform the prior competitive baselines like S1 and S1.1 (1K-shot), achieving absolute accuracy gains of +26.66% and +30.00% on AIME'24 (7B), and +13.34% and +6.67% on AIME'25 (7B); P-TTS-32B yields gains of +23.33% and +16.63% on AIME'24, and +26.63% and +3.33% on AIME'25 (vs. S1 and S1.1, respectively), with comparable or better performance on MATH500 and GPQA-Diamond. We further show that P-TTS enhances zero-shot generalization accuracy on out-of-domain reasoning benchmarks of Gaokao, Kaoyan, OlympiadBench, AMC23, GradeSchoolMath, and Minerva. Our analysis suggests that test-time scaling effectively explores the latent space of reasoning patterns, amplifying LLM problem-solving with minimal annotation overhead, and further unlocking the reasoning potential and capabilities of LLMs. Prompting Test-Time Scaling offers a practical, low-cost way to elicit LLM reasoning in resource-constrained or rapidly evolving domains.
摘要：大型语言模型 (LLM) 在提供思想链示例时表现出了令人印象深刻的推理能力，但整理大型推理数据集仍然很费力且占用资源。在这项工作中，我们引入了提示测试时间缩放（P-TTS），这是一种简单而有效的推理时间数据增强策略，用于通过微调增强 LLM 推理。 P-TTS 不是收集数千甚至数百万个示例，而是利用仅 90 个手动选择的推理实例的小池，并通过测试时有原则的指令提示强度来系统地改变示例增强，以合成不同的推理轨迹上下文。然后我们在 P-TTS 数据上微调 Qwen-2.5 模型的各种尺寸。在一系列数学推理 AIME2024 和 25、MATH500 和 GPQA-Diamond 中，我们的 P-TTS-7B 和 32B 模型优于 S1 和 S1.1（1K-shot）等先前的竞争基线，在 AIME'24 (7B) 上实现了 +26.66% 和 +30.00% 的绝对精度增益，在 AIME'24 (7B) 上实现了 +13.34% 和 +6.67% 的绝对精度增益上 AIME'25 (7B); P-TTS-32B 在 AIME'24 上的收益为 +23.33% 和 +16.63%，在 AIME'25 上的收益为 +26.63% 和 +3.33%（分别与 S1 和 S1.1 相比），在 MATH500 和 GPQA-Diamond 上具有相当或更好的性能。我们进一步表明，P-TTS 提高了高考、考研、OlympiadBench、AMC23、GradeSchoolMath 和 Minerva 的域外推理基准的零样本泛化精度。我们的分析表明，测试时间扩展有效地探索了推理模式的潜在空间，以最小的注释开销放大了 LLM 问题的解决能力，并进一步释放了 LLM 的推理潜力和能力。 Prompting Test-Time Scaling 提供了一种实用、低成本的方法，可以在资源受限或快速发展的领域中引发 LLM 推理。