2025-10-20

Title: Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective

Authors: Zhiqiang Kou, Junyang Chen, Xin-Qiang Cai, Ming-Kun Xie, Biao Liu, Changwei Wang, Lei Feng, Yuheng Jia, Gang Niu, Masashi Sugiyama, Xin Geng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15007
Pdf URL: https://arxiv.org/pdf/2510.15007
Copy Paste: [[2510.15007]] Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective(https://arxiv.org/abs/2510.15007)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks, but their potential to generate harmful content has raised serious safety concerns. Current toxicity detectors primarily rely on single-label benchmarks, which cannot adequately capture the inherently ambiguous and multi-dimensional nature of real-world toxic prompts. This limitation results in biased evaluations, including missed toxic detections and false positives, undermining the reliability of existing detectors. Additionally, gathering comprehensive multi-label annotations across fine-grained toxicity categories is prohibitively costly, further hindering effective evaluation and development. To tackle these issues, we introduce three novel multi-label benchmarks for toxicity detection: \textbf{Q-A-MLL}, \textbf{R-A-MLL}, and \textbf{H-X-MLL}, derived from public toxicity datasets and annotated according to a detailed 15-category taxonomy. We further provide a theoretical proof that, on our released datasets, training with pseudo-labels yields better performance than directly learning from single-label supervision. In addition, we develop a pseudo-label-based toxicity detection method. Extensive experimental results show that our approach significantly surpasses advanced baselines, including GPT-4o and DeepSeek, thus enabling more accurate and reliable evaluation of multi-label toxicity in LLM-generated content.
摘要：大型语言模型 (LLM) 在一系列自然语言处理任务中取得了令人印象深刻的成果，但它们产生有害内容的潜力引起了严重的安全问题。目前的毒性检测器主要依赖于单标签基准，它无法充分捕捉现实世界有毒提示固有的模糊性和多维性质。这种限制会导致评估出现偏差，包括错过有毒物质检测和误报，从而损害了现有探测器的可靠性。此外，收集细粒度毒性类别的全面多标签注释成本高昂，进一步阻碍了有效的评估和开发。为了解决这些问题，我们引入了三种新颖的多标签毒性检测基准：\textbf{Q-A-MLL}、\textbf{R-A-MLL}和\textbf{H-X-MLL}，它们源自公共毒性数据集，并根据详细的 15 类分类法进行注释。我们进一步提供了一个理论证明，在我们发布的数据集上，使用伪标签进行训练比直接从单标签监督中学习可以获得更好的性能。此外，我们还开发了一种基于伪标签的毒性检测方法。大量实验结果表明，我们的方法显着超越了先进的基线，包括 GPT-4o 和 DeepSeek，从而能够更准确、可靠地评估 LLM 生成内容中的多标签毒性。

Title: Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT, Gemini, and Deepseek

Authors: Enis Oğuz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15009
Pdf URL: https://arxiv.org/pdf/2510.15009
Copy Paste: [[2510.15009]] Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT, Gemini, and Deepseek(https://arxiv.org/abs/2510.15009)
Keywords: gpt, chat
Abstract: The developments in Generative AI technologies have paved the way for numerous innovations in different fields. Recently, Generative AI has been proposed as a competitor to AES systems in evaluating student essays automatically. Considering the potential limitations of AI in processing idioms, this study assessed the scoring performances of Generative AI models for essays with and without idioms by incorporating insights from Corpus Linguistics and Computational Linguistics. Two equal essay lists were created from 348 student essays taken from a corpus: one with multiple idioms present in each essay and another with no idioms in essays. Three Generative AI models (ChatGPT, Gemini, and Deepseek) were asked to score all essays in both lists three times, using the same rubric used by human raters in assigning essay scores. The results revealed excellent consistency for all models, but Gemini outperformed its competitors in interrater reliability with human raters. There was also no detectable bias for any demographic group in AI assessment. For essays with multiple idioms, Gemini followed a the most similar pattern to human raters. While the models in the study demonstrated potential for a hybrid approach, Gemini was the best candidate for the task due to its ability to handle figurative language and showed promise for handling essay-scoring tasks alone in the future.
摘要：生成式人工智能技术的发展为不同领域的众多创新铺平了道路。最近，生成式 AI 被提议作为 AES 系统的竞争对手，自动评估学生论文。考虑到人工智能在处理习语方面的潜在局限性，本研究结合语料库语言学和计算语言学的见解，评估了生成式人工智能模型对有习语和没有习语的论文的评分表现。根据语料库中的 348 篇学生论文创建了两个相等的论文列表：一个每篇论文中都包含多个习语，另一个论文中没有习语。三个生成式 AI 模型（ChatGPT、Gemini 和 Deepseek）被要求对两个列表中的所有论文进行三次评分，使用与人类评分者分配论文分数相同的标准。结果显示所有模型都具有出色的一致性，但 Gemini 在人类评估者的评估者间可靠性方面优于其竞争对手。人工智能评估中也没有发现任何人口群体的偏见。对于包含多种习语的论文，双子座遵循与人类评分者最相似的模式。虽然研究中的模型展示了混合方法的潜力，但 Gemini 因其处理比喻语言的能力而成为该任务的最佳候选者，并且显示出在未来单独处理论文评分任务的前景。

Title: A Generalizable Rhetorical Strategy Annotation Model Using LLM-based Debate Simulation and Labelling

Authors: Shiyu Ji, Farnoosh Hashemi, Joice Chen, Juanwen Pan, Weicheng Ma, Hefan Zhang, Sophia Pan, Ming Cheng, Shubham Mohole, Saeed Hassanpour, Soroush Vosoughi, Michael Macy
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2510.15081
Pdf URL: https://arxiv.org/pdf/2510.15081
Copy Paste: [[2510.15081]] A Generalizable Rhetorical Strategy Annotation Model Using LLM-based Debate Simulation and Labelling(https://arxiv.org/abs/2510.15081)
Keywords: language model, llm
Abstract: Rhetorical strategies are central to persuasive communication, from political discourse and marketing to legal argumentation. However, analysis of rhetorical strategies has been limited by reliance on human annotation, which is costly, inconsistent, difficult to scale. Their associated datasets are often limited to specific topics and strategies, posing challenges for robust model development. We propose a novel framework that leverages large language models (LLMs) to automatically generate and label synthetic debate data based on a four-part rhetorical typology (causal, empirical, emotional, moral). We fine-tune transformer-based classifiers on this LLM-labeled dataset and validate its performance against human-labeled data on this dataset and on multiple external corpora. Our model achieves high performance and strong generalization across topical domains. We illustrate two applications with the fine-tuned model: (1) the improvement in persuasiveness prediction from incorporating rhetorical strategy labels, and (2) analyzing temporal and partisan shifts in rhetorical strategies in U.S. Presidential debates (1960-2020), revealing increased use of affective over cognitive argument in U.S. Presidential debates.
摘要：从政治话语和营销到法律论证，修辞策略是说服性沟通的核心。然而，修辞策略的分析由于依赖人工注释而受到限制，人工注释成本高昂、不一致且难以扩展。它们的相关数据集通常仅限于特定的主题和策略，这给稳健的模型开发带来了挑战。我们提出了一种新颖的框架，利用大型语言模型（LLM）根据四部分修辞类型（因果、经验、情感、道德）自动生成和标记综合辩论数据。我们在此 LLM 标记数据集上微调基于 Transformer 的分类器，并根据该数据集和多个外部语料库上的人工标记数据验证其性能。我们的模型在各个主题领域实现了高性能和强大的泛化能力。我们用微调模型说明了两个应用：（1）通过合并修辞策略标签来改进说服力预测，以及（2）分析美国总统辩论（1960-2020）中修辞策略的时间和党派变化，揭示美国总统辩论中情感论证而非认知论证的使用增加。

Title: Continual Learning via Sparse Memory Finetuning

Authors: Jessy Lin, Luke Zettlemoyer, Gargi Ghosh, Wen-Tau Yih, Aram Markosyan, Vincent-Pierre Berges, Barlas Oğuz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15103
Pdf URL: https://arxiv.org/pdf/2510.15103
Copy Paste: [[2510.15103]] Continual Learning via Sparse Memory Finetuning(https://arxiv.org/abs/2510.15103)
Keywords: language model
Abstract: Modern language models are powerful, but typically static after deployment. A major obstacle to building models that continually learn over time is catastrophic forgetting, where updating on new data erases previously acquired capabilities. Motivated by the intuition that mitigating forgetting is challenging because trainable parameters are shared across all tasks, we investigate whether sparse parameter updates can enable learning without catastrophic forgetting. We introduce sparse memory finetuning, leveraging memory layer models (Berges et al., 2024), which are sparsely updated by design. By updating only the memory slots that are highly activated by a new piece of knowledge relative to usage on pretraining data, we reduce interference between new knowledge and the model's existing capabilities. We evaluate learning and forgetting compared to full finetuning and parameter-efficient finetuning with LoRA on two question answering tasks. We find that sparse memory finetuning learns new knowledge while exhibiting substantially less forgetting: while NaturalQuestions F1 drops by 89% after full finetuning on new facts and 71% with LoRA, sparse memory finetuning yields only an 11% drop with the same level of new knowledge acquisition. Our results suggest sparsity in memory layers offers a promising path toward continual learning in large language models.
摘要：现代语言模型很强大，但在部署后通常是静态的。构建随着时间的推移不断学习的模型的一个主要障碍是灾难性遗忘，即新数据的更新会抹掉以前获得的能力。由于直觉认为减轻遗忘具有挑战性，因为可训练参数在所有任务之间共享，因此我们研究稀疏参数更新是否可以实现学习而不会发生灾难性遗忘。我们引入了稀疏内存微调，利用内存层模型（Berges et al., 2024），这些模型在设计上是稀疏更新的。通过仅更新与预训练数据的使用相关的新知识高度激活的内存槽，我们减少了新知识与模型现有功能之间的干扰。我们在两个问答任务上评估学习和遗忘，并与 LoRA 的完全微调和参数高效微调进行比较。我们发现稀疏内存微调在学习新知识的同时表现出明显更少的遗忘：虽然 NaturalQuestions F1 在对新事实进行全面微调后下降了 89%，而使用 LoRA 则下降了 71%，但在相同水平的新知识获取下，稀疏内存微调仅下降了 11%。我们的结果表明，记忆层的稀疏性为大型语言模型的持续学习提供了一条有希望的道路。

Title: Measuring the Effect of Disfluency in Multilingual Knowledge Probing Benchmarks

Authors: Kirill Semenov, Rico Sennrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15115
Pdf URL: https://arxiv.org/pdf/2510.15115
Copy Paste: [[2510.15115]] Measuring the Effect of Disfluency in Multilingual Knowledge Probing Benchmarks(https://arxiv.org/abs/2510.15115)
Keywords: gpt, llm, prompt, chat
Abstract: For multilingual factual knowledge assessment of LLMs, benchmarks such as MLAMA use template translations that do not take into account the grammatical and semantic information of the named entities inserted in the sentence. This leads to numerous instances of ungrammaticality or wrong wording of the final prompts, which complicates the interpretation of scores, especially for languages that have a rich morphological inventory. In this work, we sample 4 Slavic languages from the MLAMA dataset and compare the knowledge retrieval scores between the initial (templated) MLAMA dataset and its sentence-level translations made by Google Translate and ChatGPT. We observe a significant increase in knowledge retrieval scores, and provide a qualitative analysis for possible reasons behind it. We also make an additional analysis of 5 more languages from different families and see similar patterns. Therefore, we encourage the community to control the grammaticality of highly multilingual datasets for higher and more interpretable results, which is well approximated by whole sentence translation with neural MT or LLM systems. The dataset and all related code is published at the Github repository: this https URL.
摘要：对于法学硕士的多语言事实知识评估，MLAMA 等基准使用模板翻译，但不考虑句子中插入的命名实体的语法和语义信息。这会导致最终提示出现许多不语法或措辞错误的情况，从而使分数的解释变得复杂，特别是对于具有丰富形态库存的语言。在这项工作中，我们从 MLAMA 数据集中采样了 4 种斯拉夫语言，并比较了初始（模板化）MLAMA 数据集与其由 Google Translate 和 ChatGPT 制作的句子级翻译之间的知识检索分数。我们观察到知识检索分数显着增加，并对其背后可能的原因进行了定性分析。我们还对来自不同语系的另外 5 种语言进行了额外分析，发现了类似的模式。因此，我们鼓励社区控制高度多语言数据集的语法，以获得更高、更可解释的结果，这可以通过神经 MT 或 LLM 系统的整句翻译来很好地近似。数据集和所有相关代码发布在 Github 存储库中：此 https URL。

Title: Latent Topic Synthesis: Leveraging LLMs for Electoral Ad Analysis

Authors: Alexander Brady, Tunazzina Islam
Subjects: cs.CL, cs.AI, cs.CY, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2510.15125
Pdf URL: https://arxiv.org/pdf/2510.15125
Copy Paste: [[2510.15125]] Latent Topic Synthesis: Leveraging LLMs for Electoral Ad Analysis(https://arxiv.org/abs/2510.15125)
Keywords: language model, llm, prompt
Abstract: Social media platforms play a pivotal role in shaping political discourse, but analyzing their vast and rapidly evolving content remains a major challenge. We introduce an end-to-end framework for automatically generating an interpretable topic taxonomy from an unlabeled corpus. By combining unsupervised clustering with prompt-based labeling, our method leverages large language models (LLMs) to iteratively construct a taxonomy without requiring seed sets or domain expertise. We apply this framework to a large corpus of Meta (previously known as Facebook) political ads from the month ahead of the 2024 U.S. Presidential election. Our approach uncovers latent discourse structures, synthesizes semantically rich topic labels, and annotates topics with moral framing dimensions. We show quantitative and qualitative analyses to demonstrate the effectiveness of our framework. Our findings reveal that voting and immigration ads dominate overall spending and impressions, while abortion and election-integrity achieve disproportionate reach. Funding patterns are equally polarized: economic appeals are driven mainly by conservative PACs, abortion messaging splits between pro- and anti-rights coalitions, and crime-and-justice campaigns are fragmented across local committees. The framing of these appeals also diverges--abortion ads emphasize liberty/oppression rhetoric, while economic messaging blends care/harm, fairness/cheating, and liberty/oppression narratives. Topic salience further reveals strong correlations between moral foundations and issues. Demographic targeting also emerges. This work supports scalable, interpretable analysis of political messaging on social media, enabling researchers, policymakers, and the public to better understand emerging narratives, polarization dynamics, and the moral underpinnings of digital political communication.
摘要：社交媒体平台在塑造政治话语方面发挥着关键作用，但分析其庞大且快速发展的内容仍然是一项重大挑战。我们引入了一个端到端框架，用于从未标记的语料库自动生成可解释的主题分类法。通过将无监督聚类与基于提示的标记相结合，我们的方法利用大型语言模型 (LLM) 迭代构建分类法，而无需种子集或领域专业知识。我们将此框架应用于 2024 年美国总统大选前一个月的 Meta（以前称为 Facebook）政治广告的大型语料库。我们的方法揭示了潜在的话语结构，合成了语义丰富的主题标签，并用道德框架维度注释了主题。我们通过定量和定性分析来证明我们框架的有效性。我们的研究结果表明，投票和移民广告在总体支出和印象中占据主导地位，而堕胎和选举诚信则达到了不成比例的影响范围。资金模式同样两极分化：经济吸引力主要由保守的政治行动委员会驱动，支持和反权利联盟之间的堕胎信息分歧，以及犯罪和司法运动在地方委员会中分散。这些诉求的框架也各不相同——堕胎广告强调自由/压迫言论，而经济信息则融合了关怀/伤害、公平/欺骗以及自由/压迫叙事。主题显着性进一步揭示了道德基础和问题之间的密切相关性。人口目标也随之出现。这项工作支持对社交媒体上的政治信息进行可扩展、可解释的分析，使研究人员、政策制定者和公众能够更好地理解新兴叙事、两极分化动态以及数字政治传播的道德基础。

Title: FarsiMCQGen: a Persian Multiple-choice Question Generation Framework

Authors: Mohammad Heydari Rad, Rezvan Afari, Saeedeh Momtazi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15134
Pdf URL: https://arxiv.org/pdf/2510.15134
Copy Paste: [[2510.15134]] FarsiMCQGen: a Persian Multiple-choice Question Generation Framework(https://arxiv.org/abs/2510.15134)
Keywords: language model, llm
Abstract: Multiple-choice questions (MCQs) are commonly used in educational testing, as they offer an efficient means of evaluating learners' knowledge. However, generating high-quality MCQs, particularly in low-resource languages such as Persian, remains a significant challenge. This paper introduces FarsiMCQGen, an innovative approach for generating Persian-language MCQs. Our methodology combines candidate generation, filtering, and ranking techniques to build a model that generates answer choices resembling those in real MCQs. We leverage advanced methods, including Transformers and knowledge graphs, integrated with rule-based approaches to craft credible distractors that challenge test-takers. Our work is based on data from Wikipedia, which includes general knowledge questions. Furthermore, this study introduces a novel Persian MCQ dataset comprising 10,289 questions. This dataset is evaluated by different state-of-the-art large language models (LLMs). Our results demonstrate the effectiveness of our model and the quality of the generated dataset, which has the potential to inspire further research on MCQs.
摘要：多项选择题 (MCQ) 常用于教育测试，因为它们提供了评估学习者知识的有效方法。然而，生成高质量的 MCQ，尤其是波斯语等资源匮乏的语言，仍然是一个重大挑战。本文介绍了 FarsiMCQGen，这是一种生成波斯语 MCQ 的创新方法。我们的方法结合了候选生成、过滤和排名技术来构建一个模型，该模型生成类似于真实 MCQ 中的答案选择。我们利用先进的方法，包括变形金刚和知识图，与基于规则的方法相结合，来制作可信的干扰因素来挑战考生。我们的工作基于维基百科的数据，其中包括一般知识问题。此外，本研究引入了一个新颖的波斯语 MCQ 数据集，包含 10,289 个问题。该数据集由不同的最先进的大型语言模型（LLM）进行评估。我们的结果证明了我们模型的有效性和生成数据集的质量，这有可能激发对 MCQ 的进一步研究。

Title: Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning

Authors: Junlin Wu, Xianrui Zhong, Jiashuo Sun, Bolian Li, Bowen Jin, Jiawei Han, Qingkai Zeng
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.15191
Pdf URL: https://arxiv.org/pdf/2510.15191
Copy Paste: [[2510.15191]] Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning(https://arxiv.org/abs/2510.15191)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) have demonstrated remarkable advances in reasoning capabilities. However, their performance remains constrained by limited access to explicit and structured domain knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external information as context to augment reasoning. Nevertheless, traditional RAG systems typically operate over unstructured and fragmented text, resulting in low information density and suboptimal reasoning. To overcome these limitations, we propose \textsc{Structure-R1}, a novel framework that transforms retrieved content into structured representations optimized for reasoning. Leveraging reinforcement learning, \textsc{Structure-R1} learns a content representation policy that dynamically generates and adapts structural formats based on the demands of multi-step reasoning. Unlike prior methods that rely on fixed schemas, our approach adopts a generative paradigm capable of producing task-specific structures tailored to individual queries. To ensure the quality and reliability of these representations, we introduce a self-reward structural verification mechanism that checks whether the generated structures are both correct and self-contained. Extensive experiments on seven knowledge-intensive benchmarks show that \textsc{Structure-R1} consistently achieves competitive performance with a 7B-scale backbone model and matches the performance of much larger models. Additionally, our theoretical analysis demonstrates how structured representations enhance reasoning by improving information density and contextual clarity. Our code and data are available at: this https URL.
摘要：大型语言模型（LLM）在推理能力方面表现出了显着的进步。然而，他们的表现仍然受到对明确和结构化领域知识的有限访问的限制。检索增强生成（RAG）通过将外部信息作为上下文来增强推理来解决这个问题。然而，传统的 RAG 系统通常在非结构化和碎片化的文本上运行，导致信息密度低和推理欠佳。为了克服这些限制，我们提出了 \textsc{Structure-R1}，这是一种新颖的框架，可以将检索到的内容转换为针对推理优化的结构化表示。利用强化学习，\textsc{Structure-R1} 学习一种内容表示策略，该策略根据多步推理的需求动态生成和调整结构格式。与依赖固定模式的先前方法不同，我们的方法采用能够生成针对单个查询定制的特定于任务的结构的生成范例。为了确保这些表示的质量和可靠性，我们引入了一种自我奖励结构验证机制，用于检查生成的结构是否正确且独立。对七个知识密集型基准的广泛实验表明，\textsc{Structure-R1} 始终能够在 7B 规模的骨干模型中实现具有竞争力的性能，并且与更大模型的性能相匹配。此外，我们的理论分析证明了结构化表示如何通过提高信息密度和上下文清晰度来增强推理。我们的代码和数据可在以下位置获取：此 https URL。

Title: Extending Audio Context for Long-Form Understanding in Large Audio-Language Models

Authors: Yuatyong Chaichana, Pittawat Taveekitworachai, Warit Sirichotedumrong, Potsawee Manakul, Kunat Pipatanakul
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2510.15231
Pdf URL: https://arxiv.org/pdf/2510.15231
Copy Paste: [[2510.15231]] Extending Audio Context for Long-Form Understanding in Large Audio-Language Models(https://arxiv.org/abs/2510.15231)
Keywords: language model, llm, long context
Abstract: Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM's text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.
摘要：大型音频语言模型 (LALM) 通常受到短音频上下文窗口的限制，即使它们的文本主干支持长上下文，从而限制了长格式音频理解。先前的工作已经在单峰 LLM 上引入了上下文扩展方法（例如 YaRN），但它们在 LALM 中的应用仍未得到探索。首先，在基于 RoPE 的上下文扩展的基础上，我们引入了 Partial YaRN，这是一种免训练、纯音频的扩展方法，仅修改音频标记位置，保持文本位置不变，以保留基础 LLM 的文本功能。其次，我们提出了虚拟长格式音频训练（VLAT），这是一种将 Partial YaRN 扩展到训练时位置增强的训练策略。 VLAT 在训练期间模拟不同的音频长度，从而能够泛化到比训练中看到的更长的输入，并提高长上下文音频理解的鲁棒性。我们在 SALMONN 和 Qwen2-Audio 上的实验表明，Partial YaRN 在各种设置下都优于原始模型，并且 VLAT 训练策略提供了实质性改进，在未见过的长度的长音频上实现了强大的性能。

Title: Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning

Authors: Lina Berrayana, Ahmed Heakl, Muhammad Abdullah Sohail, Thomas Hofmann, Salman Khan, Wei Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.15244
Pdf URL: https://arxiv.org/pdf/2510.15244
Copy Paste: [[2510.15244]] Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning(https://arxiv.org/abs/2510.15244)
Keywords: language model
Abstract: Current autoregressive language models (ARMs) achieve high accuracy but require long token sequences, making them costly. Discrete diffusion language models (DDLMs) enable parallel and flexible generation within a fixed number of steps and have recently emerged for their strong performance in complex reasoning and long-term planning tasks. We present a study exploring hybrid architectures that couple DDLMs with ARMs to assess whether their collaboration can yield complementary benefits. We first examine collaboration in text space, where one model plans the reasoning process and another executes the final answer based on that plan. We then extend this setup to latent-space communication, introducing a learned projector that maps DDLM latents into the ARM's embedding space, potentially bypassing some of the text-generation limitations of diffusion models. We find that shifting DDLM --> ARM communication from text space to latent space yields significant accuracy gains, for example increasing from 27.0% to 54.0% on DART-5 and from 0.0% to 14.0% on AIME24. We also find that combining a DDLM planner with an ARM executor can provide substantial computational savings with little to no impact on accuracy. For example, the latent-space pipeline, using 64 tokens for planning and roughly 5 for execution, surpasses Qwen3.1-7B on DART-5 and AIME, despite Qwen using 44 times more tokens. Overall, our study offers new insights into reasoning with DDLMs and highlights their potential in hybrid architectures.
摘要：当前的自回归语言模型 (ARM) 可以实现高精度，但需要很长的标记序列，因此成本高昂。离散扩散语言模型（DDLM）可以在固定数量的步骤内实现并行和灵活的生成，并且最近因其在复杂推理和长期规划任务中的强大性能而出现。我们提出了一项研究，探索将 DDLM 与 ARM 结合起来的混合架构，以评估它们的协作是否可以产生互补的效益。我们首先检查文本空间中的协作，其中一个模型计划推理过程，另一个模型根据该计划执行最终答案。然后，我们将此设置扩展到潜在空间通信，引入学习投影仪，将 DDLM 潜在映射到 ARM 的嵌入空间，从而可能绕过扩散模型的一些文本生成限制。我们发现，将 DDLM --> ARM 通信从文本空间转移到潜在空间会带来显着的准确度提升，例如，DART-5 上的准确率从 27.0% 增加到 54.0%，AIME24 上的准确率从 0.0% 增加到 14.0%。我们还发现，将 DDLM 规划器与 ARM 执行器相结合可以节省大量计算量，而对准确性几乎没有影响。例如，潜在空间管道使用 64 个令牌进行规划，大约 5 个令牌用于执行，在 DART-5 和 AIME 上超过了 Qwen3.1-7B，尽管 Qwen 使用了 44 倍多的令牌。总的来说，我们的研究为 DDLM 推理提供了新的见解，并强调了它们在混合架构中的潜力。

Title: Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

Authors: Sensen Gao, Shanshan Zhao, Xu Jiang, Lunhao Duan, Yong Xien Chng, Qing-Guo Chen, Weihua Luo, Kaifu Zhang, Jia-Wang Bian, Mingming Gong
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.15253
Pdf URL: https://arxiv.org/pdf/2510.15253
Copy Paste: [[2510.15253]] Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding(https://arxiv.org/abs/2510.15253)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, i.e., combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG. This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. We propose a taxonomy based on domain, retrieval modality, and granularity, and review advances involving graph structures and agentic frameworks. We also summarize key datasets, benchmarks, and applications, and highlight open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.
摘要：文档理解对于从金融分析到科学发现的应用至关重要。当前的方法，无论是基于 OCR 的管道提供大型语言模型 (LLM) 还是本机多模态 LLM (MLLM)，都面临着关键的局限性：前者丢失了结构细节，而后者则难以进行上下文建模。检索增强生成 (RAG) 有助于在外部数据中建立模型，但文档的多模态性质（即组合文本、表格、图表和布局）需要更高级的范例：多模态 RAG。这种方法可以跨所有模式进行整体检索和推理，从而释放全面的文档智能。认识到其重要性，本文对用于文档理解的多模式 RAG 进行了系统调查。我们提出了一种基于领域、检索模式和粒度的分类法，并回顾了涉及图结构和代理框架的进展。我们还总结了关键数据集、基准测试和应用程序，并强调了效率、细粒度表示和鲁棒性方面的开放挑战，为文档人工智能的未来进展提供了路线图。

Title: TraceCoder: Towards Traceable ICD Coding via Multi-Source Knowledge Integration

Authors: Mucheng Ren, He Chen, Yuchen Yan, Danqing Hu, Jun Xu, Xian Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15267
Pdf URL: https://arxiv.org/pdf/2510.15267
Copy Paste: [[2510.15267]] TraceCoder: Towards Traceable ICD Coding via Multi-Source Knowledge Integration(https://arxiv.org/abs/2510.15267)
Keywords: language model, llm
Abstract: Automated International Classification of Diseases (ICD) coding assigns standardized diagnosis and procedure codes to clinical records, playing a critical role in healthcare systems. However, existing methods face challenges such as semantic gaps between clinical text and ICD codes, poor performance on rare and long-tail codes, and limited interpretability. To address these issues, we propose TraceCoder, a novel framework integrating multi-source external knowledge to enhance traceability and explainability in ICD coding. TraceCoder dynamically incorporates diverse knowledge sources, including UMLS, Wikipedia, and large language models (LLMs), to enrich code representations, bridge semantic gaps, and handle rare and ambiguous codes. It also introduces a hybrid attention mechanism to model interactions among labels, clinical context, and knowledge, improving long-tail code recognition and making predictions interpretable by grounding them in external evidence. Experiments on MIMIC-III-ICD9, MIMIC-IV-ICD9, and MIMIC-IV-ICD10 datasets demonstrate that TraceCoder achieves state-of-the-art performance, with ablation studies validating the effectiveness of its components. TraceCoder offers a scalable and robust solution for automated ICD coding, aligning with clinical needs for accuracy, interpretability, and reliability.
摘要：自动国际疾病分类 (ICD) 编码为临床记录分配标准化诊断和程序代码，在医疗保健系统中发挥着关键作用。然而，现有方法面临临床文本和 ICD 代码之间的语义差距、稀有和长尾代码性能不佳以及可解释性有限等挑战。为了解决这些问题，我们提出了 TraceCoder，这是一种集成多源外部知识的新颖框架，以增强 ICD 编码的可追溯性和可解释性。 TraceCoder 动态整合不同的知识源，包括 UMLS、维基百科和大型语言模型 (LLM)，以丰富代码表示、弥合语义差距并处理罕见和模糊的代码。它还引入了一种混合注意力机制来模拟标签、临床背景和知识之间的相互作用，改善长尾代码识别，并通过基于外部证据来使预测变得可解释。 MIMIC-III-ICD9、MIMIC-IV-ICD9 和 MIMIC-IV-ICD10 数据集上的实验表明，TraceCoder 实现了最先进的性能，消融研究验证了其组件的有效性。 TraceCoder 为自动 ICD 编码提供了可扩展且强大的解决方案，符合临床对准确性、可解释性和可靠性的需求。

Title: Exemplar-Guided Planing: Enhanced LLM Agent for KGQA

Authors: Jingao Xu, Shuoyoucheng Ma, Xin Song, Rong Jiang, Hongkui Tu, Bin Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15283
Pdf URL: https://arxiv.org/pdf/2510.15283
Copy Paste: [[2510.15283]] Exemplar-Guided Planing: Enhanced LLM Agent for KGQA(https://arxiv.org/abs/2510.15283)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) as interactive agents show significant promise in Knowledge Graph Question Answering (KGQA) but often struggle with the semantic gap between natural language queries and structured knowledge graph (KG) representations. This leads to suboptimal planning and inefficient exploration on KG, while training-free approaches often underutilize valuable reasoning patterns in training data. To address these limitations, we propose a novel framework, Exemplar-Guided Planning (EGP), which enhances the planning capabilities of LLM agents for KGQA. EGP first preprocesses the training set questions via entity templating to normalize semantic variations. It then retrieves highly similar exemplary questions and their successful reasoning paths from this preprocessed set using semantic embeddings and an efficient FAISS index. These retrieved exemplars dynamically guide the LLM's planning process in two key phases: (1) Task Decomposition, by aligning generated sub-objectives with proven reasoning steps, and (2) Relation Exploration, by providing high-quality auxiliary information to improve relation pruning accuracy. Additionally, we introduce a Smart Lookahead mechanism during relation exploration to improve efficiency by preemptively exploring promising paths and potentially terminating exploration earlier. We apply EGP to the Plan-on-Graph (PoG) framework, termed PoG-EGP. Extensive experiments on two real-world KGQA datasets, WebQSP and CWQ, demonstrate that PoG-EGP significantly improves over the baseline PoG system and other compared methods.
摘要：作为交互式代理的大型语言模型 (LLM) 在知识图问答 (KGQA) 中显示出巨大的前景，但经常遇到自然语言查询和结构化知识图 (KG) 表示之间的语义差距。这导致 KG 的规划不理想和探索效率低下，而免训练方法往往没有充分利用训练数据中有价值的推理模式。为了解决这些限制，我们提出了一个新的框架，Exemplar-Guided Planning (EGP)，它增强了 KGQA 的 LLM 代理的规划能力。 EGP 首先通过实体模板预处理训练集问题，以标准化语义变化。然后，它使用语义嵌入和高效的 FAISS 索引从这个预处理集中检索高度相似的示例性问题及其成功的推理路径。这些检索到的范例在两个关键阶段动态指导法学硕士的规划过程：（1）任务分解，通过将生成的子目标与经过验证的推理步骤对齐，以及（2）关系探索，通过提供高质量的辅助信息来提高关系修剪的准确性。此外，我们在关系探索期间引入了智能前瞻机制，通过先发制人地探索有希望的路径并可能提前终止探索来提高效率。我们将 EGP 应用于 Plan-on-Graph (PoG) 框架，称为 PoG-EGP。对两个真实世界 KGQA 数据集 WebQSP 和 CWQ 的大量实验表明，PoG-EGP 比基线 PoG 系统和其他比较方法有显着改进。

Title: Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination

Authors: Zhiyang Chen, Daliang Xu, Haiyang Shen, Mengwei Xu, Shangguang Wang, Yun Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15312
Pdf URL: https://arxiv.org/pdf/2510.15312
Copy Paste: [[2510.15312]] Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination(https://arxiv.org/abs/2510.15312)
Keywords: language model, llm, agent
Abstract: Enhancing on-device large language models (LLMs) with contextual information from local data enables personalized and task-aware generation, powering use cases such as intelligent assistants and UI agents. While recent developments in neural processors have substantially improved the efficiency of prefill on mobile devices, the token-by-token generation process still suffers from high latency and limited hardware utilization due to its inherently memory-bound characteristics. This work presents CoordGen, a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices. The framework introduces three synergistic components: (1) adaptive execution scheduling, which dynamically balances compute graphs between prefill and decoding phases; (2) context-aligned drafting, which improves speculative efficiency through lightweight online calibration to current tasks; and (3) hardware-efficient draft extension, which reuses and expands intermediate sequences to improve processing parallelism and reduce verification cost. Experiments on multiple smartphones and representative workloads show consistent improvements of up to 3.8x in generation speed and 4.7x in energy efficiency compared with existing mobile inference solutions. Component-level analysis further validates the contribution of each optimization.
摘要：利用本地数据的上下文信息增强设备上的大语言模型 (LLM)，从而实现个性化和任务感知生成，为智能助理和 UI 代理等用例提供支持。虽然神经处理器的最新发展极大地提高了移动设备上预填充的效率，但由于其固有的内存限制特性，逐个令牌生成过程仍然存在高延迟和有限的硬件利用率。这项工作提出了 CoordGen，这是一种移动推理框架，它将推测性解码与动态硬件调度相集成，以加速移动设备上的上下文感知文本生成。该框架引入了三个协同组件：（1）自适应执行调度，动态平衡预填充和解码阶段之间的计算图；（2）上下文一致的起草，通过对当前任务的轻量级在线校准来提高推测效率； (3)硬件高效的草案扩展，重用和扩展中间序列以提高处理并行性并降低验证成本。在多款智能手机和代表性工作负载上进行的实验表明，与现有的移动推理解决方案相比，生成速度持续提高了 3.8 倍，能源效率提高了 4.7 倍。组件级分析进一步验证了每个优化的贡献。

Title: Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

Authors: Bolei Ma, Yina Yao, Anna-Carolina Haensch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15313
Pdf URL: https://arxiv.org/pdf/2510.15313
Copy Paste: [[2510.15313]] Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry(https://arxiv.org/abs/2510.15313)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style. Our analysis reveals systematic generation and evaluation biases: LLMs exhibit "echo chamber" effects when assessing creative quality, often converging on flawed standards that diverge from human judgments. These findings highlight both the potential and limitations of current capabilities of LLMs as proxy for literacy generation and the limited evaluation practices, thereby demonstrating the continued need of hybrid validation from both humans and models in culturally and technically complex creative tasks.
摘要：大语言模型（LLM）越来越多地应用于创意领域，但它们在中国古典诗歌生成和评估中的表现仍然知之甚少。我们提出了一个三步评估框架，结合了计算指标、法学硕士法官评估和人类专家验证。使用这个框架，我们评估了六位最先进的法学硕士，涵盖了诗歌质量的多个维度，包括主题、情感、意象、形式和风格。我们的分析揭示了系统性的生成和评估偏差：法学硕士在评估创意质量时表现出“回声室”效应，通常会趋向于与人类判断不同的有缺陷的标准。这些发现强调了法学硕士当前作为读写能力生成代理的能力的潜力和局限性以及有限的评估实践，从而表明在文化和技术复杂的创造性任务中持续需要人类和模型的混合验证。

Title: AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction

Authors: Hong Ting Tsang, Jiaxin Bai, Haoyu Huang, Qiao Xiao, Tianshi Zheng, Baixuan Xu, Shujie Liu, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15339
Pdf URL: https://arxiv.org/pdf/2510.15339
Copy Paste: [[2510.15339]] AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction(https://arxiv.org/abs/2510.15339)
Keywords: llm, retrieval-augmented generation
Abstract: Building effective knowledge graphs (KGs) for Retrieval-Augmented Generation (RAG) is pivotal for advancing question answering (QA) systems. However, its effectiveness is hindered by a fundamental disconnect: the knowledge graph (KG) construction process is decoupled from its downstream application, yielding suboptimal graph structures. To bridge this gap, we introduce AutoGraph-R1, the first framework to directly optimize KG construction for task performance using Reinforcement Learning (RL). AutoGraph-R1 trains an LLM constructor by framing graph generation as a policy learning problem, where the reward is derived from the graph's functional utility in a RAG pipeline. We design two novel, task-aware reward functions, one for graphs as knowledge carriers and another as knowledge indices. Across multiple QA benchmarks, AutoGraph-R1 consistently enables graph RAG methods to achieve significant performance gains over using task-agnostic baseline graphs. Our work shows it is possible to close the loop between construction and application, shifting the paradigm from building intrinsically ``good'' graphs to building demonstrably ``useful'' ones.
摘要：为检索增强生成 (RAG) 构建有效的知识图 (KG) 对于推进问答 (QA) 系统至关重要。然而，它的有效性受到根本性脱节的阻碍：知识图（KG）构建过程与其下游应用程序脱钩，产生次优的图结构。为了弥补这一差距，我们引入了 AutoGraph-R1，这是第一个使用强化学习 (RL) 直接优化知识图谱构建以提高任务性能的框架。 AutoGraph-R1 通过将图生成框架视为策略学习问题来训练 LLM 构造函数，其中奖励来自 RAG 管道中图的功能效用。我们设计了两种新颖的任务感知奖励函数，一种将图作为知识载体，另一种作为知识索引。在多个 QA 基准测试中，AutoGraph-R1 始终使图 RAG 方法能够比使用与任务无关的基线图实现显着的性能提升。我们的工作表明，构建和应用之间的闭环是可能的，将范式从构建本质上“好”的图转变为构建明显“有用”的图。

Title: When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

Authors: Heecheol Yun, Kwangmin Ki, Junghyun Lee, Eunho Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15346
Pdf URL: https://arxiv.org/pdf/2510.15346
Copy Paste: [[2510.15346]] When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling(https://arxiv.org/abs/2510.15346)
Keywords: language model, llm
Abstract: Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating models' next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form answers, its application to long-form generation remains underexplored. In this paper, we show that using existing ensemble methods in long-form generation requires a careful choice of ensembling positions, since the standard practice of ensembling at every token often degrades performance. We identify two key factors for determining these positions: tokenization mismatch across models and consensus in their next-token probability distributions. Based on this, we propose SAFE, (Stable And Fast LLM Ensembling), a framework that selectively ensembles by jointly considering these factors. To further improve stability, we introduce a probability sharpening strategy that consolidates probabilities spread across multiple sub-word tokens representing the same word into a single representative token. Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.
摘要：集成大型语言模型 (LLM) 作为一种通过利用各个模型的互补优势超越其性能的有前途的方法而受到关注。特别是，聚合模型的下一个令牌概率分布来选择下一个令牌已被证明在各种任务中是有效的。然而，虽然它在短格式答案方面取得了成功，但其在长格式生成中的应用仍未得到充分探索。在本文中，我们表明，在长格式生成中使用现有的集成方法需要仔细选择集成位置，因为每个标记的集成的标准实践通常会降低性能。我们确定了确定这些位置的两个关键因素：模型之间的标记化不匹配以及下一个标记概率分布的共识。基于此，我们提出了SAFE（Stable And Fast LLM Ensembling），这是一个通过共同考虑这些因素来选择性集成的框架。为了进一步提高稳定性，我们引入了一种概率锐化策略，该策略将分布在代表同一单词的多个子词标记中的概率合并为单个代表标记。我们对包括 MATH500 和 BBH 在内的各种基准进行的实验表明，SAFE 在准确性和效率方面均优于现有方法，即使在集成少于 1% 的代币时也能取得收益。

Title: Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

Authors: Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yanjie Liang, Ling Chen, Wei Chu, Yuan Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15349
Pdf URL: https://arxiv.org/pdf/2510.15349
Copy Paste: [[2510.15349]] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing(https://arxiv.org/abs/2510.15349)
Keywords: language model
Abstract: Document parsing from scanned images into structured formats remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data. This issue is further exacerbated by the limited availability of high-quality training data for layout-aware parsing tasks. To address these challenges, we introduce LayoutRL, a reinforcement learning framework that optimizes layout understanding through composite rewards integrating normalized edit distance, paragraph count accuracy, and reading order preservation. To support this training, we construct the Infinity-Doc-400K dataset, which we use to train Infinity-Parser, a vision-language model demonstrating robust generalization across various domains. Extensive evaluations on benchmarks including OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models. We will release our code, dataset, and model to facilitate reproducible research in document parsing.
摘要：由于文本段落、图形、公式和表格等复杂交织的元素，将扫描图像的文档解析为结构化格式仍然是一项重大挑战。现有的监督微调方法通常难以泛化不同的文档类型，导致性能不佳，特别是在分布外的数据上。由于布局感知解析任务的高质量训练数据的可用性有限，这个问题进一步加剧。为了应对这些挑战，我们引入了 LayoutRL，这是一种强化学习框架，它通过整合归一化编辑距离、段落计数准确性和阅读顺序保留的复合奖励来优化布局理解。为了支持这种训练，我们构建了 Infinity-Doc-400K 数据集，用于训练 Infinity-Parser，这是一种视觉语言模型，展示了跨各个领域的强大泛化能力。对 OmniDocBench、olmOCR-Bench、PubTabNet 和 FinTabNet 等基准的广泛评估表明，Infinity-Parser 在各种文档类型、语言和结构复杂性上始终如一地实现了最先进的性能，大大优于专用文档解析系统和通用视觉语言模型。我们将发布我们的代码、数据集和模型，以促进文档解析方面的可重复研究。

Title: VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency

Authors: Hongcheng Liu, Yixuan Hou, Heyang Liu, Yuhao Wang, Yanfeng Wang, Yu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15406
Pdf URL: https://arxiv.org/pdf/2510.15406
Copy Paste: [[2510.15406]] VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency(https://arxiv.org/abs/2510.15406)
Keywords: language model, llm
Abstract: While Speech Large Language Models (Speech-LLMs) show strong performance in many applications, their robustness is critically under-tested, especially to speech disfluency. Existing evaluations often rely on idealized inputs, overlooking common disfluencies, particularly those associated with conditions like Parkinson's disease. This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments. To facilitate this inquiry, we introduce VocalBench-DF, a framework for the systematic evaluation of disfluency across a multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals substantial performance degradation, indicating that their real-world readiness is limited. Further analysis identifies phoneme-level processing and long-context modeling as primary bottlenecks responsible for these failures. Strengthening recognition and reasoning capability from components and pipelines can substantially improve robustness. These findings highlight the urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs
摘要：虽然语音大型语言模型 (Speech-LLM) 在许多应用中表现出强大的性能，但其稳健性严重不足，尤其是在语音不流利方面。现有的评估通常依赖于理想化的输入，忽略了常见的不流畅之处，特别是与帕金森病等疾病相关的不流畅之处。这项工作调查了当前的语音法学硕士在与有言语障碍的用户交互时是否能够保持性能。为了促进这一调查，我们引入了 VocalBench-DF，这是一个跨多维分类法系统评估不流畅性的框架。我们对 22 名主流语音法学硕士的评估显示，其性能大幅下降，表明他们在现实世界中的准备程度有限。进一步的分析确定音素级处理和长上下文建模是导致这些失败的主要瓶颈。加强组件和管道的识别和推理能力可以显着提高鲁棒性。这些发现凸显了迫切需要新的方法来改善不流利的处理并建立真正包容性的演讲法学硕士

Title: Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs

Authors: Lee Qi Zun, Mohamad Zulhilmi Bin Abdul Halim, Goh Man Fye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15418
Pdf URL: https://arxiv.org/pdf/2510.15418
Copy Paste: [[2510.15418]] Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs(https://arxiv.org/abs/2510.15418)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines. However, their effectiveness with image-based queries is limited, as general Vision-Language Model captions often lack clinical specificity and factual grounding. This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions that serve as superior queries. To overcome data scarcity, we employ a knowledge distillation pipeline to create a synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance was rigorously assessed through a dual framework measuring both classification accuracy and, via a novel application of the RAGAS framework, caption faithfulness, relevancy, and correctness. The fine-tuned model demonstrated substantial improvements in classification performance, while RAGAS evaluation confirmed significant gains in caption faithfulness and correctness, validating the models ability to produce reliable, factually grounded descriptions. This work establishes a robust pipeline for specializing medical VLMs and validates the resulting model as a high-quality query generator, laying the groundwork for enhancing multimodal RAG systems in evidence-based clinical decision support.
摘要：检索增强生成系统对于根据马来西亚临床实践指南提供基于事实的指导至关重要。然而，它们对基于图像的查询的有效性是有限的，因为一般的视觉语言模型描述通常缺乏临床特异性和事实基础。本研究提出并验证了一个专门用于生成作为高级查询的高保真字幕的 MedGemma 模型的框架。为了克服数据稀缺的问题，我们采用知识蒸馏管道来创建跨皮肤病学、眼底和胸部放射学领域的合成数据集，并使用参数高效的 QLoRA 方法对 MedGemma 进行微调。通过衡量分类准确性的双重框架以及通过 RAGAS 框架的新颖应用来衡量标题的真实性、相关性和正确性，对性能进行了严格评估。经过微调的模型显示出分类性能的显着改进，而 RAGAS 评估证实了标题忠实度和正确性的显着提高，验证了模型生成可靠、基于事实的描述的能力。这项工作为专业医疗 VLM 建立了强大的管道，并验证了所得模型作为高质量查询生成器的能力，为增强基于证据的临床决策支持中的多模式 RAG 系统奠定了基础。

Title: When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs

Authors: Hongcheng Liu, Pingjie Wang, Yuhao Wang, Siqu Ou, Yanfeng Wang, Yu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15421
Pdf URL: https://arxiv.org/pdf/2510.15421
Copy Paste: [[2510.15421]] When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs(https://arxiv.org/abs/2510.15421)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) have shown strong capabilities across a broad range of benchmarks. However, most existing evaluations focus on passive inference, where models perform step-by-step reasoning under complete information. This setup is misaligned with real-world use, where seeing is not enough. This raises a fundamental question: Can MLLMs actively acquire missing evidence under incomplete information? To bridge this gap, we require the MLLMs to actively acquire missing evidence and iteratively refine decisions under incomplete information, by selecting a target image from a candidate pool without task-specific priors. To support systematic study, we propose GuessBench, a benchmark with both perception-oriented and knowledge-oriented images for evaluating active reasoning in MLLMs. We evaluate 20 superior MLLMs and find that performance on active reasoning lags far behind it on passive settings, indicating substantial room for improvement. Further analysis identifies fine-grained perception and timely decision-making as key challenges. Ablation studies show that perceptual enhancements benefit smaller models, whereas thinking-oriented methods provide consistent gains across model sizes. These results suggest promising directions for future research on multimodal active reasoning.
摘要：多模态大语言模型 (MLLM) 在广泛的基准测试中表现出了强大的功能。然而，现有的评估大多侧重于被动推理，即模型在完整信息下进行逐步推理。这种设置与现实世界的使用不一致，在现实世界中，光看是不够的。这就提出了一个基本问题：MLLMs能否在不完整信息下主动获取缺失的证据？为了弥补这一差距，我们要求 MLLM 主动获取缺失的证据，并通过从没有特定任务先验的候选池中选择目标图像，在不完整信息下迭代地完善决策。为了支持系统研究，我们提出了 GuessBench，这是一个具有面向感知和面向知识的图像的基准，用于评估 MLLM 中的主动推理。我们评估了 20 个优秀的 MLLM，发现主动推理的性能远远落后于被动设置，这表明还有很大的改进空间。进一步的分析表明，细粒度的感知和及时的决策是关键挑战。消融研究表明，感知增强有利于较小的模型，而面向思维的方法可以在不同模型大小之间提供一致的增益。这些结果为多模态主动推理的未来研究指明了有希望的方向。

Title: Controllable Abstraction in Summary Generation for Large Language Models via Prompt Engineering

Authors: Xiangchen Song, Yuchen Liu, Yaxuan Luan, Jinxu Guo, Xiaofan Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15436
Pdf URL: https://arxiv.org/pdf/2510.15436
Copy Paste: [[2510.15436]] Controllable Abstraction in Summary Generation for Large Language Models via Prompt Engineering(https://arxiv.org/abs/2510.15436)
Keywords: language model, prompt
Abstract: This study presents a controllable abstract summary generation method for large language models based on prompt engineering. To address the issues of summary quality and controllability in traditional methods, we design a multi-stage prompt generation framework. This framework generates summaries with varying levels of abstraction by performing semantic analysis, topic modeling, and noise control on the input text. The experiment uses the CNN/Daily Mail dataset and provides a detailed analysis of different prompt lengths, data noise, and text types. The experimental results show that prompt length has a significant impact on the quality of generated summaries. Both very short and very long prompt tokens result in a decrease in summary quality. Data noise also negatively affects the summary generation process. As noise levels increase, the ROUGE-L score gradually decreases. Furthermore, different text types have varying effects on the model's ability to generate summaries. The model performs best when handling news texts, while its performance is worse when processing academic articles. This research provides new insights into improving summary generation using large language models, particularly in how controlling prompt strategies and optimizing text preprocessing can enhance summary accuracy and controllability.
摘要：本研究提出了一种基于提示工程的大型语言模型可控摘要生成方法。为了解决传统方法中摘要质量和可控性的问题，我们设计了一个多阶段提示生成框架。该框架通过对输入文本执行语义分析、主题建模和噪声控制来生成具有不同抽象级别的摘要。该实验使用 CNN/Daily Mail 数据集，并对不同提示长度、数据噪声和文本类型进行了详细分析。实验结果表明提示长度对生成摘要的质量有显着影响。非常短和非常长的提示标记都会导致摘要质量下降。数据噪声也会对摘要生成过程产生负面影响。随着噪音水平的增加，ROUGE-L 分数逐渐下降。此外，不同的文本类型对模型生成摘要的能力有不同的影响。该模型在处理新闻文本时表现最佳，而在处理学术文章时表现较差。这项研究为使用大型语言模型改进摘要生成提供了新的见解，特别是在控制提示策略和优化文本预处理如何提高摘要准确性和可控性方面。

Title: CORE: Reducing UI Exposure in Mobile Agents via Collaboration Between Cloud and Local LLMs

Authors: Gucongcong Fan, Chaoyue Niu, Chengfei Lyu, Fan Wu, Guihai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15455
Pdf URL: https://arxiv.org/pdf/2510.15455
Copy Paste: [[2510.15455]] CORE: Reducing UI Exposure in Mobile Agents via Collaboration Between Cloud and Local LLMs(https://arxiv.org/abs/2510.15455)
Keywords: language model, llm, agent
Abstract: Mobile agents rely on Large Language Models (LLMs) to plan and execute tasks on smartphone user interfaces (UIs). While cloud-based LLMs achieve high task accuracy, they require uploading the full UI state at every step, exposing unnecessary and often irrelevant information. In contrast, local LLMs avoid UI uploads but suffer from limited capacity, resulting in lower task success rates. We propose $\textbf{CORE}$, a $\textbf{CO}$llaborative framework that combines the strengths of cloud and local LLMs to $\textbf{R}$educe UI $\textbf{E}$xposure, while maintaining task accuracy for mobile agents. CORE comprises three key components: (1) $\textbf{Layout-aware block partitioning}$, which groups semantically related UI elements based on the XML screen hierarchy; (2) $\textbf{Co-planning}$, where local and cloud LLMs collaboratively identify the current sub-task; and (3) $\textbf{Co-decision-making}$, where the local LLM ranks relevant UI blocks, and the cloud LLM selects specific UI elements within the top-ranked block. CORE further introduces a multi-round accumulation mechanism to mitigate local misjudgment or limited context. Experiments across diverse mobile apps and tasks show that CORE reduces UI exposure by up to 55.6% while maintaining task success rates slightly below cloud-only agents, effectively mitigating unnecessary privacy exposure to the cloud. The code is available at this https URL.
摘要：移动代理依靠大型语言模型 (LLM) 在智能手机用户界面 (UI) 上规划和执行任务。虽然基于云的法学硕士可以实现很高的任务准确性，但它们需要在每一步上传完整的 UI 状态，从而暴露不必要且通常不相关的信息。相比之下，本地法学硕士避免了 UI 上传，但容量有限，导致任务成功率较低。我们提出了 $\textbf{CORE}$，一个 $\textbf{CO}$llaborative 框架，它结合了云和本地 LLM 的优势，以 $\textbf{R}$ 减少 UI $\textbf{E}$xposure，同时保持移动代理的任务准确性。 CORE 包含三个关键组件： (1) $\textbf{布局感知块分区}$，它基于 XML 屏幕层次结构对语义相关的 UI 元素进行分组； (2) $\textbf{Co-planning}$，其中本地和云LLM协作识别当前子任务； (3) $\textbf{Co-decision-making}$，其中本地 LLM 对相关 UI 块进行排名，云 LLM 在排名最高的块中选择特定的 UI 元素。 CORE进一步引入了多轮累积机制，以减轻局部误判或受限上下文。跨不同移动应用程序和任务的实验表明，CORE 将 UI 暴露减少了高达 55.6%，同时保持任务成功率略低于纯云代理，从而有效减少了不必要的云隐私暴露。该代码可从此 https URL 获取。

Title: DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios

Authors: Yao Huang, Yitong Sun, Yichi Zhang, Ruochen Zhang, Yinpeng Dong, Xingxing Wei
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.15501
Pdf URL: https://arxiv.org/pdf/2510.15501
Copy Paste: [[2510.15501]] DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios(https://arxiv.org/abs/2510.15501)
Keywords: language model, llm
Abstract: Despite the remarkable advances of Large Language Models (LLMs) across diverse cognitive tasks, the rapid enhancement of these capabilities also introduces emergent deceptive behaviors that may induce severe risks in high-stakes deployments. More critically, the characterization of deception across realistic real-world scenarios remains underexplored. To bridge this gap, we establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different societal domains, what their intrinsic behavioral patterns are, and how extrinsic factors affect them. Specifically, on the static count, the benchmark encompasses 150 meticulously designed scenarios in five domains, i.e., Economy, Healthcare, Education, Social Interaction, and Entertainment, with over 1,000 samples, providing sufficient empirical foundations for deception analysis. On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement. On the extrinsic dimension, we investigate how contextual factors modulate deceptive outputs under neutral conditions, reward-based incentivization, and coercive pressures. Moreover, we incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics. Extensive experiments across LLMs and Large Reasoning Models (LRMs) reveal critical vulnerabilities, particularly amplified deception under reinforcement dynamics, demonstrating that current models lack robust resistance to manipulative contextual cues and the urgent need for advanced safeguards against various deception behaviors. Code and resources are publicly available at this https URL.
摘要：尽管大型语言模型（LLM）在不同的认知任务中取得了显着的进步，但这些功能的快速增强也引入了新的欺骗行为，可能会在高风险部署中引发严重风险。更重要的是，现实世界场景中欺骗的特征仍未得到充分探索。为了弥补这一差距，我们建立了 DeceptionBench，这是第一个系统评估欺骗倾向如何在不同社会领域表现、其内在行为模式是什么以及外在因素如何影响它们的基准。具体来说，在静态统计上，该基准包含了经济、医疗、教育、社交、娱乐五个领域的150个精心设计的场景，超过1000个样本，为欺骗分析提供了充分的经验基础。在内在维度上，我们探讨模型是否表现出自私的利己主义倾向或优先考虑用户安抚的阿谀奉承行为。在外在维度上，我们研究了情境因素如何在中性条件下、基于奖励的激励和强制压力下调节欺骗性输出。此外，我们结合了持续的多轮交互循环来构建对现实世界反馈动态的更真实的模拟。法学硕士和大型推理模型（LRM）的广泛实验揭示了关键的漏洞，特别是在强化动态下放大的欺骗，表明当前的模型缺乏对操纵性上下文线索的强大抵抗力，并且迫切需要针对各种欺骗行为的高级保障措施。代码和资源可通过此 https URL 公开获取。

Title: Temporal Referential Consistency: Do LLMs Favor Sequences Over Absolute Time References?

Authors: Ashutosh Bajpai, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15513
Pdf URL: https://arxiv.org/pdf/2510.15513
Copy Paste: [[2510.15513]] Temporal Referential Consistency: Do LLMs Favor Sequences Over Absolute Time References?(https://arxiv.org/abs/2510.15513)
Keywords: language model, llm
Abstract: The increasing acceptance of large language models (LLMs) as an alternative to knowledge sources marks a significant paradigm shift across various domains, including time-sensitive fields such as law, healthcare, and finance. To fulfill this expanded role, LLMs must not only be factually accurate but also demonstrate consistency across temporal dimensions, necessitating robust temporal reasoning capabilities. Despite this critical requirement, efforts to ensure temporal consistency in LLMs remain scarce including noticeable absence of endeavors aimed at evaluating or augmenting LLMs across temporal references in time-sensitive inquiries. In this paper, we seek to address this gap by introducing a novel benchmark entitled temporal referential consistency, accompanied by a resource TEMP-ReCon designed to benchmark a wide range of both open-source and closed-source LLMs with various linguistic contexts characterized by differing resource richness (including English, French, and Romanian). The findings emphasis that LLMs do exhibit insufficient temporal referent consistency. To address this, we propose \newmodel, a reasoning path alignment-based model that aims to enhance the temporal referential consistency of LLMs. Our empirical experiments substantiate the efficacy of UnTRaP compared to several baseline models.
摘要：人们越来越多地接受大语言模型（LLM）作为知识源的替代品，这标志着各个领域的重大范式转变，包括法律、医疗保健和金融等时间敏感领域。为了履行这一扩展的角色，法学硕士不仅必须事实准确，而且必须在时间维度上表现出一致性，从而需要强大的时间推理能力。尽管有这一关键要求，但确保法学硕士时间一致性的努力仍然很少，包括明显缺乏旨在评估或增强时间敏感查询中跨时间参考的法学硕士的努力。在本文中，我们试图通过引入一种名为时间参照一致性的新颖基准来解决这一差距，并附带一个资源 TEMP-ReCon，该资源旨在对具有不同资源丰富度（包括英语、法语和罗马尼亚语）的各种语言环境的各种开源和闭源法学硕士进行基准测试。研究结果强调，法学硕士确实表现出时间参照一致性不足。为了解决这个问题，我们提出了 \newmodel，一种基于推理路径对齐的模型，旨在增强 LLM 的时间引用一致性。我们的实证实验证实了 UnTRAP 与几种基线模型相比的有效性。

Title: From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

Authors: Rares Dolga, Lucas Maystre, Tudor Berariu, David Barber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15517
Pdf URL: https://arxiv.org/pdf/2510.15517
Copy Paste: [[2510.15517]] From Characters to Tokens: Dynamic Grouping with Hierarchical BPE(https://arxiv.org/abs/2510.15517)
Keywords: language model
Abstract: Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare words and require large embedding matrices. Character-level models address these issues but introduce performance bottlenecks, particularly in Transformer-based architectures. Recent hierarchical models attempt to merge the benefits of both paradigms by grouping characters into patches, but existing patching strategies either rely on whitespace-limiting applicability to certain languages, or require auxiliary models that introduce new dependencies. In this paper, we propose a dynamic character grouping method that leverages the structure of existing BPE tokenization without requiring additional models. By appending explicit end-of-patch markers to BPE tokens and introducing a second-level BPE compression stage to control patch granularity, our method offers efficient, flexible, and language-agnostic representations. Empirical results demonstrate that our approach matches or exceeds the performance of dynamic entropy- and whitespace-based patching strategies, while maintaining a compact vocabulary.
摘要：像字节对编码（BPE）这样的子词标记化方法由于其词汇紧凑性和表示能力的平衡而广泛应用于大型语言模型中。然而，它们在表示稀有单词方面效率低下，并且需要大量的嵌入矩阵。字符级模型解决了这些问题，但引入了性能瓶颈，特别是在基于 Transformer 的架构中。最近的分层模型试图通过将字符分组为补丁来合并两种范式的优点，但现有的补丁策略要么依赖于空白限制对某些语言的适用性，要么需要引入新依赖项的辅助模型。在本文中，我们提出了一种动态字符分组方法，该方法利用现有 BPE 标记化的结构，而不需要额外的模型。通过将显式补丁结束标记附加到 BPE 令牌并引入二级 BPE 压缩阶段来控制补丁粒度，我们的方法提供了高效、灵活且与语言无关的表示。经验结果表明，我们的方法匹配或超过了基于动态熵和空白的修补策略的性能，同时保持了紧凑的词汇表。

Title: Latent Reasoning in LLMs as a Vocabulary-Space Superposition

Authors: Jingcheng Deng, Liang Pang, Zihao Wei, Shichen Xu, Zenghao Duan, Kun Xu, Yang Song, Huawei Shen, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15522
Pdf URL: https://arxiv.org/pdf/2510.15522
Copy Paste: [[2510.15522]] Latent Reasoning in LLMs as a Vocabulary-Space Superposition(https://arxiv.org/abs/2510.15522)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) demonstrate strong reasoning abilities with chain-of-thought prompting, but explicit reasoning introduces substantial computational overhead. Recent work on latent reasoning reduces this cost by reasoning in latent space without explicit supervision, but performance drops significantly. Our preliminary experiments suggest that this degradation stems from the unstructured latent space, which makes fitting latent tokens difficult. To address this, we restrict the latent space to the column space of the LLM vocabulary, treating latent reasoning as a superposition over vocabulary probabilities. Once latent reasoning concludes, it collapses into an eigenstate of explicit reasoning to yield the final answer. Based on this idea, we propose Latent-SFT, a two-stage learning framework. In the first stage, we design two specialized attention masks to guide the Latent Token Encoder in generating latent tokens, allowing the LLM to produce the correct answer conditioned on them. In the second stage, the Latent Token Encoder is discarded, and the LLM is directly trained to generate these latent tokens autonomously for latent reasoning, optimized with KL and CE losses. Latent-SFT sets a new state of the art on GSM8k, matching explicit SFT performance while cutting reasoning chains by up to 4 times and outperforming prior latent methods. On Math500 and AIME24, lexical probability-based latent reasoning also clearly surpasses hidden-state-based approaches. Our metrics of effective compression rate and effective global parallelism further show that latent reasoning is both the compression of a single path and the superposition of multiple paths.
摘要：大型语言模型（LLM）通过思维链提示展现出强大的推理能力，但显式推理会带来大量的计算开销。最近关于潜在推理的工作通过在没有明确监督的情况下在潜在空间中进行推理来降低这种成本，但性能显着下降。我们的初步实验表明，这种退化源于非结构化的潜在空间，这使得拟合潜在标记变得困难。为了解决这个问题，我们将潜在空间限制为 LLM 词汇表的列空间，将潜在推理视为词汇概率的叠加。一旦潜在推理得出结论，它就会崩溃为显式推理的本征状态，以产生最终答案。基于这个想法，我们提出了Latent-SFT，一个两阶段学习框架。在第一阶段，我们设计了两个专门的注意掩码来指导潜在令牌编码器生成潜在令牌，从而使法学硕士能够根据它们产生正确的答案。在第二阶段，潜在令牌编码器被丢弃，LLM被直接训练以自主生成这些潜在令牌以进行潜在推理，并通过KL和CE损失进行优化。 Latent-SFT 在 GSM8k 上树立了新的技术水平，与显式 SFT 性能相匹配，同时将推理链削减了多达 4 倍，并且优于之前的潜在方法。在 Math500 和 AIME24 上，基于词汇概率的潜在推理也明显超越了基于隐藏状态的方法。我们的有效压缩率和有效全局并行性指标进一步表明，潜在推理既是单个路径的压缩，也是多个路径的叠加。

Title: MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval

Authors: Qiyu Wu, Shuyang Cui, Satoshi Hayakawa, Wei-Yao Wang, Hiromi Wakaki, Yuki Mitsufuji
Subjects: cs.CL, cs.AI, cs.IR, cs.MM
Abstract URL: https://arxiv.org/abs/2510.15543
Pdf URL: https://arxiv.org/pdf/2510.15543
Copy Paste: [[2510.15543]] MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval(https://arxiv.org/abs/2510.15543)
Keywords: language model, llm
Abstract: Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP align modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, while a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.
摘要：多模态检索旨在跨文本或图像等模态检索相关内容，支持从人工智能搜索到内容生产的应用。尽管像 CLIP 这样的分离编码器方法成功地将模态特定的嵌入与对比学习结合起来，但最近的多模态大语言模型 (MLLM) 实现了直接处理组合输入的统一编码器。虽然灵活且先进，但我们发现用传统对比学习训练的统一编码器很容易学习模态捷径，导致分布变化下的鲁棒性较差。我们提出了一个模态组合意识框架来缓解这个问题。具体来说，偏好损失强制多模态嵌入优于单模态嵌入，而组合正则化目标将多模态嵌入与由其单模态部分组成的原型对齐。这些目标明确地模拟了组合表示与其单峰对应物之间的结构关系。各种基准的实验显示了分布外检索的收益，强调了当使用 MLLM 作为统一编码器时，模态合成感知是鲁棒组合多模态检索的有效原则。

Title: TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

Authors: Sibo Xiao, Jinyuan Fu, Zhongle Xie, Lidan Shou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15545
Pdf URL: https://arxiv.org/pdf/2510.15545
Copy Paste: [[2510.15545]] TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs(https://arxiv.org/abs/2510.15545)
Keywords: language model, llm
Abstract: Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.
摘要：加速大型语言模型 (LLM) 的推理一直是生成式 AI 的一个关键挑战。推测解码 (SD) 显着提高了 LLM 推理效率。然而，它的实用性受到一个基本约束的限制：草稿模型和目标模型必须共享相同的词汇表，从而限制了可用草稿模型的数量，并且通常需要从头开始训练新模型。受动态时间规整（DTW）这种对齐时间序列的经典算法的启发，我们提出了用于通用推测解码的算法 TokenTiming。它通过重新编码草稿令牌序列以获得新的目标令牌序列，然后使用 DTW 构建映射来传输推测采样的概率分布。受益于此，我们的方法可以适应不匹配的词汇表，并且可以与任何现成的模型一起使用，而无需重新训练和修改。我们对各种任务进行了全面的实验，证明了 1.57 倍的加速。这项工作为草案模型选择提供了一种通用方法，使 SD 成为 LLM 加速的更通用和实用的工具。

Title: Rethinking Cross-lingual Gaps from a Statistical Viewpoint

Authors: Vihari Piratla, Purvam Jain, Darshan Singh, Partha Talukdar, Trevor Cohn
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.15551
Pdf URL: https://arxiv.org/pdf/2510.15551
Copy Paste: [[2510.15551]] Rethinking Cross-lingual Gaps from a Statistical Viewpoint(https://arxiv.org/abs/2510.15551)
Keywords: language model, llm, prompt
Abstract: Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried from target languages. Prior research has pointed to a cross-lingual gap, viz., a drop in accuracy when the knowledge is queried in a target language compared to when the query is in the source language. Existing research has rationalized divergence in latent representations in source and target languages as the source of cross-lingual gap. In this work, we take an alternative view and hypothesize that the variance of responses in the target language is the main cause of this gap. For the first time, we formalize the cross-lingual gap in terms of bias-variance decomposition. We present extensive experimental evidence which support proposed formulation and hypothesis. We then reinforce our hypothesis through multiple inference-time interventions that control the variance and reduce the cross-lingual gap. We demonstrate a simple prompt instruction to reduce the response variance, which improved target accuracy by 20-25% across different models.
摘要：任何知识通常都在网络或任何大型语料库中以一种或几种自然语言表达。大型语言模型 (LLM) 充当桥梁，通过从源语言获取知识并在从目标语言查询时使其可访问。先前的研究指出了跨语言差距，即用目标语言查询知识时与使用源语言查询时相比，准确性会下降。现有研究将源语言和目标语言潜在表征的差异合理化为跨语言差距的根源。在这项工作中，我们采取了另一种观点，并假设目标语言中反应的差异是造成这种差距的主要原因。我们第一次用偏差-方差分解形式形式化跨语言差距。我们提出了广泛的实验证据来支持所提出的表述和假设。然后，我们通过控制方差并减少跨语言差距的多重推理时间干预来强化我们的假设。我们演示了一个简单的提示指令来减少响应方差，这将不同模型的目标准确度提高了 20-25%。

Title: Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation

Authors: Jinliang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15552
Pdf URL: https://arxiv.org/pdf/2510.15552
Copy Paste: [[2510.15552]] Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation(https://arxiv.org/abs/2510.15552)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large language models (LLMs) excel at language understanding but often hallucinate and struggle with multi-hop reasoning. Knowledge-graph-based retrieval-augmented generation (KG-RAG) offers grounding, yet most methods rely on flat embeddings and noisy path exploration. We propose ParallaxRAG, a framework that symmetrically decouples queries and graph triples into multi-view spaces, enabling a robust retrieval architecture that explicitly enforces head diversity while constraining weakly related paths. Central to our approach is the observation that different attention heads specialize in semantic relations at distinct reasoning stages, contributing to different hops of the reasoning chain. This specialization allows ParallaxRAG to construct cleaner subgraphs and guide LLMs through grounded, step-wise reasoning. Experiments on WebQSP and CWQ, under our unified, reproducible setup (BGE-M3 + Llama3.1-8B), demonstrate competitive retrieval and QA performance, alongside reduced hallucination and good generalization. Our results highlight multi-view head specialization as a principled direction for knowledge-grounded multi-hop reasoning. Our implementation will be released as soon as the paper is accepted.
摘要：大型语言模型 (LLM) 擅长语言理解，但经常产生幻觉并难以进行多跳推理。基于知识图的检索增强生成（KG-RAG）提供了基础，但大多数方法依赖于平面嵌入和噪声路径探索。我们提出了 ParallaxRAG，一个将查询和图三元组对称地解耦到多视图空间中的框架，从而实现了强大的检索架构，可以明确地强制头部多样性，同时限制弱相关路径。我们方法的核心是观察到不同的注意力头在不同的推理阶段专门研究语义关系，从而促进推理链的不同跳跃。这种专业化使 ParallaxRAG 能够构建更清晰的子图，并指导法学硕士进行扎实的逐步推理。在我们统一的、可重复的设置（BGE-M3 + Llama3.1-8B）下，在 WebQSP 和 CWQ 上进行的实验展示了有竞争力的检索和 QA 性能，同时减少了幻觉并具有良好的泛化能力。我们的结果强调多视图头专业化是基于知识的多跳推理的原则方向。论文被接受后，我们的实施将立即发布。

Title: KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models

Authors: Dongjun Kim, Chanhee Park, Chanjun Park, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15558
Pdf URL: https://arxiv.org/pdf/2510.15558
Copy Paste: [[2510.15558]] KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models(https://arxiv.org/abs/2510.15558)
Keywords: language model, llm, agent
Abstract: The instruction-following capabilities of large language models (LLMs) are pivotal for numerous applications, from conversational agents to complex reasoning systems. However, current evaluations predominantly focus on English models, neglecting the linguistic and cultural nuances of other languages. Specifically, Korean, with its distinct syntax, rich morphological features, honorific system, and dual numbering systems, lacks a dedicated benchmark for assessing open-ended instruction-following capabilities. To address this gap, we introduce the Korean Instruction-following Task Evaluation (KITE), a comprehensive benchmark designed to evaluate both general and Korean-specific instructions. Unlike existing Korean benchmarks that focus mainly on factual knowledge or multiple-choice testing, KITE directly targets diverse, open-ended instruction-following tasks. Our evaluation pipeline combines automated metrics with human assessments, revealing performance disparities across models and providing deeper insights into their strengths and weaknesses. By publicly releasing the KITE dataset and code, we aim to foster further research on culturally and linguistically inclusive LLM development and inspire similar endeavors for other underrepresented languages.
摘要：大型语言模型 (LLM) 的指令跟踪功能对于从会话代理到复杂推理系统的众多应用程序至关重要。然而，目前的评估主要集中在英语模型上，忽略了其他语言的语言和文化细微差别。具体来说，韩语具有独特的语法、丰富的形态特征、敬语系统和双重编号系统，缺乏专门的评估开放式指令跟随能力的基准。为了解决这一差距，我们引入了韩语指令遵循任务评估 (KITE)，这是一个综合基准，旨在评估一般指令和韩语特定指令。与主要关注事实知识或多项选择测试的现有韩国基准测试不同，KITE 直接针对多样化、开放式的指令跟踪任务。我们的评估流程将自动化指标与人工评估相结合，揭示模型之间的性能差异，并提供对其优缺点的更深入见解。通过公开发布 KITE 数据集和代码，我们的目标是促进对文化和语言包容性法学硕士发展的进一步研究，并激发其他代表性不足的语言的类似努力。

Title: Finetuning LLMs for EvaCun 2025 token prediction shared task

Authors: Josef Jon, Ondřej Bojar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15561
Pdf URL: https://arxiv.org/pdf/2510.15561
Copy Paste: [[2510.15561]] Finetuning LLMs for EvaCun 2025 token prediction shared task(https://arxiv.org/abs/2510.15561)
Keywords: llm, prompt
Abstract: In this paper, we present our submission for the token prediction task of EvaCun 2025. Our sys-tems are based on LLMs (Command-R, Mistral, and Aya Expanse) fine-tuned on the task data provided by the organizers. As we only pos-sess a very superficial knowledge of the subject field and the languages of the task, we simply used the training data without any task-specific adjustments, preprocessing, or filtering. We compare 3 different approaches (based on 3 different prompts) of obtaining the predictions, and we evaluate them on a held-out part of the data.
摘要：在本文中，我们提交了 EvaCun 2025 的代币预测任务。我们的系统基于 LLM（Command-R、Mistral 和 Aya Expanse），并根据组织者提供的任务数据进行了微调。由于我们对主题领域和任务语言仅拥有非常肤浅的知识，因此我们只是使用训练数据，而不进行任何特定于任务的调整、预处理或过滤。我们比较了获得预测的 3 种不同方法（基于 3 种不同的提示），并根据保留的数据部分对其进行了评估。

Title: HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

Authors: Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He, Yew-Soon Ong, Anirudh Goyal, Dianbo Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15614
Pdf URL: https://arxiv.org/pdf/2510.15614
Copy Paste: [[2510.15614]] HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination(https://arxiv.org/abs/2510.15614)
Keywords: language model, llm
Abstract: As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: this https URL.
摘要：随着语言模型越来越多地在科学工作流程中使用，评估它们提出一组解释（而不仅仅是一个正确答案）的能力变得至关重要。许多科学问题尚未确定：多个、机制上不同的假设与相同的观察结果是一致的。我们引入了 HypoSpace，这是一个诊断套件，它将 LLM 视为有限假设集的采样器，并测量三个互补指标：有效性（与观察结果一致的提案精度）、唯一性（提案之间的非冗余性）和恢复性（枚举可接受集的覆盖范围）。我们用确定性验证器和精确枚举的假设空间在三个结构化域中实例化 HypoSpace：(i) 来自扰动的因果图，(ii) 来自自上而下投影的重力约束 3D 体素重建，以及 (iii) 布尔遗传相互作用。在指令调整和以推理为中心的模型中，有效性通常保持较高水平，而唯一性和恢复率随着可接受空间的增长而降低，从而揭示了仅正确性指标无法看到的模式崩溃。 HypoSpace 提供了一种受控探针（而不是排行榜），用于明确探索和覆盖可接受的解释空间的方法。代码可在以下位置获得：此 https URL。

Title: Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection

Authors: Joshua Wolfe Brook, Ilia Markov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15685
Pdf URL: https://arxiv.org/pdf/2510.15685
Copy Paste: [[2510.15685]] Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection(https://arxiv.org/abs/2510.15685)
Keywords: language model, llm, prompt
Abstract: This research introduces a novel approach to textual and multimodal Hate Speech Detection (HSD), using Large Language Models (LLMs) as dynamic knowledge bases to generate background context and incorporate it into the input of HSD classifiers. Two context generation strategies are examined: one focused on named entities and the other on full-text prompting. Four methods of incorporating context into the classifier input are compared: text concatenation, embedding concatenation, a hierarchical transformer-based fusion, and LLM-driven text enhancement. Experiments are conducted on the textual Latent Hatred dataset of implicit hate speech and applied in a multimodal setting on the MAMI dataset of misogynous memes. Results suggest that both the contextual information and the method by which it is incorporated are key, with gains of up to 3 and 6 F1 points on textual and multimodal setups respectively, from a zero-context baseline to the highest-performing system, based on embedding concatenation.
摘要：这项研究引入了一种文本和多模式仇恨语音检测 (HSD) 的新颖方法，使用大型语言模型 (LLM) 作为动态知识库来生成背景上下文并将其合并到 HSD 分类器的输入中。研究了两种上下文生成策略：一种侧重于命名实体，另一种侧重于全文提示。比较了将上下文合并到分类器输入中的四种方法：文本串联、嵌入串联、基于分层变换器的融合和 LLM 驱动的文本增强。在隐性仇恨言论的文本潜在仇恨数据集上进行了实验，并在厌女模因的 MAMI 数据集的多模态设置中应用。结果表明，上下文信息及其整合方法都是关键，基于嵌入串联，从零上下文基线到性能最高的系统，在文本和多模态设置上分别获得了高达 3 和 6 个 F1 点。

Title: Attention Sinks in Diffusion Language Models

Authors: Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, Alessio Devoto
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15731
Pdf URL: https://arxiv.org/pdf/2510.15731
Copy Paste: [[2510.15731]] Attention Sinks in Diffusion Language Models(https://arxiv.org/abs/2510.15731)
Keywords: language model
Abstract: Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs). DLMs employ transformer encoders with bidirectional attention, enabling parallel token generation while maintaining competitive performance. Although their efficiency and effectiveness have been extensively studied, the internal mechanisms that govern DLMs remain largely unexplored. In this work, we conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures. Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour. Second, while ARMs are highly sensitive to the removal of attention sinks, DLMs remain robust: masking sinks leads to only a minor degradation in performance. These results provide new insights into the inner workings of diffusion-based language models and highlight fundamental differences in how they allocate and utilize attention compared to autoregressive models.
摘要：掩蔽扩散语言模型 (DLM) 最近已成为传统自回归模型 (ARM) 的有前途的替代方案。 DLM 采用具有双向注意力的变压器编码器，实现并行令牌生成，同时保持竞争性能。尽管它们的效率和有效性已被广泛研究，但控制 DLM 的内部机制仍然很大程度上未被探索。在这项工作中，我们对 DLM 注意力模式进行了实证分析，重点关注注意力下沉现象，这是之前在各种基于 Transformer 的架构中观察到的效应。我们的研究结果表明，DLM 也表现出注意力下沉，但具有独特的特征。首先，与 ARM 不同，DLM 中的接收器位置往往会在整个生成过程中发生变化，表现出动态行为。其次，虽然 ARM 对注意力接收器的移除高度敏感，但 DLM 仍然保持稳健：屏蔽接收器只会导致性能轻微下降。这些结果为基于扩散的语言模型的内部运作提供了新的见解，并强调了它们与自回归模型相比在如何分配和利用注意力方面的根本差异。

Title: LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

Authors: Gao Yang, Yuhang Liu, Siyu Miao, Xinyue Liang, Zhengyang Liu, Heyan Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15746
Pdf URL: https://arxiv.org/pdf/2510.15746
Copy Paste: [[2510.15746]] LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation(https://arxiv.org/abs/2510.15746)
Keywords: language model, llm
Abstract: Ideal or real - that is the this http URL this work, we explore whether principles from game theory can be effectively applied to the evaluation of large language models (LLMs). This inquiry is motivated by the growing inadequacy of conventional evaluation practices, which often rely on fixed-format tasks with reference answers and struggle to capture the nuanced, subjective, and open-ended nature of modern LLM behavior. To address these challenges, we propose a novel alternative: automatic mutual evaluation, where LLMs assess each other's output through self-play and peer review. These peer assessments are then systematically compared with human voting behavior to evaluate their alignment with human judgment. Our framework incorporates game-theoretic voting algorithms to aggregate peer reviews, enabling a principled investigation into whether model-generated rankings reflect human preferences. Empirical results reveal both convergences and divergences between theoretical predictions and human evaluations, offering valuable insights into the promises and limitations of mutual evaluation. To the best of our knowledge, this is the first work to jointly integrate mutual evaluation, game-theoretic aggregation, and human-grounded validation for evaluating the capabilities of LLMs.
摘要：理想还是真实——那就是这个http URL 这项工作，我们探索博弈论的原理是否可以有效地应用于大型语言模型（LLM）的评估。这一调查的动机是传统评估实践的日益不足，这些实践通常依赖于带有参考答案的固定格式任务，并且难以捕捉现代法学硕士行为的微妙、主观和开放性本质。为了应对这些挑战，我们提出了一种新颖的替代方案：自动相互评估，法学硕士通过自我对弈和同行评审来评估彼此的输出。然后将这些同行评估与人类投票行为进行系统比较，以评估其与人类判断的一致性。我们的框架结合了博弈论投票算法来汇总同行评审，从而能够对模型生成的排名是否反映人类偏好进行原则性调查。实证结果揭示了理论预测和人类评估之间的趋同和分歧，为相互评估的前景和局限性提供了宝贵的见解。据我们所知，这是第一个联合整合相互评估、博弈论聚合和以人为本的验证来评估法学硕士能力的工作。

Title: On Non-interactive Evaluation of Animal Communication Translators

Authors: Orr Paradise, David F. Gruber, Adam Tauman Kalai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.15768
Pdf URL: https://arxiv.org/pdf/2510.15768
Copy Paste: [[2510.15768]] On Non-interactive Evaluation of Animal Communication Translators(https://arxiv.org/abs/2510.15768)
Keywords: hallucination
Abstract: If you had an AI Whale-to-English translator, how could you validate whether or not it is working? Does one need to interact with the animals or rely on grounded observations such as temperature? We provide theoretical and proof-of-concept experimental evidence suggesting that interaction and even observations may not be necessary for sufficiently complex languages. One may be able to evaluate translators solely by their English outputs, offering potential advantages in terms of safety, ethics, and cost. This is an instance of machine translation quality evaluation (MTQE) without any reference translations available. A key challenge is identifying ``hallucinations,'' false translations which may appear fluent and plausible. We propose using segment-by-segment translation together with the classic NLP shuffle test to evaluate translators. The idea is to translate animal communication, turn by turn, and evaluate how often the resulting translations make more sense in order than permuted. Proof-of-concept experiments on data-scarce human languages and constructed languages demonstrate the potential utility of this evaluation methodology. These human-language experiments serve solely to validate our reference-free metric under data scarcity. It is found to correlate highly with a standard evaluation based on reference translations, which are available in our experiments. We also perform a theoretical analysis suggesting that interaction may not be necessary nor efficient in the early stages of learning to translate.
摘要：如果你有一个人工智能鲸鱼到英语翻译器，你如何验证它是否有效？是否需要与动物互动或依赖温度等扎实的观察结果？我们提供了理论和概念验证的实验证据，表明对于足够复杂的语言来说，交互甚至观察可能不是必需的。人们也许能够仅根据翻译人员的英语输出来评估他们，这在安全、道德和成本方面具有潜在的优势。这是机器翻译质量评估 (MTQE) 的一个实例，没有任何可用的参考翻译。一个关键的挑战是识别“幻觉”，即看似流畅且可信的错误翻译。我们建议使用逐段翻译和经典的 NLP 洗牌测试来评估译者。这个想法是依次翻译动物的交流，并评估所得到的翻译按顺序比排列更有意义的频率。对数据稀缺的人类语言和构建语言进行的概念验证实验证明了这种评估方法的潜在效用。这些人类语言实验仅用于在数据稀缺的情况下验证我们的无参考指标。我们发现它与基于参考翻译的标准评估高度相关，这些参考翻译在我们的实验中可用。我们还进行了一项理论分析，表明在学习翻译的早期阶段，互动可能不是必要的，也不是有效的。

Title: Emergence of Linear Truth Encodings in Language Models

Authors: Shauli Ravfogel, Gilad Yehudai, Tal Linzen, Joan Bruna, Alberto Bietti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.15804
Pdf URL: https://arxiv.org/pdf/2510.15804
Copy Paste: [[2510.15804]] Emergence of Linear Truth Encodings in Language Models(https://arxiv.org/abs/2510.15804)
Keywords: language model
Abstract: Recent probing studies reveal that large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear. We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end and exposes one concrete route by which they can arise. We study one simple setting in which truth encoding can emerge: a data distribution where factual statements co-occur with other factual statements (and vice-versa), encouraging the model to learn this distinction in order to lower the LM loss on future tokens. We corroborate this pattern with experiments in pretrained language models. Finally, in the toy setting we observe a two-phase learning dynamic: networks first memorize individual factual associations in a few steps, then -- over a longer horizon -- learn to linearly separate true from false, which in turn lowers language-modeling loss. Together, these results provide both a mechanistic demonstration and an empirical motivation for how and why linear truth representations can emerge in language models.
摘要：最近的探索研究表明，大型语言模型表现出区分真假陈述的线性子空间，但其出现背后的机制尚不清楚。我们引入了一种透明的单层变压器玩具模型，它可以端到端地再现这种真实子空间，并揭示它们出现的一个具体途径。我们研究了一种可以出现真值编码的简单设置：一种数据分布，其中事实陈述与其他事实陈述同时出现（反之亦然），鼓励模型学习这种区别，以降低未来代币的 LM 损失。我们通过预训练语言模型的实验证实了这一模式。最后，在玩具环境中，我们观察到一个两阶段的学习动态：网络首先通过几个步骤记住各个事实关联，然后在更长的时间内学习线性区分真假，这反过来又降低了语言建模损失。总之，这些结果为线性真值表示如何以及为何出现在语言模型中提供了机械论证和经验动机。

Title: Paper2Web: Let's Make Your Paper Alive!

Authors: Yuhang Chen, Tianpeng Lv, Siyi Zhang, Yixiang Yin, Yao Wan, Philip S. Yu, Dongping Chen
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.15842
Pdf URL: https://arxiv.org/pdf/2510.15842
Copy Paste: [[2510.15842]] Paper2Web: Let's Make Your Paper Alive!(https://arxiv.org/abs/2510.15842)
Keywords: language model, llm, agent
Abstract: Academic project websites can more effectively disseminate research when they clearly present core content and enable intuitive navigation and interaction. However, current approaches such as direct Large Language Model (LLM) generation, templates, or direct HTML conversion struggle to produce layout-aware, interactive sites, and a comprehensive evaluation suite for this task has been lacking. In this paper, we introduce Paper2Web, a benchmark dataset and multi-dimensional evaluation framework for assessing academic webpage generation. It incorporates rule-based metrics like Connectivity, Completeness and human-verified LLM-as-a-Judge (covering interactivity, aesthetics, and informativeness), and PaperQuiz, which measures paper-level knowledge retention. We further present PWAgent, an autonomous pipeline that converts scientific papers into interactive and multimedia-rich academic homepages. The agent iteratively refines both content and layout through MCP tools that enhance emphasis, balance, and presentation quality. Our experiments show that PWAgent consistently outperforms end-to-end baselines like template-based webpages and arXiv/alphaXiv versions by a large margin while maintaining low cost, achieving the Pareto-front in academic webpage generation.
摘要：当学术项目网站清晰地呈现核心内容并实现直观的导航和交互时，可以更有效地传播研究成果。然而，当前的方法（例如直接大型语言模型 (LLM) 生成、模板或直接 HTML 转换）很难生成布局感知的交互式网站，并且缺乏用于此任务的全面评估套件。在本文中，我们介绍了Paper2Web，一个用于评估学术网页生成的基准数据集和多维度评估框架。它结合了基于规则的指标，如连通性、完整性和经过人工验证的法学硕士法官（涵盖交互性、美观性和信息性），以及衡量纸质知识保留程度的 PaperQuiz。我们进一步介绍了 PWAgent，这是一个自主管道，可将科学论文转换为交互式且多媒体丰富的学术主页。该代理通过 MCP 工具迭代地完善内容和布局，从而增强重点、平衡和演示质量。我们的实验表明，PWAgent 在保持低成本的同时，始终大幅优于基于模板的网页和 arXiv/alphaXiv 版本等端到端基线，在学术网页生成中实现了帕累托前沿。

Title: SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling

Authors: Kadri Hacioglu, Manjunath K E, Andreas Stolcke
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.15851
Pdf URL: https://arxiv.org/pdf/2510.15851
Copy Paste: [[2510.15851]] SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling(https://arxiv.org/abs/2510.15851)
Keywords: language model, llm
Abstract: Slot filling is a crucial subtask in spoken language understanding (SLU), traditionally implemented as a cascade of speech recognition followed by one or more natural language understanding (NLU) components. The recent advent of speech-based large language models (speechLLMs), which integrate speech and textual foundation models, has opened new avenues for achieving speech understanding tasks in a more unified, generative, and instruction-following manner while promising data and compute efficiency with zero-shot abilities, generalizing to unseen slot labels. We address the slot-filling task by creating an empirical upper bound for the task, identifying performance, robustness, and generalization gaps, and proposing improvements to the training data, architecture, and training strategies to narrow the gap with the upper bound result. We show that each of these measures improve performance substantially, while highlighting practical challenges and providing empirical guidance and insights for harnessing these emerging models.
摘要：槽位填充是口语理解 (SLU) 中的一项重要子任务，传统上是通过级联语音识别和一个或多个自然语言理解 (NLU) 组件来实现的。最近出现的基于语音的大语言模型（speechLLM）集成了语音和文本基础模型，为以更统一、生成和指令跟踪的方式实现语音理解任务开辟了新途径，同时保证了零样本能力的数据和计算效率，泛化到看不见的槽标签。我们通过创建任务的经验上限，识别性能、鲁棒性和泛化差距，并提出对训练数据、架构和训练策略的改进来缩小与上限结果的差距来解决槽填充任务。我们表明，这些措施中的每一项都可以显着提高绩效，同时强调实际挑战并为利用这些新兴模型提供经验指导和见解。

Title: InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

Authors: Pengkai Wang, Qi Zuo, Pengwei Liu, Zhijie Sang, Congkai Xie, Hongxia Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15859
Pdf URL: https://arxiv.org/pdf/2510.15859
Copy Paste: [[2510.15859]] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training(https://arxiv.org/abs/2510.15859)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown substantial advances through reinforcement learning (RL), particularly in domains where rewards can be programmatically verified, such as mathematics and code. In these areas, models benefit from a well-defined operational base guided by explicit rule-based objectives. However, this progress reveals a significant limitation: in open-ended domains where rewards are ambiguous, subjective, or context-dependent, such as creative writing, scientific reasoning, and notably medical consultation, robust reward functions are lacking, making these areas challenging for current RL strategies. To bridge this gap, we introduce ORBIT, an open-ended rubric-based incremental training framework specifically designed for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue generation with the dynamic creation of rubrics, employing these rubrics to direct an incremental RL process. In particular, this approach does not depend on external medical knowledge or manual rules, instead utilizing rubric-guided feedback to shape learning. When implemented on the Qwen3-4B-Instruct model, our method can greatly enhance its performance on the HealthBench-Hard benchmark from 7.0 to 27.2 using only 2k samples, thus achieving state-of-the-art results for models of this scale. Our analysis confirms that rubric-driven RL fos-ters consistent performance gains across diverse consultation scenarios, going beyond simple numerical improvements. These findings underscore rubric-based feedback as a scalable strategy for advancing LLMs in intricate, open-ended tasks.
摘要：大型语言模型 (LLM) 通过强化学习 (RL) 取得了巨大进步，特别是在可以通过编程方式验证奖励的领域，例如数学和代码。在这些领域，模型受益于由明确的基于规则的目标指导的明确的运营基础。然而，这一进展揭示了一个重大局限性：在奖励模糊、主观或依赖于上下文的开放式领域，例如创意写作、科学推理，尤其是医疗咨询，缺乏强大的奖励功能，使得这些领域对当前的强化学习策略构成挑战。为了弥补这一差距，我们引入了 ORBIT，这是一个开放式的基于规则的增量培训框架，专门为高风险的医学对话而设计。 ORBIT 将合成对话生成与规则的动态创建相结合，利用这些规则来指导增量强化学习过程。特别是，这种方法不依赖于外部医学知识或手动规则，而是利用标题引导的反馈来塑造学习。当在 Qwen3-4B-Instruct 模型上实现时，我们的方法仅使用 2k 个样本就可以将其在 HealthBench-Hard 基准上的性能从 7.0 大幅提高到 27.2，从而为该规模的模型实现了最先进的结果。我们的分析证实，标题驱动的强化学习能够在不同的咨询场景中实现一致的性能提升，而不仅仅是简单的数值改进。这些发现强调基于标题的反馈是在复杂的、开放式任务中推进法学硕士的可扩展策略。

Title: PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction

Authors: Simon Yu, Gang Li, Weiyan Shi, Peng Qi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.15863
Pdf URL: https://arxiv.org/pdf/2510.15863
Copy Paste: [[2510.15863]] PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction(https://arxiv.org/abs/2510.15863)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are moving beyond static uses and are now powering agents that learn continually during their interaction with external environments. For example, agents can learn reusable skills while navigating web pages or toggling new tools. However, existing methods for skill learning often create skills that are over-specialized to a single website and fail to generalize. We introduce PolySkill, a new framework that enables agents to learn generalizable and compositional skills. The core idea, inspired by polymorphism in software engineering, is to decouple a skill's abstract goal (what it accomplishes) and its concrete implementation (how it is executed). Experiments show that our method (1) improves skill reuse by 1.7x on seen websites and (2) boosts success rates by up to 9.4% on Mind2Web and 13.9% on unseen websites, while reducing steps by over 20%. (3) In self-exploration settings without specified tasks, our framework improves the quality of proposed tasks and enables agents to learn generalizable skills that work across different sites. By enabling the agent to identify and refine its own goals, the PolySkill enhances the agent's ability to learn a better curriculum, leading to the acquisition of more generalizable skills compared to baseline methods. This work provides a practical path toward building agents capable of continual learning in adaptive environments. Our findings show that separating a skill's goal from its execution is a crucial step toward developing autonomous agents that can learn and generalize across the open web continuously.
摘要：大型语言模型 (LLM) 正在超越静态用途，现在正在为在与外部环境交互过程中不断学习的代理提供动力。例如，代理可以在浏览网页或切换新工具时学习可重用的技能。然而，现有的技能学习方法通常创建的技能过于专业化到单个网站并且无法泛化。我们引入了 PolySkill，这是一个新框架，使智能体能够学习概括和组合技能。其核心思想受到软件工程中多态性的启发，是将技能的抽象目标（它完成什么）和它的具体实现（它如何执行）解耦。实验表明，我们的方法 (1) 在已见过的网站上将技能重用率提高了 1.7 倍，(2) 在 Mind2Web 上将成功率提高了 9.4%，在未见过的网站上将成功率提高了 13.9%，同时减少了 20% 以上的步骤。 (3) 在没有指定任务的自我探索环境中，我们的框架提高了建议任务的质量，并使代理能够学习跨不同站点工作的通用技能。通过使智能体能够识别和完善自己的目标，PolySkill 增强了智能体学习更好课程的能力，从而获得与基线方法相比更通用的技能。这项工作为构建能够在自适应环境中持续学习的代理提供了一条实用途径。我们的研究结果表明，将技能的目标与其执行分开是开发能够在开放网络上持续学习和泛化的自主代理的关键一步。