2025-09-17

Title: MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch

Authors: Nikolay Banar, Ehsan Lotfi, Jens Van Nooten, Cristina Arhiliuc, Marija Kliocaite, Walter Daelemans
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12340
Pdf URL: https://arxiv.org/pdf/2509.12340
Copy Paste: [[2509.12340]] MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch(https://arxiv.org/abs/2509.12340)
Keywords: language model
Abstract: Recently, embedding resources, including models, benchmarks, and datasets, have been widely released to support a variety of languages. However, the Dutch language remains underrepresented, typically comprising only a small fraction of the published multilingual resources. To address this gap and encourage the further development of Dutch embeddings, we introduce new resources for their evaluation and generation. First, we introduce the Massive Text Embedding Benchmark for Dutch (MTEB-NL), which includes both existing Dutch datasets and newly created ones, covering a wide range of tasks. Second, we provide a training dataset compiled from available Dutch retrieval datasets, complemented with synthetic data generated by large language models to expand task coverage beyond retrieval. Finally, we release a series of E5-NL models compact yet efficient embedding models that demonstrate strong performance across multiple tasks. We make our resources publicly available through the Hugging Face Hub and the MTEB package.
摘要：最近，包括模型，基准和数据集在内的嵌入资源已被广泛发布以支持各种语言。但是，荷兰语的代表性不足，通常仅占已发表的多语言资源的一小部分。为了解决这一差距并鼓励荷兰嵌入的进一步发展，我们为其评估和发电提供了新的资源。首先，我们介绍了荷兰语（MTEB-NL）的大量文本嵌入基准，其中包括现有的荷兰数据集和新创建的数据集，涵盖了广泛的任务。其次，我们提供了一个从可用的荷兰检索数据集编译的培训数据集，并配有大型语言模型生成的合成数据，以扩展超出检索的任务覆盖范围。最后，我们发布了一系列E5-NL模型紧凑而有效的嵌入模型，这些模型在多个任务中表现出强大的性能。我们通过拥抱面枢纽和MTEB套件公开提供资源。

Title: MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables

Authors: Matteo Marcuzzo, Alessandro Zangari, Andrea Albarelli, Jose Camacho-Collados, Mohammad Taher Pilehvar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.12371
Pdf URL: https://arxiv.org/pdf/2509.12371
Copy Paste: [[2509.12371]] MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables(https://arxiv.org/abs/2509.12371)
Keywords: llm
Abstract: As LLMs excel on standard reading comprehension benchmarks, attention is shifting toward evaluating their capacity for complex abstract reasoning and inference. Literature-based benchmarks, with their rich narrative and moral depth, provide a compelling framework for evaluating such deeper comprehension skills. Here, we present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature. The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering. To further stress-test model robustness, we introduce adversarial variants designed to surface LLM vulnerabilities and shortcuts due to issues such as data contamination. Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning. This brittleness results in significant self-contradiction, with the best models refuting their own answers in roughly 20% of cases depending on the framing of the moral choice. Interestingly, reasoning-enhanced models fail to bridge this gap, suggesting that scale - not reasoning ability - is the primary driver of performance.
摘要：随着LLM在标准阅读理解基准方面表现出色，人们的注意力正转向评估其复杂的抽象推理和推理的能力。基于文学的基准测试，具有丰富的叙述和道德深度，为评估这种更深入的理解能力提供了一个令人信服的框架。在这里，我们提出了Morables，这是一种由寓言和从历史文献中绘制的短篇小说建造的人文验证的基准。主要任务是作为针对道德推论的多项选择问题的结构，其精心制作的干扰因素挑战模型超越浅层，提取性的问答。为了进一步强调测试模型的鲁棒性，我们引入了由于数据污染等问题而引起的旨在表面LLM漏洞和快捷方式的对抗变体。我们的发现表明，尽管较大模型的表现要优于较小的模型，但它们仍然容易受到对抗操纵的影响，并且通常依靠浅表模式而不是真正的道德推理。这种脆弱的性会导致巨大的自相矛盾，最好的模型在大约20％的案件中反驳了自己的答案，具体取决于道德选择的框架。有趣的是，推理增强的模型无法弥合这一差距，表明规模不是推理能力 - 是性能的主要驱动力。

Title: LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation

Authors: Anu Pradhan, Alexandra Ortan, Apurv Verma, Madhavan Seshadri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12382
Pdf URL: https://arxiv.org/pdf/2509.12382
Copy Paste: [[2509.12382]] LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation(https://arxiv.org/abs/2509.12382)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The evaluation bottleneck in recommendation systems has become particularly acute with the rise of Generative AI, where traditional metrics fall short of capturing nuanced quality dimensions that matter in specialized domains like legal research. Can we trust Large Language Models to serve as reliable judges of their own kind? This paper investigates LLM-as-a-Judge as a principled approach to evaluating Retrieval-Augmented Generation systems in legal contexts, where the stakes of recommendation quality are exceptionally high. We tackle two fundamental questions that determine practical viability: which inter-rater reliability metrics best capture the alignment between LLM and human assessments, and how do we conduct statistically sound comparisons between competing systems? Through systematic experimentation, we discover that traditional agreement metrics like Krippendorff's alpha can be misleading in the skewed distributions typical of AI system evaluations. Instead, Gwet's AC2 and rank correlation coefficients emerge as more robust indicators for judge selection, while the Wilcoxon Signed-Rank Test with Benjamini-Hochberg corrections provides the statistical rigor needed for reliable system comparisons. Our findings suggest a path toward scalable, cost-effective evaluation that maintains the precision demanded by legal applications, transforming what was once a human-intensive bottleneck into an automated, yet statistically principled, evaluation framework.
摘要：随着生成AI的兴起，推荐系统中的评估瓶颈变得特别敏捷，在这种情况下，传统指标却没有捕获在法律研究（例如法律研究）中重要的细微质量维度。我们可以信任大型语言模型以作为自己种类的可靠法官吗？本文将LLM-AS-A-A-Gudge作为一种原则性的方法来评估法律背景下的检索型生成系统，在这种情况下，建议质量的赌注异常高。我们解决了确定实际生存能力的两个基本问题：哪种评估者间可靠性指标最能捕获LLM和人类评估之间的一致性，以及我们如何在竞争系统之间进行统计学上的合理比较？通过系统的实验，我们发现像Krippendorff的Alpha这样的传统协议指标可能在AI系统评估的典型偏斜分布中具有误导性。取而代之的是，GWET的AC2和等级相关系数作为裁判选择的更强大的指标出现，而Wilcoxon签名的秩检验则使用Benjamini-Hochberg校正提供了可靠的系统比较所需的统计严格。我们的发现提出了通往可扩展，具有成本效益的评估的途径，该途径保持了法律应用所需的精确性，将曾经是人类密集型瓶颈转变为自动化但统计学上的评估框架。

Title: SENTRA: Selected-Next-Token Transformer for LLM Text Detection

Authors: Mitchell Plyler, Yilun Zhang, Alexander Tuzhilin, Saoud Khalifah, Sen Tian
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.12385
Pdf URL: https://arxiv.org/pdf/2509.12385
Copy Paste: [[2509.12385]] SENTRA: Selected-Next-Token Transformer for LLM Text Detection(https://arxiv.org/abs/2509.12385)
Keywords: llm
Abstract: LLMs are becoming increasingly capable and widespread. Consequently, the potential and reality of their misuse is also growing. In this work, we address the problem of detecting LLM-generated text that is not explicitly declared as such. We present a novel, general-purpose, and supervised LLM text detector, SElected-Next-Token tRAnsformer (SENTRA). SENTRA is a Transformer-based encoder leveraging selected-next-token-probability sequences and utilizing contrastive pre-training on large amounts of unlabeled data. Our experiments on three popular public datasets across 24 domains of text demonstrate SENTRA is a general-purpose classifier that significantly outperforms popular baselines in the out-of-domain setting.
摘要：LLM变得越来越有能力和广泛。因此，他们滥用的潜力和现实也在增长。在这项工作中，我们解决了检测未明确声明的LLM生成的文本的问题。我们提出了一种新颖的，通用和受监督的LLM文本检测器，Selected-next-Token Transformer（Sentra）。 Sentra是一个基于变压器的编码器，利用选定的隔离式促进性序列，并利用大量未标记数据的对比预训练。我们在24个文本域上的三个流行公共数据集上进行的实验表明，Sentra是一种通用分类器，在室外设置中大大优于流行的基线。

Title: MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering

Authors: Wen-wai Yim, Asma Ben Abacha, Zixuan Yu, Robert Doerning, Fei Xia, Meliha Yetisgen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12405
Pdf URL: https://arxiv.org/pdf/2509.12405
Copy Paste: [[2509.12405]] MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering(https://arxiv.org/abs/2509.12405)
Keywords: language model, gpt, llm
Abstract: Evaluating natural language generation (NLG) systems in the medical domain presents unique challenges due to the critical demands for accuracy, relevance, and domain-specific expertise. Traditional automatic evaluation metrics, such as BLEU, ROUGE, and BERTScore, often fall short in distinguishing between high-quality outputs, especially given the open-ended nature of medical question answering (QA) tasks where multiple valid responses may exist. In this work, we introduce MORQA (Medical Open-Response QA), a new multilingual benchmark designed to assess the effectiveness of NLG evaluation metrics across three medical visual and text-based QA datasets in English and Chinese. Unlike prior resources, our datasets feature 2-4+ gold-standard answers authored by medical professionals, along with expert human ratings for three English and Chinese subsets. We benchmark both traditional metrics and large language model (LLM)-based evaluators, such as GPT-4 and Gemini, finding that LLM-based approaches significantly outperform traditional metrics in correlating with expert judgments. We further analyze factors driving this improvement, including LLMs' sensitivity to semantic nuances and robustness to variability among reference answers. Our results provide the first comprehensive, multilingual qualitative study of NLG evaluation in the medical domain, highlighting the need for human-aligned evaluation methods. All datasets and annotations will be publicly released to support future research.
摘要：由于对准确性，相关性和特定于领域的专业知识的关键需求，评估医学领域中的自然语言产生（NLG）系统提出了独特的挑战。传统的自动评估指标，例如BLEU，Rouge和BertScore，通常在区分高质量输出方面缺乏，尤其是考虑到可能存在多个有效回答的医疗问题答案（QA）任务的开放性质。在这项工作中，我们介绍了MORQA（医学开放式质量响应QA），这是一种新的多语言基准测试，旨在评估在三个英语和中文中的三个医学视觉和基于文本的质量标准数据集中NLG评估指标的有效性。与先前的资源不同，我们的数据集具有由医学专业人员撰写的2-4+金标准答案，以及三个英语和中文子集的专家人类评级。我们基于传统指标和大型语言模型（LLM）的评估者（例如GPT-4和Gemini），发现基于LLM的方法与专家判断相关的传统指标的表现明显优于传统指标。我们进一步分析了推动这一改进的因素，包括LLMS对语义细微差别的敏感性和对参考答案之间可变性的鲁棒性。我们的结果提供了对医疗领域NLG评估的首次全面，多语言的定性研究，强调了对人类一致的评估方法的需求。所有数据集和注释都将公开发布，以支持未来的研究。

Title: MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

Authors: Jiayi He, Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He, Jiaxue Hu, Xiaodong Tao, Lixian Lai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.12440
Pdf URL: https://arxiv.org/pdf/2509.12440
Copy Paste: [[2509.12440]] MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts(https://arxiv.org/abs/2509.12440)
Keywords: language model, llm, agent
Abstract: The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over-criticism'' phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models.
摘要：大型语言模型（LLM）在医疗保健中的部署增加需要对其事实可靠性进行严格的评估。但是，现有的基准通常受到狭窄数据域的限制，无法捕获现实世界中医学信息的复杂性。为了解决这个关键的差距，我们引入了MedFact，这是一种新的且具有挑战性的中国医学事实检查的基准。 MedFact包括2,116个由不同的现实世界文本策划的专家宣布的实例，涵盖了13个医学专业，8种5种细粒度错误类型，4种写作样式和多个难度级别。它的构建采用了混合AI-Human框架，迭代专家反馈完善了AI驱动的多标准过滤过程，从而确保了高数据质量和难度。我们对20个领先的LLM进行了全面的评估，以对人类专家基线的真实性分类和错误定位进行基准测试。我们的结果表明，尽管模型通常可以确定文本是否包含错误，但精确地将其本地化仍然是一个重大挑战，甚至表现最好的模型也没有人类绩效。此外，我们的分析发现了一种频繁的``过度批评''现象，这是模型将正确识别的信息误认为错误的趋势，这对高级推理技术（例如多主体协作和推理时间缩放）加剧了。通过强调在医疗应用中部署LLM的这些关键挑战，MedFact提供了强大的资源，以推动更加可靠和医学意识的模型的开发。

Title: Topic Coverage-based Demonstration Retrieval for In-Context Learning

Authors: Wonbin Kweon, SeongKu Kang, Runchu Tian, Pengcheng Jiang, Jiawei Han, Hwanjo Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12451
Pdf URL: https://arxiv.org/pdf/2509.12451
Copy Paste: [[2509.12451]] Topic Coverage-based Demonstration Retrieval for In-Context Learning(https://arxiv.org/abs/2509.12451)
Keywords: llm
Abstract: The effectiveness of in-context learning relies heavily on selecting demonstrations that provide all the necessary information for a given test input. To achieve this, it is crucial to identify and cover fine-grained knowledge requirements. However, prior methods often retrieve demonstrations based solely on embedding similarity or generation probability, resulting in irrelevant or redundant examples. In this paper, we propose TopicK, a topic coverage-based retrieval framework that selects demonstrations to comprehensively cover topic-level knowledge relevant to both the test input and the model. Specifically, TopicK estimates the topics required by the input and assesses the model's knowledge on those topics. TopicK then iteratively selects demonstrations that introduce previously uncovered required topics, in which the model exhibits low topical knowledge. We validate the effectiveness of TopicK through extensive experiments across various datasets and both open- and closed-source LLMs. Our source code is available at this https URL.
摘要：秘密学习的有效性在很大程度上依赖于选择为给定测试输入提供所有必要信息的演示。为了实现这一目标，至关重要的是识别和涵盖细粒度的知识要求。但是，先前的方法通常仅基于嵌入相似性或发电概率而检索示范，从而导致无关或冗余的例子。在本文中，我们提出了Topick，这是一个基于主题覆盖的检索框架，该框架选择演示以全面涵盖与测试输入和模型相关的主题级知识。具体而言，Topick估计了输入所需的主题，并评估了模型对这些主题的知识。然后，topick迭代选择了示范，这些演示介绍了以前发现的所需主题，其中该模型表现出低局部知识。我们通过在各种数据集以及开放式和封闭源LLM的广泛实验中验证了验证式的有效性。我们的源代码可在此HTTPS URL上找到。

Title: Does Language Model Understand Language?

Authors: Suvojit Acharjee, Utathya Aich, Asfak Ali
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12459
Pdf URL: https://arxiv.org/pdf/2509.12459
Copy Paste: [[2509.12459]] Does Language Model Understand Language?(https://arxiv.org/abs/2509.12459)
Keywords: language model
Abstract: Despite advances in natural language generation and understanding, LM still struggle with fine grained linguistic phenomena such as tense, negation, voice, and modality which are the elements central to effective human communication. In the context of the United Nations SDG 4, where linguistic clarity is critical, the deployment of LMs in educational technologies demands careful scrutiny. As LMs are increasingly powering applications like tutoring systems, automated grading, and translation, their alignment with human linguistic interpretation becomes essential for effective learning. In this study, we conduct a evaluation of SOTA language models across these challenging contexts in both English and Bengali. To ensure a structured assessment, we introduce a new Route for Evaluation of Cognitive Inference in Systematic Environments guidelines. Our proposed LUCID dataset, composed of carefully crafted sentence pairs in English and Bengali, specifically challenges these models on critical aspects of language comprehension, including negation, tense, voice variations. We assess the performance of SOTA models including MISTRAL-SABA-24B, LLaMA-4-Scout-17B, LLaMA-3.3-70B, Gemma2-9B, and Compound-Beta using standard metrics like Pearson correlation, Spearman correlation, and Mean Absolute Error, as well as novel, linguistically inspired metric the HCE accuracy. The HCE accuracy measures how often model predictions fall within one standard deviation of the mean human rating, thus capturing human like tolerance for variability in language interpretation. Our findings highlight Compound-Beta as the most balanced model, consistently achieving high correlations and low MAEs across diverse language conditions. It records the highest Pearson correlation in English and demonstrates robust performance on mixed-language data, indicating a strong alignment with human judgments in cross lingual scenarios.
摘要：尽管自然语言的产生和理解取得了进步，但LM仍在与精细的粒状语言现象（例如时态，否定，声音和方式）中挣扎，这是有效人类交流的核心。在联合国SDG 4的背景下，语言清晰度至关重要，LMS在教育技术中的部署需要仔细审查。随着LM越来越多地为辅导系统，自动化等级和翻译提供动力，它们与人类语言解释的一致性对于有效学习至关重要。在这项研究中，我们在英语和孟加拉语中对SOTA语言模型进行了评估。为了确保结构化评估，我们引入了一条新的途径，以评估系统环境指南中的认知推断。我们提出的Lucid数据集由英语和孟加拉语精心制作的句子对组成，特别是在语言理解的关键方面挑战这些模型，包括否定，时态，语音变化。我们使用Pearson相关性，Spearman相关性，平均绝对误差以及新颖的，新颖的，linimensimed Mexiended Mentimed Mentrive n h HCE Metrivic，使用标准指标来评估SOTA模型的性能，包括Mistral-Saba-24b，Llama-4-Scout-17b，Llama-Scout-17b，Llama-3.3-70B，Gemma2-9b和Gemma2-9b和化合物-beta。 HCE准确性衡量模型预测的频率属于平均人类评级的一个标准偏差，从而捕获了人类对语言解释中可变性的耐受性。我们的发现突出了化合物-beta是最平衡的模型，在各种语言条件下始终达到高度相关性和低MAE。它记录了英语中最高的皮尔逊相关性，并在混合语言数据上表现出了强劲的性能，表明与人类在交叉语言情景中的判断有很强的一致性。

Title: Audited Reasoning Refinement: Fine-Tuning Language Models via LLM-Guided Step-Wise Evaluation and Correction

Authors: Sumanta Bhattacharyya, Sara Riaz, Pedram Rooshenas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12476
Pdf URL: https://arxiv.org/pdf/2509.12476
Copy Paste: [[2509.12476]] Audited Reasoning Refinement: Fine-Tuning Language Models via LLM-Guided Step-Wise Evaluation and Correction(https://arxiv.org/abs/2509.12476)
Keywords: language model, llm, hallucination, prompt
Abstract: Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce. However, LLMs with reasoning capabilities produce abundant intermediate reasoning traces that can be systematically refined to create effective supervision signals. We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models. Our method generates initial reasoning and responses from an open-source base model on task-specific inputs, then refines these traces, fixing hallucinations and inconsistencies, to form a high-fidelity dataset. We perform a two-stage alignment, supervised fine-tuning (SFT), followed by direct preference optimization (DPO) to calibrate the model's intermediate reasoning with human-validated conceptual preferences and then condition the final output on that aligned reasoning. As a case study, we apply R2tA to evaluate extended entity relationship diagrams (EERDs) in database system design, a structurally complex task where prompt-only methods miss or hallucinate errors. We curated a dataset of 600 EERD variants (train/test split of 450/150, respectively) with induced mistakes spanning 11 categories. Empirical evaluation suggests R2tA provides a practical, cost-effective path to scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and beyond.
摘要：当直接的人类监督或高质量的标签稀缺时，培训特定于任务的小推理模型是具有挑战性的。但是，具有推理能力的LLM会产生丰富的中间推理轨迹，这些轨迹可以系统地完善以创建有效的监督信号。我们提出了推理 - 预先反复的ALIGN（R2TA），该（R2TA）将精致的模型原理变成了培训特定于任务的推理模型的监督。我们的方法在特定于任务的输入上从开源基本模型中生成初始推理和响应，然后完善这些痕迹，修复幻觉和不一致，以形成高保真数据集。我们执行了两阶段的对齐，有监督的微调（SFT），然后进行直接偏好优化（DPO），以用人类验证的概念偏好校准模型的中间推理，然后在该对齐的推理上调节最终输出。作为案例研究，我们应用R2TA来评估数据库系统设计中的扩展实体关系图（EERD），这是一个结构上复杂的任务，仅及时方法会错过或幻觉错误。我们策划了一个600个EERD变体的数据集（分别为450/150的火车/测试拆分），其中涉及错误的错误。经验评估表明，R2TA为数据筛选域中的可扩展LLM适应提供了一种实用的，具有成本效益的途径，从而为教育及其他地区提供了可重现的AI工具。

Title: FunAudio-ASR Technical Report

Authors: Keyu An, Yanni Chen, Chong Deng, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Wen Wang, Wupeng Wang, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye, Jixing Yu, Qinglin Zhang, Kun Zou, Han Zhao, Shengkui Zhao, Jingren Zhou
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2509.12508
Pdf URL: https://arxiv.org/pdf/2509.12508
Copy Paste: [[2509.12508]] FunAudio-ASR Technical Report(https://arxiv.org/abs/2509.12508)
Keywords: language model, llm, hallucination
Abstract: In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.
摘要：近年来，自动语音识别（ASR）目睹了由三个互补范式驱动的变革性进步：数据扩展，模型大小缩放和与大语言模型（LLMS）的深入集成。但是，LLM易于幻觉，这可能会大大降低现实世界中ASR应用程序中的用户体验。在本文中，我们提出了Funaudio-ASR，这是一种基于LLM的大规模ASR系统，可以协同结合大量数据，大型模型容量，LLM集成和增强学习，以实现跨多种和复杂的语音识别场景的最新性能。此外，Funaudio-ASR专门针对实际部署进行了优化，并提高了流功能，噪声稳健性，代码转换，热门自定义以及满足其他实际应用程序要求。实验结果表明，虽然大多数基于LLM的ASR系统在开源基准上取得了强劲的性能，但它们在实际行业评估集上通常表现不佳。得益于以生产为导向的优化，Funaudio-ASR在实际应用数据集上实现了SOTA性能，证明了其在实际环境中的有效性和鲁棒性。

Title: MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models

Authors: Vijay Govindarajan, Pratik Patel, Sahil Tripathi, Md Azizul Hoque, Gautam Siddharth Kashyap
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12591
Pdf URL: https://arxiv.org/pdf/2509.12591
Copy Paste: [[2509.12591]] MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models(https://arxiv.org/abs/2509.12591)
Keywords: language model, llm, prompt
Abstract: Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used.
摘要：自动音频字幕（AAC）生成音频剪辑的字幕，但与图像字幕相比，由于数据集有限，面临挑战。为了克服这一点，我们提出了利用预先训练模型的零射击AAC系统，从而消除了对广泛培训的需求。我们的方法使用预训练的音频剪辑模型来提取听觉功能并生成结构化的提示，该提示可以指导字幕生成中的大型语言模型（LLM）。与传统的贪婪解码不同，我们的方法通过音频剪辑模型来完善令牌选择，从而确保与音频内容保持一致。实验结果表明，使用WAVCAPS模型，NLG平均得分（从4.7到7.3）提高了35％。该性能受到音频文本匹配模型和关键字选择的严重影响，使用单个关键字提示符获得了最佳结果，并且在未使用关键字列表时效果下降了50％。

Title: EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving

Authors: Mukai Li, Linfeng Song, Zhenwen Liang, Jiahao Xu, Shansan Gong, Qi Liu, Haitao Mi, Dong Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.12603
Pdf URL: https://arxiv.org/pdf/2509.12603
Copy Paste: [[2509.12603]] EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving(https://arxiv.org/abs/2509.12603)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance.
摘要：大型语言模型（LLMS）最近已推进了自动定理证明（ATP）的领域，通过广泛采用的测试时间缩放策略，尤其是反思性的思想链（COT）推理和增加的采样通行证，从而获得了可观的性能增长。但是，他们都引入了大量的计算开销以进行推理。此外，现有的成本分析通常仅调节抽样的数量，同时忽略了不同缩放策略引入的采样成本的实质性差异。在本文中，我们从系统地比较了ATP模型的不同测试时间缩放策略的效率，并证明了当前最新的（SOTA）开源方法的效率低下。然后，我们调查方法，以显着减少令牌使用和样本通过的同时保持原始性能。具体而言，我们提出了两种互补方法，可以将它们集成到统一的Econrl管道中，以取得放大的好处：（1）一种动态的想法（COT）切换机制，旨在减轻不必要的令牌消耗，以及（2）多样化的平行尺度的增强增强（RL），可通过可训练的先前延伸速度来延伸。在Minif2F和验证网络上进行的实验表明，我们的Econprover的性能与仅12％的计算成本的基线方法可比。这项工作提供了可行的见解，用于部署轻量级ATP模型而不牺牲性能。

Title: PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition

Authors: Li Fu, Yu Xin, Sunlu Zeng, Lu Fan, Youzheng Wu, Xiaodong He
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2509.12647
Pdf URL: https://arxiv.org/pdf/2509.12647
Copy Paste: [[2509.12647]] PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition(https://arxiv.org/abs/2509.12647)
Keywords: language model, llm
Abstract: This paper presents a Pronunciation-Aware Contextualized (PAC) framework to address two key challenges in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems: effective pronunciation modeling and robust homophone discrimination. Both are essential for raw or long-tail word recognition. The proposed approach adopts a two-stage learning paradigm. First, we introduce a pronunciation-guided context learning method. It employs an interleaved grapheme-phoneme context modeling strategy that incorporates grapheme-only distractors, encouraging the model to leverage phonemic cues for accurate recognition. Then, we propose a pronunciation-discriminative reinforcement learning method with perturbed label sampling to further enhance the modelś ability to distinguish contextualized homophones. Experimental results on the public English Librispeech and Mandarin AISHELL-1 datasets indicate that PAC: (1) reduces relative Word Error Rate (WER) by 30.2% and 53.8% compared to pre-trained LLM-based ASR models, and (2) achieves 31.8% and 60.5% relative reductions in biased WER for long-tail words compared to strong baselines, respectively.
摘要：本文提出了一个发音意识的情境化（PAC）框架，以解决基于大语言模型（LLM）基于基于的自动语音识别（ASR）系统的两个关键挑战：有效的发音建模和强大的同质歧视。两者都是原始或长尾单词识别至关重要的。拟议的方法采用了两个阶段的学习范式。首先，我们介绍了一种发音引导的上下文学习方法。它采用了一个交织的石墨 - 音素上下文建模策略，该策略结合了仅墨西米干扰因素，鼓励模型利用音素提示来准确识别。然后，我们提出了一种使用扰动标签采样的发音歧视加强学习方法，以进一步增强模型的能力，以区分上下文的同音词。与基于LLM的ASR的预先培训的ASR模型相比，对公共英语Liblispeech和Mandarin Aishell-1数据集的实验结果表明，PAC：（1）将相对单词错误率（WER）降低30.2％和53.8％，并且（2）实现31.8％和60.5％的相对较低的差异较低的基本底数相比，相对较低的词汇相比，相对较差31.8％和60.5％。

Title: Don't Change My View: Ideological Bias Auditing in Large Language Models

Authors: Paul Kröger, Emilio Barkett
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.12652
Pdf URL: https://arxiv.org/pdf/2509.12652
Copy Paste: [[2509.12652]] Don't Change My View: Ideological Bias Auditing in Large Language Models(https://arxiv.org/abs/2509.12652)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) become increasingly embedded in products used by millions, their outputs may influence individual beliefs and, cumulatively, shape public opinion. If the behavior of LLMs can be intentionally steered toward specific ideological positions, such as political or religious views, then those who control these systems could gain disproportionate influence over public discourse. Although it remains an open question whether LLMs can reliably be guided toward coherent ideological stances and whether such steering can be effectively prevented, a crucial first step is to develop methods for detecting when such steering attempts occur. In this work, we adapt a previously proposed statistical method to the new context of ideological bias auditing. Our approach carries over the model-agnostic design of the original framework, which does not require access to the internals of the language model. Instead, it identifies potential ideological steering by analyzing distributional shifts in model outputs across prompts that are thematically related to a chosen topic. This design makes the method particularly suitable for auditing proprietary black-box systems. We validate our approach through a series of experiments, demonstrating its practical applicability and its potential to support independent post hoc audits of LLM behavior.
摘要：随着大型语言模型（LLMS）越来越多地嵌入数百万使用的产品中，它们的产出可能会影响个人信念，并累积地塑造公众舆论。如果LLM的行为可以有意地转向特定的意识形态立场，例如政治或宗教观点，那么控制这些系统的人可能会对公共话语产生不成比例的影响。尽管LLM是否可以可靠地引导到连贯的意识形态立场以及是否可以有效预防这种转向的问题仍然是一个悬而未决的问题，但至关重要的第一步是开发用于何时发生这种转向尝试的方法。在这项工作中，我们将一种先前提出的统计方法适应了意识形态偏见审核的新环境。我们的方法对原始框架的模型设计设计不需要访问语言模型的内部。取而代之的是，它通过分析与所选主题相关的提示中的模型输出中的分布变化来确定潜在的意识形态转向。该设计使该方法特别适合审核专有的黑盒系统。我们通过一系列实验来验证我们的方法，证明了其实际适用性及其支持对LLM行为的独立后审核的潜力。

Title: Mitigating Strategy Preference Bias in Emotional Support Conversation via Uncertainty Estimations

Authors: Yougen Zhou, Qin Chen, Ningning Zhou, Jie Zhou, Xingjiao Wu, Liang He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12661
Pdf URL: https://arxiv.org/pdf/2509.12661
Copy Paste: [[2509.12661]] Mitigating Strategy Preference Bias in Emotional Support Conversation via Uncertainty Estimations(https://arxiv.org/abs/2509.12661)
Keywords: language model, llm
Abstract: Emotional support conversation (ESC) aims to alleviate distress through empathetic dialogue, yet large language models (LLMs) face persistent challenges in delivering effective ESC due to low accuracy in strategy planning. Moreover, there is a considerable preference bias towards specific strategies. Prior methods using fine-tuned strategy planners have shown potential in reducing such bias, while the underlying causes of the preference bias in LLMs have not well been studied. To address these issues, we first reveal the fundamental causes of the bias by identifying the knowledge boundaries of LLMs in strategy planning. Then, we propose an approach to mitigate the bias by reinforcement learning with a dual reward function, which optimizes strategy planning via both accuracy and entropy-based confidence for each region according to the knowledge boundaries. Experiments on the ESCov and ExTES datasets with multiple LLM backbones show that our approach outperforms the baselines, confirming the effectiveness of our approach.
摘要：情感支持对话（ESC）旨在通过善解人意的对话来减轻困扰，但是由于策略计划的准确性较低，大型语言模型（LLMS）在提供有效的ESC方面面临持续的挑战。此外，对特定策略存在相当大的偏好偏见。使用微调策略计划者的先前方法表明，有潜力减少这种偏见，而LLMS中偏好偏差的根本原因尚未得到很好的研究。为了解决这些问题，我们首先通过确定LLM在战略计划中的知识边界来揭示偏见的基本原因。然后，我们提出了一种通过双重奖励函数加强学习来减轻偏见的方法，该奖励通过知识边界对每个区域的基于准确性和基于熵的信心来优化策略计划。 ESCOV上的实验并用多个LLM骨架扩展数据集表明，我们的方法优于基准，证实了我们方法的有效性。

Title: Chat-Driven Text Generation and Interaction for Person Retrieval

Authors: Zequn Xie, Chuxin Wang, Sihang Cai, Yeqiang Wang, Shulei Wang, Tao Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12662
Pdf URL: https://arxiv.org/pdf/2509.12662
Copy Paste: [[2509.12662]] Chat-Driven Text Generation and Interaction for Person Retrieval(https://arxiv.org/abs/2509.12662)
Keywords: llm, chat
Abstract: Text-based person search (TBPS) enables the retrieval of person images from large-scale databases using natural language descriptions, offering critical value in surveillance applications. However, a major challenge lies in the labor-intensive process of obtaining high-quality textual annotations, which limits scalability and practical deployment. To address this, we introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision. MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions - characteristics often seen in real-world search scenarios. Together, MTG and MTI form a unified and annotation-free framework that significantly improves retrieval accuracy, robustness, and usability. Extensive evaluations demonstrate that our method achieves competitive or superior results while eliminating the need for manual captions, paving the way for scalable and practical deployment of TBPS systems.
摘要：基于文本的人搜索（TBP）可以使用自然语言描述从大规模数据库中检索人图像，从而在监视应用程序中提供关键价值。但是，一个重大挑战在于劳动密集型的过程，即获得高质量的文本注释，从而限制了可扩展性和实际部署。为了解决这个问题，我们介绍了两个互补的模块：多转移文本生成（MTG）和多转移文本交互（MTI）。 MTG通过与MLLM的模拟对话生成丰富的伪标签，在没有手动监督的情况下产生细粒度和多样的视觉描述。 MTI通过基于动态的，基于对话的推理来完善推理时间的用户查询，从而使系统能够解释和解决模糊，不完整或模棱两可的描述 - 在现实世界中搜索场景中经常看到的特征。 MTG和MTI一起形成了一个无统一的无注释框架，可显着提高检索准确性，鲁棒性和可用性。广泛的评估表明，我们的方法在消除了对手动标题的需求，为TBPS系统的可扩展和实际部署铺平道路，从而实现竞争或卓越的结果。

Title: Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content

Authors: Shaz Furniturewala, Arkaitz Zubiaga
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12672
Pdf URL: https://arxiv.org/pdf/2509.12672
Copy Paste: [[2509.12672]] Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content(https://arxiv.org/abs/2509.12672)
Keywords: language model, llm
Abstract: The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which are usually trained on text produced by humans, suffer from misclassifications due to LLM-generated text deviating from their training data and adversarial attacks that aim to avoid detection. Present-day defence tactics are reactive rather than proactive, since they rely on adversarial training or external detection models to identify attacks. In this work, we aim to identify the vulnerable components of toxicity classifiers that contribute to misclassification, proposing a novel strategy based on mechanistic interpretability techniques. Our study focuses on fine-tuned BERT and RoBERTa classifiers, testing on diverse datasets spanning a variety of minority groups. We use adversarial attacking techniques to identify vulnerable circuits. Finally, we suppress these vulnerable circuits, improving performance against adversarial attacks. We also provide demographic-level insights into these vulnerable circuits, exposing fairness and robustness gaps in model training. We find that models have distinct heads that are either crucial for performance or vulnerable to attack and suppressing the vulnerable heads improves performance on adversarial input. We also find that different heads are responsible for vulnerability across different demographic groups, which can inform more inclusive development of toxicity detection models.
摘要：由于广泛使用大语言模型（LLMS），在线机器生成的内容的数量已急剧增长，这导致了内容审核系统的新挑战。常规的内容审核分类器通常是根据人类产生的文本进行培训的，由于LLM生成的文本偏离了其培训数据和对抗性攻击，旨在避免发现，因此患有错误分类。当今的防御策略是反应性的，而不是主动的，因为它们依靠对抗性训练或外部检测模型来识别攻击。在这项工作中，我们旨在确定有助于错误分类的毒性分类器的脆弱组成部分，并根据机械性解释性技术提出了一种新型策略。我们的研究着重于微调的Bert和Roberta分类器，对跨越各种少数群体的不同数据集进行了测试。我们使用对抗性攻击技术来识别脆弱的电路。最后，我们抑制了这些脆弱的电路，改善了针对对抗性攻击的性能。我们还为这些脆弱的电路提供了人口统计学的见解，从而在模型培训中揭示了公平和稳健性的差距。我们发现，模型具有不同的头部，对于性能至关重要，要么容易受到攻击和抑制脆弱的头部，从而改善了对抗性输入的性能。我们还发现，不同的头部负责不同人群群体的脆弱性，这可以为毒性检测模型的更具包容性发展提供信息。

Title: HistoryBankQA: Multilingual Temporal Question Answering on Historical Events

Authors: Biswadip Mandal, Anant Khandelwal, Manish Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12720
Pdf URL: https://arxiv.org/pdf/2509.12720
Copy Paste: [[2509.12720]] HistoryBankQA: Multilingual Temporal Question Answering on Historical Events(https://arxiv.org/abs/2509.12720)
Keywords: language model, gpt, llm
Abstract: Temporal reasoning about historical events is a critical skill for NLP tasks like event extraction, historical entity linking, temporal question answering, timeline summarization, temporal event clustering and temporal natural language inference. Yet efforts on benchmarking temporal reasoning capabilities of large language models (LLMs) are rather limited. Existing temporal reasoning datasets are limited in scale, lack multilingual coverage and focus more on contemporary events. To address these limitations, we present HistoryBank, a multilingual database of 10M+ historical events extracted from Wikipedia timeline pages and article infoboxes. Our database provides unprecedented coverage in both historical depth and linguistic breadth with 10 languages. Additionally, we construct a comprehensive question answering benchmark for temporal reasoning across all languages. This benchmark covers a diverse set of 6 temporal QA reasoning tasks, and we evaluate a suite of popular language models (LLaMA-3-8B, Mistral-7B, Gemma-2-9b, Qwen3-8B, GPT4o) to assess their performance on these tasks. As expected GPT4o performs best across all answer types and languages; Gemma-2 outperforms the other small language models. Our work aims to provide a comprehensive resource for advancing multilingual and temporally-aware natural language understanding of historical events. To facilitate further research, we will make our code and datasets publicly available upon acceptance of this paper.
摘要：关于历史事件的时间推理是NLP任务的关键技能，例如事件提取，历史实体链接，时间问题回答，时间表摘要，时间事件聚类和时间自然语言推断。然而，为基准测试大语模型（LLM）的时间推理能力的努力是相当有限的。现有的时间推理数据集的规模有限，缺乏多语言覆盖范围，并且专注于当代事件。为了解决这些局限性，我们提出历史银行，这是一个从Wikipedia时间轴页面和文章Infoboxes提取的10M+历史事件的多语言数据库。我们的数据库提供了10种语言的历史深度和语言广度上的前所未有的覆盖范围。此外，我们构建了一个全面的问题，以回答所有语言的时间推理基准。该基准涵盖了6种暂时性质量质量质量质量质量推理任务，我们评估了一套流行的语言模型（Llama-3-8B，Mismtral-7b，Gemma-2-9b，Qwen3-8b，gpt4o，gpt4o），以评估他们在这些任务上的表现。正如预期的那样，GPT4O在所有答案类型和语言中表现最好； Gemma-2的表现优于其他小语言模型。我们的工作旨在提供全面的资源，以推进对历史事件的多语言和时间意识的自然语言理解。为了促进进一步的研究，我们将在接受本文后公开提供代码和数据集。

Title: ConvergeWriter: Data-Driven Bottom-Up Article Construction

Authors: Binquan Ji, Jiaqi Wang, Ruiting Li, Xingchen Han, Yiyang Qi, Shichao Wang, Yifei Lu, Yuantao Han, Feiliang Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12811
Pdf URL: https://arxiv.org/pdf/2509.12811
Copy Paste: [[2509.12811]] ConvergeWriter: Data-Driven Bottom-Up Article Construction(https://arxiv.org/abs/2509.12811)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing "top-down" methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model's plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel "bottom-up," data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a "Retrieval-First for Knowledge, Clustering for Structure" strategy, which first establishes the "knowledge boundaries" of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct "knowledge clusters." These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains.
摘要：大型语言模型（LLMS）在文本生成中表现出了显着的能力，但产生了以广泛的外部知识库为基础的长形式的事实文件仍然是一个重大挑战。现有的“自上而下”方法首先产生假设或概述，然后检索证据，通常会遭受模型计划和可用知识之间的脱节，从而导致内容片段化和事实上的错误。为了解决这些局限性，我们提出了一个新颖的“自下而上”，数据驱动的框架，该框架颠倒了传统的生成管道。我们的方法基于“最初的检索优先，结构的聚类”策略，该策略首先在发生任何生成计划之前首先建立了源语料库的“知识边界”。具体而言，我们从知识库中进行详尽的迭代检索，然后采用无监督的聚类算法将所检索的文档组织成不同的“知识群”。这些群集构成了一个客观的，数据驱动的基础，该基础直接指导后来的层次轮廓和最终文档内容。自下而上的过程可确保生成的文本严格受到原始材料的约束并完全追溯，从而主动适应了知识库的有限范围，并从根本上减轻了幻觉的风险。 14B和32B参数模型的实验结果表明，我们的方法可实现与最新基准相当或超过最新基准的性能，并且有望在需要高忠诚度和结构相干性的知识约束场景中证明独特的优势。我们的工作提出了一个有效的范式，用于生成可靠，结构化的长期文档，为在高风险，知识密集型领域中更强大的LLM应用铺平了道路。

Title: Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents

Authors: Fuyu Xing, Zimu Wang, Wei Wang, Haiyang Zhang
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2509.12876
Pdf URL: https://arxiv.org/pdf/2509.12876
Copy Paste: [[2509.12876]] Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents(https://arxiv.org/abs/2509.12876)
Keywords: language model, prompt
Abstract: The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities.
摘要：多媒体含量的扩散需要开发有效的多媒体事件提取（M2E2）系统。尽管大型视觉模型（LVLMS）表现出强大的跨模式功能，但它们在M2E2任务中的实用性仍然没有被逐渐消失。在本文中，我们在M2E2数据集上介绍了代表性LVLM的第一个系统评估，包括DeepSeek-VL2和QWEN-VL系列。我们的评估涵盖了仅在几个弹药提示和微调设置下进行评估的纯文本，仅图像和跨媒体子任务。我们的主要发现突出了以下有价值的见解：（1）几乎没有弹出的LVLM在视觉任务上表现出色，但在文本任务上遇到了重大努力；（2）用LORA进行微调LVLM大大提高了模型性能；（3）LVLM在结合方式时表现出很强的协同作用，在跨模式环境中实现了卓越的性能。我们进一步提供了详细的错误分析，以揭示语义精度，本地化和跨模式接地等领域的持续挑战，这仍然是提高M2E2功能的关键障碍。

Title: The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations

Authors: Yubo Zhu, Dongrui Liu, Zecheng Lin, Wei Tong, Sheng Zhong, Jing Shao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.12886
Pdf URL: https://arxiv.org/pdf/2509.12886
Copy Paste: [[2509.12886]] The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations(https://arxiv.org/abs/2509.12886)
Keywords: language model, llm
Abstract: Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens.
摘要：估计大语模型（LLMS）所感知的输入问题的难度对于准确的绩效评估和自适应推断至关重要。现有方法通常依赖于重复的响应抽样，辅助模型或对目标模型本身进行微调，这可能会导致大量的计算成本或损害一般性。在本文中，我们提出了一种新颖的方法来实现难度估计，即仅利用目标LLM产生的隐藏表示形式。我们将令牌级的生成过程建模为马尔可夫链，并定义一个价值函数，以估算任何隐藏状态的预期输出质量。这允许仅基于初始隐藏状态而无需产生任何输出令牌就可以有效，准确地估计。在文本和多模式任务上进行的广泛实验表明，我们的方法在难度估计中始终优于现有基准。此外，我们运用难度估计来指导自适应推理策略，包括自耐心，最佳-N和自我重新申请，从而获得更高的推理效率，而产生的代币较少。

Title: Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings

Authors: Shiyu Li, Yang Tang, Ruijie Liu, Shi-Zhe Chen, Xi Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.12892
Pdf URL: https://arxiv.org/pdf/2509.12892
Copy Paste: [[2509.12892]] Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings(https://arxiv.org/abs/2509.12892)
Keywords: language model, llm
Abstract: Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).
摘要：大型语言模型（LLMS）最近在文本嵌入任务中表现出了出色的性能。以前的工作通常使用洛拉（Lora）来微调现有的LLM，这些LLM受LLM和嵌入模型之间的数据和培训差距的限制。在这项工作中，我们介绍了Conan-Embedding-V2，这是一种新的1.4B参数LLM，从头开始训练并微调作为文本嵌入器。首先，我们添加了新闻数据和多种语言对，以填补LLM预处理以弥合数据差距。基于此，我们提出了一个跨语性检索数据集，该数据集使LLM能够更好地整合跨不同语言的嵌入。其次，虽然LLMS使用带有令牌级损失的因果面具，但嵌入模型使用带有句子级损失的双向面具。这个训练差距使完整的微调效果不如洛拉。我们引入了一种软掩模机制，以逐渐在这两种类型的面具之间过渡，从而使模型能够学习更全面的表示。基于此，我们提出了一种动态的硬否定方法，该方法将模型暴露于整个训练过程中更困难的负面例子。直观有效，只有大约1.4b参数，Conan-Embedding-V2在嵌入基准（MTEB）和中国MTEB（2025年5月19日）上都可以在庞大的文本嵌入基准（MTEB）和中国MTEB上达到SOTA性能。

Title: All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning

Authors: Caiqi Zhang, Chang Shu, Ehsan Shareghi, Nigel Collier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.12908
Pdf URL: https://arxiv.org/pdf/2509.12908
Copy Paste: [[2509.12908]] All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning(https://arxiv.org/abs/2509.12908)
Keywords: language model, llm
Abstract: Confidence estimation is essential for the reliable deployment of large language models (LLMs). Existing methods are primarily designed for factual QA tasks and often fail to generalize to reasoning tasks. To address this gap, we propose a set of training-free, graph-based confidence estimation methods tailored to reasoning tasks. Our approach models reasoning paths as directed graphs and estimates confidence by exploiting graph properties such as centrality, path convergence, and path weighting. Experiments with two LLMs on three reasoning datasets demonstrate improved confidence estimation and enhanced performance on two downstream tasks.
摘要：置信度估计对于可靠的大型语言模型（LLM）至关重要。现有方法主要是针对事实质量检查任务而设计的，并且通常无法推广到推理任务。为了解决这一差距，我们提出了一套针对推理任务量身定制的基于图形的置信度估计方法。我们的方法按照指示图对推理路径进行建模，并通过利用图形特性（例如中心性，路径收敛和路径加权）来估算置信度。在三个推理数据集上使用两个LLM的实验表明，在两个下游任务上的置信度估计得到了改善，并且提高了性能。

Title: Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework

Authors: Heng Zhang, Chengzhi Zhang
Subjects: cs.CL, cs.DL, cs.IR
Abstract URL: https://arxiv.org/abs/2509.12955
Pdf URL: https://arxiv.org/pdf/2509.12955
Copy Paste: [[2509.12955]] Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework(https://arxiv.org/abs/2509.12955)
Keywords: gpt, prompt, chat
Abstract: The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of "AI for Science". However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: this https URL.
摘要：自动化的研究工作流程对于改善研究和加速“科学AI”范式的可重复性至关重要。但是，现有方法通常仅提取碎片的程序组件，因此无法捕获完整的研究工作流程。为了解决这一差距，我们提出了一个端到端框架，该框架通过挖掘全文学术论文来产生全面的结构化研究工作流。作为自然语言处理（NLP）领域的案例研究，我们以段落为中心的方法首先使用SCIBERT进行积极的未标记（PU）学习来识别工作流程描述性段落，以达到0.9772的F1分数。随后，我们通过迅速学习利用Flan-T5从这些段落中生成工作流短语，从而产生Rouge-1，Rouge-2和Rouge-L得分分别为0.4543、0.2877和0.4427。然后将这些短语系统地分类为数据准备，数据处理和数据分析阶段，使用CHATGPT几乎没有学习，从而达到了0.958的分类精度。通过将短语映射到文档中的文档位置，我们最终生成了整个研究工作流的可读视觉流程图。这种方法促进了对NLP语料库得出的工作流的分析，并揭示了过去二十年来的关键方法论转变，包括对数据分析的越来越重视以及从特征工程到消融研究的过渡。我们的工作为自动化工作流的生成提供了经过验证的技术框架，以及针对不断发展的科学范式的实证研究的新颖，面向过程的观点。源代码和数据可用：此HTTPS URL。

Title: Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

Authors: Yuval Weiss, David Demitri Africa, Paula Buttery, Richard Diehl Martinez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.12960
Pdf URL: https://arxiv.org/pdf/2509.12960
Copy Paste: [[2509.12960]] Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models(https://arxiv.org/abs/2509.12960)
Keywords: language model, llm
Abstract: Parameter-efficient methods such as LoRA have revolutionised the fine-tuning of LLMs. Still, their extension to pretraining via ReLoRA is less well understood, especially for small language models (SLMs), which offer lower computational and environmental costs. This work is the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Through ablation experiments, we find that ReLoRA generally performs worse than standard training on loss, Paloma perplexity and BLiMP, with the gap widening for the larger models. Further analysis of the learning dynamics of the models indicates that ReLoRA reinforces the rank deficiencies found in smaller models. These results indicate that low-rank update strategies may not transfer easily to SLM pretraining, highlighting the need for more research in the low-compute regime.
摘要：诸如洛拉（Lora）之类的参数效率方法已彻底改变了LLM的微调。尽管如此，它们通过Relora进行预处理的扩展知之甚少，尤其是对于小语言模型（SLM），这些模型提供了较低的计算和环境成本。这项工作是SLM（11m-66亿参数）中Relora的首次系统研究，评估了性能和学习动态。通过消融实验，我们发现Relora的表现通常比损失，paloma的困惑和飞艇的标准训练差，较大模型的差距扩大。对模型的学习动力学的进一步分析表明，Relora增强了在较小模型中发现的等级缺陷。这些结果表明，低级更新策略可能无法轻易转移到SLM预处理，这突显了在低计算制度中进行更多研究的需求。

Title: Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews

Authors: Chenye Zou, Xingyue Wen, Tianyi Hu, Qian Janice Wang, Daniel Hershcovich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12961
Pdf URL: https://arxiv.org/pdf/2509.12961
Copy Paste: [[2509.12961]] Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews(https://arxiv.org/abs/2509.12961)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have opened the door to culture-aware language tasks. We introduce the novel problem of adapting wine reviews across Chinese and English, which goes beyond literal translation by incorporating regional taste preferences and culture-specific flavor descriptors. In a case study on cross-cultural wine review adaptation, we compile the first parallel corpus of professional reviews, containing 8k Chinese and 16k Anglophone reviews. We benchmark both neural-machine-translation baselines and state-of-the-art LLMs with automatic metrics and human evaluation. For the latter, we propose three culture-oriented criteria -- Cultural Proximity, Cultural Neutrality, and Cultural Genuineness -- to assess how naturally a translated review resonates with target-culture readers. Our analysis shows that current models struggle to capture cultural nuances, especially in translating wine descriptions across different cultures. This highlights the challenges and limitations of translation models in handling cultural content.
摘要：大型语言模型（LLM）的最新进展为文化意识语言任务打开了大门。我们介绍了在中文和英语中调整葡萄酒评论的新颖问题，通过结合区域口味偏好和特定于文化的风味描述符，这超出了字面翻译。在一项关于跨文化葡萄酒评论改编的案例研究中，我们汇编了第一个平行的专业评论语料库，其中包含8K中文和16k英语评论。我们基准基准具有自动指标和人类评估的神经机译本和最先进的LLM。对于后者，我们提出了三个面向文化的标准 - 文化邻近，文化中立和文化真实性 - 评估翻译回顾的自然性与目标文化读者的共鸣。我们的分析表明，当前的模型努力捕捉文化上的细微差别，尤其是在翻译不同文化的葡萄酒描述时。这突出了翻译模型在处理文化内容中的挑战和局限性。

Title: SitLLM: Large Language Models for Sitting Posture Health Understanding via Pressure Sensor Data

Authors: Jian Gao, Fufangchen Zhao, Yiyang Zhang, Danfeng Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12994
Pdf URL: https://arxiv.org/pdf/2509.12994
Copy Paste: [[2509.12994]] SitLLM: Large Language Models for Sitting Posture Health Understanding via Pressure Sensor Data(https://arxiv.org/abs/2509.12994)
Keywords: language model, llm, prompt
Abstract: Poor sitting posture is a critical yet often overlooked factor contributing to long-term musculoskeletal disorders and physiological dysfunctions. Existing sitting posture monitoring systems, although leveraging visual, IMU, or pressure-based modalities, often suffer from coarse-grained recognition and lack the semantic expressiveness necessary for personalized feedback. In this paper, we propose \textbf{SitLLM}, a lightweight multimodal framework that integrates flexible pressure sensing with large language models (LLMs) to enable fine-grained posture understanding and personalized health-oriented response generation. SitLLM comprises three key components: (1) a \textit{Gaussian-Robust Sensor Embedding Module} that partitions pressure maps into spatial patches and injects local noise perturbations for robust feature extraction; (2) a \textit{Prompt-Driven Cross-Modal Alignment Module} that reprograms sensor embeddings into the LLM's semantic space via multi-head cross-attention using the pre-trained vocabulary embeddings; and (3) a \textit{Multi-Context Prompt Module} that fuses feature-level, structure-level, statistical-level, and semantic-level contextual information to guide instruction comprehension.
摘要：坐姿不佳是一个关键但经常被忽视的因素，导致长期肌肉骨骼疾病和生理功能障碍。现有的坐姿姿势监测系统虽然利用视觉，IMU或基于压力的方式，但通常会遭受粗粒的识别，并且缺乏个性化反馈所需的语义表达。在本文中，我们建议\ textbf {sitllm}，这是一个轻巧的多模式框架，将灵活的压力感测与大语言模型（LLMS）集成，以使良好的良好姿势理解和个性化面向健康的响应产生。 SITLLM包含三个关键组件：（1）A \ TextIt {Gaussian-bust-bust传感器嵌入模块}，将压力映射划分为空间贴片，并注入局部噪声扰动以进行鲁棒特征提取；（2）A \ TextIt {提示驱动的跨模式对齐模块}，该模块将传感器嵌入LLM的语义空间通过多头跨注意，并使用预训练的词汇嵌入；（3）A \ textIt {Multi-Context提示模块}，该模块}融合了特征级别，结构级，统计级别和语义级别的上下文信息，以指导指导说明教学理解。

Title: Multi-Model Synthetic Training for Mission-Critical Small Language Models

Authors: Nolan Platt, Pragyansmita Nayak
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.13047
Pdf URL: https://arxiv.org/pdf/2509.13047
Copy Paste: [[2509.13047]] Multi-Model Synthetic Training for Mission-Critical Small Language Models(https://arxiv.org/abs/2509.13047)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their appli- cation to specialized fields remains constrained by the scarcity and complexity of domain-specific training data. We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT-4o and o3-mini), preventing over- fitting and ensuring accurate reasoning. The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference. We show that smaller, cheaper models - when fine tuned properly - can provide similar accuracy compared to larger models that are prohibitively expensive. Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible. Beyond expand- ing research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries.
摘要：大型语言模型（LLM）在许多领域都表现出了显着的功能，但是它们对专业领域的应用仍受到域特异性培训数据的稀缺性和复杂性的限制。我们提出了一种新颖的方法，该方法通过使用LLM作为一次性教师而不是直接将其用于推断，从而实现了261倍的海上智能成本。我们的方法通过多模型生成（GPT-4O和O3-MINI）将32亿自动识别系统（AIS）的船只跟踪记录转换为21,543个合成问题和答案对，以防止过度拟合并确保准确的推理。由此产生的微调QWEN2.5-7B模型在海上任务上达到了75％的准确性，同时比使用更大的模型进行推理便宜得多。我们表明，与较大昂贵的较大型号相比，较小，更便宜的型号 - 适当地进行了微调时，可以提供相似的精度。我们的工作为专业AI应用程序的合成数据集生成的不断增长贡献，并为手动注释是不可行的域提供了一个高度可重现的框架。除了扩大专业小语言模型的研究领域的研究外，我们的方法还在各个行业的海上安全，安全操作和船只交通管理系统中都有立即应用。

Title: Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO

Authors: Francesco Pappone, Ruggero Marino Lazzaroni, Federico Califano, Niccolò Gentile, Roberto Marras
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.13081
Pdf URL: https://arxiv.org/pdf/2509.13081
Copy Paste: [[2509.13081]] Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO(https://arxiv.org/abs/2509.13081)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks
摘要：尽管大型语言模型（LLMS）在产生类似人类的文本方面表现出色，但将其输出与复杂的定性目标（如教学声音）保持一致仍然是一个重大挑战。标准的加强学习技术通常依赖于缓慢而昂贵的LLM-AS-A-A-As-A-Gudge评估或基于脆性的基于关键字的指标，例如Rouge，这些指标未能捕获高质量解释的语义本质。在这项工作中，我们介绍了一种新颖的方法，以奖励小组相对政策优化（GRPO）框架中的塑造。我们的核心贡献是将小型，高效的纯粹编码变压器用作语义奖励模型。该模型基于生成的解释和基础真实参考之间的余弦相似性提供了密集，具有语义上丰富的奖励信号，从而指导了对解释的政策，这些解释不仅是事实正确，而且在结构和概念上与专家推理保持一致。我们将此方法应用于训练意大利医学学校入学考试模型的任务，遵循标准域自适应持续培训（CPT）和监督微调（SFT）。我们的结果表明，GRPO通过我们提出的语义奖励显着提高了强大的SFT基线的忠诚和清晰度

Title: Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning

Authors: Sijia Cui, Shuai Xu, Aiyao He, Yanna Wang, Bo Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.13127
Pdf URL: https://arxiv.org/pdf/2509.13127
Copy Paste: [[2509.13127]] Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning(https://arxiv.org/abs/2509.13127)
Keywords: language model, gpt, llm, agent
Abstract: Recent advancements in Large Language Models(LLMs) have led to the development of LLM-based AI agents. A key challenge is the creation of agents that can effectively ground themselves in complex, adversarial long-horizon environments. Existing methods mainly focus on (1) using LLMs as policies to interact with the environment through generating low-level feasible actions, and (2) utilizing LLMs to generate high-level tasks or language guides to stimulate action generation. However, the former struggles to generate reliable actions, while the latter relies heavily on expert experience to translate high-level tasks into specific action sequences. To address these challenges, we introduce the Plan with Language, Act with Parameter (PLAP) planning framework that facilitates the grounding of LLM-based agents in long-horizon environments. The PLAP method comprises three key components: (1) a skill library containing environment-specific parameterized skills, (2) a skill planner powered by LLMs, and (3) a skill executor converting the parameterized skills into executable action sequences. We implement PLAP in MicroRTS, a long-horizon real-time strategy game that provides an unfamiliar and challenging environment for LLMs. The experimental results demonstrate the effectiveness of PLAP. In particular, GPT-4o-driven PLAP in a zero-shot setting outperforms 80% of baseline agents, and Qwen2-72B-driven PLAP, with carefully crafted few-shot examples, surpasses the top-tier scripted agent, CoacAI. Additionally, we design comprehensive evaluation metrics and test 6 closed-source and 2 open-source LLMs within the PLAP framework, ultimately releasing an LLM leaderboard ranking long-horizon skill planning ability. Our code is available at this https URL.
摘要：大型语言模型（LLM）的最新进展导致了基于LLM的AI代理的发展。一个关键的挑战是创建可以在复杂，对抗性的长马环境中有效地扎根的代理。现有方法主要集中于（1）使用LLM作为通过产生低级可行动作与环境互动的政策，以及（2）利用LLMS生成高级任务或语言指南来刺激行动的产生。但是，前者努力产生可靠的行动，而后者在很大程度上依赖专家经验来将高级任务转化为特定的动作序列。为了应对这些挑战，我们用语言介绍了该计划，采用参数（PLAP）计划框架来促进在长马环境中基于LLM的代理的基础。 PLAP方法包括三个关键组成部分：（1）包含特定于环境参数化技能的技能库，（2）由LLMS提供支持的技能计划者，以及（3）技能执行者将参数化技能转换为可执行动作序列。我们以Microts实施PLAP，这是一款长途实时策略游戏，为LLM提供了一个陌生且具有挑战性的环境。实验结果证明了PLAP的有效性。特别是，在零拍摄中以GPT-4O驱动的PLAP优于基线代理的80％，而QWEN2-72B驱动的PLAP具有精心制作的几示例，超过了顶级脚本代理Coacai。此外，我们在PLAP框架内设计了全面的评估指标和测试6个封闭消息和2个开源LLM，最终发布了LLM排行榜排名长的长距离技能计划能力。我们的代码可在此HTTPS URL上找到。

Title: LLM Hallucination Detection: A Fast Fourier Transform Method Based on Hidden Layer Temporal Signals

Authors: Jinxin Li, Gang Tu, ShengYu Cheng, Junjie Hu, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.13154
Pdf URL: https://arxiv.org/pdf/2509.13154
Copy Paste: [[2509.13154]] LLM Hallucination Detection: A Fast Fourier Transform Method Based on Hidden Layer Temporal Signals(https://arxiv.org/abs/2509.13154)
Keywords: language model, llm, hallucination
Abstract: Hallucination remains a critical barrier for deploying large language models (LLMs) in reliability-sensitive applications. Existing detection methods largely fall into two categories: factuality checking, which is fundamentally constrained by external knowledge coverage, and static hidden-state analysis, that fails to capture deviations in reasoning dynamics. As a result, their effectiveness and robustness remain limited. We propose HSAD (Hidden Signal Analysis-based Detection), a novel hallucination detection framework that models the temporal dynamics of hidden representations during autoregressive generation. HSAD constructs hidden-layer signals by sampling activations across layers, applies Fast Fourier Transform (FFT) to obtain frequency-domain representations, and extracts the strongest non-DC frequency component as spectral features. Furthermore, by leveraging the autoregressive nature of LLMs, HSAD identifies optimal observation points for effective and reliable detection. Across multiple benchmarks, including TruthfulQA, HSAD achieves over 10 percentage points improvement compared to prior state-of-the-art methods. By integrating reasoning-process modeling with frequency-domain analysis, HSAD establishes a new paradigm for robust hallucination detection in LLMs.
摘要：幻觉仍然是在对可靠性敏感应用中部署大型语言模型（LLM）的关键障碍。现有的检测方法很大程度上分为两类：事实检查（从根本上受到外部知识覆盖范围和静态隐藏状态分析）的约束，这在推理动态方面未能捕获偏差。结果，它们的有效性和鲁棒性仍然有限。我们提出了HSAD（基于隐藏的信号分析检测），这是一个新型的幻觉检测框架，该框架在自回旋产生过程中模拟了隐藏表示的时间动力学。 HSAD通过对层次进行采样激活来构造隐藏的信号，应用快速的傅立叶变换（FFT）以获得频域表示，并将最强的非DC频率分量作为光谱特征提取。此外，通过利用LLM的自回归性质，HSAD确定了最佳观察点，以实现有效和可靠的检测。与先前的最新方法相比，在包括真实性的多个基准测试中，HSAD可取得10个百分点的提高。通过将推理过程建模与频域分析相结合，HSAD建立了一个新的范式，用于在LLMS中进行稳健的幻觉检测。

Title: The Few-shot Dilemma: Over-prompting Large Language Models

Authors: Yongjian Tang, Doruk Tuncel, Christian Koerner, Thomas Runkler
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.13196
Pdf URL: https://arxiv.org/pdf/2509.13196
Copy Paste: [[2509.13196]] The Few-shot Dilemma: Over-prompting Large Language Models(https://arxiv.org/abs/2509.13196)
Keywords: language model, gpt, llm, prompt
Abstract: Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral. Our experimental results reveal that incorporating excessive domain-specific examples into prompts can paradoxically degrade performance in certain LLMs, which contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs. Given the trend of LLM-assisted software engineering and requirement analysis, we experiment with two real-world software requirement classification datasets. By gradually increasing the number of TF-IDF-selected and stratified few-shot examples, we identify their optimal quantity for each LLM. This combined approach achieves superior performance with fewer examples, avoiding the over-prompting problem, thus surpassing the state-of-the-art by 1% in classifying functional and non-functional requirements.
摘要：过度宣传，这是一种现象，其中提示中过多的例子导致大语模型（LLMS）的表现降低，它挑战了关于封闭式内部的传统智慧。 To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral.我们的实验结果表明，将过多的域特异性示例纳入提示中可以在某些LLM中矛盾地降低性能，这与先前的经验结论相矛盾，即更相关的几个示例普遍使LLM受益。鉴于LLM辅助软件工程和需求分析的趋势，我们尝试了两个现实的软件需求分类数据集。通过逐渐增加TF-IDF选择和分层的少数示例的数量，我们确定了它们的每个LLM的最佳数量。这种合并的方法可以通过更少的例子来实现卓越的性能，从而避免了过度促进的问题，从而超过了最先进的功能和非功能性要求。

Title: Evaluating LLM Alignment on Personality Inference from Real-World Interview Data

Authors: Jianfeng Zhu, Julina Maharjan, Xinyu Li, Karin G. Coifman, Ruoming Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.13244
Pdf URL: https://arxiv.org/pdf/2509.13244
Copy Paste: [[2509.13244]] Evaluating LLM Alignment on Personality Inference from Real-World Interview Data(https://arxiv.org/abs/2509.13244)
Keywords: language model, gpt, llm, prompt, chain-of-thought, agent
Abstract: Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding, such as emotional support agents, counselors, and decision-making assistants. However, their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored, particularly in ecologically valid conversational settings. While prior work has simulated LLM "personas" using discrete Big Five labels on social media data, the alignment of LLMs with continuous, ground-truth personality assessments derived from natural interactions is largely unexamined. To address this gap, we introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores. Using this dataset, we systematically evaluate LLM performance across three paradigms: (1) zero-shot and chain-of-thought prompting with GPT-4.1 Mini, (2) LoRA-based fine-tuning applied to both RoBERTa and Meta-LLaMA architectures, and (3) regression using static embeddings from pretrained BERT and OpenAI's text-embedding-3-small. Our results reveal that all Pearson correlations between model predictions and ground-truth personality traits remain below 0.26, highlighting the limited alignment of current LLMs with validated psychological constructs. Chain-of-thought prompting offers minimal gains over zero-shot, suggesting that personality inference relies more on latent semantic representation than explicit reasoning. These findings underscore the challenges of aligning LLMs with complex human attributes and motivate future work on trait-specific prompting, context-aware modeling, and alignment-oriented fine-tuning.
摘要：大型语言模型（LLM）越来越多地部署在需要细微的心理理解的角色中，例如情感支持者，辅导员和决策助理。但是，他们解释人格特质的能力，这是此类应用的关键方面，尤其是在生态有效的对话环境中。尽管先前的工作使用了社交媒体数据上的离散五个标签模拟LLM“角色”，但LLM与自然相互作用得出的持续不断，真实性人格评估的一致性在很大程度上尚未进行。为了解决这一差距，我们介绍了一个新颖的基准，其中包括半结构化的访谈成绩单，并配对经过验证的连续五个特征分数。使用此数据集，我们在三个范式中系统地评估LLM的性能：（1）使用GPT-4.1 mini，（2）基于Lora的微调零射击和经过链接的提示，用于Roberta和Meta-lllama体系结构，以及（3）使用静态胚胎回归的静态胚胎和开放式bertai bertai bertai berting berting berting-ext-emberding-emberding-emberd-emberd-emberd-emberd-empling。我们的结果表明，模型预测与地面真实人格特征之间的所有皮尔逊相关性保持在0.26以下，这突出了当前LLM与经过验证的心理结构的有限比对。经过深思熟虑的提示提供了对零射击的最小收益，这表明个性推论更多地依赖于潜在的语义表示，而不是明确的推理。这些发现强调了使LLM与复杂的人类属性保持一致的挑战，并激发了未来的特定特定提示，上下文感知建模和面向对齐的微调的工作。

Title: ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement

Authors: Ali Salamatian, Amirhossein Abaskohi, Wan-Cyuan Fan, Mir Rayat Imtiaz Hossain, Leonid Sigal, Giuseppe Carenini
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.13282
Pdf URL: https://arxiv.org/pdf/2509.13282
Copy Paste: [[2509.13282]] ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement(https://arxiv.org/abs/2509.13282)
Keywords: language model
Abstract: Charts are a crucial visual medium for communicating and representing information. While Large Vision-Language Models (LVLMs) have made progress on chart question answering (CQA), the task remains challenging, particularly when models attend to irrelevant regions of the chart. In this work, we present ChartGaze, a new eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. Through a systematic comparison of human and model attention, we find that LVLMs often diverge from human gaze, leading to reduced interpretability and accuracy. To address this, we propose a gaze-guided attention refinement that aligns image-text attention with human fixations. Our approach improves both answer accuracy and attention alignment, yielding gains of up to 2.56 percentage points across multiple models. These results demonstrate the promise of incorporating human gaze to enhance both the reasoning quality and interpretability of chart-focused LVLMs.
摘要：图表是通信和表示信息的关键视觉媒介。尽管大型视觉模型（LVLM）在图表问题回答（CQA）上取得了进展，但任务仍然具有挑战性，尤其是当模型参与图表的无关区域时。在这项工作中，我们介绍了Chart Stage，这是一个新的眼神跟踪数据集，可在图表推理任务中捕获人类的凝视模式。通过对人类和模型注意的系统比较，我们发现LVLM经常与人类的视线不同，从而降低了可解释性和准确性。为了解决这个问题，我们提出了一种凝视引入的注意力完善，将图像文本的关注与人类的固定相结合。我们的方法提高了答案的准确性和注意力对齐，从多个模型中获得高达2.56个百分点的收益。这些结果证明了将人类注视纳入以增强以图表为中心的LVLM的推理质量和解释性的希望。

Title: WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents

Authors: Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.13309
Pdf URL: https://arxiv.org/pdf/2509.13309
Copy Paste: [[2509.13309]] WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents(https://arxiv.org/abs/2509.13309)
Keywords: agent
Abstract: Recent advances in deep-research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono-contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high-quality training data through tool-augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi-agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.
摘要：深入研究系统的最新进展证明了AI代理人自主发现和综合外部来源知识的潜力。在本文中，我们介绍了Webresearcher，这是一个通过两个关键组成部分来构建此类代理的新型框架：（1）Webresearcher，一种迭代的深入研究范式，将深入研究重新进行了深入研究，作为马尔可夫决策过程，在该过程中，代理会定期整合发现的发现，同时将重点的工作空间弥补，以弥补上下文的范围，从而弥补了现有的噪声，从而使现有的噪声降低了，从而构成了调查。（2）WebFrontier是一种可扩展的数据综合引擎，通过工具增强的复杂性升级生成高质量的培训数据，从而使系统创建研究任务的系统创建，从而弥合被动知识回忆和主动知识构建之间的差距。值得注意的是，我们发现来自范式的训练数据也可以显着增强工具使用功能，即使是传统的单语言方法。此外，我们的范式自然会通过平行思考来扩展，从而使并发多代理探索得出更全面的结论。跨6个具有挑战性的基准进行的广泛实验表明，Webresearcher实现了最先进的性能，甚至超过了前沿专有系统。

Title: Scaling Agents via Continual Pre-training

Authors: Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.13310
Pdf URL: https://arxiv.org/pdf/2509.13310
Copy Paste: [[2509.13310]] Scaling Agents via Continual Pre-training(https://arxiv.org/abs/2509.13310)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.
摘要：大型语言模型（LLMS）已演变为能够使用自主工具的代理系统和用于复杂问题解决的多步推理。但是，培训后的方法以通用基础模型为基础，在代理任务中始终表现不佳，尤其是在开源实施中。我们确定根本原因：缺乏强大的代理基础模型强制训练期间的模型同时学习各种代理行为，同时使它们与专家演示保持一致，从而产生基本的优化张力。为此，我们是第一个提议将代理的持续预训练（代理CPT）纳入深度研究代理培训管道以构建强大的代理基础模型的人。基于这种方法，我们开发了一个名为AgentFounder的深层研究代理模型。我们在10个基准上评估了AgentFounder-30b，并实现最先进的性能，而保留了强大的工具使用能力，尤其是BrowseComp-EN的39.9％，BrowseComp-ZH的43.3％，HLE的31.5％通过@1。

Title: Towards General Agentic Intelligence via Environment Scaling

Authors: Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, Shibin Wu, Zhengwei Tao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.13311
Pdf URL: https://arxiv.org/pdf/2509.13311
Copy Paste: [[2509.13311]] Towards General Agentic Intelligence via Environment Scaling(https://arxiv.org/abs/2509.13311)
Keywords: language model, agent
Abstract: Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.
摘要：高级代理情报是在实用的现实应用程序中部署大型语言模型的先决条件。多样化的现实世界API需要精确，可靠的功能呼叫智能，该智能需要代理来通过各种环境中的互动来发展这些功能。功能调用能力的广度与对代理训练的环境的多样性紧密相关。在这项工作中，我们将环境扩展为推进一般代理智能的一步。这引起了两个核心挑战：（i）如何以原则性的方式扩展环境，以及（ii）如何通过与这些环境的互动得出的经验有效地训练代理能力。为了解决这些问题，我们设计了一个可扩展的框架，该框架自动构建了完全模拟的异质环境，从系统地拓宽了功能调用场景的空间。我们进一步调整了两阶代理微调策略：首先赋予具有基本代理能力的代理，然后专门针对特定领域的环境。对代理基准，tau板，tau2板凳和Acebench进行的广泛实验表明，我们训练的模型，AdgentScaler可显着增强模型的功能称呼能力。

Title: WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research

Authors: Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.13312
Pdf URL: https://arxiv.org/pdf/2509.13312
Copy Paste: [[2509.13312]] WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research(https://arxiv.org/abs/2509.13312)
Keywords: hallucination, agent
Abstract: This paper tackles open-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and one-shot generation paradigms that easily suffer from long-context failure issues like "loss in the middle" and hallucinations. To address these challenges, we introduce WebWeaver, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, source-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank for each part, it effectively mitigates long-context issues. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing high-quality, reliable, and well-structured reports.
摘要：本文解决了开放式深度研究（OEDR），这是一个复杂的挑战，AI代理必须将大量的网络规模信息综合为有见地的报告。当前的方法受到双重限制的困扰：静态研究管道，使计划从证据获取和单发一代范式中取消计划，这些范式很容易遭受长期存在的故障问题，例如“中间的损失”和幻觉。为了应对这些挑战，我们介绍了WebWeaver，这是一个新型的双重代理框架，模仿了人类研究过程。该计划者在动态周期中运行，迭代地交织的证据获取，并优化概述，以产生与证据记忆库链接的全面源构想的轮廓。然后，作者执行层次结构检索和写作过程，并按节撰写报告。通过仅针对每个部分中的记忆库中的必要证据进行有针对性的检索，它有效地减轻了长期以来的问题。我们的框架在包括Deepresearch Bench，DeepConsult和Deepresearchgym在内的主要OEDR基准中建立了一个新的最先进。这些结果证明了我们以人为中心的迭代方法，表明自适应计划和集中综合对于产生高质量，可靠和结构良好的报告至关重要。

Title: ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

Authors: Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.13313
Pdf URL: https://arxiv.org/pdf/2509.13313
Copy Paste: [[2509.13313]] ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization(https://arxiv.org/abs/2509.13313)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5\% over ReAct, with further gains of up to 8.2\% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3\% Pass@1 on BrowseComp-zh and 18.3\% on BrowseComp-en, surpassing existing open-source web agents.
摘要：基于大型语言模型（LLM）的Web代理在知识密集型任务上表现出强大的性能，但受到诸如React等范式的上下文限制的阻碍。复杂的查询涉及多个实体，相互交织的关系以及高度不确定性的需求广泛的搜索周期，这些搜索周期在达到完整的解决方案之前会迅速耗尽上下文预算。为了克服这一挑战，我们介绍了简历，这是一种新颖的范式，可以通过定期上下文摘要实现无限期的探索。简历将不断增长的互动历史转化为紧凑的推理状态，在绕过上下文约束时保持对先前发现的认识。为了适应范式，我们提出了简历，将GRPO与细分轨迹训练和优势广播集成在一起，以使代理商熟悉摘要条件的推理。在三个基准的不同尺度的Web代理上进行了广泛的实验表明，简历比React的平均绝对改善为4.5％，而在简历GRPO培训后，进一步增长了8.2 \％。值得注意的是，只有1K培训样本，我们的Webresummer-30b（Weblem-Grpo训练的WebSailor-30b版本）在BrowseComp-ZH上获得了33.3 \％Pass@1，在BrowseComp-en上达到18.3 \％，超过现有的开放式网络代理商。

Title: Do Natural Language Descriptions of Model Activations Convey Privileged Information?

Authors: Millicent Li, Alberto Mario Ceballos Arroyo, Giordano Rogers, Naomi Saphra, Byron C. Wallace
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.13316
Pdf URL: https://arxiv.org/pdf/2509.13316
Copy Paste: [[2509.13316]] Do Natural Language Descriptions of Model Activations Convey Privileged Information?(https://arxiv.org/abs/2509.13316)
Keywords: llm
Abstract: Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they succeed at benchmarks without any access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the activations of the target LLM being decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.
摘要：最近提出的可解释性方法将LLM内部表示使用第二个Verbalizer LLM转化为自然语言描述。这旨在阐明目标模型在输入上的代表和操作。但是，这种激活的语言方法实际上提供了有关目标模型内部运作的特权知识，还是仅传达有关其输入的信息？我们对跨先前工作的数据集进行了严格的评估，并发现它们在基准测试中取得了成功，而无需访问目标模型内部设备，这表明这些数据集并不是评估口头化方法的理想选择。然后，我们进行了受控的实验，该实验表明，语言通常反映了生成它们的语言LLM的参数知识，而不是被解码的目标LLM的激活。综上所述，我们的结果表明需要有针对性的基准和实验控制，以严格评估语言方法是否提供了对LLMS运营的有意义的见解。