2025-10-01

Title: Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI

Authors: Eduard Kapelko
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25220
Pdf URL: https://arxiv.org/pdf/2509.25220
Copy Paste: [[2509.25220]] Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI(https://arxiv.org/abs/2509.25220)
Keywords: language model, gpt
Abstract: Safety and controllability are critical for large language models. A central question is whether undesirable behaviors like deception are localized functions that can be removed, or if they are deeply intertwined with a model's core cognitive abilities. We introduce "cyclic ablation," an iterative method to test this. By combining sparse autoencoders, targeted ablation, and adversarial training on DistilGPT-2, we attempted to eliminate the concept of deception. We found that, contrary to the localization hypothesis, deception was highly resilient. The model consistently recovered its deceptive behavior after each ablation cycle via adversarial training, a process we term functional regeneration. Crucially, every attempt at this "neurosurgery" caused a gradual but measurable decay in general linguistic performance, reflected by a consistent rise in perplexity. These findings are consistent with the view that complex concepts are distributed and entangled, underscoring the limitations of direct model editing through mechanistic interpretability.
摘要：安全性和可控性对于大型语言模型至关重要。一个核心问题是，诸如欺骗之类的不良行为是否是可以删除的局部功能，或者它们是否与模型的核心认知能力深深交织在一起。我们介绍了“环状消融”，一种迭代方法来测试这一点。通过将稀疏的自动编码器结合在一起，有针对性的消融和对蒸馏2的对抗训练，我们试图消除欺骗的概念。我们发现，与定位假设相反，欺骗是高度弹性的。该模型通过对抗训练在每个消融循环后始终恢复其欺骗性行为，这是我们称为功能再生的过程。至关重要的是，这种“神经外科”的每一次尝试都会在一般语言表现中逐渐但可测量的衰减，这反映出困惑的持续增长。这些发现与以下观点一致，即复杂的概念是分布和纠缠的，强调了通过机械解释性进行直接模型编辑的局限性。

Title: From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation

Authors: Viacheslav Yusupov, Danil Maksimov, Ameliia Alaeva, Anna Vasileva, Anna Antipina, Tatyana Zaitseva, Alina Ermilova, Evgeny Burnaev, Egor Shvetsov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25359
Pdf URL: https://arxiv.org/pdf/2509.25359
Copy Paste: [[2509.25359]] From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation(https://arxiv.org/abs/2509.25359)
Keywords: language model, llm, chat
Abstract: This paper bridges internal and external analysis approaches to large language models (LLMs) by demonstrating that geometric properties of internal model representations serve as reliable proxies for evaluating generated text quality. We validate a set of metrics including Maximum Explainable Variance, Effective Rank, Intrinsic Dimensionality, MAUVE score, and Schatten Norms measured across different layers of LLMs, demonstrating that Intrinsic Dimensionality and Effective Rank can serve as universal assessments of text naturalness and quality. Our key finding reveals that different models consistently rank text from various sources in the same order based on these geometric properties, indicating that these metrics reflect inherent text characteristics rather than model-specific artifacts. This allows a reference-free text quality evaluation that does not require human-annotated datasets, offering practical advantages for automated evaluation pipelines.
摘要：本文通过证明内部模型表示的几何特性是评估生成的文本质量的可靠代理，将内部和外部分析方法桥接到大型语言模型（LLMS）中。我们验证了一组指标，包括最大的可解释差异，有效等级，内在维度，淡紫色评分以及在不同层的LLM层中测量的Schatten规范，这表明内在的维度和有效等级可以作为文本自然性和质量的普遍评估。我们的关键发现表明，不同的模型基于这些几何属性始终以相同顺序对文本进行排名，这表明这些指标反映了固有的文本特征，而不是特定于模型的伪像。这允许无需文本质量评估，该评估不需要人类通知的数据集，从而为自动化评估管道提供了实际优势。

Title: Generative Value Conflicts Reveal LLM Priorities

Authors: Andy Liu, Kshitish Ghate, Mona Diab, Daniel Fried, Atoosa Kasirzadeh, Max Kleiman-Weiner
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25369
Pdf URL: https://arxiv.org/pdf/2509.25369
Copy Paste: [[2509.25369]] Generative Value Conflicts Reveal LLM Priorities(https://arxiv.org/abs/2509.25369)
Keywords: language model, llm, prompt
Abstract: Past work seeks to align large language model (LLM)-based assistants with a target set of values, but such assistants are frequently forced to make tradeoffs between values when deployed. In response to the scarcity of value conflict in existing alignment datasets, we introduce ConflictScope, an automatic pipeline to evaluate how LLMs prioritize different values. Given a user-defined value set, ConflictScope automatically generates scenarios in which a language model faces a conflict between two values sampled from the set. It then prompts target models with an LLM-written "user prompt" and evaluates their free-text responses to elicit a ranking over values in the value set. Comparing results between multiple-choice and open-ended evaluations, we find that models shift away from supporting protective values, such as harmlessness, and toward supporting personal values, such as user autonomy, in more open-ended value conflict settings. However, including detailed value orderings in models' system prompts improves alignment with a target ranking by 14%, showing that system prompting can achieve moderate success at aligning LLM behavior under value conflict. Our work demonstrates the importance of evaluating value prioritization in models and provides a foundation for future work in this area.
摘要：过去的工作试图使基于大语言模型（LLM）的助手与目标集保持一致，但是在部署时，这些助手经常被迫在价值观之间进行权衡。为了响应现有对齐数据集中价值冲突的稀缺性，我们引入了ConcustScope，这是一种自动管道，以评估LLMS如何优先考虑不同值。给定用户定义的值集，ConcustScope会自动生成场景，其中语言模型面对从集合采样的两个值之间的冲突。然后，它使用LLM写的“用户提示”提示目标模型，并评估其自由文本响应，以使其对值集中的值进行排名。比较多项选择和开放式评估之间的结果，我们发现模型从更开放的价值冲突设置中的支持值（例如无害性）转移到了支持的保护值，例如无害的个人价值观，例如用户自主权。但是，在模型系统提示中包括详细的价值顺序，可以改善目标排名的对齐14％，表明系统提示可以在价值冲突下的LLM行为对齐时取得适度的成功。我们的工作表明了在模型中评估价值优先级的重要性，并为该领域的未来工作奠定了基础。

Title: From Faithfulness to Correctness: Generative Reward Models that Think Critically

Authors: Qiyao Ma, Yunsheng Shi, Hongtao Tian, Chao Wang, Weiming Chang, Ting Yao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25409
Pdf URL: https://arxiv.org/pdf/2509.25409
Copy Paste: [[2509.25409]] From Faithfulness to Correctness: Generative Reward Models that Think Critically(https://arxiv.org/abs/2509.25409)
Keywords: language model
Abstract: Through reinforcement learning with verifiable rewards (RLVR), large language models have achieved substantial progress in domains with easily verifiable outcomes, such as mathematics and coding. However, when applied to more complex tasks like open-domain question answering, RLVR faces significant challenges due to the difficulty of verifying correctness. The nuanced and ambiguous nature of real-world knowledge makes it difficult to reliably evaluate correctness in these settings, necessitating further abilities that extend beyond mere logical consistency to encompass an understanding and assessment of both external and internal knowledge. Recent work has primarily focused on improving faithfulness, defined as semantic alignment with supporting documents, which can cause models to rely excessively on external sources and diminish their capacity for critical assessment. To address this, we propose the Thinking-supervised Reward Model (TRM), which incorporates sentence-level thinking supervision to endow reward models with critical thinking abilities. Given a query, answer, and supporting documents, TRM first assesses the faithfulness of each answer sentence to the supporting documents, and then applies a reasoning step to evaluate sentence-level correctness. By structuring reward modeling as a sequence of faithfulness, reasoning, and correctness evaluations, TRM encourages models to critically assess and leverage both external and internal knowledge. Experiments on reward signals demonstrate that TRM substantially improves the identification of incorrect sentences, and incorporating TRM into policy optimization leads to significant gains in both answer correctness and usefulness.
摘要：通过具有可验证的奖励（RLVR）的加强学习，大语言模型在具有易于验证的结果（例如数学和编码）的领域取得了实质性进展。但是，当应用于更复杂的任务（例如开放域问题回答）时，由于难以验证正确性，RLVR会面临重大挑战。现实世界知识的细微和模棱两可的本质使得在这些环境中可靠地评估正确性变得困难，需要进一步的能力，这些能力超出了仅仅逻辑上的一致性，才能涵盖对外部和内部知识的理解和评估。最近的工作主要集中于提高忠诚，定义为与支持文件的语义一致性，这可能会导致模型过度依赖外部来源并降低其进行批判性评估的能力。为了解决这个问题，我们提出了专门观察的奖励模型（TRM），该模型（TRM）结合了句子级别的思维监督，以赋予具有批判性思维能力的奖励模型。鉴于查询，答案和支持文件，TRM首先评估每个答案句子的忠诚度对支持文件，然后应用推理步骤来评估句子级别的正确性。通过将奖励建模构建为一系列忠诚，推理和正确性评估，TRM鼓励模型进行批判性评估和利用外部和内部知识。关于奖励信号的实验表明，TRM基本上改善了不正确的句子的识别，并将TRM纳入政策优化会导致答案正确性和有用性都显着提高。

Title: SimulRAG: Simulator-based RAG for Grounding LLMs in Long-form Scientific QA

Authors: Haozhou Xu, Dongxia Wu, Matteo Chinazzi, Ruijia Niu, Rose Yu, Yi-An Ma
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25459
Pdf URL: https://arxiv.org/pdf/2509.25459
Copy Paste: [[2509.25459]] SimulRAG: Simulator-based RAG for Grounding LLMs in Long-form Scientific QA(https://arxiv.org/abs/2509.25459)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large language models (LLMs) show promise in solving scientific problems. They can help generate long-form answers for scientific questions, which are crucial for comprehensive understanding of complex phenomena that require detailed explanations spanning multiple interconnected concepts and evidence. However, LLMs often suffer from hallucination, especially in the challenging task of long-form scientific question answering. Retrieval-Augmented Generation (RAG) approaches can ground LLMs by incorporating external knowledge sources to improve trustworthiness. In this context, scientific simulators, which play a vital role in validating hypotheses, offer a particularly promising retrieval source to mitigate hallucination and enhance answer factuality. However, existing RAG approaches cannot be directly applied for scientific simulation-based retrieval due to two fundamental challenges: how to retrieve from scientific simulators, and how to efficiently verify and update long-form answers. To overcome these challenges, we propose the simulator-based RAG framework (SimulRAG) and provide a long-form scientific QA benchmark covering climate science and epidemiology with ground truth verified by both simulations and human annotators. In this framework, we propose a generalized simulator retrieval interface to transform between textual and numerical modalities. We further design a claim-level generation method that utilizes uncertainty estimation scores and simulator boundary assessment (UE+SBA) to efficiently verify and update claims. Extensive experiments demonstrate SimulRAG outperforms traditional RAG baselines by 30.4% in informativeness and 16.3% in factuality. UE+SBA further improves efficiency and quality for claim-level generation.
摘要：大型语言模型（LLM）在解决科学问题方面表现出了希望。它们可以帮助为科学问题产生长格式的答案，这对于对复杂现象的全面理解至关重要，这些现象需要详细的解释，这些解释涵盖了多个相互联系的概念和证据。但是，LLMS经常遭受幻觉的困扰，尤其是在长期科学问题回答的挑战性的任务中。检索增强的生成（RAG）方法可以通过合并外部知识来源以提高信任度来扎根LLM。在这种情况下，在验证假设方面起着至关重要的作用的科学模拟器为减轻幻觉和增强答案的事实提供了特别有希望的检索来源。但是，由于两个基本的挑战，现有的破布方法不能直接用于基于科学模拟的检索：如何从科学模拟器中检索，以及如何有效地验证和更新长形式的答案。为了克服这些挑战，我们提出了基于模拟器的抹布框架（Simulrag），并提供了长期的科学质量质量质量标准，涵盖了气候科学和流行病学，并通过模拟和人类注释验证了地面真理。在此框架中，我们提出了一个广义的模拟器检索接口，以在文本和数值方式之间转换。我们进一步设计了一种利用不确定性估计得分和模拟器边界评估（UE+SBA）的索赔级生成方法，以有效验证和更新索赔。广泛的实验表明，Simulrag的信息性优于传统的抹布基线，而事实上，Simulrag的实验性比传统的抹布基准比传统的抹布基线的表现优于30.4％。 UE+SBA进一步提高了索赔级生成的效率和质量。

Title: The Rise of AfricaNLP: Contributions, Contributors, and Community Impact (2005-2025)

Authors: Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Iqra Ameer, Ibrahim Said Ahmad, Isa Inuwa-Dutse, Idris Abdulmumin, Grigori Sidorov, Vukosi Marivate, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25477
Pdf URL: https://arxiv.org/pdf/2509.25477
Copy Paste: [[2509.25477]] The Rise of AfricaNLP: Contributions, Contributors, and Community Impact (2005-2025)(https://arxiv.org/abs/2509.25477)
Keywords: language model, llm
Abstract: Natural Language Processing (NLP) is undergoing constant transformation, as Large Language Models (LLMs) are driving daily breakthroughs in research and practice. In this regard, tracking the progress of NLP research and automatically analyzing the contributions of research papers provides key insights into the nature of the field and the researchers. This study explores the progress of African NLP (AfricaNLP) by asking (and answering) basic research questions such as: i) How has the nature of NLP evolved over the last two decades?, ii) What are the contributions of AfricaNLP papers?, and iii) Which individuals and organizations (authors, affiliated institutions, and funding bodies) have been involved in the development of AfricaNLP? We quantitatively examine the contributions of AfricaNLP research using 1.9K NLP paper abstracts, 4.9K author contributors, and 7.8K human-annotated contribution sentences (AfricaNLPContributions) along with benchmark results. Our dataset and continuously existing NLP progress tracking website provide a powerful lens for tracing AfricaNLP research trends and hold potential for generating data-driven literature surveys.
摘要：自然语言处理（NLP）正在经历不断的转变，因为大型语言模型（LLMS）正在促进研究和实践中的每日突破。在这方面，跟踪NLP研究的进度并自动分析研究论文的贡献，为领域和研究人员的性质提供了关键的见解。这项研究通过询问（和回答）基础研究问题（i）NLP的性质在过去二十年中进化（ii）非洲人论文的贡献是什么？和III的贡献是什么？我们使用1.9k NLP纸张摘要，4.9k作者贡献者和7.8K人类宣传的贡献句子（AfricanLPContribitions）以及基准结果进行定量检查非洲LP研究的贡献。我们的数据集和连续现有的NLP进度跟踪网站为追踪非洲LP研究趋势提供了强大的镜头，并具有生成数据驱动文献调查的潜力。

Title: Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries

Authors: Nick Hagar, Wilma Agustianto, Nicholas Diakopoulos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25498
Pdf URL: https://arxiv.org/pdf/2509.25498
Copy Paste: [[2509.25498]] Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries(https://arxiv.org/abs/2509.25498)
Keywords: language model, gpt, llm, hallucination, prompt, chat
Abstract: Large language models (LLMs) are increasingly used in newsroom workflows, but their tendency to hallucinate poses risks to core journalistic practices of sourcing, attribution, and accuracy. We evaluate three widely used tools - ChatGPT, Gemini, and NotebookLM - on a reporting-style task grounded in a 300-document corpus related to TikTok litigation and policy in the U.S. We vary prompt specificity and context size and annotate sentence-level outputs using a taxonomy to measure hallucination type and severity. Across our sample, 30% of model outputs contained at least one hallucination, with rates approximately three times higher for Gemini and ChatGPT (40%) than for NotebookLM (13%). Qualitatively, most errors did not involve invented entities or numbers; instead, we observed interpretive overconfidence - models added unsupported characterizations of sources and transformed attributed opinions into general statements. These patterns reveal a fundamental epistemological mismatch: While journalism requires explicit sourcing for every claim, LLMs generate authoritative-sounding text regardless of evidentiary support. We propose journalism-specific extensions to existing hallucination taxonomies and argue that effective newsroom tools need architectures that enforce accurate attribution rather than optimize for fluency.
摘要：大型语言模型（LLM）越来越多地用于新闻编辑室的工作流程中，但是它们幻觉的趋势会带来核心新闻，归因和准确性的核心新闻实践。我们评估了三种广泛使用的工具 - chatgpt，gemini和nocebooklm-在美国与Tiktok诉讼和政策相关的300个文档语料库中的报告式任务中，我们使用分类级别的句子级别的输出来衡量幻想类型和严重程度。在我们的样本中，30％的模型输出包含至少一个幻觉，双子座和Chatgpt（40％）的汇率约为NotebookLM（13％）。定性地，大多数错误不涉及发明的实体或数字；取而代之的是，我们观察到了解释性的过度自信 - 模型添加了对源的不支持的特征，并将归因的意见转换为一般陈述。这些模式揭示了一个基本的认识论不匹配：尽管新闻业需要为每个主张进行明确的采购，但LLMS会产生声音听起来的文本，而无论证据支持如何。我们建议针对现有幻觉分类法的新闻特定扩展，并认为有效的新闻编辑室工具需要实施准确的归属而不是优化流利性的架构。

Title: MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Authors: Huu Nguyen, Victor May, Harsh Raj, Marianna Nezhurina, Yishan Wang, Yanqi Luo, Minh Chien Vu, Taishi Nakamura, Ken Tsui, Van Khue Nguyen, David Salinas, Aleksandra Krasnodębska, Christoph Schuhmann, Mats Leon Richter, Xuan-Son (Sonny)Vu, Jenia Jitsev
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25531
Pdf URL: https://arxiv.org/pdf/2509.25531
Copy Paste: [[2509.25531]] MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources(https://arxiv.org/abs/2509.25531)
Keywords: llm
Abstract: We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: this https URL
摘要：我们提出了混合vitae，这是一种开放式访问训练的语料库，旨在最大程度地降低法律风险，同时提供强大的模型性能。 MixtureVitae遵循了一种降低风险的采购策略，该策略结合了公共域和允许许可的文本（例如CC-BY/APACHE）与精心合理的低风险添加（例如，政府工作和欧盟符合TDM的资格来源），以及有针对性的指导，推理，推理，推理和合成数据以及有记录的外观。我们详细介绍了一条透明的多阶段管道，用于使用许可的过滤，安全性和质量筛选以及域感知的混合，并发布数据集和策展配方以支持可重复的研究。在使用开放式REF培训方案的受控实验中（以130m/400m/1.3b/1.7b参数为固定架构；训练预算为50b和300b代币），在混合仪上训练的模型始终超过其他允许数据集，跨越了其他允许数据集，并在1.7B/300B的范围内置于1.7B/300B的设置，并设置为1.7b/300b的设置。训练。在数学/代码上的性能特别强，并且在质量检查任务上具有竞争力。这些结果表明，宽松的，风险减轻的数据为有能力培训的LLM的实用且合法地减轻了基础，从而减少了对不加区分的网络刮擦而无需牺牲竞争力的依赖。代码：此HTTPS URL

Title: Calibrating Verbalized Confidence with Self-Generated Distractors

Authors: Victor Wang, Elias Stengel-Eskin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25532
Pdf URL: https://arxiv.org/pdf/2509.25532
Copy Paste: [[2509.25532]] Calibrating Verbalized Confidence with Self-Generated Distractors(https://arxiv.org/abs/2509.25532)
Keywords: language model, llm
Abstract: Calibrated confidence estimates are necessary for large language model (LLM) outputs to be trusted by human users. While LLMs can express their confidence in human-interpretable ways, verbalized LLM-generated confidence scores have empirically been found to be miscalibrated, reporting high confidence on instances with low accuracy and thereby harming trust and safety. We hypothesize that this overconfidence often stems from a given LLM's heightened suggestibility when faced with claims that it encodes little information about; we empirically validate this hypothesis, finding more suggestibility on lower-accuracy claims. Building on this finding, we introduce Distractor-Normalized Coherence (DINCO), which estimates and accounts for an LLM's suggestibility bias by having the model verbalize its confidence independently across several self-generated distractors (i.e. alternative claims), and normalizes by the total verbalized confidence. To further improve calibration, we leverage generator-validator disagreement, augmenting normalized validator confidence with a consistency-based estimate of generator confidence. Here, we frame the popular approach of self-consistency as leveraging coherence across sampled generations, and normalized verbalized confidence as leveraging coherence across validations on incompatible claims, allowing us to integrate these complementary dimensions of coherence into DINCO. Moreover, our analysis shows that DINCO provides less saturated -- and therefore more usable -- confidence estimates, and that further sampling alone cannot close the gap between DINCO and baselines, with DINCO at 10 inference calls outperforming self-consistency at 100.
摘要：校准的置信度估计是大型语言模型（LLM）输出的必要条件。尽管LLM可以表达他们对人类解剖方式的信心，但在经验上发现口头上的LLM生成的置信度得分是错误地校准的，对较低准确性的实例报告了高度的信心，从而损害了信任和安全性。我们假设这种过度自信通常源于给定的LLM提高的建议性，并说出它几乎没有编码有关的信息；我们从经验上验证了这一假设，发现对较低准确性主张的提示。在这一发现的基础上，我们引入了分散分散的一致性相干性（DINCO），该相干性（DINCO）估计并说明了LLM的暗示性偏见，通过使模型在几个自我产生的干扰物（即替代性主张）中独立地口头信心（即替代性主张），并通过总的口头信心来归一化。为了进一步改善校准，我们利用生成器验证器的分歧，通过基于一致性的发电机置信度增强归一化验证器的信心。在这里，我们将自矛盾的流行方法构建为在抽样的世代中利用相干性的流行方法，并将口头上的信心归一化，因为对不兼容的主张的验证，利用了连贯性，从而使我们能够将相干性的这些互补维度整合到Dinco中。此外，我们的分析表明，DINCO提供的置信度较低（因此更可用），并且仅进一步的采样无法缩小Dinco和Baselines之间的差距，而Dinco以10个推论呼吁以100的速度自以为是。

Title: Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning

Authors: Zhiling Ye, Yun Yue, Haowen Wang, Xudong Han, Jiadi Jiang, Cheng Wei, Lei Fan, Jiaxin Liang, Shuowen Zhang, Ji Li, Chunxiao Guo, Jian Wang, Peng Wei, Jinjie Gu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25534
Pdf URL: https://arxiv.org/pdf/2509.25534
Copy Paste: [[2509.25534]] Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning(https://arxiv.org/abs/2509.25534)
Keywords: language model, gpt
Abstract: Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader. Motivated by this, we introduce Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning, a lightweight framework that enables faster and more resource-efficient training while surpassing baselines. Remarkably, on Qwen3-32B, training with just the 4000-sample HealthBench Easy subset is sufficient to obtain a model that exceeds GPT-5 on HealthBench Hard. Incorporating a small amount of teacher-graded data further enhances performance for less capable models.
摘要：开放式评估对于在现实世界中部署大型语言模型至关重要。在研究HealthBench时，我们观察到，将模型本身用作分级器并产生基于标题的奖励信号可以大大提高推理性能。值得注意的是，训练有素的模型也成为更强大的毕业生。在此激励的基础上，我们引入了开放式推理的自我奖励基础的加固学习，这是一个轻巧的框架，可以在超过基线的同时更快，更具资源效率的培训。值得注意的是，在QWEN3-32B上，仅使用4000个样本的HealthBench轻松子集进行培训就足以获得超过HealthBench上GPT-5的模型。合并少量的教师学位数据进一步增强了功能较低的模型的性能。

Title: Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model

Authors: Fahim Faisal, Kaiqiang Song, Song Wang, Simin Ma, Shujian Liu, Haoyun Deng, Sathish Reddy Indurthi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25543
Pdf URL: https://arxiv.org/pdf/2509.25543
Copy Paste: [[2509.25543]] Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model(https://arxiv.org/abs/2509.25543)
Keywords: language model, llm, agent
Abstract: While reinforcement learning has advanced the reasoning abilities of Large Language Models (LLMs), these gains are largely confined to English, creating a significant performance disparity across languages. To address this, we introduce Pivot-Based Reinforcement Learning with Semantically Verifiable Rewards (PB-RLSVR), a novel framework that enhances multilingual reasoning by circumventing the need for human-annotated data in target languages. Our approach employs a high-performing English LLM as a "pivot" model to generate reference responses for reasoning tasks. A multilingual model is then rewarded based on the semantic equivalence of its responses to the English reference, effectively transferring the pivot model's reasoning capabilities across languages. We investigate several cross-lingual semantic reward functions, including those based on embeddings and machine translation. Extensive experiments on a suite of multilingual reasoning benchmarks show that our method significantly narrows the performance gap between English and other languages, substantially outperforming traditional PPO baselines. Specifically, our PB-RLSVR framework improves the average multilingual performance of Llama-3.1-8B-Instruct and Qwen3-32B by 16.41% and 10.17%, respectively, demonstrating a powerful and data-efficient approach to building truly multilingual reasoning agents.
摘要：尽管强化学习提高了大语言模型（LLM）的推理能力，但这些收益在很大程度上限于英语，从而在跨语言造成了重大的性能差异。为了解决这个问题，我们以语义可验证的奖励（PB-RLSVR）引入了基于枢轴的增强学习，这是一个新颖的框架，通过规避目标语言中对人类注释数据的需求来增强多语言推理。我们的方法采用高性能的英语LLM作为“枢轴”模型来生成推理任务的参考响应。然后，基于其对英语参考的响应的语义等效性，奖励了多语言模型，从而有效地传递了枢轴模型跨语言的推理能力。我们研究了几种跨语性语义奖励功能，包括基于嵌入和机器翻译的函数。一系列多语言推理基准的广泛实验表明，我们的方法显着缩小了英语和其他语言之间的性能差距，从而大大优于传统的PPO基线。具体而言，我们的PB-RLSVR框架分别提高了Llama-3.1-8B-Instruct和Qwen3-32b的平均多语言性能，分别提高了16.41％和10.17％，展示了一种强大且具有数据效率的方法，用于构建真正多语言的推理代理。

Title: Probing the Limits of Stylistic Alignment in Vision-Language Models

Authors: Asma Farajidizaji, Akash Gupta, Vatsal Raina
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25568
Pdf URL: https://arxiv.org/pdf/2509.25568
Copy Paste: [[2509.25568]] Probing the Limits of Stylistic Alignment in Vision-Language Models(https://arxiv.org/abs/2509.25568)
Keywords: language model
Abstract: Vision-language models are increasingly used to generate image captions in specific styles, such as humor or romantic. However, these transformer-based models often struggle with this subjective task in a zero-shot setting. While preference data can be used to align them toward a desired style, such data is expensive to acquire, limiting the ability to explore the models' full capabilities. This work addresses this by studying the data efficiency of aligning small vision-language models to humor and romantic styles. This approach helps to define the performance limits of these models and determine how little preference data is needed to achieve stylistic saturation, benchmarking their capabilities and limitations.
摘要：视觉模型越来越多地用于生成特定样式的图像标题，例如幽默或浪漫。但是，这些基于变压器的模型通常在零拍设置中处理此主观任务。虽然偏好数据可用于使它们与所需的样式保持一致，但获取此类数据很昂贵，从而限制了探索模型的完整功能的能力。这项工作通过研究将小型视觉模型与幽默和浪漫风格保持一致的数据效率来解决这一问题。这种方法有助于定义这些模型的性能限制，并确定实现风格饱和度所需的偏好数据很少，从而基于其功能和局限性。

Title: RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance

Authors: Tianlang Chen, Minkai Xu, Jure Leskovec, Stefano Ermon
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25604
Pdf URL: https://arxiv.org/pdf/2509.25604
Copy Paste: [[2509.25604]] RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance(https://arxiv.org/abs/2509.25604)
Keywords: language model, llm
Abstract: Diffusion large language models (dLLMs) have shown great potential in large-scale language modeling, and there is an increasing interest in further improving the capacity to solve complex problems by guiding the reasoning process step by step. Common practice for autoregressive language models typically learns a process reward model with dense annotation for each intermediate step. However, this is challenging for dLLMs where the generation is in an any-order fashion and intermediate states are partially masked sentences. To this end, in this paper, we propose reward-free guidance (RFG), a principled method for guiding the reasoning trajectory of dLLMs without explicit process reward. The key idea of RFG is to parameterize the process reward by log-likelihood ratios of the enhanced and reference dLLMs, where the enhanced model can be easily obtained by any off-the-shelf dLLM that has been post-trained with reinforcement learning (RL) or supervised fine-tuning (SFT). We provide theoretical justification that RFG induces the reward-guided sampling distribution with no additional reward. We conduct comprehensive experiments on four challenging mathematical reasoning and code generation benchmarks using a diverse suite of dLLMs enhanced with various post-training methods. RFG consistently yields significant improvements across all tasks and model types, achieving accuracy gains of up to 9.2%. These findings establish RFG as a general training-free framework that scales test-time reasoning without reliance on external reward models.
摘要：扩散大语言模型（DLLM）在大规模语言建模中显示出巨大的潜力，并且通过逐步指导推理过程来进一步提高解决复杂问题的能力越来越兴趣。自回归语言模型的常见实践通常会学习一个过程奖励模型，每个中间步骤都有密集的注释。但是，这对于以任何阶段的方式是DLLM的DLLM是一个挑战，中间状态是部分掩盖的句子。为此，在本文中，我们提出了无奖励指导（RFG），这是一种指导DLLM的推理轨迹而没有明确过程奖励的原则方法。 RFG的关键思想是通过增强和参考dllms的对数可能性比率进行参数奖励，在此奖励比对数的dllms和参考dllms的纪念比率，在其中，通过在训练后训练有培训（RL）或受监督的细调（SFT），可以轻松获得增强的模型。我们提供理论上的理由，RFG诱导奖励指导的抽样分布而没有额外的奖励。我们使用各种培训后的DLLM套件进行了多种DLLM，对四个具有挑战性的数学推理和代码生成基准进行了全面的实验。 RFG始终在所有任务和模型类型上都产生重大改进，可实现高达9.2％的准确性提高。这些发现将RFG建立为无培训的框架，可扩展测试时间推理，而无需依赖外部奖励模型。

Title: Transformers through the lens of support-preserving maps between measures

Authors: Takashi Furuya, Maarten V. de Hoop, Matti Lassas
Subjects: cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2509.25611
Pdf URL: https://arxiv.org/pdf/2509.25611
Copy Paste: [[2509.25611]] Transformers through the lens of support-preserving maps between measures(https://arxiv.org/abs/2509.25611)
Keywords: prompt
Abstract: Transformers are deep architectures that define ``in-context maps'' which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for a vision transformer). In previous work, we studied the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically, uniformly analyze their expressivity, we considered the case that the mappings are conditioned on a context represented by a probability distribution which becomes discrete for a finite number of tokens. Modeling neural networks as maps on probability measures has multiple applications, such as studying Wasserstein regularity, proving generalization bounds and doing a mean-field limit analysis of the dynamics of interacting particles as they go through the network. In this work, we study the question what kind of maps between measures are transformers. We fully characterize the properties of maps between measures that enable these to be represented in terms of in-context maps via a push forward. On the one hand, these include transformers; on the other hand, transformers universally approximate representations with any continuous in-context map. These properties are preserving the cardinality of support and that the regular part of their Fréchet derivative is uniformly continuous. Moreover, we show that the solution map of the Vlasov equation, which is of nonlocal transport type, for interacting particle systems in the mean-field regime for the Cauchy problem satisfies the conditions on the one hand and, hence, can be approximated by a transformer; on the other hand, we prove that the measure-theoretic self-attention has the properties that ensure that the infinite depth, mean-field measure-theoretic transformer can be identified with a Vlasov flow.
摘要：变形金刚是定义``context Maps''的深层体系结构，它可以根据给定的令牌集（例如NLP应用程序中的提示或一组视觉变压器的补丁）来预测新令牌。在以前的工作中，我们研究了这些体系结构处理任意大量上下文令牌的能力。在数学上，统一地分析其表达性，我们认为映射是在概率分布表示的上下文中进行的，该概率分布已成为有限数量的令牌。建模神经网络作为概率度量的地图具有多种应用，例如研究Wasserstein的规律性，证明概括性界限并对相互作用粒子通过网络的动力学进行平均范围限制分析。在这项工作中，我们研究了一个措施之间的措施之间的地图。我们充分表征了措施之间地图的属性，这些措施可以通过推动向前以在上下文图中表示这些图。一方面，这些包括变压器；另一方面，变形金刚通过任何连续的内在图映射普遍近似表示表示。这些特性保留了支撑的基础性，并且其Fréchet衍生物的常规部分均匀连续。此外，我们表明，对于非局部传输类型的vlasov方程的解图，用于在cauchy问题的均值域中相互作用的粒子系统，一方面满足条件，因此可以通过变压器近似。另一方面，我们证明了测量理论的自我注意力具有确保无限深度，平均测量理论变压器可以通过弗拉索夫流动来识别的特性。

Title: The Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale

Authors: Samar Haider, Amir Tohidi, Jenny S. Wang, Timothy Dörr, David M. Rothschild, Chris Callison-Burch, Duncan J. Watts
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25649
Pdf URL: https://arxiv.org/pdf/2509.25649
Copy Paste: [[2509.25649]] The Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale(https://arxiv.org/abs/2509.25649)
Keywords: language model, llm
Abstract: Mainstream news organizations shape public perception not only directly through the articles they publish but also through the choices they make about which topics to cover (or ignore) and how to frame the issues they do decide to cover. However, measuring these subtle forms of media bias at scale remains a challenge. Here, we introduce a large, ongoing (from January 1, 2024 to present), near real-time dataset and computational framework developed to enable systematic study of selection and framing bias in news coverage. Our pipeline integrates large language models (LLMs) with scalable, near-real-time news scraping to extract structured annotations -- including political lean, tone, topics, article type, and major events -- across hundreds of articles per day. We quantify these dimensions of coverage at multiple levels -- the sentence level, the article level, and the publisher level -- expanding the ways in which researchers can analyze media bias in the modern news landscape. In addition to a curated dataset, we also release an interactive web platform for convenient exploration of these data. Together, these contributions establish a reusable methodology for studying media bias at scale, providing empirical resources for future research. Leveraging the breadth of the corpus over time and across publishers, we also present some examples (focused on the 150,000+ articles examined in 2024) that illustrate how this novel data set can reveal insightful patterns in news coverage and bias, supporting academic research and real-world efforts to improve media accountability.
摘要：主流新闻机构不仅通过他们发表的文章，还通过他们对要涵盖（或忽略）的主题做出的选择来塑造公众的看法，以及如何构建他们决定涵盖的问题。但是，在大规模测量这些微妙的媒体偏见仍然是一个挑战。在这里，我们介绍了一个正在进行的大型，正在进行的（从2024年1月1日到现在），靠近实时数据集和计算框架，以实现对新闻报道中的选择和框架偏见的系统研究。我们的管道将大型语言模型（LLM）与可扩展的，近实时的新闻刮擦整合在一起，以提取结构化注释，包括每天数百篇文章，包括政治精益，音调，主题，文章类型和重大事件。我们在多个层面（句子级别，文章级别和发布者级别）量化了这些覆盖范围的这些维度，从而扩大了研究人员可以分析现代新闻环境中媒体偏见的方式。除了策划的数据集外，我们还发布了一个交互式Web平台，以方便探索这些数据。这些贡献共同建立了可重复使用的方法，用于研究媒体偏见，为未来的研究提供了经验资源。利用语料库的广度，随着时间的流逝和整个发行商，我们还展示了一些示例（重点介绍了2024年研究的15万多种文章），这些示例说明了这一新颖的数据集如何揭示新闻报道和偏见中有见地的模式，支持学术研究和现实世界中的努力，以改善媒体责任。

Title: QFrBLiMP: a Quebec-French Benchmark of Linguistic Minimal Pairs

Authors: David Beauchemin, Pier-Luc Veilleux, Richard Khoury, Johanna-Pascale Roy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25664
Pdf URL: https://arxiv.org/pdf/2509.25664
Copy Paste: [[2509.25664]] QFrBLiMP: a Quebec-French Benchmark of Linguistic Minimal Pairs(https://arxiv.org/abs/2509.25664)
Keywords: llm
Abstract: In this paper, we introduce the Quebec-French Benchmark of Linguistic Minimal Pairs (QFrBLiMP), a corpus designed to evaluate the linguistic knowledge of LLMs on prominent grammatical phenomena in Quebec-French. QFrBLiMP consists of 1,761 minimal pairs annotated with 20 linguistic phenomena. Specifically, these minimal pairs have been created by manually modifying sentences extracted from an official online resource maintained by a Québec government institution. Each pair is annotated by twelve Quebec-French native speakers, who select the sentence they feel is grammatical amongst the two. These annotations are used to compare the competency of LLMs with that of humans. We evaluate different LLMs on QFrBLiMP and MultiBLiMP-Fr by observing the rate of higher probabilities assigned to the sentences of each minimal pair for each category. We find that while grammatical competence scales with model size, a clear hierarchy of difficulty emerges. All benchmarked models consistently fail on phenomena requiring deep semantic understanding, revealing a critical limitation and a significant gap compared to human performance on these specific tasks.
摘要：在本文中，我们介绍了语言最小对（QFRBLIMP）的魁北克 - 法国基准，该基准是一种旨在评估Quebec-French中突出语法现象的语言知识的语料库。 Qfrblimp由1,761个最小对，用20种语言现象注释。具体而言，这些最小对是通过手动修改魁北克政府机构维护的官方在线资源中提取的句子来创建的。每对均由十二个魁北克 - 法国人说的母语者注释，他们选择他们认为是语法的句子在两者中是语法。这些注释用于将LLM与人类的能力进行比较。我们通过观察分配给每个类别每个最小对句子的较高概率的速率来评估QFRBLIMP和Multiblimp-FR上的不同LLM。我们发现，尽管语法能力具有模型大小的尺度，但出现了清晰的难度层次结构。与人类在这些特定任务上的表现相比，所有基准模型始终失败，需要深入的语义理解，揭示了关键的局限性和显着差距。

Title: Mitigating Biases in Language Models via Bias Unlearning

Authors: Dianqing Liu, Yi Liu, Guoqing Jin, Zhendong Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25673
Pdf URL: https://arxiv.org/pdf/2509.25673
Copy Paste: [[2509.25673]] Mitigating Biases in Language Models via Bias Unlearning(https://arxiv.org/abs/2509.25673)
Keywords: language model, prompt
Abstract: Many studies have shown various biases targeting different demographic groups in language models, amplifying discrimination and harming fairness. Recent parameter modification debiasing approaches significantly degrade core capabilities such as text coherence and task accuracy. And Prompt-based debiasing methods, only effective for predefined trigger words, fail to address deeply embedded stereotypical associations in model parameters. In this paper, we propose BiasUnlearn, a novel model debiasing framework which achieves targeted debiasing via dual-pathway unlearning mechanisms coordinating stereotype forgetting with anti-stereotype retention, while preventing bias polarity reversal through adversarial forget set and dynamic dataset swapping. We conducted extensive experiments with multiple language models across various evaluation benchmarks. The results show that BiasUnlearn outperforms existing methods in mitigating bias in language models while retaining language modeling capabilities. Further experiments reveal that debiasing weights are transferable across model variants, confirming that bias representations become entrenched during pre-training and persist through fine-tuning phases.
摘要：许多研究表明，针对语言模型中不同人口组的各种偏见，扩大了歧视和损害公平。最近的参数修改偏数方法显着降低了核心功能，例如文本连贯性和任务准确性。和基于及时的词汇方法（仅对预定义触发单词有效）无法解决模型参数中深层嵌入式刻板印象的关联。在本文中，我们提出了Biasunlearn，这是一个新型的模型模型框架框架，该框架通过双座道未学习机制实现了有针对性的偏差，以协调刻板印象与反疾病型保留的遗忘，同时防止偏见通过对抗性忘记忘记的设置和动态数据集互换。我们对各种评估基准的多种语言模型进行了广泛的实验。结果表明，Biasunlearn在减轻语言模型中的偏见方面优于现有方法，同时保留语言建模功能。进一步的实验表明，在模型变体之间可以转移偏见的权重，证实偏见表示在预训练期间会根深蒂固，并通过微调阶段持续存在。

Title: LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

Authors: Yuan Zhuang, Yi Shen, Yuexin Bian, Qing Su, Shihao Ji, Yuanyuan Shi, Fei Miao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25684
Pdf URL: https://arxiv.org/pdf/2509.25684
Copy Paste: [[2509.25684]] LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts(https://arxiv.org/abs/2509.25684)
Keywords: language model, llm
Abstract: Recent studies have shown that combining parameter-efficient fine-tuning (PEFT) with mixture-of-experts (MoE) is an effective strategy for adapting large language models (LLMs) to the downstream tasks. However, most existing approaches rely on conventional TopK routing, which requires careful hyperparameter tuning and assigns a fixed number of experts to each token. In this work, we propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts that enables adaptive, token-dependent, and layer-wise expert allocation. Our method replaces the non-differentiable TopK selection with a differentiable routing function and a closed-form solution. Moreover, our design allows the model to adaptively determine the number of experts to activate for each token at different layers. In addition, we introduce an analytical sparsity control objective to regularize the number of activated experts. Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks. Our method not only achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.
摘要：最近的研究表明，将参数有效的微调（PEFT）与专家混合物（MOE）相结合是将大型语言模型（LLMS）适应下游任务的有效策略。但是，大多数现有的方法都依赖于常规的TOPK路由，这需要仔细的高参数调整，并为每个令牌分配固定数量的专家。在这项工作中，我们提出了LD-mole，这是一种可学习的动态路由机制，用于洛拉专家的混合物，以实现自适应，依赖令牌和层次专家分配。我们的方法用可区分的路由函数和封闭形式的解决方案代替了非不同的TOPK选择。此外，我们的设计允许模型自适应地确定在不同层上激活每个令牌的专家数量。此外，我们引入了一个分析稀疏控制目标，以使活化专家的数量正常。在QWEN3-1.7B和LLAMA-3.2-3B模型上进行的广泛实验表明，LD-Mole与最新基线相比，在各种基准测试基准中，LD-Mole达到了最高的平均得分。我们的方法不仅取得了卓越的性能，而且还展示了学习代币依赖和层次专家分配的能力。

Title: Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities

Authors: Jiayi Kuang, Haojing Huang, Yinghui Li, Xinnian Liang, Zhikun Xu, Yangning Li, Xiaoyu Tan, Chao Qu, Meishan Zhang, Ying Shen, Philip S. Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25725
Pdf URL: https://arxiv.org/pdf/2509.25725
Copy Paste: [[2509.25725]] Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities(https://arxiv.org/abs/2509.25725)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated outstanding performance in mathematical reasoning capabilities. However, we argue that current large-scale reasoning models primarily rely on scaling up training datasets with diverse mathematical problems and long thinking chains, which raises questions about whether LLMs genuinely acquire mathematical concepts and reasoning principles or merely remember the training data. In contrast, humans tend to break down complex problems into multiple fundamental atomic capabilities. Inspired by this, we propose a new paradigm for evaluating mathematical atomic capabilities. Our work categorizes atomic abilities into two dimensions: (1) field-specific abilities across four major mathematical fields, algebra, geometry, analysis, and topology, and (2) logical abilities at different levels, including conceptual understanding, forward multi-step reasoning with formal math language, and counterexample-driven backward reasoning. We propose corresponding training and evaluation datasets for each atomic capability unit, and conduct extensive experiments about how different atomic capabilities influence others, to explore the strategies to elicit the required specific atomic capability. Evaluation and experimental results on advanced models show many interesting discoveries and inspirations about the different performances of models on various atomic capabilities and the interactions between atomic capabilities. Our findings highlight the importance of decoupling mathematical intelligence into atomic components, providing new insights into model cognition and guiding the development of training strategies toward a more efficient, transferable, and cognitively grounded paradigm of "atomic thinking".
摘要：大型语言模型（LLM）在数学推理能力方面表现出了出色的表现。但是，我们认为当前的大规模推理模型主要依赖于具有多种数学问题和长期思维链的培训数据集扩展，这引发了有关LLM的问题是否真正获得数学概念和推理原则，或者仅记住培训数据。相比之下，人类倾向于将复杂的问题分解为多个基本原子能的能力。受此启发，我们提出了一种用于评估数学原子能力的新范式。我们的工作将原子能力分为两个维度：（1）跨四个主要数学领域的特定能力，即代数，几何，分析和拓扑，以及（2）在不同级别上的逻辑能力，包括概念理解，具有正式数学语言的远期多步理论以及反例驱动的后向后推理。我们提出了每个原子能力单元的相应培训和评估数据集，并进行广泛的实验，涉及不同的原子能力如何影响他人，以探索引发所需的特定特定原子能的策略。高级模型的评估和实验结果显示了许多有趣的发现和灵感，这些发现和灵感在各种原子能力以及原子能力之间的相互作用上的相互作用。我们的发现强调了将数学智能脱钩到原子成分的重要性，为模型认知提供了新的见解，并指导培训策略的发展，以更有效，可转移和认知地扎根于“原子思维”。

Title: CATCH: A Novel Data Synthesis Framework for High Therapy Fidelity and Memory-Driven Planning Chain of Thought in AI Counseling

Authors: Mingyu Chen, Jingkai Lin, Zhaojie Chu, Xiaofen Xing, Yirong Chen, Xiangmin Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25733
Pdf URL: https://arxiv.org/pdf/2509.25733
Copy Paste: [[2509.25733]] CATCH: A Novel Data Synthesis Framework for High Therapy Fidelity and Memory-Driven Planning Chain of Thought in AI Counseling(https://arxiv.org/abs/2509.25733)
Keywords: language model, chain-of-thought, agent
Abstract: Recently, advancements in AI counseling based on large language models have shown significant progress. However, existing studies employ a one-time generation approach to synthesize multi-turn dialogue samples, resulting in low therapy fidelity and failing to capture the decision-making rationale behind each response. In this work, we propose CATCH, a novel data synthesis framework designed to address these challenges. Specifically, to improve therapy fidelity, we introduce the Progressive Dialogue Synthesis strategy, which extracts goals, resources, and solutions from a client's self-report, organizes them into structured outlines, and then incrementally generates stage-aligned counseling dialogues. To capture decision-making rationale behind each response, we propose the Memory-Driven Dynamic Planning thinking pattern that integrates memory enhancement, global planning, and strategy reasoning; a collaborative multi-agent optimizer then leverages MDP to attach explicit chain-of-thought to each dialogue turn. Extensive experiments and human evaluations demonstrate that CATCH significantly enhances fidelity and logical coherence in AI counseling.
摘要：最近，基于大语言模型的AI咨询的进步已显示出很大的进步。但是，现有研究采用一次性生成方法来综合多转向对话样本，从而导致较低的治疗保真度，并且未能捕获每种反应背后的决策原理。在这项工作中，我们提出了一个新型的数据综合框架Catch，旨在应对这些挑战。具体来说，为了改善治疗保真度，我们介绍了渐进式对话综合策略，从客户的自我报告中提取目标，资源和解决方案，将它们组织成结构化的轮廓，然后逐步生成阶段一致的咨询对话。为了捕获每个响应背后的决策理由，我们提出了以内存为驱动的动态计划思维模式，以整合内存增强，全球计划和策略推理；然后，一个协作的多代理优化器利用MDP将明确的思想链附加到每个对话转弯。广泛的实验和人类评估表明，Catch可显着提高AI咨询中的忠诚度和逻辑连贯性。

Title: Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications

Authors: Chenhua Shi, Gregor Macdonald, Bhavika Jalli, Wanlu Lei, John Zou, Mridul Jain, Joji Philip
Subjects: cs.CL, cs.AI, cs.IT, cs.NI
Abstract URL: https://arxiv.org/abs/2509.25736
Pdf URL: https://arxiv.org/pdf/2509.25736
Copy Paste: [[2509.25736]] Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications(https://arxiv.org/abs/2509.25736)
Keywords: language model, llm
Abstract: The success of large language models (LLMs) depends heavily on large-scale, high-quality instruction-following and reinforcement datasets. However, generating such data through human annotation is prohibitively time-consuming particularly for domain-specific tasks like telecom network troubleshooting, where accurate responses require deep technical expertise and contextual understanding. In this paper, we present a fully automated, retrieval-augmented pipeline for generating synthetic question-answer (QA) pairs grounded in structured domain knowledge. Our multi-stage framework integrates a retriever, base generator, and refinement model to synthesize and enhance QA pairs using documents retrieved from a domain-specific knowledge graph. To ensure data quality, we employ customized RAGAS-based scoring to filter low-quality samples, producing a high-quality dataset suitable for reinforcement fine-tuning (RFT). We demonstrate our approach in a real-world telecom scenario focused on radio access network (RAN) troubleshooting. The resulting pipeline generates complex, context-rich troubleshooting solution plans without human intervention. This work offers a scalable solution for building instruction and reinforcement datasets in specialized domains, significantly reducing dependence on manual labeling while maintaining high technical fidelity.
摘要：大语言模型（LLM）的成功在很大程度上取决于大规模，高质量的指导跟踪和增强数据集。但是，通过人类注释生成此类数据是非常耗时的，特别是对于特定于特定领域的任务（例如电信网络故障排除），准确的响应需要深厚的技术专业知识和上下文理解。在本文中，我们提出了一条完全自动化的检索调节管道，用于生成基于结构化域知识的合成问题解答（QA）对。我们的多阶段框架集成了检索器，基础生成器和改进模型，以使用从特定于域的知识图中检索到的文档合成和增强质量检查对。为了确保数据质量，我们采用定制的基于RAGAS的评分来过滤低质量的样本，从而产生适合加固微调（RFT）的高质量数据集。我们在关注无线电访问网络（RAN）故障排除的现实电信方案中演示了我们的方法。由此产生的管道在没有人类干预的情况下生成了复杂的，上下文富的故障排除解决方案计划。这项工作提供了可扩展的解决方案，用于在专用域中构建指导和增强数据集，从而大大降低了对手动标签的依赖，同时保持高技术忠诚度。

Title: TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

Authors: Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25760
Pdf URL: https://arxiv.org/pdf/2509.25760
Copy Paste: [[2509.25760]] TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning(https://arxiv.org/abs/2509.25760)
Keywords: language model, llm, hallucination
Abstract: While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.
摘要：尽管大型语言模型（LLMS）在Factoid问题回答上表现出了很强的表现，但它们仍然容易出现幻觉和不真实的回答，尤其是当任务要求其参数知识以外的信息时。确实，真实性不仅需要准确性 - 在不确定避免幻觉时，模型还必须认识到不确定性和弃权。这对现有方法提出了一个基本挑战：优化准确性的方法通常会放大幻觉，而鼓励戒除的方法可能会变得过于保守，牺牲了正确的答案。这两个极端最终都损害了真实性。在这项工作中，我们提出了TruthRl，这是一个普遍的强化学习（RL）框架，可直接优化LLM的真实性。具体来说，我们使用GRPO实现TruthRl，并具有简单而有效的三元奖励，从而区分了正确的答案，幻觉和弃权。它不仅通过提供正确的响应，而且在不确定时戒除，从而激励模型减少幻觉，从而改善真实性。在四个知识密集的基准测试中进行的广泛实验表明，与香草RL相比，TruthRl将幻觉显着降低了28.9％，并将真实性提高了21.1％，在各种骨干模型（例如QWEN，LLAMA，LLAMA）均取得一致的收入和非逆向设置。深入的消融研究表明，香草精度驱动的方法，例如以二进制奖励进行的微调或RL，努力平衡事实正确性和不确定性。相比之下，我们提出的真实性驱动的真理在准确性和真实性上都取得了强大的表现，强调了学习目标设计对发展真实LLM的重要性。

Title: Assessing Algorithmic Bias in Language-Based Depression Detection: A Comparison of DNN and LLM Approaches

Authors: Obed Junias, Prajakta Kini, Theodora Chaspari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25795
Pdf URL: https://arxiv.org/pdf/2509.25795
Copy Paste: [[2509.25795]] Assessing Algorithmic Bias in Language-Based Depression Detection: A Comparison of DNN and LLM Approaches(https://arxiv.org/abs/2509.25795)
Keywords: language model, llm, prompt
Abstract: This paper investigates algorithmic bias in language-based models for automated depression detection, focusing on socio-demographic disparities related to gender and race/ethnicity. Models trained using deep neural networks (DNN) based embeddings are compared to few-shot learning approaches with large language models (LLMs), evaluating both performance and fairness on clinical interview transcripts from the Distress Analysis Interview Corpus/Wizard-of-Oz (DAIC-WOZ). To mitigate bias, fairness-aware loss functions are applied to DNN-based models, while in-context learning with varied prompt framing and shot counts is explored for LLMs. Results indicate that LLMs outperform DNN-based models in depression classification, particularly for underrepresented groups such as Hispanic participants. LLMs also exhibit reduced gender bias compared to DNN-based embeddings, though racial disparities persist. Among fairness-aware techniques for mitigating bias in DNN-based embeddings, the worst-group loss, which is designed to minimize loss for the worst-performing demographic group, achieves a better balance between performance and fairness. In contrast, the fairness-regularized loss minimizes loss across all groups but performs less effectively. In LLMs, guided prompting with ethical framing helps mitigate gender bias in the 1-shot setting. However, increasing the number of shots does not lead to further reductions in disparities. For race/ethnicity, neither prompting strategy nor increasing $N$ in $N$-shot learning effectively reduces disparities.
摘要：本文研究了基于语言的自动抑郁症检测模型中的算法偏见，重点是与性别和种族/种族相关的社会人口统计学差异。将使用基于深神经网络（DNN）嵌入的训练的模型与具有大语言模型（LLMS）的少数学习方法进行了比较，从而评估了来自遇险分析访谈语料库/向导式OZ（DAIC-WOZ）的临床访谈成绩单的性能和公平性。为了减轻偏见，将公平感知的损失功能应用于基于DNN的模型，而在LLMS中探索了具有多种及时的迅速框架和射击计数的内在学习。结果表明，LLMS在抑郁症分类中的表现优于基于DNN的模型，特别是对于像西班牙裔参与者等代表性不足的群体。与基于DNN的嵌入相比，LLM的性别偏见也降低了，尽管种族差异持续存在。在减轻基于DNN的嵌入的偏见的公平感知技术中，最严重的群体损失旨在最大程度地减少表现最差的人群群体的损失，可以在绩效和公平性之间取得更好的平衡。相比之下，公平性验证的损失可最大程度地减少所有组的损失，但性能降低了。在LLMS中，通过道德框架提示有助于减轻1次环境中的性别偏见。但是，增加镜头的数量不会导致差异进一步降低。对于种族/种族，既不提示策略，也不会增加$ n $ n $ shot学习的$ n $有效地降低差异。

Title: RoBiologyDataChoiceQA: A Romanian Dataset for improving Biology understanding of Large Language Models

Authors: Dragos-Dumitru Ghinea, Adela-Nicoleta Corbeanu, Adrian-Marius Dumitran
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25813
Pdf URL: https://arxiv.org/pdf/2509.25813
Copy Paste: [[2509.25813]] RoBiologyDataChoiceQA: A Romanian Dataset for improving Biology understanding of Large Language Models(https://arxiv.org/abs/2509.25813)
Keywords: language model, llm, prompt
Abstract: In recent years, large language models (LLMs) have demonstrated significant potential across various natural language processing (NLP) tasks. However, their performance in domain-specific applications and non-English languages remains less explored. This study introduces a novel Romanian-language dataset for multiple-choice biology questions, carefully curated to assess LLM comprehension and reasoning capabilities in scientific contexts. Containing approximately 14,000 questions, the dataset provides a comprehensive resource for evaluating and improving LLM performance in biology. We benchmark several popular LLMs, analyzing their accuracy, reasoning patterns, and ability to understand domain-specific terminology and linguistic nuances. Additionally, we perform comprehensive experiments to evaluate the impact of prompt engineering, fine-tuning, and other optimization techniques on model performance. Our findings highlight both the strengths and limitations of current LLMs in handling specialized knowledge tasks in low-resource languages, offering valuable insights for future research and development.
摘要：近年来，大型语言模型（LLMS）在各种自然语言处理（NLP）任务中表现出巨大的潜力。但是，它们在特定领域的应用程序和非英语语言中的性能仍然不那么探索。这项研究介绍了一个新颖的罗马尼亚语言数据集，以用于多项选择生物学问题，并精心策划，以评估科学环境中的LLM理解和推理能力。数据集包含大约14,000个问题，为评估和改善生物学的LLM性能提供了全面的资源。我们基准了几个流行的LLM，分析了它们的准确性，推理模式以及了解域特异性术语和语言细微差别的能力。此外，我们执行全面的实验，以评估及时工程，微调和其他优化技术对模型性能的影响。我们的发现突出了当前LLM在处理低资源语言专业知识任务中的优势和局限性，为未来的研究和发展提供了宝贵的见解。

Title: Personalized Scientific Figure Caption Generation: An Empirical Study on Author-Specific Writing Style Transfer

Authors: Jaeyoung Kim, Jongho Lee, Hongjun Choi, Sion Jang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2509.25817
Pdf URL: https://arxiv.org/pdf/2509.25817
Copy Paste: [[2509.25817]] Personalized Scientific Figure Caption Generation: An Empirical Study on Author-Specific Writing Style Transfer(https://arxiv.org/abs/2509.25817)
Keywords: language model
Abstract: We study personalized figure caption generation using author profile data from scientific papers. Our experiments demonstrate that rich author profile data, combined with relevant metadata, can significantly improve the personalization performance of multimodal large language models. However, we also reveal a fundamental trade-off between matching author style and maintaining caption quality. Our findings offer valuable insights and future directions for developing practical caption automation systems that balance both objectives. This work was conducted as part of the 3rd SciCap challenge.
摘要：我们使用来自科学论文的作者资料数据来研究个性化的图形字幕生成。我们的实验表明，丰富的作者概况数据以及相关的元数据可以显着提高多模式大语言模型的个性化表现。但是，我们还揭示了匹配的作者风格和维护字幕质量之间的基本权衡。我们的发现为开发实用标题自动化系统提供了有价值的见解和未来的方向，以平衡这两个目标。这项工作是作为第三次SCICAP挑战的一部分进行的。

Title: Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations

Authors: Keyu He, Tejas Srinivasan, Brihi Joshi, Xiang Ren, Jesse Thomason, Swabha Swayamdipta
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2509.25844
Pdf URL: https://arxiv.org/pdf/2509.25844
Copy Paste: [[2509.25844]] Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations(https://arxiv.org/abs/2509.25844)
Keywords: language model
Abstract: When people query Vision-Language Models (VLMs) but cannot see the accompanying visual context (e.g. for blind and low-vision users), augmenting VLM predictions with natural language explanations can signal which model predictions are reliable. However, prior work has found that explanations can easily convince users that inaccurate VLM predictions are correct. To remedy undesirable overreliance on VLM predictions, we propose evaluating two complementary qualities of VLM-generated explanations via two quality scoring functions. We propose Visual Fidelity, which captures how faithful an explanation is to the visual context, and Contrastiveness, which captures how well the explanation identifies visual details that distinguish the model's prediction from plausible alternatives. On the A-OKVQA and VizWiz tasks, these quality scoring functions are better calibrated with model correctness than existing explanation qualities. We conduct a user study in which participants have to decide whether a VLM prediction is accurate without viewing its visual context. We observe that showing our quality scores alongside VLM explanations improves participants' accuracy at predicting VLM correctness by 11.1%, including a 15.4% reduction in the rate of falsely believing incorrect predictions. These findings highlight the utility of explanation quality scores in fostering appropriate reliance on VLM predictions.
摘要：当人们查询视觉语言模型（VLM）但看不到随附的视觉上下文（例如，对于盲人和低视频用户）时，使用自然语言解释来增强VLM预测可以表明哪种模型预测是可靠的。但是，先前的工作发现，解释可以轻松说服用户不准确的VLM预测是正确的。为了纠正对VLM预测的不良过度依赖，我们建议通过两个质量评分功能评估VLM生成的解释的两种互补质量。我们提出了视觉保真度，它捕获了对视觉上下文的解释和对比度的忠诚程度，从而捕捉了解释对区分模型的预测与合理替代方案的视觉细节的很好。在A-OKVQA和VIZWIZ任务上，这些质量评分功能通过模型正确性比现有的解释质量更好。我们进行了一项用户研究，其中参与者必须在不查看其视觉上下文的情况下决定VLM预测是否准确。我们观察到，与VLM解释一起显示我们的质量得分可以提高参与者在VLM正确性方面的准确性11.1％，其中包括错误地相信错误的预测率降低了15.4％。这些发现突出了解释质量得分在促进VLM预测的适当依赖方面的实用性。

Title: ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

Authors: Yindong Wang, Martin Preiß, Margarita Bugueño, Jan Vincent Hoffbauer, Abdullatif Ghajar, Tolga Buz, Gerard de Melo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25868
Pdf URL: https://arxiv.org/pdf/2509.25868
Copy Paste: [[2509.25868]] ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations(https://arxiv.org/abs/2509.25868)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) frequently confabulate scientific facts,severely undermining their trustworthiness. Addressing this challenge requires benchmarks that go beyond binary factuality and enable fine-grained evaluation. We introduce \textbf{ReFACT} (\textit{Reddit False And Correct Texts}), a benchmark of 1,001 expert-annotated question--answer pairs spanning diverse scientific domains for the detection of scientific confabulation. Each instance includes both a scientifically correct answer and a non-factual counterpart annotated with \textbf{precise error spans and error-types}. ReFACT enables multi-stage evaluation: (1) confabulation detection, (2) fine-grained error localization, and (3) correction. We benchmark 9 state-of-the-art LLMs, revealing limited performance ($\sim$50\% accuracy). Even top models such as GPT-4o fail to distinguish factual from confabulated scientific answers, raising concerns about the reliability of \textit{LLM-as-judge} evaluation paradigms. Our findings highlight the need for fine-grained, human-validated benchmarks to detect and correct scientific confabulation in domain-specific contexts. Dataset is released on \href{this https URL}{GitHub}\footnote{We provide the dataset at: this https URL}.
摘要：大型语言模型（LLM）经常夸大科学事实，严重破坏了他们的可信度。应对这一挑战需要超越二进制事实的基准并实现精细的评估。我们介绍了\ textbf {redact}（\ textit {reddit false和rectract texts}），这是1,001个专家注销的问题的基准 - 跨越多种科学领域的跨越科学混合域的跨成对。每个实例都包含一个科学正确的答案，也包括一个与\ textBf {精确误差跨度和错误类型}注释的非事实对应的。 Refact启用多阶段评估：（1）解释检测，（2）细粒误差定位和（3）校正。我们基于9个最先进的LLM，揭示了有限的性能（$ \ sim $ 50 \％精度）。即使是GPT-4O之类的顶级模型也无法将事实与综合的科学答案区分开，也引起了人们对\ textit {llm-as-as-gudge}评估范式的可靠性的担忧。我们的发现凸显了需要在特定环境中检测和纠正科学串意的细粒度，人体验证的基准测试的必要性。数据集在\ href {此https url} {github} \ footNote {我们提供数据集上的数据集上发布：此https url}。

Title: RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity

Authors: Jisu Shin, Hoyun Song, Juhyun Oh, Changgeon Ko, Eunsu Kim, Chani Jung, Alice Oh
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2509.25897
Pdf URL: https://arxiv.org/pdf/2509.25897
Copy Paste: [[2509.25897]] RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity(https://arxiv.org/abs/2509.25897)
Keywords: language model, llm
Abstract: Humans often encounter role conflicts -- social dilemmas where the expectations of multiple roles clash and cannot be simultaneously fulfilled. As large language models (LLMs) become increasingly influential in human decision-making, understanding how they behave in complex social situations is essential. While previous research has evaluated LLMs' social abilities in contexts with predefined correct answers, role conflicts represent inherently ambiguous social dilemmas that require contextual sensitivity: the ability to recognize and appropriately weigh situational cues that can fundamentally alter decision priorities. To address this gap, we introduce RoleConflictBench, a novel benchmark designed to evaluate LLMs' contextual sensitivity in complex social dilemmas. Our benchmark employs a three-stage pipeline to generate over 13K realistic role conflict scenarios across 65 roles, systematically varying their associated expectations (i.e., their responsibilities and obligations) and situational urgency levels. By analyzing model choices across 10 different LLMs, we find that while LLMs show some capacity to respond to these contextual cues, this sensitivity is insufficient. Instead, their decisions are predominantly governed by a powerful, inherent bias related to social roles rather than situational information. Our analysis quantifies these biases, revealing a dominant preference for roles within the Family and Occupation domains, as well as a clear prioritization of male roles and Abrahamic religions across most evaluatee models.
摘要：人类经常会遇到角色冲突 - 社会困境，在这种困境中，对多个角色冲突的期望，无法同时实现。随着大型语言模型（LLMS）在人类决策中越来越有影响力，因此必须了解它们在复杂的社会情况下的行为至关重要。尽管先前的研究已经在具有预定义答案的上下文中评估了LLMS的社会能力，但角色冲突代表了需要上下文敏感性的固有模棱两可的社会困境：识别和适当称重的情境线索的能力可以从根本上改变决策优先事项。为了解决这一差距，我们介绍了RoleconflictBench，这是一种新颖的基准，旨在评估LLMS在复杂的社会困境中的上下文敏感性。我们的基准采用三阶段的管道来产生超过13K现实的角色冲突情景，跨越65个角色，从系统地改变了其相关的期望（即他们的责任和义务）和情境紧急水平。通过分析10种不同LLM的模型选择，我们发现虽然LLM显示出对这些上下文提示做出响应的能力，但这种敏感性不足。取而代之的是，他们的决定主要由与社会角色而不是情境信息有关的强大，固有的偏见控制。我们的分析量化了这些偏见，揭示了对家庭和职业领域中角色的主要偏好，以及大多数评估模型中男性角色和亚伯拉罕宗教的明确优先级。

Title: PerQ: Efficient Evaluation of Multilingual Text Personalization Quality

Authors: Dominik Macko, Andrew Pulver
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25903
Pdf URL: https://arxiv.org/pdf/2509.25903
Copy Paste: [[2509.25903]] PerQ: Efficient Evaluation of Multilingual Text Personalization Quality(https://arxiv.org/abs/2509.25903)
Keywords: language model
Abstract: Since no metrics are available to evaluate specific aspects of a text, such as its personalization quality, the researchers often rely solely on large language models to meta-evaluate such texts. Due to internal biases of individual language models, it is recommended to use multiple of them for combined evaluation, which directly increases costs of such meta-evaluation. In this paper, a computationally efficient method for evaluation of personalization quality of a given text (generated by a language model) is introduced, called PerQ. A case study of comparison of generation capabilities of large and small language models shows the usability of the proposed metric in research, effectively reducing the waste of resources.
摘要：由于无法评估文本的特定方面（例如其个性化质量），因此研究人员通常仅依靠大型语言模型来元评估此类文本。由于单个语言模型的内部偏见，建议将其中的多个用于合并评估，这直接增加了这种元评估的成本。在本文中，引入了一种用于评估给定文本个性化质量（由语言模型生成的）的计算有效方法，称为perq。大型和小语言模型的发电能力比较的案例研究表明，拟议的指标在研究中的可用性有效减少了资源的浪费。

Title: Mem-α: Learning Memory Construction via Reinforcement Learning

Authors: Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, Xiaojian Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25911
Pdf URL: https://arxiv.org/pdf/2509.25911
Copy Paste: [[2509.25911]] Mem-α: Learning Memory Construction via Reinforcement Learning(https://arxiv.org/abs/2509.25911)
Keywords: language model, llm, agent
Abstract: Large language model (LLM) agents are constrained by limited context windows, necessitating external memory systems for long-term information understanding. Current memory-augmented agents typically depend on pre-defined instructions and tools for memory updates. However, language models may lack the ability to determine which information to store, how to structure it, and when to update it, especially as memory systems become more complex. This results in suboptimal memory construction and information loss. To this end, we propose Mem-alpha, a reinforcement learning framework that trains agents to effectively manage complex memory systems through interaction and feedback. We also construct a specialized training dataset spanning diverse multi-turn interaction patterns paired with comprehensive evaluation questions designed to teach effective memory management. During training, agents process sequential information chunks, learn to extract and store relevant content, then update the memory system. The reward signal derives from downstream question-answering accuracy over the full interaction history, directly optimizing for memory construction. To illustrate the effectiveness of our training framework, we design a memory architecture comprising core, episodic, and semantic components, equipped with multiple tools for memory operations. Empirical evaluation demonstrates that Mem-alpha achieves significant improvements over existing memory-augmented agent baselines. Despite being trained exclusively on instances with a maximum length of 30k tokens, our agents exhibit remarkable generalization to sequences exceeding 400k tokens, over 13x the training length, highlighting the robustness of Mem-alpha.
摘要：大型语言模型（LLM）代理受到有限上下文窗口的约束，需要外部存储系统以进行长期信息理解。当前的内存调节代理通常取决于预定义的说明和用于内存更新的工具。但是，语言模型可能缺乏确定要存储哪些信息，如何构造信息以及何时更新信息的能力，尤其是随着内存系统变得更加复杂。这导致了次优的内存构建和信息丢失。为此，我们提出了Mem-Alpha，这是一种增强学习框架，该框架训练代理通过互动和反馈有效地管理复杂的内存系统。我们还构建了一个专门的培训数据集，该数据集涵盖了多种多样的互动模式，并配对旨在教授有效记忆管理的全面评估问题。在培训期间，代理处理顺序信息块，学习提取和存储相关内容，然后更新内存系统。奖励信号源自整个交互历史上下游提问的准确性，直接为内存构建优化。为了说明训练框架的有效性，我们设计了一个包括核心，情节和语义组件的内存体系结构，配备了多种用于内存操作的工具。经验评估表明，Mem-Alpha比现有的内存增强代理基线可取得重大改善。尽管接受了最大长度为30k代币的实例进行训练，但我们的代理人对超过400K令牌的序列表现出显着的概括，超过了训练长度的13倍，突出了MEM-Alpha的稳健性。

Title: Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

Authors: Chuanyang Zheng, Jiankai Sun, Yihang Gao, Enze Xie, Yuehao Wang, Peihao Wang, Ting Xu, Matthew Chang, Liliang Ren, Jingyao Li, Jing Xiong, Kashif Rasul, Mac Schwager, Anderson Schneider, Zhangyang Wang, Yuriy Nevmyvaka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25913
Pdf URL: https://arxiv.org/pdf/2509.25913
Copy Paste: [[2509.25913]] Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel(https://arxiv.org/abs/2509.25913)
Keywords: language model, llm
Abstract: Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \textbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $\mathrm{Softmax}$. We demonstrate that this router generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers. \textbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $\mathrm{ReLU}$ activation and $\ell_2$-normalization in $\mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.
摘要：Experts（MOE）的混合物已成为最近最先进的大语言模型（LLM）的基石。传统上，Moe依靠$ \ mathrm {softmax} $作为路由器得分功能来汇总专家输出，这是一种设计的选择，从最早的MOE模型到现代LLM，现在被广泛认为是标准实践。但是，使用$ \ mathrm {softmax} $将权重投影到概率的必要性仍然是一个不受挑战的假设，而不是原则上的设计选择。在这项工作中，我们首先重新审视经典的Nadaraya-Watson回归，并观察到Moe具有与Nadaraya-Watson回归相同的数学表述。此外，我们表明，前馈神经网络（FFN）和MOE都可以解释为Nadaraya-Watson回归的特殊情况，其中内核函数对应于输出层的输入神经元。在这些见解的推动下，我们提出了\ textbf {Zero-Additional-Cost}内核灵感的路由器，具有标准化（Kern），一种FFN风格的路由器函数，作为$ \ Mathrm {softmax} $的替代方法。我们证明，该路由器概括了$ \ mathrm {sigmoid} $ - 和$ \ mathrm {softmax} $ - 基于路由器。 \ textbf {基于经验观察和FFN实施中的既定实践，我们建议使用$ \ Mathrm {relu} $激活和$ \ ell_2 $ - normalization in $ \ mathrm {kern} $ router} $ router函数。

Title: Bringing Emerging Architectures to Sequence Labeling in NLP

Authors: Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25918
Pdf URL: https://arxiv.org/pdf/2509.25918
Copy Paste: [[2509.25918]] Bringing Emerging Architectures to Sequence Labeling in NLP(https://arxiv.org/abs/2509.25918)
Keywords: language model
Abstract: Pretrained Transformer encoders are the dominant approach to sequence labeling. While some alternative architectures-such as xLSTMs, structured state-space models, diffusion models, and adversarial learning-have shown promise in language modeling, few have been applied to sequence labeling, and mostly on flat or simplified tasks. We study how these architectures adapt across tagging tasks that vary in structural complexity, label space, and token dependencies, with evaluation spanning multiple languages. We find that the strong performance previously observed in simpler settings does not always generalize well across languages or datasets, nor does it extend to more complex structured tasks.
摘要：预处理的变压器编码器是序列标记的主要方法。尽管某些替代体系结构，例如XLSTM，结构化状态空间模型，扩散模型和对抗性学习在语言建模中显示出希望，但很少应用于序列标签，并且主要用于平坦或简化的任务。我们研究这些体系结构如何在结构复杂性，标签空间和代币依赖性方面的标记任务中适应，并评估多种语言。我们发现，以前在简单的设置中观察到的强大性能并不总是跨语言或数据集良好地概括，也不会扩展到更复杂的结构化任务。

Title: Reliability Crisis of Reference-free Metrics for Grammatical Error Correction

Authors: Takumi Goto, Yusuke Sakai, Taro Watanabe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25961
Pdf URL: https://arxiv.org/pdf/2509.25961
Copy Paste: [[2509.25961]] Reliability Crisis of Reference-free Metrics for Grammatical Error Correction(https://arxiv.org/abs/2509.25961)
Keywords: llm
Abstract: Reference-free evaluation metrics for grammatical error correction (GEC) have achieved high correlation with human judgments. However, these metrics are not designed to evaluate adversarial systems that aim to obtain unjustifiably high scores. The existence of such systems undermines the reliability of automatic evaluation, as it can mislead users in selecting appropriate GEC systems. In this study, we propose adversarial attack strategies for four reference-free metrics: SOME, Scribendi, IMPARA, and LLM-based metrics, and demonstrate that our adversarial systems outperform the current state-of-the-art. These findings highlight the need for more robust evaluation methods.
摘要：语法误差校正（GEC）的无参考评估指标（GEC）与人类判断的高度相关性很高。但是，这些指标并非旨在评估旨在获得不合理的高分的对抗系统。此类系统的存在破坏了自动评估的可靠性，因为它可能会误导用户选择适当的GEC系统。在这项研究中，我们提出了针对四个无参考指标的对抗性攻击策略：一些，Scribendi，Impara和LLM的指标，并证明我们的对手系统的表现优于当前的最新目前。这些发现突出了需要更强大的评估方法。

Title: RAGferee: Building Contextual Reward Models for Retrieval-Augmented Generation

Authors: Andrei C. Coman, Ionut-Teodor Sorodoc, Leonardo F. R. Ribeiro, Bill Byrne, James Henderson, Adrià de Gispert
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26011
Pdf URL: https://arxiv.org/pdf/2509.26011
Copy Paste: [[2509.26011]] RAGferee: Building Contextual Reward Models for Retrieval-Augmented Generation(https://arxiv.org/abs/2509.26011)
Keywords: retrieval augmented generation, retrieval-augmented generation
Abstract: Existing Reward Models (RMs), typically trained on general preference data, struggle in Retrieval Augmented Generation (RAG) settings, which require judging responses for faithfulness to retrieved context, relevance to the user query, appropriate refusals when context is insufficient, completeness and conciseness of information. To address the lack of publicly available RAG-centric preference datasets and specialised RMs, we introduce RAGferee, a methodology that repurposes question-answering (QA) datasets into preference pairs that prioritise groundedness over stylistic features, enabling the training of contextual RMs better suited to judging RAG responses. Using RAGferee, we curate a small preference dataset of 4K samples and fine-tune RMs ranging from 7B to 24B parameters. Our RAG-centric RMs achieve state-of-the-art performance on ContextualJudgeBench, surpassing existing 70B+ RMs trained on much larger (up to 2.4M samples) general corpora, with an absolute improvement of +15.5%.
摘要：现有的奖励模型（RMS）通常接受了一般偏好数据的培训，在检索增强生成（RAG）设置中的挣扎，这些设置需要判断回答以忠诚以检索上下文，与用户查询相关，当上下文不足，完整性，完整性和简洁信息时，适当拒绝。为了解决缺乏公开可用的以RAG为中心的偏好数据集和专门的RMS，我们介绍了Ragferee，这种方法是将问题解答（QA）数据集重新推出的方法，以优先考虑对风格上的扎根对，从而使对文体功能的接地优先级，从而使上下文RMS更适合审查RAG RAG响应的培训。使用Ragferee，我们策划了一个4K样本的小型偏好数据集和范围从7B到24B参数的微调RMS。我们以RAG为中心的RMS在上下文kindughedbench上实现了最先进的性能，超过了经过更大（多达240万个样本）一般语料库的现有的70B + RMS，绝对提高了 + 15.5％。

Title: RE$^2$: Improving Chinese Grammatical Error Correction via Retrieving Appropriate Examples with Explanation

Authors: Baoxin Wang, Yumeng Luo, Yixuan Wang, Dayong Wu, Wanxiang Che, Shijin Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26038
Pdf URL: https://arxiv.org/pdf/2509.26038
Copy Paste: [[2509.26038]] RE$^2$: Improving Chinese Grammatical Error Correction via Retrieving Appropriate Examples with Explanation(https://arxiv.org/abs/2509.26038)
Keywords: language model, llm
Abstract: The primary objective of Chinese grammatical error correction (CGEC) is to detect and correct errors in Chinese sentences. Recent research shows that large language models (LLMs) have been applied to CGEC with significant results. For LLMs, selecting appropriate reference examples can help improve their performance. However, existing methods predominantly rely on text similarity for example retrieval, a strategy that frequently mismatches actual error patterns and retrieves lexically similar yet grammatically irrelevant sentences. To address this problem, we propose a method named RE$^2$, which retrieves appropriate examples with explanations of grammatical errors. Instead of using text similarity of the input sentence, we use explanations of grammatical errors to select reference examples, which are used by LLMs to improve the performance of CGEC. We conduct experiments on two CGEC datasets and create a high-quality grammatical error explanation (GEE) dataset, which is not only used in our research but also serves as a valuable resource for future studies in both CGEC and GEE. The experimental results on the two datasets indicate that our proposed method effectively improves the performance of CGEC.
摘要：中国语法误差校正（CGEC）的主要目标是检测和纠正中国句子中的错误。最近的研究表明，大型语言模型（LLMS）已应用于CGEC，结果显着。对于LLM，选择适当的参考示例可以帮助提高其性能。但是，现有方法主要依赖文本相似性，例如检索，这种策略经常与实际的错误模式不匹配并检索词汇相似但语法上无关的句子。为了解决这个问题，我们提出了一种名为re $^2 $的方法，该方法通过语法错误的解释来检索适当的示例。我们不使用输入句子的文本相似性，而是使用语法错误的解释来选择参考示例，而LLM将其用于提高CGEC的性能。我们对两个CGEC数据集进行了实验，并创建了高质量的语法错误解释（GEE）数据集，该数据集不仅在我们的研究中使用，而且还可以作为CGEC和GEE中未来研究的宝贵资源。两个数据集的实验结果表明，我们提出的方法有效地改善了CGEC的性能。

Title: Unspoken Hints: Accuracy Without Acknowledgement in LLM Reasoning

Authors: Arash Marioriyad, Shaygan Adim, Nima Alighardashi, Mahdieh Soleymani Banghshah, Mohammad Hossein Rohban
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26041
Pdf URL: https://arxiv.org/pdf/2509.26041
Copy Paste: [[2509.26041]] Unspoken Hints: Accuracy Without Acknowledgement in LLM Reasoning(https://arxiv.org/abs/2509.26041)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) increasingly rely on chain-of-thought (CoT) prompting to solve mathematical and logical reasoning tasks. Yet, a central question remains: to what extent are these generated rationales \emph{faithful} to the underlying computations, rather than post-hoc narratives shaped by hints that function as answer shortcuts embedded in the prompt? Following prior work on hinted vs.\ unhinted prompting, we present a systematic study of CoT faithfulness under controlled hint manipulations. Our experimental design spans four datasets (AIME, GSM-Hard, MATH-500, UniADILR), two state-of-the-art models (GPT-4o and Gemini-2-Flash), and a structured set of hint conditions varying in correctness (correct and incorrect), presentation style (sycophancy and data leak), and complexity (raw answers, two-operator expressions, four-operator expressions). We evaluate both task accuracy and whether hints are explicitly acknowledged in the reasoning. Our results reveal three key findings. First, correct hints substantially improve accuracy, especially on harder benchmarks and logical reasoning, while incorrect hints sharply reduce accuracy in tasks with lower baseline competence. Second, acknowledgement of hints is highly uneven: equation-based hints are frequently referenced, whereas raw hints are often adopted silently, indicating that more complex hints push models toward verbalizing their reliance in the reasoning process. Third, presentation style matters: sycophancy prompts encourage overt acknowledgement, while leak-style prompts increase accuracy but promote hidden reliance. This may reflect RLHF-related effects, as sycophancy exploits the human-pleasing side and data leak triggers the self-censoring side. Together, these results demonstrate that LLM reasoning is systematically shaped by shortcuts in ways that obscure faithfulness.
摘要：大型语言模型（LLMS）越来越依赖于思想链（COT）促使促使数学和逻辑推理任务。然而，一个中心问题仍然存在：这些产生的理由在多大程度上是对基础计算的\ emph {忠实的}，而不是由提示作为答案快捷方式嵌入提示中的答案快捷方式所塑造的事后叙述？在提示与\ \毫无疑问的提示有关的事先工作之后，我们提出了对受控提示操纵下的COT忠诚的系统研究。我们的实验设计涵盖了四个数据集（AIME，GSM-HARD，MATH-500，UNIADILR），两个最先进的模型（GPT-4O和GEMINI-2-FLASH），以及一套结构化的提示条件，在正确性（正确和不正确）的情况下变化（表达式和数据泄漏）和复杂性（RAWERSITION），以及两个-RESSITION，两个-RESSIRETION，两个-RENSERITION，两个 - 辅助器。我们评估了任务准确性以及是否在推理中明确确认提示。我们的结果揭示了三个关键发现。首先，正确的提示大大提高了准确性，尤其是在较难的基准和逻辑推理上，而错误提示则急剧降低了基线能力较低的任务的准确性。其次，对提示的认可是高度不平衡的：经常引用基于方程式的提示，而原始提示通常会静静地采用，这表明更复杂的提示将模型推向了在推理过程中的依赖。第三，演示风格很重要：Sycophancy提示鼓励公开承认，而泄漏式提示提示提高了准确性，但可以促进隐藏的依赖。这可能反映了与RLHF相关的效果，因为粘粘剂利用了人类令人愉悦的一面和数据泄漏会触发自我审查的一面。这些结果共同表明，LLM推理以掩盖忠诚的方式是由捷径系统塑造的。

Title: RE-Searcher: Robust Agentic Search with Goal-oriented Planning and Self-reflection

Authors: Daocheng Fu, Jianbiao Mei, Licheng Wen, Xuemeng Yang, Cheng Yang, Rong Wu, Tao Hu, Siqi Li, Yufan Shen, Xinyu Cai, Pinlong Cai, Botian Shi, Yong Liu, Yu Qiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26048
Pdf URL: https://arxiv.org/pdf/2509.26048
Copy Paste: [[2509.26048]] RE-Searcher: Robust Agentic Search with Goal-oriented Planning and Self-reflection(https://arxiv.org/abs/2509.26048)
Keywords: language model, llm, hallucination, agent
Abstract: Large language models (LLMs) excel at knowledge-intensive question answering and reasoning, yet their real-world deployment remains constrained by knowledge cutoff, hallucination, and limited interaction modalities. Augmenting LLMs with external search tools helps alleviate these issues, but it also exposes agents to a complex search environment in which small, plausible variations in query formulation can steer reasoning into unproductive trajectories and amplify errors. We present a systematic analysis that quantifies how environmental complexity induces fragile search behaviors and, in turn, degrades overall performance. To address this challenge, we propose a simple yet effective approach to instantiate a search agent, RE-Searcher. During search, RE-Searcher explicitly articulates a concrete search goal and subsequently reflects on whether the retrieved evidence satisfies that goal. This combination of goal-oriented planning and self-reflection enables RE-Searcher to resist spurious cues in complex search environments and perform robust search. Extensive experiments show that our method improves search accuracy and achieves state-of-the-art results. Perturbation studies further demonstrate substantial resilience to noisy or misleading external signals, mitigating the fragility of the search process. We believe these findings offer practical guidance for integrating LLM-powered agents into more complex interactive environments and enabling more autonomous decision-making.
摘要：大型语言模型（LLMS）在知识密集的问题回答和推理方面表现出色，但是他们的现实部署仍受到知识截止，幻觉和有限的互动方式的限制。使用外部搜索工具增强LLM有助于减轻这些问题，但它也将代理暴露于一个复杂的搜索环境中，在该环境中，查询配方中的微小，合理的变化可以将推理引导到非生产性轨迹中，并放大错误。我们提出了一项系统的分析，该分析量化了环境复杂性如何诱导脆弱的搜索行为，进而降低了整体性能。为了应对这一挑战，我们提出了一种简单而有效的方法来实例化搜索代理，重新搜索者。在搜索过程中，重新搜索者明确阐明了一个具体的搜索目标，并随后反思检索的证据是否满足了该目标。以目标为导向的计划和自我反射的组合使重新搜索者能够抵制复杂的搜索环境中的虚假提示并执行强大的搜索。广泛的实验表明，我们的方法提高了搜索准确性并取得了最先进的结果。扰动研究进一步证明了对嘈杂或误导外部信号的实质性韧性，从而减轻了搜索过程的脆弱性。我们认为，这些发现为将LLM驱动的代理集成到更复杂的交互式环境并实现更自主的决策提供了实用的指导。

Title: DyFlow: Dynamic Workflow Framework for Agentic Reasoning

Authors: Yanbo Wang, Zixiang Xu, Yue Huang, Xiangqi Wang, Zirui Song, Lang Gao, Chenxi Wang, Xiangru Tang, Yue Zhao, Arman Cohan, Xiangliang Zhang, Xiuying Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26062
Pdf URL: https://arxiv.org/pdf/2509.26062
Copy Paste: [[2509.26062]] DyFlow: Dynamic Workflow Framework for Agentic Reasoning(https://arxiv.org/abs/2509.26062)
Keywords: language model, llm, agent
Abstract: Agent systems based on large language models (LLMs) have shown great potential in complex reasoning tasks, but building efficient and generalizable workflows remains a major challenge. Most existing approaches rely on manually designed processes, which limits their adaptability across different tasks. While a few methods attempt automated workflow generation, they are often tied to specific datasets or query types and make limited use of intermediate feedback, reducing system robustness and reasoning depth. Moreover, their operations are typically predefined and inflexible. To address these limitations, we propose DyFlow, a dynamic workflow generation framework that adaptively constructs and adjusts reasoning procedures based on task requirements and real-time intermediate feedback, thereby enhancing cross-task generalization. DyFlow consists of two core components: a designer and an executor. The designer decomposes complex problems into a sequence of sub-goals defined by high-level objectives and dynamically plans the next steps based on intermediate outputs and feedback. These plans are then carried out by the executor, which executes each operation using dynamic operators with context-aware parameterization, enabling flexible and semantically grounded reasoning. We systematically evaluate DyFlow across diverse domains, including social reasoning, biomedical tasks, mathematical problem solving, and code generation. Results demonstrate that DyFlow significantly outperforms existing baselines, achieving substantial Pass@k improvements and exhibiting robust generalization across diverse domains. The code is publicly available at this https URL.
摘要：基于大语言模型（LLM）的代理系统在复杂的推理任务中显示出很大的潜力，但是建立有效且可推广的工作流仍然是一个主要挑战。大多数现有的方法都依赖于手动设计的过程，这限制了其在不同任务中的适应性。尽管一些方法尝试自动化工作流，但它们通常与特定的数据集或查询类型相关，并有限地使用中间反馈，减少系统的鲁棒性和推理深度。此外，它们的操作通常是预定义的且不灵活的。为了解决这些局限性，我们提出了Dyflow，这是一个动态的工作流生成框架，该框架可以根据任务要求和实时中间反馈来适应并调整推理过程，从而增强交叉任务的概括。 Dyflow由两个核心组件组成：设计师和执行者。设计师将复杂的问题分解为一系列由高级目标定义的子目标，并根据中间输出和反馈动态计划下一步。然后，这些计划由执行人执行，该计划使用具有上下文感知参数化的动态运算符执行每个操作，从而实现灵活和语义扎根的推理。我们系统地评估跨不同领域的堤防，包括社会推理，生物医学任务，数学问题解决和代码生成。结果表明，堤防明显胜过现有基线，实现了实质性通过@k的改进并在各种领域表现出强大的概括。该代码在此HTTPS URL上公开可用。

Title: The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge

Authors: Arash Marioriyad, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26072
Pdf URL: https://arxiv.org/pdf/2509.26072
Copy Paste: [[2509.26072]] The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge(https://arxiv.org/abs/2509.26072)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are increasingly deployed as automatic judges to evaluate system outputs in tasks such as summarization, dialogue, and creative writing. A faithful judge should base its verdicts solely on response quality and explicitly acknowledge the factors shaping its decision. We show that current LLM judges fail on both counts by relying on shortcuts introduced in the prompt. Our study uses two evaluation datasets: ELI5, a benchmark for long-form question answering, and LitBench, a recent benchmark for creative writing. Both datasets provide pairwise comparisons, where the evaluator must choose which of two responses is better. From each dataset we construct 100 pairwise judgment tasks and employ two widely used models, GPT-4o and Gemini-2.5-Flash, as evaluators in the role of LLM-as-a-judge. For each pair, we assign superficial cues to the responses, provenance cues indicating source identity (Human, Expert, LLM, or Unknown) and recency cues indicating temporal origin (Old, 1950 vs. New, 2025), while keeping the rest of the prompt fixed. Results reveal consistent verdict shifts: both models exhibit a strong recency bias, systematically favoring new responses over old, as well as a clear provenance hierarchy (Expert > Human > LLM > Unknown). These biases are especially pronounced in GPT-4o and in the more subjective and open-ended LitBench domain. Crucially, cue acknowledgment is rare: justifications almost never reference the injected cues, instead rationalizing decisions in terms of content qualities. These findings demonstrate that current LLM-as-a-judge systems are shortcut-prone and unfaithful, undermining their reliability as evaluators in both research and deployment.
摘要：大型语言模型（LLM）越来越多地作为自动法官部署，以评估诸如汇总，对话和创意写作等任务中的系统输出。忠实的法官应仅基于响应质量的判决，并明确承认塑造其决定的因素。我们表明，当前的LLM法官通过依靠提示中引入的快捷方式而在两项方面失败。我们的研究使用了两个评估数据集：ELI5，一个用于长形问答的基准，以及Litbench，这是最新创意写作的基准。两个数据集都提供成对比较，在其中评估者必须选择两个响应中的哪个更好。从每个数据集中，我们构建了100个成对判断任务，并采用了两个广泛使用的模型，GPT-4O和Gemini-2.5-Flash，作为LLM-AS-A-A-Gudge角色的评估者。对于每一对，我们为响应，表明源身份的出处提示（人，专家，LLM或未知）分配表面提示，并为表示暂时起源的新近度提示（Old，1950 vs. New，2025年），同时保持其余的提示固定。结果揭示了一致的判决转变：这两种模型都表现出强烈的新近度偏见，系统地偏向于旧的反应，以及明确的出处层次结构（专家> Human> Human> llm>未知）。这些偏见在GPT-4O和更主观和开放式的Litbench域中尤为明显。至关重要的是，提示认可是罕见的：理由几乎永远不会引用注入的提示，而是从内容质量方面合理化决策。这些发现表明，当前的LLM-AS-A-A-Gudge系统易于捷径且不忠，破坏了他们作为研究和部署的评估者的可靠性。

Title: Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis

Authors: Leitian Tao, Xuefeng Du, Yixuan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26074
Pdf URL: https://arxiv.org/pdf/2509.26074
Copy Paste: [[2509.26074]] Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis(https://arxiv.org/abs/2509.26074)
Keywords: language model, llm
Abstract: Reward modeling, crucial for aligning large language models (LLMs) with human preferences, is often bottlenecked by the high cost of preference data. Existing textual data synthesis methods are computationally expensive. We propose a novel framework LENS for synthesizing preference data directly in the LLM's latent embedding space. Our method employs a Variational Autoencoder (VAE) to learn a structured latent representation of response embeddings. By performing controlled perturbations in this latent space and decoding back to the embedding space, we efficiently generate diverse, semantically consistent synthetic preference pairs, bypassing costly text generation and annotation. We provide theoretical guarantees that our synthesized pairs approximately preserve original preference ordering and improve reward model generalization. Empirically, our latent-space synthesis significantly outperforms text-based augmentation on standard benchmarks, achieving superior results while being 18x faster in generation and using a 16,000x smaller model. Our work offers a scalable and effective alternative for enhancing reward modeling through efficient data augmentation. Code is publicly available at this https URL
摘要：奖励建模，对于将大型语言模型（LLM）与人类偏好保持一致的至关重要，通常是由高偏好数据的高成本所瓶颈。现有的文本数据合成方法在计算上很昂贵。我们提出了一个新型的框架镜头，用于直接在LLM潜在嵌入空间中合成偏好数据。我们的方法采用各种自动编码器（VAE）来学习响应嵌入的结构性潜在表示。通过在此潜在空间中执行受控的扰动并将其解码回到嵌入空间，我们有效地生成了多样化的，语义上一致的合成偏好对，绕开了昂贵的文本生成和注释。我们提供理论保证，我们的合成对近似保留原始偏好顺序并改善奖励模型的概括。从经验上讲，我们的潜在空间合成显着优于基于文本基准的基于文本基准的增强，在生成速度快18倍和使用16,000倍较小的型号的情况下，取得了卓越的结果。我们的工作提供了可扩展有效的替代方法，可通过有效的数据增强来增强奖励建模。代码在此HTTPS URL上公开可用

Title: IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation

Authors: Johannes Schmitt, Gergely Bérczi, Jasper Dekoninck, Jeremy Feusi, Tim Gehrunger, Raphael Appenzeller, Jim Bryan, Niklas Canova, Timo de Wolff, Filippo Gaia, Michel van Garrel, Baran Hashemi, David Holmes, Aitor Iribar Lopez, Victor Jaeck, Martina Jørgensen, Steven Kelk, Stefan Kuhlmann, Adam Kurpisz, Chiara Meroni, Ingmar Metzler, Martin Möller, Samuel Muñoz-Echániz, Robert Nowak, Georg Oberdieck, Daniel Platt, Dylan Possamaï, Gabriel Ribeiro, Raúl Sánchez Galán, Zheming Sun, Josef Teichmann, Richard P. Thomas, Charles Vial
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26076
Pdf URL: https://arxiv.org/pdf/2509.26076
Copy Paste: [[2509.26076]] IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation(https://arxiv.org/abs/2509.26076)
Keywords: language model, gpt, llm, agent
Abstract: As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are limited, as they focus solely on final-answer questions or high-school competition problems. To address this gap, we introduce IMProofBench, a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians. Each problem requires a detailed proof and is paired with subproblems that have final answers, supporting both an evaluation of mathematical reasoning capabilities by human experts and a large-scale quantitative analysis through automated grading. Furthermore, unlike prior benchmarks, the evaluation setup simulates a realistic research environment: models operate in an agentic framework with tools like web search for literature review and mathematical software such as SageMath. Our results show that current LLMs can succeed at the more accessible research-level questions, but still encounter significant difficulties on more challenging problems. Quantitatively, Grok-4 achieves the highest accuracy of 52% on final-answer subproblems, while GPT-5 obtains the best performance for proof generation, achieving a fully correct solution for 22% of problems. IMProofBench will continue to evolve as a dynamic benchmark in collaboration with the mathematical community, ensuring its relevance for evaluating the next generation of LLMs.
摘要：随着大语言模型（LLM）的数学功能的提高，评估其在数学知识边界的研究级任务上的表现变得越来越重要。但是，现有的基准是有限的，因为它们仅着眼于最终解答问题或高中竞争问题。为了解决这一差距，我们介绍了Improof Bench，这是一个由专家数学家制定的39个同行评审的问题组成的私人基准测试。每个问题都需要一个详细的证明，并与具有最终答案的子问题配对，既支持人类专家对数学推理能力的评估，又支持通过自动分级进行大规模的定量分析。此外，与先前的基准分析不同，评估设置模拟了一个现实的研究环境：模型在代理框架中使用Web搜索文学审查和数学软件（例如Sagemath）的工具运行。我们的结果表明，当前的LLM可以在更容易获得的研究级问题上取得成功，但仍遇到更具挑战性的问题遇到的困难。从数量上讲，Grok-4在最终答案子问题上达到了52％的最高精度，而GPT-5获得了证明生成的最佳性能，为22％的问题提供了完全正确的解决方案。 Indroofbench将继续作为与数学社区合作的动态基准发展，以确保其与评估下一代LLM的相关性。

Title: Reinforced Strategy Optimization for Conversational Recommender Systems via Network-of-Experts

Authors: Xiaoyan Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26093
Pdf URL: https://arxiv.org/pdf/2509.26093
Copy Paste: [[2509.26093]] Reinforced Strategy Optimization for Conversational Recommender Systems via Network-of-Experts(https://arxiv.org/abs/2509.26093)
Keywords: language model, llm, prompt
Abstract: Conversational Recommender Systems (CRSs) provide personalized recommendations through multi-turn interactions. With the strong reasoning abilities of Large Language Models (LLMs), applying them to CRSs has become promising. Yet, existing methods often lack explicit optimization of interaction strategies, relying instead on unified prompts, which can yield suboptimal outcomes. We propose Reinforced Strategy Optimization (RSO), a hierarchical framework that decomposes response generation into macro-level strategy planning and micro-level adaptation within a network-of-experts. A Planner selects strategies (e.g., recommend, explain, encourage), while an Actor generates responses guided by auxiliary experts for preferences and factual grounding. This disentanglement enables more tractable learning. To address limited multi-turn data, we model strategy learning as reinforcement learning with an LLM-based reward for exploration. Experiments show RSO outperforms state-of-the-art baselines, validating the effectiveness of hierarchical strategy optimization.
摘要：会话推荐系统（CRS）通过多转交互提供个性化建议。凭借大型语言模型（LLM）的强大推理能力，将其应用于CRS已变得很有希望。但是，现有方法通常缺乏对交互策略的明确优化，而依靠统一的提示，这可以产生次优的结果。我们提出了加强战略优化（RSO），这是一个分层框架，将响应生成分解为宏观策略计划和超级专家网络中的微观适应。计划者选择策略（例如，建议，解释，鼓励），而演员则在辅助专家指导的偏好和事实基础的指导下产生回应。这种分解可以使更多的学习学习。为了解决有限的多转化数据，我们将策略学习模型为强化学习，并获得基于LLM的探索奖励。实验表明RSO的表现优于最先进的基线，从而验证了层次策略优化的有效性。

Title: End-to-End Aspect-Guided Review Summarization at Scale

Authors: Ilya Boytsov, Vinny DeGenova, Mikhail Balyasin, Joseph Walt, Caitlin Eusden, Marie-Claire Rochat, Margaret Pierson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.26103
Pdf URL: https://arxiv.org/pdf/2509.26103
Copy Paste: [[2509.26103]] End-to-End Aspect-Guided Review Summarization at Scale(https://arxiv.org/abs/2509.26103)
Keywords: language model, llm, prompt
Abstract: We present a scalable large language model (LLM)-based system that combines aspect-based sentiment analysis (ABSA) with guided summarization to generate concise and interpretable product review summaries for the Wayfair platform. Our approach first extracts and consolidates aspect-sentiment pairs from individual reviews, selects the most frequent aspects for each product, and samples representative reviews accordingly. These are used to construct structured prompts that guide the LLM to produce summaries grounded in actual customer feedback. We demonstrate the real-world effectiveness of our system through a large-scale online A/B test. Furthermore, we describe our real-time deployment strategy and release a dataset of 11.8 million anonymized customer reviews covering 92,000 products, including extracted aspects and generated summaries, to support future research in aspect-guided review summarization.
摘要：我们提出了一个可扩展的大语言模型（LLM）的系统，该系统将基于方面的情感分析（ABSA）与指导摘要相结合，以生成Wayfair平台的简洁且可解释的产品审查摘要。我们的方法首先从各个评论中提取并巩固了方面态度对，为每种产品选择最常见的方面，并相应地代表评论。这些用于构建结构化提示，以指导LLM生成以实际客户反馈为基础的摘要。我们通过大规模的在线A/B测试来证明系统的现实有效性。此外，我们描述了我们的实时部署策略，并发布了1,180万个匿名客户评论的数据集，其中包括92,000种产品，包括提取的方面和生成的摘要，以支持未来的研究审查摘要中的未来研究。

Title: Vocabulary Customization for Efficient Domain-Specific LLM Deployment

Authors: Christian Herold, Michael Kozielski, Nicholas Santavas, Yannick Versley, Shahram Khadivi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26124
Pdf URL: https://arxiv.org/pdf/2509.26124
Copy Paste: [[2509.26124]] Vocabulary Customization for Efficient Domain-Specific LLM Deployment(https://arxiv.org/abs/2509.26124)
Keywords: llm
Abstract: When using an LLM to process text outside the training domain(s), an often overlooked factor is vocabulary mismatch, where the general-domain tokenizer fails to capture frequent domain-specific terms, leading to higher token fertility and thus a decrease in processing speed due to suboptimal sub-word splits. We address this limitation by augmenting the pretrained vocabulary with a set of domain-specific tokens. To this end, we design an algorithm that extends an existing tokenizer while guaranteeing it never decreases tokenization efficiency: every input sequence is segmented into at most the same number of tokens as before. Evaluated on real-world e-Commerce use-cases, the augmented tokenizer significantly shortens input sequences by up to 20% and reduces inference latency on downstream tasks while preserving predictive quality. We further analyze secondary effects, such as the impact on forward pass speed and the rate at which the model adopts the newly introduced tokens, to illustrate the broader benefits of vocabulary adaptation.
摘要：当使用LLM在训练域之外处理文本时，经常被忽略的因素是词汇不匹配，在该因素不匹配的情况下，通用域的代币器未能捕获频繁的域特异性术语，从而导致较高的代币生育能力，从而导致由于亚次优的子手的分离而导致的处理速度下降。我们通过使用一组特定领域的代币来增强经过验证的词汇来解决这一限制。为此，我们设计了一种算法，该算法扩展了现有的代币仪，同时保证它永远不会降低令牌化效率：每个输入序列最多都与以前相同。对现实世界的电子商务用例评估，增强的令牌仪可显着缩短输入序列多达20％，并减少下游任务的推理潜伏期，同时保留预测性质量。我们进一步分析了次要效应，例如对远期传球速度的影响以及模型采用新引入令牌的速度，以说明词汇适应的更广泛好处。

Title: The Hunger Game Debate: On the Emergence of Over-Competition in Multi-Agent Systems

Authors: Xinbei Ma, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Mengru Wang, Jen-tse Huang, Qu Yang, Wenxuan Wang, Fanghua Ye, Qingxuan Jiang, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Hai Zhao, Zhaopeng Tu, Xiaolong Li, Linus
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26126
Pdf URL: https://arxiv.org/pdf/2509.26126
Copy Paste: [[2509.26126]] The Hunger Game Debate: On the Emergence of Over-Competition in Multi-Agent Systems(https://arxiv.org/abs/2509.26126)
Keywords: llm, agent
Abstract: LLM-based multi-agent systems demonstrate great potential for tackling complex problems, but how competition shapes their behavior remains underexplored. This paper investigates the over-competition in multi-agent debate, where agents under extreme pressure exhibit unreliable, harmful behaviors that undermine both collaboration and task performance. To study this phenomenon, we propose HATE, the Hunger Game Debate, a novel experimental framework that simulates debates under a zero-sum competition arena. Our experiments, conducted across a range of LLMs and tasks, reveal that competitive pressure significantly stimulates over-competition behaviors and degrades task performance, causing discussions to derail. We further explore the impact of environmental feedback by adding variants of judges, indicating that objective, task-focused feedback effectively mitigates the over-competition behaviors. We also probe the post-hoc kindness of LLMs and form a leaderboard to characterize top LLMs, providing insights for understanding and governing the emergent social dynamics of AI community.
摘要：基于LLM的多代理系统具有解决复杂问题的巨大潜力，但是竞争如何塑造其行为仍未得到充实。本文调查了多项式辩论中的过度竞争，在极端压力下的代理人表现出不可靠的有害行为，破坏了协作和任务绩效。为了研究这种现象，我们提出了仇恨，《饥饿游戏辩论》，这是一个新颖的实验框架，模拟了零和竞赛领域的辩论。我们在一系列LLM和任务进行的实验表明，竞争压力显着刺激了过度竞争的行为并降低了任务绩效，从而导致讨论出轨。我们通过添加法官的变体来进一步探索环境反馈的影响，这表明客观，以任务为中心的反馈有效地减轻了过度竞争的行为。我们还探究了LLM的事后友善，并形成了表征顶级LLM的排行榜，为理解和管理AI社区的新兴社会动态提供了见解。

Title: CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models

Authors: Paul Grundmann, Dennis Fast, Jan Frick, Thomas Steffek, Felix Gers, Wolfgang Nejdl, Alexander Löser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26136
Pdf URL: https://arxiv.org/pdf/2509.26136
Copy Paste: [[2509.26136]] CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models(https://arxiv.org/abs/2509.26136)
Keywords: language model, llm
Abstract: With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks. However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.
摘要：随着其不断增长的能力，生成的大语言模型（LLM）正在越来越多地研究复杂的医疗任务。但是，它们在现实世界中的临床应用中的有效性仍未得到充实。为了解决这个问题，我们介绍了Clinibench，这是第一个基准，它可以可与基于编码的分类器和生成LLM可比性可比性，用于在模拟物IV数据集中通过入院注释进行排放诊断预测。我们的广泛研究比较了12个生成LLM和3个基于编码器的分类器，并证明基于编码器的分类器在诊断预测中始终超过生成模型。我们评估了几种从类似患者学习的秘密学习的检索增强策略，并发现它们为生成LLM提供了显着的性能改善。

Title: MGen: Millions of Naturally Occurring Generics in Context

Authors: Gustavo Cilleruelo, Emily Allaway, Barry Haddow, Alexandra Birch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26160
Pdf URL: https://arxiv.org/pdf/2509.26160
Copy Paste: [[2509.26160]] MGen: Millions of Naturally Occurring Generics in Context(https://arxiv.org/abs/2509.26160)
Keywords: long context
Abstract: MGen is a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources. Sentences in the dataset have long context documents, corresponding to websites and academic papers, and cover 11 different quantifiers. We analyze the features of generics sentences in the dataset, with interesting insights: generics can be long sentences (averaging over 16 words) and speakers often use them to express generalisations about people. MGen is the biggest and most diverse dataset of naturally occurring generic sentences, opening the door to large-scale computational research on genericity. It is publicly available at this https URL
摘要：MGEN是一个超过400万天然发生的通用和量化句子的数据集，这些句子从不同的文本来源中提取。数据集中的句子具有与网站和学术论文相对应的长上下文文档，并涵盖了11个不同的量词。我们分析了数据集中的仿制句子的特征，并具有有趣的见解：仿制药可以是长句子（平均16个单词），并且说话者经常使用它们来表达对人的概括。 MGEN是自然存在的通用句子的最大，最多样化的数据集，为通用大规模计算研究打开了大门。它在此HTTPS URL上公开可用

Title: Explaining novel senses using definition generation with open language models

Authors: Mariia Fedorova, Andrey Kutuzov, Francesco Periti, Yves Scherrer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26181
Pdf URL: https://arxiv.org/pdf/2509.26181
Copy Paste: [[2509.26181]] Explaining novel senses using definition generation with open language models(https://arxiv.org/abs/2509.26181)
Keywords: language model, llm
Abstract: We apply definition generators based on open-weights large language models to the task of creating explanations of novel senses, taking target word usages as an input. To this end, we employ the datasets from the AXOLOTL'24 shared task on explainable semantic change modeling, which features Finnish, Russian and German languages. We fine-tune and provide publicly the open-source models performing higher than the best submissions of the aforementioned shared task, which employed closed proprietary LLMs. In addition, we find that encoder-decoder definition generators perform on par with their decoder-only counterparts.
摘要：我们将基于开放量大语言模型的定义发电机应用于创建新颖感官解释的任务，以目标单词用法作为输入。为此，我们采用了Axolotl'24共享任务的数据集，这些任务是可解释的语义变化建模，该模型具有芬兰语，俄语和德语。我们微调并公开提供高于上述共同任务的最佳提交的开源模型，该任务采用了封闭的专有LLM。此外，我们发现编码器定义生成器与仅解码器的对应物执行。

Title: VietBinoculars: A Zero-Shot Approach for Detecting Vietnamese LLM-Generated Text

Authors: Trieu Hai Nguyen, Sivaswamy Akilesh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26189
Pdf URL: https://arxiv.org/pdf/2509.26189
Copy Paste: [[2509.26189]] VietBinoculars: A Zero-Shot Approach for Detecting Vietnamese LLM-Generated Text(https://arxiv.org/abs/2509.26189)
Keywords: language model, gpt, llm, prompt
Abstract: The rapid development research of Large Language Models (LLMs) based on transformer architectures raises key challenges, one of them being the task of distinguishing between human-written text and LLM-generated text. As LLM-generated textual content, becomes increasingly complex over time, and resembles human writing, traditional detection methods are proving less effective, especially as the number and diversity of LLMs continue to grow with new models and versions being released at a rapid pace. This study proposes VietBinoculars, an adaptation of the Binoculars method with optimized global thresholds, to enhance the detection of Vietnamese LLM-generated text. We have constructed new Vietnamese AI-generated datasets to determine the optimal thresholds for VietBinoculars and to enable benchmarking. The results from our experiments show results show that VietBinoculars achieves over 99\% in all two domains of accuracy, F1-score, and AUC on multiple out-of-domain datasets. It outperforms the original Binoculars model, traditional detection methods, and other state-of-the-art approaches, including commercial tools such as ZeroGPT and DetectGPT, especially under specially modified prompting strategies.
摘要：基于变压器体系结构的大型语言模型（LLM）的快速发展研究提出了关键挑战，其中之一是区分人写文本和LLM生成的文本的任务。随着LLM生成的文本内容，随着时间的流逝而变得越来越复杂，类似于人类写作，传统的检测方法的效果降低了，尤其是随着LLM的数量和多样性的速度和多样性的速度不断增长，随着新模型和版本的速度快速发布。这项研究提出了越南眼，这是对优化全局阈值的双筒望远镜方法的适应，以增强越南LLM生成的文本的检测。我们已经构建了新的越南AI生成的数据集，以确定越南眼的最佳阈值并启用基准测试。我们实验的结果表明，结果表明，在多个室外数据集中，越南眼球在所有两个准确性，F1分数和AUC中都达到了99 \％以上。它的表现优于原始的双筒望远镜模型，传统检测方法和其他最先进的方法，包括诸如Zerogpt和Destectgpt之类的商业工具，尤其是在经过特殊修改的提示策略下。

Title: Type-Less yet Type-Aware Inductive Link Prediction with Pretrained Language Models

Authors: Alessandro De Bellis, Salvatore Bufi, Giovanni Servedio, Vito Walter Anelli, Tommaso Di Noia, Eugenio Di Sciascio
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.26224
Pdf URL: https://arxiv.org/pdf/2509.26224
Copy Paste: [[2509.26224]] Type-Less yet Type-Aware Inductive Link Prediction with Pretrained Language Models(https://arxiv.org/abs/2509.26224)
Keywords: language model
Abstract: Inductive link prediction is emerging as a key paradigm for real-world knowledge graphs (KGs), where new entities frequently appear and models must generalize to them without retraining. Predicting links in a KG faces the challenge of guessing previously unseen entities by leveraging generalizable node features such as subgraph structure, type annotations, and ontological constraints. However, explicit type information is often lacking or incomplete. Even when available, type information in most KGs is often coarse-grained, sparse, and prone to errors due to human annotation. In this work, we explore the potential of pre-trained language models (PLMs) to enrich node representations with implicit type signals. We introduce TyleR, a Type-less yet type-awaRe approach for subgraph-based inductive link prediction that leverages PLMs for semantic enrichment. Experiments on standard benchmarks demonstrate that TyleR outperforms state-of-the-art baselines in scenarios with scarce type annotations and sparse graph connectivity. To ensure reproducibility, we share our code at this https URL .
摘要：归纳链路预测是作为现实世界知识图（kgs）的关键范式出现的，新实体经常出现，模型必须在不进行重新训练的情况下概括为它们。通过利用可概括的节点特征，例如子图结构，类型注释和本体论约束，预测千克中的链接面临着猜测以前看不见的实体的挑战。但是，通常缺乏或不完整的明确类型信息。即使在可用的情况下，大多数kg中的信息通常都是粗粒，稀疏，由于人类注释而容易出现错误。在这项工作中，我们探讨了预训练的语言模型（PLM）的潜力，以丰富具有隐式类型信号的节点表示。我们介绍了泰勒（Tyler），这是一种基于子图的归纳链路预测的无类型但一种感知的方法，它利用PLMS进行语义富集。标准基准测试的实验表明，在具有稀缺类型的注释和稀疏图连接性的情况下，泰勒的表现优于最先进的基线。为了确保可重复性，我们在此HTTPS URL上共享代码。

Title: Finetune Once: Decoupling General & Domain Learning with Dynamic Boosted Annealing

Authors: Yang Tang, Ruijie Liu, Yifan Wang, Shiyu Li, Xi Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.26242
Pdf URL: https://arxiv.org/pdf/2509.26242
Copy Paste: [[2509.26242]] Finetune Once: Decoupling General & Domain Learning with Dynamic Boosted Annealing(https://arxiv.org/abs/2509.26242)
Keywords: language model, llm
Abstract: Large language models (LLMs) fine-tuning shows excellent implications. However, vanilla fine-tuning methods often require intricate data mixture and repeated experiments for optimal generalization. To address these challenges and streamline the training process, we propose an efficient and universal solution, Dynamic Boosted Annealing (DBA). We obtain a global gradient through zero-learning-rate training on general data, which is subsequently employed for gradient boosting and dynamic training step correction during domain training. In conjunction with annealing learning, we end up establishing a fine-tuning pipeline that relies solely on domain data without collapse. By evaluating both general and domain-specific performance across multiple tasks on several popular base models, DBA achieves an average improvement of 5.8% in joint performance over vanilla fine-tuning. Furthermore, since general data is no longer involved in annealing, repeated experiments led by data mixture are also eliminated. According to our tests, the DBA method can reduce GPU hours by 91.0% compared to the vanilla method.
摘要：大型语言模型（LLMS）微调显示出极好的含义。但是，香草微调方法通常需要复杂的数据混合物和重复的实验才能获得最佳概括。为了应对这些挑战并简化培训过程，我们提出了一个有效而通用的解决方案，动态增强退火（DBA）。我们通过对通用数据进行零学习率培训获得了全球梯度，随后在域训练期间用于梯度提升和动态训练步骤校正。结合退火学习，我们最终建立了一条微调管道，该管道仅依赖于域数据而不会崩溃。通过在几个流行的基本模型上评估多个任务中的一般和域特异性绩效，DBA可以在Vanilla微调方面的平均联合性能提高5.8％。此外，由于一般数据不再参与退火，因此还消除了由数据混合物引起的重复实验。根据我们的测试，与香草方法相比，DBA方法可以将GPU小时减少91.0％。

Title: Optimizing Speech Language Models for Acoustic Consistency

Authors: Morteza Rohanian, Michael Krauthammer
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2509.26276
Pdf URL: https://arxiv.org/pdf/2509.26276
Copy Paste: [[2509.26276]] Optimizing Speech Language Models for Acoustic Consistency(https://arxiv.org/abs/2509.26276)
Keywords: language model
Abstract: We study speech language models that incorporate semantic initialization and planning losses to achieve robust and consistent generation. Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives that target robustness and content planning. We train three models: a 0.7B speech-only model, a 1.0B speech-only model, and a 1.0B interleaved model with both text and speech. Acoustic studies show that the speech-only models achieve the highest consistency across speaker, gender, sentiment, room, and background factors, surpassing larger systems. Interleaving improves lexical and syntactic probes and semantic--acoustic alignment but reduces consistency. Linear probes show that our initialization biases the model toward content structure while trading off prosody detail. These results show that LM-side design and training mix control the balance between acoustic stability and semantic grounding without changes to the tokenizer or runtime architecture. A demo and model weights are available for exploration.
摘要：我们研究语音语言模型，这些模型结合了语义初始化和计划损失，以实现强大而稳定的产生。我们的方法以自我监督的功能，应用光线对齐损失以及具有稀疏和辅助目标的火车来初始化语音令牌。我们训练三个模型：一个0.7B语音模型，仅1.0B语音模型，以及带有文本和语音的1.0B交错模型。声学研究表明，仅语音模型在说话者，性别，情感，房间和背景因素之间达到了最高的一致性，超过了较大的系统。交织改善了词汇和句法探针和语义 - 声学对齐，但降低了一致性。线性探针表明，我们的初始化偏向于内容结构，同时交易韵律细节。这些结果表明，LM侧设计和训练混合物控制声学稳定性与语义接地之间的平衡，而不会更改令牌或运行时体系结构。演示和模型权重可用于探索。

Title: QUARTZ : QA-based Unsupervised Abstractive Refinement for Task-oriented Dialogue Summarization

Authors: Mohamed Imed Eddine Ghebriout (1), Gaël Guibon (1, 2), Ivan Lerner (3, 4, 5), Emmanuel Vincent (1) ((1) Universite de Lorraine, CNRS, Inria, LORIA, Nancy, France, (2) Universite Sorbonne Paris Nord, CNRS, LIPN, Villetaneuse, France, (3) Inserm, Centre de Recherche des Cordeliers, Universite Paris Cite, Sorbonne Universite, Paris, France, (4) HeKA, Inria Paris, Paris, France, (5) Assistance Publique Hopitaux de Paris, Georges Pompidou European Hospital, Paris, France)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.26302
Pdf URL: https://arxiv.org/pdf/2509.26302
Copy Paste: [[2509.26302]] QUARTZ : QA-based Unsupervised Abstractive Refinement for Task-oriented Dialogue Summarization(https://arxiv.org/abs/2509.26302)
Keywords: language model, llm
Abstract: Dialogue summarization aims to distill the core meaning of a conversation into a concise text. This is crucial for reducing the complexity and noise inherent in dialogue-heavy applications. While recent approaches typically train language models to mimic human-written summaries, such supervision is costly and often results in outputs that lack task-specific focus limiting their effectiveness in downstream applications, such as medical tasks. In this paper, we propose \app, a framework for task-oriented utility-based dialogue summarization. \app starts by generating multiple summaries and task-oriented question-answer pairs from a dialogue in a zero-shot manner using a pool of large language models (LLMs). The quality of the generated summaries is evaluated by having LLMs answer task-related questions before \textit{(i)} selecting the best candidate answers and \textit{(ii)} identifying the most informative summary based on these answers. Finally, we fine-tune the best LLM on the selected summaries. When validated on multiple datasets, \app demonstrates its effectiveness by achieving competitive results in various zero-shot settings, rivaling fully-supervised State-of-the-Art (SotA) methods.
摘要：对话摘要旨在将对话的核心含义提炼成简洁的文本。这对于降低对话重度应用中固有的复杂性和噪音至关重要。尽管最近的方法通常会训练语言模型来模仿人写的摘要，但这种监督却是昂贵的，并且通常会导致缺乏特定于任务的焦点的输出，从而限制了其在下游应用程序（例如医疗任务）中的有效性。在本文中，我们提出了\ App，这是一个基于任务的对话摘要的框架。 \ app首先使用大型语言模型（LLMS）生成对话中的多个摘要和面向任务的问题回答对话。通过在\ textit {（i）}选择最佳候选答案之前，通过让LLMS答案与任务相关的问题进行评估，可以评估生成的摘要的质量，然后根据这些答案选择最佳的候选答案，然后\ textit {（ii）}确定最有用的摘要。最后，我们在选定的摘要中微调了最好的LLM。当在多个数据集上验证时，\ app通过在各种零拍设置中实现竞争成果来证明其有效性，从而与完全监督的最新方法（SOTA）方法媲美。

Title: Feedback Forensics: A Toolkit to Measure AI Personality

Authors: Arduin Findeis, Timo Kaufmann, Eyke Hüllermeier, Robert Mullins
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.26305
Pdf URL: https://arxiv.org/pdf/2509.26305
Copy Paste: [[2509.26305]] Feedback Forensics: A Toolkit to Measure AI Personality(https://arxiv.org/abs/2509.26305)
Keywords: chat
Abstract: Some traits making a "good" AI model are hard to describe upfront. For example, should responses be more polite or more casual? Such traits are sometimes summarized as model character or personality. Without a clear objective, conventional benchmarks based on automatic validation struggle to measure such traits. Evaluation methods using human feedback such as Chatbot Arena have emerged as a popular alternative. These methods infer "better" personality and other desirable traits implicitly by ranking multiple model responses relative to each other. Recent issues with model releases highlight limitations of these existing opaque evaluation approaches: a major model was rolled back over sycophantic personality issues, models were observed overfitting to such feedback-based leaderboards. Despite these known issues, limited public tooling exists to explicitly evaluate model personality. We introduce Feedback Forensics: an open-source toolkit to track AI personality changes, both those encouraged by human (or AI) feedback, and those exhibited across AI models trained and evaluated on such feedback. Leveraging AI annotators, our toolkit enables investigating personality via Python API and browser app. We demonstrate the toolkit's usefulness in two steps: (A) first we analyse the personality traits encouraged in popular human feedback datasets including Chatbot Arena, MultiPref and PRISM; and (B) then use our toolkit to analyse how much popular models exhibit such traits. We release (1) our Feedback Forensics toolkit alongside (2) a web app tracking AI personality in popular models and feedback datasets as well as (3) the underlying annotation data at this https URL.
摘要：一些制作“良好” AI模型的特征很难预先描述。例如，回应应该更有礼貌还是更随意？这些特征有时被总结为模型特征或个性。没有明确的客观，基于自动验证的常规基准来衡量此类特征。使用人类反馈（例如聊天机器人竞技场）的评估方法已成为一种流行的选择。这些方法通过相对于彼此对多个模型响应进行排名，推断“更好”的个性和其他理想的特征。模型发布的最新问题重点介绍了这些现有不透明评估方法的局限性：在Sycophantic性格问题上重新回升了一个主要模型，观察到模型过于适应此类基于反馈的排行榜。尽管存在这些已知问题，但仍有有限的公共工具可以明确评估模型个性。我们介绍了反馈取证：一种开源工具包，以跟踪人格（或AI）反馈鼓励的人格变化，以及在此类反馈中训练和评估的AI模型中展示的。我们的工具包利用AI注释器，可以通过Python API和浏览器应用程序调查个性。我们通过两个步骤来证明该工具包的实用性：（a）首先，我们分析了流行的人类反馈数据集中鼓励的人格特征，包括聊天机器人竞技场，倍增和Prism；（b）然后使用我们的工具包来分析有多少流行模型表现出此类特征。我们发布（1）我们的反馈取证工具包以及（2）Web应用程序跟踪流行模型和反馈数据集中的AI个性以及（3）此HTTPS URL上的基础注释数据。

Title: One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient

Authors: Rui Ming, Haoyuan Wu, Shoubo Hu, Zhuolun He, Bei Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26313
Pdf URL: https://arxiv.org/pdf/2509.26313
Copy Paste: [[2509.26313]] One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient(https://arxiv.org/abs/2509.26313)
Keywords: language model, llm
Abstract: Supervised fine-tuning (SFT) is the predominant method for adapting large language models (LLMs), yet it often struggles with generalization compared to reinforcement learning (RL). In this work, we posit that this performance disparity stems not just from the loss function, but from a more fundamental difference: SFT learns from a fixed, pre-collected dataset, whereas RL utilizes on-policy data sampled from the current policy. Building on this hypothesis, we introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method. OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory. At each step, it performs a Monte Carlo ``rollout'' by sampling multiple candidate tokens from the current policy's distribution. The ground-truth token from the supervised data is then used to provide a reward signal to these samples. Guided by policy gradient, our algorithm repurposes static, off-policy supervised data into a dynamic, on-policy signal at the token level, capturing the generalization benefits of on-policy learning while bypassing the costly overhead of full sentence generation. Through extensive experiments on a diverse suite of challenging benchmarks spanning mathematical reasoning, code generation, and general domain reasoning, we demonstrate that OTR consistently outperforms standard SFT. Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs and provide compelling evidence that the on-policy nature of data is a critical driver of generalization, offering a promising new direction for fine-tuning LLMs.
摘要：监督微调（SFT）是适应大型语言模型（LLM）的主要方法，但与增强学习（RL）相比，它经常在概括方面挣扎。在这项工作中，我们认为这种绩效差异不仅源于损失功能，还源于更根本的差异：SFT从固定的，预收取的数据集中学习，而RL则利用从当前策略中采样的上政策数据。在此假设的基础上，我们介绍了一句话推出（OTR），这是一种新颖的微调算法，以策略梯度方法指导SFT。 OTR通过将每个令牌生成视为单步加固学习轨迹来重构自回归学习过程。在每个步骤中，它通过从当前策略的发行版中抽样多个候选代币来执行蒙特卡洛````推出''。然后，使用监督数据中的地面图表来向这些样品提供奖励信号。在政策梯度的指导下，我们的算法将静态的，非政策的监督数据重新调整为一个在令牌级别的动态，式的信号，从而捕获了上政策学习的概括益处，同时绕过了完整句子的成本昂贵的台词。通过对跨越数学推理，代码生成和一般领域推理的各种具有挑战性的基准测试的广泛实验，我们证明OTR始终优于标准SFT。我们的发现将OTR确定为微调LLMS的强大而实用的替代方案，并提供了令人信服的证据，表明数据的底漆性质是概括的关键驱动力，为微调LLMS提供了有希望的新方向。

Title: Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in its Latent Thoughts

Authors: Hanwen Du, Yuxin Dong, Xia Ning
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26314
Pdf URL: https://arxiv.org/pdf/2509.26314
Copy Paste: [[2509.26314]] Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in its Latent Thoughts(https://arxiv.org/abs/2509.26314)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel at problem solving by generating chain of thoughts in natural language, but such verbal thinking is computationally costly and prone to overthinking. Recent work instead proposes a latent thinking architecture Huggin-3.5B, which represents intermediate reasoning steps as sequence of latent representations. However, latent thoughts lack interpretability and are difficult to supervise, raising concerns about the correctness and reliability of its latent thinking processes. In this paper, we provide a systematic study of how Huggin-3.5B thinks in the latent space and how external supervision signals can improve its latent thinking processes. We show that latent thoughts leading to correct versus incorrect answers exhibit highly distinguishable patterns, and that a latent classifier can reliably predict answer correctness directly from latent thoughts. Leveraging these insights, we propose Latent Thinking Optimization (LTO), a probabilistic algorithm that employs the latent classifier as a Latent Reward Model (LRM) to optimize the latent thinking processes. Extensive experiments across diverse reasoning tasks demonstrate that LRM is highly effective in detecting incorrect latent thinking patterns, and LTO can significantly improve the latent thinking processes. Furthermore, we show that LRM can generalize across diverse domains, and LTO can be seamlessly applied to general LLMs to improve their thinking processes. In contrast to verbal thinking, our method demonstrates that reward modeling and scaling test-time thinking with supervision can be performed directly in the latent space, highlighting its potential as a general, efficient, and domain-agnostic approach to improving the thinking processes of LLMs.
摘要：大型语言模型（LLMS）通过以自然语言产生思想链来擅长解决问题，但是这种语言思维在计算上是昂贵的，容易多思考。相反，最近的工作提出了一个潜在的思维体系结构Huggin-3.5B，该架构代表了中间推理步骤作为潜在表示的顺序。但是，潜在思想缺乏可解释性，难以监督，引起人们对其潜在思维过程的正确性和可靠性的担忧。在本文中，我们对Huggin-3.5B在潜在空间中的思考以及外部监督信号如何改善其潜在思维过程的方式提供了系统的研究。我们表明，导致正确和错误答案的潜在思想表现出高度可区分的模式，并且潜在分类器可以直接从潜在思想中可靠地预测正确的答案。利用这些见解，我们提出了潜在思维优化（LTO），这是一种概率算法，该算法采用潜在分类器作为潜在奖励模型（LRM）来优化潜在的思维过程。跨不同推理任务的广泛实验表明，LRM在检测不正确的潜在思维模式方面非常有效，而LTO可以显着改善潜在的思维过程。此外，我们表明LRM可以概括在各种领域，LTO可以无缝地应用于一般的LLMs以改善其思维过程。与口头思考相反，我们的方法表明，可以直接在潜在空间中进行奖励建模和扩展测试时间思维，从而强调了其作为一般，高效和领域的潜力，以改善LLM的思维过程。

Title: Fast-dLLM v2: Efficient Block-Diffusion LLM

Authors: Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, Enze Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26328
Pdf URL: https://arxiv.org/pdf/2509.26328
Copy Paste: [[2509.26328]] Fast-dLLM v2: Efficient Block-Diffusion LLM(https://arxiv.org/abs/2509.26328)
Keywords: language model, llm
Abstract: Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.
摘要：自回归（AR）大语言模型（LLM）在各种自然语言任务中取得了出色的表现，但它们固有的顺序解码限制推理效率。在这项工作中，我们提出了一种经过精心设计的块扩散语言模型（DLLM）的Fast-DLLM V2，该模型有效地将预处理的AR模型调整到DLLM中以进行平行文本生成，仅需要大约1B的微调令牌。与Dream（580B令牌）等全注意扩散LLM相比，这代表了训练数据的500倍降低，同时保留了原始模型的性能。我们的方法引入了一种新颖的培训配方，该配方将块扩散机制与互补的注意面膜结合在一起，从而实现了无需牺牲AR训练目标的方向双向上下文建模。为了进一步加速解码，我们设计了一个层次的缓存机制：一个块级缓存，该缓存在跨块跨块中存储历史上下文表示，以及一个在部分解码的块中实现有效平行生成的子块缓存。加上我们的平行解码管道，Fast-DLLM V2在不损害发电质量的情况下，超过标准AR解码的速度高达2.5倍。跨不同基准测试的广泛实验表明，快速DLLM V2匹配或超过AR基准的准确性，同时在DLLMS之间提供最先进的效率 - 标志着快速，准确的LLM的实际部署迈出了重要一步。代码和模型将公开发布。

Title: Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

Authors: Jinyeop Song, Song Wang, Julian Shun, Yada Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.26383
Pdf URL: https://arxiv.org/pdf/2509.26383
Copy Paste: [[2509.26383]] Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning(https://arxiv.org/abs/2509.26383)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucinations and expose reasoning traces. However, many KG-RAG systems compose multiple LLM modules (e.g planning, reasoning, and responding), inflating inference cost and binding behavior to a specific target KG. To address this, we introduce KG-R1, an agentic KG retrieval-augmented generation (KG-RAG) framework through reinforcement learning (RL). KG-R1 utilizes a single agent that interacts with KGs as its environment, learning to retrieve at each step and incorporating the retrieved information into its reasoning and generation. The process is optimized through end-to-end RL. In controlled experiments across Knowledge-Graph Question Answering (KGQA) benchmarks, our method demonstrates both efficiency and transferability: Using Qwen-2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use larger foundation or fine-tuned models. Furthermore, KG-R1 enables plug and play: after training, it maintains strong accuracy on new KGs without modification. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at this https URL.
摘要：知识仪检索型生成（kg-rag）伴侣大型语言模型（LLMS）具有结构化的，可验证的知识图（kgs），以减少幻觉并揭示推理痕迹。但是，许多KG-rag系统组成了多个LLM模块（例如规划，推理和响应），将推理成本膨胀以及对特定目标kg的约束行为膨胀。为了解决这个问题，我们介绍了通过增强学习（RL）的代理KG检索生成（KG-rag）框架KG-R1。 KG-R1利用与KGS作为其环境相互作用的单一代理，在每个步骤中学习检索并将检索到的信息纳入其推理和生成中。该过程是通过端到端RL优化的。在跨知识记录问答（KGQA）基准的受控实验中，我们的方法证明了效率和可传递性：使用QWEN-2.5-3B，KG-R1提高了与先前使用更大基础模型或较大基础模型或较大基础模型相比，产生代币的答案准确性更少。此外，KG-R1可以启用插头：训练后，它在新的KGS上保持了强烈的准确性而无需修改。这些属性使KG-R1成为现实部署的有希望的KG-rag框架。我们的代码在此HTTPS URL上公开可用。

Title: Automatic Fact-checking in English and Telugu

Authors: Ravi Kiran Chikkala, Tatiana Anikina, Natalia Skachkova, Ivan Vykopal, Rodrigo Agerri, Josef van Genabith
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26415
Pdf URL: https://arxiv.org/pdf/2509.26415
Copy Paste: [[2509.26415]] Automatic Fact-checking in English and Telugu(https://arxiv.org/abs/2509.26415)
Keywords: language model, llm
Abstract: False information poses a significant global challenge, and manually verifying claims is a time-consuming and resource-intensive process. In this research paper, we experiment with different approaches to investigate the effectiveness of large language models (LLMs) in classifying factual claims by their veracity and generating justifications in English and Telugu. The key contributions of this work include the creation of a bilingual English-Telugu dataset and the benchmarking of different veracity classification approaches based on LLMs.
摘要：虚假信息构成了一个重大的全球挑战，手动验证主张是一个耗时且资源密集的过程。在这篇研究论文中，我们尝试了不同的方法来研究大语模型（LLMS）在通过其真实性分类并在英语和泰卢固语中产生理由的事实主张的有效性。这项工作的主要贡献包括创建双语英语 - telugu数据集以及基于LLM的不同真实分类方法的基准测试。

Title: Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests

Authors: Yanbin Fu, Hong Jiao, Tianyi Zhou, Robert W. Lissitz, Nan Zhang, Ming Li, Qingshu Xu, Sydney Peters
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26431
Pdf URL: https://arxiv.org/pdf/2509.26431
Copy Paste: [[2509.26431]] Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests(https://arxiv.org/abs/2509.26431)
Keywords: language model
Abstract: Aligning test items to content standards is a critical step in test development to collect validity evidence based on content. Item alignment has typically been conducted by human experts. This judgmental process can be subjective and time-consuming. This study investigated the performance of fine-tuned small language models (SLMs) for automated item alignment using data from a large-scale standardized reading and writing test for college admissions. Different SLMs were trained for alignment at both domain and skill levels respectively with 10 skills mapped to 4 content domains. The model performance was evaluated in multiple criteria on two testing datasets. The impact of types and sizes of the input data for training was investigated. Results showed that including more item text data led to substantially better model performance, surpassing the improvements induced by sample size increase alone. For comparison, supervised machine learning models were trained using the embeddings from the multilingual-E5-large-instruct model. The study results showed that fine-tuned SLMs consistently outperformed the embedding-based supervised machine learning models, particularly for the more fine-grained skill alignment. To better understand model misclassifications, multiple semantic similarity analysis including pairwise cosine similarity, Kullback-Leibler divergence of embedding distributions, and two-dimension projections of item embeddings were conducted. These analyses consistently showed that certain skills in SAT and PSAT were semantically too close, providing evidence for the observed misclassification.
摘要：将测试项目与内容标准保持一致是测试开发的关键步骤，以根据内容收集有效性证据。项目对齐通常是由人类专家进行的。这个判断力的过程可能是主观的且耗时的。这项研究调查了使用大型大学入学的大规模标准化阅读和写作测试中的数据进行微调小语言模型（SLM）进行自动项目对齐的性能。分别在域和技能水平上分别对不同的SLM进行了对齐的培训，其中10个技能映射到4个内容域。在两个测试数据集上以多个标准评估了模型性能。研究了输入数据的类型和大小用于培训的影响。结果表明，包括更多的项目文本数据导致了基本上更好的模型性能，超过了单独增加样本量引起的改进。为了进行比较，使用来自多语言 - E5大型教学模型的嵌入式培训了监督的机器学习模型。研究结果表明，微调的SLM始终优于基于嵌入的监督机器学习模型，尤其是对于更细粒度的技能对齐方式。为了更好地理解模型错误分类，进行了多个语义相似性分析，包括成对余弦相似性，嵌入分布的kullback-leibler差异以及对项目嵌入的二维投影。这些分析一致地表明，SAT和PSAT中的某些技能在语义上太近了，为观察到的错误分类提供了证据。

Title: Adaptive Planning for Multi-Attribute Controllable Summarization with Monte Carlo Tree Search

Authors: Sangwon Ryu, Heejin Do, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.26435
Pdf URL: https://arxiv.org/pdf/2509.26435
Copy Paste: [[2509.26435]] Adaptive Planning for Multi-Attribute Controllable Summarization with Monte Carlo Tree Search(https://arxiv.org/abs/2509.26435)
Keywords: language model, llm
Abstract: Controllable summarization moves beyond generic outputs toward human-aligned summaries guided by specified attributes. In practice, the interdependence among attributes makes it challenging for language models to satisfy correlated constraints consistently. Moreover, previous approaches often require per-attribute fine-tuning, limiting flexibility across diverse summary attributes. In this paper, we propose adaptive planning for multi-attribute controllable summarization (PACO), a training-free framework that reframes the task as planning the order of sequential attribute control with a customized Monte Carlo Tree Search (MCTS). In PACO, nodes represent summaries, and actions correspond to single-attribute adjustments, enabling progressive refinement of only the attributes requiring further control. This strategy adaptively discovers optimal control orders, ultimately producing summaries that effectively meet all constraints. Extensive experiments across diverse domains and models demonstrate that PACO achieves robust multi-attribute controllability, surpassing both LLM-based self-planning models and fine-tuned baselines. Remarkably, PACO with Llama-3.2-1B rivals the controllability of the much larger Llama-3.3-70B baselines. With larger models, PACO achieves superior control performance, outperforming all competitors.
摘要：可控的摘要超出了以指定属性为指导的人吻合的摘要的通用输出。在实践中，属性之间的相互依赖性使语言模型始终如一地满足相关约束的挑战。此外，以前的方法通常需要进行每项微调，从而限制了各种摘要属性的灵活性。在本文中，我们提出了多属性可控摘要（PACO）的自适应计划，这是一个无训练的框架，将任务重新缩放为使用自定义的蒙特卡洛树搜索（MCTS）计划顺序属性控件的顺序。在PACO中，节点代表摘要，而动作对应于单属性调整，从而仅逐步完善需要进一步控制的属性。该策略可适应地发现最佳控制订单，最终产生有效符合所有限制的摘要。跨不同领域和模型的广泛实验表明，PACO可实现强大的多属性可控性，超过了基于LLM的自我计划模型和微调基线。值得注意的是，与Llama-3.2-1b的PACO可竞争更大的Llama-3.3-70B基准的可控性。借助较大的型号，PACO实现了出色的控制表现，表现优于所有竞争对手。

Title: CreAgentive: An Agent Workflow Driven Multi-Category Creative Generation Engine

Authors: Yuyang Cheng, Linyue Cai, Changwei Peng, Yumiao Xu, Rongfang Bie, Yong Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26461
Pdf URL: https://arxiv.org/pdf/2509.26461
Copy Paste: [[2509.26461]] CreAgentive: An Agent Workflow Driven Multi-Category Creative Generation Engine(https://arxiv.org/abs/2509.26461)
Keywords: language model, agent
Abstract: We present CreAgentive, an agent workflow driven multi-category creative generation engine that addresses four key limitations of contemporary large language models in writing stories, drama and other categories of creatives: restricted genre diversity, insufficient output length, weak narrative coherence, and inability to enforce complex structural constructs. At its core, CreAgentive employs a Story Prototype, which is a genre-agnostic, knowledge graph-based narrative representation that decouples story logic from stylistic realization by encoding characters, events, and environments as semantic triples. CreAgentive engages a three-stage agent workflow that comprises: an Initialization Stage that constructs a user-specified narrative skeleton; a Generation Stage in which long- and short-term objectives guide multi-agent dialogues to instantiate the Story Prototype; a Writing Stage that leverages this prototype to produce multi-genre text with advanced structures such as retrospection and foreshadowing. This architecture reduces storage redundancy and overcomes the typical bottlenecks of long-form generation. In extensive experiments, CreAgentive generates thousands of chapters with stable quality and low cost (less than $1 per 100 chapters) using a general-purpose backbone model. To evaluate performance, we define a two-dimensional framework with 10 narrative indicators measuring both quality and length. Results show that CreAgentive consistently outperforms strong baselines and achieves robust performance across diverse genres, approaching the quality of human-authored novels.
摘要：我们提出了Creagentive，这是一种代理工作流动的多类创意生成引擎，该引擎介绍了在写作，戏剧和其他类别的创作者中，解决当代大型语言模型的四个关键局限性：限制流派多样性，不足的输出长度，较弱的叙事连贯性以及无法实施复杂的结构结构。 Creagentive以故事原型为核心，这是一种基于类型的基于知识图的叙事表示形式，它通过编码角色，事件和环境作为语义三元组来将故事逻辑与风格实现相融合。 Creagentive参与了一个三阶段的代理工作流，该工作流程包括：构建用户指定叙事骨架的初始化阶段；长期和短期目标指导多代理对话以实例化故事原型的一代阶段；一个由该原型制作出具有先进结构（例如回顾和预先设计的高级结构）的写作阶段。该体系结构减少了存储冗余，并克服了长期生成的典型瓶颈。在广泛的实验中，Creagentive通过通用骨干模型产生了数千章，质量稳定，成本低（每100章少于1美元）（每100章少于1美元）。为了评估性能，我们定义了一个二维框架，其中有10个叙事指标可以测量质量和长度。结果表明，杂乱无章的表现始终优于强大的基准，并在各种流派中取得了良好的表现，接近了人为实现的小说的质量。

Title: Regression Language Models for Code

Authors: Yash Akhauri, Xingyou Song, Arissa Wongpanich, Bryan Lewandowski, Mohamed S. Abdelfattah
Subjects: cs.CL, cs.AI, cs.LG, cs.PF, cs.SE
Abstract URL: https://arxiv.org/abs/2509.26476
Pdf URL: https://arxiv.org/pdf/2509.26476
Copy Paste: [[2509.26476]] Regression Language Models for Code(https://arxiv.org/abs/2509.26476)
Keywords: language model
Abstract: We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of programming languages. While prior methods have resorted to heavy and domain-specific feature engineering, we show that a single unified Regression Language Model (RLM) can simultaneously predict directly from text, (i) the memory footprint of code across multiple high-level languages such as Python and C++, (ii) the latency of Triton GPU kernels, and (iii) the accuracy and speed of trained neural networks represented in ONNX. In particular, a relatively small 300M parameter RLM initialized from T5Gemma, obtains > 0.9 Spearman-rank on competitive programming submissions from APPS, and a single unified model achieves > 0.5 average Spearman-rank across 17 separate languages from CodeNet. Furthermore, the RLM can obtain the highest average Kendall-Tau of 0.46 on five classic NAS design spaces previously dominated by graph neural networks, and simultaneously predict architecture latencies on numerous hardware platforms.
摘要：我们研究代码对金属回归：预测代码执行的数字结果，这是由于编程语言的开放性质而导致的一项具有挑战性的任务。 While prior methods have resorted to heavy and domain-specific feature engineering, we show that a single unified Regression Language Model (RLM) can simultaneously predict directly from text, (i) the memory footprint of code across multiple high-level languages such as Python and C++, (ii) the latency of Triton GPU kernels, and (iii) the accuracy and speed of trained neural networks represented in ONNX.特别是，从T5GEMMA初始化的相对较小的300m参数RLM在来自应用程序的竞争性编程提交中获得了> 0.9 Spearman级别，并且单个统一模型在来自CodeNet的17个单独语言中达到了> 0.5平均长矛阵列。此外，RLM可以在先前以图形神经网络主导的五个经典NAS设计空间上获得0.46的最高平均肯德尔-TAU，并同时预测众多硬件平台上的建筑潜伏期。

Title: dParallel: Learnable Parallel Decoding for dLLMs

Authors: Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26488
Pdf URL: https://arxiv.org/pdf/2509.26488
Copy Paste: [[2509.26488]] dParallel: Learnable Parallel Decoding for dLLMs(https://arxiv.org/abs/2509.26488)
Keywords: language model, llm
Abstract: Diffusion large language models (dLLMs) have recently drawn considerable attention within the research community as a promising alternative to autoregressive generation, offering parallel token prediction and lower inference latency. Yet, their parallel decoding potential remains largely underexplored, as existing open-source models still require nearly token-length decoding steps to ensure performance. To address this, we introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling. We identify that the key bottleneck to parallel decoding arises from the sequential certainty convergence for masked tokens. Building on this insight, we introduce the core of our approach: certainty-forcing distillation, a novel training strategy that distills the model to follow its original sampling trajectories while enforcing it to achieve high certainty on masked tokens more rapidly and in parallel. Extensive experiments across various benchmarks demonstrate that our method can dramatically reduce the number of decoding steps while maintaining performance. When applied to the LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup while maintaining accuracy. Our code is available at this https URL
摘要：扩散大语言模型（DLLM）最近在研究界引起了人们的关注，作为自回归产生的有前途的替代方案，提供了平行的令牌预测和较低的推理潜伏期。然而，它们的平行解码势在很大程度上仍然没有被解散，因为现有的开源模型仍然需要几乎需要代币的解码步骤来确保性能。为了解决这个问题，我们介绍了一种简单有效的方法，它可以解锁DLLM的固有并行性以进行快速采样。我们确定并行解码的关键瓶颈是由蒙版令牌的顺序确定性收敛引起的。在这种见解的基础上，我们介绍了方法的核心：确定性蒸馏，一种新颖的训练策略，它蒸馏出该模型以遵循其原始采样轨迹，同时强制实现它以更快地和平行地对掩盖的代币实现高确定性。各种基准的广泛实验表明，我们的方法可以在保持性能的同时大大减少解码步骤的数量。当应用于LLADA-8B - 教学模型时，DPARALALL将在GSM8K上的解码步骤从256降低到30，实现了8.5倍的加速，而无需性能降解。在MBPP基准测试中，它将解码步骤从256降低到24，在保持准确性的同时，速度为10.5倍。我们的代码可在此HTTPS URL上找到

Title: VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

Authors: Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, Man Gao, Xi Su, Xiaodong Cai, Xunliang Cai, Yu Yang, Yunke Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.26490
Pdf URL: https://arxiv.org/pdf/2509.26490
Copy Paste: [[2509.26490]] VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications(https://arxiv.org/abs/2509.26490)
Keywords: llm, agent
Abstract: As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To address this gap, we introduce VitaBench, a challenging benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings. Drawing from daily applications in food delivery, in-store consumption, and online travel services, VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools. Through a framework that eliminates domain-specific policies, we enable flexible composition of these scenarios and tools, yielding 100 cross-scenario tasks (main results) and 300 single-scenario tasks. Each task is derived from multiple real user requests and requires agents to reason across temporal and spatial dimensions, utilize complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent throughout multi-turn conversations. Moreover, we propose a rubric-based sliding window evaluator, enabling robust assessment of diverse solution pathways in complex environments and stochastic interactions. Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross-scenario tasks, and less than 50% success rate on others. Overall, we believe VitaBench will serve as a valuable resource for advancing the development of AI agents in practical real-world applications. The code, dataset, and leaderboard are available at this https URL
摘要：由于基于LLM的代理商越来越多地部署在现实生活中，因此现有的基准无法捕获其固有的处理广泛信息，利用多样化资源以及管理动态用户交互的固有复杂性。为了解决这一差距，我们介绍了Vitabench，这是一个充满挑战的基准，可评估代理在现实世界中基于的多功能交互式任务。 Vitabench从食品交付，店内消费和在线旅行服务中的日常应用中汲取灵感，为代理商提供了迄今为止最复杂的寿命模拟环境，其中包括66个工具。通过消除特定领域策略的框架，我们可以灵活地组成这些方案和工具，从而产生100个跨幕组的任务（主要结果）和300个单幕纳里奥任务。每个任务均来自多个真实的用户请求，并要求代理在时间和空间维度上进行推理，利用复杂的工具集，主动澄清模棱两可的说明以及在整个多转交流中跟踪用户意图的轨迹转移。此外，我们提出了一个基于标语的滑动窗口评估器，从而在复杂的环境和随机相互作用中对各种解决方案途径进行了强有力的评估。我们的全面评估表明，即使是最先进的模型，跨筛查任务的成功率也仅达到30％，而其他模型的成功率不到50％。总体而言，我们认为Vitabench将是推进实际现实应用程序中AI代理的发展的宝贵资源。代码，数据集和排行榜可在此HTTPS URL上找到

Title: BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs

Authors: Yue Wang, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Wanshun Chen, Huang Liu, Jiadi Yao, Qu Yang, Qingxuan Jiang, Fanghua Ye, Juntao Li, Min Zhang, Zhaopeng Tu, Xiaolong Li, Linus
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26514
Pdf URL: https://arxiv.org/pdf/2509.26514
Copy Paste: [[2509.26514]] BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs(https://arxiv.org/abs/2509.26514)
Keywords: language model, llm
Abstract: The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model's ability to follow text instructions for controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm inspired by ``operationalism'' that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a ``conductor'', understanding user instructions and generating a textual ``plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS model, the ``orchestra'', then generates the speech from these features. To realize this component, we develop BatonTTS, a TTS model trained specifically for this task. Our experiments demonstrate that BatonVoice achieves strong performance in controllable and emotional speech synthesis, outperforming strong open- and closed-source baselines. Notably, our approach enables remarkable zero-shot cross-lingual generalization, accurately applying feature control abilities to languages unseen during post-training. This demonstrates that objectifying speech into textual vocal features can more effectively unlock the linguistic intelligence of LLMs.
摘要：大语言模型（LLM）的兴起正在重塑多模型模型，语音合成是一个突出的应用。但是，现有的方法通常使这些模型的语言智能不足，通常无法利用其强大的指导遵循能力。此限制阻碍了该模型遵循文本说明的能力，以控制可控的文本到语音〜（TTS）。为了解决这个问题，我们提出了一个受``操作主义''启发的新范式，该范式将教学理解与语音产生相反。我们介绍了Batonvoice，这是一个框架，其中LLM充当``指挥''，了解用户指令并生成文本``PLAN'' - 明确的人声功能（例如，音调，能量）。单独的TTS模型``乐团'，然后从这些功能中产生演讲。为了实现此组件，我们开发了Batontts，这是专门针对此任务训练的TTS模型。我们的实验表明，Batonvoice在可控和情感的语音综合中取得了强劲的表现，表现优于强大的开放和封闭式基线。值得注意的是，我们的方法可以实现零拍的惊人跨语性概括，从而准确地将功能控制能力应用于训练后在训练期间看不见的语言。这表明，将语音客观化成文本人声特征可以更有效地解锁LLM的语言智能。

Title: Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization

Authors: Yaoxiang Wang, Qingguo Hu, Yucheng Ding, Ruizhe Wang, Yeyun Gong, Jian Jiao, Yelong Shen, Peng Cheng, Jinsong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26520
Pdf URL: https://arxiv.org/pdf/2509.26520
Copy Paste: [[2509.26520]] Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization(https://arxiv.org/abs/2509.26520)
Keywords: language model
Abstract: Mixture-of-Experts (MoE) has emerged as a promising paradigm for efficiently scaling large language models without a proportional increase in computational cost. However, the standard training strategy of Top-K router prevents MoE models from realizing their full potential for elastic inference. When the number of activated experts is altered at inference time, these models exhibit precipitous performance degradation. In this work, we introduce Matryoshka MoE (M-MoE), a training framework that instills a coarse-to-fine structure directly into the expert ensemble. By systematically varying the number of activated experts during training, M-MoE compels the model to learn a meaningful ranking: top-ranked experts collaborate to provide essential, coarse-grained capabilities, while subsequent experts add progressively finer-grained detail. We explore this principle at multiple granularities, identifying a layer-wise randomization strategy as the most effective. Our experiments demonstrate that a single M-MoE model achieves remarkable elasticity, with its performance at various expert counts closely matching that of an entire suite of specialist models, but at only a fraction of the total training cost. This flexibility not only unlocks elastic inference but also enables optimizing performance by allocating different computational budgets to different model layers. Our work paves the way for more practical and adaptable deployments of large-scale MoE models.
摘要：Experts（MOE）的混合物已成为有前途的范式，用于有效地扩展大型语言模型，而无需成比例的计算成本增加。但是，Top-K路由器的标准训练策略可阻止MoE模型实现其全部弹性推理的潜力。当推断时激活专家的数量发生变化时，这些模型会表现出巨大的性能降解。在这项工作中，我们介绍了Matryoshka Moe（M-MoE），该培训框架将粗到1的结构直接灌输到专家合奏中。通过系统地改变培训期间激活的专家的数量，M-MOE迫使该模型学习有意义的排名：排名最高的专家合作以提供必不可少的粗糙颗粒功能，而随后的专家则添加了逐渐细粒的细节。我们以多种粒度探讨了这一原理，将层次随机化策略确定为最有效的策略。我们的实验表明，单个M-MOE模型实现了显着的弹性，其在各种专家的表现与整个专家模型套件的表现非常匹配，但仅占总培训成本的一小部分。这种灵活性不仅可以解锁弹性推理，还可以通过将不同的计算预算分配给不同的模型层来优化性能。我们的工作为大型Moe模型的更实际和适应性的部署铺平了道路。

Title: OceanGym: A Benchmark Environment for Underwater Embodied Agents

Authors: Yida Xue, Mingjun Mao, Xiangyuan Ru, Yuqi Zhu, Baochang Ren, Shuofei Qiao, Mengru Wang, Shumin Deng, Xinyu An, Ningyu Zhang, Ying Chen, Huajun Chen
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2509.26536
Pdf URL: https://arxiv.org/pdf/2509.26536
Copy Paste: [[2509.26536]] OceanGym: A Benchmark Environment for Underwater Embodied Agents(https://arxiv.org/abs/2509.26536)
Keywords: language model, llm, agent
Abstract: We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth's last unexplored frontiers. The code and data are available at this https URL.
摘要：我们介绍了OceanGym，这是海洋水下体现代理的第一个综合基准，旨在在最苛刻的现实环境之一中推进AI。与陆地或空中域不同，水下环境提出了极端的感知和决策挑战，包括知名度低，动态洋流，使有效的代理部署异常困难。 OceanGym涵盖了八个现实的任务域和一个由多模式大型语言模型（MLLM）驱动的统一代理框架，该框架集成了感知，记忆和顺序决策。需要代理来理解光学和声纳数据，自主探索复杂的环境，并在这些恶劣条件下实现长跑目标。广泛的实验揭示了最先进的MLLM驱动代理商与人类专家之间的差距，从而强调了海洋水下环境中的感知，计划和适应性的持续困难。通过提供高保真，严格设计的平台，OceanGym建立了一个测试床，用于开发强大的体现AI，并将这些功能转移到现实世界中自动自动的海洋水下车辆上，这标志着能够在地球上最后一个未探索的边界之一中朝着能够在地球上运行的智能试剂的决定性步骤。该代码和数据可在此HTTPS URL上找到。

Title: Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

Authors: Seiji Maekawa, Jackson Hassell, Pouya Pezeshkpour, Tom Mitchell, Estevam Hruschka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26553
Pdf URL: https://arxiv.org/pdf/2509.26553
Copy Paste: [[2509.26553]] Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling(https://arxiv.org/abs/2509.26553)
Keywords: language model, gpt, llm, agent
Abstract: As language models gain access to external tools via structured function calls, they become increasingly more capable of solving complex, multi-step tasks. However, existing benchmarks for tool-augmented language models (TaLMs) provide insufficient control over factors such as the number of functions accessible, task complexity, and input size, and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where nodes are function calls and an edge between nodes represents one function consuming the output of another. Given a set of external function schemas, initial variable values, and a target variable, models must compose the correct call sequence to compute the target variable. FuncBenchGen allows users to precisely control task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding data leakage. We apply our FuncBenchGen framework to evaluate seven LLMs on tool use tasks of varying difficulty. Reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other models. Performance declines sharply as dependency depth increases. Furthermore, connected irrelevant functions prove especially difficult to handle. We find that strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding a success rate improvement from 62.5% to 81.3% for GPT-5.
摘要：随着语言模型通过结构化功能调用访问外部工具，它们越来越能够解决复杂的多步任务。但是，现有的工具增强语言模型（TALMS）的基准提供了对诸如可访问功能数量，任务复杂性和输入大小等因素的控制不足，并且仍然容易受到数据污染的影响。我们提出了Funcbenchgen，这是一个统一的无污染框架，可以通过生成合成多步工具使用任务来评估手术。关键想法是将工具用作横向遍历的遍历，在一个隐藏的函数依赖性dag上，节点是函数呼叫，而节点之间的边缘代表一个函数，一个函数消耗了另一个函数。给定一组外部函数模式，初始变量值和目标变量，模型必须组成正确的调用序列才能计算目标变量。 funcbenchgen允许用户精确控制任务难度（例如，图形大小，依赖关系深度和干扰物功能），同时避免数据泄漏。我们应用funcbenchgen框架来评估七个LLM在工具上使用改变难度的任务。推理优化的模型始终超过GPT-5的通用模型明显优于其他模型。随着依赖深度的增加，性能急剧下降。此外，连接的无关函数证明尤其难以处理。我们发现，强大的模型通常会进行句法有效的函数调用，但跨步骤传播错误或陈旧的参数值，从而揭示了LLMS在多转移工具中的脆弱状态跟踪。在此观察过程中，我们引入了一种简单的缓解策略，该策略在每个步骤中都向代理明确重述了先前的变量值。令人惊讶的是，这种轻巧的变化在模型之间产生了可观的增长。例如，GPT-5的成功率从62.5％提高到81.3％。

Title: Generating Difficult-to-Translate Texts

Authors: Vilém Zouhar, Wenda Xu, Parker Riley, Juraj Juraska, Mara Finkelstein, Markus Freitag, Dan Deutsch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.26592
Pdf URL: https://arxiv.org/pdf/2509.26592
Copy Paste: [[2509.26592]] Generating Difficult-to-Translate Texts(https://arxiv.org/abs/2509.26592)
Keywords: language model, llm
Abstract: Machine translation benchmarks sourced from the real world are quickly obsoleted, due to most examples being easy for state-of-the-art translation models. This limits the benchmark's ability to distinguish which model is better or to reveal models' weaknesses. Current methods for creating difficult test cases, such as subsampling or from-scratch synthesis, either fall short of identifying difficult examples or suffer from a lack of diversity and naturalness. Inspired by the iterative process of human experts probing for model failures, we propose MT-breaker, a method where a large language model iteratively refines a source text to increase its translation difficulty. The LLM iteratively queries a target machine translation model to guide its generation of difficult examples. Our approach generates examples that are more challenging for the target MT model while preserving the diversity of natural texts. While the examples are tailored to a particular machine translation model during the generation, the difficulty also transfers to other models and languages.
摘要：从现实世界中采购的机器翻译基准很快就被淘汰，因为大多数示例对于最先进的翻译模型都很容易。这限制了基准测定哪种模型更好或揭示模型的弱点的能力。当前创建困难的测试用例（例如次采样或造口综合）的方法，要么没有确定困难的例子，要么遭受了缺乏多样性和自然性的困扰。受到人类专家探索模型失败的迭代过程的启发，我们提出了Break-breaker，这种方法是一种大型语言模型迭代地改进源文本以增加其翻译困难的方法。 LLM迭代查询目标机器翻译模型，以指导其生成困难的示例。我们的方法产生的例子对目标MT模型更具挑战性，同时保留了自然文本的多样性。尽管这些示例是在一代人期间针对特定机器翻译模型量身定制的，但困难也会传输到其他模型和语言。

Title: Deconstructing Self-Bias in LLM-generated Translation Benchmarks

Authors: Wenda Xu, Sweta Agrawal, Vilém Zouhar, Markus Freitag, Daniel Deutsch
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.26600
Pdf URL: https://arxiv.org/pdf/2509.26600
Copy Paste: [[2509.26600]] Deconstructing Self-Bias in LLM-generated Translation Benchmarks(https://arxiv.org/abs/2509.26600)
Keywords: language model, llm
Abstract: As large language models (LLMs) begin to saturate existing benchmarks, automated benchmark creation using LLMs (LLM as a benchmark) has emerged as a scalable alternative to slow and costly human curation. While these generated test sets have to potential to cheaply rank models, we demonstrate a critical flaw. LLM generated benchmarks systematically favor the model that created the benchmark, they exhibit self bias on low resource languages to English translation tasks. We show three key findings on automatic benchmarking of LLMs for translation: First, this bias originates from two sources: the generated test data (LLM as a testset) and the evaluation method (LLM as an evaluator), with their combination amplifying the effect. Second, self bias in LLM as a benchmark is heavily influenced by the model's generation capabilities in the source language. For instance, we observe more pronounced bias in into English translation, where the model's generation system is developed, than in out of English translation tasks. Third, we observe that low diversity in source text is one attribution to self bias. Our results suggest that improving the diversity of these generated source texts can mitigate some of the observed self bias.
摘要：随着大型语言模型（LLMS）开始饱和现有的基准，使用LLMS（LLM作为基准）的自动基准创建已成为可扩展的替代方法，以缓慢且昂贵的人类策划。尽管这些生成的测试集必须潜在地对模型进行廉价排名，但我们证明了一个关键缺陷。 LLM生成的基准有系统地利用创建基准的模型，它们对低资源语言表现出对英语翻译任务的自偏见。我们展示了有关翻译自动基准测试的三个关键发现：首先，该偏见源自两个来源：生成的测试数据（LLM作为测试集）和评估方法（LLM作为评估器），并及其组合放大了效果。其次，LLM作为基准的自我偏见受模型在源语言中的生成能力的影响很大。例如，我们观察到与英语翻译任务相比，在开发模型的生成系统的英文翻译中更明显的偏见。第三，我们观察到源文本的多样性低是自我偏见的一种归因。我们的结果表明，改善这些生成的源文本的多样性可以减轻某些观察到的自偏见。

Title: MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages

Authors: Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam, Nicolò Busetto, Denise Diaz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.26601
Pdf URL: https://arxiv.org/pdf/2509.26601
Copy Paste: [[2509.26601]] MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages(https://arxiv.org/abs/2509.26601)
Keywords: language model, llm, prompt
Abstract: Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.
摘要：确保多种语言中大型语言模型（LLM）响应的本地式质量具有挑战性。为了解决这个问题，我们介绍了Menlo，该框架可以根据受众设计启发的机制来运行对本地式响应质量的评估。使用Menlo，我们创建了一个由6,423个人类宣传的及时响应偏好对的数据集，该对涵盖了47种语言品种的高质量通道一致性的四个质量维度。我们的评估表明，零射门LLM法官从成对评估和我们的结构化注释主题中受益匪浅，但它们仍然不足我们数据集中的人类注释。我们通过通过增强学习，奖励塑造和多任务学习方法进行微调来展示实质性的改进。此外，我们表明，经过RL训练的法官可以作为增强LLMS多语言能力的生成奖励模型，尽管仍然存在人类判断力的差异。我们的发现表明了可扩展的多语言评估和偏好对齐方式的有希望的方向。我们发布我们的数据集和评估框架，以支持多语言LLM评估中的进一步研究。

Title: Scaling Spoken Language Models with Syllabic Speech Tokenization

Authors: Nicholas Lee, Cheol Jun Cho, Alan W Black, Gopala K. Anumanchipalli
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2509.26634
Pdf URL: https://arxiv.org/pdf/2509.26634
Copy Paste: [[2509.26634]] Scaling Spoken Language Models with Syllabic Speech Tokenization(https://arxiv.org/abs/2509.26634)
Keywords: language model
Abstract: Spoken language models (SLMs) typically discretize speech into high-frame-rate tokens extracted from SSL speech models. As the most successful LMs are based on the Transformer architecture, processing these long token streams with self-attention is expensive, as attention scales quadratically with sequence length. A recent SSL work introduces acoustic tokenization of speech at the syllable level, which is more interpretable and potentially more scalable with significant compression in token lengths (4-5 Hz). Yet, their value for spoken language modeling is not yet fully explored. We present the first systematic study of syllabic tokenization for spoken language modeling, evaluating models on a suite of SLU benchmarks while varying training data scale. Syllabic tokens can match or surpass the previous high-frame rate tokens while significantly cutting training and inference costs, achieving more than a 2x reduction in training time and a 5x reduction in FLOPs. Our findings highlight syllable-level language modeling as a promising path to efficient long-context spoken language models.
摘要：口语模型（SLM）通常将语音离散为从SSL语音模型中提取的高框架标记中。由于最成功的LMS基于变压器体系结构，因此以自我注意力处理这些长令牌流是昂贵的，因为注意力尺寸为序列长度四倍。最近的SSL工作引入了音节级别的语音宣传，在令牌长度（4-5 Hz）中具有明显的压缩，它更容易解释，并且可能更可扩展。但是，它们对口语建模的价值尚未得到充分探索。我们介绍了用于口语建模的课程序列化的首次系统研究，在改变训练数据量表的同时评估了SLU基准测试套件的模型。音节代币可以匹配或超过先前的高框架代币，同时显着削减培训和推理成本，从而减少训练时间2倍以上，减少了5倍。我们的发现重点介绍了音节级的语言建模，这是有效的长篇小说口语模型的有希望的途径。

Title: Convergence and Divergence of Language Models under Different Random Seeds

Authors: Finlay Fehlauer (1), Kyle Mahowald (2), Tiago Pimentel (1) ((1) ETH Zurich, (2) University of Texas at Austin)
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.26643
Pdf URL: https://arxiv.org/pdf/2509.26643
Copy Paste: [[2509.26643]] Convergence and Divergence of Language Models under Different Random Seeds(https://arxiv.org/abs/2509.26643)
Keywords: language model
Abstract: In this paper, we investigate the convergence of language models (LMs) trained under different random seeds, measuring convergence as the expected per-token Kullback--Leibler (KL) divergence across seeds. By comparing LM convergence as a function of model size and training checkpoint, we identify a four-phase convergence pattern: (i) an initial uniform phase, (ii) a sharp-convergence phase, (iii) a sharp-divergence phase, and (iv) a slow-reconvergence phase. Further, we observe that larger models reconverge faster in later training stages, while smaller models never actually reconverge; these results suggest that a certain model size may be necessary to learn stable distributions. Restricting our analysis to specific token frequencies or part-of-speech (PoS) tags further reveals that convergence is uneven across linguistic categories: frequent tokens and function words converge faster and more reliably than their counterparts (infrequent tokens and content words). Overall, our findings highlight factors that influence the stability of the learned distributions in model training.
摘要：在本文中，我们研究了在不同随机种子下训练的语言模型（LMS）的收敛性，并将收敛视为预期的双kullback-leibler（KL）跨种子的差异。通过比较LM收敛量与模型大小和训练检查点的函数，我们确定了四相收敛模式：（i）初始均匀阶段，（ii）尖锐的相位阶段，（iii）尖锐的差异阶段，以及（iv）慢速连接阶段。此外，我们观察到较大的模型在以后的训练阶段更快地重新验证，而较小的模型从未真正重新验证。这些结果表明，学习稳定的分布可能需要一定的模型大小。将我们的分析限制在特定的令牌频率或词性（POS）标签上进一步表明，在语言类别中收敛是不平衡的：频繁的令牌和函数单词比其对应物更快，更可靠地收敛（不频繁的令牌和内容词）。总体而言，我们的发现突出了影响模型培训中学习分布的稳定性的因素。