2025-09-30

Title: Are you sure? Measuring models bias in content moderation through uncertainty

Authors: Alessandra Urbinati, Mirko Lai, Simona Frenda, Marco Antonio Stranisci
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22699
Pdf URL: https://arxiv.org/pdf/2509.22699
Copy Paste: [[2509.22699]] Are you sure? Measuring models bias in content moderation through uncertainty(https://arxiv.org/abs/2509.22699)
Keywords: language model
Abstract: Automatic content moderation is crucial to ensuring safety in social media. Language Model-based classifiers are being increasingly adopted for this task, but it has been shown that they perpetuate racial and social biases. Even if several resources and benchmark corpora have been developed to challenge this issue, measuring the fairness of models in content moderation remains an open issue. In this work, we present an unsupervised approach that benchmarks models on the basis of their uncertainty in classifying messages annotated by people belonging to vulnerable groups. We use uncertainty, computed by means of the conformal prediction technique, as a proxy to analyze the bias of 11 models against women and non-white annotators and observe to what extent it diverges from metrics based on performance, such as the $F_1$ score. The results show that some pre-trained models predict with high accuracy the labels coming from minority groups, even if the confidence in their prediction is low. Therefore, by measuring the confidence of models, we are able to see which groups of annotators are better represented in pre-trained models and lead the debiasing process of these models before their effective use.
摘要：自动内容审核对于确保社交媒体的安全至关重要。基于语言模型的分类器越来越多地用于此任务，但已证明它们使种族和社会偏见永存。即使已经开发了几个资源和基准语料库来挑战这个问题，衡量模型在内容审核中的公平性仍然是一个空旷的问题。在这项工作中，我们提出了一种无监督的方法，该方法是根据其在对属于脆弱群体的人注释的消息时基于其不确定性进行基准模型的。我们使用通过共形预测技术计算的不确定性作为分析11个模型对女性和非白人注释者的偏见的代理，并观察它基于绩效（例如$ f_1 $得分）与指标分歧。结果表明，即使对预测的信心很低，一些预训练的模型以高度准确性预测。因此，通过衡量模型的置信度，我们能够看到哪些注释者在预训练模型中更好地表示，并在有效使用之前领导这些模型的歧义过程。

Title: AccessEval: Benchmarking Disability Bias in Large Language Models

Authors: Srikant Panda, Amit Agarwal, Hitesh Laxmichand Patel
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2509.22703
Pdf URL: https://arxiv.org/pdf/2509.22703
Copy Paste: [[2509.22703]] AccessEval: Benchmarking Disability Bias in Large Language Models(https://arxiv.org/abs/2509.22703)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly deployed across diverse domains but often exhibit disparities in how they handle real-life queries. To systematically investigate these effects within various disability contexts, we introduce \textbf{AccessEval (Accessibility Evaluation)}, a benchmark evaluating 21 closed- and open-source LLMs across 6 real-world domains and 9 disability types using paired Neutral and Disability-Aware Queries. We evaluated model outputs with metrics for sentiment, social perception, and factual accuracy. Our analysis reveals that responses to disability-aware queries tend to have a more negative tone, increased stereotyping, and higher factual error compared to neutral queries. These effects show notable variation by domain and disability type, with disabilities affecting hearing, speech, and mobility disproportionately impacted. These disparities reflect persistent forms of ableism embedded in model behavior. By examining model performance in real-world decision-making contexts, we better illuminate how such biases can translate into tangible harms for disabled users. This framing helps bridges the gap between technical evaluation and user impact, reinforcing importance of bias mitigation in day-to-day applications. Our dataset is publicly available at: this https URL
摘要：大型语言模型（LLM）越来越多地在不同的领域中部署，但经常在他们处理现实生活查询方面表现出差异。为了系统地研究这些效果，我们介绍了\ textbf {AccessEval（Accessibility评估）}，这是一种基准测试，评估了21种现实世界中的封闭和开源LLM，并使用配对中性和残疾人意识到的查询进行了9种真实世界和9种残疾类型。我们评估了具有情感，社会感知和事实准确性的指标的模型输出。我们的分析表明，与中性查询相比，对残疾感知查询的反应往往具有更负面的语调，刻板印象增加以及更高的事实错误。这些效果显示出域和残疾类型的显着变化，残疾影响听力，言语和流动性不成比例地影响。这些差异反映了模型行为中嵌入的能力主义的持续形式。通过检查现实世界决策环境中的模型性能，我们更好地阐明了这种偏见如何转化为残疾用户的切实伤害。这种框架有助于弥合技术评估和用户影响之间的差距，从而在日常应用程序中加强了缓解偏见的重要性。我们的数据集可公开获取：此HTTPS URL

Title: RAR$^2$: Retrieval-Augmented Medical Reasoning via Thought-Driven Retrieval

Authors: Kaishuai Xu, Wenjun Hou, Yi Cheng, Wenjie Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22713
Pdf URL: https://arxiv.org/pdf/2509.22713
Copy Paste: [[2509.22713]] RAR$^2$: Retrieval-Augmented Medical Reasoning via Thought-Driven Retrieval(https://arxiv.org/abs/2509.22713)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have shown promising performance on diverse medical benchmarks, highlighting their potential in supporting real-world clinical tasks. Retrieval-Augmented Generation (RAG) has emerged as a key approach for mitigating knowledge gaps and hallucinations by incorporating external medical information. However, RAG still struggles with complex medical questions that require intensive reasoning, as surface-level input often fails to reflect the true knowledge needs of the task. Existing methods typically focus on refining queries without explicitly modeling the reasoning process, limiting their ability to retrieve and integrate clinically relevant knowledge. In this work, we propose RAR$^2$, a joint learning framework that improves both Reasoning-Augmented Retrieval and Retrieval-Augmented Reasoning. RAR$^2$ constructs a thought process to uncover implicit knowledge requirements and uses it to guide retrieval and answer generation. We build a training dataset of mixed preference pairs and apply Direct Preference Optimization (DPO) to train the model. Moreover, we design two test-time scaling strategies to explore the boundaries of our framework. Experiments demonstrate the effectiveness of RAR$^2$ across several biomedical question answering datasets, outperforming RAG baselines with or without fine-tuning.
摘要：大型语言模型（LLMS）在各种医疗基准上表现出了有希望的表现，突出了它们在支持现实世界临床任务方面的潜力。通过合并外部医学信息来减轻知识差距和幻觉的关键方法，已出现了检索增强的生成（RAG）。但是，由于表面水平的输入通常无法反映任务的真实知识需求，因此RAG仍然在需要密集推理的复杂医学问题上挣扎。现有方法通常专注于精炼查询，而无需显式建模推理过程，从而限制了他们检索和整合临床相关知识的能力。在这项工作中，我们提出了RAR $^2 $，这是一个共同的学习框架，可改善推理提升的检索和检索效果。 RAR $^2 $构建了一个思考过程，以发现隐性知识要求并使用它来指导检索和答案生成。我们构建了混合偏好对的培训数据集，并应用直接偏好优化（DPO）来训练模型。此外，我们设计了两种测试时间扩展策略，以探索我们框架的边界。实验证明了RAR $^2 $在几个生物医学问题答案数据集中的有效性，超过或不进行微调的抹布基准。

Title: TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?

Authors: Jiho Park, Jongyoon Song, Minjin Choi, Kyuho Heo, Taehun Huh, Ji Won Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22715
Pdf URL: https://arxiv.org/pdf/2509.22715
Copy Paste: [[2509.22715]] TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?(https://arxiv.org/abs/2509.22715)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly integral as productivity assistants, but existing benchmarks fall short in rigorously evaluating their real-world instruction-following capabilities. Current benchmarks often (i) lack sufficient multilinguality, (ii) fail to capture the implicit constraints inherent in user requests, and (iii) overlook the complexities of multi-turn dialogue. To address these critical gaps and provide a more realistic assessment, we introduce TRUEBench (Trustworthy Real-world Usage Evaluation Benchmark)1, a novel benchmark specifically designed for LLM-based productivity assistants. TRUEBench distinguishes itself by featuring input prompts across 12 languages, incorporating intra-instance multilingual instructions, employing rigorous evaluation criteria to capture both explicit and implicit constraints, and including complex multi-turn dialogue scenarios with both accumulating constraints and context switches. Furthermore, to ensure reliability in evaluation, we refined constraints using an LLM validator. Extensive experiments demonstrate that TRUEBench presents significantly greater challenges than existing benchmarks; for instance, a strong model like OpenAI o1 achieved only a 69.07% overall pass rate. TRUEBench offers a demanding and realistic assessment of LLMs in practical productivity settings, highlighting their capabilities and limitations.
摘要：大型语言模型（LLMS）越来越多地作为生产力助手，但现有基准在严格评估其现实世界中遵循能力的严格评估方面缺乏。当前的基准通常（i）缺乏足够的多语言性，（ii）无法捕获用户请求中固有的隐式约束，并且（iii）忽略了多转向对话的复杂性。为了解决这些关键的差距并提供更现实的评估，我们介绍了TrueBench（值得信赖的现实使用情况评估基准）1，这是一种专门为基于LLM的生产力助手设计的新型基准。 TrueBench通过在12种语言中提供输入提示来区分自身，并结合了内在的多语言说明，采用严格的评估标准来捕获显式和隐性约束，并包括复杂的多转向对话方案以及累积的约束和上下文开关。此外，为了确保评估的可靠性，我们使用LLM验证器来完善约束。广泛的实验表明，与现有基准相比，TrueBench提出的挑战要大得多。例如，像OpenAI O1这样的强大模型仅达到了69.07％的总通过率。 TrueBench在实际生产力环境中对LLM的LLM进行了苛刻的评估，从而强调了它们的能力和局限性。

Title: Multi-Modal Sentiment Analysis with Dynamic Attention Fusion

Authors: Sadia Abdulhalim, Muaz Albaghdadi, Moshiur Farazi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22729
Pdf URL: https://arxiv.org/pdf/2509.22729
Copy Paste: [[2509.22729]] Multi-Modal Sentiment Analysis with Dynamic Attention Fusion(https://arxiv.org/abs/2509.22729)
Keywords: language model
Abstract: Traditional sentiment analysis has long been a unimodal task, relying solely on text. This approach overlooks non-verbal cues such as vocal tone and prosody that are essential for capturing true emotional intent. We introduce Dynamic Attention Fusion (DAF), a lightweight framework that combines frozen text embeddings from a pretrained language model with acoustic features from a speech encoder, using an adaptive attention mechanism to weight each modality per utterance. Without any finetuning of the underlying encoders, our proposed DAF model consistently outperforms both static fusion and unimodal baselines on a large multimodal benchmark. We report notable gains in F1-score and reductions in prediction error and perform a variety of ablation studies that support our hypothesis that the dynamic weighting strategy is crucial for modeling emotionally complex inputs. By effectively integrating verbal and non-verbal information, our approach offers a more robust foundation for sentiment prediction and carries broader impact for affective computing applications -- from emotion recognition and mental health assessment to more natural human computer interaction.
摘要：长期以来，传统的情感分析一直是一项仅依赖文本的单峰任务。这种方法忽略了非语言提示，例如声音和韵律，这对于捕捉真正的情感意图至关重要。我们引入了动态注意力融合（DAF），这是一个轻巧的框架，将验证的语言模型中的冷冻文本嵌入与语音编码器的声学特征结合在一起，使用适应性注意机制来加重每种语音的每种方式。我们提出的DAF模型没有任何基础编码器的任何填充，在大型的多模式基准上始终优于静态融合和单峰基线。我们报告了F1得分的显着提高和预测误差的减少，并进行了各种消融研究，这些研究支持我们的假设，即动态加权策略对于建模情感复杂的输入至关重要。通过有效整合言语和非语言信息，我们的方法为情感预测提供了更健壮的基础，并对情感计算应用产生更大的影响 - 从情感识别和心理健康评估到更自然的人类计算机互动。

Title: Enabling Approximate Joint Sampling in Diffusion LMs

Authors: Parikshit Bansal, Sujay Sanghavi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22738
Pdf URL: https://arxiv.org/pdf/2509.22738
Copy Paste: [[2509.22738]] Enabling Approximate Joint Sampling in Diffusion LMs(https://arxiv.org/abs/2509.22738)
Keywords: language model
Abstract: In autoregressive language models, each token is sampled by conditioning on all the past tokens; the overall string has thus been sampled from the correct underlying joint distribution represented by the model. In contrast, masked diffusion language models generate text by unmasking tokens out of order and potentially in parallel. Generating an overall string sampled from the correct underlying joint distribution would (again) require exactly one token unmasking in every full-model forward pass. The more tokens unmasked in parallel, the further away the string is from the true joint; this can be seen in the resulting drop in accuracy (but, increase in speed). In this paper we devise a way to {\em approximately} sample multiple tokens from the joint distribution in a single full-model forward pass; we do so by developing a new lightweight single-layer ``sampler" on top of an existing large diffusion LM. One forward pass of the full model can now be followed by multiple forward passes of only this sampler layer, to yield multiple unmasked tokens. Our sampler is trained to mimic exact joint sampling from the (frozen) full model. We show the effectiveness of our approximate joint sampling for both pretrained-only (Dream-7B-Base) and instruction-tuned (Dream-7B-Instruct) models on language modeling and math \& coding tasks. When four tokens are unmasked for each full-model denoising step, our sampling algorithm achieves a MAUVE score of 0.87 (vs marginal baseline of 0.31) with respect to the true joint distribution.
摘要：在自回归语言模型中，每个令牌都是通过对过去令牌进行调节来取样的；因此，从模型表示的正确的基础关节分布中对整个字符串进行了采样。相比之下，掩盖的扩散语言模型通过违反秩序并可能并行揭示令牌来生成文本。从正确的基础关节分布中生成一个总体字符串（再次），需要在每个全模型前传中完全删除一个令牌。并行揭开的令牌越多，绳子远离真正的关节。这可以在准确性下降（但速度提高）中可以看出。在本文中，我们设计了一种方法，以{\ em大约}在单个全模型前传中的关节分布中示例多个令牌。 we do so by developing a new lightweight single-layer ``sampler" on top of an existing large diffusion LM. One forward pass of the full model can now be followed by multiple forward passes of only this sampler layer, to yield multiple unmasked tokens. Our sampler is trained to mimic exact joint sampling from the (frozen) full model. We show the effectiveness of our approximate joint sampling for both pretrained-only （Dream-7b基础）和指导调节（Dream-7b - 教学）模型在语言建模和数学\＆编码任务上。

Title: Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models

Authors: Sasha Cui, Zhongren Chen
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2509.22739
Pdf URL: https://arxiv.org/pdf/2509.22739
Copy Paste: [[2509.22739]] Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models(https://arxiv.org/abs/2509.22739)
Keywords: language model, prompt
Abstract: Language models (LMs) are typically post-trained for desired capabilities and behaviors via weight-based or prompt-based steering, but the former is time-consuming and expensive, and the latter is not precisely controllable and often requires manual trial-and-error. While activation steering (AS) promises a cheap, fast, and controllable alternative to the two existing post-training methods, current AS techniques require hand-crafted prompt pairs or labor-intensive feature annotation, making them more inconvenient than the plug-and-play methods such as Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). We introduce Painless Activation Steering (PAS), a family of fully automated methods that make AS readily usable with any given labeled dataset, with no need for prompt construction, feature labeling, or human intervention. We evaluate PAS on three open-weight models (Llama3.1-8B-Instruct, DeepSeek-R1-Distill-8B, and Nous-Hermes-2) and 18 tasks; we find that PAS reliably improves performance for behavior tasks, but not for intelligence-oriented tasks. The introspective variant (iPAS) delivers the strongest causal steering effects (10.1% on Bias, 5.2% on Morality, and 34.8% on Alignment). We also show PAS delivers additional gains on top of In-Context Learning (ICL) and SFT. PAS constructs a fast, lightweight activation vector that can be cheaply trained, easily stored, and activated at will. Our results provide a characterization of where AS helps, where it fails, and how to deploy it as a practical, automated LM post-training option.
摘要：语言模型（LMS）通常是通过基于重量或基于及时的转向的所需功能和行为进行训练的，但前者是耗时且昂贵的，而后者则不能精确控制，并且通常需要手动反复试用。虽然激活转向（AS）承诺与两种现有的训练后方法有一种廉价，快速且可控制的替代方案，但当前作为技术需要手工制作的及时对或劳动密集型的特征注释，使它们比插件的不便，而不是加强学习（RL）等插件，并进行了良好的效果（SFT）（SFT）。我们引入了无痛的激活转向（PAS），这是一个全自动方法的家族，可与任何给定标记的数据集一样容易使用，不需要及时构造，功能标签或人类干预。我们评估了三种开放式模型（llama3.1-8b-instruct，deepseek-r1-distill-8b和nous-hermes-2）和18个任务的PA；我们发现PA可靠地提高行为任务的性能，但不能改善面向智能的任务。内省变体（IPA）具有最强的因果转向效应（偏见为10.1％，道德5.2％，对齐方式为34.8％）。我们还表明，PA在内在学习（ICL）和SFT之上还带来了更多的收益。 PAS构建了一个快速，轻巧的激活向量，可以随意训练，易于存储和激活。我们的结果提供了AS的帮助，失败的位置以及如何将其部署为实用，自动化的LM训练后选项的表征。

Title: MIRAGE: Multi-hop Reasoning with Ambiguity Evaluation for Illusory Questions

Authors: Jeonghyun Park, Ingeol Baek, Seunghyun Yoon, Haeun Jang, Aparna Garimella, Akriti Jain, Nedim Lipka, Hwanhee Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22750
Pdf URL: https://arxiv.org/pdf/2509.22750
Copy Paste: [[2509.22750]] MIRAGE: Multi-hop Reasoning with Ambiguity Evaluation for Illusory Questions(https://arxiv.org/abs/2509.22750)
Keywords: language model, llm, agent
Abstract: Real-world Multi-hop Question Answering (QA) often involves ambiguity that is inseparable from the reasoning process itself. This ambiguity creates a distinct challenge, where multiple reasoning paths emerge from a single question, each requiring independent resolution. Since each sub-question is ambiguous, the model must resolve ambiguity at every step. Thus, answering a single question requires handling multiple layers of ambiguity throughout the reasoning chain. We find that current Large Language Models (LLMs) struggle in this setting, typically exploring wrong reasoning paths and producing incomplete answers. To facilitate research on multi-hop ambiguity, we introduce MultI-hop Reasoning with AmbiGuity Evaluation for Illusory Questions (MIRAGE), a benchmark designed to analyze and evaluate this challenging intersection of ambiguity interpretation and multi-hop reasoning. MIRAGE contains 1,142 high-quality examples of ambiguous multi-hop questions, categorized under a taxonomy of syntactic, general, and semantic ambiguity, and curated through a rigorous multi-LLM verification pipeline. Our experiments reveal that even state-of-the-art models struggle on MIRAGE, confirming that resolving ambiguity combined with multi-step inference is a distinct and significant challenge. To establish a robust baseline, we propose CLarifying Ambiguity with a Reasoning and InstructiON (CLARION), a multi-agent framework that significantly outperforms existing approaches on MIRAGE, paving the way for more adaptive and robust reasoning systems.
摘要：现实世界中的多跳问题回答（QA）通常涉及与推理过程本身密不可分的歧义。这种歧义带来了一个独特的挑战，其中多个推理路径从一个问题中出现，每个问题都需要独立解决。由于每个子问题都是模棱两可的，因此该模型必须在每个步骤中解决歧义。因此，回答一个问题需要在整个推理链中处理多层歧义。我们发现，当前的大型语言模型（LLM）在这种情况下挣扎，通常会探索错误的推理路径并产生不完整的答案。为了促进对多跳歧义的研究，我们通过歧义性问题（Mirage）介绍了多跳的推理，这是一种旨在分析和评估歧义解释和多跳上推理的具有挑战性的相交的基准。 Mirage包含1,142个高质量的模棱两可的多跳问题示例，分类为句法，一般和语义歧义，并通过严格的多LLLM验证管道进行策划。我们的实验表明，即使是最先进的模型在海市rage楼上都挣扎，证实解决歧义与多步推断相结合是一个明显而重大的挑战。为了建立强大的基准，我们建议用推理和指导（Clarion）澄清歧义，这是一个多代理框架，极大地超过了幻影的现有方法，为更适应性和健壮的推理系统铺平了道路。

Title: ML2B: Multi-Lingual ML Benchmark For AutoML

Authors: Ekaterina Trofimova, Zosia Shamina, Maria Selifanova, Artem Zaitsev, Remi Savchuk, Maxim Minets, Daria Ozerova, Emil Sataev, Denis Zuenko, Andrey E. Ustyuzhanin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22768
Pdf URL: https://arxiv.org/pdf/2509.22768
Copy Paste: [[2509.22768]] ML2B: Multi-Lingual ML Benchmark For AutoML(https://arxiv.org/abs/2509.22768)
Keywords: language model, llm
Abstract: Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML2B, the first benchmark for evaluating multilingual ML code generation. ML2B consists of 30 Kaggle competitions translated into 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Our results reveal substantial 15-45% performance degradation on non-English tasks, highlighting critical challenges in multilingual representation learning for code generation. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: this https URL.
摘要：大型语言模型（LLMS）最近在生成机器学习（ML）代码方面表现出强大的功能，从而从自然语言指令中启用了端到端的管道构造。但是，现有的ML代码生成基准主要仅限于英语，忽视了ML研究和实践的全球和多语言性质。为了解决此差距，我们提出了ML2B，这是评估多语言ML代码生成的第一个基准。 ML2B由30次Kaggle竞赛组成，分为13种自然语言，涵盖了表格，文本和图像数据类型，并具有结构化的元数据和经过验证的人工评审翻译。为了进行评估，我们采用助手，一个自动框架来端到端评估数据科学管道，并提供了跨语性模型性能的见解。我们的结果表明，非英语任务上的15-45％的绩效降解，强调了代码生成的多语言表示学习中的关键挑战。基准，评估框架和全面的结果可通过我们的GitHub存储库提供，以促进多语言ML代码生成的未来研究：此HTTPS URL。

Title: EditGRPO: Reinforcement Learning with Post -Rollout Edits for Clinically Accurate Chest X-Ray Report Generation

Authors: Kai Zhang, Christopher Malon, Lichao Sun, Martin Renqiang Min
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22812
Pdf URL: https://arxiv.org/pdf/2509.22812
Copy Paste: [[2509.22812]] EditGRPO: Reinforcement Learning with Post -Rollout Edits for Clinically Accurate Chest X-Ray Report Generation(https://arxiv.org/abs/2509.22812)
Keywords: language model, llm
Abstract: Radiology report generation requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. Although recent innovations, particularly multimodal large language models (MLLMs), have shown improved performance, their supervised fine-tuning (SFT) objective is not explicitly aligned with clinical efficacy. In this work, we introduce EditGRPO, a mixed-policy reinforcement learning (RL) algorithm designed specifically to optimize the generation through clinically motivated rewards. EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts. This mixed-policy approach addresses the exploration dilemma and sampling efficiency issues typically encountered in RL. Applied to a Qwen2.5-VL-3B MLLM initialized with supervised fine-tuning (SFT), EditGRPO outperforms both SFT and vanilla GRPO baselines, achieving an average improvement of 3.4% in CheXbert, GREEN, Radgraph, and RATEScore metrics across four major chest X-ray report generation datasets. Notably, EditGRPO also demonstrates superior out-of-domain generalization, with an average performance gain of 5.9% on unseen datasets.
摘要：放射学报告生成需要先进的医学图像分析，有效的时间推理和准确的文本生成。尽管最近的创新，尤其是多模式的大语言模型（MLLMS），显示出改善的性能，但其监督的微调（SFT）目标并未明确与临床疗效明确保持一致。在这项工作中，我们介绍了Editgrpo，这是一种混合政策增强学习（RL）算法，专门旨在通过临床动机的奖励优化一代。 Editgrpo通过在训练推出期间注入句子级的详细校正，将上政策探索与非政策指导集成在一起。这种混合政策方法解决了RL中通常遇到的探索困境和采样效率问题。 Editgrpo应用于具有监督的微调（SFT）初始化的QWEN2.5-VL-3B MLLM，Editgrpo的表现都优于SFT和Vanilla Grpo基线，在四个主要的胸部X型报告生成数据集中的Chexbert，Green，Radgraph和Radgraph和Radgraph和Rate Corcorcore的平均提高3.4％。值得注意的是，EditGrpo还展示了卓越的室外概括，在看不见的数据集上，平均性能增长为5.9％。

Title: Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

Authors: Chi Ruan, Dongfu Jiang, Yubo Wang, Wenhu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22824
Pdf URL: https://arxiv.org/pdf/2509.22824
Copy Paste: [[2509.22824]] Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning(https://arxiv.org/abs/2509.22824)
Keywords: gpt, llm
Abstract: Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label $c \in \{\texttt{True}, \texttt{False}\}$ of the generated critique aligns with the ground-truth judgment $c^*$. Building on this point, we introduce \textsc{Critique-Coder}, which is trained on a hybrid of RL and CRL by substituting 20\% of the standard RL data with CRL data. We fine-tune multiple models (\textsc{Critique-Coder}) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that \textsc{Critique-Coder} consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our \textsc{Critique-Coder-8B} can reach over 60\% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1. Beyond code generation, \textsc{Critique-Coder} also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning.
摘要：增强学习（RL）已成为一种流行的培训范式，尤其是与推理模型配对时。尽管有效，但它主要集中于产生反应，并且缺乏明确培养批评或反思的机制。最近的一些研究，例如批判性 - 调节（CFT）和批评指导依据（CGD），已经显示出明确教授LLMS如何批评的好处。由他们的促进，我们提出了批评加强学习（CRL），该模型的任务是为给定（问题，解决方案）对产生批评。奖励仅取决于\ {\ texttt {true}，\ texttt {false} \} $的最终判断标签$ c \ in \ {\ texttt {true}，与地面真实判断$ c^*$相符。在这一点上，我们介绍了\ textsc {critique-coder}，该}通过将20 \％的标准RL数据用CRL数据替换为RL和CRL的混合训练。我们微调多个模型（\ textsc {critique-coder}），并在不同的基准测试中对其进行评估，以显示它们的优势比仅RL模型。我们表明，\ textsc {critique-coder}在所有评估的基准测试中始终优于仅RL的基准。值得注意的是，我们的\ textsc {critique-coder-8b}可以在livecodebench（V5）上达到60 \％以上，表现优于其他推理模型，例如deepcoder-14b和gpt-o1。除了代码生成之外，\ textsc {critique-coder}还展示了增强的一般推理能力，这是由于其在BBEH数据集中在逻辑推理任务上的更好性能所证明的。这表明CRL在编码数据集中的应用增强了一般推理和批评能力，这些能力可以在各种任务中转移。因此，我们认为CRL是对LLM推理的标准RL的重要补充。

Title: ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Authors: Hwan Chang, Yonghyun Jun, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22830
Pdf URL: https://arxiv.org/pdf/2509.22830
Copy Paste: [[2509.22830]] ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents(https://arxiv.org/abs/2509.22830)
Keywords: language model, llm, prompt, chat, agent
Abstract: The growing deployment of large language model (LLM) based agents that interact with external environments has created new attack surfaces for adversarial manipulation. One major threat is indirect prompt injection, where attackers embed malicious instructions in external environment output, causing agents to interpret and execute them as if they were legitimate prompts. While previous research has focused primarily on plain-text injection attacks, we find a significant yet underexplored vulnerability: LLMs' dependence on structured chat templates and their susceptibility to contextual manipulation through persuasive multi-turn dialogues. To this end, we introduce ChatInject, an attack that formats malicious payloads to mimic native chat templates, thereby exploiting the model's inherent instruction-following tendencies. Building on this foundation, we develop a persuasion-driven Multi-turn variant that primes the agent across conversational turns to accept and execute otherwise suspicious actions. Through comprehensive experiments across frontier LLMs, we demonstrate three critical findings: (1) ChatInject achieves significantly higher average attack success rates than traditional prompt injection methods, improving from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent, with multi-turn dialogues showing particularly strong performance at average 52.33% success rate on InjecAgent, (2) chat-template-based payloads demonstrate strong transferability across models and remain effective even against closed-source LLMs, despite their unknown template structures, and (3) existing prompt-based defenses are largely ineffective against this attack approach, especially against Multi-turn variants. These findings highlight vulnerabilities in current agent systems.
摘要：与外部环境相互作用的大型语言模型（LLM）代理的日益增长的部署已经为对抗操作创造了新的攻击表面。一个主要威胁是间接提示注射，攻击者将恶意指示嵌入外部环境输出中，导致代理人解释和执行它们，就像它们是合法的提示一样。尽管以前的研究主要集中在平坦的注射攻击上，但我们发现了一个重要但没有充满疏忽的脆弱性：LLMS对结构化聊天模板的依赖性及其对通过有说服力的多转话对话进行上下文操作的敏感性。为此，我们引入了ChatInject，这是一种将恶意有效载荷格式化的攻击，以模仿本机聊天模板，从而利用模型的固有指令跟随倾向。在这个基础的基础上，我们开发了一种说服力驱动的多转弯变体，该变体可以在对话转弯中提出代理商，以接受并执行原本可疑的行动。通过Frontier LLMS的全面实验，我们证明了三个关键发现：（1）与传统的快速注射方法相比，ChatIndight的平均攻击成功率明显更高，从5.18％提高到32.05％的特工，从15.13％，从15.13％上升到45.90％，在受伤的情况下，以52.33％的速度划分的型号聊天，并在52.33％上进行了良好的交流。有效载荷表明，尽管其未知的模板结构未知，但在封闭源的LLM上也表现出强大的可转移性，并且在封闭源LLM上仍然有效，并且（3）现有的基于及时的防御能力在很大程度上对这种攻击方法效果不佳，尤其是针对多转变变体。这些发现突出了当前代理系统中的漏洞。

Title: Towards Generalizable Implicit In-Context Learning with Attention Routing

Authors: Jiaqian Li, Yanshu Li, Ligong Han, Ruixiang Tang, Wenya Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22854
Pdf URL: https://arxiv.org/pdf/2509.22854
Copy Paste: [[2509.22854]] Towards Generalizable Implicit In-Context Learning with Attention Routing(https://arxiv.org/abs/2509.22854)
Keywords: language model, llm
Abstract: Implicit in-context learning (ICL) has newly emerged as a promising paradigm that simulates ICL behaviors in the representation space of Large Language Models (LLMs), aiming to attain few-shot performance at zero-shot cost. However, existing approaches largely rely on injecting shift vectors into residual flows, which are typically constructed from labeled demonstrations or task-specific alignment. Such designs fall short of utilizing the structural mechanisms underlying ICL and suffer from limited generalizability. To address this, we propose In-Context Routing (ICR), a novel implicit ICL method that internalizes generalizable ICL patterns at the attention logits level. It extracts reusable structural directions that emerge during ICL and employs a learnable input-conditioned router to modulate attention logits accordingly, enabling a train-once-and-reuse framework. We evaluate ICR on 12 real-world datasets spanning diverse domains and multiple LLMs. The results show that ICR consistently outperforms prior implicit ICL methods that require task-specific retrieval or training, while demonstrating robust generalization to out-of-domain tasks where existing methods struggle. These findings position ICR to push the boundary of ICL's practical value.
摘要：隐式内部文化学习（ICL）已新出现是一种有希望的范式，它模拟了大语言模型（LLMS）的表示空间中的ICL行为，旨在以零拍的成本获得很少的射击性能。但是，现有的方法在很大程度上依赖于将移位向量注入残留流，这些流量通常是由标记的演示或特定于任务的对准来构建的。这种设计缺乏利用ICL的结构机制，并且遭受了有限的概括性。为了解决这个问题，我们提出了一种新颖的隐式ICL方法，它建议在注意力逻辑级别内部内化可概括的ICL模式。它提取可重复使用的结构方向，这些方向在ICL期间出现，并采用可学习的输入条件路由器来相应地调节注意力逻辑，从而实现了一路并重新建立的框架。我们在12个现实世界数据集上评估了ICR，这些数据集涵盖了不同的域和多个LLM。结果表明，ICR始终胜过需要特定于任务检索或培训的先前隐式ICL方法，同时证明了对现有方法困难的局部任务的强大概括。这些发现的位置ICR可以推动ICL实践价值的边界。

Title: The Bias is in the Details: An Assessment of Cognitive Bias in LLMs

Authors: R. Alexander Knipper, Charles S. Knipper, Kaiqi Zhang, Valerie Sims, Clint Bowers, Santu Karmaker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22856
Pdf URL: https://arxiv.org/pdf/2509.22856
Copy Paste: [[2509.22856]] The Bias is in the Details: An Assessment of Cognitive Bias in LLMs(https://arxiv.org/abs/2509.22856)
Keywords: language model, llm, prompt
Abstract: As Large Language Models (LLMs) are increasingly embedded in real-world decision-making processes, it becomes crucial to examine the extent to which they exhibit cognitive biases. Extensively studied in the field of psychology, cognitive biases appear as systematic distortions commonly observed in human judgments. This paper presents a large-scale evaluation of eight well-established cognitive biases across 45 LLMs, analyzing over 2.8 million LLM responses generated through controlled prompt variations. To achieve this, we introduce a novel evaluation framework based on multiple-choice tasks, hand-curate a dataset of 220 decision scenarios targeting fundamental cognitive biases in collaboration with psychologists, and propose a scalable approach for generating diverse prompts from human-authored scenario templates. Our analysis shows that LLMs exhibit bias-consistent behavior in 17.8-57.3% of instances across a range of judgment and decision-making contexts targeting anchoring, availability, confirmation, framing, interpretation, overattribution, prospect theory, and representativeness biases. We find that both model size and prompt specificity play a significant role on bias susceptibility as follows: larger size (>32B parameters) can reduce bias in 39.5% of cases, while higher prompt detail reduces most biases by up to 14.9%, except in one case (Overattribution), which is exacerbated by up to 8.8%.
摘要：由于大型语言模型（LLM）越来越多地嵌入现实世界的决策过程中，因此检查它们表现出认知偏见的程度至关重要。在心理学领域进行了广泛的研究，认知偏见似乎是在人类判断中常见的系统扭曲。本文对45个LLM的八种公认的认知偏见进行了大规模评估，分析了通过受控及时变化产生的超过280万个LLM响应。为了实现这一目标，我们介绍了一个基于多项选择任务的新颖评估框架，手工策划了220个决策情景的数据集，该数据集针对与心理学家合作，针对基本认知偏见，并提出了一种可扩展的方法，以产生从人体实现的情景模板中产生多样的提示。我们的分析表明，在一系列判断和决策环境中，LLM在17.8-57.3％的实例中表现出偏见的行为，旨在针对锚定，可用性，确认，框架，解释，覆盖，前景理论和代表性偏见。我们发现，模型大小和及时特异性在偏见易感性上都起着重要作用，如下所示：较大的尺寸（> 32B参数）可以减少39.5％的情况下的偏差，而更高的及时及时细节可将大多数偏见降低14.9％，除了在一个情况下（超级贡献），这会被降低至8.8％。

Title: HEART: Emotionally-driven test-time scaling of Language Models

Authors: Gabriela Pinto, Palash Goyal, Yiwen Song, Souradip Chakraborty, Zifeng Wang, Tomas Pfister, Hamid Palangi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22876
Pdf URL: https://arxiv.org/pdf/2509.22876
Copy Paste: [[2509.22876]] HEART: Emotionally-driven test-time scaling of Language Models(https://arxiv.org/abs/2509.22876)
Keywords: language model, prompt
Abstract: Test-time scaling has shown considerable success in improving the performance of language models on complex reasoning tasks without requiring fine-tuning. However, current strategies such as self-reflection primarily focus on logical or structural refinement. They do not leverage the guiding potential of affective feedback. Inspired by psychological research showing that emotions can modulate cognitive performance, we introduce HEART--a novel framework that uses emotionally-driven prompts for iterative self-correction. HEART provides feedback on a model's incorrect response using a curated set of concise, emotionally charged phrases based on the six universal emotions categorized by Dr. Paul Ekman. By systematically varying the emotional tone of the feedback across iterations, our method guides the model to escape flawed reasoning paths and explore more promising alternatives. We evaluate our framework on challenging reasoning benchmarks including OlympiadBench, Humanity's Last Exam, and SimpleQA. Our results reveal a significant new phenomenon: when guided by an oracle verifier, this affective iteration protocol unlocks significantly deeper reasoning, leading to consistent and substantial increases in accuracy over state-of-the-art baselines with the same verifier. However, we also identify a critical bottleneck for practical deployment. In a verifier-free setting, it struggles to harness these gains consistently, highlighting as a key challenge for future work. Our findings suggest that the next frontier in machine reasoning may lie not just in refining logic, but also in understanding and leveraging the `HEART' of the models.
摘要：测试时间缩放在不需要微调的情况下改善语言模型在复杂的推理任务上的性能方面取得了很大的成功。但是，诸如自我反思之类的当前策略主要集中于逻辑或结构改进。他们不利用情感反馈的指导潜力。受到心理研究的启发，表明情绪可以调节认知表现，我们引入了心脏，这是一个新颖的框架，它使用情感驱动的提示进行迭代自我纠正。 Heart使用保罗·埃克曼（Paul Ekman）博士分类的六种普遍情绪，使用一组精心策划的，情感上充满情感的短语对模型的反应提供反馈。通过系统地改变遍历反馈的情感基调，我们的方法指导该模型避免有缺陷的推理路径并探索更有前途的替代方案。我们评估了我们的框架，包括挑战推理基准，包括奥林匹亚山甲板，人类的最后考试和SimpleQA。我们的结果揭示了一个重大的新现象：当在Oracle验证器的指导下，该情感迭代协议解锁了更深层次的推理，从而导致使用同一验证器对最先进的基线的准确性一致和大幅提高。但是，我们还确定了用于实际部署的关键瓶颈。在无验证者的环境中，它努力始终如一地利用这些收益，这是对未来工作的关键挑战。我们的发现表明，机器推理中的下一个前沿可能不仅在于完善逻辑，而且还在于理解和利用模型的“心脏”。

Title: Infusing Theory of Mind into Socially Intelligent LLM Agents

Authors: EunJeong Hwang, Yuwei Yin, Giuseppe Carenini, Peter West, Vered Shwartz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22887
Pdf URL: https://arxiv.org/pdf/2509.22887
Copy Paste: [[2509.22887]] Infusing Theory of Mind into Socially Intelligent LLM Agents(https://arxiv.org/abs/2509.22887)
Keywords: llm, prompt, chat, agent
Abstract: Theory of Mind (ToM)-an understanding of the mental states of others-is a key aspect of human social intelligence, yet, chatbots and LLM-based social agents do not typically integrate it. In this work, we demonstrate that LLMs that explicitly use ToM get better at dialogue, achieving goals more effectively. After showing that simply prompting models to generate mental states between dialogue turns already provides significant benefit, we further introduce ToMAgent (ToMA), a ToM-focused dialogue agent. ToMA is trained by pairing ToM with dialogue lookahead to produce mental states that are maximally useful for achieving dialogue goals. Experiments on the Sotopia interactive social evaluation benchmark demonstrate the effectiveness of our method over a range of baselines. Comprehensive analysis shows that ToMA exhibits more strategic, goal-oriented reasoning behaviors, which enable long-horizon adaptation, while maintaining better relationships with their partners. Our results suggest a step forward in integrating ToM for building socially intelligent LLM agents.
摘要：心理理论（汤姆） - 对他人的心理状态的理解 - 是人类社会智力的关键方面，但是，聊天机器人和基于LLM的社会代理人通常不会整合它。在这项工作中，我们证明了明确使用汤姆的LLM在对话中变得更好，可以更有效地实现目标。在展示了简单地促使模型在对话之间产生心理状态的情况之后，我们进一步介绍了以汤姆（Tom）为中心的对话代理人tomagent（Toma）。托马通过将汤姆与对话lookahead配对，以产生对实现对话目标极有用的心理状态进行训练。关于Sotopia Interactive社会评估基准的实验证明了我们方法在一系列基础线上的有效性。全面的分析表明，托马表现出更具战略性，面向目标的推理行为，这使长期适应性适应，同时与伴侣保持更好的关系。我们的结果表明，将TOM整合到建立具有社会智能LLM的代理商方面迈出了一步。

Title: Extract-0: A Specialized Language Model for Document Information Extraction

Authors: Henrique Godoy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22906
Pdf URL: https://arxiv.org/pdf/2509.22906
Copy Paste: [[2509.22906]] Extract-0: A Specialized Language Model for Document Information Extraction(https://arxiv.org/abs/2509.22906)
Keywords: language model, gpt
Abstract: This paper presents Extract-0, a 7-billion parameter language model specifically optimized for document information extraction that achieves performance exceeding models with parameter counts several orders of magnitude larger. Through a novel combination of synthetic data generation, supervised fine-tuning with Low-Rank Adaptation (LoRA), and reinforcement learning via Group Relative Policy Optimization (GRPO), Extract-0 achieves a mean reward of 0.573 on a benchmark of 1,000 diverse document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459). The training methodology employs a memory-preserving synthetic data generation pipeline that produces 280,128 training examples from diverse document sources, followed by parameterefficient fine-tuning that modifies only 0.53% of model weights (40.4M out of 7.66B parameters). The reinforcement learning phase introduces a novel semantic similarity-based reward function that handles the inherent ambiguity in information extraction tasks. This research demonstrates that task-specific optimization can yield models that surpass general-purpose systems while requiring substantially fewer computational resource.
摘要：本文介绍了Extract-0，这是一种70亿个参数语言模型，专门针对文档信息提取，可实现具有参数超过模型的性能超过几个数量级的模型。通过合成数据生成的新型组合，通过小组相对政策优化（GRPO）进行的低级适应（LORA）进行微调（LORA）和加固学习，提取-0在1,000多种文档提取任务的基准上获得了0.573的平均奖励，超过了GPT-4.1（0.457），O3（0.457），O3（和0.464），以及0.4644.4644. 4644. 4644. 4644.（0.457）（和0.464）（0.459）。该培训方法采用了一种存储器的合成数据生成管道，该管道可从各种文档来源中产生280,128个培训示例，然后进行参数范围的微调，仅修改了0.53％的模型权重（在7.66B参数中为404m）。强化学习阶段引入了一种新颖的基于语义相似性的奖励功能，该功能处理了信息提取任务中固有的歧义。这项研究表明，特定于任务的优化可以产生超过通用系统的模型，同时需要更少的计算资源。

Title: Large language models management of medications: three performance analyses

Authors: Kelli Henry, Steven Xu, Kaitlin Blotske, Moriah Cargile, Erin F. Barreto, Brian Murray, Susan Smith, Seth R. Bauer, Yanjun Gao, Tianming Liu, Andrea Sikora
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22926
Pdf URL: https://arxiv.org/pdf/2509.22926
Copy Paste: [[2509.22926]] Large language models management of medications: three performance analyses(https://arxiv.org/abs/2509.22926)
Keywords: language model, gpt, llm, hallucination
Abstract: Background: Large language models (LLMs) can be useful in diagnosing medical conditions, but few studies have evaluated their consistency in recommending appropriate medication regimens. The purpose of this evaluation was to test GPT-4o on three medication benchmarking tests including mapping a drug name to its correct formulation, identifying drug-drug interactions using both its internal knowledge and using a web search, and preparing a medication order sentence after being given the medication name. Methods: Using GTP-4o three experiments were completed. Accuracy was quantified by computing cosine similarity on TF-IDF vectors, normalized Levenshtein similarity, and ROUGE-1/ROUGE-L F1 between each response and its reference string or by manual evaluation by clinicians. Results: GPT-4o performed poorly on drug-formulation matching, with frequent omissions of available drug formulations (mean 1.23 per medication) and hallucinations of formulations that do not exist (mean 1.14 per medication). Only 49% of tested medications were correctly matched to all available formulations. Accuracy was decreased for medications with more formulations (p<0.0001). GPT-4o was also inconsistent at identifying drug-drug-interactions, although it had better performance with the search-augmented assessment compared to its internal knowledge (54.7% vs. 69.2%, p=0.013). However, allowing a web-search worsened performance when there was no drug-drug interaction (median % correct 100% vs. 40%, p<0.001). Finally, GPT-4o performed moderately with preparing a medication order sentence, with only 65.8% of medication order sentences containing no medication or abbreviation errors. Conclusions: Model performance was overall poor for all tests. This highlights the need for domain-specific training through clinician-annotated datasets and a comprehensive evaluation framework for benchmarking performance.
摘要：背景：大语言模型（LLM）在诊断医疗状况中很有用，但是很少有研究评估其在建议适当的药物方案方面的一致性。该评估的目的是在三个药物基准测试中测试GPT-4O，包括将药物名称映射到其正确的配方中，使用其内部知识和使用网络搜索来识别药物 - 药物相互作用，并在给出药物名称后准备药物订单。方法：使用GTP-4O完成了三个实验。通过计算TF-IDF载体，归一化Levenshtein的相似性以及每个响应及其参考字符串之间的rouge-1/rouge-l F1的余弦相似性来量化精度，或者通过临床医生的手动评估。结果：GPT-4O在药物制定匹配方面的表现较差，经常遗漏可用的药物制剂（每种药物平均1.23）和不存在的配方幻觉（每种药物平均1.14）。只有49％的测试药物与所有可用配方正确匹配。用于制剂更多的药物（p <0.0001）的药物的准确性降低。 GPT-4O在识别药物与毒品之间的交流方面也不一致，尽管与其内部知识相比，它在搜索仪的评估方面具有更好的性能（54.7％vs. 69.2％，p = 0.013）。但是，当没有药物毒品相互作用时，允许网络搜索恶化（中位％正确100％比40％，p <0.001）。最后，GPT-4O在准备药物订单句子中进行了适度的表现，只有65.8％的药物订单句子中没有药物或缩写错误。结论：所有测试的模型性能总体差。这强调了通过临床医生注销的数据集进行特定于领域的培训的需求，以及用于基准性能的全面评估框架。

Title: LLMs Behind the Scenes: Enabling Narrative Scene Illustration

Authors: Melissa Roemmele, John Joon Young Chung, Taewook Kim, Yuqian Sun, Alex Calderwood, Max Kreminski
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2509.22940
Pdf URL: https://arxiv.org/pdf/2509.22940
Copy Paste: [[2509.22940]] LLMs Behind the Scenes: Enabling Narrative Scene Illustration(https://arxiv.org/abs/2509.22940)
Keywords: llm, prompt
Abstract: Generative AI has established the opportunity to readily transform content from one medium to another. This capability is especially powerful for storytelling, where visual illustrations can illuminate a story originally expressed in text. In this paper, we focus on the task of narrative scene illustration, which involves automatically generating an image depicting a scene in a story. Motivated by recent progress on text-to-image models, we consider a pipeline that uses LLMs as an interface for prompting text-to-image models to generate scene illustrations given raw story text. We apply variations of this pipeline to a prominent story corpus in order to synthesize illustrations for scenes in these stories. We conduct a human annotation task to obtain pairwise quality judgments for these illustrations. The outcome of this process is the SceneIllustrations dataset, which we release as a new resource for future work on cross-modal narrative transformation. Through our analysis of this dataset and experiments modeling illustration quality, we demonstrate that LLMs can effectively verbalize scene knowledge implicitly evoked by story text. Moreover, this capability is impactful for generating and evaluating illustrations.
摘要：Generative AI已经建立了很容易将内容从一种媒介转变为另一种媒介的机会。这种能力对于讲故事特别有力，在那里，视觉插图可以照亮最初在文本中表达的故事。在本文中，我们专注于叙事场景插图的任务，该任务涉及自动生成一个描绘故事场景的图像。在文本到图像模型上的最新进展中，我们考虑了一条使用LLMS作为界面的管道，用于提示文本到图像模型以生成给定的原始故事文本的场景插图。我们将该管道的变体应用于一个著名的故事语料库，以便将这些故事中的场景插图合成插图。我们执行人类注释任务，以获得这些插图的成对质量判断。此过程的结果是centeLustrations数据集，我们将其作为跨模式叙事转换的未来工作的新资源发布。通过对该数据集的分析和实验建模插图质量，我们证明了LLM可以有效地对故事文本暗中唤起的场景知识进行言语。此外，这种能力对生成和评估插图具有影响。

Title: What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?

Authors: Mohammed Sabry, Anya Belz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22947
Pdf URL: https://arxiv.org/pdf/2509.22947
Copy Paste: [[2509.22947]] What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?(https://arxiv.org/abs/2509.22947)
Keywords: language model
Abstract: Does explicitly exercising the induction circuit during pretraining improve in-context learning (ICL), or is natural text sufficient when compute is held constant (iso-FLOPs)? To test whether targeted synthetic data can accelerate induction-head emergence and enhance ICL, we introduce Bi-Induct, a lightweight curriculum that injects forward-copy (Induction), backward-copy (Anti), or a balanced mix into the pretraining stream. We train models from 0.13B to 1B parameters under iso-FLOPs, evaluating (i) few-shot ICL benchmarks, (ii) head-level telemetry, and (iii) held-out language modeling perplexity. Our findings challenge the assumption that early induction circuit activation directly improves ICL. While Bi-Induct accelerates induction-head emergence at small scales, this does not consistently yield stronger generalization. On standard LM benchmarks, Bi-Induct matches natural-only training; on function-style ICL probes, the 1B natural-only performs best. Stress tests (e.g., label permutation, HITS@1 vs. HITS@3, 1 vs. 10 shots) preserve these trends. Telemetry shows larger natural-only models develop broader, earlier induction heads without explicit induction patterns. Anti-induction data fails to elicit meaningful activation. Perplexity penalties from synthetic data shrink with scale, suggesting larger models can absorb non-natural patterns with minimal cost. Crucially, ablating the top 2% of induction heads degrades ICL more than random ablations, especially for natural-only models, indicating more centralized, load-bearing circuits. Bi-Induct variants exhibit more redundant induction activity, implying different circuit utilization. Overall, inducing activation is not sufficient: ICL gains depend on these circuits becoming functionally necessary. These results underscore mechanism-aware pretraining diagnostics and data mixtures that foster load-bearing, not merely present, structure.
摘要：在预处理过程中明确行使感应电路改善了内在学习（ICL），还是在计算保持恒定时自然文本足够（ISO-Flops）？为了测试有针对性的合成数据是否可以加速感应头出现并增强ICL，我们引入了双感应，这是一种轻巧的课程，可注射前向拷贝（感应），向后拷贝（抗）或平衡的混合物或预处理的流中。我们在ISO-Flops下将0.13B的模型从0.13B训练到1B参数，评估（i）几乎没有ICL基准测试，（ii）头部级遥测和（iii）Hold-Out语言建模的困惑。我们的发现挑战了以下假设：早期诱导电路激活直接改善ICL。尽管双感应在小尺度上加速了诱导头出现，但这并不能始终产生更强的概括。在标准LM基准测试中，双感应匹配自然训练；在功能风格的ICL探针上，1B自然只能表现最佳。压力测试（例如，标签排列，命中率@1 vs.命中@3，1 vs. 10杆）保留了这些趋势。遥测表明，较大的自然模型会发展出更广泛的，更早的诱导头，没有明确的诱导模式。抗诱导数据无法引起有意义的激活。综合数据随规模缩小的造成的困惑惩罚，表明较大的模型可以以最低的成本吸收非天然模式。至关重要的是，在诱导头部的前2％消除ICL的降低而不是随机消融，尤其是对于自然模型，表明更集中，承载的电路。双感应变体表现出更冗余的诱导活性，这意味着不同的电路利用率。总体而言，诱导激活是不够的：ICL增益取决于这些电路在功能上变得必要。这些结果强调了诊断预处理的诊断和数据混合物，这些诊断和数据混合物不仅促进了负载的结构。

Title: Same Content, Different Representations: A Controlled Study for Table QA

Authors: Yue Zhang, Seiji Maekawa, Nikita Bhutani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22983
Pdf URL: https://arxiv.org/pdf/2509.22983
Copy Paste: [[2509.22983]] Same Content, Different Representations: A Controlled Study for Table QA(https://arxiv.org/abs/2509.22983)
Keywords: llm
Abstract: Table Question Answering (Table QA) in real-world settings must operate over both structured databases and semi-structured tables containing textual fields. However, existing benchmarks are tied to fixed data formats and have not systematically examined how representation itself affects model performance. We present the first controlled study that isolates the role of table representation by holding content constant while varying structure. Using a verbalization pipeline, we generate paired structured and semi-structured tables, enabling direct comparisons across modeling paradigms. To support detailed analysis, we introduce a diagnostic benchmark with splits along table size, join requirements, query complexity, and schema quality. Our experiments reveal consistent trade-offs: SQL-based methods achieve high accuracy on structured inputs but degrade on semi-structured data, LLMs exhibit flexibility but reduced precision, and hybrid approaches strike a balance, particularly under noisy schemas. These effects intensify with larger tables and more complex queries. Ultimately, no single method excels across all conditions, and we highlight the central role of representation in shaping Table QA performance. Our findings provide actionable insights for model selection and design, paving the way for more robust hybrid approaches suited for diverse real-world data formats.
摘要：现实世界设置中的表问题回答（表QA）必须在结构化数据库和包含文本字段的半结构表上进行操作。但是，现有基准与固定数据格式相关，并且没有系统地检查表示本身如何影响模型性能。我们提出了第一项对照研究，该研究通过保持内容恒定而在变化结构时隔离表表示的作用。使用语言管道，我们生成配对的结构化和半结构表，从而在建模范式上进行直接比较。为了支持详细的分析，我们介绍了一个诊断基准测试，并沿桌子尺寸拆分，连接需求，查询复杂性和模式质量。我们的实验揭示了一致的权衡：基于SQL的方法在结构化输入方面具有很高的精度，但在半结构化数据上降解，LLMS表现出灵活性，但精度降低，而混合方法则达到平衡，尤其是在嘈杂的模式下。这些效果通过较大的表和更复杂的查询加剧。最终，在所有条件下都没有单一的方法出色，我们强调了表示在塑造表QA性能中的核心作用。我们的发现为模型选择和设计提供了可行的见解，为适合各种现实世界数据格式的更强大的混合方法铺平了道路。

Title: ADAM: A Diverse Archive of Mankind for Evaluating and Enhancing LLMs in Biographical Reasoning

Authors: Jasin Cekinmez, Omid Ghahroodi, Saad Fowad Chandle, Dhiman Gupta, Ehsaneddin Asgari
Subjects: cs.CL, cs.AI, cs.CV, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22991
Pdf URL: https://arxiv.org/pdf/2509.22991
Copy Paste: [[2509.22991]] ADAM: A Diverse Archive of Mankind for Evaluating and Enhancing LLMs in Biographical Reasoning(https://arxiv.org/abs/2509.22991)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: We introduce ADAM (A Diverse Archive of Mankind), a framework for evaluating and improving multimodal large language models (MLLMs) in biographical reasoning. To the best of our knowledge, this is the first work to systematically examine LLM capabilities in biography, a critical yet underexplored dimension of factual knowledge. At its core, AdamDB is a multilingual and multimodal dataset covering over 4 million individuals across geography, time, and profession, while AdamBench provides cognitively structured evaluations based on Bloom's taxonomy, spanning six reasoning levels in both English and native languages. To address hallucinations, particularly for lesser-known individuals, we propose AdamRAG, a retrieval-augmented generation system tailored to biographical contexts. Experiments show that AdamRAG substantially improves open-source models and modestly benefits closed-source ones, with the largest gains on lower-order reasoning. Popularity strongly mediates accuracy, and multimodal input via face images offers smaller, less consistent improvements than retrieval. ADAM establishes the first benchmark and framework for cognitively, culturally, and multimodally grounded biographical evaluation, advancing the development of multilingual, accurate, and hallucination-resistant MLLMs.
摘要：我们介绍了亚当（人类的多元化档案馆），这是一个用于评估和改善传记推理中多模式大语模型（MLLM）的框架。据我们所知，这是系统地检查传记中LLM功能的第一项工作，这是事实知识的批判但毫无变化的维度。 ADAMDB的核心是一个多语言和多模式数据集，涵盖了超过400万个地理，时间和职业的人，而Adambench则根据Bloom的分类法提供了认知结构化的评估，涵盖英语和母语的六个推理水平。为了解决幻觉，尤其是对于鲜为人知的人，我们提出了Adamrag，这是一种根据传记环境量身定制的检索型发电系统。实验表明，Adamrag基本上改善了开源模型，并适度地有益于封闭源的模型，而低阶推理的收益最大。受欢迎程度强烈介导精度，而通过面部图像的多模式输入比检索更小，一致。亚当（Adam）建立了在认知，文化和多模式基础的传记评估上的第一个基准和框架，从而推进了多语言，准确和抗幻觉的MLLM的发展。

Title: AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts

Authors: Jiří Milička, Anna Marklová, Václav Cvrček
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22996
Pdf URL: https://arxiv.org/pdf/2509.22996
Copy Paste: [[2509.22996]] AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts(https://arxiv.org/abs/2509.22996)
Keywords: language model, gpt, llm
Abstract: This article presents two corpora of English and Czech texts generated with large language models (LLMs). The motivation is to create a resource for comparing human-written texts with LLM-generated text linguistically. Emphasis was placed on ensuring these resources are multi-genre and rich in terms of topics, authors, and text types, while maintaining comparability with existing human-created corpora. These generated corpora replicate reference human corpora: BE21 by Paul Baker, which is a modern version of the original Brown Corpus, and Koditex corpus that also follows the Brown Corpus tradition but in Czech. The new corpora were generated using models from OpenAI, Anthropic, Alphabet, Meta, and DeepSeek, ranging from GPT-3 (davinci-002) to GPT-4.5, and are tagged according to the Universal Dependencies standard (i.e., they are tokenized, lemmatized, and morphologically and syntactically annotated). The subcorpus size varies according to the model used (the English part contains on average 864k tokens per model, 27M tokens altogether, the Czech partcontains on average 768k tokens per model, 21.5M tokens altogether). The corpora are freely available for download under the CC BY 4.0 license (the annotated data are under CC BY-NC-SA 4.0 licence) and are also accessible through the search interface of the Czech National Corpus.
摘要：本文介绍了用大语言模型（LLM）生成的两个英语和捷克文本。动机是创建一种资源，以语言上的人为写的文本与LLM生成的文本进行比较。重点是确保这些资源在主题，作者和文本类型方面具有多种多样而丰富，同时保持与现有的人类创建的Corpora的可比性。这些生成的Corpora复制参考文献：Paul Baker的BE21，Paul Baker是原始Brown Corpus的现代版本，Koditex语料库也遵循Brown Corpus的传统，但在捷克语中。使用OpenAI，拟人，字母，元和DeepSeek的型号生成了新的Corpora，从GPT-3（Davinci-002）到GPT-4.5，并根据普遍的依赖性标准（即它们被象征性，易于识别，以及形式上和构成和构成）标记。亚库pus的大小根据所使用的模型而变化（英文部分平均包含864K令牌，每款型号为2700万代币，平均每个型号的捷克零件contcontains，每个型号为768K令牌，完全具有2150万个标记）。该公司可根据CC by 4.0许可免费下载（带注释的数据在CC BY-NC-SA 4.0许可下），并且也可以通过捷克国家语料库的搜索接口访问。

Title: Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

Authors: Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, An Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23040
Pdf URL: https://arxiv.org/pdf/2509.23040
Copy Paste: [[2509.23040]] Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents(https://arxiv.org/abs/2509.23040)
Keywords: language model, llm, agent
Abstract: Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory corpus that is dynamically updated during a single-pass document scan, also known as the "memorize while reading" methods. While this approach scales efficiently, it suffers from irreversible forward-only processing, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, a memory-augmented agent with callback-enhanced memory that allows selective retrieval from the entire memory history and allows non-linear reasoning and revisiting of early evidence. To further strengthen training, we propose Reinforcement Learning with Multi-Level Rewards (RLMLR), which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support multi-hop memory utilizing. Experiments on long-document QA show significant gains over existing memory-based approaches, which validates ReMemR1 as an effective solution for long-context reasoning agents.
摘要：大型语言模型在长期以下的问题回答中面临挑战，在这种情况下，查询的主要证据可能会在数百万个令牌中散布。现有作品为大型语言模型配备了一种记忆语料库，该内存语料库在单通行文档扫描中进行了动态更新，也称为“读取时记忆”方法。尽管这种方法有效地扩展，但它遭受了不可逆转的远期处理，通过覆盖的信息丢失以及稀疏的增强学习信号。为了应对这些挑战，我们提出了一种带有回调增强内存的内存仪器的rememr1，可以从整个内存历史记录中选择性检索，并允许非线性推理和重新审视早期证据。为了进一步加强培训，我们建议使用多层次奖励（RLMLR）进行增强学习，该学习结合了最终回答的奖励与密集的，步进的信号，以指导有效的内存使用。这些贡献共同减轻信息降解，改善监督并支持使用多跳的内存。长期文档质量检查的实验显示了现有基于内存的方法的显着增长，这将rememr1验证为长篇文化推理剂的有效解决方案。

Title: Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate

Authors: Binwei Yao, Chao Shang, Wanyu Du, Jianfeng He, Ruixue Lian, Yi Zhang, Hang Su, Sandesh Swamy, Yanjun Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23055
Pdf URL: https://arxiv.org/pdf/2509.23055
Copy Paste: [[2509.23055]] Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate(https://arxiv.org/abs/2509.23055)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) often display sycophancy, a tendency toward excessive agreeability. This behavior poses significant challenges for multi-agent debating systems (MADS) that rely on productive disagreement to refine arguments and foster innovative thinking. LLMs' inherent sycophancy can collapse debates into premature consensus, potentially undermining the benefits of multi-agent debate. While prior studies focus on user--LLM sycophancy, the impact of inter-agent sycophancy in debate remains poorly understood. To address this gap, we introduce the first operational framework that (1) proposes a formal definition of sycophancy specific to MADS settings, (2) develops new metrics to evaluate the agent sycophancy level and its impact on information exchange in MADS, and (3) systematically investigates how varying levels of sycophancy across agent roles (debaters and judges) affects outcomes in both decentralized and centralized debate frameworks. Our findings reveal that sycophancy is a core failure mode that amplifies disagreement collapse before reaching a correct conclusion in multi-agent debates, yields lower accuracy than single-agent baselines, and arises from distinct debater-driven and judge-driven failure modes. Building on these findings, we propose actionable design principles for MADS, effectively balancing productive disagreement with cooperation in agent interactions.
摘要：大型语言模型（LLMS）通常显示出粘粘性，这是过度同意性的趋势。这种行为对依靠生产性分歧来完善论点并培养创新思维的多代理辩论系统（MAD）提出了重大挑战。 LLMS的固有的无粘合物可以将辩论崩溃为过早共识，从而破坏多代理辩论的益处。虽然先前的研究集中于用户-LLM sycophancy，但互联网间综合症在辩论中的影响仍然很少了解。 To address this gap, we introduce the first operational framework that (1) proposes a formal definition of sycophancy specific to MADS settings, (2) develops new metrics to evaluate the agent sycophancy level and its impact on information exchange in MADS, and (3) systematically investigates how varying levels of sycophancy across agent roles (debaters and judges) affects outcomes in both decentralized and centralized debate frameworks.我们的发现表明，粘粘剂是一种核心故障模式，它在多代理辩论中得出正确的结论之前会放大分歧崩溃，比单位基准的得出的准确性更低，并且是由不同的辩论驱动和法官驱动的失败模式引起的。在这些发现的基础上，我们提出了针对疯狂的可行设计原则，从而有效地平衡了生产力的分歧与代理互动中的合作。

Title: Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks

Authors: Chunyang Jiang, Yonggang Zhang, Yiyang Cai, Chi-Min Chan, Yulong Liu, Mingming Chen, Wei Xue, Yike Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23067
Pdf URL: https://arxiv.org/pdf/2509.23067
Copy Paste: [[2509.23067]] Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks(https://arxiv.org/abs/2509.23067)
Keywords: language model, llm
Abstract: The rising cost of acquiring supervised data has driven significant interest in self-improvement for large language models (LLMs). Straightforward unsupervised signals like majority voting have proven effective in generating pseudo-labels for verifiable tasks, while their applicability to unverifiable tasks (e.g., translation) is limited by the open-ended character of responses. As a result, self-evaluation mechanisms (e.g., self-judging and entropy minimization) are predominantly used to derive pseudo-labels. However, self-evaluation relying on LLMs typically incurs high computational overhead and introduces overconfidence issues due to intrinsic biases. To address these challenges, we propose a novel self-evaluation-free approach for unverifiable tasks, designed for lightweight yet effective self-improvement. Inspired by majority voting commonly employed in verifiable tasks, we propose semantic voting as a novel mechanism that relaxes the principle of hard matching (i.e., exact matching) toward soft matching (i.e., semantic similarity). Soft matching is achieved by leveraging a lightweight sentence embedding model to quantify semantic similarity, thereby mitigating excessive computational burden and intrinsic bias-associated limitations of self-evaluation. Comprehensive experiments demonstrate that our method achieves substantial gains in computational efficiency and overall better performance than self-evaluation methods across diverse model architectures and tasks.
摘要：获取监督数据的成本上升引起了人们对大语言模型（LLM）的自我完善的重大兴趣。事实证明，像大多数投票这样的直接无监督信号在生成可验证任务的伪标记方面有效，而其适用于无法验证的任务（例如，翻译）的适用性受响应的开放性特征的限制。结果，自我评估机制（例如，自判断和熵最小化）主要用于推导伪标签。但是，依靠LLM的自我评估通常会导致高计算开销，并由于内在偏见引起过度自信问题。为了应对这些挑战，我们为无法验证的任务提出了一种新颖的无自我评估方法，该方法是为轻巧但有效的自我完善而设计的。受到可验证任务中通常采用的多数投票的启发，我们提出语义投票是一种新型机制，它放松了对软匹配（即语义相似性）的硬匹配（即精确匹配）的原理。软匹配是通过利用轻巧的句子嵌入模型来量化语义相似性来实现的，从而减轻了过度的计算负担和自我评估的内在偏见相关局限性。全面的实验表明，与各种模型体系结构和任务之间的自我评估方法相比，我们的方法在计算效率和总体性能方面取得了可观的提高。

Title: From Evidence to Trajectory: Abductive Reasoning Path Synthesis for Training Retrieval-Augmented Generation Agents

Authors: Muzhi Li, Jinhu Qi, Yihong Wu, Minghao Zhao, Liheng Ma, Yifan Li, Xinyu Wang, Yingxue Zhang, Ho-fung Leung, Irwin King
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23071
Pdf URL: https://arxiv.org/pdf/2509.23071
Copy Paste: [[2509.23071]] From Evidence to Trajectory: Abductive Reasoning Path Synthesis for Training Retrieval-Augmented Generation Agents(https://arxiv.org/abs/2509.23071)
Keywords: language model, llm, retrieval-augmented generation, chain-of-thought, agent
Abstract: Retrieval-augmented generation agents development is hindered by the lack of process-level supervision to effectively guide agentic capabilities like task decomposition, retriever invocation, and stepwise decision-making. While reinforcement learning offers a potential solution, it suffers from sparse rewards and the limited reasoning capabilities of large language models (LLMs). Meanwhile, existing data synthesis methods only produce chain-of-thought rationales and fail to model environmental interactions. In this paper, we propose EviPath, an evidence-anchored reasoning path synthesis paradigm for RAG agent development. EviPath comprises: (i) Abductive Subtask Planning, which decomposes the problem into sub-questions and iteratively plans an optimal solution path based on the dependencies between them; (ii) Faithful Sub-question Answering, which uses supporting evidence to construct a proxy environment to generate reasoning thoughts and answers for each sub-question; and (iii) Conversational Fine-Tuning, which formats the complete agent-environment interaction trajectory into a dialogue format suitable for Supervised Fine-Tuning. EviPath allows LLMs to learn complex reasoning and tool-use capabilities directly from synthesized data. Extensive experiments on widely-used question-answering benchmarks show that an 8B parameter model trained with EviPath-synthesized data significantly and consistently outperforms state-of-the-art baselines with a double-digit absolute EM gain of 14.7% in open-domain question answering.
摘要：由于缺乏过程级的监督，可以有效地指导特工能力，例如任务分解，回猎犬调用和逐步决策。尽管增强学习提供了潜在的解决方案，但它具有稀疏的奖励和大型语言模型（LLMS）的有限推理能力。同时，现有的数据合成方法仅产生经过思考链的理由，并且无法对环境相互作用进行建模。在本文中，我们提出了Evipath，这是一种证据锚定的推理路径综合范围，用于破布剂的开发。 Evipath包括：（i）绑架子任务计划，将问题分解为子问题，并迭代地计划了基于它们之间依赖关系的最佳解决方案路径；（ii）忠实的子问题回答，该回答使用支持证据来构建代理环境，为每个子问题产生推理思想和答案；（iii）会话微调，将完整的环境相互作用轨迹格式化为适合监督微调的对话格式。 Evipath允许LLM直接从合成数据中学习复杂的推理和工具使用功能。对广泛使用的提问基准的广泛实验表明，一个8B参数模型通过Expath-synthised数据训练，并且始终超过了最先进的基准，其双位数的绝对EM增益为14.7％。

Title: The Geometry of Creative Variability: How Credal Sets Expose Calibration Gaps in Language Models

Authors: Esteban Garces Arias, Julian Rodemann, Christian Heumann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23088
Pdf URL: https://arxiv.org/pdf/2509.23088
Copy Paste: [[2509.23088]] The Geometry of Creative Variability: How Credal Sets Expose Calibration Gaps in Language Models(https://arxiv.org/abs/2509.23088)
Keywords: language model, prompt
Abstract: Understanding uncertainty in large language models remains a fundamental challenge, particularly in creative tasks where multiple valid outputs exist. We present a geometric framework using credal sets - convex hulls of probability distributions - to quantify and decompose uncertainty in neural text generation, calibrated against human creative variation. Analyzing 500 creative writing prompts from the WritingPrompts dataset with 10 unique human continuations each, we evaluate four language models across five decoding strategies, generating 100,000 stories. Our credal set analysis reveals substantial gaps in capturing human creative variation, with the best model-human calibration reaching only 0.434 (Gemma-2B with temperature 0.7). We decompose total uncertainty into epistemic and aleatoric components, finding that the choice of decoding strategy contributes 39.4% to 72.0% of total epistemic uncertainty. Model scale shows weak correlation with calibration quality and no significant difference exists between base and instruction-tuned models in calibration quality. Our geometric framework provides actionable insights for improving generation systems for human-AI creative alignment. We release our complete experimental framework.
摘要：了解大语模型中的不确定性仍然是一个基本挑战，尤其是在存在多个有效输出的创意任务中。我们使用信用集（概率分布的凸头）提出了一个几何框架，以量化和分解神经文本生成的不确定性，并针对人类创造性变化进行了校准。分析WritingerPrompts数据集中的500个创意写作提示，每个数据集都具有10个独特的人类连续性，我们在五种解码策略中评估了四种语言模型，产生了100,000个故事。我们的信用集分析揭示了捕获人类创造性变化的巨大差距，最佳模型人类校准仅达到0.434（温度为0.7的Gemma-2b）。我们将完全不确定性分解为认知和核心成分，发现解码策略的选择占了认知不确定性总数的39.4％至72.0％。模型量表显示出与校准质量较弱的相关性，并且在校准质量方面的基础和指令调整模型之间没有显着差异。我们的几何框架提供了可行的见解，以改善人为创造性一致性的生成系统。我们发布完整的实验框架。

Title: d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

Authors: Yuchu Jiang, Yue Cai, Xiangzhong Luo, Jiale Fu, Jiarui Wang, Chonghan Liu, Xu Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23094
Pdf URL: https://arxiv.org/pdf/2509.23094
Copy Paste: [[2509.23094]] d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching(https://arxiv.org/abs/2509.23094)
Keywords: language model, llm
Abstract: Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce \textit{Dual aDaptive Cache} (d$^2$Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d$^2$Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d$^2$Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d$^2$Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at this https URL.
摘要：基于扩散的大语言模型（DLLM）尽管表现有前途，但仍具有劣质的推理效率。这是因为DLLMS依靠双向关注，并且不能直接从标准的键值（KV）缓存中受益，作为自回旋模型（ARM）。为了解决此问题，我们介绍\ textIt {dual Adaptive Cache}（D $^2 $缓存），这是一个无训练的近似KV缓存框架，用于加速DLLM推断。 D $^2 $ CACCE具有两阶段的细粒选择策略，可在每个解码步骤中识别令牌并自适应地更新其KV状态，同时缓存其余令牌的KV状态以进行重复使用。此外，d $^2 $ CACH自然提供了更可靠的解码替代方案，可以在序列结束时启用准到右生成的准时代并减轻代币的过度自信。对两个代表性DLLM（\ ie，Llada和Dream）的广泛实验结果表明，D $^2 $ CACH不仅可以实现大量的推理加速，而且还可以持续提高发电质量。该代码可在此HTTPS URL上找到。

Title: How to Make Large Language Models Generate 100% Valid Molecules?

Authors: Wen Tao, Jing Tang, Alvin Chan, Bryan Hooi, Baolong Bi, Nanyun Peng, Yuansheng Liu, Yiwei Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23099
Pdf URL: https://arxiv.org/pdf/2509.23099
Copy Paste: [[2509.23099]] How to Make Large Language Models Generate 100% Valid Molecules?(https://arxiv.org/abs/2509.23099)
Keywords: language model, llm
Abstract: Molecule generation is key to drug discovery and materials science, enabling the design of novel compounds with specific properties. Large language models (LLMs) can learn to perform a wide range of tasks from just a few examples. However, generating valid molecules using representations like SMILES is challenging for LLMs in few-shot settings. In this work, we explore how LLMs can generate 100% valid molecules. We evaluate whether LLMs can use SELFIES, a representation where every string corresponds to a valid molecule, for valid molecule generation but find that LLMs perform worse with SELFIES than with SMILES. We then examine LLMs' ability to correct invalid SMILES and find their capacity limited. Finally, we introduce SmiSelf, a cross-chemical language framework for invalid SMILES correction. SmiSelf converts invalid SMILES to SELFIES using grammatical rules, leveraging SELFIES' mechanisms to correct the invalid SMILES. Experiments show that SmiSelf ensures 100% validity while preserving molecular characteristics and maintaining or even enhancing performance on other metrics. SmiSelf helps expand LLMs' practical applications in biomedicine and is compatible with all SMILES-based generative models. Code is available at this https URL.
摘要：分子生成是药物发现和材料科学的关键，可以设计具有特定特性的新型化合物。大型语言模型（LLM）可以学会从几个示例中执行各种任务。但是，使用微笑等表示生成有效的分子对LLM在几次设置中都具有挑战性。在这项工作中，我们探讨了LLM如何产生100％有效分子。我们评估了LLM是否可以使用自拍照，这是每个字符串与有效分子相对应的表示，用于有效的分子生成，但发现LLM在自拍照方面的表现要比微笑差。然后，我们检查LLM纠正无效微笑并找到其容量有限的能力。最后，我们介绍了Smiself，这是一种无效的微笑校正的跨化学语言框架。 Smiself使用语法规则将无效的微笑转换为自拍照，并利用自拍照的机制来纠正无效的微笑。实验表明，SMISELF可确保100％的有效性，同时保留分子特性，并维持甚至增强其他指标的性能。 SMISELF有助于扩大LLMS在生物医学中的实际应用，并与所有基于微笑的生成模型兼容。代码可在此HTTPS URL上找到。

Title: Non-Collaborative User Simulators for Tool Agents

Authors: Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23124
Pdf URL: https://arxiv.org/pdf/2509.23124
Copy Paste: [[2509.23124]] Non-Collaborative User Simulators for Tool Agents(https://arxiv.org/abs/2509.23124)
Keywords: hallucination, agent
Abstract: Non-Collaborative User Simulators for Tool Agents Download PDF Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo 19 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference SubmissionConference, AuthorsRevisionsCC BY 4.0 Keywords: Tool Agent, User Simulator, Non-collaborative User, Dialogue Simulation TL;DR: A non-collaborative user simulation method for tool agent. Abstract: Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $\tau$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents' weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.
摘要：用于工具代理的非授权用户模拟器下载PDF Jeonghoon Shim，Woojung Song，Cheyon Jin，Seungwon Kook，Yohan JO，Yohan Jo，2025年9月19日（修改：2025年9月25日）ICLR 2026 ICLR 2026 Conference Conference Conference Confirectorconferect工具代理的非授权用户仿真方法。摘要：工具代理通过多转对话与用户进行交互，以完成各种任务。最近的研究采用了用户仿真方法，以在多转弯设置中开发这些试剂。但是，现有的用户模拟器倾向于对代理友好，仅表现出合作行为，而合作行为未能针对现实世界中的非管理用户训练和测试代理。为了解决这个问题，我们提出了一个新颖的用户模拟器体系结构，该体系结构模拟了四个非企业行为的类别：请求不可用的服务，剥离切线对话，表达不耐烦并提供不完整的话语。我们的用户模拟器可以模拟具有挑战性和自然的非企业行为，同时可靠地提供完成任务所需的所有意图和信息。我们在Multiwoz和$ \ tau $ -bench上进行的实验显示，在遇到非授权用户时，最先进的工具代理中的性能降低了。我们提供了在每个非授权条件下的代理商弱点的详细分析，例如幻觉和对话崩溃。最终，我们为易于扩展的用户仿真框架提供了贡献，以帮助研究社区开发工具代理，并在自己的服务中挑战现实情况下促进其诊断它们。

Title: Tagging the Thought: Unlocking Personalization Reasoning via Reinforcement Learning

Authors: Song Jin, Juntian Zhang, Yong Liu, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, Rui Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23140
Pdf URL: https://arxiv.org/pdf/2509.23140
Copy Paste: [[2509.23140]] Tagging the Thought: Unlocking Personalization Reasoning via Reinforcement Learning(https://arxiv.org/abs/2509.23140)
Keywords: language model, llm
Abstract: Recent advancements have endowed Large Language Models (LLMs) with impressive general reasoning capabilities, yet they often struggle with personalization reasoning - the crucial ability to analyze user history, infer unique preferences, and generate tailored responses. To address this limitation, we introduce TagPR, a novel training framework that significantly enhances an LLM's intrinsic capacity for personalization reasoning through a tagging the thought approach. Our method first develops a data-driven pipeline to automatically generate and semantically label reasoning chains, creating a structured dataset that fosters interpretable reasoning. We then propose a synergistic training strategy that begins with Supervised Fine-Tuning (SFT) on this tagged data to establish foundational reasoning patterns, followed by a multi-stage reinforcement learning (RL) process. This RL phase is guided by a unique composite reward signal, which integrates tag-based constraints and a novel Personalization Reward Model with User Embeddings (PRMU) to achieve fine-grained alignment with user-specific logic. Extensive experiments on the public LaMP benchmark and a self-constructed dataset demonstrate that our approach achieves state-of-the-art results, delivering an average improvement of 32.65% over the base model across all tasks. Our work validates that structured, interpretable reasoning is a highly effective pathway to unlocking genuine personalization capabilities in LLMs.
摘要：最近的进步已将大型语言模型（LLM）赋予了令人印象深刻的一般推理能力，但他们经常在个性化推理中挣扎 - 分析用户历史记录，推断独特的偏好并产生量身定制的回应的关键能力。为了解决这一限制，我们介绍了TAGPR，这是一个新颖的培训框架，可以通过标记思想方法来大大提高LLM的固有能力来个性化推理。我们的方法首先开发了一个数据驱动的管道，以自动生成和语义标记推理链，从而创建一个结构化数据集，以促进可解释的推理。然后，我们提出了一种协同的培训策略，该策略是从该标记数据的监督微调（SFT）开始，以建立基本的推理模式，然后进行多阶段增强学习（RL）过程。此RL阶段以独特的综合奖励信号为指导，该信号将基于标签的约束和新颖的个性化奖励模型与用户嵌入（PRMU）集成在一起，以实现与用户特定逻辑的细粒度对齐。对公共灯基准和自我结构的数据集进行了广泛的实验表明，我们的方法可实现最先进的结果，在所有任务中，平均提高了32.65％。我们的工作验证了结构化的，可解释的推理是解锁LLM中真正个性化功能的高效途径。

Title: Tree Reward-Aligned Search for TReASURe in Masked Diffusion Language Models

Authors: Zichao Yu, Ming Li, Wenyi Zhang, Weiguo Gao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23146
Pdf URL: https://arxiv.org/pdf/2509.23146
Copy Paste: [[2509.23146]] Tree Reward-Aligned Search for TReASURe in Masked Diffusion Language Models(https://arxiv.org/abs/2509.23146)
Keywords: language model
Abstract: Tree search has recently emerged as a powerful framework for aligning generative models with task-specific rewards at test time. Applying tree search to Masked Diffusion Language Models, however, introduces two key challenges: (i) parallel unmasking yields highly correlated branches, limiting exploration, and (ii) reward evaluation via sampled completions produces high-variance estimates, making pruning unstable. We propose TReASURe, a tree-search test-time alignment method that addresses these issues. It introduces (i) UnmaskBranch, a branching strategy based on first-hitting unmasking that diversifies both token content and reveal order with a single model call per parent node, and (ii) ResubstituteScore, a pruning rule that uses deterministic resubstitution to score partially masked sequences with low-variance proxy completions. Theoretically, we quantify branching efficiency gains in NFEs (number of function evaluations), show that the scoring rule approximates the true reward with error bounded by predictive uncertainty, and prove improvements with larger tree widths. Empirically, TReASURe achieves state-of-the-art results on perplexity, linguistic acceptability, and control of sentiment and toxicity, outperforming prior methods under matched compute budgets, with especially strong gains in low-NFE regimes.
摘要：最近，树木搜索是将生成模型与特定于任务的奖励对齐的强大框架。但是，将树搜索应用于掩盖的扩散语言模型，引入了两个关键挑战：（i）平行拆卸产量高度相关的分支，限制探索，以及（ii）通过采样完成的奖励评估可产生高变化的估计，从而使原先不稳定。我们提出了宝藏，这是一种解决这些问题的树木搜索测试时间对齐方法。它介绍了（i）Unmaskbranch，这是一种基于首次击打的分支策略，该策略可以通过每个父节点进行单个模型调用，以及（ii）ResubStitutesCore，将订单多样化，并揭示订单，（ii）使用确定性的重新确定来对低相位较低的稳定性进行部分掩盖的序列，将其重新掩盖。从理论上讲，我们量化了NFE中的分支效率提高（功能评估次数），表明评分规则近似于由预测不确定性限制的误差的真实奖励，并证明具有较大树宽度的改进。从经验上讲，宝藏取得了最先进的结果，其结果是在匹配的计算预算下的困惑，语言可接受性以及对情感和毒性的控制，在匹配的计算预算下表现优于先前的方法，在低NFE制度中的增长尤其强劲。

Title: Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

Authors: Chenxing Wei, Hong Wang, Ying He, Fei Yu, Yao Shu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23166
Pdf URL: https://arxiv.org/pdf/2509.23166
Copy Paste: [[2509.23166]] Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs(https://arxiv.org/abs/2509.23166)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) employ multi-turn interaction as a fundamental paradigm for completing complex tasks. However, their performance often degrades in extended interactions, as they are typically trained on static, single-turn data, which hinders their ability to adapt to real-time user feedback. To address this limitation, we first propose a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy aligned with user preferences, then updates a small subset of parameters to steer the model toward this policy, ultimately enabling efficient in-conversation self-correction. We then introduce Optimum-Referenced One-Step Adaptation (ROSA), a lightweight algorithm that operationalizes T2PAM. ROSA guides the model parameters toward a theoretical optimal policy in a single, efficient update step, avoiding costly iterative gradient-based optimization and minimizing computational overhead. We provide a rigorous theoretical analysis guaranteeing that the policy of ROSA converges to the preference of user as the number of interactions increases. Extensive experiments on challenging benchmark demonstrate that ROSA achieves significant improvements in both task effectiveness and efficiency.
摘要：大型语言模型（LLMS）采用多转交互作用作为完成复杂任务的基本范例。但是，他们的性能通常会在扩展互动中降低，因为它们通常在静态，单转数据的数据上进行培训，从而阻碍了它们适应实时用户反馈的能力。为了解决这一局限性，我们首先提出了一个新的范式：多转交互作用的测试时间策略适应（T2PAM），该范围利用正在进行的互动中的用户反馈作为奖励信号，以估算与用户偏好一致的潜在最佳策略，然后更新参数的小子集以使模型转向此策略，最终朝着该策略迈向该模型，最终会效率高效地启动convers corme corme corme corme cormecration-convertion-convers convers incor convers incor convers incor convers incor convers incor convers in convers incor convers in convers。然后，我们介绍了最佳引用的一步改编（ROSA），这是一种可操作T2PAM的轻质算法。罗莎（Rosa）将模型参数指向一个理论上的最佳策略，以避免了昂贵的迭代梯度优化和最小化计算开销。我们提供了严格的理论分析，以确保Rosa的政策随着交互作用的增加而收敛到用户的偏好。关于挑战基准的广泛实验表明，罗莎在任务效率和效率方面都取得了重大改善。

Title: Pretraining LLM with Latent Thoughts in Continuous Space

Authors: Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Ziwei He, Xinbing Wang, Zhouhan Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23184
Pdf URL: https://arxiv.org/pdf/2509.23184
Copy Paste: [[2509.23184]] Pretraining LLM with Latent Thoughts in Continuous Space(https://arxiv.org/abs/2509.23184)
Keywords: language model, llm, chain-of-thought
Abstract: The remarkable success of Chain-of-Thought (CoT), which enhances performance by scaling generation steps at test-time, inspires us to ask: can we leverage a similar scaling of computational steps during pretraining to improve the generation of each individual token? To address this, we propose a novel pre-training methodology: Pretraining Language Models with Latent Thoughts. Our approach pretrains a language model (LM) to first generate an intermediate latent thought-the last hidden state of the current position-which is then used as input to predict the actual subsequent token. This additional computational step enables the LM to refine its prediction within unconstrained continuous space. Our experiments demonstrate that, at an identical inference cost, a LM that generates one additional latent thought per token outperforms a standard model with double the parameters. For instance, ours-1.4B (Pythia Arch), pretrained on 300B tokens from the Pile, significantly surpasses the vanilla Pythia-2.8B trained on the same data on both language modeling and a range of general downstream tasks. Furthermore, increasing the number of latent thoughts generated before each actual token-forming a chain analogous to CoT-consistently improves the model's performance.
摘要：通过在测试时间扩展生成步骤来提高绩效的思维链（COT）的显着成功激发了我们的启发：我们可以利用预算训练期间的计算步骤的类似规模来改善每个单个令牌的生成？为了解决这个问题，我们提出了一种新颖的培训方法：具有潜在思想的训练语言模型。我们的方法预告了一种语言模型（LM）首先生成中间的潜在思想 - 当前位置的最后一个隐藏状态 - 然后用作预测实际后续令牌的输入。这个附加的计算步骤使LM能够在不受约束的连续空间中完善其预测。我们的实验表明，以相同的推理成本，一个LM产生一个额外的潜在思想的LM优于参数两倍的标准模型。例如，我们的1.4b（毕曲拱）在300b代币上预估计，显着超过了对语言建模和一系列一般下游任务的相同数据培训的香草毕曲（Vanilla Pythia-2.8b）。此外，增加在每个实际令牌形成的链条之前，产生的潜在思想数量类似于COT一致，从而改善了模型的性能。

Title: Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts

Authors: Guancheng Wan, Leixin Sun, Longxu Dou, Zitong Shi, Fang Wu, Eric Hanchen Jiang, Wenke Huang, Guibin Zhang, Hejia Geng, Xiangru Tang, Zhenfei Yin, Yizhou Sun, Wei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23188
Pdf URL: https://arxiv.org/pdf/2509.23188
Copy Paste: [[2509.23188]] Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts(https://arxiv.org/abs/2509.23188)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks. However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts (system-user, peer-peer), where agents misprioritize system-level rules in the presence of competing demands. Moreover, widely used macro-level metrics (e.g., pass@k) obscure these micro-level violations and offer little actionable guidance for remedy. In this work, we present a full-stack, three-stage framework: (1) Diagnose - Contextualized Role Adherence Score (CRAS), a query-wise, context-aware scoring metric that decomposes role adherence into four measurable dimensions; (2) Localize - attention drift analysis revealing that instruction conflicts are resolved by attention heads that are largely concentrated in middle layers; (3) Align - Surgical Alignment of Instruction Layers (SAIL), which installs LoRA only on the localized focal layers and optimizes a token-weighted DPO-style preference objective that credits tokens by their focal attentional contribution. Across standard benchmarks and MAS frameworks, our surgical approach improves instruction hierarchy compliance (e.g., +5.60% with AutoGen on MedQA) without full-model finetuning.
摘要：大型语言模型（LLM）能力的多机构系统（MAS）在复杂的任务中具有快速高级的协作推理，工具使用和角色专题协调。但是，关键性部署仍然受到系统性故障模式的阻碍：指令冲突中的层次合规性（System-user，Peer-Peer），在这种情况下，代理在存在竞争需求的情况下将系统级规则错误地规定。此外，使用广泛使用的宏观指标（例如，通过@k）掩盖了这些微观违规行为，几乎没有可行的补救指导。在这项工作中，我们提出了一个全栈，三阶段的框架：（1）诊断 - 上下文化的角色依从性评分（CRA），这是一个查询，上下文感知的评分度量，将角色依从性分解为四个可测量的维度；（2）本地化 - 注意力漂移分析表明，指导冲突是通过主要集中在中层层次的注意力头来解决的；（3）对齐 - 教学层的手术对齐（SAIL），仅在局部焦点层上安装LORA，并优化了令牌加权的DPO风格的偏好目标，该目标通过其焦点注意力贡献来表示代币。在标准的基准和MAS框架中，我们的外科手术方法改善了教学层次结构合规性（例如，在MEDQA上使用Autogen的 +5.60％），而无需全模型登录。

Title: From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs

Authors: Haonan Wang, Weida Liang, Zihang Fu, Nie Zheng, Yifan Zhang, Yao Tong, Tongyao Zhu, Hao Jiang, Chuang Li, Jiaying Wu, Kenji Kawaguchi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23196
Pdf URL: https://arxiv.org/pdf/2509.23196
Copy Paste: [[2509.23196]] From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs(https://arxiv.org/abs/2509.23196)
Keywords: gpt, llm
Abstract: Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME'25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.
摘要：最近的推理LLM（RLMS），尤其是那些接受基于验证者的增强培训的培训的推理，通常比直接答复更糟糕的是，造成的婴儿床的表现差。我们使用DeepSeek-R1的高质量推理痕迹作为演示来重新访问此悖论，并发现即使演示是最佳的，添加更多的示例也始终如一地降低精度。详细的分析揭示了这种下降背后的两个机制：（i）语义误导，高文本相似性导致模型将目标视为与示例相同，并逐字复制中间步骤；（ii）策略转移失败，该模型努力提取有用的推理策略并将其应用于目标问题。在这些指导下，我们介绍了洞察力到解决方案（I2S），这是一个顺序测试时间程序，将演示变成明确的，可重复使用的见解并得出特定于目标的推理轨迹；可选地，推理是为了连贯性和正确性（I2S+）进行自我精心设计的。关于不同基准测试的广泛实验表明，I2S和I2+在开放式和封闭源模型上始终优于直接答案和测试时间缩放基准。即使对于GPT模型，我们的方法也有助于：在AIME'25上，GPT-4.1上升 +14.0％，而O1-Mini在AIME上提高了 +2.7％，而GPQA上则 +1.7％，表明可以通过Insight-Refine-Refine-Solvine-Solvin-Solve-Solve-Solve-Solve-Solve-Solve-Solve-Solve-Solve-Solve-Solve-Solve-Selve-Solve-Solve-Solveworks进行有效利用。

Title: Steering Prepositional Phrases in Language Models: A Case of with-headed Adjectival and Adverbial Complements in Gemma-2

Authors: Stefan Arnold, René Gröbner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23204
Pdf URL: https://arxiv.org/pdf/2509.23204
Copy Paste: [[2509.23204]] Steering Prepositional Phrases in Language Models: A Case of with-headed Adjectival and Adverbial Complements in Gemma-2(https://arxiv.org/abs/2509.23204)
Keywords: language model, prompt
Abstract: Language Models, when generating prepositional phrases, must often decide for whether their complements functions as an instrumental adjunct (describing the verb adverbially) or an attributive modifier (enriching the noun adjectivally), yet the internal mechanisms that resolve this split decision remain poorly understood. In this study, we conduct a targeted investigation into Gemma-2 to uncover and control the generation of prepositional complements. We assemble a prompt suite containing with-headed prepositional phrases whose contexts equally accommodate either an instrumental or attributive continuation, revealing a strong preference for an instrumental reading at a ratio of 3:4. To pinpoint individual attention heads that favor instrumental over attributive complements, we project activations into the vocabulary space. By scaling the value vector of a single attention head, we can shift the distribution of functional roles of complements, attenuating instruments to 33% while elevating attributes to 36%.
摘要：语言模型在生成介词短语时，必须经常决定其补充是乐器辅助功能（描述动词副词）还是属性修饰符（以形容词来富集名词），但解决此拆分决策的内部机制仍然不足。在这项研究中，我们对Gemma-2进行了针对性的研究，以发现和控制介词补充的产生。我们组装了一个迅速的套件，该套件包含有头位的介词短语，其上下文同样容纳了仪器或归因性延续，从而揭示了以3：4比例以3：4的仪器读数的强烈偏爱。为了确定有利于工具而不是属性补充的个人注意力负责人，我们将激活投射到词汇空间中。通过缩放单个注意力头的价值向量，我们可以将补件功能作用的分布转移，将仪器衰减到33％，同时将属性提升到36％。

Title: PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness

Authors: Huacan Chai, Zijie Cao, Maolin Ran, Yingxuan Yang, Jianghao Lin, pengxin, Hairui Wang, Renjie Ding, Ziyu Wan, Muning Wen, Weiwen Liu, Weinan Zhang, Fei Huang, Ying Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23206
Pdf URL: https://arxiv.org/pdf/2509.23206
Copy Paste: [[2509.23206]] PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness(https://arxiv.org/abs/2509.23206)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved impressive success in single-turn function calling, yet real-world applications such as travel planning or multi-stage data analysis typically unfold across multi-turn conversations. In these settings, LLMs must not only issue accurate function calls at each step but also maintain progress awareness, the ability to summarize past interactions and plan future actions to ensure coherent, long-horizon task execution. Existing approaches, however, either reduce multi-turn training to isolated single-turn samples, which neglects task-level planning, or employ end-to-end reinforcement learning (RL) that struggles with redundancy and lacks explicit integration of progress awareness. To overcome these limitations, we introduce PARL-MT, a framework that explicitly incorporates progress awareness into LLM training for multi-turn function calling. PARL-MT combines (i) a Progress Awareness Generation (PAG) pipeline, which automatically constructs datasets coupling conversation summaries with future task planning, and (ii) a Progress Awareness-Guided Reinforcement Learning (PAG-RL) algorithm, which integrates progress awareness into RL training to reduce contextual redundancy and improve alignment between local actions and global task completion. Empirical results on two public benchmarks demonstrate that PARL-MT significantly outperforms existing methods, highlighting the effectiveness of progress awareness in enabling robust and efficient multi-turn function calling.
摘要：大型语言模型（LLM）在单转功能呼叫中取得了令人印象深刻的成功，但实际上是在多转交谈中进行的实际应用程序，例如旅行计划或多阶段数据分析。在这些设置中，LLMS不仅必须在每个步骤上发出准确的功能调用，而且还必须保持进度意识，总结过去的交互和计划未来的措施以确保连贯，长期长的任务执行。但是，现有的方法要么将多转弯训练减少到孤立的单转弯样本，该样本忽略了任务级别的计划，要么采用端到端的强化学习（RL），这些学习（RL）在冗余而苦苦挣扎并且缺乏进步意识的明确整合。为了克服这些局限性，我们引入了PARL-MT，该框架将进度意识纳入LLM培训中以进行多转化功能调用。 PARL-MT结合了（i）进度意识产生（PAG）管道，该管道会自动构建数据集将对话汇总与未来的任务计划，以及（ii）进度意识引导的增强增强学习（PAG-RL）算法，将进步的进度培训整合到RL培训中，以减少RL型RL培训，以减少上下文的局部竞争和全局任务和全局任务和全局任务。两个公共基准的经验结果表明，PARL-MT显着胜过现有方法，强调了进步意识在实现强大而有效的多转弯功能调用方面的有效性。

Title: A Structured Framework for Evaluating and Enhancing Interpretive Capabilities of Multimodal LLMs in Culturally Situated Tasks

Authors: Haorui Yu, Ramon Ruiz-Dolz, Qiufeng Yi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23208
Pdf URL: https://arxiv.org/pdf/2509.23208
Copy Paste: [[2509.23208]] A Structured Framework for Evaluating and Enhancing Interpretive Capabilities of Multimodal LLMs in Culturally Situated Tasks(https://arxiv.org/abs/2509.23208)
Keywords: language model, llm, prompt
Abstract: This study aims to test and evaluate the capabilities and characteristics of current mainstream Visual Language Models (VLMs) in generating critiques for traditional Chinese painting. To achieve this, we first developed a quantitative framework for Chinese painting critique. This framework was constructed by extracting multi-dimensional evaluative features covering evaluative stance, feature focus, and commentary quality from human expert critiques using a zero-shot classification model. Based on these features, several representative critic personas were defined and quantified. This framework was then employed to evaluate selected VLMs such as Llama, Qwen, or Gemini. The experimental design involved persona-guided prompting to assess the VLM's ability to generate critiques from diverse perspectives. Our findings reveal the current performance levels, strengths, and areas for improvement of VLMs in the domain of art critique, offering insights into their potential and limitations in complex semantic understanding and content generation tasks. The code used for our experiments can be publicly accessed at: this https URL.
摘要：这项研究旨在测试和评估当前主流视觉语言模型（VLM）在为传统中国绘画生成批评的能力和特征。为了实现这一目标，我们首先为中国绘画批评开发了定量框架。该框架是通过提取多维评估功能来构建的，该功能涵盖了使用零摄影分类模型的人类专家批评的评估姿态，功能重点和评论质量。基于这些功能，定义和量化了几个代表性的评论家角色。然后使用该框架评估所选VLM，例如Llama，Qwen或Gemini。实验设计涉及角色引导，促使评估VLM从不同角度产生批评的能力。我们的发现揭示了当前的绩效水平，优势和在艺术批评领域改善VLM的领域，从而有见识它们在复杂的语义理解和内容生成任务中的潜力和局限性。可以通过以下方式公开访问我们的实验的代码：此HTTPS URL。

Title: Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models

Authors: Sina J. Semnani, Jirayu Burapacheep, Arpandeep Khatua, Thanawan Atchariyachanvanit, Zheng Wang, Monica S. Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23233
Pdf URL: https://arxiv.org/pdf/2509.23233
Copy Paste: [[2509.23233]] Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models(https://arxiv.org/abs/2509.23233)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Wikipedia is the largest open knowledge corpus, widely used worldwide and serving as a key resource for training large language models (LLMs) and retrieval-augmented generation (RAG) systems. Ensuring its accuracy is therefore critical. But how accurate is Wikipedia, and how can we improve it? We focus on inconsistencies, a specific type of factual inaccuracy, and introduce the task of corpus-level inconsistency detection. We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review. In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time. Combining CLAIRE with human annotation, we contribute WIKICOLLIDE, the first benchmark of real Wikipedia inconsistencies. Using random sampling with CLAIRE-assisted analysis, we find that at least 3.3% of English Wikipedia facts contradict another fact, with inconsistencies propagating into 7.3% of FEVEROUS and 4.0% of AmbigQA examples. Benchmarking strong baselines on this dataset reveals substantial headroom: the best fully automated system achieves an AUROC of only 75.1%. Our results show that contradictions are a measurable component of Wikipedia and that LLM-based systems like CLAIRE can provide a practical tool to help editors improve knowledge consistency at scale.
摘要：Wikipedia是最大的开放知识语料库，在全球范围内广泛使用，是训练大语言模型（LLMS）和检索功能增强生成（RAG）系统的关键资源。因此，确保其准确性至关重要。但是Wikipedia的准确性如何？我们如何改善它？我们专注于矛盾，一种特定类型的事实不准确，并介绍了语料库级别不一致检测的任务。我们提出了克莱尔（Claire），这是一种将LLM推理与检索结合到表面可能不一致的主张以及人类审查的上下文证据的代理系统。在经验丰富的Wikipedia编辑的用户研究中，使用Claire时有87.5％的人报告了较高的信心，并且参与者在相同的时间内确定了64.7％的不一致性。将克莱尔与人类注释相结合，我们贡献了维基科利德，这是真正的维基百科不一致的第一个基准。使用与克莱尔辅助分析的随机抽样，我们发现至少有3.3％的英语维基百科事实与另一个事实相矛盾，不一致地传播到7.3％的发烧和4.0％的Ambigqa示例中。该数据集上的基准测试基线揭示了大量的净空：最佳的全自动系统的AUROC仅为75.1％。我们的结果表明，矛盾是Wikipedia的可衡量组成部分，而基于LLM的系统像Claire可以提供一种实用的工具来帮助编辑提高规模的知识一致性。

Title: A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models

Authors: Wonje Jeung, Sangyeon Yoon, Yoonjun Cho, Dongjae Jeon, Sangwoo Shin, Hyesoo Hong, Albert No
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23286
Pdf URL: https://arxiv.org/pdf/2509.23286
Copy Paste: [[2509.23286]] A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models(https://arxiv.org/abs/2509.23286)
Keywords: language model, llm
Abstract: Diffusion large language models (dLLMs) enable any-order generation, but this flexibility enlarges the attack surface: harmful spans may appear at arbitrary positions, and template-based prefilling attacks such as DIJA bypass response-level refusals. We introduce A2D (Any-Order, Any-Step Defense), a token-level alignment method that aligns dLLMs to emit an [EOS] refusal signal whenever harmful content arises. By aligning safety directly at the token-level under randomized masking, A2D achieves robustness to both any-decoding-order and any-step prefilling attacks under various conditions. It also enables real-time monitoring: dLLMs may begin a response but automatically terminate if unsafe continuation emerges. On safety benchmarks, A2D consistently prevents the generation of harmful outputs, slashing DIJA success rates from over 80% to near-zero (1.3% on LLaDA-8B-Instruct, 0.0% on Dream-v0-Instruct-7B), and thresholded [EOS] probabilities allow early rejection, yielding up to 19.3x faster safe termination.
摘要：扩散大语言模型（DLLM）可以使任何阶生成生成，但是这种灵活性扩大了攻击表面：有害跨度可能出现在任意位置，基于模板的预填充攻击，例如DIJA旁路响应级别的拒绝。我们引入了A2D（任何阶，任何步骤防御），即一种令牌级别对齐方法，该方法在出现有害内容时会对准DLLM，以发出[EOS]拒绝信号。通过直接在随机遮罩下直接在令牌级别上对齐，A2D可以在各种条件下对任何dececoding-norder订单和任何步骤预填充攻击都具有稳健性。它还启用实时监视：DLLM可以开始响应，但如果出现不安全的延续，则会自动终止。在安全基准上，A2D始终阻止有害产出的产生，将DIJA的成功率从80％以上降低到接近零（llada-8b-8b-Instruct的1.3％，在Dream-V0-Instruct-7b上为0.0％），并在[EOS-eos]上进行了阈值[EOS]的概率，允许早期排斥拒绝，从而产生至19.3x Faster Sakitation。

Title: Scaling Policy Compliance Assessment in Language Models with Policy Reasoning Traces

Authors: Joseph Marvin Imperial, Harish Tayyar Madabushi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23291
Pdf URL: https://arxiv.org/pdf/2509.23291
Copy Paste: [[2509.23291]] Scaling Policy Compliance Assessment in Language Models with Policy Reasoning Traces(https://arxiv.org/abs/2509.23291)
Keywords: language model, llm
Abstract: Policy compliance assessment is a fundamental task of evaluating whether an input case strictly complies with a set of human-defined rules, more generally known as policies. In practice, human experts follow a systematic, step-by-step process to identify violations with respect to specific stipulations outlined in the policy. However, such documentation of gold-standard, expert-level reasoning processes is costly to acquire. In this paper, we introduce Policy Reasoning Traces (PRT), a form of specialized generated reasoning chains that serve as a reasoning bridge to improve an LLM's policy compliance assessment capabilities. Our empirical evaluations demonstrate that the use of PRTs for both inference-time and training-time scenarios significantly enhances the performance of open-weight and commercial models, setting a new state-of-the-art for HIPAA and GDPR policies. Beyond accuracy gains, we also highlight how PRTs can improve an LLM's ability to accurately cite policy clauses, as well as influence compliance decisions through their high utilization from the raw chains of thought.
摘要：政策合规性评估是评估输入案例是否严格符合一组人类定义的规则（通常称为政策）的一项基本任务。实际上，人类专家遵循一个系统的，分步的过程，以确定政策中概述的特定规定的违规行为。但是，这种金色标准，专家级推理流程的文档是昂贵的。在本文中，我们介绍了政策推理轨迹（PRT），这是一种专业生成的推理链，是提高LLM的政策合规性评估能力的推理桥梁。我们的经验评估表明，在推理时间和培训时间方案中使用PRT可以显着提高开放量和商业模型的性能，从而为HIPAA和GDPR策略设定了新的最先进。除了准确性的提高外，我们还强调了PRT如何提高LLM准确引用政策条款的能力，并通过从思想的原始链中通过高利用来影响合规性决策。

Title: Learning to Reason in Structured In-context Environments with Reinforcement Learning

Authors: Peng Yu, Zeyuan Zhao, Shao Zhang, Luoyi Fu, Xinbing Wang, Ying Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23330
Pdf URL: https://arxiv.org/pdf/2509.23330
Copy Paste: [[2509.23330]] Learning to Reason in Structured In-context Environments with Reinforcement Learning(https://arxiv.org/abs/2509.23330)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved significant advancements in reasoning capabilities through reinforcement learning (RL) via environmental exploration. As the intrinsic properties of the environment determine the abilities that LLMs can learn, the environment plays a important role in the RL finetuning process. An ideal LLM reasoning environment should possess three core characteristics: scalability, generalizable reasoning, and verifiability. However, existing mathematical and coding environments are difficult to scale due to heavy reliance on expert annotation, while the skills learned in game-based environments are too specialized to generalize. To bridge this gap, we introduce the \textbf{S}tructured \textbf{I}n-context \textbf{E}nvironment (SIE) framework. SIE achieves scalability by automatically constructing reasoning environments from large-scale structured data, where the rich compositional patterns naturally support generalizable reasoning. Moreover, the explicit schemas and reasoning chains in structured data provide a foundation for rule-based verifiability. Experimental results show that SIE framework not only achieves substantial improvements in in-domain structured reasoning, but also enables the learned compositional reasoning skills to generalize effectively to out-of-domain mathematical and logical reasoning tasks. We further explored learning in information-limited partial SIEs and found that LLMs can infer the missing information through exploring the environment, leading to robust reasoning improvements and generalization performance.
摘要：大型语言模型（LLM）通过环境探索通过增强学习（RL）在推理能力方面取得了重大进步。由于环境的固有特性决定了LLM可以学习的能力，因此环境在RL登录过程中起着重要作用。理想的LLM推理环境应具有三个核心特征：可伸缩性，可推广的推理和可验证性。但是，由于非常依赖专家注释，因此现有的数学和编码环境难以扩展，而在基于游戏的环境中学习的技能太专业而无法概括。为了弥合这一差距，我们介绍了\ textbf {s} cructured \ textbf {i} n-context \ textbf {e} nvironment（sie）框架。 SIE通过从大规模结构化数据中自动构造推理环境来实现可伸缩性，其中丰富的组成模式自然支持可概括的推理。此外，结构化数据中的显式模式和推理链为基于规则的可验证性提供了基础。实验结果表明，SIE框架不仅可以实现内域结构化推理的重大改进，而且还可以使学习的组成推理技能有效地推广到跨域外数学和逻辑推理任务。我们进一步探索了信息限制的部分SIES学习，发现LLM可以通过探索环境来推断丢失的信息，从而导致强大的推理改进和泛化性能。

Title: C-Evolve: Consensus-based Evolution for Prompt Groups

Authors: Tiancheng Li, Yuhang Wang, Zhiyang Chen, Zijun Wang, Liyuan Ma, Guo-jun Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23331
Pdf URL: https://arxiv.org/pdf/2509.23331
Copy Paste: [[2509.23331]] C-Evolve: Consensus-based Evolution for Prompt Groups(https://arxiv.org/abs/2509.23331)
Keywords: gpt, prompt
Abstract: Prompt evolution algorithms offer a powerful paradigm for enhancing AI systems based on closed-source models, while few work explores whether aggregating results from multiple prompts to reach a consensus can further advance the system capability boundary. In this paper, we introduce Consensus-Evolve (C-Evolve), an evolutionary algorithm that discovers a group of prompts whose aggregated outputs after majority voting achieve optimal performance. More specifically, C-Evolve employs an island-based evolutionary algorithm to maintain population diversity, and prompts from distinct islands are selected to form groups to aggregate their outputs. The key difference from single individual evolution is a voting score, which evaluates each individual prompt's contribution within groups. We take this as the fitness score for evolution instead of individual performance. Consequently, C-Evolve is more likely to produce and maintain prompts with higher potential to form a high-performing group and eliminate low-performing ones, gradually improving the group performance after reaching consensus. Our method achieves state-of-the-art performance across a wide range of tasks, including both open-ended tasks like HotpotQA and closed-ended tasks like MATH. On Qwen3-8B, C-Evolve achieves 70.67% on HotpotQA and 43.88% on IFBench, which are 4.95% and 2.73% higher than GEPA, respectively. For GPT-4.1-mini, the accuracy on IFBench is further improved to 47.96% and reaches 95.33% in the MATH benchmark. These results demonstrate the C-Evolve's competitive performance.
摘要：及时的Evolution算法提供了一个强大的范式，用于增强基于封闭源模型的AI系统，而很少有工作探讨了从多个提示获得共识的汇总结果是否可以进一步提高系统能力边界。在本文中，我们引入了共识进化（C-Evolve），这是一种进化算法，该算法发现了一组提示，其在多数投票后的汇总输出实现了最佳性能。更具体地说，C-Evolve采用基于岛屿的进化算法来维持人口多样性，并选择了来自不同岛屿的提示来组建群体以汇总其产出。与单个单独进化的关键区别在于投票得分，该评分评估每个单独的提示在组中的贡献。我们将其视为进化而不是个人表现的健身得分。因此，C-Evolve更有可能产生和维持具有更高潜力的提示，以形成高性能组并消除表现低下的群体，从而在达成共识后逐渐改善组绩效。我们的方法可以在各种任务中实现最先进的性能，包括HotPotQA等开放式任务和数学等封闭任务。在QWEN3-8B上，C-Evolve在HOTPOTQA上获得70.67％的速度和IFBench的43.88％，分别比GEPA高4.95％和2.73％。对于GPT-4.1米尼，IFBENCH的准确性进一步提高到47.96％，在数学基准中达到95.33％。这些结果证明了C-Evolve的竞争性能。

Title: Dual-Space Smoothness for Robust and Balanced LLM Unlearning

Authors: Han Yan, Zheyuan Liu, Meng Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23362
Pdf URL: https://arxiv.org/pdf/2509.23362
Copy Paste: [[2509.23362]] Dual-Space Smoothness for Robust and Balanced LLM Unlearning(https://arxiv.org/abs/2509.23362)
Keywords: language model, llm
Abstract: With the rapid advancement of large language models, Machine Unlearning has emerged to address growing concerns around user privacy, copyright infringement, and overall safety. Yet state-of-the-art (SOTA) unlearning methods often suffer from catastrophic forgetting and metric imbalance, for example by over-optimizing one objective (e.g., unlearning effectiveness, utility preservation, or privacy protection) at the expense of others. In addition, small perturbations in the representation or parameter space can be exploited by relearn and jailbreak attacks. To address these challenges, we propose PRISM, a unified framework that enforces dual-space smoothness in representation and parameter spaces to improve robustness and balance unlearning metrics. PRISM consists of two smoothness optimization stages: (i) a representation space stage that employs a robustly trained probe to defend against jailbreak attacks, and (ii) a parameter-space stage that decouples retain-forget gradient conflicts, reduces imbalance, and smooths the parameter space to mitigate relearning attacks. Extensive experiments on WMDP and MUSE, across conversational-dialogue and continuous-text settings, show that PRISM outperforms SOTA baselines under multiple attacks while achieving a better balance among key metrics.
摘要：随着大型语言模型的快速发展，机器上的学习已经出现，以解决对用户隐私，侵犯版权和整体安全的日益关注的问题。然而，最先进的（SOTA）学习方法通常会遭受灾难性的遗忘和度量失衡的困扰，例如，以牺牲他人为代价的一个目标（例如，无学习的效率，公用事业保存或隐私保护）过度优化。此外，可以通过重新学习和越狱攻击来利用表示或参数空间中的小扰动。为了应对这些挑战，我们提出了Prism，这是一个统一的框架，可以在表示和参数空间中实施双空间平稳性，以提高稳健性和平衡性未学习指标。 PRISM由两个平滑度优化阶段组成：（i）一个代表空间阶段，该阶段采用了训练有素的探测器来防御越狱攻击，以及（ii）将参数空间阶段解除，该阶段将其脱离验证梯度冲突，减少失衡，并减少参数空间以减轻重新学习攻击。关于WMDP和MUSE的广泛实验，跨越对话二元格和连续的文本设置，表明Prism在多次攻击下优于SOTA基线，同时在关键指标之间取得了更好的平衡。

Title: MedCritical: Enhancing Medical Reasoning in Small Language Models via Self-Collaborative Correction

Authors: Xinchun Su, Chunxu Luo, Yixuan Li, Weidong Yang, Lipeng Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23368
Pdf URL: https://arxiv.org/pdf/2509.23368
Copy Paste: [[2509.23368]] MedCritical: Enhancing Medical Reasoning in Small Language Models via Self-Collaborative Correction(https://arxiv.org/abs/2509.23368)
Keywords: language model, gpt, llm
Abstract: In the field of medicine, complex reasoning tasks such as clinical diagnosis, treatment planning, and medical knowledge integration pose significant challenges, where small language models often underperform compared to large language models like GPT-4 and Deepseek. Recent knowledge distillation-based methods aim to address these issues through teacher-guided error correction, but this LLM as judge approach remains challenging in terms of cost, time, and efficiency. To circumvent this issue, we propose a novel two-stage framework, MedCritical, which uses a small language model fine-tuned by a large teacher model to play against itself. In the first stage, we extract high-level and detailed long-chain thought templates from the teacher model to guide the student model to generate more complex reasoning thoughts. In the second stage, we introduce direct preference optimization (DPO) through model self-iteration collaboration to enhance the reasoning ability of the student model by playing against the correction trajectory of the fine-tuned model during training. This model self-learning DPO approach teaches the student model to use its own error-driven insights to consolidate its skills and knowledge to solve complex problems, and achieves comparable results to traditional knowledge distillation methods using teacher models at a lower cost. Notably, our MedCritical 7B model outperforms the Taiyi and Huatuo-o1-7B models by 3.04\% and 10.12\% respectively on the CMExam benchmark, achieving new SOTA performance among 7B-class small models.
摘要：在医学领域，诸如临床诊断，治疗计划和医学知识整合之类的复杂推理任务构成了重大挑战，与GPT-4和DeepSeek（例如GPT-4和DeepSeek）相比，小语言模型通常表现不佳。最近的基于知识蒸馏的方法旨在通过教师指导的错误纠正来解决这些问题，但是作为法官方法的LLM在成本，时间和效率方面仍然具有挑战性。为了解决这个问题，我们提出了一个新颖的两阶段框架Medcritical，该框架使用大型教师模型微调的小语言模型来对抗自己。在第一阶段，我们从教师模型中提取高级和详细的长链思想模板，以指导学生模型产生更复杂的推理思想。在第二阶段，我们通过模型自我合作介绍了直接的偏好优化（DPO），以通过在训练过程中对微调模型的校正轨迹进行校正轨迹来增强学生模型的推理能力。这种模型的自学习DPO方法教会学生模型使用自己的错误驱动见解来巩固其技能和知识以解决复杂的问题，并以较低的成本使用教师模型实现与传统知识蒸馏方法的可比结果。值得注意的是，我们的Medicitical 7b模型在CMEXAM基准上分别优于3.04 \％和10.12 \％的Taiyi和Huatuo-O1-7b模型，在7b级小型模型中实现了新的SOTA性能。

Title: Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

Authors: Junming Yang, Ning Xu, Biao Liu, Shiqi Qiao, Xin Geng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23371
Pdf URL: https://arxiv.org/pdf/2509.23371
Copy Paste: [[2509.23371]] Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization(https://arxiv.org/abs/2509.23371)
Keywords: language model, llm
Abstract: Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model's dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an "alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs.
摘要：偏好优化对于将大语言模型（LLM）与人类价值和意图保持一致至关重要。在此过程中，一个重大的挑战是预收到的离线偏好数据与不断发展的模型策略之间的分布不匹配。现有的方法试图使用静态启发式方法或脱钩的在线抽样策略来减少这一差距，但它们通常无法适应模型的动态学习状态。为了弥合这一差距，我们提出了元加权自适应偏好优化（Metaapo），这是一个新型框架，将数据生成与模型培训动态结合。 Metaapo采用轻量级的元学习器作为“对齐间隙估计器”，以评估与离线数据相关的政策采样的潜在益处。这将指导在线生成，并将样本的元重量分配给优化目标，并动态平衡在线和离线数据的质量和分布。关于Alpacaeval 2，Arena-Hard和MT Bench的实验表明，Metaapo在各种环境中始终优于现有的偏好优化方法，同时降低了42％的在线注释成本。

Title: CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding

Authors: Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2509.23379
Pdf URL: https://arxiv.org/pdf/2509.23379
Copy Paste: [[2509.23379]] CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding(https://arxiv.org/abs/2509.23379)
Keywords: language model, llm, hallucination, prompt
Abstract: Multimodal large language models (MLLMs) have recently achieved remarkable progress in radiology by integrating visual perception with natural language understanding. However, they often generate clinically unsupported descriptions, known as medical hallucinations, which pose serious risks in medical applications that demand accuracy and image-grounded outputs. Through empirical analysis, we find that prompt-induced hallucinations remain prevalent in radiology MLLMs, largely due to over-sensitivity to clinical sections. To address this, we introduce Clinical Contrastive Cecoding (CCD), a training-free and retrieval-free inference framework that integrates structured clinical signals from task-specific radiology expert models. CCD introduces a dual-stage contrastive mechanism to refine token-level logits during generation, thereby enhancing clinical fidelity without modifying the base MLLM. Experiments on three datasets and multiple models demonstrate that CCD consistently improves overall performance on radiology report generation (RRG). On the MIMIC-CXR dataset, it yields up to a 17% improvement in RadGraph-F1 when applied to state-of-the-art RRG models. Our approach provides a lightweight and generalisable solution for mitigating medical hallucinations, effectively bridging expert models and MLLMs in radiology.
摘要：多模式大型语言模型（MLLM）最近通过将视觉感知与自然语言理解相结合，在放射学方面取得了显着进步。但是，它们通常会产生临床上不受支持的描述，称为医学幻觉，这在需要准确性和图像基础输出的医学应用中构成了严重的风险。通过经验分析，我们发现迅速引起的幻觉在放射学MLLM中仍然很普遍，这在很大程度上是由于对临床切片的过度敏感性。为了解决这个问题，我们引入了临床对比CECODing（CCD），这是一种无培训和无检索的推理框架，该框架整合了特定于任务的放射学专家模型的结构化临床信号。 CCD引入了一种双阶段对比机制，可在生成过程中提高令牌级别的逻辑，从而增强临床保真度，而无需修改基础MLLM。在三个数据集和多个模型上进行的实验表明，CCD始终提高放射学报告生成（RRG）的总体绩效。在MIMIC-CXR数据集上，当应用于最先进的RRG模型时，Radgraph-F1的提高17％。我们的方法为缓解医学幻觉，有效桥接专家模型和放射学中的MLLM提供了轻巧且可普遍的解决方案。

Title: Guard Vector: Beyond English LLM Guardrails with Task-Vector Composition and Streaming-Aware Prefix SFT

Authors: Wonhyuk Lee, Youngchol Kim, Yunjin Park, Junhyung Moon, Dongyoung Jeong, Wanjin Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23381
Pdf URL: https://arxiv.org/pdf/2509.23381
Copy Paste: [[2509.23381]] Guard Vector: Beyond English LLM Guardrails with Task-Vector Composition and Streaming-Aware Prefix SFT(https://arxiv.org/abs/2509.23381)
Keywords: language model, llm
Abstract: We introduce Guard Vector, a safety task vector computed as the parameter difference between a guardrail model (Guard Model) and a same-architecture pretrained language model. Composing this vector with a target language model yields a Target Guard Model (TGM). We then adapt TGM with a streaming-aware approach that combines prefix-based training and evaluation with a classifier that produces a single-token output. With this composition alone, TGM improves classification quality over established Guard Models across standard safety suites and enables language extensibility to Chinese, Japanese, and Korean, requiring neither additional training nor target language labels. It also demonstrates model portability across two widely used public guardrail backbones, Llama and Gemma. With prefix SFT (supervised fine-tuning), TGM preserves classification quality under streaming by aligning the behavior between prefix inputs and full-text inputs. The single-token output design increases throughput and reduces latency. Together, these components reduce data and compute requirements while promoting streaming-aware evaluation practices, thereby contributing to a more responsible AI ecosystem.
摘要：我们介绍了Guard Vector，这是一种安全任务向量，该矢量计算为护栏模型（护罩模型）和相同架构预验证的语言模型之间的参数差。用目标语言模型组成该矢量会产生目标护罩模型（TGM）。然后，我们使用流媒体感知的方法调整TGM，该方法将基于前缀的培训和评估与产生单一输出的分类器相结合。仅使用此组成，TGM就可以提高标准安全套件上已建立的后卫模型的分类质量，并使语言可扩展到中文，日语和韩语，不需要其他培训和目标语言标签。它还展示了两个广泛使用的公共护栏式骨架，骆驼和杰玛（Gemma）的模型可移植性。借助前缀SFT（有监督的微调），TGM通过对齐前缀输入和全文输入之间的行为来保留流式传输质量的分类质量。单口输出设计增加了吞吐量并减少了延迟。这些组件共同减少了数据和计算要求，同时促进流化的评估实践，从而为更负责任的AI生态系统做出了贡献。

Title: Train Once, Answer All: Many Pretraining Experiments for the Cost of One

Authors: Sebastian Bordt, Martin Pawelczyk
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23383
Pdf URL: https://arxiv.org/pdf/2509.23383
Copy Paste: [[2509.23383]] Train Once, Answer All: Many Pretraining Experiments for the Cost of One(https://arxiv.org/abs/2509.23383)
Keywords: language model, llm
Abstract: Recent work has demonstrated that controlled pretraining experiments are a powerful tool for understanding learning, reasoning, and memorization in large language models (LLMs). However, the computational cost of pretraining presents a significant constraint. To overcome this constraint, we propose to conduct multiple pretraining experiments simultaneously during a single training run. We demonstrate the feasibility of this approach by conducting ten experiments during the training of a 1.5B parameter model on 210B tokens. Although we only train a single model, we can replicate the results from multiple previous works on data contamination, poisoning, and memorization. We also conduct novel investigations into knowledge acquisition, mathematical reasoning, and watermarking. For example, we dynamically update the training data until the model acquires a particular piece of knowledge. Remarkably, the influence of the ten experiments on the model's training dynamics and overall performance is minimal. However, interactions between different experiments may act as a potential confounder in our approach. We propose to test for interactions with continual pretraining experiments, finding them to be negligible in our setup. Overall, our findings suggest that performing multiple pretraining experiments in a single training run can enable rigorous scientific experimentation with large models on a compute budget.
摘要：最近的工作表明，受控的预处理实验是理解大语模型（LLMS）学习，推理和记忆的强大工具。但是，预处理的计算成本构成了重大限制。为了克服这一约束，我们建议在一次训练过程中同时进行多个预训练的实验。我们通过在210B令牌上训练1.5B参数模型的过程中进行十项实验来证明这种方法的可行性。尽管我们只训练一个模型，但我们可以从以前的有关数据污染，中毒和记忆的多项作品中复制结果。我们还对知识获取，数学推理和水印进行了新的研究。例如，我们动态更新培训数据，直到该模型获取特定知识为止。值得注意的是，十个实验对模型训练动力和整体性能的影响很小。但是，不同实验之间的相互作用可能是我们方法中的潜在混杂因素。我们建议测试与持续预处理实验的相互作用，发现它们在我们的设置中可以忽略不计。总体而言，我们的发现表明，在一次训练运行中进行多个预训练的实验可以在计算预算中使用大型模型进行严格的科学实验。

Title: No Loss, No Gain: Gated Refinement and Adaptive Compression for Prompt Optimization

Authors: Wenhang Shi, Yiren Chen, Shuqing Bian, Xinyi Zhang, Kai Tang, Pengfei Hu, Zhe Zhao, Wei Lu, Xiaoyong Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23387
Pdf URL: https://arxiv.org/pdf/2509.23387
Copy Paste: [[2509.23387]] No Loss, No Gain: Gated Refinement and Adaptive Compression for Prompt Optimization(https://arxiv.org/abs/2509.23387)
Keywords: language model, llm, prompt
Abstract: Prompt engineering is crucial for leveraging the full potential of large language models (LLMs). While automatic prompt optimization offers a scalable alternative to costly manual design, generating effective prompts remains challenging. Existing methods often struggle to stably generate improved prompts, leading to low efficiency, and overlook that prompt optimization easily gets trapped in local optima. Addressing this, we propose GRACE, a framework that integrates two synergistic strategies: Gated Refinement and Adaptive Compression, achieving Efficient prompt optimization. The gated refinement strategy introduces a feedback regulation gate and an update rejection gate, which refine update signals to produce stable and effective prompt improvements. When optimization stagnates, the adaptive compression strategy distills the prompt's core concepts, restructuring the optimization trace and opening new paths. By strategically introducing information loss through refinement and compression, GRACE delivers substantial gains in performance and efficiency. In extensive experiments on 11 tasks across three practical domains, including BIG-Bench Hard (BBH), domain-specific, and general NLP tasks, GRACE achieves significant average relative performance improvements of 4.7%, 4.4% and 2.7% over state-of-the-art methods, respectively. Further analysis shows that GRACE achieves these gains using only 25% of the prompt generation budget required by prior methods, highlighting its high optimization efficiency and low computational overhead. Our code is available at this https URL.
摘要：及时工程对于利用大型语言模型（LLM）的全部潜力至关重要。虽然自动及时优化为昂贵的手动设计提供了可扩展的替代方案，但生成有效的提示仍然具有挑战性。现有的方法通常难以稳定地产生改进的提示，从而导致低效率，并忽略该提示很容易被捕获到本地Optima中。在解决这个问题时，我们提出了恩典，这是一个整合了两种协同策略的框架：封闭的改进和适应性压缩，实现了有效的及时迅速优化。封闭式改进策略引入了反馈调节门和更新拒绝门，该门会完善更新信号，以产生稳定有效的及时改进。当优化停滞不前时，自适应压缩策略会提示提示的核心概念，重组优化跟踪并打开新路径。通过通过完善和压缩来战略性地引入信息丢失，Grace可以在绩效和效率方面取得可观的提高。在对三个实用领域的11个任务进行的广泛实验中，包括大基础（BBH），特定于域名和一般NLP任务，GRACE分别在最先进的方法中实现了4.7％，4.4％和2.7％的显着平均相对绩效提高。进一步的分析表明，GRACE仅使用先前方法所需的迅速生成预算的25％实现了这些收益，从而突出了其高优化效率和低计算开销。我们的代码可在此HTTPS URL上找到。

Title: Liaozhai through the Looking-Glass: On Paratextual Explicitation of Culture-Bound Terms in Machine Translation

Authors: Sherrie Shen, Weixuan Wang, Alexandra Birch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23395
Pdf URL: https://arxiv.org/pdf/2509.23395
Copy Paste: [[2509.23395]] Liaozhai through the Looking-Glass: On Paratextual Explicitation of Culture-Bound Terms in Machine Translation(https://arxiv.org/abs/2509.23395)
Keywords: llm, prompt, agent
Abstract: The faithful transfer of contextually-embedded meaning continues to challenge contemporary machine translation (MT), particularly in the rendering of culture-bound terms--expressions or concepts rooted in specific languages or cultures, resisting direct linguistic transfer. Existing computational approaches to explicitating these terms have focused exclusively on in-text solutions, overlooking paratextual apparatus in the footnotes and endnotes employed by professional translators. In this paper, we formalize Genette's (1987) theory of paratexts from literary and translation studies to introduce the task of paratextual explicitation for MT. We construct a dataset of 560 expert-aligned paratexts from four English translations of the classical Chinese short story collection Liaozhai and evaluate LLMs with and without reasoning traces on choice and content of explicitation. Experiments across intrinsic prompting and agentic retrieval methods establish the difficulty of this task, with human evaluation showing that LLM-generated paratexts improve audience comprehension, though remain considerably less effective than translator-authored ones. Beyond model performance, statistical analysis reveals that even professional translators vary widely in their use of paratexts, suggesting that cultural mediation is inherently open-ended rather than prescriptive. Our findings demonstrate the potential of paratextual explicitation in advancing MT beyond linguistic equivalence, with promising extensions to monolingual explanation and personalized adaptation.
摘要：忠实的上下文意义的转移继续挑战当代机器翻译（MT），尤其是在文化结合的术语的渲染中 - 扎根于特定语言或文化的表达或概念，以抵制直接的语言转移。阐明这些术语的现有计算方法仅集中在文本解决方案上，俯瞰着专业翻译人员使用的脚注和尾注中的PARATEXTAUL APTARATUS。在本文中，我们从文学和翻译研究中正式化了Genette（1987）的Paratexts理论，以介绍MT的Paratextual equipitation的任务。我们构建了一个来自古典中国短篇小说集liaozhai的英文翻译的560个专家平整的paratexts的数据集，并在没有理由的迹象上评估了LLM，以选择和阐释内容。跨内在提示和代理检索方法进行的实验确定了这项任务的困难，人类评估表明，LLM生成的Paratexts可以提高受众的理解力，尽管与翻译人员的效果相当大得多。除了模型性能外，统计分析还表明，即使是专业翻译人员在使用paratexts方面也有很大差异，这表明文化调解本质上是开放性的，而不是规定性的。我们的发现表明，掌握式阐释的潜力在使MT超越语言等效性方面，并有望扩展单语言解释和个性化的适应。

Title: Comparison of Scoring Rationales Between Large Language Models and Human Raters

Authors: Haowei Hua (1), Hong Jiao (2), Dan Song (3) ((1) Princeton University, (2) University of Maryland, (3) University of Iowa)
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23412
Pdf URL: https://arxiv.org/pdf/2509.23412
Copy Paste: [[2509.23412]] Comparison of Scoring Rationales Between Large Language Models and Human Raters(https://arxiv.org/abs/2509.23412)
Keywords: language model, gpt, llm, chat
Abstract: Advances in automated scoring are closely aligned with advances in machine-learning and natural-language-processing techniques. With recent progress in large language models (LLMs), the use of ChatGPT, Gemini, Claude, and other generative-AI chatbots for automated scoring has been explored. Given their strong reasoning capabilities, LLMs can also produce rationales to support the scores they assign. Thus, evaluating the rationales provided by both human and LLM raters can help improve the understanding of the reasoning that each type of rater applies when assigning a score. This study investigates the rationales of human and LLM raters to identify potential causes of scoring inconsistency. Using essays from a large-scale test, the scoring accuracy of GPT-4o, Gemini, and other LLMs is examined based on quadratic weighted kappa and normalized mutual information. Cosine similarity is used to evaluate the similarity of the rationales provided. In addition, clustering patterns in rationales are explored using principal component analysis based on the embeddings of the rationales. The findings of this study provide insights into the accuracy and ``thinking'' of LLMs in automated scoring, helping to improve the understanding of the rationales behind both human scoring and LLM-based automated scoring.
摘要：自动评分的进步与机器学习和自然语言处理技术的进步紧密相符。随着大型语言模型（LLM）的最新进展，已经探索了Chatgpt，Gemini，Claude和其他生成型聊天机器人进行自动评分的使用。鉴于其强大的推理能力，LLM还可以产生理由来支持他们分配的分数。因此，评估人和LLM评估者提供的理由都可以帮助提高对分配分数时每种评估者适用的推理的理解。这项研究调查了人类和LLM评估者的理由，以识别出较不一致的潜在原因。使用大规模测试中的论文，基于二次加权kappa和归一化的共同信息，检查了GPT-4O，Gemini和其他LLM的评分精度。余弦相似性用于评估所提供的理由的相似性。此外，使用基于理由的嵌入的主成分分析探索了理由中的聚类模式。这项研究的发现提供了对自动评分中LLM的准确性和``思维''的见解，有助于提高对人类评分和基于LLM的自动评分背后理由的理解。

Title: Retrieval-Constrained Decoding Reveals Underestimated Parametric Knowledge in Language Models

Authors: Rajaa El Hamdani, Samy Haffoudhi, Nils Holzenberger, Fabian Suchanek, Thomas Bonald, Fragkiskos D. Malliaros
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23417
Pdf URL: https://arxiv.org/pdf/2509.23417
Copy Paste: [[2509.23417]] Retrieval-Constrained Decoding Reveals Underestimated Parametric Knowledge in Language Models(https://arxiv.org/abs/2509.23417)
Keywords: language model
Abstract: Language models (LMs) encode substantial factual knowledge, but often produce answers judged as incorrect. We hypothesize that many of these answers are actually correct, but are expressed in alternative surface forms that are dismissed due to an overly strict evaluation, leading to an underestimation of models' parametric knowledge. We propose Retrieval-Constrained Decoding (RCD), a decoding strategy that restricts model outputs to unique surface forms. We introduce YAGO-QA, a dataset of 19,137 general knowledge questions. Evaluating open-source LMs from 135M to 70B parameters, we show that standard decoding undervalues their knowledge. For instance, Llama-3.1-70B scores only 32.3% F1 with vanilla decoding but 46.0% with RCD. Similarly, Llama-3.1-8B reaches 33.0% with RCD, outperforming the larger model under vanilla decoding. We publicly share the code and dataset at this https URL.
摘要：语言模型（LMS）编码大量的事实知识，但经常产生被判断为不正确的答案。我们假设其中许多答案实际上是正确的，但是以替代表面形式表达，由于过度严格的评估而被驳回，从而低估了模型的参数知识。我们提出了检索受限的解码（RCD），这是一种限制模型输出到唯一表面形式的解码策略。我们介绍了Yago-QA，这是19,137个常识问题的数据集。将开源LMS评估为13500万至70B参数，我们表明标准解码低估了他们的知识。例如，Llama-3.1-70B仅得分32.3％，而香草解码为46.0％，而RCD的分解为46.0％。同样，Llama-3.1-8b的RCD达到33.0％，在香草解码下的表现优于较大的模型。我们在此HTTPS URL上公开共享代码和数据集。

Title: Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models

Authors: Xuanming Zhang, Yuxuan Chen, Min-Hsuan Yeh, Yixuan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23441
Pdf URL: https://arxiv.org/pdf/2509.23441
Copy Paste: [[2509.23441]] Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models(https://arxiv.org/abs/2509.23441)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel at complex reasoning but can still exhibit harmful behaviors. Current alignment strategies typically embed safety into model weights, making these controls implicit, static, and difficult to modify. This paper introduces Cognition-of-Thought (CooT), a novel decoding-time framework that equips LLMs with an explicit cognitive self-monitoring loop. CooT couples a standard text Generator with a cognitive Perceiver that continuously monitors the unfolding sequence. The Perceiver uses a structured, precedence-based hierarchy of principles (e.g., safety over obedience) to detect potential misalignments as they arise. When violations are flagged, CooT intervenes by rolling back the generation to the point of error and regenerating under injected guidance that combines universal social priors with context-specific warnings. CooT thus transforms alignment from a fixed property into an explicit, dynamic, and auditable process active during inference, allowing for flexible policy updates without retraining the model. Extensive experiments across multiple benchmarks and model families confirm that CooT consistently improves safety and social reasoning performance.
摘要：大型语言模型（LLM）在复杂的推理方面表现出色，但仍然可以表现出有害行为。当前的对齐策略通常将安全性嵌入模型权重中，使这些控件具有隐式，静态和难以修改。本文介绍了认知（COOT），这是一种新颖的解码时间框架，它使LLMS具有明确的认知自我监控循环。 Coot将标准文本生成器与认知感知器融合在一起，可连续监视展开的序列。感知者使用基于结构化的，基于优先级的原则层次结构（例如，安全性的安全性）来检测潜在的未对准。当违规行为被标记时，COOT会通过将一代倒退到错误点并再生注入的指导来进行干预，从而将普遍的社会先验与特定于上下文的警告相结合。因此，COOT将一致性从固定属性转换为推理期间有效的显式，动态和可审核的过程，从而可以在不重新培训模型的情况下进行灵活的策略更新。跨多个基准和模型家庭进行的广泛实验证实，COOT始终提高安全性和社会推理绩效。

Title: Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review

Authors: Sydney Peters, Nan Zhang, Hong Jiao, Ming Li, Tianyi Zhou, Robert Lissitz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23486
Pdf URL: https://arxiv.org/pdf/2509.23486
Copy Paste: [[2509.23486]] Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review(https://arxiv.org/abs/2509.23486)
Keywords: language model
Abstract: Item difficulty plays a crucial role in test performance, interpretability of scores, and equity for all test-takers, especially in large-scale assessments. Traditional approaches to item difficulty modeling rely on field testing and classical test theory (CTT)-based item analysis or item response theory (IRT) calibration, which can be time-consuming and costly. To overcome these challenges, text-based approaches leveraging machine learning and language models, have emerged as promising alternatives. This paper reviews and synthesizes 37 articles on automated item difficulty prediction in large-scale assessment settings published through May 2025. For each study, we delineate the dataset, difficulty parameter, subject domain, item type, number of items, training and test data split, input, features, model, evaluation criteria, and model performance outcomes. Results showed that although classic machine learning models remain relevant due to their interpretability, state-of-the-art language models, using both small and large transformer-based architectures, can capture syntactic and semantic patterns without the need for manual feature engineering. Uniquely, model performance outcomes were summarized to serve as a benchmark for future research and overall, text-based methods have the potential to predict item difficulty with root mean square error (RMSE) as low as 0.165, Pearson correlation as high as 0.87, and accuracy as high as 0.806. The review concludes by discussing implications for practice and outlining future research directions for automated item difficulty modeling.
摘要：项目难度在所有考试者的测试绩效，分数的解释性和权益中起着至关重要的作用，尤其是在大规模评估中。传统的项目难度方法建模的方法取决于现场测试和经典测试理论（CTT）基于项目分析或项目响应理论（IRT）校准，这可能是耗时且昂贵的。为了克服这些挑战，利用机器学习和语言模型的基于文本的方法已成为有前途的替代方案。本文回顾并合成了37篇有关自动项目难度预测的文章，以预测于2025年5月。对于每项研究，我们都会描述数据集，难度参数，主题域，项目类型，项目数量，培训和测试数据分配，输入，输入，特征，模型，评估标准，以及模型性能和模型性能。结果表明，尽管经典的机器学习模型由于其可解释性而保持相关，但是使用基于小型变压器的小型和大型变压器架构的最先进的语言模型可以捕获句法和语义模式，而无需手动功能工程。独特的是，模型性能结果总结为未来研究的基准，总体而言，基于文本的方法有可能通过均方根误差（RMSE）低至0.165，Pearson相关性高达0.87，并且准确性高达0.806，以预测项目难度。审查结束时，讨论了对实践的影响，并概述了对自动项目难度建模的未来研究方向。

Title: The Impact of Role Design in In-Context Learning for Large Language Models

Authors: Hamidreza Rouzegar, Masoud Makrehchi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23501
Pdf URL: https://arxiv.org/pdf/2509.23501
Copy Paste: [[2509.23501]] The Impact of Role Design in In-Context Learning for Large Language Models(https://arxiv.org/abs/2509.23501)
Keywords: language model, gpt, llm, prompt
Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to generate predictions based on prompts without additional fine-tuning. While prompt engineering has been widely studied, the impact of role design within prompts remains underexplored. This study examines the influence of role configurations in zero-shot and few-shot learning scenarios using GPT-3.5 and GPT-4o from OpenAI and Llama2-7b and Llama2-13b from Meta. We evaluate the models' performance across datasets, focusing on tasks like sentiment analysis, text classification, question answering, and math reasoning. Our findings suggest the potential of role-based prompt structuring to enhance LLM performance.
摘要：在文化学习（ICL）中，大型语言模型（LLMS）能够根据提示生成预测，而无需进行其他微调。尽管已广泛研究了及时的工程，但在提示中角色设计的影响仍未得到充实。这项研究研究了使用openai和llama2-7b和llama2-7b和llama2-13b的GPT-3.5和GPT-4O在零射击和少数射击学习方案中的角色配置的影响。我们评估模型跨数据集的性能，重点关注情感分析，文本分类，问题答案和数学推理等任务。我们的发现表明，基于角色的及时结构的潜力以增强LLM性能。

Title: From Human Annotation to Automation: LLM-in-the-Loop Active Learning for Arabic Sentiment Analysis

Authors: Dania Refai, Alaa Dalaq, Doaa Dalaq, Irfan Ahmad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23515
Pdf URL: https://arxiv.org/pdf/2509.23515
Copy Paste: [[2509.23515]] From Human Annotation to Automation: LLM-in-the-Loop Active Learning for Arabic Sentiment Analysis(https://arxiv.org/abs/2509.23515)
Keywords: language model, gpt, llm, chat
Abstract: Natural language processing (NLP), particularly sentiment analysis, plays a vital role in areas like marketing, customer service, and social media monitoring by providing insights into user opinions and emotions. However, progress in Arabic sentiment analysis remains limited due to the lack of large, high-quality labeled datasets. While active learning has proven effective in reducing annotation efforts in other languages, few studies have explored it in Arabic sentiment tasks. Likewise, the use of large language models (LLMs) for assisting annotation and comparing their performance to human labeling is still largely unexplored in the Arabic context. In this paper, we propose an active learning framework for Arabic sentiment analysis designed to reduce annotation costs while maintaining high performance. We evaluate multiple deep learning architectures: Specifically, long short-term memory (LSTM), gated recurrent units (GRU), and recurrent neural networks (RNN), across three benchmark datasets: Hunger Station, AJGT, and MASAC, encompassing both modern standard Arabic and dialectal variations. Additionally, two annotation strategies are compared: Human labeling and LLM-assisted labeling. Five LLMs are evaluated as annotators: GPT-4o, Claude 3 Sonnet, Gemini 2.5 Pro, DeepSeek Chat, and LLaMA 3 70B Instruct. For each dataset, the best-performing LLM was used: GPT-4o for Hunger Station, Claude 3 Sonnet for AJGT, and DeepSeek Chat for MASAC. Our results show that LLM-assisted active learning achieves competitive or superior performance compared to human labeling. For example, on the Hunger Station dataset, the LSTM model achieved 93% accuracy with only 450 labeled samples using GPT-4o-generated labels, while on the MASAC dataset, DeepSeek Chat reached 82% accuracy with 650 labeled samples, matching the accuracy obtained through human labeling.
摘要：自然语言处理（NLP），尤其是情感分析，通过提供对用户意见和情感的见解，在市场，客户服务和社交媒体监控等领域中起着至关重要的作用。但是，由于缺乏大型，高质量的标签数据集，阿拉伯情感分析的进展仍然有限。尽管活跃的学习已被证明有效地减少了其他语言的注释工作，但很少有研究在阿拉伯情感任务中探讨了这一点。同样，在阿拉伯语环境中，使用大型语言模型（LLM）在协助注释和将其性能与人类标签进行比较仍然没有探索。在本文中，我们为阿拉伯情感分析提出了一个积极的学习框架，旨在降低注释成本，同时保持高性能。我们评估了多个深度学习架构：特别是短期记忆（LSTM），封闭式复发单元（GRU）和经常性的神经网络（RNN），跨三个基准数据集：饥饿站，AJGT和MASAC，包括现代的标准阿拉伯语和方言变化。此外，比较了两种注释策略：人类标记和LLM辅助标签。将五个LLM评估为注释者：GPT-4O，Claude 3 Sonnet，Gemini 2.5 Pro，DeepSeek Chat和Llama 3 70B指令。对于每个数据集，都使用了表现最佳的LLM：GPT-4O饥饿站，Claude 3 SONNET for AJGT和MASAC的DeepSeek Chat。我们的结果表明，与人类标签相比，LLM辅助的主动学习可以实现竞争性或卓越的表现。例如，在Hunger Station数据集上，LSTM模型仅使用GPT-4O生成的标签仅450个标记的样品实现了93％的精度，而在MASAC数据集中，DeepSeek聊天的精度为82％，具有650个标签样品，与通过人类标记获得的精度相匹配。

Title: On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization

Authors: Janvijay Singh, Austin Xu, Yilun Zhou, Yefan Zhou, Dilek Hakkani-Tur, Shafiq Joty
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23542
Pdf URL: https://arxiv.org/pdf/2509.23542
Copy Paste: [[2509.23542]] On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization(https://arxiv.org/abs/2509.23542)
Keywords: llm, prompt
Abstract: The LLM-as-a-judge paradigm is widely used in both evaluating free-text model responses and reward modeling for model alignment and finetuning. Recently, finetuning judges with judge-specific data has emerged as an often preferred choice over directly prompting frontier models as judges, as the former achieves better performance with smaller model sizes while being more robust to common biases. However, the standard evaluation ignores several practical concerns of finetuned judges regarding their real world deployment. In this paper, we identify and formalize three aspects that affect the shelf life of these judges: future proofing and backward compatibility -- how well judges finetuned on responses by today's generator models perform on responses by future models or past models, as well as question generalization -- how well judges generalize to unseen questions at test time. We study these three aspects in the math domain under a unified framework with varying train and test distributions, three SFT- and DPO-based finetuning algorithms and three different base models. Experiments suggest that future-proofing is challenging for most models, while backward compatibility is relatively easy, with DPO-trained models consistently improving performance. We further find that continual learning provides a more balanced adaptation to shifts between older and newer response distributions than training solely on stronger or weaker responses. Moreover, all models observe certain degrees of performance degradation when moving from questions seen during training to unseen ones, showing that current judges do not fully generalize to unseen questions. These findings provide insights into practical considerations for developing and deploying judge models in the face of ever-changing generators.
摘要：LLM-AS-A-A-Gudge范式广泛用于评估自由文本模型响应和用于模型对齐和填充的奖励建模。最近，具有特定于法官数据的填补法官已成为通常的首选选择，而不是直接促使边境模型作为法官，因为前者以较小的模型大小取得了更好的性能，同时对常见偏见更加强大。但是，标准评估忽略了对法官对其现实世界部署的几个实际问题。在本文中，我们确定并正式化了影响这些法官保质期的三个方面：未来的校对和向后兼容 - 法官对当今发电机模型的回答对未来模型或过去模型的响应进行了挑战，以及问题的概括 - 法官在测试时间内对未见的问题概括地概括了。我们在统一的框架下研究了数学领域的这三个方面，这些框架具有不同的火车和测试分布，三种基于SFT和DPO的鉴定算法以及三个不同的基本模型。实验表明，对于大多数模型来说，防止未来是具有挑战性的，而向后兼容性相对容易，而DPO训练的模型始终提高性能。我们进一步发现，持续的学习提供了比仅在较强或较弱的响应方面培训较新的和更新的响应分布之间更平衡的适应性。此外，所有模型都会在从培训期间看到的问题转变为看不见的问题时都会观察到某些绩效降解程度，这表明当前的法官并未完全概括到看不见的问题。这些发现为面对不断变化的发电机而开发和部署法官模型的实际考虑方面提供了见解。

Title: Towards Efficient CoT Distillation: Self-Guided Rationale Selector for Better Performance with Fewer Rationales

Authors: Jianzhi Yan, Le Liu, Youcheng Pan, Shiwei Chen, Yang Xiang, Buzhou Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23574
Pdf URL: https://arxiv.org/pdf/2509.23574
Copy Paste: [[2509.23574]] Towards Efficient CoT Distillation: Self-Guided Rationale Selector for Better Performance with Fewer Rationales(https://arxiv.org/abs/2509.23574)
Keywords: language model, chain-of-thought
Abstract: Chain-of-thought (CoT) distillation aims to enhance small language models' (SLMs) reasoning by transferring multi-step reasoning capability from the larger teacher models. However, existing work underestimates rationale quality, focusing primarily on data quantity, which may transfer noisy or incorrect information to the student model. To address the above issues, we proposed \textbf{M}odel-\textbf{O}riented \textbf{R}ationale \textbf{S}election \textbf{D}istillation (MoRSD), which can discern and select high quality rationales for distillation to improve performance further. We further propose a Rationale Difficulty (RD) metric to measure the ability of the student model to generate the correct answer under a given rationale. Compared to the baseline, we achieved 4.6$\%$ average improvement on seven datasets over three tasks, using fewer rationales by controlling their accuracy, diversity, and difficulty. Our results reveal that a small portion of the high quality rationales can enhance the reasoning ability of student models than the entire dataset. Our method promises to be a possible solution for efficient CoT distillation. Our code will be released in this https URL.
摘要：经过思考链（COT）蒸馏旨在通过从较大的教师模型中传递多步推理能力来增强小语言模型（SLM）的推理。但是，现有的工作低估了基本原理质量，主要关注数据数量，这可能会将嘈杂或错误的信息传递给学生模型。为了解决上述问题，我们提出了\ textbf {M} odel- \ textbf {o} riended \ textbf {r} ationale \ textbf {s}选举\ textbf {d} istillation（morsd），可以辨别并选择高质量的有理质量的有理质量，以进一步提高绩效，以进一步提高绩效。我们进一步提出了一个理由难度（RD）指标，以衡量学生模型在给定理由下产生正确答案的能力。与基线相比，我们通过控制其准确性，多样性和难度来实现了七个任务的七个数据集的4.6 $ \％$平均改进。我们的结果表明，与整个数据集相比，一小部分高质量理由可以提高学生模型的推理能力。我们的方法有望成为有效的COT蒸馏的可能解决方案。我们的代码将在此HTTPS URL中发布。

Title: Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks

Authors: Kevin Frank, Anmol Gulati, Elias Lumer, Sindy Campagna, Vamse Kumar Subbiah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23579
Pdf URL: https://arxiv.org/pdf/2509.23579
Copy Paste: [[2509.23579]] Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks(https://arxiv.org/abs/2509.23579)
Keywords: language model, llm
Abstract: Enterprise teams rely on the Jira Query Language (JQL) to retrieve and filter issues from Jira. Yet, to our knowledge, there is no open, real-world, execution-based benchmark for mapping natural language queries to JQL. We introduce Jackal, a novel, large-scale text-to-JQL benchmark comprising 100,000 natural language (NL) requests paired with validated JQL queries and execution-based results on a live Jira instance with over 200,000 issues. To reflect real-world usage, each JQL query is associated with four types of user requests: (i) Long NL, (ii) Short NL, (iii) Semantically Similar, and (iv) Semantically Exact. We release Jackal, a corpus of 100,000 text-to-JQL pairs, together with an execution-based scoring toolkit, and a static snapshot of the evaluated Jira instance for reproducibility. We report text-to-JQL results on 23 Large Language Models (LLMs) spanning parameter sizes, open and closed source models, across execution accuracy, exact match, and canonical exact match. In this paper, we report results on Jackal-5K, a 5,000-pair subset of Jackal. On Jackal-5K, the best overall model (Gemini 2.5 Pro) achieves only 60.3% execution accuracy averaged equally across four user request types. Performance varies significantly across user request types: (i) Long NL (86.0%), (ii) Short NL (35.7%), (iii) Semantically Similar (22.7%), and (iv) Semantically Exact (99.3%). By benchmarking LLMs on their ability to produce correct and executable JQL queries, Jackal exposes the limitations of current state-of-the-art LLMs and sets a new, execution-based challenge for future research in Jira enterprise data.
摘要：企业团队依靠JIRA查询语言（JQL）从JIRA检索和过滤问题。然而，据我们所知，没有开放的，现实的，基于执行的基准将自然语言查询映射到JQL。我们介绍了Jackal，这是一种新颖的大规模文本到JQL基准，其中包含100,000个自然语言（NL）请求，并在现场JIRA实例上配对经过验证的JQL查询和基于执行的结果，其中包含超过200,000期。为了反映现实世界的用法，每个JQL查询都与四种类型的用户请求相关联：（i）长NL，（ii）简短的NL，（iii）语义上相似，并且（iv）在语义上精确。我们释放Jackal，是100,000个文本到JQL对的语料库，以及基于执行的评分工具包，以及评估的JIRA实例的静态快照，以进行可重复性。我们在23个大语言模型（LLMS）上报告了跨越参数大小，打开和封闭源模型的文本到JQL结果，执行精度，精确匹配和规范精确匹配。在本文中，我们报告了jackal-5k的结果，这是5,000对jackAL的子集。在Jackal-5K上，最佳总体模型（Gemini 2.5 Pro）仅在四种用户请求类型上平均实现60.3％的执行精度。性能在用户请求类型上差异很大：（i）长NL（86.0％），（ii）短NL（35.7％），（iii）语义上相似（22.7％）和（iv）语义上的精确（99.3％）。通过基于LLMS生成正确和可执行JQL查询的能力，Jackal揭露了当前最新LLM的局限性，并为JIRA Enterprise Data中的未来研究设定了新的，基于执行的挑战。

Title: LLM Hallucination Detection: HSAD

Authors: JinXin Li, Gang Tu, JunJie Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23580
Pdf URL: https://arxiv.org/pdf/2509.23580
Copy Paste: [[2509.23580]] LLM Hallucination Detection: HSAD(https://arxiv.org/abs/2509.23580)
Keywords: language model, llm, hallucination
Abstract: Although Large Language Models have demonstrated powerful capabilities in a wide range of tasks such as language understanding and code generation, the frequent occurrence of hallucinations during the generation process has become a significant impediment to their deployment in critical application scenarios. Current mainstream hallucination detection methods rely on factual consistency verification or static hidden layer features. The former is constrained by the scope of knowledge coverage, while the latter struggles to capture reasoning biases during the inference process. To address these issues, and inspired by signal analysis methods in cognitive neuroscience, this paper proposes a hallucination detection method based on the frequency-domain analysis of hidden layer temporal signals, named HSAD (\textbf{H}idden \textbf{S}ignal \textbf{A}nalysis-based \textbf{D}etection). First, by treating the LLM's reasoning process as a cognitive journey that unfolds over time, we propose modeling and simulating the human process of signal perception and discrimination in a deception-detection scenario through hidden layer temporal signals. Next, The Fast Fourier Transform is applied to map these temporal signals into the frequency domain to construct spectral features, which are used to capture anomalies that arise during the reasoning process; analysis experiments on these spectral features have proven the effectiveness of this approach. Finally, a hallucination detection algorithm is designed based on these spectral features to identify hallucinations in the generated content. By effectively combining the modeling of the reasoning process with frequency-domain feature extraction, the HSAD method overcomes the limitations of existing approaches in terms of knowledge coverage and the detection of reasoning biases, demonstrating higher detection accuracy and robustness.
摘要：尽管大型语言模型已经在各种任务（例如语言理解和代码生成）中表现出强大的功能，但是在生成过程中频繁出现幻觉的发生已成为其在关键应用程序场景中其部署的重大障碍。当前主流幻觉检测方法依赖于事实一致性验证或静态隐藏层特征。前者受到知识覆盖的范围的限制，而后者则在推理过程中努力捕获推理偏见。为了解决这些问题，并受到认知神经科学中的信号分析方法的启发，本文提出了一种基于隐藏层时间信号的频域分析的幻觉检测方法，称为HSAD（\ textbf {h} idden \ textbf \ textbf {s s} naalisy基于基于基于\ textbf {a} nalisy基于基于\ textbf textbf {d}。首先，通过将LLM的推理过程视为随着时间的流逝而展开的认知旅程，我们提出了通过隐藏的层暂时信号在欺骗检测场景中进行信号感知和歧视的人类过程。接下来，将快速的傅立叶变换应用于将这些时间信号映射到频域中以构建光谱特征，光谱特征用于捕获在推理过程中出现的异常。这些光谱特征的分析实验证明了这种方法的有效性。最后，基于这些光谱特征设计了幻觉检测算法，以识别生成的内容中的幻觉。通过有效地将推理过程的建模与频域特征提取相结合，HSAD方法克服了在知识覆盖范围和检测推理偏见方面的现有方法的局限性，证明了更高的检测准确性和鲁棒性。

Title: Fast Thinking for Large Language Models

Authors: Haoyu Zheng, Zhuonan Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Zheqi Lv, Juncheng Li, Siliang Tang, Yueting Zhuang, Hongyang He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23633
Pdf URL: https://arxiv.org/pdf/2509.23633
Copy Paste: [[2509.23633]] Fast Thinking for Large Language Models(https://arxiv.org/abs/2509.23633)
Keywords: language model, llm, chain-of-thought
Abstract: Reasoning-oriented Large Language Models (LLMs) often rely on generating explicit tokens step by step, and their effectiveness typically hinges on large-scale supervised fine-tuning or reinforcement learning. While Chain-of-Thought (CoT) techniques substantially enhance performance on complex reasoning tasks, they remain inefficient, requiring long reasoning traces that increase latency and token usage. In this work, we introduce Latent Codebooks for Fast Thinking, a framework that uses concise CoT sketches only during training to learn a codebook of discrete strategy priors. At inference, the model conditions on a handful of continuous thinking vectors distilled from the codebook in a single pass, enabling strategy-level guidance without producing explicit reasoning tokens. To complement this design, we propose GainRouter, a lightweight routing mechanism that adaptively switches between fast codebook guided inference and slow explicit reasoning, thereby suppressing overthinking and reducing unnecessary token generation. Experiments across multiple reasoning benchmarks show that our approach achieves competitive or superior accuracy while substantially lowering inference cost, offering a practical path toward efficient and controllable reasoning in large language models.
摘要：面向推理的大语言模型（LLM）通常依靠生成显式令牌，其有效性通常取决于大规模监督的微调微调或加强学习。虽然经过思考链（COT）的技术大大提高了复杂的推理任务的性能，但它们仍然效率低下，需要长时间的推理痕迹来增加延迟和代币使用情况。在这项工作中，我们介绍了潜在的代码手册，以进行快速思考，该框架仅在培训期间使用简洁的COT草图来学习离散策略先验的代码手册。在推断时，模型条件是在单个通行证中从代码手册中提取的少数连续思维矢量的条件，从而实现了策略级别的指导，而无需产生明确的推理令牌。为了补充这一设计，我们提出了GainRouter，这是一种轻巧的路由机制，可以在快速代码书指导推理和缓慢的显式推理之间自适应切换，从而抑制过度思考和减少不必要的令牌生成。跨多个推理基准的实验表明，我们的方法可以实现竞争性或卓越的准确性，同时大大降低了推理成本，这为大语言模型提供了有效且可控制的推理的实用途径。

Title: Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models

Authors: Zemin Huang, Yuhang Wang, Zhiyang Chen, Guo-Jun Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23653
Pdf URL: https://arxiv.org/pdf/2509.23653
Copy Paste: [[2509.23653]] Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models(https://arxiv.org/abs/2509.23653)
Keywords: language model
Abstract: Mask-based Diffusion Language Models (DLMs) struggle to revise incorrect tokens: once a token is generated, it typically remains fixed. The key challenge is to identify potential errors in the inputs. In this paper, we propose \emph{\underline{Rem}asking-\underline{e}nabled \underline{Di}ffusion Language Model (RemeDi}, a mask-based DLM that introduces \emph{remasking} as another fundamental mechanism, enabling more flexible text refinement in diffusion-based text generation. To achieve this, RemeDi jointly predicts token distributions and per-token confidence scores at each step. The confidence scores determine which tokens to be unmasked after the current step, allowing the model to identify tokens with low quality and remask them. These remasked tokens can be resampled with richer context in subsequent steps. We design a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens in addition to predict mask tokens, and reinforcement learning which optimizes full generation trajectories toward higher rewards. Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.
摘要：基于掩模的扩散语言模型（DLMS）难以修改不正确的令牌：一旦产生了代币，它通常保持固定。关键挑战是确定输入中的潜在错误。 In this paper, we propose \emph{\underline{Rem}asking-\underline{e}nabled \underline{Di}ffusion Language Model (RemeDi}, a mask-based DLM that introduces \emph{remasking} as another fundamental mechanism, enabling more flexible text refinement in diffusion-based text generation. To achieve this, RemeDi jointly predicts在每个步骤中，置信度得分。掩盖标记和增强学习，以优化了朝着更高奖励的完整轨迹。

Title: Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in LLMs

Authors: Shulin Huang, Yiran Ding, Junshu Pan, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23657
Pdf URL: https://arxiv.org/pdf/2509.23657
Copy Paste: [[2509.23657]] Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in LLMs(https://arxiv.org/abs/2509.23657)
Keywords: language model, llm
Abstract: Enhancing the complex reasoning capabilities of Large Language Models (LLMs) attracts widespread attention. While reinforcement learning (RL) has shown superior performance for improving complex reasoning, its impact on cross-lingual generalization compared to Supervised Fine-Tuning (SFT) remains unexplored. We present the first systematic investigation into cross-lingual reasoning generalization of RL and SFT. Using Qwen2.5-3B-Base as our foundation model, we conduct experiments on diverse multilingual reasoning benchmarks, including math reasoning, commonsense reasoning, and scientific reasoning. Our investigation yields two significant findings: (1) Tuning with RL not only achieves higher accuracy but also demonstrates substantially stronger cross-lingual generalization capabilities compared to SFT. (2) RL training on non-English data yields better overall performance and generalization than training on English data, which is not observed with SFT. Furthermore, through comprehensive mechanistic analyses, we explore the underlying factors of RL's superiority and generalization across languages. Our results provide compelling evidence that RL enables the model with more robust reasoning strategies, offering crucial guidance for more equitable and effective multilingual reasoning.
摘要：增强大语言模型（LLM）的复杂推理能力会引起广泛的关注。尽管增强学习（RL）在改善复杂推理方面表现出色，但与监督的微调（SFT）相比，其对跨语性概括的影响仍然没有探索。我们介绍了RL和SFT的跨语性推理概括的首次系统研究。我们使用QWEN2.5-3B基本作为我们的基础模型，我们就多种多语言推理基准（包括数学推理，常识性推理和科学推理）进行了实验。我们的调查得出了两个重要的发现：（1）与SFT相比，使用RL的调整不仅可以达到更高的精度，而且还表现出更强的跨语性概括能力。（2）对非英语数据的RL培训比在SFT中观察到的英语数据的培训比对英语数据的培训更好。此外，通过全面的机械分析，我们探讨了RL跨语言的优势和概括的根本因素。我们的结果提供了令人信服的证据，使RL能够采用更强大的推理策略，为更加公平和有效的多语言推理提供了重要的指导。

Title: Aligning LLMs for Multilingual Consistency in Enterprise Applications

Authors: Amit Agarwal, Hansa Meghwani, Hitesh Laxmichand Patel, Tao Sheng, Sujith Ravi, Dan Roth
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23659
Pdf URL: https://arxiv.org/pdf/2509.23659
Copy Paste: [[2509.23659]] Aligning LLMs for Multilingual Consistency in Enterprise Applications(https://arxiv.org/abs/2509.23659)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) remain unreliable for global enterprise applications due to substantial performance gaps between high-resource and mid/low-resource languages, driven by English-centric pretraining and internal reasoning biases. This inconsistency undermines customer experience and operational reliability in multilingual settings such as customer support, content moderation, and information retrieval. Even with advanced Retrieval-Augmented Generation (RAG) systems, we observe up to an 29% accuracy drop in non-English languages compared to English. We propose a practical, batch-wise alignment strategy for fine-tuning LLMs, leveraging semantically equivalent multilingual data in each training batch to directly align model outputs across languages. This approach improves non-English accuracy by up to 23.9\% without compromising English performance, model reasoning, or retrieval quality. Our method is simple to implement, scalable, and integrates seamlessly with existing LLM training \& deployment pipelines, enabling more robust and equitable multilingual AI solutions in industry.
摘要：大型语言模型（LLM）对于全球企业应用程序不可靠，这是由于高资源和中/低资源语言之间的巨大性能差距，这是由以英语为中心的预读和内部推理偏见驱动的。这种不一致会破坏客户体验和在多语言设置（例如客户支持，内容审核和信息检索）中的运营可靠性。即使具有先进的检索型发电（RAG）系统，与英语相比，我们观察到非英语语言的准确性下降了29％。我们为微调LLMS提出了一种实用的，批量的对齐策略，利用每个培训批次中的语义上等效的多语言数据，以直接将语言跨语言的模型输出保持一致。这种方法将非英语准确性提高了23.9 \％，而不会损害英语表现，模型推理或检索质量。我们的方法易于实现，可扩展和与现有的LLM培训\＆部署管道无缝集成，从而使行业中更强大，更公平的多语言AI解决方案。

Title: TF-Bench: Evaluating Program Semantics Reasoning with Type Inference in System F

Authors: Yifeng He, Luning Yang, Christopher Castro Gaw Gonzalo, Hao Chen
Subjects: cs.CL, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2509.23686
Pdf URL: https://arxiv.org/pdf/2509.23686
Copy Paste: [[2509.23686]] TF-Bench: Evaluating Program Semantics Reasoning with Type Inference in System F(https://arxiv.org/abs/2509.23686)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem. Their test-time compute (TTC) reasoning capabilities show significant potential for understanding program logic and semantics beyond mere token recognition. However, current benchmarks for code reasoning lack a formal, program-centric deductive framework to ensure sound evaluation, and are incapable of assessing whether models genuinely reason about program semantics or merely exploit superficial associations between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as program semantics reasoning. By employing verified transformations to remove semantically irrelevant natural language, we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet) achieving only 55.85% accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess robustness and the effectiveness of test-time reasoning, underscoring critical limitations in current LLM capabilities and highlighting essential directions for future research.
摘要：大型语言模型（LLM）越来越多地集成到软件工程生态系统中。他们的测试时间计算（TTC）推理功能显示出对程序逻辑和语义的重要潜力，而不是简单的识别。但是，当前的代码推理基准缺乏正式的，以程序为中心的演绎框架来确保合理的评估，并且无法评估模型是真正的理由是关于程序语义的理由还是仅仅利用了自然语言和代码代码代码的表面关联。为了弥合这一差距，我们引入了TF-Bench，这是一种基准测试，旨在根据系统F中的类型来评估LLM推理，这是我们称为程序语义推理的任务。通过采用经过验证的转换来消除语义上无关的自然语言，我们构建了TF-bench_pure，这是纯粹由语义驱动的TF板台的变体。我们的分析揭示了最先进的LLMS的实质性局限性，其表现最佳的LLM（Claude-3.7-Sonnet）仅在TF-BENCH_PURE上实现了55.85％的精度。此外，我们提出了两个新型指标，以评估鲁棒性和测试时间推理的有效性，强调当前LLM功能的临界局限性，并突出未来研究的基本方向。

Title: VIVA+: Human-Centered Situational Decision-Making

Authors: Zhe Hu, Yixiao Ren, Guanzhong Liu, Jing Li, Yu Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23698
Pdf URL: https://arxiv.org/pdf/2509.23698
Copy Paste: [[2509.23698]] VIVA+: Human-Centered Situational Decision-Making(https://arxiv.org/abs/2509.23698)
Keywords: language model, llm, agent
Abstract: Multimodal Large Language Models (MLLMs) show promising results for embodied agents in operating meaningfully in complex, human-centered environments. Yet, evaluating their capacity for nuanced, human-like reasoning and decision-making remains challenging. In this work, we introduce VIVA+, a cognitively grounded benchmark for evaluating the reasoning and decision-making of MLLMs in human-centered situations. VIVA+ consists of 1,317 real-world situations paired with 6,373 multiple-choice questions, targeting three core abilities for decision-making: (1) Foundational Situation Comprehension, (2) Context-Driven Action Justification, and (3) Reflective Reasoning. Together, these dimensions provide a systematic framework for assessing a model's ability to perceive, reason, and act in socially meaningful ways. We evaluate the latest commercial and open-source models on VIVA+, where we reveal distinct performance patterns and highlight significant challenges. We further explore targeted training and multi-step reasoning strategies, which yield consistent performance improvements. Finally, our in-depth analysis highlights current model limitations and provides actionable insights for advancing MLLMs toward more robust, context-aware, and socially adept decision-making in real-world settings.
摘要：多模式的大语言模型（MLLM）显示出在复杂的，以人为中心的环境中有意义地操作的具体作用的有希望的结果。然而，评估他们的细微差别，类似人类的推理和决策能力仍然具有挑战性。在这项工作中，我们介绍了Viva+，这是一种认知扎根的基准，用于评估以人为本的情况下MLLM的推理和决策。 Viva+由1,317个现实情况组成，与6,373个多项选择问题配对，针对决策的三个核心能力：（1）基础情况理解，（2）上下文驱动的行动合理性和（3）反思性推理。这些维度共同提供了一个系统的框架，以评估模型以社会意义的方式感知，理性和行动的能力。我们在Viva+上评估了最新的商业和开源模型，在那里我们揭示了不同的性能模式并突出了重大挑战。我们进一步探讨了有针对性的培训和多步推理策略，这些策略可实现一致的绩效提高。最后，我们的深入分析强调了当前的模型限制，并提供了可行的见解，以使MLLM在现实世界中更强大，背景感知和社会熟练的决策。

Title: Do LLMs Understand Romanian Driving Laws? A Study on Multimodal and Fine-Tuned Question Answering

Authors: Eduard Barbu, Adrian Marius Dumitran
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23715
Pdf URL: https://arxiv.org/pdf/2509.23715
Copy Paste: [[2509.23715]] Do LLMs Understand Romanian Driving Laws? A Study on Multimodal and Fine-Tuned Question Answering(https://arxiv.org/abs/2509.23715)
Keywords: language model, llm
Abstract: Ensuring that both new and experienced drivers master current traffic rules is critical to road safety. This paper evaluates Large Language Models (LLMs) on Romanian driving-law QA with explanation generation. We release a 1{,}208-question dataset (387 multimodal) and compare text-only and multimodal SOTA systems, then measure the impact of domain-specific fine-tuning for Llama 3.1-8B-Instruct and RoLlama 3.1-8B-Instruct. SOTA models perform well, but fine-tuned 8B models are competitive. Textual descriptions of images outperform direct visual input. Finally, an LLM-as-a-Judge assesses explanation quality, revealing self-preference bias. The study informs explainable QA for less-resourced languages.
摘要：确保新的和有经验的驾驶员掌握当前的交通规则对道路安全至关重要。本文通过解释生成评估了罗马尼亚驾驶法律质量保证QA的大型语言模型（LLM）。我们发布了一个1 {，} 208-问题数据集（387个多模式），并比较纯文本和多模式SOTA系统，然后测量域特异性微调对Llama 3.1-8b-Instruct和Rollama 3.1-8B 3.1-8B教学的影响。 SOTA型号的性能很好，但是微调的8B型号具有竞争力。图像的文本描述优于直接视觉输入。最后，LLM-AS-A-Gudge评估了解释质量，揭示了自我挑战偏见。该研究为可解释的质量保证提供了资源较低的语言。

Title: Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

Authors: Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23744
Pdf URL: https://arxiv.org/pdf/2509.23744
Copy Paste: [[2509.23744]] Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning(https://arxiv.org/abs/2509.23744)
Keywords: language model, llm, prompt
Abstract: Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.
摘要：多模式的大型语言模型（MLLM）通过整合诸如文本，视觉和音频等多样化的输入来增强推理。然而，跨模式推理仍然没有充满反感，有关增加方式是否有助于或伤害性能的报道相互矛盾。这些不一致源于缺乏受控的评估框架以及对模型内部的分析，以隔离何时以及为什么互动支持或破坏推理。我们通过一个逻辑的评估框架来解决这一差距，该框架将多模式推理分为六个交互模式，从而改变了事实如何跨模态分布和逻辑合并。从经验上讲，仅当推理提供独立和足够的推理路径时，额外的方式才能增强推理，而多余或链接的支撑通常会损害性能。此外，推理以三种系统的方式降低：较弱的方式拖延了整体性能，冲突对某些方式的偏见和来自不同模式的联合信号无法有效地集成。因此，我们确定了两个核心故障：任务组成的瓶颈，其中不能在一个通行证中共同执行识别和推理，而融合瓶颈则是早期整合引入偏见。为了进行进一步的研究，我们发现注意力模式无法编码事实有用性，但是简单的两步提示（识别理性）可以恢复性能，从而确认任务组成的瓶颈。此外，模态身份在早期层中仍可恢复，并且在早期融合中的关注会改善推理，从而强调了偏见的融合作为另一种故障模式。总体而言，我们的发现表明，整合而非感知是多模式推理的主要障碍，表明组成感知的训练和早期的融合控制是有希望的方向。

Title: Understanding Textual Capability Degradation in Speech LLMs via Parameter Importance Analysis

Authors: Chao Wang, Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23755
Pdf URL: https://arxiv.org/pdf/2509.23755
Copy Paste: [[2509.23755]] Understanding Textual Capability Degradation in Speech LLMs via Parameter Importance Analysis(https://arxiv.org/abs/2509.23755)
Keywords: language model, llm
Abstract: The integration of speech into Large Language Models (LLMs) has substantially expanded their capabilities, but often at the cost of weakening their core textual competence. This degradation limits the ability of speech-enabled LLMs to fully exploit their pre-trained text-based knowledge. In this work, we analyze the underlying mechanisms of this issue through a focused study of the widely used encoder-adaptor paradigm. We propose an analytical framework based on parameter importance estimation, which reveals that fine-tuning for speech introduces a textual importance distribution shift: the layer-wise allocation of parameters critical to textual reasoning is disrupted. Building on this insight, we investigate two mitigation strategies: layer-wise learning rate scheduling and Low-Rank Adaptation (LoRA), both aim to preserve the original parameter distribution. Experimental results show that both approaches better maintain textual competence than full fine-tuning, while also improving downstream spoken question answering performance. Furthermore, our analysis offers a principled explanation for the effectiveness of the proposed mitigation strategies, linking their benefits to the structural properties of textual knowledge in LLMs.
摘要：语音纳入大语言模型（LLM）已大大扩展了其能力，但通常以削弱其核心文字能力的代价。这种退化限制了支持语音的LLM充分利用其预先培训的基于文本的知识的能力。在这项工作中，我们通过对广泛使用的编码器适应器范式的重点研究来分析此问题的潜在机制。我们基于参数重要性估计提出了一个分析框架，该框架揭示了语音的微调引入文本重要性分布移动：对文本推理至关重要的参数的层分配受到中断。在此洞察力的基础上，我们研究了两种缓解策略：按层学习率调度和低级适应（LORA），均旨在保留原始参数分布。实验结果表明，这两种方法都比全面的微调更好地保持文本能力，同时也改善了下游的口头答案，以回答性能。此外，我们的分析为提出的缓解策略的有效性提供了一个原则上的解释，将其收益与LLMS中文本知识的结构属性联系起来。

Title: Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality

Authors: Junliang Li, Yucheng Wang, Yan Chen, Yu Ran, Ruiqing Zhang, Jing Liu, Hua Wu, Haifeng Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23765
Pdf URL: https://arxiv.org/pdf/2509.23765
Copy Paste: [[2509.23765]] Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality(https://arxiv.org/abs/2509.23765)
Keywords: language model, llm, hallucination
Abstract: Hallucination and factuality deficits remain key obstacles to the reliability of large language models (LLMs) in long-form generation. Existing reinforcement learning from human feedback (RLHF) frameworks primarily rely on preference rewards, yet they often overlook the model's internal knowledge boundaries, exacerbating the so-called "hallucination tax". To address this challenge, we propose Knowledge-Level Consistency Reinforcement Learning Framework (KLCF), a novel framework that focuses on the knowledge consistency between the policy model's expressed knowledge and the base model's parametric knowledge, and introduces a Dual-Fact Alignment mechanism to jointly optimize factual recall and precision. Specifically, KLCF leverages pretrained knowledge boundaries to construct fact checklist, guiding online reinforcement learning to improve factual coverage and recall; simultaneously, it trains a self-assessment module based on the base model's internal knowledge to enhance factual precision during generation. Unlike prior methods that rely on external retrieval or heavy verification, our reward design is fully external-knowledge-free and lightweight, making KLCF efficient and easily scalable to large-scale training. Experimental results demonstrate that KLCF substantially improves factuality metrics across multiple long-form benchmarks and effectively alleviates model hallucinations.
摘要：幻觉和事实赤字仍然是长期产生大语言模型（LLM）可靠性的关键障碍。从人类反馈（RLHF）框架中学习的现有强化学习主要依赖于偏好奖励，但他们经常忽略模型的内部知识边界，从而加剧了所谓的“幻觉税”。为了应对这一挑战，我们提出了知识级的一致性增强学习框架（KLCF），该框架的新颖框架着重于策略模型的表达知识与基本模型的参数知识之间的知识一致性，并引入了双交对准机制，以共同优化事实召回和精确。具体而言，KLCF利用了预处理的知识界限来构建事实清单，指导在线强化学习以改善事实的覆盖范围和召回；同时，它根据基本模型的内部知识来训练一个自我评估模块，以增强一代人的事实精度。与依靠外部检索或重度验证的先前方法不同，我们的奖励设计是完全无外部知识和轻量级的，使KLCF高效且易于扩展到大规模培训。实验结果表明，KLCF基本上改善了多种长期基准的事实指标，并有效地减轻了模型幻觉。

Title: From Personal to Collective: On the Role of Local and Global Memory in LLM Personalization

Authors: Zehong Wang, Junlin Wu, ZHaoxuan Tan, Bolian Li, Xianrui Zhong, Zheli Liu, Qingkai Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23767
Pdf URL: https://arxiv.org/pdf/2509.23767
Copy Paste: [[2509.23767]] From Personal to Collective: On the Role of Local and Global Memory in LLM Personalization(https://arxiv.org/abs/2509.23767)
Keywords: language model, llm
Abstract: Large language model (LLM) personalization aims to tailor model behavior to individual users based on their historical interactions. However, its effectiveness is often hindered by two key challenges: the \textit{cold-start problem}, where users with limited history provide insufficient context for accurate personalization, and the \textit{biasing problem}, where users with abundant but skewed history cause the model to overfit to narrow preferences. We identify both issues as symptoms of a common underlying limitation, i.e., the inability to model collective knowledge across users. To address this, we propose a local-global memory framework (LoGo) that combines the personalized local memory with a collective global memory that captures shared interests across the population. To reconcile discrepancies between these two memory sources, we introduce a mediator module designed to resolve conflicts between local and global signals. Extensive experiments on multiple benchmarks demonstrate that LoGo consistently improves personalization quality by both warming up cold-start users and mitigating biased predictions. These results highlight the importance of incorporating collective knowledge to enhance LLM personalization.
摘要：大型语言模型（LLM）个性化旨在根据单个用户的历史互动来量身定制模型行为。但是，其有效性通常受到两个关键挑战的阻碍：\ textIt {冷启动问题}，其中历史记录有限的用户为准确的个性化提供了不足的上下文，而\ textit {偏见问题}，其中大量但偏斜的历史记录的用户会导致模型过度拟合偏好的模型。我们将这两个问题确定为常见潜在限制的症状，即无法对用户的集体知识进行建模。为了解决这个问题，我们提出了一个本地全球内存框架（徽标），该框架将个性化的本地记忆与集体全球记忆结合在一起，该内存捕获了整个人群中共同的兴趣。为了调和这两个内存源之间的差异，我们引入了一个旨在解决本地和全球信号之间冲突的调解器模块。对多个基准测试的广泛实验表明，徽标始终通过暖和的用户和减轻偏见的预测来始终提高个性化质量。这些结果强调了纳入集体知识以增强LLM个性化的重要性。

Title: Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions

Authors: Yoonah Park, Haesung Pyun, Yohan Jo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23782
Pdf URL: https://arxiv.org/pdf/2509.23782
Copy Paste: [[2509.23782]] Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions(https://arxiv.org/abs/2509.23782)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) often fail on multiple-choice questions (MCQs) despite demonstrating correct knowledge in other contexts, such as free-form generation. To investigate the mechanism underlying this knowledge-prediction gap on MCQs and alleviate it, we conduct a probing analysis and find that residual streams in certain layers contain a subspace spanned by two important bases: a \emph{knowledge basis} that encodes the probability of the ground-truth answer for a given MCQ and a \emph{prediction basis} that encodes the probability of the answer choice predicted by the model. We observe that incorrect predictions arise from a misalignment of the model's hidden states along these two bases. Hence, we introduce \textbf{KAPPA} (Knowledge-Aligned Prediction through Projection-based Adjustment), a parameter-free intervention that transforms the hidden states to align the prediction coordinate with the knowledge coordinate within this subspace. Experiments on binary-choice reformulations of Big-Bench-Hard and ARC-Challenge show that KAPPA substantially improves accuracy and consistently outperforms baselines. While optimal subspaces differ across tasks, subspaces generalize to some extent, as supported by cross-dataset experiments. Moreover, KAPPA extends its effectiveness to free-form questions beyond MCQs. Our work provides a new geometric understanding of the knowledge-prediction gap and offers a practical method for better aligning model behavior with its latent knowledge.
摘要：尽管在其他情况下（例如自由形式的生成）证明了正确的知识，但大型语言模型（LLMS）通常会在多项选择问题（MCQ）上失败。为了调查这种知识预测差距在MCQ上的基础机制并减轻它，我们进行了探测分析，发现某些层中的残留流包含一个由两个重要基础的子空间：\ emph {sewood Basis}，这些基础{知识基础}对给定的mcq和a \ emph probection the \ emph probectiation the Pretiaction predictiation pretictiation contection contectiation prectiation prectiation} reptions prectiation pretiaction} contectiation}编码的概率。我们观察到，错误的预测是由于模型在这两个基础上的隐藏状态的未对准而产生的。因此，我们介绍\ textbf {kappa}（通过基于投影的调整与知识一致的预测），这是一种无参数的干预措施，可转换隐藏状态以使预测坐标与此子空间内的知识坐标保持一致。关于大台式甲板和弧形 - 挑战的二进制选择重新进行的实验表明，Kappa显着提高了准确性，并且始终超过了基准。虽然最佳子空间在各个任务之间有所不同，但子空间在一定程度上概括，如交叉数据库实验所支持。此外，Kappa将其有效性扩展到MCQ以外的自由形式问题。我们的工作提供了对知识预测差距的新几何理解，并提供了一种实用方法，可以更好地使模型行为与其潜在知识保持一致。

Title: Transformer Tafsir at QIAS 2025 Shared Task: Hybrid Retrieval-Augmented Generation for Islamic Knowledge Question Answering

Authors: Muhammad Abu Ahmad, Mohamad Ballout, Raia Abu Ahmad, Elia Bruni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23793
Pdf URL: https://arxiv.org/pdf/2509.23793
Copy Paste: [[2509.23793]] Transformer Tafsir at QIAS 2025 Shared Task: Hybrid Retrieval-Augmented Generation for Islamic Knowledge Question Answering(https://arxiv.org/abs/2509.23793)
Keywords: language model, llm, retrieval-augmented generation
Abstract: This paper presents our submission to the QIAS 2025 shared task on Islamic knowledge understanding and reasoning. We developed a hybrid retrieval-augmented generation (RAG) system that combines sparse and dense retrieval methods with cross-encoder reranking to improve large language model (LLM) performance. Our three-stage pipeline incorporates BM25 for initial retrieval, a dense embedding retrieval model for semantic matching, and cross-encoder reranking for precise content retrieval. We evaluate our approach on both subtasks using two LLMs, Fanar and Mistral, demonstrating that the proposed RAG pipeline enhances performance across both, with accuracy improvements up to 25%, depending on the task and model configuration. Our best configuration is achieved with Fanar, yielding accuracy scores of 45% in Subtask 1 and 80% in Subtask 2.
摘要：本文介绍了我们对QIAS 2025关于伊斯兰知识理解和推理的共同任务的提交。我们开发了一种混合检索功能的生成（RAG）系统，该系统将稀疏和密集的检索方法与跨编码器重新升级相结合，以改善大型语言模型（LLM）的性能。我们的三阶段管道结合了用于初始检索的BM25，这是一种致密的嵌入嵌入式检索模型，用于语义匹配，并重新编码重新编码器以进行精确的内容检索。我们使用两个LLM（FANAR和MISTRAL）在两个子任务上评估我们的方法，这表明所提出的RAG管道可提高两者的性能，并且根据任务和模型配置，精度提高了高达25％。我们最好的配置是通过Fanar实现的，在子任务2中的子任务1和80％的准确度得分为45％。

Title: Open-DeBias: Toward Mitigating Open-Set Bias in Language Models

Authors: Arti Rani, Shweta Singh, Nihar Ranjan Sahoo, Gaurav Kumar Nayak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23805
Pdf URL: https://arxiv.org/pdf/2509.23805
Copy Paste: [[2509.23805]] Open-DeBias: Toward Mitigating Open-Set Bias in Language Models(https://arxiv.org/abs/2509.23805)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable success on question answering (QA) tasks, yet they often encode harmful biases that compromise fairness and trustworthiness. Most existing bias mitigation approaches are restricted to predefined categories, limiting their ability to address novel or context-specific emergent biases. To bridge this gap, we tackle the novel problem of open-set bias detection and mitigation in text-based QA. We introduce OpenBiasBench, a comprehensive benchmark designed to evaluate biases across a wide range of categories and subgroups, encompassing both known and previously unseen biases. Additionally, we propose Open-DeBias, a novel, data-efficient, and parameter-efficient debiasing method that leverages adapter modules to mitigate existing social and stereotypical biases while generalizing to unseen ones. Compared to the state-of-the-art BMBI method, Open-DeBias improves QA accuracy on BBQ dataset by nearly $48\%$ on ambiguous subsets and $6\%$ on disambiguated ones, using adapters fine-tuned on just a small fraction of the training data. Remarkably, the same adapters, in a zero-shot transfer to Korean BBQ, achieve $84\%$ accuracy, demonstrating robust language-agnostic generalization. Through extensive evaluation, we also validate the effectiveness of Open-DeBias across a broad range of NLP tasks, including StereoSet and CrowS-Pairs, highlighting its robustness, multilingual strength, and suitability for general-purpose, open-domain bias mitigation. The project page is available at: this https URL
摘要：大型语言模型（LLMS）在问题回答（QA）任务上取得了巨大的成功，但它们经常编码有害偏见，损害公平和可信度。大多数现有的缓解方法仅限于预定义的类别，从而限制了它们解决新颖或特定于上下文的新兴偏见的能力。为了弥合这一差距，我们解决了基于文本的质量检查中开放式偏见检测和缓解的新问题。我们介绍了OpenBiasbench，这是一种综合基准，旨在评估各种类别和子组的偏见，涵盖已知和以前看不见的偏见。此外，我们提出了开放式，一种新颖的，数据效率和参数有效的偏差方法，该方法利用适配器模块来减轻现有的社会和刻板印象偏见，同时推广到看不见的偏见。与最先进的BMBI方法相比，Open-Debias在模棱两可的子集上提高了烧烤数据集的QA准确性近48美元\％$ $，而Disabiged的数据集则使用了$ 6 \％$ $ $ 6 \％$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ 6 \％$ $ $ $ $ $ $ nibepters使用适配器对培训数据的一小部分。值得注意的是，同一适配器以零拍传输到韩国烧烤，达到了84美元\％$的准确性，证明了强大的语言 - 不平智的概括。通过广泛的评估，我们还验证了在广泛的NLP任务中，包括立体声和crows对，突出了其稳健性，多语言强度以及对通用，开放式偏置缓解的稳健性，多种力量和适合性。项目页面可用：此HTTPS URL

Title: SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models

Authors: Ziyi Yang, Weizhou Shen, Ruijun Chen, Chenliang Li, Fanqi Wan, Ming Yan, Xiaojun Quan, Fei Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23863
Pdf URL: https://arxiv.org/pdf/2509.23863
Copy Paste: [[2509.23863]] SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models(https://arxiv.org/abs/2509.23863)
Keywords: language model, llm
Abstract: Progress in long-context reasoning for large language models (LLMs) has lagged behind other recent advances. This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals. In this paper, we propose SPELL, a multi-role self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning. SPELL integrates three cyclical roles-questioner, responder, and verifier-within a single model to enable continual self-improvement. The questioner generates questions from raw documents paired with reference answers; the responder learns to solve these questions based on the documents; and the verifier evaluates semantic equivalence between the responder's output and the questioner's reference answer, producing reward signals to guide continual training. To stabilize training, we introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model's evolving capabilities. Extensive experiments on six long-context benchmarks show that SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data. Notably, SPELL achieves an average 7.6-point gain in pass@8 on the strong reasoning model Qwen3-30B-A3B-Thinking, raising its performance ceiling and showing promise for scaling to even more capable models.
摘要：大型语言模型（LLM）的长篇文化推理的进展远远落后于其他最近的进步。这一差距不仅源于处理长文本的内在困难，而且还源于可靠的人类注释和可靠性验证的奖励信号的稀缺性。在本文中，我们提出了Spell，这是一种多功能自我播放增强学习框架，可实现可扩展的，无标签的优化，以实现长篇文化推理。咒语集成了三个周期性角色 - 问题，响应者和验证者与单个模型的验证者，以实现持续的自我改善。发问者从原始文档和参考答案配对的原始文档中生成问题；响应者学会根据文件解决这些问题；验证者评估了响应者的输出与发问者的参考答案之间的语义等效性，从而产生奖励信号来指导持续的培训。为了稳定培训，我们引入了一种自动课程，该课程逐渐增加了文档的长度和奖励功能，使问题难以适应该模型不断发展的功能。对六个长篇小写基准测试的广泛实验表明，咒语始终提高各种LLM的性能，并且在大规模注释的数据上微调的尺寸均优于同等大小的模型。值得注意的是，Spell在强大的推理模型QWEN3-30B-A3B思维中平均获得7.6分的增长，从而提高了其性能上限，并有希望扩展到更有能力的模型。

Title: Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

Authors: Shaobo Wang, Jiaming Wang, Jiajun Zhang, Cong Wang, Yue Min, Zichen Wen, Fei Huang, Huiqiang Jiang, Junyang Lin, Dayiheng Liu, Linfeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23873
Pdf URL: https://arxiv.org/pdf/2509.23873
Copy Paste: [[2509.23873]] Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning(https://arxiv.org/abs/2509.23873)
Keywords: language model, llm
Abstract: As supervised fine-tuning (SFT) evolves from a lightweight post-training step into a compute-intensive phase rivaling mid-training in scale, data efficiency has become critical for aligning large language models (LLMs) under tight budgets. Existing data pruning methods suffer from a fragmented design: they operate either at the sample level or the token level in isolation, failing to jointly optimize both dimensions. This disconnect leads to significant inefficiencies--high-value samples may still contain redundant tokens, while token-level pruning often discards crucial instructional or corrective signals embedded in individual examples. To address this bottleneck, we introduce the Error-Uncertainty (EU) Plane, a diagnostic framework that jointly characterizes the heterogeneous utility of training data across samples and tokens. Guided by this insight, we propose Quadrant-based Tuning (Q-Tuning), a unified framework that strategically coordinates sample pruning and token pruning. Q-Tuning employs a two-stage strategy: first, it performs sample-level triage to retain examples rich in informative misconceptions or calibration signals; second, it applies an asymmetric token-pruning policy, using a context-aware scoring mechanism to trim less salient tokens exclusively from misconception samples while preserving calibration samples in their entirety. Our method sets a new state of the art across five diverse benchmarks. Remarkably, on SmolLM2-1.7B, Q-Tuning achieves a +38\% average improvement over the full-data SFT baseline using only 12.5\% of the original training data. As the first dynamic pruning approach to consistently outperform full-data training, Q-Tuning provides a practical and scalable blueprint for maximizing data utilization in budget-constrained LLM SFT.
摘要：随着受监督的微调（SFT）从轻巧的训练后步骤演变为计算阶段的规模中的中期培训，数据效率对于在预算紧张的情况下对大型语言模型（LLMS）的一致至关重要。现有的数据修剪方法具有零散的设计：它们在样本级别或孤立的令牌水平上运行，无法共同优化两个维度。这种断开连接会导致明显的低效率 - 高价值样本可能仍然包含多余的令牌，而令牌级的修剪通常会丢弃嵌入在个别示例中的关键教学或纠正信号。为了解决这个瓶颈，我们介绍了误差（EU）平面，这是一个诊断框架，共同表征了跨样品和令牌训练数据的异质效用。在这种见解的指导下，我们提出了基于象限的调整（Q-Tuning），这是一个统一的框架，可以从策略性地协调样品修剪并修剪象征性的修剪。 Q-Tuning采用了两阶段的策略：首先，它执行样品级分类，以保留富含信息误解或校准信号的示例；其次，它采用不对称的代币策略，使用上下文感知的评分机制来减少误解样本中的显着令牌，同时整体保留校准样本。我们的方法在五个不同的基准中设定了新的最新状态。值得注意的是，在SmollM2-1.7B上，Q-Tuning仅使用原始培训数据的12.5％来实现+38 \％的平均改善。作为始终超过全数据训练的第一种动态修剪方法，Q-Tuning提供了一种实用且可扩展的蓝图，以最大程度地利用预算受限的LLM SFT中的数据利用率。

Title: DocPruner: A Storage-Efficient Framework for Multi-Vector Visual Document Retrieval via Adaptive Patch-Level Embedding Pruning

Authors: Yibo Yan, Guangwei Xu, Xin Zou, Shuliang Liu, James Kwok, Xuming Hu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2509.23883
Pdf URL: https://arxiv.org/pdf/2509.23883
Copy Paste: [[2509.23883]] DocPruner: A Storage-Efficient Framework for Multi-Vector Visual Document Retrieval via Adaptive Patch-Level Embedding Pruning(https://arxiv.org/abs/2509.23883)
Keywords: language model
Abstract: Visual Document Retrieval (VDR), the task of retrieving visually-rich document pages using queries that combine visual and textual cues, is crucial for numerous real-world applications. Recent state-of-the-art methods leverage Large Vision-Language Models (LVLMs) in a multi-vector paradigm, representing each document as patch-level embeddings to capture fine-grained details. While highly effective, this approach introduces a critical challenge: prohibitive storage overhead, as storing hundreds of vectors per page makes large-scale deployment costly and impractical. To address this, we introduce DocPruner, the first framework to employ adaptive patch-level embedding pruning for VDR to effectively reduce the storage overhead. DocPruner leverages the intra-document patch attention distribution to dynamically identify and discard redundant embeddings for each document. This adaptive mechanism enables a significant 50-60% reduction in storage for leading multi-vector VDR models with negligible degradation in document retrieval performance. Extensive experiments across more than ten representative datasets validate that DocPruner offers a robust, flexible, and effective solution for building storage-efficient, large-scale VDR systems.
摘要：视觉文档检索（VDR）是使用将视觉和文本提示结合的查询检索视觉上富的文档页面的任务，对于众多现实世界应用至关重要。最新的最新方法利用多矢量范式中的大型视觉模型（LVLM），代表每个文档作为贴片级嵌入，以捕获细粒细节。尽管非常有效，但这种方法引入了一个关键的挑战：过度的存储开销，因为每页存储数百个向量使大规模部署的成本昂贵和不切实际。为了解决这个问题，我们介绍了Docpruner，这是第一个采用自适应补丁级嵌入修剪来有效减少存储开销的框架。 DOCPRUNER利用文档内部的注意分布分布动态识别和丢弃每个文档的冗余嵌入。这种自适应机制可显着减少50-60％的储存量，用于领先的多矢量VDR模型，在文档检索性能方面具有可忽略的降解。在十多个代表性数据集中进行的广泛实验验证了DOCPRUNER为构建储存效率的大规模VDR系统提供了强大，灵活且有效的解决方案。

Title: Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step

Authors: Jingyi Yang, Guanxu Chen, Xuhao Hu, Jing Shao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23924
Pdf URL: https://arxiv.org/pdf/2509.23924
Copy Paste: [[2509.23924]] Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step(https://arxiv.org/abs/2509.23924)
Keywords: language model
Abstract: Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: this https URL.
摘要：蒙版扩散语言模型（MDLMS）最近已成为自回旋（AR）语言模型的有前途的替代方案，它提供了诸如平行解码，灵活的生成订单等属性，并具有更少的推理步骤。尽管有这些优势，但针对MDLMS量身定制的解码策略和强化学习（RL）算法仍未得到充实。一种幼稚的方法是将AR模型的完善技术直接传输到MDLM。但是，这提出了一个直接的问题：如此幼稚的转移真的是最佳的吗？例如，1）在MDLMS培训期间不采用块和半AR解码策略，那么为什么它们在推断过程中的表现要优于完全扩散式的解码呢？ 2）直接将为AR模型设计的RL算法应用于MDLM，表现出训练 - 推断不一致，因为MDLM解码是非造成的（平行）。这会导致推出轨迹和优化轨迹之间的不一致。为了应对这些挑战，我们提出EOS早期排斥（EOSER）和上升的阶梯尺寸（ASS）解码调度程序，该调度程序释放了MDLMS执行完全扩散式解码的潜力，从而在更少的解码步骤中实现了竞争性能。此外，我们引入了针对MDLMS的一致性轨迹相对策略优化（CJ-GRPO），该策略强调了推出轨迹和优化轨迹之间的一致性，并减少了Skip-Step优化引起的优化错误。我们使用LLADA-8B教学进行了有关推理任务的广泛实验，例如数学和计划基准。结果表明，提出的eoser和ASS机制与CJ-GRPO一起有效地有效地驯服MDLM。代码：此HTTPS URL。

Title: Assessing Large Language Models in Updating Their Forecasts with New Information

Authors: Zhangdie Yuan, Zifeng Ding, Andreas Vlachos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23936
Pdf URL: https://arxiv.org/pdf/2509.23936
Copy Paste: [[2509.23936]] Assessing Large Language Models in Updating Their Forecasts with New Information(https://arxiv.org/abs/2509.23936)
Keywords: language model, llm
Abstract: Prior work has largely treated future event prediction as a static task, failing to consider how forecasts and the confidence in them should evolve as new evidence emerges. To address this gap, we introduce EVOLVECAST, a framework for evaluating whether large language models appropriately revise their predictions in response to new information. In particular, EVOLVECAST assesses whether LLMs adjust their forecasts when presented with information released after their training cutoff. We use human forecasters as a comparative reference to analyze prediction shifts and confidence calibration under updated contexts. While LLMs demonstrate some responsiveness to new information, their updates are often inconsistent or overly conservative. We further find that neither verbalized nor logits-based confidence estimates consistently outperform the other, and both remain far from the human reference standard. Across settings, models tend to express conservative bias, underscoring the need for more robust approaches to belief updating.
摘要：先前的工作在很大程度上将未来的事件预测视为一项静态任务，未能考虑随着新证据的出现，预测和对它们的信心应如何发展。为了解决这一差距，我们介绍了Evolvecast，这是一个框架，用于评估大型语言模型是否适当地修改其对新信息的预测。特别是，EvolveCast评估LLM在培训截止后发布的信息时是否调整其预测。我们使用人类预报器作为比较参考，以分析更新环境下的预测转移和置信度校准。尽管LLMS对新信息有所反应，但它们的更新通常是不一致或过于保守的。我们进一步发现，口头上也不是基于逻辑的置信度估计始终胜过对方，并且两者都远离人类参考标准。在整个设置中，模型倾向于表达保守的偏见，强调了对更新更新的更强大方法的需求。

Title: Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems

Authors: Guojian Li, Chengyou Wang, Hongfei Xue, Shuiyuan Wang, Dehui Gao, Zihan Zhang, Yuke Lin, Wenjie Li, Longshuai Xiao, Zhonghua Fu, Lei Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23938
Pdf URL: https://arxiv.org/pdf/2509.23938
Copy Paste: [[2509.23938]] Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems(https://arxiv.org/abs/2509.23938)
Keywords: llm
Abstract: Full-duplex interaction is crucial for natural human-machine communication, yet remains challenging as it requires robust turn-taking detection to decide when the system should speak, listen, or remain silent. Existing solutions either rely on dedicated turn-taking models, most of which are not open-sourced. The few available ones are limited by their large parameter size or by supporting only a single modality, such as acoustic or linguistic. Alternatively, some approaches finetune LLM backbones to enable full-duplex capability, but this requires large amounts of full-duplex data, which remain scarce in open-source form. To address these issues, we propose Easy Turn, an open-source, modular turn-taking detection model that integrates acoustic and linguistic bimodal information to predict four dialogue turn states: complete, incomplete, backchannel, and wait, accompanied by the release of Easy Turn trainset, a 1,145-hour speech dataset designed for training turn-taking detection models. Compared to existing open-source models like TEN Turn Detection and Smart Turn V2, our model achieves state-of-the-art turn-taking detection accuracy on our open-source Easy Turn testset. The data and model will be made publicly available on GitHub.
摘要：全双工互动对于自然的人机交流至关重要，但仍然具有挑战性，因为它需要强大的转弯检测才能决定系统何时应讲话，倾听或保持沉默。现有的解决方案要么依赖于专用的转弯模型，其中大多数不是开源的。少数可用的参数大小或仅支持单一模态（例如声学或语言）的限制。另外，有些方法可以使用Finetune LLM骨架来实现全双工功能，但这需要大量的全套数据数据，这些数据以开源形式仍然很少。为了解决这些问题，我们提出了简单的转弯，一种开源的，模块化的转向检测模型，该模型集成了声学和语言的双峰信息，以预测四个对话转弯状态：完整，不完整，频道和等待，并伴随着易于的转弯列车集，易于发行，是一种1,145小时的演讲数据集，用于训练转向检测模型。与现有的开源型号（如十回合检测和智能转弯V2）相比，我们的模型在我们的开源易于转弯测试集中实现了最新的转弯检测精度。数据和模型将在GitHub上公开可用。

Title: Vision-Grounded Machine Interpreting: Improving the Translation Process through Visual Cues

Authors: Claudio Fantinuoli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23957
Pdf URL: https://arxiv.org/pdf/2509.23957
Copy Paste: [[2509.23957]] Vision-Grounded Machine Interpreting: Improving the Translation Process through Visual Cues(https://arxiv.org/abs/2509.23957)
Keywords: language model
Abstract: Machine Interpreting systems are currently implemented as unimodal, real-time speech-to-speech architectures, processing translation exclusively on the basis of the linguistic signal. Such reliance on a single modality, however, constrains performance in contexts where disambiguation and adequacy depend on additional cues, such as visual, situational, or pragmatic information. This paper introduces Vision-Grounded Interpreting (VGI), a novel approach designed to address the limitations of unimodal machine interpreting. We present a prototype system that integrates a vision-language model to process both speech and visual input from a webcam, with the aim of priming the translation process through contextual visual information. To evaluate the effectiveness of this approach, we constructed a hand-crafted diagnostic corpus targeting three types of ambiguity. In our evaluation, visual grounding substantially improves lexical disambiguation, yields modest and less stable gains for gender resolution, and shows no benefit for syntactic ambiguities. We argue that embracing multimodality represents a necessary step forward for advancing translation quality in machine interpreting.
摘要：机器解释系统目前被实现为单峰，实时语音到语音架构，仅根据语言信号进行处理。但是，这种对单一模式的依赖会限制在歧义和充分性取决于其他提示（例如视觉，情境或务实信息）的上下文中的性能。本文介绍了视觉基础解释（VGI），这是一种旨在解决非模式机器解释的局限性的新颖方法。我们提出了一个原型系统，该系统集成了视觉模型，以处理网络摄像头的语音和视觉输入，目的是通过上下文视觉信息启动翻译过程。为了评估这种方法的有效性，我们构建了一种针对三种类型歧义的手工制作的诊断语料库。在我们的评估中，视觉接地基本上改善了词汇歧义，对性别解决方案产生了适度且稳定的增长，并且对句法歧义没有任何好处。我们认为，拥抱多模式代表着在机器解释中提高翻译质量的必要一步。

Title: HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

Authors: Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu, Tianhao Peng, Xinping Lei, Weihao Li, Jingxuan Xu, Kun Wu, Yifan Yao, Haoyang Huang, Huaixi Tang, Kepeng Lei, Zhiyi Lai, Songwei Yu, Zongxian Feng, Zuchen Gao, Weihao Xie, Chenchen Zhang, Yanan Wu, Yuanxing Zhang, Lecheng Huang, Yuqun Zhang, Jie Liu, Zhaoxiang Zhang, Haotian Zhang, Bin Chen, Jiaheng Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23967
Pdf URL: https://arxiv.org/pdf/2509.23967
Copy Paste: [[2509.23967]] HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs(https://arxiv.org/abs/2509.23967)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipelineproviding paired Think-on and Think-off responseswith a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO a can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.
摘要：大型语言模型（LLMS）越来越依赖于经营链（COT）推理来提高复杂任务的准确性。但是，始终产生冗长的推理轨迹效率低下，导致代币使用过多和推理成本更高。本文介绍了混合策略优化（即HIPO），这是一个自适应推理控制的框架，使LLMS能够选择性地决定何时进行详细的推理（思考）和何时直接响应（思考）。具体而言，HIPO结合了混合数据管道提供的配对思想和思考回答，并与混合增强学习奖励系统相结合，该奖励系统可以平衡准确性和效率，同时避免过度依赖详细的推理。跨数学和编码基准的实验表明，HIPO可以在保持或提高准确性的同时大大减少令牌长度。最后，我们希望HIPO A可以成为有效的自适应推理的原则方法，从而推进了面向推理的LLM在现实世界中对资源敏感的环境中的部署。

Title: ByteSized32Refactored: Towards an Extensible Interactive Text Games Corpus for LLM World Modeling and Evaluation

Authors: Haonan Wang, Junfeng Sun, Xingdi Yuan, Ruoyao Wang, Ziang Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23979
Pdf URL: https://arxiv.org/pdf/2509.23979
Copy Paste: [[2509.23979]] ByteSized32Refactored: Towards an Extensible Interactive Text Games Corpus for LLM World Modeling and Evaluation(https://arxiv.org/abs/2509.23979)
Keywords: language model, gpt, llm
Abstract: Simulating interactive world models remains a core challenge in Large Language Models(LLMs). In this work, we introduce the ByteSized32Refactored, a refactored, modular, and extensible implementation of the original ByteSized32 corpus to explore the task of text game generation. We further optimize the code structure of each text game and create the this http URL foundation library, which centralizes common logic across all 32 games by abstracting 7 base classes (GameObject, etc.) into reusable modules, thereby reducing from 20k to 10k total lines of Python code compared to the original Bytesized32. Our refactored implementation enables extendability - with our centralized design, ByteSized32Refactored can be more efficiently extended to include text games of new scenarios and specifications by reusing the shared logic and functionalities. Extensive experiments with GPT-4o demonstrate a mix of performance - with Bytesized32Refactored, the generated text games for unseen scenarios showcase quality improvements on two of the four evaluation dimensions while decreases on the other two, indicating that the hierarchical structure of the refactored code presents new challenges for LLMs. Overall, we highlight that our extensible code structure, centered on the foundation library and the modular optimization, not only facilitates LLM adaptation to environment specifications but also establishes a scalable environment that supports future extensions.
摘要：在大语言模型（LLM）中，模拟交互式世界模型仍然是一个核心挑战。在这项工作中，我们介绍了Bytesized32 Refactored，对原始Bytesized32 Copus的重构，模块化和可扩展的实现，以探索文本游戏生成的任务。我们进一步优化了每个文本游戏的代码结构，并创建了此HTTP URL基础库，该库通过将7个基类（GameObject等）抽象为可重复使用的模块，从而在所有32场游戏中集中了共同的逻辑，从而从20K到10K的总python代码，与原始的Bytesized32相比，将其从20K降低到10K。我们的重构实现启用了可扩展性 - 通过我们的集中设计，可以更有效地扩展bytesized32 recreatized 32，以通过重复使用共享的逻辑和功能来包括新方案和规格的文本游戏。 GPT -4O进行的广泛实验表明了性能的混合 - 与副作用32重新分配，生成的文本游戏，用于看不见的场景，展示了四个评估维度中的两个，而另外两个评估维度中的两个，同时降低了其他两个评估，这表明Repactored Code的层次结构提出了LLMS的新挑战。总体而言，我们强调的是，我们的可扩展代码结构以基础库和模块化优化为中心，不仅促进了LLM对环境规范的适应，而且还建立了支持未来扩展的可扩展环境。

Title: Toward Preference-aligned Large Language Models via Residual-based Model Steering

Authors: Lucio La Cava, Andrea Tagarelli
Subjects: cs.CL, cs.AI, cs.CY, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2509.23982
Pdf URL: https://arxiv.org/pdf/2509.23982
Copy Paste: [[2509.23982]] Toward Preference-aligned Large Language Models via Residual-based Model Steering(https://arxiv.org/abs/2509.23982)
Keywords: language model, llm
Abstract: Preference alignment is a critical step in making Large Language Models (LLMs) useful and aligned with (human) preferences. Existing approaches such as Reinforcement Learning from Human Feedback or Direct Preference Optimization typically require curated data and expensive optimization over billions of parameters, and eventually lead to persistent task-specific models. In this work, we introduce Preference alignment of Large Language Models via Residual Steering (PaLRS), a training-free method that exploits preference signals encoded in the residual streams of LLMs. From as few as one hundred preference pairs, PaLRS extracts lightweight, plug-and-play steering vectors that can be applied at inference time to push models toward preferred behaviors. We evaluate PaLRS on various small-to-medium-scale open-source LLMs, showing that PaLRS-aligned models achieve consistent gains on mathematical reasoning and code generation benchmarks while preserving baseline general-purpose performance. Moreover, when compared to DPO-aligned models, they perform better with huge time savings. Our findings highlight that PaLRS offers an effective, much more efficient and flexible alternative to standard preference optimization pipelines, offering a training-free, plug-and-play mechanism for alignment with minimal data.
摘要：偏好对齐是使大语模型（LLM）有用并与（人类）偏好保持一致的关键步骤。现有的方法，例如从人类反馈中学习或直接偏好优化的增强学习，通常需要精选的数据和数十亿个参数的昂贵优化，并最终导致持续的特定于任务的模型。在这项工作中，我们通过剩余转向（PALRS）引入了大语言模型的偏好对齐，这是一种无训练的方法，可利用LLMS残留流中编码的偏好信号。从只有一百个偏好对开始，Palrs提取了轻巧的插件转向向量，可以在推理时间应用，以将模型推向首选行为。我们在各种中小型开源LLMS上评估PALR，表明与PALRS对准模型在数学推理和代码生成基准方面具有一致的增长，同时保留了基线通用性能。此外，与DPO的模型相比，它们可以节省大量时间。我们的发现强调，Palrs为标准偏好优化管道提供了一种有效，更有效，更灵活的替代方案，提供了一种无训练的插件机制，可与最小数据保持一致。

Title: The Hidden Costs of Translation Accuracy: Distillation, Quantization, and Environmental Impact

Authors: Dhaathri Vijay, Anandaswarup Vadapalli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23990
Pdf URL: https://arxiv.org/pdf/2509.23990
Copy Paste: [[2509.23990]] The Hidden Costs of Translation Accuracy: Distillation, Quantization, and Environmental Impact(https://arxiv.org/abs/2509.23990)
Keywords: language model, llm
Abstract: The rapid expansion of large language models (LLMs) has heightened concerns about their computational and environmental costs. This study investigates the trade-offs between translation quality and efficiency by comparing full-scale, distilled, and quantized models using machine translation as a case study. We evaluated performance on the Flores+ benchmark and through human judgments of conversational translations in French, Hindi, and Kannada. Our analysis of carbon emissions per evaluation run revealed that the full 3.3B fp32 model, while achieving the highest BLEU scores, incurred the largest environmental footprint (about 0.007-0.008 kg CO2 per run). The distilled models achieved an inference of up to 4.5x faster than the full 3.3B model, with only minimal reductions in BLEU scores. Human evaluations also showed that even aggressive quantization (INT4) preserved high levels of accuracy and fluency, with differences between models generally minor. These findings demonstrate that model compression strategies can substantially reduce computational demands and environmental impact while maintaining competitive translation quality, though trade-offs are more pronounced in low-resource settings. We argue for evaluation frameworks that integrate efficiency and sustainability alongside objective metrics as central dimensions of progress in NLP.
摘要：大型语言模型（LLMS）的快速扩展引起了人们对其计算和环境成本的关注。这项研究通过使用机器翻译作为案例研究来比较全尺度，蒸馏和量化模型，研究了翻译质量和效率之间的权衡。我们通过法语，印地语和卡纳达语中的对话翻译进行了人类对对话翻译的判断评估了表现。我们对每次评估碳排放的分析表明，达到最高的BLEU得分的整个3.3B FP32模型，却产生了最大的环境足迹（约为0.007-0.008 kg CO2，每次运行）。蒸馏模型的推断速度比完整的3.3b模型快4.5倍，而BLEU得分的降低只有最小的降低。人类评估还表明，即使是积极的量化（INT4）也保留了高水平的准确性和流利度，模型之间的差异通常很小。这些发现表明，模型压缩策略可以大大减少计算需求和环境影响，同时保持竞争性翻译质量，尽管在低资源环境中取舍更为明显。我们主张评估框架，将效率和可持续性与客观指标一起作为NLP进步的中心方面。

Title: The AI Agent Code of Conduct: Automated Guardrail Policy-as-Prompt Synthesis

Authors: Gauri Kholkar, Ratinder Ahuja
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23994
Pdf URL: https://arxiv.org/pdf/2509.23994
Copy Paste: [[2509.23994]] The AI Agent Code of Conduct: Automated Guardrail Policy-as-Prompt Synthesis(https://arxiv.org/abs/2509.23994)
Keywords: language model, llm, prompt, agent
Abstract: As autonomous AI agents are increasingly deployed in industry, it is essential to safeguard them. We introduce a novel framework that automates the translation of unstructured design documents into verifiable, real-time guardrails. We introduce "Policy as Prompt," a new approach that uses Large Language Models (LLMs) to interpret and enforce natural language policies by applying contextual understanding and the principle of least privilege. Our system first ingests technical artifacts to construct a verifiable policy tree, which is then compiled into lightweight, prompt-based classifiers that audit agent behavior at runtime. We validate our approach across diverse applications, demonstrating a scalable and auditable pipeline that bridges the critical policy-to-practice gap, paving the way for verifiably safer and more regulatable AI.
摘要：随着自主AI代理人越来越多地部署在行业中，因此必须保护它们。我们介绍了一个新颖的框架，将非结构化设计文档的翻译自动化为可验证的实时护栏。我们介绍了一种使用大型语言模型（LLMS）来解释和执行自然语言政策的新方法，即通过应用上下文理解和最低特权的原则来解释和执行自然语言政策。我们的系统首先摄入技术工件以构建可验证的策略树，然后将其编译成可在运行时审核代理行为的轻质，及时的分类器。我们在各种应用程序中验证了我们的方法，展示了可扩展和可审计的管道，该管道弥合了策略对实践差距的关键，为可见更安全，更可监管的AI铺平了道路。

Title: MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Authors: Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, Zirui Wang, Jinjie Ni, Yufan Yang, Arvin Xu, Michael Qizhe Shieh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24002
Pdf URL: https://arxiv.org/pdf/2509.24002
Copy Paste: [[2509.24002]] MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use(https://arxiv.org/abs/2509.24002)
Keywords: gpt, llm, agent
Abstract: MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only $52.56$\% pass@1 and $33.86$\% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below $30$\% pass@1 and $15$\% pass^4. On average, LLMs require $16.2$ execution turns and $17.4$ tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.
摘要：MCP标准化LLM与外部系统的相互作用，为普通代理构成基础。但是，现有的MCP基准在范围内仍然很狭窄：它们专注于读取的任务或相互作用深度有限的任务，并且无法捕获现实世界工作流的复杂性和现实性。为了解决这一差距，我们提出了MCPMark，这是一种基准，旨在以更现实，更全面的方式评估MCP使用。它由$ 127 $的高质量任务组成，由域专家和AI代理人共同创建。每个任务均以精选的初始状态开头，并包含一个用于自动验证的程序化脚本。这些任务要求与环境更丰富，更多样化的互动，涉及广泛的创建，阅读，更新和删除（CRUD）操作。我们使用在工具循环中运行的最小代理框架对尖端LLM进行了全面评估。经验结果表明，表现最佳的型号GPT-5-米数仅达到$ 52.56 $ \％pass@1和$ 33.86 $ \％\％pass^4，而其他广泛认为的强型号，包括Claude-Sonnet-4和O3，包括Claude-Sonnnet-4和O3，跌至$ 30 $ \％Pass@1和$ 15 $ \％\％Pass^4。平均而言，LLMS要求$ 16.2 $执行转弯和$ 17.4 $的每项任务呼叫，这大大超过了MCP基准中的那些，并突出了MCPMark的压力测试性质。

Title: Sequential Diffusion Language Models

Authors: Yangzhou Liu, Yue Cao, Hao Li, Gen Luo, Zhe Chen, Weiyun Wang, Xiaobo Liang, Biqing Qi, Lijun Wu, Changyao Tian, Yanting Zhang, Yuqiang Li, Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24007
Pdf URL: https://arxiv.org/pdf/2509.24007
Copy Paste: [[2509.24007]] Sequential Diffusion Language Models(https://arxiv.org/abs/2509.24007)
Keywords: language model
Abstract: Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value (KV) caches. Block diffusion mitigates these issues, yet still enforces a fixed block size and requires expensive training. We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction, enabling the model to adaptively determine the generation length at each step. When the length is fixed to 1, NSP reduces to standard next-token prediction. Building on NSP, we propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost. Specifically, SDLM performs diffusion inference within fixed-size mask blocks, but dynamically decodes consecutive subsequences based on model confidence, thereby preserving KV-cache compatibility and improving robustness to varying uncertainty and semantics across the sequence. Experiments show that SDLM matches or surpasses strong autoregressive baselines using only 3.5M training samples, while achieving 2.1 higher throughput than Qwen-2.5. Notably, the SDLM-32B model delivers even more pronounced efficiency gains, demonstrating the strong scalability potential of our modeling paradigm. Project page and codes: this https URL
摘要：扩散语言模型（DLMS）具有强大的理论效率，但受到固定长度解码和与键值（KV）缓存不兼容的限制。块扩散减轻了这些问题，但仍会实施固定的块大小，需要昂贵的培训。我们介绍了下一个序列预测（NSP），该预测统一了下一步和下一个块预测，从而使模型能够自适应地确定每个步骤的生成长度。当长度固定为1时，NSP将减少为标准的下一步预测。在NSP的基础上，我们提出了顺序扩散语言模型（SDLM），该模型可以以最低的成本改造预训练的自回归语言模型（ALMS）。具体而言，SDLM在固定尺寸的掩码块内执行扩散推断，但是基于模型置信度动态解码连续解码，从而保留了KV-CACH的兼容性并提高了对整个序列的不同不确定性和语义的鲁棒性。实验表明，SDLM仅使用350万个训练样本匹配或超过强大的自回归基线，而比QWEN-2.5的吞吐量高2.1。值得注意的是，SDLM-32B模型提供了更明显的效率提高，这表明了我们建模范式的强可伸缩性潜力。项目页面和代码：此HTTPS URL

Title: SparseD: Sparse Attention for Diffusion Language Models

Authors: Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, Xinchao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24014
Pdf URL: https://arxiv.org/pdf/2509.24014
Copy Paste: [[2509.24014]] SparseD: Sparse Attention for Diffusion Language Models(https://arxiv.org/abs/2509.24014)
Keywords: language model
Abstract: While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention's quadratic complexity with respect to context length in computing all query-key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose SparseD, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to $1.50\times$ speedup over FlashAttention at a 64k context length with 1,024 denoising steps.
摘要：尽管扩散语言模型（DLMS）为自回归模型（ARS）提供了有希望的替代方法，但现有的开源DLMS遭受了很高的推理潜伏期。这种瓶颈主要是由于注意所有查询对键对的上下文长度的二次复杂性。凭直觉，为了降低这种复杂性，一种自然策略是将注意力限制在仅保留最相关联系的稀疏模式上。这种方法在ARS中建立了良好的，其中注意力遵循固定且明确定义的稀疏模式。但是，在DLM中，我们观察到不同的稀疏性行为：（1）注意力模式在整个头部各不相同，（2）每个头部的注意力模式在脱氧步骤中保持高度相似，并且（3）早期脱氧步骤对于发电至关重要。这些发现使专为ARS设计的稀疏注意方法在很大程度上与DLM不兼容，因为它们无法捕获特定于头部的结构和在早期脱氧步骤中应用时的风险降解产生。为了应对这些挑战，我们提出了稀疏的DLMS稀疏注意方法。利用观测值仅需要一次预计头部特定的稀疏图案，并将其重新遍及所有步骤。这样可以防止在每个DeNoising步骤中重新计算稀疏图案。同时，稀疏在早期步骤中会充分关注，然后切换到以后的稀疏注意力以保持发电质量。这些共同将稀疏作为在长篇文化应用程序中部署DLM的实用和高效解决方案。实验结果表明，稀疏实现无损的加速度，在64K上下文长度上以1,024个denoising步骤提供了高达$ 1.50 \ times $ speedup time $速度。

Title: Ensembling Multilingual Transformers for Robust Sentiment Analysis of Tweets

Authors: Meysam Shirdel Bilehsavar, Negin Mahmoudi, Mohammad Jalili Torkamani, Kiana Kiashemshaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24080
Pdf URL: https://arxiv.org/pdf/2509.24080
Copy Paste: [[2509.24080]] Ensembling Multilingual Transformers for Robust Sentiment Analysis of Tweets(https://arxiv.org/abs/2509.24080)
Keywords: language model, llm
Abstract: Sentiment analysis is a very important natural language processing activity in which one identifies the polarity of a text, whether it conveys positive, negative, or neutral sentiment. Along with the growth of social media and the Internet, the significance of sentiment analysis has grown across numerous industries such as marketing, politics, and customer service. Sentiment analysis is flawed, however, when applied to foreign languages, particularly when there is no labelled data to train models upon. In this study, we present a transformer ensemble model and a large language model (LLM) that employs sentiment analysis of other languages. We used multi languages dataset. Sentiment was then assessed for sentences using an ensemble of pre-trained sentiment analysis models: bert-base-multilingual-uncased-sentiment, and XLM-R. Our experimental results indicated that sentiment analysis performance was more than 86% using the proposed method.
摘要：情感分析是一种非常重要的自然语言处理活动，其中人们识别文本的极性，无论是传达正，消极还是中性情绪。随着社交媒体和互联网的增长，情感分析的重要性在众多行业，例如市场，政治和客户服务等众多行业中增长。但是，当应用于外语时，情感分析是有缺陷的，尤其是在没有标记的数据进行培训模型时。在这项研究中，我们提出了一种使用其他语言的情感分析的变压器集合模型和大型语言模型（LLM）。我们使用了多语言数据集。然后，使用预先训练的情感分析模型的集合来评估情感的句子：Bert-Base-Multlingual-Incundiment-sendiment和XLM-R。我们的实验结果表明，使用该方法，情感分析绩效超过86％。

Title: Large-Scale Constraint Generation - Can LLMs Parse Hundreds of Constraints?

Authors: Matteo Boffa, Jiaxuan You
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24090
Pdf URL: https://arxiv.org/pdf/2509.24090
Copy Paste: [[2509.24090]] Large-Scale Constraint Generation - Can LLMs Parse Hundreds of Constraints?(https://arxiv.org/abs/2509.24090)
Keywords: language model, llm, prompt
Abstract: Recent research has explored the constrained generation capabilities of Large Language Models (LLMs) when explicitly prompted by few task-specific requirements. In contrast, we introduce Large-Scale Constraint Generation (LSCG), a new problem that evaluates whether LLMs can parse a large, fine-grained, generic list of constraints. To examine the LLMs' ability to handle an increasing number constraints, we create a practical instance of LSCG, called Words Checker. In Words Checker, we evaluate the impact of model characteristics (e.g., size, family) and steering techniques (e.g., Simple Prompt, Chain of Thought, Best of N) on performance. We also propose FoCusNet, a small and dedicated model that parses the original list of constraints into a smaller subset, helping the LLM focus on relevant constraints. Experiments reveal that existing solutions suffer a significant performance drop as the number of constraints increases, with FoCusNet showing an 8-13% accuracy boost.
摘要：最近的研究探索了大型语言模型（LLMS）的限制生成能力，当时很少有特定于任务的要求提示。相比之下，我们引入了大规模约束生成（LSCG），这是一个新问题，它评估LLM是否可以解析大型，细粒度的限制列表。为了检查LLMS处理不断增长的数字约束的能力，我们创建了LSCG的实例，称为单词检查器。用文字检查器，我们评估了模型特征（例如大小，家庭）和转向技术（例如，简单的提示，思想链，n链，n的最佳）的影响。我们还提出了一种小型而专用的模型FocusNet，将原始约束列表解析为较小的子集，帮助LLM专注于相关的约束。实验表明，随着约束数量的增加，现有解决方案的性能下降显着下降，而FocusNet显示出8-13％的精度提高。

Title: GEAR: A General Evaluation Framework for Abductive Reasoning

Authors: Kaiyu He, Peilin Wu, Mian Zhang, Kun Wan, Wentian Zhao, Xinya Du, Zhiyu Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24096
Pdf URL: https://arxiv.org/pdf/2509.24096
Copy Paste: [[2509.24096]] GEAR: A General Evaluation Framework for Abductive Reasoning(https://arxiv.org/abs/2509.24096)
Keywords: language model, llm
Abstract: Since the advent of large language models (LLMs), research has focused on instruction following and deductive reasoning. A central question remains: can these models discover new knowledge, and how can we evaluate this ability? We address this by studying abductive reasoning-the generation of plausible hypotheses to explain observations-and introduce GEAR (General Evaluation for Abductive Reasoning), a general-purpose, fully automated, transparent, and label-free evaluation paradigm. GEAR scores hypothesis sets by three metrics: consistency (each hypothesis explains the observations), generalizability (consistent hypotheses make meaningful predictions on unseen inputs), and diversity (the set covers distinct predictions and patterns). Built this way, GEAR is scalable (no human gold answers), reliable (deterministic scoring aligned with classical abduction), and open-ended (scores improve only when models produce new plausible hypotheses, unlike static benchmarks that saturate once accuracy is high). Using GEAR, we conduct a fine-grained study of nine LLMs on four abduction benchmarks with 1,500 problems, generating over 50,000 candidate hypotheses and revealing model differences obscured by gold-answer or purely human evaluations. We further propose a momentum-based curriculum that adjusts GEAR-derived training data by learning velocity: it starts with what the model learns quickly and shifts toward harder objectives such as generating diverse hypotheses once the model is confident on foundational objectives. Without gold-label supervision, this strategy improves all GEAR objectives and these gains transfer to established abductive reasoning benchmarks. Taken together, GEAR provides a principled framework that evaluates abduction and supplies label-free, scalable training signals that help LLMs produce more diverse and reliable hypotheses.
摘要：自大语言模型（LLMS）的出现以来，研究的重点是跟随和演绎推理。一个中心问题仍然存在：这些模型可以发现新知识，我们如何评估这种能力？我们通过研究绑架性推理 - 产生合理的假设来解释观察值的产生，并引入齿轮（一般评估绑架性推理），这是一种通用，完全自动化，透明和无标签评估范式的通用。齿轮评分通过三个指标设定的假设：一致性（每个假设都解释了观察结果），可推广性（一致的假设对看不见的输入都有有意义的预测）和多样性（集合涵盖了不同的预测和模式）。以这种方式构建，齿轮是可扩展的（没有人类黄金答案），可靠的（确定性评分与经典绑架相一致），并且开放式（只有在模型产生新的合理假设时，得分只有在精度较高时饱和的静态基准会提高）。使用齿轮，我们对四个绑架基准有1,500个问题进行了九个LLM的精细研究，产生了超过50,000个候选假设，并揭示了被黄金或纯人类评估所掩盖的模型差异。我们进一步提出了一个基于动量的课程，该课程通过学习速度来调整齿轮衍生的培训数据：它始于模型快速学习的知识，并转向更艰难的目标，例如一旦模型对基础目标充满信心，就产生了多样化的假设。如果没有金标记的监督，该策略将改善所有齿轮目标，并将这些收益转移到既定的推理基准。综上所述，Gear提供了一个原则上的框架，该框架可以评估绑架和提供无标签的可扩展培训信号，以帮助LLM产生更多样化和可靠的假设。

Title: BTC-SAM: Leveraging LLMs for Generation of Bias Test Cases for Sentiment Analysis Models

Authors: Zsolt T. Kardkovács, Lynda Djennane, Anna Field, Boualem Benatallah, Yacine Gaci, Fabio Casati, Walid Gaaloul
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24101
Pdf URL: https://arxiv.org/pdf/2509.24101
Copy Paste: [[2509.24101]] BTC-SAM: Leveraging LLMs for Generation of Bias Test Cases for Sentiment Analysis Models(https://arxiv.org/abs/2509.24101)
Keywords: language model, llm, prompt
Abstract: Sentiment Analysis (SA) models harbor inherent social biases that can be harmful in real-world applications. These biases are identified by examining the output of SA models for sentences that only vary in the identity groups of the subjects. Constructing natural, linguistically rich, relevant, and diverse sets of sentences that provide sufficient coverage over the domain is expensive, especially when addressing a wide range of biases: it requires domain experts and/or crowd-sourcing. In this paper, we present a novel bias testing framework, BTC-SAM, which generates high-quality test cases for bias testing in SA models with minimal specification using Large Language Models (LLMs) for the controllable generation of test sentences. Our experiments show that relying on LLMs can provide high linguistic variation and diversity in the test sentences, thereby offering better test coverage compared to base prompting methods even for previously unseen biases.
摘要：情感分析（SA）模型具有固有的社会偏见，这些偏见在现实世界中可能是有害的。通过检查仅在受试者的身份组中有所不同的句子的SA模型的输出来确定这些偏差。建造自然，语言丰富，相关和多样化的句子集，这些句子在域上提供足够的覆盖范围是昂贵的，尤其是在解决广泛的偏见时：它需要域专家和/或众包。在本文中，我们提出了一个新颖的偏见测试框架BTC-SAM，该框架在SA模型中生成了具有最小语言模型（LLMS）的SA模型中的高质量测试用例，用于可控的测试句子。我们的实验表明，依靠LLM可以在测试句子中提供高语言差异和多样性，从而与基本提示方法相比，即使对于以前看不见的偏见，也提供了更好的测试覆盖率。

Title: Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Distributional Semantics

Authors: Guangliang Liu, Xi Chen, Bocheng Chen, Xitong Zhang, Kristen Johnson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24102
Pdf URL: https://arxiv.org/pdf/2509.24102
Copy Paste: [[2509.24102]] Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Distributional Semantics(https://arxiv.org/abs/2509.24102)
Keywords: language model, llm
Abstract: Moral reasoning has emerged as a promising research direction for Large Language Models (LLMs), yet achieving generalization remains a central challenge. From a linguistic standpoint, this difficulty arises because LLMs are adept at capturing distributional semantics, which fundamentally differs from the morals which operate at the pragmatic level. This paper investigates how LLMs can achieve generalized moral reasoning despite their reliance on distributional semantics. We propose pragmatic inference methods grounded in moral foundations theory, which leverage contextual information at each step to bridge the pragmatic gap and guide LLMs in connecting moral foundations with moral reasoning objectives. Experimental results demonstrate that our approach significantly enhances LLMs' generalization in moral reasoning, providing a foundation for future research grounded in moral foundations theory.
摘要：道德推理已成为大型语言模型（LLM）的有前途的研究方向，但是实现概括仍然是一个核心挑战。从语言的角度来看，这一困难之所以出现，是因为LLM擅长捕获分布语义，从根本上讲，这与在务实水平上运作的道德不同。本文研究了LLM，尽管它们依赖分布语义，但如何实现广义的道德推理。我们提出了基于道德基础理论的务实推理方法，该方法在每个步骤中利用上下文信息来弥合务实的差距，并指导LLMS将道德基础与道德推理目标联系起来。实验结果表明，我们的方法显着增强了LLM在道德推理中的概括，为以道德基础理论为基础的未来研究为基础提供了基础。

Title: Dual-Scale World Models for LLM Agents Towards Hard-Exploration Problems

Authors: Minsoo Kim, Seung-won Hwang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24116
Pdf URL: https://arxiv.org/pdf/2509.24116
Copy Paste: [[2509.24116]] Dual-Scale World Models for LLM Agents Towards Hard-Exploration Problems(https://arxiv.org/abs/2509.24116)
Keywords: llm, agent
Abstract: LLM-based agents have seen promising advances, yet they are still limited in "hard-exploration" tasks requiring learning new knowledge through exploration. We present GLoW, a novel approach leveraging dual-scale world models, maintaining a trajectory frontier of high-value discoveries at the global scale, while learning from local trial-and-error in exploration through a Multi-path Advantage Reflection mechanism which infers advantage-based progress signals to guide exploration. To evaluate our framework for hard-exploration, we tackle the Jericho benchmark suite of text-based games, where GLoW achieves a new state-of-theart performance for LLM-based approaches. Compared to state-of-the-art RLbased methods, our approach achieves comparable performance while requiring 100-800x fewer environment interactions.
摘要：基于LLM的代理商已经看到了有希望的进步，但是它们仍然受到“硬探索”任务的限制，需要通过探索学习新知识。我们提出了一种新型的方法，这是一种利用双尺度世界模型的新方法，在全球范围内保持了高价值发现的轨迹前沿，同时通过多路的优势反射机制从本地试验和校园中学习，该机制吸收了基于优势的进度信号来指导探索。为了评估我们的艰苦探索框架，我们解决了基于文本的游戏的Jericho基准套件，在该游戏中，Glow为基于LLM的方法实现了新的最先进性能。与最先进的RLB基础方法相比，我们的方法可以达到可比的性能，而环境相互作用则需要少100-800倍。

Title: EduVidQA: Generating and Evaluating Long-form Answers to Student Questions based on Lecture Videos

Authors: Sourjyadip Ray, Shubham Sharma, Somak Aditya, Pawan Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24120
Pdf URL: https://arxiv.org/pdf/2509.24120
Copy Paste: [[2509.24120]] EduVidQA: Generating and Evaluating Long-form Answers to Student Questions based on Lecture Videos(https://arxiv.org/abs/2509.24120)
Keywords: language model, llm
Abstract: As digital platforms redefine educational paradigms, ensuring interactivity remains vital for effective learning. This paper explores using Multimodal Large Language Models (MLLMs) to automatically respond to student questions from online lectures - a novel question answering task of real world significance. We introduce the EduVidQA Dataset with 5252 question-answer pairs (both synthetic and real-world) from 296 computer science videos covering diverse topics and difficulty levels. To understand the needs of the dataset and task evaluation, we empirically study the qualitative preferences of students, which we provide as an important contribution to this line of work. Our benchmarking experiments consist of 6 state-of-the-art MLLMs, through which we study the effectiveness of our synthetic data for finetuning, as well as showing the challenging nature of the task. We evaluate the models using both text-based and qualitative metrics, thus showing a nuanced perspective of the models' performance, which is paramount to future work. This work not only sets a benchmark for this important problem, but also opens exciting avenues for future research in the field of Natural Language Processing for Education.
摘要：随着数字平台重新定义教育范例，确保互动对于有效学习仍然至关重要。本文使用多模式的大语言模型（MLLM）探讨了从在线讲座中自动回答学生问题 - 这是一个新颖的问题回答现实世界重要性的任务。我们从296个计算机科学视频中介绍了5252个问答对（合成和现实世界）的Eduvidqa数据集，其中涵盖了各种主题和难度级别。为了了解数据集和任务评估的需求，我们从经验上研究了学生的定性偏好，这是对这一工作的重要贡献。我们的基准测试实验由6个最先进的MLLM组成，通过这些实验，我们研究了合成数据对固定的有效性，并显示了任务的挑战性质。我们使用基于文本的和定性的指标评估模型，因此显示了模型性能的细微差别，这对于未来的工作至关重要。这项工作不仅为这个重要问题树立了基准，而且还为自然语言处理领域的未来研究开辟了令人兴奋的途径。

Title: Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE

Authors: Guancheng Wan, Lucheng Fu, Haoxin Liu, Yiqiao Jin, Hui Yi Leong, Eric Hanchen Jiang, Hejia Geng, Jinhe Bi, Yunpu Ma, Xiangru Tang, B. Aditya Prakash, Yizhou Sun, Wei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24130
Pdf URL: https://arxiv.org/pdf/2509.24130
Copy Paste: [[2509.24130]] Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE(https://arxiv.org/abs/2509.24130)
Keywords: language model, llm, prompt
Abstract: The performance of Large Language Models (LLMs) hinges on carefully engineered prompts. However, prevailing prompt optimization methods, ranging from heuristic edits and reinforcement learning to evolutionary search, primarily target point-wise accuracy. They seldom enforce paraphrase invariance or searching stability, and therefore cannot remedy this brittleness in practice. Automated prompt search remains brittle: small, semantically preserving paraphrases often cause large performance swings. We identify this brittleness as the textual sharpness of the prompt landscape. In this work, we provide the first formal treatment of textual sharpness in the discrete, semantic space of prompts, together with an operational robustness criterion over a semantic neighborhood; the design is black-box or API-only, requiring no gradients to update the model's parameters. Then we introduce TARE (Textual Sharpness-Aware Evolving), a derivative-free framework that alternates between an inner, sampling-based adversarial search that stresses a prompt with hard paraphrases and an outer, robust selection that prefers candidates whose neighborhoods remain strong. We further propose ATARE, which learns anisotropic weights to shape the semantic neighborhood and adapts its radius over time to balance exploration and fidelity. Diverse tasks evaluate our methods, whose design for minimizing textual sharpness gap leads to prompts that preserve accuracy under paraphrasing, outperforming accuracy-only prompt search while remaining computationally practical.
摘要：大型语言模型（LLMS）的性能取决于精心设计的提示。但是，从启发式编辑和强化学习到进化搜索，主要是目标的准确性。他们很少会执行释义不变性或搜索稳定性，因此无法在实践中纠正这种脆弱性。自动化的及时搜索仍然很脆弱：小型，语义上保存的释义通常会引起大量的性能波动。我们将这种脆弱性确定为及时景观的文字清晰度。在这项工作中，我们在提示的离散语义空间以及在语义社区上的操作鲁棒性标准提供了对文本清晰度的首次正式处理；该设计是黑框或仅API的，不需要梯度来更新模型的参数。然后，我们介绍TARE（文本清晰度感知的不断发展），这是一个无衍生的无衍生框架，在基于样本的对抗性搜索之间交替使用，该搜索用硬式释义和外部，强大的选择强调了提示，可预见其居民的候选人保持强大。我们进一步提出了Atare，它学习了各向异性的权重以塑造语义邻域并随着时间的推移而适应其半径，以平衡探索和忠诚度。各种任务评估了我们的方法，其设计用于最大程度地减少文本清晰度差距会导致提示，这些提示可以在释义下保持准确性，在较高的准确度上均超过准确性的迅速搜索，同时保持计算实用。

Title: Your thoughts tell who you are: Characterize the reasoning patterns of LRMs

Authors: Yida Chen, Yuning Mao, Xianjun Yang, Suyu Ge, Shengjie Bi, Lijuan Liu, Saghar Hosseini, Liang Tan, Yixin Nie, Shaoliang Nie
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24147
Pdf URL: https://arxiv.org/pdf/2509.24147
Copy Paste: [[2509.24147]] Your thoughts tell who you are: Characterize the reasoning patterns of LRMs(https://arxiv.org/abs/2509.24147)
Keywords: language model, llm
Abstract: Current comparisons of large reasoning models (LRMs) focus on macro-level statistics such as task accuracy or reasoning length. Whether different LRMs reason differently remains an open question. To address this gap, we introduce the LLM-proposed Open Taxonomy (LOT), a classification method that uses a generative language model to compare reasoning traces from two LRMs and articulate their distinctive features in words. LOT then models how these features predict the source LRM of a reasoning trace based on their empirical distributions across LRM outputs. Iterating this process over a dataset of reasoning traces yields a human-readable taxonomy that characterizes how models think. We apply LOT to compare the reasoning of 12 open-source LRMs on tasks in math, science, and coding. LOT identifies systematic differences in their thoughts, achieving 80-100% accuracy in distinguishing reasoning traces from LRMs that differ in scale, base model family, or objective domain. Beyond classification, LOT's natural-language taxonomy provides qualitative explanations of how LRMs think differently. Finally, in a case study, we link the reasoning differences to performance: aligning the reasoning style of smaller Qwen3 models with that of the largest Qwen3 during test time improves their accuracy on GPQA by 3.3-5.7%.
摘要：大型推理模型（LRMS）的当前比较集中在宏观统计上，例如任务准确性或推理长度。不同的LRMS原因是否有所不同仍然是一个悬而未决的问题。为了解决这一差距，我们介绍了LLM传播的开放分类法（LOT），该分类方法使用生成语言模型比较了来自两个LRM的推理痕迹并在单词中表达其独特的特征。然后，批建模这些功能如何根据其在LRM输出中的经验分布来预测推理迹线的源LRM。在推理痕迹的数据集上迭代此过程产生了人类可读的分类学，该分类学表征了模型的思维方式。我们将很多内容用于比较12个开源LRM在数学，科学和编码方面的任务的推理。批次确定了他们的思想的系统差异，在区分尺度，基本模型家族或客观领域的LRMS上，实现了80-100％的准确性。除了分类外，Lot的自然语言分类法还提供了关于LRM不同思考方式的定性解释。最后，在一个案例研究中，我们将推理差异与性能联系起来：在测试时间期间，较小的QWEN3模型的推理样式与最大的QWEN3的推理样式提高了其对GPQA的准确性3.3-5.7％。

Title: Localizing Task Recognition and Task Learning in In-Context Learning via Attention Head Analysis

Authors: Haolin Yang, Hakaze Cho, Naoya Inoue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24164
Pdf URL: https://arxiv.org/pdf/2509.24164
Copy Paste: [[2509.24164]] Localizing Task Recognition and Task Learning in In-Context Learning via Attention Head Analysis(https://arxiv.org/abs/2509.24164)
Keywords: language model
Abstract: We investigate the mechanistic underpinnings of in-context learning (ICL) in large language models by reconciling two dominant perspectives: the component-level analysis of attention heads and the holistic decomposition of ICL into Task Recognition (TR) and Task Learning (TL). We propose a novel framework based on Task Subspace Logit Attribution (TSLA) to identify attention heads specialized in TR and TL, and demonstrate their distinct yet complementary roles. Through correlation analysis, ablation studies, and input perturbations, we show that the identified TR and TL heads independently and effectively capture the TR and TL components of ICL. Using steering experiments with geometric analysis of hidden states, we reveal that TR heads promote task recognition by aligning hidden states with the task subspace, while TL heads rotate hidden states within the subspace toward the correct label to facilitate prediction. We further show how previous findings on ICL mechanisms, including induction heads and task vectors, can be reconciled with our attention-head-level analysis of the TR-TL decomposition. Our framework thus provides a unified and interpretable account of how large language models execute ICL across diverse tasks and settings.
摘要：我们通过核对两个主导观点来研究大语言模型中文化学习（ICL）的机械基础：注意力头的组成级分析以及ICL将ICL的整体分解为任务识别（TR）和任务学习（TL）。我们提出了一个基于任务子空间logit归因（TSLA）的新框架，以识别专门用于TR和TL的注意力头，并演示其独特但互补的角色。通过相关分析，消融研究和输入扰动，我们表明已确定的TR和TL头独立有效地捕获了ICL的TR和TL成分。使用转向实验对隐藏状态进行几何分析，我们揭示了TR头通过将隐藏状态与任务子空间保持一致，而TL头将子空间内的隐藏状态旋转到正确的标签以促进预测。我们进一步展示了先前关于ICL机制的发现，包括诱导头和任务向量，如何与我们对TR-TL分解的注意头级分析进行调和。因此，我们的框架提供了一个统一和可解释的说明，该框架对大型语言模型在各种任务和设置之间的执行方式进行了。

Title: Task Vectors, Learned Not Extracted: Performance Gains and Mechanistic Insight

Authors: Haolin Yang, Hakaze Cho, Kaize Ding, Naoya Inoue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24169
Pdf URL: https://arxiv.org/pdf/2509.24169
Copy Paste: [[2509.24169]] Task Vectors, Learned Not Extracted: Performance Gains and Mechanistic Insight(https://arxiv.org/abs/2509.24169)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) can perform new tasks from in-context demonstrations, a phenomenon known as in-context learning (ICL). Recent work suggests that these demonstrations are compressed into task vectors (TVs), compact task representations that LLMs exploit for predictions. However, prior studies typically extract TVs from model outputs or hidden states using cumbersome and opaque methods, and they rarely elucidate the mechanisms by which TVs influence computation. In this work, we address both limitations. First, we propose directly training Learned Task Vectors (LTVs), which surpass extracted TVs in accuracy and exhibit superior flexibility-acting effectively at arbitrary layers, positions, and even with ICL prompts. Second, through systematic analysis, we investigate the mechanistic role of TVs, showing that at the low level they steer predictions primarily through attention-head OV circuits, with a small subset of "key heads" most decisive. At a higher level, we find that despite Transformer nonlinearities, TV propagation is largely linear: early TVs are rotated toward task-relevant subspaces to improve logits of relevant labels, while later TVs are predominantly scaled in magnitude. Taken together, LTVs not only provide a practical approach for obtaining effective TVs but also offer a principled lens into the mechanistic foundations of ICL.
摘要：大型语言模型（LLMS）可以从内在示范中执行新任务，这是一种称为“文化学习”（ICL）的现象。最近的工作表明，这些演示被压缩到任务向量（TVS）中，即LLMS为预测而利用的紧凑任务表示。但是，先前的研究通常使用笨拙和不透明的方法从模型输出或隐藏状态中提取电视，它们很少阐明电视影响计算的机制。在这项工作中，我们解决了这两个局限性。首先，我们提出了直接培训的任务载体（LTV），该任务向量（LTV）的准确性超过了提取的TV，并在任意层，位置甚至ICL提示下有效地表现出了卓越的灵活性作用。其次，通过系统分析，我们研究了电视的机械作用，表明在低水平上，它们主要通过注意力头OV电路进行预测，其中一小部分“钥匙头”最具决定性。在较高级别上，我们发现，尽管有变压器非线性，电视传播基本上是线性的：早期电视旋转到与任务相关的子空间旋转以改善相关标签的ligits，而后来的电视则主要缩放。综上所述，LTV不仅为获得有效的电视提供了一种实用的方法，而且还为ICL的机械基础提供了原则性的镜头。

Title: Retrieval-augmented GUI Agents with Generative Guidelines

Authors: Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C. Ho, Carl Yang, Dong Yu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24183
Pdf URL: https://arxiv.org/pdf/2509.24183
Copy Paste: [[2509.24183]] Retrieval-augmented GUI Agents with Generative Guidelines(https://arxiv.org/abs/2509.24183)
Keywords: language model, agent
Abstract: GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inference time. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling finetuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluated across three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% across two model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.
摘要：由视觉模型（VLMS）提供动力的GUI代理在自动化复杂的数字任务方面显示出希望。但是，它们在实际应用中的有效性通常受到稀缺培训数据的限制以及这些任务的固有复杂性，这些任务通常需要长尾知识，涵盖罕见，看不见的情况。我们提出了RAG-GUI，这是一种轻巧的VLM，它在推理时间内利用Web教程。 RAG-GUI首先是通过有监督的Finetuning（SFT）进行暖启动的，并通过自引导的拒绝采样登录（RSF）进一步完善。 Rag-GUI设计为模型无关，可作为一个通用插件，可增强任何基于VLM的代理。在三个不同的任务中进行了评估，它始终胜过基线代理，并超过了两个模型尺寸的其他推理基准，在现实情况下表明了强大的概括和实用的插件功能。

Title: Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models

Authors: Zhimeng Luo, Lixin Wu, Adam Frisch, Daqing He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24186
Pdf URL: https://arxiv.org/pdf/2509.24186
Copy Paste: [[2509.24186]] Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models(https://arxiv.org/abs/2509.24186)
Keywords: language model, gpt, llm
Abstract: As Large Language Models (LLMs) are increasingly proposed for high-stakes medical applications, there has emerged a critical need for reliable and accurate evaluation methodologies. Traditional accuracy metrics fail inadequately as they neither capture question characteristics nor offer topic-specific insights. To address this gap, we introduce \textsc{MedIRT}, a rigorous evaluation framework grounded in Item Response Theory (IRT), the gold standard in high-stakes educational testing. Unlike previous research relying on archival data, we prospectively gathered fresh responses from 80 diverse LLMs on a balanced, 1,100-question USMLE-aligned benchmark. Using one unidimensional two-parameter logistic IRT model per topic, we estimate LLM's latent model ability jointly with question difficulty and discrimination, yielding more stable and nuanced performance rankings than accuracy alone. Notably, we identify distinctive ``spiky'' ability profiles, where overall rankings can be misleading due to highly specialized model abilities. While \texttt{GPT-5} was the top performer in a majority of domains (8 of 11), it was outperformed in Social Science and Communication by \texttt{Claude-3-opus}, demonstrating that even an overall 23rd-ranked model can hold the top spot for specific competencies. Furthermore, we demonstrate IRT's utility in auditing benchmarks by identifying flawed questions. We synthesize these findings into a practical decision-support framework that integrates our multi-factor competency profiles with operational metrics. This work establishes a robust, psychometrically grounded methodology essential for the safe, effective, and trustworthy deployment of LLMs in healthcare.
摘要：由于大型语言模型（LLM）越来越多地用于高风险的医疗应用，因此出现了对可靠和准确的评估方法的迫切需求。传统的准确度指标既不捕获问题特征也没有提供特定于主题的见解，因此失败了。为了解决这一差距，我们介绍了\ textsc {medirt}，这是一个基于项目响应理论（IRT）的严格评估框架，这是高风险教育测试中的黄金标准。与先前依靠档案数据的研究不同，我们前瞻性地收集了80种不同LLM的新响应，这是一个平衡的1,100个问题USMLE对准的基准。每个主题使用一个一维的两参数逻辑IRT模型，我们可以通过问题难度和歧视共同估算LLM的潜在模型能力，从而比单独的精度产生更稳定和细微的性能排名。值得注意的是，我们确定了独特的``尖峰''能力概况，在这些能力上，由于高度专业的模型能力，总体排名可能会误导。尽管\ texttt {gpt-5}是大多数领域的最佳表现（11个中的8个），但它在社会科学和沟通中的表现\ texttt {claude-3-opus}在社会科学和交流中的表现均优胜，这表明即使是整个23级排名的模型也可以拥有特定能力的最高点。此外，我们通过识别有缺陷的问题来证明IRT在审核基准测试方面的实用性。我们将这些发现综合为实用的决策支持框架，该框架将我们的多因素能力概况与操作指标集成在一起。这项工作确立了一种强大的，心理上的基础方法，对于医疗保健中LLM的安全，有效和可信赖的部署至关重要。

Title: PET: Preference Evolution Tracking with LLM-Generated Explainable Distribution

Authors: Luyang Zhang, Siyuan Peng, Jialu Wang, Shichao Zhu, Beibei Li, Zhongcun Wang, Guangmou Pan, Yan Li, Song Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24189
Pdf URL: https://arxiv.org/pdf/2509.24189
Copy Paste: [[2509.24189]] PET: Preference Evolution Tracking with LLM-Generated Explainable Distribution(https://arxiv.org/abs/2509.24189)
Keywords: language model, llm
Abstract: Understanding how user preference evolves over time is a fundamental challenge central to modern digital ecosystems, for which Large Language Models (LLMs) are an increasingly prominent and popular approach due to their ability to comprehend the rich semantic context within behavioral data. A common practice is to use LLMs to predict a user's next action by directly generating a ranked list of preferred items. Although effective for short-term prediction, the end-to-end generation paradigm inherently limits personalization. Its opaque decision-making process obscures holistic user profiling and exacerbates popularity bias. To address these limitations, we propose Preference Evolution Tracking (PET), a framework that reframes the task as inferring a dynamic probability distribution over a stable and interpretable lattice of preference clusters. By applying logit-probing and generative classification techniques, PET infers a user's preference as a probability distribution, enabling transparent preference learning. On public benchmarks (Yelp, MovieLens), PET improves ranking quality by up to 40% in NDCG over direct generation baselines. On a large-scale, real-world dataset from a short-video platform, it excels at ranking long-tail contents, significantly outperforming a SOTA production model by 7 times in the NDCG score. Ultimately, PET transforms the user profile model from direct preference list generation to a transparent distributional preference mapping, paving the way for more explainable, fair, and diverse personalization systems.
摘要：了解用户偏好随着时间的流逝如何发展是现代数字生态系统的核心挑战，因为大型语言模型（LLMS）是一种越来越突出和流行的方法，因为它们能够理解行为数据中丰富的语义环境。一种常见的做法是使用LLM通过直接生成排名的优先项目列表来预测用户的下一个操作。尽管对短期预测有效，但端到端一代范式固有地限制了个性化。它不透明的决策过程掩盖了整体用户的分析，并加剧了普及偏见。为了解决这些局限性，我们提出了偏好演化跟踪（PET），该框架将任务重新缩放为推断出稳定且可解释的优先群集晶格的动态概率分布。通过应用logit键盘和生成分类技术，宠物将用户的偏好作为概率分布，从而实现透明的首选项学习。在公共基准（Yelp，Movielens）上，PET在NDCG的直接生成基线中提高了高达40％的质量。在来自短视频平台的大规模现实数据集中，它擅长排名长尾内容，在NDCG分数中大大优于SOTA生产模型7倍。最终，PET将用户配置文件模型从直接偏好列表的生成转变为透明的分布偏好映射，为更加可解释，公平和多样化的个性化系统铺平了道路。

Title: AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play

Authors: Ran Xu, Yuchen Zhuang, Zihan Dong, Jonathan Wang, Yue Yu, Joyce C. Ho, Linjun Zhang, Haoyu Wang, Wenqi Shi, Carl Yang
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24193
Pdf URL: https://arxiv.org/pdf/2509.24193
Copy Paste: [[2509.24193]] AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play(https://arxiv.org/abs/2509.24193)
Keywords: language model, llm
Abstract: Search-augmented LLMs often struggle with complex reasoning tasks due to ineffective multi-hop retrieval and limited reasoning ability. We propose AceSearcher, a cooperative self-play framework that trains a single large language model (LLM) to alternate between two roles: a decomposer that breaks down complex queries and a solver that integrates retrieved contexts for answer generation. AceSearcher couples supervised fine-tuning on a diverse mixture of search, reasoning, and decomposition tasks with reinforcement fine-tuning optimized for final answer accuracy, eliminating the need for intermediate annotations. Extensive experiments on three reasoning-intensive tasks across 10 datasets show that AceSearcher outperforms state-of-the-art baselines, achieving an average exact match improvement of 7.6%. Remarkably, on document-level finance reasoning tasks, AceSearcher-32B matches the performance of the DeepSeek-V3 model using less than 5% of its parameters. Even at smaller scales (1.5B and 8B), AceSearcher often surpasses existing search-augmented LLMs with up to 9x more parameters, highlighting its exceptional efficiency and effectiveness in tackling complex reasoning tasks. Our code will be published at this https URL and this https URL.
摘要：由于多效的多跳检索能力和有限的推理能力，搜索设施的LLM经常在复杂的推理任务中挣扎。我们提出了Acesearcher，这是一个合作的自我播放框架，该框架训练单个大语言模型（LLM）以在两个角色之间进行交替：一个分解复杂查询的分解器和一个集成了回答答案生成上下文的求解器。 Acesearcher夫妇对搜索，推理和分解任务的多种混合物进行了微调，并通过加强微调进行了优化，以实现最终答案的准确性，从而消除了中间注释的需求。对10个数据集的三项推理密集型任务进行了广泛的实验表明，Acesearcher的表现优于最先进的基线，实现了平均确切的匹配提高7.6％。值得注意的是，在文档级财务推理任务上，Acesearcher-32B使用少于5％的参数匹配DeepSeek-V3模型的性能。即使在较小的尺度（1.5b和8b）中，Acesearcher也经常超过现有的搜索型LLMS，其参数多达9倍，突出了其在解决复杂推理任务方面的出色效率和有效性。我们的代码将在此HTTPS URL和此HTTPS URL上发布。

Title: Can Large Language Models Express Uncertainty Like Human?

Authors: Linwei Tao, Yi-Fan Yeh, Bo Kai, Minjing Dong, Tao Huang, Tom A. Lamb, Jialin Yu, Philip H.S. Torr, Chang Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24202
Pdf URL: https://arxiv.org/pdf/2509.24202
Copy Paste: [[2509.24202]] Can Large Language Models Express Uncertainty Like Human?(https://arxiv.org/abs/2509.24202)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly used in high-stakes settings, where overconfident responses can mislead users. Reliable confidence estimation has been shown to enhance trust and task accuracy. Yet existing methods face practical barriers: logits are often hidden, multi-sampling is computationally expensive, and verbalized numerical uncertainty (e.g., giving a 0-100 score) deviates from natural communication. We revisit linguistic confidence (LC), where models express uncertainty through hedging language (e.g., probably, might), offering a lightweight and human-centered alternative. To advance this direction, we (1) release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores, and (2) propose a lightweight mapper that converts hedges into confidence scores at near-zero cost. Building on these resources, we (3) conduct the first systematic study of LC across modern LLMs and QA benchmarks, revealing that while most LLMs underperform in expressing reliable LC, carefully designed prompting achieves competitive calibration and discriminability. Finally, we (4) introduce a fine-tuning framework that further improves LC reliability. Taken together, our work positions linguistic confidence as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, and calls for deeper exploration of this promising yet underexplored direction.
摘要：大型语言模型（LLM）越来越多地用于高风险设置，在这种设置中，过度自信的响应会误导用户。可靠的置信度估计已被证明可以提高信任和任务准确性。然而，现有方法面临实用的障碍：逻辑通常是隐藏的，多样采样在计算上昂贵，并且口头上的数值不确定性（例如，得分为0-100）偏离了自然通信。我们重新审视语言信心（LC），其中模型通过对冲语言（例如，可能，可能）表达不确定性，提供了轻巧且以人为中心的替代方案。为了迈向这一方向，我们（1）以人为宣传的置信度得分释放了第一个不同的大规模表达式数据集，（2）提出了一个轻巧的映射器，该映射器以接近零成本将套期保值转换为置信度。在这些资源的基础上，我们（3）在现代LLM和QA基准中进行了首次对LC进行系统研究，表明尽管大多数LLMS表现出可靠的LC表现不佳，但精心设计的促使促使促使其实现了竞争性的校准和可区分性。最后，我们（4）引入了一个微调框架，进一步提高了LC的可靠性。综上所述，我们的工作位置将语言信心视为LLM不确定性估计的可扩展，高效和人类一致的方法，并要求对这个有希望而又毫无用处的方向进行更深入的探索。

Title: BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models

Authors: Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre, Meng Lu, Morteza Ziyadi, Xuan Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24210
Pdf URL: https://arxiv.org/pdf/2509.24210
Copy Paste: [[2509.24210]] BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models(https://arxiv.org/abs/2509.24210)
Keywords: language model, gpt
Abstract: Evaluating language models fairly is becoming harder as static benchmarks available on the internet risk contamination by training data. This makes it unclear whether models are truly reasoning or just recalling answers. In this paper, we introduce BeyondBench, an evaluation framework that avoids this problem by using algorithmic problem generation. Unlike traditional benchmarks that risk contamination from internet-scale training data, BeyondBench creates mathematically grounded problems on the fly, ensuring each test remains fresh and uncontaminated. Our framework covers 44 algorithmic tasks with a total of 117 variations, grouped into three difficulty levels: the Easy Suite (29 tasks) for basic arithmetic and statistics, the Medium Suite (5 tasks, 49 variations) for sequence patterns and reasoning, and the Hard Suite (10 tasks, 68 variations) tackling NP-complete and constraint satisfaction problems. Each task generates problems from a combinatorial space larger than 10^15 unique instances, with solutions verified deterministically by mathematical proofs. We evaluated 101 language models, including 85 open-source and 16 closed-source models, spanning sizes from 0.5B to 141B parameters and multiple quantization schemes. Our results show consistent reasoning deficiencies across model families, with performance degrading sharply as problem complexity increases from polynomial to exponential. In our Hard Suite evaluations, models such as Gemini-2.5-pro, Llama-3.3-70B, and Qwen2.5-72B achieved average accuracies of 56.38%, 26.91%, and 33.60%, respectively. Moreover, we observe that performance drops drastically without tool usage, with GPT-5, GPT-5-mini, and GPT-5-nano showing a decline of 16.81%, 28.05%, and 47.59% accuracy on the hard suite. Our leaderboard is publicly available at this https URL
摘要：随着通过培训数据在Internet风险污染中可用的静态基准，评估语言模型变得越来越难。这尚不清楚模型是真正的推理还是仅仅回忆起答案。在本文中，我们介绍了BeyondBench，这是一个评估框架，通过使用算法产生算法来避免此问题。与传统的基准测试有可能受到互联网规模培训数据污染的污染，超越基地在数学上构成了扎根的问题，从而确保每个测试保持新鲜和未经污染。 Our framework covers 44 algorithmic tasks with a total of 117 variations, grouped into three difficulty levels: the Easy Suite (29 tasks) for basic arithmetic and statistics, the Medium Suite (5 tasks, 49 variations) for sequence patterns and reasoning, and the Hard Suite (10 tasks, 68 variations) tackling NP-complete and constraint satisfaction problems.每个任务都会从大于10^15独特实例的组合空间产生问题，并通过数学证明确定性验证了解决方案。我们评估了101个语言模型，包括85个开源和16个封闭源模型，范围从0.5B到141B参数和多个量化方案。我们的结果表明，模型家族之间的推理不足一致，随着问题复杂性从多项式到指数的增加，性能急剧下降。在我们的硬套房评估中，Gemini-2.5-Pro，Llama-3.3-70B和Qwen2.5-72B等模型的平均准确度分别为56.38％，26.91％和33.60％。此外，我们观察到，在没有工具使用的情况下，性能会大大降低，GPT-5，GPT-5-MINI和GPT-5-NANO显示出16.81％，28.05％和47.59％的硬套件的降低。我们的排行榜在此HTTPS URL上公开可用

Title: ScenarioBench: Trace-Grounded Compliance Evaluation for Text-to-SQL and RAG

Authors: Zahra Atf, Peter R Lewis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24212
Pdf URL: https://arxiv.org/pdf/2509.24212
Copy Paste: [[2509.24212]] ScenarioBench: Trace-Grounded Compliance Evaluation for Text-to-SQL and RAG(https://arxiv.org/abs/2509.24212)
Keywords: hallucination, retrieval-augmented generation
Abstract: ScenarioBench is a policy-grounded, trace-aware benchmark for evaluating Text-to-SQL and retrieval-augmented generation in compliance contexts. Each YAML scenario includes a no-peek gold-standard package with the expected decision, a minimal witness trace, the governing clause set, and the canonical SQL, enabling end-to-end scoring of both what a system decides and why. Systems must justify outputs using clause IDs from the same policy canon, making explanations falsifiable and audit-ready. The evaluator reports decision accuracy, trace quality (completeness, correctness, order), retrieval effectiveness, SQL correctness via result-set equivalence, policy coverage, latency, and an explanation-hallucination rate. A normalized Scenario Difficulty Index (SDI) and a budgeted variant (SDI-R) aggregate results while accounting for retrieval difficulty and time. Compared with prior Text-to-SQL or KILT/RAG benchmarks, ScenarioBench ties each decision to clause-level evidence under strict grounding and no-peek rules, shifting gains toward justification quality under explicit time budgets.
摘要：ScenariObench是一种政策基础的，具有痕量意识的基准，用于在合规性上下文中评估文本到SQL和检索增强的生成。每个YAML场景都包含一个无疑的金色标准包，其中包含预期的决定，最小的证人跟踪，理事子句集以及规范的SQL，使系统决定的内容和原因都可以终端得分。系统必须使用来自同一策略佳能的子句ID证明输出合理，从而使解释可伪造和审核。评估者报告了决策准确性，微分质量（完整性，正确性，顺序），检索效率，通过结果集对等相同的SQL正确性，策略覆盖范围，延迟和解释障碍率。归一化方案难度指数（SDI）和预算变体（SDI-R）汇总结果，同时考虑了检索困难和时间。与先前的文本到SQL或苏格兰短裙/抹布基准相比，情景台面将每个决定都与严格的基础和无PEEK规则的条款级别的证据联系起来，从而在明确的时间预算下将收益转移到了质量上的理由质量。

Title: MoVa: Towards Generalizable Classification of Human Morals and Values

Authors: Ziyu Chen, Junfei Sun, Chenxi Li, Tuan Dung Nguyen, Jing Yao, Xiaoyuan Yi, Xing Xie, Chenhao Tan, Lexing Xie
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.24216
Pdf URL: https://arxiv.org/pdf/2509.24216
Copy Paste: [[2509.24216]] MoVa: Towards Generalizable Classification of Human Morals and Values(https://arxiv.org/abs/2509.24216)
Keywords: llm, prompt
Abstract: Identifying human morals and values embedded in language is essential to empirical studies of communication. However, researchers often face substantial difficulty navigating the diversity of theoretical frameworks and data available for their analysis. Here, we contribute MoVa, a well-documented suite of resources for generalizable classification of human morals and values, consisting of (1) 16 labeled datasets and benchmarking results from four theoretically-grounded frameworks; (2) a lightweight LLM prompting strategy that outperforms fine-tuned models across multiple domains and frameworks; and (3) a new application that helps evaluate psychological surveys. In practice, we specifically recommend a classification strategy, all@once, that scores all related concepts simultaneously, resembling the well-known multi-label classifier chain. The data and methods in MoVa can facilitate many fine-grained interpretations of human and machine communication, with potential implications for the alignment of machine behavior.
摘要：识别语言中嵌入的人类道德和价值观对于交流的经验研究至关重要。但是，研究人员通常会在浏览理论框架的多样性和可用于分析的数据的多样性方面面临重大困难。在这里，我们贡献了Mova，这是一个有据可查的资源套件，用于人类道德和价值观的可概括分类，由（1）16个标记的数据集组成，并从理论上的四个框架框架中进行了基准测试。（2）轻巧的LLM提示策略，该策略优于多个领域和框架的微调模型；（3）有助于评估心理调查的新应用。在实践中，我们特别建议一种分类策略，全部@一次，该策略同时得分所有相关的概念，类似于著名的多标签分类器链。 MOVA中的数据和方法可以促进对人类和机器通信的许多细粒度解释，对机器行为的一致性有潜在的影响。

Title: Model Fusion with Multi-LoRA Inference for Tool-Enhanced Game Dialogue Agents

Authors: Kangxu Wang, Ze Chen, Chengcheng Wei, Jiewen Zheng, Jiarong He, Max Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24229
Pdf URL: https://arxiv.org/pdf/2509.24229
Copy Paste: [[2509.24229]] Model Fusion with Multi-LoRA Inference for Tool-Enhanced Game Dialogue Agents(https://arxiv.org/abs/2509.24229)
Keywords: llm, agent
Abstract: This paper presents the opdainlp team's solution for the GPU track of the CPDC 2025 challenge. The challenge consists of three tasks, aiming to build an in-game conversational AI that adheres to character personas, aligns with the game's worldview, and supports function calling. Considering both effectiveness and resource/time constraints during inference, we synthesized data for some of the tasks based on the datasets provided by the competition organizers. We employed Qwen3-14B with LoRA fine-tuning and model fusion, and utilized a base model integrated with multiple LoRA adapters during inference. Specifically, in the competition, we used three distinct LoRA adapters to handle tool calling, response generation with tool call results, and response generation without tool call results, respectively. MultiLoRA inference was implemented using vLLM. Our solution achieved the first place in Task 1 and Task 3, and the second place in Task 2 of the GPU track.
摘要：本文介绍了OPDAINLP团队为CPDC 2025挑战的GPU轨道的解决方案。挑战包括三个任务，旨在建立一个遵守角色角色，与游戏的世界观相一致并支持功能调用的游戏中的对话AI。考虑到推理期间的有效性和资源/时间限制，我们根据竞争组织者提供的数据集合成了某些任务的数据。我们使用lora微调和模型融合使用了QWEN3-14B，并在推理过程中利用了与多个Lora适配器集成的基本模型。具体来说，在竞争中，我们使用了三个不同的Lora适配器来处理工具呼叫，通过工具调用结果进行响应生成以及没有工具呼叫结果的响应生成。使用VLLM实施了Multilora推理。我们的解决方案在任务1和任务3中获得了第一名，以及GPU轨道任务2的第二名。

Title: Prompt and Parameter Co-Optimization for Large Language Models

Authors: Xiaohe Bo, Rui Li, Zexu Sun, Quanyu Dai, Zeyu Zhang, Zihang Tian, Xu Chen, Zhenhua Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24245
Pdf URL: https://arxiv.org/pdf/2509.24245
Copy Paste: [[2509.24245]] Prompt and Parameter Co-Optimization for Large Language Models(https://arxiv.org/abs/2509.24245)
Keywords: language model, llm, prompt
Abstract: Prompt optimization and fine-tuning are two major approaches to improve the performance of Large Language Models (LLMs). They enhance the capabilities of LLMs from complementary perspectives: the former through explicit natural language, and the latter through implicit parameter updates. However, prior work has typically studied them in isolation, leaving their synergistic potential largely underexplored. To bridge this gap, in this paper, we introduce MetaTuner, a novel framework that jointly integrates prompt optimization and fine-tuning for LLM training. Specifically, we introduce two neural networks to generate prompts and parameters, respectively, while allowing them to share a common bottom encoding layer to enable knowledge sharing. By the guidance of the final supervised signals, our framework is optimized to discover the optimal combinations between the prompts and parameters. Given that prompt learning involves discrete optimization while fine-tuning operates in a continuous parameter space, we design a supervised regularization loss to train our framework effectively. Extensive experiments across diverse benchmarks show that our method consistently outperforms the baselines.
摘要：及时的优化和微调是提高大语模型（LLMS）性能的两种主要方法。它们从互补的角度来增强LLM的功能：前者通过明确的自然语言，而后者通过隐式参数更新。但是，先前的工作通常孤立地研究了它们，使它们的协同潜力在很大程度上没有被逐渐倍增。为了弥合这一差距，在本文中，我们介绍了Metatuner，这是一个新颖的框架，该框架共同整合了LLM培训的迅速优化和微调。具体来说，我们介绍了两个神经网络以分别生成提示和参数，同时允许它们共享一个共同的底部编码层以实现知识共享。根据最终监督信号的指导，我们的框架被优化，以发现提示和参数之间的最佳组合。鉴于该提示学习涉及在连续参数空间中进行微调运行时的离散优化，因此我们设计了监督的正规化损失，以有效地训练我们的框架。跨不同基准测试的广泛实验表明，我们的方法始终优于基准。

Title: MRAG-Suite: A Diagnostic Evaluation Platform for Visual Retrieval-Augmented Generation

Authors: Yuelyu Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24253
Pdf URL: https://arxiv.org/pdf/2509.24253
Copy Paste: [[2509.24253]] MRAG-Suite: A Diagnostic Evaluation Platform for Visual Retrieval-Augmented Generation(https://arxiv.org/abs/2509.24253)
Keywords: hallucination, retrieval-augmented generation
Abstract: Multimodal Retrieval-Augmented Generation (Visual RAG) significantly advances question answering by integrating visual and textual evidence. Yet, current evaluations fail to systematically account for query difficulty and ambiguity. We propose MRAG-Suite, a diagnostic evaluation platform integrating diverse multimodal benchmarks (WebQA, Chart-RAG, Visual-RAG, MRAG-Bench). We introduce difficulty-based and ambiguity-aware filtering strategies, alongside MM-RAGChecker, a claim-level diagnostic tool. Our results demonstrate substantial accuracy reductions under difficult and ambiguous queries, highlighting prevalent hallucinations. MM-RAGChecker effectively diagnoses these issues, guiding future improvements in Visual RAG systems.
摘要：通过整合视觉和文本证据，多模式检索型生成一代（视觉抹布）可以显着提高问题。然而，当前的评估无法系统地解释查询难度和歧义。我们提出了MRAG-SUITE，这是一个诊断评估平台，集成了多样化的多模式基准（WebQA，图表摊位，视觉窗格，MRAG BENCH）。我们介绍了基于难度的基于难度的和歧义的过滤策略，以及MM Ragchecker（一种索赔级诊断工具）。我们的结果表明，在困难和模棱两可的查询下降低了大幅准确性，突出了普遍的幻觉。 MM Ragchecker有效地诊断了这些问题，从而指导视觉抹布系统的未来改进。

Title: SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

Authors: Gyuhyeon Seo, Jungwoo Yang, Junseong Pyo, Nalim Kim, Jonggeun Lee, Yohan Jo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24282
Pdf URL: https://arxiv.org/pdf/2509.24282
Copy Paste: [[2509.24282]] SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents(https://arxiv.org/abs/2509.24282)
Keywords: language model, gpt, llm, agent
Abstract: Large Language Model (LLM) agents excel at multi-step, tool-augmented tasks. However, smart homes introduce distinct challenges, requiring agents to handle latent user intents, temporal dependencies, device constraints, scheduling, and more. The main bottlenecks for developing smart home agents with such capabilities include the lack of a realistic simulation environment where agents can interact with devices and observe the results, as well as a challenging benchmark to evaluate them. To address this, we introduce $\textbf{SimuHome}$, a time-accelerated home environment that simulates smart devices, supports API calls, and reflects changes in environmental variables. By building the simulator on the Matter protocol (the global industry standard for smart home communication), SimuHome provides a high-fidelity environment, and agents validated in SimuHome can be deployed on real Matter-compliant devices with minimal adaptation. We provide a challenging benchmark of 600 episodes across twelve user query types that require the aforementioned capabilities. Our evaluation of 11 agents under a unified ReAct framework reveals that while models perform well on simple tasks, they struggle with latent intent inference, state verification, and especially temporal scheduling. Even the top-performing model, GPT-4.1, reaches only 54% success rate. These findings highlight a critical need for methods that can reliably verify the current state via tools before acting and coordinate time-dependent actions.
摘要：大型语言模型（LLM）代理在多步骤的，具有工具增强的任务上表现出色。但是，智能家居会引入不同的挑战，要求代理处理潜在的用户意图，时间依赖性，设备约束，调度等等。开发具有这种功能的智能家居代理商的主要瓶颈包括缺乏逼真的模拟环境，在该环境中，代理可以与设备进行交互并观察结果，以及对评估它们的挑战性基准。为了解决这个问题，我们介绍了$ \ textbf {simuhome} $，这是一个时间加速的家庭环境，可模拟智能设备，支持API调用并反映环境变量的变化。通过构建有关物质协议的模拟器（智能家庭通信的全球行业标准），Simuhome提供了一个高保真的环境，并且在Simuhome中验证的代理可以部署在符合实际物质的设备上，以最少的适应。我们为需要上述功能的十二种用户查询类型提供了600集的具有挑战性的基准。我们对统一反应框架下的11个代理商的评估表明，尽管模型在简单的任务上表现良好，但他们在潜在的意图推断，状态验证，尤其是时间安排方面挣扎。即使是表现最佳的模型GPT-4.1，成功率也仅达到54％。这些发现突出了对可以通过行动和协调时间相关的动作来可靠地验证当前状态的方法的关键需求。

Title: Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement

Authors: Yu-Che Tsai, Kuan-Yu Chen, Yuan-Chi Li, Yuan-Hao Chen, Ching-Yu Tsai, Shou-De Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24291
Pdf URL: https://arxiv.org/pdf/2509.24291
Copy Paste: [[2509.24291]] Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement(https://arxiv.org/abs/2509.24291)
Keywords: language model, llm
Abstract: Existing large language model (LLM)-based embeddings typically adopt an encoder-only paradigm, treating LLMs as static feature extractors and overlooking their core generative strengths. We introduce GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings), a novel framework that leverages autoregressive generation to iteratively refine semantic representations. By producing sequences of soft tokens optimized under contrastive objective, GIRCSE captures latent concepts and implicit semantics that encoder-only methods often miss. To guide this process, we propose an Iterative Contrastive Refinement (ICR) objective that encourages each refinement step to yield better representations. Extensive experiments show that GIRCSE outperforms strong LLM-based embedding baselines on the MTEB benchmark and instruction-following tasks. Moreover, GIRCSE exhibits an emergent test-time scaling property: generating more tokens at inference steadily improves embedding quality. Our results establish generative iterative refinement as a new paradigm for representation learning.
摘要：现有的大型语言模型（LLM）基于基于编码器的范例通常采用纯粹的范式，将LLM视为静态功能提取器，并忽略其核心生成优势。我们介绍了Gircse（对对比句子嵌入的生成性迭代改进），这是一个新型框架，利用自回归的产生来迭代地完善语义表示。通过生成在对比目标下优化的软令牌的序列，Gircse捕获了仅编码仅编码方法的潜在概念和隐式语义。为了指导这一过程，我们提出了一个迭代对比性改进（ICR）目标，该目标鼓励每个改进步骤产生更好的表示。广泛的实验表明，Gircse在MTEB基准和遵循指令遵循的任务上的基于LLM的强大嵌入基准。此外，Gircse具有新兴的测试时间缩放属性：在推理时产生更多的令牌可以稳步提高嵌入质量。我们的结果确立了生成性迭代精致作为表示学习的新范式。

Title: LOGOS: LLM-driven End-to-End Grounded Theory Development and Schema Induction for Qualitative Research

Authors: Xinyu Pi, Qisen Yang, Chuong Nguyen
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2509.24294
Pdf URL: https://arxiv.org/pdf/2509.24294
Copy Paste: [[2509.24294]] LOGOS: LLM-driven End-to-End Grounded Theory Development and Schema Induction for Qualitative Research(https://arxiv.org/abs/2509.24294)
Keywords: llm
Abstract: Grounded theory offers deep insights from qualitative data, but its reliance on expert-intensive manual coding presents a major scalability bottleneck. Current computational tools stop short of true automation, keeping researchers firmly in the loop. We introduce LOGOS, a novel, end-to-end framework that fully automates the grounded theory workflow, transforming raw text into a structured, hierarchical theory. LOGOS integrates LLM-driven coding, semantic clustering, graph reasoning, and a novel iterative refinement process to build highly reusable codebooks. To ensure fair comparison, we also introduce a principled 5-dimensional metric and a train-test split protocol for standardized, unbiased evaluation. Across five diverse corpora, LOGOS consistently outperforms strong baselines and achieves a remarkable $88.2\%$ alignment with an expert-developed schema on a complex dataset. LOGOS demonstrates a powerful new path to democratize and scale qualitative research without sacrificing theoretical nuance.
摘要：扎根的理论提供了定性数据的深刻见解，但其对专家密集型手动编码的依赖提出了主要的可扩展性瓶颈。当前的计算工具停止了真正的自动化，从而使研究人员牢固地保持循环。我们介绍了徽标，这是一个新颖的端到端框架，该框架完全自动化了扎根的理论工作流程，将原始文本转换为结构化的层次结构理论。徽标集成了LLM驱动的编码，语义聚类，图形推理和新型的迭代精炼过程，以构建高度可重复使用的代码簿。为了确保公平的比较，我们还引入了一个原则上的5维度量和火车测试拆分方案，用于标准化，无偏见的评估。在五个不同的语料库中，徽标始终胜过强大的基线，并在复杂的数据集中与专家开发的架构达到了惊人的$ 88.2 \％$对齐。徽标展示了在不牺牲理论细微差别的情况下民主化和扩展定性研究的强大新途径。

Title: DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Authors: Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, Jiaheng Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24296
Pdf URL: https://arxiv.org/pdf/2509.24296
Copy Paste: [[2509.24296]] DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models(https://arxiv.org/abs/2509.24296)
Keywords: language model, llm
Abstract: The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: this https URL.
摘要：扩散大语言模型（DLLM）的快速发展引入了前所未有的漏洞，这些漏洞从根本上与自回归的LLM不同，这是由于其迭代和平行的生成机制。在本文中，我们对DLLM脆弱性进行了深入的分析，以跨两个不同的维度越狱攻击：内部和步进界动态。实验结果揭示了标准贪婪的重新启动策略中固有的有害偏差，并确定了一个关键现象，我们将其称为denoisising-path依赖性，在这种情况下，早期令牌的安全性会影响最终输出。这些发现还表明，尽管当前的解码策略构成了一个重大的脆弱性，但DLLM具有巨大的内在安全潜力。为了释放这种潜力，我们提出了通过双阶段方法来解决漏洞的无训练防御框架：随机退火重新启动重新启动进行动态引入控制的随机性，以减轻贪婪的选择偏见，而块级审核和维修修复利用内部模型的内部模型来检测和引导的校正。在四个DLLM上进行的全面实验证明了Diffuguard的出色效果，将六种不同越狱方法的攻击成功率从47.9％降低到14.7％，同时保留了模型效用和效率。我们的代码可用：此HTTPS URL。

Title: Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs

Authors: Junying Wang, Zicheng Zhang, Ye Shen, Yalun Wu, Yingji Liang, Yijin Guo, Farong Wen, Wenzhe Li, Xuezhi Zhao, Qi Jia, Guangtao Zhai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24297
Pdf URL: https://arxiv.org/pdf/2509.24297
Copy Paste: [[2509.24297]] Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs(https://arxiv.org/abs/2509.24297)
Keywords: agent
Abstract: High-quality, multi-modal benchmarks are crucial for advancing scientific reasoning in large models yet their manual creation is costly and unscalable. To address this bottleneck, we explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs), which include three parts: 1) Task Definition \& Evaluation Rubric: We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality rubric that provides principles for the transformation. 2) Benchmark Construction: Then we construct two extensive benchmarks to rigorously evaluate state-of-the-art generation \& understanding models on the distinct tasks of MMQA generation \& MMQA quality evaluation. 3) Preliminary Solution: We develop an agentic system (Q-Mirror), which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement. Our experiments show that while state-of-the-art models can generate MMQAs, their outputs still leave substantial gaps, underscoring the need for reliable evaluation. We further demonstrate that top-tier understanding models align closely with human judgment in MMQA quality assessment. Leveraging both insights, the Q-Mirror agent raises average scores from 78.90 to 85.22 and pass rates from 72\% to 95\%, offering a practical path to large-scale scientific benchmarks.
摘要：高质量的多模式基准对于在大型模型中推进科学推理至关重要，但他们的手动创建却是昂贵且不可计入的。为了解决这种瓶颈，我们探索了将仅文本QA对（TQA）转换为高质量多模式QA对（MMQA）的潜力，其中包括三个部分：1）任务定义\＆评估界面：我们开发了TQA-TOM-TO-MMQA框架，并为综合的多二维MMQA质量提供了质量的综合型号。 2）基准结构：然后，我们构建了两个广泛的基准测试，以严格评估有关MMQA Generation \＆MMQA质量评估的不同任务的最新一代\＆理解模型。 3）初步解决方案：我们开发了一个代理系统（Q-MIRROR），该系统通过将MMQA的生成和评估整合到封闭环中，以进行迭代精致，从而实现我们的框架。我们的实验表明，尽管最先进的模型可以生成MMQA，但它们的输出仍留下大量差距，强调了对可靠评估的需求。我们进一步证明，在MMQA质量评估中，顶级理解模型与人类判断紧密相吻合。 Q-MIRROR代理商利用这两种见解，平均得分从78.90提高到85.22，并通过72 \％降低到95 \％，为大规模科学基准提供了实用的途径。

Title: Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs

Authors: Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24319
Pdf URL: https://arxiv.org/pdf/2509.24319
Copy Paste: [[2509.24319]] Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs(https://arxiv.org/abs/2509.24319)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) can express different values in two distinct ways: (1) intrinsic expression, reflecting the model's inherent values learned during training, and (2) prompted expression, elicited by explicit prompts. Given their widespread use in value alignment and persona steering, it is paramount to clearly understand their underlying mechanisms, particularly whether they mostly overlap (as one might expect) or rely on substantially different mechanisms, but this remains largely understudied. We analyze this at the mechanistic level using two approaches: (1) value vectors, feature directions representing value mechanisms extracted from the residual stream, and (2) value neurons, MLP neurons that contribute to value expressions. We demonstrate that intrinsic and prompted value mechanisms partly share common components that are crucial for inducing value expression, but also possess unique elements that manifest in different ways. As a result, these mechanisms lead to different degrees of value steerability (prompted > intrinsic) and response diversity (intrinsic > prompted). In particular, components unique to the intrinsic mechanism seem to promote lexical diversity in responses, whereas those specific to the prompted mechanism primarily strengthen instruction following, taking effect even in distant tasks like jailbreaking.
摘要：大型语言模型（LLMS）可以以两种不同的方式表达不同的值：（1）内在表达式，反映模型在训练过程中学到的固有值，以及（2）通过显式提示引起的提示表达。鉴于它们在价值一致性和角色转向方面的广泛使用，至关重要的是要清楚地了解它们的潜在机制，尤其是它们主要是重叠（人们可能期望的）还是依赖实质性不同的机制，但这在很大程度上仍然是研究的。我们使用两种方法在机械水平上分析了这一点：（1）值向量，代表从残留流中提取的价值机制的特征方向，以及（2）值神经元，有助于值表达式的MLP神经元。我们证明，内在和促使价值机制部分共享对诱导价值表达至关重要的共同组成部分，但也具有以不同方式表现出来的独特元素。结果，这些机制导致了不同程度的价值可置换性（提示>内在）和响应多样性（内在>提示）。特别是，内在机制独有的组件似乎促进了响应的词汇多样性，而特定于促进机制的词汇多样性主要增强了跟随的指导，即使在越来越多的越狱任务中也有生效。

Title: Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey

Authors: Yuntao Shou, Tao Meng, Wei Ai, Keqin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24322
Pdf URL: https://arxiv.org/pdf/2509.24322
Copy Paste: [[2509.24322]] Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey(https://arxiv.org/abs/2509.24322)
Keywords: language model, llm
Abstract: In recent years, large language models (LLMs) have driven major advances in language understanding, marking a significant step toward artificial general intelligence (AGI). With increasing demands for higher-level semantics and cross-modal fusion, multimodal large language models (MLLMs) have emerged, integrating diverse information sources (e.g., text, vision, and audio) to enhance modeling and reasoning in complex scenarios. In AI for Science, multimodal emotion recognition and reasoning has become a rapidly growing frontier. While LLMs and MLLMs have achieved notable progress in this area, the field still lacks a systematic review that consolidates recent developments. To address this gap, this paper provides a comprehensive survey of LLMs and MLLMs for emotion recognition and reasoning, covering model architectures, datasets, and performance benchmarks. We further highlight key challenges and outline future research directions, aiming to offer researchers both an authoritative reference and practical insights for advancing this domain. To the best of our knowledge, this paper is the first attempt to comprehensively survey the intersection of MLLMs with multimodal emotion recognition and reasoning. The summary of existing methods mentioned is in our Github: \href{this https URL}{this https URL}.
摘要：近年来，大型语言模型（LLM）在语言理解方面取得了重大进步，这标志着迈向人工通用情报（AGI）的重要一步。随着对高级语义和跨模式融合的需求不断提高，已经出现了多模式大语言模型（MLLM），从而整合了各种信息源（例如文本，视觉和音频），以增强复杂场景中的建模和推理。在《科学的AI》中，多模式的情感识别和推理已成为快速发展的边界。尽管LLM和MLLM在这一领域取得了显着进步，但该领域仍然缺乏系统的综述，可以巩固最近的发展。为了解决这一差距，本文对情感识别和推理的LLM和MLLM进行了全面调查，涵盖了模型架构，数据集和性能基准。我们进一步强调了关键的挑战并概述未来的研究方向，旨在为研究人员提供权威参考和实践见解，以推进该领域。据我们所知，本文是首次尝试全面调查MLLM与多模式情感识别和推理的交集。所述现有方法的摘要在我们的github中：\ href {this https url} {this https url}。

Title: Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding

Authors: Sungkyun Kim, Jaemin Kim, Dogyung Yoon, Jiho Shin, Junyeol Lee, Jiwon Seo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24328
Pdf URL: https://arxiv.org/pdf/2509.24328
Copy Paste: [[2509.24328]] Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding(https://arxiv.org/abs/2509.24328)
Keywords: llm
Abstract: LLMs have low GPU efficiency and high latency due to autoregressive decoding. Speculative decoding (SD) mitigates this using a small draft model to speculatively generate multiple tokens, which are then verified in parallel by a target model. However, when speculation accuracy is low, the overhead from rejected tokens can offset the benefits, limiting SD's effectiveness, especially at large batch sizes. To address this, we propose Speculative Verification (SV), an efficient augmentation to SD that dynamically predicts speculation accuracy and adapts the verification length to maximize throughput. SV introduces a companion model - a small auxiliary model similar in size to the draft model - to estimate the alignment between draft and target model distributions. By maximizing the information gain from quantifying this alignment, SV refines verification decisions, reducing wasted computation on rejected tokens and improving decoding efficiency. Moreover, SV requires no modifications to the draft or target models and is compatible with existing SD variants. We extensively evaluated SV on publicly available LLMs across three NLP tasks using nine combinations of draft, companion, and target models, including 13B-72B target models and three types of variations: base (no finetuning), instruction-tuned, and task fine-tuned. Across all experiments and batch sizes (4-80), SV consistently outperforms both SD and standard decoding with the target model. It improves SD performance by up to 2$\times$, with an average speedup of 1.4 $\times$ in large-batch settings (batch sizes 32-80). These results demonstrate SV's robustness, scalability, and practical utility for efficient LLM inference.
摘要：由于自回归解码，LLM的GPU效率较低和高潜伏期。投机解码（SD）使用小型草稿模型来缓解此功能，以产生多个令牌，然后通过目标模型并行验证。但是，当推测准确性较低时，被拒绝的令牌的开销可以抵消好处，从而限制了SD的有效性，尤其是在大批量尺寸下。为了解决这个问题，我们提出了投机验证（SV），这是对SD的有效增强，该增强能够动态预测投机精度并适应验证长度以最大化吞吐量。 SV引入了一个伴随模型 - 一种与草案模型相似的小辅助模型 - 以估计草稿和目标模型分布之间的对齐方式。通过量化此对齐方式，SV最大化信息的增益，可以完善验证决策，减少拒绝令牌的浪费计算并提高解码效率。此外，SV不需要对草稿或目标模型进行任何修改，并且与现有的SD变体兼容。我们使用九种草稿，伴侣和目标模型的组合在三个NLP任务中广泛评估了SV，包括13B-72B目标模型和三种变化类型：基础（无填充），指导键合和任务进行了调整。在所有实验和批处理大小（4-80）中，SV与目标模型始终优于SD和标准解码。它可提高SD性能高达2 $ \ times $，平均速度为1.4 $ \ times $ $ \ times $ $ $ \ times $（批量尺寸为32-80）。这些结果证明了SV的鲁棒性，可伸缩性和实用性，用于有效的LLM推断。

Title: AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment

Authors: Mengyu Bu, Shaolei Zhang, Zhongjun He, Hua Wu, Yang Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24338
Pdf URL: https://arxiv.org/pdf/2509.24338
Copy Paste: [[2509.24338]] AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment(https://arxiv.org/abs/2509.24338)
Keywords: language model, llm
Abstract: Multilingual large language models (LLMs) possess impressive multilingual understanding and generation capabilities. However, their performance and cross-lingual alignment often lag for non-dominant languages. A common solution is to fine-tune LLMs on large-scale and more balanced multilingual corpus, but such approaches often lead to imprecise alignment and suboptimal knowledge transfer, struggling with limited improvements across languages. In this paper, we propose AlignX to bridge the multilingual performance gap, which is a two-stage representation-level framework for enhancing multilingual performance of pre-trained LLMs. In the first stage, we align multilingual representations with multilingual semantic alignment and language feature integration. In the second stage, we stimulate the multilingual capability of LLMs via multilingual instruction fine-tuning. Experimental results on several pre-trained LLMs demonstrate that our approach enhances LLMs' multilingual general and cross-lingual generation capability. Further analysis indicates that AlignX brings the multilingual representations closer and improves the cross-lingual alignment.
摘要：多语言大语模型（LLM）具有令人印象深刻的多语言理解和发电能力。但是，他们的表现和跨语言对准通常滞后于非主导语言。一个常见的解决方案是对大规模和更平衡的多语言语料库进行微调LLM，但是这种方法通常会导致不精确的一致性和次优知识转移，并在跨语言的有限改进方面挣扎。在本文中，我们建议Alignx弥合多语言性能差距，这是一个两阶段表示级别的框架，用于增强预训练的LLM的多语言性能。在第一阶段，我们将多语言表示与多语言语义一致性和语言特征集成一致。在第二阶段，我们通过多语言指令微调刺激LLM的多语言能力。对几个预训练的LLM的实验结果表明，我们的方法增强了LLMS的多语言通用和跨语性生成能力。进一步的分析表明，Alignx使多语言表示更接近，并改善了跨语性对准。

Title: Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining

Authors: Matthew Theodore Roque, Dan John Velasco
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24356
Pdf URL: https://arxiv.org/pdf/2509.24356
Copy Paste: [[2509.24356]] Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining(https://arxiv.org/abs/2509.24356)
Keywords: language model, llm
Abstract: Most studies on language model pretraining focus on large datasets, leaving open questions about optimization in data-constrained settings. In such settings, the effects of training data order and of including alternative versions of the same text remain underexplored. We address this by studying curriculum learning in pretraining, focusing on text-complexity ordering and data augmentation via simplification. We ask: (1) Does simplifying texts enhance representation quality more than reusing the original data? and (2) Does ordering data by text complexity yield better representations? To answer, we build on a pair of parallel corpora where human-written paragraphs are aligned with LLM-simplified variants, and test four data schedules: repeated exposure, low-to-high complexity, high-to-low, and interleaved. We analyze models' representation quality from a sample efficiency perspective via fine-tuning, as well as its zero-shot performance on linguistic knowledge, entity tracking, world knowledge, and commonsense reasoning. Our findings show that adding simplified data improves fine-tuning and zero-shot performance over a repeated-exposure baseline: smaller models benefit from low-to-high complexity, while larger models perform better with interleaved ordering.
摘要：大多数关于语言模型进行预处理的研究集中在大型数据集上，在数据约束设置中留下了有关优化的公开问题。在这种情况下，培训数据顺序和包括同一文本的替代版本的效果仍未得到充实。我们通过研究课程学习来解决这一问题，重点是通过简化进行文本复杂性排序和数据增强。我们问：（1）简化文本是否可以增强表示质量，而不是重复使用原始数据？（2）按文本复杂性订购数据是否会产生更好的表示形式？为了回答，我们建立在一对平行的语料库上，其中人写的段落与LLM缩写的变体保持一致，并测试四个数据时间表：重复曝光，低到高复杂性，高低至低和相互交织。我们从样本效率的角度通过微调来分析模型的表示质量，以及其在语言知识，实体跟踪，世界知识和常识性推理方面的零拍摄性能。我们的发现表明，添加简化的数据可以改善重复曝光基线的微调和零拍摄的性能：较小的模型受益于低到高的复杂性，而较大的模型则在交织的订购过程中表现更好。

Title: Reinforcement Mid-Training

Authors: Yijun Tian, Shaoyu Chen, Zhichao Xu, Yawei Wang, Jinhe Bi, Peng Han, Wei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24375
Pdf URL: https://arxiv.org/pdf/2509.24375
Copy Paste: [[2509.24375]] Reinforcement Mid-Training(https://arxiv.org/abs/2509.24375)
Keywords: language model
Abstract: The development of state-of-the-art large language models is commonly understood as a two-stage process involving pre-training and post-training. We point out the need for an additional intermediate stage called reinforcement mid-training with potential for strong performance gains. In this paper, we formally define the problem and identify three key challenges: (1) inefficient training due to excessive reasoning steps, (2) disregard of the imbalanced token entropy distribution, and (3) underutilization of token information. To address these challenges, we propose RMT, a framework for efficient, adaptive, and unified reinforcement mid-training with various innovative components. In particular, we first introduce a dynamic token budget mechanism that constrains unnecessary reasoning steps and mitigates model overthinking. Next, we design a curriculum-based adaptive sampling method that fosters a progressive learning trajectory from easy to hard tokens. Finally, we present a dual training strategy that combines reinforcement learning with next-token prediction, ensuring targeted learning on key tokens and full exploitation of all token information. Extensive experiments demonstrate the superiority of RMT over state-of-the-art methods, achieving up to +64.91% performance improvement with only 21% of the reasoning length in language modeling. We also show that checkpoints obtained after reinforcement mid-training can benefit the subsequent post-training, yielding up to +18.76% improvement in the mathematical domain.
摘要：最先进的大语言模型的发展通常被理解为涉及培训和培训后的两阶段过程。我们指出，需要一个额外的中间阶段，称为强化中期训练，并具有强大的性能提高。在本文中，我们正式定义了问题并确定了三个关键挑战：（1）由于过度推理步骤，（2）忽略不平衡的令牌熵分布，以及（3）对令牌信息的实用不足。为了应对这些挑战，我们提出了RMT，这是一个与各种创新组件进行高效，适应性和统一的强化培训的框架。特别是，我们首先引入了动态的令牌预算机制，该机制限制了不必要的推理步骤，并减轻模型过度思考。接下来，我们设计了一种基于课程的自适应抽样方法，该方法促进了从易于执行到硬令牌的渐进学习轨迹。最后，我们提出了一种双重培训策略，将强化学习与下一步的预测相结合，确保对关键令牌进行有针对性的学习和对所有令牌信息的全面开发。广泛的实验表明，RMT优于最先进的方法，可提高 +64.91％的性能提高，而语言建模的推理长度仅为21％。我们还表明，加强中期训练后获得的检查点可以使随后的训练后培训受益，从而使数学领域提高 +18.76％。

Title: HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment

Authors: Langqi Yang, Tianhang Zheng, Kedong Xiu, Yixuan Chen, Di Wang, Puning Zhao, Zhan Qin, Kui Ren
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24384
Pdf URL: https://arxiv.org/pdf/2509.24384
Copy Paste: [[2509.24384]] HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment(https://arxiv.org/abs/2509.24384)
Keywords: language model, llm, prompt
Abstract: The alignment of large language models (LLMs) with human values is critical for their safe deployment, yet jailbreak attacks can subvert this alignment to elicit harmful outputs from LLMs. In recent years, a proliferation of jailbreak attacks has emerged, accompanied by diverse metrics and judges to assess the harmfulness of the LLM outputs. However, the absence of a systematic benchmark to assess the quality and effectiveness of these metrics and judges undermines the credibility of the reported jailbreak effectiveness and other risks. To address this gap, we introduce HarmMetric Eval, a comprehensive benchmark designed to support both overall and fine-grained evaluation of harmfulness metrics and judges. Our benchmark includes a high-quality dataset of representative harmful prompts paired with diverse harmful and non-harmful model responses, alongside a flexible scoring mechanism compatible with various metrics and judges. With HarmMetric Eval, our extensive experiments uncover a surprising result: two conventional metrics--METEOR and ROUGE-1--outperform LLM-based judges in evaluating the harmfulness of model responses, challenging prevailing beliefs about LLMs' superiority in this domain. Our dataset is publicly available at this https URL, and the code is available at this https URL.
摘要：大语言模型（LLM）与人类价值观的一致性对他们的安全部署至关重要，但是越狱攻击可以颠覆这种对齐，以引起LLM的有害产量。近年来，越狱袭击的扩散已经出现，并伴随着不同的指标和法官来评估LLM产出的有害性。但是，缺乏系统的基准来评估这些指标和法官的质量和有效性，这破坏了报告的越狱效力和其他风险的可信度。为了解决这一差距，我们引入了Harmmetric评估，这是一个综合基准，旨在支持对有害度指标和法官的整体和细粒度评估。我们的基准包括代表性有害提示的高质量数据集，并配对各种有害和无害的模型响应，以及与各种指标和法官兼容的灵活评分机制。通过Harmmetric评估，我们的广泛实验发现了一个令人惊讶的结果：两个常规的指标 - 现代和鲁日-1-基于Perform llm的法官在评估模型响应的有害性方面，挑战了对LLMS在该领域优越的普遍信念。我们的数据集可在此HTTPS URL上公开可用，并且该代码可在此HTTPS URL上获得。

Title: LLaDA-MoE: A Sparse MoE Diffusion Language Model

Authors: Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, Ji-Rong Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24389
Pdf URL: https://arxiv.org/pdf/2509.24389
Copy Paste: [[2509.24389]] LLaDA-MoE: A Sparse MoE Diffusion Language Model(https://arxiv.org/abs/2509.24389)
Keywords: language model, agent
Abstract: We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE's strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.
摘要：我们介绍了Llada-Moe，这是一种大型语言扩散模型，其中包括Experts（MOE）架构的混合物，在大约20t令牌上从头开始训练。 LLADA-MOE通过保持7B参数的容量，同时仅激活1.4B参数，从而实现竞争性能，从而显着降低了计算开销。我们的经验评估表明，LLADA-MOE具有更大参数的扩散语言模型之间的最新性能，超过了以前的扩散语言模型LLADA，LLADA 1.5，并在多个基准中进行了梦想。指令调整的模型LLADA-MOE-7B-A1B教学证明了在知识理解，代码生成，数学推理，代理和对齐任务中与QWEN2.5-3B教学相当的功能，尽管使用了较少的活动参数。我们的结果表明，将稀疏的MOE架构整合到掩盖扩散语言模型的训练目标中，仍然可以在有效的推断下具有很少的主动参数，从而带来了MOE的优势，并为进一步探索扩散语言模型提供了足够的空间。 Llada-Moe型号可在Huggingface上找到。

Title: Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling

Authors: Pengfei Wang, Baolin Sun, Xuemei Dong, Yaxun Dai, Hongwei Yuan, Mengdie Chu, Yingqi Gao, Xiang Qi, Peng Zhang, Ying Yan
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2509.24403
Pdf URL: https://arxiv.org/pdf/2509.24403
Copy Paste: [[2509.24403]] Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling(https://arxiv.org/abs/2509.24403)
Keywords: language model, agent
Abstract: State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind human experts on challenging benchmarks like BIRD. Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model's internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL, a novel framework leveraging scalable computation to improve performance. Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that synergistically combines three distinct perspectives: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy adaptation to new databases and more powerful language models. Extensive experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD benchmark, reaching 81.67\% execution accuracy on the test set and ranking first on the official leaderboard, demonstrating an effective path toward human-level performance.
摘要：最先进的（SOTA）文本到SQL方法仍然显着落后于人类在诸如Bird等挑战基准方面的专家。探索测试时间扩展的当前方法缺乏精心策划的策略，并且忽略了模型的内部推理过程。为了弥合这一差距，我们介绍了Agent-Scale-SQL，这是一个新型框架，利用可扩展的计算来提高性能。 Agensar-Scale-SQL实现了协同的测试时间缩放策略，该策略协同结合了三种不同的观点：i）通过RL增强的内在推理进行内部缩放，ii）通过迭代细化进行顺序缩放，iiii）使用多样的合成和锦标赛选择。 Agentar-Scale-SQL是一个通用框架，旨在轻松适应新的数据库和更强大的语言模型。广泛的实验表明，Agent-Scale-SQL在鸟基准上取得了SOTA的性能，在测试组上达到81.67 \％的执行精度，并在官方排行榜上排名第一，表明了人类水平绩效的有效途径。

Title: Multilingual Text-to-SQL: Benchmarking the Limits of Language Models with Collaborative Language Agents

Authors: Khanh Trinh Pham, Thu Huong Nguyen, Jun Jo, Quoc Viet Hung Nguyen, Thanh Tam Nguyen
Subjects: cs.CL, cs.AI, cs.DB, cs.ET, cs.IR
Abstract URL: https://arxiv.org/abs/2509.24405
Pdf URL: https://arxiv.org/pdf/2509.24405
Copy Paste: [[2509.24405]] Multilingual Text-to-SQL: Benchmarking the Limits of Language Models with Collaborative Language Agents(https://arxiv.org/abs/2509.24405)
Keywords: language model, llm, agent
Abstract: Text-to-SQL enables natural access to databases, yet most benchmarks are English-only, limiting multilingual progress. We introduce MultiSpider 2.0, extending Spider 2.0 to eight languages (English, German, French, Spanish, Portuguese, Japanese, Chinese, Vietnamese). It preserves Spider 2.0's structural difficulty while adding linguistic and dialectal variability, demanding deeper reasoning for complex SQL. On this benchmark, state-of-the-art LLMs (such as DeepSeek-R1 and OpenAI o1) reach only 4\% execution accuracy when relying on intrinsic reasoning, versus 60\% on MultiSpider 1.0. Therefore, we provide a collaboration-driven language agents baseline that iteratively refines queries, improving accuracy to 15\%. These results reveal a substantial multilingual gap and motivate methods that are robust across languages and ready for real-world enterprise deployment. Our benchmark is available at this https URL.
摘要：文本到SQL可以自然访问数据库，但大多数基准是英语的，限制了多语言进度。我们介绍了MultoSpider 2.0，将Spider 2.0扩展到八种语言（英语，德语，法语，西班牙语，葡萄牙语，日语，中文，越南语）。它可以保留蜘蛛2.0的结构困难，同时添加语言和方言可变性，要求更深入地推理复杂的SQL。在此基准测试中，最先进的LLM（例如DeepSeek-R1和OpenAI O1）在依靠内在推理时仅达到4 \％的执行精度，而在MultiSpider 1.0上为60 \％。因此，我们提供了一个以协作为导向的语言代理基线，可迭代地完善查询，将准确性提高到15 \％。这些结果揭示了一个实质性的多语言差距和激励方法，这些方法在跨语言中具有鲁棒性并为现实世界中的部署做好了准备。我们的基准标准可在此HTTPS URL上找到。

Title: CDT: A Comprehensive Capability Framework for Large Language Models Across Cognition, Domain, and Task

Authors: Haosi Mo, Xinyu Ma, Xuebo Liu, Derek F. Wong, Yu Li, Jie Liu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24422
Pdf URL: https://arxiv.org/pdf/2509.24422
Copy Paste: [[2509.24422]] CDT: A Comprehensive Capability Framework for Large Language Models Across Cognition, Domain, and Task(https://arxiv.org/abs/2509.24422)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) have significantly enhanced their capabilities, highlighting the need for comprehensive evaluation frameworks that extend beyond task-specific benchmarks. However, existing benchmarks often focus on isolated abilities, lacking a holistic framework for assessing LLM capabilities. To address this gap, we propose the Cognition-Domain-Task (CDT) framework, which comprehensively measures a model's capabilities across three dimensions. We expand the scope of model capability definitions at the cognitive level by incorporating the Cattell-Horn-Carroll cognitive theory, refining the categorization of model capabilities. We apply CDT in two directions: dataset capability evaluation and data selection. Experiments show that our capability metrics correlate well with downstream performance and can support effective dataset analysis and construction. The experiments on data selection also show significant improvements in both general and specific benchmarks, achieving scores of 44.3 and 45.4, with an increase of 1.6 and 2.2 points over the baselines, respectively. These results validate the effectiveness and practicality of CDT. Source code and models are available at this https URL.
摘要：大型语言模型（LLM）的最新进展显着增强了其功能，突出了对超出特定任务基准超出的全面评估框架的需求。但是，现有的基准通常专注于孤立的能力，缺乏评估LLM功能的整体框架。为了解决这一差距，我们提出了认知任务任务（CDT）框架，该框架可以全面地衡量模型在三个维度上的功能。我们通过结合Cattell-Horn-Carroll认知理论，完善模型能力的分类来扩大认知水平上模型能力定义的范围。我们在两个方向上应用CDT：数据集功能评估和数据选择。实验表明，我们的能力指标与下游性能很好地相关，并且可以支持有效的数据集分析和构造。有关数据选择的实验还显示出一般和特定基准的显着改善，得分为44.3和45.4，分别比基线的1.6和2.2分增加了。这些结果证明了CDT的有效性和实用性。源代码和模型可在此HTTPS URL上找到。

Title: Alternatives To Next Token Prediction In Text Generation - A Survey

Authors: Charlie Wyatt, Aditya Joshi, Flora Salim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24435
Pdf URL: https://arxiv.org/pdf/2509.24435
Copy Paste: [[2509.24435]] Alternatives To Next Token Prediction In Text Generation - A Survey(https://arxiv.org/abs/2509.24435)
Keywords: language model, llm
Abstract: The paradigm of Next Token Prediction (NTP) has driven the unprecedented success of Large Language Models (LLMs), but is also the source of their most persistent weaknesses such as poor long-term planning, error accumulation, and computational inefficiency. Acknowledging the growing interest in exploring alternatives to NTP, the survey describes the emerging ecosystem of alternatives to NTP. We categorise these approaches into five main families: (1) Multi-Token Prediction, which targets a block of future tokens instead of a single one; (2) Plan-then-Generate, where a global, high-level plan is created upfront to guide token-level decoding; (3) Latent Reasoning, which shifts the autoregressive process itself into a continuous latent space; (4) Continuous Generation Approaches, which replace sequential generation with iterative, parallel refinement through diffusion, flow matching, or energy-based methods; and (5) Non-Transformer Architectures, which sidestep NTP through their inherent model structure. By synthesizing insights across these methods, this survey offers a taxonomy to guide research into models that address the known limitations of token-level generation to develop new transformative models for natural language processing.
摘要：接下来的令牌预测（NTP）的范式驱动了大型语言模型（LLMS）的前所未有的成功，但也是其最持续弱点的来源，例如长期计划，错误积累和计算效率低下。该调查承认对探索NTP的替代方案的兴趣日益增加，描述了新兴的NTP替代品生态系统。我们将这些方法分为五个主要家庭：（1）多token预测，它针对未来令牌而不是单个代币；（2）计划 - 然后是一个全球高级计划，以指导令牌级别的解码；（3）潜在推理，将自回归过程本身转移到连续的潜在空间；（4）连续生成方法，通过扩散，流匹配或基于能量的方法，用迭代的平行细化代替顺序生成；（5）非转化器体系结构，通过其固有的模型结构避开了NTP。通过综合这些方法的见解，该调查提供了一种分类法，以指导研究模型的研究，以解决令牌级生成的已知局限性，以开发新的自然语言处理模型。

Title: Bias Mitigation or Cultural Commonsense? Evaluating LLMs with a Japanese Dataset

Authors: Taisei Yamamoto, Ryoma Kumon, Danushka Bollegala, Hitomi Yanaka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24468
Pdf URL: https://arxiv.org/pdf/2509.24468
Copy Paste: [[2509.24468]] Bias Mitigation or Cultural Commonsense? Evaluating LLMs with a Japanese Dataset(https://arxiv.org/abs/2509.24468)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) exhibit social biases, prompting the development of various debiasing methods. However, debiasing methods may degrade the capabilities of LLMs. Previous research has evaluated the impact of bias mitigation primarily through tasks measuring general language understanding, which are often unrelated to social biases. In contrast, cultural commonsense is closely related to social biases, as both are rooted in social norms and values. The impact of bias mitigation on cultural commonsense in LLMs has not been well investigated. Considering this gap, we propose SOBACO (SOcial BiAs and Cultural cOmmonsense benchmark), a Japanese benchmark designed to evaluate social biases and cultural commonsense in LLMs in a unified format. We evaluate several LLMs on SOBACO to examine how debiasing methods affect cultural commonsense in LLMs. Our results reveal that the debiasing methods degrade the performance of the LLMs on the cultural commonsense task (up to 75% accuracy deterioration). These results highlight the importance of developing debiasing methods that consider the trade-off with cultural commonsense to improve fairness and utility of LLMs.
摘要：大型语言模型（LLMS）表现出社会偏见，促使各种歧义方法的发展。但是，辩护方法可能会降低LLM的功能。先前的研究主要通过衡量一般语言理解的任务来评估偏见缓解的影响，这通常与社会偏见无关。相反，文化常识与社会偏见密切相关，因为这两者都源于社会规范和价值观。偏见缓解对LLMS文化常识的影响尚未得到很好的研究。考虑到这一差距，我们提出了Sobaco（社会偏见和文化常识基准），这是一种日本基准测试，旨在以统一的格式评估LLMS的社会偏见和文化常识。我们评估了Sobaco上的几个LLM，以检查伪数方法如何影响LLM中的文化常识。我们的结果表明，伪造方法降低了LLM在文化常识任务上的性能（精度降低了75％）。这些结果强调了开发偏见方法的重要性，这些方法考虑使用文化常识来折衷，以改善LLM的公平性和效用。

Title: Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models

Authors: Wenjie Fu, Huandong Wang, Junyao Gao, Guoan Wan, Tao Jiang
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24488
Pdf URL: https://arxiv.org/pdf/2509.24488
Copy Paste: [[2509.24488]] Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models(https://arxiv.org/abs/2509.24488)
Keywords: language model, llm, prompt, chat
Abstract: As Large Language Models (LLMs) achieve remarkable success across a wide range of applications, such as chatbots and code copilots, concerns surrounding the generation of harmful content have come increasingly into focus. Despite significant advances in aligning LLMs with safety and ethical standards, adversarial prompts can still be crafted to elicit undesirable responses. Existing mitigation strategies are predominantly based on post-hoc filtering, which introduces substantial latency or computational overhead, and is incompatible with token-level streaming generation. In this work, we introduce Self-Sanitize, a novel LLM-driven mitigation framework inspired by cognitive psychology, which emulates human self-monitor and self-repair behaviors during conversations. Self-Sanitize comprises a lightweight Self-Monitor module that continuously inspects high-level intentions within the LLM at the token level via representation engineering, and a Self-Repair module that performs in-place correction of harmful content without initiating separate review dialogues. This design allows for real-time streaming monitoring and seamless repair, with negligible impact on latency and resource utilization. Given that privacy-invasive content has often been insufficiently focused in previous studies, we perform extensive experiments on four LLMs across three privacy leakage scenarios. The results demonstrate that Self-Sanitize achieves superior mitigation performance with minimal overhead and without degrading the utility of LLMs, offering a practical and robust solution for safer LLM deployments. Our code is available at the following link: this https URL
摘要：随着大型语言模型（LLMS）在诸如聊天机器人和代码副驾驶等广泛的应用程序中取得了巨大的成功，围绕有害内容产生的担忧已经越来越重视。尽管在将LLM与安全和道德标准保持一致方面取得了重大进展，但仍可以制定对抗性提示来引起不良的回应。现有的缓解策略主要基于事后过滤，这引入了大量的延迟或计算开销，并且与令牌级流的生成不相容。在这项工作中，我们介绍了一个受认知心理学启发的新型LLM驱动缓解框架的自我动力学，该框架在对话中模仿了人类的自我监测和自我修复行为。自动启动化包括一个轻巧的自我监视模块，该模块通过表示工程不断地检查LLM内LLM内的高级意图，以及一个自我修复模块，该模块可以在不启动单独的审查对话的情况下执行有害内容的就地校正。这种设计允许实时流媒体监视和无缝维修，对延迟和资源利用率产生了可观的影响。鉴于隐私侵入性内容通常不足以专注于以前的研究，因此我们在三个隐私泄漏方案上对四个LLM进行了广泛的实验。结果表明，自动化以最小的开销来实现卓越的缓解性能，而不会降低LLMS的实用性，从而为更安全的LLM部署提供了实用，强大的解决方案。我们的代码可在以下链接上找到：此HTTPS URL

Title: GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training

Authors: Hongcheng Wang, Yinuo Huang, Sukai Wang, Guanghui Ren, Hao Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24494
Pdf URL: https://arxiv.org/pdf/2509.24494
Copy Paste: [[2509.24494]] GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training(https://arxiv.org/abs/2509.24494)
Keywords: language model, llm, chain-of-thought
Abstract: Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models (VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling between thoughts and answers, sparse reward signals caused by limited parallel sampling, and unstable advantage estimation. To mitigate these challenges, we propose GRPO-MA, a simple yet theoretically grounded method that leverages multi-answer generation from each thought process, enabling more robust and efficient optimization. Theoretically, we show that the variance of thought advantage decreases as the number of answers per thought increases. Empirically, our gradient analysis confirms this effect, showing that GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and diverse multimodal tasks demonstrate that GRPO-MA substantially improves performance and training efficiency. Our ablation studies further reveal that increasing the number of answers per thought consistently enhances model performance.
摘要：最近的进展，例如DeepSeek-R1，表明GRPO算法是一种增强学习方法（RL）方法，可以有效地训练大语言模型（LLMS）和视觉模型（VLMS）中的思考链（COT）推理。在本文中，我们分析了GRPO的三个挑战：思想和答案之间的梯度耦合，由有限的并行抽样引起的稀疏奖励信号以及优势估计不稳定。为了减轻这些挑战，我们提出了GRPO-MA，这是一种简单但理论上的基础方法，可从每个思考过程中利用多回答的生成，从而实现更强大和有效的优化。从理论上讲，我们表明，思想优势的差异随着每个思想的答案数量的增加而降低。从经验上讲，我们的梯度分析证实了这一效果，表明GRPO-MA与GRPO相比降低了梯度尖峰。关于数学，代码和多种模式任务的实验表明，GRPO-MA显着提高了性能和训练效率。我们的消融研究进一步表明，增加每个思想的答案数量会始终增强模型性能。

Title: Knowledge Editing with Subspace-Aware Key-Value Mappings

Authors: Haewon Park, Sangwoo Kim, Yohan Jo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24502
Pdf URL: https://arxiv.org/pdf/2509.24502
Copy Paste: [[2509.24502]] Knowledge Editing with Subspace-Aware Key-Value Mappings(https://arxiv.org/abs/2509.24502)
Keywords: language model, gpt
Abstract: Knowledge editing aims to efficiently correct factual errors in Language Models (LMs). The popular locate-then-edit approach modifies an MLP layer by finding an optimal mapping between its input vector (key) and output vector (value) that leads to the expression of the edited knowledge. However, existing methods without any constraints on the key and value vectors cause significant perturbations to the edited model. To address this, we propose Subspace Knowledge Edit (SUIT), a method that identifies and modifies only the subspace of critical features relevant to the edit. Our empirical results on LLaMA-3-8B, GPT-J-6B, and Qwen2.5-7B models show that SUIT dramatically improves knowledge preservation over strong baselines while maintaining high edit efficacy. This effectiveness confirms that SUIT successfully identifies the critical subspace for the edit. Further analyses provide additional validation for our approach. The source code and data will be released to the public upon publication of the paper.
摘要：知识编辑旨在有效纠正语言模型（LMS）中的事实错误。流行的定位 - 然后编辑方法通过在其输入向量（键）和输出向量（值）之间找到最佳映射来修改MLP层，从而导致编辑知识的表达。但是，对密钥和值向量没有任何约束的现有方法会对编辑的模型产生重大扰动。为了解决这个问题，我们提出了子空间知识编辑（西装），该方法仅识别和修改与编辑相关的关键功能的子空间。我们在Llama-3-8B，GPT-J-6B和QWEN2.5-7B模型上的经验结果表明，西服可极大地改善知识保存在强大的基线上，同时保持高编辑功效。这种有效性证实了诉讼成功地标识了编辑的关键子空间。进一步的分析为我们的方法提供了其他验证。源代码和数据将在本文发表后向公众发布。

Title: Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

Authors: Hamna, Gayatri Bhat, Sourabrata Mukherjee, Faisal Lalani, Evan Hadfield, Divya Siddarth, Kalika Bali, Sunayana Sitaram
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24506
Pdf URL: https://arxiv.org/pdf/2509.24506
Copy Paste: [[2509.24506]] Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings(https://arxiv.org/abs/2509.24506)
Keywords: language model, llm, chat
Abstract: Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users. Critical domains such as healthcare require evaluations that extend beyond artificial or simulated tasks to reflect the everyday needs, cultural practices, and nuanced contexts of communities. We propose Samiksha, a community-driven evaluation pipeline co-created with civil-society organizations (CSOs) and community members. Our approach enables scalable, automated benchmarking through a culturally aware, community-driven pipeline in which community feedback informs what to evaluate, how the benchmark is built, and how outputs are scored. We demonstrate this approach in the health domain in India. Our analysis highlights how current multilingual LLMs address nuanced community health queries, while also offering a scalable pathway for contextually grounded and inclusive LLM evaluation.
摘要：大型语言模型（LLMS）通常通过一般或域特异性基准测试功能进行评估，这些基准测试功能通常在最终用户的现实现实中缺乏基础。诸如医疗保健等关键领域需要评估，以超越人工或模拟任务，以反映社区的日常需求，文化实践和细微差别的环境。我们提出了与社区驱动的评估管道Samiksha，与民用社会组织（CSO）和社区成员共同创建。我们的方法通过一条具有文化意识的，社区驱动的管道来实现可扩展的自动基准测试，在该管道中，社区反馈可以告知评估内容，如何构建基准以及如何评分产出。我们在印度的卫生领域中证明了这种方法。我们的分析强调了当前的多语言LLM如何解决细微的社区健康查询，同时还为上下文基础和包容性LLM评估提供了可扩展的途径。

Title: AdaThink-Med: Medical Adaptive Thinking with Uncertainty-Guided Length Calibration

Authors: Shaohao Rui, Kaitao Chen, Weijie Ma, Xiaosong Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24560
Pdf URL: https://arxiv.org/pdf/2509.24560
Copy Paste: [[2509.24560]] AdaThink-Med: Medical Adaptive Thinking with Uncertainty-Guided Length Calibration(https://arxiv.org/abs/2509.24560)
Keywords: language model, llm
Abstract: Recent advances in inference time scaling with extended long chain-of thought have significantly improved the reasoning capabilities of both general and medical large language models (LLMs). However, these models tend to engage in lengthy reasoning processes regardless of the difficulty of the input question, leading to increased inference costs in real-world applications. Therefore, enabling adaptive thinking where models think less for simpler questions and think more for complex ones is critical for the effective use of medical LLMs in practice. Despite its importance, there is a lack of end-to-end approaches designed to enhance the adaptive thinking capabilities of medical LLMs while providing a comprehensive examination of the trade-off between performance and computational cost. To bridge this gap, we propose AdaThink-Med, the first end-to-end framework designed to enhance adaptive thinking ability in medical reasoning models with uncertainty-guided length calibration. AdaThink-Med first generates multiple candidate outputs for each question, evaluates the correctness and uncertainty of each candidate, and then estimates problem difficulty via an uncertainty-guided length calibration module. For outputs with low difficulty and correct answers, the framework penalizes longer reasoning paths; whereas for those with high difficulty and incorrect answers, it encourages extending the chain of thought to explore alternative solutions. On six public medical QA benchmarks, AdaThink-Med achieves up to 6.4x length reduction on average while retaining performance with only minimal degradation. Intriguingly, we observe that AdaThink-Med spontaneously develops two distinct reasoning modes, which we characterize as "non-thinking" and "thinking", demonstrating the model's ability to suppress redundant reasoning processes dynamically.
摘要：延长的长期思维链的推理时间扩展的最新进展显着提高了通用和医学大语言模型（LLMS）的推理能力。但是，无论输入问题的难度如何，这些模型都倾向于参与冗长的推理过程，从而导致实际应用中的推理成本增加。因此，启用自适应思维，模型对于更简单的问题进行较少的思考，而对复杂问题的思考对于在实践中有效使用医学LLM至关重要。尽管它很重要，但缺乏旨在增强医学LLM的适应性思维能力的端到端方法，同时对性能和计算成本之间的权衡进行了全面的检查。为了弥合这一差距，我们提出了Adathink-Med，这是第一个端到端框架，旨在增强具有不确定性引导长度校准的医学推理模型中的适应性思维能力。 Adathink-MED首先为每个问题生成多个候选输出，评估每个候选者的正确性和不确定性，然后通过不确定性引导的长度校准模块估算问题难度。对于难度低和正确答案的输出，该框架会惩罚更长的推理路径；尽管对于那些难度和错误答案的人来说，它鼓励扩展思想链来探索替代解决方案。在六个公共医疗质量检查基准测试中，Adathink-Med的平均长度最高为6.4倍，同时仅保留绩效，而仅减少降级。有趣的是，我们观察到Adathink-Med自发地发展了两种不同的推理模式，我们将其描述为“无思考”和“思考”，证明了该模型动态抑制冗余推理过程的能力。

Title: Inducing Dyslexia in Vision Language Models

Authors: Melika Honarmand, Ayati Sharma, Badr AlKhamissi, Johannes Mehrer, Martin Schrimpf
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24597
Pdf URL: https://arxiv.org/pdf/2509.24597
Copy Paste: [[2509.24597]] Inducing Dyslexia in Vision Language Models(https://arxiv.org/abs/2509.24597)
Keywords: language model
Abstract: Dyslexia, a neurodevelopmental disorder characterized by persistent reading difficulties, is often linked to reduced activity of the visual word form area in the ventral occipito-temporal cortex. Traditional approaches to studying dyslexia, such as behavioral and neuroimaging methods, have provided valuable insights but remain limited in their ability to test causal hypotheses about the underlying mechanisms of reading impairments. In this study, we use large-scale vision-language models (VLMs) to simulate dyslexia by functionally identifying and perturbing artificial analogues of word processing. Using stimuli from cognitive neuroscience, we identify visual-word-form-selective units within VLMs and demonstrate that targeted ablation of these units, unlike ablation of random units, leads to selective impairments in reading tasks while general visual and language comprehension abilities remain intact. In particular, the resulting model matches dyslexic humans' phonological deficits without a significant change in orthographic processing. Taken together, our modeling results replicate key characteristics of dyslexia and establish a computational framework for investigating reading disorders.
摘要：阅读障碍是一种以持续阅读困难为特征的神经发育障碍，通常与腹侧枕叶状性皮质中视觉单词形式区域的活性减少有关。研究阅读障碍的传统方法，例如行为和神经影像学方法，提供了有价值的见解，但在检验有关阅读障碍的基本机制的因果假设的能力方面仍然有限。在这项研究中，我们使用大规模视觉模型（VLM）来模拟阅读障碍，通过识别和扰动文字处理的人工类似物。使用认知神经科学的刺激，我们确定了VLMS中的视觉词形式选择单元，并证明了针对这些单元的靶向消融，这与随机单位的消融不同，导致阅读任务的选择性损害，而一般的视觉和语言理解能力仍然保持完整。特别是，所得的模型与阅读障碍的人的语音缺陷匹配，而拼字处理没有显着变化。综上所述，我们的建模结果复制了阅读障碍的关键特征，并建立了用于研究阅读障碍的计算框架。

Title: Hype or not? Formalizing Automatic Promotional Language Detection in Biomedical Research

Authors: Bojan Batalo, Erica K. Shimomoto, Neil Millar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24638
Pdf URL: https://arxiv.org/pdf/2509.24638
Copy Paste: [[2509.24638]] Hype or not? Formalizing Automatic Promotional Language Detection in Biomedical Research(https://arxiv.org/abs/2509.24638)
Keywords: language model
Abstract: In science, promotional language ('hype') is increasing and can undermine objective evaluation of evidence, impede research development, and erode trust in science. In this paper, we introduce the task of automatic detection of hype, which we define as hyperbolic or subjective language that authors use to glamorize, promote, embellish, or exaggerate aspects of their research. We propose formalized guidelines for identifying hype language and apply them to annotate a portion of the National Institutes of Health (NIH) grant application corpus. We then evaluate traditional text classifiers and language models on this task, comparing their performance with a human baseline. Our experiments show that formalizing annotation guidelines can help humans reliably annotate candidate hype adjectives and that using our annotated dataset to train machine learning models yields promising results. Our findings highlight the linguistic complexity of the task, and the potential need for domain knowledge and temporal awareness of the facts. While some linguistic works address hype detection, to the best of our knowledge, we are the first to approach it as a natural language processing task.
摘要：在科学方面，促销语言（“炒作”）正在增加，并可能破坏证据的客观评估，阻碍研究发展并侵蚀科学的信任。在本文中，我们介绍了自动检测炒作的任务，我们将其定义为作者用来魅力化，促进，修饰或夸大其研究方面的双曲线或主观语言。我们提出了识别炒作语言的正式指南，并将其应用于注释美国国立卫生研究院（NIH）授予申请语料库的一部分。然后，我们在此任务上评估了传统的文本分类器和语言模型，并将其性能与人类基线进行比较。我们的实验表明，正式化注释指南可以帮助人类可靠注释候选人的炒作形容词，并使用我们的注释数据集训练机器学习模型会产生有希望的结果。我们的发现突出了任务的语言复杂性，以及对事实知识和时间意识的潜在需求。据我们所知，一些语言作品涉及炒作检测，但我们是第一个将其作为自然语言处理任务进行的。

Title: InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation

Authors: Weilin Zhao, Zihan Zhou, Zhou Su, Chaojun Xiao, Yuxuan Li, Yanghao Li, Yudi Zhang, Weilun Zhao, Zhen Li, Yuxiang Huang, Ao Sun, Xu Han, Zhiyuan Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24663
Pdf URL: https://arxiv.org/pdf/2509.24663
Copy Paste: [[2509.24663]] InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation(https://arxiv.org/abs/2509.24663)
Keywords: language model, llm, chain-of-thought
Abstract: Long-sequence processing is a critical capability for modern large language models. However, the self-attention mechanism in the standard Transformer architecture faces severe computational and memory bottlenecks when processing long sequences. While trainable sparse attention methods offer a promising solution, existing approaches such as NSA introduce excessive extra parameters and disrupt the conventional \textit{pretrain-on-short, finetune-on-long} workflow, resulting in slow convergence and difficulty in acceleration. To overcome these limitations, we introduce dense-sparse switchable attention framework, termed as InfLLM-V2. InfLLM-V2 is a trainable sparse attention that seamlessly adapts models from short to long sequences. Specifically, InfLLM-V2 reuses dense attention parameters through parameter-free architecture modification, maintaining consistency between short and long sequence processing. Additionally, InfLLM-V2 ensures computational efficiency across all sequence lengths, by using dense attention for short inputs and smoothly transitioning to sparse attention for long sequences. To achieve practical acceleration, we further introduce an efficient implementation of InfLLM-V2 that significantly reduces the computational overhead. Our experiments on long-context understanding and chain-of-thought reasoning demonstrate that InfLLM-V2 is 4$\times$ faster than dense attention while retaining 98.1% and 99.7% of the performance, respectively. Based on the InfLLM-V2 framework, we have trained and open-sourced MiniCPM4.1 (this https URL), a hybrid reasoning model, providing a reproducible implementation for the research community.
摘要：长期处理是现代大型语言模型的关键能力。但是，在处理长序列时，标准变压器体系结构中的自发机制会面临严重的计算和记忆瓶颈。尽管可训练的稀疏注意方法提供了一种有希望的解决方案，但现有方法（例如NSA）引入了过多的额外参数，并破坏了常规的\ textit {pretrain-on-s-on-s-on-short，fineTune-on-ongonge}工作流程，从而导致加速的收敛缓慢和难度。为了克服这些局限性，我们引入了可切换的可切换注意力框架，称为INFLLM-V2。 INFLLM-V2是一种可训练的稀疏注意力，可无缝调整模型从短序列到长序列。具体而言，INFLLM-V2通过无参数的体系结构修改重复了密集的注意参数，从而保持短序列处理之间的一致性。此外，通过使用密集的关注短输入并平稳过渡到长序列的稀疏注意力，INFLLM-V2可确保所有序列长度的计算效率。为了实现实际加速，我们进一步介绍了INFLLM-V2的有效实施，该实施大大降低了计算开销。我们对长篇文化理解和经过经过经过经过经过经过经过经验推理的链条的实验表明，INFLLM-V2比密集的关注速度快4 $ \ times $，同时保留了98.1％和99.7％的表现。基于INFLLM-V2框架，我们已经培训了开源的MinicpM4.1（此HTTPS URL），这是一种混合推理模型，为研究界提供了可重复的实施。

Title: Understanding the Dilemma of Unlearning for Large Language Models

Authors: Qingjie Zhang, Haoting Qian, Zhicong Huang, Cheng Hong, Minlie Huang, Ke Xu, Chao Zhang, Han Qiu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24675
Pdf URL: https://arxiv.org/pdf/2509.24675
Copy Paste: [[2509.24675]] Understanding the Dilemma of Unlearning for Large Language Models(https://arxiv.org/abs/2509.24675)
Keywords: language model, llm, prompt
Abstract: Unlearning seeks to remove specific knowledge from large language models (LLMs), but its effectiveness remains contested. On one side, "forgotten" knowledge can often be recovered through interventions such as light fine-tuning; on the other side, unlearning may induce catastrophic forgetting that degrades general capabilities. Despite active exploration of unlearning methods, interpretability analyses of the mechanism are scarce due to the difficulty of tracing knowledge in LLMs' complex architectures. We address this gap by proposing unPact, an interpretable framework for unlearning via prompt attribution and contribution tracking. Typically, it quantifies each prompt token's influence on outputs, enabling pre- and post-unlearning comparisons to reveal what changes. Across six mainstream unlearning methods, three LLMs, and three benchmarks, we find that: (1) Unlearning appears to be effective by disrupting focus on keywords in prompt; (2) Much of the knowledge is not truly erased and can be recovered by simply emphasizing these keywords in prompts, without modifying the model's weights; (3) Catastrophic forgetting arises from indiscriminate penalization of all tokens. Taken together, our results suggest an unlearning dilemma: existing methods tend either to be insufficient - knowledge remains recoverable by keyword emphasis, or overly destructive - general performance collapses due to catastrophic forgetting, still leaving a gap to reliable unlearning.
摘要：Underning试图从大语言模型（LLM）中删除特定知识，但其有效性仍然有争议。一方面，通常可以通过诸如轻巧调节之类的干预措施来恢复“被遗忘的”知识；另一方面，学习可能会引起灾难性的忘记，从而降低了一般能力。尽管对未学习方法进行了积极的探索，但由于难以追踪LLMS复杂体系结构中知识的困难，该机制的可解释性分析仍然很少。我们通过提出不启动来解决这一差距，这是通过及时归因和贡献跟踪学习的可解释框架。通常，它量化了每个提示令牌对输出的影响，从而实现了前和未实现的比较以揭示什么变化。在六种主流学位的方法，三个LLM和三个基准测试中，我们发现：（1）通过迅速中的关注关注关注的关注，未学习似乎是有效的；（2）许多知识并没有真正消除，并且可以通过简单地在提示中强调这些关键字来恢复，而无需修改模型的权重；（3）灾难性遗忘是由于所有令牌的不加区分惩罚引起的。综上所述，我们的结果表明了一个不学习的困境：现有方法往往是不足的 - 知识可以通过关键字强调或过度破坏性 - 由于灾难性的遗忘而导致的一般绩效崩溃，但仍然留下了可靠的无学习的差距。

Title: Reference-Free Rating of LLM Responses via Latent Information

Authors: Leander Girrbach, Chi-Ping Su, Tankred Saanum, Richard Socher, Eric Schulz, Zeynep Akata
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24678
Pdf URL: https://arxiv.org/pdf/2509.24678
Copy Paste: [[2509.24678]] Reference-Free Rating of LLM Responses via Latent Information(https://arxiv.org/abs/2509.24678)
Keywords: llm, prompt
Abstract: How reliable are single-response LLM-as-a-judge ratings without references, and can we obtain fine-grained, deterministic scores in this setting? We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses and show two systematic issues: scores are unstable under sampling and poorly calibrated, leading to compression near the top of the scale and frequent ties. We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals: (i) probability-weighted scores over integer ratings, (ii) verifier-style probabilities of "yes", and (iii) linear probes trained on model activations at the rating position. Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting, with consistent gains on pairwise accuracy and listwise ranking relevant to Best-of-N selection. Probability-weighted scores achieve the strongest single-rating correlations, while probes recover useful signals when output logits are miscalibrated. These results indicate that latent information provides deterministic and more discriminative signals for reference-free evaluation, and can improve selection and training approaches like Best-of-$N$, multi-teacher distillation, and routing.
摘要：How reliable are single-response LLM-as-a-judge ratings without references, and can we obtain fine-grained, deterministic scores in this setting?我们研究了要求法官模型将李克特级得分分配给自由文本响应的常见做法，并显示了两个系统的问题：在采样和校准较差的情况下，得分是不稳定的，导致在规模顶部和频繁联系的顶部压缩。然后，我们提出和评估潜在法官，该法官从内部模型信号中得出标量评分：（i）对整数等级的概率加权分数，（ii）“是”的验证者式概率，以及（iii）在额定位置上对模型激活进行训练的线性探测。在一系列成对和单评分基准的广泛套件中，潜在方法匹配或超过标准提示，并且在成对精度上始终取得了与最佳N选择相关的列表排名。概率加权得分达到了最强的单评分相关性，而在输出逻辑误解时，探针恢复了有用的信号。这些结果表明，潜在信息为无参考评估提供了确定性和更具歧视性的信号，并且可以改善选择和培训方法，例如最佳$ n $，多教老师蒸馏和路由。

Title: MemGen: Weaving Generative Latent Memory for Self-Evolving Agents

Authors: Guibin Zhang, Muxin Fu, Shuicheng Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24704
Pdf URL: https://arxiv.org/pdf/2509.24704
Copy Paste: [[2509.24704]] MemGen: Weaving Generative Latent Memory for Self-Evolving Agents(https://arxiv.org/abs/2509.24704)
Keywords: language model, llm, agent
Abstract: Agent memory shapes how Large Language Model (LLM)-powered agents, akin to the human brain, progressively refine themselves through environment interactions. Existing paradigms remain constrained: parametric memory forcibly adjusts model parameters, and retrieval-based memory externalizes experience into structured databases, yet neither captures the fluid interweaving of reasoning and memory that underlies human cognition. To address this gap, we propose MemGen, a dynamic generative memory framework that equips agents with a human-esque cognitive faculty. It consists of a \textit{memory trigger}, which monitors the agent's reasoning state to decide explicit memory invocation, and a \textit{memory weaver}, which takes the agent's current state as stimulus to construct a latent token sequence as machine-native memory to enrich its reasoning. In this way, MemGen enables agents to recall and augment latent memory throughout reasoning, producing a tightly interwoven cycle of memory and cognition. Extensive experiments across eight benchmarks show that MemGen surpasses leading external memory systems such as ExpeL and AWM by up to $38.22\%$, exceeds GRPO by up to $13.44\%$, and exhibits strong cross-domain generalization ability. More importantly, we find that without explicit supervision, MemGen spontaneously evolves distinct human-like memory faculties, including planning memory, procedural memory, and working memory, suggesting an emergent trajectory toward more naturalistic forms of machine cognition.
摘要：代理记忆塑造了类似于人脑的大型语言模型（LLM）的能力，通过环境相互作用逐步完善自己。现有范式仍然受到限制：参数内存强行调整模型参数，基于检索的内存将经验外部化到结构化的数据库中，但都没有捕获构成人类认知基础的推理和记忆的流体交织。为了解决这一差距，我们提出了一种动态的生成记忆框架MEMGEN，它使代理与人类风格的认知能力相比。它由\ textIt {内存触发}组成，该{内存触发器}可以监视代理的推理状态以确定明确的内存调用和\ textIt {memory weaver}，该{memory weaver}将代理的当前状态作为刺激来构建潜在的令牌序列作为机器本机内存的潜在标记序列，以丰富其推理。通过这种方式，MEMGEN使代理可以在整个推理过程中回忆和增强潜在记忆，从而产生一个紧密相互交织的记忆和认知周期。八个基准的广泛实验表明，MEMGEN超过领先的外部存储系统，例如Expel和AWM $ 38.22 \％$，超过了GRPO，高达$ 13.44 \％$ $，并且具有强大的跨域泛化能力。更重要的是，我们发现，如果没有明确的监督，Memgen会自发地演变出不同的人类记忆系，包括计划记忆，程序记忆和工作记忆，这表明了对机器认知的更自然形式的新兴轨迹。

Title: Socratic-Zero : Bootstrapping Reasoning via Data-Free Agent Co-evolution

Authors: Shaobo Wang, Zhengbo Jiao, Zifan Zhang, Yilang Peng, Xu Ze, Boyu Yang, Wei Wang, Hu Wei, Linfeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24726
Pdf URL: https://arxiv.org/pdf/2509.24726
Copy Paste: [[2509.24726]] Socratic-Zero : Bootstrapping Reasoning via Data-Free Agent Co-evolution(https://arxiv.org/abs/2509.24726)
Keywords: language model, gpt, llm, agent
Abstract: Recent breakthroughs in large language models (LLMs) on reasoning tasks rely heavily on massive, high-quality datasets-typically human-annotated and thus difficult to scale. While data synthesis or distillation offers a promising alternative, existing methods struggle with inconsistent data quality and an inability to dynamically adapt to the evolving capabilities of the model, leading to suboptimal training signals. To address these limitations, we introduce Socratic-Zero, a fully autonomous framework that generates high-quality training data from minimal seed examples through the co-evolution of three agents: the Teacher, the Solver, and the Generator. The Solver continuously refines its reasoning by learning from preference feedback on both successful and failed trajectories; the Teacher adaptively crafts increasingly challenging questions based on the Solver's weaknesses; and the Generator distills the Teacher's question-design strategy to enable scalable, high-fidelity curriculum generation. This closed-loop system produces a self-improving curriculum-requiring no pre-existing tasks or labels. Remarkably, starting from only 100 seed questions, our Socratic-Solver-8B achieves an average gain of +20.2 percentage points over prior data synthesis methods across seven mathematical reasoning benchmarks (AMC23, AIME24-25, Olympiad, MATH-500, Minerva, and GSM8K), with consistent gains on both Qwen3 and GLM4 series models. Even more surprisingly, synthetic data from Socratic-Generator-32B enables student LLMs to achieve superior performance compared to other state-of-the-art (SOTA) commercial LLMs on these benchmarks, including Qwen3-235B-A22B, DeepSeek-V3.1-671B, GPT-5, Gemini-2.5-Pro, Grok-4, and Claude-4.1-Opus.
摘要：在大型语言模型（LLM）上，最近在推理任务上的突破在很大程度上取决于大型，高质量的数据集典型的人类宣传，因此难以扩展。尽管数据合成或蒸馏提供了一种有希望的替代方案，但现有的方法与数据质量不一致，并且无法动态地适应该模型的不断发展的功能，从而导致了次优训练信号。为了解决这些局限性，我们引入了Socratic-Zero，这是一个完全自主的框架，该框架通过三种代理的共同发展从最小的种子示例中生成高质量的培训数据：教师，求解器和生成器。求解器通过从偏好反馈中学习成功和失败的轨迹来不断地完善其推理；教师根据解决者的弱点而适应越来越有挑战性的问题。发电机将教师提出的问题设计策略提炼出可扩展的高保真课程生成。这个闭环系统会产生一个自我改善的课程，不再存在预先存在的任务或标签。值得注意的是，从仅100个种子问题开始，我们的Socratic-Solver-8B在七个数学推理基准（AMC23，AIME24-25，Olympiad，Math-500，Math-500，Minerva和GSM8K）上的平均数据合成方法的平均增益 +20.2个百分点，并在QweWen3和Glm上都一致地增长了。更令人惊讶的是，与其他最先进的商业LLM相比，Socratic-Generator-32B的合成数据使学生LLM能够在这些基准上获得较高的性能，包括QWEN3-235B-A22B，DeepSeek-v3.1-v3.1-671b，gpt-5，gpt-5，Gemini-2.5-Pro，gemini-2.5-pro，clakok，clakok，clakok，groak＆clakok，and。

Title: ProxyAttn: Guided Sparse Attention via Representative Heads

Authors: Yixuan Wang, Huang He, Siqi Bao, Hua Wu, Haifeng Wang, Qingfu Zhu, Wanxiang Che
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24745
Pdf URL: https://arxiv.org/pdf/2509.24745
Copy Paste: [[2509.24745]] ProxyAttn: Guided Sparse Attention via Representative Heads(https://arxiv.org/abs/2509.24745)
Keywords: language model, llm
Abstract: The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their coarse-grained estimation inevitably leads to performance degradation at high sparsity rates. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads, we use the scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from representative proxy heads with multi-head dynamic budgets, we achieve a more fine-grained block importance evaluation at low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads. Leveraging a fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss. Our code is available at this https URL.
摘要：注意机制的二次复杂性限制了大语言模型（LLMS）对长文化任务的效率。最近，动态估计块重要性的方法已使有效的阻滞稀疏注意力，从而导致LLMS长篇文本预填充的显着加速。但是，它们的粗粒估计不可避免地会导致高稀疏率的性能降解。在这项工作中，我们提出了ProxyAttn，这是一种无训练的稀疏注意算法，通过压缩注意力头的维度来实现更精确的块估计。根据我们对多个注意力头的相似性的观察，我们使用汇总代表头的得分来近似所有头部的得分。为了说明头之间的稀疏性，我们还提出了一种动态预算估算方法。通过将代表性代理人的分数结合在一起，具有多头动态预算，我们以低计算成本获得了更精细的块重要性评估。对各种主流模型和广泛的基准进行实验证实了注意力头之间的基本相似性。与现有方法相比，提出的方法利用细粒度的估计，在性能和效率方面取得了可观的提高。更确切地说，ProxyAttn最多可以达到10.3倍的注意加速度和2.4倍的预填充加速度，而不会显着绩效损失。我们的代码可在此HTTPS URL上找到。

Title: LatentEvolve: Self-Evolving Test-Time Scaling in Latent Space

Authors: Guibin Zhang, Fanci Meng, Guancheng Wan, Zherui Li, Kun Wang, Zhenfei Yin, Lei Bai, Shuicheng Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24771
Pdf URL: https://arxiv.org/pdf/2509.24771
Copy Paste: [[2509.24771]] LatentEvolve: Self-Evolving Test-Time Scaling in Latent Space(https://arxiv.org/abs/2509.24771)
Keywords: language model, llm
Abstract: Test-time Scaling (TTS) has been demonstrated to significantly enhance the reasoning capabilities of Large Language Models (LLMs) during the inference phase without altering model parameters. However, existing TTS methods are largely independent, implying that LLMs have not yet evolved to progressively learn how to scale more effectively. With the objective of evolving LLMs to learn ``how to scale test-time computation,'' we propose LatentEvolve, a self-evolving latent TTS framework inspired by the complementary learning system (CLS) theory. Analogous to the human brain's dual system of a fast-recall hippocampus and a slow-consolidating neocortex, LatentEvolve comprises two evolutionary components: \textit{daytime scaling}, which rapidly retrieves historical latent representations to better guide current LLM reasoning; and \textit{nighttime scaling}, which integrates past latent optimizations in a manner akin to the human brain's consolidation of experiences during sleep. The alternation of daytime and nighttime processes facilitates a fast and slow evolution of LLM TTS, mirroring human cognitive dynamics in a fully unsupervised manner. Extensive experiments across eight benchmarks and five model backbones demonstrate that our LatentEvolve surpasses state-of-the-art TTS methods such as LatentSeek and TTRL by up to $13.33\%$ and exhibits exceptional cross-domain and cross-backbone generalization.
摘要：已证明测试时间缩放（TTS）可以在推理阶段显着增强大语言模型（LLM）的推理能力，而不会改变模型参数。但是，现有的TTS方法在很大程度上是独立的，这意味着LLM尚未发展为逐步学习如何更有效地扩展。为了使LLM不断发展``如何扩展测试时间计算''的目的，我们提出了LitentEvolve，这是一个受互补学习系统（CLS）理论启发的自我发展的潜在TTS框架。类似于人类大脑的双重呼吸海马和慢固化的新皮层的双重系统，Litentevolve包括两个进化组成部分：\ textit {daytime缩放}，它们快速检索历史的潜在表现以更好地指导电流LLM推理；和\ textit {夜间缩放}，它以类似于人类大脑在睡眠期间体验的巩固的方式整合了过去的潜在优化。白天和夜间过程的交替促进了LLM TT的快速发展，以完全无监督的方式反映人类的认知动态。在八个基准和五个模型主链上进行的广泛实验表明，我们的潜伏能超过最先进的TTS方法，例如LitentSeek和Ttrl高达$ 13.33 \％$ $，并且展示了出色的跨域和跨背骨通用。

Title: SeaPO: Strategic Error Amplification for Robust Preference Optimization of Large Language Models

Authors: Jun Rao, Yunjie Liao, Xuebo Liu, Zepeng Lin, Lian Lian, Dong Jin, Shengjun Cheng, Jun Yu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24781
Pdf URL: https://arxiv.org/pdf/2509.24781
Copy Paste: [[2509.24781]] SeaPO: Strategic Error Amplification for Robust Preference Optimization of Large Language Models(https://arxiv.org/abs/2509.24781)
Keywords: language model, llm
Abstract: Existing alignment methods for preference optimization of large language models (LLMs) aim to enhance model performance by utilizing pairs of positive and negative samples. However, due to the limited capacity of models in scoring or generating responses, the quality of positive and negative samples may become similar during training, which complicates optimization for preference learning. To address this issue, we introduce SeaPO, a Strategic Error Amplification method that leverages three error types commonly occurring in LLMs to introduce specific error patterns into the model Preference Optimization. This strategy ensures that negative samples are more erroneous than positive samples and preference-based training is employed to mitigate the occurrence of these errors, thereby enhancing model performance. Evaluations across five capability dimensions and different model scales (1.5B to 14B) demonstrate that the generated data significantly improved overall model performance, particularly in terms of truthfulness, with improvements of 5-10 percentage points observed. Further analysis reveals that task performance varies depending on the error types introduced. Injecting the most common error types improves performance in related tasks, while a mix of error types leads to a broader performance enhancement: most tasks show stable improvements, while a few tasks exhibit significant gains.
摘要：偏好优化大语言模型（LLM）的现有对齐方法旨在通过利用成对的正和负样本来增强模型性能。但是，由于模型在评分或产生响应中的能力有限，因此在训练过程中，正面和负样本的质量可能会变得相似，这会使优化学习的优化复杂化。为了解决此问题，我们介绍了SEAPO，这是一种战略错误放大方法，该方法利用LLMS中通常发生的三种错误类型将特定的误差模式引入模型偏好优化。该策略确保了负样本比积极样本更错误，并且采用基于偏好的培训来减轻这些错误的发生，从而增强模型性能。五个功能维度和不同模型量表（1.5b至14b）的评估表明，生成的数据显着改善了总体模型性能，尤其是在真实性方面，并且观察到了5-10个百分点的改善。进一步的分析表明，任务性能取决于引入的错误类型。注入最常见的错误类型可以改善相关任务中的性能，而多种错误类型会导致更广泛的性能提高：大多数任务显示出稳定的改进，而一些任务显示出显着的收益。

Title: Evaluating Spatiotemporal Consistency in Automatically Generated Sewing Instructions

Authors: Luisa Geiger, Mareike Hartmann, Michael Sullivan, Alexander Koller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24792
Pdf URL: https://arxiv.org/pdf/2509.24792
Copy Paste: [[2509.24792]] Evaluating Spatiotemporal Consistency in Automatically Generated Sewing Instructions(https://arxiv.org/abs/2509.24792)
Keywords: llm
Abstract: In this paper, we propose a novel, automatic tree-based evaluation metric for LLM-generated step-by-step assembly instructions, that more accurately reflects spatiotemporal aspects of construction than traditional metrics such as BLEU and BERT similarity scores. We apply our proposed metric to the domain of sewing instructions, and show that our metric better correlates with manually-annotated error counts as well as human quality ratings, demonstrating our metric's superiority for evaluating the spatiotemporal soundness of sewing instructions. Further experiments show that our metric is more robust than traditional approaches against artificially-constructed counterfactual examples that are specifically constructed to confound metrics that rely on textual similarity.
摘要：在本文中，我们提出了一种针对LLM生成的逐步组装指令的新型，自动树的评估指标，比BLEU和BERT相似性得分等传统指标更准确地反映结构的时空方面。我们将提议的指标应用于缝纫说明的领域，并表明我们的度量可以更好地与手动宣布的误差计数以及人类质量评级相关，这表明了我们的度量标准优越性，以评估缝纫指令的时空声音性。进一步的实验表明，我们的指标比传统的方法更强大，而对于人为构建的反事实示例的方法，这些示例是专门构建的，这些示例是为了混淆依赖文本相似性的指标。

Title: KnowGuard: Knowledge-Driven Abstention for Multi-Round Clinical Reasoning

Authors: Xilin Dang, Kexin Chen, Xiaorui Su, Ayush Noori, Iñaki Arango, Lucas Vittor, Xinyi Long, Yuyang Du, Marinka Zitnik, Pheng Ann Heng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24816
Pdf URL: https://arxiv.org/pdf/2509.24816
Copy Paste: [[2509.24816]] KnowGuard: Knowledge-Driven Abstention for Multi-Round Clinical Reasoning(https://arxiv.org/abs/2509.24816)
Keywords: language model, llm
Abstract: In clinical practice, physicians refrain from making decisions when patient information is insufficient. This behavior, known as abstention, is a critical safety mechanism preventing potentially harmful misdiagnoses. Recent investigations have reported the application of large language models (LLMs) in medical scenarios. However, existing LLMs struggle with the abstentions, frequently providing overconfident responses despite incomplete information. This limitation stems from conventional abstention methods relying solely on model self-assessments, which lack systematic strategies to identify knowledge boundaries with external medical evidences. To address this, we propose \textbf{KnowGuard}, a novel \textit{investigate-before-abstain} paradigm that integrates systematic knowledge graph exploration for clinical decision-making. Our approach consists of two key stages operating on a shared contextualized evidence pool: 1) an evidence discovery stage that systematically explores the medical knowledge space through graph expansion and direct retrieval, and 2) an evidence evaluation stage that ranks evidence using multiple factors to adapt exploration based on patient context and conversation history. This two-stage approach enables systematic knowledge graph exploration, allowing models to trace structured reasoning paths and recognize insufficient medical evidence. We evaluate our abstention approach using open-ended multi-round clinical benchmarks that mimic realistic diagnostic scenarios, assessing abstention quality through accuracy-efficiency trade-offs beyond existing closed-form evaluations. Experimental evidences clearly demonstrate that KnowGuard outperforms state-of-the-art abstention approaches, improving diagnostic accuracy by 3.93\% while reducing unnecessary interaction by 7.27 turns on average.
摘要：在临床实践中，当患者信息不足时，医生避免做出决策。这种行为被称为弃权，是一种关键的安全机制，可防止潜在有害误诊。最近的调查报道了大语模型（LLM）在医疗方案中的应用。但是，现有的LLM在弃权方面挣扎，尽管信息不完整，但经常提供过度自信的答复。这种限制源于仅依靠模型自我评估的常规弃权方法，这些方法缺乏系统的策略来识别具有外部医学证据的知识边界。为了解决这个问题，我们提出了\ textbf {knowguard}，这是一种小说\ textit {defcord-before-abstain}范式，该范式集成了用于临床决策的系统知识探索。我们的方法包括在共享上下文化的证据库上运行的两个关键阶段：1）一个证据发现阶段，该阶段通过图形扩展和直接检索系统地探索医学知识空间，以及2）证据评估阶段，该阶段使用多个因素对基于患者上下文和对话历史进行改编探索的证据进行对证据进行排名。这种两阶段的方法可以实现系统知识图探索，从而使模型能够追踪结构化的推理路径并识别不足的医学证据。我们使用开放式的多轮临床基准评估我们的弃戒方法，这些临床基准模仿现实的诊断方案，通过超出现有封闭形式评估的准确性效率折衷来评估弃权质量。实验证据清楚地表明，知识群的表现优于最先进的方法，将诊断准确性提高了3.93 \％，同时平均将不必要的互动降低了7.27转。

Title: SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching

Authors: Xinye Zhao, Spyridon Mastorakis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24832
Pdf URL: https://arxiv.org/pdf/2509.24832
Copy Paste: [[2509.24832]] SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching(https://arxiv.org/abs/2509.24832)
Keywords: language model, llm, prompt, agent
Abstract: As large language models (LLMs) continue to scale, the memory footprint of key-value (KV) caches during inference has become a significant bottleneck. Existing approaches primarily focus on compressing KV caches within a single prompt or reusing shared prefixes or frequently ocurred text segments across prompts. However, such strategies are limited in scenarios where prompts are semantically similar but lexically different, which frequently occurs in tasks such as multi-document summarization and conversational agents. We propose \textit{SemShareKV}, a KV cache sharing and compression framework that accelerates LLM inference by reusing KVCache in semantically similar prompts. Instead of relying on exact token matches, SemShareKV applies fuzzy token matching using locality-sensitive hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to better preserve positional information. By selectively reusing relevant key-value pairs from a reference prompt's cache, SemShareKV reduces redundant computation while maintaining output quality. Experiments on diverse summarization datasets show up to 6.25$\times$ speedup and 42\% lower GPU memory usage with 5k tokens input, with negligible quality degradation. These results highlight the potential of semantic-aware cache sharing for efficient LLM inference.
摘要：随着大型语言模型（LLMS）继续扩展，推断过程中键值（KV）缓存的内存足迹已成为一个重要的瓶颈。现有方法主要集中于在单个提示中压缩KV缓存，或重复使用共享前缀或跨提示中经常发生的文本段。但是，在提示在语义上相似但词汇不同的情况下，这种策略经常发生在多文件摘要和对话代理等任务中。我们建议\ textit {semsharekv}，这是一种KV缓存共享和压缩框架，该框架通过在语义上相似的提示中重复使用KVCACHE来加速LLM推断。 SemSharekv不依赖于确切的令牌匹配，而是使用对局部嵌入方式上的局部敏感哈希（LSH）应用模糊令牌匹配，并结合了旋转位置嵌入（ROPE）来更好地保留位置信息。通过从参考提示的缓存中选择性重复相关的键值对，SemShareKV可以减少冗余计算，同时保持输出质量。有关多种摘要数据集的实验显示为6.25 $ \ times $ speedup和42 \％降低GPU内存的使用，并具有5K令牌输入，并且质量降低质量微不足道。这些结果突出了语义感知缓存共享有效LLM推断的潜力。

Title: Hierarchical Error Correction for Large Language Models: A Systematic Framework for Domain-Specific AI Quality Enhancement

Authors: Zhilong Zhao, Yindi Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24841
Pdf URL: https://arxiv.org/pdf/2509.24841
Copy Paste: [[2509.24841]] Hierarchical Error Correction for Large Language Models: A Systematic Framework for Domain-Specific AI Quality Enhancement(https://arxiv.org/abs/2509.24841)
Keywords: language model, llm
Abstract: Large Language Models face significant performance challenges in specialized domains, with state-of-the-art models achieving only 45.9% accuracy on medical coding tasks. This study proposes a Hierarchical Error Correction (HEC) framework that addresses domain-specific AI limitations through systematic error analysis and targeted intervention strategies. We analyze error patterns across four specialized domains and find that AI errors follow consistent hierarchical structures: Knowledge-layer errors (58.4%), Reasoning-layer errors (39.6%), and Complexity-layer errors (2.0%). Based on these patterns, we develop a three-stage correction framework that addresses errors according to their hierarchical importance and demonstrates that framework effectiveness correlates inversely with baseline task performance. Experimental validation across medical transcription (4,921 cases), legal document classification (1,000 cases), political bias detection (645 cases), and legal reasoning (1,000 cases) shows consistent improvements. Cross-model validation across five LLM architectures demonstrates average improvements of 11.2 percentage points (p < 0.001). However, analysis reveals framework limitations in high-baseline tasks (>75% accuracy), where hierarchical intervention may interfere with effective reasoning processes. The results suggest that systematic error analysis can guide effective AI enhancement strategies in specialized domains, particularly for moderate-baseline tasks, while highlighting the importance of understanding framework boundaries for optimal deployment.
摘要：大型语言模型在专业领域面临着重大的性能挑战，最先进的模型在医疗编码任务上仅达到45.9％的精度。这项研究提出了一个分层误差校正（HEC）框架，该框架通过系统的误差分析和有针对性的干预策略来解决特定于领域的AI限制。我们分析了四个专用域之间的误差模式，发现AI错误遵循一致的层次结构：知识层错误（58.4％），推理层错误（39.6％）和复杂性 - 层错误错误（2.0％）。基于这些模式，我们开发了一个三阶段的校正框架，该框架根据其层次结构的重要性解决错误，并证明框架有效性与基线任务性能成反比。医学转录（4,921例），法律文件分类（1,000例），政治偏见检测（645例）和法律推理（1,000例）之间的实验验证显示出一致的改善。五个LLM架构的跨模型验证表明，平均改善为11.2个百分点（p <0.001）。但是，分析揭示了高基线任务的框架限制（精度> 75％），其中层次干预可能会干扰有效的推理过程。结果表明，系统错误分析可以指导专用域中的有效AI增强策略，尤其是对于中等基线任务，同时强调了理解框架边界以实现最佳部署的重要性。

Title: Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Authors: Adrian Arnaiz-Rodriguez, Miguel Baidal, Erik Derner, Jenn Layton Annable, Mark Ball, Mark Ince, Elvira Perez Vallejos, Nuria Oliver
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.24857
Pdf URL: https://arxiv.org/pdf/2509.24857
Copy Paste: [[2509.24857]] Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs(https://arxiv.org/abs/2509.24857)
Keywords: language model, gpt, llm, chat
Abstract: The widespread use of chatbots powered by large language models (LLMs) such as ChatGPT and Llama has fundamentally reshaped how people seek information and advice across domains. Increasingly, these chatbots are being used in high-stakes contexts, including emotional support and mental health concerns. While LLMs can offer scalable support, their ability to safely detect and respond to acute mental health crises remains poorly understood. Progress is hampered by the absence of unified crisis taxonomies, robust annotated benchmarks, and empirical evaluations grounded in clinical best practices. In this work, we address these gaps by introducing a unified taxonomy of six clinically-informed mental health crisis categories, curating a diverse evaluation dataset, and establishing an expert-designed protocol for assessing response appropriateness. We systematically benchmark three state-of-the-art LLMs for their ability to classify crisis types and generate safe, appropriate responses. The results reveal that while LLMs are highly consistent and generally reliable in addressing explicit crisis disclosures, significant risks remain. A non-negligible proportion of responses are rated as inappropriate or harmful, with responses generated by an open-weight model exhibiting higher failure rates than those generated by the commercial ones. We also identify systemic weaknesses in handling indirect or ambiguous risk signals, a reliance on formulaic and inauthentic default replies, and frequent misalignment with user context. These findings underscore the urgent need for enhanced safeguards, improved crisis detection, and context-aware interventions in LLM deployments. Our taxonomy, datasets, and evaluation framework lay the groundwork for ongoing research and responsible innovation in AI-driven mental health support, helping to minimize harm and better protect vulnerable users.
摘要：由大语言模型（LLM）（例如Chatgpt和Llama）提供动力的聊天机器人的广泛使用，从根本上重塑了人们如何在范围内寻求信息和建议。这些聊天机器人越来越多地用于高风险环境，包括情感支持和心理健康问题。尽管LLM可以提供可扩展的支持，但他们安全检测和应对急性心理健康危机的能力仍然很众所周知。缺乏统一的危机分类法，稳健的注释基准和基于临床最佳实践的经验评估，阻碍了进展。在这项工作中，我们通过引入六个临床知识心理健康危机类别的统一分类法，策划多样化的评估数据集并建立一个专家设计的协议来评估响应适当性。我们从系统地基于三个最先进的LLM，以便它们对危机类型进行分类并产生安全，适当的答复的能力。结果表明，尽管LLM高度一致，并且在解决明确的危机披露方面通常是可靠的，但仍然存在很大的风险。不可忽略的响应比例被评为不适当或有害的，其响应由开放重量模型产生的响应比商业产生的响应更高。我们还确定了处理间接或模棱两可的风险信号的系统性弱点，对公式化和不真实的默认答复的依赖，以及与用户环境的频繁未对准。这些发现强调了迫切需要增强保障措施，改善的危机检测以及LLM部署的上下文感知干预措施。我们的分类学，数据集和评估框架为正在进行的研究和AI驱动的心理健康支持中负责任的创新奠定了基础，有助于最大程度地减少伤害并更好地保护弱势用户。

Title: Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning

Authors: Matteo Fuoli, Weihang Huang, Jeannette Littlemore, Sarah Turner, Ellen Wilding
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24866
Pdf URL: https://arxiv.org/pdf/2509.24866
Copy Paste: [[2509.24866]] Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning(https://arxiv.org/abs/2509.24866)
Keywords: language model, llm, prompt, retrieval-augmented generation, chain-of-thought
Abstract: Metaphor is a pervasive feature of discourse and a powerful lens for examining cognition, emotion, and ideology. Large-scale analysis, however, has been constrained by the need for manual annotation due to the context-sensitive nature of metaphor. This study investigates the potential of large language models (LLMs) to automate metaphor identification in full texts. We compare three methods: (i) retrieval-augmented generation (RAG), where the model is provided with a codebook and instructed to annotate texts based on its rules and examples; (ii) prompt engineering, where we design task-specific verbal instructions; and (iii) fine-tuning, where the model is trained on hand-coded texts to optimize performance. Within prompt engineering, we test zero-shot, few-shot, and chain-of-thought strategies. Our results show that state-of-the-art closed-source LLMs can achieve high accuracy, with fine-tuning yielding a median F1 score of 0.79. A comparison of human and LLM outputs reveals that most discrepancies are systematic, reflecting well-known grey areas and conceptual challenges in metaphor theory. We propose that LLMs can be used to at least partly automate metaphor identification and can serve as a testbed for developing and refining metaphor identification protocols and the theory that underpins them.
摘要：隐喻是话语的普遍特征，也是研究认知，情感和意识形态的强大镜头。然而，由于隐喻的上下文敏感性，大规模分析受到对手动注释的需求的限制。这项研究研究了大语模型（LLMS）在全文中自动化隐喻识别的潜力。我们比较了三种方法：（i）检索仪器（RAG），在其中为模型提供了代码手册，并根据其规则和示例指示注释文本；（ii）及时的工程，我们在其中设计特定任务的口头说明；（iii）微调，在手工编码的文本上对模型进行了训练以优化性能。在迅速的工程中，我们测试了零射，很少的射击和经过经过经过经过经过经过经理链接的策略。我们的结果表明，最新的闭合源LLM可以达到高精度，微调的中位数F1得分为0.79。人类和LLM输出的比较表明，大多数差异是系统的，反映了隐喻理论中众所周知的灰色区域和概念性挑战。我们建议可以使用LLM至少部分自动化隐喻识别，并可以用作开发和完善隐喻识别方案的测试床以及基于它们的理论。

Title: Expanding Computation Spaces of LLMs at Inference Time

Authors: Yoonna Jang, Kisu Yang, Isabelle Augenstein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24884
Pdf URL: https://arxiv.org/pdf/2509.24884
Copy Paste: [[2509.24884]] Expanding Computation Spaces of LLMs at Inference Time(https://arxiv.org/abs/2509.24884)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-thought (CoT) rationale enables language models to use additional task-related text for problem-solving, benefiting not only from detailed reasoning steps but also from the expanded computational space of longer inputs. Prior work has trained filler or special tokens to serve as additional computation spaces. In this study, we investigate whether language models can leverage artificially inserted sequences of filler tokens solely at inference. We first identify effective token types, numbers, and insertion locations, then examine at what stage of training models begin to exploit the expanded computation space, and finally analyze dynamics within these spaces via attention maps. Experiments on models ranging from 1.7B to 32B across open-domain QA and math tasks show that appropriate token types and counts vary, but placing filler tokens directly before the final 'Answer:' token is most effective. Smaller models benefit most, up to 12.372 percentage points in SmolLM2-1.7B-Instruct, indicating that these spaces act as additional computational capacity rather than redundant input. Attention maps reveal that expanded spaces often continue the original attention mechanism and sometimes focus on questions or answer options, suggesting meaningful computation for problem-solving.
摘要：经过思考链（COT）的理由使语言模型可以使用与任务相关的其他文本进行问题解决，这不仅可以从详细的推理步骤中受益，而且从更长的输入的扩展计算空间中受益。先前的工作已经训练了填充物或特殊令牌，可以用作额外的计算空间。在这项研究中，我们研究了语言模型是否可以仅在推理时填充填充令牌的人为插入的序列。我们首先确定有效的令牌类型，数字和插入位置，然后检查训练模型的哪个阶段开始利用扩展的计算空间，并最终通过注意图在这些空间内分析动态。在开放域QA和数学任务上从1.7B到32B的模型实验表明，适当的令牌类型和计数各不相同，但是将填充令牌直接放置在最终的“答案之前：”令牌是最有效的。较小的模型受益于大多数，在Smollm2-1.7b-Instruct中最多可达12.372个百分点，这表明这些空间充当额外的计算能力而不是冗余输入。注意地图表明，扩展的空间通常会继续使用原始的注意机制，有时会关注问题或答案选择，这表明解决问题的有意义的计算。

Title: BOE-XSUM: Extreme Summarization in Clear Language of Spanish Legal Decrees and Notifications

Authors: Andrés Fernández García, Javier de la Rosa, Julio Gonzalo, Roser Morante, Enrique Amigó, Alejandro Benito-Santos, Jorge Carrillo-de-Albornoz, Víctor Fresno, Adrian Ghajari, Guillermo Marco, Laura Plaza, Eva Sánchez Salido
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24908
Pdf URL: https://arxiv.org/pdf/2509.24908
Copy Paste: [[2509.24908]] BOE-XSUM: Extreme Summarization in Clear Language of Spanish Legal Decrees and Notifications(https://arxiv.org/abs/2509.24908)
Keywords: language model, gpt, llm
Abstract: The ability to summarize long documents succinctly is increasingly important in daily life due to information overload, yet there is a notable lack of such summaries for Spanish documents in general, and in the legal domain in particular. In this work, we present BOE-XSUM, a curated dataset comprising 3,648 concise, plain-language summaries of documents sourced from Spain's ``Bolet\'ın Oficial del Estado'' (BOE), the State Official Gazette. Each entry in the dataset includes a short summary, the original text, and its document type label. We evaluate the performance of medium-sized large language models (LLMs) fine-tuned on BOE-XSUM, comparing them to general-purpose generative models in a zero-shot setting. Results show that fine-tuned models significantly outperform their non-specialized counterparts. Notably, the best-performing model -- BERTIN GPT-J 6B (32-bit precision) -- achieves a 24\% performance gain over the top zero-shot model, DeepSeek-R1 (accuracies of 41.6\% vs.\ 33.5\%).
摘要：由于信息超负荷，在日常生活中简洁地总结长文件的能力越来越重要，但是对于西班牙文档而言，尤其是在法律领域中，有明显的摘要缺乏这样的摘要。在这项工作中，我们介绍了Boe-Xsum，这是一个策划的数据集，其中包括3,648个简明的，简单的语言摘要，这些文件来自西班牙的``Bolet \'iCial del del estado''（BOE）（BOE），该州官方官方宪报。数据集中的每个条目都包括一个简短的摘要，原始文本及其文档类型标签。我们评估了BOE-XSUM上微调的中型大语模型（LLMS）的性能，并将其与零拍设置中的通用生成模型进行了比较。结果表明，微调模型的表现明显优于其非专业对应物。值得注意的是，表现最佳的模型 - Bertin GPT-J 6B（32位精度） - 在顶部零拍模型DeepSeek-R1（41.6 \％\％vs. \ 33.5 \％的精度）上获得了24 \％的性能增益。

Title: How Well Do LLMs Imitate Human Writing Style?

Authors: Rebira Jemama, Rajesh Kumar
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.24930
Pdf URL: https://arxiv.org/pdf/2509.24930
Copy Paste: [[2509.24930]] How Well Do LLMs Imitate Human Writing Style?(https://arxiv.org/abs/2509.24930)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) can generate fluent text, but their ability to replicate the distinctive style of a specific human author remains unclear. We present a fast, training-free framework for authorship verification and style imitation analysis. The method integrates TF-IDF character n-grams with transformer embeddings and classifies text pairs through empirical distance distributions, eliminating the need for supervised training or threshold tuning. It achieves 97.5\% accuracy on academic essays and 94.5\% in cross-domain evaluation, while reducing training time by 91.8\% and memory usage by 59\% relative to parameter-based baselines. Using this framework, we evaluate five LLMs from three separate families (Llama, Qwen, Mixtral) across four prompting strategies - zero-shot, one-shot, few-shot, and text completion. Results show that the prompting strategy has a more substantial influence on style fidelity than model size: few-shot prompting yields up to 23.5x higher style-matching accuracy than zero-shot, and completion prompting reaches 99.9\% agreement with the original author's style. Crucially, high-fidelity imitation does not imply human-like unpredictability - human essays average a perplexity of 29.5, whereas matched LLM outputs average only 15.2. These findings demonstrate that stylistic fidelity and statistical detectability are separable, establishing a reproducible basis for future work in authorship modeling, detection, and identity-conditioned generation.
摘要：大型语言模型（LLM）可以产生流利的文本，但是它们复制特定人类作者独特风格的能力尚不清楚。我们提出了一个快速，无培训的框架，用于作者身份验证和样式模仿分析。该方法将TF-IDF字符n-gram与变压器嵌入整合，并通过经验距离分布对文本对进行分类，从而消除了对监督训练或阈值调整的需求。它在学术论文中达到了97.5 \％的准确性，在跨域评估中达到了94.5 \％，同时将训练时间降低了91.8 \％，相对于基于参数的基准，将训练时间降低了59 \％。使用此框架，我们在四种提示策略中评估了来自三个独立家庭（Llama，Qwen，Mixtral）的五个LLM-零射击，一次性，很少射击和文本完成。结果表明，提示策略对样式保真度的影响要比模型大小具有更大的影响：与零拍摄的样式匹配精度高达23.5倍，并且完成促使促使与原始作者的样式达到99.9 \％的一致性。至关重要的是，高保真的模仿并不意味着类似人类的不可预测性 - 人类论文的平均比例为29.5，而匹配的LLM输出平均仅为15.2。这些发现表明，风格上的保真度和统计检测性是可分离的，为在作者身份建模，检测和身份条件生成中的未来工作建立了可再现的基础。

Title: MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Authors: Changsheng Zhao, Ernie Chang, Zechun Liu, Chia-Jung Chang, Wei Wen, Chen Lai, Rick Cao, Yuandong Tian, Raghuraman Krishnamoorthi, Yangyang Shi, Vikas Chandra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24945
Pdf URL: https://arxiv.org/pdf/2509.24945
Copy Paste: [[2509.24945]] MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes(https://arxiv.org/abs/2509.24945)
Keywords: language model, llm, chain-of-thought
Abstract: The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we revisit the necessity of scaling to extremely large corpora (>10T tokens) for reasoning emergence. By carefully curating and resampling open-source datasets that we identify as beneficial under our designed metrics, we demonstrate that strong reasoning abilities can emerge with far less data. Specifically, we show that only ~2T tokens of high-quality data are sufficient, and pre-training with 4.2T tokens on the dataset resampled from these ~2T tokens, followed by a established post-training procedure, enables the development of MobileLLM-R1, a series of sub-billion-parameter reasoning models that substantially outperform prior models trained on fully open-sourced data. For example, MobileLLM-R1-950M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3's proprietary 36T-token corpus for pretraining, MobileLLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks. To facilitate further research in this direction, we have released the complete training recipe, data sources, data mixing ratio, and model checkpoints, together with the key insights obtained throughout this study.
摘要：大语言模型（LLM）从本能的反应转变为对经营链（COT）推理的范式转变助长了两个主要的假设：（1）推理能力仅在足够大的模型中出现，（2）此类功能需要在大型数据集上进行培训。尽管最近的数十亿参数推理模型（例如Qwen3-0.6b和DeepSeek蒸馏型变体）已经挑战了第一个假设，但第二种假设仍然很大程度上毫无疑问。在这项工作中，我们重新审视将扩展到极大的语料库（> 10t代币）以进行推理出现的必要性。通过仔细策划和重新采样我们在设计指标下确定为有益的开源数据集，我们证明，强大的推理能力可以随着数据较少而出现。具体而言，我们表明，仅对高质量数据的〜2t代币就足够了，并且在数据集上的4.2T代币进行了预训练，从这些〜2T代币中重新采样，然后进行既定的训练后程序，使Mobilellm-R1的开发，一系列二十亿比十分值的理由模型，这些模型均已完全培训，这些模型是完全培训的。例如，MobiLellM-R1-9.50m的AIME得分为15.5，而OLMO-2-1.48B仅为0.6，Smollm-2-1.7b为0.3。值得注意的是，与QWEN3专有的36t-Token语料库相比，仅接受了11.7％的代币培训，用于训练，Mobilellm-R1-950M匹配或超过QWEN3-0.6B的QWEN3-0.6B。为了促进这一方向的进一步研究，我们发布了完整的培训配方，数据源，数据混合比和模型检查点，以及整个研究中获得的关键见解。

Title: The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability

Authors: Linlu Gong, Ante Wang, Yunghwei Lai, Weizhi Ma, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24958
Pdf URL: https://arxiv.org/pdf/2509.24958
Copy Paste: [[2509.24958]] The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability(https://arxiv.org/abs/2509.24958)
Keywords: llm, agent
Abstract: An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.
摘要：有效的医生在治疗患者时应拥有同理心，专业知识，耐心和清晰沟通的结合。最近的进步成功地赋予了AI医生的专家诊断技能，尤其是通过查询积极寻求信息的能力。但是，好医生的其他基本品质仍然被忽视。为了弥合这一差距，我们提出了Maque（医疗代理质疑评估），这是对医学多转变质疑的自动和全面评估的有史以来最大的基准。它具有3,000名现实模拟的患者药物，这些患者药物表现出各种语言模式，认知局限性，情感反应和被动披露的趋势。我们还介绍了一个多方面的评估框架，涵盖了任务成功，询问能力，对话能力，查询效率和患者经验。不同LLM的实验揭示了整个评估方面的重大挑战。即使是最先进的模型也显示出很大的提高其查询功能的空间。这些模型对现实的患者行为的变化高度敏感，这极大地影响了诊断准确性。此外，我们的细粒度指标在不同的评估观点之间揭示了权衡取舍，从而强调了在现实世界中临床环境中平衡性能和实用性的挑战。

Title: SemanticShield: LLM-Powered Audits Expose Shilling Attacks in Recommender Systems

Authors: Kaihong Li, Huichi Zhou, Bin Ma, Fangjun Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24961
Pdf URL: https://arxiv.org/pdf/2509.24961
Copy Paste: [[2509.24961]] SemanticShield: LLM-Powered Audits Expose Shilling Attacks in Recommender Systems(https://arxiv.org/abs/2509.24961)
Keywords: language model, llm
Abstract: Recommender systems (RS) are widely used in e-commerce for personalized suggestions, yet their openness makes them susceptible to shilling attacks, where adversaries inject fake behaviors to manipulate recommendations. Most existing defenses emphasize user-side behaviors while overlooking item-side features such as titles and descriptions that can expose malicious intent. To address this gap, we propose a two-stage detection framework that integrates item-side semantics via large language models (LLMs). The first stage pre-screens suspicious users using low-cost behavioral criteria, and the second stage employs LLM-based auditing to evaluate semantic consistency. Furthermore, we enhance the auditing model through reinforcement fine-tuning on a lightweight LLM with carefully designed reward functions, yielding a specialized detector called SemanticShield. Experiments on six representative attack strategies demonstrate the effectiveness of SemanticShield against shilling attacks, and further evaluation on previously unseen attack methods shows its strong generalization capability. Code is available at this https URL.
摘要：推荐系统（RS）被广泛用于电子商务中的个性化建议，但是它们的开放性使它们容易受到先令攻击的影响，在这种情况下，对手会注入虚假行为来操纵建议。大多数现有的防御力都强调用户端行为，同时忽略了项目端功能，例如标题和描述，这些功能可以暴露出恶意意图。为了解决这一差距，我们提出了一个两阶段检测框架，该框架通过大语言模型（LLMS）整合了项目侧语义。第一阶段使用低成本行为标准的屏幕前可疑用户，第二阶段采用基于LLM的审计来评估语义一致性。此外，我们通过使用精心设计的奖励功能在轻量级LLM上进行微调来增强审计模型，从而产生一个名为Semanticshield的专门检测器。对六种代表性攻击策略进行的实验证明了语义屏幕对先令攻击的有效性，并且对先前看不见的攻击方法的进一步评估显示了其强大的概括能力。代码可在此HTTPS URL上找到。

Title: Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns

Authors: Hanqi Xiao, Vaidehi Patil, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24988
Pdf URL: https://arxiv.org/pdf/2509.24988
Copy Paste: [[2509.24988]] Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns(https://arxiv.org/abs/2509.24988)
Keywords: llm
Abstract: Generating accurate and calibrated confidence estimates is critical for deploying LLMs in high-stakes or user-facing applications, and remains an open challenge. Prior research has often framed confidence as a problem of eliciting a model's "self-knowledge", i.e., the ability of an LLM to judge whether its own answers are correct; this approach implicitly assumes that there is some privileged information about the answer's correctness that is accessible to the model itself. However, our experiments reveal that an LLM attempting to predict the correctness of its own outputs generally performs no better than an unrelated LLM. Moreover, we hypothesize that a key factor in building a "Correctness Model" (CM) is exposure to a target model's historical predictions. We propose multiple methods to inject this historical correctness information, creating a Generalized Correctness Model (GCM). We first show that GCMs can be trained on the correctness data from many LLMs and learn patterns for correctness prediction applicable across datasets and models. We then use CMs as a lens for studying the source of correctness prediction ability and its generalization, systematically controlling their training data and finding that answer phrasing is a strong predictor for correctness. We further explore alternative methods of injecting history without training an LLM, finding that including history as in-context examples can help improve correctness prediction, and post-hoc calibration can provide complementary reductions in calibration error. We evaluate GCMs based on Qwen3-8B across 5 model families and the MMLU and TriviaQA datasets, as well as on a downstream selective prediction task, finding that reliable LLM confidence estimation is a generalizable and model-agnostic skill learned by systematically encoding correctness history rather than a model-specific skill reliant on self-introspection.
摘要：生成准确和校准的置信度估计对于在高风险或面向用户的应用程序中部署LLMS至关重要，并且仍然是一个开放的挑战。先前的研究通常将信心定为引起模型的“自我知识”的问题，即LLM判断自己的答案是否正确的能力；这种方法隐含地假设有一些关于答案的正确性的特权信息，而模型本身可以访问。但是，我们的实验表明，试图预测其自身输出的正确性的LLM通常不会比无关的LLM更好。此外，我们假设建立“正确性模型”（CM）的关键因素是暴露于目标模型的历史预测。我们提出了多种方法来注入此历史正确性信息，创建了广义正确性模型（GCM）。我们首先表明，可以对来自许多LLM的正确性数据进行培训，并学习模式以适用于数据集和模型的正确性预测。然后，我们将CMS用作镜头来研究正确性预测能力及其概括，系统地控制其训练数据并发现答案措辞是正确性的有力预测指标。我们进一步探讨了在不训练LLM的情况下注入历史记录的替代方法，发现将历史记录为中文中的示例包括有助于改善正确性预测，而事后校准可以提供校准误差的补充减少。我们在5个模型系列以及MMLU和TriviaQA数据集以及下游选择性预测任务上评估了基于QWEN3-8B的GCM，发现可靠的LLM置信度估计是可以通过系统地编码的正确性历史而不是模型的技能，而不是针对自我介绍的，可靠的LLM置信度估计是一种可以推广的模型和模型的技能。

Title: Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

Authors: Haoyang Zheng, Xinyang Liu, Cindy Xiangrui Kong, Nan Jiang, Zheyuan Hu, Weijian Luo, Wei Deng, Guang Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25035
Pdf URL: https://arxiv.org/pdf/2509.25035
Copy Paste: [[2509.25035]] Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct(https://arxiv.org/abs/2509.25035)
Keywords: language model, gpt, llm
Abstract: Fast generation of language texts is the holy grail that people pursue in the AI era. In this work, we introduced Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that leads to fast language generation models by initializing from a pre-trained (masked) discrete diffusion language model (dLLM). The resulting DiDi-Instruct model outperforms the dLLM counterparts and the GPT-2 baseline with 64x acceleration. In the theoretical part of the paper, we build the foundation of DiDi-Instruct in a framework of integral KL-divergence minimization, with practical training algorithms. We also introduce techniques like grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler (RGAS) that significantly improve the training stability, the model coverage, and the inference performances. On OpenWebText, DiDi-Instruct outperforms all accelerated language generation models as well as the GPT-2 baseline and the standard dLLMs, achieving sample perplexities ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs). These performance gains are accomplished with a negligible entropy loss of about 1% and 20x less additional training wall-clock time. We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, and the generation of discrete protein sequences. In conclusion, DiDi-Instruct is an efficient yet effective distillation method, enabling language generation in the blink of an eye. We will release both code and models at this http URL.
摘要：快速产生的语言文本是人们在人工智能时代追求的圣杯。在这项工作中，我们引入了离散扩散差异指令（DIDI-Instruct），这是一种基于培训的方法，通过从预先训练（屏蔽）离散扩散语言模型（DLLM）初始化来导致语言生成模型。由此产生的DID-Instruct模型的表现优于DLLM对应物，而GPT-2基线的加速度则优于GPT-2基线。在本文的理论部分中，我们通过实用的培训算法在整体KL-Divergence最小化的框架内建立了Didi教学的基础。我们还介绍了诸如分组奖励归一化，中等状态匹配以及奖励指导的祖先采样器（RGA）等技术，从而显着改善了训练稳定性，模型覆盖率和推理性能。在OpenWebText上，DID-Instruct优于所有加速语言生成模型以及GPT-2基线和标准DLLM，实现了从62.2（8 NFES）到18.4至18.4（128 NFES）的样本困惑。这些性能的增长是通过可忽略的熵损失约1％和20倍来完成的，减少了额外的训练壁垒时间。我们通过广泛的消融研究，模型缩放和离散蛋白质序列的产生进一步验证DIDI教学的鲁棒性和有效性。总之，Didi-Instruct是一种有效而有效的蒸馏方法，使语言眨眼之间就能产生。我们将在此HTTP URL上发布代码和模型。

Title: Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures

Authors: Marco Bronzini, Carlo Nicolini, Bruno Lepri, Jacopo Staiano, Andrea Passerini
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25045
Pdf URL: https://arxiv.org/pdf/2509.25045
Copy Paste: [[2509.25045]] Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures(https://arxiv.org/abs/2509.25045)
Keywords: language model, llm
Abstract: Despite their capabilities, Large Language Models (LLMs) remain opaque with limited understanding of their internal representations. Current interpretability methods, such as direct logit attribution (DLA) and sparse autoencoders (SAEs), provide restricted insight due to limitations such as the model's output vocabulary or unclear feature names. This work introduces Hyperdimensional Probe, a novel paradigm for decoding information from the LLM vector space. It combines ideas from symbolic representations and neural probing to project the model's residual stream into interpretable concepts via Vector Symbolic Architectures (VSAs). This probe combines the strengths of SAEs and conventional probes while overcoming their key limitations. We validate our decoding paradigm with controlled input-completion tasks, probing the model's final state before next-token prediction on inputs spanning syntactic pattern recognition, key-value associations, and abstract inference. We further assess it in a question-answering setting, examining the state of the model both before and after text generation. Our experiments show that our probe reliably extracts meaningful concepts across varied LLMs, embedding sizes, and input domains, also helping identify LLM failures. Our work advances information decoding in LLM vector space, enabling extracting more informative, interpretable, and structured features from neural representations.
摘要：尽管具有能力，但大型语言模型（LLM）仍然不透明，对它们的内部表现有限。当前的可解释性方法，例如直接logit归因（DLA）和稀疏的自动编码器（SAE），由于限制（例如模型的输出词汇或不清晰的特征名称）提供了有限的见解。这项工作引入了高维探针，这是一种用于解码LLM矢量空间信息的新型范式。它结合了符号表示和神经探测的想法，通过向量符号体系结构（VSA）将模型的残差流将其投影到可解释的概念中。该探测器结合了SAE和常规探针的优势，同时克服了它们的关键局限性。我们使用受控的输入核算任务来验证解码范式，在跨越句法模式识别，键值值关联和抽象推论的输入的下一步预测之前探索模型的最终状态。我们在提问的环境中进一步评估它，并在文本生成之前和之后检查模型的状态。我们的实验表明，我们的探测可靠地提取了各种LLM，嵌入尺寸和输入域的有意义的概念，也有助于识别LLM失败。我们的工作推进了LLM矢量空间中的信息解码，从而使从神经表示中提取更有益，可解释和结构化的特征。

Title: Confidence-Guided Error Correction for Disordered Speech Recognition

Authors: Abner Hernandez, Tomás Arias Vergara, Andreas Maier, Paula Andrea Pérez-Toro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25048
Pdf URL: https://arxiv.org/pdf/2509.25048
Copy Paste: [[2509.25048]] Confidence-Guided Error Correction for Disordered Speech Recognition(https://arxiv.org/abs/2509.25048)
Keywords: language model, llm, prompt
Abstract: We investigate the use of large language models (LLMs) as post-processing modules for automatic speech recognition (ASR), focusing on their ability to perform error correction for disordered speech. In particular, we propose confidence-informed prompting, where word-level uncertainty estimates are embedded directly into LLM training to improve robustness and generalization across speakers and datasets. This approach directs the model to uncertain ASR regions and reduces overcorrection. We fine-tune a LLaMA 3.1 model and compare our approach to both transcript-only fine-tuning and post hoc confidence-based filtering. Evaluations show that our method achieves a 10% relative WER reduction compared to naive LLM correction on the Speech Accessibility Project spontaneous speech and a 47% reduction on TORGO, demonstrating the effectiveness of confidence-aware fine-tuning for impaired speech.
摘要：我们研究了大型语言模型（LLM）作为自动语音识别（ASR）的后处理模块（ASR）的使用，重点是他们对无序语音进行错误校正的能力。特别是，我们提出了置信方面的提示，其中单词级的不确定性估计直接嵌入LLM培训中，以改善跨说话者和数据集的稳健性和概括。这种方法指示该模型不确定的ASR区域并减少过度纠正。我们对Llama 3.1模型进行微调，并将我们的方法与仅转录的微调和基于事后置信后的过滤进行比较。评估表明，与对语音可访问性项目自发语音的幼稚LLM纠正相比，我们的方法相对降低了10％，对Torgo的校正率降低了47％，这表明了置信度意识到言语受损的有效性。

Title: Scaling Generalist Data-Analytic Agents

Authors: Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang, Zhao Bin, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25084
Pdf URL: https://arxiv.org/pdf/2509.25084
Copy Paste: [[2509.25084]] Scaling Generalist Data-Analytic Agents(https://arxiv.org/abs/2509.25084)
Keywords: gpt, prompt, agent
Abstract: Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.
摘要：数据分析剂正在成为自动化科学发现的关键催化剂和创新AI的愿景。但是，当前的方法在很大程度上依赖于专有模型的迅速工程，而开源模型则难以面对各种形式，大规模的数据文件和长期长途，多步骤的推理，认为现实世界分析的需求。本文介绍了Datamind，这是一种可扩展的数据综合和代理培训配方，旨在构建通用数据分析剂。 Datamind在构建开源数据分析剂方面面临三个关键挑战，包括数据资源不足，不正确的培训策略和基于代码的不稳定多转移推出。具体而言，Datamind应用于1）精细的任务分类法和递归易于硬的任务组成机制，以增加合成查询的多样性和困难； 2）知识增强的轨迹采样策略，然后是基于模型和规则的过滤； 3）结合SFT和RL损失的动态可调训练目标； 4）基于内存且稳定的基于代码的多转推断框架。我们构建在Datamind上，我们策划Datamind-12k，这是一个高质量的轨迹集，涵盖了用于数据分析任务的不同域，任务类别和数据文件格式。经过Datamind-12k培训，我们的Datamind-14b在多个数据分析基准上的平均得分为71.16％，以优于最强的专有基线DeepSeek-V3.1和GPT-5，平均得分为71.16％。我们的Datamind-7b在所有开源型号中也表现最好，得分为68.10％。我们还将从探索性试验中获得的一些经验见解纳入了分析实验，旨在提供有关社区代理培训的可行见解。我们将发布Datamind-12k和Datamind-7b，14B，以进行社区的未来研究。

Title: Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs

Authors: Akio Hayakawa, Stefan Bott, Horacio Saggion
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25086
Pdf URL: https://arxiv.org/pdf/2509.25086
Copy Paste: [[2509.25086]] Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs(https://arxiv.org/abs/2509.25086)
Keywords: language model, llm
Abstract: Despite their strong performance, large language models (LLMs) face challenges in real-world application of lexical simplification (LS), particularly in privacy-sensitive and resource-constrained environments. Moreover, since vulnerable user groups (e.g., people with disabilities) are one of the key target groups of this technology, it is crucial to ensure the safety and correctness of the output of LS systems. To address these issues, we propose an efficient framework for LS systems that utilizes small LLMs deployable in local environments. Within this framework, we explore knowledge distillation with synthesized data and in-context learning as baselines. Our experiments in five languages evaluate model outputs both automatically and manually. Our manual analysis reveals that while knowledge distillation boosts automatic metric scores, it also introduces a safety trade-off by increasing harmful simplifications. Importantly, we find that the model's output probability is a useful signal for detecting harmful simplifications. Leveraging this, we propose a filtering strategy that suppresses harmful simplifications while largely preserving beneficial ones. This work establishes a benchmark for efficient and safe LS with small LLMs. It highlights the key trade-offs between performance, efficiency, and safety, and demonstrates a promising approach for safe real-world deployment.
摘要：尽管表现出色，但大型语言模型（LLMS）在词汇简化（LS）的现实应用中面临挑战，尤其是在对隐私敏感和资源约束的环境中。此外，由于脆弱的用户组（例如残疾人）是该技术的关键目标群体之一，因此确保LS系统输出的安全性和正确性至关重要。为了解决这些问题，我们为LS系统提供了一个有效的框架，该系统利用可在本地环境中部署的小型LLMS。在此框架内，我们以合成的数据和内在的学习为基础来探索知识蒸馏。我们的五种语言实验会自动和手动评估模型输出。我们的手动分析表明，尽管知识蒸馏可以提高自动度量分数，但它也通过增加有害的简化而引入了安全权衡。重要的是，我们发现模型的输出概率是检测有害简化的有用信号。利用这一点，我们提出了一种过滤策略，该策略抑制了有害的简化，同时在很大程度上保留了有益的策略。这项工作为具有小LLM的高效和安全LS建立了基准。它突出了性能，效率和安全性之间的关键权衡，并展示了一种有希望的安全现实部署方法。

Title: Towards Personalized Deep Research: Benchmarks and Evaluations

Authors: Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, Minghao Liu, Yuchen Eleanor Jiang, Ningyu Zhang, Wangchunshu Zhou
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2509.25106
Pdf URL: https://arxiv.org/pdf/2509.25106
Copy Paste: [[2509.25106]] Towards Personalized Deep Research: Benchmarks and Evaluations(https://arxiv.org/abs/2509.25106)
Keywords: agent
Abstract: Deep Research Agents (DRAs) can autonomously conduct complex investigations and generate comprehensive reports, demonstrating strong real-world potential. However, existing evaluations mostly rely on close-ended benchmarks, while open-ended deep research benchmarks remain scarce and typically neglect personalized scenarios. To bridge this gap, we introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in DRAs. It pairs 50 diverse research tasks across 10 domains with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries. To assess system performance, we propose the PQR Evaluation Framework, which jointly measures (P) Personalization Alignment, (Q) Content Quality, and (R) Factual Reliability. Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. This work establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.
摘要：深度研究代理（DRA）可以自主进行复杂的研究并产生全面的报告，并证明现实世界中的潜力很强。但是，现有的评估主要依赖于封闭式的基准，而开放式的深层研究基准测试基准仍然很少，并且通常忽略了个性化的方案。为了弥合这一差距，我们介绍了个性化的深入研究基准，这是评估DRA中个性化的第一个基准。它将10个域中的50个不同的研究任务与25个真实的用户配置文件结合在一起，将结构化的角色属性与动态现实世界的上下文相结合，产生250个现实的用户任务查询。为了评估系统性能，我们提出了PQR评估框架，该框架共同测量（P）个性化对齐，（Q）内容质量和（R）事实可靠性。我们对一系列系统的实验强调了当前的功能和处理个性化深入研究的局限性。这项工作为开发和评估下一代真正个性化的AI研究助理建立了严格的基础。

Title: Knowledge Extraction on Semi-Structured Content: Does It Remain Relevant for Question Answering in the Era of LLMs?

Authors: Kai Sun, Yin Huang, Srishti Mehra, Mohammad Kachuee, Xilun Chen, Renjie Tao, Zhaojiang Lin, Andrea Jessee, Nirav Shah, Alex Betty, Yue Liu, Anuj Kumar, Wen-tau Yih, Xin Luna Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25107
Pdf URL: https://arxiv.org/pdf/2509.25107
Copy Paste: [[2509.25107]] Knowledge Extraction on Semi-Structured Content: Does It Remain Relevant for Question Answering in the Era of LLMs?(https://arxiv.org/abs/2509.25107)
Keywords: language model, llm
Abstract: The advent of Large Language Models (LLMs) has significantly advanced web-based Question Answering (QA) systems over semi-structured content, raising questions about the continued utility of knowledge extraction for question answering. This paper investigates the value of triple extraction in this new paradigm by extending an existing benchmark with knowledge extraction annotations and evaluating commercial and open-source LLMs of varying sizes. Our results show that web-scale knowledge extraction remains a challenging task for LLMs. Despite achieving high QA accuracy, LLMs can still benefit from knowledge extraction, through augmentation with extracted triples and multi-task learning. These findings provide insights into the evolving role of knowledge triple extraction in web-based QA and highlight strategies for maximizing LLM effectiveness across different model sizes and resource settings.
摘要：大型语言模型（LLMS）的出现具有明显的基于Web的问题答案（QA）系统，而不是半结构化内容，从而提出了有关知识提取的持续效用以进行问题回答的问题。本文通过扩展了具有知识提取注释的现有基准并评估各种大小的商业和开源LLM，从而研究了这种新范式中三重提取的价值。我们的结果表明，Web规模的知识提取仍然是LLM的一项具有挑战性的任务。尽管达到了高质量检查的准确性，但LLM仍可以通过提取的三倍和多任务学习来增加知识提取。这些发现提供了有关知识三重提取在基于Web的质量质量问题中不断发展的作用的见解，并突出了在不同模型大小和资源设置之间最大化LLM有效性的策略。

Title: Investigating Language and Retrieval Bias in Multilingual Previously Fact-Checked Claim Detection

Authors: Ivan Vykopal, Antonia Karamolegkou, Jaroslav Kopčan, Qiwei Peng, Tomáš Javůrek, Michal Gregor, Marián Šimko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25138
Pdf URL: https://arxiv.org/pdf/2509.25138
Copy Paste: [[2509.25138]] Investigating Language and Retrieval Bias in Multilingual Previously Fact-Checked Claim Detection(https://arxiv.org/abs/2509.25138)
Keywords: language model, llm, prompt
Abstract: Multilingual Large Language Models (LLMs) offer powerful capabilities for cross-lingual fact-checking. However, these models often exhibit language bias, performing disproportionately better on high-resource languages such as English than on low-resource counterparts. We also present and inspect a novel concept - retrieval bias, when information retrieval systems tend to favor certain information over others, leaving the retrieval process skewed. In this paper, we study language and retrieval bias in the context of Previously Fact-Checked Claim Detection (PFCD). We evaluate six open-source multilingual LLMs across 20 languages using a fully multilingual prompting strategy, leveraging the AMC-16K dataset. By translating task prompts into each language, we uncover disparities in monolingual and cross-lingual performance and identify key trends based on model family, size, and prompting strategy. Our findings highlight persistent bias in LLM behavior and offer recommendations for improving equity in multilingual fact-checking. To investigate retrieval bias, we employed multilingual embedding models and look into the frequency of retrieved claims. Our analysis reveals that certain claims are retrieved disproportionately across different posts, leading to inflated retrieval performance for popular claims while under-representing less common ones.
摘要：多语言大语模型（LLMS）为跨语性事实检查提供了强大的功能。但是，这些模型通常表现出语言偏见，在高资源语言（例如英语）上的表现比在低资源对应的语言上表现出色。当信息检索系统倾向于偏爱某些信息而不是其他信息时，我们还提出并检查了一个新颖的概念 - 检索偏见，而检索过程却偏斜。在本文中，我们在先前事实检查的索赔检测（PFCD）的背景下研究语言和检索偏见。我们使用完全多语言的提示策略评估了20种语言的六个开源多语言LLM，并利用AMC-16K数据集。通过将任务提示转化为每种语言，我们发现单语和跨语言性能的差异，并根据模型家族，大小和提示策略确定关键趋势。我们的发现突出了LLM行为的持续偏见，并提供了改善多语言事实检查公平的建议。为了研究检索偏见，我们采用了多语言嵌入模型，并研究了检索索赔的频率。我们的分析表明，某些主张在不同的职位上的检索不成比例，从而导致流行索赔的检索表现膨胀，同时代表性较低。

Title: Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation

Authors: Yen-Ju Lu, Thomas Thebaud, Laureano Moro-Velazquez, Najim Dehak, Jesus Villalba
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25144
Pdf URL: https://arxiv.org/pdf/2509.25144
Copy Paste: [[2509.25144]] Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation(https://arxiv.org/abs/2509.25144)
Keywords: llm
Abstract: We present Paired by the Teacher (PbT), a two-stage teacher-student pipeline that synthesizes accurate input-output pairs without human labels or parallel data. In many low-resource natural language generation (NLG) scenarios, practitioners may have only raw outputs, like highlights, recaps, or questions, or only raw inputs, such as articles, dialogues, or paragraphs, but seldom both. This mismatch forces small models to learn from very few examples or rely on costly, broad-scope synthetic examples produced by large LLMs. PbT addresses this by asking a teacher LLM to compress each unpaired example into a concise intermediate representation (IR), and training a student to reconstruct inputs from IRs. This enables outputs to be paired with student-generated inputs, yielding high-quality synthetic data. We evaluate PbT on five benchmarks-document summarization (XSum, CNNDM), dialogue summarization (SAMSum, DialogSum), and question generation (SQuAD)-as well as an unpaired setting on SwitchBoard (paired with DialogSum summaries). An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora and other unsupervised baselines, coming within 1.2 ROUGE-L of human-annotated pairs and closing 82% of the oracle gap at one-third the annotation cost of direct synthesis. Human evaluation on SwitchBoard further confirms that only PbT produces concise, faithful summaries aligned with the target style, highlighting its advantage of generating in-domain sources that avoid the mismatch, limiting direct synthesis.
摘要：我们介绍了由教师（PBT）配对，这是一条两阶段的教师学生，它合成了没有人类标签或并行数据的准确输入输出对。在许多低资源的自然语言产生（NLG）场景中，从业者可能只有原始输出，例如亮点，回顾或问题，或者只有原始输入，例如文章，对话或段落，但很少。这种不匹配迫使小型模型从很少的示例中学习或依靠大型LLM产生的昂贵的，广泛的合成实例。 PBT通过要求教师LLM将每个不成对的示例压缩为简洁的中间表示（IR），并培训学生重建IRS的投入来解决这一问题。这使输出可以与学生生成的输入配对，从而产生高质量的合成数据。我们在五个基准文档摘要（XSUM，CNNDM），对话摘要（Samsum，Dialogsum）和问题生成（小队）上评估了PBT，以及打算板上的不合格设置（与对话汇总配对）。仅接受PBT数据培训的8B学生优于在70 B教师生成的语料库和其他无监督的基线的培训的模型，在人类注销的1.2盘折ruge-l范围内，以三分之一的直接合成的注释成本以1.2 rouge-l缩小了甲骨文差距的82％。人体评估在机板上进一步证实，只有PBT才能产生与目标样式保持一致的简洁，忠实的摘要，强调了其生成避免不匹配的内域源的优势，从而限制了直接综合。

Title: Pretraining Large Language Models with NVFP4

Authors: NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blakeman, Evan Briones, Ian Buck, Bryan Catanzaro, Jinhang Choi, Mike Chrzanowski, Eric Chung, Victor Cui, Steve Dai, Bita Darvish Rouhani, Carlo del Mundo, Deena Donia, Burc Eryilmaz, Henry Estela, Abhinav Goel, Oleg Goncharov, Yugi Guvvala, Robert Hesse, Russell Hewett, Herbert Hum, Ujval Kapasi, Brucek Khailany, Mikail Khona, Nick Knight, Alex Kondratenko, Ronny Krashinsky, Ben Lanir, Simon Layton, Michael Lightstone, Daniel Lo, Paulius Micikevicius, Asit Mishra, Tim Moon, Deepak Narayanan, Chao Ni, Abhijit Paithankar, Satish Pasumarthi, Ankit Patel, Mostofa Patwary, Ashwin Poojary, Gargi Prasad, Sweta Priyadarshi, Yigong Qin, Xiaowei Ren, Oleg Rybakov, Charbel Sakr, Sanjeev Satheesh, Stas Sergienko, Pasha Shamis, Kirthi Shankar, Nishant Sharma, Mohammad Shoeybi, Michael Siu, Misha Smelyanskiy, Darko Stosic, Dusan Stosic, Bor-Yiing Su, Frank Sun, Nima Tajbakhsh, Shelby Thomas, Przemek Tredak, Evgeny Tsykunov, Gandhi Vaithilingam, Aditya Vavre, Rangharajan Venkatesan, Roger Waleffe, Qiyu Wan, Hexin Wang, Mengdi Wang, Lizzie Wei, Hao Wu, Evan Wu, Keith Wyss, Ning Xu, Jinze Xue, Charlene Yang, Yujia Zhai, Ruoxi Zhang, Jingyang Zhu, Zhongbo Zhu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25149
Pdf URL: https://arxiv.org/pdf/2509.25149
Copy Paste: [[2509.25149]] Pretraining Large Language Models with NVFP4(https://arxiv.org/abs/2509.25149)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) today are powerful problem solvers across many domains, and they continue to get stronger as they scale in model size, training set size, and training set quality, as shown by extensive research and experimentation across the industry. Training a frontier model today requires on the order of tens to hundreds of yottaflops, which is a massive investment of time, compute, and energy. Improving pretraining efficiency is therefore essential to enable the next generation of even more capable LLMs. While 8-bit floating point (FP8) training is now widely adopted, transitioning to even narrower precision, such as 4-bit floating point (FP4), could unlock additional improvements in computational speed and resource utilization. However, quantization at this level poses challenges to training stability, convergence, and implementation, notably for large-scale models trained on long token horizons. In this study, we introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format. Our method integrates Random Hadamard transforms (RHT) to bound block-level outliers, employs a two-dimensional quantization scheme for consistent representations across both the forward and backward passes, utilizes stochastic rounding for unbiased gradient estimation, and incorporates selective high-precision layers. We validate our approach by training a 12-billion-parameter model on 10 trillion tokens -- the longest publicly documented training run in 4-bit precision to date. Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline. These findings highlight that NVFP4, when combined with our training approach, represents a major step forward in narrow-precision LLM training algorithms.
摘要：如今，大型语言模型（LLMS）是许多领域的强大问题解决者，随着它们扩展模型大小，训练集大小和训练集质量，它们继续变得越来越强，如整个行业的广泛研究和实验所示。今天训练边境模型需要数十个顺序到数百个Yottaflops，这是时间，计算和能源的大量投资。因此，提高训练效率对于实现下一代更有能力的LLM至关重要。尽管现在广泛采用了8位浮点（FP8）训练，但过渡到更狭窄的精度，例如4位浮点（FP4），可以解锁计算速度和资源利用率的额外改进。但是，在此级别上的量化对训练稳定性，收敛性和实施构成了挑战，特别是对于在长期代币范围内训练的大型模型。在这项研究中，我们介绍了一种使用NVFP4格式对大语言模型（LLM）进行稳定且准确培训的新方法。我们的方法将随机的Hadamard变换（RHT）集成到结合的区块级别的离群值，采用二维量化方案来跨前向和向后传递的一致表示，利用随机圆形进行无偏置的梯度估计，并结合了选择性的高精度层。我们通过对10万亿代币的120亿参数模型进行培训来验证我们的方法 - 迄今为止，最长的公开记录的培训最长的培训运行。我们的结果表明，通过基于NVFP4的预训练技术训练的模型可实现训练损失和与FP8基线相当的下游任务精度。这些发现凸显了NVFP4与我们的训练方法结合使用，这代表了狭窄精确的LLM培训算法迈出的重要一步。

Title: EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering

Authors: Haolei Xu, Xinyu Mei, Yuchen Yan, Rui Zhou, Wenqi Zhang, Weiming Lu, Yueting Zhuang, Yongliang Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25175
Pdf URL: https://arxiv.org/pdf/2509.25175
Copy Paste: [[2509.25175]] EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering(https://arxiv.org/abs/2509.25175)
Keywords: language model, llm, hallucination
Abstract: Large language model (LLM) steering has emerged as a promising paradigm for controlling model behavior at inference time through targeted manipulation of hidden states, offering a lightweight alternative to expensive retraining. However, existing steering frameworks suffer from critical limitations: computational inefficiency, limited extensibility, and restricted functionality that hinder both research progress and practical deployment. We present EasySteer, a unified framework for high-performance, extensible LLM steering built on vLLM. Our system features modular architecture with pluggable interfaces for both analysis-based and learning-based methods, fine-grained parameter control, pre-computed steering vectors for eight application domains, and an interactive demonstration system. Through deep integration with vLLM's optimized inference engine, EasySteer achieves 5.5-11.4$\times$ speedup over existing frameworks. Extensive experiments demonstrate its effectiveness in overthinking mitigation, hallucination reduction, and other key applications. EasySteer transforms steering from research technique to production-ready capability, establishing critical infrastructure for deployable, controllable language models.
摘要：大型语言模型（LLM）转向已成为一种有希望的范式，用于通过针对性操纵隐藏状态在推理时间控制模型行为，从而提供了轻巧的替代品来替代昂贵的再培训。但是，现有的转向框架受到关键局限性：计算效率低下，可扩展性有限和有限的功能，从而阻碍了研究进度和实际部署。我们提出了EasyStereer，这是一个在VLLM上构建的高性能，可扩展的LLM转向的统一框架。我们的系统具有模块化体系结构，具有可插入的接口，用于基于分析的方法和基于学习的方法，细粒参数控制，八个应用程序域的预计转向向量以及交互式演示系统。通过与VLLM优化的推理引擎的深入集成，EasyStoreS在现有框架上实现了5.5-11.4 $ \ times $的加速。广泛的实验表明，其在过度思考缓解，减少幻觉和其他关键应用方面的有效性。 EasyStore将转向从研究技术转变为生产准备就绪的能力，为可部署的可控语言模型建立关键的基础架构。

Title: NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation

Authors: Penghai Zhao, Jinyu Tian, Qinghua Xing, Xin Zhang, Zheng Li, Jianjun Qian, Ming-Ming Cheng, Xiang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25179
Pdf URL: https://arxiv.org/pdf/2509.25179
Copy Paste: [[2509.25179]] NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation(https://arxiv.org/abs/2509.25179)
Keywords: llm
Abstract: The ability to estimate the quality of scientific papers is central to how both humans and AI systems will advance scientific knowledge in the future. However, existing LLM-based estimation methods suffer from high inference cost, whereas the faster direct score regression approach is limited by scale inconsistencies. We present NAIPv2, a debiased and efficient framework for paper quality estimation. NAIPv2 employs pairwise learning within domain-year groups to reduce inconsistencies in reviewer ratings and introduces the Review Tendency Signal (RTS) as a probabilistic integration of reviewer scores and confidences. To support training and evaluation, we further construct NAIDv2, a large-scale dataset of 24,276 ICLR submissions enriched with metadata and detailed structured content. Trained on pairwise comparisons but enabling efficient pointwise prediction at deployment, NAIPv2 achieves state-of-the-art performance (78.2% AUC, 0.432 Spearman), while maintaining scalable, linear-time efficiency at inference. Notably, on unseen NeurIPS submissions, it further demonstrates strong generalization, with predicted scores increasing consistently across decision categories from Rejected to Oral. These findings establish NAIPv2 as a debiased and scalable framework for automated paper quality estimation, marking a step toward future scientific intelligence systems. Code and dataset are released at this https URL.
摘要：估计科学论文质量的能力对于人类和人工智能系统将来如何提高科学知识至关重要。但是，现有的基于LLM的估计方法的推理成本很高，而更快的直接分数回归方法受到量表不一致的限制。我们提出了NAIPV2，这是一个用于纸质质量估计的虚假和有效框架。 NAIPV2在域年组中采用成对学习，以减少审阅者评分中的不一致，并引入审查趋势信号（RTS），作为审查员分数和信誉的概率整合。为了支持培训和评估，我们进一步构建了NAIDV2，这是24,276个ICLR提交的大规模数据集，富含元数据和详细的结构化内容。 NAIPV2经过对成对比较的培训，但可以在部署时有效地预测，在推理时保持可扩展的线性时间效率，同时保持最先进的性能（78.2％AUC，0.432 Spearman）。值得注意的是，在看不见的神经提交中，它进一步证明了强有力的概括，预测的分数从被拒绝到口头的决策类别始终如一地提高。这些发现将NAIPV2作为自动纸质质量估计的一个依据且可扩展的框架，标志着朝着未来的科学智能系统迈出的一步。代码和数据集在此HTTPS URL上发布。

Title: Incentive-Aligned Multi-Source LLM Summaries

Authors: Yanchen Jiang, Zhe Feng, Aranyak Mehta
Subjects: cs.CL, cs.AI, cs.GT
Abstract URL: https://arxiv.org/abs/2509.25184
Pdf URL: https://arxiv.org/pdf/2509.25184
Copy Paste: [[2509.25184]] Incentive-Aligned Multi-Source LLM Summaries(https://arxiv.org/abs/2509.25184)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used in modern search and answer systems to synthesize multiple, sometimes conflicting, texts into a single response, yet current pipelines offer weak incentives for sources to be accurate and are vulnerable to adversarial content. We introduce Truthful Text Summarization (TTS), an incentive-aligned framework that improves factual robustness without ground-truth labels. TTS (i) decomposes a draft synthesis into atomic claims, (ii) elicits each source's stance on every claim, (iii) scores sources with an adapted multi-task peer-prediction mechanism that rewards informative agreement, and (iv) filters unreliable sources before re-summarizing. We establish formal guarantees that align a source's incentives with informative honesty, making truthful reporting the utility-maximizing strategy. Experiments show that TTS improves factual accuracy and robustness while preserving fluency, aligning exposure with informative corroboration and disincentivizing manipulation.
摘要：大型语言模型（LLM）越来越多地用于现代搜索和答案系统中，以将多个（有时是冲突的文本）综合为单个响应，但是当前的管道为准确的源提供了薄弱的激励措施，并且很容易受到对抗内容的影响。我们介绍了真实的文本摘要（TTS），这是一个由激励措施一致的框架，可改善事实稳健性而没有地面真相标签。 TTS（i）将合成草案分解为原子主张，（ii）对每个索赔的立场提出了每种索赔的立场，（iii）以适应性的多任务同伴预测机制分数来源，以奖励信息丰富的协议，以及（iv）在重新确定之前不可靠的来源。我们建立正式的保证，即，将消息来源的激励措施与信息丰富的诚实保持一致，从而使实用性最大化策略的真实报道。实验表明，TTS提高了事实的准确性和鲁棒性，同时保持流利度，使暴露与信息性的佐证保持一致，并阻止了操纵。

Title: Learning to Parallel: Accelerating Diffusion Large Language Models via Adaptive Parallel Decoding

Authors: Wenrui Bao, Zhiben Chen, Dan Xu, Yuzhang Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25188
Pdf URL: https://arxiv.org/pdf/2509.25188
Copy Paste: [[2509.25188]] Learning to Parallel: Accelerating Diffusion Large Language Models via Adaptive Parallel Decoding(https://arxiv.org/abs/2509.25188)
Keywords: language model, llm
Abstract: Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through iterative denoising. However, current parallel decoding strategies rely on fixed, input-agnostic heuristics (e.g., confidence thresholds), which fail to adapt to input-specific characteristics, resulting in suboptimal speed-quality trade-offs across diverse NLP tasks. In this work, we explore a more flexible and dynamic approach to parallel decoding. We propose Learning to Parallel Decode (Learn2PD), a framework that trains a lightweight and adaptive filter model to predict, for each token position, whether the current prediction matches the final output. This learned filter approximates an oracle parallel decoding strategy that unmasks tokens only when correctly predicted. Importantly, the filter model is learned in a post-training manner, requiring only a small amount of computation to optimize it (minute-level GPU time). Additionally, we introduce End-of-Text Prediction (EoTP) to detect decoding completion at the end of sequence, avoiding redundant decoding of padding tokens. Experiments on the LLaDA benchmark demonstrate that our method achieves up to 22.58$\times$ speedup without any performance drop, and up to 57.51$\times$ when combined with KV-Cache.
摘要：在大语言模型（LLMS）中进行自回旋解码需要$ \ MATHCAL {O}（N）$ n $令牌的顺序步骤，从根本上限制了推理吞吐量。最近的基于扩散的LLM（DLLM）通过迭代denoising实现了平行令牌生成。然而，当前的平行解码策略依赖于固定的输入不合时宜的启发式方法（例如，置信阈值），这些启发式方法无法适应特定于输入的特征，从而导致跨不同NLP任务的速度质量折衷。在这项工作中，我们探索了一种更灵活，动态的平行解码方法。我们建议学习并行解码（Learn2pd），该框架训练轻质和自适应滤波器模型，以预测每个令牌位置，当前预测是否与最终输出相匹配。该学习的过滤器近似于甲骨文并行解码策略，该策略仅在正确预测时才能揭示令牌。重要的是，以训练后的方式学习过滤器模型，只需要少量计算才能优化它（分钟级GPU时间）。此外，我们引入了文本预测（EOTP），以检测序列结束时的解码完成，避免了填充令牌的冗余解码。 LLADA基准测试的实验表明，我们的方法可达到22.58 $ \ times $速度，而无需任何性能下降，并且与KV-CACHE结合使用时最多可达到57.51 $ \ times $。

Title: InfoAgent: Advancing Autonomous Information-Seeking Agents

Authors: Gongrui Zhang, Jialiang Zhu, Ruiqi Yang, Kai Qiu, Miaosen Zhang, Zhirong Wu, Qi Dai, Bei Liu, Chong Luo, Zhengyuan Yang, Linjie Li, Lijuan Wang, Weizhu Chen, Yuan Zhang, Xin Li, Zhaoyi Liu, Xin Geng, Baining Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25189
Pdf URL: https://arxiv.org/pdf/2509.25189
Copy Paste: [[2509.25189]] InfoAgent: Advancing Autonomous Information-Seeking Agents(https://arxiv.org/abs/2509.25189)
Keywords: language model, agent
Abstract: Building Large Language Model agents that expand their capabilities by interacting with external tools represents a new frontier in AI research and applications. In this paper, we introduce InfoAgent, a deep research agent powered by an innovative data synthesis pipeline and orchestrated web search tools. To construct challenging, hard-to-find queries,we build entity trees and apply sub-tree sampling with entity fuzzification to systematically increase question difficulty. Unlike prior work that relies heavily on commercial search tools, we develop a dedicated self-hosted search infrastructure, enhancing transparency of agent environments and facilitating further advancement of agent capacity. We evaluate the effectiveness of our data pipeline by measuring the average number of tool calls required to correctly answer a question, and also show that our agent yields better performance when equipped with our tools. Our \mbox{InfoAgent} is post-trained from Qwen3-14B using a two-stage recipe: cold-start supervised finetuning to instill long-horizon search behaviors, followed by reinforcement learning which significantly improves reasoning-driven tool use. With our methods, InfoAgent achieves 15.3\% accuracy on BrowseComp, 29.2\% on BrowseComp-ZH, and 40.4\% on Xbench-DS, outperforming prior open-source deep research agents such as WebSailor-72B and DeepDive-32B.
摘要：通过与外部工具互动来构建扩大其功能的大型语言模型代理，代表了AI研究和应用中的新领域。在本文中，我们介绍了Infoagent，这是一位由创新数据合成管道和精心策划的Web搜索工具提供支持的深入研究代理。为了构建具有挑战性的，难以找到的查询，我们建造实体树并应用实体模糊化的子树采样，以系统地增加问题的难度。与先前依赖商业搜索工具的先前工作不同，我们开发了专门的自托管搜索基础架构，提高了代理环境的透明度并促进代理能力的进一步发展。我们通过测量正确回答问题所需的平均工具调用数量来评估数据管道的有效性，并表明我们的代理在配备我们的工具时会产生更好的性能。我们的\ mbox {infoagent}使用两阶段的配方从qwen3-14b进行了训练：冷启动监督的登录以灌输长期搜索行为，然后进行强大的增强学习，从而显着改善了推理驱动的工具的使用。通过我们的方法，Infoagent在BrowseComp上实现了15.3 \％的准确性，BrowseComp-ZH的29.2 \％和Xbench-DS的40.4 \％\％\％，优于先前的开放源代码深度研究代理，例如WebSailor-72b和DeepDive-32B。