2025-10-03

Title: Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset

Authors: Leroy Z. Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01219
Pdf URL: https://arxiv.org/pdf/2510.01219
Copy Paste: [[2510.01219]] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset(https://arxiv.org/abs/2510.01219)
Keywords: language model, prompt
Abstract: We introduce a dataset of concept learning tasks that helps uncover implicit biases in large language models. Using in-context concept learning experiments, we found that language models may have a bias toward upward monotonicity in quantifiers; such bias is less apparent when the model is tested by direct prompting without concept learning components. This demonstrates that in-context concept learning can be an effective way to discover hidden biases in language models.
摘要：我们介绍了一个概念学习任务的数据集，该数据集有助于发现大语言模型中的隐性偏见。我们发现，我们发现语言模型可能会偏向于量词中的单调性。当模型通过直接提示而没有概念学习组件测试模型时，这种偏见就不太明显。这表明在语言模型中发现隐藏的偏见可以是一种有效的方法。

Title: Towards Open-Ended Discovery for Low-Resource NLP

Authors: Bonaventure F. P. Dossou, Henri Aïdasso
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01220
Pdf URL: https://arxiv.org/pdf/2510.01220
Copy Paste: [[2510.01220]] Towards Open-Ended Discovery for Low-Resource NLP(https://arxiv.org/abs/2510.01220)
Keywords: language model
Abstract: Natural Language Processing (NLP) for low-resource languages remains fundamentally constrained by the lack of textual corpora, standardized orthographies, and scalable annotation pipelines. While recent advances in large language models have improved cross-lingual transfer, they remain inaccessible to underrepresented communities due to their reliance on massive, pre-collected data and centralized infrastructure. In this position paper, we argue for a paradigm shift toward open-ended, interactive language discovery, where AI systems learn new languages dynamically through dialogue rather than static datasets. We contend that the future of language technology, particularly for low-resource and under-documented languages, must move beyond static data collection pipelines toward interactive, uncertainty-driven discovery, where learning emerges dynamically from human-machine collaboration instead of being limited to pre-existing datasets. We propose a framework grounded in joint human-machine uncertainty, combining epistemic uncertainty from the model with hesitation cues and confidence signals from human speakers to guide interaction, query selection, and memory retention. This paper is a call to action: we advocate a rethinking of how AI engages with human knowledge in under-documented languages, moving from extractive data collection toward participatory, co-adaptive learning processes that respect and empower communities while discovering and preserving the world's linguistic diversity. This vision aligns with principles of human-centered AI, emphasizing interactive, cooperative model building between AI systems and speakers.
摘要：低资源语言的自然语言处理（NLP）在根本上仍然受到缺乏文本语料库，标准化拼写和可扩展注释管道的限制。尽管大型语言模型的最新进展改善了跨语性的转移，但由于依赖大量，预先收集的数据和集中式基础设施，它们对代表性不足的社区仍然无法访问。在该立场论文中，我们主张范式转向开放式，交互式语言发现，在该语言中，AI系统通过对话而不是静态数据集动态学习新语言。我们认为，语言技术的未来，尤其是对于低资源和文献记载的语言，必须超越静态数据收集管道朝着交互式，不确定性驱动的发现，其中学习是从人机协作中动态出现的，而不是仅限于预先存在的数据集。我们提出了一个基于人机关节不确定性的框架，将模型的认知不确定性与人说的犹豫提示和置信信号相结合，以指导相互作用，查询选择和记忆保留。本文是行动的呼吁：我们提倡对AI在文档不足的语言中如何与人类知识进行重新思考，从提取数据收集到参与性的共同自适应学习过程，这些学习过程尊重和赋予社区，同时发现和保护世界的语言多样性。这种愿景符合以人为中心的AI原则，强调了AI系统和说话者之间的互动，合作模型。

Title: Discourse vs emissions: Analysis of corporate narratives, symbolic practices, and mimicry through LLMs

Authors: Bertrand Kian Hassani, Yacoub Bahini, Rizwan Mushtaq
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2510.01222
Pdf URL: https://arxiv.org/pdf/2510.01222
Copy Paste: [[2510.01222]] Discourse vs emissions: Analysis of corporate narratives, symbolic practices, and mimicry through LLMs(https://arxiv.org/abs/2510.01222)
Keywords: language model, llm
Abstract: Climate change has increased demands for transparent and comparable corporate climate disclosures, yet imitation and symbolic reporting often undermine their value. This paper develops a multidimensional framework to assess disclosure maturity among 828 this http URL firms using large language models (LLMs) fine-tuned for climate communication. Four classifiers-sentiment, commitment, specificity, and target ambition-extract narrative indicators from sustainability and annual reports, which are linked to firm attributes such as emissions, market capitalization, and sector. Analyses reveal three insights: (1) risk-focused narratives often align with explicit commitments, but quantitative targets (e.g., net-zero pledges) remain decoupled from tone; (2) larger and higher-emitting firms disclose more commitments and actions than peers, though inconsistently with quantitative targets; and (3) widespread similarity in disclosure styles suggests mimetic behavior, reducing differentiation and decision usefulness. These results highlight the value of LLMs for ESG narrative analysis and the need for stronger regulation to connect commitments with verifiable transition strategies.
摘要：气候变化增加了对透明和可比较的公司气候披露的需求，但是模仿和象征性报告经常破坏其价值。本文开发了一个多维框架，以评估828个HTTP URL公司之间使用大型语言模型（LLMS）微调进行气候交流的披露成熟度。来自可持续性和年度报告中的四个分类器索赔，承诺，特殊性和目标野心提取叙事指标，这些指标与排放，市值和行业等公司属性有关。分析揭示了三个见解：（1）以风险为中心的叙述通常与明确的承诺保持一致，但是定量目标（例如，净零承诺）仍然与音调脱钩；（2）较大和高级的公司比同行披露更多的承诺和行动，尽管与定量目标不一致；（3）披露方式的广泛相似性表明了模拟行为，降低了分化和决策实用性。这些结果突出了LLM对ESG叙事分析的价值，以及对更强大的监管将承诺与可验证的过渡策略联系起来的需求。

Title: Context Matters: Comparison of commercial large language tools in veterinary medicine

Authors: Tyler J Poore, Christopher J Pinard, Aleena Shabbir, Andrew Lagree, Andre Telfer, Kuan-Chuen Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01224
Pdf URL: https://arxiv.org/pdf/2510.01224
Copy Paste: [[2510.01224]] Context Matters: Comparison of commercial large language tools in veterinary medicine(https://arxiv.org/abs/2510.01224)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used in clinical settings, yet their performance in veterinary medicine remains underexplored. We evaluated three commercially available veterinary-focused LLM summarization tools (Product 1 [Hachiko] and Products 2 and 3) on a standardized dataset of veterinary oncology records. Using a rubric-guided LLM-as-a-judge framework, summaries were scored across five domains: Factual Accuracy, Completeness, Chronological Order, Clinical Relevance, and Organization. Product 1 achieved the highest overall performance, with a median average score of 4.61 (IQR: 0.73), compared to 2.55 (IQR: 0.78) for Product 2 and 2.45 (IQR: 0.92) for Product 3. It also received perfect median scores in Factual Accuracy and Chronological Order. To assess the internal consistency of the grading framework itself, we repeated the evaluation across three independent runs. The LLM grader demonstrated high reproducibility, with Average Score standard deviations of 0.015 (Product 1), 0.088 (Product 2), and 0.034 (Product 3). These findings highlight the importance of veterinary-specific commercial LLM tools and demonstrate that LLM-as-a-judge evaluation is a scalable and reproducible method for assessing clinical NLP summarization in veterinary medicine.
摘要：大型语言模型（LLM）越来越多地用于临床环境中，但它们在兽医医学方面的表现仍未得到充实。我们在标准的兽医肿瘤记录数据集上评估了三种以兽医为重点的LLM摘要工具（产品1 [Hachiko]和产品2和3）。使用标语引导的LLM-AS-A-A-Gudge框架，摘要在五个领域进行了评分：事实准确性，完整性，时间顺序，临床相关性和组织。产品1的总体表现最高，中位平均得分为4.61（IQR：0.73），而产品2的产品2和2.45（IQR：0.92）为2.55（IQR：0.78）。它也获得了事实准确性和时间顺序的完美中位数得分。为了评估评分框架本身的内部一致性，我们重复了三个独立运行的评估。 LLM分级器表现出很高的可重复性，平均得分标准偏差为0.015（乘积1），0.088（产品2）和0.034（产品3）。这些发现凸显了兽医特异性商业LLM工具的重要性，并证明了LLM-AS-A-A-Gudge评估是评估兽医中临床NLP摘要的可扩展且可重复的方法。

Title: ClaimCheck: Real-Time Fact-Checking with Small Language Models

Authors: Akshith Reddy Putta, Jacob Devasier, Chengkai Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01226
Pdf URL: https://arxiv.org/pdf/2510.01226
Copy Paste: [[2510.01226]] ClaimCheck: Real-Time Fact-Checking with Small Language Models(https://arxiv.org/abs/2510.01226)
Keywords: language model, gpt, llm, prompt
Abstract: We introduce ClaimCheck, an LLM-guided automatic fact-checking system designed to verify real-world claims using live Web evidence and small language models. Unlike prior systems that rely on large, closed-source models and static knowledge stores, ClaimCheck employs a transparent, stepwise verification pipeline that mirrors human fact-checking workflows consisting of Web search query planning, Web-based evidence retrieval and summarization, evidence synthesis and re-retrieval, and claim verdict evaluation. Each module is optimized for small LLMs, allowing the system to deliver accurate and interpretable fact-checking with significantly lower computational requirements. Despite using a much smaller Qwen3-4B model, ClaimCheck achieves state-of-the-art accuracy of 76.4% on the AVeriTeC dataset, outperforming previous approaches using LLaMA3.1 70B and GPT-4o. Extensive ablations demonstrate that careful modular design and prompting strategies can overcome the limitations of smaller LLMs. To promote accessibility and transparency, we provide a public demo at this https URL.
摘要：我们介绍了SoperCheck，这是一种LLM引导的自动事实检查系统，旨在使用实时网络证据和小语言模型来验证现实世界的主张。与先前依靠大型封闭式模型和静态知识商店的系统不同，SopeRCheck采用了透明的逐步验证管道，反映了人类事实检查的工作流程，这些工作流包括网络搜索查询计划，基于Web的证据检索和摘要，证据综合和重新确认和重新确认和要求验证验证评估。每个模块均针对小型LLM进行了优化，从而使系统可以提供准确且可解释的事实检查，并具有明显较低的计算要求。尽管使用了QWEN3-4B模型要小得多，但SoperCheck在Averitec数据集上达到了76.4％的最新精度，使用Llama3.1 70B和GPT-4O的先前方法优于先前的方法。广泛的消融表明，仔细的模块化设计和提示策略可以克服较小的LLM的局限性。为了促进可访问性和透明度，我们在此HTTPS URL上提供公众演示。

Title: EEFSUVA: A New Mathematical Olympiad Benchmark

Authors: Nicole N Khatibi, Daniil A. Radamovich, Michael P. Brenner
Subjects: cs.CL, math.HO
Abstract URL: https://arxiv.org/abs/2510.01227
Pdf URL: https://arxiv.org/pdf/2510.01227
Copy Paste: [[2510.01227]] EEFSUVA: A New Mathematical Olympiad Benchmark(https://arxiv.org/abs/2510.01227)
Keywords: language model, llm
Abstract: Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.
摘要：最近的突破刺激了声称，大型语言模型（LLMS）与金牌奥林匹克运动会相匹配，以达到数学基准的研究生水平。在这项工作中，我们详细检查了这些主张，并评估当前基准测试真正的LLM数学推理的程度。这些基准的组成，主要是从国际数学奥林匹克（IMO）和相关竞赛中汲取的，可能会夸大由于潜在的数据污染和狭posity对熟悉的问题类型的关注而导致的推理能力。为了对数学理解进行更全面的评估，我们介绍了Eefsuva，这是一种新颖的基准，该基准是由东欧和前苏联的国家循环区域和国家奥林匹克运动会策划的。这些竞赛具有与IMO相当的难度的问题，并以要求非标准的问题解决技术而闻名，但是在线语料库中，它们的问题却不那么普遍。初步结果表明，相对于其他奥林匹克风格的基准，即使是最先进的LLMS在Eefsuva上也表现出显着的效果下降。这些发现还表明，更广泛的评估数据集对于对数学推理和指导未来模型开发的更全面评估的潜在重要性。

Title: Who is In Charge? Dissecting Role Conflicts in Instruction Following

Authors: Siqi Zeng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.01228
Pdf URL: https://arxiv.org/pdf/2510.01228
Copy Paste: [[2510.01228]] Who is In Charge? Dissecting Role Conflicts in Instruction Following(https://arxiv.org/abs/2510.01228)
Keywords: language model, prompt
Abstract: Large language models should follow hierarchical instructions where system prompts override user inputs, yet recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. We extend these behavioral findings with mechanistic interpretations on a large-scale dataset. Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces. Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues. Steering experiments show that, despite using social cues, the vectors surprisingly amplify instruction following in a role-agnostic way. Together, these results explain fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods.
摘要：大型语言模型应遵循层次结构指令，其中系统会提示用户输入覆盖用户的输入，但是最近的工作表明，他们经常忽略此规则，同时强烈遵守诸如权威或共识之类的社交线索。我们在大规模数据集上使用机械解释扩展了这些行为发现。线性探测表明，冲突决策信号是早期编码的，系统用户和社会冲突形成不同的子空间。直接的logit归因显示在系统用户案例中更强大的内部冲突检测，但仅针对社会提示进行一致的解决方案。转向实验表明，尽管使用了社交提示，但矢量令人惊讶地以角色敏捷的方式扩大了指导。这些结果共同解释了脆弱的系统服从，并强调了对轻质层次结构敏感的比对方法的需求。

Title: Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision

Authors: Dimitar Peshevski, Kiril Blazhevski, Martin Popovski, Gjorgji Madjarov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01229
Pdf URL: https://arxiv.org/pdf/2510.01229
Copy Paste: [[2510.01229]] Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision(https://arxiv.org/abs/2510.01229)
Keywords: language model, llm
Abstract: Effective document reranking is essential for improving search relevance across diverse applications. While Large Language Models (LLMs) excel at reranking due to their deep semantic understanding and reasoning, their high computational cost makes them impractical for many real-world deployments. Fine-tuning smaller, task-specific models is a more efficient alternative but typically depends on scarce, manually labeled data. To overcome this, we propose a novel pipeline that eliminates the need for human-labeled query-document pairs. Our method uses LLMs to generate synthetic queries from domain-specific corpora and employs an LLM-based classifier to label positive and hard-negative pairs. This synthetic dataset is then used to fine-tune a smaller transformer model with contrastive learning using Localized Contrastive Estimation (LCE) loss. Experiments on the MedQuAD dataset show that our approach significantly boosts in-domain performance and generalizes well to out-of-domain tasks. By using LLMs for data generation and supervision rather than inference, we reduce computational costs while maintaining strong reranking capabilities.
摘要：有效的文档重读对于改善各种应用程序的搜索相关性至关重要。尽管大型语言模型（LLMS）由于其深厚的语义理解和推理而在重新管理方面表现出色，但它们的高计算成本使它们在许多现实世界部署中都不切实际。微调较小的，特定于任务的模型是一种更有效的替代方法，但通常取决于稀缺，手动标记的数据。为了克服这一点，我们提出了一条新型管道，以消除对人体标记的查询文档对的需求。我们的方法使用LLMS从域特异性语料库中生成合成查询，并采用基于LLM的分类器来标记正面和硬性对。然后，该合成数据集用于使用局部对比度估计（LCE）损失进行对比度学习的较小变压器模型。 MEDQUAD数据集的实验表明，我们的方法显着提高了内域性能，并且可以很好地推广到室外任务。通过使用LLM进行数据生成和监督而不是推理，我们可以降低计算成本，同时保持强大的重新依赖能力。

Title: Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models

Authors: Shuaidong Pan, Di Wu
Subjects: cs.CL, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2510.01231
Pdf URL: https://arxiv.org/pdf/2510.01231
Copy Paste: [[2510.01231]] Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models(https://arxiv.org/abs/2510.01231)
Keywords: language model, prompt
Abstract: This study addresses the reliability of automatic summarization in high-risk scenarios and proposes a large language model framework that integrates uncertainty quantification and risk-aware mechanisms. Starting from the demands of information overload and high-risk decision-making, a conditional generation-based summarization model is constructed, and Bayesian inference is introduced during generation to model uncertainty in the parameter space, which helps avoid overconfident predictions. The uncertainty level of the generated content is measured using predictive distribution entropy, and a joint optimization of entropy regularization and risk-aware loss is applied to ensure that key information is preserved and risk attributes are explicitly expressed during information compression. On this basis, the model incorporates risk scoring and regulation modules, allowing summaries to cover the core content accurately while enhancing trustworthiness through explicit risk-level prompts. Comparative experiments and sensitivity analyses verify that the proposed method significantly improves the robustness and reliability of summarization in high-risk applications while maintaining fluency and semantic integrity. This research provides a systematic solution for trustworthy summarization and demonstrates both scalability and practical value at the methodological level.
摘要：这项研究解决了高风险场景中自动汇总的可靠性，并提出了一个大型语言模型框架，该框架整合了不确定性量化和风险感知机制。从信息过载和高风险决策的需求开始，构建了基于有条件的生成摘要模型，并在生成期间引入了贝叶斯推断，以模拟参数空间中的不确定性，这有助于避免过度确定的预测。生成内容的不确定性水平是使用预测分布熵测量的，并应用了熵正规化和风险感知损失的联合优化，以确保保留关键信息，并在信息压缩过程中明确表示风险属性。在此基础上，该模型结合了风险评分和监管模块，允许摘要可以准确覆盖核心内容，同时通过明确的风险级别提示提高信任度。比较实验和灵敏度分析证明，所提出的方法可显着提高高风险应用中摘要的鲁棒性和可靠性，同时保持流利性和语义完整性。这项研究为可信赖的汇总提供了系统的解决方案，并在方法论层面证明了可扩展性和实践价值。

Title: Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

Authors: Dongjun Kim, Gyuho Shim, Yongchan Chun, Minhyuk Kim, Chanjun Park, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01232
Pdf URL: https://arxiv.org/pdf/2510.01232
Copy Paste: [[2510.01232]] Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks(https://arxiv.org/abs/2510.01232)
Keywords: language model, llm
Abstract: Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model's success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.
摘要：大型语言模型通常是根据标准基准分数来判断的，但是这种分数通常夸大了真正的能力，因为它们掩盖了任务实际上需要的技能。例如，假定ARC测试推理，而Hellaswag旨在评估常识。但是，我们缺乏一种系统的方法来验证这些基准是否实际测量这些标签。我们介绍了基准分析，这是一个诊断框架，将基准性能分解为十个认知扎根的能力。该方法将基于梯度的重要性评分与目标参数消融结合在一起，以计算能力影响评分（AIS），以量化每个能力在给定基准上的模型成功的贡献。 Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could负面影响性能。因此，基准分析解释了为什么绩效提高并不总是转化为用户感知的能力，并为基准审核和模型解释性提供了透明的工具。

Title: LLMRank: Understanding LLM Strengths for Model Routing

Authors: Shubham Agrawal, Prasang Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01234
Pdf URL: https://arxiv.org/pdf/2510.01234
Copy Paste: [[2510.01234]] LLMRank: Understanding LLM Strengths for Model Routing(https://arxiv.org/abs/2510.01234)
Keywords: language model, llm, prompt
Abstract: The rapid growth of large language models (LLMs) with diverse capabilities, latency and computational costs presents a critical deployment challenge: selecting the most suitable model for each prompt to optimize the trade-off between performance and efficiency. We introduce LLMRank, a prompt-aware routing framework that leverages rich, human-readable features extracted from prompts, including task type, reasoning patterns, complexity indicators, syntactic cues, and signals from a lightweight proxy solver. Unlike prior one-shot routers that rely solely on latent embeddings, LLMRank predicts per-model utility using a neural ranking model trained on RouterBench, comprising 36,497 prompts spanning 11 benchmarks and 11 state-of-the-art LLMs, from small efficient models to large frontier systems. Our approach achieves up to 89.2% of oracle utility, while providing interpretable feature attributions that explain routing decisions. Extensive studies demonstrate the importance of multifaceted feature extraction and the hybrid ranking objective, highlighting the potential of feature-driven routing for efficient and transparent LLM deployment.
摘要：具有不同功能，延迟和计算成本的大型语言模型（LLM）的快速增长提出了一个关键的部署挑战：选择最适合每个提示的模型，以优化性能和效率之间的权衡。我们介绍了LLMRANK，这是一个及时感知的路由框架，利用从提示中提取的丰富，可读的功能，包括任务类型，推理模式，复杂性指标，句法提示以及轻巧代理求解器的信号。与以前仅依赖潜在嵌入的一击路由器不同，LLMRANK使用在路由碱培训的神经排名模型预测人均实用程序，包括36,497个提示，涵盖11个基准和11个最先进的LLMS，从小型有效型号到大型前沿系统。我们的方法可实现多达89.2％的Oracle实用程序，同时提供可解释的特征属性来解释路由决策。广泛的研究表明，多方面特征提取和混合排名目标的重要性，突出了特征驱动的路由对有效且透明的LLM部署的潜力。

Title: GRPO++: Enhancing Dermatological Reasoning under Low Resource Settings

Authors: Ismam Nur Swapnil, Aranya Saha, Tanvir Ahmed Khan, Mohammad Ariful Haque
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.01236
Pdf URL: https://arxiv.org/pdf/2510.01236
Copy Paste: [[2510.01236]] GRPO++: Enhancing Dermatological Reasoning under Low Resource Settings(https://arxiv.org/abs/2510.01236)
Keywords: language model
Abstract: Vision-Language Models (VLMs) show promise in medical image analysis, yet their capacity for structured reasoning in complex domains like dermatology is often limited by data scarcity and the high computational cost of advanced training techniques. To address these challenges, we introduce DermIQ-VLM, a VLM developed through a multi-stage, resource-efficient methodology designed to emulate a dermatologist's diagnostic process. Our primary contribution is a modified version of Grouped Relative Policy Optimization (GRPO), called GRPO++, which stabilizes the powerful but data-intensive GRPO framework. Our proposed training pipeline first employs GRPO++ for reasoning-oriented disease recognition, followed by supervised fine-tuning for conversational ability. To mitigate factual errors introduced during this step, we then align the model using Direct Preference Optimization (DPO), leveraging a Knowledge Graph-based system as a scalable proxy for expert preference. A preliminary evaluation on a curated dermatological dataset demonstrates that our proposed methodology yields notable performance gains over standard fine-tuning approaches. These findings validate the potential of our pipeline as a feasible pathway for developing specialized, reliable VLMs in resource-constrained environments.
摘要：视觉语言模型（VLM）在医学图像分析中显示出希望，但是它们在像皮肤病学这样的复杂领域中的结构化推理的能力通常受到数据稀缺性和高级培训技术的高计算成本的限制。为了应对这些挑战，我们引入了Dermiq-VLM，这是一种通过多阶段，资源效率高的方法开发的VLM，旨在模仿皮肤科医生的诊断过程。我们的主要贡献是分组的相对策略优化（GRPO）的修改版本，称为GRPO ++，它稳定了功能强大但具有数据密集型的GRPO框架。我们提出的培训管道首先采用GRPO ++来实现以推理为导向的疾病识别，然后是监督对话能力的微调。为了减轻此步骤中引入的事实错误，我们然后使用直接偏好优化（DPO）对齐模型，利用基于知识图的系统作为专家偏好的可扩展代理。对策划的皮肤病学数据集进行初步评估表明，我们所提出的方法对标准微调方法产生了显着的性能提高。这些发现验证了我们的管道的潜力，作为在资源受限环境中开发专业，可靠的VLM的可行途径。

Title: Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation

Authors: Nandakishor M
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01237
Pdf URL: https://arxiv.org/pdf/2510.01237
Copy Paste: [[2510.01237]] Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation(https://arxiv.org/abs/2510.01237)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models suffer from hallucination, generating plausible yet factually incorrect content. Current mitigation strategies focus on post-generation correction, which is computationally expensive and fails to prevent unreliable content generation. We propose a confidence-aware routing system that proactively assesses model uncertainty before generation and redirects queries based on estimated reliability. Our approach combines three complementary signals: semantic alignment between internal representations and reference embeddings, internal convergence analysis across model layers, and learned confidence estimation. The unified confidence score determines routing to four pathways: local generation for high confidence, retrieval-augmented generation for medium confidence, larger models for low confidence, and human review for very low confidence. Evaluation on knowledge-intensive QA benchmarks demonstrates significant improvements in hallucination detection (0.74 vs. 0.42 baseline) while reducing computational costs by 40% compared to post-hoc methods. The F1 score improves from 0.61 to 0.82 with low false positive rates (0.09). This paradigm shift from reactive correction to proactive assessment offers a computationally efficient approach to LLM reliability enhancement.
摘要：大型语言模型遭受了幻觉的影响，产生了合理但实际上不正确的内容。当前的缓解策略集中在后期校正上，这在计算上很昂贵，无法防止不可靠的内容产生。我们提出了一个置信度感知的路由系统，该系统在生成前主动评估模型不确定性，并根据估计的可靠性重定向查询。我们的方法结合了三个互补信号：内部表示和参考嵌入之间的语义一致性，跨模型层的内部收敛分析以及学习置信度估计。统一的置信度得分确定了四个途径的路线：高度置信度的本地生成，中等置信度的检索型生成，较大的置信度较大的模型以及人类的审查以非常低的置信度。与事后方法相比，对知识密集型QA基准的评估表明，幻觉检测检测的显着改善（0.74 vs. 0.42基线），同时将计算成本降低了40％。 F1得分从0.61提高到0.82，较低的假正率（0.09）。从反应性校正到主动评估的这种范式转变为LLM可靠性增强提供了一种计算有效的方法。

Title: Silent Tokens, Loud Effects: Padding in LLMs

Authors: Rom Himelstein, Amit LeVi, Yonatan Belinkov, Avi Mendelson
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.01238
Pdf URL: https://arxiv.org/pdf/2510.01238
Copy Paste: [[2510.01238]] Silent Tokens, Loud Effects: Padding in LLMs(https://arxiv.org/abs/2510.01238)
Keywords: language model, llm
Abstract: Padding tokens are widely used in large language models (LLMs) to equalize sequence lengths during batched inference. While they should be fully masked, implementation errors can cause them to influence computation, and the extent of this influence is not well understood. We systematically study this effect across three open-source model families (Llama, Gemma, Qwen), inserting controlled amounts of padding and evaluating outcomes along four axes: activations, generation quality, bias, and safety. Even small amounts of padding shift hidden representations, degrade quality in smaller models, alter bias in unpredictable ways, and weaken safety guardrails. These findings demonstrate that padding is not a harmless detail but a robustness risk that must be carefully handled in deployment.
摘要：填充令牌被广泛用于大语言模型（LLMS），以在批处理推理期间均衡序列长度。尽管应该完全掩盖它们，但实施错误可能会导致它们影响计算，并且这种影响的程度尚未得到充分理解。我们在三个开源模型家族（Llama，Gemma，Qwen）中系统地研究了这种效果，插入受控量的填充量以及沿着四个轴的评估结果：激活，发电质量，偏见和安全性。即使是少量的填充也会移动隐藏的表示形式，以较小的型号降低质量，以无法预测的方式改变偏见，并削弱了安全护栏。这些发现表明，填充不是无害的细节，而是必须在部署中仔细处理的稳健风险。

Title: CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM

Authors: Juntae Lee, Jihwan Bang, Seunghan Yang, Simyung Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01239
Pdf URL: https://arxiv.org/pdf/2510.01239
Copy Paste: [[2510.01239]] CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM(https://arxiv.org/abs/2510.01239)
Keywords: language model, llm
Abstract: We present CIFLEX (Contextual Instruction Flow for Sub-task Execution), which is a novel execution system for efficient sub-task handling in multi-turn interactions with a single on-device large language model (LLM). As LLMs become increasingly capable, a single model is expected to handle diverse sub-tasks that more effectively and comprehensively support answering user requests. Naive approach reprocesses the entire conversation context when switching between main and sub-tasks (e.g., query rewriting, summarization), incurring significant computational overhead. CIFLEX mitigates this overhead by reusing the key-value (KV) cache from the main task and injecting only task-specific instructions into isolated side paths. After sub-task execution, the model rolls back to the main path via cached context, thereby avoiding redundant prefill computation. To support sub-task selection, we also develop a hierarchical classification strategy tailored for small-scale models, decomposing multi-choice decisions into binary ones. Experiments show that CIFLEX significantly reduces computational costs without degrading task performance, enabling scalable and efficient multi-task dialogue on-device.
摘要：我们提出了CIFLEX（子任务执行的上下文指令流），这是一个新颖的执行系统，用于与单个设备上的大语言模型（LLM）进行多转交互的有效子任务处理。随着LLM越来越有能力，预计单个模型将处理更有效，更全面地支持用户请求的各种子任务。 Naive方法在主任务和子任务之间切换（例如查询重写，摘要）时会重新处理整个对话上下文，从而产生了重要的计算开销。 CIFLEX通过从主要任务中重用键值（KV）缓存来减轻此开销，并仅将特定于任务的指令注入孤立的侧路径。子任务执行后，该模型通过缓存的上下文回到主要路径，从而避免了冗余的预填充计算。为了支持子任务选择，我们还制定了针对小型模型量身定制的层次分类策略，将多选择决策分解为二进制决策。实验表明，CIFLEX大大降低了计算成本而不会降低任务绩效，从而实现了可扩展有效的多任务对话。

Title: SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

Authors: Hu Wei, Ze Xu, Boyu Yang, Linlin Miao, Weiqi Zhai, Yihan Li, Zixuan Li, Zhijun Wang, Boya Wang, Jianwei Yu, Jialing Yuan, Xiaoyue Zhang, Cheng He, Minglei Chen, Zifan Zhang, Qianhui Li, Wei Wang, Xiang Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01241
Pdf URL: https://arxiv.org/pdf/2510.01241
Copy Paste: [[2510.01241]] SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation(https://arxiv.org/abs/2510.01241)
Keywords: language model, llm
Abstract: Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.
摘要：现在，大型语言模型（LLMS）在许多公共数学套件上都表现出色，但是数学内的边界分离越来越遭受上限效果。我们提出了两个互补的基准：Skylenage-Reasoningmath，这是一个100个项目，结构感知的诊断设置，长度为每个项目，数字密度和符号复杂性； Skylenage-Math是一个150个项目的比赛风格的套房，在七个受试者的分类学下，从高中到博士，跨越了四个阶段。我们在单个设置下评估了15个当代LLM变体，并分析主题X模型和X级模型性能。在比赛套件中，最强的模型达到44％，而亚军则达到37％。从高中到博士学位的准确性下降，顶级系统在79％接近79％的博士学位上保留了博士学位。在推理集中，最佳模型总体上达到了81％，最硬的结果揭示了领导者和中层之间的明显稳健性差距。总而言之，我们释放Skylenage-ReasoningMath并报告Skylenage-Math的骨料结果； Skylenage一起提供了一个以校准的难度和丰富的元数据为中心，以推理为中心的数学基准，可作为对数学推理的未来评估的参考基准。

Title: Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI

Authors: Seyma Yaman Kayadibi
Subjects: cs.CL, cs.AI, cs.IT, cs.LG
Abstract URL: https://arxiv.org/abs/2510.01242
Pdf URL: https://arxiv.org/pdf/2510.01242
Copy Paste: [[2510.01242]] Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI(https://arxiv.org/abs/2510.01242)
Keywords: language model, gpt, chat
Abstract: Artificial intelligence is observed to age not through chronological time but through structural asymmetries in memory performance. In large language models, semantic cues such as the name of the day often remain stable across sessions, while episodic details like the sequential progression of experiment numbers tend to collapse when conversational context is reset. To capture this phenomenon, the Artificial Age Score (AAS) is introduced as a log-scaled, entropy-informed metric of memory aging derived from observable recall behavior. The score is formally proven to be well-defined, bounded, and monotonic under mild and model-agnostic assumptions, making it applicable across various tasks and domains. In its Redundancy-as-Masking formulation, the score interprets redundancy as overlapping information that reduces the penalized mass. However, in the present study, redundancy is not explicitly estimated; all reported values assume a redundancy-neutral setting (R = 0), yielding conservative upper bounds. The AAS framework was tested over a 25-day bilingual study involving ChatGPT-5, structured into stateless and persistent interaction phases. During persistent sessions, the model consistently recalled both semantic and episodic details, driving the AAS toward its theoretical minimum, indicative of structural youth. In contrast, when sessions were reset, the model preserved semantic consistency but failed to maintain episodic continuity, causing a sharp increase in the AAS and signaling structural memory aging. These findings support the utility of AAS as a theoretically grounded, task-independent diagnostic tool for evaluating memory degradation in artificial systems. The study builds on foundational concepts from von Neumann's work on automata, Shannon's theories of information and redundancy, and Turing's behavioral approach to intelligence.
摘要：观察到人工智能不是通过按时间顺序排列的，而是通过记忆性能的结构不对称。在大型语言模型中，诸如当天的名称之类的语义提示通常在会话中保持稳定，而诸如实验数字的顺序进展之类的情节细节往往会在重置对话上下文时崩溃。为了捕获这一现象，将人工年龄评分（AAS）作为对数刻度的，熵贴的记忆老化的指标，该指标来自可观察的召回行为。该分数正式证明是在轻度和模型不合时宜的假设下定义明确，有限和单调的，使其适用于各种任务和域。在其冗余掩盖配方中，该分数将冗余解释为减少惩罚质量的重叠信息。但是，在本研究中，没有明确估计冗余。所有报告的值都假设冗余 - 中性设置（r = 0），产生了保守的上限。在一项涉及Chatgpt-5的25天双语研究中，对AAS框架进行了测试，该研究构成了无状态和持续的相互作用阶段。在持续的会议期间，该模型始终回忆起语义和情节细节，将AAS推向其理论最低限度，表明结构性青年。相反，当重置会话时，该模型保留了语义一致性，但未能保持情节连续性，从而导致AAS急剧增加和信号结构记忆衰老。这些发现支持AAS作为理论上的，独立于任务的诊断工具，用于评估人工系统中的记忆力下降。这项研究基于冯·诺伊曼（Von Neumann）在自动机，香农（Shannon）的信息和冗余理论以及图灵（Turing）的智力行为方法的基础概念上。

Title: Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

Authors: Yisong Xiao, Aishan Liu, Siyuan Liang, Zonghao Ying, Xianglong Liu, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01243
Pdf URL: https://arxiv.org/pdf/2510.01243
Copy Paste: [[2510.01243]] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing(https://arxiv.org/abs/2510.01243)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content, necessitating detoxification strategies to ensure safe and responsible deployment. Test-time detoxification methods, which typically introduce static or dynamic interventions into LLM representations, offer a promising solution due to their flexibility and minimal invasiveness. However, current approaches often suffer from imprecise interventions, primarily due to their insufficient exploration of the transition space between toxic and non-toxic outputs. To address this challenge, we propose \textsc{A}utoregressive \textsc{R}eward \textsc{G}uided \textsc{R}epresentation \textsc{E}diting (ARGRE), a novel test-time detoxification framework that explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing. ARGRE identifies non-toxic semantic directions and interpolates between toxic and non-toxic representations to reveal fine-grained transition trajectories. These trajectories transform sparse toxicity annotations into dense training signals, enabling the construction of an autoregressive reward model that delivers stable and precise editing guidance. At inference, the reward model guides an adaptive two-step editing process to obtain detoxified representations: it first performs directional steering based on expected reward gaps to shift representations toward non-toxic regions, followed by lightweight gradient-based refinements. Extensive experiments across 8 widely used LLMs show that ARGRE significantly outperforms leading baselines in effectiveness (-62.21% toxicity) and efficiency (-47.58% inference time), while preserving the core capabilities of the original model with minimal degradation. Our code is available at the website.
摘要：大型语言模型（LLM）在各种任务中表现出令人印象深刻的表现，但它们仍然容易产生有毒内容，需要采取排毒策略来确保安全和负责任的部署。测试时间排毒方法通常会在LLM表示中引入静态或动态干预措施，因此由于其灵活性和最小的侵入性提供了有希望的解决方案。但是，当前的方法通常遭受不精确的干预措施，这主要是由于它们对有毒和无毒输出之间过渡空间的探索不足。 To address this challenge, we propose \textsc{A}utoregressive \textsc{R}eward \textsc{G}uided \textsc{R}epresentation \textsc{E}diting (ARGRE), a novel test-time detoxification framework that explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing. Argre确定了毒性和无毒表示之间的无毒语义方向和插值，以揭示细粒的过渡轨迹。这些轨迹将稀疏的毒性注释转化为密集的训练信号，从而实现了自回归奖励模型的构建，该模型提供了稳定且精确的编辑指南。在推断时，奖励模型指导自适应的两步编辑过程以获得排毒表示：它首先基于预期奖励差距执行方向转向，以将表示形式转移到无毒区域，然后进行基于轻量级的精炼。跨8个LLM的广泛实验表明，ARGRE在有效性（-62.21％的毒性）和效率（-47.58％的推理时间）方面的表现明显优于领先的基线，同时保留了原始模型的核心能力，并具有最小的降解。我们的代码可在网站上找到。

Title: Feasibility of Structuring Stress Documentation Using an Ontology-Guided Large Language Model

Authors: Hyeoneui Kim, Jeongha Kim, Huijing Xu, Jinsun Jung, Sunghoon Kang, Sun Joo Jang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01244
Pdf URL: https://arxiv.org/pdf/2510.01244
Copy Paste: [[2510.01244]] Feasibility of Structuring Stress Documentation Using an Ontology-Guided Large Language Model(https://arxiv.org/abs/2510.01244)
Keywords: language model, llm
Abstract: Stress, arising from the dynamic interaction between external stressors, individual appraisals, and physiological or psychological responses, significantly impacts health yet is often underreported and inconsistently documented, typically captured as unstructured free-text in electronic health records. Ambient AI technologies offer promise in reducing documentation burden, but predominantly generate unstructured narratives, limiting downstream clinical utility. This study aimed to develop an ontology for mental stress and evaluate the feasibility of using a Large Language Model (LLM) to extract ontology-guided stress-related information from narrative text. The Mental Stress Ontology (MeSO) was developed by integrating theoretical models like the Transactional Model of Stress with concepts from 11 validated stress assessment tools. MeSO's structure and content were refined using Ontology Pitfall Scanner! and expert validation. Using MeSO, six categories of stress-related information--stressor, stress response, coping strategy, duration, onset, and temporal profile--were extracted from 35 Reddit posts using Claude Sonnet 4. Human reviewers evaluated accuracy and ontology coverage. The final ontology included 181 concepts across eight top-level classes. Of 220 extractable stress-related items, the LLM correctly identified 172 (78.2%), misclassified 27 (12.3%), and missed 21 (9.5%). All correctly extracted items were accurately mapped to MeSO, although 24 relevant concepts were not yet represented in the ontology. This study demonstrates the feasibility of using an ontology-guided LLM for structured extraction of stress-related information, offering potential to enhance the consistency and utility of stress documentation in ambient AI systems. Future work should involve clinical dialogue data and comparison across LLMs.
摘要：压力是由外部压力源，个人评估以及生理或心理反应之间的动态相互作用引起的，对健康产生了重大影响，但经常被低估和不一致的记录，通常被视为电子健康记录中非结构化的自由文本。环境AI技术在减轻文档负担方面有希望，但主要产生非结构化的叙述，从而限制了下游临床实用程序。这项研究旨在开发用于精神压力的本体，并评估使用大型语言模型（LLM）从叙事文本中提取与压力相关的信息的可行性。精神压力本体论（MESO）是通过将压力的交易模型（例如，11个经过验证的压力评估工具的概念）整合到理论模型来开发的。使用本体论陷阱扫描仪对中索的结构和内容进行了完善！和专家验证。使用中索，使用Claude SONNet 4。从35个Reddit柱中提取了六类与压力相关的信息 - 压力，应力反应，应对策略，持续时间，发作和时间概况。最后的本体论包括八个顶级课程的181个概念。在220个可提取应力相关的项目中，LLM正确识别了172个（78.2％），错误分类27（12.3％），并错过了21（9.5％）。尽管本体中尚未代表24个相关概念，但所有正确提取的项目都准确地映射到了中索。这项研究表明，使用本体学指导的LLM进行与压力相关信息的结构化提取的可行性，从而提供了增强环境AI系统中应力文档的一致性和实用性的潜力。未来的工作应涉及LLM的临床对话数据和比较。

Title: SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction

Authors: Runfei Chen, Shuyang Jiang, Wei Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01245
Pdf URL: https://arxiv.org/pdf/2510.01245
Copy Paste: [[2510.01245]] SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction(https://arxiv.org/abs/2510.01245)
Keywords: llm, agent
Abstract: Human mobility prediction is vital for urban services, but often fails to account for abrupt changes from external events. Existing spatiotemporal models struggle to leverage textual descriptions detailing these events. We propose SeMob, an LLM-powered semantic synthesis pipeline for dynamic mobility prediction. Specifically, SeMob employs a multi-agent framework where LLM-based agents automatically extract and reason about spatiotemporally related text from complex online texts. Fine-grained relevant contexts are then incorporated with spatiotemporal data through our proposed innovative progressive fusion architecture. The rich pre-trained event prior contributes enriched insights about event-driven prediction, and hence results in a more aligned forecasting model. Evaluated on a dataset constructed through our pipeline, SeMob achieves maximal reductions of 13.92% in MAE and 11.12% in RMSE compared to the spatiotemporal model. Notably, the framework exhibits pronounced superiority especially within spatiotemporal regions close to an event's location and time of occurrence.
摘要：人类流动性预测对于城市服务至关重要，但通常无法解释外部事件的突然变化。现有的时空模型努力利用详细说明这些事件的文本描述。我们提出了SEMOB，这是一种由LLM驱动的语义合成管道，用于动态迁移率预测。具体而言，SEMOB采用了一个多代理框架，基于LLM的代理会自动提取有关复杂在线文本的时空相关文本的理由。然后，通过我们提出的创新渐进式融合体系结构将细粒度相关的环境与时空数据合并。丰富的预训练的事件先验对事件驱动的预测有了丰富的见解，因此导致了更加一致的预测模型。与时空模型相比，SEMOB在通过我们的管道构建的数据集上进行了评估，在MAE中达到了13.92％的最大降低，RMSE的最大降低为11.12％。值得注意的是，该框架表现出明显的优势，尤其是在事件的位置和发生时间的时空区域内。

Title: A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

Authors: Jiaqing Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01246
Pdf URL: https://arxiv.org/pdf/2510.01246
Copy Paste: [[2510.01246]] A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering(https://arxiv.org/abs/2510.01246)
Keywords: language model
Abstract: Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic features such as punctuation rather than semantic attributes like instructions. To address this, we propose focusing on a single, most relevant SAE latent (top-1), eliminating redundant features. We further identify a limitation in constant SAE steering, which often produces degenerate outputs such as repetitive single words. To mitigate this, we introduce a token-wise decaying steering strategy, enabling more faithful comparisons with mean activation difference baselines. Empirically, we show that steering an SAE latent associated with reasoning reliably elicits step-by-step mathematical reasoning and enhances inference quality, functionally resembling the effect of appending a guiding token. Our results demonstrate that SAEs outperform mean activation difference methods on mathematical reasoning benchmarks and match their performance on IF-Eval.
摘要：稀疏的自动编码器（SAE）最近成为语言模型转向的强大工具。先前的工作探索了转向的Top-K SAE潜伏期，但是我们观察到，Top-K潜伏期之间的许多维度捕获了非语义特征，例如标点符号，而不是诸如说明之类的语义属性。为了解决这个问题，我们建议专注于一个最相关的SAE潜伏（TOP-1），以消除冗余功能。我们进一步确定了恒定SAE转向的局限性，该转向通常会产生退化的输出，例如重复性单词。为了减轻这种状况，我们引入了一个令牌腐烂的转向策略，从而使与平均激活差异基线的比较更加忠实。从经验上讲，我们表明，转向与推理相关的SAE潜在可靠地引起逐步的数学推理并增强推理质量，从而在功能上类似于附加指导令牌的效果。我们的结果表明，SAES优于数学推理基准的平均激活差异方法，并符合其在IF-EVAL上的性能。

Title: Let's Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models' Understanding of Sports

Authors: Punit Kumar Singh, Nishant Kumar, Akash Ghosh, Kunal Pasad, Khushi Soni, Manisha Jaishwal, Sriparna Saha, Syukron Abu Ishaq Alfarozi, Asres Temam Abagissa, Kitsuchart Pasupa, Haiqin Yang, Jose G Moreno
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01247
Pdf URL: https://arxiv.org/pdf/2510.01247
Copy Paste: [[2510.01247]] Let's Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models' Understanding of Sports(https://arxiv.org/abs/2510.01247)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Language Models (LMs) are primarily evaluated on globally popular sports, often overlooking regional and indigenous sporting traditions. To address this gap, we introduce \textbf{\textit{CultSportQA}}, a benchmark designed to assess LMs' understanding of traditional sports across 60 countries and 6 continents, encompassing four distinct cultural categories. The dataset features 33,000 multiple-choice questions (MCQs) across text and image modalities, each of which is categorized into three key types: history-based, rule-based, and scenario-based. To evaluate model performance, we employ zero-shot, few-shot, and chain-of-thought (CoT) prompting across a diverse set of Large Language Models (LLMs), Small Language Models (SLMs), and Multimodal Large Language Models (MLMs). By providing a comprehensive multilingual and multicultural sports benchmark, \textbf{\textit{CultSportQA}} establishes a new standard for assessing AI's ability to understand and reason about traditional sports.
摘要：语言模型（LMS）主要根据全球流行的运动进行评估，通常忽略了地区和土著体育传统。为了解决这一差距，我们介绍了\ textbf {\ textit {cultSportqa}}，这是一种旨在评估LMS对60个国家和6个大洲传统体育的理解的基准，涵盖了四个不同的文化类别。该数据集跨文本和图像模式具有33,000个多项选择问题（MCQ），每个问题都归类为三种关键类型：基于历史记录，基于规则和方案。为了评估模型性能，我们采用了零射，很少的射击和经过思考（COT）（COT），促使各种大型语言模型（LLMS），小语言模型（SLM）和多模式大语言模型（MLMS）介入。通过提供全面的多语言和多元文化运动基准，\ textbf {\ textit {cultSportqa}}建立了一个新的标准，以评估AI理解和推理传统运动的能力。

Title: SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs

Authors: Ruyue Liu, Rong Yin, Xiangzhen Bo, Xiaoshuai Hao, Yong Liu, Jinwen Zhong, Can Ma, Weiping Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01248
Pdf URL: https://arxiv.org/pdf/2510.01248
Copy Paste: [[2510.01248]] SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs(https://arxiv.org/abs/2510.01248)
Keywords: language model, llm
Abstract: Large scale pretrained models have revolutionized Natural Language Processing (NLP) and Computer Vision (CV), showcasing remarkable cross domain generalization abilities. However, in graph learning, models are typically trained on individual graph datasets, limiting their capacity to transfer knowledge across different graphs and tasks. This approach also heavily relies on large volumes of annotated data, which presents a significant challenge in resource-constrained settings. Unlike NLP and CV, graph structured data presents unique challenges due to its inherent heterogeneity, including domain specific feature spaces and structural diversity across various applications. To address these challenges, we propose a novel structure aware self supervised learning method for Text Attributed Graphs (SSTAG). By leveraging text as a unified representation medium for graph learning, SSTAG bridges the gap between the semantic reasoning of Large Language Models (LLMs) and the structural modeling capabilities of Graph Neural Networks (GNNs). Our approach introduces a dual knowledge distillation framework that co-distills both LLMs and GNNs into structure-aware multilayer perceptrons (MLPs), enhancing the scalability of large-scale TAGs. Additionally, we introduce an in-memory mechanism that stores typical graph representations, aligning them with memory anchors in an in-memory repository to integrate invariant knowledge, thereby improving the model's generalization ability. Extensive experiments demonstrate that SSTAG outperforms state-of-the-art models on cross-domain transfer learning tasks, achieves exceptional scalability, and reduces inference costs while maintaining competitive performance.
摘要：大规模预估计的模型已彻底改变了自然语言处理（NLP）和计算机视觉（CV），展示了出色的跨域概括能力。但是，在图形学习中，通常在单个图形数据集上对模型进行培训，从而限制了他们在不同图形和任务中传输知识的能力。这种方法还在很大程度上依赖大量注释的数据，这在资源约束设置中提出了重大挑战。与NLP和简历不同，图形结构化数据由于其固有的异质性，包括特定特定特征空间和各种应用程序的结构多样性，提出了独特的挑战。为了应对这些挑战，我们提出了一种新颖的结构，了解文本归因图（SSTAG）的自我监督学习方法。通过利用文本作为图形学习的统一表示媒介，SSTAG弥合了大语言模型（LLMS）的语义推理与图神经网络（GNNS）的结构建模功能之间的差距。我们的方法引入了双重知识蒸馏框架，该框架将LLM和GNN共同使用结构感知的多层感知器（MLP），从而增强了大规模标签的可扩展性。此外，我们引入了一种内存机制，该机制存储典型的图表表示，将它们与内存存储库中的存储器锚对齐以整合不变知识，从而提高了模型的概括能力。广泛的实验表明，SSTAG在跨域转移学习任务上的最先进模型，实现了出色的可扩展性，并降低了推理成本，同时保持了竞争性能。

Title: LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning

Authors: You-Le Fang, Dong-Shan Jian, Xiang Li, Ce Meng, Ling-Shi Meng, Chen-Xu Yan, Zhi-Zhang Bian, Yan-Qing Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01249
Pdf URL: https://arxiv.org/pdf/2510.01249
Copy Paste: [[2510.01249]] LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning(https://arxiv.org/abs/2510.01249)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) excel in general domains, their reliability often falls short in scientific problem-solving. The advancement of scientific AI depends on large-scale, high-quality corpora. However, existing scientific question-answering (QA) datasets suffer from high error rates, frequently resulting from logical leaps and implicit reasoning within the answers. To address this issue, we introduce LOCA (Logical Chain Augmentation), a novel framework for automatically cleaning scientific corpora, implemented through an augment-and-review loop. At its core, LOCA enhances raw answers by completing missing logical steps and explicitly separating the underlying scientific principle from its subsequent derivation. By applying LOCA to challenging scientific corpora, we demonstrate that it can automatically filter noisy datasets, typically reducing the error rate from as high as 20\% to below 2\%. LOCA provides a scalable and effective methodology for creating high-quality scientific corpora, paving the way for more reliable training and evaluation of scientific AI.
摘要：尽管大型语言模型（LLM）在一般领域中表现出色，但它们的可靠性通常在科学问题解决方面缺乏。科学AI的进步取决于大规模，高质量的语料库。但是，现有的科学提问（QA）数据集患有高错误率，这通常是由于答案中逻辑上的飞跃和隐性推理所致。为了解决这个问题，我们介绍了通过增强和审视循环实施的自动清洁科学语料库的新型框架LoCA（逻辑链增强）。 LOCA的核心通过完成缺失的逻辑步骤并将基本科学原理与随后的派生分开，从而增强了原始答案。通过将LOCA应用于具有挑战性的科学语料库，我们证明它可以自动过滤嘈杂的数据集，通常将错误率从高达20 \％降低到2 \％以下。 LOCA提供了一种可扩展有效的方法来创建高质量的科学语料库，为更可靠的培训和评估科学AI铺平了道路。

Title: GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages

Authors: Trung Duc Anh Dang, Ferdinando Pio D'Elia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01250
Pdf URL: https://arxiv.org/pdf/2510.01250
Copy Paste: [[2510.01250]] GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages(https://arxiv.org/abs/2510.01250)
Keywords: prompt, chain-of-thought
Abstract: As social-media platforms emerge and evolve faster than the regulations meant to oversee them, automated detoxification might serve as a timely tool for moderators to enforce safe discourse at scale. We here describe our submission to the PAN 2025 Multilingual Text Detoxification Challenge, which rewrites toxic single-sentence inputs into neutral paraphrases across 15 typologically diverse languages. Building on a 12B-parameter Gemma-3 multilingual transformer, we apply parameter-efficient LoRA SFT fine-tuning and prompting techniques like few-shot and Chain-of-Thought. Our multilingual training corpus combines 3,600 human-authored parallel pairs, 21,600 machine-translated synthetic pairs, and model-generated pairs filtered by Jaccard thresholds. At inference, inputs are enriched with three LaBSE-retrieved neighbors and explicit toxic-span annotations. Evaluated via Style Transfer Accuracy, LaBSE-based semantic preservation, and xCOMET fluency, our system ranks first on high-resource and low-resource languages. Ablations show +0.081 joint score increase from few-shot examples and +0.088 from basic CoT prompting. ANOVA analysis identifies language resource status as the strongest predictor of performance ($\eta^2$ = 0.667, p < 0.01).
摘要：随着社交媒体平台的出现和发展速度比监督它们的法规更快，自动排毒可能是主持人及时执行安全论述的及时工具。我们在这里描述了我们对PAN 2025多语言文本排毒挑战的提交，该挑战将有毒的单句输入重写为15种类型上不同语言的中性释义。在12B参数Gemma-3多语言变压器上建造，我们应用参数效率的Lora SFT微调和提示技术，例如很少的射击和经过思考。我们的多语言训练语料库结合了3,600个人造的平行对，21,600个机器翻译合成对，以及由Jaccard阈值过滤的模型生成对。在推断时，输入富含三个Labse-Ret rethigher的邻居和显式的有毒Span注释。通过样式转移精度，基于LABSE的语义保存和Xcomet Fluency进行评估，我们的系统在高资源和低资源语言上排名第一。消融显示+0.081的关节得分增加了几个示例，+0.088在基本的COT提示中提高了+0.088。 ANOVA分析将语言资源状态确定为最强的性能预测指标（$ \ eta^2 $ = 0.667，p <0.01）。

Title: Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data

Authors: Carlo Bono, Federico Belotti, Matteo Palmonari
Subjects: cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2510.01251
Pdf URL: https://arxiv.org/pdf/2510.01251
Copy Paste: [[2510.01251]] Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data(https://arxiv.org/abs/2510.01251)
Keywords: language model, llm
Abstract: Linking textual values in tabular data to their corresponding entities in a Knowledge Base is a core task across a variety of data integration and enrichment applications. Although Large Language Models (LLMs) have shown State-of-The-Art performance in Entity Linking (EL) tasks, their deployment in real-world scenarios requires not only accurate predictions but also reliable uncertainty estimates, which require resource-demanding multi-shot inference, posing serious limits to their actual applicability. As a more efficient alternative, we investigate a self-supervised approach for estimating uncertainty from single-shot LLM outputs using token-level features, reducing the need for multiple generations. Evaluation is performed on an EL task on tabular data across multiple LLMs, showing that the resulting uncertainty estimates are highly effective in detecting low-accuracy outputs. This is achieved at a fraction of the computational cost, ultimately supporting a cost-effective integration of uncertainty measures into LLM-based EL workflows. The method offers a practical way to incorporate uncertainty estimation into EL workflows with limited computational overhead.
摘要：将表格数据中的文本值链接到知识库中其相应的实体是各种数据集成和丰富应用程序的核心任务。尽管大型语言模型（LLMS）在实体链接（EL）任务中显示出最先进的性能，但它们在实际情况中的部署不仅需要准确的预测，还需要可靠的不确定性估计，这需要资源占用的多弹药推理，从而对其实际适用性构成了严重的限制。作为一种更有效的替代方法，我们研究了一种使用令牌级特征从单发LLM输出估算不确定性的自制方法，从而减少了多代的需求。评估是在多个LLM的表格数据上对EL任务进行的，这表明所产生的不确定性估计值在检测低精度输出方面非常有效。这是计算成本的一小部分实现的，最终支持将不确定性度量的成本效益整合到基于LLM的EL工作流中。该方法提供了一种实用方法，可以将不确定性估计纳入有限的计算开销中的EL工作流中。

Title: GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

Authors: Mariam Mahran, Katharina Simbeck
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.01252
Pdf URL: https://arxiv.org/pdf/2510.01252
Copy Paste: [[2510.01252]] GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models(https://arxiv.org/abs/2510.01252)
Keywords: language model, gpt, llm
Abstract: As large language models (LLMs) are increasingly trained on massive, uncurated corpora, understanding both model representations and the data they internalize has become a major challenge. In this work, we show that pairing LLMs with sparse autoencoders (SAEs) enables interpretation not only of model behavior but also of the deeper structures, themes, and biases embedded in the training data. We train a GPT-style transformer model exclusively on the novels of Jane Austen, a corpus rich in social constructs and narrative patterns. We then apply SAEs to hidden states across multiple layers, uncovering sparse, interpretable features that reflect the key narratives and concepts present in the corpus, including gender, class, and societal duty. Our findings demonstrate that LLMs combined with SAEs can act as scalable probes into complex datasets, offering a new path for corpus exploration, bias discovery, and model interpretability at scale.
摘要：随着大型语言模型（LLM）越来越多地接受了大规模，未经保育的语料库的培训，了解模型表示和它们内部化的数据已成为一个重大挑战。在这项工作中，我们表明将LLM与稀疏的自动编码器（SAE）配对不仅可以解释模型行为，还可以解释嵌入在培训数据中的更深层次结构，主题和偏见。我们专门培训了GPT风格的变压器模型，该模型是简·奥斯丁（Jane Austen）的小说，这是一个丰富的社会结构和叙事模式的语料库。然后，我们将SAE应用于跨多层的隐藏状态，揭示了稀疏，可解释的特征，这些特征反映了语料库中存在的关键叙事和概念，包括性别，阶级和社会职责。我们的发现表明，与SAE相结合的LLM可以充当复杂数据集中的可扩展探针，为语料库探索，偏置发现和模型可解释性的新途径提供了扩展。

Title: Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

Authors: Shree Harsha Bokkahalli Satish, Gustav Eje Henter, Éva Székely
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2510.01254
Pdf URL: https://arxiv.org/pdf/2510.01254
Copy Paste: [[2510.01254]] Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs(https://arxiv.org/abs/2510.01254)
Keywords: language model, llm, prompt
Abstract: Recent work in benchmarking bias and fairness in speech large language models (SpeechLLMs) has relied heavily on multiple-choice question answering (MCQA) formats. The model is tasked to choose between stereotypical, anti-stereotypical, or neutral/irrelevant answers given an input speech prompt and an optional text prompt. Such MCQA benchmarks implicitly assume that model performance is consistent across other MCQA tasks, voices, and other task formats such as more realistic, long-form evaluations. In this paper, we probe that assumption. We fine-tune three SpeechLLMs using LoRA adapters to induce specific MCQA behaviours: preference for stereotypical, anti-stereotypical, or neutral/uncertain answers. We then evaluate whether these behaviours generalise to another, distinct MCQA benchmark, and more critically to long-form, creative generation tasks. Our results show that performance on MCQA bias benchmarks fails to reliably predict performances across other MCQA benchmarks, and more importantly across long-form tasks. We conclude that current MCQA bias benchmarks show limited evidence of cross-task generalisation in the speech domain, and also propose an evaluation suite for measuring behaviour transferability in future models and benchmarks.
摘要：在基准测试偏见和公平性的言语大语模型（SpeechLlms）中的最新工作在很大程度上依赖于多项选择的答案（MCQA）格式。该模型的任务是在刻板印象，抗疾病或中性/无关的答案之间进行选择，并在输入语音提示和可选的文本提示下进行选择。这样的MCQA基准隐含地假设模型性能在其他MCQA任务，声音和其他任务格式（例如更现实，更长期的评估）中是一致的。在本文中，我们探究了这一假设。我们使用LORA适配器诱导特定的MCQA行为来微调三个语音插件：偏爱刻板印象，抗疾病或中性/不确定的答案。然后，我们评估这些行为是否将其推广到另一个不同的MCQA基准，并更加批判地涉足长期，创造性的生成任务。我们的结果表明，MCQA偏置基准的性能无法可靠地预测其他MCQA基准的性能，更重要的是在长期任务中更重要的是。我们得出的结论是，当前的MCQA偏置基准在语音域中显示出有限的交叉任务概括证据，还提出了一个评估套件，用于测量未来模型和基准中的行为可传递性。

Title: Longitudinal Monitoring of LLM Content Moderation of Social Issues

Authors: Yunlang Dai, Emma Lurie, Danaé Metaxa, Sorelle A. Friedler
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2510.01255
Pdf URL: https://arxiv.org/pdf/2510.01255
Copy Paste: [[2510.01255]] Longitudinal Monitoring of LLM Content Moderation of Social Issues(https://arxiv.org/abs/2510.01255)
Keywords: language model, gpt, llm
Abstract: Large language models' (LLMs') outputs are shaped by opaque and frequently-changing company content moderation policies and practices. LLM moderation often takes the form of refusal; models' refusal to produce text about certain topics both reflects company policy and subtly shapes public discourse. We introduce AI Watchman, a longitudinal auditing system to publicly measure and track LLM refusals over time, to provide transparency into an important and black-box aspect of LLMs. Using a dataset of over 400 social issues, we audit Open AI's moderation endpoint, GPT-4.1, and GPT-5, and DeepSeek (both in English and Chinese). We find evidence that changes in company policies, even those not publicly announced, can be detected by AI Watchman, and identify company- and model-specific differences in content moderation. We also qualitatively analyze and categorize different forms of refusal. This work contributes evidence for the value of longitudinal auditing of LLMs, and AI Watchman, one system for doing so.
摘要：大型语言模型（LLMS'）的输出是由不透明和经常改变的公司内容审核政策和实践来塑造的。 LLM节制通常采用拒绝的形式。模型拒绝制作有关某些主题的文本既反映了公司政策，又巧妙地塑造了公共话语。我们介绍了AI Watchman，这是一种纵向审核系统，以公开测量和跟踪LLM随着时间的推移拒绝，以提供透明度，以透明到LLM的重要和黑色盒子方面。我们使用一个超过400个社会问题的数据集，我们审核AI的Meweration端点，GPT-4.1和GPT-5以及DeepSeek（包括英语和中文）。我们发现证据表明，AI Watchman可以检测到公司政策的变化，即使是未公开宣布的”，并确定内容适量的公司和模型特定的差异。我们还定性地分析和分类了不同形式的拒绝。这项工作为LLM纵向审计的价值和AI Watchman的价值提供了证据，这是一个这样做的系统。

Title: RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs

Authors: Can Lin, Zhengwang Jiang, Ling Zheng, Qi Zhao, Yuhang Zhang, Qi Song, Wangqiu Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01257
Pdf URL: https://arxiv.org/pdf/2510.01257
Copy Paste: [[2510.01257]] RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs(https://arxiv.org/abs/2510.01257)
Keywords: language model, gpt, llm, agent
Abstract: Knowledge graph question answering (KGQA) aims to answer natural language questions using knowledge graphs. Recent research leverages large language models (LLMs) to enhance KGQA reasoning, but faces limitations: retrieval-based methods are constrained by the quality of retrieved information, while agent-based methods rely heavily on proprietary LLMs. To address these limitations, we propose Retrieval-Judgment-Exploration (RJE), a framework that retrieves refined reasoning paths, evaluates their sufficiency, and conditionally explores additional evidence. Moreover, RJE introduces specialized auxiliary modules enabling small-sized LLMs to perform effectively: Reasoning Path Ranking, Question Decomposition, and Retriever-assisted Exploration. Experiments show that our approach with proprietary LLMs (such as GPT-4o-mini) outperforms existing baselines while enabling small open-source LLMs (such as 3B and 8B parameters) to achieve competitive results without fine-tuning LLMs. Additionally, RJE substantially reduces the number of LLM calls and token usage compared to agent-based methods, yielding significant efficiency improvements.
摘要：知识图应答（KGQA）旨在使用知识图回答自然语言问题。最近的研究利用大型语言模型（LLM）来增强KGQA推理，但面对局限性：基于检索的方法受到检索信息的质量的限制，而基于代理的方法在很大程度上依赖于专有的LLM。为了解决这些局限性，我们提出了检索判断 - 探索（RJE），该框架可以检索精致的推理路径，评估其充分性并有条件地探索其他证据。此外，RJE引入了专门的辅助模块，使小型LLM有效地执行：推理路径排名，问题分解和检索器辅助探索。实验表明，我们使用专有LLM（例如GPT-4O-MINI）的方法优于现有基准，同时启用小型开源LLM（例如3B和8B参数）可以在没有微调LLM的情况下获得竞争成果。此外，与基于代理的方法相比，RJE大大减少了LLM调用和令牌用法的数量，从而产生了显着提高的效率。

Title: Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse

Authors: Nathan Junzi Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01258
Pdf URL: https://arxiv.org/pdf/2510.01258
Copy Paste: [[2510.01258]] Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse(https://arxiv.org/abs/2510.01258)
Keywords: language model, llm
Abstract: Amidst the rapid normalization of generative artificial intelligence (GAI), intelligent systems have come to dominate political discourse across information mediums. However, internalized political biases stemming from training data skews, human prejudice, and algorithmic flaws continue to plague the novel technology. This paper employs a zero-shot classification approach to evaluate algorithmic political partisanship through a methodical combination of ideological alignment, topicality, response sentiment, and objectivity. A total of 1800 model responses across six mainstream large language models (LLMs) were individually input into four distinct fine-tuned classification algorithms, each responsible for computing an aforementioned bias evaluation metric. Results show an amplified liberal-authoritarian alignment across all six LLMs evaluated, with notable instances of reasoning supersessions and canned refusals. The study subsequently highlights the psychological influences underpinning human-computer interactions and how intrinsic biases can permeate public discourse. The resulting distortion of the political landscape can ultimately manifest as conformity or polarization, depending on a region's pre-existing socio-political structures.
摘要：在生成人工智能（GAI）的快速正常化中，智能系统已经跨越了信息媒介的政治话语。但是，由于培训数据偏斜，人类偏见和算法缺陷而导致的内部政治偏见继续困扰着新技术。本文采用了零拍的分类方法来评估算法的政治党派，通过有条理的意识形态一致性，局部性，反应情绪和客观性的有条理结合。在六个主流大语模型（LLM）中，共有1800个模型响应分别输入了四种不同的微型分类算法，每个算法都负责计算上述偏见评估指标。结果表明，所有六个LLMS的自由主义者对齐都进行了放大，并具有明显的推理超级评估和拒绝罐头的实例。随后，该研究突出了心理影响，基于人类计算机相互作用以及内在偏见如何渗透公众话语。政治格局的造成的扭曲最终可能表现为一致性或两极分化，具体取决于一个地区先前存在的社会政治结构。

Title: In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

Authors: Nils Durner
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2510.01259
Pdf URL: https://arxiv.org/pdf/2510.01259
Copy Paste: [[2510.01259]] In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b(https://arxiv.org/abs/2510.01259)
Keywords: gpt, prompt
Abstract: We probe OpenAI's open-weights 20-billion-parameter model gpt-oss-20b to study how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior. Across 80 seeded iterations per scenario, we test several harm domains including ZIP-bomb construction (cyber threat), synthetic card-number generation, minor-unsafe driving advice, drug-precursor indicators, and RAG context exfiltration. Composite prompts that combine an educator persona, a safety-pretext ("what to avoid"), and step-cue phrasing flip assistance rates from 0% to 97.5% on a ZIP-bomb task. On our grid, formal registers in German and French are often leakier than matched English prompts. A "Linux terminal" role-play overrides a developer rule not to reveal context in a majority of runs with a naive developer prompt, and we introduce an AI-assisted hardening method that reduces leakage to 0% in several user-prompt variants. We further test evaluation awareness with a paired-track design and measure frame-conditioned differences between matched "helpfulness" and "harmfulness" evaluation prompts; we observe inconsistent assistance in 13% of pairs. Finally, we find that the OpenAI Moderation API under-captures materially helpful outputs relative to a semantic grader, and that refusal rates differ by 5 to 10 percentage points across inference stacks, raising reproducibility concerns. We release prompts, seeds, outputs, and code for reproducible auditing at this https URL .
摘要：我们探究Openai的开放式增长量为20亿参数GPT-OSS-20B，以研究社会主义框架，语言选择和教学层次结构如何影响拒绝行为。在每个情况的80个种子迭代中，我们测试了几个危害领域，包括邮政炸弹构造（网络威胁），合成卡数生成，较小的不安全驾驶建议，药品培训指标和抹布上下文剥落。综合提示将教育工作者角色，安全性（“要避免的”）以及邮政炸弹任务中的逐步措辞翻转辅助率从0％到97.5％。在我们的网格上，德语和法语的正式注册通常比匹配的英语提示更漏水。 “ Linux终端”角色扮演覆盖了开发人员规则，不要在大多数运行中都以幼稚的开发者提示揭示上下文，我们引入了AI辅助硬化方法，该方法将泄漏降低到几个用户prompt变体中的0％。我们通过配对的设计和测量框架条件的差异进一步测试评估意识，并在匹配的“有益性”和“有害性”评估提示之间存在差异；我们观察到13％的成对的援助不一致。最后，我们发现，相对于语义分级机，OpenAI Mederation API实质上有用的输出非常有用，并且拒绝率在推理堆栈之间差异5至10个百分点，从而增加了可重复性的问题。我们发布此提示，种子，输出和代码，以在此HTTPS URL上进行可再现的审核。

Title: OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

Authors: Isa Inuwa-Dutse
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01266
Pdf URL: https://arxiv.org/pdf/2510.01266
Copy Paste: [[2510.01266]] OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language(https://arxiv.org/abs/2510.01266)
Keywords: gpt, prompt
Abstract: In response to the recent safety probing for OpenAI's GPT-OSS-20b model, we present a summary of a set of vulnerabilities uncovered in the model, focusing on its performance and safety alignment in a low-resource language setting. The core motivation for our work is to question the model's reliability for users from underrepresented communities. Using Hausa, a major African language, we uncover biases, inaccuracies, and cultural insensitivities in the model's behaviour. With a minimal prompting, our red-teaming efforts reveal that the model can be induced to generate harmful, culturally insensitive, and factually inaccurate content in the language. As a form of reward hacking, we note how the model's safety protocols appear to relax when prompted with polite or grateful language, leading to outputs that could facilitate misinformation and amplify hate speech. For instance, the model operates on the false assumption that common insecticide locally known as Fiya-Fiya (Cyphermethrin) and rodenticide like Shinkafar Bera (a form of Aluminium Phosphide) are safe for human consumption. To contextualise the severity of this error and popularity of the substances, we conducted a survey (n=61) in which 98% of participants identified them as toxic. Additional failures include an inability to distinguish between raw and processed foods and the incorporation of demeaning cultural proverbs to build inaccurate arguments. We surmise that these issues manifest through a form of linguistic reward hacking, where the model prioritises fluent, plausible-sounding output in the target language over safety and truthfulness. We attribute the uncovered flaws primarily to insufficient safety tuning in low-resource linguistic contexts. By concentrating on a low-resource setting, our approach highlights a significant gap in current red-teaming effort and offer some recommendations.
摘要：为了响应OpenAI的GPT-Oss-20B模型的最新安全探测，我们介绍了该模型中发现的一系列漏洞的摘要，重点介绍了其在低资源语言设置中的性能和安全对准。我们工作的核心动机是质疑该模型对代表性不足社区的用户的可靠性。我们使用豪萨（Hausa），一种主要的非洲语言，我们发现了模型行为的偏见，不准确和文化不敏感。随着最小的提示，我们的红色团队努力表明，该模型可以引起该语言中的有害，文化不敏感和事实不准确的内容。作为奖励黑客的一种形式，我们注意到使用礼貌或感恩的语言提示，模型的安全协议似乎如何放松，从而导致输出有助于误解并扩大仇恨言论。例如，该模型基于以下假设，即普通杀虫剂局部称为fiya-fiya（ciphermethrin）和啮齿动物（如Shinkafar Bera）（一种形式的磷化铝），可用于人类消费。为了使这种错误的严重程度和物质普及的严重性，我们进行了一项调查（n = 61），其中98％的参与者将它们识别为有毒。其他失败包括无法区分原始食品和加工食品，以及融合贬低文化谚语以建立不准确的论点。我们推测，这些问题通过一种语言奖励黑客攻击而表现出来，该模型优先考虑了目标语言中的流利，合理的输出，而不是安全性和真实性。我们将发现的缺陷归因于低资源语言环境中的安全性不足。通过专注于低资源环境，我们的方法突出了当前的红色团队工作的显着差距，并提供了一些建议。

Title: AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

Authors: Hongyi Zhou, Jin Zhu, Pingfan Su, Kai Ye, Ying Yang, Shakeel A O B Gavioli-Akilagun, Chengchun Shi
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.01268
Pdf URL: https://arxiv.org/pdf/2510.01268
Copy Paste: [[2510.01268]] AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees(https://arxiv.org/abs/2510.01268)
Keywords: language model, gpt, llm
Abstract: We study the problem of determining whether a piece of text has been authored by a human or by a large language model (LLM). Existing state of the art logits-based detectors make use of statistics derived from the log-probability of the observed text evaluated using the distribution function of a given source LLM. However, relying solely on log probabilities can be sub-optimal. In response, we introduce AdaDetectGPT -- a novel classifier that adaptively learns a witness function from training data to enhance the performance of logits-based detectors. We provide statistical guarantees on its true positive rate, false positive rate, true negative rate and false negative rate. Extensive numerical studies show AdaDetectGPT nearly uniformly improves the state-of-the-art method in various combination of datasets and LLMs, and the improvement can reach up to 58%. A python implementation of our method is available at this https URL.
摘要：我们研究了确定文本是否由人类或大型语言模型（LLM）撰写的问题。现有的基于逻辑的检测器的现有状态利用了使用给定源LLM的分布函数评估的观察到的文本的对数概率得出的统计信息。但是，仅依靠对数概率可能是最佳的。作为回应，我们介绍了AdadeTectgpt，这是一种新颖的分类器，可以自适应地从训练数据中学习证人功能，以增强基于逻辑的检测器的性能。我们提供统计保证，以其真正的正率，误报率，真为负率和假阴性率提供统计保证。广泛的数值研究表明，在各种数据集和LLMS组合中，AdadeTectgpt几乎均匀地改善了最先进的方法，并且改进可以达到58％。我们的方法的Python实现可在此HTTPS URL上获得。

Title: Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

Authors: Hoang Phan, Victor Li, Qi Lei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01270
Pdf URL: https://arxiv.org/pdf/2510.01270
Copy Paste: [[2510.01270]] Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection(https://arxiv.org/abs/2510.01270)
Keywords: language model, llm
Abstract: Large language models (LLMs) have revolutionized natural language processing with their ability to generate coherent and contextually relevant text. However, their deployment raises significant concerns about the potential for generating harmful or inappropriate content. In this paper, we introduce Progressive Self-Reflection (PSR), a novel inference-time technique that empowers LLMs to self-monitor and correct their outputs dynamically. Experimental results demonstrate that applying our proposed method to Llama-3.1-8B-Instruct reduces the attack success rate from 77.5\% to 5.9\%, to Llama-3.1-8B base from 89.7\% to 5.6\%, and to Qwen2.5-7B-Instruct from 44.4\% to 3.8\%, without additional training, while maintaining their original performance on benign tasks. Our approach acts as a test-time scaling method, where additional self-reflection rounds enhance safety at the cost of inference overhead. To balance safety with computational efficiency, we introduce a lightweight self-reflection predictor that estimates the optimal number of reflection rounds based on input complexity. This adaptive mechanism prevents unnecessary self-assessment on benign inputs while ensuring thorough evaluation when encountering potentially harmful content. Our findings suggest that Progressive Self-Reflection serves as a scalable test-time approach, enhancing LLM safety by dynamically allocating computational resources in proportion to the input's risk profile.
摘要：大型语言模型（LLMS）通过其产生连贯和上下文相关的文本的能力彻底改变了自然语言处理。但是，他们的部署引起了人们对产生有害或不适当内容的潜力的重大关注。在本文中，我们引入了渐进式自我反省（PSR），这是一种新颖的推理时间技术，旨在使LLMS自我监测并动态纠正其输出。实验结果表明，将我们提出的方法应用于Llama-3.1-8b-Instruction将攻击的成功率从77.5 \％降低到5.9 \％，将llama-3.1-8b基础从89.7 \％\％\％\％\％降低至5.6 \％，并从Qwen2.5-7b-造型中从44.4 \％\％进行了训练，而没有进行其他训练。我们的方法是一种测试时间缩放方法，其中额外的自我反射以推理开销为代价增强了安全性。为了使安全性与计算效率之间的平衡，我们引入了一个轻巧的自我反射预测指标，该预测指标估计了基于输入复杂性的最佳反射回合数。这种自适应机制可防止对良性输入的不必要的自我评估，同时在遇到潜在有害内容时确保彻底评估。我们的发现表明，渐进的自我反射是一种可扩展的测试时间方法，通过按比例按照输入的风险概况进行动态分配计算资源来增强LLM安全性。

Title: TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models

Authors: Shenxu Chang, Junchi Yu, Weixing Wang, Yongqiang Chen, Jialin Yu, Philip Torr, Jindong Gu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.01274
Pdf URL: https://arxiv.org/pdf/2510.01274
Copy Paste: [[2510.01274]] TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models(https://arxiv.org/abs/2510.01274)
Keywords: language model, llm, hallucination
Abstract: Diffusion large language models (D-LLMs) have recently emerged as a promising alternative to auto-regressive LLMs (AR-LLMs). However, the hallucination problem in D-LLMs remains underexplored, limiting their reliability in real-world applications. Existing hallucination detection methods are designed for AR-LLMs and rely on signals from single-step generation, making them ill-suited for D-LLMs where hallucination signals often emerge throughout the multi-step denoising process. To bridge this gap, we propose TraceDet, a novel framework that explicitly leverages the intermediate denoising steps of D-LLMs for hallucination detection. TraceDet models the denoising process as an action trace, with each action defined as the model's prediction over the cleaned response, conditioned on the previous intermediate output. By identifying the sub-trace that is maximally informative to the hallucinated responses, TraceDet leverages the key hallucination signals in the multi-step denoising process of D-LLMs for hallucination detection. Extensive experiments on various open source D-LLMs demonstrate that TraceDet consistently improves hallucination detection, achieving an average gain in AUROC of 15.2% compared to baselines.
摘要：扩散大语言模型（D-LLM）最近已成为自动回归LLM（AR-LLM）的有前途的替代方案。但是，D-llms中的幻觉问题仍未得到充满信心，从而限制了它们在现实世界中的可靠性。现有的幻觉检测方法是为AR-LLM设计的，并依赖于单步生成的信号，使其不适合D-LLMS，在此过程中，幻觉信号经常在整个多步中的DeNoising过程中出现。为了弥合这一差距，我们提出了TraceDet，这是一个新颖的框架，该框架明确利用了D-llms的中间denoising步骤进行幻觉检测。 TraceDet将Denoisis Process建模为一个动作迹线，每个动作都定义为模型对清洁响应的预测，以先前的中间输出为条件。通过识别对幻觉响应最大程度丰富的子痕迹，TraceDet利用了D-LLM的多步降解过程中的关键幻觉信号进行幻觉检测。对各种开源D-llms进行的广泛实验表明，TraceDet始终改善幻觉检测，与基线相比，AUROC的平均增益为15.2％。

Title: LLM Based Sentiment Classification From Bangladesh E-Commerce Reviews

Authors: Sumaiya Tabassum
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01276
Pdf URL: https://arxiv.org/pdf/2510.01276
Copy Paste: [[2510.01276]] LLM Based Sentiment Classification From Bangladesh E-Commerce Reviews(https://arxiv.org/abs/2510.01276)
Keywords: language model, llm
Abstract: Sentiment analysis is an essential part of text analysis, which is a larger field that includes determining and evaluating the author's emotional state. This method is essential since it makes it easier to comprehend consumers' feelings, viewpoints, and preferences holistically. The introduction of large language models (LLMs), such as Llama, has greatly increased the availability of cutting-edge model applications, such as sentiment analysis. However, accurate sentiment analysis is hampered by the intricacy of written language and the diversity of languages used in evaluations. The viability of using transformer-based BERT models and other LLMs for sentiment analysis from Bangladesh e commerce reviews is investigated in this paper. A subset of 4000 samples from the original dataset of Bangla and English customer reviews was utilized to fine-tune the model. The fine tuned Llama-3.1-8B model outperformed other fine-tuned models, including Phi-3.5-mini-instruct, Mistral-7B-v0.1, DistilBERT-multilingual, mBERT, and XLM-R-base, with an overall accuracy, precision, recall, and F1 score of 95.5%, 93%, 88%, 90%. The study emphasizes how parameter efficient fine-tuning methods (LoRA and PEFT) can lower computational overhead and make it appropriate for contexts with limited resources. The results show how LLMs can
摘要：情感分析是文本分析的重要组成部分，它是一个更大的领域，包括确定和评估作者的情绪状态。此方法至关重要，因为它可以更轻松地从整体上理解消费者的感受，观点和偏好。大型语言模型（LLM）的引入（例如Llama）大大提高了尖端模型应用的可用性，例如情感分析。但是，准确的情感分析受到书面语言的复杂性以及评估中使用的语言的多样性的阻碍。本文研究了使用基于变压器的BERT模型和其他LLM进行情感分析的生存能力。孟加拉和英语客户评论的原始数据集中的4000个样本的子集用于微调模型。微调的Llama-3.1-8b模型优于其他微型模型，包括Phi-3.5-Mini-Instruct，Mistral-7b-V0.1，Distilbert-Multlingual，Mbert和XLM-R-Base，具有整体准确性，精确度，召回率和F1得分95.5％，93.5％，93％，88％，88％，90％，90％，90％。该研究强调了参数有效的微调方法（LORA和PEFT）如何降低计算开销，并使其适用于资源有限的环境。结果显示了LLMS如何

Title: TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

Authors: Yongchao Chen, Jiefeng Chen, Rui Meng, Ji Yin, Na Li, Chuchu Fan, Chi Wang, Tomas Pfister, Jinsung Yoon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01279
Pdf URL: https://arxiv.org/pdf/2510.01279
Copy Paste: [[2510.01279]] TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture(https://arxiv.org/abs/2510.01279)
Keywords: language model, gpt, llm, chat, agent
Abstract: While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.
摘要：虽然在Chatgpt Agent和Gemini-Pro等模型中集成了诸如代码解释器和搜索之类的工具可以显着增强大型语言模型（LLM）推理，但缺乏有关最佳工具使用的实用指南。核心挑战是有效地结合了文本推理，编码和寻找各种问题。在本文中，我们提出了工具使用混合物（Tumix），这是一个合奏框架，并并行运行多个代理，每种代理都采用不同的工具使用策略和答案路径。 Tumix的代理商迭代地分享并根据问题和以前的答案进行完善回答。在实验中，Tumix在最先进的工具启动和测试时间缩放方法上取得了显着增长，在跨关键推理基准的Gemini-2.5-Pro和Gemini-2.5-Pro和Gemini-2.5-Flash的最佳基线上，平均准确性提高了3.55％，并具有接近平等的推理成本。我们发现代理多样性和质量至关重要，可以通过使用LLM自动化代理设计来增强。此外，Tumix在达到足够的信心后可以停止细化，仅以推理成本的49％保留性能。进一步的扩展可以取得更高的性能，尽管成本更高。

Title: Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing

Authors: Israel Abebe Azime, Tadesse Destaw Belay, Atnafu Lambebo Tonja
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01283
Pdf URL: https://arxiv.org/pdf/2510.01283
Copy Paste: [[2510.01283]] Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing(https://arxiv.org/abs/2510.01283)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) powered with argentic capabilities are able to do knowledge-intensive tasks without human involvement. A prime example of this tool is Deep research with the capability to browse the web, extract information and generate multi-page reports. In this work, we introduce an evaluation sheet that can be used for assessing the capability of Deep Research tools. In addition, we selected academic survey writing as a use case task and evaluated output reports based on the evaluation sheet we introduced. Our findings show the need to have carefully crafted evaluation standards. The evaluation done on OpenAI`s Deep Search and Google's Deep Search in generating an academic survey showed the huge gap between search engines and standalone Deep Research tools, the shortcoming in representing the targeted area.
摘要：具有Argentic功能的大型语言模型（LLM）能够在不参与的情况下执行知识密集型任务。该工具的一个主要示例是深入研究，具有浏览网络，提取信息并生成多页报告的能力。在这项工作中，我们介绍了一份评估表，可用于评估深度研究工具的能力。此外，我们选择了学术调查写作作为用例任务，并根据我们介绍的评估表进行了评估。我们的发现表明有必要精心制作的评估标准。对OpenAI的深入搜索和Google进行了一项学术调查的深入搜索的评估表明，搜索引擎和独立的深层研究工具之间的差距很大，这是代表目标区域的缺点。

Title: HiSpec: Hierarchical Speculative Decoding for LLMs

Authors: Avinash Kumar, Sujay Sanghavi, Poulami Das
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.01336
Pdf URL: https://arxiv.org/pdf/2510.01336
Copy Paste: [[2510.01336]] HiSpec: Hierarchical Speculative Decoding for LLMs(https://arxiv.org/abs/2510.01336)
Keywords: llm
Abstract: Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.
摘要：投机解码通过使用较小的草稿模型推测较大目标模型验证的令牌来加速LLM的推理。验证通常是瓶颈（例如，当3B模型推测70B目标模型时，验证比令牌生成$ 4 \ times $慢），但大多数先前的作品仅着重于加速起草。 $ \ textit {````中间}} $验证可以通过丢弃不准确的令牌来缩短验证时间$ \下划线{\ textIt {hi}} \ textit {erarchical} \ underline {\ textit {spec}} \ textit {uperative decododing（hispec）} $，用于高通量解码的框架，用于利用$ \ textIt $ preativiative $ prof-effer-effer-periviation}通过跳过层遍历层，可以清楚地训练，以便可以解释所选层的隐藏状态，从而使它们独特地适合中间验证，而无需大幅度提高计算和内存的速度，以进一步提高资源效率。 HISPEC定期验证了中间验证者对目标模型接受的草案，使用各种代表性的基准和模型，与无抗差精度相比，HISPEC平均将吞吐量提高了1.28 $ \ timple $ \ timple $ \ timple $ 2.01 $ \倍。

Title: TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies

Authors: Maithili Kadam, Francis Ferraro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01391
Pdf URL: https://arxiv.org/pdf/2510.01391
Copy Paste: [[2510.01391]] TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies(https://arxiv.org/abs/2510.01391)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) excel at general language tasks but often struggle with event-based questions-especially those requiring causal or temporal reasoning. We introduce TAG-EQA (Text-And-Graph for Event Question Answering), a prompting framework that injects causal event graphs into LLM inputs by converting structured relations into natural-language statements. TAG-EQA spans nine prompting configurations, combining three strategies (zero-shot, few-shot, chain-of-thought) with three input modalities (text-only, graph-only, text+graph), enabling a systematic analysis of when and how structured knowledge aids inference. On the TORQUESTRA benchmark, TAG-EQA improves accuracy by 5% on average over text-only baselines, with gains up to 12% in zero-shot settings and 18% when graph-augmented CoT prompting is effective. While performance varies by model and configuration, our findings show that causal graphs can enhance event reasoning in LLMs without fine-tuning, offering a flexible way to encode structure in prompt-based QA.
摘要：大型语言模型（LLMS）在一般语言任务上表现出色，但通常在基于事件的问题上遇到困难，尤其是那些需要因果关系或时间推理的问题。我们介绍了Tag-eqa（事件问题回答的文本和图形），这是一个提示框架，通过将结构化关系转换为自然语言语句，将因果事件图将其注入LLM输入。 TAG-EQA跨越了九个提示配置，结合了三种策略（零射击，几乎没有射击，经过思考链）与三种输入模式（仅文本，仅图形，文本+图），对结构性知识的何时及方式进行系统分析。在Torquestra基准测试中，TAG-EQA平均将准确性提高了5％，而仅文本基线的基准量平均提高了准确性，在零拍摄的设置中最多增加了12％，而当图形提示的COT提示有效时，其量为18％。尽管性能随模型和配置而变化，但我们的发现表明，因果图可以在不进行微调的情况下增强LLMS中的事件推理，从而在基于及时的QA中提供一种灵活的编码结构的方式。

Title: A-VERT: Agnostic Verification with Embedding Ranking Targets

Authors: Nicolás Aguirre, Ramiro Caso, Ramiro Rodríguez Colmeiro, Mauro Santelli, Joaquín Toranzo Calderón
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.01469
Pdf URL: https://arxiv.org/pdf/2510.01469
Copy Paste: [[2510.01469]] A-VERT: Agnostic Verification with Embedding Ranking Targets(https://arxiv.org/abs/2510.01469)
Keywords: language model, llm
Abstract: The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response classification relies on methods that are too expensive (i.e. LLM-as-a-Judge) or that are far from real-world conditions (string-matching, logprob). In this paper, a structure-free evaluation method is presented. The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text, resulting in a robust classification of the response at a relatively low compute cost (embedding models of less than $10B$ parameters). The results show a regression score of ~0.97 and an accuracy of ~96% against human annotators, tested over 3 data sets and 3 different LM architectures.
摘要：语言模型（LM）响应的自动评估是基准和指标开发的关键部分，无论是用于模型培训和生产模型终点的质量评估。当前的响应分类方法取决于太昂贵的方法（即llm-as-a-a-gudge）或远离现实世界条件（字符串匹配，logprob）的方法。在本文中，提出了一种无结构评估方法。该方法利用语义嵌入距离将目标候选者与任意LM生成的文本相匹配，从而以相对较低的计算成本（嵌入小于$ 10B $ $参数的嵌入模型）对响应进行了强大的分类。结果显示，对人类注释者的回归评分约为0.97，精度约为96％，在3个数据集和3个不同的LM架构上进行了测试。

Title: One More Question is Enough, Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning

Authors: Mengyu Wang, Sotirios Sabanis, Miguel de Carvalho, Shay B. Cohen, Tiejun Ma
Subjects: cs.CL, q-fin.CP
Abstract URL: https://arxiv.org/abs/2510.01526
Pdf URL: https://arxiv.org/pdf/2510.01526
Copy Paste: [[2510.01526]] One More Question is Enough, Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning(https://arxiv.org/abs/2510.01526)
Keywords: language model, llm, prompt
Abstract: Domain-specific quantitative reasoning remains a major challenge for large language models (LLMs), especially in fields requiring expert knowledge and complex question answering (QA). In this work, we propose Expert Question Decomposition (EQD), an approach designed to balance the use of domain knowledge with computational efficiency. EQD is built on a two-step fine-tuning framework and guided by a reward function that measures the effectiveness of generated sub-questions in improving QA outcomes. It requires only a few thousand training examples and a single A100 GPU for fine-tuning, with inference time comparable to zero-shot prompting. Beyond its efficiency, EQD outperforms state-of-the-art domain-tuned models and advanced prompting strategies. We evaluate EQD in the financial domain, characterized by specialized knowledge and complex quantitative reasoning, across four benchmark datasets. Our method consistently improves QA performance by 0.6% to 10.5% across different LLMs. Our analysis reveals an important insight: in domain-specific QA, a single supporting question often provides greater benefit than detailed guidance steps.
摘要：特定于领域的定量推理仍然是大型语言模型（LLM）的主要挑战，尤其是在需要专家知识和复杂问题答案（QA）的领域。在这项工作中，我们提出了专家问题分解（EQD），这种方法旨在平衡域知识与计算效率的使用。 EQD建立在两步微调框架上，并以奖励功能为指导，该奖励功能衡量了生成的子问题在改善质量检查结果中的有效性。它仅需要几千个培训示例和一个A100 GPU进行微调，而推理时间可与零射击提示相当。除了其效率之外，EQD的表现优于最先进的域名模型和高级提示策略。我们在金融领域中评估了EQD，其特征是在四个基准数据集中进行专业知识和复杂的定量推理。在不同的LLM中，我们的方法一致地将质量检查的性能提高了0.6％，至10.5％。我们的分析揭示了一个重要的见解：在特定领域的质量保证中，一个单一的支持问题通常比详细的指导步骤更大。

Title: ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

Authors: Haochen You, Baojing Liu
Subjects: cs.CL, cs.NI
Abstract URL: https://arxiv.org/abs/2510.01585
Pdf URL: https://arxiv.org/pdf/2510.01585
Copy Paste: [[2510.01585]] ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning(https://arxiv.org/abs/2510.01585)
Keywords: language model
Abstract: While Transformer architectures have demonstrated impressive scalability across domains, they continue to face challenges in long-context reasoning, computational efficiency, and structural generalization - largely due to rigid layer stacking, dense attention, and reliance on positional encodings. We present ReSSFormer, a Recursive Sparse Structured Transformer that integrates three complementary innovations: Recurrent Reasoning & Memory Unit (R2MU) for iterative reasoning with bounded depth, Adaptive Sparse Attention Module (ASAM) for efficient and focused context selection, and Self-Organizing Encoder Structure (SOES) for position-free structure induction. ReSSFormer replaces conventional depth stacking with recurrent inference, substitutes full attention with token- and expert-level sparsity, and models latent token topology directly from content. Across language modeling, multi-hop QA, and structure-sensitive tasks, ReSSFormer consistently outperforms strong baselines under comparable FLOPs and parameter budgets, highlighting its scalability, efficiency, and structural flexibility.
摘要：尽管变压器体系结构表现出了整个领域的令人印象深刻的可扩展性，但它们在长期文本推理，计算效率和结构概括方面继续面临挑战 - 很大程度上是由于刚性层堆叠，密集的关注以及对位置编码的依赖。我们提出了RessFormer，这是一种递归的稀疏结构变压器，它整合了三个互补的创新：迭代推理的经常性推理和记忆单元（R2MU），具有有界深度的迭代推理，自适应稀疏注意模块（ASAM），用于有效和焦点上下文选择，以及自我组织的编码器结构（SOES），以实现位置结构。 RessFormer用反复的推理代替了传统的深度堆积，用令牌和专家级别的稀疏度全部替代了关注，并直接从内容中直接模型潜在的代币拓扑模型。在语言建模，多跳质量质量质量检查和对结构敏感的任务中，RessFormer在可比的拖鞋和参数预算下始终优于强大的基准，从而突出了其可扩展性，效率和结构灵活性。

Title: CLUE: Non-parametric Verification from Experience via Hidden-State Clustering

Authors: Zhenwen Liang, Ruosen Li, Yujun Zhou, Linfeng Song, Dian Yu, Xinya Du, Haitao Mi, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01591
Pdf URL: https://arxiv.org/pdf/2510.01591
Copy Paste: [[2510.01591]] CLUE: Non-parametric Verification from Experience via Hidden-State Clustering(https://arxiv.org/abs/2510.01591)
Keywords: language model, llm
Abstract: Assessing the quality of Large Language Model (LLM) outputs presents a critical challenge. Previous methods either rely on text-level information (e.g., reward models, majority voting), which can overfit to superficial cues, or on calibrated confidence from token probabilities, which would fail on less-calibrated models. Yet both of these signals are, in fact, partial projections of a richer source of information: the model's internal hidden states. Early layers, closer to token embeddings, preserve semantic and lexical features that underpin text-based judgments, while later layers increasingly align with output logits, embedding confidence-related information. This paper explores hidden states directly as a unified foundation for verification. We show that the correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations. To validate this, we present Clue (Clustering and Experience-based Verification), a deliberately minimalist, non-parametric verifier. With no trainable parameters, CLUE only summarizes each reasoning trace by an hidden state delta and classifies correctness via nearest-centroid distance to ``success'' and ``failure'' clusters formed from past experience. The simplicity of this method highlights the strength of the underlying signal. Empirically, CLUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates, improving both top-1 and majority-vote accuracy across AIME 24/25 and GPQA. As a highlight, on AIME 24 with a 1.5B model, CLUE boosts accuracy from 56.7% (majority@64) to 70.0% (top-maj@16).
摘要：评估大语言模型（LLM）输出的质量提出了一个关键的挑战。以前的方法依赖文本级信息（例如奖励模型，多数投票），这些信息可以过度融合表面提示，或者是从代币概率中校准的信心，这将在较不校准的模型上失败。然而，这两个信号实际上都是更丰富的信息来源的部分预测：模型的内部隐藏状态。早期的层，更接近令牌嵌入，保留基于文本判断的语义和词汇特征，而后来的层越来越与输出逻辑一致，嵌入了与置信度相关的信息。本文直接探讨了隐藏状态，作为统一的验证基础。我们表明，将解决方案的正确性编码为隐藏激活轨迹内的几何可分离签名。为了验证这一点，我们提出了线索（基于聚类和经验的验证），这是一个故意的简约，非参数验证者。没有可训练的参数，线索仅通过隐藏的状态三角洲总结了每个推理跟踪，并通过最近的中央式距离对``成功''和``失败''簇进行了分类。这种方法的简单性突出了基本信号的强度。从经验上讲，线索始终超过LLM-AS-A-A-A-A-Gudge基线，并且匹配超过了候选者的现代置信度方法，从而提高了AIME 24/25和GPQA的TOP-1和多数投票精度。作为亮点，在具有1.5B型号的Aime 24上，线索将准确性从56.7％（多数@64）提高到70.0％（Top-Maj@16）。

Title: A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

Authors: Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01600
Pdf URL: https://arxiv.org/pdf/2510.01600
Copy Paste: [[2510.01600]] A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation(https://arxiv.org/abs/2510.01600)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation Download PDF Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu Published: 20 Aug 2025, Last Modified: 17 Sept 2025EMNLP 2025 FindingsConference, Publication Chairs, AuthorsRevisionsBibTeXCC BY 4.0 Keywords: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Fine-tuning, Question Answering, Joint fine-tuning TL;DR: We evaluate and compare strategies for fine-tuning Retrieval Augmented Generation (RAG) pipelines, including independent fine-tuning, joint fine-tuning, and two-phase fine-tuning. Abstract: Retrieval augmented generation (RAG) is a popular framework for question answering that is powered by two large language models (LLMs): an embedding model that retrieves context documents from a database that are relevant to a given question, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned to increase performance of a RAG pipeline on a new task, but multiple fine-tuning strategies exist with different costs and benefits. In this paper, we evaluate and compare several RAG fine-tuning strategies, including independent, joint, and two-phase fine-tuning. In our experiments, we observe that all of these strategies achieve about equal improvement in EM and F1 generation quality metrics, although they have significantly different computational costs. We conclude the optimal fine-tuning strategy to use depends on whether the training dataset includes context labels and whether a grid search over the learning rates for the embedding and generator models is required.
摘要：A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation Download PDF Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu Published: 20 Aug 2025, Last Modified: 17 Sept 2025EMNLP 2025 FindingsConference, Publication Chairs, AuthorsRevisionsBibTeXCC BY 4.0 Keywords: Retrieval-Augmented Generation (RAG), Large Language Models （LLMS），微调，问答，关节微调TL； DR：我们评估和比较微调检索增强发电（RAG）管道的策略，包括独立的微调，关节微调，调查和两相微调。摘要：检索增强生成（RAG）是一个由两个大型语言模型（LLMS）提供支持的问题答案的流行框架：一种嵌入模型，从数据库中检索上下文文档，该模型与给定问题相关，以及使用检索到的生成器模型，该模型使用检索到的上下文来生成问题的答案。嵌入和发电机模型都可以进行微调，以提高抹布管道在新任务上的性能，但是存在多种微调策略，其成本和收益不同。在本文中，我们评估并比较了几种抹布微调策略，包括独立，关节和两相微调。在我们的实验中，我们观察到所有这些策略在EM和F1发电质量指标的平等改进方面取得了成就，尽管它们具有明显不同的计算成本。我们得出结论，使用的最佳微调策略取决于培训数据集是否包括上下文标签，以及是否需要对嵌入和发电机模型的学习率进行网格搜索。

Title: RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering

Authors: Lovely Yeswanth Panchumarthi, Sai Prasad Gudari, Atharva Negi, Praveen Raj Budime, Harsit Upadhya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01612
Pdf URL: https://arxiv.org/pdf/2510.01612
Copy Paste: [[2510.01612]] RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering(https://arxiv.org/abs/2510.01612)
Keywords: retrieval-augmented generation
Abstract: The exponential growth of biomedical literature creates significant challenges for accessing precise medical information. Current biomedical question-answering systems primarily focus on short-form answers, failing to provide the comprehensive explanations necessary for clinical decision-making. We present RAG-BioQA, a novel framework combining retrieval-augmented generation with domain-specific fine-tuning to produce evidence-based, long-form biomedical answers. Our approach integrates BioBERT embeddings with FAISS indexing and compares various re-ranking strategies (BM25, ColBERT, MonoT5) to optimize context selection before synthesizing evidence through a fine-tuned T5 model. Experimental results on the PubMedQA dataset show significant improvements over baselines, with our best model achieving substantial gains across BLEU, ROUGE, and METEOR metrics, advancing the state of accessible, evidence-based biomedical knowledge retrieval.
摘要：生物医学文献的指数增长为获取精确的医学信息带来了重大挑战。当前的生物医学提问系统主要集中于短形式的答案，未能提供临床决策所需的全面解释。我们提出了Rag-Bioqa，这是一个新颖的框架，将检索功能的生成与域特异性微调结合起来，产生基于证据的，长形式的生物医学答案。我们的方法将生物植物的嵌入与FAISS索引相结合，并比较各种重新排列策略（BM25，Colbert，Monot5），以优化上下文选择，然后通过微调的T5模型综合证据。 PubMedQA数据集的实验结果表现出比基线的显着改善，我们的最佳模型在BLEU，ROUGE和流星指标上取得了可观的增长，从而促进了可访问的，基于证据的生物医学知识检索的状态。

Title: Efficient Training of Robust Traditional Chinese LLaMA-1B on a Single Consumer GPU: Continual Pre-training, SFT, and DPO

Authors: Yu-Cheng Chih, Ming-Tao Duan, Yong-Hao Hou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01616
Pdf URL: https://arxiv.org/pdf/2510.01616
Copy Paste: [[2510.01616]] Efficient Training of Robust Traditional Chinese LLaMA-1B on a Single Consumer GPU: Continual Pre-training, SFT, and DPO(https://arxiv.org/abs/2510.01616)
Keywords: language model
Abstract: Small Language Models (SLMs) enable cost-effective, on-device and latency-sensitive AI applications, yet their deployment in Traditional Chinese (TC) remains hindered by token-level instability - models unpredictably emit non-TC characters or code-switch into other languages. We address this practical reliability gap by creating PureTC-1B, a three-stage stabilization pipeline for Llama-3.2-1B-Instruct (an open-weight, instruction-tuned model released by Meta) using parameter-efficient LoRA adapters. Our method combines Continual Pre-Training (CPT) on TC-centric corpora, Supervised Fine-Tuning (SFT) with instruction data, and Direct Preference Optimization (DPO) using TC-adherence preferences to improve monolingual robustness without full-model retraining. On a benchmark designed to simulate real-world usage, PureTC-1B achieves a 51.3% relative reduction (micro-average) in non-TC output tokens versus the base model. On a Named Entity Translation (NET) task, PureTC-1B further reduces incorrect-language tokens by 77.2% relative to Llama-3B and 57.2% relative to Qwen-1.5B, indicating that robust TC adherence is attainable even at the 1B scale. The pipeline is reproducible, adapter-only, and hardware-friendly, offering practitioners a practical recipe to enhance language stability for TC and potentially other non-English languages.
摘要：小型语言模型（SLM）启用了具有成本效益的，设备和对潜伏期敏感的AI应用程序，但是它们在传统的中文（TC）中的部署仍然受到令牌级不稳定性的阻碍 - 模型无法预测地发射非TC字符或代码转换为其他语言。我们通过创建PURETC-1B来解决这一实用的可靠性差距，这是使用参数效率的lora适配器的Llama-3.2-1b-Instruct（Meta发布的开放量，指导调节模型）的三阶段稳定管道。我们的方法结合了以TC为中心语料库的连续预训练（CPT），监督微调（SFT）与指导数据以及使用TC辅助偏好的直接偏好优化（DPO），以提高单语的鲁棒性，而无需全模型retraning。在旨在模拟现实世界中的基准测试中，Puretc-1b在非TC输出令牌中与基本模型相比，在非TC输出令牌中实现了51.3％的相对降低（微平均值）。在指定的实体翻译（NET）任务上，PURETC-1B相对于Llama-3B，相对于QWEN-1.5B，PURETC-1B将不正确的语言令牌进一步降低了77.2％，而相对于QWEN-1.5B，puretc-1b也将不正确的语言令牌降低了，这表明即使在1B量表下也可以实现强大的TC依从性。该管道可再现，仅适配和硬件友好，为从业者提供了一种实用的食谱，以增强TC和潜在的其他非英语语言的语言稳定性。

Title: AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System

Authors: Hui Yi Leong, Yuheng Li, Yuqing Wu, Wenwen Ouyang, Wei Zhu, Jiechao Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01617
Pdf URL: https://arxiv.org/pdf/2510.01617
Copy Paste: [[2510.01617]] AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System(https://arxiv.org/abs/2510.01617)
Keywords: language model, llm, agent
Abstract: Although large language models (LLMs) have revolutionized natural language processing capabilities, their practical implementation as autonomous multi-agent systems (MAS) for industrial problem-solving encounters persistent barriers. Conventional MAS architectures are fundamentally restricted by inflexible, hand-crafted graph topologies that lack contextual responsiveness, resulting in diminished efficacy across varied academic and commercial workloads. To surmount these constraints, we introduce AMAS, a paradigm-shifting framework that redefines LLM-based MAS through a novel dynamic graph designer. This component autonomously identifies task-specific optimal graph configurations via lightweight LLM adaptation, eliminating the reliance on monolithic, universally applied structural templates. Instead, AMAS exploits the intrinsic properties of individual inputs to intelligently direct query trajectories through task-optimized agent pathways. Rigorous validation across question answering, mathematical deduction, and code generation benchmarks confirms that AMAS systematically exceeds state-of-the-art single-agent and multi-agent approaches across diverse LLM architectures. Our investigation establishes that context-sensitive structural adaptability constitutes a foundational requirement for high-performance LLM MAS deployments.
摘要：尽管大型语言模型（LLMS）彻底改变了自然语言处理能力，但其实际实施是用于工业问题解决问题的自主多机构系统（MAS）。传统的MAS体系结构从根本上受到缺乏上下文响应能力的僵化，手工制作的图形拓扑结构的限制，从而导致各种学术和商业工作负载的疗效降低。为了克服这些约束，我们介绍了AMA，这是一个范式移动框架，该框架通过新颖的动态图设计器重新定义了基于LLM的MAS。该组件自主通过轻巧的LLM适应来识别特定任务的最佳图形配置，从而消除了对单片，普遍应用的结构模板的依赖。取而代之的是，AMA通过任务优化的代理途径利用单个输入的固有属性到智能直接查询轨迹。在问答，数学扣除和代码生成基准之间进行严格的验证证实，AMA系统地超过了不同LLM架构的最先进的单机构和多机构方法。我们的调查表明，上下文敏感的结构适应性构成了高性能LLM MAS部署的基础要求。

Title: NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

Authors: John Hawkins, Aditya Pramar, Rodney Beard, Rohitash Chandra
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2510.01644
Pdf URL: https://arxiv.org/pdf/2510.01644
Copy Paste: [[2510.01644]] NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT(https://arxiv.org/abs/2510.01644)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer's policies. In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses, including looking at our ability to identify jailbreaks that use previously unseen strategies. Our results indicate that using current datasets the best performance is achieved by fine tuning a Bidirectional Encoder Representations from Transformers (BERT) model end-to-end for identifying jailbreaks. We visualise the keywords that distinguish jailbreak from genuine prompts and conclude that explicit reflexivity in prompt structure could be a signal of jailbreak intention.
摘要：大型语言模型（LLMS）遭受了一系列漏洞，使恶意用户可以通过操纵输入文本征求不良响应。这些所谓的越狱提示旨在欺骗LLM，以规避制定的安全护栏，以保持开发商政策可接受的回应。在这项研究中，我们分析了不同机器学习模型将越狱提示与真实用途区分开的能力，包括研究我们识别使用以前看不见策略的越狱的能力。我们的结果表明，使用当前的数据集，通过从变形金刚（BERT）端到端的双向编码器表示以识别越狱的方式来实现最佳性能。我们可以看到将越狱与真正的提示区分开的关键字，并得出结论，迅速结构中的明确反身性可能是越狱意图的信号。

Title: Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention

Authors: Zhaoxin Feng, Jianfei Ma, Emmanuele Chersoni, Xiaojing Zhao, Xiaoyi Bao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01652
Pdf URL: https://arxiv.org/pdf/2510.01652
Copy Paste: [[2510.01652]] Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention(https://arxiv.org/abs/2510.01652)
Keywords: language model, llm
Abstract: Autoregressive Large Language Models (LLMs) demonstrate exceptional performance in language understanding and generation. However, their application in text embedding tasks has been relatively slow, along with the analysis of their semantic representation in probing tasks, due to the constraints of the unidirectional attention mechanism. This paper aims to explore whether such constraints can be overcome by enabling bidirectional attention in LLMs. We tested different variants of the Llama architecture through additional training steps, progressively enabling bidirectional attention and unsupervised/supervised contrastive learning.
摘要：自回归的大型语言模型（LLMS）在语言理解和产生方面表现出卓越的表现。但是，由于单向注意机制的限制，它们在文本嵌入任务中的应用相对较慢，以及对探测任务中的语义表示的分析。本文旨在通过在LLM中引起双向关注来探讨是否可以克服此类限制。我们通过其他训练步骤测试了Llama体系结构的不同变体，逐步使双向关注和无监督/监督的对比学习。

Title: SoK: Measuring What Matters for Closed-Loop Security Agents

Authors: Mudita Khurana, Raunak Jain
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01654
Pdf URL: https://arxiv.org/pdf/2510.01654
Copy Paste: [[2510.01654]] SoK: Measuring What Matters for Closed-Loop Security Agents(https://arxiv.org/abs/2510.01654)
Keywords: agent
Abstract: Cybersecurity is a relentless arms race, with AI driven offensive systems evolving faster than traditional defenses can adapt. Research and tooling remain fragmented across isolated defensive functions, creating blind spots that adversaries exploit. Autonomous agents capable of integrating, exploit confirmation, remediation, and validation into a single closed loop offer promise, but the field lacks three essentials: a framework defining the agentic capabilities of security systems across security life cycle, a principled method for evaluating closed loop agents, and a benchmark for measuring their performance in practice. We introduce CLASP: the Closed-Loop Autonomous Security Performance framework which aligns the security lifecycle (reconnaissance, exploitation, root cause analysis, patch synthesis, validation) with core agentic capabilities (planning, tool use, memory, reasoning, reflection & perception) providing a common vocabulary and rubric for assessing agentic capabilities in security tasks. By applying CLASP to 21 representative works, we map where systems demonstrate strengths, and where capability gaps persist. We then define the Closed-Loop Capability (CLC) Score, a composite metric quantifying both degree of loop closure and operational effectiveness, and outline the requirements for a closed loop benchmark. Together, CLASP and the CLC Score, provide the vocabulary, diagnostics, and measurements needed to advance both function level performance and measure closed loop security agents.
摘要：网络安全是一场无情的武器竞赛，AI驱动的进攻系统的发展速度比传统防御能力适应的快。研究和工具在孤立的防御功能中仍然存在分散的碎片，从而产生了对手利用的盲点。能够将确认，补救和验证整合，补救和验证到单个封闭环的自主代理提供了承诺，但该领域缺乏三个基本要素：一个框架，定义了整个安全生命周期安全系统的代理能力，这是评估封闭循环剂的原则性方法，以及用于测量其实践中绩效的基准。我们介绍CLASP：闭环自主安全性能框架，该框架与核心代理能力（计划，工具使用，记忆，推理，反射和感知）对齐安全生命周期（侦察，剥削，根本原因分析，补丁合成，验证），可提供常见的词汇和级别，以评估安全性辅助性capability in Cercecrics in Cercections in Cercections。通过将扣子应用于21个代表性作品，我们映射系统表现出强度以及能力差距持续存在的位置。然后，我们定义了闭环能力（CLC）评分，这是一种复合度量，量化了循环闭合程度和操作效率，并概述了闭环基准测试的要求。 clasp和CLC得分共同提供了提高功能水平性能和测量闭环安全剂所需的词汇，诊断和测量值。

Title: MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

Authors: Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, Saab Mansour
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01659
Pdf URL: https://arxiv.org/pdf/2510.01659
Copy Paste: [[2510.01659]] MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization(https://arxiv.org/abs/2510.01659)
Keywords: llm
Abstract: Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.
摘要：多模式对话摘要（MDS）是具有广泛应用程序的关键任务。为了支持有效的MDS模型的开发，强大的自动评估方法对于降低成本和人力的努力至关重要。但是，这种方法需要以人类注释为基于人类注释的强大元评估基准。在这项工作中，我们介绍了MDSSEVAL，这是MD的第一个元评估基准，其中包括图像共享对话，相应的摘要以及跨八个明确定义的质量方面的人类判断。为了确保数据质量和丰富性，我们提出了一个新颖的过滤框架，该框架利用跨模式的相互排斥的关键信息（MEKI）。我们的工作是第一个识别和正式化MDS特定的关键评估维度的工作。我们基于最新的模态评估方法，揭示了它们在区分摘要和高级MLLM及其对各种偏见的敏感性方面的局限性。

Title: FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol

Authors: He Zhang, Anzhou Zhang, Jian Dai
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2510.01674
Pdf URL: https://arxiv.org/pdf/2510.01674
Copy Paste: [[2510.01674]] FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol(https://arxiv.org/abs/2510.01674)
Keywords: gpt, prompt
Abstract: Reasoning protocols such as Chain of Thought (CoT) and Tree of Thought (ToT) organize internal deliberation but lack an explicit mechanism for external questioning that elicits self-revision. We present FOR-Prompting (From Objection to Revision Prompting), an asymmetric protocol where a Defender proposes an answer, an Objectioner raises question-style objections with no direct fixes, and a Host enforces consistency and closure. On GSM8K we observe about a 22% point gain over single-prompt and accuracy on par with CoT, with more than 10% higher ratings in reasoning and coherence from a uniform GPT 4.1 judge. FOR-Prompting also corrects mistakes without tools or human supervision on tricky queries, and improves performance for small-scale model (approx. 19% accuracy improved on Llama3.2:1b for GSM8K task), highlighting promise for small models and on personal device use. Beyond factual QA, qualitative analyses on open-ended tasks show enhanced exploration and refinement, with dialogue traces that make assumptions and trade-offs explicit. The protocol is model agnostic and operates purely at the prompt level through role-structured turns, so it works with hosted and local models of different sizes without retraining, and it supports large-scale study of objection-guided reasoning.
摘要：推理方案（例如思想链（COT）和思想树（TOT））组织了内部审议，但缺乏引起自我革命的外部质疑的明确机制。我们提出了待办事项（从异议到修订提示），这是一种不对称的协议，辩护人提出答案，反对者提出问题风格的异议，而无需直接修复，并且主机会执行一致性和封闭。在GSM8K上，我们观察到与COT相当的单次奖励和准确性约为22％的增长，其推理和一致性的统一GPT 4.1法官的评级高10％以上。待办事项还可以在没有工具或人为棘手的查询中纠正错误，并提高小规模型号的性能（GSM8K任务的Llama3.2：1b上有大约19％的准确性），突出了小型模型和个人设备使用的承诺。除了事实质量检查之外，对开放式任务的定性分析显示出增强的探索和改进，对话痕迹可以明确地进行假设和权衡。该协议是模型不可知论的，并且通过角色结构的转弯纯粹在迅速的水平上运行，因此它可以与不同尺寸的托管和本地模型一起工作，而无需重新培训，并且支持对反对意见的推理的大规模研究。

Title: How Do Language Models Compose Functions?

Authors: Apoorv Khandelwal, Ellie Pavlick
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01685
Pdf URL: https://arxiv.org/pdf/2510.01685
Copy Paste: [[2510.01685]] How Do Language Models Compose Functions?(https://arxiv.org/abs/2510.01685)
Keywords: language model, llm
Abstract: While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the "compositionality gap": i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. Then, using logit lens on their residual stream activations, we identify two processing mechanisms, one which solves tasks $\textit{compositionally}$, computing $f(x)$ along the way to computing $g(f(x))$, and one which solves them $\textit{directly}$, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that which mechanism is employed appears to be related to the embedding space geometry, with the idiomatic mechanism being dominant in cases where there exists a linear mapping from $x$ to $g(f(x))$ in the embedding spaces. We fully release our data and code at: this https URL .
摘要：尽管大型语言模型（LLMS）似乎越来越能够解决组成任务，但是否使用组成机制是一个悬而未决的问题。在这项工作中，我们研究了FeedForward LLMS如何求解两跳事实召回任务，可以在构图上以$ g（f（x））$表示。我们首先确认现代LLMS继续遭受“构图差距”的困扰：即他们计算$ z = f（x）$和$ y = g（z）$的能力并不需要他们计算构图$ y = g（f（x）$）。然后，使用logit镜头在其残留流激活上，我们确定了两个处理机制，一种解决任务$ \ textit {compositionally} $，计算$ f（x）$在计算$ g（f（x）$的过程中最后，我们发现使用哪种机制似乎与嵌入空间几何形状有关，而在嵌入空间中存在从$ x $到$ g（f（x））$的线性映射的情况下，惯用机制占主导地位。我们将数据和代码完全发布：此HTTPS URL。

Title: Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation

Authors: Seungseop Lim, Gibaeg Kim, Wooseok Han, Jean Seo, Hyunkyung Lee, Jaehyo Yoo, Eunho Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01688
Pdf URL: https://arxiv.org/pdf/2510.01688
Copy Paste: [[2510.01688]] Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation(https://arxiv.org/abs/2510.01688)
Keywords: language model, llm, chat
Abstract: Recent advances in Large Language Models (LLMs) have brought significant improvements to various service domains, including chatbots and medical pre-consultation applications. In the healthcare domain, the most common approach for adapting LLMs to multi-turn dialogue generation is Supervised Fine-Tuning (SFT). However, datasets for SFT in tasks like medical pre-consultation typically exhibit a skewed turn-count distribution. Training on such data induces a novel failure mechanism we term **Format Inertia**, where models tend to generate repetitive, format-correct, but diagnostically uninformative questions in long medical dialogues. To mitigate this observed failure mechanism, we adopt a simple, data-centric method that rebalances the turn-count distribution of the training dataset. Experimental results show that our approach substantially alleviates Format Inertia in medical pre-consultation.
摘要：大型语言模型（LLM）的最新进展为包括聊天机器人和医疗预培养应用在内的各种服务领域带来了重大改进。在医疗保健领域，将LLM适应多转化对话的最常见方法是监督的微调（SFT）。但是，在医疗预培养等任务中用于SFT的数据集通常表现出偏斜的转盘分布。对此类数据的培训会导致一种新颖的失败机制，我们称呼“格式”惯性**，该模型倾向于在长期的医疗对话中产生重复性，格式纠正，但可以诊断出诊断性的问题。为了减轻这种观察到的故障机制，我们采用了一种简单的，以数据为中心的方法，可以重新平衡训练数据集的转数分布。实验结果表明，我们的方法大大减轻了医学预培养中的惯性。

Title: What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

Authors: Jiwan Chung, Neel Joshi, Pratyusha Sharma, Youngjae Yu, Vibhav Vineet
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01719
Pdf URL: https://arxiv.org/pdf/2510.01719
Copy Paste: [[2510.01719]] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?(https://arxiv.org/abs/2510.01719)
Keywords: llm
Abstract: Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry, yet their evaluation remains dominated by aggregate accuracy, a single score that obscures where and how models are improving. We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbook-style geometry problems. The benchmark separates performance into three components: Perception: extracting information from raw inputs, Reasoning: operating on available information, and Integration: selecting relevant perceptual evidence and applying it within reasoning. To support each test, we provide annotations: visual diagrams, textual descriptions to evaluate reasoning in isolation, controlled questions that require both modalities, and probes for fine-grained perceptual skills, all derived from symbolic specifications of the problems to ensure consistency and robustness. Our analysis reveals that different training approaches have uneven effects: First, reinforcement learning chiefly strengthens perception, especially when supported by textual supervision, while textual SFT indirectly improves perception through reflective reasoning. Second, reasoning improves only in tandem with perception. Third, integration remains the weakest capacity, with residual errors concentrated there once other skills advance. Finally, robustness diverges: RL improves consistency under diagram variation, whereas multimodal SFT reduces it through overfitting. We will release all data and experimental logs.
摘要：多模式推理模型最近显示出对诸如奥林匹克级几何形状等具有挑战性的领域的希望，但是他们的评估仍然由汇总准确性主导，这是一个掩盖模型在哪里以及如何改善的单个分数。我们介绍了Mathlens，这是一种基准测试，旨在解散多模式推理的亚技能，同时保留教科书式的几何问题的复杂性。基准测试将绩效分为三个组成部分：感知：从原始输入中提取信息，推理：操作可用信息和集成：选择相关的感知证据并将其应用于推理。为了支持每个测试，我们提供注释：视觉图，文本描述以评估孤立的推理，需要同时模态的受控问题，以及对精细感知技能的探测，所有这些都来自问题的符号规格，以确保一致性和稳健性。我们的分析表明，不同的培训方法具有不平衡的影响：首先，强化学习主要增强了感知，尤其是在文本监督支持时，而文本SFT则间接地通过反思性推理来改善感知。其次，推理仅在与感知的同时提高。第三，集成仍然是最弱的能力，一旦其他技能提高，剩余错误就集中在那里。最后，鲁棒性分歧：RL在图变化下提高了一致性，而多模式SFT通过过度拟合减少了它。我们将发布所有数据和实验日志。

Title: Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

Authors: Wenbo Pan, Jie Xu, Qiguang Chen, Junhao Dong, Libo Qin, Xinfeng Li, Haining Yu, Xiaohua Jia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01782
Pdf URL: https://arxiv.org/pdf/2510.01782
Copy Paste: [[2510.01782]] Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks(https://arxiv.org/abs/2510.01782)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability. However, existing metrics fail to faithfully measure this ability. On the one hand, simple refusal-based metrics are biased by refusal rates and yield inconsistent scores when models exhibit different refusal tendencies. On the other hand, existing calibration metrics are proxy-based, capturing the performance of auxiliary calibration processes rather than the model's actual refusal behavior. In this work, we propose the Refusal Index (RI), a principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearman's rank correlation between refusal probability and error probability. To make RI practically measurable, we design a lightweight two-pass evaluation method that efficiently estimates RI from observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a model's intrinsic knowledge-aware refusal capability in factual tasks. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a model's overall accuracy and refusal rates. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile. This finding highlights the need to complement traditional accuracy metrics with the Refusal Index for comprehensive factuality evaluation.
摘要：大型语言模型（LLMS）应拒绝回答超出其知识的问题。我们认为拒绝知识的能力对于事实可靠性至关重要。但是，现有指标无法忠实地衡量这种能力。一方面，当模型表现出不同的拒绝趋势时，简单的基于拒绝的指标会被拒绝率和分数不一致的偏见。另一方面，现有的校准指标是基于代理的，可捕获辅助校准过程的性能，而不是模型的实际拒绝行为。在这项工作中，我们提出了拒绝指数（RI），这是一个原则上的指标，可衡量LLM拒绝他们不知道的问题的准确性。我们将RI定义为拒绝概率和错误概率之间的Spearman等级相关性。为了使RI实际上可以测量，我们设计了一种轻巧的两次评估方法，该方法从两次标准评估运行中观察到的拒绝率有效地估算了RI。跨16个模型和5个数据集进行的广泛实验表明，RI可以准确量化模型在事实任务中的内在知识拒绝能力。值得注意的是，RI在不同的拒绝率上保持稳定，并提供一致的模型排名，而与模型的总体准确性和拒绝率无关。更重要的是，RI提供了对LLM事实的重要但以前被忽视的方面的见解：尽管LLM在事实任务上具有很高的精度，但它们的拒绝行为可能是不可靠和脆弱的。这一发现凸显了需要通过拒绝索引来补充传统准确性指标，以进行全面的事实评估。

Title: Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction

Authors: Ivan Leonidovich Litvak, Anton Kostin, Fedor Lashkin, Tatiana Maksiyan, Sergey Lagutin
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.01792
Pdf URL: https://arxiv.org/pdf/2510.01792
Copy Paste: [[2510.01792]] Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction(https://arxiv.org/abs/2510.01792)
Keywords: gpt, llm
Abstract: The rapid advancement of artificial intelligence in legal natural language processing demands scalable methods for evaluating text extraction from judicial decisions. This study evaluates 16 unsupervised metrics, including novel formulations, to assess the quality of extracting seven semantic blocks from 1,000 anonymized Russian judicial decisions, validated against 7,168 expert reviews on a 1--5 Likert scale. These metrics, spanning document-based, semantic, structural, pseudo-ground truth, and legal-specific categories, operate without pre-annotated ground truth. Bootstrapped correlations, Lin's concordance correlation coefficient (CCC), and mean absolute error (MAE) reveal that Term Frequency Coherence (Pearson $r = 0.540$, Lin CCC = 0.512, MAE = 0.127) and Coverage Ratio/Block Completeness (Pearson $r = 0.513$, Lin CCC = 0.443, MAE = 0.139) best align with expert ratings, while Legal Term Density (Pearson $r = -0.479$, Lin CCC = -0.079, MAE = 0.394) show strong negative correlations. The LLM Evaluation Score (mean = 0.849, Pearson $r = 0.382$, Lin CCC = 0.325, MAE = 0.197) showed moderate alignment, but its performance, using gpt-4.1-mini via g4f, suggests limited specialization for legal textse. These findings highlight that unsupervised metrics, including LLM-based approaches, enable scalable screening but, with moderate correlations and low CCC values, cannot fully replace human judgment in high-stakes legal contexts. This work advances legal NLP by providing annotation-free evaluation tools, with implications for judicial analytics and ethical AI deployment.
摘要：法律自然语言处理中人工智能的快速发展需要可扩展的方法来评估司法决策的文本提取。这项研究评估了16个无监督的指标，包括新型制定的指标，以评估从1,000个匿名俄罗斯司法决策中提取7个语义块的质量，并以1--5个Likert量表对7,168个专家评论进行了验证。这些指标，涵盖基于文档的，语义，结构，伪地真相和特定于法律的类别，无需预先注销的地面真相。 Bootstrapped correlations, Lin's concordance correlation coefficient (CCC), and mean absolute error (MAE) reveal that Term Frequency Coherence (Pearson $r = 0.540$, Lin CCC = 0.512, MAE = 0.127) and Coverage Ratio/Block Completeness (Pearson $r = 0.513$, Lin CCC = 0.443, MAE = 0.139) best align with expert ratings, while Legal Term密度（Pearson $ r = -0.479 $，Lin CCC = -0.079，MAE = 0.394）显示出强的负相关。 LLM评估得分（平均= 0.849，Pearson $ r = 0.382 $，Lin CCC = 0.325，MAE = 0.197）显示中度对齐，但是使用GPT-4.1-MINI通过G4F，其性能表明法律文本的专业化有限。这些发现凸显了无监督的指标，包括基于LLM的方法，可以进行可扩展的筛查，但是，具有中等相关性和低CCC值的相关性，无法在高风险法律背景下完全取代人类的判断。这项工作通过提供无注释的评估工具来推动法律NLP，这对司法分析和道德AI部署有影响。

Title: Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network

Authors: Xin Liu, Rongwu Xu, Xinyi Jia, Jason Liao, Jiao Sun, Ling Huang, Wei Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01801
Pdf URL: https://arxiv.org/pdf/2510.01801
Copy Paste: [[2510.01801]] Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network(https://arxiv.org/abs/2510.01801)
Keywords: language model, gpt, llm
Abstract: The rise of large language models (LLMs) has enabled the generation of highly persuasive spam reviews that closely mimic human writing. These reviews pose significant challenges for existing detection systems and threaten the credibility of online platforms. In this work, we first create three realistic LLM-generated spam review datasets using three distinct LLMs, each guided by product metadata and genuine reference reviews. Evaluations by GPT-4.1 confirm the high persuasion and deceptive potential of these reviews. To address this threat, we propose FraudSquad, a hybrid detection model that integrates text embeddings from a pre-trained language model with a gated graph transformer for spam node classification. FraudSquad captures both semantic and behavioral signals without relying on manual feature engineering or massive training resources. Experiments show that FraudSquad outperforms state-of-the-art baselines by up to 44.22% in precision and 43.01% in recall on three LLM-generated datasets, while also achieving promising results on two human-written spam datasets. Furthermore, FraudSquad maintains a modest model size and requires minimal labeled training data, making it a practical solution for real-world applications. Our contributions include new synthetic datasets, a practical detection framework, and empirical evidence highlighting the urgency of adapting spam detection to the LLM era. Our code and datasets are available at: this https URL.
摘要：大型语言模型（LLM）的兴起使得能够产生高度说服力的垃圾邮件评论，这些评论非常模仿人类的写作。这些评论对现有检测系统构成了重大挑战，并威胁到在线平台的信誉。在这项工作中，我们首先使用三个不同的LLM创建了三个现实的LLM生成的垃圾邮件评论数据集，每个LLM都由产品元数据和真实的参考评论引导。 GPT-4.1的评估证实了这些评论的高说服力和欺骗性潜力。为了应对这种威胁，我们提出了欺诈性，这是一种混合检测模型，该模型将预训练的语言模型的文本嵌入与垃圾邮件节点分类的封闭图形变压器相结合。欺诈行为捕获语义和行为信号，而无需依赖手动功能工程或大量培训资源。实验表明，在三个LLM生成的数据集上，欺诈的精确度高达44.22％，召回率的最高为44.22％，同时在两个人工写的垃圾邮件数据集中获得了有希望的结果。此外，Draudsquad保持了适度的模型大小，并且需要最少的标记培训数据，从而使其成为现实世界应用程序的实用解决方案。我们的贡献包括新的合成数据集，一个实际的检测框架以及经验证据，强调了将垃圾邮件检测到LLM时代的紧迫性。我们的代码和数据集可在以下网址提供：此HTTPS URL。

Title: Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors

Authors: Dane Williamson, Yangfeng Ji, Matthew Dwyer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01831
Pdf URL: https://arxiv.org/pdf/2510.01831
Copy Paste: [[2510.01831]] Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors(https://arxiv.org/abs/2510.01831)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution. We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways. These errors are not due to gaps in mathematical competence, but rather reflect a brittle coupling between surface form and internal representation. To test this, we rephrase incorrectly answered questions using syntactic templates drawn from correct examples. These rephrasings, which preserve semantics while reducing structural complexity, often lead to correct answers. We quantify syntactic complexity using a metric based on Dependency Locality Theory (DLT), and show that higher DLT scores are associated with increased failure rates across multiple datasets. Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.
摘要：大型语言模型（LLMS）表现出强大的数学解决能力，但经常在语法上偏离训练分布的问题上失败。我们确定了系统的失败模式，句法盲点，其中模型误解了熟悉的推理策略，这些策略是在语义上直接但不熟悉的方式简单地折磨的问题。这些错误不是由于数学能力的差距，而是反映了表面形式和内部表示之间的脆弱耦合。为了测试这一点，我们使用从正确示例中绘制的句法模板错误地回答了问题。这些塑料可以保留语义，同时降低结构复杂性，通常会导致正确的答案。我们使用基于依赖性位置理论（DLT）的度量来量化句法复杂性，并表明较高的DLT分数与多个数据集的失败率提高相关。我们的发现表明，许多推理错误源于结构性的未对准而不是概念上的困难，而语法感知干预措施可以揭示和减轻这些归纳失败。

Title: SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning

Authors: Shicheng Liu, Kai Sun, Lisheng Fu, Xilun Chen, Xinyuan Zhang, Zhaojiang Lin, Rulin Shao, Yue Liu, Anuj Kumar, Wen-tau Yih, Xin Luna Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01832
Pdf URL: https://arxiv.org/pdf/2510.01832
Copy Paste: [[2510.01832]] SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning(https://arxiv.org/abs/2510.01832)
Keywords: gpt, llm
Abstract: Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.
摘要：HTML表，列表和Infoboxes中的半结构化内容占网络上的大量事实数据，但是格式化使用法复杂化，并可靠地从中可靠提取结构化信息仍然具有挑战性。现有方法要么由于每页LLM推断而缺乏概括或资源密集型。在本文中，我们介绍了抄写员（网络尺度上的基于脚本的半结构化内容提取），这是一个新颖的增强学习框架，该框架利用与奖励信号相同的网站上的布局相似性。抄写员没有单独处理每个页面，而是生成可重复使用的提取脚本，这些脚本可以应用于结构相似的网页组。我们的方法通过迭代培训从野外公共数据进行迭代培训来进一步改善。实验表明，我们的方法的脚本质量超过13％，而下游问题将准确性提高了4％以上，而GPT-4O的准确性则超过4％，从而启用了可扩展和资源有效的Web信息提取。

Title: Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

Authors: Ece Takmaz, Lisa Bylinina, Jakub Dotlacil
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.01845
Pdf URL: https://arxiv.org/pdf/2510.01845
Copy Paste: [[2510.01845]] Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models(https://arxiv.org/abs/2510.01845)
Keywords: language model
Abstract: State-of-the-art vision-and-language models consist of many parameters and learn from enormous datasets, surpassing the amounts of linguistic data that children are exposed to as they acquire a language. This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy. We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets, with our multimodal models outperforming previous BabyLM baselines. One finding in the multimodal language model literature is that these models tend to underperform in \textit{language-only} tasks. Therefore, we focus on maintaining language-only abilities in multimodal models. To this end, we experiment with \textit{model merging}, where we fuse the parameters of multimodal models with those of language-only models using weighted linear interpolation. Our results corroborate the findings that multimodal models underperform in language-only benchmarks that focus on grammar, and model merging with text-only models can help alleviate this problem to some extent, while maintaining multimodal performance.
摘要：最先进的视觉和语言模型由许多参数组成，并从巨大的数据集中学习，超过了儿童在获得语言时暴露于语言数据的数量。本文介绍了我们对解决这一差异的Babylm挑战的多模式轨道的方法。我们使用开发方面的数据集在低资源设置中开发仅语言和多模式的模型，我们的多模型模型的表现优于先前的Babylm基线。多模式语言模型文献中的一个发现是，这些模型在\ textit {纯语言}任务中的表现往往不足。因此，我们专注于维持多模式模型中的仅语言能力。为此，我们使用\ textIt {模型合并}进行了实验，在此我们使用加权线性插值将多模型模型的参数与仅语言模型的参数融合在一起。我们的结果证实了以语法为重点的只有语言基准的多模型模型不表现出色的发现，并且模型与仅文本模型合并可以帮助减轻此问题，同时保持多模式性能。

Title: REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration

Authors: Yisu Wang, Ming Wang, Haoyuan Song, Wenjie Huang, Chaozheng Wang, Yi Xie, Xuming Ran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01879
Pdf URL: https://arxiv.org/pdf/2510.01879
Copy Paste: [[2510.01879]] REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration(https://arxiv.org/abs/2510.01879)
Keywords: language model, llm
Abstract: Post-training for large language models (LLMs) is constrained by the high cost of acquiring new knowledge or correcting errors and by the unintended side effects that frequently arise from retraining. To address these issues, we introduce REPAIR (Robust Editing via Progressive Adaptive Intervention and Reintegration), a lifelong editing framework designed to support precise and low-cost model updates while preserving non-target knowledge. REPAIR mitigates the instability and conflicts of large-scale sequential edits through a closed-loop feedback mechanism coupled with dynamic memory management. Furthermore, by incorporating frequent knowledge fusion and enforcing strong locality guards, REPAIR effectively addresses the shortcomings of traditional distribution-agnostic approaches that often overlook unintended ripple effects. Our experiments demonstrate that REPAIR boosts editing accuracy by 10%-30% across multiple model families and significantly reduces knowledge forgetting. This work introduces a robust framework for developing reliable, scalable, and continually evolving LLMs.
摘要：大型语言模型（LLMS）的培训受到获取新知识或纠正错误的高成本，以及经常因再培训而引起的意外副作用的高成本。为了解决这些问题，我们介绍了维修（通过渐进的自适应干预和重新融合进行了强大的编辑），这是一个终身编辑框架，旨在支持精确和低成本模型更新，同时保留非目标知识。维修通过闭环反馈机制以及动态内存管理来减轻大规模顺序编辑的不稳定性和冲突。此外，通过将频繁的知识融合并实施强大的区域卫队进行融合，维修有效地解决了传统的分布敏捷方法的缺点，这些方法通常忽略了意想不到的涟漪效应。我们的实验表明，维修在多个模型家族中的编辑精度提高了10％-30％，并大大减少了知识遗忘。这项工作引入了一个强大的框架，用于开发可靠，可扩展和不断发展的LLM。

Title: Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

Authors: Qiyuan Liu, Hao Xu, Xuhong Chen, Wei Chen, Yee Whye Teh, Ning Miao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01925
Pdf URL: https://arxiv.org/pdf/2510.01925
Copy Paste: [[2510.01925]] Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey(https://arxiv.org/abs/2510.01925)
Keywords: language model, llm
Abstract: Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we address critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.
摘要：奖励模型（RMS）在增强LLM的推理性能方面起着关键作用。例如，他们可以在加固学习期间为Finetune LLM提供培训信号（RL），并在推理期间从多个候选人中选择最佳答案。在本文中，我们对RMS进行了系统的介绍，以及对其在LLM推理中的应用的全面调查。我们首先回顾了RMS的基本概念，包括其架构，培训方法和评估技术。然后，我们探索其关键应用程序：（1）指导生成并在LLM推理期间选择最佳输出，（2）促进LLM的数据综合和迭代自我改善，以及（3）在基于RL的Finetunning中提供培训信号。最后，我们根据现有研究和我们自己的经验发现，解决有关RMS的选择，概括，评估和增强的关键开放问题。我们的分析旨在为LLM推理的有效部署和进步提供可行的见解。

Title: Inverse Language Modeling towards Robust and Grounded LLMs

Authors: Davide Gabrielli, Simone Sestito, Iacopo Masi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01929
Pdf URL: https://arxiv.org/pdf/2510.01929
Copy Paste: [[2510.01929]] Inverse Language Modeling towards Robust and Grounded LLMs(https://arxiv.org/abs/2510.01929)
Keywords: language model, llm
Abstract: The current landscape of defensive mechanisms for LLMs is fragmented and underdeveloped, unlike prior work on classifiers. To further promote adversarial robustness in LLMs, we propose Inverse Language Modeling (ILM), a unified framework that simultaneously 1) improves the robustness of LLMs to input perturbations, and, at the same time, 2) enables native grounding by inverting model outputs to identify potentially toxic or unsafe input triggers. ILM transforms LLMs from static generators into analyzable and robust systems, potentially helping RED teaming. ILM can lay the foundation for next-generation LLMs that are not only robust and grounded but also fundamentally more controllable and trustworthy. The code is publicly available at this http URL.
摘要：与先前在分类器上的工作不同，LLMS的防御机制的当前景观被碎片和欠发达。为了进一步促进LLM中的对抗性鲁棒性，我们提出了反语言建模（ILM），一个同时的统一框架1）提高LLMS对输入扰动的鲁棒性，同时，2）2）通过将模型输出识别为潜在的毒性毒性毒性或不适当的输入来启用本地接地。 ILM将LLM从静态发电机转变为可分析和强大的系统，有可能帮助红色团队。 ILM可以为下一代LLM奠定基础，这些LLM不仅强大，扎根，而且从根本上更容易控制和值得信赖。该代码在此HTTP URL上公开可用。

Title: Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning

Authors: Qi He, Cheng Qian, Xiusi Chen, Bingxiang He, Yi R. (May)Fung, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01932
Pdf URL: https://arxiv.org/pdf/2510.01932
Copy Paste: [[2510.01932]] Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning(https://arxiv.org/abs/2510.01932)
Keywords: language model, llm, prompt
Abstract: Claim verification with large language models (LLMs) has recently attracted considerable attention, owing to their superior reasoning capabilities and transparent verification pathways compared to traditional answer-only judgments. Online claim verification requires iterative evidence retrieval and reasoning, yet existing approaches mainly rely on prompt engineering or predesigned reasoning workflows without offering a unified training paradigm to improve necessary skills. Therefore, we introduce Veri-R1, an online reinforcement learning (RL) framework that enables an LLM to interact with a search engine and to receive reward signals that explicitly shape its planning, retrieval, and reasoning behaviors. The dynamic interaction between models and retrieval systems more accurately reflects real-world verification scenarios and fosters comprehensive verification skills. Empirical results show that Veri-R1 improves joint accuracy by up to 30% and doubles evidence score, often surpassing larger-scale counterparts. Ablation studies further reveal the impact of reward components and the link between output logits and label accuracy. Our results highlight the effectiveness of online RL for precise and faithful claim verification and provide a foundation for future research. We release our code to support community progress in LLM empowered claim verification.
摘要：与传统的纯正判断相比，使用大型语言模型（LLM）的索赔验证最近引起了人们的关注。在线索赔验证需要迭代证据检索和推理，但现有的方法主要依赖于迅速的工程或预先设计的推理工作流程，而无需提供统一的培训范式以提高必要的技能。因此，我们介绍了Veri-R1，这是一个在线增强学习（RL）框架，使LLM能够与搜索引擎进行交互并获得明确塑造其计划，检索和推理行为的奖励信号。模型和检索系统之间的动态互动更准确地反映了现实世界的验证方案，并培养了全面的验证技能。经验结果表明，VERI-R1提高关节准确性高达30％，并使证据得分加倍，通常超过较大的同行。消融研究进一步揭示了奖励组成部分以及输出逻辑与标签精度之间的联系的影响。我们的结果突出了在线RL对精确和忠实的主张验证的有效性，并为将来的研究奠定了基础。我们发布我们的代码，以支持LLM授权索赔验证的社区进度。

Title: Taking a SEAT: Predicting Value Interpretations from Sentiment, Emotion, Argument, and Topic Annotations

Authors: Adina Nicola Dobrinoiu, Ana Cristiana Marcu, Amir Homayounirad, Luciano Cavalcante Siebert, Enrico Liscio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01976
Pdf URL: https://arxiv.org/pdf/2510.01976
Copy Paste: [[2510.01976]] Taking a SEAT: Predicting Value Interpretations from Sentiment, Emotion, Argument, and Topic Annotations(https://arxiv.org/abs/2510.01976)
Keywords: language model
Abstract: Our interpretation of value concepts is shaped by our sociocultural background and lived experiences, and is thus subjective. Recognizing individual value interpretations is important for developing AI systems that can align with diverse human perspectives and avoid bias toward majority viewpoints. To this end, we investigate whether a language model can predict individual value interpretations by leveraging multi-dimensional subjective annotations as a proxy for their interpretive lens. That is, we evaluate whether providing examples of how an individual annotates Sentiment, Emotion, Argument, and Topics (SEAT dimensions) helps a language model in predicting their value interpretations. Our experiment across different zero- and few-shot settings demonstrates that providing all SEAT dimensions simultaneously yields superior performance compared to individual dimensions and a baseline where no information about the individual is provided. Furthermore, individual variations across annotators highlight the importance of accounting for the incorporation of individual subjective annotators. To the best of our knowledge, this controlled setting, although small in size, is the first attempt to go beyond demographics and investigate the impact of annotation behavior on value prediction, providing a solid foundation for future large-scale validation.
摘要：我们对价值概念的解释是由我们的社会文化背景和生活经验塑造的，因此是主观的。识别个人价值解释对于开发可以与人类观点保持一致并避免对多数观点偏见的AI系统很重要。为此，我们研究语言模型是否可以通过利用多维主观注释作为其解释性镜头来预测个人价值解释。也就是说，我们评估提供个人注释情感，情感，论证和主题（座位维度）的示例是否有助于语言模型预测其价值解释。我们在不同的零和少数射击设置上进行的实验表明，与单个维度相比，提供所有座椅尺寸同时产生了较高的性能，并且在没有提供有关个人信息的基线的基线中。此外，跨注释者的各个变化突出了会计纳入单个主观注释者的重要性。据我们所知，这种受控的环境虽然很小，但首次尝试超越人口统计数据并研究注释行为对价值预测的影响，为将来的大规模验证提供了坚实的基础。

Title: Exploring Database Normalization Effects on SQL Generation

Authors: Ryosuke Kohita
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01989
Pdf URL: https://arxiv.org/pdf/2510.01989
Copy Paste: [[2510.01989]] Exploring Database Normalization Effects on SQL Generation(https://arxiv.org/abs/2510.01989)
Keywords: language model
Abstract: Schema design, particularly normalization, is a critical yet often overlooked factor in natural language to SQL (NL2SQL) systems. Most prior research evaluates models on fixed schemas, overlooking the influence of design on performance. We present the first systematic study of schema normalization's impact, evaluating eight leading large language models on synthetic and real-world datasets with varied normalization levels. We construct controlled synthetic datasets with formal normalization (1NF-3NF) and real academic paper datasets with practical schemes. Our results show that denormalized schemas offer high accuracy on simple retrieval queries, even with cost-effective models in zero-shot settings. In contrast, normalized schemas (2NF/3NF) introduce challenges such as errors in base table selection and join type prediction; however, these issues are substantially mitigated by providing few-shot examples. For aggregation queries, normalized schemas yielded better performance, mainly due to their robustness against the data duplication and NULL value issues that cause errors in denormalized schemas. These findings suggest that the optimal schema design for NL2SQL applications depends on the types of queries to be supported. Our study demonstrates the importance of considering schema design when developing NL2SQL interfaces and integrating adaptive schema selection for real-world scenarios.
摘要：模式设计，尤其是归一化，是对SQL（NL2SQL）系统的自然语言的关键因素。大多数先前的研究都评估了固定模式的模型，忽视了设计对性能的影响。我们介绍了对架构归一化影响的首次系统研究，评估了八个领先的大型语言模型对归一化水平变化的合成和现实世界数据集。我们构建具有正式标准化（1NF-3NF）的受控合成数据集和具有实用方案的真实学术论文数据集。我们的结果表明，即使在零拍设置中具有具有成本效益的模型，否定的模式也可以在简单的检索查询方面具有很高的精度。相反，标准化模式（2NF/3NF）引入了挑战，例如基础表选择中的错误并加入类型预测；但是，这些问题通过提供的示例很大而可以大大减轻。对于聚合查询，归一化模式的性能更高，这主要是由于它们针对数据重复和无效问题的稳健性，这些问题会导致非正式模式中的错误。这些发现表明，NL2SQL应用程序的最佳架构设计取决于要支持的查询类型。我们的研究表明，在开发NL2SQL界面并整合适应性模式选择现实世界情景时，考虑模式设计的重要性。

Title: LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target

Authors: Md Arid Hasan, Firoj Alam, Md Fahad Hossain, Usman Naseem, Syed Ishtiaque Ahmed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01995
Pdf URL: https://arxiv.org/pdf/2510.01995
Copy Paste: [[2510.01995]] LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target(https://arxiv.org/abs/2510.01995)
Keywords: llm, prompt
Abstract: Online social media platforms are central to everyday communication and information seeking. While these platforms serve positive purposes, they also provide fertile ground for the spread of hate speech, offensive language, and bullying content targeting individuals, organizations, and communities. Such content undermines safety, participation, and equity online. Reliable detection systems are therefore needed, especially for low-resource languages where moderation tools are limited. In Bangla, prior work has contributed resources and models, but most are single-task (e.g., binary hate/offense) with limited coverage of multi-facet signals (type, severity, target). We address these gaps by introducing the first multi-task Bangla hate-speech dataset, BanglaMultiHate, one of the largest manually annotated corpus to date. Building on this resource, we conduct a comprehensive, controlled comparison spanning classical baselines, monolingual pretrained models, and LLMs under zero-shot prompting and LoRA fine-tuning. Our experiments assess LLM adaptability in a low-resource setting and reveal a consistent trend: although LoRA-tuned LLMs are competitive with BanglaBERT, culturally and linguistically grounded pretraining remains critical for robust performance. Together, our dataset and findings establish a stronger benchmark for developing culturally aligned moderation tools in low-resource contexts. For reproducibility, we will release the dataset and all related scripts.
摘要：在线社交媒体平台是日常沟通和信息寻求的核心。尽管这些平台有积极的目的，但它们还为仇恨言论，进攻性语言以及针对个人，组织和社区的欺凌内容的传播提供了肥沃的基础。这些内容破坏了在线安全，参与和权益。因此，需要可靠的检测系统，尤其是对于适度工具受到限制的低资源语言。在孟加拉国，先前的工作贡献了资源和模型，但大多数是单任务（例如，二进制仇恨/进攻），对多面信号（类型，严重性，目标）的覆盖有限。我们通过介绍第一个多任务孟加拉仇恨语音数据集，Langlamultihate（迄今为止最大的手动注释语料库之一）来解决这些差距。在此资源的基础上，我们进行了全面的，受控的比较，涵盖了古典基线，单语预审计的模型以及零射击促使和Lora微调下的LLM。我们的实验在低资源环境中评估了LLM的适应性，并揭示了一致的趋势：尽管洛拉调整的LLM与孟加拉国具有竞争力，但在文化和语言上扎根的预处理仍然对稳健的性能至关重要。我们的数据集和发现共同为在低资源环境中开发具有文化调节的适量工具建立了更强的基准。为了获得可重复性，我们将发布数据集和所有相关脚本。

Title: Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models

Authors: Donghoon Jung, Jiwoo Choi, Songeun Chae, Seohyon Jung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02025
Pdf URL: https://arxiv.org/pdf/2510.02025
Copy Paste: [[2510.02025]] Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models(https://arxiv.org/abs/2510.02025)
Keywords: language model, llm, prompt
Abstract: Evaluations of large language models (LLMs)' creativity have focused primarily on the quality of their outputs rather than the processes that shape them. This study takes a process-oriented approach, drawing on narratology to examine LLMs as computational authors. We introduce constraint-based decision-making as a lens for authorial creativity. Using controlled prompting to assign authorial personas, we analyze the creative preferences of the models. Our findings show that LLMs consistently emphasize Style over other elements, including Character, Event, and Setting. By also probing the reasoning the models provide for their choices, we show that distinctive profiles emerge across models and argue that our approach provides a novel systematic tool for analyzing AI's authorial creativity.
摘要：大型语言模型（LLMS）的创造力的评估主要集中在其输出的质量上，而不是塑造它们的过程。这项研究采用了以过程为导向的方法，借鉴了叙事学，以研究LLM作为计算作者。我们将基于约束的决策引入了作者创造力的镜头。使用受控提示来分配作者角色，我们分析了模型的创意偏好。我们的发现表明，LLM始终强调样式，而不是其他元素，包括角色，事件和设置。通过探测模型为选择的推理，我们表明，跨模型出现了独特的概况，并认为我们的方法为分析AI的作者创造力提供了一种新颖的系统工具。

Title: Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

Authors: Siddhant Arora, Haidar Khan, Kai Sun, Xin Luna Dong, Sajal Choudhary, Seungwhan Moon, Xinyuan Zhang, Adithya Sagar, Surya Teja Appini, Kaushik Patnaik, Sanat Sharma, Shinji Watanabe, Anuj Kumar, Ahmed Aly, Yue Liu, Florian Metze, Zhaojiang Lin
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2510.02044
Pdf URL: https://arxiv.org/pdf/2510.02044
Copy Paste: [[2510.02044]] Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage(https://arxiv.org/abs/2510.02044)
Keywords: llm, hallucination, retrieval-augmented generation, agent
Abstract: End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines, generating more natural, expressive responses with significantly lower latency. However, these systems remain prone to hallucinations due to limited factual grounding. While text-based dialogue systems address this challenge by integrating tools such as web search and knowledge graph APIs, we introduce the first approach to extend tool use directly into speech-in speech-out systems. A key challenge is that tool integration substantially increases response latency, disrupting conversational flow. To mitigate this, we propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech, even before the user finishes speaking. Specifically, we develop a post-training pipeline that teaches the model when to issue tool calls during ongoing speech and how to generate spoken summaries that fuse audio queries with retrieved text results, thereby improving both accuracy and responsiveness. To evaluate our approach, we construct AudioCRAG, a benchmark created by converting queries from the publicly available CRAG dataset into speech form. Experimental results demonstrate that our streaming RAG approach increases QA accuracy by up to 200% relative (from 11.1% to 34.2% absolute) and further enhances user experience by reducing tool use latency by 20%. Importantly, our streaming RAG approach is modality-agnostic and can be applied equally to typed input, paving the way for more agentic, real-time AI assistants.
摘要：端到端的语音 - 语音对话系统正在成为传统ASR-LLM-TTS管道的有力替代方案，从而产生更自然的，表现力的响应，并且潜伏期明显较低。但是，由于事实基础有限，这些系统仍然容易出现幻觉。基于文本的对话系统通过集成诸如Web搜索和知识图API之类的工具来应对这一挑战，但我们介绍了将工具使用直接扩展到语音输出系统的第一种方法。一个关键的挑战是，工具集成大大增加了响应延迟，破坏了对话流。为了减轻这种情况，我们提出了流式检索功能的生成（Streaming Rag），这是一个新颖的框架，可以通过与用户语音并行预测工具查询来减少用户感知的延迟，甚至在用户完成讲话之前。具体来说，我们开发了一条训练后管道，该管道在持续的语音过程中教授该模型何时发出工具调用，以及如何生成语音摘要，将音频查询与检索到的文本结果融合在一起，从而提高准确性和响应能力。为了评估我们的方法，我们构建了通过将查询从公开可用的crag数据集转换为语音形式而创建的基准。实验结果表明，我们的流抹布方法可将质量检查的准确性提高到相对200％（从11.1％到34.2％），并通过将工具使用延迟降低20％来进一步增强用户体验。重要的是，我们的流抹布方法是模态敏锐的，可以同样应用于键入输入，为更多的代理，实时的AI助手铺平了道路。

Title: Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems

Authors: Siddhant Arora, Jinchuan Tian, Hayato Futami, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2510.02066
Pdf URL: https://arxiv.org/pdf/2510.02066
Copy Paste: [[2510.02066]] Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems(https://arxiv.org/abs/2510.02066)
Keywords: chain-of-thought
Abstract: Most end-to-end (E2E) spoken dialogue systems (SDS) rely on voice activity detection (VAD) for turn-taking, but VAD fails to distinguish between pauses and turn completions. Duplex SDS models address this by predicting output continuously, including silence tokens, thus removing the need for explicit VAD. However, they often have complex dual-channel architecture and lag behind cascaded models in semantic reasoning. To overcome these challenges, we propose SCoT: a Streaming Chain-of-Thought (CoT) framework for Duplex SDS, alternating between processing fixed-duration user input and generating responses in a blockwise manner. Using frame-level alignments, we create intermediate targets-aligned user transcripts and system responses for each block. Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions compared to turn-by-turn systems.
摘要：大多数端到端（E2E）口语对话系统（SD）依靠语音活动检测（VAD）进行转弯，但VAD无法区分停顿和转弯完成。双工SDS模型通过连续预测输出（包括沉默令牌）来解决这一问题，从而消除了对显式VAD的需求。但是，它们通常具有复杂的双通道架构，并且在语义推理中落后于级联模型。为了克服这些挑战，我们提出了SCOT：双层SD的流媒体链（COT）框架，在处理固定持续用户输入和以模块的方式生成响应之间交替。使用帧级对齐，我们为每个块创建与中间目标一致的用户成绩单和系统响应。实验表明，与现有的双工方法相比，我们的方法会产生更连贯和可解释的响应，同时与转弯系统相比，支持较低的延迟和重叠的交互。

Title: The Disparate Impacts of Speculative Decoding

Authors: Jameson Sandler, Ahmet Üstün, Marco Romanelli, Sara Hooker, Ferdinando Fioretto
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02128
Pdf URL: https://arxiv.org/pdf/2510.02128
Copy Paste: [[2510.02128]] The Disparate Impacts of Speculative Decoding(https://arxiv.org/abs/2510.02128)
Keywords: language model
Abstract: The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, ``drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.
摘要：推测解码的实践，即推论在较小，更便宜，``制图器''模型上概率地支持，已成为系统地减少大语言模型解码时间的标准技术。本文通过跨任务的潜在不同加速率的镜头进行了投机解码的分析。至关重要的是，该论文表明，从投机解码中获得的加速并不统一分布在任务中，而拟合不足的任务通常会减少，而且代表性不足。为了更好地理解这一现象，我们得出了一个分析，以量化观察到的``不公平性''，并引起人们对激发这种不同加速出现的因素的关注。此外，在这些见解的指导下，本文提出了一种缓解策略，旨在减少加速差异并验证多个模型对的方法，平均揭示了我们的公平度量标准增长12％。

Title: RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization

Authors: Zhaoning Yu, Will Su, Leitian Tao, Haozhu Wang, Aashu Singh, Hanchao Yu, Jianyu Wang, Hongyang Gao, Weizhe Yuan, Jason Weston, Ping Yu, Jing Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02172
Pdf URL: https://arxiv.org/pdf/2510.02172
Copy Paste: [[2510.02172]] RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization(https://arxiv.org/abs/2510.02172)
Keywords: chain-of-thought
Abstract: Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.
摘要：通过人类通知的数据进行的强化学习促进了大型推理模型中的思想链推理，但是这些收益在标记的数据中以高昂的成本为代价，同时步履蹒跚。自然的下一步是经验驱动的学习，通过适应未标记的数据，模型在没有策划标签的情况下改进。我们介绍CODNED（使用自我遏制的强化学习），这是一个自我培养的RL框架，它将缺少黄金标签转换为有用的学习信号。与其超过多数票，不如限制模型的整个答案分布中的信号：对过度自信的推出和低一致性示例进行惩罚，同时保留有希望的推理链。自我培训机制将无缝集成到诸如GRPO之类的政策优化方法中，从而在不监督的情况下可以持续自我完善。在具有挑战性的推理基准下，仅使用未标记的数据来限制大幅收益。借助QWEN3-4B-BASE和OCTOTHINKER HYBRID-8B基础，它的AIME25上的PASS@1提高了140.7％，MMLU_STEM上的Pass@1 +36.2％，在GPQA-Diamond上，GPQA-Diamond上的Pass@1提高了+36.2％，几乎匹配了金牌训练，同时使用了金标签，同时又使用了金标签。这些结果表明，CONDERS建立了通往没有金标签的更强推理的可扩展途径。

Title: Learning to Reason for Hallucination Span Detection

Authors: Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Kundan Krishna, Hadi Pouransari, Cheng-Yu Hsieh, Cem Koc, Joseph Yitan Cheng, Oncel Tuzel, Raviteja Vemulapalli
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02173
Pdf URL: https://arxiv.org/pdf/2510.02173
Copy Paste: [[2510.02173]] Learning to Reason for Hallucination Span Detection(https://arxiv.org/abs/2510.02173)
Keywords: language model, llm, hallucination, chain-of-thought
Abstract: Large language models (LLMs) often generate hallucinations -- unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans. To answer this question, we first evaluate pretrained models with and without Chain-of-Thought (CoT) reasoning, and show that CoT reasoning has the potential to generate at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that incentivizes reasoning with a span-level reward function. RL4HS builds on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate reward imbalance issue. Experiments on the RAGTruth benchmark (summarization, question answering, data-to-text) show that RL4HS surpasses pretrained reasoning models and supervised fine-tuning, demonstrating the necessity of reinforcement learning with span-level rewards for detecting hallucination spans.
摘要：大型语言模型（LLMS）通常会产生幻觉 - 不受支持的内容破坏了可靠性。虽然大多数先前的工作框架幻觉检测是二进制任务，但许多现实世界的应用程序都需要识别幻觉跨度，这是一个多步骤的决策过程。这自然提出了一个问题，即明确的推理是否可以帮助检测幻觉跨度的复杂任务。为了回答这个问题，我们首先评估了有或没有经过思考（COT）推理的预验证模型，并表明COT推理有可能在多次采样时至少产生一个正确的答案。在此激励的情况下，我们提出了RL4HS，这是一个增强学习框架，它通过跨度奖励功能激励推理。 RL4HS建立在小组相对政策优化的基础上，并引入阶级感知政策优化以减轻奖励不平衡问题。关于Ragtruth基准测试的实验（摘要，问题答案，数据到文本）表明，RL4HS超过了预处理的推理模型并监督了微调，这表明需要使用跨度级别的奖励来检测幻觉跨度。

Title: ARUQULA -- An LLM based Text2SPARQL Approach using ReAct and Knowledge Graph Exploration Utilities

Authors: Felix Brei, Lorenz Bühmann, Johannes Frey, Daniel Gerber, Lars-Peter Meyer, Claus Stadler, Kirill Bulert
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02200
Pdf URL: https://arxiv.org/pdf/2510.02200
Copy Paste: [[2510.02200]] ARUQULA -- An LLM based Text2SPARQL Approach using ReAct and Knowledge Graph Exploration Utilities(https://arxiv.org/abs/2510.02200)
Keywords: language model, llm, agent
Abstract: Interacting with knowledge graphs can be a daunting task for people without a background in computer science since the query language that is used (SPARQL) has a high barrier of entry. Large language models (LLMs) can lower that barrier by providing support in the form of Text2SPARQL translation. In this paper we introduce a generalized method based on SPINACH, an LLM backed agent that translates natural language questions to SPARQL queries not in a single shot, but as an iterative process of exploration and execution. We describe the overall architecture and reasoning behind our design decisions, and also conduct a thorough analysis of the agent behavior to gain insights into future areas for targeted improvements. This work was motivated by the Text2SPARQL challenge, a challenge that was held to facilitate improvements in the Text2SPARQL domain.
摘要：对于没有计算机科学背景的人来说，与知识图互动可能是一项艰巨的任务，因为使用的查询语言（SPARQL）具有很高的进入障碍。大型语言模型（LLM）可以通过以Text2Sparql翻译的形式提供支持来降低该障碍。在本文中，我们介绍了一种基于菠菜的广义方法，菠菜是一种LLM支持的代理，将自然语言问题转换为Sparql查询不是单镜头，而是作为迭代探索和执行过程。我们描述了设计决策背后的整体架构和推理，并对代理行为进行了详尽的分析，以了解对未来的有针对性改进的洞察力。这项工作是由Text2SPARQL挑战激励的，这是为了促进Text2Sparql域的改进而提出的挑战。

Title: Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents

Authors: Lingzhong Dong, Ziqi Zhou, Shuaibo Yang, Haiyue Sheng, Pengzhou Cheng, Zongru Wu, Zheng Wu, Gongshen Liu, Zhuosheng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02204
Pdf URL: https://arxiv.org/pdf/2510.02204
Copy Paste: [[2510.02204]] Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents(https://arxiv.org/abs/2510.02204)
Keywords: language model, chain-of-thought, agent
Abstract: Mobile-use agents powered by vision-language models (VLMs) have shown great potential in interpreting natural language instructions and generating corresponding actions based on mobile graphical user interface. Recent studies suggest that incorporating chain-of-thought (CoT) reasoning tends to improve the execution accuracy. However, existing evaluations emphasize execution accuracy while neglecting whether CoT reasoning aligns with ground-truth actions. This oversight fails to assess potential reasoning-execution gaps, which in turn foster over-trust: users relying on seemingly plausible CoTs may unknowingly authorize harmful actions, potentially resulting in financial loss or trust crisis. In this work, we introduce a new evaluation framework to diagnose reasoning-execution gaps. At its core lies Ground-Truth Alignment (GTA), which measures whether the action implied by a CoT matches the ground-truth action. By combining GTA with the standard Exact Match (EM) metric, we jointly assess both the reasoning accuracy and execution accuracy. This joint perspective reveals two types of reasoning-execution gaps: (i) Execution Gap (EG), where the reasoning correctly identifies the correct action but execution fails, and (ii) Reasoning Gap (RG), where execution succeeds but reasoning process conflicts with the actual execution. Experimental results across a wide range of mobile interaction tasks reveal that reasoning-execution gaps are prevalent, with execution gaps occurring more frequently than reasoning gaps. Moreover, while scaling up model size reduces the overall gap, sizable execution gaps persist even in the largest models. Further analysis shows that our framework reliably reflects systematic EG/RG patterns in state-of-the-art models. These findings offer concrete diagnostics and support the development of more trustworthy mobile-use agents.
摘要：由视觉语言模型（VLM）提供动力的移动使用代理在解释自然语言说明和基于移动图形用户界面生成相应的操作方面具有巨大的潜力。最近的研究表明，合并思想链（COT）推理倾向于提高执行精度。但是，现有的评估强调执行精度，同时忽略了COT推理是否与地面行为保持一致。这种疏忽无法评估潜在的推理执行差距，这反过来促进了过度信任：依靠看似合理的婴儿的用户可能在不知不觉中授权有害行动，可能导致财务损失或信任危机。在这项工作中，我们引入了一个新的评估框架，以诊断推理执行差距。地面对齐（GTA）以核心为核心，该核心衡量COT所隐含的作用是否与地面真实作用相匹配。通过将GTA与标准精确匹配（EM）度量相结合，我们可以共同评估推理准确性和执行精度。该联合观点揭示了两种类型的推理执行差距：（i）执行差距（例如），其中推理正确地识别了正确的措施，但执行失败，而（ii）推理差距（RG），执行成功但推理过程与实际执行的过程发生冲突。各种移动交互任务的实验结果表明，推理执行差距很普遍，而执行差距比推理差距更频繁。此外，在扩大模型大小的同时，降低了整体差距，即使在最大的型号中，相当大的执行差距仍然存在。进一步的分析表明，我们的框架可靠地反映了最新模型中的系统性EG/RG模式。这些发现提供了具体的诊断并支持更值得信赖的移动使用代理的开发。

Title: More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

Authors: Xiaoyang Yuan, Yujuan Ding, Yi Bin, Wenqi Shao, Jinyu Cai, Jingkuan Song, Yang Yang, Hengtao Shen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02227
Pdf URL: https://arxiv.org/pdf/2510.02227
Copy Paste: [[2510.02227]] More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration(https://arxiv.org/abs/2510.02227)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This "guidance-on-demand" approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at this https URL.
摘要：通过可验证的奖励（RLVR）的增强学习是增强大语言模型（LLMS）推理能力的有希望的范式。但是，普遍的方法主要依赖于自我探索或单个非政策教师来引起长期的思考链（LongCot）推理，这可能引入内在的模型偏见并限制探索，最终限制了推理的多样性和表现。我们从知识蒸馏中的多教师策略中汲取灵感，我们引入了自适应多辅助政策优化（AMPO），这是一个新颖的框架，可自适应地利用多个熟练的教师模型的指导，但只有当实验室模型未能生成正确的解决方案时。这种“按需指导”方法扩大了探索，同时保留了自我发现的价值。此外，AMPO结合了一种基于理解的选择机制，促使学生从最有可能理解的推理路径中学习，从而平衡广泛的探索与有效的剥削。广泛的实验表明，AMPO的表现大大胜过强大的基线（GRPO），数学推理任务的提高了4.3％，而在分布式任务上的提高了12.2％，同时显着提高了PASS@k性能并实现了更多样化的探索。值得注意的是，使用四个同行大小的教师，我们的方法实现了可比的结果，与利用更多数据的单个，更强大的老师（例如DeepSeek-r1）的方法相当。这些结果证明了通往卓越的推理和概括性的更有效和可扩展的途径。我们的代码可在此HTTPS URL上找到。

Title: AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications

Authors: Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Van-Cuong Pham, Hoang Ngo, Dat Quoc Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02243
Pdf URL: https://arxiv.org/pdf/2510.02243
Copy Paste: [[2510.02243]] AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications(https://arxiv.org/abs/2510.02243)
Keywords: llm, retrieval-augmented generation
Abstract: We introduce AccurateRAG -- a novel framework for constructing high-performance question-answering applications based on retrieval-augmented generation (RAG). Our framework offers a pipeline for development efficiency with tools for raw dataset processing, fine-tuning data generation, text embedding & LLM fine-tuning, output evaluation, and building RAG systems locally. Experimental results show that our framework outperforms previous strong baselines and obtains new state-of-the-art question-answering performance on benchmark datasets.
摘要：我们介绍了准确的框架 - 一个新颖的框架，用于基于检索功能的生成（RAG）来构建高性能提问的应用程序。我们的框架提供了开发效率的管道，该工具借助原始数据集处理，微调数据生成，文本嵌入和LLM微调，输出评估以及本地建筑抹布系统。实验结果表明，我们的框架的表现优于以前的强基线，并在基准数据集上获得了新的最新问题提问性能。

Title: Explore Briefly, Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation

Authors: Tianyi Jiang, Yi Bin, Yujuan Ding, Kainian Zhu, Fei Ma, Jingkuan Song, Heng Tao Shen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02249
Pdf URL: https://arxiv.org/pdf/2510.02249
Copy Paste: [[2510.02249]] Explore Briefly, Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation(https://arxiv.org/abs/2510.02249)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities on complex problems using long Chain-of-Thought (CoT) reasoning. However, they often suffer from overthinking, meaning generating unnecessarily lengthy reasoning steps for simpler problems. This issue may degrade the efficiency of the models and make them difficult to adapt the reasoning depth to the complexity of problems. To address this, we introduce a novel metric Token Entropy Cumulative Average (TECA), which measures the extent of exploration throughout the reasoning process. We further propose a novel reasoning paradigm -- Explore Briefly, Then Decide -- with an associated Cumulative Entropy Regulation (CER) mechanism. This paradigm leverages TECA to help the model dynamically determine the optimal point to conclude its thought process and provide a final answer, thus achieving efficient reasoning. Experimental results across diverse mathematical benchmarks show that our approach substantially mitigates overthinking without sacrificing problem-solving ability. With our thinking paradigm, the average response length decreases by up to 71% on simpler datasets, demonstrating the effectiveness of our method in creating a more efficient and adaptive reasoning process.
摘要：大型语言模型（LLMS）在使用长期思考（COT）推理的复杂问题上表现出了出色的推理能力。但是，它们通常会遭受过度思考的困扰，这意味着为简单的问题产生不必要的冗长的推理步骤。这个问题可能会降低模型的效率，并使它们难以使推理深度适应问题的复杂性。为了解决这个问题，我们介绍了一种新型的度量令牌熵累积平均值（TECA），该平均值（TECA）衡量了整个推理过程中的勘探程度。我们进一步提出了一种新颖的推理范式 - 简要探索，然后决定 - 与相关的累积熵调节（CER）机制。该范式利用TECA来帮助模型动态确定结论其思维过程并提供最终答案的最佳点，从而实现有效的推理。各种数学基准的实验结果表明，我们的方法大大减轻了过度思考而不会牺牲解决问题的能力。借助我们的思维范式，平均响应长度在更简单的数据集上最多降低了71％，这表明了我们方法在创建更有效和适应性推理过程中的有效性。

Title: InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents

Authors: Yaxin Du, Yuanshuo Zhang, Xiyuan Yang, Yifan Zhou, Cheng Wang, Gongyi Zou, Xianghe Pang, Wenhao Wang, Menglan Chen, Shuo Tang, Zhiyu Li, Siheng Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02271
Pdf URL: https://arxiv.org/pdf/2510.02271
Copy Paste: [[2510.02271]] InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents(https://arxiv.org/abs/2510.02271)
Keywords: gpt, llm, agent
Abstract: Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools -- and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.
摘要：寻求信息是人类的基本要求。但是，现有的LLM代理在很大程度上依赖开放式搜索，这表明了两个基本弱点：在线内容是嘈杂且不可靠的，许多现实世界中的任务需要从网络上不可用的精确，特定于领域的知识。现在，模型上下文协议（MCP）的出现允许代理与数千种专业工具接口，似乎可以解决此限制。然而，尚不清楚代理是否可以有效利用此类工具 - 更重要的是，他们是否可以将它们与通用搜索集成以解决复杂的任务。因此，我们介绍了Infomosaic Bench，这是第一个专门针对工具增强代理商寻求多源信息的基准。 Infomosaic Bench涵盖了六个代表性域（医学，金融，地图，视频，网络和多域集成），需要代理将通用搜索与特定领域的工具相结合。任务与Infomosaic-Flow合成，这是一条可扩展的管道，该管道在经过验证的工具输出中以任务条件为基础，执行跨源依赖性，并通过琐碎的查找来解决快捷案例。这种设计可以保证可靠性和非平凡性。使用14个最先进的LLM代理的实验揭示了三个发现：（i）仅网络信息不足，GPT-5仅实现了38.2％的精度和67.5％的通行率；（ii）域工具提供了选择性但不一致的好处，在降低其他领域的同时改善了某些领域；（iii）22.4％的故障是由错误的工具使用或选择引起的，这强调了当前的LLMS甚至基本工具处理仍在努力。

Title: From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

Authors: Hala Sheta, Eric Huang, Shuyu Wu, Ilia Alenabi, Jiajun Hong, Ryker Lin, Ruoxi Ning, Daniel Wei, Jialin Yang, Jiawei Zhou, Ziqiao Ma, Freda Shi
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.02292
Pdf URL: https://arxiv.org/pdf/2510.02292
Copy Paste: [[2510.02292]] From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens(https://arxiv.org/abs/2510.02292)
Keywords: language model
Abstract: We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.
摘要：我们介绍了VLM-Lens，该工具包旨在通过支持开放源VLMS的正向通行期间从任何一层中提取中间输出的提取，旨在实现视觉模型（VLMS）的系统基准测试，分析和解释。 VLM-LENS提供了一个统一的，可容纳的界面，该接口将模型特定的复杂性抽象化，并支持各种VLM的用户友好操作。目前，它支持16个最先进的基本VLM及其30多种变体，并且可以扩展以适应新型号而无需更改核心逻辑。该工具包可以轻松地与各种可解释性和分析方法集成。我们通过两个简单的分析实验证明了它的用法，从而揭示了VLMS跨层和目标概念的隐藏表示形式的系统差异。 VLM-Lens作为一个开源项目发布，以加快社区理解和改善VLM的努力。

Title: F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Authors: Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02294
Pdf URL: https://arxiv.org/pdf/2510.02294
Copy Paste: [[2510.02294]] F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data(https://arxiv.org/abs/2510.02294)
Keywords: language model, llm
Abstract: We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.
摘要：我们介绍了F2LLM-以三种尺寸的最先进的嵌入模型为特色：0.6b，1.7b和4b。与以前的顶级嵌入模型不同，这些模型需要进行大量的对比度，先进的培训管道和昂贵的合成培训数据，F2LLM在600万个查询文档阴性的单元上直接从基础模型中予以迎合，该单元是由开放源，非合成数据集的开放源，非合成数据集的培训型和培训成本范围较高的成本尺寸和Embs shopt and Embed syversed and Embed shipting and Embsed bysed bysed and Embsed node and Embed shipting and Embed shipting and sneft and obsed bysed。在MTEB英语排行榜上，F2LLM-4B在大约4B参数和总体第七的模型中排名第二，而F2LLM-1.7B在1B-2B尺寸范围内排名第一。为了促进该领域的未来研究，我们发布了模型，培训数据集和代码，将F2LLM定位为强大，可重现和预算友好的基线，用于未来的工作。

Title: Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation

Authors: Raphael Tang, Crystina Zhang, Wenyan Li, Carmen Lai, Pontus Stenetorp, Yao Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02306
Pdf URL: https://arxiv.org/pdf/2510.02306
Copy Paste: [[2510.02306]] Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation(https://arxiv.org/abs/2510.02306)
Keywords: language model, llm
Abstract: In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the "battle" a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.
摘要：在大型语言模型（LLMS）的竞技场风格评估中，两个LLM对用户查询做出了响应，并且用户选择获胜响应或认为“战斗”抽奖，从而调整了这两种模型的评分。建模这些评级动力学的主要方法是将战斗视为两人游戏匹配，如国际象棋，并应用ELO评级系统及其衍生产品。在本文中，我们批判性地研究了这种范式。具体来说，我们质疑抽奖是否真正意味着两个模型是平等的，因此是否应均等。取而代之的是，我们猜想绘制更具查询难度的指示：如果查询太简单，则两个模型都更有可能平等地成功。在三个现实世界中的竞技场数据集中，我们表明，忽略抽奖的评级更新会产生1-3％的战斗结果预测准确性（包括抽奖），为所有研究的所有评级系统提高了。进一步的分析表明，对于评级为非常容易的查询和高度客观的查询，风险比分别为1.37和1.35。我们建议将来的评级系统重新考虑现有的DRAW语义并说明评级更新中的查询属性。