2025-10-23

Title: Contextual Augmentation for Entity Linking using Large Language Models

Authors: Daniel Vollmers, Hamada M. Zahera, Diego Moussallem, Axel-Cyrille Ngonga Ngomo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.18888
Pdf URL: https://arxiv.org/pdf/2510.18888
Copy Paste: [[2510.18888]] Contextual Augmentation for Entity Linking using Large Language Models(https://arxiv.org/abs/2510.18888)
Keywords: language model
Abstract: Entity Linking involves detecting and linking entity mentions in natural language texts to a knowledge graph. Traditional methods use a two-step process with separate models for entity recognition and disambiguation, which can be computationally intensive and less effective. We propose a fine-tuned model that jointly integrates entity recognition and disambiguation in a unified framework. Furthermore, our approach leverages large language models to enrich the context of entity mentions, yielding better performance in entity disambiguation. We evaluated our approach on benchmark datasets and compared with several baselines. The evaluation results show that our approach achieves state-of-the-art performance on out-of-domain datasets.
摘要：实体链接涉及检测自然语言文本中的实体提及并将其链接到知识图谱。传统方法使用具有单独模型的两步过程来进行实体识别和消歧，这可能需要大量计算且效率较低。我们提出了一个微调模型，将实体识别和消歧联合集成在一个统一的框架中。此外，我们的方法利用大型语言模型来丰富实体提及的上下文，从而在实体消歧方面产生更好的性能。我们在基准数据集上评估了我们的方法，并与几个基线进行了比较。评估结果表明，我们的方法在域外数据集上实现了最先进的性能。

Title: Small Language Models Offer Significant Potential for Science Community

Authors: Jian Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.18890
Pdf URL: https://arxiv.org/pdf/2510.18890
Copy Paste: [[2510.18890]] Small Language Models Offer Significant Potential for Science Community(https://arxiv.org/abs/2510.18890)
Keywords: language model, gpt, llm, chat
Abstract: Recent advancements in natural language processing, particularly with large language models (LLMs), are transforming how scientists engage with the literature. While the adoption of LLMs is increasing, concerns remain regarding potential information biases and computational costs. Rather than LLMs, I developed a framework to evaluate the feasibility of precise, rapid, and cost-effective information retrieval from extensive geoscience literature using freely available small language models (MiniLMs). A curated corpus of approximately 77 million high-quality sentences, extracted from 95 leading peer-reviewed geoscience journals such as Geophysical Research Letters and Earth and Planetary Science Letters published during years 2000 to 2024, was constructed. MiniLMs enable a computationally efficient approach for extracting relevant domain-specific information from these corpora through semantic search techniques and sentence-level indexing. This approach, unlike LLMs such as ChatGPT-4 that often produces generalized responses, excels at identifying substantial amounts of expert-verified information with established, multi-disciplinary sources, especially for information with quantitative findings. Furthermore, by analyzing emotional tone via sentiment analysis and topical clusters through unsupervised clustering within sentences, MiniLM provides a powerful tool for tracking the evolution of conclusions, research priorities, advancements, and emerging questions within geoscience communities. Overall, MiniLM holds significant potential within the geoscience community for applications such as fact and image retrievals, trend analyses, contradiction analyses, and educational purposes.
摘要：自然语言处理的最新进展，特别是大型语言模型 (LLM)，正在改变科学家处理文献的方式。尽管法学硕士的采用率不断增加，但对潜在信息偏差和计算成本的担忧仍然存在。我开发的不是法学硕士，而是一个框架，用于评估使用免费提供的小语言模型 (MiniLM) 从大量地球科学文献中进行精确、快速且经济有效的信息检索的可行性。我们构建了一个包含约 7700 万个高质量句子的精选语料库，这些句子选自 95 种领先的同行评审地球科学期刊，例如 2000 年至 2024 年出版的《地球物理研究快报》和《地球与行星科学快报》。 MiniLM 支持一种计算高效的方法，通过语义搜索技术和句子级索引从这些语料库中提取相关的特定领域信息。这种方法与 ChatGPT-4 等经常产生概括性回答的法学硕士不同，它擅长通过已建立的多学科来源识别大量经过专家验证的信息，尤其是具有定量结果的信息。此外，通过情感分析和主题聚类分析情绪基调，通过句子内的无监督聚类，MiniLM 提供了一个强大的工具来跟踪地球科学界内结论、研究重点、进展和新问题的演变。总体而言，MiniLM 在地球科学界的事实和图像检索、趋势分析、矛盾分析和教育目的等应用中具有巨大的潜力。

Title: When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs

Authors: Richard J. Young, Brandon Gillins, Alice M. Matthews
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.18892
Pdf URL: https://arxiv.org/pdf/2510.18892
Copy Paste: [[2510.18892]] When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs(https://arxiv.org/abs/2510.18892)
Keywords: language model, llm, prompt
Abstract: Despite widespread deployment of Large Language Models, systematic evaluation of instruction-following capabilities remains challenging. While comprehensive benchmarks exist, focused assessments that quickly diagnose specific instruction adherence patterns are valuable. As newer models may be trained on existing benchmarks, novel evaluation approaches are needed to assess genuine capabilities rather than memorized performance. This paper presents a streamlined evaluation framework using twenty carefully designed prompts to assess LLM instruction-following across diverse task categories. We demonstrate this framework through a large-scale empirical study conducted on October 14, 2025, testing 256 verified working models from 331 available via OpenRouter. To ensure methodological rigor and prevent selection bias, we first verified each model's basic functionality before inclusion. Unlike large-scale benchmarks requiring extensive computational resources, our approach offers a practical diagnostic tool researchers and practitioners can readily apply. Our methodology builds upon verifiable instructions while introducing a compact test suite balancing comprehensiveness with efficiency. Each prompt targets distinct aspects of instruction following, including format compliance, content constraints, logical sequencing, and multi-step task execution. We evaluate models from major providers (OpenAI, Anthropic, Google, Meta, Mistral) and emerging implementations (Qwen, DeepSeek, community models), providing comparative performance analysis. Our findings reveal consistent failure modes and identify specific instruction types posing particular challenges. This work contributes both a practical evaluation tool and one of the most comprehensive empirical analyses of instruction-following capabilities across the contemporary LLM landscape.
摘要：尽管大型语言模型得到了广泛部署，但对指令跟踪能力的系统评估仍然具有挑战性。虽然存在全面的基准，但快速诊断特定指令遵守模式的集中评估是有价值的。由于更新的模型可能会在现有基准上进行训练，因此需要新的评估方法来评估真正的能力而不是记忆的表现。本文提出了一个简化的评估框架，使用二十个精心设计的提示来评估跨不同任务类别的法学硕士指令遵循情况。我们通过 2025 年 10 月 14 日进行的大规模实证研究展示了这一框架，测试了 OpenRouter 提供的 331 个可用模型中的 256 个经过验证的工作模型。为了确保方法的严谨性并防止选择偏差，我们在纳入之前首先验证了每个模型的基本功能。与需要大量计算资源的大规模基准不同，我们的方法提供了研究人员和从业者可以轻松应用的实用诊断工具。我们的方法建立在可验证的指令的基础上，同时引入了一个紧凑的测试套件，平衡了全面性和效率。每个提示针对指令遵循的不同方面，包括格式合规性、内容约束、逻辑排序和多步骤任务执行。我们评估主要提供商（OpenAI、Anthropic、Google、Meta、Mistral）和新兴实现（Qwen、DeepSeek、社区模型）的模型，提供比较性能分析。我们的研究结果揭示了一致的故障模式，并确定了构成特定挑战的特定指令类型。这项工作既提供了实用的评估工具，也提供了对当代法学硕士领域的指令遵循能力最全面的实证分析之一。

Title: Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti

Authors: Mangsura Kabir Oni, Tabia Tanzin Prama
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2510.18898
Pdf URL: https://arxiv.org/pdf/2510.18898
Copy Paste: [[2510.18898]] Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti(https://arxiv.org/abs/2510.18898)
Keywords: language model, llm
Abstract: Machine Translation (MT) has advanced from rule-based and statistical methods to neural approaches based on the Transformer architecture. While these methods have achieved impressive results for high-resource languages, low-resource varieties such as Sylheti remain underexplored. In this work, we investigate Bengali-to-Sylheti translation by fine-tuning multilingual Transformer models and comparing them with zero-shot large language models (LLMs). Experimental results demonstrate that fine-tuned models significantly outperform LLMs, with mBART-50 achieving the highest translation adequacy and MarianMT showing the strongest character-level fidelity. These findings highlight the importance of task-specific adaptation for underrepresented languages and contribute to ongoing efforts toward inclusive language technologies.
摘要：机器翻译 (MT) 已经从基于规则和统计的方法发展到基于 Transformer 架构的神经方法。虽然这些方法在高资源语言方面取得了令人印象深刻的结果，但对 Sylheti 等低资源语言的研究仍然不足。在这项工作中，我们通过微调多语言 Transformer 模型并将其与零样本大型语言模型 (LLM) 进行比较来研究孟加拉语到锡尔赫蒂语的翻译。实验结果表明，微调模型的性能显着优于 LLM，其中 mBART-50 实现了最高的翻译充分性，MarianMT 显示了最强的字符级保真度。这些发现强调了针对代表性不足的语言进行针对特定任务的适应的重要性，并有助于持续努力实现包容性语言技术。

Title: DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code

Authors: Shriyansh Agrawal, Aidan Lau, Sanyam Shah, Ahan M R, Kevin Zhu, Sunishchal Dev, Vasu Sharma
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2510.18904
Pdf URL: https://arxiv.org/pdf/2510.18904
Copy Paste: [[2510.18904]] DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code(https://arxiv.org/abs/2510.18904)
Keywords: language model, gpt, llm
Abstract: The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC $= 0.97$ to $0.99$ and macro-F1 $0.89$ to $0.94$ while reducing latency by $8$-$12\times$ and peak VRAM by $3$-$5\times$ at $512$-token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/renaming), performance retains $\geq 92%$ of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included.
摘要：用于生成多语言文本和源代码的大型语言模型 (LLM) 的流行只会增加机器生成的内容检测器跨领域准确和高效的必要性。当前的检测器主要采用零样本方法，例如 Fast DetectGPT 或 GPTZero，要么会产生较高的计算成本，要么缺乏足够的精度，通常需要在两者之间进行权衡，从而留下进一步改进的空间。为了解决这些差距，我们建议对仅编码器的小语言模型（SLM）进行微调，特别是 RoBERTA 和 CodeBERTa 的预训练模型，使用源代码和其他自然语言的专门数据集来证明，对于二元分类任务，SLM 的性能远远优于 LLM，同时只使用一小部分计算。我们的编码器实现了 AUROC $= 0.97$ 至 $0.99$ 和宏 F1 $0.89$ 至 $0.94$，同时在 $512$ 代币输入时将延迟减少 $8$-$12\times$，将峰值 VRAM 减少 $3$-$5\times$。在跨生成器转换和对抗性转换（释义、反向翻译、代码格式化/重命名）下，性能保留了干净 AUROC 的 $\geq 92%$。我们发布带有种子和配置的培训和评估脚本；还包括可重复性检查表。

Title: Improving Topic Modeling of Social Media Short Texts with Rephrasing: A Case Study of COVID-19 Related Tweets

Authors: Wangjiaxuan Xin, Shuhua Yin, Shi Chen, Yaorong Ge
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.18908
Pdf URL: https://arxiv.org/pdf/2510.18908
Copy Paste: [[2510.18908]] Improving Topic Modeling of Social Media Short Texts with Rephrasing: A Case Study of COVID-19 Related Tweets(https://arxiv.org/abs/2510.18908)
Keywords: language model, llm
Abstract: Social media platforms such as Twitter (now X) provide rich data for analyzing public discourse, especially during crises such as the COVID-19 pandemic. However, the brevity, informality, and noise of social media short texts often hinder the effectiveness of traditional topic modeling, producing incoherent or redundant topics that are often difficult to interpret. To address these challenges, we have developed \emph{TM-Rephrase}, a model-agnostic framework that leverages large language models (LLMs) to rephrase raw tweets into more standardized and formal language prior to topic modeling. Using a dataset of 25,027 COVID-19-related Twitter posts, we investigate the effects of two rephrasing strategies, general- and colloquial-to-formal-rephrasing, on multiple topic modeling methods. Results demonstrate that \emph{TM-Rephrase} improves three metrics measuring topic modeling performance (i.e., topic coherence, topic uniqueness, and topic diversity) while reducing topic redundancy of most topic modeling algorithms, with the colloquial-to-formal strategy yielding the greatest performance gains and especially for the Latent Dirichlet Allocation (LDA) algorithm. This study contributes to a model-agnostic approach to enhancing topic modeling in public health related social media analysis, with broad implications for improved understanding of public discourse in health crisis as well as other important domains.
摘要：Twitter（现为 X）等社交媒体平台提供了丰富的数据来分析公众言论，尤其是在 COVID-19 大流行等危机期间。然而，社交媒体短文本的简洁、非正式和噪音往往会阻碍传统主题建模的有效性，产生往往难以解释的不连贯或冗余的主题。为了应对这些挑战，我们开发了 \emph{TM-Rephrase}，这是一个与模型无关的框架，它利用大型语言模型 (LLM) 在主题建模之前将原始推文重新表述为更标准化和正式的语言。我们使用 25,027 个与 COVID-19 相关的 Twitter 帖子的数据集，研究了两种改写策略（一般改写和口语到正式改写）对多种主题建模方法的影响。结果表明，\emph{TM-Rephrase} 改进了衡量主题建模性能的三个指标（即主题连贯性、主题唯一性和主题多样性），同时减少了大多数主题建模算法的主题冗余，通俗到正式的策略产生了最大的性能增益，特别是对于潜在狄利克雷分配（LDA）算法。这项研究有助于采用与模型无关的方法来增强公共卫生相关社交媒体分析中的主题建模，对于提高对健康危机以及其他重要领域中公共话语的理解具有广泛的影响。

Title: Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection

Authors: Hongyi He, Xiao Liu, Zhenghao Lin, Mingni Tang, Yi Cheng, Jintao Wang, Wenjie Li, Peng Cheng, Yeyun Gong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.18909
Pdf URL: https://arxiv.org/pdf/2510.18909
Copy Paste: [[2510.18909]] Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection(https://arxiv.org/abs/2510.18909)
Keywords: language model, llm
Abstract: High-quality pre-training data is crutial for large language models, where quality captures factual reliability and semantic value, and diversity ensures broad coverage and distributional heterogeneity. Existing approaches typically rely on single or multiple-dimensional score-based selection. However, directly selecting top-scored data often degrades performance, and sampling from a broader range is required to recover results. The above non-monotonicity between dataset scores and downstream benchmark results reveals a fundamental bias: score-based methods collapse correlated dimensions, causing top-scored data to appear high-quality while systematically overlooking diversity. We argue that ensuring diversity requires decomposing correlated metrics into orthogonal feature dimensions, from which the top-scored data can be directly selected. Therefore, we proposed the Orthogonal Diversity-Aware Selection (ODiS) algorithm, which preserves both quality and diversity during data selection. First, ODiS evaluates data from multiple dimensions, covering language quality, knowledge quality, and comprehension difficulty. The multi-dimensional scores are then decorrelated via Principal Component Analysis (PCA), yielding orthogonal evaluation dimensions. For each dimension, a Roberta-based scorer is trained to regress the data onto PCA-projected scores, enabling scalable inference on large corpora. Finally, ODiS constructs the training dataset by selecting top-scored data within each orthogonal dimension, thereby ensuring both quality and diversity. Empirical results show that ODiS-selected data exhibit less than 2\% inter-dimension overlap, confirming orthogonality between dimensions. More importantly, models trained with ODiS-selected data significantly outperform other baselines on downstream benchmarks, highlighting the necessity of orthogonal, diversity-aware data selection for LLMs.
摘要：高质量的预训练数据对于大型语言模型至关重要，其中质量捕获事实可靠性和语义价值，而多样性确保广泛的覆盖范围和分布异质性。现有方法通常依赖于单维或多维基于分数的选择。然而，直接选择得分最高的数据通常会降低性能，并且需要从更广泛的范围进行采样才能恢复结果。上述数据集得分和下游基准测试结果之间的非单调性揭示了一个基本偏差：基于得分的方法破坏了相关维度，导致得分最高的数据显得高质量，同时系统性地忽视了多样性。我们认为，确保多样性需要将相关指标分解为正交特征维度，从中可以直接选择得分最高的数据。因此，我们提出了正交多样性感知选择（ODiS）算法，该算法在数据选择过程中同时保留质量和多样性。首先，ODiS从语言质量、知识质量、理解难度等多个维度对数据进行评估。然后通过主成分分析 (PCA) 将多维分数去相关，产生正交评估维度。对于每个维度，基于 Roberta 的评分器经过训练，将数据回归到 PCA 投影分数上，从而实现大型语料库的可扩展推理。最后，ODiS 通过选择每个正交维度内得分最高的数据来构建训练数据集，从而保证质量和多样性。经验结果表明，ODiS 选择的数据表现出小于 2% 的维度间重叠，证实了维度之间的正交性。更重要的是，使用 ODiS 选择的数据训练的模型在下游基准测试中显着优于其他基线，这凸显了法学硕士选择正交、多样性意识数据的必要性。

Title: Context-aware Fairness Evaluation and Mitigation in LLMs

Authors: Afrozah Nadeem, Mark Dras, Usman Naseem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.18914
Pdf URL: https://arxiv.org/pdf/2510.18914
Copy Paste: [[2510.18914]] Context-aware Fairness Evaluation and Mitigation in LLMs(https://arxiv.org/abs/2510.18914)
Keywords: language model, llm
Abstract: Large language models often display undesirable behaviors embedded in their internal representations, undermining fairness, inconsistency drift, amplification of harmful content, and the propagation of unwanted patterns during extended dialogue and conversations. Although training-time or data-centric methods attempt to reduce these effects, they are computationally expensive, irreversible once deployed, and slow to adapt to new conversational contexts. Pruning-based methods provide a flexible and transparent way to reduce bias by adjusting the neurons responsible for certain behaviors. However, most existing approaches are static; once a neuron is removed, the model loses the ability to adapt when the conversation or context changes. To address this, we propose a dynamic, reversible, pruning-based framework that detects context-aware neuron activations and applies adaptive masking to modulate their influence during generation. Our inference-time solution provides fine-grained, memory-aware mitigation with knowledge-preserved, more coherent behavior across multilingual single- and multi-turn dialogues, enabling dynamic fairness control in real-world conversational AI.
摘要：大型语言模型通常会在其内部表示中表现出不良行为，从而破坏公平性、不一致漂移、有害内容放大以及在扩展对话和对话期间传播不需要的模式。尽管训练时间或以数据为中心的方法试图减少这些影响，但它们的计算成本很高，一旦部署就不可逆转，并且适应新的对话环境的速度很慢。基于剪枝的方法提供了一种灵活且透明的方法，通过调整负责某些行为的神经元来减少偏差。然而，大多数现有方法都是静态的；一旦神经元被移除，模型就会失去在对话或上下文发生变化时的适应能力。为了解决这个问题，我们提出了一个动态的、可逆的、基于修剪的框架，该框架可以检测上下文感知的神经元激活并应用自适应掩蔽来调节它们在生成过程中的影响。我们的推理时间解决方案提供细粒度、记忆感知的缓解措施，并在多语言单轮和多轮对话中提供知识保留、更连贯的行为，从而在现实世界的对话人工智能中实现动态公平控制。

Title: Misinformation Detection using Large Language Models with Explainability

Authors: Jainee Patel, Chintan Bhatt, Himani Trivedi, Thanh Thi Nguyen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.18918
Pdf URL: https://arxiv.org/pdf/2510.18918
Copy Paste: [[2510.18918]] Misinformation Detection using Large Language Models with Explainability(https://arxiv.org/abs/2510.18918)
Keywords: language model
Abstract: The rapid spread of misinformation on online platforms undermines trust among individuals and hinders informed decision making. This paper shows an explainable and computationally efficient pipeline to detect misinformation using transformer-based pretrained language models (PLMs). We optimize both RoBERTa and DistilBERT using a two-step strategy: first, we freeze the backbone and train only the classification head; then, we progressively unfreeze the backbone layers while applying layer-wise learning rate decay. On two real-world benchmark datasets, COVID Fake News and FakeNewsNet GossipCop, we test the proposed approach with a unified protocol of preprocessing and stratified splits. To ensure transparency, we integrate the Local Interpretable Model-Agnostic Explanations (LIME) at the token level to present token-level rationales and SHapley Additive exPlanations (SHAP) at the global feature attribution level. It demonstrates that DistilBERT achieves accuracy comparable to RoBERTa while requiring significantly less computational resources. This work makes two key contributions: (1) it quantitatively shows that a lightweight PLM can maintain task performance while substantially reducing computational cost, and (2) it presents an explainable pipeline that retrieves faithful local and global justifications without compromising performance. The results suggest that PLMs combined with principled fine-tuning and interpretability can be an effective framework for scalable, trustworthy misinformation detection.
摘要：错误信息在网络平台上的迅速传播破坏了个人之间的信任并阻碍了明智的决策。本文展示了一种可解释且计算高效的管道，可使用基于 Transformer 的预训练语言模型 (PLM) 来检测错误信息。我们使用两步策略优化 RoBERTa 和 DistilBERT：首先，我们冻结主干网并仅训练分类头；然后，我们逐步解冻主干层，同时应用分层学习率衰减。在两个真实世界的基准数据集（COVID Fake News 和 FakeNewsNet GossipCop）上，我们使用统一的预处理和分层分割协议测试了所提出的方法。为了确保透明度，我们在 token 级别集成了 Local Interpretable Model-Agnostic Explanations (LIME)，以呈现 token 级别的基本原理，并在全局特征归因级别集成 SHapley Additive exPlanations (SHAP)。它表明 DistilBERT 的准确性可与 RoBERTa 相当，同时需要的计算资源显着减少。这项工作做出了两个关键贡献：(1) 它定量地表明，轻量级 PLM 可以保持任务性能，同时大幅降低计算成本；(2) 它提供了一个可解释的管道，可以在不影响性能的情况下检索忠实的本地和全局理由。结果表明，PLM 与有原则的微调和可解释性相结合，可以成为可扩展、值得信赖的错误信息检测的有效框架。

Title: Evaluating LLM Story Generation through Large-scale Network Analysis of Social Structures

Authors: Hiroshi Nonaka, K. E. Perry
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.18932
Pdf URL: https://arxiv.org/pdf/2510.18932
Copy Paste: [[2510.18932]] Evaluating LLM Story Generation through Large-scale Network Analysis of Social Structures(https://arxiv.org/abs/2510.18932)
Keywords: language model, gpt, llm
Abstract: Evaluating the creative capabilities of large language models (LLMs) in complex tasks often requires human assessments that are difficult to scale. We introduce a novel, scalable methodology for evaluating LLM story generation by analyzing underlying social structures in narratives as signed character networks. To demonstrate its effectiveness, we conduct a large-scale comparative analysis using networks from over 1,200 stories, generated by four leading LLMs (GPT-4o, GPT-4o mini, Gemini 1.5 Pro, and Gemini 1.5 Flash) and a human-written corpus. Our findings, based on network properties like density, clustering, and signed edge weights, show that LLM-generated stories consistently exhibit a strong bias toward tightly-knit, positive relationships, which aligns with findings from prior research using human assessment. Our proposed approach provides a valuable tool for evaluating limitations and tendencies in the creative storytelling of current and future LLMs.
摘要：评估大型语言模型 (LLM) 在复杂任务中的创造力通常需要难以扩展的人工评估。我们引入了一种新颖的、可扩展的方法，通过分析作为签名角色网络的叙事中的潜在社会结构来评估法学硕士故事的生成。为了证明其有效性，我们使用来自 1,200 多个故事的网络进行了大规模比较分析，这些故事由四位领先的法学硕士（GPT-4o、GPT-4o mini、Gemini 1.5 Pro 和 Gemini 1.5 Flash）和人工编写的语料库生成。我们的研究结果基于密度、聚类和带符号边权重等网络属性，表明法学硕士生成的故事始终表现出对紧密、积极关系的强烈偏见，这与之前使用人工评估的研究结果一致。我们提出的方法为评估当前和未来法学硕士创造性讲故事的局限性和趋势提供了一个有价值的工具。

Title: Lost in the Maze: Overcoming Context Limitations in Long-Horizon Agentic Search

Authors: Howard Yen, Ashwin Paranjape, Mengzhou Xia, Thejas Venkatesh, Jack Hessel, Danqi Chen, Yuhao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.18939
Pdf URL: https://arxiv.org/pdf/2510.18939
Copy Paste: [[2510.18939]] Lost in the Maze: Overcoming Context Limitations in Long-Horizon Agentic Search(https://arxiv.org/abs/2510.18939)
Keywords: hallucination, agent
Abstract: Long-horizon agentic search requires iteratively exploring the web over long trajectories and synthesizing information across many sources, and is the foundation for enabling powerful applications like deep research systems. In this work, we show that popular agentic search frameworks struggle to scale to long trajectories primarily due to context limitations-they accumulate long, noisy content, hit context window and tool budgets, or stop early. Then, we introduce SLIM (Simple Lightweight Information Management), a simple framework that separates retrieval into distinct search and browse tools, and periodically summarizes the trajectory, keeping context concise while enabling longer, more focused searches. On long-horizon tasks, SLIM achieves comparable performance at substantially lower cost and with far fewer tool calls than strong open-source baselines across multiple base models. Specifically, with o3 as the base model, SLIM achieves 56% on BrowseComp and 31% on HLE, outperforming all open-source frameworks by 8 and 4 absolute points, respectively, while incurring 4-6x fewer tool calls. Finally, we release an automated fine-grained trajectory analysis pipeline and error taxonomy for characterizing long-horizon agentic search frameworks; SLIM exhibits fewer hallucinations than prior systems. We hope our analysis framework and simple tool design inform future long-horizon agents.
摘要：长视野代理搜索需要在长轨迹上迭代地探索网络并综合多个来源的信息，这是实现深度研究系统等强大应用程序的基础。在这项工作中，我们展示了流行的代理搜索框架难以扩展到长轨迹，这主要是由于上下文限制——它们积累了长且嘈杂的内容，达到了上下文窗口和工具预算，或者提前停止。然后，我们引入 SLIM（简单轻量级信息管理），这是一个简单的框架，它将检索分离为不同的搜索和浏览工具，并定期总结轨迹，保持上下文简洁，同时实现更长、更集中的搜索。在长期任务中，与跨多个基础模型的强大开源基线相比，SLIM 以低得多的成本和更少的工具调用实现了可比的性能。具体来说，以 o3 作为基础模型，SLIM 在 BrowseComp 上实现了 56%，在 HLE 上实现了 31%，分别比所有开源框架高出 8 和 4 个绝对点，同时工具调用减少了 4-6 倍。最后，我们发布了一个自动化的细粒度轨迹分析管道和错误分类法，用于表征长期代理搜索框架； SLIM 比以前的系统表现出更少的幻觉。我们希望我们的分析框架和简单的工具设计能够为未来的长期代理提供信息。

Title: ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Authors: Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.18941
Pdf URL: https://arxiv.org/pdf/2510.18941
Copy Paste: [[2510.18941]] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge(https://arxiv.org/abs/2510.18941)
Keywords: language model, gpt, llm
Abstract: Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9\% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: this https URL and Code: this https URL
摘要：评估大型语言模型 (LLM) 的进展通常受到验证响应、限制对数学、编程和简短问答等任务的评估的挑战的限制。然而，许多现实世界的应用程序需要评估法学硕士处理专业文档、综合信息以及生成综合报告以响应用户查询的能力。我们介绍 ProfBench：由具有物理博士、化学博士、金融 MBA 和咨询 MBA 专业知识的人类专家评估的一组超过 7000 个响应标准对。我们建立了强大且负担得起的法学硕士法官来评估 ProfBench 评分标准，通过减轻自我增强偏见并将评估成本降低 2-3 个数量级，使其公平且可供更广泛的社区使用。我们的研究结果表明，即使对于最先进的法学硕士来说，ProfBench 也构成了重大挑战，GPT-5-high 等表现最好的模型仅实现了 65.9% 的整体性能。此外，我们还发现了专有模型和开放权重模型之间显着的性能差异，并深入了解了扩展思维在解决复杂的专业领域任务中所扮演的角色。数据：此 https URL 和代码：此 https URL

Title: Dynamic Evaluation for Oversensitivity in LLMs

Authors: Sophia Xiao Pu, Sitao Cheng, Xin Eric Wang, William Yang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19005
Pdf URL: https://arxiv.org/pdf/2510.19005
Copy Paste: [[2510.19005]] Dynamic Evaluation for Oversensitivity in LLMs(https://arxiv.org/abs/2510.19005)
Keywords: language model, llm, prompt
Abstract: Oversensitivity occurs when language models defensively reject prompts that are actually benign. This behavior not only disrupts user interactions but also obscures the boundary between harmful and harmless content. Existing benchmarks rely on static datasets that degrade overtime as models evolve, leading to data contamination and diminished evaluative power. To address this, we develop a framework that dynamically generates model-specific challenging datasets, capturing emerging defensive patterns and aligning with each model's unique behavior. Building on this approach, we construct OVERBENCH, a benchmark that aggregates these datasets across diverse LLM families, encompassing 450,000 samples from 25 models. OVERBENCH provides a dynamic and evolving perspective on oversensitivity, allowing for continuous monitoring of defensive triggers as models advance, highlighting vulnerabilities that static datasets overlook.
摘要：当语言模型防御性地拒绝实际上是良性的提示时，就会出现过度敏感。这种行为不仅扰乱了用户交互，而且模糊了有害内容和无害内容之间的界限。现有的基准依赖于静态数据集，随着模型的发展，这些数据集会随着模型的发展而退化，从而导致数据污染和评估能力下降。为了解决这个问题，我们开发了一个框架，可以动态生成特定于模型的具有挑战性的数据集，捕获新兴的防御模式并与每个模型的独特行为保持一致。在此方法的基础上，我们构建了 OVERBENCH，这是一个基准，聚合了不同 LLM 系列的数据集，包含来自 25 个模型的 450,000 个样本。 OVERBENCH 提供了关于过度敏感的动态和不断发展的视角，允许随着模型的进展持续监控防御触发因素，突出静态数据集忽略的漏洞。

Title: Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues

Authors: Eunsu Kim, Junyeong Park, Juhyun Oh, Kiwoong Park, Seyoung Song, A.Seza Dogruoz, Najoung Kim, Alice Oh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19028
Pdf URL: https://arxiv.org/pdf/2510.19028
Copy Paste: [[2510.19028]] Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues(https://arxiv.org/abs/2510.19028)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: As large language models (LLMs) are increasingly used in human-AI interactions, their social reasoning capabilities in interpersonal contexts are critical. We introduce SCRIPTS, a 1k-dialogue dataset in English and Korean, sourced from movie scripts. The task involves evaluating models' social reasoning capability to infer the interpersonal relationships (e.g., friends, sisters, lovers) between speakers in each dialogue. Each dialogue is annotated with probabilistic relational labels (Highly Likely, Less Likely, Unlikely) by native (or equivalent) Korean and English speakers from Korea and the U.S. Evaluating nine models on our task, current proprietary LLMs achieve around 75-80% on the English dataset, whereas their performance on Korean drops to 58-69%. More strikingly, models select Unlikely relationships in 10-25% of their responses. Furthermore, we find that thinking models and chain-of-thought prompting, effective for general reasoning, provide minimal benefits for social reasoning and occasionally amplify social biases. Our findings reveal significant limitations in current LLMs' social reasoning capabilities, highlighting the need for efforts to develop socially-aware language models.
摘要：随着大型语言模型（LLM）越来越多地用于人类与人工智能的交互中，它们在人际环境中的社交推理能力至关重要。我们引入了 SCRIPTS，这是一个来自电影脚本的 1k 英语和韩语对话数据集。该任务涉及评估模型的社会推理能力，以推断每个对话中说话者之间的人际关系（例如朋友、姐妹、恋人）。每个对话都由来自韩国和美国的母语（或同等水平）韩语和英语人士用概率关系标签（高度可能、不太可能、不太可能）进行注释。在我们的任务中评估九个模型时，当前专有的法学硕士在英语数据集上的表现约为 75-80%，而他们在韩语数据集上的表现则下降至 58-69%。更引人注目的是，模型在 10-25% 的回答中选择了不太可能的关系。此外，我们发现，对于一般推理有效的思维模型和思维链提示，对社会推理的好处微乎其微，并且偶尔会放大社会偏见。我们的研究结果揭示了当前法学硕士的社会推理能力的显着局限性，强调了开发具有社会意识的语言模型的必要性。

Title: When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation

Authors: Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Jimmy Huang, Frank Rudzicz, Elham Dolatabadi
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2510.19032
Pdf URL: https://arxiv.org/pdf/2510.19032
Copy Paste: [[2510.19032]] When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation(https://arxiv.org/abs/2510.19032)
Keywords: language model, llm
Abstract: Evaluating Large Language Models (LLMs) for mental health support is challenging due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale dialogue datasets and judge reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation. MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k}reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health. We release the benchmarks and codes at: this https URL
摘要：由于治疗性对话在情感和认知上的复杂性，评估大型语言模型 (LLM) 的心理健康支持具有挑战性。现有的基准在规模和可靠性方面都受到限制，通常依赖于合成数据或社交媒体数据，并且缺乏评估自动法官何时可信的框架。为了满足大规模对话数据集和判断可靠性评估的需求，我们引入了两个基准，为生成和评估提供了框架。 MentalBench-100k 整合了来自三个真实场景数据集的 10,000 个单轮对话，每个对话与 9 个 LLM 生成的响应配对，产生 100,000 个响应对。 MentalAlign-70k}通过将四位表现出色的 LLM 法官与人类专家对七个属性的 70,000 个评分进行比较，重新构建评估，这些属性分为认知支持分数 (CSS) 和情感共鸣分数 (ARS)。然后，我们采用情感认知一致性框架，这是一种统计方法，使用类内相关系数 (ICC) 和置信区间来量化法学硕士法官和人类专家之间的一致性、一致性和偏差。我们的分析揭示了法学硕士法官的系统性膨胀、认知属性（例如指导和信息性）的高度可靠性、同理心的精确度降低以及安全性和相关性方面的一些不可靠性。我们的贡献为心理健康法学硕士的可靠、大规模评估奠定了新的方法论和实证基础。我们在以下位置发布基准测试和代码：此 https URL

Title: From Memorization to Generalization: Fine-Tuning Large Language Models for Biomedical Term-to-Identifier Normalization

Authors: Suswitha Pericharla, Daniel B. Hier, Tayo Obafemi-Ajayi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19036
Pdf URL: https://arxiv.org/pdf/2510.19036
Copy Paste: [[2510.19036]] From Memorization to Generalization: Fine-Tuning Large Language Models for Biomedical Term-to-Identifier Normalization(https://arxiv.org/abs/2510.19036)
Keywords: language model, gpt, llm
Abstract: Effective biomedical data integration depends on automated term normalization, the mapping of natural language biomedical terms to standardized identifiers. This linking of terms to identifiers is essential for semantic interoperability. Large language models (LLMs) show promise for this task but perform unevenly across terminologies. We evaluated both memorization (training-term performance) and generalization (validation-term performance) across multiple biomedical ontologies. Fine-tuning Llama 3.1 8B revealed marked differences by terminology. GO mappings showed strong memorization gains (up to 77% improvement in term-to-identifier accuracy), whereas HPO showed minimal improvement. Generalization occurred only for protein-gene (GENE) mappings (13.9% gain), while fine-tuning for HPO and GO yielded negligible transfer. Baseline accuracy varied by model scale, with GPT-4o outperforming both Llama variants for all terminologies. Embedding analyses showed tight semantic alignment between gene symbols and protein names but weak alignment between terms and identifiers for GO or HPO, consistent with limited lexicalization. Fine-tuning success depended on two interacting factors: identifier popularity and lexicalization. Popular identifiers were more likely encountered during pretraining, enhancing memorization. Lexicalized identifiers, such as gene symbols, enabled semantic generalization. By contrast, arbitrary identifiers in GO and HPO constrained models to rote learning. These findings provide a predictive framework for when fine-tuning enhances factual recall versus when it fails due to sparse or non-lexicalized identifiers.
摘要：有效的生物医学数据集成取决于自动术语标准化，即自然语言生物医学术语到标准化标识符的映射。这种术语与标识符的链接对于语义互操作性至关重要。大型语言模型 (LLM) 有望完成这项任务，但在不同术语方面的表现却参差不齐。我们评估了多个生物医学本体的记忆（训练项表现）和泛化（验证项表现）。对 Llama 3.1 8B 的微调揭示了术语的显着差异。 GO 映射显示出强大的记忆增益（术语到标识符的准确度提高了 77%），而 HPO 的提升幅度很小。泛化仅发生在蛋白质基因 (GENE) 映射上（增益 13.9%），而 HPO 和 GO 的微调产生的转移可以忽略不计。基线精度因模型规模而异，GPT-4o 在所有术语方面均优于两种 Llama 变体。嵌入分析显示基因符号和蛋白质名称之间的语义紧密对齐，但 GO 或 HPO 的术语和标识符之间的对齐较弱，这与有限的词汇化一致。微调的成功取决于两个相互作用的因素：标识符流行度和词汇化。在预训练期间更有可能遇到流行的标识符，从而增强记忆。词汇化标识符，例如基因符号，实现了语义泛化。相比之下，GO 和 HPO 中的任意标识符限制模型死记硬背。这些发现为微调何时增强事实回忆以及何时由于稀疏或非词汇化标识符而失败提供了一个预测框架。

Title: That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation

Authors: Jaesung Bae, Cameron Churchwell, Mitchell Hermon, Tsun-An Hsieh, Jocelyn Xu, Yekaterina Yegorova, Mark Hasegawa-Johnson, Heng Ji
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.19116
Pdf URL: https://arxiv.org/pdf/2510.19116
Copy Paste: [[2510.19116]] That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation(https://arxiv.org/abs/2510.19116)
Keywords: language model, llm, prompt
Abstract: This paper investigates how large language models (LLMs) behave when faced with discrepancies between their parametric knowledge and conflicting information contained in a prompt. Building on prior question-answering (QA) research, we extend the investigation of knowledge conflicts to the realm of code generation. We propose a domain-agnostic framework for constructing and interpreting such conflicts, along with a novel evaluation method and dataset tailored to code conflict scenarios. Our experiments indicate that sufficiently large LLMs encode the notion of a knowledge conflict in their parameters, enabling us to detect knowledge conflicts with up to \textbf{80.65\%} accuracy. Building on these insights, we show that activation-level steering can achieve up to a \textbf{12.6\%} improvement in steering success over a random baseline. However, effectiveness depends critically on balancing model size, task domain, and steering direction. The experiment code and data will be made publicly available after acceptance.
摘要：本文研究了大型语言模型 (LLM) 在面临参数知识与提示中包含的冲突信息之间的差异时如何表现。基于先前的问答（QA）研究，我们将知识冲突的调查扩展到代码生成领域。我们提出了一个与领域无关的框架来构建和解释此类冲突，以及针对代码冲突场景定制的新颖的评估方法和数据集。我们的实验表明，足够大的 LLM 在其参数中编码了知识冲突的概念，使我们能够以高达 \textbf{80.65\%} 的准确度检测知识冲突。基于这些见解，我们表明激活级转向可以在随机基线的转向成功方面实现高达 \textbf{12.6\%} 的改进。然而，有效性关键取决于平衡模型大小、任务域和指导方向。实验代码和数据将在验收后公开。

Title: A Graph Signal Processing Framework for Hallucination Detection in Large Language Models

Authors: Valentin Noël
Subjects: cs.CL, cs.LG, eess.SP, stat.ML
Abstract URL: https://arxiv.org/abs/2510.19117
Pdf URL: https://arxiv.org/pdf/2510.19117
Copy Paste: [[2510.19117]] A Graph Signal Processing Framework for Hallucination Detection in Large Language Models(https://arxiv.org/abs/2510.19117)
Keywords: language model, gpt, hallucination
Abstract: Large language models achieve impressive results but distinguishing factual reasoning from hallucinations remains challenging. We propose a spectral analysis framework that models transformer layers as dynamic graphs induced by attention, with token embeddings as signals on these graphs. Through graph signal processing, we define diagnostics including Dirichlet energy, spectral entropy, and high-frequency energy ratios, with theoretical connections to computational stability. Experiments across GPT architectures suggest universal spectral patterns: factual statements exhibit consistent "energy mountain" behavior with low-frequency convergence, while different hallucination types show distinct signatures. Logical contradictions destabilize spectra with large effect sizes ($g>1.0$), semantic errors remain stable but show connectivity drift, and substitution hallucinations display intermediate perturbations. A simple detector using spectral signatures achieves 88.75% accuracy versus 75% for perplexity-based baselines, demonstrating practical utility. These findings indicate that spectral geometry may capture reasoning patterns and error behaviors, potentially offering a framework for hallucination detection in large language models.
摘要：大型语言模型取得了令人印象深刻的结果，但区分事实推理和幻觉仍然具有挑战性。我们提出了一个谱分析框架，将变压器层建模为由注意力引起的动态图，并将令牌嵌入作为这些图上的信号。通过图信号处理，我们定义了诊断，包括狄利克雷能量、谱熵和高频能量比，并与计算稳定性建立了理论联系。跨 GPT 架构的实验表明了通用的频谱模式：事实陈述表现出一致的“能量山”行为和低频收敛，而不同的幻觉类型则显示出不同的特征。逻辑矛盾会破坏具有大效应量（$g>1.0$）的谱的稳定性，语义错误保持稳定但显示出连通性漂移，并且替代幻觉显示出中间扰动。使用光谱特征的简单检测器可实现 88.75% 的准确度，而基于困惑度的基线的准确度为 75%，证明了实用性。这些发现表明，谱几何可以捕获推理模式和错误行为，有可能为大型语言模型中的幻觉检测提供框架。

Title: Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges

Authors: Cheng Huang, Nyima Tashi, Fan Gao, Yutong Liu, Jiahao Li, Hao Tian, Siyang Jiang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Jin Zhang, Xiao Feng, Hao Wang, Jie Tang, Guojie Tang, Xiangxiang Wang, Jia Zhang, Tsengdar Lee, Yongbin Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19144
Pdf URL: https://arxiv.org/pdf/2510.19144
Copy Paste: [[2510.19144]] Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges(https://arxiv.org/abs/2510.19144)
Keywords: llm
Abstract: Tibetan, one of the major low-resource languages in Asia, presents unique linguistic and sociocultural characteristics that pose both challenges and opportunities for AI research. Despite increasing interest in developing AI systems for underrepresented languages, Tibetan has received limited attention due to a lack of accessible data resources, standardized benchmarks, and dedicated tools. This paper provides a comprehensive survey of the current state of Tibetan AI in the AI domain, covering textual and speech data resources, NLP tasks, machine translation, speech recognition, and recent developments in LLMs. We systematically categorize existing datasets and tools, evaluate methods used across different tasks, and compare performance where possible. We also identify persistent bottlenecks such as data sparsity, orthographic variation, and the lack of unified evaluation metrics. Additionally, we discuss the potential of cross-lingual transfer, multi-modal learning, and community-driven resource creation. This survey aims to serve as a foundational reference for future work on Tibetan AI research and encourages collaborative efforts to build an inclusive and sustainable AI ecosystem for low-resource languages.
摘要：藏语是亚洲主要的资源匮乏语言之一，具有独特的语言和社会文化特征，这给人工智能研究带来了挑战和机遇。尽管人们对开发针对代表性不足的语言的人工智能系统越来越感兴趣，但由于缺乏可访问的数据资源、标准化基准和专用工具，藏语受到的关注有限。本文全面调查了人工智能领域藏族人工智能的现状，涵盖文本和语音数据资源、NLP 任务、机器翻译、语音识别以及法学硕士的最新发展。我们系统地对现有数据集和工具进行分类，评估不同任务中使用的方法，并尽可能比较性能。我们还发现了持续存在的瓶颈，例如数据稀疏、拼写变化以及缺乏统一的评估指标。此外，我们还讨论了跨语言迁移、多模式学习和社区驱动的资源创建的潜力。这项调查旨在为未来藏语人工智能研究工作提供基础参考，并鼓励合作努力为资源匮乏的语言建立一个包容性和可持续的人工智能生态系统。

Title: "You Are Rejected!": An Empirical Study of Large Language Models Taking Hiring Evaluations

Authors: Dingjie Fu, Dianxing Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19167
Pdf URL: https://arxiv.org/pdf/2510.19167
Copy Paste: [[2510.19167]] "You Are Rejected!": An Empirical Study of Large Language Models Taking Hiring Evaluations(https://arxiv.org/abs/2510.19167)
Keywords: language model, llm
Abstract: With the proliferation of the internet and the rapid advancement of Artificial Intelligence, leading technology companies face an urgent annual demand for a considerable number of software and algorithm engineers. To efficiently and effectively identify high-potential candidates from thousands of applicants, these firms have established a multi-stage selection process, which crucially includes a standardized hiring evaluation designed to assess job-specific competencies. Motivated by the demonstrated prowess of Large Language Models (LLMs) in coding and reasoning tasks, this paper investigates a critical question: Can LLMs successfully pass these hiring evaluations? To this end, we conduct a comprehensive examination of a widely used professional assessment questionnaire. We employ state-of-the-art LLMs to generate responses and subsequently evaluate their performance. Contrary to any prior expectation of LLMs being ideal engineers, our analysis reveals a significant inconsistency between the model-generated answers and the company-referenced solutions. Our empirical findings lead to a striking conclusion: All evaluated LLMs fails to pass the hiring evaluation.
摘要：随着互联网的普及和人工智能的快速进步，领先的科技公司每年都面临着对相当数量的软件和算法工程师的迫切需求。为了从数千名申请人中高效地识别高潜力候选人，这些公司建立了多阶段的选拔流程，其中至关重要的是旨在评估特定工作能力的标准化招聘评估。受大型语言模型 (LLM) 在编码和推理任务中所展示的实力的启发，本文研究了一个关键问题：LLM 能否成功通过这些招聘评估？为此，我们对广泛使用的专业评估问卷进行了全面审查。我们聘请最先进的法学硕士来生成回复并随后评估他们的表现。与之前对法学硕士成为理想工程师的期望相反，我们的分析揭示了模型生成的答案与公司参考的解决方案之间存在显着的不一致。我们的实证研究结果得出了一个惊人的结论：所有经过评估的法学硕士都未能通过招聘评估。

Title: Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG

Authors: Jihwan Bang, Juntae Lee, Seunghan Yang, Sungha Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19171
Pdf URL: https://arxiv.org/pdf/2510.19171
Copy Paste: [[2510.19171]] Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG(https://arxiv.org/abs/2510.19171)
Keywords: prompt, retrieval-augmented generation
Abstract: Multi-hop retrieval-augmented generation (RAG) is a promising strategy for complex reasoning, yet existing iterative prompting approaches remain inefficient. They often regenerate predictable token sequences at every step and rely on stochastic stopping, leading to excessive token usage and unstable termination. We propose TSSS (Think Straight, Stop Smart), a structured multi-hop RAG framework designed for efficiency. TSSS introduces (i) a template-based reasoning that caches recurring prefixes and anchors sub-queries to the main question, reducing token generation cost while promoting stable reasoning, and (ii) a retriever-based terminator, which deterministically halts reasoning once additional sub-queries collapse into repetition. This separation of structured reasoning and termination control enables both faster inference and more reliable answers. On HotpotQA, 2WikiMultiHop, and MuSiQue, TSSS achieves state-of-the-art accuracy and competitive efficiency among RAG-CoT approaches, highlighting its effectiveness in efficiency-constrained scenarios such as on-device inference.
摘要：多跳检索增强生成（RAG）是一种有前景的复杂推理策略，但现有的迭代提示方法仍然效率低下。它们通常在每一步重新生成可预测的令牌序列，并依赖于随机停止，导致过度的令牌使用和不稳定的终止。我们提出 TSSS（Think Straight，Stop Smart），这是一种专为提高效率而设计的结构化多跳 RAG 框架。 TSSS 引入了 (i) 基于模板的推理，可缓存重复出现的前缀并将子查询锚定到主要问题，从而降低标记生成成本，同时促进稳定的推理；(ii) 基于检索器的终止符，一旦其他子查询陷入重复，它就会确定性地停止推理。结构化推理和终止控制的这种分离可以实现更快的推理和更可靠的答案。在 HotpotQA、2WikiMultiHop 和 MuSiQue 上，TSSS 在 RAG-CoT 方法中实现了最先进的准确性和竞争效率，突出了其在设备上推理等效率受限场景中的有效性。

Title: When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA

Authors: Nishanth Sridhar Nakshatri, Shamik Roy, Manoj Ghuhan Arivazhagan, Hanhan Zhou, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.19172
Pdf URL: https://arxiv.org/pdf/2510.19172
Copy Paste: [[2510.19172]] When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA(https://arxiv.org/abs/2510.19172)
Keywords: llm
Abstract: LLMs often fail to handle temporal knowledge conflicts--contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.
摘要：法学硕士通常无法处理暂时的知识冲突——当事实在训练数据中随着时间的推移而变化时就会出现矛盾。现有的研究通过建立在维基数据等结构化知识库上的基准来评估这种现象，但它们关注的是广泛覆盖、易于记忆的流行实体，缺乏公平评估具有不同知识截止日期的法学硕士所需的动态结构。我们引入了evolveQA，这是一个专门设计用于评估法学硕士关于暂时演变的知识的基准，由 3 个真实世界的带时间戳的语料库构建：AWS 更新、Azure 更改和 WHO 疾病爆发报告。我们的框架识别自然发生的知识演变，并生成针对不同法学硕士知识截止日期量身定制的黄金答案的问题。通过对 3 种知识探索格式的 12 个开源和闭源法学硕士进行广泛评估，我们证明，与静态知识问题相比，evolveQA 的性能显着下降高达 31%。

Title: Interpretable Question Answering with Knowledge Graphs

Authors: Kartikeya Aneja, Manasvi Srivastava, Subhayan Das, Nagender Aneja
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.19181
Pdf URL: https://arxiv.org/pdf/2510.19181
Copy Paste: [[2510.19181]] Interpretable Question Answering with Knowledge Graphs(https://arxiv.org/abs/2510.19181)
Keywords: language model, gpt, llm, retrieval augmented generation
Abstract: This paper presents a question answering system that operates exclusively on a knowledge graph retrieval without relying on retrieval augmented generation (RAG) with large language models (LLMs). Instead, a small paraphraser model is used to paraphrase the entity relationship edges retrieved from querying the knowledge graph. The proposed pipeline is divided into two main stages. The first stage involves pre-processing a document to generate sets of question-answer (QA) pairs. The second stage converts these QAs into a knowledge graph from which graph-based retrieval is performed using embeddings and fuzzy techniques. The graph is queried, re-ranked, and paraphrased to generate a final answer. This work includes an evaluation using LLM-as-a-judge on the CRAG benchmark, which resulted in accuracies of 71.9% and 54.4% using LLAMA-3.2 and GPT-3.5-Turbo, respectively.
摘要：本文提出了一种问答系统，该系统仅在知识图谱检索上运行，而不依赖于具有大型语言模型（LLM）的检索增强生成（RAG）。相反，使用一个小型释义器模型来释义从查询知识图谱中检索到的实体关系边。拟议的管道分为两个主要阶段。第一阶段涉及预处理文档以生成问答 (QA) 对集。第二阶段将这些 QA 转换为知识图，使用嵌入和模糊技术执行基于图的检索。该图经过查询、重新排名和解释以生成最终答案。这项工作包括使用 LLM 作为评判者对 CRAG 基准进行评估，使用 LLAMA-3.2 和 GPT-3.5-Turbo 的准确率分别为 71.9% 和 54.4%。

Title: Multi-Faceted Evaluation of Tool-Augmented Dialogue Systems

Authors: Zhaoyi Joey Hou, Tanya Shourya, Yingfan Wang, Shamik Roy, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19186
Pdf URL: https://arxiv.org/pdf/2510.19186
Copy Paste: [[2510.19186]] Multi-Faceted Evaluation of Tool-Augmented Dialogue Systems(https://arxiv.org/abs/2510.19186)
Keywords: agent
Abstract: Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents' tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues-such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases, and SCOPE, an evaluation framework that automatically discovers diverse error patterns and evaluation rubrics in tool-augmented dialogues. Experiments show SCOPE significantly outperforms the baseline, particularly on challenging cases where user satisfaction signals are misleading.
摘要：评估使用外部工具的对话式人工智能系统具有挑战性，因为用户、代理和工具之间的复杂交互可能会产生错误。虽然现有的评估方法评估用户满意度或代理的工具调用能力，但它们无法捕获多轮工具增强对话中的关键错误，例如代理误解工具结果但又让用户满意的情况。我们引入了 TRACE（涵盖各种错误情况的系统综合工具增强对话的基准）和 SCOPE（一个评估框架，可自动发现工具增强对话中的各种错误模式和评估规则）。实验表明 SCOPE 显着优于基线，特别是在用户满意度信号具有误导性的具有挑战性的情况下。

Title: DiSRouter: Distributed Self-Routing for LLM Selections

Authors: Hang Zheng, Hongshen Xu, Yongkai Lin, Shuai Fan, Lu Chen, Kai Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19208
Pdf URL: https://arxiv.org/pdf/2510.19208
Copy Paste: [[2510.19208]] DiSRouter: Distributed Self-Routing for LLM Selections(https://arxiv.org/abs/2510.19208)
Keywords: language model, llm, agent
Abstract: The proliferation of Large Language Models (LLMs) has created a diverse ecosystem of models with highly varying performance and costs, necessitating effective query routing to balance performance and expense. Current routing systems often rely on a centralized external router trained on a fixed set of LLMs, making them inflexible and prone to poor performance since the small router can not fully understand the knowledge boundaries of different LLMs. We introduce DiSRouter (Distributed Self-Router), a novel paradigm that shifts from centralized control to distributed routing. In DiSRouter, a query traverses a network of LLM agents, each independently deciding whether to answer or route to other agents based on its own self-awareness, its ability to judge its competence. This distributed design offers superior flexibility, scalability, and generalizability. To enable this, we propose a two-stage Self-Awareness Training pipeline that enhances each LLM's self-awareness. Extensive experiments demonstrate that DiSRouter significantly outperforms existing routing methods in utility across various scenarios, effectively distinguishes between easy and hard queries, and shows strong generalization to out-of-domain tasks. Our work validates that leveraging an LLM's intrinsic self-awareness is more effective than external assessment, paving the way for more modular and efficient multi-agent systems.
摘要：大型语言模型 (LLM) 的激增创建了一个多样化的模型生态系统，其性能和成本差异很大，因此需要有效的查询路由来平衡性能和费用。当前的路由系统通常依赖于在一组固定的LLM上训练的集中式外部路由器，这使得它们不灵活并且容易出现性能差，因为小型路由器无法完全理解不同LLM的知识边界。我们引入了 DiSRouter（分布式自路由器），这是一种从集中式控制转向分布式路由的新颖范例。在 DiSRouter 中，查询遍历 LLM 代理网络，每个代理根据自己的自我意识和判断其能力的能力独立决定是回答还是路由到其他代理。这种分布式设计提供了卓越的灵活性、可扩展性和通用性。为了实现这一目标，我们提出了一个两阶段的自我意识培训渠道，以增强每个法学硕士的自我意识。大量实验表明，DiSRouter 在各种场景下的实用性显着优于现有路由方法，有效区分简单查询和困难查询，并对域外任务表现出很强的泛化能力。我们的工作证明，利用法学硕士内在的自我意识比外部评估更有效，为更加模块化和高效的多智能体系统铺平了道路。

Title: SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets

Authors: Ziwei Wang, Jiayuan Su, Mengyu Zhou, Huaxing Zeng, Mengni Jia, Xiao Lv, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19247
Pdf URL: https://arxiv.org/pdf/2510.19247
Copy Paste: [[2510.19247]] SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets(https://arxiv.org/abs/2510.19247)
Keywords: language model, llm, agent
Abstract: Understanding and reasoning over complex spreadsheets remain fundamental challenges for large language models (LLMs), which often struggle with accurately capturing the complex structure of tables and ensuring reasoning correctness. In this work, we propose SheetBrain, a neuro-symbolic dual workflow agent framework designed for accurate reasoning over tabular data, supporting both spreadsheet question answering and manipulation tasks. SheetBrain comprises three core modules: an understanding module, which produces a comprehensive overview of the spreadsheet - including sheet summary and query-based problem insight to guide reasoning; an execution module, which integrates a Python sandbox with preloaded table-processing libraries and an Excel helper toolkit for effective multi-turn reasoning; and a validation module, which verifies the correctness of reasoning and answers, triggering re-execution when necessary. We evaluate SheetBrain on multiple public tabular QA and manipulation benchmarks, and introduce SheetBench, a new benchmark targeting large, multi-table, and structurally complex spreadsheets. Experimental results show that SheetBrain significantly improves accuracy on both existing benchmarks and the more challenging scenarios presented in SheetBench. Our code is publicly available at this https URL.
摘要：对复杂电子表格的理解和推理仍然是大型语言模型 (LLM) 的基本挑战，大型语言模型通常难以准确捕获表格的复杂结构并确保推理的正确性。在这项工作中，我们提出了 SheetBrain，这是一种神经符号双工作流代理框架，旨在对表格数据进行精确推理，支持电子表格问答和操作任务。 SheetBrain 包含三个核心模块：理解模块，生成电子表格的全面概述 - 包括工作表摘要和基于查询的问题洞察以指导推理；执行模块，它将 Python 沙箱与预加载的表处理库和 Excel 帮助工具包集成在一起，以实现有效的多轮推理；验证模块，验证推理和答案的正确性，必要时触发重新执行。我们在多个公共表格 QA 和操作基准上评估 SheetBrain，并引入 SheetBench，这是一个针对大型、多表和结构复杂的电子表格的新基准。实验结果表明，SheetBrain 显着提高了现有基准测试和 SheetBench 中呈现的更具挑战性的场景的准确性。我们的代码可通过此 https URL 公开获取。

Title: Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization

Authors: Yuto Tomikawa, Masaki Uto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19265
Pdf URL: https://arxiv.org/pdf/2510.19265
Copy Paste: [[2510.19265]] Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization(https://arxiv.org/abs/2510.19265)
Keywords: language model
Abstract: Difficulty-controllable question generation for reading comprehension has gained significant attention in the field of education as a fundamental tool for adaptive learning support. Although several neural question generation methods have recently succeeded in controlling difficulty, conventional approaches still face two major limitations. First, they cannot directly generate multiple-choice questions, which are the most widely used question type in educational contexts. Second, they are not explicitly trained to optimize the accuracy of difficulty control, leaving room for further improvement in difficulty controllability. To address these limitations, this study proposes a novel difficulty-controllable multiple-choice question generation method for reading comprehension which leverages a large language model trained using a direct preference optimization technique to improve the accuracy of difficulty control.
摘要：用于阅读理解的难度可控问题生成作为自适应学习支持的基本工具在教育领域受到了广泛关注。尽管最近几种神经问题生成方法在控制难度方面取得了成功，但传统方法仍然面临两个主要限制。首先，它们不能直接生成多项选择题，这是教育环境中最广泛使用的问题类型。其次，他们没有经过明确的训练来优化难度控制的准确性，为难度可控性的进一步提高留下了空间。为了解决这些局限性，本研究提出了一种新颖的难度可控的阅读理解多项选择问题生成方法，该方法利用使用直接偏好优化技术训练的大型语言模型来提高难度控制的准确性。

Title: TheMCPCompany: Creating General-purpose Agents with Task-specific Tools

Authors: Reza Esfandiarpoor, Vishwas Suryanarayanan, Stephen H. Bach, Vishal Chowdhary, Anthony Aue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19286
Pdf URL: https://arxiv.org/pdf/2510.19286
Copy Paste: [[2510.19286]] TheMCPCompany: Creating General-purpose Agents with Task-specific Tools(https://arxiv.org/abs/2510.19286)
Keywords: language model, gpt, llm, agent
Abstract: Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5's performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.
摘要：自从引入模型上下文协议 (MCP) 以来，大型语言模型 (LLM) 的可用工具数量显着增加。这些特定于任务的工具集提供了 Web 浏览器等通用工具的替代方案，同时比 GUI 更易于开发和维护。然而，当前的通用代理主要依赖网络浏览器与环境交互。在这里，我们介绍 TheMCPCompany，这是一个用于评估工具调用代理执行涉及与各种现实世界服务交互的任务的基准。我们使用这些服务的 REST API 来创建 MCP 服务器，其中包括超过 18,000 个工具。我们还为每项任务提供手动注释的真实工具。在我们的实验中，我们使用真实工具来展示工具调用代理在假设完美的工具检索的情况下提高性能和降低成本的潜力。接下来，我们使用工具检索来探索代理性能，以研究基于工具的代理的现实实用性。虽然所有具有工具检索的模型的性能与基于浏览器的代理相似或更好，但较小的模型无法通过检索充分利用可用的工具。另一方面，GPT-5 的工具检索性能与其使用真实工具的性能非常接近。总的来说，我们的工作表明，最先进的推理模型可以有效地在更简单的环境中发现工具，但在复杂的企业环境中导航却遇到了严重的困难。 TheMCPCompany 透露，对于当前模型来说，使用数以万计的工具并以不平凡的方式组合它们来解决复杂问题仍然是一项具有挑战性的任务，需要更好的推理和更好的检索模型。

Title: JointCQ: Improving Factual Hallucination Detection with Joint Claim and Query Generation

Authors: Fan Xu, Huixuan Zhang, Zhenliang Zhang, Jiahao Wang, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19310
Pdf URL: https://arxiv.org/pdf/2510.19310
Copy Paste: [[2510.19310]] JointCQ: Improving Factual Hallucination Detection with Joint Claim and Query Generation(https://arxiv.org/abs/2510.19310)
Keywords: language model, llm, hallucination
Abstract: Current large language models (LLMs) often suffer from hallucination issues, i,e, generating content that appears factual but is actually unreliable. A typical hallucination detection pipeline involves response decomposition (i.e., claim extraction), query generation, evidence collection (i.e., search or retrieval), and claim verification. However, existing methods exhibit limitations in the first two stages, such as context loss during claim extraction and low specificity in query generation, resulting in degraded performance across the hallucination detection pipeline. In this work, we introduce JointCQ this https URL, a joint claim-and-query generation framework designed to construct an effective and efficient claim-query generator. Our framework leverages elaborately designed evaluation criteria to filter synthesized training data, and finetunes a language model for joint claim extraction and query generation, providing reliable and informative inputs for downstream search and verification. Experimental results demonstrate that our method outperforms previous methods on multiple open-domain QA hallucination detection benchmarks, advancing the goal of more trustworthy and transparent language model systems.
摘要：当前的大型语言模型（LLM）经常遭受幻觉问题，即生成看似真实但实际上不可靠的内容。典型的幻觉检测流程涉及响应分解（即声明提取）、查询生成、证据收集（即搜索或检索）和声明验证。然而，现有方法在前两个阶段表现出局限性，例如声明提取期间的上下文丢失和查询生成的特异性较低，导致整个幻觉检测管道的性能下降。在这项工作中，我们引入了 JointCQ 这个 https URL，这是一个联合声明和查询生成框架，旨在构建有效且高效的声明查询生成器。我们的框架利用精心设计的评估标准来过滤合成的训练数据，并微调用于联合声明提取和查询生成的语言模型，为下游搜索和验证提供可靠且信息丰富的输入。实验结果表明，我们的方法在多个开放域 QA 幻觉检测基准上优于以前的方法，推进了更值得信赖和透明的语言模型系统的目标。

Title: HAD: HAllucination Detection Language Models Based on a Comprehensive Hallucination Taxonomy

Authors: Fan Xu, Xinyu Hu, Zhenghan Yu, Li Lin, Xu Zhang, Yang Zhang, Wei Zhou, Jinjie Gu, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19318
Pdf URL: https://arxiv.org/pdf/2510.19318
Copy Paste: [[2510.19318]] HAD: HAllucination Detection Language Models Based on a Comprehensive Hallucination Taxonomy(https://arxiv.org/abs/2510.19318)
Keywords: language model, hallucination
Abstract: The increasing reliance on natural language generation (NLG) models, particularly large language models, has raised concerns about the reliability and accuracy of their outputs. A key challenge is hallucination, where models produce plausible but incorrect information. As a result, hallucination detection has become a critical task. In this work, we introduce a comprehensive hallucination taxonomy with 11 categories across various NLG tasks and propose the HAllucination Detection (HAD) models this https URL, which integrate hallucination detection, span-level identification, and correction into a single inference process. Trained on an elaborate synthetic dataset of about 90K samples, our HAD models are versatile and can be applied to various NLG tasks. We also carefully annotate a test set for hallucination detection, called HADTest, which contains 2,248 samples. Evaluations on in-domain and out-of-domain test sets show that our HAD models generally outperform the existing baselines, achieving state-of-the-art results on HaluEval, FactCHD, and FaithBench, confirming their robustness and versatility.
摘要：对自然语言生成（NLG）模型，特别是大型语言模型的日益依赖，引起了人们对其输出的可靠性和准确性的担忧。一个关键的挑战是幻觉，即模型产生看似合理但不正确的信息。因此，幻觉检测已成为一项关键任务。在这项工作中，我们引入了跨各种 NLG 任务的 11 个类别的综合幻觉分类法，并提出了 HAllucination 检测 (HAD) 模型这个 https URL，它将幻觉检测、跨级识别和纠正集成到单个推理过程中。我们的 HAD 模型在包含约 90K 样本的精心合成数据集上进行训练，用途广泛，可应用于各种 NLG 任务。我们还仔细注释了一个用于幻觉检测的测试集，称为 HADTest，其中包含 2,248 个样本。对域内和域外测试集的评估表明，我们的 HAD 模型总体上优于现有基线，在 HaluEval、FactCHD 和 FaithBench 上取得了最先进的结果，证实了它们的稳健性和多功能性。

Title: Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization

Authors: Junjie Song, Yiwen Liu, Dapeng Li, Yin Sun, Shukun Fu, Siqi Chen, Yuji Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.19325
Pdf URL: https://arxiv.org/pdf/2510.19325
Copy Paste: [[2510.19325]] Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization(https://arxiv.org/abs/2510.19325)
Keywords: language model, gpt, llm
Abstract: Text summarization is a crucial task that requires the simultaneous optimization of multiple objectives, including consistency, coherence, relevance, and fluency, which presents considerable challenges. Although large language models (LLMs) have demonstrated remarkable performance, enhanced by reinforcement learning (RL), few studies have focused on optimizing the multi-objective problem of summarization through RL based on LLMs. In this paper, we introduce hypervolume optimization (HVO), a novel optimization strategy that dynamically adjusts the scores between groups during the reward process in RL by using the hypervolume method. This method guides the model's optimization to progressively approximate the pareto front, thereby generating balanced summaries across multiple objectives. Experimental results on several representative summarization datasets demonstrate that our method outperforms group relative policy optimization (GRPO) in overall scores and shows more balanced performance across different dimensions. Moreover, a 7B foundation model enhanced by HVO performs comparably to GPT-4 in the summarization task, while maintaining a shorter generation length. Our code is publicly available at this https URL
摘要：文本摘要是一项关键任务，需要同时优化多个目标，包括一致性、连贯性、相关性和流畅性，这提出了相当大的挑战。尽管大型语言模型（LLM）在强化学习（RL）的增强下表现出了卓越的性能，但很少有研究关注通过基于 LLM 的 RL 来优化多目标摘要问题。在本文中，我们介绍了超体积优化（HVO），这是一种新颖的优化策略，它使用超体积方法在强化学习的奖励过程中动态调整组之间的分数。该方法引导模型的优化逐步逼近帕累托前沿，从而生成跨多个目标的平衡摘要。在几个代表性汇总数据集上的实验结果表明，我们的方法在总体得分上优于组相对策略优化（GRPO），并且在不同维度上表现出更平衡的性能。此外，HVO 增强的 7B 基础模型在摘要任务中的表现与 GPT-4 相当，同时保持更短的生成长度。我们的代码可通过此 https URL 公开获取

Title: Slot Filling as a Reasoning Task for SpeechLLMs

Authors: Kadri Hacioglu, Manjunath K E, Andreas Stolcke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19326
Pdf URL: https://arxiv.org/pdf/2510.19326
Copy Paste: [[2510.19326]] Slot Filling as a Reasoning Task for SpeechLLMs(https://arxiv.org/abs/2510.19326)
Keywords: language model, llm, chain-of-thought
Abstract: We propose integration of reasoning into speech large language models (speechLLMs) for the end-to-end slot-filling task. Inspired by the recent development of reasoning LLMs, we use a chain-of-thought framework to decompose the slot-filling task into multiple reasoning steps, create a reasoning dataset and apply the supervised fine-tuning strategy to a speechLLM. We distinguish between regular and reasoning speechLLMs and experiment with different types and sizes of LLMs as their text foundation models. We demonstrate performance improvements by introducing reasoning (intermediate) steps. However, we show that a reasoning textual LLM developed mainly for math, logic and coding domains might be inferior as a foundation model for a reasoning speechLLM. We further show that hybrid speechLLMs, built on a hybrid text foundation LLM and fine-tuned to preserve both direct and reasoning modes of operation, have better performance than those fine-tuned employing only one mode of operation.
摘要：我们建议将推理集成到语音大语言模型（speechLLM）中，以实现端到端的槽填充任务。受推理法学硕士最新发展的启发，我们使用思想链框架将槽填充任务分解为多个推理步骤，创建推理数据集并将监督微调策略应用于语音法学硕士。我们区分了常规语音法学硕士和推理语音法学硕士，并尝试使用不同类型和大小的法学硕士作为其文本基础模型。我们通过引入推理（中间）步骤来演示性能改进。然而，我们表明，主要针对数学、逻辑和编码领域开发的推理文本法学硕士可能不如推理语音法学硕士的基础模型。我们进一步表明，建立在混合文本基础 LLM 之上并经过微调以保留直接操作模式和推理操作模式的混合语音 LLM 比仅采用一种操作模式进行微调的混合语音 LLM 具有更好的性能。

Title: Algorithmic Fairness in NLP: Persona-Infused LLMs for Human-Centric Hate Speech Detection

Authors: Ewelina Gajewska, Arda Derbent, Jaroslaw A Chudziak, Katarzyna Budzynska
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2510.19331
Pdf URL: https://arxiv.org/pdf/2510.19331
Copy Paste: [[2510.19331]] Algorithmic Fairness in NLP: Persona-Infused LLMs for Human-Centric Hate Speech Detection(https://arxiv.org/abs/2510.19331)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation
Abstract: In this paper, we investigate how personalising Large Language Models (Persona-LLMs) with annotator personas affects their sensitivity to hate speech, particularly regarding biases linked to shared or differing identities between annotators and targets. To this end, we employ Google's Gemini and OpenAI's GPT-4.1-mini models and two persona-prompting methods: shallow persona prompting and a deeply contextualised persona development based on Retrieval-Augmented Generation (RAG) to incorporate richer persona profiles. We analyse the impact of using in-group and out-group annotator personas on the models' detection performance and fairness across diverse social groups. This work bridges psychological insights on group identity with advanced NLP techniques, demonstrating that incorporating socio-demographic attributes into LLMs can address bias in automated hate speech detection. Our results highlight both the potential and limitations of persona-based approaches in reducing bias, offering valuable insights for developing more equitable hate speech detection systems.
摘要：在本文中，我们研究了使用注释者角色个性化大型语言模型（Persona-LLM）如何影响他们对仇恨言论的敏感度，特别是与注释者和目标之间共享或不同身份相关的偏见。为此，我们采用 Google 的 Gemini 和 OpenAI 的 GPT-4.1-mini 模型和两种角色提示方法：浅层角色提示和基于检索增强生成（RAG）的深度情境化角色开发，以纳入更丰富的角色配置文件。我们分析了使用组内和组外注释者角色对模型在不同社会群体中的检测性能和公平性的影响。这项工作将群体身份的心理洞察与先进的 NLP 技术联系起来，证明将社会人口统计属性纳入法学硕士可以解决自动仇恨言论检测中的偏见。我们的结果强调了基于角色的方法在减少偏见方面的潜力和局限性，为开发更公平的仇恨言论检测系统提供了宝贵的见解。

Title: Local Obfuscation by GLINER for Impartial Context Aware Lineage: Development and evaluation of PII Removal system

Authors: Prakrithi Shivaprakash, Lekhansh Shukla, Animesh Mukherjee, Prabhat Chand, Pratima Murthy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19346
Pdf URL: https://arxiv.org/pdf/2510.19346
Copy Paste: [[2510.19346]] Local Obfuscation by GLINER for Impartial Context Aware Lineage: Development and evaluation of PII Removal system(https://arxiv.org/abs/2510.19346)
Keywords: language model, llm, prompt
Abstract: Removing Personally Identifiable Information (PII) from clinical notes in Electronic Health Records (EHRs) is essential for research and AI development. While Large Language Models (LLMs) are powerful, their high computational costs and the data privacy risks of API-based services limit their use, especially in low-resource settings. To address this, we developed LOGICAL (Local Obfuscation by GLINER for Impartial Context-Aware Lineage), an efficient, locally deployable PII removal system built on a fine-tuned Generalist and Lightweight Named Entity Recognition (GLiNER) model. We used 1515 clinical documents from a psychiatric hospital's EHR system. We defined nine PII categories for removal. A modern-gliner-bi-large-v1.0 model was fine-tuned on 2849 text instances and evaluated on a test set of 376 instances using character-level precision, recall, and F1-score. We compared its performance against Microsoft Azure NER, Microsoft Presidio, and zero-shot prompting with Gemini-Pro-2.5 and Llama-3.3-70B-Instruct. The fine-tuned GLiNER model achieved superior performance, with an overall micro-average F1-score of 0.980, significantly outperforming Gemini-Pro-2.5 (F1-score: 0.845). LOGICAL correctly sanitised 95% of documents completely, compared to 64% for the next-best solution. The model operated efficiently on a standard laptop without a dedicated GPU. However, a 2% entity-level false negative rate underscores the need for human-in-the-loop validation across all tested systems. Fine-tuned, specialised transformer models like GLiNER offer an accurate, computationally efficient, and secure solution for PII removal from clinical notes. This "sanitisation at the source" approach is a practical alternative to resource-intensive LLMs, enabling the creation of de-identified datasets for research and AI development while preserving data privacy, particularly in resource-constrained environments.
摘要：从电子健康记录 (EHR) 的临床记录中删除个人身份信息 (PII) 对于研究和人工智能开发至关重要。虽然大型语言模型 (LLM) 功能强大，但其高昂的计算成本和基于 API 的服务的数据隐私风险限制了其使用，特别是在资源匮乏的环境中。为了解决这个问题，我们开发了 LOGICAL（GLINER 的 Local Obfuscation for Impartial Context-Aware Lineage），这是一种高效、可本地部署的 PII 删除系统，构建在经过微调的通才和轻量级命名实体识别 (GLiNER) 模型上。我们使用了来自精神病医院 EHR 系统的 1515 份临床文件。我们定义了九个要删除的 PII 类别。 Modern-gliner-bi-large-v1.0 模型在 2849 个文本实例上进行了微调，并使用字符级精度、召回率和 F1 分数在 376 个实例的测试集上进行了评估。我们将其性能与 Microsoft Azure NER、Microsoft Presidio 以及 Gemini-Pro-2.5 和 Llama-3.3-70B-Instruct 的零样本提示进行了比较。经过微调的 GLiNER 模型取得了优异的性能，整体微平均 F1 分数为 0.980，显着优于 Gemini-Pro-2.5（F1 分数：0.845）。 LOGICAL 完全正确地清理了 95% 的文档，而次佳解决方案的清理率为 64%。该模型在没有专用 GPU 的标准笔记本电脑上高效运行。然而，2% 的实体级误报率强调了在所有测试系统中进行人机循环验证的必要性。 GLiNER 等经过微调的专用变压器模型为从临床记录中删除 PII 提供了准确、计算高效且安全的解决方案。这种“源头清理”方法是资源密集型法学硕士的实用替代方案，可以为研究和人工智能开发创建去识别化数据集，同时保护数据隐私，特别是在资源有限的环境中。

Title: M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models

Authors: Yejin Kwon, Taewoo Kang, Hyunsoo Yoon, Changouk Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.19358
Pdf URL: https://arxiv.org/pdf/2510.19358
Copy Paste: [[2510.19358]] M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models(https://arxiv.org/abs/2510.19358)
Keywords: language model, llm
Abstract: We present M3-SLU, a new multimodal large language model (MLLM) benchmark for evaluating multi-speaker, multi-turn spoken language understanding. While recent models show strong performance in speech and text comprehension, they still struggle with speaker-attributed reasoning, the ability to understand who said what and when in natural conversations. M3-SLU is built from four open corpora (CHiME-6, MELD, MultiDialog, and AMI) and comprises over 12,000 validated instances with paired audio, transcripts, and metadata. It includes two tasks: (1) Speaker-Attributed Question Answering and (2) Speaker Attribution via Utterance Matching. We provide baseline results for both cascaded pipelines and end-to-end MLLMs, evaluated using an LLM-as-Judge and accuracy metrics. Results show that while models can capture what was said, they often fail to identify who said it, revealing a key gap in speaker-aware dialogue understanding. M3-SLU offers as a challenging benchmark to advance research in speaker-aware multimodal understanding.
摘要：我们推出了 M3-SLU，这是一种新的多模态大语言模型 (MLLM) 基准，用于评估多说话人、多轮口语理解。尽管最近的模型在语音和文本理解方面表现出强大的性能，但它们仍然在说话者归因推理方面遇到困难，即理解自然对话中谁说了什么以及何时说的能力。 M3-SLU 由四个开放语料库（CHiME-6、MELD、MultiDialog 和 AMI）构建而成，包含超过 12,000 个经过验证的实例以及配对音频、文字记录和元数据。它包括两项任务：（1）说话人属性问答和（2）通过话语匹配进行说话人归因。我们提供级联管道和端到端 MLLM 的基线结果，并使用 LLM 作为法官和准确性指标进行评估。结果表明，虽然模型可以捕捉所说内容，但它们通常无法识别是谁说的，这揭示了说话者感知对话理解中的一个关键差距。 M3-SLU 提供了一个具有挑战性的基准，以推进说话人感知多模态理解的研究。

Title: AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation

Authors: Xianyang Liu, Yilin Liu, Shuai Wang, Hao Cheng, Andrew Estornell, Yuzhi Zhao, Jiaheng Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.19361
Pdf URL: https://arxiv.org/pdf/2510.19361
Copy Paste: [[2510.19361]] AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation(https://arxiv.org/abs/2510.19361)
Keywords: language model, llm, chain-of-thought, agent
Abstract: The creation of high-quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low-quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic pipeline for generating high-quality mathematical question-answer pairs to enhance the supervised fine-tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi-agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain-of-thought reasoning to enhance numerical and logical correctness, without reliance on human-provided labels; and (4) a final Question and Answer Evaluation that retains only the most superior pairs. Extensive experiments demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated datasets (comprising only 30-60K math samples) achieves competitive or superior performance on diverse in domain and out-of-domain mathematical reasoning benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M samples). Our work demonstrates that targeted, high-quality data generation is a more efficient path to improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.
摘要：创建高质量数据集以改进大型语言模型 (LLM) 推理仍然是一项重大挑战，因为当前的方法经常会生成低质量/不正确的答案，并且可用数据源的信息丰富度有限。为了解决这个问题，我们提出了 AgenticMath，一种新颖的代理管道，用于生成高质量的数学问答对，以增强法学硕士的监督微调。我们的方法通过四个阶段进行操作：（1）种子问题过滤器，选择信息丰富性、复杂性和清晰度高的问题； (2) 主体问题改述步骤，采用多主体系统生成多样化的、逻辑上一致的释义； (3) 答案增强步骤，使用思维链推理重写答案，以增强数字和逻辑的正确性，而不依赖于人类提供的标签； (4) 最终的问答评估，仅保留最优秀的对。大量实验表明，与在更多数据（例如 400K 或 230 万个样本）上训练的基线相比，在 AgenticMath 生成的数据集（仅包含 30-60K 数学样本）上微调 3B-8B 参数 LLM 可在不同的域内和域外数学推理基准上实现具有竞争力或优越的性能。我们的工作表明，与大规模、低质量的替代方案相比，有针对性的高质量数据生成是提高法学硕士数学推理的更有效途径。

Title: LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

Authors: Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, Mao Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19363
Pdf URL: https://arxiv.org/pdf/2510.19363
Copy Paste: [[2510.19363]] LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts(https://arxiv.org/abs/2510.19363)
Keywords: language model, long context, chain-of-thought
Abstract: Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.
摘要：长上下文推理对于大型语言模型至关重要。虽然强化学习 (RL) 通过在思维链中引入“啊哈”时刻来增强短上下文推理，但长上下文推理所需的高级思维模式在很大程度上仍未被探索，而且高难度的 RL 数据也很稀缺。在本文中，我们介绍了 LoongRL，一种用于高级长上下文推理的数据驱动的 RL 方法。 LoongRL 的核心是 KeyChain，这是一种综合方法，通过插入隐藏大量分散注意力的文档中真正问题的 UUID 链，将短的多跳 QA 转换为高难度的长上下文任务。解决这些任务需要模型逐步追踪正确的链，识别真正的问题，检索相关事实并推理它们以正确回答。对 KeyChain 数据进行 RL 训练会引发一种紧急的计划-检索-推理-重新检查推理模式，其概括性远远超出了训练长度。在 16K 下训练的模型可以有效解决 128K 任务，而无需高昂的全长 RL 部署成本。在 Qwen2.5-7B 和 14B 上，LoongRL 大幅提高了长上下文多跳 QA 准确率，绝对增益增加了 +23.5% 和 +21.1%。由此产生的 LoongRL-14B 得分达到 74.2，可与 o3-mini（74.5）和 DeepSeek-R1（74.9）等更大的前沿模型相媲美。它还改进了长上下文检索，通过了所有 128K 大海捞针压力测试，并保留了短上下文推理能力。

Title: ToMMeR -- Efficient Entity Mention Detection from Large Language Models

Authors: Victor Morand, Nadi Tomeh, Josiane Mothe, Benjamin Piwowarski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.19410
Pdf URL: https://arxiv.org/pdf/2510.19410
Copy Paste: [[2510.19410]] ToMMeR -- Efficient Entity Mention Detection from Large Language Models(https://arxiv.org/abs/2510.19410)
Keywords: language model, llm
Abstract: Identifying which text spans refer to entities -- mention detection -- is both foundational for information extraction and a known performance bottleneck. We introduce ToMMeR, a lightweight model (<300K parameters) probing mention detection capabilities from early LLM layers. Across 13 NER benchmarks, ToMMeR achieves 93\% recall zero-shot, with over 90\% precision using an LLM as a judge showing that ToMMeR rarely produces spurious predictions despite high recall. Cross-model analysis reveals that diverse architectures (14M-15B parameters) converge on similar mention boundaries (DICE >75\%), confirming that mention detection emerges naturally from language modeling. When extended with span classification heads, ToMMeR achieves near SOTA NER performance (80-87\% F1 on standard benchmarks). Our work provides evidence that structured entity representations exist in early transformer layers and can be efficiently recovered with minimal parameters.
摘要：识别哪些文本跨度引用实体（提及检测）既是信息提取的基础，也是已知的性能瓶颈。我们引入了 ToMMeR，一个轻量级模型（<300K 参数），探测早期 LLM 层的提及检测功能。在 13 个 NER 基准测试中，ToMMeR 实现了 93% 的零样本召回率，使用法学硕士作为判断的精确度超过 90%，这表明尽管召回率很高，但 ToMMeR 很少产生虚假预测。跨模型分析表明，不同的架构（14M-15B 参数）收敛于相似的提及边界（DICE >75%），证实提及检测是从语言建模中自然产生的。当使用跨度分类头进行扩展时，ToMMeR 实现了接近 SOTA NER 的性能（标准基准测试中为 80-87\% F1）。我们的工作提供了证据，证明结构化实体表示存在于早期变压器层中，并且可以用最少的参数有效地恢复。

Title: BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models

Authors: Yuan Gao, Suchir Salhan, Andrew Caines, Paula Buttery, Weiwei Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19419
Pdf URL: https://arxiv.org/pdf/2510.19419
Copy Paste: [[2510.19419]] BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models(https://arxiv.org/abs/2510.19419)
Keywords: language model
Abstract: To bridge the gap between performance-oriented benchmarks and the evaluation of cognitively inspired models, we introduce BLiSS 1.0, a Benchmark of Learner Interlingual Syntactic Structure. Our benchmark operationalizes a new paradigm of selective tolerance, testing whether a model finds a naturalistic learner error more plausible than a matched, artificial error within the same sentence. Constructed from over 2.8 million naturalistic learner sentences, BLiSS provides 136,867 controlled triplets (corrected, learner, artificial) for this purpose. Experiments on a diverse suite of models demonstrate that selective tolerance is a distinct capability from standard grammaticality, with performance clustering strongly by training paradigm. This validates BLiSS as a robust tool for measuring how different training objectives impact a model's alignment with the systematic patterns of human language acquisition.
摘要：为了弥合以绩效为导向的基准与认知启发模型评估之间的差距，我们引入了 BLiSS 1.0，即学习者语际句法结构的基准。我们的基准测试运行了一种新的选择性容忍范式，测试模型是否发现自然学习者错误比同一句子中匹配的人为错误更合理。 BLiSS 由超过 280 万个自然学习句子构成，为此目的提供了 136,867 个受控三元组（校正、学习、人工）。对各种模型的实验表明，选择性容忍是一种与标准语法不同的能力，其性能通过训练范式强烈聚类。这验证了 BLiSS 是一个强大的工具，用于衡量不同的训练目标如何影响模型与人类语言习得的系统模式的一致性。

Title: VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

Authors: Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, Tao Yu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.19488
Pdf URL: https://arxiv.org/pdf/2510.19488
Copy Paste: [[2510.19488]] VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos(https://arxiv.org/abs/2510.19488)
Keywords: agent
Abstract: Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.
摘要：训练计算机使用的代理需要大量的 GUI 交互数据，但大规模手动注释动作轨迹的成本却极其昂贵。我们推出了 VideoAgentTrek，这是一个可扩展的管道，可以在网络规模上自动从公开的屏幕录制视频中挖掘训练数据，从而无需手动注释。我们的方法解决了一个关键挑战：原始视频包含隐式演示，但缺乏明确的动作标签。为了解决这个问题，我们开发了 Video2Action，一个具有两个组件的逆动态模块 (IDM)：(1) 视频基础模型，用于检测和定位具有精确时间边界和上下文的 GUI 动作；(2) 动作内容识别器，用于提取结构化参数，如高保真点击坐标和键入文本。我们的管道应用于 39,000 个 YouTube 教程视频，自动生成 152 万个交互步骤。我们通过持续的预训练和监督微调来利用这些数据。在 OSWorld-Verified 上，我们的方法将任务成功率从 9.3%（仅 SFT 基线）提高到 15.8%，相对提高了 70%。在 AgentNetBench 上，步数准确度从 64.1% 提高到 69.3%。我们的结果表明，被动的互联网视频可以转化为对计算机使用代理的高质量监督，为昂贵的手动注释提供了可扩展的替代方案。

Title: Machine Text Detectors are Membership Inference Attacks

Authors: Ryuto Koike, Liam Dugan, Masahiro Kaneko, Chris Callison-Burch, Naoaki Okazaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19492
Pdf URL: https://arxiv.org/pdf/2510.19492
Copy Paste: [[2510.19492]] Machine Text Detectors are Membership Inference Attacks(https://arxiv.org/abs/2510.19492)
Keywords: language model
Abstract: Although membership inference attacks (MIAs) and machine-generated text detection target different goals, identifying training samples and synthetic texts, their methods often exploit similar signals based on a language model's probability distribution. Despite this shared methodological foundation, the two tasks have been independently studied, which may lead to conclusions that overlook stronger methods and valuable insights developed in the other task. In this work, we theoretically and empirically investigate the transferability, i.e., how well a method originally developed for one task performs on the other, between MIAs and machine text detection. For our theoretical contribution, we prove that the metric that achieves the asymptotically highest performance on both tasks is the same. We unify a large proportion of the existing literature in the context of this optimal metric and hypothesize that the accuracy with which a given method approximates this metric is directly correlated with its transferability. Our large-scale empirical experiments, including 7 state-of-the-art MIA methods and 5 state-of-the-art machine text detectors across 13 domains and 10 generators, demonstrate very strong rank correlation (rho > 0.6) in cross-task performance. We notably find that Binoculars, originally designed for machine text detection, achieves state-of-the-art performance on MIA benchmarks as well, demonstrating the practical impact of the transferability. Our findings highlight the need for greater cross-task awareness and collaboration between the two research communities. To facilitate cross-task developments and fair evaluations, we introduce MINT, a unified evaluation suite for MIAs and machine-generated text detection, with implementation of 15 recent methods from both tasks.
摘要：尽管成员推理攻击 (MIA) 和机器生成的文本检测针对不同的目标，识别训练样本和合成文本，但它们的方法通常基于语言模型的概率分布利用类似的信号。尽管有共同的方法论基础，但这两项任务是独立研究的，这可能会导致结论忽视另一项任务中开发的更强大的方法和有价值的见解。在这项工作中，我们从理论上和实证上研究了 MIA 和机器文本检测之间的可迁移性，即最初为一项任务开发的方法在另一项任务上的执行情况。对于我们的理论贡献，我们证明在这两个任务上实现渐近最高性能的指标是相同的。我们在这个最佳度量的背景下统一了大部分现有文献，并假设给定方法近似该度量的准确性与其可转移性直接相关。我们的大规模实证实验，包括跨 13 个领域和 10 个生成器的 7 个最先进的 MIA 方法和 5 个最先进的机器文本检测器，在跨任务性能中展示了非常强的排名相关性 (rho > 0.6)。我们特别发现，最初为机器文本检测而设计的双筒望远镜在 MIA 基准上也实现了最先进的性能，展示了可转移性的实际影响。我们的研究结果强调了两个研究团体之间需要加强跨任务意识和协作。为了促进跨任务开发和公平评估，我们引入了 MINT，这是一个用于 MIA 和机器生成文本检测的统一评估套件，并实现了这两个任务的 15 种最新方法。

Title: What is the Best Sequence Length for BABYLM?

Authors: Suchir Salhan, Richard Diehl Martinez, Zébulon Goriely, Paula Buttery
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19493
Pdf URL: https://arxiv.org/pdf/2510.19493
Copy Paste: [[2510.19493]] What is the Best Sequence Length for BABYLM?(https://arxiv.org/abs/2510.19493)
Keywords: language model
Abstract: Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.
摘要：Transformer 语言模型通常使用固定长度的上下文窗口进行操作，该窗口随着大规模预训练数据集的增长而增长。然而，在 BabyLM 挑战赛中，许多过去的提交都默认使用更短的序列长度。我们研究了序列长度对 BabyLM 预训练的影响，以回答一个简单的问题：在训练 Baby LM 时我们应该使用什么序列长度？使用 100M 字的训练数据和固定的计算预算，我们比较了 125M 参数的 Mamba 和 OPT 模型，发现虽然越长通常越好，但最佳长度取决于任务和架构。较短的序列足以完成语法概括任务，而较长的上下文有利于形态类比推理任务。

Title: Lookahead Routing for Large Language Models

Authors: Canbin Huang, Tianyuan Shi, Yuhua Zhu, Ruijun Chen, Xiaojun Quan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19506
Pdf URL: https://arxiv.org/pdf/2510.19506
Copy Paste: [[2510.19506]] Lookahead Routing for Large Language Models(https://arxiv.org/abs/2510.19506)
Keywords: language model, llm
Abstract: Large language model (LLM) routers improve the efficiency of multi-model systems by directing each query to the most appropriate model while leveraging the diverse strengths of heterogeneous LLMs. Most existing approaches frame routing as a classification problem based solely on the input query. While this reduces overhead by avoiding inference across all models, it overlooks valuable information that could be gleaned from potential outputs and fails to capture implicit intent or contextual nuances that often emerge only during response generation. These limitations can result in suboptimal routing decisions, particularly for complex or ambiguous queries that require deeper semantic understanding. To address this challenge, we propose Lookahead, a routing framework that "foresees" potential model outputs by predicting their latent representations and uses these predictions to guide model selection, thus enabling more informed routing without full inference. Within this framework, we implement two approaches based on causal and masked language models. Empirical evaluations across seven public benchmarks - spanning instruction following, mathematical reasoning, and code generation - show that Lookahead consistently outperforms existing routing baselines, achieving an average performance gain of 7.7% over the state-of-the-art. Our code is available at this https URL.
摘要：大型语言模型 (LLM) 路由器将每个查询定向到最合适的模型，同时利用异构 LLM 的不同优势，从而提高多模型系统的效率。大多数现有方法将路由视为仅基于输入查询的分类问题。虽然这通过避免跨所有模型进行推理来减少开销，但它忽略了可以从潜在输出中收集的有价值的信息，并且无法捕获通常仅在响应生成期间出现的隐含意图或上下文细微差别。这些限制可能会导致路由决策不理想，特别是对于需要更深入语义理解的复杂或模糊查询。为了应对这一挑战，我们提出了 Lookahead，这是一种路由框架，它通过预测潜在模型输出来“预见”潜在的模型输出，并使用这些预测来指导模型选择，从而无需完全推理即可实现更明智的路由。在此框架内，我们实现了两种基于因果语言模型和屏蔽语言模型的方法。对七个公共基准（涵盖指令跟踪、数学推理和代码生成）的实证评估表明，Lookahead 始终优于现有的路由基准，与最先进的技术相比，平均性能提高了 7.7%。我们的代码可以在这个 https URL 上找到。

Title: Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark

Authors: Yu Wu, Ke Shu, Jonas Fischer, Lidia Pivovarova, David Rosson, Eetu Mäkelä, Mikko Tolonen
Subjects: cs.CL, cs.AI, cs.CV, cs.DL
Abstract URL: https://arxiv.org/abs/2510.19585
Pdf URL: https://arxiv.org/pdf/2510.19585
Copy Paste: [[2510.19585]] Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark(https://arxiv.org/abs/2510.19585)
Keywords: language model
Abstract: This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models' capabilities and limits for this task.
摘要：本文提出了一项从不同布局的混合语言历史文献中提取拉丁语片段的新颖任务。我们根据 724 个带注释页面的多模态数据集对大型基础模型的性能进行基准测试和评估。结果表明，使用现代模型可以实现可靠的拉丁语检测。我们的研究首次对这些模型的能力和限制进行了全面分析。

Title: PBBQ: A Persian Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models

Authors: Farhan Farsi, Shayan Bali, Fatemeh Valeh, Parsa Ghofrani, Alireza Pakniat, Kian Kashfipour, Amir H. Payberah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19616
Pdf URL: https://arxiv.org/pdf/2510.19616
Copy Paste: [[2510.19616]] PBBQ: A Persian Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models(https://arxiv.org/abs/2510.19616)
Keywords: language model, llm
Abstract: With the increasing adoption of large language models (LLMs), ensuring their alignment with social norms has become a critical concern. While prior research has examined bias detection in various languages, there remains a significant gap in resources addressing social biases within Persian cultural contexts. In this work, we introduce PBBQ, a comprehensive benchmark dataset designed to evaluate social biases in Persian LLMs. Our benchmark, which encompasses 16 cultural categories, was developed through questionnaires completed by 250 diverse individuals across multiple demographics, in close collaboration with social science experts to ensure its validity. The resulting PBBQ dataset contains over 37,000 carefully curated questions, providing a foundation for the evaluation and mitigation of bias in Persian language models. We benchmark several open-source LLMs, a closed-source model, and Persian-specific fine-tuned models on PBBQ. Our findings reveal that current LLMs exhibit significant social biases across Persian culture. Additionally, by comparing model outputs to human responses, we observe that LLMs often replicate human bias patterns, highlighting the complex interplay between learned representations and cultural this http URL acceptance of the paper, our PBBQ dataset will be publicly available for use in future work. Content warning: This paper contains unsafe content.
摘要：随着大型语言模型（LLM）的日益普及，确保其符合社会规范已成为一个关键问题。尽管之前的研究已经检验了各种语言的偏见检测，但解决波斯文化背景下的社会偏见的资源仍然存在巨大差距。在这项工作中，我们引入了 PBBQ，这是一个综合基准数据集，旨在评估波斯法学硕士的社会偏见。我们的基准涵盖 16 个文化类别，是通过由 250 名来自多个人口统计数据的不同个人填写的调查问卷制定的，并与社会科学专家密切合作，以确保其有效性。生成的 PBBQ 数据集包含超过 37,000 个精心策划的问题，为评估和减轻波斯语言模型中的偏见提供了基础。我们在 PBBQ 上对多个开源法学硕士、闭源模型和波斯语特定的微调模型进行了基准测试。我们的研究结果表明，目前的法学硕士在波斯文化中表现出显着的社会偏见。此外，通过将模型输出与人类反应进行比较，我们观察到法学硕士经常复制人类偏见模式，突出了学习表征与文化之间复杂的相互作用。本文的 http URL 接受度，我们的 PBBQ 数据集将公开用于未来的工作。内容警告：本文包含不安全内容。

Title: CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English

Authors: Daryna Dementieva, Evgeniya Sukhodolskaya, Alexander Fraser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19628
Pdf URL: https://arxiv.org/pdf/2510.19628
Copy Paste: [[2510.19628]] CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English(https://arxiv.org/abs/2510.19628)
Keywords: language model, llm
Abstract: In the era of social networks and rapid misinformation spread, news analysis remains a critical task. Detecting fake news across multiple languages, particularly beyond English, poses significant challenges. Cross-lingual news comparison offers a promising approach to verify information by leveraging external sources in different languages (Chen and Shu, 2024). However, existing datasets for cross-lingual news analysis (Chen et al., 2022a) were manually curated by journalists and experts, limiting their scalability and adaptability to new languages. In this work, we address this gap by introducing a scalable, explainable crowdsourcing pipeline for cross-lingual news similarity assessment. Using this pipeline, we collected a novel dataset CrossNews-UA of news pairs in Ukrainian as a central language with linguistically and contextually relevant languages-Polish, Russian, and English. Each news pair is annotated for semantic similarity with detailed justifications based on the 4W criteria (Who, What, Where, When). We further tested a range of models, from traditional bag-of-words, Transformer-based architectures to large language models (LLMs). Our results highlight the challenges in multilingual news analysis and offer insights into models performance.
摘要：在社交网络和错误信息快速传播的时代，新闻分析仍然是一项至关重要的任务。检测多种语言（尤其是英语以外的语言）的假新闻面临着巨大的挑战。跨语言新闻比较提供了一种利用不同语言的外部资源来验证信息的有前景的方法（Chen 和 Shu，2024）。然而，现有的跨语言新闻分析数据集（Chen et al., 2022a）是由记者和专家手动整理的，限制了它们的可扩展性和对新语言的适应性。在这项工作中，我们通过引入可扩展、可解释的众包管道来进行跨语言新闻相似性评估来解决这一差距。使用这个管道，我们收集了一个新的数据集 CrossNews-UA，其中包含以乌克兰语为中心语言的新闻对，以及语言和上下文相关的语言（波兰语、俄语和英语）。每个新闻对都根据 4W 标准（谁、什么、哪里、何时）对语义相似性进行了注释，并附有详细的理由。我们进一步测试了一系列模型，从传统的词袋、基于 Transformer 的架构到大型语言模型 (LLM)。我们的结果凸显了多语言新闻分析中的挑战，并提供了对模型性能的见解。

Title: Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent

Authors: Yangshijie Zhang, Xinda Wang, Jialin Liu, Wenqiang Wang, Zhicong Ma, Xingxing Jia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.19641
Pdf URL: https://arxiv.org/pdf/2510.19641
Copy Paste: [[2510.19641]] Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent(https://arxiv.org/abs/2510.19641)
Keywords: llm
Abstract: With social media growth, users employ stylistic fonts and font-like emoji to express individuality, creating visually appealing text that remains human-readable. However, these fonts introduce hidden vulnerabilities in NLP models: while humans easily read stylistic text, models process these characters as distinct tokens, causing interference. We identify this human-model perception gap and propose a style-based attack, Style Attack Disguise (SAD). We design two sizes: light for query efficiency and strong for superior attack performance. Experiments on sentiment classification and machine translation across traditional models, LLMs, and commercial services demonstrate SAD's strong attack performance. We also show SAD's potential threats to multimodal tasks including text-to-image and text-to-speech generation.
摘要：随着社交媒体的发展，用户使用风格字体和类似字体的表情符号来表达个性，创建具有视觉吸引力且仍可供人类阅读的文本。然而，这些字体在 NLP 模型中引入了隐藏的漏洞：虽然人类很容易阅读风格文本，但模型将这些字符处理为不同的标记，从而造成干扰。我们识别了这种人类模型的感知差距，并提出了一种基于风格的攻击，即风格攻击伪装（SAD）。我们设计了两种大小：轻量级用于查询效率，强量级用于卓越的攻击性能。跨传统模型、LLM 和商业服务的情感分类和机器翻译实验证明了 SAD 强大的攻击性能。我们还展示了 SAD 对多模态任务（包括文本到图像和文本到语音生成）的潜在威胁。

Title: LLavaCode: Compressed Code Representations for Retrieval-Augmented Code Generation

Authors: Daria Cherniuk, Nikita Sukhorukov, Nikita Sushko, Daniil Gusak, Danil Sivtsov, Elena Tutubalina, Evgeny Frolov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19644
Pdf URL: https://arxiv.org/pdf/2510.19644
Copy Paste: [[2510.19644]] LLavaCode: Compressed Code Representations for Retrieval-Augmented Code Generation(https://arxiv.org/abs/2510.19644)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation has emerged as one of the most effective approaches for code completion, particularly when context from a surrounding repository is essential. However, incorporating context significantly extends sequence length, leading to slower inference - a critical limitation for interactive settings such as IDEs. In this work, we introduce LlavaCode, a framework that compresses code into compact, semantically rich representations interpretable by code LLM, enhancing generation quality while reducing the retrieved context to only a few compressed single-token vectors. Using a small projector module we can significantly increase the EM and ES metrics of coding model with negligible latency increase. Our experiments demonstrate that compressed context enables 20-38% reduction in Time-to-First-Token (TTFT) on line completion tasks compared to full-RAG pipelines.
摘要：检索增强生成已成为代码完成的最有效方法之一，特别是当来自周围存储库的上下文至关重要时。然而，合并上下文会显着延长序列长度，导致推理速度变慢——这是 IDE 等交互式设置的一个关键限制。在这项工作中，我们引入了 LlavaCode，这是一个框架，可将代码压缩为可由代码 LLM 解释的紧凑、语义丰富的表示形式，从而提高生成质量，同时将检索到的上下文减少为仅几个压缩的单标记向量。使用小型投影仪模块，我们可以显着提高编码模型的 EM 和 ES 指标，而延迟的增加可以忽略不计。我们的实验表明，与完整的 RAG 管道相比，压缩上下文可以使在线完成任务的首次令牌时间 (TTFT) 减少 20-38%。

Title: Unraveling Emotions with Pre-Trained Models

Authors: Alejandro Pajón-Sanmartín, Francisco De Arriba-Pérez, Silvia García-Méndez, Fátima Leal, Benedita Malheiro, Juan Carlos Burguillo-Rial
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.19668
Pdf URL: https://arxiv.org/pdf/2510.19668
Copy Paste: [[2510.19668]] Unraveling Emotions with Pre-Trained Models(https://arxiv.org/abs/2510.19668)
Keywords: language model, llm, prompt
Abstract: Transformer models have significantly advanced the field of emotion recognition. However, there are still open challenges when exploring open-ended queries for Large Language Models (LLMs). Although current models offer good results, automatic emotion analysis in open texts presents significant challenges, such as contextual ambiguity, linguistic variability, and difficulty interpreting complex emotional expressions. These limitations make the direct application of generalist models difficult. Accordingly, this work compares the effectiveness of fine-tuning and prompt engineering in emotion detection in three distinct scenarios: (i) performance of fine-tuned pre-trained models and general-purpose LLMs using simple prompts; (ii) effectiveness of different emotion prompt designs with LLMs; and (iii) impact of emotion grouping techniques on these models. Experimental tests attain metrics above 70% with a fine-tuned pre-trained model for emotion recognition. Moreover, the findings highlight that LLMs require structured prompt engineering and emotion grouping to enhance their performance. These advancements improve sentiment analysis, human-computer interaction, and understanding of user behavior across various domains.
摘要：Transformer 模型极大地推进了情感识别领域的发展。然而，在探索大型语言模型 (LLM) 的开放式查询时仍然存在开放性挑战。尽管当前的模型提供了良好的结果，但开放文本中的自动情感分析提出了重大挑战，例如上下文歧义、语言变异性以及解释复杂情感表达的困难。这些限制使得通才模型的直接应用变得困难。因此，这项工作比较了三种不同场景下情绪检测中微调和提示工程的有效性：（i）使用简单提示的微调预训练模型和通用 LLM 的性能； (ii) 法学硕士不同情绪提示设计的有效性； (iii) 情绪分组技术对这些模型的影响。通过经过微调的预训练情绪识别模型，实验测试的指标达到了 70% 以上。此外，研究结果强调，法学硕士需要结构化的提示工程和情感分组来提高他们的表现。这些进步改进了情感分析、人机交互以及对各个领域的用户行为的理解。

Title: DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference

Authors: Xiang Liu, Xuming Hu, Xiaowen Chu, Eunsol Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19669
Pdf URL: https://arxiv.org/pdf/2510.19669
Copy Paste: [[2510.19669]] DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference(https://arxiv.org/abs/2510.19669)
Keywords: language model, llm, prompt
Abstract: Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22--25\% entropy reduction from easy to medium difficulty regions, suggesting an {overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM's final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4\%, establishing a practical path toward compute-efficient reasoning.
摘要：最近的推理大型语言模型（LLM）表现出卓越的解决问题的能力，但经常产生长的思维痕迹，其效用尚不清楚。我们的工作旨在提高他们的效率，使他们能够在不过度思考的情况下达到高绩效。首先，我们分析推理轨迹中标记概率的熵。在三个模型中，我们观察到一致的 U 形熵模式：尽管精度较高，但简单问题的熵较高，中等难度问题的熵较低，反映不确定性的困难问题的熵较高。具体来说，我们注意到从简单到中等难度区域熵减少了 22--25%，这表明在简单实例上存在{过度思考}现象。基于这些见解，我们引入了 \textbf{DiffAdapt}，这是一个轻量级框架，它根据每个问题的难度和推理跟踪熵选择 Easy/Normal/Hard 推理策略。每个推理策略由固定的提示、温度和最大标记长度组成。与现有的效率优化方法相比，我们的方法不会微调基础 LLM，而是对 LLM 最终隐藏状态进行分类的小型探针，从而允许廉价的适应。我们在五个模型和八个基准上全面评估我们的方法。我们的方法实现了相当或更高的准确性，同时将令牌使用量减少了高达 22.4%，从而建立了一条实现计算高效推理的实用路径。

Title: CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation

Authors: Hasan Akgul, Mari Eplik, Javier Rojas, Aina Binti Abdullah, Pieter van der Merwe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19670
Pdf URL: https://arxiv.org/pdf/2510.19670
Copy Paste: [[2510.19670]] CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation(https://arxiv.org/abs/2510.19670)
Keywords: language model, llm, prompt
Abstract: We present CoSense-LLM, an edge-first framework that turns continuous multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and lightweight vision) into compact, verifiable semantic tokens and coordinates with large language models under explicit latency, energy, bandwidth, and privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight encoder that aligns sensor embeddings with language and compresses them into short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer that grounds generation in site specific policies and notes; (iii) PromptRouter, a cost and uncertainty aware policy that selects edge only generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure Execution, an auditable redaction path that enforces data minimization so raw waveforms never leave the device. The system works with modern serving optimizations, including paged or streaming KV caches, FlashAttention style kernels, speculative decoding, and quantized LoRA adapters, and supports on device personalization and federated updates under non IID drift. Across home, office, and clinic deployments, CoSense-LLM delivers grounded explanations while meeting tight service level objectives: it sustains sub second (p95) end to end latency on edge dominant paths, reduces inter tier token and bandwidth costs by preferring local retrieval grounded responses, and preserves privacy by transmitting only discrete codes and redacted metadata. Ablations show that Edge-RAG improves factual consistency and reduces contradictions, calibrated uncertainty enables selective abstention and controlled escalations, and KV plus decoding accelerators lower energy per decision. The results support an edge first design that treats semantics, privacy, and predictable latency as co equal goals for large model deployments in interference prone environments.
摘要：我们提出了 CoSense-LLM，这是一种边缘优先的框架，可将连续的多模式传感器流（例如 Wi-Fi CSI、IMU、音频、RFID 和轻量级视觉）转换为紧凑、可验证的语义标记，并在显式延迟、能量、带宽和隐私约束下与大型语言模型进行协调。 CoSense-LLM 有四个部分：(i) SenseFusion，一种轻量级编码器，可将传感器嵌入与语言对齐并将其压缩为短的离散代码序列； (ii) Edge-RAG，一个本地混合检索层，基于站点特定政策和注释的生成； (iii) PromptRouter，一种成本和不确定性感知策略，选择仅边缘生成、边缘加检索或紧凑云升级； (iv) 安全执行，这是一种可审计的编辑路径，可强制执行数据最小化，使原始波形永远不会离开设备。该系统可与现代服务优化配合使用，包括分页或流式 KV 缓存、FlashAttention 样式内核、推测性解码和量化 LoRA 适配器，并支持非 IID 漂移下的设备个性化和联合更新。在家庭、办公室和诊所部署中，CoSense-LLM 提供可靠的解释，同时满足严格的服务水平目标：它在边缘主导路径上维持亚秒级 (p95) 端到端延迟，通过首选本地检索接地响应来减少层间令牌和带宽成本，并通过仅传输离散代码和编辑元数据来保护隐私。消融表明，Edge-RAG 提高了事实一致性并减少了矛盾，校准的不确定性实现了选择性弃权和受控升级，KV 加解码加速器降低了每个决策的能量。结果支持边缘优先设计，将语义、隐私和可预测延迟视为在易受干扰的环境中部署大型模型的同等目标。

Title: Are Large Language Models Sensitive to the Motives Behind Communication?

Authors: Addison J. Wu, Ryan Liu, Kerem Oktar, Theodore R. Sumers, Thomas L. Griffiths
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.19687
Pdf URL: https://arxiv.org/pdf/2510.19687
Copy Paste: [[2510.19687]] Are Large Language Models Sensitive to the Motives Behind Communication?(https://arxiv.org/abs/2510.19687)
Keywords: language model, llm, agent
Abstract: Human communication is motivated: people speak, write, and create content with a particular communicative intent in mind. As a result, information that large language models (LLMs) and AI agents process is inherently framed by humans' intentions and incentives. People are adept at navigating such nuanced information: we routinely identify benevolent or self-serving motives in order to decide what statements to trust. For LLMs to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source -- for instance, weighing the credibility of claims made in a sales pitch. In this paper, we undertake a comprehensive study of whether LLMs have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that LLMs' behavior is consistent with rational models of learning from motivated testimony, and find they successfully discount information from biased sources in a human-like manner. We then extend our evaluation to sponsored online adverts, a more naturalistic reflection of LLM agents' information ecosystems. In these settings, we find that LLMs' inferences do not track the rational models' predictions nearly as closely -- partly due to additional information that distracts them from vigilance-relevant considerations. However, a simple steering intervention that boosts the salience of intentions and incentives substantially increases the correspondence between LLMs and the rational model. These results suggest that LLMs possess a basic sensitivity to the motivations of others, but generalizing to novel real-world settings will require further improvements to these models.
摘要：人类交流是有动机的：人们在说话、写作和创造内容时都考虑到特定的交流意图。因此，大型语言模型 (LLM) 和人工智能代理处理的信息本质上是由人类的意图和动机决定的。人们善于浏览这些微妙的信息：我们通常会识别仁慈或自私的动机，以便决定信任哪些陈述。为了让法学硕士在现实世界中发挥作用，他们也必须通过考虑来源的动机来批判性地评估内容——例如，权衡推销中主张的可信度。在本文中，我们对法学硕士是否具有这种动机警惕的能力进行了全面研究。我们首先采用认知科学的对照实验来验证法学硕士的行为与从动机性证言中学习的理性模型是一致的，并发现他们成功地以类似人类的方式折扣来自有偏见的来源的信息。然后，我们将评估扩展到赞助在线广告，这是法学硕士代理人信息生态系统的更自然的反映。在这些情况下，我们发现法学硕士的推论并没有那么密切地跟踪理性模型的预测——部分原因是额外的信息分散了他们对警惕相关考虑的注意力。然而，一个简单的指导干预可以提高意图和激励的显着性，从而大大增加法学硕士和理性模型之间的对应性。这些结果表明法学硕士对其他人的动机具有基本的敏感性，但推广到新的现实世界环境将需要对这些模型进行进一步改进。

Title: Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings

Authors: Cesar Gonzalez-Gutierrez, Dirk Hovy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.19694
Pdf URL: https://arxiv.org/pdf/2510.19694
Copy Paste: [[2510.19694]] Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings(https://arxiv.org/abs/2510.19694)
Keywords: prompt
Abstract: Prompting is a common approach for leveraging LMs in zero-shot settings. However, the underlying mechanisms that enable LMs to perform diverse tasks without task-specific supervision remain poorly understood. Studying the relationship between prompting and the quality of internal representations can shed light on how pre-trained embeddings may support in-context task solving. In this empirical study, we conduct a series of probing experiments on prompt embeddings, analyzing various combinations of prompt templates for zero-shot classification. Our findings show that while prompting affects the quality of representations, these changes do not consistently correlate with the relevance of the prompts to the target task. This result challenges the assumption that more relevant prompts necessarily lead to better representations. We further analyze potential factors that may contribute to this unexpected behavior.
摘要：提示是在零样本设置中利用 LM 的常见方法。然而，使 LM 在没有特定任务监督的情况下执行各种任务的基本机制仍然知之甚少。研究提示与内部表征质量之间的关系可以揭示预先训练的嵌入如何支持上下文中的任务解决。在这项实证研究中，我们对提示嵌入进行了一系列探测实验，分析了用于零样本分类的提示模板的各种组合。我们的研究结果表明，虽然提示会影响表征的质量，但这些变化并不总是与提示与目标任务的相关性相关。这一结果挑战了“更相关的提示必然会带来更好的表示”的假设。我们进一步分析可能导致这种意外行为的潜在因素。

Title: From Answers to Guidance: A Proactive Dialogue System for Legal Documents

Authors: Ashish Chouhan, Michael Gertz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19723
Pdf URL: https://arxiv.org/pdf/2510.19723
Copy Paste: [[2510.19723]] From Answers to Guidance: A Proactive Dialogue System for Legal Documents(https://arxiv.org/abs/2510.19723)
Keywords: retrieval-augmented generation
Abstract: The accessibility of legal information remains a constant challenge, particularly for laypersons seeking to understand and apply complex institutional texts. While the European Union provides open access to legislation, parliamentary responses, and regulatory documents, these resources can be challenging for laypeople to explore. In this paper, we introduce EUDial, a proactive multi-turn dialogue dataset constructed from 204 blogs curated by the Citizens' Enquiries Unit (AskEP) of the European Parliamentary Research Service. EUDial contains 880 dialogue turns (averaging 4.3 turns per dialogue), where each dialogue includes initial questions, structured answers, and follow-up questions. Beyond dataset construction, we propose the LexGuide framework that leverages retrieval-augmented generation with hierarchical topic organization to structure dialogue progression, ensuring both comprehensive coverage of legal aspects and coherence across conversational turns. The results demonstrate that proactive, structured navigation closes the gap between the availability of legal information and citizen comprehension, establishing EUDial and LexGuide as practical resources for advancing proactive legal dialogue systems.
摘要：法律信息的可获取性仍然是一个持续的挑战，特别是对于寻求理解和应用复杂制度文本的外行来说。虽然欧盟提供立法、议会回应和监管文件的开放获取途径，但这些资源对于外行来说探索起来可能具有挑战性。在本文中，我们介绍了 EUDial，这是一个主动多轮对话数据集，由欧洲议会研究服务公民查询部门 (AskEP) 策划的 204 个博客构建而成。 EUDial 包含 880 轮对话（平均每个对话 4.3 轮），其中每个对话包括初始问题、结构化答案和后续问题。除了数据集构建之外，我们还提出了 LexGuide 框架，该框架利用检索增强生成和分层主题组织来构建对话进程，确保法律方面的全面覆盖和对话轮次的连贯性。结果表明，主动、结构化的导航缩小了法律信息的可用性和公民理解之间的差距，将 EUDial 和 LexGuide 作为推进主动法律对话系统的实用资源。

Title: Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning

Authors: M. H. I. Abdalla, Zhipin Wang, Christian Frey, Steffen Eger, Josif Grabocka
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.19733
Pdf URL: https://arxiv.org/pdf/2510.19733
Copy Paste: [[2510.19733]] Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning(https://arxiv.org/abs/2510.19733)
Keywords: language model, llm, prompt
Abstract: Large Language Model (LLM) conditioning refers to instructing an LLM to generate content in accordance with the norms and values of a specific culture, beliefs of a particular political orientation, or any desired text-specified semantic conditioning. Unfortunately, prompt engineering does not ensure that LLMs behave in accordance with a desired conditioning due to the inductive bias of the pre-training and alignment datasets. Prior works have focused on fine-tuning LLMs by directly conditioning the LoRA weights; however, such methods introduce a large number of parameters. As a remedy, we propose Zhyper, a parameter-efficient factorized hypernetwork framework that generates context-aware LoRA adapters from textual descriptions. Experiments on multiple benchmarks show that Zhyper achieves competitive performance with up to 26x fewer parameters than the state-of-the-art baselines. Furthermore, we extend Zhyper to cultural alignment, demonstrating improved generalization to out-of-domain settings and a better capturing of fine-grained contextual values.
摘要：大语言模型 (LLM) 调节是指指示 LLM 根据特定文化的规范和价值观、特定政治取向的信仰或任何所需的文本指定的语义调节来生成内容。不幸的是，由于预训练和对齐数据集的归纳偏差，即时工程并不能确保法学硕士的行为符合所需的条件。之前的工作重点是通过直接调节 LoRA 权重来微调 LLM；然而，此类方法引入了大量参数。作为补救措施，我们提出了 Zhyper，一种参数高效的分解超网络框架，可根据文本描述生成上下文感知的 LoRA 适配器。多个基准测试的实验表明，Zhyper 的参数比最先进的基准少 26 倍，从而实现了具有竞争力的性能。此外，我们将 Zhyper 扩展到文化一致性，展示了对域外设置的改进泛化以及更好地捕捉细粒度的上下文价值。

Title: SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration

Authors: Xichen Zhang, Sitong Wu, Haoru Tan, Shaozuo Yu, Yinghao Zhu, Ziyi He, Jiaya Jia
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.19767
Pdf URL: https://arxiv.org/pdf/2510.19767
Copy Paste: [[2510.19767]] SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration(https://arxiv.org/abs/2510.19767)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: The long chain-of-thought (LongCoT) capability is central to the recent breakthroughs achieved by large language models in complex reasoning tasks. However, the accompanying issue of ''underthinking'', where models exhibit shallow reasoning by frequently switching thoughts without sufficient exploration, limits both performance and token efficiency. To address this problem, we propose a simple yet effective reasoning strategy: the SmartSwitch inference framework. This framework can be easily integrated into any large language model as a plug-and-play solution, continuously monitoring the model's reasoning process to detect underthinking and guide it toward deeper exploration of promising but overlooked thoughts. Specifically, the perception module identifies points where thoughts switch and evaluates the potential of the preceding thought using an off-the-shelf process reward model (PRM). If a high-potential thought is found to be prematurely abandoned, the intervention module interrupts the ongoing inference, backtracks to the point before the switch, and inserts a "deepening prompt" to encourage further exploration along that promising path. Extensive experiments on challenging mathematical reasoning benchmarks demonstrate that our method significantly enhances the performance of various large language models of different sizes.
摘要：长思维链（LongCoT）能力是大型语言模型最近在复杂推理任务中取得突破的核心。然而，随之而来的“思考不足”问题，即模型在没有充分探索的情况下频繁切换思想而表现出肤浅的推理，限制了性能和代币效率。为了解决这个问题，我们提出了一种简单而有效的推理策略：SmartSwitch 推理框架。该框架可以作为即插即用的解决方案轻松集成到任何大型语言模型中，持续监控模型的推理过程以检测不足的想法，并引导其更深入地探索有希望但被忽视的想法。具体来说，感知模块识别思想切换的点，并使用现成的过程奖励模型（PRM）评估先前思想的潜力。如果发现一个高潜力的想法被过早放弃，干预模块就会中断正在进行的推理，回溯到切换之前的点，并插入一个“深化提示”，以鼓励沿着这条有前途的道路进一步探索。对具有挑战性的数学推理基准的大量实验表明，我们的方法显着增强了不同规模的各种大型语言模型的性能。

Title: AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

Authors: Yuezhou Hu, Jiaxin Guo, Xinyu Feng, Tuo Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.19779
Pdf URL: https://arxiv.org/pdf/2510.19779
Copy Paste: [[2510.19779]] AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders(https://arxiv.org/abs/2510.19779)
Keywords: language model
Abstract: Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15\%). The code is publicly available at this https URL.
摘要：推测解码 (SD) 通过使用小型草稿模型生成预测来加速大型语言模型推理，然后由更大的目标模型进行验证。 SD 的有效性取决于这些模型之间的一致性，这通常通过知识蒸馏 (KD) 得到增强。然而，传统的 KD 方法旨在最小化所有代币的草稿模型和目标模型之间的 KL 差异，这一目标与 SD 的真正目标（即最大化代币接受率）不一致。因此，由于容量限制，草稿模型常常难以完全吸收目标模型的知识，从而导致性能不佳。为了应对这一挑战，我们提出了 AdaSPEC，这是一种将选择性令牌过滤纳入 KD 过程的新颖方法。 AdaSPEC 利用参考模型来识别和过滤掉难以拟合的标记，从而能够对草稿模型进行提炼，从而更好地与更简单的标记上的目标模型保持一致。这种方法在不影响生成质量的情况下提高了总体代币接受率。我们使用 31M/1.4B 和 350M/2.7B 参数的模型配置跨各种任务评估 AdaSPEC，包括算术推理、指令跟踪、编码和摘要。我们的结果表明，AdaSPEC 始终优于最先进的 DistillSpec 方法，在所有任务中实现了更高的接受率（高达 15%）。该代码可通过此 https URL 公开获取。

Title: Adapting Multilingual Models to Code-Mixed Tasks via Model Merging

Authors: Prashant Kodali, Vaishnavi Shivkumar, Swarang Joshi, Monojit Choudhary, Ponnurangam Kumaraguru, Manish Shrivastava
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19782
Pdf URL: https://arxiv.org/pdf/2510.19782
Copy Paste: [[2510.19782]] Adapting Multilingual Models to Code-Mixed Tasks via Model Merging(https://arxiv.org/abs/2510.19782)
Keywords: llm, prompt
Abstract: We study model merging as a practical alternative to conventional adaptation strategies for code-mixed NLP. Starting from a multilingual base model, we: (i) perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an adapted checkpoint, (ii) merge checkpoint with the base model, and (iii) fine-tune (FT) on the downstream task data. We evaluate our approach for sentence classification (sentiment and hate speech) task in English-Hindi (En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our results show that merged models consistently outperform full fine-tuning and CPT->FT. We observe gains of 2--5 points in F1 over full fine-tuning and ~1-2 points over CPT->FT, indicating that unlabeled data is leveraged more effectively via merging than via CPT alone. Zero-/few-shot prompting with larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged checkpoints, underscoring limits of in-context learning for code-mixed inputs. We further test cross-pair transfer by training on En-Hi and evaluating on En-Ta and En-Ml: merged checkpoints transfer more strongly than monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs 0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more reliable substrate for low-resource pairs. We conclude with adaptation recipes matched to common data regimes (labeled only; labeled+unlabeled; transfer-only) and discuss limitations and scaling considerations for broader tasks and larger models.
摘要：我们研究模型合并作为代码混合 NLP 传统适应策略的实用替代方案。从多语言基础模型开始，我们：（i）对未标记的代码混合文本执行持续预训练（CPT）以获得适应的检查点，（ii）将检查点与基础模型合并，以及（iii）对下游任务数据进行微调（FT）。我们使用 XLM-R 和 Llama-3.2-1B 模型评估英语-印地语 (En-Hi) 和英语-西班牙语 (En-Es) 句子分类（情感和仇恨言论）任务的方法。我们的结果表明，合并模型的性能始终优于完全微调和 CPT->FT。我们观察到 F1 比完全微调提高了 2--5 个点，比 CPT->FT 提高了约 1-2 个点，这表明通过合并比单独通过 CPT 更有效地利用未标记数据。较大的 LLM（例如 Llama-3.3-70B）的零/少样本提示落后于微调和合并的检查点，强调了代码混合输入的上下文学习的局限性。我们通过在 En-Hi 上进行训练并在 En-Ta 和 En-Ml 上进行评估来进一步测试跨对迁移：合并的检查点迁移比单语英语基线更强（例如，TV/TIES 变体达到 0.65-0.68 F1，而完全微调则达到 0.61-0.63），这表明代码混合知识对于低资源对来说是更可靠的基础。我们以与常见数据机制相匹配的适应方法（仅标记；标记+未标记；仅传输）作为结论，并讨论更广泛任务和更大模型的限制和扩展考虑因素。

Title: ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers

Authors: Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang, Suhang Wang, Zhe Feng
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.19791
Pdf URL: https://arxiv.org/pdf/2510.19791
Copy Paste: [[2510.19791]] ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers(https://arxiv.org/abs/2510.19791)
Keywords: language model, llm
Abstract: Tool calling has become increasingly popular for Large Language Models (LLMs). However, for large tool sets, the resulting tokens would exceed the LLM's context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide LLMs with the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework to condition retriever models to fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e., description of tools that the LLM feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD's. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of sparse and dense retrievers with and without training, thus showcasing its flexibility. Through our proposed framework, our aim is to offload a portion of the reasoning burden to the retriever so that the LLM may effectively handle a large collection of tools without inundating its context window.
摘要：工具调用在大型语言模型 (LLM) 中变得越来越流行。然而，对于大型工具集，生成的标记将超出 LLM 的上下文窗口限制，从而不可能包含所有工具。因此，外部检索器用于为法学硕士提供最相关的查询工具。现有的检索模型根据用户查询和工具描述（TD）之间的相似性对工具进行排名。这会导致检索效果不佳，因为用户请求通常与 TD 语言不一致。为了解决这个问题，我们提出了 ToolDreamer，这是一个框架，用于条件检索器模型来获取基于使用 LLM 生成的假设（合成）TD 的工具，即 LLM 认为对查询可能有用的工具的描述。该框架使 TD 语言空间内的查询和工具之间能够更加自然地对齐。我们在 ToolRet 数据集上应用 ToolDreamer，并表明我们的方法在训练和不训练的情况下都提高了稀疏和密集检索器的性能，从而展示了其灵活性。通过我们提出的框架，我们的目标是将部分推理负担减轻给检索器，以便法学硕士可以有效地处理大量工具，而不会淹没其上下文窗口。

Title: The Art of Asking: Multilingual Prompt Optimization for Synthetic Data

Authors: David Mora, Viraat Aryabumi, Wei-Yin Ko, Sara Hooker, Julia Kreutzer, Marzieh Fadaee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19806
Pdf URL: https://arxiv.org/pdf/2510.19806
Copy Paste: [[2510.19806]] The Art of Asking: Multilingual Prompt Optimization for Synthetic Data(https://arxiv.org/abs/2510.19806)
Keywords: language model, llm, prompt
Abstract: Synthetic data has become a cornerstone for scaling large language models, yet its multilingual use remains bottlenecked by translation-based prompts. This strategy inherits English-centric framing and style and neglects cultural dimensions, ultimately constraining model generalization. We argue that the overlooked prompt space-the very inputs that define training distributions-offers a more powerful lever for improving multilingual performance. We introduce a lightweight framework for prompt-space optimization, where translated prompts are systematically transformed for Naturalness, Cultural Adaptation, and Difficulty Enhancement. Using an off-the-shelf multilingual LLM, we apply these transformations to prompts for 12 languages spanning 7 families. Under identical data conditions, our approaches achieve substantial and consistent downstream improvements over the translation-only baseline: +4.7% on Global-MMLU accuracy, +2.4% on Flores XCometXL and +35.3% wins in preferences on mArenaHard. We establish prompt-space optimization as a simple yet powerful paradigm for building multilingual LLMs that are more robust, culturally grounded, and globally capable.
摘要：合成数据已成为扩展大型语言模型的基石，但其多语言使用仍然受到基于翻译的提示的瓶颈。该策略继承了以英语为中心的框架和风格，而忽略了文化维度，最终限制了模型的泛化。我们认为，被忽视的提示空间（定义训练分布的输入）为提高多语言性能提供了更强大的杠杆。我们引入了一个用于提示空间优化的轻量级框架，其中翻译的提示被系统地转换为自然性、文化适应和难度增强。使用现成的多语言 LLM，我们将这些转换应用于跨越 7 个语系的 12 种语言的提示。在相同的数据条件下，我们的方法相对于仅翻译基线实现了实质性且一致的下游改进：Global-MMLU 准确率+4.7%，Flores XCometXL+2.4%，mArenaHard 偏好+35.3%。我们建立即时空间优化作为一个简单而强大的范例，用于构建更强大、更具有文化基础和全球能力的多语言法学硕士。

Title: Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

Authors: Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.19807
Pdf URL: https://arxiv.org/pdf/2510.19807
Copy Paste: [[2510.19807]] Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning(https://arxiv.org/abs/2510.19807)
Keywords: language model, llm, prompt
Abstract: Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ''learning cliff'' phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model's independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO's effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.
摘要：来自可验证奖励的强化学习已成为增强大型语言模型（LLM）复杂推理能力的强大技术。然而，这些方法从根本上受到“学习悬崖”现象的限制：当面临远远超出其当前能力的问题时，模型始终会失败，产生持续的零奖励信号。在 GRPO 等策略优化算法中，这会将优势计算压缩为零，从而使这些困难问题对学习梯度不可见并导致进度停滞。为了克服这个问题，我们引入了 Scaf-GRPO（支架组相对策略优化），这是一种渐进式训练框架，仅当模型的独立学习趋于稳定时才从战略上提供最低限度的指导。该框架首先诊断学习停滞，然后通过注入分层的提示提示（从抽象概念到具体步骤）进行干预，使模型能够自行构建有效的解决方案。针对具有挑战性的数学基准的大量实验证明了 Scaf-GRPO 的有效性，将 AIME24 基准上的 Qwen2.5-Math-7B 模型的 pass@1 分数比普通 GRPO 基准提高了 44.3%。这一结果表明，我们的框架提供了一种强大而有效的方法来释放模型解决以前无法解决的问题的能力，这是扩展法学硕士自主推理前沿的关键一步。

Title: Hubble: a Model Suite to Advance the Study of LLM Memorization

Authors: Johnny Tian-Zheng Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P. Gummadi, Willie Neiswanger, Robin Jia
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.19811
Pdf URL: https://arxiv.org/pdf/2510.19811
Copy Paste: [[2510.19811]] Hubble: a Model Suite to Advance the Study of LLM Memorization(https://arxiv.org/abs/2510.19811)
Keywords: language model, llm
Abstract: We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models -- standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens -- establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.
摘要：我们推出 Hubble，这是一套完全开源的大语言模型 (LLM)，用于 LLM 记忆的科学研究。哈勃模型有标准和扰动变体：标准模型在大型英语语料库上进行预训练，扰动模型以相同的方式进行训练，但受控插入文本（例如书籍段落、传记和测试集），旨在模拟关键记忆风险。我们的核心版本包括 8 个模型——具有 1B 或 8B 参数的标准模型和扰动模型，在 100B 或 500B 令牌上进行预训练——确定记忆风险是由敏感数据相对于训练语料库大小的频率决定的（即，在较小语料库中出现一次的密码比在较大语料库中出现的相同密码更好地记住）。我们的版本还包括 6 个扰动模型，其中在不同的预训练阶段插入了文本，这表明没有持续暴露的敏感数据可能会被遗忘。这些发现提出了解决记忆风险的两种最佳实践：通过增加训练语料库的大小来稀释敏感数据，以及命令敏感数据在训练中较早出现。除了这些一般性的实证研究结果之外，哈勃望远镜还可以进行广泛的记忆研究。例如，分析传记可以揭示不同类型的私人信息被记住的容易程度。我们还证明，哈勃中的随机插入使其成为成员推理和机器学习的理想测试平台，并邀请社区进一步探索、基准测试和基于我们的工作。