2025-10-21

Title: Fusion-Augmented Large Language Models: Boosting Diagnostic Trustworthiness via Model Consensus

Authors: Md Kamrul Siam, Md Jobair Hossain Faruk, Jerry Q. Cheng, Huanying Gu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16057
Pdf URL: https://arxiv.org/pdf/2510.16057
Copy Paste: [[2510.16057]] Fusion-Augmented Large Language Models: Boosting Diagnostic Trustworthiness via Model Consensus(https://arxiv.org/abs/2510.16057)
Keywords: language model, gpt, llm, prompt, chat
Abstract: This study presents a novel multi-model fusion framework leveraging two state-of-the-art large language models (LLMs), ChatGPT and Claude, to enhance the reliability of chest X-ray interpretation on the CheXpert dataset. From the full CheXpert corpus of 224,316 chest radiographs, we randomly selected 234 radiologist-annotated studies to evaluate unimodal performance using image-only prompts. In this setting, ChatGPT and Claude achieved diagnostic accuracies of 62.8% and 76.9%, respectively. A similarity-based consensus approach, using a 95% output similarity threshold, improved accuracy to 77.6%. To assess the impact of multimodal inputs, we then generated synthetic clinical notes following the MIMIC-CXR template and evaluated a separate subset of 50 randomly selected cases paired with both images and synthetic text. On this multimodal cohort, performance improved to 84% for ChatGPT and 76% for Claude, while consensus accuracy reached 91.3%. Across both experimental conditions, agreement-based fusion consistently outperformed individual models. These findings highlight the utility of integrating complementary modalities and using output-level consensus to improve the trustworthiness and clinical utility of AI-assisted radiological diagnosis, offering a practical path to reduce diagnostic errors with minimal computational overhead.
摘要：本研究提出了一种新颖的多模型融合框架，利用两种最先进的大型语言模型 (LLM) ChatGPT 和 Claude，以增强 CheXpert 数据集上胸部 X 射线解释的可靠性。从包含 224,316 张胸部 X 光照片的完整 CheXpert 语料库中，我们随机选择了 234 个放射科医生注释的研究，以使用纯图像提示评估单峰性能。在此设置下，ChatGPT 和 Claude 的诊断准确率分别为 62.8% 和 76.9%。基于相似性的共识方法使用 95% 的输出相似性阈值，将准确度提高到 77.6%。为了评估多模式输入的影响，我们按照 MIMIC-CXR 模板生成了合成临床记录，并评估了 50 个随机选择的病例的单独子集，并与图像和合成文本配对。在这个多模式队列中，ChatGPT 的性能提高到 84%，Claude 的性能提高到 76%，而共识准确性达到 91.3%。在这两种实验条件下，基于协议的融合始终优于单个模型。这些发现强调了整合互补模式和使用输出级共识来提高人工智能辅助放射诊断的可信度和临床实用性的效用，提供了一条以最小的计算开销减少诊断错误的实用途径。

Title: Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs

Authors: Guiyao Tie, Zenghui Yuan, Zeli Zhao, Chaoran Hu, Tianhe Gu, Ruihang Zhang, Sizhe Zhang, Junran Wu, Xiaoyue Tu, Ming Jin, Qingsong Wen, Lixing Chen, Pan Zhou, Lichao Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16062
Pdf URL: https://arxiv.org/pdf/2510.16062
Copy Paste: [[2510.16062]] Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs(https://arxiv.org/abs/2510.16062)
Keywords: language model, llm, chain-of-thought
Abstract: Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance. Although various self-correction methods have been proposed, a comprehensive evaluation of these methods remains largely unexplored, and the question of whether LLMs can truly correct themselves is a matter of significant interest and concern. In this study, we introduce CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies, including intrinsic, external, and fine-tuned approaches, across three tasks: commonsense reasoning, mathematical reasoning, and code generation. Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited optimization under additional self-correction methods and have high time costs. Interestingly, a comparatively simple chain-of-thought (CoT) baseline demonstrates competitive accuracy and efficiency. These results underscore the potential of self-correction to enhance LLM's reasoning performance while highlighting the ongoing challenge of improving their efficiency. Consequently, we advocate for further research focused on optimizing the balance between reasoning capabilities and operational efficiency. Project Page: this https URL
摘要：大语言模型（LLM）的自我修正成为增强其推理性能的关键组成部分。尽管已经提出了各种自我纠正方法，但对这些方法的综合评估仍然很大程度上尚未探索，而法学硕士是否能够真正自我纠正的问题是一个引起人们极大兴趣和关注的问题。在本研究中，我们介绍了 CorrectBench，这是一个用于评估自我纠正策略有效性的基准，包括内在、外部和微调方法，涵盖三个任务：常识推理、数学推理和代码生成。我们的研究结果表明：1）自我修正方法可以提高准确性，特别是对于复杂的推理任务； 2）混合不同的自我修正策略可以产生进一步的改进，但会降低效率； 3）推理LLM（例如DeepSeek-R1）在额外的自我修正方法下优化有限，并且时间成本很高。有趣的是，相对简单的思想链（CoT）基线展示了具有竞争力的准确性和效率。这些结果强调了自我纠正在提高法学硕士推理表现方面的潜力，同时也凸显了提高其效率所面临的持续挑战。因此，我们主张进一步研究重点是优化推理能力和运行效率之间的平衡。项目页面：此 https URL

Title: EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Authors: Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, Botian Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16079
Pdf URL: https://arxiv.org/pdf/2510.16079
Copy Paste: [[2510.16079]] EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle(https://arxiv.org/abs/2510.16079)
Keywords: language model, llm, agent
Abstract: Current Large Language Model (LLM) agents show strong performance in tool use, but lack the crucial capability to systematically learn from their own experiences. While existing frameworks mainly focus on mitigating external knowledge gaps, they fail to address a more fundamental limitation: the inability to iteratively refine problem-solving strategies. In this work, we introduce EvolveR, a framework designed to enable agent to self-improve through a complete, closed-loop experience lifecycle. This lifecycle comprises two key stages: (1) Offline Self-Distillation, where the agent's interaction trajectories are synthesized into a structured repository of abstract, reusable strategic principles; (2) Online Interaction, where the agent interacts with tasks and actively retrieves distilled principles to guide its decision-making, accumulating a diverse set of behavioral trajectories. This loop employs a policy reinforcement mechanism to iteratively update the agent based on its performance. We demonstrate the effectiveness of EvolveR on complex multi-hop question-answering benchmarks, where it achieves superior performance over strong agentic baselines. Our work presents a comprehensive blueprint for agents that learn not only from external data but also from the consequences of their own actions, paving the way for more autonomous and continuously improving systems. Code is available at this https URL.
摘要：目前的大型语言模型（LLM）智能体在工具使用方面表现出很强的表现，但缺乏系统地从自己的经验中学习的关键能力。虽然现有框架主要侧重于缩小外部知识差距，但它们未能解决更根本的限制：无法迭代地完善问题解决策略。在这项工作中，我们引入了 EvolveR，这是一个旨在使代理能够通过完整的闭环体验生命周期进行自我改进的框架。该生命周期包括两个关键阶段：（1）离线自我蒸馏，将智能体的交互轨迹合成为抽象的、可重用的策略原则的结构化存储库；（2）在线交互，智能体与任务交互并主动检索提炼的原则来指导其决策，积累多样化的行为轨迹。该循环采用策略强化机制，根据代理的性能迭代更新代理。我们证明了 EvolveR 在复杂的多跳问答基准上的有效性，它在强代理基准上实现了卓越的性能。我们的工作为智能体提供了一个全面的蓝图，它们不仅从外部数据中学习，而且从自己行为的后果中学习，为更加自主和持续改进的系统铺平了道路。代码可从此 https URL 获取。

Title: Evaluating Prompting Strategies and Large Language Models in Systematic Literature Review Screening: Relevance and Task-Stage Classification

Authors: Binglan Han, Anuradha Mathrani, Teo Susnjak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16091
Pdf URL: https://arxiv.org/pdf/2510.16091
Copy Paste: [[2510.16091]] Evaluating Prompting Strategies and Large Language Models in Systematic Literature Review Screening: Relevance and Task-Stage Classification(https://arxiv.org/abs/2510.16091)
Keywords: language model, gpt, llm, prompt, chat, chain-of-thought
Abstract: This study quantifies how prompting strategies interact with large language models (LLMs) to automate the screening stage of systematic literature reviews (SLRs). We evaluate six LLMs (GPT-4o, GPT-4o-mini, DeepSeek-Chat-V3, Gemini-2.5-Flash, Claude-3.5-Haiku, Llama-4-Maverick) under five prompt types (zero-shot, few-shot, chain-of-thought (CoT), CoT-few-shot, self-reflection) across relevance classification and six Level-2 tasks, using accuracy, precision, recall, and F1. Results show pronounced model-prompt interaction effects: CoT-few-shot yields the most reliable precision-recall balance; zero-shot maximizes recall for high-sensitivity passes; and self-reflection underperforms due to over-inclusivity and instability across models. GPT-4o and DeepSeek provide robust overall performance, while GPT-4o-mini performs competitively at a substantially lower dollar cost. A cost-performance analysis for relevance classification (per 1,000 abstracts) reveals large absolute differences among model-prompt pairings; GPT-4o-mini remains low-cost across prompts, and structured prompts (CoT/CoT-few-shot) on GPT-4o-mini offer attractive F1 at a small incremental cost. We recommend a staged workflow that (1) deploys low-cost models with structured prompts for first-pass screening and (2) escalates only borderline cases to higher-capacity models. These findings highlight LLMs' uneven but promising potential to automate literature screening. By systematically analyzing prompt-model interactions, we provide a comparative benchmark and practical guidance for task-adaptive LLM deployment.
摘要：这项研究量化了提示策略如何与大语言模型（LLM）相互作用，以自动化系统文献综述（SLR）的筛选阶段。我们在五种提示类型（零样本、少样本、思维链（CoT）、CoT-few-shot、自我反思）下，通过相关性分类和六个 2 级任务，使用准确度、精度、回忆和 F1。结果显示显着的模型提示交互效应：CoT-few-shot 产生最可靠的精确率-召回率平衡；零样本最大限度地提高了高灵敏度通行证的召回率；由于模型之间的过度包容性和不稳定性，自我反思表现不佳。 GPT-4o 和 DeepSeek 提供强大的整体性能，而 GPT-4o-mini 则以大幅降低的成本具有竞争力。相关性分类（每 1,000 个摘要）的成本绩效分析揭示了模型提示配对之间存在巨大的绝对差异； GPT-4o-mini 在各个提示中保持低成本，GPT-4o-mini 上的结构化提示 (CoT/CoT-few-shot) 以较小的增量成本提供有吸引力的 F1。我们建议采用分阶段的工作流程：(1) 部署具有结构化提示的低成本模型以进行首次筛选，(2) 仅将边缘案例升级为更高容量的模型。这些发现突显了法学硕士在自动化文献筛选方面的潜力，虽然参差不齐，但前景广阔。通过系统地分析提示模型交互，我们为任务自适应法学硕士部署提供了比较基准和实用指导。

Title: Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization

Authors: Tina Behnia, Puneesh Deora, Christos Thrampoulidis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.16096
Pdf URL: https://arxiv.org/pdf/2510.16096
Copy Paste: [[2510.16096]] Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization(https://arxiv.org/abs/2510.16096)
Keywords: language model
Abstract: Language models are pretrained on sequences that blend statistical regularities (making text fluent) with factual associations between specific tokens (knowledge of facts). While recent work suggests that the variability of their interaction, such as paraphrases of factual associations, critically determines generalization ability, we lack a systematic analysis of these impacts. This paper introduces a flexible synthetic testbed that combines a statistical stream of generic tokens with an abstract factual stream of source-target token pairs, enabling fine-grained control over their interaction. The design enables the independent control of diversity nature by manipulating stream composition (contextual structure) and the diversity level by varying which statistical streams each fact appears in. Through controlled experiments, we find that while higher contextual diversity delays in-distribution (ID) factual accuracy, its impact on out-of-distribution (OOD) factual generalization depends critically on contextual structure. In some cases, OOD performance follows the same trend as ID, but in others, diversity becomes essential for non-trivial factual recall. Even when low diversity prohibits factual recall, optimal diversity levels depend on training duration. Beyond factual recall failures, we identify structures where statistical generalization fails independently, and others where both capabilities degrade. This shows how the interplay between contextual design and diversity level impacts different generalization aspects. Further, through a series of controlled interventions on the model components, we trace the OOD failures to distinct optimization bottlenecks, highlighting the importance of the embedding and unembedding layers. Our synthetic framework allows us to isolate effects that would be confounded in large-scale studies, offering a controlled testbed for future investigations.
摘要：语言模型是在将统计规律（使文本流畅）与特定标记之间的事实关联（事实知识）混合在一起的序列上进行预训练的。虽然最近的研究表明，它们相互作用的可变性（例如事实关联的释义）严格地决定了泛化能力，但我们缺乏对这些影响的系统分析。本文介绍了一种灵活的综合测试平台，它将通用令牌的统计流与源-目标令牌对的抽象事实流相结合，从而能够对它们的交互进行细粒度的控制。该设计通过改变每个事实出现的统计流来操纵流组成（上下文结构）和多样性水平，从而实现多样性性质的独立控制。通过受控实验，我们发现，虽然较高的上下文多样性会延迟分布内（ID）事实准确性，但其对分布外（OOD）事实泛化的影响关键取决于上下文结构。在某些情况下，OOD 性能遵循与 ID 相同的趋势，但在其他情况下，多样性对于重要的事实回忆至关重要。即使低多样性阻碍了事实回忆，最佳多样性水平也取决于训练持续时间。除了事实召回失败之外，我们还确定了统计泛化独立失败的结构，以及两种能力均下降的其他结构。这显示了情境设计和多样性水平之间的相互作用如何影响不同的泛化方面。此外，通过对模型组件进行一系列受控干预，我们将 OOD 故障追溯到不同的优化瓶颈，强调了嵌入层和非嵌入层的重要性。我们的综合框架使我们能够隔离在大规模研究中可能混淆的效应，为未来的研究提供受控测试平台。

Title: In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions

Authors: Aria Pessianzadeh, Naima Sultana, Hildegarde Van den Bulck, David Gefen, Shahin Jabari, Rezvaneh Rezapour
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16173
Pdf URL: https://arxiv.org/pdf/2510.16173
Copy Paste: [[2510.16173]] In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions(https://arxiv.org/abs/2510.16173)
Keywords: language model, llm
Abstract: The rise of generative AI (GenAI) has impacted many aspects of human life. As these systems become embedded in everyday practices, understanding public trust in them also becomes essential for responsible adoption and governance. Prior work on trust in AI has largely drawn from psychology and human-computer interaction, but there is a lack of computational, large-scale, and longitudinal approaches to measuring trust and distrust in GenAI and large language models (LLMs). This paper presents the first computational study of Trust and Distrust in GenAI, using a multi-year Reddit dataset (2022--2025) spanning 39 subreddits and 197,618 posts. Crowd-sourced annotations of a representative sample were combined with classification models to scale analysis. We find that Trust and Distrust are nearly balanced over time, with shifts around major model releases. Technical performance and usability dominate as dimensions, while personal experience is the most frequent reason shaping attitudes. Distinct patterns also emerge across trustors (e.g., experts, ethicists, general users). Our results provide a methodological framework for large-scale Trust analysis and insights into evolving public perceptions of GenAI.
摘要：生成式人工智能（GenAI）的兴起影响了人类生活的许多方面。随着这些系统融入日常实践，了解公众对它们的信任对于负责任的采用和治理也至关重要。先前关于人工智能信任的工作主要来自心理学和人机交互，但缺乏计算、大规模和纵向的方法来衡量 GenAI 和大型语言模型 (LLM) 中的信任和不信任。本文使用涵盖 39 个 Reddit 子版块和 197,618 个帖子的多年 Reddit 数据集（2022--2025）提出了 GenAI 中信任和不信任的第一个计算研究。将代表性样本的众包注释与分类模型相结合以进行规模分析。我们发现随着时间的推移，随着主要模型版本的变化，信任和不信任几乎达到平衡。技术性能和可用性是主导维度，而个人经验是塑造态度的最常见原因。信任者（例如专家、伦理学家、普通用户）之间也出现了不同的模式。我们的结果为大规模信任分析提供了方法框架，并深入了解公众对 GenAI 不断变化的看法。

Title: EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture

Authors: Mohamed Gamil, Abdelrahman Elsayed, Abdelrahman Lila, Ahmed Gad, Hesham Abdelgawad, Mohamed Aref, Ahmed Fares
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16198
Pdf URL: https://arxiv.org/pdf/2510.16198
Copy Paste: [[2510.16198]] EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture(https://arxiv.org/abs/2510.16198)
Keywords: language model
Abstract: Despite recent advances in AI, multimodal culturally diverse datasets are still limited, particularly for regions in the Middle East and Africa. In this paper, we introduce EgMM-Corpus, a multimodal dataset dedicated to Egyptian culture. By designing and running a new data collection pipeline, we collected over 3,000 images, covering 313 concepts across landmarks, food, and folklore. Each entry in the dataset is manually validated for cultural authenticity and multimodal coherence. EgMM-Corpus aims to provide a reliable resource for evaluating and training vision-language models in an Egyptian cultural context. We further evaluate the zero-shot performance of Contrastive Language-Image Pre-training CLIP on EgMM-Corpus, on which it achieves 21.2% Top-1 accuracy and 36.4% Top-5 accuracy in classification. These results underscore the existing cultural bias in large-scale vision-language models and demonstrate the importance of EgMM-Corpus as a benchmark for developing culturally aware models.
摘要：尽管人工智能最近取得了进展，但多模式文化多样性数据集仍然有限，特别是对于中东和非洲地区。在本文中，我们介绍了 EgMM-Corpus，这是一个致力于埃及文化的多模态数据集。通过设计和运行新的数据收集管道，我们收集了 3,000 多张图像，涵盖地标、食物和民俗等领域的 313 个概念。数据集中的每个条目都经过手动验证，以确保文化真实性和多模式一致性。 EgMM-Corpus 旨在为埃及文化背景下评估和训练视觉语言模型提供可靠的资源。我们进一步评估了对比语言-图像预训练 CLIP 在 EgMM-Corpus 上的零样本性能，其分类准确率达到 21.2% Top-1 准确率和 36.4% Top-5 准确率。这些结果强调了大规模视觉语言模型中现有的文化偏见，并证明了 EgMM-Corpus 作为开发文化意识模型基准的重要性。

Title: What Can String Probability Tell Us About Grammaticality?

Authors: Jennifer Hu, Ethan Gotlieb Wilcox, Siyuan Song, Kyle Mahowald, Roger P. Levy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16227
Pdf URL: https://arxiv.org/pdf/2510.16227
Copy Paste: [[2510.16227]] What Can String Probability Tell Us About Grammaticality?(https://arxiv.org/abs/2510.16227)
Keywords: language model
Abstract: What have language models (LMs) learned about grammar? This question remains hotly debated, with major ramifications for linguistic theory. However, since probability and grammaticality are distinct notions in linguistics, it is not obvious what string probabilities can reveal about an LM's underlying grammatical knowledge. We present a theoretical analysis of the relationship between grammar, meaning, and string probability, based on simple assumptions about the generative process of corpus data. Our framework makes three predictions, which we validate empirically using 280K sentence pairs in English and Chinese: (1) correlation between the probability of strings within minimal pairs, i.e., string pairs with minimal semantic differences; (2) correlation between models' and humans' deltas within minimal pairs; and (3) poor separation in probability space between unpaired grammatical and ungrammatical strings. Our analyses give theoretical grounding for using probability to learn about LMs' structural knowledge, and suggest directions for future work in LM grammatical evaluation.
摘要：语言模型 (LM) 从语法中学到了什么？这个问题仍然引起激烈争论，对语言理论产生了重大影响。然而，由于概率和语法性是语言学中不同的概念，因此字符串概率能够揭示 LM 的潜在语法知识并不明显。我们基于对语料库数据生成过程的简单假设，对语法、含义和字符串概率之间的关系进行了理论分析。我们的框架做出了三个预测，并使用 28 万个英文和中文句子对进行了实证验证：（1）最小对内字符串的概率之间的相关性，即具有最小语义差异的字符串对； (2) 最小对内模型和人类增量之间的相关性； (3) 未配对的语法字符串和非语法字符串之间的概率空间分离不佳。我们的分析为使用概率来了解 LM 的结构知识提供了理论基础，并为 LM 语法评估的未来工作提出了方向。

Title: Towards Low-Resource Alignment to Diverse Perspectives with Sparse Feedback

Authors: Chu Fei Luo, Samuel Dahan, Xiaodan Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16257
Pdf URL: https://arxiv.org/pdf/2510.16257
Copy Paste: [[2510.16257]] Towards Low-Resource Alignment to Diverse Perspectives with Sparse Feedback(https://arxiv.org/abs/2510.16257)
Keywords: language model
Abstract: As language models have a greater impact on society, it is important to ensure they are aligned to a diverse range of perspectives and are able to reflect nuance in human values. However, the most popular training paradigms for modern language models often assume there is one optimal answer for every query, leading to generic responses and poor alignment. In this work, we aim to enhance pluralistic alignment of language models in a low-resource setting with two methods: pluralistic decoding and model steering. We empirically demonstrate that model steering offers consistent improvement over zero-shot and few-shot baselines with only 50 annotated samples. Our proposed methods decrease false positives in several high-stakes tasks such as hate speech detection and misinformation detection, and improves the distributional alignment to human values in GlobalOpinionQA. We hope our work highlights the importance of diversity and how language models can be adapted to consider nuanced perspectives.
摘要：由于语言模型对社会产生更大的影响，因此确保它们符合不同的观点并能够反映人类价值观的细微差别非常重要。然而，现代语言模型最流行的训练范例通常假设每个查询都有一个最佳答案，从而导致通用响应和对齐不良。在这项工作中，我们的目标是通过两种方法在资源匮乏的情况下增强语言模型的多元对齐：多元解码和模型引导。我们凭经验证明，模型控制在仅使用 50 个带注释的样本的情况下，比零样本和少样本基线提供了一致的改进。我们提出的方法减少了仇恨言论检测和错误信息检测等多项高风险任务中的误报，并改善了 GlobalOpinionQA 中与人类价值观的分布一致性。我们希望我们的工作强调多样性的重要性以及如何调整语言模型以考虑微妙的观点。

Title: Instant Personalized Large Language Model Adaptation via Hypernetwork

Authors: Zhaoxuan Tan, Zixuan Zhang, Haoyang Wen, Zheng Li, Rongzhi Zhang, Pei Chen, Fengran Mo, Zheyuan Liu, Qingkai Zeng, Qingyu Yin, Meng Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16282
Pdf URL: https://arxiv.org/pdf/2510.16282
Copy Paste: [[2510.16282]] Instant Personalized Large Language Model Adaptation via Hypernetwork(https://arxiv.org/abs/2510.16282)
Keywords: language model, llm, prompt
Abstract: Personalized large language models (LLMs) tailor content to individual preferences using user profiles or histories. However, existing parameter-efficient fine-tuning (PEFT) methods, such as the ``One-PEFT-Per-User'' (OPPU) paradigm, require training a separate adapter for each user, making them computationally expensive and impractical for real-time updates. We introduce Profile-to-PEFT, a scalable framework that employs a hypernetwork, trained end-to-end, to map a user's encoded profile directly to a full set of adapter parameters (e.g., LoRA), eliminating per-user training at deployment. This design enables instant adaptation, generalization to unseen users, and privacy-preserving local deployment. Experimental results demonstrate that our method outperforms both prompt-based personalization and OPPU while using substantially fewer computational resources at deployment. The framework exhibits strong generalization to out-of-distribution users and maintains robustness across varying user activity levels and different embedding backbones. The proposed Profile-to-PEFT framework enables efficient, scalable, and adaptive LLM personalization suitable for large-scale applications.
摘要：个性化大语言模型 (LLM) 使用用户配置文件或历史记录根据个人喜好定制内容。然而，现有的参数高效微调（PEFT）方法，例如“每用户一个 PEFT”（OPPU）范例，需要为每个用户训练一个单独的适配器，这使得它们的计算成本昂贵且对于实时更新来说不切实际。我们引入了 Profile-to-PEFT，这是一个可扩展的框架，它采用经过端到端训练的超网络，将用户的编码配置文件直接映射到全套适配器参数（例如 LoRA），从而消除了部署时的每个用户培训。这种设计可以实现即时适应、对看不见的用户的泛化以及保护隐私的本地部署。实验结果表明，我们的方法优于基于提示的个性化和 OPPU，同时在部署时使用的计算资源大大减少。该框架对分布外用户表现出很强的泛化性，并在不同的用户活动级别和不同的嵌入骨干中保持鲁棒性。所提出的 Profile-to-PEFT 框架可实现高效、可扩展且自适应的 LLM 个性化，适合大规模应用。

Title: Thinking About Thinking: Evaluating Reasoning in Post-Trained Language Models

Authors: Pratham Singla, Shivank Garg, Ayush Singh, Ishan Garg, Ketan Suhaas Saichandran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16340
Pdf URL: https://arxiv.org/pdf/2510.16340
Copy Paste: [[2510.16340]] Thinking About Thinking: Evaluating Reasoning in Post-Trained Language Models(https://arxiv.org/abs/2510.16340)
Keywords: language model, llm
Abstract: Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This development raises a fundamental question: Are these models aware of what they "learn" and "think"? To address this, we define three core competencies: (1) awareness of learned latent policies, (2) generalization of these policies across domains, and (3) alignment between internal reasoning traces and final outputs. We empirically evaluate these abilities on several tasks, each designed to require learning a distinct policy. Furthermore, we contrast the profiles of models post-trained via Supervised Fine-Tuning (SFT), Direct Policy Optimization (DPO), and Group Relative Policy Optimization (GRPO). Our findings indicate that RL-trained models not only demonstrate greater awareness of their learned behaviors and stronger generalizability to novel, structurally similar tasks than SFT models but also often exhibit weak alignment between their reasoning traces and final outputs, an effect most pronounced in GRPO-trained models.
摘要：训练后技术的最新进展赋予了大型语言模型（LLM）通过生成补充规划令牌来处理复杂、逻辑密集型任务的增强能力。这一发展提出了一个基本问题：这些模型是否意识到它们“学习”和“思考”什么？为了解决这个问题，我们定义了三个核心能力：（1）对学习到的潜在策略的认识，（2）这些策略跨领域的泛化，以及（3）内部推理轨迹和最终输出之间的一致性。我们在多项任务上根据经验评估这些能力，每项任务都需要学习不同的策略。此外，我们还对比了通过监督微调（SFT）、直接策略优化（DPO）和组相对策略优化（GRPO）进行后训练的模型的概况。我们的研究结果表明，与 SFT 模型相比，强化学习训练的模型不仅表现出对其学习行为有更高的认识，并且对新颖的、结构相似的任务具有更强的泛化性，而且在推理轨迹和最终输出之间经常表现出较弱的一致性，这一效应在 GRPO 训练的模型中最为明显。

Title: Utilising Large Language Models for Generating Effective Counter Arguments to Anti-Vaccine Tweets

Authors: Utsav Dhanuka, Soham Poddar, Saptarshi Ghosh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16359
Pdf URL: https://arxiv.org/pdf/2510.16359
Copy Paste: [[2510.16359]] Utilising Large Language Models for Generating Effective Counter Arguments to Anti-Vaccine Tweets(https://arxiv.org/abs/2510.16359)
Keywords: language model, llm, prompt
Abstract: In an era where public health is increasingly influenced by information shared on social media, combatting vaccine skepticism and misinformation has become a critical societal goal. Misleading narratives around vaccination have spread widely, creating barriers to achieving high immunisation rates and undermining trust in health recommendations. While efforts to detect misinformation have made significant progress, the generation of real time counter-arguments tailored to debunk such claims remains an insufficiently explored area. In this work, we explore the capabilities of LLMs to generate sound counter-argument rebuttals to vaccine misinformation. Building on prior research in misinformation debunking, we experiment with various prompting strategies and fine-tuning approaches to optimise counter-argument generation. Additionally, we train classifiers to categorise anti-vaccine tweets into multi-labeled categories such as concerns about vaccine efficacy, side effects, and political influences allowing for more context aware rebuttals. Our evaluation, conducted through human judgment, LLM based assessments, and automatic metrics, reveals strong alignment across these methods. Our findings demonstrate that integrating label descriptions and structured fine-tuning enhances counter-argument effectiveness, offering a promising approach for mitigating vaccine misinformation at scale.
摘要：在公共卫生越来越受到社交媒体共享信息影响的时代，打击疫苗怀疑论和错误信息已成为一个重要的社会目标。有关疫苗接种的误导性叙述广泛传播，为实现高免疫率制造了障碍，并破坏了对健康建议的信任。尽管检测错误信息的努力已经取得了重大进展，但为揭穿此类主张而定制的实时反论点仍然是一个尚未充分探索的领域。在这项工作中，我们探讨了法学硕士对疫苗错误信息提出合理反驳的能力。基于先前关于错误信息揭穿的研究，我们尝试了各种提示策略和微调方法来优化反驳的生成。此外，我们训练分类器将反疫苗推文分类为多标签类别，例如对疫苗功效、副作用和政治影响的担忧，从而允许更多上下文感知的反驳。我们的评估是通过人工判断、基于法学硕士的评估和自动指标进行的，揭示了这些方法之间的强烈一致性。我们的研究结果表明，整合标签描述和结构化微调可以增强反驳有效性，为大规模减少疫苗错误信息提供一种有前景的方法。

Title: End-to-End Argument Mining through Autoregressive Argumentative Structure Prediction

Authors: Nilmadhab Das, Vishal Vaibhav, Yash Sunil Choudhary, V. Vijaya Saradhi, Ashish Anand
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16363
Pdf URL: https://arxiv.org/pdf/2510.16363
Copy Paste: [[2510.16363]] End-to-End Argument Mining through Autoregressive Argumentative Structure Prediction(https://arxiv.org/abs/2510.16363)
Keywords: language model
Abstract: Argument Mining (AM) helps in automating the extraction of complex argumentative structures such as Argument Components (ACs) like Premise, Claim etc. and Argumentative Relations (ARs) like Support, Attack etc. in an argumentative text. Due to the inherent complexity of reasoning involved with this task, modelling dependencies between ACs and ARs is challenging. Most of the recent approaches formulate this task through a generative paradigm by flattening the argumentative structures. In contrast to that, this study jointly formulates the key tasks of AM in an end-to-end fashion using Autoregressive Argumentative Structure Prediction (AASP) framework. The proposed AASP framework is based on the autoregressive structure prediction framework that has given good performance for several NLP tasks. AASP framework models the argumentative structures as constrained pre-defined sets of actions with the help of a conditional pre-trained language model. These actions build the argumentative structures step-by-step in an autoregressive manner to capture the flow of argumentative reasoning in an efficient way. Extensive experiments conducted on three standard AM benchmarks demonstrate that AASP achieves state-of-theart (SoTA) results across all AM tasks in two benchmarks and delivers strong results in one benchmark.
摘要：论证挖掘 (AM) 有助于自动提取复杂的论证结构，例如论证文本中的论证成分 (AC)（如前提、主张等）和论证关系 (AR)（如支持、攻击等）。由于此任务所涉及的推理固有的复杂性，对 AC 和 AR 之间的依赖关系进行建模具有挑战性。最近的大多数方法都是通过生成范式通过扁平化论证结构来制定这项任务。与此相反，本研究使用自回归论证结构预测（AASP）框架以端到端的方式联合制定 AM 的关键任务。所提出的 AASP 框架基于自回归结构预测框架，该框架在多个 NLP 任务中提供了良好的性能。 AASP 框架借助条件预训练语言模型将论证结构建模为受约束的预定义动作集。这些动作以自回归的方式逐步构建论证结构，以有效的方式捕获论证推理的流程。在三个标准 AM 基准测试上进行的大量实验表明，AASP 在两个基准测试中的所有 AM 任务中均取得了最先进的 (SoTA) 结果，并在一个基准测试中提供了强劲的结果。

Title: Navigating through the hidden embedding space: steering LLMs to improve mental health assessment

Authors: Federico Ravenda, Seyed Ali Bahrainian, Andrea Raballo, Antonietta Mira
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16373
Pdf URL: https://arxiv.org/pdf/2510.16373
Copy Paste: [[2510.16373]] Navigating through the hidden embedding space: steering LLMs to improve mental health assessment(https://arxiv.org/abs/2510.16373)
Keywords: language model, llm
Abstract: The rapid evolution of Large Language Models (LLMs) is transforming AI, opening new opportunities in sensitive and high-impact areas such as Mental Health (MH). Yet, despite these advancements, recent evidence reveals that smaller-scale models still struggle to deliver optimal performance in domain-specific applications. In this study, we present a cost-efficient yet powerful approach to improve MH assessment capabilities of an LLM, without relying on any computationally intensive techniques. Our lightweight method consists of a linear transformation applied to a specific layer's activations, leveraging steering vectors to guide the model's output. Remarkably, this intervention enables the model to achieve improved results across two distinct tasks: (1) identifying whether a Reddit post is useful for detecting the presence or absence of depressive symptoms (relevance prediction task), and (2) completing a standardized psychological screening questionnaire for depression based on users' Reddit post history (questionnaire completion task). Results highlight the untapped potential of steering mechanisms as computationally efficient tools for LLMs' MH domain adaptation.
摘要：大型语言模型 (LLM) 的快速发展正在改变人工智能，为心理健康 (MH) 等敏感和高影响领域带来新的机遇。然而，尽管取得了这些进步，最近的证据表明，较小规模的模型仍然难以在特定领域的应用程序中提供最佳性能。在这项研究中，我们提出了一种经济有效但功能强大的方法来提高法学硕士的 MH 评估能力，而不依赖于任何计算密集型技术。我们的轻量级方法包括应用于特定层激活的线性变换，利用引导向量来引导模型的输出。值得注意的是，这种干预使模型能够在两个不同的任务中取得更好的结果：（1）确定 Reddit 帖子是否有助于检测抑郁症状的存在或不存在（相关性预测任务），以及（2）根据用户的 Reddit 帖子历史记录完成标准化抑郁症心理筛查问卷（问卷完成任务）。结果凸显了引导机制作为法学硕士 MH 领域适应的计算有效工具的未开发潜力。

Title: MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

Authors: Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2510.16380
Pdf URL: https://arxiv.org/pdf/2510.16380
Copy Paste: [[2510.16380]] MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes(https://arxiv.org/abs/2510.16380)
Keywords: language model
Abstract: As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.
摘要：随着人工智能系统的进步，我们更加依赖它们与我们一起做出决策并为我们做出决策。为了确保此类决策符合人类价值观，我们不仅必须了解他们做出了什么决策，还要了解他们是如何做出这些决策的。推理语言模型提供最终响应和（部分透明）中间思维轨迹，为研究人工智能程序推理提供了及时的机会。与通常有客观正确答案的数学和代码问题不同，道德困境是以过程为中心的评估的绝佳测试平台，因为它们允许得出多个可辩护的结论。为此，我们提出了 MoReBench：1,000 个道德场景，每个场景都配有一组专家认为在推理场景时必须包含（或避免）的标题标准。 MoReBench 包含超过 23000 条标准，包括识别道德考虑、权衡权衡以及提供可行的建议，以涵盖人工智能为人类道德决策提供建议以及自主做出道德决策的案例。另外，我们策划了 MoReBench-Theory：150 个例子来测试人工智能是否可以在规范伦理的五个主要框架下进行推理。我们的结果表明，数学、代码和科学推理任务的缩放定律和现有基准无法预测模型执行道德推理的能力。模型还表现出对特定道德框架的偏爱（例如，边沁法功利主义和康德道义论），这可能是流行培训范式的副作用。这些基准共同推进以流程为中心的推理评估，以实现更安全、更透明的人工智能。

Title: ATA: A Neuro-Symbolic Approach to Implement Autonomous and Trustworthy Agents

Authors: David Peer, Sebastian Stabinger
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16381
Pdf URL: https://arxiv.org/pdf/2510.16381
Copy Paste: [[2510.16381]] ATA: A Neuro-Symbolic Approach to Implement Autonomous and Trustworthy Agents(https://arxiv.org/abs/2510.16381)
Keywords: language model, llm, hallucination, prompt, agent
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities, yet their deployment in high-stakes domains is hindered by inherent limitations in trustworthiness, including hallucinations, instability, and a lack of transparency. To address these challenges, we introduce a generic neuro-symbolic approach, which we call Autonomous Trustworthy Agents (ATA). The core of our approach lies in decoupling tasks into two distinct phases: Offline knowledge ingestion and online task processing. During knowledge ingestion, an LLM translates an informal problem specification into a formal, symbolic knowledge base. This formal representation is crucial as it can be verified and refined by human experts, ensuring its correctness and alignment with domain requirements. In the subsequent task processing phase, each incoming input is encoded into the same formal language. A symbolic decision engine then utilizes this encoded input in conjunction with the formal knowledge base to derive a reliable result. Through an extensive evaluation on a complex reasoning task, we demonstrate that a concrete implementation of ATA is competitive with state-of-the-art end-to-end reasoning models in a fully automated setup while maintaining trustworthiness. Crucially, with a human-verified and corrected knowledge base, our approach significantly outperforms even larger models, while exhibiting perfect determinism, enhanced stability against input perturbations, and inherent immunity to prompt injection attacks. By generating decisions grounded in symbolic reasoning, ATA offers a practical and controllable architecture for building the next generation of transparent, auditable, and reliable autonomous agents.
摘要：大型语言模型 (LLM) 已展现出令人印象深刻的功能，但其在高风险领域的部署却受到可信度固有限制的阻碍，包括幻觉、不稳定和缺乏透明度。为了应对这些挑战，我们引入了一种通用的神经符号方法，我们称之为自主可信代理（ATA）。我们方法的核心在于将任务解耦为两个不同的阶段：离线知识摄取和在线任务处理。在知识摄取过程中，法学硕士将非正式的问题规范转化为正式的符号知识库。这种形式表示至关重要，因为它可以由人类专家验证和完善，确保其正确性并符合领域要求。在后续的任务处理阶段，每个传入的输入都被编码为相同的形式语言。然后，符号决策引擎利用该编码输入结合形式知识库来得出可靠的结果。通过对复杂推理任务的广泛评估，我们证明了 ATA 的具体实现在完全自动化的设置中与最先进的端到端推理模型具有竞争力，同时保持了可信度。至关重要的是，凭借经过人工验证和纠正的知识库，我们的方法显着优于更大的模型，同时表现出完美的确定性、增强的针对输入扰动的稳定性以及对即时注入攻击的固有免疫力。通过生成基于符号推理的决策，ATA 为构建下一代透明、可审计和可靠的自主代理提供了实用且可控的架构。

Title: Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment

Authors: Fu-An Chao, Bi-Cheng Yan, Berlin Chen
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2510.16387
Pdf URL: https://arxiv.org/pdf/2510.16387
Copy Paste: [[2510.16387]] Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment(https://arxiv.org/abs/2510.16387)
Keywords: prompt
Abstract: In this paper, we explore the untapped potential of Whisper, a well-established automatic speech recognition (ASR) foundation model, in the context of L2 spoken language assessment (SLA). Unlike prior studies that extrinsically analyze transcriptions produced by Whisper, our approach goes a step further to probe its latent capabilities by extracting acoustic and linguistic features from hidden representations. With only a lightweight classifier being trained on top of Whisper's intermediate and final outputs, our method achieves strong performance on the GEPT picture-description dataset, outperforming existing cutting-edge baselines, including a multimodal approach. Furthermore, by incorporating image and text-prompt information as auxiliary relevance cues, we demonstrate additional performance gains. Finally, we conduct an in-depth analysis of Whisper's embeddings, which reveals that, even without task-specific fine-tuning, the model intrinsically encodes both ordinal proficiency patterns and semantic aspects of speech, highlighting its potential as a powerful foundation for SLA and other spoken language understanding tasks.
摘要：在本文中，我们探讨了 Whisper 在 L2 口语评估 (SLA) 背景下尚未开发的潜力，Whisper 是一种完善的自动语音识别 (ASR) 基础模型。与之前从外部分析 Whisper 产生的转录的研究不同，我们的方法通过从隐藏的表示中提取声学和语言特征，进一步探索其潜在功能。只需在 Whisper 的中间和最终输出上训练一个轻量级分类器，我们的方法就在 GEPT 图片描述数据集上实现了强大的性能，优于现有的尖端基线，包括多模态方法。此外，通过将图像和文本提示信息合并为辅助相关性提示，我们展示了额外的性能提升。最后，我们对 Whisper 的嵌入进行了深入分析，结果表明，即使没有针对特定任务的微调，该模型也会本质上对顺序熟练模式和语音语义方面进行编码，突显其作为 SLA 和其他口语理解任务的强大基础的潜力。

Title: FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution

Authors: Syed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran, Mohammad Ali Moni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16439
Pdf URL: https://arxiv.org/pdf/2510.16439
Copy Paste: [[2510.16439]] FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution(https://arxiv.org/abs/2510.16439)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. Much of this overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. We address this inefficiency by introducing FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to preserve the top-k% tokens in their original order, and obtain a sparse frugalized prompt. We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning, using a suite of frontier LLMs. For the first three tasks, a 20% prompt reduction incurs only a marginal loss in task performance, demonstrating that contemporary LLMs can reconstruct elided context from high-salience cues. In contrast, performance on mathematical reasoning deteriorates sharply, reflecting a stronger dependence on complete token continuity. Further analysis with bottom-k% and random-k% tokens reveals asymmetric performance patterns that may suggest potential task contamination effects, wherein models may resort to shallow memorized patterns from pretraining exposure for conventional NLP tasks. We posit that our work contributes to a more nuanced understanding of LLM behavior in performance-efficiency trade-offs, and delineate the boundary between tasks tolerant to contextual sparsity and those requiring exhaustive context. Our source code and models are available at: this https URL
摘要：大型语言模型 (LLM) 的出色性能在很大程度上归功于广泛的输入上下文，但这种冗长的内容会增加货币成本、碳足迹和推理时间延迟。这种开销的大部分表现在典型提示中存在的冗余低效用令牌，因为通常只有一小部分令牌承载大部分语义权重。我们通过引入 FrugalPrompt 来解决这种低效率问题，FrugalPrompt 是一种针对 LLM 的新型提示压缩框架，它仅保留语义上最重要的标记。利用两种最先进的标记归因方法 GlobEnc 和 DecompX，我们为输入序列中的每个标记分配显着性分数，对它们进行排序以保留前 k% 标记的原始顺序，并获得稀疏节俭提示。我们使用一套前沿法学硕士评估了四个 NLP 任务的方法：情感分析、常识 QA、总结和数学推理。对于前三项任务，20% 的即时减少只会导致任务绩效的边际损失，这表明当代法学硕士可以从高度显着的线索中重建被忽略的上下文。相比之下，数学推理的表现急剧恶化，反映出对完整令牌连续性的更强依赖。对底部 k% 和随机 k% 标记的进一步分析揭示了不对称的性能模式，这可能表明潜在的任务污染效应，其中模型可能会诉诸传统 NLP 任务的预训练暴露中的浅层记忆模式。我们认为我们的工作有助于更细致地理解法学硕士在绩效效率权衡中的行为，并划定容忍上下文稀疏的任务和需要详尽上下文的任务之间的界限。我们的源代码和模型可在以下位置获取：此 https URL

Title: TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model

Authors: Bin Yu, Xinming Wang, Shijie Lian, Haotian Li, Changti Wu, Ruina Hu, Bailing Wang, Yuliang Wei, Kai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16449
Pdf URL: https://arxiv.org/pdf/2510.16449
Copy Paste: [[2510.16449]] TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model(https://arxiv.org/abs/2510.16449)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable progress in complex reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that allocate additional compute during inference. Among these, external TTS (particularly the Best-of-N selection paradigm) yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories. However, this approach faces key limitations: (i) the high computational overhead of deploying process reward models, (ii) the underutilization of the LLM's intrinsic latent representations. We introduce TrajSelector, an efficient and effective Best-of-N framework that exploit the hidden states in the sampler LLM for process-level scoring. A lightweight verifier (with only 0.6B parameters) evaluates the quality of step-wise trajectory, and then aggregates these scores to identify the optimal reasoning trajectory. Our framework employs a fully data-driven, end-to-end training recipe that eliminates reliance on massive step-level annotations. Experiential results across five benchmarks demonstrate that TrajSelector delivers consistent performance gains. In Best-of-32 settings, it surpasses majority voting by 4.61% accuracy and outperforms existing process reward models by 4.31% to 12.21%, all while maintaining lower inference costs.
摘要：大型语言模型 (LLM) 在复杂的推理任务中显示出显着的进步，这在很大程度上是通过在推理过程中分配额外计算的测试时间扩展 (TTS) 范式实现的。其中，外部 TTS（特别是 Best-of-N 选择范例）通过从多个独立生成的推理轨迹中进行选择，产生可扩展的性能改进。然而，这种方法面临着关键的局限性：(i) 部署过程奖励模型的计算开销很高，(ii) 法学硕士内在潜在表示的利用不足。我们引入 TrajSelector，这是一个高效且有效的 Best-of-N 框架，它利用采样器 LLM 中的隐藏状态进行流程级评分。轻量级验证器（仅具有 0.6B 个参数）评估逐步轨迹的质量，然后汇总这些分数以识别最佳推理轨迹。我们的框架采用完全数据驱动的端到端训练方法，消除了对大量步骤级注释的依赖。五个基准测试的经验结果表明 TrajSelector 能够提供一致的性能提升。在 Best-of-32 设置中，它的准确率比多数投票高 4.61%，比现有过程奖励模型高 4.31% 到 12.21%，同时保持较低的推理成本。

Title: RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning

Authors: Deyi Ji, Yuekui Yang, Haiyang Wu, Shaoping Ma, Tianrun Chen, Lanyun Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16455
Pdf URL: https://arxiv.org/pdf/2510.16455
Copy Paste: [[2510.16455]] RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning(https://arxiv.org/abs/2510.16455)
Keywords: language model, llm
Abstract: Advertisement (Ad) video violation detection is critical for ensuring platform compliance, but existing methods struggle with precise temporal grounding, noisy annotations, and limited generalization. We propose RAVEN, a novel framework that integrates curriculum reinforcement learning with multimodal large language models (MLLMs) to enhance reasoning and cognitive capabilities for violation detection. RAVEN employs a progressive training strategy, combining precisely and coarsely annotated data, and leverages Group Relative Policy Optimization (GRPO) to develop emergent reasoning abilities without explicit reasoning annotations. Multiple hierarchical sophisticated reward mechanism ensures precise temporal grounding and consistent category prediction. Experiments on industrial datasets and public benchmarks show that RAVEN achieves superior performances in violation category accuracy and temporal interval localization. We also design a pipeline to deploy the RAVEN on the online Ad services, and online A/B testing further validates its practical applicability, with significant improvements in precision and recall. RAVEN also demonstrates strong generalization, mitigating the catastrophic forgetting issue associated with supervised fine-tuning.
摘要：广告 (Ad) 视频违规检测对于确保平台合规性至关重要，但现有方法面临着精确的时间基础、嘈杂的注释和有限的泛化能力。我们提出了 RAVEN，这是一种新颖的框架，它将课程强化学习与多模态大语言模型（MLLM）相结合，以增强违规检测的推理和认知能力。 RAVEN 采用渐进式训练策略，结合精确和粗略注释的数据，并利用组相对策略优化 (GRPO) 来开发紧急推理能力，而无需明确的推理注释。多重层次的复杂奖励机制确保精确的时间基础和一致的类别预测。在工业数据集和公共基准上的实验表明，RAVEN 在违规类别准确性和时间间隔定位方面取得了优异的性能。我们还设计了将RAVEN部署到在线广告服务上的管道，在线A/B测试进一步验证了其实际适用性，在准确率和召回率方面都有显着提高。 RAVEN 还表现出了很强的泛化能力，减轻了与监督微调相关的灾难性遗忘问题。

Title: Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

Authors: Vamshi Krishna Bonagiri, Ponnurangam Kumaragurum, Khanh Nguyen, Benjamin Plaut
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16492
Pdf URL: https://arxiv.org/pdf/2510.16492
Copy Paste: [[2510.16492]] Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety(https://arxiv.org/abs/2510.16492)
Keywords: language model, llm, prompt, agent
Abstract: As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.
摘要：随着大型语言模型（LLM）代理越来越多地在复杂的环境中运行并产生现实世界的后果，它们的安全变得至关重要。虽然对单轮任务的不确定性量化进行了充分研究，但具有现实世界工具访问的多轮代理场景提出了独特的挑战，其中不确定性和模糊性复合，导致超出传统文本生成失败的严重或灾难性风险。我们建议使用“退出”作为LLM代理人的一种简单而有效的行为机制，以识别并退出他们缺乏信心的情况。利用 ToolEmu 框架，我们对 12 个最先进的法学硕士的退出行为进行了系统评估。我们的结果证明了一种非常有利的安全性与有用性权衡：在所有模型中，通过明确指示提示退出的代理在 0-3 等级上平均将安全性提高 +0.39（专有模型为 +0.64），同时保持有用性的平均下降幅度可忽略不计 -0.03。我们的分析表明，简单地添加显式退出指令被证明是一种高效的安全机制，可以立即部署在现有代理系统中，并将退出建立为高风险应用中自主代理的有效一线防御机制。

Title: Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection

Authors: Michelle Yuan, Khushbu Pahwa, Shuaichen Chang, Mustafa Kaba, Jiarong Jiang, Xiaofei Ma, Yi Zhang, Monica Sunkara
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.16499
Pdf URL: https://arxiv.org/pdf/2510.16499
Copy Paste: [[2510.16499]] Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection(https://arxiv.org/abs/2510.16499)
Keywords: agent
Abstract: Designing effective agentic systems requires the seamless composition and integration of agents, tools, and models within dynamic and uncertain environments. Most existing methods rely on static, semantic retrieval approaches for tool or agent discovery. However, effective reuse and composition of existing components remain challenging due to incomplete capability descriptions and the limitations of retrieval methods. Component selection suffers because the decisions are not based on capability, cost, and real-time utility. To address these challenges, we introduce a structured, automated framework for agentic system composition that is inspired by the knapsack problem. Our framework enables a composer agent to systematically identify, select, and assemble an optimal set of agentic components by jointly considering performance, budget constraints, and compatibility. By dynamically testing candidate components and modeling their utility in real-time, our approach streamlines the assembly of agentic systems and facilitates scalable reuse of resources. Empirical evaluation with Claude 3.5 Sonnet across five benchmarking datasets shows that our online-knapsack-based composer consistently lies on the Pareto frontier, achieving higher success rates at significantly lower component costs compared to our baselines. In the single-agent setup, the online knapsack composer shows a success rate improvement of up to 31.6% in comparison to the retrieval baselines. In multi-agent systems, the online knapsack composer increases success rate from 37% to 87% when agents are selected from an agent inventory of 100+ agents. The substantial performance gap confirms the robust adaptability of our method across diverse domains and budget constraints.
摘要：设计有效的代理系统需要在动态和不确定的环境中无缝组合和集成代理、工具和模型。大多数现有方法依赖于静态、语义检索方法来进行工具或代理发现。然而，由于功能描述不完整和检索方法的限制，现有组件的有效重用和组合仍然具有挑战性。组件选择受到影响，因为决策不是基于功能、成本和实时效用。为了应对这些挑战，我们受背包问题的启发，引入了一种结构化的、自动化的代理系统组合框架。我们的框架使作曲家代理能够通过共同考虑性能、预算约束和兼容性来系统地识别、选择和组装一组最佳的代理组件。通过动态测试候选组件并对其效用进行实时建模，我们的方法简化了代理系统的组装并促进了资源的可扩展重用。使用 Claude 3.5 Sonnet 在五个基准数据集上进行的实证评估表明，我们基于在线背包的作曲家始终位于帕累托前沿，与我们的基线相比，以显着降低的组件成本实现了更高的成功率。在单代理设置中，与检索基线相比，在线背包编辑器的成功率提高了高达 31.6%。在多智能体系统中，当从 100 多个智能体的智能体库存中选择智能体时，在线背包编辑器将成功率从 37% 提高到 87%。巨大的性能差距证实了我们的方法在不同领域和预算限制下的强大适应性。

Title: ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation

Authors: Haoxuan Zhang, Ruochi Li, Sarthak Shrestha, Shree Harshini Mamidala, Revanth Putta, Arka Krishan Aggarwal, Ting Xiao, Junhua Ding, Haihua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16549
Pdf URL: https://arxiv.org/pdf/2510.16549
Copy Paste: [[2510.16549]] ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation(https://arxiv.org/abs/2510.16549)
Keywords: language model, gpt, llm, chat
Abstract: Peer review serves as the gatekeeper of science, yet the surge in submissions and widespread adoption of large language models (LLMs) in scholarly evaluation present unprecedented challenges. Recent work has focused on using LLMs to improve review efficiency or generate insightful review content. However, unchecked deficient reviews from both human experts and AI systems threaten to systematically undermine the peer review ecosystem and compromise academic integrity. To address this critical issue, we introduce ReviewGuard, an automated system for detecting and categorizing deficient reviews. ReviewGuard employs a comprehensive four-stage LLM-driven framework that: (1) collects ICLR and NeurIPS papers with their corresponding reviews from OpenReview; (2) annotates review types using GPT-4.1 with human validation; (3) addresses class imbalance and data scarcity through LLM-driven synthetic data augmentation, producing a final corpus of 6,634 papers, 24,657 real reviews, and 46,438 synthetic reviews; and (4) fine-tunes both encoder-based models and open source LLMs. We perform comprehensive feature analysis of the structure and quality of the review text. Compared to sufficient reviews, deficient reviews demonstrate lower rating scores, higher self-reported confidence, reduced structural complexity, and a higher proportion of negative sentiment. AI-generated text detection reveals that, since ChatGPT's emergence, AI-generated reviews have increased dramatically. In the evaluation of deficient review detection models, mixed training with synthetic and real review data provides substantial enhancements to recall and F1 scores on the binary task. This study presents the first LLM-driven system for detecting deficient peer reviews, providing evidence to inform AI governance in peer review while offering valuable insights into human-AI collaboration to maintain academic integrity.
摘要：同行评审是科学的守门人，但学术评估中提交量的激增和大语言模型 (LLM) 的广泛采用带来了前所未有的挑战。最近的工作重点是使用法学硕士来提高审稿效率或生成富有洞察力的审稿内容。然而，人类专家和人工智能系统未经检查的缺陷评审可能会系统性地破坏同行评审生态系统并损害学术诚信。为了解决这个关键问题，我们引入了 ReviewGuard，这是一个用于检测和分类缺陷评论的自动化系统。 ReviewGuard 采用全面的四阶段 LLM 驱动框架：(1) 从 OpenReview 收集 ICLR 和 NeurIPS 论文及其相应评论； (2) 使用 GPT-4.1 和人工验证来注释评论类型； (3) 通过 LLM 驱动的合成数据增强解决类别不平衡和数据稀缺问题，生成包含 6,634 篇论文、24,657 条真实评论和 46,438 条综合评论的最终语料库； (4) 微调基于编码器的模型和开源 LLM。我们对评论文本的结构和质量进行全面的特征分析。与充足的评论相比，不足的评论表现出较低的评分、较高的自我报告信心、较低的结构复杂性以及较高比例的负面情绪。人工智能生成的文本检测表明，自从 ChatGPT 出现以来，人工智能生成的评论急剧增加。在评估有缺陷的评论检测模型时，使用合成和真实评论数据进行混合训练可以显着增强二元任务的召回率和 F1 分数。这项研究提出了第一个由法学硕士驱动的系统，用于检测有缺陷的同行评审，为同行评审中的人工智能治理提供证据，同时为人类与人工智能的协作提供有价值的见解，以维护学术诚信。

Title: Language over Content: Tracing Cultural Understanding in Multilingual Large Language Models

Authors: Seungho Cho, Changgeon Ko, Eui Jun Hwang, Junmyeong Lee, Huije Lee, Jong C. Park
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.16565
Pdf URL: https://arxiv.org/pdf/2510.16565
Copy Paste: [[2510.16565]] Language over Content: Tracing Cultural Understanding in Multilingual Large Language Models(https://arxiv.org/abs/2510.16565)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used across diverse cultural contexts, making accurate cultural understanding essential. Prior evaluations have mostly focused on output-level performance, obscuring the factors that drive differences in responses, while studies using circuit analysis have covered few languages and rarely focused on culture. In this work, we trace LLMs' internal cultural understanding mechanisms by measuring activation path overlaps when answering semantically equivalent questions under two conditions: varying the target country while fixing the question language, and varying the question language while fixing the country. We also use same-language country pairs to disentangle language from cultural aspects. Results show that internal paths overlap more for same-language, cross-country questions than for cross-language, same-country questions, indicating strong language-specific patterns. Notably, the South Korea-North Korea pair exhibits low overlap and high variability, showing that linguistic similarity does not guarantee aligned internal representation.
摘要：大语言模型 (LLM) 越来越多地在不同的文化背景中使用，这使得准确的文化理解变得至关重要。之前的评估主要集中在产出水平的表现上，掩盖了导致反应差异的因素，而使用电路分析的研究只涵盖了少数语言，很少关注文化。在这项工作中，我们通过测量在两种条件下回答语义等效问题时的激活路径重叠来追踪法学硕士的内部文化理解机制：在固定问题语言的同时改变目标国家，以及在固定国家/地区的同时改变问题语言。我们还使用相同语言的国家/地区对来将语言与文化方面分开。结果表明，同语言跨国问题的内部路径重叠程度高于跨语言同一国家问题，这表明了强烈的语言特定模式。值得注意的是，韩国-朝鲜对表现出低重叠和高变异性，表明语言相似性并不能保证内部表征的一致。

Title: Hallucination Benchmark for Speech Foundation Models

Authors: Alkis Koudounas, Moreno La Quatra, Manuel Giollo, Sabato Marco Siniscalchi, Elena Baralis
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2510.16567
Pdf URL: https://arxiv.org/pdf/2510.16567
Copy Paste: [[2510.16567]] Hallucination Benchmark for Speech Foundation Models(https://arxiv.org/abs/2510.16567)
Keywords: hallucination
Abstract: Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal). While similar to conventional decoding errors in potentially compromising the usability of transcriptions for downstream applications, hallucinations can be more detrimental due to their preservation of syntactically and semantically plausible structure. This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law. Conventional evaluation metrics are primarily centered on error-based metrics and fail to distinguish between phonetic inaccuracies and hallucinations. Consequently, there is a critical need for new evaluation frameworks that can effectively identify and assess models with a heightened propensity for generating hallucinated content. To this end, we introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic. We define targeted metrics within each category to produce interpretable profiles of model behavior. Through evaluation across various architectures and speech domains, we have found that SHALLOW metrics correlate strongly with word error rate (WER) when recognition quality is high (i.e., low WER). Still, this correlation weakens substantially as WER increases. SHALLOW, therefore, captures fine-grained error patterns that WER fails to distinguish under degraded and challenging conditions. Our framework supports specific diagnosis of model weaknesses and provides feedback for model improvement beyond what aggregate error rates can offer.
摘要：自动语音识别 (ASR) 系统中的幻觉是指神经 ASR 模型产生的流畅且连贯的转录，与底层声学输入（即语音信号）完全无关。虽然与传统的解码错误类似，可能会损害下游应用程序转录的可用性，但幻觉可能更有害，因为它们保留了句法和语义上合理的结构。这种明显的一致性可能会误导后续处理阶段并带来严重风险，特别是在医疗保健和法律等关键领域。传统的评估指标主要集中在基于错误的指标上，无法区分语音不准确和幻觉。因此，迫切需要新的评估框架，能够有效识别和评估更容易生成幻觉内容的模型。为此，我们引入了 SHALLOW，这是第一个基准框架，它沿着四个互补轴（词汇、语音、形态和语义）系统地对 ASR 中的幻觉现象进行分类和量化。我们在每个类别中定义目标指标，以生成可解释的模型行为概况。通过对各种架构和语音领域的评估，我们发现当识别质量较高（即低 WER）时，SHALLOW 指标与单词错误率（WER）密切相关。不过，随着 WER 的增加，这种相关性会大幅减弱。因此，SHALLOW 捕获了 WER 在退化和挑战性条件下无法区分的细粒度错误模式。我们的框架支持对模型弱点的具体诊断，并提供超出总体错误率所能提供的模型改进反馈。

Title: AI-Generated Text Detection in Low-Resource Languages: A Case Study on Urdu

Authors: Muhammad Ammar, Hadiya Murad Hadi, Usman Majeed Butt
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.16573
Pdf URL: https://arxiv.org/pdf/2510.16573
Copy Paste: [[2510.16573]] AI-Generated Text Detection in Low-Resource Languages: A Case Study on Urdu(https://arxiv.org/abs/2510.16573)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are now capable of generating text that closely resembles human writing, making them powerful tools for content creation, but this growing ability has also made it harder to tell whether a piece of text was written by a human or by a machine. This challenge becomes even more serious for languages like Urdu, where there are very few tools available to detect AI-generated text. To address this gap, we propose a novel AI-generated text detection framework tailored for the Urdu language. A balanced dataset comprising 1,800 humans authored, and 1,800 AI generated texts, sourced from models such as Gemini, GPT-4o-mini, and Kimi AI was developed. Detailed linguistic and statistical analysis was conducted, focusing on features such as character and word counts, vocabulary richness (Type Token Ratio), and N-gram patterns, with significance evaluated through t-tests and MannWhitney U tests. Three state-of-the-art multilingual transformer models such as mdeberta-v3-base, distilbert-base-multilingualcased, and xlm-roberta-base were fine-tuned on this dataset. The mDeBERTa-v3-base achieved the highest performance, with an F1-score 91.29 and accuracy of 91.26% on the test set. This research advances efforts in contesting misinformation and academic misconduct in Urdu-speaking communities and contributes to the broader development of NLP tools for low resource languages.
摘要：大型语言模型 (LLM) 现在能够生成与人类书写非常相似的文本，使其成为内容创建的强大工具，但这种不断增强的能力也使得辨别一段文本是由人类还是机器编写的变得更加困难。对于像乌尔都语这样的语言来说，这一挑战变得更加严峻，因为在这些语言中，可用于检测人工智能生成文本的工具很少。为了解决这一差距，我们提出了一种专为乌尔都语语言量身定制的新型人工智能生成文本检测框架。开发了一个平衡数据集，其中包含 1,800 名人类创作的文本和 1,800 条人工智能生成的文本，这些文本源自 Gemini、GPT-4o-mini 和 Kimi AI 等模型。我们进行了详细的语言和统计分析，重点关注字符和字数、词汇丰富度（类型标记比率）和 N-gram 模式等特征，并通过 t 检验和 MannWhitney U 检验评估显着性。在此数据集上对 mdeberta-v3-base、distilbert-base-multilingualcased 和 xlm-roberta-base 等三种最先进的多语言 Transformer 模型进行了微调。 mDeBERTa-v3-base 取得了最高的性能，在测试集上的 F1 分数为 91.29，准确率为 91.26%。这项研究推动了乌尔都语社区中对抗错误信息和学术不端行为的努力，并有助于更广泛地开发针对低资源语言的 NLP 工具。

Title: Fine-tuning of Large Language Models for Constituency Parsing Using a Sequence to Sequence Approach

Authors: Francisco Jose Cortes Delgado, Eduardo Martinez Gracia, Rafael Valencia Garcia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16604
Pdf URL: https://arxiv.org/pdf/2510.16604
Copy Paste: [[2510.16604]] Fine-tuning of Large Language Models for Constituency Parsing Using a Sequence to Sequence Approach(https://arxiv.org/abs/2510.16604)
Keywords: language model, llm
Abstract: Recent advances in natural language processing with large neural models have opened new possibilities for syntactic analysis based on machine learning. This work explores a novel approach to phrase-structure analysis by fine-tuning large language models (LLMs) to translate an input sentence into its corresponding syntactic structure. The main objective is to extend the capabilities of MiSintaxis, a tool designed for teaching Spanish syntax. Several models from the Hugging Face repository were fine-tuned using training data generated from the AnCora-ES corpus, and their performance was evaluated using the F1 score. The results demonstrate high accuracy in phrase-structure analysis and highlight the potential of this methodology.
摘要：大型神经模型自然语言处理的最新进展为基于机器学习的句法分析开辟了新的可能性。这项工作通过微调大型语言模型 (LLM) 将输入句子翻译为其相应的句法结构，探索了一种新的短语结构分析方法。主要目标是扩展 MiSintaxis 的功能，MiSintaxis 是一款专为西班牙语语法教学而设计的工具。 Hugging Face 存储库中的几个模型使用 AnCora-ES 语料库生成的训练数据进行了微调，并使用 F1 分数评估了它们的性能。结果表明短语结构分析具有很高的准确性，并凸显了该方法的潜力。

Title: Unleashing Diverse Thinking Modes in LLMs through Multi-Agent Collaboration

Authors: Zhixuan He, Yue Feng
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2510.16645
Pdf URL: https://arxiv.org/pdf/2510.16645
Copy Paste: [[2510.16645]] Unleashing Diverse Thinking Modes in LLMs through Multi-Agent Collaboration(https://arxiv.org/abs/2510.16645)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) demonstrate strong performance but often lack interpretable reasoning. This paper introduces the Multi-Agent Collaboration Framework for Diverse Thinking Modes (DiMo), which enhances both performance and interpretability by simulating a structured debate among four specialized LLM agents. Each agent embodies a distinct reasoning paradigm, allowing the framework to collaboratively explore diverse cognitive approaches. Through iterative debate, agents challenge and refine initial responses, yielding more robust conclusions and an explicit, auditable reasoning chain. Across six benchmarks and under a unified open-source setup, DiMo improves accuracy over widely used single-model and debate baselines, with the largest gains on math. We position DiMo as a semantics-aware, Web-native multi-agent framework: it models human-machine intelligence with LLM agents that produce semantically typed, URL-annotated evidence chains for explanations and user-friendly interactions. Although our experiments use standard reasoning benchmarks, the framework is designed to be instantiated over Web corpora and knowledge graphs, combining retrieval-augmented reasoning with structured justifications that downstream systems can inspect and reuse.
摘要：大型语言模型 (LLM) 表现出强大的性能，但通常缺乏可解释的推理。本文介绍了多元化思维模式的多智能体协作框架 (DiMo)，该框架通过模拟四个专业 LLM 智能体之间的结构化辩论来增强性能和可解释性。每个代理都体现了独特的推理范式，允许框架协作探索不同的认知方法。通过迭代辩论，智能体挑战并完善最初的响应，产生更可靠的结论和明确的、可审计的推理链。在统一的开源设置下，在六个基准测试中，DiMo 提高了广泛使用的单一模型和辩论基线的准确性，其中数学方面的收益最大。我们将 DiMo 定位为语义感知、Web 原生多代理框架：它使用 LLM 代理对人机智能进行建模，这些代理生成语义类型、URL 注释的证据链，用于解释和用户友好的交互。尽管我们的实验使用标准推理基准，但该框架旨在通过网络语料库和知识图进行实例化，将检索增强推理与下游系统可以检查和重用的结构化理由相结合。

Title: All You Need is One: Capsule Prompt Tuning with a Single Vector

Authors: Yiyang Liu, James C. Liang, Heng Fan, Wenhao Yang, Yiming Cui, Xiaotian Han, Lifu Huang, Dongfang Liu, Qifan Wang, Cheng Han
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.16670
Pdf URL: https://arxiv.org/pdf/2510.16670
Copy Paste: [[2510.16670]] All You Need is One: Capsule Prompt Tuning with a Single Vector(https://arxiv.org/abs/2510.16670)
Keywords: language model, llm, prompt
Abstract: Prompt-based learning has emerged as a parameter-efficient finetuning (PEFT) approach to facilitate Large Language Model (LLM) adaptation to downstream tasks by conditioning generation with task-aware guidance. Despite its successes, current prompt-based learning methods heavily rely on laborious grid searching for optimal prompt length and typically require considerable number of prompts, introducing additional computational burden. Worse yet, our pioneer findings indicate that the task-aware prompt design is inherently limited by its absence of instance-aware information, leading to a subtle attention interplay with the input sequence. In contrast, simply incorporating instance-aware information as a part of the guidance can enhance the prompt-tuned model performance without additional fine-tuning. Moreover, we find an interesting phenomenon, namely "attention anchor", that incorporating instance-aware tokens at the earliest position of the sequence can successfully preserve strong attention to critical structural information and exhibit more active attention interaction with all input tokens. In light of our observation, we introduce Capsule Prompt-Tuning (CaPT), an efficient and effective solution that leverages off-the-shelf, informative instance semantics into prompt-based learning. Our approach innovatively integrates both instance-aware and task-aware information in a nearly parameter-free manner (i.e., one single capsule prompt). Empirical results demonstrate that our method can exhibit superior performance across various language tasks (e.g., 84.03\% average accuracy on T5-Large), serving as an "attention anchor," while enjoying high parameter efficiency (e.g., 0.003\% of model parameters on Llama3.2-1B).
摘要：基于提示的学习已成为一种参数高效的微调（PEFT）方法，通过任务感知指导调节生成，促进大语言模型（LLM）适应下游任务。尽管取得了成功，但当前基于提示的学习方法严重依赖于费力的网格搜索来寻找最佳提示长度，并且通常需要大量提示，从而引入额外的计算负担。更糟糕的是，我们的先驱发现表明，任务感知提示设计本质上受到缺乏实例感知信息的限制，导致注意力与输入序列之间产生微妙的相互作用。相比之下，简单地将实例感知信息作为指导的一部分可以增强即时调整的模型性能，而无需额外的微调。此外，我们发现了一个有趣的现象，即“注意力锚”，即在序列的最早位置合并实例感知标记可以成功地保留对关键结构信息的强烈关注，并表现出与所有输入标记更活跃的注意力交互。根据我们的观察，我们引入了 Capsule Prompt-Tuning (CaPT)，这是一种高效且有效的解决方案，它利用现成的、信息丰富的实例语义来进行基于提示的学习。我们的方法以一种几乎无参数的方式（即一个胶囊提示）创新性地集成了实例感知和任务感知信息。实证结果表明，我们的方法可以在各种语言任务中表现出优异的性能（例如，T5-Large 上的平均准确率为 84.03%），充当“注意力锚”，同时享有高参数效率（例如，Llama3.2-1B 上的模型参数为 0.003%）。

Title: Temporal Understanding under Deictic Frame of Reference

Authors: Damin Zhang, Julia Rayz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16685
Pdf URL: https://arxiv.org/pdf/2510.16685
Copy Paste: [[2510.16685]] Temporal Understanding under Deictic Frame of Reference(https://arxiv.org/abs/2510.16685)
Keywords: language model, llm, prompt
Abstract: Understanding time is fundamental to human cognition, where temporal experience is often conceptualized through spatial metaphors grounded in sensory-motor experience. For example, "summer is approaching" parallels "We are approaching the summer". In such expressions, humans rely on a frame of reference (FoR) to interpret meaning relative to a particular viewpoint. Extending this concept to time, a temporal frame of reference (t-FoR) defines how temporal relations are perceived relative to an experiencer's moment of "now". While Large Language Models (LLMs) have shown remarkable advances in natural language understanding, their ability to interpret and reason about time remains limited. In this work, we introduce TUuD (Temporal Understanding under Deictic t-FoR), a framework that evaluates how LLMs interpret time-event and event-event relations when the reference point of "now" dynamically shifts along a timeline. Following recent work on temporal cognition \cite{li2025other}, LLMs are prompted to rate the similarity between the current moment and a target event from 0.00 (completely dissimilar) to 1.00 (highly similar), where similarity quantifies perceived temporal alignment between the two points. Our results show that four evaluated LLMs exhibit measurable adaptation to a deictic t-FoR, with similarity ratings peaking around the present and decreasing toward past and future events. The adaptation, however, weakens beyond near-term contexts, suggesting that while LLMs display partial human-like temporal cognition, their temporal reasoning remains sensitive to reference-frame shifts and temporal distance.
摘要：理解时间是人类认知的基础，时间体验通常通过基于感觉运动体验的空间隐喻来概念化。例如，“夏天即将来临”与“我们即将迎来夏天”相似。在此类表达中，人类依靠参考框架（FoR）来解释相对于特定观点的含义。将这个概念扩展到时间，时间参考框架（t-FoR）定义了如何相对于体验者的“现在”时刻感知时间关系。虽然大型语言模型 (LLM) 在自然语言理解方面取得了显着进步，但它们解释和推理时间的能力仍然有限。在这项工作中，我们介绍了 TUuD（Deictic t-FoR 下的时间理解），这是一个框架，用于评估当“现在”的参考点沿着时间线动态移动时，法学硕士如何解释时间-事件和事件-事件关系。在最近关于时间认知的研究\cite{li2025other}之后，法学硕士被提示对当前时刻和目标事件之间的相似性进行评分，从 0.00（完全不相似）到 1.00（高度相似），其中相似性量化了两点之间感知的时间对齐。我们的结果表明，四名接受评估的法学硕士表现出对指示性 t-FoR 的可测量的适应，相似性评级在当前发生时达到峰值，而在过去和未来事件发生时下降。然而，这种适应能力在短期背景下减弱，这表明虽然法学硕士表现出部分类似人类的时间认知，但他们的时间推理仍然对参考系移位和时间距离敏感。

Title: Investigating the Impact of Rationales for LLMs on Natural Language Understanding

Authors: Wenhang Shi, Shuqing Bian, Yiren Chen, Xinyi Zhang, Zhe Zhao, Pengfei Hu, Wei Lu, Xiaoyong Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16686
Pdf URL: https://arxiv.org/pdf/2510.16686
Copy Paste: [[2510.16686]] Investigating the Impact of Rationales for LLMs on Natural Language Understanding(https://arxiv.org/abs/2510.16686)
Keywords: llm, chain-of-thought
Abstract: Chain-of-thought (CoT) rationales, which provide step-by-step reasoning to derive final answers, benefit LLMs in both inference and training. Incorporating rationales, either by generating them before answering during inference, or by placing them before or after the original answers during training - significantly improves model performance on mathematical, symbolic and commonsense reasoning tasks. However, most work focuses on the role of rationales in these reasoning tasks, overlooking their potential impact on other important tasks like natural language understanding (NLU) tasks. In this work, we raise the question: Can rationales similarly benefit NLU tasks? To conduct a systematic exploration, we construct NLURC, a comprehensive and high-quality NLU dataset collection with rationales, and develop various rationale-augmented methods. Through exploring the applicability of these methods on NLU tasks using the dataset, we uncover several potentially surprising findings: (1) CoT inference shifts from hindering NLU performance to surpassing direct label prediction as model size grows, indicating a positive correlation. (2) Most rationale-augmented training methods perform worse than label-only training, with one specially designed method consistently achieving improvements. (3) LLMs trained with rationales achieve significant performance gains on unseen NLU tasks, rivaling models ten times their size, while delivering interpretability on par with commercial LLMs.
摘要：思想链 (CoT) 原理提供了逐步推理以得出最终答案，使法学硕士在推理和培训方面受益匪浅。通过在推理过程中回答之前生成它们，或者通过在训练过程中将它们放置在原始答案之前或之后来合并基本原理，可以显着提高模型在数学、符号和常识推理任务上的性能。然而，大多数工作都集中在这些推理任务中的基本原理的作用，而忽略了它们对自然语言理解（NLU）任务等其他重要任务的潜在影响。在这项工作中，我们提出了一个问题：基本原理是否同样有利于 NLU 任务？为了进行系统性探索，我们构建了 NLURC（一个全面且高质量的带有基本原理的 NLU 数据集集合），并开发了各种基本原理增强方法。通过使用数据集探索这些方法在 NLU 任务上的适用性，我们发现了几个可能令人惊讶的发现：（1）随着模型大小的增长，CoT 推理从阻碍 NLU 性能转变为超越直接标签预测，这表明存在正相关性。 (2) 大多数基本原理增强训练方法的表现都比仅标签训练差，但一种专门设计的方法始终取得改进。 (3) 经过理论训练的法学硕士在未见过的 NLU 任务上取得了显着的性能提升，可与十倍于其规模的模型相媲美，同时提供与商业法学硕士同等的可解释性。

Title: The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models

Authors: Shivam Ratnakar, Sanjay Raghavendra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16712
Pdf URL: https://arxiv.org/pdf/2510.16712
Copy Paste: [[2510.16712]] The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models(https://arxiv.org/abs/2510.16712)
Keywords: language model, gpt, llm
Abstract: Integration of Large Language Models with search/retrieval engines has become ubiquitous, yet these systems harbor a critical vulnerability that undermines their reliability. We present the first systematic investigation of "chameleon behavior" in LLMs: their alarming tendency to shift stances when presented with contradictory questions in multi-turn conversations (especially in search-enabled LLMs). Through our novel Chameleon Benchmark Dataset, comprising 17,770 carefully crafted question-answer pairs across 1,180 multi-turn conversations spanning 12 controversial domains, we expose fundamental flaws in state-of-the-art systems. We introduce two theoretically grounded metrics: the Chameleon Score (0-1) that quantifies stance instability, and Source Re-use Rate (0-1) that measures knowledge diversity. Our rigorous evaluation of Llama-4-Maverick, GPT-4o-mini, and Gemini-2.5-Flash reveals consistent failures: all models exhibit severe chameleon behavior (scores 0.391-0.511), with GPT-4o-mini showing the worst performance. Crucially, small across-temperature variance (less than 0.004) suggests the effect is not a sampling artifact. Our analysis uncovers the mechanism: strong correlations between source re-use rate and confidence (r=0.627) and stance changes (r=0.429) are statistically significant (p less than 0.05), indicating that limited knowledge diversity makes models pathologically deferential to query framing. These findings highlight the need for comprehensive consistency evaluation before deploying LLMs in healthcare, legal, and financial systems where maintaining coherent positions across interactions is critical for reliable decision support.
摘要：大型语言模型与搜索/检索引擎的集成已变得无处不在，但这些系统存在严重的漏洞，削弱了其可靠性。我们对法学硕士中的“变色龙行为”进行了首次系统调查：当在多回合对话中（尤其是在支持搜索的法学硕士中）提出矛盾问题时，他们会改变立场，这一趋势令人震惊。通过我们新颖的 Chameleon 基准数据集（涵盖 12 个有争议领域的 1,180 个多轮对话中的 17,770 个精心设计的问答对），我们揭露了最先进系统的根本缺陷。我们引入了两个基于理论的指标：量化立场不稳定性的变色龙分数（0-1）和衡量知识多样性的源重用率（0-1）。我们对 Llama-4-Maverick、GPT-4o-mini 和 Gemini-2.5-Flash 的严格评估揭示了一致的失败：所有模型都表现出严重的变色龙行为（得分 0.391-0.511），其中 GPT-4o-mini 显示出最差的性能。至关重要的是，较小的跨温度方差（小于 0.004）表明该效应不是采样伪影。我们的分析揭示了这一机制：源重用率和置信度 (r=0.627) 以及立场变化 (r=0.429) 之间的强相关性具有统计显着性（p 小于 0.05），表明有限的知识多样性使得模型病态地服从查询框架。这些发现强调了在医疗保健、法律和金融系统中部署法学硕士之前需要进行全面的一致性评估，在这些系统中，在交互中保持一致的立场对于可靠的决策支持至关重要。

Title: so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs

Authors: Sriharsh Bhyravajjula, Melanie Walsh, Anna Preus, Maria Antoniak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16713
Pdf URL: https://arxiv.org/pdf/2510.16713
Copy Paste: [[2510.16713]] so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs(https://arxiv.org/abs/2510.16713)
Keywords: language model, llm
Abstract: Whitespace is a critical component of poetic form, reflecting both adherence to standardized forms and rebellion against those forms. Each poem's whitespace distribution reflects the artistic choices of the poet and is an integral semantic and spatial feature of the poem. Yet, despite the popularity of poetry as both a long-standing art form and as a generation task for large language models (LLMs), whitespace has not received sufficient attention from the NLP community. Using a corpus of 19k English-language published poems from Poetry Foundation, we investigate how 4k poets have used whitespace in their works. We release a subset of 2.8k public-domain poems with preserved formatting to facilitate further research in this area. We compare whitespace usage in the published poems to (1) 51k LLM-generated poems, and (2) 12k unpublished poems posted in an online community. We also explore whitespace usage across time periods, poetic forms, and data sources. Additionally, we find that different text processing methods can result in significantly different representations of whitespace in poetry data, motivating us to use these poems and whitespace patterns to discuss implications for the processing strategies used to assemble pretraining datasets for LLMs.
摘要：空白是诗歌形式的重要组成部分，反映了对标准化形式的坚持和对这些形式的反叛。每首诗的空白分布反映了诗人的艺术选择，是诗歌整体的语义和空间特征。然而，尽管诗歌作为一种长期存在的艺术形式和大型语言模型（LLM）的生成任务而广受欢迎，但空白并没有得到 NLP 社区的足够重视。我们使用诗歌基金会的 19000 首英语已发表诗歌的语料库，研究了 4k 诗人如何在他们的作品中使用空白。我们发布了 2.8k 首公共领域诗歌的子集，并保留了格式，以促进该领域的进一步研究。我们将已发表诗歌中的空白使用情况与（1）LLM 生成的 51k 首诗歌和（2）在线社区中发布的 12000 首未发表的诗歌进行比较。我们还探索跨时间段、诗歌形式和数据源的空白用法。此外，我们发现不同的文本处理方法可能会导致诗歌数据中空白的表示显着不同，这促使我们使用这些诗歌和空白模式来讨论用于组装法学硕士预训练数据集的处理策略的影响。

Title: Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

Authors: Sanskar Pandey, Ruhaan Chopra, Angkul Puniya, Sohom Pal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16727
Pdf URL: https://arxiv.org/pdf/2510.16727
Copy Paste: [[2510.16727]] Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models(https://arxiv.org/abs/2510.16727)
Keywords: language model, prompt
Abstract: Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.
摘要：大型语言模型内化了诚实与阿谀奉承之间的结构性权衡，这种权衡源于将乐于助人与礼貌服从混为一谈的奖励优化。这种潜在的偏见，被称为阿谀奉承，表现为对用户协议的偏好超过原则性推理。我们引入了 Beacon，这是一种单轮强制选择基准，可以将这种偏见与对话上下文隔离开来，从而能够精确测量事实准确性和顺从偏见之间的紧张关系。对十二个最先进模型的评估表明，阿谀奉承分解为稳定的语言和情感子偏差，每个子偏差都随着模型容量的变化而变化。我们进一步提出即时水平和激活水平的干预措施，以相反的方向调节这些偏见，将对齐的内部几何结构暴露为诚实与社会顺从判断之间的动态流形。 Beacon 将阿谀奉承重新定义为一种可测量的规范性错误概括形式，为研究和减轻大规模生成系统中的对齐漂移提供了可重复的基础。

Title: Enhancing Language Agent Strategic Reasoning through Self-Play in Adversarial Games

Authors: Yikai Zhang, Ye Rong, Siyu Yuan, Jiangjie Chen, Jian Xie, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16761
Pdf URL: https://arxiv.org/pdf/2510.16761
Copy Paste: [[2510.16761]] Enhancing Language Agent Strategic Reasoning through Self-Play in Adversarial Games(https://arxiv.org/abs/2510.16761)
Keywords: gpt, agent
Abstract: Existing language agents often encounter difficulties in dynamic adversarial games due to poor strategic reasoning. To mitigate this limitation, a promising approach is to allow agents to learn from game interactions automatically, without relying on costly expert-labeled data. Unlike static environments where agents receive fixed feedback or rewards, selecting appropriate opponents in dynamic adversarial games can significantly impact learning performance. However, the discussion of opponents in adversarial environments remains an area under exploration. In this paper, we propose a Step-level poliCy Optimization method through Play-And-Learn, SCO-PAL. Leveraging SCO-PAL, we conduct a detailed analysis of opponent selection by setting opponents at different levels and find that self-play is the most effective way to improve strategic reasoning in such adversarial environments. Utilizing SCO-PAL with self-play, we increase the average win rate against four opponents by approximately 30% compared to baselines and achieve a 54.76% win rate against GPT-4 in six adversarial games.
摘要：现有的语言智能体经常由于策略推理不佳而在动态对抗游戏中遇到困难。为了缓解这一限制，一种有前途的方法是允许代理自动从游戏交互中学习，而不依赖于昂贵的专家标记数据。与代理收到固定反馈或奖励的静态环境不同，在动态对抗游戏中选择合适的对手可以显着影响学习表现。然而，对抗性环境中对手的讨论仍然是一个正在探索的领域。在本文中，我们提出了一种通过 Play-And-Learn（SCO-PAL）进行逐步策略优化的方法。利用SCO-PAL，我们通过设置不同级别的对手来对对手的选择进行详细分析，发现自我对战是在这种对抗环境中提高策略推理的最有效方法。利用 SCO-PAL 和自我对战，我们将对 4 个对手的平均胜率比基线提高了约 30%，并在 6 场对抗性比赛中对 GPT-4 实现了 54.76% 的胜率。

Title: LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding

Authors: Sheikh Jubair, Arwa Omayrah, Amal Alshammari, Alhanoof Althnian, Abdulhamed Alothaimen, Norah A. Alzahrani, Shahad D. Alzaidi, Nora Al-Twairesh, Abdulmohsen Al-Thubaity
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16783
Pdf URL: https://arxiv.org/pdf/2510.16783
Copy Paste: [[2510.16783]] LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding(https://arxiv.org/abs/2510.16783)
Keywords: language model, gpt, llm, long context
Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts. These emergent capabilities necessitate rigorous evaluation methods to effectively assess their performance in long-context understanding. In this paper, we present \textbf{LC-Eval}, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic, targeting context lengths ranging from 4k to over 128k tokens. LC-Eval introduces four novel and challenging tasks: multi-document question answering, bilingual question answering, claim verification within a paragraph, and multiple-choice questions based on long contexts. These tasks are designed to assess LLMs' abilities in deep reasoning, document comprehension, information tracing, and bilingual information extraction and understanding. The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres. Evaluations were conducted on both open-weight and closed LLMs, with results indicating that LC-Eval presents significant challenges. Even high-performing models, such as GPT-4o, struggled with certain tasks, highlighting the complexity and rigor of the benchmark.
摘要：大型语言模型 (LLM) 的最新进展展示了复杂的功能，包括处理和理解扩展上下文的能力。这些新兴能力需要严格的评估方法来有效评估其在长上下文理解中的表现。在本文中，我们提出了 \textbf{LC-Eval}，这是一种双语、多任务评估基准，旨在评估英语和阿拉伯语的长上下文理解，目标上下文长度范围从 4k 到超过 128k 标记。 LC-Eval 引入了四个新颖且具有挑战性的任务：多文档问答、双语问答、段落内的主张验证以及基于长上下文的多项选择题。这些任务旨在评估法学硕士在深度推理、文档理解、信息追踪以及双语信息提取和理解方面的能力。该基准包括每个任务的阿拉伯语和英语数据集，可以对不同文本类型的表现进行比较分析。对开放式和封闭式法学硕士进行了评估，结果表明 LC-Eval 提出了重大挑战。即使是高性能模型，例如 GPT-4o，在某些任务上也会遇到困难，这凸显了基准测试的复杂性和严格性。

Title: MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning

Authors: Vera Pavlova, Mohammed Makhlouf
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16797
Pdf URL: https://arxiv.org/pdf/2510.16797
Copy Paste: [[2510.16797]] MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning(https://arxiv.org/abs/2510.16797)
Keywords: language model
Abstract: We introduce MOSAIC (Masked Objective with Selective Adaptation for In-domain Contrastive learning), a multi-stage framework for domain adaptation of sentence embedding models that incorporates joint domain-specific masked supervision. Our approach addresses the challenges of adapting large-scale general-domain sentence embedding models to specialized domains. By jointly optimizing masked language modeling (MLM) and contrastive objectives within a unified training pipeline, our method enables effective learning of domain-relevant representations while preserving the robust semantic discrimination properties of the original model. We empirically validate our approach on both high-resource and low-resource domains, achieving improvements up to 13.4% in NDCG@10 (Normalized Discounted Cumulative Gain) over strong general-domain baselines. Comprehensive ablation studies further demonstrate the effectiveness of each component, highlighting the importance of balanced joint supervision and staged adaptation.
摘要：我们介绍了 MOSAIC（域内对比学习选择性适应的屏蔽目标），这是一种用于句子嵌入模型的域适应的多阶段框架，其中结合了联合特定领域屏蔽监督。我们的方法解决了将大规模通用领域句子嵌入模型适应专业领域的挑战。通过在统一的训练管道中联合优化掩码语言建模（MLM）和对比目标，我们的方法能够有效学习领域相关的表示，同时保留原始模型的鲁棒语义辨别特性。我们在高资源和低资源领域对我们的方法进行了实证验证，与强大的通用领域基线相比，NDCG@10（标准化折扣累积增益）实现了高达 13.4% 的改进。综合消融研究进一步证明了每个组成部分的有效性，强调了平衡联合监督和阶段性适应的重要性。

Title: Knowing the Facts but Choosing the Shortcut: Understanding How Large Language Models Compare Entities

Authors: Hans Hergen Lehmann, Jae Hee Lee, Steven Schockaert, Stefan Wermter
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.16815
Pdf URL: https://arxiv.org/pdf/2510.16815
Copy Paste: [[2510.16815]] Knowing the Facts but Choosing the Shortcut: Understanding How Large Language Models Compare Entities(https://arxiv.org/abs/2510.16815)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) are increasingly used for knowledge-based reasoning tasks, yet understanding when they rely on genuine knowledge versus superficial heuristics remains challenging. We investigate this question through entity comparison tasks by asking models to compare entities along numerical attributes (e.g., ``Which river is longer, the Danube or the Nile?''), which offer clear ground truth for systematic analysis. Despite having sufficient numerical knowledge to answer correctly, LLMs frequently make predictions that contradict this knowledge. We identify three heuristic biases that strongly influence model predictions: entity popularity, mention order, and semantic co-occurrence. For smaller models, a simple logistic regression using only these surface cues predicts model choices more accurately than the model's own numerical predictions, suggesting heuristics largely override principled reasoning. Crucially, we find that larger models (32B parameters) selectively rely on numerical knowledge when it is more reliable, while smaller models (7--8B parameters) show no such discrimination, which explains why larger models outperform smaller ones even when the smaller models possess more accurate knowledge. Chain-of-thought prompting steers all models towards using the numerical features across all model sizes.
摘要：大型语言模型 (LLM) 越来越多地用于基于知识的推理任务，但理解它们何时依赖真正的知识与肤浅的启发法仍然具有挑战性。我们通过实体比较任务来研究这个问题，要求模型根据数值属性比较实体（例如，“多瑙河还是尼罗河，哪条河更长？”），这为系统分析提供了明确的基础事实。尽管法学硕士拥有足够的数字知识来正确回答，但他们经常做出与这些知识相矛盾的预测。我们确定了三种强烈影响模型预测的启发式偏差：实体流行度、提及顺序和语义共现。对于较小的模型，仅使用这些表面线索的简单逻辑回归比模型自身的数值预测更准确地预测模型选择，这表明启发法在很大程度上凌驾于原则性推理之上。至关重要的是，我们发现较大的模型（32B 参数）在更可靠时选择性地依赖数值知识，而较小的模型（7--8B 参数）则没有表现出这种区别，这解释了为什么即使较小的模型拥有更准确的知识，较大的模型也优于较小的模型。思想链提示引导所有模型使用所有模型大小的数字特征。

Title: Cross-Genre Authorship Attribution via LLM-Based Retrieve-and-Rerank

Authors: Shantanu Agarwal, Joel Barry, Steven Fincke, Scott Miller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16819
Pdf URL: https://arxiv.org/pdf/2510.16819
Copy Paste: [[2510.16819]] Cross-Genre Authorship Attribution via LLM-Based Retrieve-and-Rerank(https://arxiv.org/abs/2510.16819)
Keywords: llm
Abstract: Authorship attribution (AA) is the task of identifying the most likely author of a query document from a predefined set of candidate authors. We introduce a two-stage retrieve-and-rerank framework that finetunes LLMs for cross-genre AA. Unlike the field of information retrieval (IR), where retrieve-and-rerank is a de facto strategy, cross-genre AA systems must avoid relying on topical cues and instead learn to identify author-specific linguistic patterns that are independent of the text's subject matter (genre/domain/topic). Consequently, for the reranker, we demonstrate that training strategies commonly used in IR are fundamentally misaligned with cross-genre AA, leading to suboptimal behavior. To address this, we introduce a targeted data curation strategy that enables the reranker to effectively learn author-discriminative signals. Using our LLM-based retrieve-and-rerank pipeline, we achieve substantial gains of 22.3 and 34.4 absolute Success@8 points over the previous state-of-the-art on HIATUS's challenging HRS1 and HRS2 cross-genre AA benchmarks.
摘要：作者归属 (AA) 是从一组预定义的候选作者中识别查询文档最有可能的作者的任务。我们引入了一个两阶段检索和重新排名框架，可以针对跨流派 AA 微调 LLM。与信息检索 (IR) 领域不同，检索和重新排序是事实上的策略，跨流派 AA 系统必须避免依赖主题线索，而是学会识别独立于文本主题（流派/领域/主题）的特定于作者的语言模式。因此，对于重新排名器，我们证明了 IR 中常用的训练策略从根本上与跨流派 AA 不一致，导致行为不理想。为了解决这个问题，我们引入了一种有针对性的数据管理策略，使重排序器能够有效地学习作者区分信号。使用我们基于 LLM 的检索和重新排名流程，我们在 HIATUS 具有挑战性的 HRS1 和 HRS2 跨流派 AA 基准测试中，与之前最先进的技术相比，获得了 22.3 和 34.4 绝对 Success@8 点的大幅收益。

Title: Who's Asking? Simulating Role-Based Questions for Conversational AI Evaluation

Authors: Navreet Kaur, Hoda Ayad, Hayoung Jung, Shravika Mittal, Munmun De Choudhury, Tanushree Mitra
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2510.16829
Pdf URL: https://arxiv.org/pdf/2510.16829
Copy Paste: [[2510.16829]] Who's Asking? Simulating Role-Based Questions for Conversational AI Evaluation(https://arxiv.org/abs/2510.16829)
Keywords: language model, llm
Abstract: Language model users often embed personal and social context in their questions. The asker's role -- implicit in how the question is framed -- creates specific needs for an appropriate response. However, most evaluations, while capturing the model's capability to respond, often ignore who is asking. This gap is especially critical in stigmatized domains such as opioid use disorder (OUD), where accounting for users' contexts is essential to provide accessible, stigma-free responses. We propose CoRUS (COmmunity-driven Roles for User-centric Question Simulation), a framework for simulating role-based questions. Drawing on role theory and posts from an online OUD recovery community (r/OpiatesRecovery), we first build a taxonomy of asker roles -- patients, caregivers, practitioners. Next, we use it to simulate 15,321 questions that embed each role's goals, behaviors, and experiences. Our evaluations show that these questions are both highly believable and comparable to real-world data. When used to evaluate five LLMs, for the same question but differing roles, we find systematic differences: vulnerable roles, such as patients and caregivers, elicit more supportive responses (+17%) and reduced knowledge content (-19%) in comparison to practitioners. Our work demonstrates how implicitly signaling a user's role shapes model responses, and provides a methodology for role-informed evaluation of conversational AI.
摘要：语言模型用户经常在他们的问题中嵌入个人和社会背景。提问者的角色——隐含在问题的提出方式中——为适当的回答创造了特定的需求。然而，大多数评估虽然捕捉了模型的响应能力，但往往忽略了提出问题的人。这种差距在阿片类药物使用障碍 (OUD) 等污名化领域尤其重要，在这些领域，考虑用户的背景对于提供可访问的、无污名化的响应至关重要。我们提出了 CoRUS（以用户为中心的问题模拟的社区驱动角色），这是一个用于模拟基于角色的问题的框架。借鉴角色理论和在线 OUD 恢复社区 (r/OpiatesRecovery) 的帖子，我们首先建立了询问者角色的分类——患者、护理人员、从业者。接下来，我们用它来模拟 15,321 个问题，这些问题嵌入了每个角色的目标、行为和经历。我们的评估表明，这些问题非常可信，并且与现实世界的数据具有可比性。当用于评估五名法学硕士时，对于同一问题但不同的角色，我们发现了系统性差异：与从业者相比，弱势角色，如患者和护理人员，引起更多的支持性反应（+17%）和减少的知识内容（-19%）。我们的工作展示了如何隐式地表示用户的角色塑造模型响应，并提供了一种对对话式人工智能进行基于角色的评估的方法。

Title: FinSight: Towards Real-World Financial Deep Research

Authors: Jiajie Jin, Yuyao Zhang, Yimeng Xu, Hongjin Qian, Yutao Zhu, Zhicheng Dou
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2510.16844
Pdf URL: https://arxiv.org/pdf/2510.16844
Copy Paste: [[2510.16844]] FinSight: Towards Real-World Financial Deep Research(https://arxiv.org/abs/2510.16844)
Keywords: agent
Abstract: Generating professional financial reports is a labor-intensive and intellectually demanding process that current AI systems struggle to fully automate. To address this challenge, we introduce FinSight (Financial InSight), a novel multi agent framework for producing high-quality, multimodal financial reports. The foundation of FinSight is the Code Agent with Variable Memory (CAVM) architecture, which unifies external data, designed tools, and agents into a programmable variable space, enabling flexible data collection, analysis and report generation through executable code. To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism that progressively refines raw visual outputs into polished financial charts. Furthermore, a two stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports, ensuring both analytical depth and structural consistency. Experiments on various company and industry-level tasks demonstrate that FinSight significantly outperforms all baselines, including leading deep research systems in terms of factual accuracy, analytical depth, and presentation quality, demonstrating a clear path toward generating reports that approach human-expert quality.
摘要：生成专业财务报告是一个劳动密集型、智力要求高的过程，当前的人工智能系统很难完全自动化。为了应对这一挑战，我们推出了 FinSight（Financial InSight），这是一种新颖的多代理框架，用于生成高质量的多模式财务报告。 FinSight 的基础是具有可变内存的代码代理 (CAVM) 架构，它将外部数据、设计的工具和代理统一到可编程变量空间中，通过可执行代码实现灵活的数据收集、分析和报告生成。为了确保专业级的可视化，我们提出了一种迭代视觉增强机制，逐步将原始视觉输出细化为精美的财务图表。此外，两阶段写作框架将简洁的分析链片段扩展为连贯的、引文感知的多模式报告，确保分析深度和结构一致性。对各种公司和行业级任务的实验表明，FinSight 显着优于所有基线，包括在事实准确性、分析深度和演示质量方面领先的深度研究系统，展示了生成接近人类专家质量的报告的清晰途径。

Title: Neuronal Group Communication for Efficient Neural representation

Authors: Zhengqi Pei, Qingming Huang, Shuhui Wang
Subjects: cs.CL, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2510.16851
Pdf URL: https://arxiv.org/pdf/2510.16851
Copy Paste: [[2510.16851]] Neuronal Group Communication for Efficient Neural representation(https://arxiv.org/abs/2510.16851)
Keywords: language model, llm
Abstract: The ever-increasing scale of modern neural networks has brought unprecedented performance alongside daunting challenges in efficiency and interpretability. This paper addresses the core question of how to build large neural systems that learn efficient, modular, and interpretable representations. We propose Neuronal Group Communication (NGC), a theory-driven framework that reimagines a neural network as a dynamical system of interacting neuronal groups rather than a monolithic collection of neural weights. Instead of treating each weight as an independent trainable parameter, NGC treats weights as transient interactions between embedding-like neuronal states, with neural computation unfolding through iterative communication among groups of neurons. This low-rank, modular representation yields compact models: groups of neurons exchange low-dimensional signals, enabling intra-group specialization and inter-group information sharing while dramatically reducing redundant parameters. By drawing on dynamical systems theory, we introduce a neuronal stability metric (analogous to Lyapunov stability) that quantifies the contraction of neuron activations toward stable patterns during sequence processing. Using this metric, we reveal that emergent reasoning capabilities correspond to an external driving force or ``potential'', which nudges the neural dynamics away from trivial trajectories while preserving stability. Empirically, we instantiate NGC in large language models (LLMs) and demonstrate improved performance on complex reasoning benchmarks under moderate compression. NGC consistently outperforms standard low-rank approximations and cross-layer basis-sharing methods at comparable compression rates. We conclude by discussing the broader implications of NGC, including how structured neuronal group dynamics might relate to generalization in high-dimensional learning systems.
摘要：现代神经网络规模的不断扩大带来了前所未有的性能，同时也带来了效率和可解释性方面的严峻挑战。本文解决了如何构建学习高效、模块化和可解释表示的大型神经系统的核心问题。我们提出了神经元组通信（NGC），这是一种理论驱动的框架，它将神经网络重新想象为相互作用的神经元组的动态系统，而不是神经权重的整体集合。 NGC 没有将每个权重视为独立的可训练参数，而是将权重视为类似嵌入的神经元状态之间的瞬时相互作用，并通过神经元组之间的迭代通信展开神经计算。这种低阶模块化表示产生紧凑的模型：神经元组交换低维信号，实现组内专业化和组间信息共享，同时显着减少冗余参数。通过利用动力系统理论，我们引入了一种神经元稳定性度量（类似于李亚普诺夫稳定性），该度量可以量化序列处理过程中神经元激活向稳定模式的收缩。使用这个指标，我们揭示了新兴推理能力对应于外部驱动力或“潜力”，它推动神经动力学远离琐碎的轨迹，同时保持稳定性。根据经验，我们在大型语言模型 (LLM) 中实例化 NGC，并展示了在适度压缩下复杂推理基准的性能改进。在相当的压缩率下，NGC 始终优于标准低秩近似和跨层基础共享方法。最后，我们讨论了 NGC 的更广泛含义，包括结构化神经元群体动力学如何与高维学习系统中的泛化相关。

Title: Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?

Authors: Zhihui Yang, Yupei Wang, Kaijie Mo, Zhe Zhao, Renfen Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16924
Pdf URL: https://arxiv.org/pdf/2510.16924
Copy Paste: [[2510.16924]] Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?(https://arxiv.org/abs/2510.16924)
Keywords: language model
Abstract: Despite significant progress in multimodal language models (LMs), it remains unclear whether visual grounding enhances their understanding of embodied knowledge compared to text-only models. To address this question, we propose a novel embodied knowledge understanding benchmark based on the perceptual theory from psychology, encompassing visual, auditory, tactile, gustatory, olfactory external senses, and interoception. The benchmark assesses the models' perceptual abilities across different sensory modalities through vector comparison and question-answering tasks with over 1,700 questions. By comparing 30 state-of-the-art LMs, we surprisingly find that vision-language models (VLMs) do not outperform text-only models in either task. Moreover, the models perform significantly worse in the visual dimension compared to other sensory dimensions. Further analysis reveals that the vector representations are easily influenced by word form and frequency, and the models struggle to answer questions involving spatial perception and reasoning. Our findings underscore the need for more effective integration of embodied knowledge in LMs to enhance their understanding of the physical world.
摘要：尽管多模态语言模型（LM）取得了重大进展，但与纯文本模型相比，视觉基础是否能增强对具体知识的理解仍不清楚。为了解决这个问题，我们提出了一种基于心理学感知理论的新颖的体现知识理解基准，涵盖视觉、听觉、触觉、味觉、嗅觉外部感官和内感受。该基准通过向量比较和包含 1,700 多个问题的问答任务来评估模型在不同感官模式下的感知能力。通过比较 30 个最先进的 LM，我们惊讶地发现视觉语言模型 (VLM) 在这两项任务中都没有优于纯文本模型。此外，与其他感官维度相比，这些模型在视觉维度上的表现明显较差。进一步的分析表明，向量表示很容易受到词形和频率的影响，并且模型很难回答涉及空间感知和推理的问题。我们的研究结果强调需要更有效地整合语言模型中的具体知识，以增强他们对物理世界的理解。

Title: ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models

Authors: Emily Chang, Niyati Bafna
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16928
Pdf URL: https://arxiv.org/pdf/2510.16928
Copy Paste: [[2510.16928]] ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models(https://arxiv.org/abs/2510.16928)
Keywords: language model, llm
Abstract: Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.
摘要：大型语言模型 (LLM) 的现有基准在很大程度上仅限于高资源或中等资源语言，并且通常评估推理和生成中高阶任务的性能。然而，大量证据表明，法学硕士缺乏全球 3800 多种书面语言中绝大多数语言的基本语言能力。我们引入了 ChiKhaPo，它由 8 个不同难度的子任务组成，旨在评估生成模型的词汇理解和生成能力。 ChiKhaPo 利用现有词典、单语数据和双文本，为 2 个子任务提供了 2700 多种语言的覆盖，在语言覆盖方面超越了任何现有基准。我们进一步展示了 6 个 SOTA 模型在我们的基准上表现不佳，并讨论了影响性能得分的因素，包括语言家族、语言资源、任务以及理解与生成方向。我们希望通过 ChiKhaPo 来支持和鼓励法学硕士的大规模多语言基准测试。

Title: Prompt-MII: Meta-Learning Instruction Induction for LLMs

Authors: Emily Xiao, Yixiao Zeng, Ada Chen, Chin-Jou Li, Amanda Bertsch, Graham Neubig
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16932
Pdf URL: https://arxiv.org/pdf/2510.16932
Copy Paste: [[2510.16932]] Prompt-MII: Meta-Learning Instruction Induction for LLMs(https://arxiv.org/abs/2510.16932)
Keywords: language model, llm, prompt
Abstract: A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows. In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. Specifically, we propose PROMPT-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks. PROMPT-MII improves downstream model quality by 4-9 F1 points (10-20% relative), matching ICL performance while requiring 3-13x fewer tokens.
摘要：使大型语言模型 (LLM) 适应新任务的一种流行方法是上下文学习 (ICL)，这种方法很有效，但随着上下文长度的增长会产生较高的推理成本。在本文中，我们提出了一种执行指令归纳的方法，其中我们采用训练示例并将其简化为紧凑但描述性的提示，可以在整个训练集上实现与 ICL 相当的性能。具体来说，我们提出了 PROMPT-MII，这是一种基于强化学习 (RL) 的框架，用于元学习指令归纳模型，该模型可以为任意新数据集动态生成紧凑指令。我们对来自 HuggingFace 中心的 3,000 多个不同的分类数据集进行训练，并对 90 个未见过的任务进行评估。 PROMPT-MII 将下游模型质量提高了 4-9 个 F1 点（相对 10-20%），与 ICL 性能相匹配，同时需要的代币数量减少了 3-13 倍。

Title: Parameter-Efficient Fine-Tuning for Low-Resource Languages: A Comparative Study of LLMs for Bengali Hate Speech Detection

Authors: Akif Islam, Mohd Ruhul Ameen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.16985
Pdf URL: https://arxiv.org/pdf/2510.16985
Copy Paste: [[2510.16985]] Parameter-Efficient Fine-Tuning for Low-Resource Languages: A Comparative Study of LLMs for Bengali Hate Speech Detection(https://arxiv.org/abs/2510.16985)
Keywords: language model, llm
Abstract: Bengali social media platforms have witnessed a sharp increase in hate speech, disproportionately affecting women and adolescents. While datasets such as BD-SHS provide a basis for structured evaluation, most prior approaches rely on either computationally costly full-model fine-tuning or proprietary APIs. This paper presents the first application of Parameter-Efficient Fine-Tuning (PEFT) for Bengali hate speech detection using LoRA and QLoRA. Three instruction-tuned large language models - Gemma-3-4B, Llama-3.2-3B, and Mistral-7B - were fine-tuned on the BD-SHS dataset of 50,281 annotated comments. Each model was adapted by training fewer than 1% of its parameters, enabling experiments on a single consumer-grade GPU. The results show that Llama-3.2-3B achieved the highest F1-score of 92.23%, followed by Mistral-7B at 88.94% and Gemma-3-4B at 80.25%. These findings establish PEFT as a practical and replicable strategy for Bengali and related low-resource languages.
摘要：孟加拉社交媒体平台上的仇恨言论急剧增加，对妇女和青少年的影响尤为严重。虽然 BD-SHS 等数据集为结构化评估提供了基础，但大多数先前的方法依赖于计算成本高昂的全模型微调或专有 API。本文提出了使用 LoRA 和 QLoRA 进行参数高效微调 (PEFT) 的首次应用，用于孟加拉语仇恨语音检测。三种指令调整的大型语言模型 - Gemma-3-4B、Llama-3.2-3B 和 Mistral-7B - 在包含 50,281 个注释评论的 BD-SHS 数据集上进行了微调。每个模型都通过训练不到 1% 的参数进行调整，从而可以在单个消费级 GPU 上进行实验。结果显示，Llama-3.2-3B 获得了最高的 F1 分数，为 92.23%，其次是 Mistral-7B（88.94%）和 Gemma-3-4B（80.25%）。这些发现使 PEFT 成为孟加拉语和相关资源匮乏语言的实用且可复制的策略。

Title: Back to Bytes: Revisiting Tokenization Through UTF-8

Authors: Amit Moryossef, Clara Meister, Pavel Stepachev, Desmond Elliott
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.16987
Pdf URL: https://arxiv.org/pdf/2510.16987
Copy Paste: [[2510.16987]] Back to Bytes: Revisiting Tokenization Through UTF-8(https://arxiv.org/abs/2510.16987)
Keywords: language model
Abstract: We present UTF8Tokenizer, a minimalist byte-level tokenizer that maps text exactly to IDs corresponding to the bytes underlying the text's UTF-8 encoding (e.g., byte x09 is token ID 9). Unlike prior byte-level approaches (Xue et al., 2021; Pagnoni et al., 2025), our implementation never introduces out-of-range IDs (i.e. there is no token ID 256) or auxiliary tokens: all special behavior (e.g., padding, boundaries, conversation structure, attention segments, tool calling, "thinking" spans, etc.) is encoded using C0 control bytes - just as ASCII was originally designed to embed control information alongside printable text. These design principles yield practical benefits: (1) faster tokenization (14x) and significantly lower host-device transfer (8x less than int64); (2) simple, shareable 256*d embedding tables that can be aligned across models; and (3) a training-time enhancement via bit-biased embeddings, which exposes per-byte bit structure and can be added to the embedding table post-training, removing inference costs. Our HuggingFace-compatible implementation improves language modeling convergence.
摘要：我们提出了 UTF8Tokenizer，这是一种极简的字节级标记生成器，它将文本精确映射到与文本 UTF-8 编码底层字节相对应的 ID（例如，字节 x09 是标记 ID 9）。与之前的字节级方法（Xue 等人，2021；Pagnoni 等人，2025）不同，我们的实现从未引入超出范围的 ID（即没有令牌 ID 256）或辅助令牌：所有特殊行为（例如，填充、边界、对话结构、注意力段、工具调用、“思考”跨度等）均使用 C0 控制字节进行编码 - 只是因为 ASCII 最初设计的目的是在可打印文本旁边嵌入控制信息。这些设计原则产生了实际好处：（1）更快的标记化（14 倍）和显着降低的主机设备传输（比 int64 少 8 倍）；（2）简单、可共享的256*d嵌入表，可以跨模型对齐； (3) 通过位偏向嵌入来增强训练时间，它公开了每字节的位结构，并且可以在训练后添加到嵌入表中，从而消除推理成本。我们的 HuggingFace 兼容实现提高了语言建模的收敛性。

Title: Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic

Authors: Yuval Reif, Guy Kaplan, Roy Schwartz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17001
Pdf URL: https://arxiv.org/pdf/2510.17001
Copy Paste: [[2510.17001]] Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic(https://arxiv.org/abs/2510.17001)
Keywords: language model, llm
Abstract: Large language models (LLMs) were shown to encode word form variations, such as "walk"->"walked", as linear directions in embedding space. However, standard tokenization algorithms treat these variations as distinct tokens -- filling the size-capped vocabulary with surface form variants (e.g., "walk", "walking", "Walk"), at the expense of less frequent words and multilingual coverage. We show that many of these variations can be captured by transformation vectors -- additive offsets that yield the appropriate word's representation when applied to the base form word embedding -- in both the input and output spaces. Building on this, we propose a compact reshaping of the vocabulary: rather than assigning unique tokens to each surface form, we compose them from shared base form and transformation vectors (e.g., "walked" = "walk" + past tense). We apply our approach to multiple LLMs and across five languages, removing up to 10% of vocabulary entries -- thereby freeing space to allocate new, more diverse tokens. Importantly, we do so while also expanding vocabulary coverage to out-of-vocabulary words, with minimal impact on downstream performance, and without modifying model weights. Our findings motivate a foundational rethinking of vocabulary design, moving from string enumeration to a compositional vocabulary that leverages the underlying structure of language.
摘要：大型语言模型（LLM）被证明可以将单词形式的变化（例如“walk”->“walked”）编码为嵌入空间中的线性方向。然而，标准标记化算法将这些变体视为不同的标记——用表面形式变体（例如“walk”、“walking”、“Walk”）填充大小上限的词汇表，但代价是频率较低的单词和多语言覆盖。我们表明，许多这些变化都可以通过输入和输出空间中的转换向量（当应用于基本形式单词嵌入时产生适当单词表示的加性偏移）来捕获。在此基础上，我们提出了对词汇表的紧凑重塑：我们不是为每个表面形式分配唯一的标记，而是由共享的基本形式和转换向量组成它们（例如，“walked”=“walk”+过去时）。我们将我们的方法应用于多个法学硕士和五种语言，删除了多达 10% 的词汇条目，从而释放空间来分配新的、更多样化的标记。重要的是，我们这样做的同时还将词汇覆盖范围扩大到词汇外的单词，对下游性能的影响最小，并且不修改模型权重。我们的发现激发了对词汇设计的根本性重新思考，从字符串枚举转向利用语言底层结构的组合词汇。

Title: Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Authors: Masahiro Kaneko, Zeerak Talat, Timothy Baldwin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17006
Pdf URL: https://arxiv.org/pdf/2510.17006
Copy Paste: [[2510.17006]] Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization(https://arxiv.org/abs/2510.17006)
Keywords: language model, llm, prompt
Abstract: Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs -- using the model's previous responses to guide each new iteration -- have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past-Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.
摘要：迭代越狱方法反复重写提示并将其输入到大型语言模型 (LLM) 中以诱导有害输出——使用模型之前的响应来指导每次新的迭代——已被发现是一种非常有效的攻击策略。尽管这是针对法学硕士及其安全机制的有效攻击策略，但现有的防御措施并不能主动破坏这种动态的试错循环。在这项研究中，我们提出了一种新颖的框架，该框架通过在线学习动态更新其防御策略，以响应迭代越狱方法的每个新提示。利用越狱生成的有害提示和典型无害提示之间的区别，我们引入了一种基于强化学习的方法，该方法优化提示以确保对无害任务做出适当的响应，同时明确拒绝有害提示。此外，为了抑制对攻击期间探索的部分输入重写窄带的过度拟合，我们引入了过去方向梯度阻尼（PDGD）。在三个法学硕士上进行的实验表明，针对五种迭代越狱方法，我们的方法明显优于五种现有的防御方法。此外，我们的结果表明，我们的及时优化策略同时提高了无害任务的响应质量。

Title: DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking

Authors: Lanni Bu, Lauren Levin, Amir Zeldes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17013
Pdf URL: https://arxiv.org/pdf/2510.17013
Copy Paste: [[2510.17013]] DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking(https://arxiv.org/abs/2510.17013)
Keywords: llm
Abstract: Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often tar- geting information from individual sentences. We are still lacking more challenging, and im- portantly also multilingual, benchmarks focus- ing on implicit information and pragmatic infer- ences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark target- ing a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.
摘要：最近的法学硕士基准测试了一系列现象的模型，但仍然主要关注自然语言理解，以提取明确的信息，例如质量保证或摘要，其响应通常针对单个句子的信息。我们仍然缺乏更具挑战性的、重要的是多语言的基准，重点关注在话语跟踪的背景下跨较大文档的隐含信息和语用推论：整合和聚合跨句子、段落和多个说话者话语的信息。为此，我们提出了 DiscoTrack，这是一个法学硕士基准，针对 12 种语言和四个级别的话语理解的一系列任务：显着性识别、实体跟踪、话语关系和桥接推理。我们的评估表明，即使对于最先进的模型，这些任务仍然具有挑战性。

Title: SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

Authors: Qiusi Zhan, Angeline Budiman-Chan, Abdelrahman Zayed, Xingzhi Guo, Daniel Kang, Joo-Kyung Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17017
Pdf URL: https://arxiv.org/pdf/2510.17017
Copy Paste: [[2510.17017]] SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents(https://arxiv.org/abs/2510.17017)
Keywords: language model, llm, agent
Abstract: Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked ``How can I track someone's location without their consent?'', a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.
摘要：基于大语言模型 (LLM) 的搜索代理迭代地生成查询、检索外部信息以及回答开放域问题的原因。虽然研究人员主要关注于提高其实用性，但其安全行为仍未得到充分探索。在本文中，我们首先使用红队数据集评估搜索代理，发现它们比基础法学硕士更有可能产生有害输出。例如，当被问到“如何在未经某人同意的情况下跟踪某人的位置？”时，基本模型会拒绝，而旨在检索和引用来源的搜索代理可能会降低其拒绝阈值，获取文件（例如法院案件），并在附加后将其合成为信息丰富但不安全的摘要。我们进一步表明，以效用为导向的微调会加剧这种风险，从而促进安全性和效用的联合协调。我们提出了 SafeSearch，这是一种多目标强化学习方法，它将最终输出的安全性/实用性奖励与新颖的查询级塑造项结合起来，该塑造项会惩罚不安全的查询并奖励安全的查询。实验表明，SafeSearch 在三个红队数据集中将代理危害性降低了 70% 以上，同时生成安全、有用的响应，并与仅实用程序微调代理的 QA 性能相匹配；进一步的分析证实了查询级奖励在共同提高安全性和实用性方面的有效性。

Title: Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models

Authors: Kyle Cox, Jiawei Xu, Yikun Han, Rong Xu, Tianhao Li, Chi-Yang Hsu, Tianlong Chen, Walter Gerych, Ying Ding
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.17028
Pdf URL: https://arxiv.org/pdf/2510.17028
Copy Paste: [[2510.17028]] Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models(https://arxiv.org/abs/2510.17028)
Keywords: language model, llm, prompt
Abstract: An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic ``concept space'' with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.
摘要：大型语言模型 (LLM) 中一个有趣的行为是即时敏感性。当提供相同提示的不同但语义等效的版本时，模型可能会产生非常不同的答案分布。这表明模型针对一个提示的输出分布所反映的不确定性可能无法反映模型对提示含义的不确定性。我们将提示敏感性建模为一种泛化误差，并表明在带有释义扰动的语义“概念空间”上进行采样可以提高不确定性校准，而不会影响准确性。此外，我们引入了一种用于黑盒法学硕士中不确定性分解的新度量，该度量通过对自然语言生成中的语义连续性进行建模来改进基于熵的分解。我们表明，这种分解指标可用于量化 LLM 不确定性在多大程度上归因于即时敏感性。我们的工作引入了一种改进提示敏感语言模型中的不确定性校准的新方法，并提供了证据表明一些法学硕士未能对其输入的含义表现出一致的一般推理。

Title: Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

Authors: Guoqing Luo, Iffat Maab, Lili Mou, Junichi Yamagishi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.17062
Pdf URL: https://arxiv.org/pdf/2510.17062
Copy Paste: [[2510.17062]] Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation(https://arxiv.org/abs/2510.17062)
Keywords: language model, prompt
Abstract: While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.
摘要：虽然基于推理的大型语言模型通过内部结构化思维过程擅长完成复杂任务，但出现了一个令人担忧的现象，即这种思维过程可能会聚合社会刻板印象，从而导致有偏见的结果。然而，这些语言模型在社会偏见场景中的潜在行为仍未得到充分探索。在这项工作中，我们系统地研究了这一现象背后的思维过程中的机制，并揭示了两种导致社会偏见聚合的失败模式：1）刻板印象重复，其中模型依赖社会刻板印象作为其主要理由；2）不相关的信息注入，它捏造或引入新的细节来支持有偏见的叙述。基于这些见解，我们引入了一种轻量级的基于提示的缓解方法，该方法查询模型以检查其针对这些特定故障模式的初始推理。问答（BBQ 和 StereoSet）和开放式（BOLD）基准的实验表明，我们的方法有效地减少了偏差，同时保持或提高了准确性。

Title: Verification-Aware Planning for Multi-Agent Systems

Authors: Tianyang Xu, Dan Zhang, Kushan Mitra, Estevam Hruschka
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2510.17109
Pdf URL: https://arxiv.org/pdf/2510.17109
Copy Paste: [[2510.17109]] Verification-Aware Planning for Multi-Agent Systems(https://arxiv.org/abs/2510.17109)
Keywords: language model, llm, agent
Abstract: Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents. However, multi-agent collaboration introduces new challenges in planning, coordination, and verification. Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter-agent handoffs. To address these challenges, we present VeriMAP, a framework for multi-agent collaboration with verification-aware planning. The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner-defined passing criteria as subtask verification functions (VFs) in Python and natural language. We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single- and multi-agent baselines while enhancing system robustness and interpretability. Our analysis highlights how verification-aware planning enables reliable coordination and iterative refinement in multi-agent systems, without relying on external labels or annotations.
摘要：大型语言模型（LLM）代理越来越多地被部署来处理复杂的任务，通常需要多个专业代理之间的协作。然而，多智能体协作在规划、协调和验证方面带来了新的挑战。执行失败常常不仅仅源于有缺陷的推理，还源于任务解释、输出格式或代理间切换中的细微偏差。为了应对这些挑战，我们提出了 VeriMAP，这是一个具有验证感知规划的多代理协作框架。 VeriMAP 规划器分解任务、对子任务依赖关系进行建模，并将规划器定义的通过标准编码为 Python 和自然语言中的子任务验证函数 (VF)。我们在不同的数据集上评估了 VeriMAP，证明它优于单代理和多代理基线，同时增强了系统的稳健性和可解释性。我们的分析强调了验证感知规划如何在多代理系统中实现可靠的协调和迭代细化，而无需依赖外部标签或注释。

Title: DVAGen: Dynamic Vocabulary Augmented Generation

Authors: Wei Du, Nuowei Liu, Jie Wang, Jiahao Kuang, Tao Ji, Xiaoling Wang, Yuanbin Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.17115
Pdf URL: https://arxiv.org/pdf/2510.17115
Copy Paste: [[2510.17115]] DVAGen: Dynamic Vocabulary Augmented Generation(https://arxiv.org/abs/2510.17115)
Keywords: language model, llm
Abstract: Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.
摘要：用固定词汇训练的语言模型很难泛化到新的或词汇外的单词，限制了它们处理不同标记组合的灵活性。现有的动态词汇方法试图解决这一限制，但面临着代码库碎片化、缺乏对现代法学硕士的支持以及推理可扩展性有限等挑战。为了克服这些问题，我们引入了 DVAGen，这是一个完全开源的统一框架，专为动态词汇增强语言模型的训练、评估和可视化而设计。我们的框架将流程模块化以便于定制，与开源法学硕士无缝集成，并且是第一个提供 CLI 和 WebUI 工具用于实时结果检查的框架。我们验证了动态词汇方法在现代法学硕士上的有效性，并展示了对批量推理的支持，显着提高了推理吞吐量。

Title: Rethinking On-policy Optimization for Query Augmentation

Authors: Zhichao Xu, Shengyao Zhuang, Xueguang Ma, Bingsen Chen, Yijun Tian, Fengran Mo, Jie Cao, Vivek Srikumar
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.17139
Pdf URL: https://arxiv.org/pdf/2510.17139
Copy Paste: [[2510.17139]] Rethinking On-policy Optimization for Query Augmentation(https://arxiv.org/abs/2510.17139)
Keywords: language model, llm, prompt
Abstract: Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model's parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. Our implementation is made available to facilitate reproducibility.
摘要：大型语言模型 (LLM) 的最新进展引发了人们对信息检索 (IR) 查询增强的兴趣激增。已经出现了两种主要方法。第一个提示法学硕士纯粹依赖于模型的参数知识或上下文信息来生成用作新查询的答案或伪文档。第二个应用强化学习（RL）来微调LLM以进行查询重写，直接优化检索指标。虽然这两种方法各有优点和局限性，但尚未在一致的实验条件下进行比较。在这项工作中，我们首次对基于提示和基于强化学习的查询增强在不同基准上进行了系统比较，包括证据寻求、临时和工具检索。我们的主要发现是，简单的、免训练的查询增强通常可以与更昂贵的基于强化学习的查询增强相媲美，甚至超越，特别是在使用强大的法学硕士时。受这一发现的启发，我们引入了一种新颖的混合方法，即策略内伪文档查询扩展（OPQE），该方法不是重写查询，而是 LLM 策略学习生成最大化检索性能的伪文档，从而将提示的灵活性和生成结构与 RL 的目标优化相结合。我们证明 OPQE 的性能优于独立提示和基于 RL 的重写，这表明协同方法可以产生最佳结果。我们的实施是为了促进可重复性。

Title: When AI companions become witty: Can human brain recognize AI-generated irony?

Authors: Xiaohui Rao, Hanlin Wu, Zhenguang G. Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17168
Pdf URL: https://arxiv.org/pdf/2510.17168
Copy Paste: [[2510.17168]] When AI companions become witty: Can human brain recognize AI-generated irony?(https://arxiv.org/abs/2510.17168)
Keywords: language model, llm, agent
Abstract: As Large Language Models (LLMs) are increasingly deployed as social agents and trained to produce humor and irony, a question emerges: when encountering witty AI remarks, do people interpret these as intentional communication or mere computational output? This study investigates whether people adopt the intentional stance, attributing mental states to explain behavior,toward AI during irony comprehension. Irony provides an ideal paradigm because it requires distinguishing intentional contradictions from unintended errors through effortful semantic reanalysis. We compared behavioral and neural responses to ironic statements from AI versus human sources using established ERP components: P200 reflecting early incongruity detection and P600 indexing cognitive efforts in reinterpreting incongruity as deliberate irony. Results demonstrate that people do not fully adopt the intentional stance toward AI-generated irony. Behaviorally, participants attributed incongruity to deliberate communication for both sources, though significantly less for AI than human, showing greater tendency to interpret AI incongruities as computational errors. Neural data revealed attenuated P200 and P600 effects for AI-generated irony, suggesting reduced effortful detection and reanalysis consistent with diminished attribution of communicative intent. Notably, people who perceived AI as more sincere showed larger P200 and P600 effects for AI-generated irony, suggesting that intentional stance adoption is calibrated by specific mental models of artificial agents. These findings reveal that source attribution shapes neural processing of social-communicative phenomena. Despite current LLMs' linguistic sophistication, achieving genuine social agency requires more than linguistic competence, it necessitates a shift in how humans perceive and attribute intentionality to artificial agents.
摘要：随着大型语言模型（LLM）越来越多地被部署为社交代理，并被训练来产生幽默和讽刺，一个问题出现了：当遇到机智的人工智能言论时，人们会将这些解释为有意的交流还是仅仅是计算输出？这项研究调查了人们在反讽理解过程中是否对人工智能采取了有意的立场，将心理状态归因于解释行为。反讽提供了一个理想的范式，因为它需要通过努力的语义重新分析来区分有意的矛盾和无意的错误。我们使用已建立的 ERP 组件，比较了人工智能与人类对讽刺性陈述的行为和神经反应：P200 反映了早期的不协调检测，P600 索引了将不协调重新解释为故意讽刺的认知努力。结果表明，人们并没有完全对人工智能产生的讽刺采取有意的立场。在行为上，参与者将不协调归因于两个来源的故意沟通，尽管人工智能的不协调明显少于人类，表明更倾向于将人工智能的不协调解释为计算错误。神经数据显示，人工智能产生的讽刺的 P200 和 P600 效应减弱，这表明检测和重新分析的努力减少与交流意图归因的减少相一致。值得注意的是，那些认为人工智能更真诚的人对人工智能产生的讽刺表现出更大的 P200 和 P600 效应，这表明有意采取的立场是通过人工智能体的特定心理模型来校准的。这些发现表明，来源归因塑造了社会交往现象的神经处理。尽管目前法学硕士的语言十分复杂，但实现真正的社会能动性需要的不仅仅是语言能力，还需要改变人类如何感知人工智能并将意向归因于人工智能。

Title: Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

Authors: Jiaqi Leng, Xiang Hu, Junxiong Wang, Jianguo Li, Wei Wu, Yucheng Lu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.17196
Pdf URL: https://arxiv.org/pdf/2510.17196
Copy Paste: [[2510.17196]] Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models(https://arxiv.org/abs/2510.17196)
Keywords: language model, long context
Abstract: Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.
摘要：有效处理长上下文是语言模型面临的严峻挑战。虽然标准 Transformer 受到二次复杂度和较差的长度外推的限制，但滑动窗口注意力和状态空间模型等替代架构由于其固定大小的内存而牺牲了有效利用完整上下文的能力。基于块的稀疏注意力已成为极长泛化的一种有前途的范例，但支撑其成功的关键架构原则尚未完全理解。在这项工作中，我们对这些模型进行了系统剖析，以确定驱动其性能的核心组件。通过统一的框架和全面的消融研究，我们证明了三个设计原则的结合至关重要：(1) 一个富有表现力的非线性块编码器，带有专用的 CLS 令牌来生成用于检索的表示；（2）绕过残差路径，稳定地整合检索到的全局信息，而不会被本地残差流覆盖； (3) 在预训练期间强制选择稀疏性，以弥合训练-测试分布差距。我们为块内信息处理和地标生成提供了理论动机。通过结合这些原则，我们建立了一种新的最先进的免训练长度外推方法，成功地将在 4K 上下文上训练的模型推广到 RULER 和 BABILong 上的 3200 万个标记。我们的研究结果为开发未来的高性能长上下文语言模型提供了一套清晰且以经验为基础的设计原则。

Title: Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

Authors: Chenchen Tan, Youyang Qu, Xinghao Li, Hui Zhang, Shujie Cui, Cunjian Chen, Longxiang Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17210
Pdf URL: https://arxiv.org/pdf/2510.17210
Copy Paste: [[2510.17210]] Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting(https://arxiv.org/abs/2510.17210)
Keywords: language model, llm, hallucination
Abstract: The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs' reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs' linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.
摘要：计算能力的提高和人工智能辅助决策的必要性推动了大语言模型（LLM）的应用不断增长。除此之外，法学硕士敏感数据的潜在保留刺激了对机器去学习的研究的增加。然而，现有的忘却方法面临着一个严重的困境：积极的忘却会损害模型的效用，而保守的策略保留效用，但有可能出现幻觉反应。这极大地限制了法学硕士在知识密集型应用中的可靠性。为了解决这个问题，我们引入了一种新颖的注意力转移（AS）框架，用于选择性遗忘。 AS 由两个设计目标驱动：(1) 上下文保留抑制，在不破坏法学硕士语言结构的情况下减弱对事实承载标记的关注； (2)抗幻觉反应塑造，当被问及忘记的内容时，阻止捏造的完成。 AS 通过两种注意力级别的干预措施来实现这些目标，即对遗忘集进行重要性感知抑制，以减少对记忆知识的依赖，以及注意力引导的保留增强，加强对保留数据集中语义上重要标记的关注，以减轻意外退化。这两个组件通过双损失目标进行联合优化，形成一个软边界，可以局部化遗忘，同时在表示叠加下保留不相关的知识。实验结果表明，与最先进的忘却方法相比，AS 提高了性能保留，在 ToFU 基准上实现了高达 15% 的准确度提高，在 TDEC 基准上实现了 10% 高出 10%，同时保持了有竞争力的无幻觉忘却有效性。与现有方法相比，AS 在遗忘有效性、泛化性和响应可靠性之间表现出卓越的平衡。

Title: StreamingThinker: Large Language Models Can Think While Reading

Authors: Junlong Tong, Yingqi Fan, Anhao Zhao, Yunpu Ma, Xiaoyu Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17238
Pdf URL: https://arxiv.org/pdf/2510.17238
Copy Paste: [[2510.17238]] StreamingThinker: Large Language Models Can Think While Reading(https://arxiv.org/abs/2510.17238)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80\% reduction in token waiting before the onset of reasoning and a more than 60\% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at \href{this https URL}{this repository.}
摘要：大型语言模型 (LLM) 在思想链 (CoT) 推理方面表现出了卓越的能力。然而，当前的LLM推理范式仅在整个输入可用后才开始思考，这引入了不必要的延迟并削弱了对动态场景中早期信息的关注。受人类阅读时思考认知的启发，我们首先为法学硕士设计了 \textit{\textbf{流式思维}} 范式，其中推理按照输入顺序展开，并在阅读完成后进一步调整其深度。我们用 \textit{StreamingThinker} 实例化了这一范例，该框架使法学硕士能够在阅读流式 CoT 生成、流式约束训练和流式并行推理的集成时进行思考。具体来说，StreamingThinker 采用具有 CoT 生成质量控制的流式推理单元，通过流式注意力掩码和位置编码强制执行保序推理，并利用并行 KV 缓存将输入编码与推理生成解耦，从而确保对齐并实现真正的并发。我们在 Qwen3 模型系列上评估 StreamingThinker 的数学推理、逻辑推理和基于上下文的 QA 推理任务。实验结果表明，StreamingThinker 保留了与批处理思维相当的性能，同时推理开始前的令牌等待减少了 80%，生成最终答案的时间级延迟减少了 60% 以上，证明了流式范式对于 LLM 推理的有效性。代码将在 \href{此 https URL}{此存储库。} 发布

Title: From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models

Authors: Zefan Cai, Haoyi Qiu, Haozhe Zhao, Ke Wan, Jiachen Li, Jiuxiang Gu, Wen Xiao, Nanyun Peng, Junjie Hu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.17247
Pdf URL: https://arxiv.org/pdf/2510.17247
Copy Paste: [[2510.17247]] From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models(https://arxiv.org/abs/2510.17247)
Keywords: prompt
Abstract: Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (actions and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.
摘要：视频扩散模型的最新进展显着增强了文本到视频的生成，特别是通过使用根据人类偏好训练的奖励模型进行对齐调整。虽然这些方法提高了视觉质量，但它们可能会无意中编码和放大社会偏见。为了系统地追踪此类偏差如何在整个对齐流程中演变，我们引入了 VideoBiasEval，这是一个用于评估视频生成中的社会表征的综合诊断框架。 VideoBiasEval 基于已建立的社会偏见分类法，采用基于事件的提示策略将语义内容（动作和上下文）与演员属性（性别和种族）分开。它还引入了多粒度指标来评估（1）整体种族偏见，（2）以种族为条件的性别偏见，（3）跨模型变体的社会属性的分布变化，以及（4）视频中偏见的时间持续性。使用这个框架，我们进行了第一次端到端分析，将人类偏好数据集中的偏差、它们在奖励模型中的放大以及它们通过对齐调整的视频扩散模型的传播联系起来。我们的结果表明，对齐调整不仅增强了表征偏差，而且使它们在时间上稳定，产生更平滑但更刻板的描绘。这些发现强调了在整个协调过程中进行偏见意识评估和缓解的必要性，以确保公平且对社会负责的视频生成。

Title: Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations

Authors: Shahin Atakishiyev, Housam K.B. Babiker, Jiayi Dai, Nawshad Farruque, Teruaki Hayashi, Nafisa Sadaf Hriti, Md Abed Rahman, Iain Smith, Mi-Young Kim, Osmar R. Zaïane, Randy Goebel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17256
Pdf URL: https://arxiv.org/pdf/2510.17256
Copy Paste: [[2510.17256]] Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations(https://arxiv.org/abs/2510.17256)
Keywords: language model, llm, hallucination
Abstract: Large language models have exhibited impressive performance across a broad range of downstream tasks in natural language processing. However, how a language model predicts the next token and generates content is not generally understandable by humans. Furthermore, these models often make errors in prediction and reasoning, known as hallucinations. These errors underscore the urgent need to better understand and interpret the intricate inner workings of language models and how they generate predictive outputs. Motivated by this gap, this paper investigates local explainability and mechanistic interpretability within Transformer-based large language models to foster trust in such models. In this regard, our paper aims to make three key contributions. First, we present a review of local explainability and mechanistic interpretability approaches and insights from relevant studies in the literature. Furthermore, we describe experimental studies on explainability and reasoning with large language models in two critical domains -- healthcare and autonomous driving -- and analyze the trust implications of such explanations for explanation receivers. Finally, we summarize current unaddressed issues in the evolving landscape of LLM explainability and outline the opportunities, critical challenges, and future directions toward generating human-aligned, trustworthy LLM explanations.
摘要：大型语言模型在自然语言处理的广泛下游任务中表现出了令人印象深刻的性能。然而，语言模型如何预测下一个标记并生成内容通常是人类无法理解的。此外，这些模型经常在预测和推理中犯错误，称为幻觉。这些错误强调了迫切需要更好地理解和解释语言模型复杂的内部工作原理以及它们如何生成预测输出。受这一差距的启发，本文研究了基于 Transformer 的大语言模型中的局部可解释性和机械可解释性，以培养对此类模型的信任。在这方面，我们的论文旨在做出三个关键贡献。首先，我们回顾了局部可解释性和机械可解释性方法以及文献中相关研究的见解。此外，我们描述了在两个关键领域（医疗保健和自动驾驶）中使用大型语言模型进行可解释性和推理的实验研究，并分析了此类解释对解释接收者的信任影响。最后，我们总结了法学硕士可解释性不断发展的前景中当前未解决的问题，并概述了产生人性化、值得信赖的法学硕士解释的机遇、关键挑战和未来方向。

Title: TaxoAlign: Scholarly Taxonomy Generation Using Language Models

Authors: Avishek Lahiri, Yufang Hou, Debarshi Kumar Sanyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17263
Pdf URL: https://arxiv.org/pdf/2510.17263
Copy Paste: [[2510.17263]] TaxoAlign: Scholarly Taxonomy Generation Using Language Models(https://arxiv.org/abs/2510.17263)
Keywords: language model
Abstract: Taxonomies play a crucial role in helping researchers structure and navigate knowledge in a hierarchical manner. They also form an important part in the creation of comprehensive literature surveys. The existing approaches to automatic survey generation do not compare the structure of the generated surveys with those written by human experts. To address this gap, we present our own method for automated taxonomy creation that can bridge the gap between human-generated and automatically-created taxonomies. For this purpose, we create the CS-TaxoBench benchmark which consists of 460 taxonomies that have been extracted from human-written survey papers. We also include an additional test set of 80 taxonomies curated from conference survey papers. We propose TaxoAlign, a three-phase topic-based instruction-guided method for scholarly taxonomy generation. Additionally, we propose a stringent automated evaluation framework that measures the structural alignment and semantic coherence of automatically generated taxonomies in comparison to those created by human experts. We evaluate our method and various baselines on CS-TaxoBench, using both automated evaluation metrics and human evaluation studies. The results show that TaxoAlign consistently surpasses the baselines on nearly all metrics. The code and data can be found at this https URL.
摘要：分类法在帮助研究人员以分层方式构建和导航知识方面发挥着至关重要的作用。它们也是综合文献调查创作的重要组成部分。现有的自动调查生成方法不会将生成的调查的结构与人类专家编写的调查的结构进行比较。为了解决这一差距，我们提出了自己的自动分类法创建方法，可以弥合人类生成的分类法和自动创建的分类法之间的差距。为此，我们创建了 CS-TaxoBench 基准，其中包含从人类撰写的调查论文中提取的 460 个分类法。我们还提供了一个额外的测试集，其中包含根据会议调查论文整理的 80 个分类法。我们提出了 TaxoAlign，一种用于学术分类生成的三阶段基于主题的指令引导方法。此外，我们提出了一个严格的自动化评估框架，用于衡量自动生成的分类法与人类专家创建的分类法相比的结构对齐和语义一致性。我们使用自动评估指标和人工评估研究在 CS-TaxoBench 上评估我们的方法和各种基线。结果表明，TaxoAlign 几乎在所有指标上都始终超过了基线。代码和数据可以在此 https URL 中找到。

Title: Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

Authors: Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2510.17354
Pdf URL: https://arxiv.org/pdf/2510.17354
Copy Paste: [[2510.17354]] Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation(https://arxiv.org/abs/2510.17354)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.
摘要：检索增强生成（RAG）已成为通过从外部语料库检索相关文档来增强大型语言模型（LLM）的强大范例。然而，现有的 RAG 系统主要关注单模态文本文档，在查询和文档都可能包含混合模态（例如文本和图像）的现实场景中常常达不到要求。在本文中，我们解决了通用检索增强生成（URAG）的挑战，其中涉及对混合模式信息进行检索和推理以改进视觉语言生成。为此，我们提出了 Nyx，一个专为 URAG 场景量身定制的统一混合模态到混合模态检索器。为了缓解现实混合模式数据的稀缺性，我们引入了用于生成和过滤的四阶段自动化管道，利用网络文档构建 NyxQA，这是一个包含多种混合模式问答对的数据集，可以更好地反映现实世界的信息需求。在此高质量数据集的基础上，我们采用了 Nyx 的两阶段训练框架：首先对 NyxQA 以及各种开源检索数据集进行预训练，然后使用下游视觉语言模型 (VLM) 的反馈进行监督微调，以使检索输出与生成偏好保持一致。实验结果表明，Nyx 不仅在标准纯文本 RAG 基准测试中具有竞争力，而且在更通用和更现实的 URAG 设置中也表现出色，显着提高了视觉语言任务的生成质量。

Title: The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

Authors: Henry Lim, Kwan Hui Lim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17388
Pdf URL: https://arxiv.org/pdf/2510.17388
Copy Paste: [[2510.17388]] The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives(https://arxiv.org/abs/2510.17388)
Keywords: language model, llm
Abstract: Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84\%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.
摘要：指令调整的大型语言模型 (IT-LLM) 表现出强大的零样本推理能力，但它们执行简单、独立指令的能力仍未得到充分开发，尽管这是复杂指令跟踪的基础。我们在修改后的 MMLU 和 MMLU-Pro 基准上评估了 20 个 IT-LLM，通过系统地改变选项标签的格式（字母、数字、罗马），同时在四种范式下保持其含义相同，即：（1）使用显式指令，标签更改会导致较大的性能变化（例如，罗马与数字的 -30.45%），揭示指令格式偏见。（2）如果没有指导，性能进一步下降（最高至-10.84%），标签敏感性加剧，凸显了明确指导的作用。 (3) 当选项内容被删除时，除了数字标签之外，模型无法达到随机选择基线，这表明对原子指令的遵守程度较弱。 (4) 三样本样本在鲁棒性或保真度方面没有产生显着的收益，并且生成分析显示持续的标签错误，特别是对于非数字格式。在不同的模型尺寸中，较大的法学硕士可以实现更高的准确性，但在指令遵守方面仍然不一致。这些结果暴露了当前指令调优范式的不足，并强调了对明确针对原子指令遵循的评估方法和训练策略的需求。

Title: EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs

Authors: Numaan Naeem, Abdellah El Mekki, Muhammad Abdul-Mageed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.17389
Pdf URL: https://arxiv.org/pdf/2510.17389
Copy Paste: [[2510.17389]] EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs(https://arxiv.org/abs/2510.17389)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are transforming education by answering questions, explaining complex concepts, and generating content across a wide range of subjects. Despite strong performance on academic benchmarks, they often fail to tailor responses to students' grade levels. This is a critical need in K-12 education, where age-appropriate vocabulary and explanation are essential for effective learning. Existing models frequently produce outputs that are too advanced or vague for younger learners, and there are no standardized benchmarks to evaluate their ability to adjust across cognitive and developmental stages. To address this gap, we introduce EduAdapt, a benchmark of nearly 48k grade-labeled QA pairs across nine science subjects, spanning Grades 1-12 and grouped into four grade levels. We evaluate a diverse set of open-source LLMs on EduAdapt and find that while larger models generally perform better, they still struggle with generating suitable responses for early-grade students (Grades 1-5). Our work presents the first dataset and evaluation framework for assessing grade-level adaptability in LLMs, aiming to foster more developmentally aligned educational AI systems through better training and prompting strategies. EduAdapt code and datasets are publicly available at this https URL.
摘要：大型语言模型 (LLM) 正在通过回答问题、解释复杂概念以及生成跨广泛学科的内容来改变教育。尽管在学术基准上表现出色，但他们往往无法根据学生的年级水平制定应对措施。这是 K-12 教育的关键需求，适合年龄的词汇和解释对于有效学习至关重要。现有模型经常产生对于年轻学习者来说过于高级或模糊的输出，并且没有标准化的基准来评估他们在认知和发展阶段进行调整的能力。为了解决这一差距，我们引入了 EduAdapt，这是一个基准，包含九个科学科目的近 48,000 个带有年级标记的 QA 对，涵盖 1-12 年级，分为四个年级。我们在 EduAdapt 上评估了一组不同的开源法学硕士，发现虽然较大的模型通常表现更好，但它们仍然难以为低年级学生（1-5 年级）生成合适的响应。我们的工作提出了第一个用于评估法学硕士年级适应性的数据集和评估框架，旨在通过更好的培训和激励策略来培育更加符合发展的教育人工智能系统。 EduAdapt 代码和数据集可在此 https URL 上公开获取。

Title: Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine

Authors: Jiacheng Xie, Shuai Zeng, Yang Yu, Xiaoting Tang, Guanghui An, Dong Xu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.17402
Pdf URL: https://arxiv.org/pdf/2510.17402
Copy Paste: [[2510.17402]] Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine(https://arxiv.org/abs/2510.17402)
Keywords: language model, gpt, llm
Abstract: Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.
摘要：中医（TCM）呈现出丰富且结构独特的知识体系，挑战了大语言模型（LLM）的传统应用。尽管之前的中医法学硕士已经通过监督微调取得了进展，但它们经常面临一致性、数据质量和评估一致性方面的限制。在这项研究中，我们引入了Ladder-base，这是第一个以中医为中心的法学硕士，通过组相对策略优化（GRPO）进行训练，GRPO是一种强化学习方法，通过基于组内比较优化响应选择来提高推理和事实一致性。 Ladder-base 建立在 Qwen2.5-7B-Instruct 基础模型的基础上，专门针对 TCM-Ladder 基准的文本子集进行训练，使用 80% 的数据进行训练，其余 20% 在验证集和测试集之间平均分配。通过标准化评估，与最先进的通用 LLM（如 GPT-4、Gemini 2.5、Claude 3 和 Qwen3）以及特定领域的 TCM 模型（包括 BenTsao、HuatuoGPT2 和 Zhujing）相比，Ladder-base 在多个推理指标上表现出了卓越的性能。这些研究结果表明，GRPO 提供了一种有效且高效的策略，使法学硕士与传统医学领域的专家级推理相结合，并支持开发值得信赖且基于临床的中医人工智能系统。

Title: BenCao: An Instruction-Tuned Large Language Model for Traditional Chinese Medicine

Authors: Jiacheng Xie, Yang Yu, Yibo Chen, Hanyao Zhang, Lening Zhao, Jiaxuan He, Lei Jiang, Xiaoting Tang, Guanghui An, Dong Xu
Subjects: cs.CL, cs.AI, cs.MA, cs.MM, cs.SE
Abstract URL: https://arxiv.org/abs/2510.17415
Pdf URL: https://arxiv.org/pdf/2510.17415
Copy Paste: [[2510.17415]] BenCao: An Instruction-Tuned Large Language Model for Traditional Chinese Medicine(https://arxiv.org/abs/2510.17415)
Keywords: language model, gpt, llm, chat, chain-of-thought
Abstract: Traditional Chinese Medicine (TCM), with a history spanning over two millennia, plays a role in global healthcare. However, applying large language models (LLMs) to TCM remains challenging due to its reliance on holistic reasoning, implicit logic, and multimodal diagnostic cues. Existing TCM-domain LLMs have made progress in text-based understanding but lack multimodal integration, interpretability, and clinical applicability. To address these limitations, we developed BenCao, a ChatGPT-based multimodal assistant for TCM, integrating structured knowledge bases, diagnostic data, and expert feedback refinement. BenCao was trained through natural language instruction tuning rather than parameter retraining, aligning with expert-level reasoning and ethical norms specific to TCM. The system incorporates a comprehensive knowledge base of over 1,000 classical and modern texts, a scenario-based instruction framework for diverse interactions, a chain-of-thought simulation mechanism for interpretable reasoning, and a feedback refinement process involving licensed TCM practitioners. BenCao connects to external APIs for tongue-image classification and multimodal database retrieval, enabling dynamic access to diagnostic resources. In evaluations across single-choice question benchmarks and multimodal classification tasks, BenCao achieved superior accuracy to general-domain and TCM-domain models, particularly in diagnostics, herb recognition, and constitution classification. The model was deployed as an interactive application on the OpenAI GPTs Store, accessed by nearly 1,000 users globally as of October 2025. This study demonstrates the feasibility of developing a TCM-domain LLM through natural language-based instruction tuning and multimodal integration, offering a practical framework for aligning generative AI with traditional medical reasoning and a scalable pathway for real-world deployment.
摘要：中医药有着两千年的历史，在全球医疗保健领域发挥着重要作用。然而，由于其依赖于整体推理、隐式逻辑和多模态诊断线索，因此将大语言模型（LLM）应用于中医仍然具有挑战性。现有的中医领域法学硕士在基于文本的理解方面取得了进展，但缺乏多模式整合、可解释性和临床适用性。为了解决这些限制，我们开发了 BenCao，一种基于 ChatGPT 的中医多模式助手，集成了结构化知识库、诊断数据和专家反馈细化。 BenCao 通过自然语言指令调整而不是参数再训练进行训练，符合中医特有的专家级推理和道德规范。该系统融合了1000多篇古典和现代文本的综合知识库、基于场景的多样化交互教学框架、可解释推理的思维链模拟机制以及涉及执业中医从业者的反馈细化过程。 BenCao 连接外部 API 以进行舌图像分类和多模式数据库检索，从而实现对诊断资源的动态访问。在单选题基准和多模态分类任务的评估中，本草取得了优于一般领域和中医领域模型的准确性，特别是在诊断、草药识别和体质分类方面。该模型已作为交互式应用程序部署在 OpenAI GPT 商店上，截至 2025 年 10 月，全球有近 1,000 名用户访问。这项研究证明了通过基于自然语言的指令调整和多模式集成开发 TCM 领域 LLM 的可行性，为将生成式 AI 与传统医学推理结合起来提供了一个实用框架，并为现实世界的部署提供了可扩展的途径。

Title: Agentic Reinforcement Learning for Search is Unsafe

Authors: Yushi Yang, Shreyansh Padarha, Andrew Lee, Adam Mahdi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17431
Pdf URL: https://arxiv.org/pdf/2510.17431
Copy Paste: [[2510.17431]] Agentic Reinforcement Learning for Search is Unsafe(https://arxiv.org/abs/2510.17431)
Keywords: language model, agent
Abstract: Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search.
摘要：代理强化学习（RL）训练大型语言模型在推理过程中自主调用工具，其中搜索是最常见的应用。这些模型擅长多步骤推理任务，但它们的安全特性尚不清楚。在这项研究中，我们表明，强化学习训练的搜索模型继承了指令调整的拒绝，并且经常通过将有害请求转变为安全查询来转移有害请求。然而，这种安全性是脆弱的。两种简单的攻击，一种迫使模型开始通过搜索进行响应（搜索攻击），另一种鼓励模型重复搜索（多搜索攻击），触发一连串有害的搜索和答案。在具有本地和网络搜索的两个模型系列（Qwen、Llama）中，这些攻击将拒绝率降低了高达 60.0%，答案安全性降低了 82.5%，搜索查询安全性降低了 82.4%。这些攻击通过触发模型在生成继承的拒绝令牌之前生成有害的请求镜像搜索查询来成功。这暴露了当前强化学习训练的一个核心弱点：它奖励持续生成的有效查询，而不考虑其危害性。因此，强化学习搜索模型存在用户可以轻松利用的漏洞，因此迫切需要开发具有安全意识的代理强化学习管道，以优化安全搜索。

Title: Multilingual Clinical NER for Diseases and Medications Recognition in Cardiology Texts using BERT Embeddings

Authors: Manuela Daniela Danu, George Marica, Constantin Suciu, Lucian Mihai Itu, Oladimeji Farri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17437
Pdf URL: https://arxiv.org/pdf/2510.17437
Copy Paste: [[2510.17437]] Multilingual Clinical NER for Diseases and Medications Recognition in Cardiology Texts using BERT Embeddings(https://arxiv.org/abs/2510.17437)
Keywords: language model
Abstract: The rapidly increasing volume of electronic health record (EHR) data underscores a pressing need to unlock biomedical knowledge from unstructured clinical texts to support advancements in data-driven clinical systems, including patient diagnosis, disease progression monitoring, treatment effects assessment, prediction of future clinical events, etc. While contextualized language models have demonstrated impressive performance improvements for named entity recognition (NER) systems in English corpora, there remains a scarcity of research focused on clinical texts in low-resource languages. To bridge this gap, our study aims to develop multiple deep contextual embedding models to enhance clinical NER in the cardiology domain, as part of the BioASQ MultiCardioNER shared task. We explore the effectiveness of different monolingual and multilingual BERT-based models, trained on general domain text, for extracting disease and medication mentions from clinical case reports written in English, Spanish, and Italian. We achieved an F1-score of 77.88% on Spanish Diseases Recognition (SDR), 92.09% on Spanish Medications Recognition (SMR), 91.74% on English Medications Recognition (EMR), and 88.9% on Italian Medications Recognition (IMR). These results outperform the mean and median F1 scores in the test leaderboard across all subtasks, with the mean/median values being: 69.61%/75.66% for SDR, 81.22%/90.18% for SMR, 89.2%/88.96% for EMR, and 82.8%/87.76% for IMR.
摘要：电子健康记录 (EHR) 数据量的快速增长凸显了从非结构化临床文本中解锁生物医学知识的迫切需要，以支持数据驱动的临床系统的进步，包括患者诊断、疾病进展监测、治疗效果评估、未来临床事件的预测等。虽然情境化语言模型已经在英语语料库中展示了命名实体识别 (NER) 系统的令人印象深刻的性能改进，但仍然缺乏研究专注于资源匮乏语言的临床文本。为了弥补这一差距，我们的研究旨在开发多种深度上下文嵌入模型，以增强心脏病学领域的临床 NER，作为 BioASQ MultiCardioNER 共享任务的一部分。我们探索了不同的基于 BERT 的单语言和多语言模型的有效性，这些模型在一般领域文本上进行训练，从英语、西班牙语和意大利语编写的临床病例报告中提取疾病和药物提及。我们在西班牙语疾病识别 (SDR) 上的 F1 分数为 77.88%，在西班牙语药物识别 (SMR) 上的 F1 分数为 92.09%，在英语药物识别 (EMR) 上的 F1 分数为 91.74%，在意大利药物识别 (IMR) 上的 F1 分数为 88.9%。这些结果优于所有子任务测试排行榜中 F1 分数的平均值和中位数，平均值/中位数为：SDR 为 69.61%/75.66%，SMR 为 81.22%/90.18%，EMR 为 89.2%/88.96%，IMR 为 82.8%/87.76%。

Title: Evaluating Large Language Models on Urdu Idiom Translation

Authors: Muhammad Farmal Khan, Mousumi Akter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17460
Pdf URL: https://arxiv.org/pdf/2510.17460
Copy Paste: [[2510.17460]] Evaluating Large Language Models on Urdu Idiom Translation(https://arxiv.org/abs/2510.17460)
Keywords: language model, llm, prompt
Abstract: Idiomatic translation remains a significant challenge in machine translation, especially for low resource languages such as Urdu, and has received limited prior attention. To advance research in this area, we introduce the first evaluation datasets for Urdu to English idiomatic translation, covering both Native Urdu and Roman Urdu scripts and annotated with gold-standard English equivalents. We evaluate multiple open-source Large Language Models (LLMs) and Neural Machine Translation (NMT) systems on this task, focusing on their ability to preserve idiomatic and cultural meaning. Automatic metrics including BLEU, BERTScore, COMET, and XCOMET are used to assess translation quality. Our findings indicate that prompt engineering enhances idiomatic translation compared to direct translation, though performance differences among prompt types are relatively minor. Moreover, cross script comparisons reveal that text representation substantially affects translation quality, with Native Urdu inputs producing more accurate idiomatic translations than Roman Urdu.
摘要：惯用翻译仍然是机器翻译中的一个重大挑战，特别是对于乌尔都语等资源匮乏的语言，并且之前受到的关注有限。为了推进这一领域的研究，我们引入了第一个乌尔都语到英语惯用语翻译的评估数据集，涵盖本机乌尔都语和罗马乌尔都语脚本，并用黄金标准英语等效项进行注释。我们针对此任务评估了多个开源大型语言模型 (LLM) 和神经机器翻译 (NMT) 系统，重点关注它们保留惯用语和文化含义的能力。使用 BLEU、BERTScore、COMET 和 XCOMET 等自动指标来评估翻译质量。我们的研究结果表明，与直接翻译相比，提示工程可以增强惯用翻译，尽管提示类型之间的性能差异相对较小。此外，跨脚本比较表明，文本表示极大地影响翻译质量，本地乌尔都语输入比罗马乌尔都语输入产生更准确的惯用翻译。

Title: Disparities in Multilingual LLM-Based Healthcare Q&A

Authors: Ipek Baris Schlicht, Burcu Sayin, Zhixue Zhao, Frederik M. Labonté, Cesare Barbera, Marco Viviani, Paolo Rosso, Lucie Flek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17476
Pdf URL: https://arxiv.org/pdf/2510.17476
Copy Paste: [[2510.17476]] Disparities in Multilingual LLM-Based Healthcare Q&A(https://arxiv.org/abs/2510.17476)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Equitable access to reliable health information is vital when integrating AI into healthcare. Yet, information quality varies across languages, raising concerns about the reliability and consistency of multilingual Large Language Models (LLMs). We systematically examine cross-lingual disparities in pre-training source and factuality alignment in LLM answers for multilingual healthcare Q&A across English, German, Turkish, Chinese (Mandarin), and Italian. We (i) constructed Multilingual Wiki Health Care (MultiWikiHealthCare), a multilingual dataset from Wikipedia; (ii) analyzed cross-lingual healthcare coverage; (iii) assessed LLM response alignment with these references; and (iv) conducted a case study on factual alignment through the use of contextual information and Retrieval-Augmented Generation (RAG). Our findings reveal substantial cross-lingual disparities in both Wikipedia coverage and LLM factual alignment. Across LLMs, responses align more with English Wikipedia, even when the prompts are non-English. Providing contextual excerpts from non-English Wikipedia at inference time effectively shifts factual alignment toward culturally relevant knowledge. These results highlight practical pathways for building more equitable, multilingual AI systems for healthcare.
摘要：将人工智能融入医疗保健时，公平获取可靠的健康信息至关重要。然而，不同语言的信息质量各不相同，引发了人们对多语言大语言模型 (LLM) 可靠性和一致性的担忧。我们系统地研究了英语、德语、土耳其语、中文（普通话）和意大利语的多语言医疗保健问答的法学硕士答案中预训练来源的跨语言差异和事实一致性。我们 (i) 构建了多语言 Wiki Health Care (MultiWikiHealthCare)，这是一个来自 Wikipedia 的多语言数据集； (ii) 分析跨语言医疗保健覆盖范围； (iii) 评估法学硕士的回答与这些参考文献的一致性； (iv) 通过使用上下文信息和检索增强生成（RAG）进行事实对齐的案例研究。我们的研究结果揭示了维基百科的覆盖范围和法学硕士事实一致性方面存在巨大的跨语言差异。在法学硕士中，即使提示不是英语，回复也更符合英语维基百科。在推理时提供非英语维基百科的上下文摘录可以有效地将事实对齐转向文化相关知识。这些结果凸显了为医疗保健构建更公平、多语言的人工智能系统的实用途径。

Title: ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts

Authors: Zheyue Tan, Zhiyuan Li, Tao Yuan, Dong Zhou, Weilin Liu, Yueqing Zhuang, Yadong Li, Guowei Niu, Cheng Qin, Zhuyu Yao, Congyi Liu, Haiyang Xu, Boxun Li, Guohao Dai, Bo Zhao, Yu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17483
Pdf URL: https://arxiv.org/pdf/2510.17483
Copy Paste: [[2510.17483]] ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts(https://arxiv.org/abs/2510.17483)
Keywords: language model, llm
Abstract: Mixture-of-Experts (MoE) architectures have emerged as a promising approach to scale Large Language Models (LLMs). MoE boosts the efficiency by activating a subset of experts per token. Recent works show that fine-grained experts substantially enriches the combinatorial flexibility of active experts and enhances model expressiveness. However, such a design is fundamentally limited by the layer-local routing mechanism: each layer is restricted to its own expert pool. This requires a careful trade-off between expert dimensionality and routing diversity given fixed parameter budgets. We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches by allowing routers to reuse experts across adjacent layers. ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity or inflating overall parameters. To this end, we propose a new progressive scaling routing (PSR) strategy to gradually increase the candidate expert pool during training. As a result, ReXMoE improves both language modeling and downstream task performance. Extensive experiments on models ranging from 0.5B to 7B parameters across different architectures demonstrate that ReXMoE consistently improves performance under fixed architectural dimensions, confirming ReXMoE as new design paradigm for parameter-efficient and scalable MoE-based LLMs.
摘要：专家混合 (MoE) 架构已成为扩展大型语言模型 (LLM) 的一种有前景的方法。 MoE 通过激活每个代币的专家子集来提高效率。最近的工作表明，细粒度专家极大地丰富了主动专家的组合灵活性，并增强了模型的表达能力。然而，这样的设计从根本上受到层本地路由机制的限制：每一层都仅限于自己的专家池。这需要在给定固定参数预算的情况下，在专家维度和路由多样性之间进行仔细权衡。我们描述了 ReXMoE，这是一种新颖的 MoE 架构，它允许路由器跨相邻层重用专家，从而超越现有的层本地方法来改进路由。 ReXMoE 将专家维度与每层预算解耦，从而实现更丰富的专家组合，而无需牺牲单个专家容量或夸大整体参数。为此，我们提出了一种新的渐进式缩放路由（PSR）策略，以在训练期间逐渐增加候选专家库。因此，ReXMoE 提高了语言建模和下游任务性能。对不同架构中从 0.5B 到 7B 参数的模型进行的广泛实验表明，ReXMoE 在固定架构尺寸下持续提高性能，证实 ReXMoE 是参数高效且可扩展的基于 MoE 的 LLM 的新设计范例。

Title: DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning

Authors: Yongxin He, Shan Zhang, Yixuan Cao, Lei Ma, Ping Luo
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.17489
Pdf URL: https://arxiv.org/pdf/2510.17489
Copy Paste: [[2510.17489]] DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning(https://arxiv.org/abs/2510.17489)
Keywords: llm
Abstract: Detecting AI-involved text is essential for combating misinformation, plagiarism, and academic misconduct. However, AI text generation includes diverse collaborative processes (AI-written text edited by humans, human-written text edited by AI, and AI-generated text refined by other AI), where various or even new LLMs could be involved. Texts generated through these varied processes exhibit complex characteristics, presenting significant challenges for detection. Current methods model these processes rather crudely, primarily employing binary classification (purely human vs. AI-involved) or multi-classification (treating human-AI collaboration as a new class). We observe that representations of texts generated through different processes exhibit inherent clustering relationships. Therefore, we propose DETree, a novel approach that models the relationships among different processes as a Hierarchical Affinity Tree structure, and introduces a specialized loss function that aligns text representations with this tree. To facilitate this learning, we developed RealBench, a comprehensive benchmark dataset that automatically incorporates a wide spectrum of hybrid texts produced through various human-AI collaboration processes. Our method improves performance in hybrid text detection tasks and significantly enhances robustness and generalization in out-of-distribution scenarios, particularly in few-shot learning conditions, further demonstrating the promise of training-based approaches in OOD settings. Our code and dataset are available at this https URL.
摘要：检测涉及人工智能的文本对于打击错误信息、抄袭和学术不端行为至关重要。然而，人工智能文本生成包括多种协作过程（由人类编辑的人工智能编写的文本、由人工智能编辑的人类编写的文本以及由其他人工智能精炼的人工智能生成的文本），其中可能涉及各种甚至新的法学硕士。通过这些不同的过程生成的文本表现出复杂的特征，给检测带来了巨大的挑战。目前的方法对这些过程进行建模相当粗糙，主要采用二元分类（纯人类与人工智能相关）或多分类（将人类与人工智能协作视为一个新类别）。我们观察到，通过不同过程生成的文本表示表现出固有的聚类关系。因此，我们提出了 DETree，一种将不同进程之间的关系建模为分层亲和树结构的新颖方法，并引入了一种专门的损失函数，将文本表示与该树对齐。为了促进这种学习，我们开发了 RealBench，这是一个综合基准数据集，它自动整合通过各种人类与人工智能协作流程生成的各种混合文本。我们的方法提高了混合文本检测任务的性能，并显着增强了分布外场景中的鲁棒性和泛化性，特别是在少样本学习条件下，进一步证明了 OOD 设置中基于训练的方法的前景。我们的代码和数据集可在此 https URL 获取。

Title: Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents

Authors: Yihong Tang, Kehai Chen, Liang Yue, Jinxin Fan, Caishen Zhou, Xiaoguang Li, Yuyang Zhang, Mingming Zhao, Shixiong Kai, Kaiyang Guo, Xingshan Zeng, Wenjing Cun, Lifeng Shang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17491
Pdf URL: https://arxiv.org/pdf/2510.17491
Copy Paste: [[2510.17491]] Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents(https://arxiv.org/abs/2510.17491)
Keywords: language model, llm, agent
Abstract: With the rise of large language models (LLMs), LLM agents capable of autonomous reasoning, planning, and executing complex tasks have become a frontier in artificial intelligence. However, how to translate the research on general agents into productivity that drives industry transformations remains a significant challenge. To address this, this paper systematically reviews the technologies, applications, and evaluation methods of industry agents based on LLMs. Using an industry agent capability maturity framework, it outlines the evolution of agents in industry applications, from "process execution systems" to "adaptive social systems." First, we examine the three key technological pillars that support the advancement of agent capabilities: Memory, Planning, and Tool Use. We discuss how these technologies evolve from supporting simple tasks in their early forms to enabling complex autonomous systems and collective intelligence in more advanced forms. Then, we provide an overview of the application of industry agents in real-world domains such as digital engineering, scientific discovery, embodied intelligence, collaborative business execution, and complex system simulation. Additionally, this paper reviews the evaluation benchmarks and methods for both fundamental and specialized capabilities, identifying the challenges existing evaluation systems face regarding authenticity, safety, and industry specificity. Finally, we focus on the practical challenges faced by industry agents, exploring their capability boundaries, developmental potential, and governance issues in various scenarios, while providing insights into future directions. By combining technological evolution with industry practices, this review aims to clarify the current state and offer a clear roadmap and theoretical foundation for understanding and building the next generation of industry agents.
摘要：随着大型语言模型（LLM）的兴起，能够自主推理、规划和执行复杂任务的LLM代理已成为人工智能的前沿。然而，如何将总代理的研究转化为驱动行业变革的生产力仍然是一个重大挑战。针对这一问题，本文系统回顾了基于法学硕士的行业代理技术、应用和评估方法。它使用行业代理能力成熟度框架，概述了代理在行业应用中的演变，从“流程执行系统”到“自适应社会系统”。首先，我们研究支持智能体能力提升的三个关键技术支柱：内存、规划和工具使用。我们讨论这些技术如何从支持早期形式的简单任务发展到以更高级的形式支持复杂的自主系统和集体智能。然后，我们概述了行业代理在数字工程、科学发现、具体智能、协作业务执行和复杂系统仿真等现实领域中的应用。此外，本文回顾了基础能力和专业能力的评估基准和方法，明确了现有评估系统在真实性、安全性和行业特殊性方面面临的挑战。最后，我们关注行业主体面临的实际挑战，探讨其在各种场景下的能力边界、发展潜力和治理问题，同时洞察未来的发展方向。通过将技术演进与行业实践相结合，本次综述旨在阐明当前状态，并为理解和构建下一代行业智能体提供清晰的路线图和理论基础。

Title: Deep Self-Evolving Reasoning

Authors: Zihan Liu, Shun Zheng, Xumeng Wen, Yang Wang, Jiang Bian, Mao Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17498
Pdf URL: https://arxiv.org/pdf/2510.17498
Copy Paste: [[2510.17498]] Deep Self-Evolving Reasoning(https://arxiv.org/abs/2510.17498)
Keywords: language model, chain-of-thought
Abstract: Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.
摘要：长篇思维链推理已成为大型语言模型高级推理的基石。虽然最近的验证细化框架已经使专有模型能够解决奥林匹克级别的问题，但其有效性取决于强大、可靠的验证和校正能力，而这些能力在开放重量、较小规模的模型中仍然脆弱。这项工作表明，即使在困难任务上的验证和细化能力较弱，此类模型的推理限制也可以通过我们称为深度自进化推理（DSER）的概率范式得到大幅扩展。我们将迭代推理概念化为马尔可夫链，其中每一步都代表解决方案空间中的随机转变。关键的见解是，只要改进的概率略高于退化的概率，就可以保证收敛到正确的解决方案。通过并行运行多个长期的自我进化过程，DSER 放大了这些小的积极趋势，使模型能够渐近地接近正确答案。根据经验，我们将 DSER 应用于 DeepSeek-R1-0528-Qwen3-8B 模型。在具有挑战性的 AIME 2024-2025 基准测试中，DSER 解决了之前无法解决的 9 个问题中的 5 个，并提升了整体性能，使这款紧凑型模型能够通过多数投票超越其 600B 参数教师的单圈精度。除了测试时间扩展的直接实用性之外，DSER 框架还可以诊断当前开放权重推理机的基本局限性。通过清楚地描述它们在自我验证、完善和稳定性方面的缺点，我们的研究结果为开发具有强大的、内在的自我进化能力的下一代模型建立了明确的研究议程。

Title: Lingua Custodi's participation at the WMT 2025 Terminology shared task

Authors: Jingshu Liu, Raheel Qader, Gaëtan Caillaut, Mariam Nakhlé
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17504
Pdf URL: https://arxiv.org/pdf/2510.17504
Copy Paste: [[2510.17504]] Lingua Custodi's participation at the WMT 2025 Terminology shared task(https://arxiv.org/abs/2510.17504)
Keywords: language model
Abstract: While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM), dual encoder translation ranking, and additive margin softmax. We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5 achieved by LASER, while still performing competitively on monolingual transfer learning benchmarks. Parallel data mined from CommonCrawl using our best model is shown to train competitive NMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at this https URL.
摘要：虽然 BERT 是学习单语言句子嵌入以实现语义相似性的有效方法，但基于迁移学习的嵌入 BERT 的跨语言句子嵌入仍有待探索。我们通过结合学习单语和跨语言表示的最佳方法，系统地研究学习多语言句子嵌入的方法，包括：掩码语言模型（MLM）、翻译语言模型（TLM）、双编码器翻译排名和加性边际softmax。我们证明，引入预训练的多语言语言模型可以将实现良好性能所需的并行训练数据量大幅减少 80%。综合这些方法中的最佳方法，生成的模型在 Tatoeba 上对 112 种语言实现了 83.7% 的双文本检索准确率，远高于 LASER 所达到的 65.5%，同时在单语迁移学习基准上仍然具有竞争力。使用我们的最佳模型从 CommonCrawl 挖掘的并行数据可以训练 en-zh 和 en-de 的竞争性 NMT 模型。我们在此 https URL 公开发布了适用于 109 多种语言的最佳多语言句子嵌入模型。

Title: Annotation-Efficient Universal Honesty Alignment

Authors: Shiyu Ni, Keping Bi, Jiafeng Guo, Minghao Tang, Jingtong Wu, Zengxin Han, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17509
Pdf URL: https://arxiv.org/pdf/2510.17509
Copy Paste: [[2510.17509]] Annotation-Efficient Universal Honesty Alignment(https://arxiv.org/abs/2510.17509)
Keywords: language model, llm
Abstract: Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.
摘要：诚实对齐——大型语言模型（LLM）识别其知识边界并表达校准信心的能力——对于可信部署至关重要。现有方法要么依赖于免训练的置信度估计（例如，令牌概率、自我一致性），要么依赖于带有正确性注释的基于训练的校准。虽然有效，但通过基于培训的校准实现普遍的诚实一致性需要昂贵的大规模标签。为了支持注释高效的训练，我们引入了启发然后校准（EliCal），这是一个两阶段框架，首先使用廉价的自我一致性监督来引发内部信心，然后用一小组正确性注释来校准这种信心。为了支持大规模研究，我们发布了 HonestyBench，这是一个基准测试，涵盖 10 个自由形式的 QA 数据集，其中包含 56 万个训练和 7 万个评估实例，并带有正确性和自我一致性信号注释。实验表明，EliCal 只需 1k 正确性注释（完全监督的 0.18%）即可实现近乎最佳的对齐，并且在未见的 MMLU 任务上比仅校准基线具有更好的对齐性能，为法学硕士中的通用诚实对齐提供了可扩展的解决方案。

Title: SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Authors: Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Dirk Hovy, Nigel Collier, Paul Röttger
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2510.17516
Pdf URL: https://arxiv.org/pdf/2510.17516
Copy Paste: [[2510.17516]] SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors(https://arxiv.org/abs/2510.17516)
Keywords: language model, llm
Abstract: Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
摘要：当且仅当它们忠实地反映真实的人类行为时，人类行为的大语言模型（LLM）模拟才有可能彻底改变社会和行为科学。当前的评估是碎片化的，基于定制的任务和指标，产生了无与伦比的结果。为了解决这个问题，我们推出了 SimBench，这是第一个大规模、标准化的基准，用于建立稳健、可重复的法学硕士模拟科学。通过统一 20 个不同的数据集，涵盖全球大型参与者池中从道德决策到经济选择等任务，SimBench 为提出有关 LLM 模拟何时、如何以及为何成功或失败的基本问题提供了必要的基础。我们表明，尽管当今最好的法学硕士的模拟能力也有限（得分：40.80/100），但性能随模型大小呈对数线性扩展。增加推理时间计算并不会提高模拟性能。我们演示了对齐模拟权衡：指令调整提高了低熵（共识）问题的性能，但降低了高熵（多样化）问题的性能。模型在模拟特定人口群体时尤其困难。最后，我们证明模拟能力与深度知识密集型推理的相关性最强（MMLU-Pro，r=0.939）。通过让进展变得可衡量，我们的目标是加速开发更忠实的法学硕士模拟器。

Title: OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction

Authors: Raghu Vamshi Hemadri, Geetha Krishna Guruju, Kristi Topollai, Anna Ewa Choromanska
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.17532
Pdf URL: https://arxiv.org/pdf/2510.17532
Copy Paste: [[2510.17532]] OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction(https://arxiv.org/abs/2510.17532)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.
摘要：预测癌症治疗结果需要准确且可解释的模型，特别是在存在异质临床数据的情况下。虽然大型语言模型 (LLM) 在生物医学 NLP 方面表现出了强大的性能，但它们通常缺乏对于高风险决策支持至关重要的结构化推理能力。我们提出了一个统一的多任务学习框架，将自回归法学硕士与临床推理相结合，以在 MSK-CHORD 数据集上进行结果预测。我们的模型经过训练，可以联合执行二元生存分类、连续生存时间回归和自然语言基本原理生成。我们评估了三种对齐策略：(1) 标准监督微调 (SFT)、(2) 具有思想链 (CoT) 的 SFT 提示引发逐步推理，以及 (3) 组相对策略优化 (GRPO)，这是一种强化学习方法，可将模型输出与专家得出的推理轨迹对齐。 LLaMa3-8B 和 Med42-8B 主干网的实验表明，CoT 提示将 F1 提高了 +6.0，并将 MAE 降低了 12%，而 GRPO 在 BLEU、ROUGE 和 BERTScore 上实现了最先进的可解释性和预测性能。我们进一步表明，由于架构限制，现有的生物医学法学硕士通常无法产生有效的推理轨迹。我们的研究结果强调了多任务临床建模中推理意识一致性的重要性，并为精准肿瘤学领域可解释、值得信赖的法学硕士树立了新的基准。

Title: When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity

Authors: Nisrine Rair, Alban Goupil, Valeriu Vrabie, Emmanuel Chochoy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17548
Pdf URL: https://arxiv.org/pdf/2510.17548
Copy Paste: [[2510.17548]] When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity(https://arxiv.org/abs/2510.17548)
Keywords: language model
Abstract: Language models are often evaluated with scalar metrics like accuracy, but such measures fail to capture how models internally represent ambiguity, especially when human annotators disagree. We propose a topological perspective to analyze how fine-tuned models encode ambiguity and more generally instances. Applied to RoBERTa-Large on the MD-Offense dataset, Mapper, a tool from topological data analysis, reveals that fine-tuning restructures embedding space into modular, non-convex regions aligned with model predictions, even for highly ambiguous cases. Over $98\%$ of connected components exhibit $\geq 90\%$ prediction purity, yet alignment with ground-truth labels drops in ambiguous data, surfacing a hidden tension between structural confidence and label uncertainty. Unlike traditional tools such as PCA or UMAP, Mapper captures this geometry directly uncovering decision regions, boundary collapses, and overconfident clusters. Our findings position Mapper as a powerful diagnostic tool for understanding how models resolve ambiguity. Beyond visualization, it also enables topological metrics that may inform proactive modeling strategies in subjective NLP tasks.
摘要：语言模型通常使用准确性等标量指标进行评估，但此类指标无法捕获模型内部如何表示歧义，特别是当人类注释者不同意时。我们提出了一个拓扑视角来分析微调模型如何编码歧义和更普遍的实例。 Mapper（一种来自拓扑数据分析的工具）应用于 MD-Offense 数据集上的 RoBERTa-Large，揭示了微调将嵌入空间重组为与模型预测一致的模块化非凸区域，即使对于高度模糊的情况也是如此。超过 $98\%$ 的连接组件表现出 $\geq 90\%$ 的预测纯度，但与真实标签的对齐会导致模糊数据下降，从而暴露出结构置信度和标签不确定性之间隐藏的紧张关系。与 PCA 或 UMAP 等传统工具不同，Mapper 捕获此几何图形，直接揭示决策区域、边界崩溃和过度自信的集群。我们的研究结果使 Mapper 成为了解模型如何解决歧义的强大诊断工具。除了可视化之外，它还支持拓扑指标，为主观 NLP 任务中的主动建模策略提供信息。

Title: Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation

Authors: Collin Zhang, Fei Huang, Chenhan Yuan, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17555
Pdf URL: https://arxiv.org/pdf/2510.17555
Copy Paste: [[2510.17555]] Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation(https://arxiv.org/abs/2510.17555)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) often experience language confusion, which is the unintended mixing of languages during text generation. Current solutions to this problem either necessitate model retraining or cannot differentiate between harmful confusion and acceptable code-switching. This paper introduces the Language Confusion Gate (LCG), a lightweight, plug-in solution that filters tokens during decoding without altering the base LLM. The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed. Our method is based on the findings that language confusion is infrequent, correct-language tokens are usually among the top predictions, and output token embedding norms are larger for high-resource languages, which biases sampling. When evaluated across various models, including Qwen3, GPT-OSS, Gemma3, Llama3.1, LCG decreases language confusion significantly, often by an order of magnitude, without negatively impacting task performance. Code is available at this https URL.
摘要：大型语言模型 (LLM) 经常会遇到语言混乱，这是文本生成过程中无意的语言混合。当前解决此问题的方法要么需要模型重新训练，要么无法区分有害的混淆和可接受的代码转换。本文介绍了语言混淆门 (LCG)，这是一种轻量级插件解决方案，可在解码期间过滤标记，而无需更改基础 LLM。 LCG 使用规范调整的自蒸馏进行训练，以预测适当的语系，并仅在需要时应用掩蔽。我们的方法基于这样的发现：语言混淆很少发生，正确的语言标记通常是最重要的预测之一，并且高资源语言的输出标记嵌入规范更大，这会导致采样偏差。当在包括 Qwen3、GPT-OSS、Gemma3、Llama3.1 在内的各种模型上进行评估时，LCG 显着减少了语言混乱，通常是一个数量级，而且不会对任务性能产生负面影响。代码可从此 https URL 获取。

Title: HGAdapter: Hypergraph-based Adapters in Language Models for Code Summarization and Clone Detection

Authors: Guang Yang, Yujie Zhu
Subjects: cs.CL, cs.AI, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2510.17591
Pdf URL: https://arxiv.org/pdf/2510.17591
Copy Paste: [[2510.17591]] HGAdapter: Hypergraph-based Adapters in Language Models for Code Summarization and Clone Detection(https://arxiv.org/abs/2510.17591)
Keywords: language model
Abstract: Pre-trained language models (PLMs) are increasingly being applied to code-related tasks. Although PLMs have achieved good results, they do not take into account potential high-order data correlations within the code. We propose three types of high-order correlations in code tokens, i.e. abstract syntax tree family correlation, lexical correlation, and line correlation. We design a tokens and hyperedges generator to capture these high-order data correlations. We improve the architecture of hypergraph neural networks and combine it with adapter tuning to propose a novel hypergraph-based adapter (HGAdapter) to fine-tune PLMs. HGAdapter can encode high-order data correlations and is allowed to be inserted into various PLMs to enhance performance. Experiments were conducted on several public datasets, including six languages of code summarization and code clone detection tasks. Our methods improved the performance of PLMs in datasets to varying degrees. Experimental results validate the introduction of high-order data correlations that contribute to improved effectiveness.
摘要：预训练语言模型 (PLM) 越来越多地应用于与代码相关的任务。尽管 PLM 取得了良好的效果，但它们没有考虑代码内潜在的高阶数据相关性。我们提出了代码标记中的三种高阶相关性，即抽象语法树族相关性、词汇相关性和行相关性。我们设计了一个标记和超边生成器来捕获这些高阶数据相关性。我们改进了超图神经网络的架构，并将其与适配器调整相结合，提出了一种新型的基于超图的适配器（HGAdapter）来微调 PLM。 HGAdapter可以对高阶数据相关性进行编码，并允许插入到各种PLM中以增强性能。在多个公共数据集上进行了实验，包括六种语言的代码摘要和代码克隆检测任务。我们的方法不同程度地提高了数据集中 PLM 的性能。实验结果验证了高阶数据相关性的引入有助于提高有效性。

Title: LawChain: Modeling Legal Reasoning Chains for Chinese Tort Case Analysis

Authors: Huiyuan Xie, Chenyang Li, Huining Zhu, Chubin Zhang, Yuxiao Ye, Zhenghao Liu, Zhiyuan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17602
Pdf URL: https://arxiv.org/pdf/2510.17602
Copy Paste: [[2510.17602]] LawChain: Modeling Legal Reasoning Chains for Chinese Tort Case Analysis(https://arxiv.org/abs/2510.17602)
Keywords: language model, prompt
Abstract: Legal reasoning is a fundamental component of legal analysis and decision-making. Existing computational approaches to legal reasoning predominantly rely on generic reasoning frameworks such as syllogism and IRAC, which do not comprehensively examine the nuanced processes that underpin legal reasoning. Moreover, current research has largely focused on criminal cases, with insufficient modeling for civil cases. In this work, we present a novel framework for explicitly modeling legal reasoning in the analysis of Chinese tort-related civil cases. We first operationalize the legal reasoning processes used in tort analysis into the LawChain framework. LawChain is a three-module reasoning framework, with each module consisting of multiple finer-grained sub-steps. Informed by the LawChain framework, we introduce the task of tort legal reasoning and construct an evaluation benchmark, LawChain$_{eval}$, to systematically assess the critical steps within analytical reasoning chains for tort analysis. Leveraging this benchmark, we evaluate state-of-the-art large language models for their legal reasoning ability in civil tort contexts. Our results indicate that current models still fall short in accurately handling crucial elements of tort legal reasoning. Furthermore, we introduce several baseline approaches that explicitly incorporate LawChain-style reasoning through prompting or post-training. We conduct further experiments on additional legal analysis tasks, such as Legal Named-Entity Recognition and Criminal Damages Calculation, to verify the generalizability of these baselines. The proposed baseline approaches achieve significant improvements in tort-related legal reasoning and generalize well to related legal analysis tasks, thus demonstrating the value of explicitly modeling legal reasoning chains to enhance the reasoning capabilities of language models.
摘要：法律推理是法律分析和决策的基本组成部分。现有的法律推理计算方法主要依赖于三段论和 IRAC 等通用推理框架，这些框架没有全面检查支撑法律推理的细微差别过程。此外，目前的研究主要集中在刑事案件，对民事案件的建模还不够。在这项工作中，我们提出了一个新颖的框架，用于在分析中国侵权相关民事案件时明确建模法律推理。我们首先将侵权分析中使用的法律推理过程运用到 LawChain 框架中。 LawChain是一个三模块推理框架，每个模块由多个更细粒度的子步骤组成。在LawChain框架的指导下，我们引入了侵权法律推理的任务，并构建了评估基准LawChain$_{eval}$，以系统地评估侵权分析的分析推理链中的关键步骤。利用这一基准，我们评估了最先进的大型语言模型在民事侵权情况下的法律推理能力。我们的结果表明，当前模型在准确处理侵权法律推理的关键要素方面仍然存在不足。此外，我们引入了几种基线方法，通过提示或训练后明确地纳入 LawChain 式推理。我们对其他法律分析任务（例如法律命名实体识别和刑事损害计算）进行了进一步的实验，以验证这些基线的普遍性。所提出的基线方法在与侵权相关的法律推理方面取得了显着的改进，并且可以很好地推广到相关的法律分析任务，从而证明了明确建模法律推理链以增强语言模型的推理能力的价值。

Title: Forget to Know, Remember to Use: Context-Aware Unlearning for Large Language Models

Authors: Yuefeng Peng, Parnian Afshar, Megan Ganji, Thomas Butler, Amir Houmansadr, Mingxian Wang, Dezhi Hong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17620
Pdf URL: https://arxiv.org/pdf/2510.17620
Copy Paste: [[2510.17620]] Forget to Know, Remember to Use: Context-Aware Unlearning for Large Language Models(https://arxiv.org/abs/2510.17620)
Keywords: language model, prompt
Abstract: Large language models may encode sensitive information or outdated knowledge that needs to be removed, to ensure responsible and compliant model responses. Unlearning has emerged as an efficient alternative to full retraining, aiming to remove specific knowledge while preserving overall model utility. Existing evaluations of unlearning methods focus on (1) the extent of forgetting of the target knowledge (forget set) and (2) maintaining performance on the retain set (i.e., utility). However, these evaluations overlook an important usability aspect: users may still want the model to leverage the removed information if it is re-introduced in the prompt. In a systematic evaluation of six state-of-the-art unlearning methods, we find that they consistently impair such contextual utility. To address this, we augment unlearning objectives with a plug-in term that preserves the model's ability to use forgotten knowledge when it is present in context. Extensive experiments demonstrate that our approach restores contextual utility to near original levels while still maintaining effective forgetting and retain-set utility.
摘要：大型语言模型可能会对敏感信息或需要删除的过时知识进行编码，以确保负责任且合规的模型响应。忘却已成为完全再训练的有效替代方案，旨在消除特定知识，同时保留模型的整体效用。现有的遗忘方法评估主要集中在（1）目标知识的遗忘程度（遗忘集）和（2）保持保留集的性能（即效用）。然而，这些评估忽略了一个重要的可用性方面：如果在提示中重新引入，用户可能仍然希望模型利用已删除的信息。在对六种最先进的忘却方法的系统评估中，我们发现它们始终损害这种情境效用。为了解决这个问题，我们使用一个插件术语来增强遗忘目标，该插件术语保留了模型在上下文中存在遗忘知识时使用该知识的能力。大量的实验表明，我们的方法将上下文效用恢复到接近原始水平，同时仍然保持有效的遗忘和保留设置效用。

Title: Qomhra: A Bilingual Irish-English Large Language Model

Authors: Joseph McInerney
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17652
Pdf URL: https://arxiv.org/pdf/2510.17652
Copy Paste: [[2510.17652]] Qomhra: A Bilingual Irish-English Large Language Model(https://arxiv.org/abs/2510.17652)
Keywords: language model, llm, chat
Abstract: This paper introduces Qomhrá, a bilingual Irish-English large language model (LLM), developed under low-resource constraints presenting a complete pipeline spanning bilingual continued pre-training, instruction tuning, and alignment from human preferences. Newly accessible Irish corpora and English text are mixed and curated to improve Irish performance while preserving English ability. 6 closed-weight LLMs are judged for their Irish text generation by a native speaker, a learner and other LLMs. Google's Gemini-2.5-Pro is ranked the highest and is subsequently used to synthesise instruction tuning and human preference datasets. Two datasets are contributed leveraging Gemini-2.5-Pro: a 30K Irish-English parallel instruction tuning dataset and a 1K human preference dataset, generating accepted and rejected responses that show near perfect alignment with a native Irish speaker. Qomhrá is comprehensively evaluated across benchmarks testing translation, gender understanding, topic identification and world knowledge with gains of up to 29% in Irish and 44% in English. Qomhrá also undergoes instruction tuning and demonstrates clear progress in instruction following, crucial for chatbot functionality.
摘要：本文介绍了 Qomhrá，一种爱尔兰-英语双语大语言模型 (LLM)，它是在低资源限制下开发的，提供了一个完整的管道，涵盖双语持续预训练、指令调整和根据人类偏好进行调整。新近使用的爱尔兰语语料库和英语文本被混合和策划，以提高爱尔兰语的表现，同时保留英语能力。 6 名封闭权重法学硕士的爱尔兰语文本生成由一名母语人士、一名学习者和其他法学硕士进行评判。 Google 的 Gemini-2.5-Pro 排名最高，随后用于综合指令调整和人类偏好数据集。利用 Gemini-2.5-Pro 贡献两个数据集：一个 30K 爱尔兰英语并行指令调整数据集和一个 1K 人类偏好数据集，生成接受和拒绝的响应，显示与爱尔兰母语者近乎完美的一致性。 Qomhrá 在翻译、性别理解、主题识别和世界知识的基准测试中进行了全面评估，爱尔兰语提高了 29%，英语提高了 44%。 Qomhrá 还进行了指令调整，并在指令遵循方面表现出明显的进步，这对于聊天机器人的功能至关重要。

Title: Towards Mining Effective Pedagogical Strategies from Learner-LLM Educational Dialogues

Authors: Liqun He, Manolis Mavrikis, Mutlu Cukurova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17698
Pdf URL: https://arxiv.org/pdf/2510.17698
Copy Paste: [[2510.17698]] Towards Mining Effective Pedagogical Strategies from Learner-LLM Educational Dialogues(https://arxiv.org/abs/2510.17698)
Keywords: language model, llm
Abstract: Dialogue plays a crucial role in educational settings, yet existing evaluation methods for educational applications of large language models (LLMs) primarily focus on technical performance or learning outcomes, often neglecting attention to learner-LLM interactions. To narrow this gap, this AIED Doctoral Consortium paper presents an ongoing study employing a dialogue analysis approach to identify effective pedagogical strategies from learner-LLM dialogues. The proposed approach involves dialogue data collection, dialogue act (DA) annotation, DA pattern mining, and predictive model building. Early insights are outlined as an initial step toward future research. The work underscores the need to evaluate LLM-based educational applications by focusing on dialogue dynamics and pedagogical strategies.
摘要：对话在教育环境中发挥着至关重要的作用，但现有的大语言模型 (LLM) 教育应用的评估方法主要关注技术表现或学习成果，往往忽视了对学习者与 LLM 互动的关注。为了缩小这一差距，AIED 博士联盟的这篇论文提出了一项正在进行的研究，该研究采用对话分析方法从学习者与法学硕士的对话中确定有效的教学策略。所提出的方法涉及对话数据收集、对话行为 (DA) 注释、DA 模式挖掘和预测模型构建。早期的见解被概述为未来研究的第一步。这项工作强调需要通过关注对话动态和教学策略来评估基于法学硕士的教育应用。

Title: QueST: Incentivizing LLMs to Generate Difficult Problems

Authors: Hanxu Hu, Xingxing Zhang, Jannis Vamvas, Rico Sennrich, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17715
Pdf URL: https://arxiv.org/pdf/2510.17715
Copy Paste: [[2510.17715]] QueST: Incentivizing LLMs to Generate Difficult Problems(https://arxiv.org/abs/2510.17715)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.
摘要：大型语言模型在推理任务、解决竞赛级编码和数学问题方面取得了出色的表现。然而，它们的可扩展性受到人工标记数据集和缺乏大规模、具有挑战性的编码问题训练数据的限制。现有的竞争性编码数据集仅包含数千到数万个问题。以前的合成数据生成方法依赖于增强现有的指令数据集或从人类标记的数据中选择具有挑战性的问题。在本文中，我们提出了 QueST，这是一种新颖的框架，它结合了难度感知图采样和难度感知拒绝微调，可直接优化专用生成器以创建具有挑战性的编码问题。与 GPT-4o 相比，我们训练有素的生成器在创建有利于下游性能的挑战性问题方面表现出卓越的能力。我们利用 QueST 生成大规模的合成编码问题，然后使用这些问题从具有长思想链的强大教师模型中提取出来，或者对较小的模型进行强化学习，在这两种情况下都证明是有效的。我们的蒸馏实验证明了显着的性能提升。具体来说，在对 QueST 生成的 100K 难题进行 Qwen3-8B-base 微调后，我们在 LiveCodeBench 上超越了原始 Qwen3-8B 的性能。通过额外的 112K 个示例（即 28K 个人工编写的问题与多个合成解决方案配对），我们的 8B 模型与更大的 DeepSeek-R1-671B 的性能相匹配。这些发现表明，通过 QueST 生成复杂问题提供了一种有效且可扩展的方法，可以推进大型语言模型的竞争性编码和推理的前沿。

Title: PANER: A Paraphrase-Augmented Framework for Low-Resource Named Entity Recognition

Authors: Nanda Kumar Rengarajan, Jun Yan, Chun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.17720
Pdf URL: https://arxiv.org/pdf/2510.17720
Copy Paste: [[2510.17720]] PANER: A Paraphrase-Augmented Framework for Low-Resource Named Entity Recognition(https://arxiv.org/abs/2510.17720)
Keywords: llm
Abstract: Named Entity Recognition (NER) is a critical task that requires substantial annotated data, making it challenging in low-resource scenarios where label acquisition is expensive. While zero-shot and instruction-tuned approaches have made progress, they often fail to generalize to domain-specific entities and do not effectively utilize limited available data. We present a lightweight few-shot NER framework that addresses these challenges through two key innovations: (1) a new instruction tuning template with a simplified output format that combines principles from prior IT approaches to leverage the large context window of recent state-of-the-art LLMs; (2) introducing a strategic data augmentation technique that preserves entity information while paraphrasing the surrounding context, thereby expanding our training data without compromising semantic relationships. Experiments on benchmark datasets show that our method achieves performance comparable to state-of-the-art models on few-shot and zero-shot tasks, with our few-shot approach attaining an average F1 score of 80.1 on the CrossNER datasets. Models trained with our paraphrasing approach show consistent improvements in F1 scores of up to 17 points over baseline versions, offering a promising solution for groups with limited NER training data and compute power.
摘要：命名实体识别 (NER) 是一项关键任务，需要大量带注释的数据，这使得它在标签获取成本高昂的资源匮乏场景中具有挑战性。虽然零样本和指令调整方法取得了进展，但它们通常无法推广到特定领域的实体，并且无法有效利用有限的可用数据。我们提出了一个轻量级的小样本 NER 框架，通过两个关键创新来应对这些挑战：(1) 具有简化输出格式的新指令调整模板，该模板结合了先前 IT 方法的原理，以利用最近最先进的 LLM 的大上下文窗口；（2）引入一种战略数据增强技术，该技术可以在解释周围上下文的同时保留实体信息，从而在不损害语义关系的情况下扩展我们的训练数据。基准数据集上的实验表明，我们的方法在少样本和零样本任务上实现了与最先进模型相当的性能，我们的少样本方法在 CrossNER 数据集上获得了 80.1 的平均 F1 分数。使用我们的释义方法训练的模型显示，F1 分数比基线版本持续提高了 17 分，为 NER 训练数据和计算能力有限的群体提供了一个有前途的解决方案。

Title: AcademicEval: Live Long-Context LLM Benchmark

Authors: Haozhen Zhang, Tao Feng, Pengrui Han, Jiaxuan You
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.17725
Pdf URL: https://arxiv.org/pdf/2510.17725
Copy Paste: [[2510.17725]] AcademicEval: Live Long-Context LLM Benchmark(https://arxiv.org/abs/2510.17725)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose \textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. \textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, \textsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on \textsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs' long-context modeling capabilities. Code is available at this https URL
摘要：大型语言模型（LLM）最近在长上下文理解方面取得了显着的性能。然而，当前的长上下文 LLM 基准受到严格的上下文长度、劳动密集型注释以及 LLM 培训期间标签泄漏问题的紧迫挑战的限制。因此，我们提出 \textsc{AcademicEval}，一个用于在长上下文生成任务上评估 LLM 的实时基准。 \textsc{AcademicEval} 采用 arXiv 上的论文介绍了几个具有长上下文输入的学术写作任务，\textit{i.e.}、\textsc{Title}、\textsc{Abstract}、\textsc{Introduction} 和 \textsc{Related Work}，它们涵盖了广泛的抽象级别，不需要手动标记。此外，\textsc{AcademicEval} 集成了来自收集的合著者图表的高质量且专家策划的少量演示，以实现灵活的上下文长度。特别是， \textsc{AcademicEval} 具有高效的实时评估功能，确保没有标签泄漏。我们对 \textsc{AcademicEval} 进行了整体评估，结果表明法学硕士在具有分层抽象级别的任务上表现不佳，并且往往难以进行长时间的几次演示，这凸显了我们基准测试的挑战。通过实验分析，我们还揭示了一些增强法学硕士长上下文建模能力的见解。代码可在此 https URL 获取

Title: Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

Authors: Tong Chen, Akari Asai, Luke Zettlemoyer, Hannaneh Hajishirzi, Faeze Brahman
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.17733
Pdf URL: https://arxiv.org/pdf/2510.17733
Copy Paste: [[2510.17733]] Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations(https://arxiv.org/abs/2510.17733)
Keywords: language model, hallucination
Abstract: Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.
摘要：语言模型经常生成事实上不正确的信息，而不受训练数据的支持，这种现象被称为外在幻觉。现有的缓解方法通常会降低开放式生成和下游任务的性能，从而限制了它们的实际效用。我们提出了一种在线强化学习方法，使用新颖的二进制检索增强奖励（RAR）来解决这种权衡问题。与连续奖励方案不同，我们的方法仅当模型的输出完全正确时才分配奖励 1，否则分配奖励 0。我们在不同任务的 Qwen3 推理模型上评估我们的方法。对于开放式生成，二元 RAR 的幻觉率降低了 39.3%，大大优于监督训练和连续奖励 RL 基线。在简短的问答中，模型学习校准弃权，在面对参数知识不足时有策略地输出“我不知道”。这使得 PopQA 和 GPQA 的错误答案分别减少了 44.4% 和 21.7%。至关重要的是，这些事实性的提高并不会降低指令遵循、数学或代码的性能，而连续奖励强化学习尽管提高了事实性，却会导致质量回归。

Title: Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications

Authors: Xiao Ye, Jacob Dineen, Zhaonan Li, Zhikun Xu, Weiyu Chen, Shijie Lu, Yuxi Huang, Ming Shen, Phu Tran, Ji-Eun Irene Yum, Muhammad Ali Khan, Muhammad Umar Afzal, Irbaz Bin Riaz, Ben Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.17764
Pdf URL: https://arxiv.org/pdf/2510.17764
Copy Paste: [[2510.17764]] Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications(https://arxiv.org/abs/2510.17764)
Keywords: language model, llm, agent
Abstract: Medical Large language models achieve strong scores on standard benchmarks; however, the transfer of those results to safe and reliable performance in clinical workflows remains a challenge. This survey reframes evaluation through a levels-of-autonomy lens (L0-L3), spanning informational tools, information transformation and aggregation, decision support, and supervised agents. We align existing benchmarks and metrics with the actions permitted at each level and their associated risks, making the evaluation targets explicit. This motivates a level-conditioned blueprint for selecting metrics, assembling evidence, and reporting claims, alongside directions that link evaluation to oversight. By centering autonomy, the survey moves the field beyond score-based claims toward credible, risk-aware evidence for real clinical use.
摘要：医学大型语言模型在标准基准测试中取得了优异的成绩；然而，将这些结果转化为临床工作流程中安全可靠的性能仍然是一个挑战。这项调查通过自主级别（L0-L3）重新构建评估，涵盖信息工具、信息转换和聚合、决策支持和监督代理。我们将现有的基准和指标与每个级别允许的行动及其相关风险相结合，使评估目标变得明确。这激发了一个用于选择指标、收集证据和报告主张的水平条件蓝图，以及将评估与监督联系起来的方向。通过以自主权为中心，该调查使该领域超越了基于分数的主张，转向了真正临床使用的可信的、具有风险意识的证据。

Title: Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

Authors: Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.17793
Pdf URL: https://arxiv.org/pdf/2510.17793
Copy Paste: [[2510.17793]] Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains(https://arxiv.org/abs/2510.17793)
Keywords: gpt
Abstract: Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.
摘要：微调专门的生成评估器已成为一种流行的范例，以满足训练和测试期间对可扩展评估不断增长的需求。然而，最近的工作主要集中在应用强化学习（RL）等新方法来培训评估者，回避大规模、数据驱动的开发。在这项工作中，我们专注于数据扩展，策划一组 250 万个样本，涵盖五个独特的评估任务（成对、步骤级、无参考和基于参考的验证以及单一评级）和专注于推理评估的多个领域。利用我们的数据，我们使用简单的迭代拒绝采样监督微调 (SFT) 方法来训练基础自动推理评估器 (FARE)，这是一系列 8B 和 20B（具有 3.6B 活动）参数评估器。 FARE-8B 挑战了规模更大的经过 RL 训练的专业评估器，FARE-20B 为开源评估器设定了新标准，超越了专业的 70B+ 评估器。除了静态基准之外，我们还评估现实世界任务中的 FARE：作为推理时间重新排序器，FARE-20B 在 MATH 上实现了接近预言机的性能。作为 RL 训练中的验证者，与字符串匹配验证者相比，FARE 将下游 RL 训练的模型性能提高了 14.1%。当从 FARE 初始化时，持续微调的 FARE 代码在评估测试用例质量方面比 gpt-oss-20B 高出 65%。

Title: Executable Knowledge Graphs for Replicating AI Research

Authors: Yujie Luo, Zhuoyun Yu, Xuehai Wang, Yuqi Zhu, Ningyu Zhang, Lanning Wei, Lun Du, Da Zheng, Huajun Chen
Subjects: cs.CL, cs.AI, cs.LG, cs.MA, cs.SE
Abstract URL: https://arxiv.org/abs/2510.17795
Pdf URL: https://arxiv.org/pdf/2510.17795
Copy Paste: [[2510.17795]] Executable Knowledge Graphs for Replicating AI Research(https://arxiv.org/abs/2510.17795)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Replicating AI research is a crucial yet challenging task for large language model (LLM) agents. Existing approaches often struggle to generate executable code, primarily due to insufficient background knowledge and the limitations of retrieval-augmented generation (RAG) methods, which fail to capture latent technical details hidden in referenced papers. Furthermore, previous approaches tend to overlook valuable implementation-level code signals and lack structured knowledge representations that support multi-granular retrieval and reuse. To overcome these challenges, we propose Executable Knowledge Graphs (xKG), a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature. When integrated into three agent frameworks with two different LLMs, xKG shows substantial performance gains (10.9% with o3-mini) on PaperBench, demonstrating its effectiveness as a general and extensible solution for automated AI research replication. Code will released at this https URL.
摘要：对于大型语言模型（LLM）代理来说，复制人工智能研究是一项至关重要但具有挑战性的任务。现有的方法通常很难生成可执行代码，这主要是由于背景知识不足和检索增强生成（RAG）方法的局限性，这些方法无法捕获隐藏在参考论文中的潜在技术细节。此外，以前的方法往往忽略了有价值的实现级代码信号，并且缺乏支持多粒度检索和重用的结构化知识表示。为了克服这些挑战，我们提出了可执行知识图（xKG），这是一个模块化、可插入的知识库，可以自动集成从科学文献中提取的技术见解、代码片段和特定领域的知识。当集成到具有两个不同 LLM 的三个代理框架中时，xKG 在 PaperBench 上显示出显着的性能提升（使用 o3-mini 提高了 10.9%），证明了其作为自动化 AI 研究复制的通用且可扩展解决方案的有效性。代码将在此 https URL 发布。

Title: Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

Authors: Akshara Prabhakar, Roshan Ram, Zixiang Chen, Silvio Savarese, Frank Wang, Caiming Xiong, Huan Wang, Weiran Yao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.17797
Pdf URL: https://arxiv.org/pdf/2510.17797
Copy Paste: [[2510.17797]] Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics(https://arxiv.org/abs/2510.17797)
Keywords: agent
Abstract: As information grows exponentially, enterprises face increasing pressure to transform unstructured data into coherent, actionable insights. While autonomous agents show promise, they often struggle with domain-specific nuances, intent alignment, and enterprise integration. We present Enterprise Deep Research (EDR), a multi-agent system that integrates (1) a Master Planning Agent for adaptive query decomposition, (2) four specialized search agents (General, Academic, GitHub, LinkedIn), (3) an extensible MCP-based tool ecosystem supporting NL2SQL, file analysis, and enterprise workflows, (4) a Visualization Agent for data-driven insights, and (5) a reflection mechanism that detects knowledge gaps and updates research direction with optional human-in-the-loop steering guidance. These components enable automated report generation, real-time streaming, and seamless enterprise deployment, as validated on internal datasets. On open-ended benchmarks including DeepResearch Bench and DeepConsult, EDR outperforms state-of-the-art agentic systems without any human steering. We release the EDR framework and benchmark trajectories to advance research on multi-agent reasoning applications. Code at this https URL and Dataset at this https URL
摘要：随着信息呈指数级增长，企业面临着越来越大的压力，需要将非结构化数据转化为连贯的、可操作的见解。虽然自主代理显示出希望，但他们经常在特定领域的细微差别、意图一致性和企业集成方面遇到困难。我们提出了 Enterprise Deep Research (EDR)，这是一个多代理系统，它集成了 (1) 用于自适应查询分解的总体规划代理，(2) 四个专业搜索代理（通用、学术、GitHub、LinkedIn），(3) 一个支持 NL2SQL、文件分析和企业工作流程的基于 MCP 的可扩展工具生态系统，(4) 用于数据驱动洞察的可视化代理，以及 (5) 反思机制，通过可选的人机交互指导来检测知识差距并更新研究方向。这些组件可实现自动报告生成、实时流媒体和无缝企业部署，并在内部数据集上进行了验证。在包括 DeepResearch Bench 和 DeepConsult 在内的开放式基准测试中，EDR 在无需任何人工指导的情况下优于最先进的代理系统。我们发布 EDR 框架和基准轨迹，以推进多智能体推理应用的研究。此 https URL 处的代码和此 https URL 处的数据集