2025-10-09

Title: OpenStaxQA: A multilingual dataset based on open-source college textbooks

Authors: Pranav Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06239
Pdf URL: https://arxiv.org/pdf/2510.06239
Copy Paste: [[2510.06239]] OpenStaxQA: A multilingual dataset based on open-source college textbooks(https://arxiv.org/abs/2510.06239)
Keywords: language model, llm
Abstract: We present OpenStaxQA, an evaluation benchmark specific to college-level educational applications based on 43 open-source college textbooks in English, Spanish, and Polish, available under a permissive Creative Commons license. We finetune and evaluate large language models (LLMs) with approximately 7 billion parameters on this dataset using quantized low rank adapters (QLoRa). Additionally we also perform a zero-shot evaluation on the AI2 reasoning challenge dev dataset in order to check if OpenStaxQA can lead to an improved performance on other tasks. We also discuss broader impacts relevant to datasets such as OpenStaxQA.
摘要：我们推出了 OpenStaxQA，这是一个专门针对大学教育应用程序的评估基准，基于 43 本英语、西班牙语和波兰语的开源大学教科书，可在许可的 Creative Commons 许可下使用。我们使用量化低阶适配器 (QLoRa) 在此数据集上微调和评估具有约 70 亿个参数的大型语言模型 (LLM)。此外，我们还对 AI2 推理挑战开发数据集进行了零样本评估，以检查 OpenStaxQA 是否可以提高其他任务的性能。我们还讨论了与 OpenStaxQA 等数据集相关的更广泛的影响。

Title: Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets

Authors: Jiqun Pan, Zhenke Duan, Jiani Tu, Anzhi Cheng, Yanqing Wang
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2510.06240
Pdf URL: https://arxiv.org/pdf/2510.06240
Copy Paste: [[2510.06240]] Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets(https://arxiv.org/abs/2510.06240)
Keywords: language model, agent
Abstract: Industrial question-answering (QA) systems require higher safety and reliability than general-purpose dialogue models, as errors in high-risk scenarios such as equipment fault diagnosis can have severe consequences. Although multi-agent large language models enhance reasoning depth, they suffer from uncontrolled iterations and unverifiable outputs, and conventional distillation methods struggle to transfer collaborative reasoning capabilities to lightweight, deployable student models. To address these challenges, we propose Knowledge Graph-guided Multi-Agent System Distillation (KG-MASD). Our approach formulates distillation as a Markov Decision Process and incorporates a knowledge graph as a verifiable structured prior to enrich state representation and ensure convergence. By integrating collaborative reasoning with knowledge grounding, KG-MASD generates high-confidence instruction-tuning data and jointly distills reasoning depth and verifiability into compact student models suitable for edge deployment. Experiments on an industrial QA dataset show that KG-MASD improves accuracy by 2.4 per cent to 20.1 per cent over baselines and significantly enhances reliability, enabling trustworthy AI deployment in safety-critical industrial scenarios. Code and data are available at this https URL.
摘要：工业问答（QA）系统比通用对话模型要求更高的安全性和可靠性，因为在设备故障诊断等高风险场景中出现错误可能会造成严重后果。尽管多智能体大语言模型增强了推理深度，但它们存在不受控制的迭代和无法验证的输出，而传统的蒸馏方法很难将协作推理能力转移到轻量级、可部署的学生模型。为了应对这些挑战，我们提出了知识图引导的多智能体系统蒸馏（KG-MASD）。我们的方法将蒸馏制定为马尔可夫决策过程，并将知识图作为可验证的结构，然后丰富状态表示并确保收敛。通过将协作推理与知识基础相结合，KG-MASD 生成高可信度的指令调优数据，并将推理深度和可验证性联合提炼为适合边缘部署的紧凑学生模型。在工业 QA 数据集上进行的实验表明，KG-MASD 的准确性比基线提高了 2.4% 至 20.1%，并显着增强了可靠性，从而在安全关键的工业场景中实现了值得信赖的 AI 部署。代码和数据可从此 https URL 获取。

Title: Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses

Authors: Subin An, Yugyeong Ji, Junyoung Kim, Heejin Kook, Yang Lu, Josh Seltzer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06242
Pdf URL: https://arxiv.org/pdf/2510.06242
Copy Paste: [[2510.06242]] Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses(https://arxiv.org/abs/2510.06242)
Keywords: llm
Abstract: Open-ended survey responses provide valuable insights in marketing research, but low-quality responses not only burden researchers with manual filtering but also risk leading to misleading conclusions, underscoring the need for effective evaluation. Existing automatic evaluation methods target LLM-generated text and inadequately assess human-written responses with their distinct characteristics. To address such characteristics, we propose a two-stage evaluation framework specifically designed for human survey responses. First, gibberish filtering removes nonsensical responses. Then, three dimensions-effort, relevance, and completeness-are evaluated using LLM capabilities, grounded in empirical analysis of real-world survey data. Validation on English and Korean datasets shows that our framework not only outperforms existing metrics but also demonstrates high practical applicability for real-world applications such as response quality prediction and response rejection, showing strong correlations with expert assessment.
摘要：开放式调查回复为营销研究提供了宝贵的见解，但低质量的回复不仅给研究人员带来了手动过滤的负担，而且还存在导致误导性结论的风险，这凸显了有效评估的必要性。现有的自动评估方法以法学硕士生成的文本为目标，无法充分评估具有独特特征的人类书面回答。为了解决这些特征，我们提出了一个专门针对人类调查响应而设计的两阶段评估框架。首先，乱码过滤会删除无意义的响应。然后，基于对现实世界调查数据的实证分析，利用法学硕士的能力对三个维度——努力度、相关性和完整性进行评估。对英语和韩语数据集的验证表明，我们的框架不仅优于现有指标，而且还表现出对响应质量预测和响应拒绝等实际应用的高度实用性，与专家评估显示出很强的相关性。

Title: CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning

Authors: Qihua Dong, Luis Figueroa, Handong Zhao, Kushal Kafle, Jason Kuen, Zhihong Ding, Scott Cohen, Yun Fu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06243
Pdf URL: https://arxiv.org/pdf/2510.06243
Copy Paste: [[2510.06243]] CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning(https://arxiv.org/abs/2510.06243)
Keywords: language model, llm, chain-of-thought
Abstract: Referring Expression Comprehension and Segmentation are critical tasks for assessing the integration of language understanding and image comprehension, serving as benchmarks for Multimodal Large Language Models (MLLMs) capabilities. To address these challenges, we propose a new strategy, CoT Referring, which enhances model reasoning across modalities through a structured, chain-of-thought training data structure. Our approach systematically parses textual structures to a sequential referring step, where in each step it identifies relationships and ensures consistent reference alignment, thereby improving accuracy in complex query scenarios. We restructure the training data to enforce a new output form, providing new annotations for existing datasets and compiling an evaluation benchmark from existing resources. This benchmark is designed explicitly for complex referring cases. We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance. Experimental results on our curated benchmark and RefCOCO/+/g demonstrate the effectiveness of our approach, with a notable increase of 2.5%+ over baseline models.
摘要：引用表达理解和分割是评估语言理解和图像理解集成的关键任务，可作为多模态大语言模型（MLLM）能力的基准。为了应对这些挑战，我们提出了一种新策略 CoT Referral，它通过结构化的、思想链式的训练数据结构来增强跨模式的模型推理。我们的方法系统地将文本结构解析为顺序引用步骤，在每个步骤中它都会识别关系并确保一致的引用对齐，从而提高复杂查询场景中的准确性。我们重组训练数据以强制执行新的输出形式，为现有数据集提供新的注释并根据现有资源编译评估基准。该基准是专门为复杂的参考案例而设计的。我们还将检测和分割功能集成到统一的 MLLM 框架中，使用新颖的自适应加权损失对其进行训练以优化性能。我们策划的基准和 RefCOCO/+/g 的实验结果证明了我们方法的有效性，与基线模型相比显着提高了 2.5% 以上。

Title: TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B

Authors: Toshiki Nakai, Ravi Kiran Chikkala, Lena Sophie Oberkircher, Nicholas Jennings, Natalia Skachkova, Tatiana Anikina, Jesujoba Oluwadara Alabi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06249
Pdf URL: https://arxiv.org/pdf/2510.06249
Copy Paste: [[2510.06249]] TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B(https://arxiv.org/abs/2510.06249)
Keywords: language model, llm
Abstract: The 2025 Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo) Language Challenge addresses one of India's most pressing linguistic gaps: the lack of resources for its diverse low-resource languages (LRLs). In this study, we investigate whether enforcing cross-lingual similarity in specific internal layers of a decoder-only multilingual large language model (LLM) can improve translation quality from LRL to high-resource language (HRL). Specifically, we combine Centered Kernel Alignment (CKA), a similarity metric that encourages representations of different languages to align, with REPINA, a regularization method that constrains parameter updates to remain close to the pretrained model, into a joint method we call TRepLiNa. In this research project, we experiment with zero-shot, few-shot, and fine-tuning settings using Aya-23 8B with QLoRA across MMLoSo shared task language pairs (Mundari, Santali, Bhili) with Hindi/English pivots. Our results show that aligning mid-level layers using TRepLiNa (CKA+REPINA) is a low-cost, practical approach to improving LRL translation, especially in data-scarce settings.
摘要：2025 年低资源环境和社会影响的多模式模型 (MMLoSo) 语言挑战赛解决了印度最紧迫的语言差距之一：各种低资源语言 (LRL) 缺乏资源。在本研究中，我们研究了在仅解码器的多语言大语言模型 (LLM) 的特定内部层中强制执行跨语言相似性是否可以提高从 LRL 到高资源语言 (HRL) 的翻译质量。具体来说，我们将中心核对齐（CKA）（一种鼓励不同语言的表示对齐的相似性度量）与 REPINA（一种限制参数更新以保持接近预训练模型的正则化方法）结合到我们称为 TRepLiNa 的联合方法中。在这个研究项目中，我们使用 Aya-23 8B 和 QLoRA 跨 MMLoSo 共享任务语言对（Mundari、Santali、Bhili）以印地语/英语为支点，尝试零样本、少样本和微调设置。我们的结果表明，使用 TRepLiNa (CKA+REPINA) 对齐中层是一种改进 LRL 翻译的低成本、实用的方法，尤其是在数据稀缺的环境中。

Title: Scalable multilingual PII annotation for responsible AI in LLMs

Authors: Bharti Meena, Joanna Skubisz, Harshit Rajgarhia, Nand Dave, Kiran Ganesh, Shivali Dalmia, Abhishek Mukherji, Vasudevan Sundarababu, Olga Pospelova
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06250
Pdf URL: https://arxiv.org/pdf/2510.06250
Copy Paste: [[2510.06250]] Scalable multilingual PII annotation for responsible AI in LLMs(https://arxiv.org/abs/2510.06250)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) gain wider adoption, ensuring their reliable handling of Personally Identifiable Information (PII) across diverse regulatory contexts has become essential. This work introduces a scalable multilingual data curation framework designed for high-quality PII annotation across 13 underrepresented locales, covering approximately 336 locale-specific PII types. Our phased, human-in-the-loop annotation methodology combines linguistic expertise with rigorous quality assurance, leading to substantial improvements in recall and false positive rates from pilot, training, and production phases. By leveraging inter-annotator agreement metrics and root-cause analysis, the framework systematically uncovers and resolves annotation inconsistencies, resulting in high-fidelity datasets suitable for supervised LLM fine-tuning. Beyond reporting empirical gains, we highlight common annotator challenges in multilingual PII labeling and demonstrate how iterative, analytics-driven pipelines can enhance both annotation quality and downstream model reliability.
摘要：随着大型语言模型 (LLM) 得到更广泛的采用，确保其在不同的监管环境中可靠地处理个人身份信息 (PII) 变得至关重要。这项工作引入了一个可扩展的多语言数据管理框架，旨在跨 13 个代表性不足的区域设置提供高质量的 PII 注释，涵盖大约 336 个区域特定的 PII 类型。我们的分阶段、人机交互注释方法将语言专业知识与严格的质量保证相结合，从而显着提高试点、培训和生产阶段的召回率和误报率。通过利用注释者间协议指标和根本原因分析，该框架系统地发现并解决注释不一致问题，从而产生适合监督 LLM 微调的高保真数据集。除了报告实证成果之外，我们还强调了多语言 PII 标记中常见的注释器挑战，并演示了迭代、分析驱动的管道如何提高注释质量和下游模型可靠性。

Title: Dual-stage and Lightweight Patient Chart Summarization for Emergency Physicians

Authors: Jiajun Wu, Swaleh Zaidi, Braden Teitge, Henry Leung, Jiayu Zhou, Jessalyn Holodinsky, Steve Drew
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06263
Pdf URL: https://arxiv.org/pdf/2510.06263
Copy Paste: [[2510.06263]] Dual-stage and Lightweight Patient Chart Summarization for Emergency Physicians(https://arxiv.org/abs/2510.06263)
Keywords: language model, llm
Abstract: Electronic health records (EHRs) contain extensive unstructured clinical data that can overwhelm emergency physicians trying to identify critical information. We present a two-stage summarization system that runs entirely on embedded devices, enabling offline clinical summarization while preserving patient privacy. In our approach, a dual-device architecture first retrieves relevant patient record sections using the Jetson Nano-R (Retrieve), then generates a structured summary on another Jetson Nano-S (Summarize), communicating via a lightweight socket link. The summarization output is two-fold: (1) a fixed-format list of critical findings, and (2) a context-specific narrative focused on the clinician's query. The retrieval stage uses locally stored EHRs, splits long notes into semantically coherent sections, and searches for the most relevant sections per query. The generation stage uses a locally hosted small language model (SLM) to produce the summary from the retrieved text, operating within the constraints of two NVIDIA Jetson devices. We first benchmarked six open-source SLMs under 7B parameters to identify viable models. We incorporated an LLM-as-Judge evaluation mechanism to assess summary quality in terms of factual accuracy, completeness, and clarity. Preliminary results on MIMIC-IV and de-identified real EHRs demonstrate that our fully offline system can effectively produce useful summaries in under 30 seconds.
摘要：电子健康记录 (EHR) 包含大量非结构化临床数据，这些数据可能会让试图识别关键信息的急诊医生不知所措。我们提出了一个完全在嵌入式设备上运行的两阶段总结系统，可实现离线临床总结，同时保护患者隐私。在我们的方法中，双设备架构首先使用 Jetson Nano-R（检索）检索相关患者记录部分，然后在另一个 Jetson Nano-S（总结）上生成结构化摘要，并通过轻量级套接字链接进行通信。总结输出有两部分：（1）固定格式的关键发现列表，以及（2）针对临床医生询问的特定上下文叙述。检索阶段使用本地存储的 EHR，将长笔记分割成语义连贯的部分，并根据查询搜索最相关的部分。生成阶段使用本地托管的小语言模型 (SLM) 根据检索到的文本生成摘要，并在两个 NVIDIA Jetson 设备的限制下运行。我们首先在 7B 参数下对六个开源 SLM 进行基准测试，以确定可行的模型。我们采用了法学硕士作为法官的评估机制，以评估摘要质量的事实准确性、完整性和清晰度。 MIMIC-IV 和去识别的真实 EHR 的初步结果表明，我们完全离线的系统可以在 30 秒内有效地生成有用的摘要。

Title: A Comprehensive Survey of Hallucination in Large Language Models: Causes, Detection, and Mitigation

Authors: Aisha Alansari, Hamzah Luqman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06265
Pdf URL: https://arxiv.org/pdf/2510.06265
Copy Paste: [[2510.06265]] A Comprehensive Survey of Hallucination in Large Language Models: Causes, Detection, and Mitigation(https://arxiv.org/abs/2510.06265)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have transformed natural language processing, achieving remarkable performance across diverse tasks. However, their impressive fluency often comes at the cost of producing false or fabricated information, a phenomenon known as hallucination. Hallucination refers to the generation of content by an LLM that is fluent and syntactically correct but factually inaccurate or unsupported by external evidence. Hallucinations undermine the reliability and trustworthiness of LLMs, especially in domains requiring factual accuracy. This survey provides a comprehensive review of research on hallucination in LLMs, with a focus on causes, detection, and mitigation. We first present a taxonomy of hallucination types and analyze their root causes across the entire LLM development lifecycle, from data collection and architecture design to inference. We further examine how hallucinations emerge in key natural language generation tasks. Building on this foundation, we introduce a structured taxonomy of detection approaches and another taxonomy of mitigation strategies. We also analyze the strengths and limitations of current detection and mitigation approaches and review existing evaluation benchmarks and metrics used to quantify LLMs hallucinations. Finally, we outline key open challenges and promising directions for future research, providing a foundation for the development of more truthful and trustworthy LLMs.
摘要：大型语言模型 (LLM) 改变了自然语言处理，在不同的任务中取得了卓越的性能。然而，它们令人印象深刻的流畅性往往是以产生虚假或捏造的信息为代价的，这种现象被称为幻觉。幻觉是指法学硕士生成的内容流畅且语法正确，但实际上不准确或没有外部证据支持。幻觉会破坏法学硕士的可靠性和可信度，尤其是在需要事实准确性的领域。这项调查对法学硕士的幻觉研究进行了全面回顾，重点是原因、检测和缓解。我们首先提出幻觉类型的分类法，并在整个法学硕士开发生命周期（从数据收集、架构设计到推理）中分析其根本原因。我们进一步研究幻觉如何在关键的自然语言生成任务中出现。在此基础上，我们引入了检测方法的结构化分类法和缓解策略的另一种分类法。我们还分析了当前检测和缓解方法的优点和局限性，并审查了用于量化法学硕士幻觉的现有评估基准和指标。最后，我们概述了未来研究的关键开放挑战和有希望的方向，为发展更真实和值得信赖的法学硕士奠定了基础。

Title: Language models for longitudinal analysis of abusive content in Billboard Music Charts

Authors: Rohitash Chandra, Yathin Suresh, Divyansh Raj Sinha, Sanchit Jindal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06266
Pdf URL: https://arxiv.org/pdf/2510.06266
Copy Paste: [[2510.06266]] Language models for longitudinal analysis of abusive content in Billboard Music Charts(https://arxiv.org/abs/2510.06266)
Keywords: language model
Abstract: There is no doubt that there has been a drastic increase in abusive and sexually explicit content in music, particularly in Billboard Music Charts. However, there is a lack of studies that validate the trend for effective policy development, as such content has harmful behavioural changes in children and youths. In this study, we utilise deep learning methods to analyse songs (lyrics) from Billboard Charts of the United States in the last seven decades. We provide a longitudinal study using deep learning and language models and review the evolution of content using sentiment analysis and abuse detection, including sexually explicit content. Our results show a significant rise in explicit content in popular music from 1990 onwards. Furthermore, we find an increasing prevalence of songs with lyrics containing profane, sexually explicit, and otherwise inappropriate language. The longitudinal analysis of the ability of language models to capture nuanced patterns in lyrical content, reflecting shifts in societal norms and language use over time.
摘要：毫无疑问，音乐中的辱骂和色情内容急剧增加，特别是在公告牌音乐排行榜上。然而，缺乏研究来验证有效政策制定的趋势，因为此类内容会对儿童和青少年产生有害的行为变化。在这项研究中，我们利用深度学习方法来分析过去七十年美国公告牌排行榜上的歌曲（歌词）。我们使用深度学习和语言模型提供纵向研究，并使用情感分析和滥用检测来回顾内容的演变，包括露骨的色情内容。我们的结果显示，自 1990 年以来，流行音乐中的露骨内容显着增加。此外，我们发现歌词包含亵渎、露骨和其他不恰当语言的歌曲越来越普遍。对语言模型捕获抒情内容中细微差别模式的能力进行纵向分析，反映社会规范和语言使用随时间的变化。

Title: Reproducibility Study of "XRec: Large Language Models for Explainable Recommendation"

Authors: Ranjan Mishra, Julian I. Bibo, Quinten van Engelen, Henk Schaapman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06275
Pdf URL: https://arxiv.org/pdf/2510.06275
Copy Paste: [[2510.06275]] Reproducibility Study of "XRec: Large Language Models for Explainable Recommendation"(https://arxiv.org/abs/2510.06275)
Keywords: language model, gpt, llm
Abstract: In this study, we reproduced the work done in the paper "XRec: Large Language Models for Explainable Recommendation" by Ma et al. (2024). The original authors introduced XRec, a model-agnostic collaborative instruction-tuning framework that enables large language models (LLMs) to provide users with comprehensive explanations of generated recommendations. Our objective was to replicate the results of the original paper, albeit using Llama 3 as the LLM for evaluation instead of GPT-3.5-turbo. We built on the source code provided by Ma et al. (2024) to achieve our goal. Our work extends the original paper by modifying the input embeddings or deleting the output embeddings of XRec's Mixture of Experts module. Based on our results, XRec effectively generates personalized explanations and its stability is improved by incorporating collaborative information. However, XRec did not consistently outperform all baseline models in every metric. Our extended analysis further highlights the importance of the Mixture of Experts embeddings in shaping the explanation structures, showcasing how collaborative signals interact with language modeling. Through our work, we provide an open-source evaluation implementation that enhances accessibility for researchers and practitioners alike. Our complete code repository can be found at this https URL.
摘要：在这项研究中，我们重现了 Ma 等人在论文“XRec：用于可解释推荐的大型语言模型”中所做的工作。（2024）。原作者介绍了 XRec，这是一种与模型无关的协作指令调优框架，使大型语言模型 (LLM) 能够为用户提供生成推荐的全面解释。我们的目标是复制原始论文的结果，尽管使用 Llama 3 作为 LLM 进行评估，而不是 GPT-3.5-turbo。我们基于 Ma 等人提供的源代码进行构建。（2024）实现我们的目标。我们的工作通过修改 XRec 的 Mixture of Experts 模块的输入嵌入或删除输出嵌入来扩展原始论文。根据我们的结果，XRec 有效地生成个性化解释，并且通过合并协作信息提高了其稳定性。然而，XRec 并没有在每个指标上始终优于所有基线模型。我们的扩展分析进一步强调了专家混合嵌入在塑造解释结构中的重要性，展示了协作信号如何与语言建模交互。通过我们的工作，我们提供了一个开源评估实施方案，增强了研究人员和从业人员的可访问性。我们完整的代码存储库可以在此 https URL 中找到。

Title: LLM Bias Detection and Mitigation through the Lens of Desired Distributions

Authors: Ingroj Shrestha, Padmini Srinivasan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06354
Pdf URL: https://arxiv.org/pdf/2510.06354
Copy Paste: [[2510.06354]] LLM Bias Detection and Mitigation through the Lens of Desired Distributions(https://arxiv.org/abs/2510.06354)
Keywords: language model, llm
Abstract: Although prior work on bias mitigation has focused on promoting social equality and demographic parity, less attention has been given to aligning LLM's outputs to desired distributions. For example, we might want to align a model with real-world distributions to support factual grounding. Thus, we define bias as deviation from a desired distribution, which may be an equal or real-world distribution, depending on application goals. We propose a weighted adaptive loss based fine-tuning method that aligns LLM's gender-profession output distribution with the desired distribution, while preserving language modeling capability. Using 3 profession sets -- male-dominated, female-dominated, and gender-balanced -- derived from U.S. labor statistics (2024), we assess both our adaptive method for reflecting reality and a non-adaptive variant for equality. Across three masked language models, bias is observed under both distributions. We achieve near-complete mitigation under equality and 30-75% reduction under real-world settings. Autoregressive LLMs show no bias under equality but notable bias under real-world settings, with the Llama Instruct models (3.2-3B, 3.1-8B) achieving a 50-62% reduction.
摘要：尽管之前关于缓解偏见的工作主要集中在促进社会平等和人口平等，但很少关注如何使法学硕士的产出与期望的分布保持一致。例如，我们可能希望将模型与现实世界的分布保持一致以支持事实基础。因此，我们将偏差定义为与期望分布的偏差，该分布可能是相等的或真实世界的分布，具体取决于应用程序目标。我们提出了一种基于加权自适应损失的微调方法，使法学硕士的性别职业输出分布与所需的分布保持一致，同时保留语言建模能力。我们使用源自美国劳动力统计数据（2024 年）的 3 个职业集（男性主导、女性主导和性别平衡）来评估反映现实的适应性方法和平等的非适应性变体。在三种屏蔽语言模型中，两种分布下都观察到了偏差。我们在平等条件下实现了近乎完全的缓解，在现实环境下实现了 30-75% 的减排。自回归法学硕士在平等条件下没有表现出偏差，但在现实世界环境下表现出明显的偏差，Llama Instruct 模型（3.2-3B、3.1-8B）实现了 50-62% 的减少。

Title: EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preference

Authors: Kshitish Ghate, Andy Liu, Devansh Jain, Taylor Sorensen, Atoosa Kasirzadeh, Aylin Caliskan, Mona T. Diab, Maarten Sap
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06370
Pdf URL: https://arxiv.org/pdf/2510.06370
Copy Paste: [[2510.06370]] EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preference(https://arxiv.org/abs/2510.06370)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) are deployed globally, creating pluralistic systems that can accommodate the diverse preferences and values of users worldwide becomes essential. We introduce EVALUESTEER, a benchmark to measure LLMs' and reward models' (RMs) steerability towards users' value and stylistic preference profiles grounded in psychology and human-LLM interaction literature. To address the gap in existing datasets that do not support controlled evaluations of RM steering, we synthetically generated 165,888 preference pairs -- systematically varying pairs along 4 value dimensions (traditional, secular-rational, survival, and self-expression) and 4 style dimensions (verbosity, readability, confidence, and warmth). We use EVALUESTEER to evaluate whether, given a user profile and a pair of candidate value-laden and style-laden responses, LLMs and RMs are able to select the output that aligns with the user's preferences. We evaluate six open-source and proprietary LLMs and RMs under sixteen systematic prompting conditions and six preference comparison scenarios. Notably, our results show that, when given the user's full profile of values and stylistic preferences, the best models achieve <75% accuracy at choosing the correct response, in contrast to >99% accuracy when only relevant style and value preferences are provided. EVALUESTEER thus highlights the limitations of current RMs at identifying and adapting to relevant user profile information, and provides a challenging testbed for developing RMs that can be steered towards diverse human values and preferences.
摘要：随着大型语言模型（LLM）在全球范围内部署，创建能够适应全球用户不同偏好和价值观的多元化系统变得至关重要。我们引入了 EVALUESTEER，这是一个基准，用于衡量法学硕士和奖励模型 (RM) 对用户价值和风格偏好的可引导性，这些特征基于心理学和人类与法学硕士互动文献。为了解决现有数据集中不支持 RM 转向受控评估的差距，我们综合生成了 165,888 个偏好对——沿着 4 个价值维度（传统、世俗理性、生存和自我表达）和 4 个风格维度（冗长、可读性、自信和温暖）系统地改变偏好对。我们使用 EVALUESTEER 来评估，在给定用户配置文件和一对候选的价值负载和风格负载响应的情况下，LLM 和 RM 是否能够选择符合用户偏好的输出。我们在十六种系统提示条件和六种偏好比较场景下评估了六种开源和专有的法学硕士和 RM。值得注意的是，我们的结果表明，当给定用户完整的价值观和风格偏好时，最佳模型在选择正确响应时的准确率低于 75%，而仅提供相关风格和价值观偏好时的准确率则高于 99%。因此，EVALUESTEER 强调了当前 RM 在识别和适应相关用户配置文件信息方面的局限性，并为开发可引导不同人类价值观和偏好的 RM 提供了一个具有挑战性的测试平台。

Title: EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

Authors: Firoj Alam, Ali Ezzat Shahroor, Md. Arid Hasan, Zien Sheikh Ali, Hunzalah Hassan Bhatti, Mohamed Bayan Kmainasi, Shammur Absar Chowdhury, Basel Mousi, Fahim Dalvi, Nadir Durrani, Natasa Milic-Frayling
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06371
Pdf URL: https://arxiv.org/pdf/2510.06371
Copy Paste: [[2510.06371]] EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA(https://arxiv.org/abs/2510.06371)
Keywords: llm
Abstract: Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.
摘要：大规模多模态模型在视觉问答 (VQA) 等任务上取得了很好的结果，但当查询需要基于文化的日常知识时，尤其是在资源匮乏和代表性不足的语言中，它们常常会失败。为了弥补这一差距，我们引入了日常多模式和多语言 QA (EverydayMMQA)，这是一个为口语和视觉问答 (SVQA) 创建大规模、基于文化的数据集的框架。使用这个框架，我们开发了 OASIS，一个集成语音、图像和文本的多模态数据集。 OASIS 拥有超过 0.92M 图像和 1480 万个 QA 对，包含 370 万个口头问题，支持四种独特的输入组合：纯语音、纯文本、语音+图像和文本+图像。该数据集内容侧重于 18 个国家的英语和阿拉伯语变体，旨在反映多样化的现实世界情况。 OASIS 测试模型的任务超出了对象识别的范围，涉及务实、常识和文化意识推理。我们对四种闭源模型、三种开源模型和一种微调模型进行了基准测试。 EverydayMMQA 和 OASIS 共同提供了一个基准和训练数据集，用于为文化背景下的一系列全面的日常任务构建多模式法学硕士。该框架和数据集将向社区公开。

Title: Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

Authors: Angie Boggust, Donghao Ren, Yannick Assogba, Dominik Moritz, Arvind Satyanarayan, Fred Hohman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06378
Pdf URL: https://arxiv.org/pdf/2510.06378
Copy Paste: [[2510.06378]] Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language(https://arxiv.org/abs/2510.06378)
Keywords: language model, llm
Abstract: Automated interpretability aims to translate large language model (LLM) features into human understandable descriptions. However, these natural language feature descriptions are often vague, inconsistent, and require manual relabeling. In response, we introduce semantic regexes, structured language descriptions of LLM features. By combining primitives that capture linguistic and semantic feature patterns with modifiers for contextualization, composition, and quantification, semantic regexes produce precise and expressive feature descriptions. Across quantitative benchmarks and qualitative analyses, we find that semantic regexes match the accuracy of natural language while yielding more concise and consistent feature descriptions. Moreover, their inherent structure affords new types of analyses, including quantifying feature complexity across layers, scaling automated interpretability from insights into individual features to model-wide patterns. Finally, in user studies, we find that semantic regex descriptions help people build accurate mental models of LLM feature activations.
摘要：自动解释性旨在将大语言模型（LLM）特征翻译成人类可理解的描述。然而，这些自然语言特征描述往往模糊、不一致，需要手动重新标记。作为回应，我们引入了语义正则表达式，即 LLM 功能的结构化语言描述。通过将捕获语言和语义特征模式的原语与上下文化、组合和量化的修饰符相结合，语义正则表达式可以生成精确且富有表现力的特征描述。通过定量基准和定性分析，我们发现语义正则表达式与自然语言的准确性相匹配，同时产生更简洁和一致的特征描述。此外，它们的固有结构提供了新型分析，包括量化跨层的特征复杂性，将自动解释性从对单个特征的洞察扩展到模型范围的模式。最后，在用户研究中，我们发现语义正则表达式描述可以帮助人们建立 LLM 特征激活的准确心理模型。

Title: Protecting De-identified Documents from Search-based Linkage Attacks

Authors: Pierre Lison, Mark Anderson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06383
Pdf URL: https://arxiv.org/pdf/2510.06383
Copy Paste: [[2510.06383]] Protecting De-identified Documents from Search-based Linkage Attacks(https://arxiv.org/abs/2510.06383)
Keywords: llm
Abstract: While de-identification models can help conceal the identity of the individual(s) mentioned in a document, they fail to address linkage risks, defined as the potential to map the de-identified text back to its source. One straightforward way to perform such linkages is to extract phrases from the de-identified document and then check their presence in the original dataset. This paper presents a method to counter search-based linkage attacks while preserving the semantic integrity of the text. The method proceeds in two steps. We first construct an inverted index of the N-grams occurring in the document collection, making it possible to efficiently determine which N-grams appear in less than $k$ documents (either alone or in combination with other N-grams). An LLM-based rewriter is then iteratively queried to reformulate those spans until linkage is no longer possible. Experimental results on a collection of court cases show that the method is able to effectively prevent search-based linkages while remaining faithful to the original content.
摘要：虽然去识别化模型可以帮助隐藏文档中提到的个人的身份，但它们无法解决链接风险，即将去识别化文本映射回其来源的可能性。执行此类链接的一种直接方法是从去标识化文档中提取短语，然后检查它们在原始数据集中的存在情况。本文提出了一种对抗基于搜索的链接攻击，同时保持文本语义完整性的方法。该方法分两步进行。我们首先构建文档集合中出现的 N-gram 的倒排索引，从而可以有效地确定哪些 N-gram 出现在少于 $k$ 的文档中（单独或与其他 N-gram 组合）。然后，迭代查询基于 LLM 的重写器以重新制定这些跨度，直到不再可能进行链接。一系列法院案例的实验结果表明，该方法能够有效防止基于搜索的链接，同时保持原始内容的真实性。

Title: Reward Model Perspectives: Whose Opinions Do Reward Models Reward?

Authors: Elle
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06391
Pdf URL: https://arxiv.org/pdf/2510.06391
Copy Paste: [[2510.06391]] Reward Model Perspectives: Whose Opinions Do Reward Models Reward?(https://arxiv.org/abs/2510.06391)
Keywords: language model, prompt
Abstract: Reward models (RMs) are central to the alignment of language models (LMs). An RM often serves as a proxy for human preferences to guide downstream LM behavior. However, our understanding of RM behavior is limited. Our work (i) formalizes a framework for measuring the alignment of opinions captured by RMs, (ii) investigates the extent to which RMs demonstrate sociodemographic biases, and (iii) explores the effects of prompting to steer rewards towards the preferences of a target group. We study the subjective and diverse perspectives on controversial topics, which allows us to quantify RM perspectives in terms of their opinions, attitudes, and values. We show that RMs are poorly aligned with several demographic groups and can systematically reward harmful stereotypes, and steering alone is not enough to overcome these limitations. Our findings underscore the need for more careful consideration of RM behavior in model alignment during preference learning to prevent the propagation of unwanted social biases in the language technologies that we use.
摘要：奖励模型 (RM) 是语言模型 (LM) 协调的核心。 RM 通常充当人类偏好的代理来指导下游 LM 行为。然而，我们对 RM 行为的理解是有限的。我们的工作（i）正式确定了一个衡量 RM 所捕获意见一致性的框架，（ii）调查了 RM 表现出社会人口统计学偏见的程度，以及（iii）探索了促使奖励转向目标群体偏好的效果。我们研究有争议话题的主观和多样化的观点，这使我们能够根据 RM 的观点、态度和价值观来量化他们的观点。我们表明，RM 与多个人口群体的一致性较差，并且可以系统地奖励有害的刻板印象，并且仅靠指导不足以克服这些限制。我们的研究结果强调，在偏好学习期间需要更仔细地考虑模型对齐中的 RM 行为，以防止我们使用的语言技术中传播不必要的社会偏见。

Title: Instructional Goal-Aligned Question Generation for Student Evaluation in Virtual Lab Settings: How Closely Do LLMs Actually Align?

Authors: R. Alexander Knipper, Indrani Dey, Souvika Sarkar, Hari Narayanan, Sadhana Puntambekar, Santu Karmaker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06411
Pdf URL: https://arxiv.org/pdf/2510.06411
Copy Paste: [[2510.06411]] Instructional Goal-Aligned Question Generation for Student Evaluation in Virtual Lab Settings: How Closely Do LLMs Actually Align?(https://arxiv.org/abs/2510.06411)
Keywords: language model, llm, prompt
Abstract: Virtual Labs offer valuable opportunities for hands-on, inquiry-based science learning, yet teachers often struggle to adapt them to fit their instructional goals. Third-party materials may not align with classroom needs, and developing custom resources can be time-consuming and difficult to scale. Recent advances in Large Language Models (LLMs) offer a promising avenue for addressing these limitations. In this paper, we introduce a novel alignment framework for instructional goal-aligned question generation, enabling teachers to leverage LLMs to produce simulation-aligned, pedagogically meaningful questions through natural language interaction. The framework integrates four components: instructional goal understanding via teacher-LLM dialogue, lab understanding via knowledge unit and relationship analysis, a question taxonomy for structuring cognitive and pedagogical intent, and the TELeR taxonomy for controlling prompt detail. Early design choices were informed by a small teacher-assisted case study, while our final evaluation analyzed over 1,100 questions from 19 open-source LLMs. With goal and lab understanding grounding questions in teacher intent and simulation context, the question taxonomy elevates cognitive demand (open-ended formats and relational types raise quality by 0.29-0.39 points), and optimized TELeR prompts enhance format adherence (80% parsability, >90% adherence). Larger models yield the strongest gains: parsability +37.1%, adherence +25.7%, and average quality +0.8 Likert points.
摘要：虚拟实验室为实践、探究式科学学习提供了宝贵的机会，但教师往往很难使它们适应他们的教学目标。第三方材料可能与课堂需求不符，开发定制资源可能非常耗时且难以扩展。大型语言模型 (LLM) 的最新进展为解决这些限制提供了一条有希望的途径。在本文中，我们介绍了一种新颖的对齐框架，用于生成教学目标一致的问题，使教师能够利用法学硕士通过自然语言交互生成模拟一致的、具有教学意义的问题。该框架整合了四个组成部分：通过教师与法学硕士对话的教学目标理解、通过知识单元和关系分析的实验室理解、用于构建认知和教学意图的问题分类法以及用于控制提示细节的 TELeR 分类法。早期的设计选择是通过一个小型教师辅助案例研究得出的，而我们的最终评估则分析了来自 19 个开源法学硕士的 1,100 多个问题。通过目标和实验室理解，将问题置于教师意图和模拟环境中，问题分类提高了认知需求（开放式格式和关系类型将质量提高了 0.29-0.39 分），优化的 TELeR 提示增强了格式遵循性（80% 可解析性，>90% 遵循性）。较大的模型产生最强的收益：可解析性 +37.1%，依从性 +25.7%，平均质量 +0.8 Likert 点。

Title: FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

Authors: Yitao Long, Tiansheng Hu, Yilun Zhao, Arman Cohan, Chen Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06426
Pdf URL: https://arxiv.org/pdf/2510.06426
Copy Paste: [[2510.06426]] FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering(https://arxiv.org/abs/2510.06426)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval. We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process. We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback.
摘要：大型语言模型 (LLM) 经常会产生长篇问题的幻觉，产生看似合理但实际上不正确的答案。常见的缓解策略是提供 LLM 输出的归属。然而，现有的基准主要关注简单的归因，检索支持文本证据作为参考。我们认为，在金融应用等现实场景中，归因超出了参考检索的范围。我们推出 FinLFQA，这是一个基准，旨在评估法学硕士为复杂的财务问题生成具有可靠和细致入微的归因的长格式答案的能力。 FinLFQA 通过人工注释评估归因的三个关键方面：(1) 支持从财务报告中提取的证据，(2) 中间数字推理步骤，以及 (3) 为推理过程提供信息的特定领域的金融知识。我们进一步提供了一个涵盖答案质量和归因质量的自动评估框架。通过对跨多个归因生成范例的八个法学硕士进行广泛的实验，我们发现细粒度的指标对于区分模型功能非常重要，端到端生成实现了与事后方法相当的性能，并且迭代细化只有在外部反馈的指导下才有帮助。

Title: MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning

Authors: Neeraja Kirtane, Yuvraj Khanna, Peter Relan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06430
Pdf URL: https://arxiv.org/pdf/2510.06430
Copy Paste: [[2510.06430]] MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning(https://arxiv.org/abs/2510.06430)
Keywords: language model, gpt
Abstract: Large language models excel on math benchmarks, but their math reasoning robustness to linguistic variation is underexplored. While recent work increasingly treats high-difficulty competitions like the IMO as the gold standard for evaluating reasoning, we believe in comprehensive benchmarking of high school-level math problems in real educational settings. We introduce MathRobust-LV, a test set and evaluation methodology that mirrors how instructors rephrase problems across assessments while keeping difficulty constant: we change surface details (names, contexts, variables) while preserving numerical structure and answers. In contrast to prior efforts that alter problem content or emphasize IMO-level tasks, we focus on high-school-level dataset problems at the difficulty level where models are currently deployed in educational settings: tutoring and assessment systems. In these applications, instructors rephrase identical concepts in varied ways, making linguistic robustness essential for reliable deployment. Although MATH data benchmarking is often regarded as saturated, our experiment on 34 models reveals that accuracy declines when moving from the baseline to the variants. These drops are severe for smaller models (9-11%) while stronger models also show measurable degradation. Frontier models like GPT-5, Gemini-2.5pro remain comparatively stable. Our results highlight that robustness to linguistic variation is a fundamental challenge, exposing reasoning vulnerabilities in models.
摘要：大型语言模型在数学基准上表现出色，但其数学推理对语言变化的稳健性尚未得到充分探索。虽然最近的工作越来越多地将像 IMO 这样的高难度竞赛视为评估推理的黄金标准，但我们相信在真实的教育环境中对高中水平的数学问题进行全面的基准测试。我们引入了 MathRobust-LV，这是一种测试集和评估方法，它反映了教师如何在保持难度不变的情况下重新表述评估中的问题：我们改变表面细节（名称、上下文、变量），同时保留数字结构和答案。与之前改变问题内容或强调 IMO 级别任务的努力相比，我们专注于目前在教育环境中部署模型的高中级别数据集问题：辅导和评估系统。在这些应用程序中，教师以不同的方式重新表述相同的概念，使得语言的稳健性对于可靠的部署至关重要。尽管 MATH 数据基准测试通常被认为是饱和的，但我们对 34 个模型的实验表明，从基线转移到变体时，准确性会下降。对于较小的模型 (9-11%)，这些下降非常严重，而较强的模型也显示出可测量的退化。 GPT-5、Gemini-2.5pro等前沿型号保持相对稳定。我们的结果强调，对语言变化的鲁棒性是一个根本性的挑战，暴露了模型中的推理漏洞。

Title: A Survey on Agentic Security: Applications, Threats and Defenses

Authors: Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, Md Rizwan Parvez
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2510.06445
Pdf URL: https://arxiv.org/pdf/2510.06445
Copy Paste: [[2510.06445]] A Survey on Agentic Security: Applications, Threats and Defenses(https://arxiv.org/abs/2510.06445)
Keywords: llm, agent
Abstract: The rapid shift from passive LLMs to autonomous LLM-agents marks a new paradigm in cybersecurity. While these agents can act as powerful tools for both offensive and defensive operations, the very agentic context introduces a new class of inherent security risks. In this work we present the first holistic survey of the agentic security landscape, structuring the field around three interdependent pillars: Applications, Threats, and Defenses. We provide a comprehensive taxonomy of over 150 papers, explaining how agents are used, the vulnerabilities they possess, and the countermeasures designed to protect them. A detailed cross-cutting analysis shows emerging trends in agent architecture while revealing critical research gaps in model and modality coverage.
摘要：从被动的法学硕士到自主的法学硕士代理的快速转变标志着网络安全的新范式。虽然这些代理可以充当进攻和防御行动的强大工具，但非常代理的环境引入了一种新的固有安全风险。在这项工作中，我们首次对代理安全格局进行了全面调查，围绕三个相互依赖的支柱构建了该领域：应用程序、威胁和防御。我们提供了 150 多篇论文的全面分类，解释了代理的使用方式、它们所具有的漏洞以及旨在保护它们的对策。详细的横切分析显示了代理架构的新兴趋势，同时揭示了模型和模态覆盖方面的关键研究差距。

Title: Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

Authors: Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06499
Pdf URL: https://arxiv.org/pdf/2510.06499
Copy Paste: [[2510.06499]] Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels(https://arxiv.org/abs/2510.06499)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
摘要：大型语言模型（LLM）通过对大量文本语料库的模仿学习取得了显着的成功，但这种范式造成了训练代沟并限制了稳健的推理。强化学习 (RL) 提供了一种数据效率更高的解决方案，能够弥补这一差距，但其应用受到关键数据瓶颈的限制：现有的 RL 数据集比网络规模的预训练语料库要小几个数量级，多样性也较低。为了解决这个问题，我们引入了 Webscale-RL 管道，这是一个可扩展的数据引擎，可以系统地将大规模预训练文档转换为数百万个多样化的、可验证的 RL 问答对。使用此管道，我们构建了 Webscale-RL 数据集，其中包含跨 9 个以上领域的 120 万个示例。我们的实验表明，在该数据集上训练的模型在一系列基准测试中显着优于连续预训练和强大的数据细化基线。值得注意的是，事实证明，使用我们的数据集进行 RL 训练的效率显着提高，用最多 100$\times$ 的标记实现了连续预训练的性能。我们的工作提供了一条将强化学习扩展到预训练水平的可行途径，从而实现更强大、更高效的语言模型。

Title: From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

Authors: Seng Pei Liew, Takuya Kato
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.06548
Pdf URL: https://arxiv.org/pdf/2510.06548
Copy Paste: [[2510.06548]] From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining(https://arxiv.org/abs/2510.06548)
Keywords: language model
Abstract: Bootstrapped pretraining, i.e., the reuse of a pretrained base model for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, its effectiveness remains unclear, especially when applied to overtrained base models. In this work, we empirically study the scaling behavior of bootstrapped pretraining and find that its scaling efficiency diminishes in a predictable manner: The scaling exponent with respect to second-stage pretraining tokens decreases logarithmically with the number of tokens used to pretrain the base model. The joint dependence on first- and second-stage tokens is accurately modeled by a simple scaling law. Such saturation effect reveals a fundamental trade-off in multi-stage pretraining strategies: the more extensively a model is pretrained, the less additional benefit bootstrapping provides. Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.
摘要：自举预训练，即重用预训练的基础模型进行进一步预训练，例如持续预训练或模型增长，有望降低从头开始训练语言模型的成本。然而，其有效性仍不清楚，特别是当应用于过度训练的基础模型时。在这项工作中，我们凭经验研究了自举预训练的缩放行为，发现其缩放效率以可预测的方式减小：第二阶段预训练标记的缩放指数随着用于预训练基本模型的标记数量呈对数减少。对第一阶段和第二阶段代币的联合依赖性通过简单的缩放定律精确建模。这种饱和效应揭示了多阶段预训练策略的基本权衡：模型预训练的范围越广，自举提供的额外好处就越少。我们的研究结果为有效的语言模型训练提供了实用的见解，并为重用过度训练的模型提出了重要的考虑因素。

Title: Flipping the Dialogue: Training and Evaluating User Language Models

Authors: Tarek Naous, Philippe Laban, Wei Xu, Jennifer Neville
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06552
Pdf URL: https://arxiv.org/pdf/2510.06552
Copy Paste: [[2510.06552]] Flipping the Dialogue: Training and Evaluating User Language Models(https://arxiv.org/abs/2510.06552)
Keywords: language model, gpt, llm, prompt
Abstract: Conversations with LMs involve two participants: a human user leading the conversation, and an LM assistant responding to the user's request. To satisfy this specific role, LMs are post-trained to be helpful assistants -- optimized to produce exhaustive and well-structured responses, free of ambiguity and grammar errors. User utterances, on the other hand, are rarely perfected, with each user phrasing requests in unique ways, sometimes putting in partial effort at each turn and refining on the fly. To evaluate LM performance in realistic settings, prior work simulated users in multi-turn conversations, often prompting an LLM originally trained to be a helpful assistant to act as a user. However, we show that assistant LMs make for poor user simulators, with the surprising finding that better assistants yield worse simulators. Instead, we introduce purpose-built User Language Models (User LMs) - models post-trained to simulate human users in multi-turn conversations. Through various evaluations, we show how User LMs align better with human behavior and achieve better simulation robustness than existing simulation methods. When leveraging User LMs to simulate coding and math conversations, the performance of a strong assistant (GPT-4o) drops from 74.6% to 57.4%, confirming that more realistic simulation environments lead to assistant struggles as they fail to cope with the nuances of users in multi-turn setups.
摘要：与 LM 的对话涉及两个参与者：引导对话的人类用户和响应用户请求的 LM 助手。为了满足这一特定角色，LM 经过后期培训，成为有用的助手——经过优化，可以生成详尽且结构良好的响应，没有歧义和语法错误。另一方面，用户的话语很少是完美的，每个用户都以独特的方式表达请求，有时每次都会投入部分精力并即时完善。为了在现实环境中评估 LM 的性能，之前的工作在多轮对话中模拟了用户，通常会促使原本被训练为有用助手的法学硕士充当用户。然而，我们表明助理 LM 会导致较差的用户模拟器，令人惊讶的是更好的助手会产生更差的模拟器。相反，我们引入了专门构建的用户语言模型（用户 LM）——经过后训练的模型，用于在多轮对话中模拟人类用户。通过各种评估，我们展示了用户 LM 如何更好地与人类行为保持一致，并实现比现有模拟方法更好的模拟鲁棒性。当利用用户 LM 模拟编码和数学对话时，强大助手 (GPT-4o) 的性能从 74.6% 下降到 57.4%，这证实了更真实的模拟环境会导致助手陷入困境，因为它们无法应对多回合设置中用户的细微差别。

Title: The Algebra of Meaning: Why Machines Need Montague More Than Moore's Law

Authors: Cheonkam Jeong, Sungdo Kim, Jewoo Park
Subjects: cs.CL, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2510.06559
Pdf URL: https://arxiv.org/pdf/2510.06559
Copy Paste: [[2510.06559]] The Algebra of Meaning: Why Machines Need Montague More Than Moore's Law(https://arxiv.org/abs/2510.06559)
Keywords: language model, hallucination
Abstract: Contemporary language models are fluent yet routinely mis-handle the types of meaning their outputs entail. We argue that hallucination, brittle moderation, and opaque compliance outcomes are symptoms of missing type-theoretic semantics rather than data or scale limitations. Building on Montague's view of language as typed, compositional algebra, we recast alignment as a parsing problem: natural-language inputs must be compiled into structures that make explicit their descriptive, normative, and legal dimensions under context. We present Savassan, a neuro-symbolic architecture that compiles utterances into Montague-style logical forms and maps them to typed ontologies extended with deontic operators and jurisdictional contexts. Neural components extract candidate structures from unstructured inputs; symbolic components perform type checking, constraint reasoning, and cross-jurisdiction mapping to produce compliance-aware guidance rather than binary censorship. In cross-border scenarios, the system "parses once" (e.g., defect claim(product x, company y)) and projects the result into multiple legal ontologies (e.g., defamation risk in KR/JP, protected opinion in US, GDPR checks in EU), composing outcomes into a single, explainable decision. This paper contributes: (i) a diagnosis of hallucination as a type error; (ii) a formal Montague-ontology bridge for business/legal reasoning; and (iii) a production-oriented design that embeds typed interfaces across the pipeline. We outline an evaluation plan using legal reasoning benchmarks and synthetic multi-jurisdiction suites. Our position is that trustworthy autonomy requires compositional typing of meaning, enabling systems to reason about what is described, what is prescribed, and what incurs liability within a unified algebra of meaning.
摘要：当代语言模型很流畅，但经常错误地处理其输出所蕴含的含义类型。我们认为，幻觉、脆弱的节制和不透明的合规结果是类型理论语义缺失的症状，而不是数据或规模限制的症状。基于蒙塔古将语言视为类型化、组合代数的观点，我们将对齐重新定义为解析问题：自然语言输入必须编译成结构，在上下文中明确其描述性、规范性和法律维度。我们提出了 Savassan，一种神经符号架构，它将话语编译成蒙塔古式的逻辑形式，并将它们映射到用道义运算符和管辖上下文扩展的类型本体。神经组件从非结构化输入中提取候选结构；符号组件执行类型检查、约束推理和跨管辖区映射，以生成合规意识指导而不是二元审查。在跨境场景中，系统“解析一次”（例如，缺陷索赔（产品x，公司y））并将结果投影到多个法律本体中（例如，韩国/日本的诽谤风险，美国的受保护意见，欧盟的GDPR检查），将结果组合成单个可解释的决策。本文贡献：（i）将幻觉诊断为类型错误； (ii) 用于商业/法律推理的正式蒙塔古本体桥； (iii) 面向生产的设计，在整个管道中嵌入类型化接口。我们使用法律推理基准和综合多司法管辖区套件概述了评估计划。我们的立场是，值得信赖的自治需要意义的组合类型，使系统能够在统一的意义代数内推理所描述的内容、规定的内容以及产生责任的内容。

Title: TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents

Authors: Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, Jiaxuan You
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06579
Pdf URL: https://arxiv.org/pdf/2510.06579
Copy Paste: [[2510.06579]] TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents(https://arxiv.org/abs/2510.06579)
Keywords: language model, llm, agent
Abstract: Automatic research with Large Language Models (LLMs) is rapidly gaining importance, driving the development of increasingly complex workflows involving multi-agent systems, planning, tool usage, code execution, and human-agent interaction to accelerate research processes. However, as more researchers and developers begin to use and build upon these tools and platforms, the complexity and difficulty of extending and maintaining such agentic workflows have become a significant challenge, particularly as algorithms and architectures continue to advance. To address this growing complexity, TinyScientist identifies the essential components of the automatic research workflow and proposes an interactive, extensible, and controllable framework that easily adapts to new tools and supports iterative growth. We provide an open-source codebase, an interactive web demonstration, and a PyPI Python package to make state-of-the-art auto-research pipelines broadly accessible to every researcher and developer.
摘要：使用大型语言模型 (LLM) 的自动研究正在迅速变得越来越重要，推动了涉及多代理系统、规划、工具使用、代码执行和人机交互的日益复杂的工作流程的发展，以加速研究过程。然而，随着越来越多的研究人员和开发人员开始使用和构建这些工具和平台，扩展和维护此类代理工作流程的复杂性和难度已成为一项重大挑战，特别是随着算法和架构的不断进步。为了解决这种日益增长的复杂性，TinyScientist 确定了自动研究工作流程的基本组成部分，并提出了一个交互式、可扩展和可控的框架，该框架可以轻松适应新工具并支持迭代增长。我们提供开源代码库、交互式 Web 演示和 PyPI Python 包，使每个研究人员和开发人员都可以广泛访问最先进的自动研究管道。

Title: Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?

Authors: Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06594
Pdf URL: https://arxiv.org/pdf/2510.06594
Copy Paste: [[2510.06594]] Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?(https://arxiv.org/abs/2510.06594)
Keywords: language model, gpt, llm, prompt
Abstract: Jailbreaking large language models (LLMs) has emerged as a pressing concern with the increasing prevalence and accessibility of conversational LLMs. Adversarial users often exploit these models through carefully engineered prompts to elicit restricted or sensitive outputs, a strategy widely referred to as jailbreaking. While numerous defense mechanisms have been proposed, attackers continuously develop novel prompting techniques, and no existing model can be considered fully resistant. In this study, we investigate the jailbreak phenomenon by examining the internal representations of LLMs, with a focus on how hidden layers respond to jailbreak versus benign prompts. Specifically, we analyze the open-source LLM GPT-J and the state-space model Mamba2, presenting preliminary findings that highlight distinct layer-wise behaviors. Our results suggest promising directions for further research on leveraging internal model dynamics for robust jailbreak detection and defense.
摘要：随着会话式 LLM 的日益流行和可及性，越狱大型语言模型 (LLM) 已成为一个紧迫的问题。敌对用户经常通过精心设计的提示来利用这些模型来引出受限或敏感的输出，这种策略被广泛称为越狱。尽管已经提出了许多防御机制，但攻击者不断开发新的提示技术，并且没有任何现有模型可以被认为是完全抵抗的。在这项研究中，我们通过检查法学硕士的内部表征来研究越狱现象，重点是隐藏层如何响应越狱与良性提示。具体来说，我们分析了开源 LLM GPT-J 和状态空间模型 Mamba2，提出了突出不同分层行为的初步发现。我们的结果为利用内部模型动态进行稳健的越狱检测和防御的进一步研究提供了有希望的方向。

Title: Aligning Large Language Models via Fully Self-Synthetic Data

Authors: Shangjian Yin, Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Yu Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06652
Pdf URL: https://arxiv.org/pdf/2510.06652
Copy Paste: [[2510.06652]] Aligning Large Language Models via Fully Self-Synthetic Data(https://arxiv.org/abs/2510.06652)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Traditional reinforcement learning from human feedback (RLHF) for large language models (LLMs) relies on expensive human-annotated datasets, while Reinforcement Learning from AI Feedback (RLAIF) also incurs significant costs, requiring the collection of diverse prompts and corresponding responses, often necessitating external reward models or proprietary models like GPT-4 to annotate preference pairs. In this work, we introduce Self-Alignment Optimization (SAO), a fully self-synthetic framework for LLM alignment, where all training data, including prompts (i.e., user queries), responses, and preferences, are generated by the model itself. Specifically, SAO first instructs the LLM to engage in persona role-play and generate diverse prompts and responses, which are then self-evaluated for preference optimization. Extensive experiments demonstrate that SAO effectively enhances the model's chat capabilities on standard benchmarks like AlpacaEval~2.0, while maintaining strong performance on downstream objective tasks (e.g., question-answering, math reasoning). Our work provides a practical solution for self-improvement in aligning LLMs, and the code for reproducing our results is available at: this https URL.
摘要：用于大型语言模型 (LLM) 的传统人类反馈强化学习 (RLHF) 依赖于昂贵的人工注释数据集，而人工智能反馈强化学习 (RLAIF) 也会产生巨大成本，需要收集不同的提示和相应的响应，通常需要外部奖励模型或 GPT-4 等专有模型来注释偏好对。在这项工作中，我们引入了自对齐优化（SAO），这是一种用于 LLM 对齐的完全自合成框架，其中所有训练数据，包括提示（即用户查询）、响应和偏好，均由模型本身生成。具体来说，SAO 首先指示 LLM 进行角色扮演并生成不同的提示和响应，然后进行自我评估以进行偏好优化。大量实验表明，SAO 有效增强了模型在 AlpacaEval~2.0 等标准基准上的聊天能力，同时在下游目标任务（例如，问答、数学推理）上保持强劲的性能。我们的工作为调整法学硕士的自我完善提供了一个实用的解决方案，用于重现我们结果的代码可在以下网址找到：此 https URL。

Title: ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory

Authors: Yunzhong Xiao, Yangmin Li, Hewei Wang, Yunlong Tang, Zora Zhiruo Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06664
Pdf URL: https://arxiv.org/pdf/2510.06664
Copy Paste: [[2510.06664]] ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory(https://arxiv.org/abs/2510.06664)
Keywords: language model, llm, agent
Abstract: Agents utilizing tools powered by large language models (LLMs) or vision-language models (VLMs) have demonstrated remarkable progress in diverse tasks across text and visual modalities. Unlike traditional tools such as calculators, which give deterministic outputs, neural tools perform uncertainly across task scenarios. While different tools for a task may excel in varied scenarios, existing agents typically rely on fixed tools, thus limiting the flexibility in selecting the most suitable tool for specific tasks. In contrast, humans snowball their understanding of the capabilities of different tools by interacting with them, and apply this knowledge to select the optimal tool when solving a future task. To build agents that similarly benefit from this process, we propose ToolMem that enables agents to develop memories of tool capabilities from previous interactions, by summarizing their strengths and weaknesses and storing them in memory; at inference, the agent can retrieve relevant entries from ToolMem, and select the best tool to solve individual tasks more accurately. We evaluate ToolMem on learning varied text generation and text-to-image generation neural tools. Compared to no-memory, generic agents, we find ToolMem-augmented agents predict tool performance 14.8% and 28.7% more accurately across text and multimodal generation scenarios. Moreover, ToolMem facilitates optimal tool selection among multiple choices by 21% and 24% absolute increases in respective scenarios.
摘要：使用由大型语言模型 (LLM) 或视觉语言模型 (VLM) 支持的工具的代理在跨文本和视觉模式的各种任务中表现出了显着的进步。与计算器等提供确定性输出的传统工具不同，神经工具在任务场景中的执行具有不确定性。虽然用于任务的不同工具可能在不同的场景中表现出色，但现有代理通常依赖于固定工具，从而限制了为特定任务选择最合适工具的灵活性。相比之下，人类通过与不同工具的交互，对不同工具的功能的理解不断增加，并在解决未来任务时应用这些知识来选择最佳工具。为了构建同样受益于此过程的智能体，我们提出了 ToolMem，它使智能体能够通过总结其优点和缺点并将其存储在内存中，从以前的交互中开发工具功能的记忆；在推理时，代理可以从 ToolMem 中检索相关条目，并选择最佳工具来更准确地解决各个任务。我们在学习各种文本生成和文本到图像生成神经工具方面评估了 ToolMem。与无记忆的通用代理相比，我们发现 ToolMem 增强代理在文本和多模式生成场景中预测工具性能的准确度分别提高了 14.8% 和 28.7%。此外，ToolMem 促进了多种选择中的最佳工具选择，在各自场景下绝对增加了 21% 和 24%。

Title: PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch

Authors: Shangjian Yin, Shining Liang, Wenbiao Ding, Yuli Qian, Zhouxing Shi, Hongzhi Li, Yutao Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06670
Pdf URL: https://arxiv.org/pdf/2510.06670
Copy Paste: [[2510.06670]] PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch(https://arxiv.org/abs/2510.06670)
Keywords: language model, llm
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs). However, its effectiveness depends on high-quality instruction data. Most existing alignment datasets are either private or require costly human annotation, which limits reproducibility and scalability. Even with Reinforcement Learning from AI Feedback (RLAIF), concerns about data quality remain. Moreover, it is unclear how much data is actually required to fine-tune a base model into a strong instruction-following model. Current approaches often rely on over 300k examples even at the supervised fine-tuning (SFT) stage, yet they still underperform compared to proprietary models, creating barriers for academic and resource-limited communities. To address this gap, we introduce PiKa, a data-efficient family of expert-level alignment datasets. In particular, the PiKa-SFT dataset uses only 30k SFT examples, far fewer than state-of-the-art datasets like Magpie. Through evaluations by fine-tuning Llama-3-8B-Base on PiKa and other public datasets, we show that PiKa-SFT outperforms models trained on much larger data. On AlpacaEval 2.0 and Arena-Hard benchmarks, PiKa-SFT fine-tuning even surpasses the official Llama-3-8B-Instruct model trained on over 10 million proprietary examples. We further extend our study by training the Qwen2.5 series (0.5B to 7B) on PiKa-SFT, achieving consistent gains. These findings demonstrate that high-quality alignment can be achieved with significantly less data, offering a scalable path for open-source LLM alignment. Code and data: this https URL.
摘要：来自人类反馈的强化学习 (RLHF) 已成为调整大型语言模型 (LLM) 的基石。然而，其有效性取决于高质量的指令数据。大多数现有的比对数据集要么是私有的，要么需要昂贵的人工注释，这限制了可重复性和可扩展性。即使有了人工智能反馈强化学习 (RLAIF)，对数据质量的担忧仍然存在。此外，尚不清楚实际上需要多少数据才能将基本模型微调为强大的指令跟踪模型。当前的方法即使在监督微调 (SFT) 阶段也通常依赖超过 30 万个示例，但与专有模型相比，它们的性能仍然较差，为学术界和资源有限的社区造成了障碍。为了解决这一差距，我们引入了 PiKa，这是一个数据高效的专家级对齐数据集系列。特别是，PiKa-SFT 数据集仅使用 30k SFT 示例，远少于 Magpie 等最先进的数据集。通过在 PiKa 和其他公共数据集上微调 Llama-3-8B-Base 进行评估，我们表明 PiKa-SFT 优于在更大数据上训练的模型。在 AlpacaEval 2.0 和 Arena-Hard 基准测试中，PiKa-SFT 微调甚至超越了经过超过 1000 万个专有示例训练的官方 Llama-3-8B-Instruct 模型。我们通过在 PiKa-SFT 上训练 Qwen2.5 系列（0.5B 到 7B）进一步扩展我们的研究，取得了一致的收益。这些发现表明，可以用更少的数据实现高质量的对齐，为开源 LLM 对齐提供了一条可扩展的路径。代码和数据：此 https URL。

Title: Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback

Authors: Yisha Wu, Cen (Mia)Zhao, Yuanpei Cao, Xiaoqing Su, Yashar Mehdad, Mindy Ji, Claire Na Cheng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.06677
Pdf URL: https://arxiv.org/pdf/2510.06677
Copy Paste: [[2510.06677]] Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback(https://arxiv.org/abs/2510.06677)
Keywords: agent
Abstract: We introduce an incremental summarization system for customer support agents that intelligently determines when to generate concise bullet notes during conversations, reducing agents' context-switching effort and redundant review. Our approach combines a fine-tuned Mixtral-8x7B model for continuous note generation with a DeBERTa-based classifier to filter trivial content. Agent edits refine the online notes generation and regularly inform offline model retraining, closing the agent edits feedback loop. Deployed in production, our system achieved a 3% reduction in case handling time compared to bulk summarization (with reductions of up to 9% in highly complex cases), alongside high agent satisfaction ratings from surveys. These results demonstrate that incremental summarization with continuous feedback effectively enhances summary quality and agent productivity at scale.
摘要：我们为客户支持代理引入了增量摘要系统，该系统可以智能地确定何时在对话期间生成简洁的项目符号注释，从而减少代理的上下文切换工作和冗余审查。我们的方法将用于连续笔记生成的微调 Mixtral-8x7B 模型与用于过滤琐碎内容的基于 DeBERTa 的分类器相结合。代理编辑完善在线注释生成，并定期通知离线模型重新训练，从而关闭代理编辑反馈循环。在生产中部署后，与批量汇总相比，我们的系统将案例处理时间缩短了 3%（在高度复杂的案例中，案例处理时间缩短了 9%），同时调查显示，客服人员满意度很高。这些结果表明，具有持续反馈的增量摘要可以有效地大规模提高摘要质量和代理生产力。

Title: Learning to Rewrite Prompts for Bootstrapping LLMs on Downstream Tasks

Authors: Qinhao Zhou, Xiang Xiang, Kun He, John E. Hopcroft
Subjects: cs.CL, cs.AI, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2510.06695
Pdf URL: https://arxiv.org/pdf/2510.06695
Copy Paste: [[2510.06695]] Learning to Rewrite Prompts for Bootstrapping LLMs on Downstream Tasks(https://arxiv.org/abs/2510.06695)
Keywords: language model, llm, prompt
Abstract: In recent years, the growing interest in Large Language Models (LLMs) has significantly advanced prompt engineering, transitioning from manual design to model-based optimization. Prompts for LLMs generally comprise two components: the \textit{instruction}, which defines the task or objective, and the \textit{input}, which is tailored to the instruction type. In natural language generation (NLG) tasks such as machine translation, the \textit{input} component is particularly critical, while the \textit{instruction} component tends to be concise. Existing prompt engineering methods primarily focus on optimizing the \textit{instruction} component for general tasks, often requiring large-parameter LLMs as auxiliary tools. However, these approaches exhibit limited applicability for tasks like machine translation, where the \textit{input} component plays a more pivotal role. To address this limitation, this paper introduces a novel prompt optimization method specifically designed for machine translation tasks. The proposed approach employs a small-parameter model trained using a back-translation-based strategy, significantly reducing training overhead for single-task optimization while delivering highly effective performance. With certain adaptations, this method can also be extended to other downstream tasks.
摘要：近年来，人们对大型语言模型 (LLM) 的兴趣日益浓厚，显着推进了即时工程，从手动设计过渡到基于模型的优化。 LLM 的提示通常包含两个部分：\textit{instruction}（定义任务或目标）和 \textit{input}（根据指令类型定制）。在机器翻译等自然语言生成（NLG）任务中，\textit{input}组件尤为关键，而\textit{instruction}组件则趋于简洁。现有的提示工程方法主要侧重于优化一般任务的 \textit{instruction} 组件，通常需要大参数 LLM 作为辅助工具。然而，这些方法对于机器翻译等任务的适用性有限，其中 \textit{input} 组件起着更关键的作用。为了解决这个限制，本文介绍了一种专为机器翻译任务设计的新颖的提示优化方法。所提出的方法采用了使用基于反向翻译的策略进行训练的小参数模型，显着减少了单任务优化的训练开销，同时提供了高效的性能。通过一定的调整，该方法还可以扩展到其他下游任务。

Title: How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

Authors: Leonardo Bertolazzi, Sandro Pezzelle, Raffaelle Bernardi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06700
Pdf URL: https://arxiv.org/pdf/2510.06700
Copy Paste: [[2510.06700]] How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects(https://arxiv.org/abs/2510.06700)
Keywords: language model, llm
Abstract: Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.
摘要：人类和大型语言模型 (LLM) 都表现出内容效应：推理问题的语义内容的合理性影响对其逻辑有效性的判断的偏差。虽然双过程推理理论可以最好地解释人类的这种现象，但法学硕士内容效应背后的机制仍不清楚。在这项工作中，我们通过研究法学硕士如何在其内部表征中编码有效性和合理性的概念来解决这个问题。我们表明，这两个概念都是线性表示的，并且在表征几何中强烈对齐，导致模型将合理性与有效性混为一谈。使用引导向量，我们证明合理性向量可能会导致有效性判断产生偏差，反之亦然，并且这两个概念之间的一致性程度可以预测跨模型的行为内容影响的大小。最后，我们构建去偏向量来理清这些概念，减少内容影响并提高推理准确性。我们的研究结果促进了对法学硕士如何表达抽象逻辑概念的理解，并强调代表性干预是通往更逻辑系统的道路。

Title: Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

Authors: Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, Jiecao Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.06727
Pdf URL: https://arxiv.org/pdf/2510.06727
Copy Paste: [[2510.06727]] Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management(https://arxiv.org/abs/2510.06727)
Keywords: language model, llm, agent
Abstract: We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with \underline{SU}mmarization augmented \underline{P}olicy \underline{O}ptimization (\texttt{SUPO}), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that \texttt{SUPO} significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, \texttt{SUPO} can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.
摘要：我们研究了大语言模型（LLM）代理的强化学习（RL）微调，以用于长期多轮工具的使用，其中上下文长度很快成为一个基本瓶颈。现有的强化学习管道可能会受到指令跟踪性能下降、部署成本过高以及最重要的是严格的上下文限制的影响。为了应对这些挑战，我们在培训中引入了基于摘要的上下文管理。具体来说，它定期使用 LLM 生成的摘要历史记录来压缩工具，这些摘要保留任务相关信息以保持紧凑的上下文，同时使代理能够扩展到固定上下文窗口之外。在此基础上，我们得出了一种策略梯度表示，可以无缝地支持标准 LLM RL 基础设施，以端到端的方式优化工具使用行为和摘要策略。我们用 \underline{SU}mmarization 增强 \underline{P}olicy \underline{O}ptimization (\texttt{SUPO}) 实例化这个框架，这是一种 LLM RL 算法，可以实现超出固定上下文限制的长视野训练。交互式函数调用和搜索任务的实验表明，与基线相比，\texttt{SUPO} 显着提高了成功率，同时保持相同甚至更低的工作上下文长度。我们还证明，对于复杂的搜索任务，当将测试时间最大轮次概括扩展到训练时间之外时，\texttt{SUPO} 可以进一步提高评估性能。我们的结果将基于摘要的上下文管理确立为一种有原则且可扩展的方法，用于训练超出固定上下文长度限制的 RL 代理。

Title: PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs

Authors: Manuel Frank, Haithem Afli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06730
Pdf URL: https://arxiv.org/pdf/2510.06730
Copy Paste: [[2510.06730]] PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs(https://arxiv.org/abs/2510.06730)
Keywords: llm
Abstract: Current evaluations of sentence embedding models typically rely on static test beds such as the Massive Text Embedding Benchmark (MTEB). While invaluable, repeated tuning on a fixed suite can inflate reported performance and obscure real-world robustness. We introduce the Paraphrasing Text Embedding Benchmark (PTEB), a dynamic protocol that stochastically generates meaning-preserving paraphrases at evaluation time and aggregates results across multiple runs. Using a cost-efficient LLM-based method grounded in semantic textual similarity gold ratings, we show that LLMs generate token-diverse but semantically preserving, paraphrases. Across 7 MTEB tasks, we validate our hypothesis that the performance of sentence encoders is sensitive to changes in token space even when semantics remain fixed. We also observe that smaller models are not disproportionately affected relative to larger ones. Our results are statistically robust over multiple runs and we extended our experiments to 3 multilingual datasets covering 10 languages. More generally, we aim to propose a new evaluation paradigm in NLP that relies less on static, pre-defined benchmarks but shifts towards dynamic, stochastic evaluation leveraging eval-time compute.
摘要：当前对句子嵌入模型的评估通常依赖于静态测试床，例如大规模文本嵌入基准（MTEB）。虽然非常有价值，但对固定套件的重复调整可能会夸大报告的性能并掩盖现实世界的稳健性。我们引入了释义文本嵌入基准（PTEB），这是一种动态协议，可以在评估时随机生成保留意义的释义，并汇总多次运行的结果。使用基于语义文本相似性黄金评级的经济高效的基于 LLM 的方法，我们表明 LLM 生成标记多样化但语义保留的释义。在 7 个 MTEB 任务中，我们验证了我们的假设，即即使语义保持固定，句子编码器的性能也会对标记空间的变化敏感。我们还观察到，相对于较大的模型，较小的模型并未受到不成比例的影响。我们的结果在多次运行中具有统计稳健性，并且我们将实验扩展到涵盖 10 种语言的 3 个多语言数据集。更一般地说，我们的目标是在 NLP 中提出一种新的评估范式，减少对静态、预定义基准的依赖，而是转向利用评估时间计算的动态、随机评估。

Title: Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization

Authors: Tiancheng Xing, Jerry Li, Yixuan Du, Xiyang Hu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.06732
Pdf URL: https://arxiv.org/pdf/2510.06732
Copy Paste: [[2510.06732]] Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization(https://arxiv.org/abs/2510.06732)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly used as rerankers in information retrieval, yet their ranking behavior can be steered by small, natural-sounding prompts. To expose this vulnerability, we present Rank Anything First (RAF), a two-stage token optimization method that crafts concise textual perturbations to consistently promote a target item in LLM-generated rankings while remaining hard to detect. Stage 1 uses Greedy Coordinate Gradient to shortlist candidate tokens at the current position by combining the gradient of the rank-target with a readability score; Stage 2 evaluates those candidates under exact ranking and readability losses using an entropy-based dynamic weighting scheme, and selects a token via temperature-controlled sampling. RAF generates ranking-promoting prompts token-by-token, guided by dual objectives: maximizing ranking effectiveness and preserving linguistic naturalness. Experiments across multiple LLMs show that RAF significantly boosts the rank of target items using naturalistic language, with greater robustness than existing methods in both promoting target items and maintaining naturalness. These findings underscore a critical security implication: LLM-based reranking is inherently susceptible to adversarial manipulation, raising new challenges for the trustworthiness and robustness of modern retrieval systems. Our code is available at: this https URL.
摘要：大型语言模型 (LLM) 越来越多地用作信息检索中的重新排序器，但它们的排序行为可以通过听起来自然的小提示来控制。为了暴露此漏洞，我们提出了 Rank Anything First (RAF)，这是一种两阶段令牌优化方法，可精心设计简洁的文本扰动，以持续提升 LLM 生成的排名中的目标项目，同时保持难以检测的状态。第 1 阶段使用贪婪坐标梯度，通过将排名目标的梯度与可读性得分相结合，来筛选当前位置的候选标记；第 2 阶段使用基于熵的动态加权方案在精确排名和可读性损失下评估这些候选者，并通过温度控制采样选择令牌。 RAF 在双重目标的指导下逐个生成排名提升提示：最大化排名有效性和保持语言自然性。多个法学硕士的实验表明，RAF 使用自然语言显着提高了目标项目的排名，在提升目标项目和保持自然性方面比现有方法具有更强的鲁棒性。这些发现强调了一个关键的安全隐患：基于法学硕士的重新排名本质上容易受到对抗性操纵的影响，这对现代检索系统的可信度和稳健性提出了新的挑战。我们的代码位于：此 https URL。

Title: AWM: Accurate Weight-Matrix Fingerprint for Large Language Models

Authors: Boyi Zeng, Lin Chen, Ziwei He, Xinbing Wang, Zhouhan Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06738
Pdf URL: https://arxiv.org/pdf/2510.06738
Copy Paste: [[2510.06738]] AWM: Accurate Weight-Matrix Fingerprint for Large Language Models(https://arxiv.org/abs/2510.06738)
Keywords: language model, llm
Abstract: Protecting the intellectual property of large language models (LLMs) is crucial, given the substantial resources required for their training. Consequently, there is an urgent need for both model owners and third parties to determine whether a suspect LLM is trained from scratch or derived from an existing base model. However, the intensive post-training processes that models typically undergo-such as supervised fine-tuning, extensive continued pretraining, reinforcement learning, multi-modal extension, pruning, and upcycling-pose significant challenges to reliable identification. In this work, we propose a training-free fingerprinting method based on weight matrices. We leverage the Linear Assignment Problem (LAP) and an unbiased Centered Kernel Alignment (CKA) similarity to neutralize the effects of parameter manipulations, yielding a highly robust and high-fidelity similarity metric. On a comprehensive testbed of 60 positive and 90 negative model pairs, our method demonstrates exceptional robustness against all six aforementioned post-training categories while exhibiting a near-zero risk of false positives. By achieving perfect scores on all classification metrics, our approach establishes a strong basis for reliable model lineage verification. Moreover, the entire computation completes within 30s on an NVIDIA 3090 GPU. The code is available at this https URL.
摘要：鉴于大型语言模型 (LLM) 的培训需要大量资源，保护其知识产权至关重要。因此，模型所有者和第三方都迫切需要确定可疑的法学硕士是从头开始训练还是从现有基础模型衍生而来。然而，模型通常经历的密集的训练后过程——例如监督微调、广泛的持续预训练、强化学习、多模态扩展、修剪和升级——对可靠识别提出了重大挑战。在这项工作中，我们提出了一种基于权重矩阵的免训练指纹识别方法。我们利用线性分配问题（LAP）和无偏中心核对齐（CKA）相似性来抵消参数操作的影响，产生高度稳健和高保真度的相似性度量。在 60 个正模型对和 90 个负模型对的综合测试床上，我们的方法对上述所有六个训练后类别表现出卓越的鲁棒性，同时表现出接近零的误报风险。通过在所有分类指标上获得满分，我们的方法为可靠的模型谱系验证奠定了坚实的基础。此外，整个计算在 NVIDIA 3090 GPU 上可在 30 秒内完成。该代码可从此 https URL 获取。

Title: TWIST: Training-free and Label-free Short Text Clustering through Iterative Vector Updating with LLMs

Authors: I-Fan Lin, Faegheh Hasibi, Suzan Verberne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06747
Pdf URL: https://arxiv.org/pdf/2510.06747
Copy Paste: [[2510.06747]] TWIST: Training-free and Label-free Short Text Clustering through Iterative Vector Updating with LLMs(https://arxiv.org/abs/2510.06747)
Keywords: llm, chat
Abstract: In this paper, we propose a training-free and label-free method for short text clustering that can be used on top of any existing embedder. In the context of customer-facing chatbots, companies are dealing with large amounts of user utterances that need to be clustered according to their intent. In these commercial settings, no labeled data is typically available, and the number of clusters is not known. Our method is based on iterative vector updating: it constructs sparse vectors based on representative texts, and then iteratively refines them through LLM guidance. Our method achieves comparable or superior results to state-of-the-art methods that use contrastive learning, but without assuming prior knowledge of clusters or labels. Experiments on diverse datasets and smaller LLMs show that our method is model agnostic and can be applied to any embedder, with relatively small LLMs, and different clustering methods. We also show that our method scales to large datasets, reducing the computational cost of the LLM. These low-resource, adaptable settings and the scalability of our method make it more aligned with real-world scenarios than existing clustering methods.
摘要：在本文中，我们提出了一种用于短文本聚类的免训练和免标签方法，该方法可以在任何现有嵌入器之上使用。在面向客户的聊天机器人的背景下，公司正在处理大量的用户话语，需要根据他们的意图进行聚类。在这些商业环境中，通常没有可用的标记数据，并且簇的数量未知。我们的方法基于迭代向量更新：它根据代表性文本构造稀疏向量，然后通过 LLM 指导迭代地细化它们。我们的方法取得了与使用对比学习的最先进方法相当或更好的结果，但无需假设集群或标签的先验知识。对不同数据集和较小的 LLM 的实验表明，我们的方法与模型无关，可以应用于任何具有相对较小的 LLM 和不同聚类方法的嵌入器。我们还表明，我们的方法可以扩展到大型数据集，从而降低了法学硕士的计算成本。这些资源少、适应性强的设置和我们方法的可扩展性使其比现有的聚类方法更符合现实场景。

Title: Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs

Authors: Jaeseong Lee, Dayoung Kwon, seung-won hwang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06750
Pdf URL: https://arxiv.org/pdf/2510.06750
Copy Paste: [[2510.06750]] Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs(https://arxiv.org/abs/2510.06750)
Keywords: llm
Abstract: Large Reasoning Models (LRMs) excel in structured tasks by emulating deliberate human reasoning but often suffer from overthinking, degrading performance and wasting resources. One possible baseline is to deploy both LLM and LRM, then route input by predicting whether it requires reasoning and may cause overthinking. However, deploying multiple models can be costly or impractical. We propose a superposed deployment strategy with a lightweight, training-free regulation to optimize inference by switching one model on and off. Instead of routing, we selectively unlearn from LRM at inference, scaling down computation while preserving reasoning. By analyzing the cumulative energy of singular values, we identify optimal low-rank projections to adjust reasoning just right.
摘要：大型推理模型 (LRM) 通过模拟深思熟虑的人类推理，在结构化任务中表现出色，但经常会出现过度思考、性能下降和资源浪费的问题。一种可能的基线是同时部署 LLM 和 LRM，然后通过预测是否需要推理并可能导致过度思考来路由输入。然而，部署多个模型可能成本高昂或不切实际。我们提出了一种具有轻量级、免训练调节的叠加部署策略，通过打开和关闭一个模型来优化推理。我们没有选择路由，而是在推理时选择性地从 LRM 中学习，在保留推理的同时缩减计算量。通过分析奇异值的累积能量，我们确定了最佳的低秩预测来恰到好处地调整推理。

Title: Adaptive LLM-Symbolic Reasoning via Dynamic Logical Solver Composition

Authors: Lei Xu, Pierre Beckmann, Marco Valentino, André Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06774
Pdf URL: https://arxiv.org/pdf/2510.06774
Copy Paste: [[2510.06774]] Adaptive LLM-Symbolic Reasoning via Dynamic Logical Solver Composition(https://arxiv.org/abs/2510.06774)
Keywords: language model, gpt, llm
Abstract: Neuro-symbolic NLP methods aim to leverage the complementary strengths of large language models and formal logical solvers. However, current approaches are mostly static in nature, i.e., the integration of a target solver is predetermined at design time, hindering the ability to employ diverse formal inference strategies. To address this, we introduce an adaptive, multi-paradigm, neuro-symbolic inference framework that: (1) automatically identifies formal reasoning strategies from problems expressed in natural language; and (2) dynamically selects and applies specialized formal logical solvers via autoformalization interfaces. Extensive experiments on individual and multi-paradigm reasoning tasks support the following conclusions: LLMs are effective at predicting the necessary formal reasoning strategies with an accuracy above 90 percent. This enables flexible integration with formal logical solvers, resulting in our framework outperforming competing baselines by 27 percent and 6 percent compared to GPT-4o and DeepSeek-V3.1, respectively. Moreover, adaptive reasoning can even positively impact pure LLM methods, yielding gains of 10, 5, and 6 percent on zero-shot, CoT, and symbolic CoT settings with GPT-4o. Finally, although smaller models struggle with adaptive neuro-symbolic reasoning, post-training offers a viable path to improvement. Overall, this work establishes the foundations for adaptive LLM-symbolic reasoning, offering a path forward for unifying material and formal inferences on heterogeneous reasoning challenges.
摘要：神经符号 NLP 方法旨在利用大型语言模型和形式逻辑求解器的互补优势。然而，当前的方法本质上大多是静态的，即目标求解器的集成是在设计时预先确定的，这阻碍了采用多种形式推理策略的能力。为了解决这个问题，我们引入了一种自适应、多范式、神经符号推理框架，该框架：（1）从自然语言表达的问题中自动识别形式推理策略； (2) 通过自动形式化接口动态选择和应用专门的形式逻辑求解器。对个体和多范式推理任务的大量实验支持以下结论：法学硕士可以有效地预测必要的形式推理策略，准确率超过 90%。这使得我们能够与正式逻辑求解器灵活集成，从而使我们的框架比 GPT-4o 和 DeepSeek-V3.1 的竞争基准分别高出 27% 和 6%。此外，自适应推理甚至可以对纯 LLM 方法产生积极影响，在 GPT-4o 的零样本、CoT 和符号 CoT 设置上产生 10%、5% 和 6% 的增益。最后，尽管较小的模型难以适应自适应神经符号推理，但训练后提供了一条可行的改进途径。总的来说，这项工作为自适应法学硕士符号推理奠定了基础，为统一异构推理挑战的材料和形式推论提供了一条前进的道路。

Title: Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness

Authors: Luca Giordano, Simon Razniewski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06780
Pdf URL: https://arxiv.org/pdf/2510.06780
Copy Paste: [[2510.06780]] Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness(https://arxiv.org/abs/2510.06780)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) encode substantial factual knowledge, yet measuring and systematizing this knowledge remains challenging. Converting it into structured format, for example through recursive extraction approaches such as the GPTKB methodology (Hu et al., 2025b), is still underexplored. Key open questions include whether such extraction can terminate, whether its outputs are reproducible, and how robust they are to variations. We systematically study LLM knowledge materialization using miniGPTKBs (domain-specific, tractable subcrawls), analyzing termination, reproducibility, and robustness across three categories of metrics: yield, lexical similarity, and semantic similarity. We experiment with four variations (seed, language, randomness, model) and three illustrative domains (from history, entertainment, and finance). Our findings show (i) high termination rates, though model-dependent; (ii) mixed reproducibility; and (iii) robustness that varies by perturbation type: high for seeds and temperature, lower for languages and models. These results suggest that LLM knowledge materialization can reliably surface core knowledge, while also revealing important limitations.
摘要：大型语言模型 (LLM) 编码大量事实知识，但测量和系统化这些知识仍然具有挑战性。将其转换为结构化格式，例如通过 GPTKB 方法等递归提取方法（Hu 等人，2025b），仍处于探索之中。关键的悬而未决的问题包括这种提取是否可以终止，其输出是否可重现，以及它们对变化的鲁棒性如何。我们使用 miniGPTKB（特定领域的、易于处理的子抓取）系统地研究 LLM 知识物化，分析三类指标的终止性、再现性和鲁棒性：产量、词汇相似性和语义相似性。我们尝试了四种变体（种子、语言、随机性、模型）和三个说明性领域（来自历史、娱乐和金融）。我们的研究结果表明（i）终止率较高，尽管取决于模型； (ii) 混合再现性； (iii) 稳健性因扰动类型而异：种子和温度较高，语言和模型较低。这些结果表明，法学硕士知识具体化可以可靠地展现核心知识，同时也揭示了重要的局限性。

Title: FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Authors: Haotian Wu, Shufan Jiang, Chios Chen, Yiyang Feng, Hehai Lin, Heqing Zou, Yao Shu, Yanran Li, Chengwei Qin
Subjects: cs.CL, cs.AI, cs.HC, cs.MA
Abstract URL: https://arxiv.org/abs/2510.06800
Pdf URL: https://arxiv.org/pdf/2510.06800
Copy Paste: [[2510.06800]] FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline(https://arxiv.org/abs/2510.06800)
Keywords: language model, llm, hallucination, prompt, agent
Abstract: As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character's responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.
摘要：随着大型语言模型（LLM）在角色扮演（RP）任务中的进步，现有的基准测试由于其范围狭窄、过时的交互范式以及跨不同应用场景的适应性有限而迅速过时。为了解决这一差距，我们引入了 FURINA-Builder，这是一种新颖的多智能体协作管道，可以自动构建任何规模的完全可定制的 RP 基准。它可以跨不同场景和提示格式评估任意字符，是 RP 领域第一个用于适应性评估的基准构建器。 FURINA-Builder 模拟测试角色与从精心构建的角色场景池中提取的其他角色之间的对话，而法学硕士法官选择细粒度的评估维度并将测试角色的响应调整为最终的测试话语。使用这个管道，我们构建了 FURINA-Bench，这是一个新的综合角色扮演基准，具有已建立的和综合的测试角色，每个角色都根据特定维度的评估标准进行评估。人工评估和初步可分离性分析证明了我们的流程和基准设计的合理性。我们对前沿的 LLM 进行了广泛的评估，发现 o3 和 DeepSeek-R1 分别在英语和中文 RP 任务上取得了最佳表现。在所有模型中，已建立的角色始终优于合成角色，推理能力进一步放大了这种差异。有趣的是，我们观察到模型规模并不会单调减少幻觉。更重要的是，对于推理法学硕士，我们发现了一种新的权衡：推理提高了 RP 表现，但同时增加了 RP 幻觉。对于所有法学硕士来说，这种权衡延伸到了 RP 性能和可靠性之间更广泛的帕累托边界。这些发现证明了 FURINA-Builder 的有效性以及 FURINA-Bench 带来的挑战。

Title: Overview of the Plagiarism Detection Task at PAN 2025

Authors: André Greiner-Petter, Maik Fröbe, Jan Philip Wahle, Terry Ruas, Bela Gipp, Akiko Aizawa, Martin Potthast
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.06805
Pdf URL: https://arxiv.org/pdf/2510.06805
Copy Paste: [[2510.06805]] Overview of the Plagiarism Detection Task at PAN 2025(https://arxiv.org/abs/2510.06805)
Keywords: language model
Abstract: The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the results on the last plagiarism detection task from PAN 2015 in order to interpret the robustness of the proposed approaches. We found that the current iteration does not invite a large variety of approaches as naive semantic similarity approaches based on embedding vectors provide promising results of up to 0.8 recall and 0.5 precision. In contrast, most of these approaches underperform significantly on the 2015 dataset, indicating a lack in generalizability.
摘要：PAN 2025 的生成剽窃检测任务旨在识别科学文章中自动生成的文本剽窃行为，并将其与各自的来源进行匹配。我们使用三种大型语言模型创建了一个自动生成的抄袭的新型大规模数据集：Llama、DeepSeek-R1 和 Mistral。在这篇任务概述论文中，我们概述了该数据集的创建，总结并比较了所有参与者和四个基线的结果，并评估了 PAN 2015 的最后一个抄袭检测任务的结果，以解释所提出方法的稳健性。我们发现当前的迭代并没有引入多种方法，因为基于嵌入向量的朴素语义相似性方法提供了高达 0.8 的召回率和 0.5 的精度的有希望的结果。相比之下，大多数这些方法在 2015 年数据集上的表现明显不佳，表明缺乏普遍性。

Title: BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

Authors: Philipp Mondorf, Mingyang Wang, Sebastian Gerstner, Ahmad Dawar Hakimi, Yihong Liu, Leonor Veloso, Shijia Zhou, Hinrich Schütze, Barbara Plank
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.06811
Pdf URL: https://arxiv.org/pdf/2510.06811
Copy Paste: [[2510.06811]] BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods(https://arxiv.org/abs/2510.06811)
Keywords: language model, llm
Abstract: The Circuit Localization track of the Mechanistic Interpretability Benchmark (MIB) evaluates methods for localizing circuits within large language models (LLMs), i.e., subnetworks responsible for specific task behaviors. In this work, we investigate whether ensembling two or more circuit localization methods can improve performance. We explore two variants: parallel and sequential ensembling. In parallel ensembling, we combine attribution scores assigned to each edge by different methods-e.g., by averaging or taking the minimum or maximum value. In the sequential ensemble, we use edge attribution scores obtained via EAP-IG as a warm start for a more expensive but more precise circuit identification method, namely edge pruning. We observe that both approaches yield notable gains on the benchmark metrics, leading to a more precise circuit identification approach. Finally, we find that taking a parallel ensemble over various methods, including the sequential ensemble, achieves the best results. We evaluate our approach in the BlackboxNLP 2025 MIB Shared Task, comparing ensemble scores to official baselines across multiple model-task combinations.
摘要：机械可解释性基准 (MIB) 的电路定位轨道评估大型语言模型 (LLM) 中的电路本地化方法，即负责特定任务行为的子网络。在这项工作中，我们研究了集成两种或多种电路定位方法是否可以提高性能。我们探索两种变体：并行和顺序集成。在并行集成中，我们通过不同的方法组合分配给每个边缘的归因分数 - 例如，通过平均或取最小值或最大值。在顺序集成中，我们使用通过 EAP-IG 获得的边缘归因分数作为更昂贵但更精确的电路识别方法（即边缘修剪）的热启动。我们观察到，这两种方法在基准指标上都有显着的提高，从而形成更精确的电路识别方法。最后，我们发现通过各种方法（包括顺序集成）采用并行集成可以获得最佳结果。我们在 BlackboxNLP 2025 MIB 共享任务中评估我们的方法，将多个模型任务组合的集成分数与官方基线进行比较。

Title: Adaptive Tool Generation with Models as Tools and Reinforcement Learning

Authors: Chenpeng Wang, Xiaojie Cheng, Chunye Wang, Linfeng Yang, Lei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06825
Pdf URL: https://arxiv.org/pdf/2510.06825
Copy Paste: [[2510.06825]] Adaptive Tool Generation with Models as Tools and Reinforcement Learning(https://arxiv.org/abs/2510.06825)
Keywords: language model, agent
Abstract: Tool-augmented language models have demonstrated strong capabilities, but their reliance on live API access creates scalability and reliability challenges during training and deployment. We propose MTR, a simulation-first training framework for tool-augmented reasoning. Instead of relying on live APIs, MTR learns from complete ReAct traces with schema-validated, simulated observations. Our approach operates through a multi-agent architecture where a ToolMaker generates task-specific, OpenAI-compatible tool interfaces, an AutoAgent produces structured think-act-observe sequences, and a ToolActor simulates realistic responses. Training proceeds in two stages: Stage-1 Supervised Fine-Tuning (SFT) teaches 'trace grammar' from complete reasoning sequences; Stage-2 Group Relative Policy Optimization (GRPO) optimizes strategy with a composite trace reward that balances answer correctness and internal consistency. Across four multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA, Bamboogle), MTR attains competitive Exact Match (EM) scores to live-API systems and excels on reasoning-intensive tasks, suggesting that effective tool reasoning can be learned from structured traces without live interactions.
摘要：工具增强语言模型已经展示了强大的功能，但它们对实时 API 访问的依赖在训练和部署过程中带来了可扩展性和可靠性挑战。我们提出了 MTR，一种用于工具增强推理的模拟优先训练框架。 MTR 不依赖实时 API，而是通过经过模式验证的模拟观察来从完整的 ReAct 跟踪中学习。我们的方法通过多代理架构运行，其中 ToolMaker 生成特定于任务的、与 OpenAI 兼容的工具接口，AutoAgent 生成结构化的思考-行动-观察序列，而 ToolActor 则模拟真实的响应。训练分两个阶段进行：阶段 1 监督微调 (SFT) 教授完整推理序列的“跟踪语法”；第 2 阶段组相对策略优化 (GRPO) 通过平衡答案正确性和内部一致性的复合跟踪奖励来优化策略。在四个多跳 QA 基准测试（HotpotQA、MuSiQue、2WikiMultiHopQA、Bamboogle）中，MTR 获得了与实时 API 系统竞争的精确匹配 (EM) 分数，并且在推理密集型任务上表现出色，这表明可以从结构化跟踪中学习有效的工具推理，而无需实时交互。

Title: Mid-Training of Large Language Models: A Survey

Authors: Kaixiang Mo, Yuxin Shi, Weiwei Weng, Zhiqiang Zhou, Shuman Liu, Haibo Zhang, Anxiang Zeng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06826
Pdf URL: https://arxiv.org/pdf/2510.06826
Copy Paste: [[2510.06826]] Mid-Training of Large Language Models: A Survey(https://arxiv.org/abs/2510.06826)
Keywords: language model, llm
Abstract: Large language models (LLMs) are typically developed through large-scale pre-training followed by task-specific fine-tuning. Recent advances highlight the importance of an intermediate mid-training stage, where models undergo multiple annealing-style phases that refine data quality, adapt optimization schedules, and extend context length. This stage mitigates diminishing returns from noisy tokens, stabilizes convergence, and expands model capability in late training. Its effectiveness can be explained through gradient noise scale, the information bottleneck, and curriculum learning, which together promote generalization and abstraction. Despite widespread use in state-of-the-art systems, there has been no prior survey of mid-training as a unified paradigm. We introduce the first taxonomy of LLM mid-training spanning data distribution, learning-rate scheduling, and long-context extension. We distill practical insights, compile evaluation benchmarks, and report gains to enable structured comparisons across models. We also identify open challenges and propose avenues for future research and practice.
摘要：大型语言模型（LLM）通常是通过大规模预训练和针对特定任务的微调来开发的。最近的进展凸显了中间训练阶段的重要性，其中模型经历多个退火式阶段，以提高数据质量、调整优化计划并延长上下文长度。此阶段减轻了噪声令牌带来的收益递减，稳定了收敛，并扩展了后期训练中的模型能力。其有效性可以通过梯度噪声量表、信息瓶颈和课程学习来解释，它们共同促进泛化和抽象。尽管在最先进的系统中广泛使用，但之前还没有对中期训练作为统一范式进行过调查。我们介绍了 LLM 中期训练的第一个分类法，涵盖数据分布、学习率调度和长上下文扩展。我们提炼实用见解、编制评估基准并报告收益，以实现跨模型的结构化比较。我们还确定了开放的挑战并提出了未来研究和实践的途径。

Title: SID: Multi-LLM Debate Driven by Self Signals

Authors: Xuhang Chen, Zhifan Song, Deyi Ji, Shuo Gao, Lanyun Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06843
Pdf URL: https://arxiv.org/pdf/2510.06843
Copy Paste: [[2510.06843]] SID: Multi-LLM Debate Driven by Self Signals(https://arxiv.org/abs/2510.06843)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM Agent Debate (MAD) as a way to enhance performance by enabling multiple LLMs to discuss and refine responses iteratively. Nevertheless, existing MAD methods predominantly focus on utilizing external structures, such as debate graphs, using LLM-as-a-Judge, while neglecting the application of self signals, such as token logits and attention, that arise during generation. This omission leads to redundant computation and potential performance degradation. In this paper, we shift the focus to the self signals of multi-LLM debate and introduce a Self-Signals Driven Multi-LLM Debate (SID), which leverages two types of self-signals: model-level confidence and token-level semantic focus, to adaptively guide the debate process. Our approach enables high-confidence agents to exit early at the model level and compress the redundant debate contents based on the attention mechanism. We evaluate our method on various LLMs and Multimodal LLMs across multiple challenging benchmarks. Experimental results demonstrate that our method not only outperforms existing MAD techniques in accuracy but also reduces token consumption, highlighting the effectiveness of utilizing self signals in enhancing both the performance and efficiency of multi-agent debate systems. Our code will be available at~\href{this https URL}{\texttt{this https URL}}.
摘要：大型语言模型 (LLM) 在不同的应用领域中展现了令人印象深刻的功能。最近的工作探索了多法学硕士代理人辩论（MAD）作为一种通过使多个法学硕士能够迭代讨论和完善响应来提高绩效的方法。然而，现有的 MAD 方法主要侧重于利用外部结构，例如辩论图，使用 LLM 作为法官，而忽略了生成过程中出现的自我信号的应用，例如令牌逻辑和注意力。这种遗漏会导致冗余计算和潜在的性能下降。在本文中，我们将焦点转移到多LLM辩论的自信号上，并引入自信号驱动的多LLM辩论（SID），它利用两种类型的自信号：模型级置信度和令牌级语义焦点，来自适应地指导辩论过程。我们的方法使得高置信度的智能体能够在模型层面尽早退出，并基于注意力机制压缩冗余的辩论内容。我们在多个具有挑战性的基准中评估各种法学硕士和多模式法学硕士的方法。实验结果表明，我们的方法不仅在准确性上优于现有的 MAD 技术，而且还减少了令牌消耗，凸显了利用自信号在增强多智能体辩论系统的性能和效率方面的有效性。我们的代码将在~\href{此 https URL}{\texttt{此 https URL}} 中提供。

Title: OpenJAI-v1.0: An Open Thai Large Language Model

Authors: Pontakorn Trakuekul, Attapol T. Rutherford, Jullajak Karnjanaekarin, Narongkorn Panitsrisit, Sumana Sumanakul
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06847
Pdf URL: https://arxiv.org/pdf/2510.06847
Copy Paste: [[2510.06847]] OpenJAI-v1.0: An Open Thai Large Language Model(https://arxiv.org/abs/2510.06847)
Keywords: language model
Abstract: We introduce OpenJAI-v1.0, an open-source large language model for Thai and English, developed from the Qwen3-14B model. Our work focuses on boosting performance on practical tasks through carefully curated data across three key use cases: instruction following, long-context understanding, and tool use. Evaluation results show that OpenJAI-v1.0 improves on the capabilities of its base model and outperforms other leading open-source Thai models on a diverse suite of benchmarks, while avoiding catastrophic forgetting. OpenJAI-v1.0 is publicly released as another alternative NLP resource for the Thai AI community.
摘要：我们介绍 OpenJAI-v1.0，这是一个开源的泰语和英语大语言模型，由 Qwen3-14B 模型开发而来。我们的工作重点是通过精心策划的三个关键用例的数据来提高实际任务的性能：指令遵循、长上下文理解和工具使用。评估结果表明，OpenJAI-v1.0 改进了其基础模型的功能，并在各种基准测试中优于其他领先的开源泰国模型，同时避免了灾难性遗忘。 OpenJAI-v1.0 作为泰国 AI 社区的另一个替代 NLP 资源公开发布。

Title: Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding

Authors: Wafaa Mohammed, Vlad Niculae, Chrysoula Zerva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06866
Pdf URL: https://arxiv.org/pdf/2510.06866
Copy Paste: [[2510.06866]] Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding(https://arxiv.org/abs/2510.06866)
Keywords: language model, llm
Abstract: Large language models (LLMs) have emerged as strong contenders in machine this http URL, they still struggle to adequately handle discourse phenomena, such as pronoun resolution and lexical cohesion at the document level. In this study, we thoroughly investigate the discourse phenomena performance of LLMs in context-aware translation. We demonstrate that discourse knowledge is encoded within LLMs and propose the use of quality-aware decoding (QAD) to effectively extract this knowledge, showcasing its superiority over other decoding approaches through comprehensive analysis. Furthermore, we illustrate that QAD enhances the semantic richness of translations and aligns them more closely with human preferences.
摘要：大型语言模型 (LLM) 已成为机器学习此 http URL 的有力竞争者，但它们仍然难以充分处理话语现象，例如文档级别的代词解析和词汇衔接。在这项研究中，我们深入研究了法学硕士在上下文感知翻译中的话语现象表现。我们证明了话语知识是在法学硕士中编码的，并提出使用质量感知解码（QAD）来有效地提取这些知识，通过综合分析展示了其相对于其他解码方法的优越性。此外，我们还说明 QAD 增强了翻译的语义丰富性，并使它们更符合人类的偏好。

Title: $λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences

Authors: Yining Wang, Jinman Zhao, Chuangxin Zhao, Shuhao Guan, Gerald Penn, Shinan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06870
Pdf URL: https://arxiv.org/pdf/2510.06870
Copy Paste: [[2510.06870]] $λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences(https://arxiv.org/abs/2510.06870)
Keywords: language model, llm
Abstract: Reinforcement Learning with Human Feedback (RLHF) has been the dominant approach for improving the reasoning capabilities of Large Language Models (LLMs). Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has simplified this paradigm by replacing the reward and value models with rule-based verifiers. A prominent example is Group Relative Policy Optimization (GRPO). However, GRPO inherently suffers from a length bias, since the same advantage is uniformly assigned to all tokens of a response. As a result, longer responses distribute the reward over more tokens and thus contribute disproportionately to gradient updates. Several variants, such as DAPO and Dr. GRPO, modify the token-level aggregation of the loss, yet these methods remain heuristic and offer limited interpretability regarding their implicit token preferences. In this work, we explore the possibility of allowing the model to learn its own token preference during optimization. We unify existing frameworks under a single formulation and introduce a learnable parameter $\lambda$ that adaptively controls token-level weighting. We use $\lambda$-GRPO to denote our method, and we find that $\lambda$-GRPO achieves consistent improvements over vanilla GRPO and DAPO on multiple mathematical reasoning benchmarks. On Qwen2.5 models with 1.5B, 3B, and 7B parameters, $\lambda$-GRPO improves average accuracy by $+1.9\%$, $+1.0\%$, and $+1.7\%$ compared to GRPO, respectively. Importantly, these gains come without any modifications to the training data or additional computational cost, highlighting the effectiveness and practicality of learning token preferences.
摘要：人类反馈强化学习（RLHF）一直是提高大型语言模型（LLM）推理能力的主要方法。最近，具有可验证奖励的强化学习（RLVR）通过用基于规则的验证器替换奖励和价值模型来简化了这种范式。一个突出的例子是组相对策略优化（GRPO）。然而，GRPO 本质上存在长度偏差，因为相同的优势被统一分配给响应的所有标记。因此，较长的响应将奖励分配给更多的代币，从而对梯度更新做出不成比例的贡献。 DAPO 和 Dr.GRPO 等几种变体修改了损失的代币级别聚合，但这些方法仍然是启发式的，并且对其隐式代币偏好的解释性有限。在这项工作中，我们探索了允许模型在优化过程中学习自己的令牌偏好的可能性。我们将现有框架统一在一个公式下，并引入一个可学习参数 $\lambda$ 来自适应控制令牌级别的权重。我们使用 $\lambda$-GRPO 来表示我们的方法，我们发现 $\lambda$-GRPO 在多个数学推理基准上比普通 GRPO 和 DAPO 取得了一致的改进。在具有 1.5B、3B 和 7B 参数的 Qwen2.5 模型上，与 GRPO 相比，$\lambda$-GRPO 的平均准确度分别提高了 $+1.9\%$、$+1.0\%$ 和 $+1.7\%$。重要的是，这些收益无需对训练数据进行任何修改或增加计算成本，凸显了学习令牌偏好的有效性和实用性。

Title: MeXtract: Light-Weight Metadata Extraction from Scientific Papers

Authors: Zaid Alyafeai, Maged S. Al-Shaibani, Bernard Ghanem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06889
Pdf URL: https://arxiv.org/pdf/2510.06889
Copy Paste: [[2510.06889]] MeXtract: Light-Weight Metadata Extraction from Scientific Papers(https://arxiv.org/abs/2510.06889)
Keywords: language model
Abstract: Metadata plays a critical role in indexing, documenting, and analyzing scientific literature, yet extracting it accurately and efficiently remains a challenging task. Traditional approaches often rely on rule-based or task-specific models, which struggle to generalize across domains and schema variations. In this paper, we present MeXtract, a family of lightweight language models designed for metadata extraction from scientific papers. The models, ranging from 0.5B to 3B parameters, are built by fine-tuning Qwen 2.5 counterparts. In their size family, MeXtract achieves state-of-the-art performance on metadata extraction on the MOLE benchmark. To further support evaluation, we extend the MOLE benchmark to incorporate model-specific metadata, providing an out-of-domain challenging subset. Our experiments show that fine-tuning on a given schema not only yields high accuracy but also transfers effectively to unseen schemas, demonstrating the robustness and adaptability of our approach. We release all the code, datasets, and models openly for the research community.
摘要：元数据在索引、记录和分析科学文献中发挥着关键作用，但准确有效地提取元数据仍然是一项具有挑战性的任务。传统方法通常依赖于基于规则或特定于任务的模型，这些模型很难跨领域和模式变化进行泛化。在本文中，我们提出了 MeXtract，这是一个轻量级语言模型系列，专为从科学论文中提取元数据而设计。这些模型的参数范围从0.5B到3B，是通过对Qwen 2.5对应模型进行微调而建立的。在其规模系列中，MeXtract 在 MOLE 基准的元数据提取方面实现了最先进的性能。为了进一步支持评估，我们扩展了 MOLE 基准以合并特定于模型的元数据，提供域外具有挑战性的子集。我们的实验表明，对给定模式的微调不仅可以产生高精度，而且可以有效地转移到未见过的模式，证明了我们方法的稳健性和适应性。我们向研究社区公开发布所有代码、数据集和模型。

Title: LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

Authors: Zecheng Tang, Baibei Ji, Quantong Qiu, Haitian Wang, Xiaobo Liang, Juntao Li, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06915
Pdf URL: https://arxiv.org/pdf/2510.06915
Copy Paste: [[2510.06915]] LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling(https://arxiv.org/abs/2510.06915)
Keywords: language model, llm, long context, agent
Abstract: Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.
摘要：奖励模型（RM）在使大语言模型（LLM）与人类偏好保持一致方面发挥着关键作用。随着现实世界的应用程序越来越多地涉及长期历史轨迹，例如 LLM 代理，评估模型的响应是否不仅是高质量的，而且是否基于所提供的上下文并与其保持一致就变得必不可少。然而，当前的 RM 仍然局限于短上下文环境，并且主要关注响应级别的属性（例如安全性或有用性），而在很大程度上忽略了长上下文响应一致性的关键维度。在这项工作中，我们引入了 Long-RewardBench，这是一个专为长上下文 RM 评估而设计的基准，具有配对比较和 Best-of-N 任务的特点。我们的初步研究表明，即使是最先进的生成 RM 在长上下文场景中也表现出显着的脆弱性，无法维持上下文感知的偏好判断。受模型输出中观察到的故障模式分析的启发，我们提出了一种通用的多阶段训练策略，可以有效地将任意模型扩展为鲁棒的长上下文 RM（LongRM）。实验表明，我们的方法不仅大大提高了长上下文评估的性能，而且保留了强大的短上下文能力。值得注意的是，我们的 8B LongRM 的性能优于更大的 70B 规模基准，并与专有的 Gemini 2.5 Pro 模型的性能相匹配。

Title: SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

Authors: Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2510.06917
Pdf URL: https://arxiv.org/pdf/2510.06917
Copy Paste: [[2510.06917]] SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models(https://arxiv.org/abs/2510.06917)
Keywords: language model, llm, chain-of-thought
Abstract: Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally "think while listening." In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at this https URL
摘要：当前的大语言模型 (LLM) 和口语模型 (SLM) 仅在用户完成轮次后才开始思考并采取行动。这会阻止模型在轮到用户时进行交互，并可能导致在等待思考时出现较高的响应延迟。因此，在收到完整输入后进行思考并不适合语音交互，而语音交互中实时、低延迟的交换非常重要。我们通过指出人类自然地“边听边思考”来解决这个问题。在本文中，我们提出了 SHANKS，这是一种通用推理框架，使 SLM 能够在监听用户输入的同时生成不言而喻的思想链推理。 SHANKS 以固定持续时间的块形式传输输入语音，一旦接收到块，就会根据所有先前的语音和推理生成未说出的推理，同时用户继续说话。 SHANKS 使用这种不言而喻的推理来决定是否打断用户并调用工具来完成任务。我们证明了 SHANKS 在两种场景下增强了用户与 SLM 的实时交互：（1）当用户提出数学问题的分步解决方案时，SHANKS 可以倾听、推理，并在用户犯错时打断，与不加思考地打断的基线相比，打断准确度提高了 37.1%； (2) 在工具增强对话中，SHANKS 可以在用户完成轮次之前完成 56.9% 的工具调用。总体而言，SHANKS 倾向于在整个对话过程中不断思考的模型，而不仅仅是在回合结束后。 Shanks 的动画插图可以在此 https URL 找到

Title: Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

Authors: Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Koluguri, Piotr Żelasko, Somshubra Majumdar, Adel Moumen, Sanchit Gandhi
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2510.06961
Pdf URL: https://arxiv.org/pdf/2510.06961
Copy Paste: [[2510.06961]] Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation(https://arxiv.org/abs/2510.06961)
Keywords: llm
Abstract: Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including dedicated multilingual and long-form tracks. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For English transcription, Conformer encoders paired with LLM decoders achieve the best average WER but are slower, while CTC and TDT decoders deliver much better RTFx, making them attractive for long-form and offline use. Whisper-derived encoders fine-tuned for English improve accuracy but often trade off multilingual coverage. All code and dataset loaders are open-sourced to support transparent, extensible evaluation.
摘要：尽管进展迅速，ASR 评估仍然充斥着简短的英语，而且效率很少有报道。我们推出了开放 ASR 排行榜，这是一个完全可重复的基准和交互式排行榜，比较了 11 个数据集的 60 多个开源和专有系统，包括专用的多语言和长格式轨道。我们对文本标准化进行标准化，并报告字错误率 (WER) 和逆实时因子 (RTFx)，从而实现公平的准确性-效率比较。对于英语转录，Conformer 编码器与 LLM 解码器配对可实现最佳平均 WER，但速度较慢，而 CTC 和 TDT 解码器可提供更好的 RTFx，这使得它们对于长格式和离线使用具有吸引力。针对英语进行微调的 Whisper 派生编码器可提高准确性，但通常会牺牲多语言覆盖范围。所有代码和数据集加载器都是开源的，以支持透明、可扩展的评估。

Title: EDUMATH: Generating Standards-aligned Educational Math Word Problems

Authors: Bryan R. Christ, Penelope Molitz, Jonathan Kropko, Thomas Hartvigsen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06965
Pdf URL: https://arxiv.org/pdf/2510.06965
Copy Paste: [[2510.06965]] EDUMATH: Generating Standards-aligned Educational Math Word Problems(https://arxiv.org/abs/2510.06965)
Keywords: llm
Abstract: Math word problems (MWPs) are critical K-12 educational tools, and customizing them to students' interests and ability levels can increase learning outcomes. However, teachers struggle to find time to customize MWPs for each student given large class sizes and increasing burnout. We propose that LLMs can support math education by generating MWPs customized to student interests and math education standards. To this end, we use a joint human expert-LLM judge approach to evaluate over 11,000 MWPs generated by open and closed LLMs and develop the first teacher-annotated dataset for standards-aligned educational MWP generation. We show the value of our data by using it to train a 12B open model that matches the performance of larger and more capable open models. We also use our teacher-annotated data to train a text classifier that enables a 30B open LLM to outperform existing closed baselines without any training. Next, we show our models' MWPs are more similar to human-written MWPs than those from existing models. We conclude by conducting the first study of customized LLM-generated MWPs with grade school students, finding they perform similarly on our models' MWPs relative to human-written MWPs but consistently prefer our customized MWPs.
摘要：数学应用题 (MWP) 是关键的 K-12 教育工具，根据学生的兴趣和能力水平对其进行定制可以提高学习成果。然而，由于班级规模较大且倦怠不断增加，教师很难找到时间为每个学生定制 MWP。我们建议法学硕士可以通过生成根据学生兴趣和数学教育标准定制的 MWP 来支持数学教育。为此，我们使用人类专家与法学硕士联合评审方法来评估开放式和封闭式法学硕士生成的 11,000 多个 MWP，并开发第一个教师注释的数据集，用于生成符合标准的教育 MWP。我们通过使用数据来训练 12B 开放模型来展示数据的价值，该模型与更大、更强大的开放模型的性能相匹配。我们还使用教师注释的数据来训练文本分类器，使 30B 开放式 LLM 无需任何训练即可超越现有的封闭基线。接下来，我们展示了我们模型的 MWP 比现有模型的 MWP 更类似于人类编写的 MWP。最后，我们对小学生定制的法学硕士生成的 MWP 进行了首次研究，发现他们在我们模型的 MWP 上的表现与人工编写的 MWP 相似，但始终更喜欢我们定制的 MWP。

Title: Probing Social Identity Bias in Chinese LLMs with Gendered Pronouns and Social Groups

Authors: Geng Liu, Feng Li, Junjie Mu, Mengxiao Zhu, Francesco Pierri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06974
Pdf URL: https://arxiv.org/pdf/2510.06974
Copy Paste: [[2510.06974]] Probing Social Identity Bias in Chinese LLMs with Gendered Pronouns and Social Groups(https://arxiv.org/abs/2510.06974)
Keywords: language model, llm, prompt, chat
Abstract: Large language models (LLMs) are increasingly deployed in user-facing applications, raising concerns about their potential to reflect and amplify social biases. We investigate social identity framing in Chinese LLMs using Mandarin-specific prompts across ten representative Chinese LLMs, evaluating responses to ingroup ("We") and outgroup ("They") framings, and extending the setting to 240 social groups salient in the Chinese context. To complement controlled experiments, we further analyze Chinese-language conversations from a corpus of real interactions between users and chatbots. Across models, we observe systematic ingroup-positive and outgroup-negative tendencies, which are not confined to synthetic prompts but also appear in naturalistic dialogue, indicating that bias dynamics might strengthen in real interactions. Our study provides a language-aware evaluation framework for Chinese LLMs, demonstrating that social identity biases documented in English generalize cross-linguistically and intensify in user-facing contexts.
摘要：大型语言模型 (LLM) 越来越多地部署在面向用户的应用程序中，引发了人们对其反映和放大社会偏见的潜力的担忧。我们使用十个具有代表性的中国法学硕士的普通话特定提示来调查中国法学硕士的社会身份框架，评估对内群体（“我们”）和外群体（“他们”）框架的反应，并将设置扩展到在中国背景下突出的 240 个社会群体。为了补充受控实验，我们进一步从用户和聊天机器人之间的真实交互语料库中分析中文对话。在各个模型中，我们观察到系统性的内群体积极和外群体消极倾向，这些倾向不仅限于合成提示，而且也出现在自然主义对话中，这表明偏见动态可能在实际互动中加强。我们的研究为中国法学硕士提供了一个语言感知评估框架，证明以英语记录的社会身份偏见在跨语言中普遍存在，并在面向用户的环境中加剧。

Title: Towards Reliable Retrieval in RAG Systems for Large Legal Datasets

Authors: Markus Reuter, Tobias Lingenberg, Rūta Liepiņa, Francesca Lagioia, Marco Lippi, Giovanni Sartor, Andrea Passerini, Burcu Sayin
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.06999
Pdf URL: https://arxiv.org/pdf/2510.06999
Copy Paste: [[2510.06999]] Towards Reliable Retrieval in RAG Systems for Large Legal Datasets(https://arxiv.org/abs/2510.06999)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) is a promising approach to mitigate hallucinations in Large Language Models (LLMs) for legal applications, but its reliability is critically dependent on the accuracy of the retrieval step. This is particularly challenging in the legal domain, where large databases of structurally similar documents often cause retrieval systems to fail. In this paper, we address this challenge by first identifying and quantifying a critical failure mode we term Document-Level Retrieval Mismatch (DRM), where the retriever selects information from entirely incorrect source documents. To mitigate DRM, we investigate a simple and computationally efficient technique which we refer to as Summary-Augmented Chunking (SAC). This method enhances each text chunk with a document-level synthetic summary, thereby injecting crucial global context that would otherwise be lost during a standard chunking process. Our experiments on a diverse set of legal information retrieval tasks show that SAC greatly reduces DRM and, consequently, also improves text-level retrieval precision and recall. Interestingly, we find that a generic summarization strategy outperforms an approach that incorporates legal expert domain knowledge to target specific legal elements. Our work provides evidence that this practical, scalable, and easily integrable technique enhances the reliability of RAG systems when applied to large-scale legal document datasets.
摘要：检索增强生成（RAG）是一种很有前途的方法，可以减轻法律应用中大型语言模型（LLM）中的幻觉，但其可靠性很大程度上取决于检索步骤的准确性。这在法律领域尤其具有挑战性，结构相似文档的大型数据库经常导致检索系统失败。在本文中，我们通过首先识别和量化关键故障模式（我们称之为文档级检索不匹配（DRM））来应对这一挑战，其中检索器从完全不正确的源文档中选择信息。为了减轻 DRM，我们研究了一种简单且计算高效的技术，我们将其称为摘要增强分块 (SAC)。该方法通过文档级综合摘要增强每个文本块，从而注入关键的全局上下文，否则这些上下文将在标准分块过程中丢失。我们对各种法律信息检索任务的实验表明，SAC 极大地减少了 DRM，因此也提高了文本级检索的精度和召回率。有趣的是，我们发现通用摘要策略优于结合法律专家领域知识来针对特定法律要素的方法。我们的工作证明，这种实用、可扩展且易于集成的技术在应用于大规模法律文档数据集时可以增强 RAG 系统的可靠性。

Title: Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages

Authors: Neel Prabhanjan Rachamalla, Aravind Konakalla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07000
Pdf URL: https://arxiv.org/pdf/2510.07000
Copy Paste: [[2510.07000]] Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages(https://arxiv.org/abs/2510.07000)
Keywords: language model, llm
Abstract: The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.
摘要：大型语言模型 (LLM) 的有效性在很大程度上取决于高质量训练后数据的可用性，特别是指令调整和基于偏好的示例。然而，现有的开源数据集往往缺乏多语言覆盖、文化基础，并且存在任务多样性差距，尤其是印度语言。我们引入了一个人机交互管道，它将翻译与合成扩展相结合，以生成可靠且多样化的印度语训练后数据。使用此管道，我们利用 57 个不同的数据集，整理了两个数据集：Pragyaan-IT (22.5K) 和 Pragyaan-Align (100K)，涵盖 10 种印度语言，涵盖 13 个大类和 56 个子类别。我们的数据集协议包含了几个经常被忽视的维度，并强调任务多样性、多轮对话、指令保真度、安全一致性和文化细微差别的保留，为更具包容性和有效的多语言法学硕士奠定了基础。

Title: Native Hybrid Attention for Efficient Sequence Modeling

Authors: Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, Yu Cheng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.07019
Pdf URL: https://arxiv.org/pdf/2510.07019
Copy Paste: [[2510.07019]] Native Hybrid Attention for Efficient Sequence Modeling(https://arxiv.org/abs/2510.07019)
Keywords: llm, long context
Abstract: Transformers excel at sequence modeling but face quadratic complexity, while linear attention offers improved efficiency but often compromises recall accuracy over long contexts. In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra \& inter-layer hybridization into a unified layer design. NHA maintains long-term context in key-value slots updated by a linear RNN, and augments them with short-term tokens from a sliding window. A single \texttt{softmax attention} operation is then applied over all keys and values, enabling per-token and per-head context-dependent weighting without requiring additional fusion parameters. The inter-layer behavior is controlled through a single hyperparameter, the sliding window size, which allows smooth adjustment between purely linear and full attention while keeping all layers structurally uniform. Experimental results show that NHA surpasses Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks. Furthermore, pretrained LLMs can be structurally hybridized with NHA, achieving competitive accuracy while delivering significant efficiency gains. Code is available at this https URL.
摘要：Transformer 擅长序列建模，但面临二次复杂性，而线性注意力虽然提高了效率，但往往会损害长上下文的回忆准确性。在这项工作中，我们引入了本机混合注意力（NHA），这是一种新颖的线性和完全注意力的混合架构，它将层内和层间混合集成到统一的层设计中。 NHA 在由线性 RNN 更新的键值槽中维护长期上下文，并使用滑动窗口中的短期标记来增强它们。然后将单个 \texttt{softmax 注意} 操作应用于所有键和值，从而实现每个令牌和每个头上下文相关的加权，而不需要额外的融合参数。层间行为是通过单个超参数（滑动窗口大小）控制的，它允许在纯线性和完全注意力之间进行平滑调整，同时保持所有层结构一致。实验结果表明，NHA 在回忆密集型和常识推理任务上超越了 Transformers 和其他混合基线。此外，预训练的法学硕士可以在结构上与 NHA 混合，实现有竞争力的准确性，同时显着提高效率。代码可从此 https URL 获取。

Title: Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

Authors: Shrestha Ghosh, Luca Giordano, Yujia Hu, Tuan-Phong Nguyen, Simon Razniewski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07024
Pdf URL: https://arxiv.org/pdf/2510.07024
Copy Paste: [[2510.07024]] Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge(https://arxiv.org/abs/2510.07024)
Keywords: gpt, llm, hallucination
Abstract: LLMs are remarkable artifacts that have revolutionized a range of NLP and AI tasks. A significant contributor is their factual knowledge, which, to date, remains poorly understood, and is usually analyzed from biased samples. In this paper, we take a deep tour into the factual knowledge (or beliefs) of a frontier LLM, based on GPTKB v1.5 (Hu et al., 2025a), a recursively elicited set of 100 million beliefs of one of the strongest currently available frontier LLMs, GPT-4.1. We find that the models' factual knowledge differs quite significantly from established knowledge bases, and that its accuracy is significantly lower than indicated by previous benchmarks. We also find that inconsistency, ambiguity and hallucinations are major issues, shedding light on future research opportunities concerning factual LLM knowledge.
摘要：法学硕士是非凡的产物，彻底改变了一系列自然语言处理和人工智能任务。一个重要的贡献因素是他们的事实知识，但迄今为止，人们对事实知识的了解仍然很少，并且通常是根据有偏见的样本进行分析。在本文中，我们基于 GPTKB v1.5（Hu 等人，2025a）深入探讨了前沿 LLM 的事实知识（或信念），GPTKB v1.5 是当前可用的最强大的前沿 LLM 之一（GPT-4.1）的 1 亿个信念的递归引出集。我们发现模型的事实知识与已建立的知识库存在很大差异，并且其准确性明显低于以前的基准测试。我们还发现，不一致、模糊性和幻觉是主要问题，这为未来有关法学硕士事实知识的研究机会提供了线索。

Title: Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

Authors: Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himanshu Beniwal, Mayank Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07037
Pdf URL: https://arxiv.org/pdf/2510.07037
Copy Paste: [[2510.07037]] Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models(https://arxiv.org/abs/2510.07037)
Keywords: language model, llm
Abstract: Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multiling ual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing \total{unique_references} studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at this https URL.
摘要：即使在大型语言模型 (LLM) 快速发展的情况下，语码转换 (CSW)，即单个话语中语言和脚本的交替，仍然是多语言 NLP 的基本挑战。大多数法学硕士仍然面临混合语言输入、有限的 CSW 数据集和评估偏差等问题，阻碍了在多语言社会中的部署。这项调查首次对 CSW 意识的法学硕士研究进行了全面分析，回顾了涵盖五个研究领域、12 个 NLP 任务、30 多个数据集和 80 多种语言的\total{unique_references}研究。我们按照架构、培训策略和评估方法对最新进展进行分类，概述了法学硕士如何重塑 CSW 模型以及持续存在的挑战。该论文最后提出了一个路线图，强调需要包容性数据集、公平评估和基于语言的模型来实现真正的多语言智能。所有资源的精选集合均在此 https URL 中维护。

Title: Search-R3: Unifying Reasoning and Embedding Generation in Large Language Models

Authors: Yuntao Gui, James Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07048
Pdf URL: https://arxiv.org/pdf/2510.07048
Copy Paste: [[2510.07048]] Search-R3: Unifying Reasoning and Embedding Generation in Large Language Models(https://arxiv.org/abs/2510.07048)
Keywords: language model, llm, chain-of-thought
Abstract: Despite their remarkable natural language understanding capabilities, Large Language Models (LLMs) have been underutilized for retrieval tasks. We present Search-R3, a novel framework that addresses this limitation by adapting LLMs to generate search embeddings as a direct output of their reasoning process. Our approach exploits LLMs' chain-of-thought capabilities, allowing them to produce more effective embeddings by reasoning step-by-step through complex semantic analyses. We implement this through three complementary mechanisms. (1) a supervised learning stage enables the model's ability to produce quality embeddings, (2) a reinforcement learning (RL) methodology that optimizes embedding generation alongside reasoning, and (3) a specialized RL environment that efficiently handles evolving embedding representations without requiring complete corpus re-encoding at each training iteration. Our extensive evaluations on diverse benchmarks demonstrate that Search-R3 significantly outperforms prior methods by unifying the reasoning and embedding generation processes. This integrated post-training approach represents a substantial advancement in handling complex knowledge-intensive tasks that require both sophisticated reasoning and effective information retrieval. Project page: this https URL
摘要：尽管大型语言模型 (LLM) 具有出色的自然语言理解能力，但在检索任务中尚未得到充分利用。我们提出了 Search-R3，这是一种新颖的框架，它通过调整 LLM 生成搜索嵌入作为其推理过程的直接输出来解决这一限制。我们的方法利用了法学硕士的思维链能力，使他们能够通过复杂的语义分析逐步推理来产生更有效的嵌入。我们通过三个互补机制来实现这一目标。 (1) 监督学习阶段使模型能够生成高质量的嵌入，(2) 强化学习 (RL) 方法可在推理的同时优化嵌入生成，(3) 专门的 RL 环境可有效处理不断演变的嵌入表示，而无需在每次训练迭代时进行完整的语料库重新编码。我们对不同基准的广泛评估表明，Search-R3 通过统一推理和嵌入生成过程，显着优于先前的方法。这种集成的培训后方法代表了在处理需要复杂推理和有效信息检索的复杂知识密集型任务方面的重大进步。项目页面：此 https URL

Title: Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

Authors: Amir Hossein Yari, Kalmit Kulkarni, Ahmad Raza Khan, Fajri Koto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07061
Pdf URL: https://arxiv.org/pdf/2510.07061
Copy Paste: [[2510.07061]] Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages(https://arxiv.org/abs/2510.07061)
Keywords: llm
Abstract: While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 26 automatic metrics with human judgments across six major Indian languages, enriched with fine-grained annotations. Our extensive evaluation, covering agreement with human judgments, sensitivity to outliers, language-specific reliability, inter-metric correlations, and resilience to controlled perturbations, reveals four central findings: (1) LLM-based evaluators show the strongest alignment with human judgments at both segment and system levels; (2) outliers exert a significant impact on metric-human agreement; (3) in TS, metrics are more effective at capturing content fidelity, whereas in MT, they better reflect fluency; and (4) metrics differ in their robustness and sensitivity when subjected to diverse perturbations. Collectively, these findings offer critical guidance for advancing metric design and evaluation in Indian languages.
摘要：虽然自动指标推动了机器翻译 (MT) 和文本摘要 (TS) 的进步，但现有指标几乎专门针对英语和其他高资源语言开发和验证。这种狭隘的关注使得超过 15 亿人口使用的印度语言在很大程度上被忽视，使人们对当前评估实践的普遍性产生怀疑。为了解决这一差距，我们引入了 ITEM，这是一个大规模基准测试，可以系统地评估六种主要印度语言的 26 个自动指标与人类判断的一致性，并通过细粒度注释进行丰富。我们进行了广泛的评估，涵盖与人类判断的一致性、对异常值的敏感性、特定于语言的可靠性、度量间相关性以及对受控扰动的弹性，揭示了四个中心发现：（1）基于法学硕士的评估者在细分和系统层面都显示出与人类判断的最强一致性； (2) 异常值对度量-人类一致性产生重大影响； (3) 在 TS 中，指标更能有效地捕捉内容保真度，而在 MT 中，它们更能反映流畅性； (4) 当受到不同的扰动时，度量的稳健性和敏感性有所不同。总的来说，这些发现为推进印度语言的度量设计和评估提供了重要指导。

Title: LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish

Authors: Fred Philippy, Laura Bernardy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07074
Pdf URL: https://arxiv.org/pdf/2510.07074
Copy Paste: [[2510.07074]] LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish(https://arxiv.org/abs/2510.07074)
Keywords: language model, prompt
Abstract: Instruction tuning has become a key technique for enhancing the performance of large language models, enabling them to better follow human prompts. However, low-resource languages such as Luxembourgish face severe limitations due to the lack of high-quality instruction datasets. Traditional reliance on machine translation often introduces semantic misalignment and cultural inaccuracies. In this work, we address these challenges by creating a cross-lingual instruction tuning dataset for Luxembourgish, without resorting to machine-generated translations into it. Instead, by leveraging aligned data from English, French, and German, we build a high-quality dataset that preserves linguistic and cultural nuances. We provide evidence that cross-lingual instruction tuning not only improves representational alignment across languages but also the model's generative capabilities in Luxembourgish. This highlights how cross-lingual data curation can avoid the common pitfalls of machine-translated data and directly benefit low-resource language development.
摘要：指令调优已成为增强大型语言模型性能的关键技术，使它们能够更好地遵循人类提示。然而，卢森堡语等资源匮乏的语言由于缺乏高质量的教学数据集而面临严重的限制。对机器翻译的传统依赖常常会带来语义错位和文化不准确。在这项工作中，我们通过为卢森堡语创建跨语言指令调整数据集来解决这些挑战，而无需求助于机器生成的翻译。相反，通过利用来自英语、法语和德语的对齐数据，我们构建了一个高质量的数据集，保留了语言和文化的细微差别。我们提供的证据表明，跨语言指令调整不仅可以提高跨语言的表征对齐，还可以提高模型在卢森堡语中的生成能力。这凸显了跨语言数据管理如何避免机器翻译数据的常见陷阱，并直接有利于低资源语言的开发。

Title: Accelerating Diffusion LLM Inference via Local Determinism Propagation

Authors: Fanheng Kong, Jingyuan Zhang, Yahui Liu, Zirui Wu, Yu Tian, Victoria W., Guorui Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07081
Pdf URL: https://arxiv.org/pdf/2510.07081
Copy Paste: [[2510.07081]] Accelerating Diffusion LLM Inference via Local Determinism Propagation(https://arxiv.org/abs/2510.07081)
Keywords: language model, llm
Abstract: Diffusion large language models (dLLMs) represent a significant advancement in text generation, offering parallel token decoding capabilities. However, existing open-source implementations suffer from quality-speed trade-offs that impede their practical deployment. Conservative sampling strategies typically decode only the most confident token per step to ensure quality (i.e., greedy decoding), at the cost of inference efficiency due to repeated redundant refinement iterations--a phenomenon we term delayed decoding. Through systematic analysis of dLLM decoding dynamics, we characterize this delayed decoding behavior and propose a training-free adaptive parallel decoding strategy, named LocalLeap, to address these inefficiencies. LocalLeap is built on two fundamental empirical principles: local determinism propagation centered on high-confidence anchors and progressive spatial consistency decay. By applying these principles, LocalLeap identifies anchors and performs localized relaxed parallel decoding within bounded neighborhoods, achieving substantial inference step reduction through early commitment of already-determined tokens without compromising output quality. Comprehensive evaluation on various benchmarks demonstrates that LocalLeap achieves 6.94$\times$ throughput improvements and reduces decoding steps to just 14.2\% of the original requirement, achieving these gains with negligible performance impact. The source codes are available at: this https URL.
摘要：扩散大语言模型 (dLLM) 代表了文本生成方面的重大进步，提供了并行标记解码功能。然而，现有的开源实现面临质量与速度的权衡，阻碍了其实际部署。保守采样策略通常每步只解码最有信心的标记以确保质量（即贪婪解码），但由于重复的冗余细化迭代而导致推理效率下降（我们将这种现象称为延迟解码）。通过对 dLLM 解码动态的系统分析，我们描述了这种延迟解码行为，并提出了一种名为 LocalLeap 的免训练自适应并行解码策略，以解决这些低效率问题。 LocalLeap 建立在两个基本的经验原则之上：以高置信度锚点为中心的局部决定论传播和渐进式空间一致性衰减。通过应用这些原理，LocalLeap 识别锚点并在有界邻域内执行局部宽松并行解码，通过早期承诺已确定的标记来实现大幅减少推理步骤，而不会影响输出质量。对各种基准的综合评估表明，LocalLeap 实现了 6.94$\times$ 的吞吐量改进，并将解码步骤减少到仅为原始要求的 14.2\%，在实现这些收益的同时对性能影响可以忽略不计。源代码可在以下位置获得：此 https URL。

Title: All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations

Authors: Miriam Wanner, Leif Azzopardi, Paul Thomas, Soham Dan, Benjamin Van Durme, Nick Craswell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07083
Pdf URL: https://arxiv.org/pdf/2510.07083
Copy Paste: [[2510.07083]] All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations(https://arxiv.org/abs/2510.07083)
Keywords: language model, llm
Abstract: Existing methods for evaluating the factuality of large language model (LLM) responses treat all claims as equally important. This results in misleading evaluations when vital information is missing or incorrect as it receives the same weight as peripheral details, raising the question: how can we reliably detect such differences when there are errors in key information? Current approaches that measure factuality tend to be insensitive to omitted or false key information. To investigate this lack of sensitivity, we construct VITALERRORS, a benchmark of 6,733 queries with minimally altered LLM responses designed to omit or falsify key information. Using this dataset, we demonstrate the insensitivities of existing evaluation metrics to key information errors. To address this gap, we introduce VITAL, a set of metrics that provide greater sensitivity in measuring the factuality of responses by incorporating the relevance and importance of claims with respect to the query. Our analysis demonstrates that VITAL metrics more reliably detect errors in key information than previous methods. Our dataset, metrics, and analysis provide a foundation for more accurate and robust assessment of LLM factuality.
摘要：评估大语言模型 (LLM) 响应真实性的现有方法将所有声明视为同等重要。当重要信息丢失或不正确时，这会导致误导性评估，因为它与外围细节具有相同的权重，这就提出了一个问题：当关键信息存在错误时，我们如何才能可靠地检测到这种差异？当前衡量真实性的方法往往对遗漏或虚假的关键信息不敏感。为了调查这种敏感性的缺乏，我们构建了 VITALERRORS，这是一个包含 6,733 个查询的基准，其中 LLM 响应经过最小程度的更改，旨在省略或伪造关键信息。使用该数据集，我们证明了现有评估指标对关键信息错误的不敏感性。为了解决这一差距，我们引入了 VITAL，这是一组指标，通过结合与查询有关的声明的相关性和重要性，在衡量响应的真实性方面提供更高的灵敏度。我们的分析表明，VITAL 指标比以前的方法更可靠地检测关键信息中的错误。我们的数据集、指标和分析为更准确、更稳健地评估 LLM 事实性奠定了基础。

Title: Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis

Authors: Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2510.07096
Pdf URL: https://arxiv.org/pdf/2510.07096
Copy Paste: [[2510.07096]] Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis(https://arxiv.org/abs/2510.07096)
Keywords: language model, llm, retrieval augmented generation
Abstract: Sarcasm is a subtle form of non-literal language that poses significant challenges for speech synthesis due to its reliance on nuanced semantic, contextual, and prosodic cues. While existing speech synthesis research has focused primarily on broad emotional categories, sarcasm remains largely unexplored. In this paper, we propose a Large Language Model (LLM)-enhanced Retrieval-Augmented framework for sarcasm-aware speech synthesis. Our approach combines (1) semantic embeddings from a LoRA-fine-tuned LLaMA 3, which capture pragmatic incongruity and discourse-level cues of sarcasm, and (2) prosodic exemplars retrieved via a Retrieval Augmented Generation (RAG) module, which provide expressive reference patterns of sarcastic delivery. Integrated within a VITS backbone, this dual conditioning enables more natural and contextually appropriate sarcastic speech. Experiments demonstrate that our method outperforms baselines in both objective measures and subjective evaluations, yielding improvements in speech naturalness, sarcastic expressivity, and downstream sarcasm detection.
摘要：讽刺是非文字语言的一种微妙形式，由于它依赖于微妙的语义、上下文和韵律线索，因此对语音合成提出了重大挑战。虽然现有的语音合成研究主要集中在广泛的情感类别上，但讽刺仍然很大程度上未被探索。在本文中，我们提出了一种用于讽刺感知语音合成的大型语言模型（LLM）增强型检索增强框架。我们的方法结合了（1）来自 LoRA 微调的 LLaMA 3 的语义嵌入，它捕获了讽刺的语用不一致和话语级线索，以及（2）通过检索增强生成（RAG）模块检索的韵律范例，它提供了讽刺传递的表达参考模式。这种双重调节集成在 VITS 主干中，可实现更自然且适合上下文的讽刺语音。实验表明，我们的方法在客观测量和主观评估方面均优于基线，从而在语音自然度、讽刺表达力和下游讽刺检测方面取得了改进。

Title: TALENT: Table VQA via Augmented Language-Enhanced Natural-text Transcription

Authors: Guo Yutong, Wanying Wang, Yue Wu, Zichen Miao, Haoyu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07098
Pdf URL: https://arxiv.org/pdf/2510.07098
Copy Paste: [[2510.07098]] TALENT: Table VQA via Augmented Language-Enhanced Natural-text Transcription(https://arxiv.org/abs/2510.07098)
Keywords: language model, llm, prompt
Abstract: Table Visual Question Answering (Table VQA) is typically addressed by large vision-language models (VLMs). While such models can answer directly from images, they often miss fine-grained details unless scaled to very large sizes, which are computationally prohibitive, especially for mobile deployment. A lighter alternative is to have a small VLM perform OCR and then use a large language model (LLM) to reason over structured outputs such as Markdown tables. However, these representations are not naturally optimized for LLMs and still introduce substantial errors. We propose TALENT (Table VQA via Augmented Language-Enhanced Natural-text Transcription), a lightweight framework that leverages dual representations of tables. TALENT prompts a small VLM to produce both OCR text and natural language narration, then combines them with the question for reasoning by an LLM. This reframes Table VQA as an LLM-centric multimodal reasoning task, where the VLM serves as a perception-narration module rather than a monolithic solver. Additionally, we construct ReTabVQA, a more challenging Table VQA dataset requiring multi-step quantitative reasoning over table images. Experiments show that TALENT enables a small VLM-LLM combination to match or surpass a single large VLM at significantly lower computational cost on both public datasets and ReTabVQA.
摘要：表视觉问答（Table VQA）通常由大型视觉语言模型（VLM）解决。虽然此类模型可以直接从图像中进行回答，但它们通常会错过细粒度的细节，除非缩放到非常大的尺寸，这在计算上是令人望而却步的，尤其是对于移动部署而言。一种更简单的替代方案是让小型 VLM 执行 OCR，然后使用大型语言模型 (LLM) 来推理 Markdown 表等结构化输出。然而，这些表示方法并没有自然地针对法学硕士进行优化，并且仍然会引入大量错误。我们提出了 TALENT（通过增强语言增强自然文本转录的表 VQA），这是一个利用表的双重表示的轻量级框架。 TALENT 提示小型 VLM 生成 OCR 文本和自然语言叙述，然后将它们与问题结合起来供法学硕士进行推理。这将 Table VQA 重新构建为以 LLM 为中心的多模态推理任务，其中 VLM 充当感知叙述模块而不是整体求解器。此外，我们构建了 ReTabVQA，这是一个更具挑战性的 Table VQA 数据集，需要对表格图像进行多步骤定量推理。实验表明，TALENT 使小型 VLM-LLM 组合能够在公共数据集和 ReTabVQA 上以显着较低的计算成本匹配或超越单个大型 VLM。

Title: Opt-ICL at LeWiDi-2025: Maximizing In-Context Signal from Rater Examples via Meta-Learning

Authors: Taylor Sorensen, Yejin Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07105
Pdf URL: https://arxiv.org/pdf/2510.07105
Copy Paste: [[2510.07105]] Opt-ICL at LeWiDi-2025: Maximizing In-Context Signal from Rater Examples via Meta-Learning(https://arxiv.org/abs/2510.07105)
Keywords: language model, llm
Abstract: Many natural language processing (NLP) tasks involve subjectivity, ambiguity, or legitimate disagreement between annotators. In this paper, we outline our system for modeling human variation. Our system leverages language models' (LLMs) in-context learning abilities, along with a two-step meta-learning training procedure for 1) post-training on many datasets requiring in-context learning and 2) specializing the model via in-context meta-learning to the particular data distribution of interest. We also evaluate the performance of our system submission to the Learning With Disagreements (LeWiDi) competition, where it was the overall winner on both tasks. Additionally, we perform an ablation study to measure the importance of each system component. We find that including rater examples in-context is crucial for our system's performance, dataset-specific fine-tuning is helpful on the larger datasets, post-training on other in-context datasets is helpful on one of the competition datasets, and that performance improves with model scale.
摘要：许多自然语言处理 (NLP) 任务涉及主观性、歧义性或注释者之间的合理分歧。在本文中，我们概述了人类变异建模系统。我们的系统利用语言模型（LLM）的上下文学习能力以及两步元学习训练程序，用于 1）对需要上下文学习的许多数据集进行后期训练，2）通过上下文元学习将模型专门化到感兴趣的特定数据分布。我们还评估了我们的系统提交给 Learning With Disagreements (LeWiDi) 竞赛的性能，该竞赛在两项任务中均获得了总冠军。此外，我们还进行消融研究来衡量每个系统组件的重要性。我们发现，在上下文中包含评估者示例对于我们系统的性能至关重要，特定于数据集的微调对于较大的数据集很有帮助，对其他上下文数据集的后训练对于竞争数据集之一很有帮助，并且性能随着模型规模的增加而提高。

Title: TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning

Authors: Manish Nagaraj, Sakshi Choudhary, Utkarsh Saxena, Deepak Ravikumar, Kaushik Roy
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.07118
Pdf URL: https://arxiv.org/pdf/2510.07118
Copy Paste: [[2510.07118]] TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning(https://arxiv.org/abs/2510.07118)
Keywords: language model, llm
Abstract: Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based "fingerprints" from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.
摘要：指令调优对于使大型语言模型 (LLM) 与下游任务保持一致至关重要，并且通常依赖于大型、多样化的语料库。然而，小型、高质量的子集（称为核心集）可以提供可比较或更好的结果，尽管管理它们仍然具有挑战性。现有的方法通常依赖于粗略的、样本级的信号，例如梯度，这种方法的计算成本很高并且忽略了细粒度的特征。为了解决这个问题，我们引入了 TRIM（通过可解释的多层注意力的令牌相关性），一个仅前向的、以令牌为中心的框架。 TRIM 不使用梯度，而是通过匹配通过基于注意力的“指纹”从少数目标样本中识别出的底层表征模式来进行操作。这种方法使得 TRIM 非常高效，并且对定义任务的结构特征具有独特的敏感性。我们的方法选择的核心集在下游任务上始终优于最先进的基线高达 9%，甚至在某些设置中超过了全数据微调的性能。通过避免昂贵的反向传递，TRIM 以一小部分计算成本实现了这一点。这些发现使 TRIM 成为构建高质量指令调整数据集的可扩展且高效的替代方案。

Title: Comparing human and language models sentence processing difficulties on complex structures

Authors: Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07141
Pdf URL: https://arxiv.org/pdf/2510.07141
Copy Paste: [[2510.07141]] Comparing human and language models sentence processing difficulties on complex structures(https://arxiv.org/abs/2510.07141)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) that fluently converse with humans are a reality - but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.
摘要：与人类流畅对话的大型语言模型 (LLM) 已成为现实 - 但 LLM 是否会遇到类似人类的处理困难？我们系统地比较了人类和法学硕士对七种具有挑战性的语言结构的句子理解能力。我们从人类和五个最先进的法学硕士家族中收集句子理解数据，这些数据的大小和训练程序在统一的实验框架中各不相同。我们的结果显示法学硕士在目标结构上总体上很困难，尤其是在花园小路（GP）句子上。事实上，虽然最强的模型在非 GP 结构上实现了近乎完美的准确度（GPT-5 为 93.7%），但它们在 GP 结构上却表现不佳（GPT-5 为 46.8%）。此外，当根据平均性能对结构进行排名时，人类和模型之间的排名相关性随着参数数量的增加而增加。对于每个目标结构，我们还收集其匹配基线的数据，而无需困难的结构。比较目标句子与基线句子的性能，在人类中观察到的性能差距对于法学硕士来说也成立，但有两个例外：对于太弱的模型，两种句子类型的性能都一致较低，而对于太强的模型，性能一致较高。这些共同揭示了人类和法学硕士句子理解的趋同和分歧，为人类和法学硕士的相似性提供了新的见解。

Title: Reasoning for Hierarchical Text Classification: The Case of Patents

Authors: Lekang Jiang, Wenjun Sun, Stephan Goetz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07167
Pdf URL: https://arxiv.org/pdf/2510.07167
Copy Paste: [[2510.07167]] Reasoning for Hierarchical Text Classification: The Case of Patents(https://arxiv.org/abs/2510.07167)
Keywords: language model, llm, chain-of-thought
Abstract: Hierarchical text classification (HTC) assigns documents to multiple levels of a pre-defined taxonomy. Automated patent subject classification represents one of the hardest HTC scenarios because of domain knowledge difficulty and a huge number of labels. Prior approaches only output a flat label set, which offers little insight into the reason behind predictions. Therefore, we propose Reasoning for Hierarchical Classification (RHC), a novel framework that reformulates HTC as a step-by-step reasoning task to sequentially deduce hierarchical labels. RHC trains large language models (LLMs) in two stages: a cold-start stage that aligns outputs with chain-of-thought (CoT) reasoning format and a reinforcement learning (RL) stage to enhance multi-step reasoning ability. RHC demonstrates four advantages in our experiments. (1) Effectiveness: RHC surpasses previous baselines and outperforms the supervised fine-tuning counterparts by approximately 3% in accuracy and macro F1. (2) Explainability: RHC produces natural-language justifications before prediction to facilitate human inspection. (3) Scalability: RHC scales favorably with model size with larger gains compared to standard fine-tuning. (4) Applicability: Beyond patents, we further demonstrate that RHC achieves state-of-the-art performance on other widely used HTC benchmarks, which highlights its broad applicability.
摘要：分层文本分类 (HTC) 将文档分配到预定义分类的多个级别。由于领域知识难度和大量标签，自动专利主题分类是最困难的 HTC 场景之一。先前的方法仅输出平面标签集，这几乎无法洞察预测背后的原因。因此，我们提出了层次分类推理（RHC），这是一种新颖的框架，它将 HTC 重新表述为逐步推理任务，以顺序推导出层次标签。 RHC 分两个阶段训练大型语言模型 (LLM)：冷启动阶段，将输出与思想链 (CoT) 推理格式对齐；以及强化学习 (RL) 阶段，以增强多步推理能力。 RHC 在我们的实验中展示了四个优势。 (1) 有效性：RHC 超越了之前的基线，并且在准确度和宏观 F1 方面比监督微调同行高出约 3%。 (2)可解释性：RHC在预测前产生自然语言的合理性，以方便人类检查。 (3) 可扩展性：与标准微调相比，RHC 可以随模型大小进行有利的扩展，并具有更大的增益。 (4) 适用性：除了专利之外，我们进一步证明 RHC 在其他广泛使用的 HTC 基准测试中实现了最先进的性能，这凸显了其广泛的适用性。

Title: More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning

Authors: Yike Zhao, Simin Guo, Ziqing Yang, Shifan Han, Dahua Lin, Fei Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07169
Pdf URL: https://arxiv.org/pdf/2510.07169
Copy Paste: [[2510.07169]] More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning(https://arxiv.org/abs/2510.07169)
Keywords: language model, llm
Abstract: The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in real-world pipelines remains underexplored. In this work, we conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning, evaluating them under a unified pipeline designed to mirror training and deployment scenarios. We further distill effective data selection strategies and identify practical methods suitable for industrial applications. Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume. This study provides actionable guidance for integrating training data to enhance LLM capabilities, supporting both cost-effective data curation and scalable model enhancement. We hope this work will inspire further research on how to balance "more data" versus "better data" for real-world reasoning tasks.
摘要：大型语言模型 (LLM) 的推理能力在许多下游任务中发挥着关键作用，但在很大程度上取决于训练数据的质量。尽管提出了各种数据构建方法，但它们在现实世界管道中的实际用途仍未得到充分探索。在这项工作中，我们对用于数学推理的开源数据集和数据合成技术进行了全面分析，并在旨在反映训练和部署场景的统一管道下对其进行评估。我们进一步提炼有效的数据选择策略并确定适合工业应用的实用方法。我们的研究结果强调，以更可解释的格式构建数据或从更强大的模型中提取数据通常比简单地扩大数据量更重要。这项研究为整合培训数据以增强法学硕士能力提供了可行的指导，支持具有成本效益的数据管理和可扩展的模型增强。我们希望这项工作能够激发关于如何在现实世界推理任务中平衡“更多数据”与“更好数据”的进一步研究。

Title: NurseLLM: The First Specialized Language Model for Nursing

Authors: Md Tawkat Islam Khondaker, Julia Harrington, Shady Shehata
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.07173
Pdf URL: https://arxiv.org/pdf/2510.07173
Copy Paste: [[2510.07173]] NurseLLM: The First Specialized Language Model for Nursing(https://arxiv.org/abs/2510.07173)
Keywords: language model, llm, agent
Abstract: Recent advancements in large language models (LLMs) have significantly transformed medical systems. However, their potential within specialized domains such as nursing remains largely underexplored. In this work, we introduce NurseLLM, the first nursing-specialized LLM tailored for multiple choice question-answering (MCQ) tasks. We develop a multi-stage data generation pipeline to build the first large scale nursing MCQ dataset to train LLMs on a broad spectrum of nursing topics. We further introduce multiple nursing benchmarks to enable rigorous evaluation. Our extensive experiments demonstrate that NurseLLM outperforms SoTA general-purpose and medical-specialized LLMs of comparable size on different benchmarks, underscoring the importance of a specialized LLM for the nursing domain. Finally, we explore the role of reasoning and multi-agent collaboration systems in nursing, highlighting their promise for future research and applications.
摘要：大语言模型 (LLM) 的最新进展极大地改变了医疗系统。然而，它们在护理等专业领域的潜力在很大程度上仍未得到充分开发。在这项工作中，我们介绍了 NurseLLM，这是第一个专为多项选择题回答 (MCQ) 任务量身定制的护理专业法学硕士。我们开发了一个多阶段数据生成管道来构建第一个大规模护理 MCQ 数据集，以培训法学硕士广泛的护理主题。我们进一步引入多个护理基准以实现严格的评估。我们广泛的实验表明，NurseLLM 在不同的基准上优于同等规模的 SoTA 通用和医学专业法学硕士，强调了专业法学硕士对于护理领域的重要性。最后，我们探讨了推理和多智能体协作系统在护理中的作用，强调了它们对未来研究和应用的前景。

Title: Quantifying Data Contamination in Psychometric Evaluations of LLMs

Authors: Jongwook Han, Woojung Song, Jonggeun Lee, Yohan Jo
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.07175
Pdf URL: https://arxiv.org/pdf/2510.07175
Copy Paste: [[2510.07175]] Quantifying Data Contamination in Psychometric Evaluations of LLMs(https://arxiv.org/abs/2510.07175)
Keywords: language model, llm
Abstract: Recent studies apply psychometric questionnaires to Large Language Models (LLMs) to assess high-level psychological constructs such as values, personality, moral foundations, and dark traits. Although prior work has raised concerns about possible data contamination from psychometric inventories, which may threaten the reliability of such evaluations, there has been no systematic attempt to quantify the extent of this contamination. To address this gap, we propose a framework to systematically measure data contamination in psychometric evaluations of LLMs, evaluating three aspects: (1) item memorization, (2) evaluation memorization, and (3) target score matching. Applying this framework to 21 models from major families and four widely used psychometric inventories, we provide evidence that popular inventories such as the Big Five Inventory (BFI-44) and Portrait Values Questionnaire (PVQ-40) exhibit strong contamination, where models not only memorize items but can also adjust their responses to achieve specific target scores.
摘要：最近的研究将心理测量问卷应用于大型语言模型（LLM），以评估高级心理结构，例如价值观、个性、道德基础和黑暗特质。尽管先前的工作引起了人们对心理测量清单中可能存在的数据污染的担忧，这可能会威胁此类评估的可靠性，但尚未系统地尝试量化这种污染的程度。为了解决这一差距，我们提出了一个框架来系统地测量法学硕士心理测量评估中的数据污染，评估三个方面：（1）项目记忆，（2）评估记忆，以及（3）目标分数匹配。将该框架应用于来自主要家庭的 21 个模型和四个广泛使用的心理测量量表，我们提供的证据表明，大五量表 (BFI-44) 和肖像价值观问卷 (PVQ-40) 等流行量表表现出很强的污染，其中模型不仅可以记住项目，还可以调整其反应以达到特定的目标分数。

Title: CARPAS: Towards Content-Aware Refinement of Provided Aspects for Summarization in Large Language Models

Authors: Yong-En Tian, Yu-Chien Tang, An-Zi Yen, Wen-Chih Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07177
Pdf URL: https://arxiv.org/pdf/2510.07177
Copy Paste: [[2510.07177]] CARPAS: Towards Content-Aware Refinement of Provided Aspects for Summarization in Large Language Models(https://arxiv.org/abs/2510.07177)
Keywords: language model, llm, prompt
Abstract: Aspect-based summarization has attracted significant attention for its ability to generate more fine-grained and user-aligned summaries. While most existing approaches assume a set of predefined aspects as input, real-world scenarios often present challenges where these given aspects may be incomplete, irrelevant, or entirely missing from the document. Users frequently expect systems to adaptively refine or filter the provided aspects based on the actual content. In this paper, we initiate this novel task setting, termed Content-Aware Refinement of Provided Aspects for Summarization (CARPAS), with the aim of dynamically adjusting the provided aspects based on the document context before summarizing. We construct three new datasets to facilitate our pilot experiments, and by using LLMs with four representative prompting strategies in this task, we find that LLMs tend to predict an overly comprehensive set of aspects, which often results in excessively long and misaligned summaries. Building on this observation, we propose a preliminary subtask to predict the number of relevant aspects, and demonstrate that the predicted number can serve as effective guidance for the LLMs, reducing the inference difficulty, and enabling them to focus on the most pertinent aspects. Our extensive experiments show that the proposed approach significantly improves performance across all datasets. Moreover, our deeper analyses uncover LLMs' compliance when the requested number of aspects differs from their own estimations, establishing a crucial insight for the deployment of LLMs in similar real-world applications.
摘要：基于方面的摘要因其生成更细粒度和与用户一致的摘要的能力而引起了极大的关注。虽然大多数现有方法假设一组预定义的方面作为输入，但现实场景通常会带来挑战，其中这些给定的方面可能不完整、不相关或在文档中完全缺失。用户经常期望系统能够根据实际内容自适应地细化或过滤所提供的方面。在本文中，我们启动了这种新颖的任务设置，称为摘要提供方面的内容感知细化（CARPAS），目的是在摘要之前根据文档上下文动态调整提供的方面。我们构建了三个新的数据集来促进我们的试点实验，并通过在此任务中使用具有四种代表性提示策略的法学硕士，我们发现法学硕士倾向于预测过于全面的方面，这通常会导致过长且不一致的摘要。基于这一观察，我们提出了一个初步的子任务来预测相关方面的数量，并证明预测的数量可以为法学硕士提供有效的指导，降低推理难度，并使他们能够专注于最相关的方面。我们广泛的实验表明，所提出的方法显着提高了所有数据集的性能。此外，当所要求的方面数量与他们自己的估计不同时，我们更深入的分析揭示了法学硕士的合规性，为在类似的实际应用中部署法学硕士建立了重要的见解。

Title: Biasless Language Models Learn Unnaturally: How LLMs Fail to Distinguish the Possible from the Impossible

Authors: Imry Ziv, Nur Lan, Emmanuel Chemla, Roni Katzir
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07178
Pdf URL: https://arxiv.org/pdf/2510.07178
Copy Paste: [[2510.07178]] Biasless Language Models Learn Unnaturally: How LLMs Fail to Distinguish the Possible from the Impossible(https://arxiv.org/abs/2510.07178)
Keywords: language model, gpt, llm
Abstract: Are large language models (LLMs) sensitive to the distinction between humanly possible languages and humanly impossible languages? This question is taken by many to bear on whether LLMs and humans share the same innate learning biases. Previous work has attempted to answer it in the positive by comparing LLM learning curves on existing language datasets and on "impossible" datasets derived from them via various perturbation functions. Using the same methodology, we examine this claim on a wider set of languages and impossible perturbations. We find that in most cases, GPT-2 learns each language and its impossible counterpart equally easily, in contrast to previous claims. We also apply a more lenient condition by testing whether GPT-2 provides any kind of separation between the whole set of natural languages and the whole set of impossible languages. By considering cross-linguistic variance in various metrics computed on the perplexity curves, we show that GPT-2 provides no systematic separation between the possible and the impossible. Taken together, these perspectives show that LLMs do not share the human innate biases that shape linguistic typology.
摘要：大型语言模型 (LLM) 对人类可能的语言和人类不可能的语言之间的区别是否敏感？许多人认为这个问题与法学硕士和人类是否具有相同的先天学习偏见有关。之前的工作试图通过比较现有语言数据集和通过各种扰动函数衍生的“不可能”数据集的法学硕士学习曲线来正面回答这个问题。使用相同的方法，我们在更广泛的语言和不可能的扰动上检验了这一主张。我们发现，在大多数情况下，GPT-2 可以同样轻松地学习每种语言及其不可能的对应语言，这与之前的说法相反。我们还通过测试 GPT-2 是否在整个自然语言集和整个不可能语言集之间提供任何类型的分离来应用更宽松的条件。通过考虑在困惑度曲线上计算的各种指标的跨语言方差，我们表明 GPT-2 没有提供可能与不可能之间的系统分离。总而言之，这些观点表明法学硕士并不具有塑造语言类型学的人类先天偏见。

Title: Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models

Authors: Benjamin Akera, Evelyn Nafula Ouma, Gilbert Yiga, Patrick Walukagga, Phionah Natukunda, Trevor Saaka, Solomon Nsumba, Lilian Teddy Nabukeera, Joel Muhanguzi, Imran Sekalala, Nimpamya Janat Namara, Engineer Bainomugisha, Ernest Mwebaze, John Quinn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07203
Pdf URL: https://arxiv.org/pdf/2510.07203
Copy Paste: [[2510.07203]] Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models(https://arxiv.org/abs/2510.07203)
Keywords: language model, llm
Abstract: There are more than 2000 living languages in Africa, most of which have been bypassed by advances in language technology. Current leading LLMs exhibit strong performance on a number of the most common languages (e.g. Swahili or Yoruba), but prioritise support for the languages with the most speakers first, resulting in piecemeal ability across disparate languages. We contend that a regionally focussed approach is more efficient, and present a case study for Uganda, a country with high linguistic diversity. We describe the development of Sunflower 14B and 32B, a pair of models based on Qwen 3 with state of the art comprehension in the majority of all Ugandan languages. These models are open source and can be used to reduce language barriers in a number of important practical applications.
摘要：非洲有 2000 多种仍在使用的语言，其中大部分已被语言技术的进步所取代。当前领先的法学硕士在许多最常见的语言（例如斯瓦希里语或约鲁巴语）上表现出强大的表现，但优先考虑对使用最多的语言的支持，从而导致跨不同语言的零碎能力。我们认为，以区域为重点的方法更为有效，并针对乌干达这个语言多样性较高的国家进行了案例研究。我们描述了 Sunflower 14B 和 32B 的开发，这是一对基于 Qwen 3 的模型，对大多数乌干达语言具有最先进的理解能力。这些模型是开源的，可用于减少许多重要实际应用中的语言障碍。

Title: Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models

Authors: Chengzhi Zhong, Fei Cheng, Qianying Liu, Yugo Murawaki, Chenhui Chu, Sadao Kurohashi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07213
Pdf URL: https://arxiv.org/pdf/2510.07213
Copy Paste: [[2510.07213]] Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models(https://arxiv.org/abs/2510.07213)
Keywords: language model
Abstract: Large language models exhibit strong multilingual capabilities despite limited exposure to non-English data. Prior studies show that English-centric large language models map multilingual content into English-aligned representations at intermediate layers and then project them back into target-language token spaces in the final layer. From this observation, we hypothesize that this cross-lingual transition is governed by a small and sparse set of dimensions, which occur at consistent indices across the intermediate to final layers. Building on this insight, we introduce a simple, training-free method to identify and manipulate these dimensions, requiring only as few as 50 sentences of either parallel or monolingual data. Experiments on a multilingual generation control task reveal the interpretability of these dimensions, demonstrating that the interventions in these dimensions can switch the output language while preserving semantic content, and that it surpasses the performance of prior neuron-based approaches at a substantially lower cost.
摘要：尽管接触非英语数据有限，但大型语言模型仍表现出强大的多语言能力。先前的研究表明，以英语为中心的大语言模型将多语言内容映射到中间层的英语对齐表示，然后将它们投影回最后层的目标语言标记空间。根据这一观察，我们假设这种跨语言转换是由一组小而稀疏的维度控制的，这些维度在中间层到最终层之间以一致的索引发生。基于这一见解，我们引入了一种简单的、免训练的方法来识别和操纵这些维度，只需要少至 50 个并行或单语数据的句子。多语言生成控制任务的实验揭示了这些维度的可解释性，证明对这些维度的干预可以在保留语义内容的同时切换输出语言，并且它以低得多的成本超越了先前基于神经元的方法的性能。

Title: Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation

Authors: Arjun Krishnakumar, Rhea Sanjay Sukthanker, Hannan Javed Mahadik, Gabriela Kadlecová, Vladyslav Moroshan, Timur Carstensen, Frank Hutter, Aaron Klein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07227
Pdf URL: https://arxiv.org/pdf/2510.07227
Copy Paste: [[2510.07227]] Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation(https://arxiv.org/abs/2510.07227)
Keywords: language model, llm
Abstract: Small Language models (SLMs) offer an efficient and accessible alternative to Large Language Models (LLMs), delivering strong performance while using far fewer resources. We introduce a simple and effective framework for pretraining SLMs that brings together three complementary ideas. First, we identify structurally sparse sub-network initializations that consistently outperform randomly initialized models of similar size under the same compute budget. Second, we use evolutionary search to automatically discover high-quality sub-network initializations, providing better starting points for pretraining. Third, we apply knowledge distillation from larger teacher models to speed up training and improve generalization. Together, these components make SLM pretraining substantially more efficient: our best model, discovered using evolutionary search and initialized with LLM weights, matches the validation perplexity of a comparable Pythia SLM while requiring 9.2x fewer pretraining tokens. We release all code and models at this https URL, offering a practical and reproducible path toward cost-efficient small language model development at scale.
摘要：小语言模型 (SLM) 为大型语言模型 (LLM) 提供了高效且易于访问的替代方案，可在使用更少资源的同时提供强大的性能。我们引入了一个简单而有效的预训练 SLM 框架，该框架汇集了三个互补的想法。首先，我们确定结构稀疏的子网络初始化，在相同的计算预算下，其性能始终优于类似大小的随机初始化模型。其次，我们使用进化搜索来自动发现高质量的子网络初始化，为预训练提供更好的起点。第三，我们应用从大型教师模型中提取知识来加速训练并提高泛化能力。这些组件一起使 SLM 预训练变得更加高效：我们的最佳模型是使用进化搜索发现的，并使用 LLM 权重进行初始化，与可比较的 Pythia SLM 的验证困惑度相匹配，同时需要的预训练标记减少了 9.2 倍。我们在此 https URL 发布所有代码和模型，为大规模经济高效的小语言模型开发提供实用且可重复的路径。

Title: Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping

Authors: Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, Dakuo Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07230
Pdf URL: https://arxiv.org/pdf/2510.07230
Copy Paste: [[2510.07230]] Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping(https://arxiv.org/abs/2510.07230)
Keywords: language model, llm, prompt, agent
Abstract: Simulating step-wise human behavior with Large Language Models (LLMs) has become an emerging research direction, enabling applications in various practical domains. While prior methods, including prompting, supervised fine-tuning (SFT), and reinforcement learning (RL), have shown promise in modeling step-wise behavior, they primarily learn a population-level policy without conditioning on a user's persona, yielding generic rather than personalized simulations. In this work, we pose a critical question: how can LLM agents better simulate personalized user behavior? We introduce Customer-R1, an RL-based method for personalized, step-wise user behavior simulation in online shopping environments. Our policy is conditioned on an explicit persona, and we optimize next-step rationale and action generation via action correctness reward signals. Experiments on the OPeRA dataset emonstrate that Customer-R1 not only significantly outperforms prompting and SFT-based baselines in next-action prediction tasks, but also better matches users' action distribution, indicating higher fidelity in personalized behavior simulation.
摘要：使用大型语言模型（LLM）模拟逐步的人类行为已成为一个新兴的研究方向，使其能够在各个实际领域中应用。虽然之前的方法，包括提示、监督微调 (SFT) 和强化学习 (RL)，在建模逐步行为方面显示出前景，但它们主要学习群体级别的策略，而不以用户角色为条件，产生通用而非个性化的模拟。在这项工作中，我们提出了一个关键问题：LLM 代理如何更好地模拟个性化用户行为？我们推出了 Customer-R1，这是一种基于强化学习的方法，用于在线购物环境中进行个性化、逐步的用户行为模拟。我们的政策以明确的角色为条件，我们通过行动正确性奖励信号来优化下一步的基本原理和行动生成。在 OPeRA 数据集上的实验表明，Customer-R1 不仅在下一步动作预测任务中显着优于提示和基于 SFT 的基线，而且更好地匹配用户的动作分布，表明个性化行为模拟具有更高的保真度。

Title: Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

Authors: Donggyu Lee, Sungwon Park, Yerin Hwang, Hyunwoo Oh, Hyoshin Kim, Jungwon Kim, Meeyoung Cha, Sangyoon Park, Jihee Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07231
Pdf URL: https://arxiv.org/pdf/2510.07231
Copy Paste: [[2510.07231]] Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships(https://arxiv.org/abs/2510.07231)
Keywords: language model, llm
Abstract: Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6\% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.
摘要：因果推理是大型语言模型 (LLM) 理解模式匹配之外的真正因果关系的基础。现有基准受到严重限制，例如依赖合成数据和狭窄的领域覆盖范围。我们引入了一种新颖的基准，该基准是根据从顶级经济学和金融期刊中提取的随意识别的关系构建的，并借鉴了包括工具变量、双重差分和断点回归设计在内的严格方法论。我们的基准包括 40,379 个评估项目，涵盖健康、环境、技术、法律和文化等领域的五种任务类型。八个最先进的法学硕士的实验结果揭示了很大的局限性，最好的模型仅达到 57.6% 的准确率。此外，模型规模并不能始终转化为卓越的性能，甚至高级推理模型也难以识别基本的因果关系。这些发现强调了当前法学硕士的能力与高风险应用中可靠因果推理的需求之间的关键差距。

Title: LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding

Authors: Zhivar Sourati, Zheng Wang, Marianne Menglin Liu, Yazhe Hu, Mengqing Guo, Sujeeth Bharadwaj, Kyu Han, Tao Sheng, Sujith Ravi, Morteza Dehghani, Dan Roth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07233
Pdf URL: https://arxiv.org/pdf/2510.07233
Copy Paste: [[2510.07233]] LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding(https://arxiv.org/abs/2510.07233)
Keywords: llm, retrieval-augmented generation, agent
Abstract: Question answering over visually rich documents (VRDs) requires reasoning not only over isolated content but also over documents' structural organization and cross-page dependencies. However, conventional retrieval-augmented generation (RAG) methods encode content in isolated chunks during ingestion, losing structural and cross-page dependencies, and retrieve a fixed number of pages at inference, regardless of the specific demands of the question or context. This often results in incomplete evidence retrieval and degraded answer quality for multi-page reasoning tasks. To address these limitations, we propose LAD-RAG, a novel Layout-Aware Dynamic RAG framework. During ingestion, LAD-RAG constructs a symbolic document graph that captures layout structure and cross-page dependencies, adding it alongside standard neural embeddings to yield a more holistic representation of the document. During inference, an LLM agent dynamically interacts with the neural and symbolic indices to adaptively retrieve the necessary evidence based on the query. Experiments on MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA demonstrate that LAD-RAG improves retrieval, achieving over 90% perfect recall on average without any top-k tuning, and outperforming baseline retrievers by up to 20% in recall at comparable noise levels, yielding higher QA accuracy with minimal latency.
摘要：针对视觉丰富的文档 (VRD) 进行问答不仅需要对孤立的内容进行推理，还需要对文档的结构组织和跨页面依赖性进行推理。然而，传统的检索增强生成（RAG）方法在摄取过程中将内容编码在孤立的块中，丢失了结构和跨页面依赖性，并在推理时检索固定数量的页面，而不管问题或上下文的具体要求如何。这通常会导致证据检索不完整，并降低多页推理任务的答案质量。为了解决这些限制，我们提出了 LAD-RAG，一种新颖的布局感知动态 RAG 框架。在摄取过程中，LAD-RAG 构建了一个符号文档图，用于捕获布局结构和跨页面依赖性，并将其与标准神经嵌入一起添加，以生成更全面的文档表示。在推理过程中，LLM 代理动态地与神经索引和符号索引交互，以根据查询自适应地检索必要的证据。 MMLongBench-Doc、LongDocURL、DUDE 和 MP-DocVQA 上的实验表明，LAD-RAG 改进了检索，在没有任何 top-k 调整的情况下平均实现了超过 90% 的完美召回率，并且在可比噪声水平下的召回率比基线检索器高出 20%，从而以最小的延迟产生更高的 QA 准确性。

Title: When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation

Authors: Xunyi Jiang, Dingyi Chang, Julian McAuley, Xin Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07238
Pdf URL: https://arxiv.org/pdf/2510.07238
Copy Paste: [[2510.07238]] When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation(https://arxiv.org/abs/2510.07238)
Keywords: language model, llm
Abstract: The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about their reliability for evaluating LLM factuality. While substantial works continue to rely on the popular but old benchmarks, their temporal misalignment with real-world facts and modern LLMs, and their effects on LLM factuality evaluation remain underexplored. Therefore, in this work, we present a systematic investigation of this issue by examining five popular factuality benchmarks and eight LLMs released across different years. An up-to-date fact retrieval pipeline and three metrics are tailored to quantify benchmark aging and its impact on LLM factuality evaluation. Experimental results and analysis illustrate that a considerable portion of samples in the widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality. We hope our work can provide a testbed to assess the reliability of a benchmark for LLM factuality evaluation and inspire more research on the benchmark aging issue. Codes are available in this https URL.
摘要：大型语言模型 (LLM) 和现实世界的快速发展已经超越了广泛使用的评估基准的静态性质，引发了人们对其评估 LLM 真实性的可靠性的担忧。虽然实质性著作继续依赖流行但旧的基准，但它们与现实世界事实和现代法学硕士的时间不一致，以及它们对法学硕士事实性评估的影响仍未得到充分探索。因此，在这项工作中，我们通过研究五个流行的事实性基准和不同年份发布的八个法学硕士，对这个问题进行了系统的调查。最新的事实检索管道和三个指标专门用于量化基准老化及其对法学硕士事实性评估的影响。实验结果和分析表明，广泛使用的事实性基准中相当一部分样本已经过时，导致LLM事实性评估不可靠。我们希望我们的工作能够提供一个测试平台来评估LLM真实性评估基准的可靠性，并激发更多关于基准老化问题的研究。此 https URL 中提供了代码。

Title: Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

Authors: Christos Ziakas, Nicholas Loo, Nishita Jain, Alessandra Russo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07239
Pdf URL: https://arxiv.org/pdf/2510.07239
Copy Paste: [[2510.07239]] Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts(https://arxiv.org/abs/2510.07239)
Keywords: language model, llm, prompt
Abstract: Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We introduce Red-Bandit, a red-teaming framework that adapts online to identify and exploit model failure modes under distinct attack styles (e.g., manipulation, slang). Red-Bandit post-trains a set of parameter-efficient LoRA experts, each specialized for a particular attack style, using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts based on the target model's response safety, balancing exploration and exploitation. Red-Bandit achieves state-of-the-art results on AdvBench under sufficient exploration (ASR@10), while producing more human-readable prompts (lower perplexity). Moreover, Red-Bandit's bandit policy serves as a diagnostic tool for uncovering model-specific vulnerabilities by indicating which attack styles most effectively elicit unsafe behaviors.
摘要：自动红队已成为一种在部署之前审核大型语言模型 (LLM) 的可扩展方法，但现有方法缺乏有效适应推理时特定于模型的漏洞的机制。我们引入了 Red-Bandit，这是一个红队框架，可在线适应以识别和利用不同攻击风格（例如操纵、俚语）下的模型故障模式。 Red-Bandit 对一组参数高效的 LoRA 专家进行后训练，每个专家专门针对特定的攻击风格，使用强化学习，通过基于规则的安全模型奖励不安全提示的生成。由此推断，多臂老虎机策略根据目标模型的响应安全性，在这些攻击型专家中动态选择，平衡探索和利用。 Red-Bandit 在充分的探索下 (ASR@10) 在 AdvBench 上取得了最先进的结果，同时产生了更多人类可读的提示（更低的困惑度）。此外，Red-Bandit 的强盗策略可作为诊断工具，通过指示哪些攻击方式最有效地引发不安全行为来发现特定于模型的漏洞。

Title: Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Authors: Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Yixuan Li, Jason E Weston, Ping Yu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.07242
Pdf URL: https://arxiv.org/pdf/2510.07242
Copy Paste: [[2510.07242]] Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense(https://arxiv.org/abs/2510.07242)
Keywords: language model, llm, prompt
Abstract: Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.
摘要：大型语言模型 (LLM) 推理的后训练越来越依赖于可验证的奖励：提供 0-1 正确性信号的确定性检查器。虽然可靠，但这种二元反馈是脆弱的——许多任务承认部分正确或可供选择的答案，而验证者的信用不足，而由此产生的全有或全无的监督限制了学习。奖励模型提供更丰富、持续的反馈，可以作为验证者的补充监督信号。我们引入 HERO（混合集成奖励优化），这是一种强化学习框架，它以结构化方式将验证者信号与奖励模型分数集成在一起。 HERO 采用分层归一化将奖励模型分数限制在验证者定义的组内，在细化质量差异的同时保留正确性，并采用方差感知权重来强调密集信号最重要的具有挑战性的提示。在各种数学推理基准中，HERO 始终优于仅 RM 和仅验证者基线，在可验证和难以验证的任务上都有强劲的表现。我们的结果表明，混合奖励设计保留了验证者的稳定性，同时利用奖励模型的细微差别来推进推理。

Title: LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation

Authors: Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Arijit Ghosh Chowdhury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Hannah R Marlowe, Carina Suzana Negreanu, Kitty Boxall, Diana Mincu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07243
Pdf URL: https://arxiv.org/pdf/2510.07243
Copy Paste: [[2510.07243]] LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation(https://arxiv.org/abs/2510.07243)
Keywords: language model, llm
Abstract: Evaluating large language model (LLM) outputs in the legal domain presents unique challenges due to the complex and nuanced nature of legal analysis. Current evaluation approaches either depend on reference data, which is costly to produce, or use standardized assessment methods, both of which have significant limitations for legal applications. Although LLM-as-a-Judge has emerged as a promising evaluation technique, its reliability and effectiveness in legal contexts depend heavily on evaluation processes unique to the legal industry and how trustworthy the evaluation appears to the human legal expert. This is where existing evaluation methods currently fail and exhibit considerable variability. This paper aims to close the gap: a) we break down lengthy responses into 'Legal Data Points' (LDPs), self-contained units of information, and introduce a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers; b) we demonstrate that our method outperforms a variety of baselines on both our proprietary dataset and an open-source dataset (LegalBench); c) we show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement; and finally d) we open source our Legal Data Points for a subset of LegalBench used in our experiments, allowing the research community to replicate our results and advance research in this vital area of LLM evaluation on legal question-answering.
摘要：由于法律分析的复杂性和细致入微的性质，评估法律领域的大语言模型 (LLM) 输出提出了独特的挑战。目前的评估方法要么依赖于参考数据，而参考数据的制作成本很高，要么使用标准化的评估方法，这两种方法在法律应用上都有很大的局限性。尽管法学硕士作为法官已成为一种有前景的评估技术，但其在法律环境中的可靠性和有效性在很大程度上取决于法律行业独特的评估流程以及评估对人类法律专家的可信度。这就是现有评估方法目前失败的地方，并且表现出相当大的可变性。本文旨在缩小差距：a）我们将冗长的答复分解为“法律数据点”（LDP），即独立的信息单元，并引入一种新颖的、无参考的评估方法，反映律师如何评估法律答案； b）我们证明我们的方法在我们的专有数据集和开源数据集（LegalBench）上都优于各种基线； c）我们展示了我们的方法如何与人类专家评估更紧密地相关，并有助于提高注释者间的一致性；最后 d) 我们开源实验中使用的 LegalBench 子集的法律数据点，允许研究界复制我们的结果并推进法律问答法学硕士评估这一重要领域的研究。

Title: Don't Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models

Authors: Jonggeun Lee, Woojung Song, Jongwook Han, Haesung Pyun, Yohan Jo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07248
Pdf URL: https://arxiv.org/pdf/2510.07248
Copy Paste: [[2510.07248]] Don't Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models(https://arxiv.org/abs/2510.07248)
Keywords: language model
Abstract: Small language models (SLMs) offer significant computational advantages for tool-augmented AI systems, yet they struggle with tool-use tasks, particularly in selecting appropriate tools and identifying correct parameters. A common failure mode is schema misalignment: models hallucinate plausible but non-existent tool names that reflect naming conventions internalized during pretraining but absent from the provided tool schema. Rather than forcing models to adapt to arbitrary schemas, we propose adapting schemas to align with models' pretrained knowledge. We introduce PA-Tool (Pretraining-Aligned Tool Schema Generation), a training-free method that leverages peakedness-a signal from contamination detection indicating pretraining familiarity-to automatically rename tool components. By generating multiple candidates and selecting those with highest output concentration across samples, PA-Tool identifies pretrain-aligned naming patterns. Experiments on MetaTool and RoTBench show improvements of up to 17% points, with schema misalignment errors reduced by 80%. PA-Tool enables small models to approach state-of-the-art performance while maintaining computational efficiency for adaptation to new tools without retraining. Our work demonstrates that schema-level interventions can unlock the tool-use potential of resource-efficient models by adapting schemas to models rather than models to schemas.
摘要：小语言模型 (SLM) 为工具增强型人工智能系统提供了显着的计算优势，但它们在处理工具使用任务时遇到了困难，特别是在选择适当的工具和识别正确的参数方面。一种常见的失败模式是模式不一致：模型产生看似合理但不存在的工具名称，这些名称反映了预训练期间内化的命名约定，但在提供的工具模式中不存在。我们建议调整模式以与模型的预训练知识保持一致，而不是强迫模型适应任意模式。我们引入了 PA-Tool（预训练对齐工具模式生成），这是一种无需训练的方法，它利用峰值（来自污染检测的信号，指示训练前的熟悉程度）来自动重命名工具组件。通过生成多个候选者并选择样本中输出浓度最高的候选者，PA-Tool 可以识别训练前对齐的命名模式。 MetaTool 和 RoTBench 上的实验表明，性能提升高达 17%，架构错位错误减少了 80%。 PA-Tool 使小型模型能够达到最先进的性能，同时保持计算效率以适应新工具，而无需重新训练。我们的工作表明，模式级干预可以通过使模式适应模型而不是模型适应模式来释放资源高效模型的工具使用潜力。

Title: Online Rubrics Elicitation from Pairwise Comparisons

Authors: MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Yunzhong He, Afra Feyza Akyürek
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.07284
Pdf URL: https://arxiv.org/pdf/2510.07284
Copy Paste: [[2510.07284]] Online Rubrics Elicitation from Pairwise Comparisons(https://arxiv.org/abs/2510.07284)
Keywords: llm
Abstract: Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.
摘要：评分标准提供了一种灵活的方法来培训法学硕士的开放式长格式答案，其中可验证的奖励不适用，并且人类偏好提供粗略信号。之前的工作表明，基于评分标准的强化学习可以在法学硕士培训后带来持续的收益。大多数现有方法依赖于在培训过程中保持不变的规则。然而，这种静态的规则很容易受到奖励黑客类型行为的影响，并且无法捕获训练期间出现的紧急需求。我们引入了在线评分标准启发（OnlineRubrics），这是一种通过对当前政策和参考政策的响应进行成对比较，以在线方式动态制定评估标准的方法。这个在线过程可以在培训过程中持续识别和减少错误。根据经验，与仅使用 AlpacaEval、GPQA、ArenaHard 的静态评分标准以及专家问题和评分标准的验证集进行训练相比，这种方法的持续改进高达 8%。我们定性分析所得出的标准，并确定透明度、实用性、组织和推理等突出主题。

Title: On the Convergence of Moral Self-Correction in Large Language Models

Authors: Guangliang Liu, Haitao Mao, Bochuan Cao, Zhiyu Xue, Xitong Zhang, Rongrong Wang, Kristen Marie Johnson
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.07290
Pdf URL: https://arxiv.org/pdf/2510.07290
Copy Paste: [[2510.07290]] On the Convergence of Moral Self-Correction in Large Language Models(https://arxiv.org/abs/2510.07290)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.
摘要：大型语言模型 (LLM) 能够在收到指示时改进其响应，这种能力称为自我纠正。当说明仅提供一般和抽象的目标而没有有关响应中潜在问题的具体细节时，法学硕士必须依靠其内部知识来提高响应质量，这一过程称为内在自我纠正。内在自我修正的经验成功在各种应用中都是显而易见的，但它如何以及为何有效仍然未知。聚焦法学硕士的道德自我修正，我们揭示了内在自我修正的一个关键特征：通过多轮互动实现绩效收敛；并提供这种收敛行为的机制分析。根据我们的实验结果和分析，我们揭示了收敛的潜在机制：持续注入的自我校正指令激活道德概念，减少模型的不确定性，随着激活的道德概念在连续几轮中稳定下来，导致收敛性能。本文通过展示道德自我纠正表现出理想的聚合表现特性，展示了道德自我纠正的强大潜力。

Title: Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain

Authors: Yue Li, Ran Tao, Derek Hommel, Yusuf Denizay Dönder, Sungyong Chang, David Mimno, Unso Eun Seo Jo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07309
Pdf URL: https://arxiv.org/pdf/2510.07309
Copy Paste: [[2510.07309]] Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain(https://arxiv.org/abs/2510.07309)
Keywords: llm, agent
Abstract: In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21\% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.
摘要：在数据驱动决策至关重要的业务领域，文本到 SQL 是轻松以自然语言访问结构化数据的基础。虽然最近的法学硕士在代码生成方面取得了出色的性能，但现有的文本到 SQL 基准测试仍然侧重于过去记录的事实检索。我们推出 CORGI，这是一个专为现实商业环境设计的新基准。 CORGI 由受 Doordash、Airbnb 和 Lululemon 等企业启发的合成数据库组成。它提供了四个日益复杂的业务查询类别的问题：描述性、解释性、预测性和推荐性。这一挑战需要因果推理、时间预测和战略推荐，反映多层次、多步骤的代理智能。我们发现法学硕士在高层次问题上的表现下降，难以做出准确的预测并提供可行的计划。根据执行成功率，CORGI 基准测试比 BIRD 基准测试难度高出约 21%。这凸显了流行的法学硕士与现实世界商业智能的需求之间的差距。我们发布了一个公共数据集和评估框架，以及一个供公众提交的网站。

Title: Vibe Checker: Aligning Code Evaluation with Human Preference

Authors: Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun
Subjects: cs.CL, cs.AI, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2510.07315
Pdf URL: https://arxiv.org/pdf/2510.07315
Copy Paste: [[2510.07315]] Vibe Checker: Aligning Code Evaluation with Human Preference(https://arxiv.org/abs/2510.07315)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking the non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check that represents human preference in coding besides functional correctness. To quantify models' code instruction following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in Vibe Checker, a testbed to assess both code instruction following and functional correctness. Upon evaluating 31 leading LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression. Most importantly, a composite score of functional correctness and instruction following correlates the best with human preference, with the latter emerging as the primary differentiator on real-world programming tasks. Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding.
摘要：大型语言模型 (LLM) 促进了 Vivi 编码，用户利用 LLM 通过自然语言交互生成并迭代地完善代码，直到代码通过 Vivi 检查。 Vibe 检查与现实世界中人类的偏好相关，并且超越功能：解决方案应该感觉正确、阅读清晰、保持意图并保持正确。然而，当前的代码评估仍然以 pass@k 为基础，仅捕获功能正确性，而忽略了用户日常应用的非功能指令。在本文中，我们假设指令跟随是振动检查背后缺失的部分，除了功能正确性之外，它还代表了人类在编码中的偏好。为了通过可测量的信号量化模型的代码指令跟踪能力，我们提出了 VeriCode，它是 30 个可验证代码指令以及相应的确定性验证器的分类法。我们使用分类法来增强已建立的评估套件，从而产生了 Vibe Checker，这是一个用于评估代码指令遵循和功能正确性的测试平台。在评估 31 个领先的法学硕士后，我们发现即使是最强大的模型也很难遵守多个指令并表现出清晰的功能回归。最重要的是，功能正确性和指令遵循的综合得分与人类偏好最佳相关，而后者成为现实世界编程任务的主要区别因素。我们的工作确定了氛围检查的核心因素，为基准测试和开发更好地符合用户编码偏好的模型提供了具体路径。