2025-10-24

Title: DeBERTa-KC: A Transformer-Based Classifier for Knowledge Construction in Online Learning Discourse

Authors: Jindi Wang, Yidi Zhang, Zhaoxing Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19858
Pdf URL: https://arxiv.org/pdf/2510.19858
Copy Paste: [[2510.19858]] DeBERTa-KC: A Transformer-Based Classifier for Knowledge Construction in Online Learning Discourse(https://arxiv.org/abs/2510.19858)
Keywords: language model
Abstract: This study presents DeBERTa-KC, a transformer-based model for automatic classification of knowledge construction (KC) levels in online science learning discourse. Using comments collected from four popular YouTube science channels (2022--2024), a balanced corpus of 20,000 manually annotated samples was created across four KC categories: \textit{nonKC}, \textit{Share}, \textit{Explore}, and \textit{Negotiate}. The proposed model extends DeBERTa-v3 with Focal Loss, Label Smoothing, and R-Drop regularization to address class imbalance and enhance generalization. A reproducible end-to-end pipeline was implemented, encompassing data extraction, annotation, preprocessing, training, and evaluation. Across 10-fold stratified cross-validation, DeBERTa-KC achieved a macro-F1 of $0.836 \pm 0.008$, significantly out-performing both classical and transformer baselines ($p<0.01$). Per-category results indicate strong sensitivity to higher-order epistemic engagement, particularly in \textit{Explore} and \textit{Negotiate} discourse. These findings demonstrate that large language models can effectively capture nuanced indicators of knowledge construction in informal digital learning environments, offering scalable, theory-informed approaches to discourse analysis and the development of automated tools for assessing epistemic engagement.
摘要：本研究提出了 DeBERTa-KC，这是一种基于 Transformer 的模型，用于对在线科学学习话语中的知识构建 (KC) 级别进行自动分类。使用从四个流行的 YouTube 科学频道（2022--2024 年）收集的评论，创建了包含 20,000 个手动注释样本的平衡语料库，涵盖四个 KC 类别：\textit{nonKC}、\textit{Share}、\textit{Explore} 和 \textit{Negotiate}。所提出的模型通过焦点损失、标签平滑和 R-Drop 正则化扩展了 DeBERTa-v3，以解决类别不平衡问题并增强泛化能力。实现了可重复的端到端管道，包括数据提取、注释、预处理、训练和评估。在 10 倍分层交叉验证中，DeBERTa-KC 实现了 $0.836 \pm 0.008$ 的宏观 F1，显着优于经典基线和 Transformer 基线 ($p<0.01$)。每个类别的结果表明对高阶认知参与有很强的敏感性，特别是在 \textit{Explore} 和 \textit{Negotiate} 话语中。这些发现表明，大型语言模型可以有效地捕获非正式数字学习环境中知识构建的细微差别指标，为话语分析提供可扩展的、基于理论的方法，并开发用于评估认知参与度的自动化工具。

Title: An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics

Authors: Xincheng Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.19866
Pdf URL: https://arxiv.org/pdf/2510.19866
Copy Paste: [[2510.19866]] An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics(https://arxiv.org/abs/2510.19866)
Keywords: language model, gpt, hallucination, prompt, chat
Abstract: This study evaluates the pedagogical soundness and usability of AI-generated lesson plans across five leading large language models: ChatGPT (GPT-5), Claude Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V3.2, and Grok 4. Beyond model choice, three structured prompt frameworks were tested: TAG (Task, Audience, Goal), RACE (Role, Audience, Context, Execution), and COSTAR (Context, Objective, Style, Tone, Audience, Response Format). Fifteen lesson plans were generated for a single high-school physics topic, The Electromagnetic Spectrum. The lesson plans were analyzed through four automated computational metrics: (1) readability and linguistic complexity, (2) factual accuracy and hallucination detection, (3) standards and curriculum alignment, and (4) cognitive demand of learning objectives. Results indicate that model selection exerted the strongest influence on linguistic accessibility, with DeepSeek producing the most readable teaching plan (FKGL = 8.64) and Claude generating the densest language (FKGL = 19.89). The prompt framework structure most strongly affected the factual accuracy and pedagogical completeness, with the RACE framework yielding the lowest hallucination index and the highest incidental alignment with NGSS curriculum standards. Across all models, the learning objectives in the fifteen lesson plans clustered at the Remember and Understand tiers of Bloom's taxonomy. There were limited higher-order verbs in the learning objectives extracted. Overall, the findings suggest that readability is significantly governed by model design, while instructional reliability and curricular alignment depend more on the prompt framework. The most effective configuration for lesson plans identified in the results was to combine a readability-optimized model with the RACE framework and an explicit checklist of physics concepts, curriculum standards, and higher-order objectives.
摘要：本研究评估了人工智能生成的课程计划在五个领先的大型语言模型中的教学合理性和可用性：ChatGPT (GPT-5)、Claude Sonnet 4.5、Gemini 2.5 Flash、DeepSeek V3.2 和 Grok 4。除了模型选择之外，还测试了三个结构化提示框架：TAG（任务、受众、目标）、RACE（角色、受众、情境、执行）和 COSTAR（情境、情境）目标、风格、语气、受众，响应格式）。针对一个高中物理主题“电磁频谱”生成了 15 个课程计划。通过四个自动计算指标对课程计划进行分析：（1）可读性和语言复杂性，（2）事实准确性和幻觉检测，（3）标准和课程一致性，以及（4）学习目标的认知需求。结果表明，模型选择对语言可访问性的影响最大，其中 DeepSeek 生成了最具可读性的教学计划 (FKGL = 8.64)，Claude 生成了最密集的语言 (FKGL = 19.89)。即时框架结构对事实准确性和教学完整性的影响最大，RACE 框架产生的幻觉指数最低，并且与 NGSS 课程标准的偶然一致性最高。在所有模型中，十五个课程计划中的学习目标集中在布鲁姆分类法的“记住”和“理解”层。提取的学习目标中高阶动词有限。总体而言，研究结果表明，可读性很大程度上取决于模型设计，而教学可靠性和课程一致性更多地取决于提示框架。结果中确定的课程计划最有效的配置是将可读性优化的模型与 RACE 框架以及物理概念、课程标准和高阶目标的明确清单相结合。

Title: From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Authors: Yatai Ji, Teng Wang, Yuying Ge, Zhiheng Liu, Sidi Yang, Ying Shan, Ping Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19871
Pdf URL: https://arxiv.org/pdf/2510.19871
Copy Paste: [[2510.19871]] From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model(https://arxiv.org/abs/2510.19871)
Keywords: hallucination
Abstract: Discrete diffusion models have emerged as a promising direction for vision-language tasks, offering bidirectional context modeling and theoretical parallelization. However, their practical application is severely hindered by a train-inference discrepancy, which leads to catastrophic error cascades: initial token errors during parallel decoding pollute the generation context, triggering a chain reaction of compounding errors and leading to syntactic errors and semantic hallucinations. To address this fundamental challenge, we reframe the generation process from passive denoising to active refining. We introduce ReDiff, a refining-enhanced diffusion framework that teaches the model to identify and correct its own errors. Our approach features a two-stage training process: first, we instill a foundational revision capability by training the model to revise synthetic errors; second, we implement a novel online self-correction loop where the model is explicitly trained to revise its own flawed drafts by learning from an expert's corrections. This mistake-driven learning endows the model with the crucial ability to revisit and refine its already generated output, effectively breaking the error cascade. Extensive experiments demonstrate that ReDiff significantly improves the coherence and factual accuracy of generated content, enabling stable and efficient parallel generation far superior to traditional denoising methods. Our codes and models are available at this https URL.
摘要：离散扩散模型已成为视觉语言任务的一个有前途的方向，提供双向上下文建模和理论并行化。然而，它们的实际应用受到训练推理差异的严重阻碍，这会导致灾难性的错误级联：并行解码期间的初始标记错误会污染生成上下文，引发复合错误的连锁反应，并导致句法错误和语义幻觉。为了解决这一根本挑战，我们重新构建了从被动降噪到主动精炼的生成过程。我们引入了 ReDiff，这是一种精炼增强的扩散框架，可以教会模型识别并纠正其自身的错误。我们的方法具有两个阶段的训练过程：首先，我们通过训练模型来修正合成错误来灌输基本的修正能力；其次，我们实现了一种新颖的在线自我修正循环，其中模型经过明确训练，可以通过学习专家的修正来修改自己有缺陷的草稿。这种错误驱动的学习赋予模型重新审视和完善其已生成的输出的关键能力，从而有效地打破错误级联。大量实验表明，ReDiff 显着提高了生成内容的连贯性和事实准确性，能够实现稳定高效的并行生成，远远优于传统的去噪方法。我们的代码和模型可通过此 https URL 获取。

Title: Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention

Authors: J Rosser, José Luis Redondo García, Gustavo Penha, Konstantina Palla, Hugues Bouchard
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.19875
Pdf URL: https://arxiv.org/pdf/2510.19875
Copy Paste: [[2510.19875]] Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention(https://arxiv.org/abs/2510.19875)
Keywords: language model, llm, long context, chain-of-thought
Abstract: As Large Language Models (LLMs) scale to million-token contexts, traditional Mechanistic Interpretability techniques for analyzing attention scale quadratically with context length, demanding terabytes of memory beyond 100,000 tokens. We introduce Sparse Tracing, a novel technique that leverages dynamic sparse attention to efficiently analyze long context attention patterns. We present Stream, a compilable hierarchical pruning algorithm that estimates per-head sparse attention masks in near-linear time $O(T \log T)$ and linear space $O(T)$, enabling one-pass interpretability at scale. Stream performs a binary-search-style refinement to retain only the top-$k$ key blocks per query while preserving the model's next-token behavior. We apply Stream to long chain-of-thought reasoning traces and identify thought anchors while pruning 97-99\% of token interactions. On the RULER benchmark, Stream preserves critical retrieval paths while discarding 90-96\% of interactions and exposes layer-wise routes from the needle to output. Our method offers a practical drop-in tool for analyzing attention patterns and tracing information flow without terabytes of caches. By making long context interpretability feasible on consumer GPUs, Sparse Tracing helps democratize chain-of-thought monitoring. Code is available at this https URL.
摘要：随着大型语言模型 (LLM) 扩展到百万个令牌上下文，用于分析注意力的传统机械可解释性技术与上下文长度呈二次方扩展，需要超过 100,000 个令牌的 TB 内存。我们引入稀疏追踪，这是一种利用动态稀疏注意力来有效分析长上下文注意力模式的新技术。我们提出了 Stream，一种可编译的分层剪枝算法，可在近线性时间 $O(T \log T)$ 和线性空间 $O(T)$ 中估计每个头的稀疏注意力掩模，从而实现大规模的一次性可解释性。 Stream 执行二进制搜索式细化，以仅保留每个查询的 top-$k$ 关键块，同时保留模型的下一个令牌行为。我们将 Stream 应用于长的思想链推理轨迹并识别思想锚，同时修剪 97-99% 的令牌交互。在 RULER 基准测试中，Stream 保留了关键检索路径，同时丢弃了 90-96% 的交互，并公开了从针到输出的分层路径。我们的方法提供了一个实用的插入式工具，用于分析注意力模式和跟踪信息流，而无需 TB 的缓存。通过使长上下文可解释性在消费者 GPU 上变得可行，稀疏追踪有助于实现思想链监控的民主化。代码可从此 https URL 获取。

Title: Automated HIV Screening on Dutch EHR with Large Language Models

Authors: Lang Zhou, Amrish Jhingoer, Yinghao Luo, Klaske Vliegenthart--Jongbloed, Carlijn Jordans, Ben Werkhoven, Tom Seinen, Erik van Mulligen, Casper Rokx, Yunlei Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19879
Pdf URL: https://arxiv.org/pdf/2510.19879
Copy Paste: [[2510.19879]] Automated HIV Screening on Dutch EHR with Large Language Models(https://arxiv.org/abs/2510.19879)
Keywords: language model, llm
Abstract: Efficient screening and early diagnosis of HIV are critical for reducing onward transmission. Although large scale laboratory testing is not feasible, the widespread adoption of Electronic Health Records (EHRs) offers new opportunities to address this challenge. Existing research primarily focuses on applying machine learning methods to structured data, such as patient demographics, for improving HIV diagnosis. However, these approaches often overlook unstructured text data such as clinical notes, which potentially contain valuable information relevant to HIV risk. In this study, we propose a novel pipeline that leverages a Large Language Model (LLM) to analyze unstructured EHR text and determine a patient's eligibility for further HIV testing. Experimental results on clinical data from Erasmus University Medical Center Rotterdam demonstrate that our pipeline achieved high accuracy while maintaining a low false negative rate.
摘要：艾滋病毒的有效筛查和早期诊断对于减少进一步传播至关重要。尽管大规模实验室检测不可行，但电子健康记录 (EHR) 的广泛采用为应对这一挑战提供了新的机会。现有研究主要集中于将机器学习方法应用于结构化数据（例如患者人口统计数据），以改善艾滋病毒诊断。然而，这些方法常常忽略非结构化文本数据，例如临床记录，这些数据可能包含与艾滋病毒风险相关的有价值的信息。在这项研究中，我们提出了一种新颖的管道，利用大型语言模型 (LLM) 来分析非结构化 EHR 文本并确定患者是否有资格进行进一步的 HIV 检测。鹿特丹伊拉斯姆斯大学医学中心临床数据的实验结果表明，我们的流程在保持较低假阴性率的同时实现了高精度。

Title: An Expert-grounded benchmark of General Purpose LLMs in LCA

Authors: Artur Donaldson, Bharathan Balaji, Cajetan Oriekezie, Manish Kumar, Laure Patouillard
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.19886
Pdf URL: https://arxiv.org/pdf/2510.19886
Copy Paste: [[2510.19886]] An Expert-grounded benchmark of General Purpose LLMs in LCA(https://arxiv.org/abs/2510.19886)
Keywords: language model, llm, hallucination
Abstract: Purpose: Artificial intelligence (AI), and in particular large language models (LLMs), are increasingly being explored as tools to support life cycle assessment (LCA). While demonstrations exist across environmental and social domains, systematic evidence on their reliability, robustness, and usability remains limited. This study provides the first expert-grounded benchmark of LLMs in LCA, addressing the absence of standardized evaluation frameworks in a field where no clear ground truth or consensus protocols exist. Methods: We evaluated eleven general-purpose LLMs, spanning both commercial and open-source families, across 22 LCA-related tasks. Seventeen experienced practitioners reviewed model outputs against criteria directly relevant to LCA practice, including scientific accuracy, explanation quality, robustness, verifiability, and adherence to instructions. We collected 168 expert reviews. Results: Experts judged 37% of responses to contain inaccurate or misleading information. Ratings of accuracy and quality of explanation were generally rated average or good on many models even smaller models, and format adherence was generally rated favourably. Hallucination rates varied significantly, with some models producing hallucinated citations at rates of up to 40%. There was no clear-cut distinction between ratings on open-weight versus closed-weight LLMs, with open-weight models outperforming or competing on par with closed-weight models on criteria such as accuracy and quality of explanation. Conclusion: These findings highlight the risks of applying LLMs naïvely in LCA, such as when LLMs are treated as free-form oracles, while also showing benefits especially around quality of explanation and alleviating labour intensiveness of simple tasks. The use of general-purpose LLMs without grounding mechanisms presents ...
摘要：目的：人工智能 (AI)，特别是大型语言模型 (LLM)，越来越多地被探索作为支持生命周期评估 (LCA) 的工具。虽然环境和社会领域都有示范，但关于其可靠性、稳健性和可用性的系统证据仍然有限。这项研究为 LCA 中的法学硕士提供了第一个以专家为基础的基准，解决了在不存在明确的基本事实或共识协议的领域中缺乏标准化评估框架的问题。方法：我们评估了 11 个通用法学硕士，涵盖商业和开源系列，涉及 22 个 LCA 相关任务。十七名经验丰富的从业者根据与 LCA 实践直接相关的标准审查了模型输出，包括科学准确性、解释质量、稳健性、可验证性和对指令的遵守。我们收集了 168 条专家评论。结果：专家判断 37% 的回复包含不准确或误导性信息。在许多模型甚至较小的模型上，准确性和解释质量的评级通常被评为平均或良好，并且格式遵循性通常被评为良好。幻觉率差异很大，一些模型产生幻觉引用的比率高达 40%。开放权重法学硕士与封闭权重法学硕士的评级之间没有明确的区别，开放权重模型在准确性和解释质量等标准方面优于或与封闭权重模型相媲美。结论：这些发现凸显了在 LCA 中天真地应用法学硕士的风险，例如当法学硕士被视为自由形式的预言时，同时也显示出好处，特别是在解释质量和减轻简单任务的劳动密集度方面。使用没有接地机制的通用法学硕士呈现出......

Title: Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities

Authors: Nishant Balepur, Dang Nguyen, Dayeon Ki
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.19892
Pdf URL: https://arxiv.org/pdf/2510.19892
Copy Paste: [[2510.19892]] Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities(https://arxiv.org/abs/2510.19892)
Keywords: language model, agent
Abstract: Multi-modal large language models (MLMs) are often assessed on static, individual benchmarks -- which cannot jointly assess MLM capabilities in a single task -- or rely on human or model pairwise comparisons -- which is highly subjective, expensive, and allows models to exploit superficial shortcuts (e.g., verbosity) to inflate their win-rates. To overcome these issues, we propose game-based evaluations to holistically assess MLM capabilities. Games require multiple abilities for players to win, are inherently competitive, and are governed by fix, objective rules, and makes evaluation more engaging, providing a robust framework to address the aforementioned challenges. We manifest this evaluation specifically through Dixit, a fantasy card game where players must generate captions for a card that trick some, but not all players, into selecting the played card. Our quantitative experiments with five MLMs show Dixit win-rate rankings are perfectly correlated with those on popular MLM benchmarks, while games between human and MLM players in Dixit reveal several differences between agent strategies and areas of improvement for MLM reasoning.
摘要：多模态大语言模型 (MLM) 通常是在静态、单独的基准上进行评估的，这些基准无法在单个任务中联合评估 MLM 的能力，或者依赖于人类或模型的成对比较，这是高度主观的、昂贵的，并且允许模型利用肤浅的捷径（例如冗长）来提高其获胜率。为了克服这些问题，我们提出基于游戏的评估来全面评估传销能力。游戏需要玩家具备多种能力才能获胜，本质上具有竞争性，并受到固定的客观规则的约束，并使评估更具吸引力，从而提供了一个强大的框架来应对上述挑战。我们通过 Dixit 来具体体现这种评估，这是一款奇幻纸牌游戏，玩家必须为一张纸牌生成说明文字，欺骗某些玩家（但不是所有玩家）选择所玩的纸牌。我们对五个 MLM 的定量实验表明，Dixit 胜率排名与流行的 MLM 基准上的排名完全相关，而 Dixit 中人类和 MLM 玩家之间的游戏揭示了代理策略和 MLM 推理的改进领域之间的一些差异。

Title: Large Language Model enabled Mathematical Modeling

Authors: Guoyun Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.19895
Pdf URL: https://arxiv.org/pdf/2510.19895
Copy Paste: [[2510.19895]] Large Language Model enabled Mathematical Modeling(https://arxiv.org/abs/2510.19895)
Keywords: language model, gpt, llm, hallucination, agent
Abstract: The integration of Large Language Models (LLMs) with optimization modeling offers a promising avenue for advancing decision-making in operations research (OR). Traditional optimization methods,such as linear programming, mixed integer programming, and simulation depend heavily on domain expertise to translate real-world problems into solvable mathematical models. While solvers like Gurobi and COPT are powerful, expert input remains essential for defining objectives, constraints, and variables. This research investigates the potential of LLMs, specifically the DeepSeek-R1 model, to bridge this formulation gap using natural language understanding and code generation. Although prior models like GPT-4, Claude, and Bard have shown strong performance in NLP and reasoning tasks, their high token costs and tendency toward hallucinations limit real-world applicability in supply chain contexts. In contrast, DeepSeek-R1, a cost-efficient and high-performing model trained with reinforcement learning, presents a viable alternative. Despite its success in benchmarks such as LiveCodeBench and Math-500, its effectiveness in applied OR scenarios remains under explored. This study systematically evaluates DeepSeek-R1 across four key OR benchmarks: NL4OPT, IndustryOR, EasyLP, and ComplexOR. Our methodology includes baseline assessments, the development of a hallucination taxonomy, and the application of mitigation strategies like LLM-as-a-Judge, Few-shot Learning (FSL), Tool Calling, and a Multi-agent Framework. These techniques aim to reduce hallucinations, enhance formulation accuracy, and better align model outputs with user intent.
摘要：大型语言模型 (LLM) 与优化建模的集成为推进运筹学 (OR) 决策提供了一条有前途的途径。传统的优化方法，例如线性规划、混合整数规划和仿真，在很大程度上依赖于领域专业知识将现实世界的问题转化为可解决的数学模型。虽然 Gurobi 和 COPT 等求解器功能强大，但专家的输入对于定义目标、约束和变量仍然至关重要。这项研究调查了法学硕士（特别是 DeepSeek-R1 模型）利用自然语言理解和代码生成来弥合这一表述差距的潜力。尽管 GPT-4、Claude 和 Bard 等之前的模型在 NLP 和推理任务中表现出了强大的性能，但它们的高代币成本和幻觉倾向限制了在供应链环境中的实际适用性。相比之下，DeepSeek-R1 是一种经过强化学习训练的经济高效且高性能的模型，提供了一种可行的替代方案。尽管它在 LiveCodeBench 和 Math-500 等基准测试中取得了成功，但其在应用 OR 场景中的有效性仍在探索中。本研究通过四个关键 OR 基准系统地评估 DeepSeek-R1：NL4OPT、IndustryOR、EasyLP 和 ComplexOR。我们的方法包括基线评估、幻觉分类法的开发以及缓解策略的应用，例如法学硕士作为法官、小样本学习（FSL）、工具调用和多代理框架。这些技术旨在减少幻觉，提高配方准确性，并更好地将模型输出与用户意图保持一致。

Title: Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation

Authors: Jackson Hassell, Dan Zhang, Hannah Kim, Tom Mitchell, Estevam Hruschka
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.19897
Pdf URL: https://arxiv.org/pdf/2510.19897
Copy Paste: [[2510.19897]] Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation(https://arxiv.org/abs/2510.19897)
Keywords: language model, llm, agent
Abstract: We investigate how agents built on pretrained large language models can learn target classification functions from labeled examples without parameter updates. While conventional approaches like fine-tuning are often costly, inflexible, and opaque, we propose a memory-augmented framework that leverages both labeled data and LLM-generated critiques. Our framework uses episodic memory to store instance-level critiques-capturing specific past experiences-and semantic memory to distill these into reusable, task-level guidance. Across a diverse set of tasks, incorporating critiques yields up to a 24.8 percent accuracy improvement over retrieval-based (RAG-style) baselines that rely only on labels. Through extensive empirical evaluation, we uncover distinct behavioral differences between OpenAI and opensource models, particularly in how they handle fact-oriented versus preference-based data. To interpret how models respond to different representations of supervision encoded in memory, we introduce a novel metric, suggestibility. This helps explain observed behaviors and illuminates how model characteristics and memory strategies jointly shape learning dynamics. Our findings highlight the promise of memory-driven, reflective learning for building more adaptive and interpretable LLM agents.
摘要：我们研究了基于预训练大型语言模型构建的代理如何从标记示例中学习目标分类函数，而无需更新参数。虽然微调等传统方法通常成本高昂、不灵活且不透明，但我们提出了一种记忆增强框架，该框架利用了标记数据和法学硕士生成的批评。我们的框架使用情景记忆来存储实例级批评（捕获特定的过去经验），并使用语义记忆将其提炼为可重用的任务级指导。在一系列不同的任务中，与仅依赖标签的基于检索（RAG 式）的基线相比，纳入批评可将准确率提高高达 24.8%。通过广泛的实证评估，我们发现了 OpenAI 和开源模型之间明显的行为差异，特别是在它们如何处理面向事实与基于偏好的数据方面。为了解释模型如何响应记忆中编码的监督的不同表示，我们引入了一种新的指标：建议性。这有助于解释观察到的行为，并阐明模型特征和记忆策略如何共同塑造学习动态。我们的研究结果强调了记忆驱动的反思性学习对于构建更具适应性和可解释性的法学硕士代理的前景。

Title: LLM-Augmented Symbolic NLU System for More Reliable Continuous Causal Statement Interpretation

Authors: Xin Lian, Kenneth D. Forbus
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.19988
Pdf URL: https://arxiv.org/pdf/2510.19988
Copy Paste: [[2510.19988]] LLM-Augmented Symbolic NLU System for More Reliable Continuous Causal Statement Interpretation(https://arxiv.org/abs/2510.19988)
Keywords: language model, llm, hallucination
Abstract: Despite the broad applicability of large language models (LLMs), their reliance on probabilistic inference makes them vulnerable to errors such as hallucination in generated facts and inconsistent output structure in natural language understanding (NLU) tasks. By contrast, symbolic NLU systems provide interpretable understanding grounded in curated lexicons, semantic resources, and syntactic & semantic interpretation rules. They produce relational representations that can be used for accurate reasoning and planning, as well as incremental debuggable learning. However, symbolic NLU systems tend to be more limited in coverage than LLMs and require scarce knowledge representation and linguistics skills to extend and maintain. This paper explores a hybrid approach that integrates the broad-coverage language processing of LLMs with the symbolic NLU capabilities of producing structured relational representations to hopefully get the best of both approaches. We use LLMs for rephrasing and text simplification, to provide broad coverage, and as a source of information to fill in knowledge gaps more automatically. We use symbolic NLU to produce representations that can be used for reasoning and for incremental learning. We evaluate this approach on the task of extracting and interpreting quantities and causal laws from commonsense science texts, along with symbolic- and LLM-only pipelines. Our results suggest that our hybrid method works significantly better than the symbolic-only pipeline.
摘要：尽管大型语言模型（LLM）具有广泛的适用性，但它们对概率推理的依赖使得它们很容易出现错误，例如生成事实中的幻觉和自然语言理解（NLU）任务中输出结构不一致等。相比之下，符号 NLU 系统提供基于精选词典、语义资源以及句法和语义解释规则的可解释理解。它们生成可用于准确推理和规划以及增量可调试学习的关系表示。然而，符号 NLU 系统的覆盖范围往往比 LLM 更有限，并且需要稀缺的知识表示和语言学技能来扩展和维护。本文探索了一种混合方法，将法学硕士的广泛覆盖的语言处理与生成结构化关系表示的符号 NLU 功能相结合，希望能够充分利用这两种方法。我们使用法学硕士进行改写和文本简化，提供广泛的覆盖范围，并作为信息来源来更自动地填补知识空白。我们使用符号 NLU 来生成可用于推理和增量学习的表示。我们评估这种方法的任务是从常识科学文本中提取和解释数量和因果律，以及仅限符号和法学硕士的管道。我们的结果表明，我们的混合方法比仅符号管道的效果要好得多。

Title: Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs

Authors: Yunpeng Xiao, Carl Yang, Mark Mai, Xiao Hu, Kai Shu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20001
Pdf URL: https://arxiv.org/pdf/2510.20001
Copy Paste: [[2510.20001]] Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs(https://arxiv.org/abs/2510.20001)
Keywords: language model, llm
Abstract: Large language models (LLMs) show promise for clinical use. They are often evaluated using datasets such as MedQA. However, Many medical datasets, such as MedQA, rely on simplified Question-Answering (Q\A) that underrepresents real-world clinical decision-making. Based on this, we propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions. As the background and questions approach the real clinical environment, the difficulty increases. We summarize the settings of existing datasets and benchmarks along two dimensions. Then we review methods to address clinical decision-making, including training-time and test-time techniques, and summarize when they help. Next, we extend evaluation beyond accuracy to include efficiency, explainability. Finally, we highlight open challenges. Our paradigm clarifies assumptions, standardizes comparisons, and guides the development of clinically meaningful LLMs.
摘要：大语言模型（LLM）显示出临床应用的前景。它们通常使用 MedQA 等数据集进行评估。然而，许多医学数据集（例如 MedQA）依赖于简化的问答 (Q\A)，这不足以代表现实世界的临床决策。基于此，我们提出了一个统一的范式，从两个维度描述临床决策任务：临床背景和临床问题。随着背景和问题接近真实的临床环境，难度也随之增加。我们沿两个维度总结了现有数据集和基准的设置。然后我们回顾了解决临床决策的方法，包括训练时和测试时技术，并总结它们何时有帮助。接下来，我们将评估扩展到准确性之外，包括效率和可解释性。最后，我们强调开放的挑战。我们的范式澄清了假设，标准化了比较，并指导了具有临床意义的法学硕士的开发。

Title: Improving Transfer Learning for Sequence Labeling Tasks by Adapting Pre-trained Neural Language Models

Authors: David Dukić
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20033
Pdf URL: https://arxiv.org/pdf/2510.20033
Copy Paste: [[2510.20033]] Improving Transfer Learning for Sequence Labeling Tasks by Adapting Pre-trained Neural Language Models(https://arxiv.org/abs/2510.20033)
Keywords: language model
Abstract: This doctoral thesis improves the transfer learning for sequence labeling tasks by adapting pre-trained neural language models. The proposed improvements in transfer learning involve introducing a multi-task model that incorporates an additional signal, a method based on architectural modifications in autoregressive large language models, and a sequence labeling framework for autoregressive large language models utilizing supervised in-context fine-tuning combined with response-oriented adaptation strategies. The first improvement is given in the context of domain transfer for the event trigger detection task. The domain transfer of the event trigger detection task can be improved by incorporating an additional signal obtained from a domain-independent text processing system into a multi-task model. The second improvement involves modifying the model's architecture. For that purpose, a method is proposed to enable bidirectional information flow across layers of autoregressive large language models. The third improvement utilizes autoregressive large language models as text generators through a generative supervised in-context fine-tuning framework. The proposed model, method, and framework demonstrate that pre-trained neural language models achieve their best performance on sequence labeling tasks when adapted through targeted transfer learning paradigms.
摘要：这篇博士论文通过采用预先训练的神经语言模型来改进序列标记任务的迁移学习。迁移学习中提出的改进包括引入一个包含附加信号的多任务模型、一种基于自回归大语言模型架构修改的方法，以及利用监督上下文微调与面向响应的适应策略相结合的自回归大语言模型的序列标记框架。第一个改进是在事件触发检测任务的域传输上下文中给出的。通过将从与域无关的文本处理系统获得的附加信号合并到多任务模型中，可以改进事件触发检测任务的域转移。第二个改进涉及修改模型的架构。为此，提出了一种方法来实现跨自回归大语言模型层的双向信息流。第三项改进是通过生成式监督上下文微调框架，利用自回归大型语言模型作为文本生成器。所提出的模型、方法和框架表明，预训练的神经语言模型在通过有针对性的迁移学习范例进行调整时，可以在序列标记任务上实现最佳性能。

Title: ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering

Authors: Marianne Menglin Liu, Daniel Garcia, Fjona Parllaku, Vikas Upadhyay, Syed Fahad Allam Shah, Dan Roth
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2510.20036
Pdf URL: https://arxiv.org/pdf/2510.20036
Copy Paste: [[2510.20036]] ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering(https://arxiv.org/abs/2510.20036)
Keywords: language model, llm, agent
Abstract: Large language model (LLM) agents rely on external tools to solve complex tasks, but real-world toolsets often contain redundant tools with overlapping names and descriptions, introducing ambiguity and reducing selection accuracy. LLMs also face strict input context limits, preventing efficient consideration of large toolsets. To address these challenges, we propose ToolScope, which includes: (1) ToolScopeMerger with Auto-Correction to automatically audit and fix tool merges, reducing redundancy, and (2) ToolScopeRetriever to rank and select only the most relevant tools for each query, compressing toolsets to fit within context limits without sacrificing accuracy. Evaluations on three state-of-the-art LLMs and three open-source tool-use benchmarks show gains of 8.38% to 38.6% in tool selection accuracy, demonstrating ToolScope's effectiveness in enhancing LLM tool use.
摘要：大型语言模型（LLM）代理依赖外部工具来解决复杂的任务，但现实世界的工具集通常包含名称和描述重叠的冗余工具，从而引入歧义并降低选择准确性。法学硕士还面临严格的输入上下文限制，阻碍了对大型工具集的有效考虑。为了应对这些挑战，我们提出了 ToolScope，其中包括：(1) ToolScopeMerger 具有自动更正功能，可自动审核和修复工具合并，减少冗余；(2) ToolScopeRetriever 可为每个查询仅排名和选择最相关的工具，压缩工具集以适应上下文限制，而不牺牲准确性。对三个最先进的法学硕士和三个开源工具使用基准的评估显示，工具选择准确性提高了 8.38% 至 38.6%，这证明了 ToolScope 在增强法学硕士工具使用方面的有效性。

Title: From Facts to Folklore: Evaluating Large Language Models on Bengali Cultural Knowledge

Authors: Nafis Chowdhury, Moinul Haque, Anika Ahmed, Nazia Tasnim, Md. Istiak Hossain Shihab, Sajjadur Rahman, Farig Sadeque
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.20043
Pdf URL: https://arxiv.org/pdf/2510.20043
Copy Paste: [[2510.20043]] From Facts to Folklore: Evaluating Large Language Models on Bengali Cultural Knowledge(https://arxiv.org/abs/2510.20043)
Keywords: language model, llm
Abstract: Recent progress in NLP research has demonstrated remarkable capabilities of large language models (LLMs) across a wide range of tasks. While recent multilingual benchmarks have advanced cultural evaluation for LLMs, critical gaps remain in capturing the nuances of low-resource cultures. Our work addresses these limitations through a Bengali Language Cultural Knowledge (BLanCK) dataset including folk traditions, culinary arts, and regional dialects. Our investigation of several multilingual language models shows that while these models perform well in non-cultural categories, they struggle significantly with cultural knowledge and performance improves substantially across all models when context is provided, emphasizing context-aware architectures and culturally curated training data.
摘要：NLP 研究的最新进展证明了大型语言模型 (LLM) 在各种任务中的卓越能力。虽然最近的多语言基准提高了法学硕士的文化评估，但在捕捉低资源文化的细微差别方面仍然存在重大差距。我们的工作通过孟加拉语言文化知识（BLanCK）数据集（包括民间传统、烹饪艺术和地方方言）解决了这些局限性。我们对几种多语言语言模型的调查表明，虽然这些模型在非文化类别中表现良好，但它们在文化知识方面表现不佳，并且当提供上下文时，所有模型的性能都会大幅提高，强调上下文感知架构和文化策划的训练数据。

Title: Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training

Authors: Mehrdad Ghassabi, Sadra Hakim, Hamidreza Baradaran Kashani, Pedram Rostami
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20059
Pdf URL: https://arxiv.org/pdf/2510.20059
Copy Paste: [[2510.20059]] Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training(https://arxiv.org/abs/2510.20059)
Keywords: language model, prompt, chain-of-thought
Abstract: Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.
摘要：增强小语言模型的推理能力对于医学问答等专业应用至关重要，特别是在波斯语等代表性不足的语言中。在这项研究中，我们采用人工智能反馈强化学习（RLAIF）和直接偏好优化（DPO）来提高通用波斯语模型的推理技能。为了实现这一目标，我们将多项选择的医学问答数据集翻译成波斯语，并使用 RLAIF 生成拒绝的首选答案对，这对于 DPO 训练至关重要。通过提示教师和学生模型产生思维链（CoT）推理响应，我们编译了一个包含正确和错误推理轨迹的数据集。该数据集包含首选答案中的 200 万个标记和拒绝答案中的 250 万个标记，用于训练基线模型，显着增强其波斯语医学推理能力。值得注意的是，所得模型的性能优于其前身 gaokerena-V，尽管其数据集要小得多，但后者在大约 5700 万个代币上进行了训练。这些结果凸显了以推理为中心的训练方法在开发数据可用性有限的特定领域语言模型方面的效率和有效性。

Title: CreativityPrism: A Holistic Benchmark for Large Language Model Creativity

Authors: Zhaoyi Joey Hou, Bowei Alvin Zhang, Yining Lu, Bhiman Kumar Baghel, Anneliese Brei, Ximing Lu, Meng Jiang, Faeze Brahman, Snigdha Chaturvedi, Haw-Shiuan Chang, Daniel Khashabi, Xiang Lorraine Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20091
Pdf URL: https://arxiv.org/pdf/2510.20091
Copy Paste: [[2510.20091]] CreativityPrism: A Holistic Benchmark for Large Language Model Creativity(https://arxiv.org/abs/2510.20091)
Keywords: language model, llm
Abstract: Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as producing creative text, there is still no holistic framework to evaluate their creativity across diverse scenarios. Existing evaluation methods remain fragmented, with dramatic variation across domains and tasks, largely due to differing definitions and measurements of creativity. Inspired by the hypothesis that creativity is not one fixed idea, we propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity. CreativityPrism incorporates nine tasks, three domains, i.e., divergent thinking, creative writing, and logical reasoning, and twenty evaluation metrics, which measure each dimension in task-specific, unique ways. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CreativityPrism and analyze the performance correlations among different metrics and task domains. Our results reveal a notable gap between proprietary and open-source models. Overall, model performance tends to be highly correlated across tasks within the same domain and less so across different domains. Among evaluation dimensions, diversity and quality metrics show strong correlations - models that perform well on one often excel on the other - whereas novelty exhibits much weaker correlation with either. These findings support our hypothesis that strong performance in one creativity task or dimension does not necessarily generalize to others, underscoring the need for a holistic evaluation of LLM creativity.
摘要：创造力通常被视为人类智力的标志。尽管大型语言模型（LLM）越来越被认为可以产生创造性文本，但仍然没有整体框架来评估其在不同场景下的创造力。现有的评估方法仍然分散，不同领域和任务之间存在巨大差异，这主要是由于创造力的定义和衡量标准不同。受创造力不是一个固定观念这一假设的启发，我们提出了CreativityPrism，一种评估分析框架，将创造力分解为三个维度：质量、新颖性和多样性。 CreativityPrism包含九个任务、三个领域，即发散思维、创意写作和逻辑推理，以及二十个评估指标，以特定于任务的独特方式衡量每个维度。我们在 CreativityPrism 上评估了 17 个最先进 (SoTA) 专有和开源法学硕士，并分析了不同指标和任务领域之间的绩效相关性。我们的结果揭示了专有模型和开源模型之间的显着差距。总体而言，模型性能往往在同一领域内的任务之间高度相关，而在不同领域之间则不太相关。在评估维度中，多样性和质量指标显示出很强的相关性——在一个维度上表现良好的模型通常在另一个维度上表现出色——而新颖性与其中任何一个指标的相关性都较弱。这些发现支持了我们的假设，即在一个创造力任务或维度上的出色表现并不一定会推广到其他任务或维度，这强调了对法学硕士创造力进行全面评估的必要性。

Title: Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning

Authors: Yajie Li, Albert Galimov, Mitra Datta Ganapaneni, Pujitha Thejaswi, De Meng, Priyanshu Kumar, Saloni Potdar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20098
Pdf URL: https://arxiv.org/pdf/2510.20098
Copy Paste: [[2510.20098]] Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning(https://arxiv.org/abs/2510.20098)
Keywords: language model, llm, prompt
Abstract: Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.
摘要：实体链接 (EL) 传统上依赖于大型注释数据集和广泛的模型微调。虽然最近的小样本方法通过提示来利用大型语言模型 (LLM) 来减少训练要求，但由于昂贵的基于 LLM 的推理，它们常常效率低下。 ARTER（自适应路由和目标实体推理）提出了一种结构化管道，通过策略性地结合候选生成、基于上下文的评分、自适应路由和选择性推理，无需深度微调即可实现高性能。 ARTER 对检索到的候选信号计算一小组互补信号（包括嵌入信号和基于 LLM 的信号），以将上下文提及分为简单情况和困难情况。然后，这些案例分别由低计算量实体链接器（例如 ReFinED）和更昂贵的基于 LLM 的目标推理来处理。在标准基准测试中，ARTER 的性能比 ReFinED 高出 4.47%，在 6 个数据集中的 5 个数据集上平均增益为 +2.53%，并且在所有提及中的性能与使用基于 LLM 的推理的管道相当，而就 LLM 代币数量而言，效率是 ReFinED 的两倍。

Title: BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation

Authors: Haoyuan Li, Zhengyuan Shen, Sullam Jeoung, Yueyan Chen, Jiayu Li, Qi Zhu, Shuai Wang, Vassilis Ioannidis, Huzefa Rangwala
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20151
Pdf URL: https://arxiv.org/pdf/2510.20151
Copy Paste: [[2510.20151]] BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation(https://arxiv.org/abs/2510.20151)
Keywords: language model, llm, hallucination, prompt
Abstract: As structured texts become increasingly complex across diverse domains -- from technical reports to generative AI prompts -- the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL's effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.
摘要：随着结构化文本在不同领域（从技术报告到生成人工智能提示）变得越来越复杂，将文本分割成具有语义意义的组件的需求变得至关重要。此类文本通常包含超出简单语言的元素，包括表格、代码片段和占位符，传统的句子或段落级分段方法无法有效处理这些元素。为了应对这一挑战，我们提出了 BoundRL，这是一种新颖且高效的方法，可以对长结构化文本联合执行标记级文本分割和标签预测。它不是为每个片段生成完整的内容，而是仅生成一系列起始标记，并通过在原始文本中定位这些标记来重建完整的内容，从而将推理成本降低几个数量级并最大限度地减少幻觉。为了使模型适应输出格式，BoundRL~使用可验证奖励（RLVR）执行强化学习，并使用专门设计的奖励来共同优化文档重建保真度和语义对齐。为了减轻熵崩溃，它通过系统地扰动生成的片段序列的一小部分来进一步构造中间候选，以创建迈向更高质量解决方案的垫脚石。为了证明 BoundRL 在特别具有挑战性的结构化文本上的有效性，我们重点评估用于 LLM 申请的复杂提示。实验表明，BoundRL 使小型语言模型（1.7B 参数）的性能优于大型模型的小样本提示。此外，具有我们设计的奖励的 RLVR 比监督微调产生了显着的改进，并且合并中间候选进一步提高了性能和泛化能力。

Title: Are Stereotypes Leading LLMs' Zero-Shot Stance Detection ?

Authors: Anthony Dubreuil, Antoine Gourru, Christine Largeron, Amine Trabelsi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20154
Pdf URL: https://arxiv.org/pdf/2510.20154
Copy Paste: [[2510.20154]] Are Stereotypes Leading LLMs' Zero-Shot Stance Detection ?(https://arxiv.org/abs/2510.20154)
Keywords: language model, llm
Abstract: Large Language Models inherit stereotypes from their pretraining data, leading to biased behavior toward certain social groups in many Natural Language Processing tasks, such as hateful speech detection or sentiment analysis. Surprisingly, the evaluation of this kind of bias in stance detection methods has been largely overlooked by the community. Stance Detection involves labeling a statement as being against, in favor, or neutral towards a specific target and is among the most sensitive NLP tasks, as it often relates to political leanings. In this paper, we focus on the bias of Large Language Models when performing stance detection in a zero-shot setting. We automatically annotate posts in pre-existing stance detection datasets with two attributes: dialect or vernacular of a specific group and text complexity/readability, to investigate whether these attributes influence the model's stance detection decisions. Our results show that LLMs exhibit significant stereotypes in stance detection tasks, such as incorrectly associating pro-marijuana views with low text complexity and African American dialect with opposition to Donald Trump.
摘要：大型语言模型继承了预训练数据中的刻板印象，导致在许多自然语言处理任务中对某些社会群体产生偏见行为，例如仇恨语音检测或情绪分析。令人惊讶的是，对姿态检测方法中这种偏差的评估在很大程度上被社区忽视了。立场检测涉及将某个声明标记为反对、支持或中立针对特定目标，并且是最敏感的 NLP 任务之一，因为它通常与政治倾向相关。在本文中，我们重点关注大型语言模型在零样本设置中执行姿态检测时的偏差。我们使用两个属性自动注释预先存在的立场检测数据集中的帖子：特定群体的方言或方言以及文本复杂性/可读性，以调查这些属性是否影响模型的立场检测决策。我们的结果表明，法学硕士在立场检测任务中表现出明显的刻板印象，例如错误地将支持大麻的观点与低文本复杂性以及非裔美国人方言与反对唐纳德·特朗普联系起来。

Title: DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking

Authors: Tian Lan, Bin Zhu, Qianghuai Jia, Junyang Ren, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20168
Pdf URL: https://arxiv.org/pdf/2510.20168
Copy Paste: [[2510.20168]] DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking(https://arxiv.org/abs/2510.20168)
Keywords: agent
Abstract: Current search agents fundamentally lack the ability to simultaneously perform \textit{deep} reasoning over multi-hop retrieval and \textit{wide}-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.
摘要：当前的搜索代理从根本上缺乏同时对多跳检索进行深度推理和大规模信息收集的能力，这对于综合市场分析和业务开发等实际应用来说是一个关键缺陷。为了弥补这一差距，我们引入了 DeepWideSearch，这是第一个明确设计用于评估智能体在信息搜索中整合深度和宽度的基准。在 DeepWideSearch 中，代理必须处理大量数据，每个数据都需要对多跳检索路径进行深度推理。具体来说，我们提出了两种方法来转换已建立的数据集，从而生成涵盖 15 个不同领域的 220 个问题的精选集合。大量实验表明，即使是最先进的智能体在 DeepWideSearch 上的平均成功率也仅为 2.39%，这突显了在信息搜索任务中集成深度和宽度搜索的巨大挑战。此外，我们的错误分析揭示了四种故障模式：缺乏反思、过度依赖内部知识、检索不足以及上下文溢出暴露了当前代理架构中的关键限制。我们公开发布 DeepWideSearch 来促进未来对更强大、更强大的信息搜索代理的研究。

Title: Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding

Authors: Yuhang Zhou, Mingrui Zhang, Ke Li, Mingyi Wang, Qiao Liu, Qifei wang, Jiayi Liu, Fei Liu, Serena Li, Weiwi Li, Mingze Gao, Abhishek Kumar, Xiangjun Fan, Zhuokai Zhao, Lizhu Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20176
Pdf URL: https://arxiv.org/pdf/2510.20176
Copy Paste: [[2510.20176]] Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding(https://arxiv.org/abs/2510.20176)
Keywords: language model, llm, hallucination, agent
Abstract: Understanding and reasoning over tables is a critical capability for many real-world applications. Large language models (LLMs) have shown promise on this task, but current approaches remain limited. Fine-tuning based methods strengthen language reasoning; yet they are prone to arithmetic errors and hallucination. In contrast, tool-based methods enable precise table manipulation but rely on rigid schemas and lack semantic understanding. These complementary drawbacks highlight the need for approaches that integrate robust reasoning with reliable table processing. In this work, we propose Mixture-of-Minds, a multi-agent framework that decomposes table reasoning into three specialized roles: planning, coding, and answering. This design enables each agent to focus on a specific aspect of the task while leveraging code execution for precise table manipulation. Building on this workflow, we introduce a self-improvement training framework that employs Monte Carlo Tree Search (MCTS) rollouts to generate pseudo-gold trajectories and optimize agents with reinforcement learning (RL). Extensive experiments show that Mixture-of-Minds delivers substantial gains, reaching 62.13% on TableBench and surpassing OpenAI-o4-mini-high. These results demonstrate the promise of combining structured multi-agent workflows with RL to advance table understanding.
摘要：对表的理解和推理是许多实际应用程序的关键能力。大型语言模型（LLM）在这项任务上显示出了希望，但目前的方法仍然有限。基于微调的方法强化语言推理；然而他们很容易出现算术错误和幻觉。相比之下，基于工具的方法可以实现精确的表操作，但依赖于严格的模式并且缺乏语义理解。这些互补的缺点突出表明需要将稳健的推理与可靠的表处理相结合的方法。在这项工作中，我们提出了 Mixture-of-Minds，这是一个多智能体框架，它将表格推理分解为三个专门的角色：规划、编码和回答。这种设计使每个代理能够专注于任务的特定方面，同时利用代码执行来进行精确的表操作。在此工作流程的基础上，我们引入了一个自我改进训练框架，该框架采用蒙特卡罗树搜索（MCTS）推出来生成伪黄金轨迹并通过强化学习（RL）优化代理。大量实验表明，Mixture-of-Minds 带来了可观的收益，在 TableBench 上达到 62.13%，超过了 OpenAI-o4-mini-high。这些结果证明了将结构化多智能体工作流程与 RL 相结合以促进表理解的前景。

Title: Stuck in the Matrix: Probing Spatial Reasoning in Large Language Models

Authors: Maggie Bai, Ava Kim Cohen, Eleanor Koss, Charlie Lichtenbaum
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20198
Pdf URL: https://arxiv.org/pdf/2510.20198
Copy Paste: [[2510.20198]] Stuck in the Matrix: Probing Spatial Reasoning in Large Language Models(https://arxiv.org/abs/2510.20198)
Keywords: language model, llm
Abstract: This paper explores the spatial reasoning capability of large language models (LLMs) over textual input through a suite of five tasks aimed at probing their spatial understanding and computational abilities. The models were tested on both fundamental spatial reasoning and multi-step problem-solving within structured grid-based environments using tasks such as quadrant identification, geometric transformations, distance evaluation, word searches, and tile sliding. Each task was scaled in complexity through increasing grid dimensions, requiring models to extend beyond simple pattern recognition into abstract spatial reasoning. Our results reveal that while LLMs demonstrate moderate success in all tasks with small complexity and size, performance drops off rapidly as scale increases, with an average loss in accuracy of 42.7%, and reaching as high as 84%. Every test that began with over 50% accuracy showed a loss of at least 48%, illustrating the consistent nature of the deterioration. Furthermore, their struggles with scaling complexity hint at a lack of robust spatial representations in their underlying architectures. This paper underscores the gap between linguistic and spatial reasoning in LLMs, offering insights into their current limitations, and laying the groundwork for future integrative benchmarks at the intersection of language and geometry.
摘要：本文通过旨在探索大型语言模型（LLM）的空间理解和计算能力的一组五个任务，探索了大型语言模型（LLM）对文本输入的空间推理能力。这些模型在基于结构化网格的环境中使用象限识别、几何变换、距离评估、单词搜索和平铺滑动等任务在基本空间推理和多步骤问题解决上进行了测试。每项任务的复杂性都通过增加网格维度来扩展，要求模型从简单的模式识别扩展到抽象的空间推理。我们的结果表明，虽然法学硕士在所有复杂性和规模较小的任务中都取得了一定的成功，但随着规模的增加，性能迅速下降，平均准确度损失为 42.7%，最高可达 84%。每项以超过 50% 准确度开始的测试都显示出至少 48% 的损失，说明了恶化的一致性。此外，他们在扩展复杂性方面的挣扎暗示着其底层架构缺乏强大的空间表示。本文强调了法学硕士中语言推理和空间推理之间的差距，深入了解了它们当前的局限性，并为未来语言和几何交叉领域的综合基准奠定了基础。

Title: Decoding-Free Sampling Strategies for LLM Marginalization

Authors: David Pohl, Marco Cognetta, Junyoung Lee, Naoaki Okazaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20208
Pdf URL: https://arxiv.org/pdf/2510.20208
Copy Paste: [[2510.20208]] Decoding-Free Sampling Strategies for LLM Marginalization(https://arxiv.org/abs/2510.20208)
Keywords: language model, llm
Abstract: Modern language models operate on subword-tokenized text in order to make a trade-off between model size, inference speed, and vocabulary coverage. A side effect of this is that, during inference, models are evaluated by measuring the probability of only the specific tokenization produced as the output, despite there being many possible ways to represent the same text with a subword vocabulary. Recent studies have argued instead for evaluating LLMs by marginalization - the probability mass of all tokenizations of a given text. Marginalization is difficult due to the number of possible tokenizations of a text, so often approximate marginalization is done via sampling. However, a downside of sampling is that an expensive generation step must be performed by the LLM for each sample, which limits the number of samples that can be acquired given a runtime budget, and therefore also the accuracy of the approximation. Since computing the probability of a sequence given the tokenization is relatively cheap compared to actually generating it, we investigate sampling strategies that are decoding-free - they require no generation from the LLM, instead relying entirely on extremely cheap sampling strategies that are model and tokenizer agnostic. We investigate the approximation quality and speed of decoding-free sampling strategies for a number of open models to find that they provide sufficiently accurate marginal estimates at a small fraction of the runtime cost and demonstrate its use on a set of downstream inference tasks.
摘要：现代语言模型对子词标记化文本进行操作，以便在模型大小、推理速度和词汇覆盖率之间进行权衡。这样做的一个副作用是，在推理过程中，模型是通过测量仅作为输出生成的特定标记化的概率来评估的，尽管有许多可能的方法可以用子词词汇表表示相同的文本。最近的研究主张通过边缘化（给定文本的所有标记化的概率质量）来评估法学硕士。由于文本可能的标记化数量，边缘化很困难，因此通常通过采样来完成近似边缘化。然而，采样的缺点是 LLM 必须对每个样本执行昂贵的生成步骤，这限制了给定运行时预算可以获取的样本数量，因此也限制了近似的准确性。由于与实际生成序列相比，计算给定标记化的序列概率相对便宜，因此我们研究了免解码的采样策略 - 它们不需要从 LLM 生成，而是完全依赖于与模型和标记器无关的极其便宜的采样策略。我们研究了许多开放模型的免解码采样策略的近似质量和速度，发现它们以运行时成本的一小部分提供了足够准确的边际估计，并演示了其在一组下游推理任务中的使用。

Title: Context-level Language Modeling by Learning Predictive Context Embeddings

Authors: Beiya Dai, Yuliang Liu, Daozheng Xue, Qipeng Guo, Kai Chen, Xinbing Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20280
Pdf URL: https://arxiv.org/pdf/2510.20280
Copy Paste: [[2510.20280]] Context-level Language Modeling by Learning Predictive Context Embeddings(https://arxiv.org/abs/2510.20280)
Keywords: language model, gpt, llm
Abstract: Next-token prediction (NTP) is the cornerstone of modern large language models (LLMs) pretraining, driving their unprecedented capabilities in text generation, reasoning, and instruction following. However, the token-level prediction limits the model's capacity to capture higher-level semantic structures and long-range contextual relationships. To overcome this limitation, we introduce \textbf{ContextLM}, a framework that augments standard pretraining with an inherent \textbf{next-context prediction} objective. This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks. Crucially, ContextLM achieves this enhancement while remaining fully compatible with the standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity). Extensive experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance. Our analysis indicates that next-context prediction provides a scalable and efficient pathway to stronger language modeling, yielding better long-range coherence and more effective attention allocation with minimal computational overhead.
摘要：下一个标记预测 (NTP) 是现代大型语言模型 (LLM) 预训练的基石，推动其在文本生成、推理和指令跟踪方面发挥前所未有的能力。然而，令牌级别的预测限制了模型捕获更高级别语义结构和远程上下文关系的能力。为了克服这一限制，我们引入了 \textbf{ContextLM}，这是一个通过固有的 \textbf{next-context Prediction} 目标增强标准预训练的框架。该机制训练模型学习多令牌上下文的预测表示，利用从未来令牌块派生的错误信号。至关重要的是，ContextLM 实现了这种增强，同时保持与标准自回归、逐个令牌评估范例（例如，困惑度）完全兼容。在 GPT2 和 Pythia 模型系列上进行的广泛实验（扩展到 $1.5$B 参数）表明，ContextLM 在困惑度和下游任务性能方面都提供了一致的改进。我们的分析表明，下一个上下文预测为更强的语言建模提供了一条可扩展且有效的途径，以最小的计算开销产生更好的远程连贯性和更有效的注意力分配。

Title: Citation Failure: Definition, Analysis and Efficient Mitigation

Authors: Jan Buchmann, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20303
Pdf URL: https://arxiv.org/pdf/2510.20303
Copy Paste: [[2510.20303]] Citation Failure: Definition, Analysis and Efficient Mitigation(https://arxiv.org/abs/2510.20303)
Keywords: llm
Abstract: Citations from LLM-based RAG systems are supposed to simplify response verification. However, this does not hold for citation failure, when a model generates a helpful response, but fails to cite complete evidence. In contrast to previous work, we propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible. To address citation failure, this work follows a two-step approach: (1) We study when citation failure occurs and (2) how it can be mitigated. For step 1, we extend prior work by investigating how the relation between response and evidence affects citation quality. We introduce CITECONTROL, a benchmark that systematically varies this relation to analyze failure modes. Experiments show that failures increase with relational complexity and suggest that combining citation methods could improve performance, motivating step 2. To improve LLM citation efficiently, we propose CITENTION, a framework integrating generative, attention-based, and retrieval-based methods. Results demonstrate substantial citation improvements on CITECONTROL and in transfer settings. We make our data and code publicly available.
摘要：基于 LLM 的 RAG 系统的引用应该可以简化回复验证。然而，当模型生成有用的响应但未能引用完整的证据时，这不适用于引用失败。与之前的工作相比，我们建议将其与响应失败区分开来，即响应本身存在缺陷，并且不可能引用完整的证据。为了解决引用失败的问题，这项工作遵循两步方法：（1）我们研究引用失败何时发生，以及（2）如何缓解它。对于步骤 1，我们通过调查响应和证据之间的关系如何影响引文质量来扩展之前的工作。我们引入了 CITECONTROL，这是一个系统地改变这种关系以分析故障模式的基准。实验表明，失败随着关系复杂性的增加而增加，并表明结合引用方法可以提高性能，从而激励步骤 2。为了有效地提高 LLM 引用，我们提出了 CITENTION，一个集成生成、基于注意力和基于检索的方法的框架。结果表明 CITECONTROL 和转移环境中的引用量得到了显着改善。我们公开我们的数据和代码。

Title: Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering

Authors: Lei Tang, Wei Zhou, Mohsen Mesgar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20304
Pdf URL: https://arxiv.org/pdf/2510.20304
Copy Paste: [[2510.20304]] Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering(https://arxiv.org/abs/2510.20304)
Keywords: language model, llm
Abstract: Process reward models (PRMs) improve complex reasoning in large language models (LLMs) by grading candidate solutions step-by-step and selecting answers via aggregated step scores. While effective in domains such as mathematics, their applicability to tasks involving semi-structured data, like table question answering (TQA) remains unexplored. TQA poses unique challenges for PRMs, including abundant irrelevant information, loosely connected reasoning steps, and domain-specific reasoning. This work presents the first systematic study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from both answer and step perspectives. Results show that PRMs that combine textual and code verification can aid solution selection but struggle to generalize to out-of-domain data. Analysis reveals a weak correlation between performance in step-level verification and answer accuracy, possibly stemming from weak step dependencies and loose causal links. Our findings highlight limitations of current PRMs on TQA and offer valuable insights for building more robust, process-aware verifiers.
摘要：过程奖励模型 (PRM) 通过逐步对候选解决方案进行评分并通过聚合步骤分数选择答案来改进大型语言模型 (LLM) 中的复杂推理。虽然它们在数学等领域有效，但它们对涉及半结构化数据的任务（例如表格问答（TQA））的适用性仍有待探索。 TQA 给 PRM 带来了独特的挑战，包括大量的不相关信息、松散连接的推理步骤以及特定领域的推理。这项工作首次对 TQA 的 PRM 进行了系统研究。我们从答案和步骤的角度评估 TQA 上最先进的生成 PRM。结果表明，结合文本和代码验证的 PRM 可以帮助选择解决方案，但很难推广到域外数据。分析表明，步骤级验证的性能与答案准确性之间的相关性较弱，这可能源于较弱的步骤依赖性和松散的因果关系。我们的研究结果强调了当前 PRM 在 TQA 方面的局限性，并为构建更强大、流程感知的验证程序提供了宝贵的见解。

Title: Teaching Language Models to Reason with Tools

Authors: Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20342
Pdf URL: https://arxiv.org/pdf/2510.20342
Copy Paste: [[2510.20342]] Teaching Language Models to Reason with Tools(https://arxiv.org/abs/2510.20342)
Keywords: language model
Abstract: Large reasoning models (LRMs) like OpenAI-o1 have shown impressive capabilities in natural language reasoning. However, these models frequently demonstrate inefficiencies or inaccuracies when tackling complex mathematical operations. While integrating computational tools such as Code Interpreters (CIs) offers a promising solution, it introduces a critical challenge: a conflict between the model's internal, probabilistic reasoning and the external, deterministic knowledge provided by the CI, which often leads models to unproductive deliberation. To overcome this, we introduce CoRT (Code-Optimized Reasoning Training), a post-training framework designed to teach LRMs to effectively utilize CIs. We propose \emph{Hint-Engineering}, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths. This approach generates high-quality, code-integrated reasoning data specifically tailored to optimize LRM-CI interaction. Using this method, we have synthesized 30 high-quality samples to post-train models ranging from 1.5B to 32B parameters through supervised fine-tuning. CoRT further refines the multi-round interleaving of external CI usage and internal thinking by employing rejection sampling and reinforcement learning. Our experimental evaluations demonstrate CoRT's effectiveness, yielding absolute improvements of 4\% and 8\% on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B, respectively, across five challenging mathematical reasoning datasets. Moreover, CoRT significantly enhances efficiency, reducing token usage by approximately 30\% for the 32B model and 50\% for the 1.5B model compared to pure natural language reasoning baselines. The models and code are available at: this https URL.
摘要：像 OpenAI-o1 这样的大型推理模型 (LRM) 在自然语言推理方面表现出了令人印象深刻的能力。然而，这些模型在处理复杂的数学运算时经常表现出效率低下或不准确的情况。虽然集成代码解释器 (CI) 等计算工具提供了一个有前途的解决方案，但它也带来了一个关键的挑战：模型的内部概率推理与 CI 提供的外部确定性知识之间的冲突，这通常会导致模型进行无效的审议。为了克服这个问题，我们引入了 CoRT（代码优化推理训练），这是一种训练后框架，旨在教会 LRM 有效地利用 CI。我们提出了 \emph{Hint-Engineering}，一种新的数据合成策略，可以策略性地在推理路径中的最佳点注入不同的提示。这种方法生成专门为优化 LRM-CI 交互而定制的高质量、代码集成推理数据。使用这种方法，我们通过监督微调合成了 30 个高质量样本，用于训练后模型，参数范围从 1.5B 到 32B。 CoRT 通过采用拒绝采样和强化学习，进一步细化外部 CI 使用和内部思维的多轮交织。我们的实验评估证明了 CoRT 的有效性，在五个具有挑战性的数学推理数据集上，DeepSeek-R1-Distill-Qwen-32B 和 DeepSeek-R1-Distill-Qwen-1.5B 分别获得了 4% 和 8% 的绝对改进。此外，CoRT 显着提高了效率，与纯自然语言推理基线相比，32B 模型的令牌使用量减少了约 30%，1.5B 模型的令牌使用量减少了 50%。模型和代码可在以下位置获取：此 https URL。

Title: Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models

Authors: Matteo Silvestri, Flavio Giorgi, Fabrizio Silvestri, Gabriele Tolomei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20351
Pdf URL: https://arxiv.org/pdf/2510.20351
Copy Paste: [[2510.20351]] Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models(https://arxiv.org/abs/2510.20351)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly evaluated on their ability to reason over structured data, yet such assessments often overlook a crucial confound: dataset contamination. In this work, we investigate whether LLMs exhibit prior knowledge of widely used tabular benchmarks such as Adult Income, Titanic, and others. Through a series of controlled probing experiments, we reveal that contamination effects emerge exclusively for datasets containing strong semantic cues-for instance, meaningful column names or interpretable value categories. In contrast, when such cues are removed or randomized, performance sharply declines to near-random levels. These findings suggest that LLMs' apparent competence on tabular reasoning tasks may, in part, reflect memorization of publicly available datasets rather than genuine generalization. We discuss implications for evaluation protocols and propose strategies to disentangle semantic leakage from authentic reasoning ability in future LLM assessments.
摘要：大型语言模型 (LLM) 越来越多地根据其对结构化数据进行推理的能力进行评估，但此类评估往往忽略了一个关键的问题：数据集污染。在这项工作中，我们调查法学硕士是否表现出对广泛使用的表格基准（例如成人收入、泰坦尼克号等）的先验知识。通过一系列受控探测实验，我们揭示了污染效应只出现在包含强语义线索的数据集上——例如，有意义的列名称或可解释的值类别。相反，当这些线索被删除或随机化时，性能急剧下降到接近随机的水平。这些发现表明，法学硕士在表格推理任务上的明显能力可能部分反映了对公开数据集的记忆，而不是真正的概括。我们讨论了评估协议的影响，并提出了在未来法学硕士评估中将语义泄漏与真实推理能力分开的策略。

Title: FreeChunker: A Cross-Granularity Chunking Framework

Authors: Wenxuan Zhang, Yuan-Hao Jiang, Yonghe Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20356
Pdf URL: https://arxiv.org/pdf/2510.20356
Copy Paste: [[2510.20356]] FreeChunker: A Cross-Granularity Chunking Framework(https://arxiv.org/abs/2510.20356)
Keywords: retrieval-augmented generation
Abstract: Chunking strategies significantly impact the effectiveness of Retrieval-Augmented Generation (RAG) systems. Existing methods operate within fixed-granularity paradigms that rely on static boundary identification, limiting their adaptability to diverse query requirements. This paper presents FreeChunker, a Cross-Granularity Encoding Framework that fundamentally transforms the traditional chunking paradigm: the framework treats sentences as atomic units and shifts from static chunk segmentation to flexible retrieval supporting arbitrary sentence combinations. This paradigm shift not only significantly reduces the computational overhead required for semantic boundary detection but also enhances adaptability to complex queries. Experimental evaluation on LongBench V2 demonstrates that FreeChunker achieves superior retrieval performance compared to traditional chunking methods, while significantly outperforming existing approaches in computational efficiency.
摘要：分块策略显着影响检索增强生成（RAG）系统的有效性。现有方法在依赖于静态边界识别的固定粒度范例中运行，限制了它们对不同查询需求的适应性。本文提出了 FreeChunker，一种跨粒度编码框架，从根本上改变了传统的分块范式：该框架将句子视为原子单元，并从静态块分割转变为支持任意句子组合的灵活检索。这种范式转变不仅显着减少了语义边界检测所需的计算开销，而且增强了对复杂查询的适应性。 LongBench V2 上的实验评估表明，与传统分块方法相比，FreeChunker 实现了卓越的检索性能，同时在计算效率方面显着优于现有方法。

Title: Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning)

Authors: Francesca Padovani, Bastian Bunzeck, Manar Ali, Omar Momen, Arianna Bisazza, Hendrik Buschmeier, Sina Zarrieß
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20358
Pdf URL: https://arxiv.org/pdf/2510.20358
Copy Paste: [[2510.20358]] Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning)(https://arxiv.org/abs/2510.20358)
Keywords: language model
Abstract: We investigate whether pre-training exclusively on dialogue data results in formally and functionally apt small language models. Based on this pre-trained llamalogue model, we employ a variety of fine-tuning strategies to enforce "more communicative" text generations by our models. Although our models underperform on most standard BabyLM benchmarks, they excel at dialogue continuation prediction in a minimal pair setting. While PPO fine-tuning has mixed to adversarial effects on our models, DPO fine-tuning further improves their performance on our custom dialogue benchmark.
摘要：我们研究仅针对对话数据进行预训练是否会产生形式上和功能上合适的小语言模型。基于这种预先训练的美洲驼模型，我们采用了各种微调策略来强制我们的模型生成“更具交流性”的文本。尽管我们的模型在大多数标准 BabyLM 基准上表现不佳，但它们擅长在最小配对设置中预测对话连续性。虽然 PPO 微调对我们的模型产生了对抗性影响，但 DPO 微调进一步提高了它们在我们的自定义对话基准上的性能。

Title: The Impact of Negated Text on Hallucination with Large Language Models

Authors: Jaehyung Seo, Hyeonseok Moon, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20375
Pdf URL: https://arxiv.org/pdf/2510.20375
Copy Paste: [[2510.20375]] The Impact of Negated Text on Hallucination with Large Language Models(https://arxiv.org/abs/2510.20375)
Keywords: language model, llm, hallucination
Abstract: Recent studies on hallucination in large language models (LLMs) have been actively progressing in natural language processing. However, the impact of negated text on hallucination with LLMs remains largely unexplored. In this paper, we set three important yet unanswered research questions and aim to address them. To derive the answers, we investigate whether LLMs can recognize contextual shifts caused by negation and still reliably distinguish hallucinations comparable to affirmative cases. We also design the NegHalu dataset by reconstructing existing hallucination detection datasets with negated expressions. Our experiments demonstrate that LLMs struggle to detect hallucinations in negated text effectively, often producing logically inconsistent or unfaithful judgments. Moreover, we trace the internal state of LLMs as they process negated inputs at the token level and reveal the challenges of mitigating their unintended effects.
摘要：最近，关于大语言模型（LLM）中的幻觉的研究在自然语言处理领域取得了积极进展。然而，否定文本对法学硕士产生幻觉的影响在很大程度上仍未得到探索。在本文中，我们提出了三个重要但尚未解答的研究问题，并旨在解决它们。为了得出答案，我们研究了法学硕士是否能够识别由否定引起的语境变化，并且仍然可靠地区分与肯定案例相当的幻觉。我们还通过使用否定表达式重建现有的幻觉检测数据集来设计 NegHalu 数据集。我们的实验表明，法学硕士很难有效地检测否定文本中的幻觉，常常会产生逻辑上不一致或不忠实的判断。此外，我们追踪法学硕士在代币级别处理否定输入时的内部状态，并揭示减轻其意外影响的挑战。

Title: Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction

Authors: Suchir Salhan, Hongyi Gu, Donya Rooein, Diana Galvan-Sosa, Gabrielle Gaudeau, Andrew Caines, Zheng Yuan, Paula Buttery
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20411
Pdf URL: https://arxiv.org/pdf/2510.20411
Copy Paste: [[2510.20411]] Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction(https://arxiv.org/abs/2510.20411)
Keywords: prompt, chat
Abstract: Multi-turn dialogues between a child and a caregiver are characterized by a property called contingency - that is, prompt, direct, and meaningful exchanges between interlocutors. We introduce ContingentChat, a teacher-student framework that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M words. Using a novel alignment dataset for post-training, BabyLM generates responses that are more grammatical and cohesive. Experiments with adaptive teacher decoding strategies show limited additional gains. ContingentChat demonstrates the benefits of targeted post-training for dialogue quality and indicates that contingency remains a challenging goal for BabyLMs.
摘要：儿童和看护者之间的多轮对话具有称为偶然性的属性，即对话者之间迅速、直接且有意义的交流。我们引入了 ContingentChat，这是一个师生框架，可以在经过 1 亿个单词训练的 BabyLM 中进行基准测试并改进多轮意外事件。 BabyLM 使用新颖的对齐数据集进行后训练，生成更具语法性和凝聚力的响应。自适应教师解码策略的实验显示额外收益有限。 ContingentChat 展示了有针对性的后期培训对对话质量的好处，并表明意外事件仍然是 BabyLM 的一个具有挑战性的目标。

Title: LM-mixup: Text Data Augmentation via Language Model based Mixup

Authors: Zhijie Deng, Zhouan Shen, Ling Li, Yao Zhou, Zhaowei Zhu, Yanji He, Wei Wang, Jiaheng Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20449
Pdf URL: https://arxiv.org/pdf/2510.20449
Copy Paste: [[2510.20449]] LM-mixup: Text Data Augmentation via Language Model based Mixup(https://arxiv.org/abs/2510.20449)
Keywords: language model, llm
Abstract: Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by first performing supervised fine-tuning on MIXTURE and then optimizing it with reinforcement learning. This process uses three complementary reward signals: quality, semantic alignment, and format compliance, via Group Relative Policy Optimization (GRPO). We demonstrate that LM-Mixup effectively augments imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for only about 3% of the entire dataset, not only surpasses full-dataset training but also competes with state-of-the-art high-quality data selection methods across multiple benchmarks. Our work establishes that low-quality data is a valuable resource when properly distilled and augmented with LM-Mixup, significantly enhancing the efficiency and performance of instruction-tuned LLMs.
摘要：指令调优对于调整大型语言模型 (LLM) 至关重要，但指令跟踪数据的质量差异很大。虽然高质量的数据至关重要，但它往往很稀缺；相反，大量的低质量数据经常被丢弃，导致大量信息丢失。现有的数据增强方法很难有效地增强这种低质量的数据，并且对此类技术的评估仍然没有明确的定义。为了解决这个问题，我们正式定义了指令蒸馏的任务：将多个低质量和冗余的输入蒸馏成高质量和连贯的指令输出对。具体来说，我们引入了一个全面的数据构建管道来创建 MIXTURE，这是一个 144K 样本数据集，将低质量或语义冗余的不完美指令集群与其高质量的蒸馏相配对。然后我们引入 LM-Mixup，首先对 MIXTURE 进行监督微调，然后通过强化学习对其进行优化。此过程通过组相对策略优化 (GRPO) 使用三个互补的奖励信号：质量、语义对齐和格式合规性。我们证明了 LM-Mixup 有效地增强了不完美的数据集：对仅占整个数据集约 3% 的蒸馏数据进行 LLM 微调，不仅超越了全数据集训练，而且还可以在多个基准测试中与最先进的高质量数据选择方法竞争。我们的工作表明，低质量数据在经过 LM-Mixup 适当提炼和增强后是一种宝贵的资源，可显着提高指令调整的 LLM 的效率和性能。

Title: Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models

Authors: Christian Hobelsberger, Theresa Winner, Andreas Nawroth, Oliver Mitevski, Anna-Carolina Haensch
Subjects: cs.CL, stat.AP, stat.ME
Abstract URL: https://arxiv.org/abs/2510.20460
Pdf URL: https://arxiv.org/pdf/2510.20460
Copy Paste: [[2510.20460]] Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models(https://arxiv.org/abs/2510.20460)
Keywords: language model, llm
Abstract: Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.
摘要：大型语言模型 (LLM) 产生的输出具有不同程度的不确定性，并且同样经常产生不同程度的正确性；使得它们的实际可靠性远不能得到保证。为了量化这种不确定性，我们系统地评估了 LLM 输出中的四种置信度估计方法：VCE、MSP、样本一致性和 CoCoA（Vashurin 等人，2025）。为了评估这些方法，我们使用最先进的开源法学硕士对四个问答任务进行了实验。我们的结果表明，每个不确定性指标都捕获了模型置信度的不同方面，并且混合 CoCoA 方法总体上具有最佳可靠性，从而改善了正确答案的校准和辨别。我们讨论了每种方法的权衡，并提供了在法学硕士申请中选择不确定性度量的建议。

Title: Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs

Authors: Lukas Edman, Alexander Fraser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20475
Pdf URL: https://arxiv.org/pdf/2510.20475
Copy Paste: [[2510.20475]] Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs(https://arxiv.org/abs/2510.20475)
Keywords: language model
Abstract: We describe our strategy for the 2025 edition of the BabyLM Challenge. Our main contribution is that of an improved form of Masked Language Modeling (MLM), which adapts the probabilities of the tokens masked according to the model's ability to predict them. The results show a substantial increase in performance on (Super)GLUE tasks over the standard MLM. We also incorporate sub-token embeddings, finding that this increases the model's morphological generalization capabilities. Our submission beats the baseline in the strict-small track.
摘要：我们描述了 2025 年 BabyLM 挑战赛的策略。我们的主要贡献是改进形式的屏蔽语言模型（MLM），它根据模型预测标记的能力来调整屏蔽标记的概率。结果表明，与标准 MLM 相比，(Super)GLUE 任务的性能有了显着提高。我们还结合了子令牌嵌入，发现这提高了模型的形态泛化能力。我们的提交超过了严格小赛道的基线。

Title: RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging

Authors: Bowen Wang, Haiyuan Wan, Liwen Shi, Chen Yang, Peng He, Yue Ma, Haochen Han, Wenhao Li, Tiao Tan, Yongjian Li, Fangming Liu, Yifan Gong, Sheng Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20479
Pdf URL: https://arxiv.org/pdf/2510.20479
Copy Paste: [[2510.20479]] RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging(https://arxiv.org/abs/2510.20479)
Keywords: language model, llm
Abstract: We unveil that internal representations in large language models (LLMs) serve as reliable proxies of learned knowledge, and propose RECALL, a novel representation-aware model merging framework for continual learning without access to historical data. RECALL computes inter-model similarity from layer-wise hidden representations over clustered typical samples, and performs adaptive, hierarchical parameter fusion to align knowledge across models. This design enables the preservation of domain-general features in shallow layers while allowing task-specific adaptation in deeper layers. Unlike prior methods that require task labels or incur performance trade-offs, RECALL achieves seamless multi-domain integration and strong resistance to catastrophic forgetting. Extensive experiments across five NLP tasks and multiple continual learning scenarios show that RECALL outperforms baselines in both knowledge retention and generalization, providing a scalable and data-free solution for evolving LLMs.
摘要：我们揭示了大语言模型（LLM）中的内部表示可以作为所学知识的可靠代理，并提出了 RECALL，这是一种新颖的表示感知模型合并框架，用于在无需访问历史数据的情况下进行持续学习。 RECALL 根据聚类典型样本的分层隐藏表示来计算模型间相似性，并执行自适应的分层参数融合以对齐模型之间的知识。这种设计能够在浅层中保留领域通用特征，同时允许在更深的层中进行特定于任务的适应。与之前需要任务标签或导致性能权衡的方法不同，RECALL 实现了无缝的多域集成和对灾难性遗忘的强大抵抗力。跨五个 NLP 任务和多个持续学习场景的广泛实验表明，RECALL 在知识保留和泛化方面均优于基线，为不断发展的法学硕士提供了可扩展且无数据的解决方案。

Title: Steering Evaluation-Aware Language Models To Act Like They Are Deployed

Authors: Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20487
Pdf URL: https://arxiv.org/pdf/2510.20487
Copy Paste: [[2510.20487]] Steering Evaluation-Aware Language Models To Act Like They Are Deployed(https://arxiv.org/abs/2510.20487)
Keywords: language model, llm
Abstract: Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. However, this gap can only be observed by removing the evaluation cue. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.
摘要：大型语言模型 (LLM) 有时可以检测到它们何时被评估，并调整其行为以显得更加一致，从而损害安全评估的可靠性。在本文中，我们表明，向 LLM 的激活添加转向向量可以抑制评估意识，并使模型表现得就像在评估期间部署一样。为了研究我们的指导技术，我们使用两步训练过程来训练法学硕士展示评估意识行为，该过程旨在模仿这种行为如何自然出现。首先，我们对具有模型事实描述的文档进行持续的预训练 (1) 在评估期间而不是在部署期间使用 Python 类型提示，以及 (2) 认识到某个评估提示的存在始终意味着它正在被测试。然后，我们通过专家迭代训练模型，以在评估设置中使用 Python 类型提示。生成的模型是评估感知的：它在评估上下文中写入类型提示，而不是在部署上下文中。然而，这种差距只能通过消除评估线索才能观察到。我们发现激活引导可以抑制评估意识，并使模型即使在提示存在时也表现得像已部署一样。重要的是，我们在额外训练之前使用原始模型构建了转向向量。我们的结果表明，人工智能评估人员可以通过引导模型表现得像部署的那样来提高安全评估的可靠性。

Title: Robust Preference Alignment via Directional Neighborhood Consensus

Authors: Ruochen Mao, Yuling Shi, Xiaodong Gu, Jiaheng Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20498
Pdf URL: https://arxiv.org/pdf/2510.20498
Copy Paste: [[2510.20498]] Robust Preference Alignment via Directional Neighborhood Consensus(https://arxiv.org/abs/2510.20498)
Keywords: language model, llm
Abstract: Aligning large language models with human preferences is critical for creating reliable and controllable AI systems. A human preference can be visualized as a high-dimensional vector where different directions represent trade-offs between desired attributes (e.g., helpfulness vs. verbosity). Yet, because the training data often reflects dominant, average preferences, LLMs tend to perform well on common requests but fall short in specific, individual needs. This mismatch creates a preference coverage gap. Existing methods often address this through costly retraining, which may not be generalized to the full spectrum of diverse preferences. This brittleness means that when a user's request reflects a nuanced preference deviating from the training data's central tendency, model performance can degrade unpredictably. To address this challenge, we introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus. Instead of forcing a model to generate a response from a single, highly specific preference, RPS samples multiple responses from a local neighborhood of related preferences to create a superior candidate pool. It then selects the response that best aligns with the user's original intent. We provide a theoretical framework showing our neighborhood generation strategy is provably superior to a strong baseline that also samples multiple candidates. Comprehensive experiments across three distinct alignment paradigms (DPA, DPO, and SFT) demonstrate that RPS consistently improves robustness against this baseline, achieving win rates of up to 69% on challenging preferences from under-represented regions of the space without any model retraining. Our work presents a practical, theoretically-grounded solution for enhancing the reliability of preference-aligned models.
摘要：将大型语言模型与人类偏好保持一致对于创建可靠且可控的人工智能系统至关重要。人类偏好可以被可视化为一个高维向量，其中不同的方向代表所需属性之间的权衡（例如，有用性与冗长性）。然而，由于培训数据通常反映主导的平均偏好，法学硕士往往在常见要求方面表现良好，但在具体的个人需求方面表现不佳。这种不匹配造成了偏好覆盖范围的差距。现有的方法通常通过昂贵的再培训来解决这个问题，这可能无法推广到所有不同的偏好。这种脆弱性意味着，当用户的请求反映了偏离训练数据集中趋势的细微偏好时，模型性能可能会不可预测地下降。为了应对这一挑战，我们引入了鲁棒偏好选择（RPS），这是一种利用定向邻域共识的事后、免训练方法。 RPS 不是强迫模型根据单个、高度特定的偏好生成响应，而是从相关偏好的本地邻域中采样多个响应，以创建优秀的候选池。然后，它会选择最符合用户原始意图的响应。我们提供了一个理论框架，表明我们的邻域生成策略可证明优于对多个候选样本进行采样的强大基线。跨三种不同对齐范式（DPA、DPO 和 SFT）的综合实验表明，RPS 相对于该基线持续提高稳健性，在不进行任何模型再训练的情况下，在挑战来自代表性不足的区域的偏好时实现高达 69% 的胜率。我们的工作提出了一种实用的、有理论依据的解决方案，用于增强偏好一致模型的可靠性。

Title: Hierarchical Sequence Iteration for Heterogeneous Question Answering

Authors: Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20505
Pdf URL: https://arxiv.org/pdf/2510.20505
Copy Paste: [[2510.20505]] Hierarchical Sequence Iteration for Heterogeneous Question Answering(https://arxiv.org/abs/2510.20505)
Keywords: retrieval-augmented generation, agent
Abstract: Retrieval-augmented generation (RAG) remains brittle on multi-step questions and heterogeneous evidence sources, trading accuracy against latency and token/tool budgets. This paper introducesHierarchical Sequence (HSEQ) Iteration for Heterogeneous Question Answering, a unified framework that (i) linearize documents, tables, and knowledge graphs into a reversible hierarchical sequence with lightweight structural tags, and (ii) perform structure-aware iteration to collect just-enough evidence before answer synthesis. A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes canonicalized evidence to genearte the final answer, with an optional refinement loop to resolve detected contradictions. Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong single-pass, multi-hop, and agentic RAG baselines with high efficiency. Besides, HSEQ exhibits three key advantages: (1) a format-agnostic unification that enables a single policy to operate across text, tables, and KGs without per-dataset specialization; (2) guided, budget-aware iteration that reduces unnecessary hops, tool calls, and tokens while preserving accuracy; and (3) evidence canonicalization for reliable QA, improving answers consistency and auditability.
摘要：检索增强生成（RAG）在多步骤问题和异构证据源、根据延迟和代币/工具预算进行权衡的准确性方面仍然很脆弱。本文介绍了异构问答的分层序列（HSEQ）迭代，这是一个统一的框架，它（i）将文档、表格和知识图线性化为具有轻量级结构标签的可逆分层序列，以及（ii）执行结构感知迭代以在答案合成之前收集足够的证据。头代理提供引导检索的指导，而迭代代理通过结构尊重的操作（例如父/子跳跃、表行/列邻居、KG 关系）选择和扩展 HSeq；最后，首席代理组成规范化证据来生成最终答案，并使用可选的细化循环来解决检测到的矛盾。 HotpotQA（文本）、HybridQA/TAT-QA（表格+文本）和 MetaQA (KG) 上的实验显示，与强大的单通道、多跳和代理 RAG 基线相比，EM/F1 获得了一致的增益，并且效率很高。此外，HSEQ 还具有三个关键优势：（1）与格式无关的统一，使单个策略能够跨文本、表格和知识图谱进行操作，而无需针对每个数据集进行专门化； (2) 引导式、预算意识迭代，减少不必要的跳转、工具调用和令牌，同时保持准确性； (3) 证据标准化以实现可靠的质量保证，提高答案的一致性和可审计性。

Title: Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset

Authors: Paul Lerner, François Yvon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20508
Pdf URL: https://arxiv.org/pdf/2510.20508
Copy Paste: [[2510.20508]] Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset(https://arxiv.org/abs/2510.20508)
Keywords: language model, llm
Abstract: The political biases of Large Language Models (LLMs) are usually assessed by simulating their answers to English surveys. In this work, we propose an alternative framing of political biases, relying on principles of fairness in multilingual translation. We systematically compare the translation quality of speeches in the European Parliament (EP), observing systematic differences with majority parties from left, center, and right being better translated than outsider parties. This study is made possible by a new, 21-way multiparallel version of EuroParl, the parliamentary proceedings of the EP, which includes the political affiliations of each speaker. The dataset consists of 1.5M sentences for a total of 40M words and 249M characters. It covers three years, 1000+ speakers, 7 countries, 12 EU parties, 25 EU committees, and hundreds of national parties.
摘要：大型语言模型（LLM）的政治偏见通常通过模拟他们对英语调查的答案来评估。在这项工作中，我们基于多语言翻译的公平原则，提出了政治偏见的替代框架。我们系统地比较了欧洲议会 (EP) 演讲的翻译质量，观察到左翼、中间和右翼多数政党的系统性差异，其翻译效果比外部政党要好。这项研究是通过新的 21 路多并行版本 EuroParl（欧洲议会的议会程序）实现的，其中包括每位发言人的政治立场。该数据集由 150 万个句子组成，总共 4000 万个单词和 2.49 亿个字符。它涵盖三年、1000 多名演讲者、7 个国家、12 个欧盟政党、25 个欧盟委员会和数百个国家政党。

Title: ARC-Encoder: learning compressed text representations for large language models

Authors: Hippolyte Pilchen, Edouard Grave, Patrick Pérez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20535
Pdf URL: https://arxiv.org/pdf/2510.20535
Copy Paste: [[2510.20535]] ARC-Encoder: learning compressed text representations for large language models(https://arxiv.org/abs/2510.20535)
Keywords: language model, llm, retrieval-augmented generation, chain-of-thought
Abstract: Recent techniques such as retrieval-augmented generation or chain-of-thought reasoning have led to longer contexts and increased inference costs. Context compression techniques can reduce these costs, but the most effective approaches require fine-tuning the target model or even modifying its architecture. This can degrade its general abilities when not used for this specific purpose. Here we explore an alternative approach: an encoder that compresses the context into continuous representations which replace token embeddings in decoder LLMs. First, we perform a systematic study of training strategies and architecture choices for the encoder. Our findings led to the design of an Adaptable text Representations Compressor, named ARC-Encoder, which outputs $x$-times fewer continuous representations (typically $x\!\in\!\{4,8\}$) than text tokens. We evaluate ARC-Encoder across a variety of LLM usage scenarios, ranging from in-context learning to context window extension, on both instruct and base decoders. Results show that ARC-Encoder achieves state-of-the-art performance on several benchmarks while improving computational efficiency at inference. Finally, we demonstrate that our models can be adapted to multiple decoders simultaneously, allowing a single encoder to generalize across different decoder LLMs. This makes ARC-Encoder a flexible and efficient solution for portable encoders that work seamlessly with multiple LLMs. We release a training code at this https URL , fine-tuning dataset and pretrained models are available at this https URL .
摘要：最近的技术，如检索增强生成或思维链推理，导致了更长的上下文和增加的推理成本。上下文压缩技术可以降低这些成本，但最有效的方法需要微调目标模型，甚至修改其架构。当不用于此特定目的时，这可能会降低其一般能力。在这里，我们探索一种替代方法：将上下文压缩为连续表示的编码器，以取代解码器 LLM 中的令牌嵌入。首先，我们对编码器的训练策略和架构选择进行系统研究。我们的发现导致了一种名为 ARC-Encoder 的自适应文本表示压缩器的设计，它输出的连续表示比文本标记少 $x$ 倍（通常为 $x\!\in\!\{4,8\}$）。我们在指令和基础解码器上评估了各种 LLM 使用场景（从上下文学习到上下文窗口扩展）的 ARC-Encoder。结果表明，ARC-Encoder 在多个基准测试中实现了最先进的性能，同时提高了推理计算效率。最后，我们证明我们的模型可以同时适应多个解码器，从而允许单个编码器泛化到不同的解码器 LLM。这使得 ARC-Encoder 成为便携式编码器的灵活高效的解决方案，可与多个法学硕士无缝协作。我们在此 https URL 发布了训练代码，可在此 https URL 获取微调数据集和预训练模型。

Title: The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts

Authors: Sangmitra Madhusudan, Kaige Chen, Ali Emami
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20543
Pdf URL: https://arxiv.org/pdf/2510.20543
Copy Paste: [[2510.20543]] The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts(https://arxiv.org/abs/2510.20543)
Keywords: language model
Abstract: When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.
摘要：当语言模型正确解析“狗追的猫喵喵叫”时，它们是在分析语法还是只是熟悉狗追猫？尽管进行了广泛的基准测试，但我们缺乏区分结构理解和语义模式匹配的方法。我们引入了 CenterBench，这是一个包含 9,720 个关于中心嵌入句子的理解问题的数据集（例如“The cat [that the dogs chased] meowed”），其中关系从句递归嵌套，创建从简单到深层嵌套结构的处理需求。每个句子都有一个语法上相同但语义上令人难以置信的对应句子（例如，邮递员开药，医生递送邮件）和六个测试表面理解、句法依赖性和因果推理的理解问题。测试六个模型表明，可信句子和不可信句子之间的性能差距随着复杂性的增加而系统性地扩大，模型显示中位差距高达 26.8 个百分点，量化了它们何时放弃语义关联的结构分析。值得注意的是，语义合理性会损害有关结果动作的问题的表现，其中遵循因果关系比语义连贯性更重要。推理模型提高了准确性，但其痕迹显示出语义捷径、过度思考和拒绝回答。与合理性优势随着复杂性而系统性扩大的模型不同，人类表现出可变的语义效应。 CenterBench 提供了第一个框架来识别模型何时从结构分析转向模式匹配。

Title: GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning

Authors: Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia, Shuangshuang Tian, Tingcheng Bian, Haiwei Wang, Haohuan Fu, Yan Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20548
Pdf URL: https://arxiv.org/pdf/2510.20548
Copy Paste: [[2510.20548]] GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning(https://arxiv.org/abs/2510.20548)
Keywords: retrieval-augmented generation
Abstract: Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.
摘要：强化学习最近在改进检索增强生成（RAG）方面显示出了希望。尽管取得了这些进步，它在多跳问答（QA）方面的有效性仍然受到两个基本限制的限制：（i）缺乏全局规划来构建多步骤推理，以及（ii）不忠实的执行，这阻碍了有效的查询制定和检索到的证据的一致使用。我们提出了 GlobalRAG，这是一个强化学习框架，旨在增强多跳 QA 中的全局推理。 GlobalRAG 将问题分解为子目标，协调检索与推理，并迭代地完善证据。为了指导这一过程，我们引入了规划质量奖励和子目标完成奖励，鼓励连贯的规划和可靠的子目标执行。此外，渐进式权重退火策略平衡了面向过程和基于结果的目标。对域内和域外基准的大量实验表明，GlobalRAG 在仅使用 8k 训练数据（强基线使用的训练数据的 42%）时显着优于强基线，在 EM 和 F1 中实现了 14.2% 的平均改进。

Title: Beyond Retrieval-Ranking: A Multi-Agent Cognitive Decision Framework for E-Commerce Search

Authors: Zhouwei Zhai, Mengxiang Chen, Haoyun Xia, Jin Li, Renquan Zhou, Min Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20567
Pdf URL: https://arxiv.org/pdf/2510.20567
Copy Paste: [[2510.20567]] Beyond Retrieval-Ranking: A Multi-Agent Cognitive Decision Framework for E-Commerce Search(https://arxiv.org/abs/2510.20567)
Keywords: agent
Abstract: The retrieval-ranking paradigm has long dominated e-commerce search, but its reliance on query-item matching fundamentally misaligns with multi-stage cognitive decision processes of platform users. This misalignment introduces critical limitations: semantic gaps in complex queries, high decision costs due to cross-platform information foraging, and the absence of professional shopping guidance. To address these issues, we propose a Multi-Agent Cognitive Decision Framework (MACDF), which shifts the paradigm from passive retrieval to proactive decision support. Extensive offline evaluations demonstrate MACDF's significant improvements in recommendation accuracy and user satisfaction, particularly for complex queries involving negation, multi-constraint, or reasoning demands. Online A/B testing on JD search platform confirms its practical efficacy. This work highlights the transformative potential of multi-agent cognitive systems in redefining e-commerce search.
摘要：检索排序范式长期以来主导着电子商务搜索，但其对查询项目匹配的依赖从根本上与平台用户的多阶段认知决策过程不相符。这种不一致带来了严重的限制：复杂查询中的语义差距、跨平台信息搜寻导致的高决策成本以及缺乏专业的购物指导。为了解决这些问题，我们提出了多智能体认知决策框架（MACDF），它将范式从被动检索转变为主动决策支持。广泛的离线评估表明 MACDF 在推荐准确性和用户满意度方面有显着提高，特别是对于涉及否定、多约束或推理需求的复杂查询。京东搜索平台在线A/B测试证实了其实际效果。这项工作凸显了多智能体认知系统在重新定义电子商务搜索方面的变革潜力。

Title: Can ChatGPT Code Communication Data Fairly?: Empirical Evidence from Multiple Collaborative Tasks

Authors: Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20584
Pdf URL: https://arxiv.org/pdf/2510.20584
Copy Paste: [[2510.20584]] Can ChatGPT Code Communication Data Fairly?: Empirical Evidence from Multiple Collaborative Tasks(https://arxiv.org/abs/2510.20584)
Keywords: gpt, chat
Abstract: Assessing communication and collaboration at scale depends on a labor intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology exhibits bias against different demographic groups, such as gender and race, remains unclear. To fill this gap, this paper investigates ChatGPT-based automated coding of communication data using a typical coding framework for collaborative problem solving, examining differences across gender and racial groups. The analysis draws on data from three types of collaborative tasks: negotiation, problem solving, and decision making. Our results show that ChatGPT-based coding exhibits no significant bias across gender and racial groups, paving the road for its adoption in large-scale assessment of collaboration and communication.
摘要：大规模评估通信和协作取决于根据不同框架将通信数据编码为类别的劳动密集型任务。先前的研究已经证实，ChatGPT 可以直接使用编码规则来对通信数据进行编码，并达到与人类评分者相当的准确性。然而，ChatGPT 或类似人工智能技术的编码是否对不同的人口群体（例如性别和种族）表现出偏见，目前尚不清楚。为了填补这一空白，本文研究了基于 ChatGPT 的通信数据自动编码，使用典型的协作解决问题的编码框架，检查性别和种族群体之间的差异。该分析利用了来自三种协作任务的数据：谈判、解决问题和决策。我们的结果表明，基于 ChatGPT 的编码在性别和种族群体中没有表现出明显的偏见，这为其在大规模协作和沟通评估中的采用铺平了道路。

Title: Why Did Apple Fall To The Ground: Evaluating Curiosity In Large Language Model

Authors: Haoyu Wang, Sihang Jiang, Yuyan Chen, Yitong Wang, Yanghua Xiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20635
Pdf URL: https://arxiv.org/pdf/2510.20635
Copy Paste: [[2510.20635]] Why Did Apple Fall To The Ground: Evaluating Curiosity In Large Language Model(https://arxiv.org/abs/2510.20635)
Keywords: language model, llm
Abstract: Curiosity serves as a pivotal conduit for human beings to discover and learn new knowledge. Recent advancements of large language models (LLMs) in natural language processing have sparked discussions regarding whether these models possess capability of curiosity-driven learning akin to humans. In this paper, starting from the human curiosity assessment questionnaire Five-Dimensional Curiosity scale Revised (5DCR), we design a comprehensive evaluation framework that covers dimensions such as Information Seeking, Thrill Seeking, and Social Curiosity to assess the extent of curiosity exhibited by LLMs. The results demonstrate that LLMs exhibit a stronger thirst for knowledge than humans but still tend to make conservative choices when faced with uncertain environments. We further investigated the relationship between curiosity and thinking of LLMs, confirming that curious behaviors can enhance the model's reasoning and active learning abilities. These findings suggest that LLMs have the potential to exhibit curiosity similar to that of humans, providing experimental support for the future development of learning capabilities and innovative research in LLMs.
摘要：好奇心是人类发现和学习新知识的关键渠道。自然语言处理方面大型语言模型（LLM）的最新进展引发了关于这些模型是否具有类似于人类的好奇心驱动学习能力的讨论。本文从人类好奇心评估问卷五维好奇心量表修订版（5DCR）出发，设计了涵盖信息寻求、刺激寻求和社交好奇心等维度的综合评估框架，以评估法学硕士表现出的好奇心程度。结果表明，法学硕士表现出比人类更强烈的求知欲，但在面对不确定的环境时仍然倾向于做出保守的选择。我们进一步研究了法学硕士的好奇心与思维之间的关系，证实好奇行为可以增强模型的推理和主动学习能力。这些发现表明法学硕士有潜力表现出与人类类似的好奇心，为法学硕士未来学习能力的发展和创新研究提供实验支持。

Title: Neural Diversity Regularizes Hallucinations in Small Models

Authors: Kushal Chakrabarti, Nirmal Balachundhar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.20690
Pdf URL: https://arxiv.org/pdf/2510.20690
Copy Paste: [[2510.20690]] Neural Diversity Regularizes Hallucinations in Small Models(https://arxiv.org/abs/2510.20690)
Keywords: language model, hallucination
Abstract: Language models continue to hallucinate despite increases in parameters, compute, and data. We propose neural diversity -- decorrelated parallel representations -- as a principled mechanism that reduces hallucination rates at fixed parameter and data budgets. Inspired by portfolio theory, where uncorrelated assets reduce risk by $\sqrt{P}$, we prove hallucination probability is bounded by representational correlation: $P(H) \leq f(\sigma^2((1-\rho(P))/P + \rho(P)), \mu^2)$, which predicts that language models need an optimal amount of neurodiversity. To validate this, we introduce ND-LoRA (Neural Diversity Low-Rank Adaptation), combining parallel LoRA adapters with Barlow Twins regularization, and demonstrate that ND-LoRA reduces hallucinations by up to 25.6% (and 14.6% on average) without degrading general accuracy. Ablations show LoRA adapters and regularization act synergistically, causal interventions prove neurodiversity as the mediating factor and correlational analyses indicate scale: a 0.1% neural correlation increase is associated with a 3.8% hallucination increase. Finally, task-dependent optimality emerges: different tasks require different amounts of optimal neurodiversity. Together, our results highlight neural diversity as a third axis of scaling -- orthogonal to parameters and data -- to improve the reliability of language models at fixed budgets.
摘要：尽管参数、计算和数据不断增加，语言模型仍然会产生幻觉。我们提出神经多样性——去相关的并行表示——作为一种原则机制，可以在固定参数和数据预算下降低幻觉率。受投资组合理论的启发，不相关的资产通过 $\sqrt{P}$ 降低风险，我们证明幻觉概率受到表征相关性的限制：$P(H) \leq f(\sigma^2((1-\rho(P))/P + \rho(P)), \mu^2)$，它预测语言模型需要最优数量的神经多样性。为了验证这一点，我们引入了 ND-LoRA（神经多样性低阶适应），将并行 LoRA 适配器与 Barlow Twins 正则化相结合，并证明 ND-LoRA 可将幻觉减少高达 25.6%（平均 14.6%），而不会降低总体准确性。消融显示 LoRA 适配器和正则化协同作用，因果干预证明神经多样性是中介因素，相关分析表明规模：0.1% 的神经相关性增加与 3.8% 的幻觉增加相关。最后，任务相关的最优性出现：不同的任务需要不同数量的最优神经多样性。总之，我们的结果强调了神经多样性作为扩展的第三个轴（与参数和数据正交），以提高固定预算下语言模型的可靠性。

Title: Structure-Conditional Minimum Bayes Risk Decoding

Authors: Bryan Eikema, Anna Rutkiewicz, Mario Giulianelli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20700
Pdf URL: https://arxiv.org/pdf/2510.20700
Copy Paste: [[2510.20700]] Structure-Conditional Minimum Bayes Risk Decoding(https://arxiv.org/abs/2510.20700)
Keywords: language model
Abstract: Minimum Bayes Risk (MBR) decoding has seen renewed interest as an alternative to traditional generation strategies. While MBR has proven effective in machine translation, where the variability of a language model's outcome space is naturally constrained, it may face challenges in more open-ended tasks such as dialogue or instruction-following. We hypothesise that in such settings, applying MBR with standard similarity-based utility functions may result in selecting responses that are broadly representative of the model's distribution, yet sub-optimal with respect to any particular grouping of generations that share an underlying latent structure. In this work, we introduce three lightweight adaptations to the utility function, designed to make MBR more sensitive to structural variability in the outcome space. To test our hypothesis, we curate a dataset capturing three representative types of latent structure: dialogue act, emotion, and response structure (e.g., a sentence, a paragraph, or a list). We further propose two metrics to evaluate the structural optimality of MBR. Our analysis demonstrates that common similarity-based utility functions fall short by these metrics. In contrast, our proposed adaptations considerably improve structural optimality. Finally, we evaluate our approaches on real-world instruction-following benchmarks, AlpacaEval and MT-Bench, and show that increased structural sensitivity improves generation quality by up to 13.7 percentage points in win rate.
摘要：最小贝叶斯风险 (MBR) 解码作为传统生成策略的替代方案重新引起了人们的兴趣。虽然 MBR 已被证明在机器翻译中有效，其中语言模型结果空间的可变性自然受到限制，但它可能在对话或指令遵循等更开放的任务中面临挑战。我们假设，在这种情况下，将 MBR 与标准的基于相似性的效用函数一起应用可能会导致选择广泛代表模型分布的响应，但对于共享潜在潜在结构的任何特定代组而言，这些响应不是最佳的。在这项工作中，我们引入了对效用函数的三种轻量级调整，旨在使 MBR 对结果空间中的结构变化更加敏感。为了检验我们的假设，我们整理了一个数据集，捕获三种代表性类型的潜在结构：对话行为、情感和响应结构（例如句子、段落或列表）。我们进一步提出了两个指标来评估 MBR 的结构最优性。我们的分析表明，常见的基于相似性的效用函数达不到这些指标。相比之下，我们提出的调整大大提高了结构的优化性。最后，我们在现实世界的指令跟踪基准 AlpacaEval 和 MT-Bench 上评估了我们的方法，结果表明，结构敏感性的提高可将生成质量提高 13.7 个百分点。

Title: User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios

Authors: Xiaoyuan Wu, Roshni Kaushik, Wenkai Li, Lujo Bauer, Koichi Onoue
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2510.20721
Pdf URL: https://arxiv.org/pdf/2510.20721
Copy Paste: [[2510.20721]] User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios(https://arxiv.org/abs/2510.20721)
Keywords: language model, llm
Abstract: Large language models (LLMs) have seen rapid adoption for tasks such as drafting emails, summarizing meetings, and answering health questions. In such uses, users may need to share private information (e.g., health records, contact details). To evaluate LLMs' ability to identify and redact such private information, prior work developed benchmarks (e.g., ConfAIde, PrivacyLens) with real-life scenarios. Using these benchmarks, researchers have found that LLMs sometimes fail to keep secrets private when responding to complex tasks (e.g., leaking employee salaries in meeting summaries). However, these evaluations rely on LLMs (proxy LLMs) to gauge compliance with privacy norms, overlooking real users' perceptions. Moreover, prior work primarily focused on the privacy-preservation quality of responses, without investigating nuanced differences in helpfulness. To understand how users perceive the privacy-preservation quality and helpfulness of LLM responses to privacy-sensitive scenarios, we conducted a user study with 94 participants using 90 scenarios from PrivacyLens. We found that, when evaluating identical responses to the same scenario, users showed low agreement with each other on the privacy-preservation quality and helpfulness of the LLM response. Further, we found high agreement among five proxy LLMs, while each individual LLM had low correlation with users' evaluations. These results indicate that the privacy and helpfulness of LLM responses are often specific to individuals, and proxy LLMs are poor estimates of how real users would perceive these responses in privacy-sensitive scenarios. Our results suggest the need to conduct user-centered studies on measuring LLMs' ability to help users while preserving privacy. Additionally, future research could investigate ways to improve the alignment between proxy LLMs and users for better estimation of users' perceived privacy and utility.
摘要：大型语言模型 (LLM) 已迅速应用于起草电子邮件、总结会议和回答健康问题等任务。在此类用途中，用户可能需要共享私人信息（例如健康记录、联系方式）。为了评估法学硕士识别和编辑此类私人信息的能力，之前的工作根据现实生活场景制定了基准（例如 ConfAIde、PrivacyLens）。使用这些基准，研究人员发现法学硕士在应对复杂任务时有时无法保守秘密（例如，在会议摘要中泄露员工工资）。然而，这些评估依赖于法学硕士（代理法学硕士）来衡量对隐私规范的遵守情况，忽视了真实用户的看法。此外，之前的工作主要关注回复的隐私保护质量，而没有调查有用性的细微差别。为了了解用户如何看待 LLM 对隐私敏感场景的隐私保护质量和帮助，我们使用 PrivacyLens 的 90 个场景对 94 名参与者进行了一项用户研究。我们发现，在评估对同一场景的相同回答时，用户对法学硕士回答的隐私保护质量和有用性表现出较低的一致性。此外，我们发现五个代理法学硕士之间高度一致，而每个单独的法学硕士与用户评价的相关性较低。这些结果表明，LLM 回复的隐私性和有用性通常是针对个人的，代理 LLM 无法很好地估计真实用户在隐私敏感场景中如何看待这些回复。我们的结果表明，有必要进行以用户为中心的研究，以衡量法学硕士在保护隐私的同时帮助用户的能力。此外，未来的研究可以调查提高代理法学硕士和用户之间一致性的方法，以便更好地估计用户感知的隐私和效用。

Title: Automated Extraction of Fluoropyrimidine Treatment and Treatment-Related Toxicities from Clinical Notes Using Natural Language Processing

Authors: Xizhi Wu, Madeline S. Kreider, Philip E. Empey, Chenyu Li, Yanshan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20727
Pdf URL: https://arxiv.org/pdf/2510.20727
Copy Paste: [[2510.20727]] Automated Extraction of Fluoropyrimidine Treatment and Treatment-Related Toxicities from Clinical Notes Using Natural Language Processing(https://arxiv.org/abs/2510.20727)
Keywords: language model, llm, prompt
Abstract: Objective: Fluoropyrimidines are widely prescribed for colorectal and breast cancers, but are associated with toxicities such as hand-foot syndrome and cardiotoxicity. Since toxicity documentation is often embedded in clinical notes, we aimed to develop and evaluate natural language processing (NLP) methods to extract treatment and toxicity information. Materials and Methods: We constructed a gold-standard dataset of 236 clinical notes from 204,165 adult oncology patients. Domain experts annotated categories related to treatment regimens and toxicities. We developed rule-based, machine learning-based (Random Forest, Support Vector Machine [SVM], Logistic Regression [LR]), deep learning-based (BERT, ClinicalBERT), and large language models (LLM)-based NLP approaches (zero-shot and error-analysis prompting). Models used an 80:20 train-test split. Results: Sufficient data existed to train and evaluate 5 annotated categories. Error-analysis prompting achieved optimal precision, recall, and F1 scores (F1=1.000) for treatment and toxicities extraction, whereas zero-shot prompting reached F1=1.000 for treatment and F1=0.876 for toxicities this http URL and SVM ranked second for toxicities (F1=0.937). Deep learning underperformed, with BERT (F1=0.873 treatment; F1= 0.839 toxicities) and ClinicalBERT (F1=0.873 treatment; F1 = 0.886 toxicities). Rule-based methods served as our baseline with F1 scores of 0.857 in treatment and 0.858 in toxicities. Discussion: LMM-based approaches outperformed all others, followed by machine learning methods. Machine and deep learning approaches were limited by small training data and showed limited generalizability, particularly for rare categories. Conclusion: LLM-based NLP most effectively extracted fluoropyrimidine treatment and toxicity information from clinical notes, and has strong potential to support oncology research and pharmacovigilance.
摘要：目的：氟嘧啶广泛用于治疗结直肠癌和乳腺癌，但与手足综合征和心脏毒性等毒性有关。由于毒性文档通常嵌入在临床记录中，我们的目标是开发和评估自然语言处理（NLP）方法来提取治疗和毒性信息。材料和方法：我们构建了一个黄金标准数据集，其中包含来自 204,165 名成年肿瘤患者的 236 份临床记录。领域专家注释了与治疗方案和毒性相关的类别。我们开发了基于规则、基于机器学习（随机森林、支持向量机 [SVM]、逻辑回归 [LR]）、基于深度学习（BERT、ClinicalBERT）和基于大语言模型 (LLM) 的 NLP 方法（零样本和错误分析提示）。模型使用 80:20 的训练测试比例。结果：有足够的数据来训练和评估 5 个带注释的类别。错误分析提示在治疗和毒性提取方面实现了最佳精度、召回率和 F1 分数 (F1=1.000)，而零样本提示在治疗方面达到了 F1=1.000，在毒性方面达到了 F1=0.876，该 http URL 和 SVM 在毒性方面排名第二 (F1=0.937)。深度学习表现不佳，BERT（F1=0.873 治疗；F1= 0.839 毒性）和 ClinicalBERT（F1=0.873 治疗；F1=0.886 毒性）。基于规则的方法作为我们的基线，治疗的 F1 分数为 0.857，毒性的 F1 分数为 0.858。讨论：基于 LMM 的方法优于所有其他方法，其次是机器学习方法。机器和深度学习方法受到小训练数据的限制，并且普遍性有限，特别是对于稀有类别。结论：基于法学硕士的 NLP 最有效地从临床记录中提取氟嘧啶治疗和毒性信息，并且具有支持肿瘤学研究和药物警戒的强大潜力。

Title: A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text

Authors: Alicia Sagae, Chia-Jung Lee, Sandeep Avula, Brandon Dang, Vanessa Murdock
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20782
Pdf URL: https://arxiv.org/pdf/2510.20782
Copy Paste: [[2510.20782]] A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text(https://arxiv.org/abs/2510.20782)
Keywords: language model, llm, prompt
Abstract: Current methods for evaluating large language models (LLMs) typically focus on high-level tasks such as text generation, without targeting a particular AI application. This approach is not sufficient for evaluating LLMs for Responsible AI dimensions like fairness, since protected attributes that are highly relevant in one application may be less relevant in another. In this work, we construct a dataset that is driven by a real-world application (generate a plain-text product description, given a list of product features), parameterized by fairness attributes intersected with gendered adjectives and product categories, yielding a rich set of labeled prompts. We show how to use the data to identify quality, veracity, safety, and fairness gaps in LLMs, contributing a proposal for LLM evaluation paired with a concrete resource for the research community.
摘要：当前评估大型语言模型 (LLM) 的方法通常侧重于文本生成等高级任务，而不针对特定的人工智能应用。这种方法不足以评估法学硕士的公平性等负责任的人工智能维度，因为在一个应用程序中高度相关的受保护属性可能在另一个应用程序中不太相关。在这项工作中，我们构建了一个由现实世界应用程序驱动的数据集（生成纯文本产品描述，给定产品功能列表），通过与性别形容词和产品类别相交的公平属性进行参数化，产生一组丰富的标记提示。我们展示了如何使用数据来确定法学硕士的质量、准确性、安全性和公平性差距，为法学硕士评估提出建议，并为研究界提供具体资源。

Title: Simple Context Compression: Mean-Pooling and Multi-Ratio Training

Authors: Yair Feldman, Yoav Artzi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.20797
Pdf URL: https://arxiv.org/pdf/2510.20797
Copy Paste: [[2510.20797]] Simple Context Compression: Mean-Pooling and Multi-Ratio Training(https://arxiv.org/abs/2510.20797)
Keywords: language model, llm, long context, retrieval-augmented generation
Abstract: A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.
摘要：降低在大型语言模型 (LLM) 的检索增强生成 (RAG) 中使用长上下文的计算成本的常见策略是软上下文压缩，其中输入序列被转换为较短的连续表示。我们开发了一种轻量级且简单的均值池方法，该方法始终优于广泛使用的压缩令牌架构，并研究训练相同的压缩器以输出多个压缩比。我们在域内和域外 QA 数据集以及模型系列、规模和压缩比方面进行了广泛的实验。总体而言，我们简单的均值池方法实现了最强的性能，在针对多个压缩比进行训练时，性能下降相对较小。但更广泛地说，跨架构和训练制度的权衡更加微妙，说明了压缩方法的复杂情况。

Title: On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?

Authors: Mingmeng Geng, Thierry Poibeau
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2510.20810
Pdf URL: https://arxiv.org/pdf/2510.20810
Copy Paste: [[2510.20810]] On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?(https://arxiv.org/abs/2510.20810)
Keywords: language model, llm
Abstract: With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely "LLM-generated text". Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.
摘要：随着大型语言模型（LLM）的广泛使用，许多研究人员将注意力转向检测由它们生成的文本。然而，他们的目标，即“LLM生成的文本”，并没有一致或精确的定义。使用场景的差异和LLM的多样性进一步增加了检测难度。通常被视为检测目标的内容通常仅代表法学硕士可能产生的文本的子集。对法学硕士输出的人工编辑，以及法学硕士对其用户施加的微妙影响，正在模糊法学硕士生成的文本和人工编写的文本之间的界限。现有的基准和评估方法不能充分解决现实探测器应用中的各种条件。因此，探测器的数值结果经常被误解，其意义正在减弱。因此，探测器在特定条件下仍然有用，但其结果应仅解释为参考而不是决定性指标。