2025-10-27

Title: Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

Authors: Gabriel Grand, Valerio Pepe, Jacob Andreas, Joshua B. Tenenbaum
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20886
Pdf URL: https://arxiv.org/pdf/2510.20886
Copy Paste: [[2510.20886]] Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People(https://arxiv.org/abs/2510.20886)
Keywords: language model, gpt, agent
Abstract: Many high-stakes applications of AI require forming data-driven hypotheses and making targeted guesses; e.g., in scientific and diagnostic settings. Given limited resources, to what extent do agents based on language models (LMs) act rationally? We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior. First, we introduce a strategic decision-oriented dialogue task called Collaborative Battleship, in which a partially-informed Captain must balance exploration (asking questions) and action (taking shots), while a fully-informed Spotter must provide accurate answers under an information bottleneck. Compared to human players (N=42), we find that LM agents struggle to ground answers in context, generate informative questions, and select high-value actions. Next, to address these gaps, we develop novel Monte Carlo inference strategies for LMs based on principles from Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303-0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% -> 82% win rate) and frontier models (0% -> 67% win rate vs. GPT-5) at ~1% of GPT-5's cost. We replicate these findings on Guess Who? where our methods significantly boost accuracy (+28.3-42.4 p.p.), demonstrating their general applicability for building rational information-seeking agents.
摘要：人工智能的许多高风险应用需要形成数据驱动的假设并做出有针对性的猜测；例如，在科学和诊断环境中。在资源有限的情况下，基于语言模型（LM）的智能体能够在多大程度上理性地行动？我们利用人类行为的见解，开发基准和增强代理信息搜索的方法。首先，我们引入一种名为“协作战舰”的战略决策型对话任务，其中部分信息的舰长必须平衡探索（提问）和行动（射击），而完全信息的观察员必须在信息瓶颈下提供准确的答案。与人类玩家 (N=42) 相比，我们发现 LM 智能体很难在上下文中找到答案、生成信息丰富的问题并选择高价值的动作。接下来，为了解决这些差距，我们根据贝叶斯实验设计 (BED) 的原理为 LM 开发新颖的蒙特卡罗推理策略。对于 Spotter 智能体，我们的方法比仅使用 LM 的基线绝对准确度提高了 14.7%；对于队长代理，它将预期信息增益 (EIG) 提高了 0.227 位（可实现噪声上限的 94.2%）。结合起来，这些组件产生更清晰的目标（+0.303-0.374 F1），并使较弱的 LM（例如 Llama-4-Scout）能够以 GPT-5 成本约 1% 的成本超越人类（8% -> 82% 胜率）和前沿模型（0% -> 67% 胜率与 GPT-5）。我们在《猜猜谁？》中复制了这些发现我们的方法显着提高了准确性（+28.3-42.4 p.p.），证明了它们对于构建理性信息寻求代理的普遍适用性。

Title: Code-enabled language models can outperform reasoning models on diverse tasks

Authors: Cedegao E. Zhang, Cédric Colas, Gabriel Poesia, Joshua B. Tenenbaum, Jacob Andreas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20909
Pdf URL: https://arxiv.org/pdf/2510.20909
Copy Paste: [[2510.20909]] Code-enabled language models can outperform reasoning models on diverse tasks(https://arxiv.org/abs/2510.20909)
Keywords: language model
Abstract: Reasoning models (RMs), language models (LMs) trained with reinforcement learning to produce long-form natural language reasoning, have been remarkably successful, but they still require large amounts of computation and data to train, and can be slow and expensive to run. In this paper, we show that standard instruct LMs can already be elicited to be strong reasoners at a level comparable to or even surpassing their corresponding RMs (e.g., DeepSeek V3 vs R1) without finetuning, across diverse domains from instruction following and creative generation to mathematical reasoning. This is achieved by CodeAdapt, our simple recipe that combines the CodeAct framework, where LMs interleave natural language reasoning with code execution in a multi-step fashion, with few-shot bootstrap in-context learning from as few as five training problems. Analyzing four matched pairs of LMs and RMs, we find that CodeAdapt enables three LMs to outperform the corresponding RMs on average over eight tasks (up to 22.9%) while being 10-81% more token efficient, and delivers superior performance on six tasks when averaged over the four models (up to 35.7%). Furthermore, the code-augmented reasoning traces display rich and varied problem-solving strategies. Our findings support that (1) CodeAdapt-style learning and reasoning may be robust and domain general and (2) code-enabled LMs are cognitively grounded and powerful systems, potentially providing a strong foundation for in-weight reinforcement learning.
摘要：推理模型（RM）和语言模型（LM）通过强化学习训练来产生长形式的自然语言推理，已经取得了巨大的成功，但它们仍然需要大量的计算和数据来训练，并且运行速度缓慢且昂贵。在本文中，我们表明，标准指令 LM 已经可以成为强大的推理器，其水平相当于甚至超过相应的 RM（例如 DeepSeek V3 与 R1），而无需进行微调，涵盖从指令跟踪、创造性生成到数学推理的不同领域。这是通过 CodeAdapt 实现的，CodeAdapt 是我们结合了 CodeAct 框架的简单方法，其中 LM 以多步骤方式将自然语言推理与代码执行交织在一起，并从少至五个训练问题中进行少量引导上下文学习。通过分析四对匹配的 LM 和 RM，我们发现 CodeAdapt 使三个 LM 在 8 个任务上的平均性能优于相应的 RM（高达 22.9%），同时令牌效率提高 10-81%，并且在四个模型的平均性能（高达 35.7%）上的 6 个任务上提供卓越的性能。此外，代码增强推理痕迹显示出丰富多样的问题解决策略。我们的研究结果支持以下观点：(1) CodeAdapt 式的学习和推理可能是稳健的且具有领域通用性；(2) 支持代码的 LM 是认知基础且功能强大的系统，可能为权重强化学习提供坚实的基础。

Title: FicSim: A Dataset for Multi-Faceted Semantic Similarity in Long-Form Fiction

Authors: Natasha Johnson, Amanda Bertsch, Maria-Emil Deal, Emma Strubell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20926
Pdf URL: https://arxiv.org/pdf/2510.20926
Copy Paste: [[2510.20926]] FicSim: A Dataset for Multi-Faceted Semantic Similarity in Long-Form Fiction(https://arxiv.org/abs/2510.20926)
Keywords: language model
Abstract: As language models become capable of processing increasingly long and complex texts, there has been growing interest in their application within computational literary studies. However, evaluating the usefulness of these models for such tasks remains challenging due to the cost of fine-grained annotation for long-form texts and the data contamination concerns inherent in using public-domain literature. Current embedding similarity datasets are not suitable for evaluating literary-domain tasks because of a focus on coarse-grained similarity and primarily on very short text. We assemble and release FICSIM, a dataset of long-form, recently written fiction, including scores along 12 axes of similarity informed by author-produced metadata and validated by digital humanities scholars. We evaluate a suite of embedding models on this task, demonstrating a tendency across models to focus on surface-level features over semantic categories that would be useful for computational literary studies tasks. Throughout our data-collection process, we prioritize author agency and rely on continual, informed author consent.
摘要：随着语言模型能够处理越来越长和复杂的文本，人们对其在计算文学研究中的应用越来越感兴趣。然而，由于长文本的细粒度注释的成本以及使用公共领域文献固有的数据污染问题，评估这些模型对此类任务的有用性仍然具有挑战性。当前的嵌入相似性数据集不适合评估文学领域任务，因为它关注粗粒度相似性并且主要关注非常短的文本。我们收集并发布了 FICSIM，这是一个近期撰写的长篇小说数据集，包括由作者生成的元数据提供的 12 个相似轴的分数，并由数字人文学者验证。我们在此任务上评估了一套嵌入模型，证明了模型之间倾向于关注表面特征而不是语义类别，这对于计算文学研究任务很有用。在我们的数据收集过程中，我们优先考虑作者机构，并依赖作者持续、知情的同意。

Title: Do LLMs Truly Understand When a Precedent Is Overruled?

Authors: Li Zhang, Jaromir Savelka, Kevin Ashley
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20941
Pdf URL: https://arxiv.org/pdf/2510.20941
Copy Paste: [[2510.20941]] Do LLMs Truly Understand When a Precedent Is Overruled?(https://arxiv.org/abs/2510.20941)
Keywords: language model, llm
Abstract: Large language models (LLMs) with extended context windows show promise for complex legal reasoning tasks, yet their ability to understand long legal documents remains insufficiently evaluated. Developing long-context benchmarks that capture realistic, high-stakes tasks remains a significant challenge in the field, as most existing evaluations rely on simplified synthetic tasks that fail to represent the complexity of real-world document understanding. Overruling relationships are foundational to common-law doctrine and commonly found in judicial opinions. They provide a focused and important testbed for long-document legal understanding that closely resembles what legal professionals actually do. We present an assessment of state-of-the-art LLMs on identifying overruling relationships from U.S. Supreme Court cases using a dataset of 236 case pairs. Our evaluation reveals three critical limitations: (1) era sensitivity -- the models show degraded performance on historical cases compared to modern ones, revealing fundamental temporal bias in their training; (2) shallow reasoning -- models rely on shallow logical heuristics rather than deep legal comprehension; and (3) context-dependent reasoning failures -- models produce temporally impossible relationships in complex open-ended tasks despite maintaining basic temporal awareness in simple contexts. Our work contributes a benchmark that addresses the critical gap in realistic long-context evaluation, providing an environment that mirrors the complexity and stakes of actual legal reasoning tasks.
摘要：具有扩展上下文窗口的大型语言模型（LLM）显示出完成复杂法律推理任务的希望，但它们理解长法律文档的能力仍未得到充分评估。开发捕获现实、高风险任务的长上下文基准仍然是该领域的一项重大挑战，因为大多数现有评估依赖于简化的综合任务，而这些任务无法代表现实世界文档理解的复杂性。否决关系是普通法原则的基础，并且常见于司法意见中。它们为长文档法律理解提供了一个重点突出且重要的测试平台，与法律专业人士的实际工作非常相似。我们使用 236 个案例对的数据集，对最先进的法学硕士进行了评估，以识别美国最高法院案件中的推翻关系。我们的评估揭示了三个关键局限性：（1）时代敏感性——与现代案例相比，模型在历史案例上的性能下降，揭示了其训练中的基本时间偏差；（2）浅层推理——模型依赖于浅层逻辑启发，而不是深层次的法律理解； (3) 上下文相关推理失败——尽管在简单上下文中保持了基本的时间意识，但模型在复杂的开放式任务中产生了时间上不可能的关系。我们的工作提供了一个基准，可以解决现实长上下文评估中的关键差距，提供一个反映实际法律推理任务的复杂性和利害关系的环境。

Title: Irish-BLiMP: A Linguistic Benchmark for Evaluating Human and Language Model Performance in a Low-Resource Setting

Authors: Josh McGiff, Khanh-Tung Tran, William Mulcahy, Dáibhidh Ó Luinín, Jake Dalzell, Róisín Ní Bhroin, Adam Burke, Barry O'Sullivan, Hoang D. Nguyen, Nikola S. Nikolov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.20957
Pdf URL: https://arxiv.org/pdf/2510.20957
Copy Paste: [[2510.20957]] Irish-BLiMP: A Linguistic Benchmark for Evaluating Human and Language Model Performance in a Low-Resource Setting(https://arxiv.org/abs/2510.20957)
Keywords: language model, gpt, llm
Abstract: We present Irish-BLiMP (Irish Benchmark of Linguistic Minimal Pairs), the first dataset and framework designed for fine-grained evaluation of linguistic competence in the Irish language, an endangered language. Drawing on a variety of linguistic literature and grammar reference works, we manually constructed and reviewed 1020 minimal pairs across a taxonomy of 11 linguistic features, through a team of fluent Irish speakers. We evaluate both existing Large Language Models (LLMs) and fluent human participants on their syntactic knowledge of Irish. Our findings show that humans outperform all models across all linguistic features, achieving 16.6% higher accuracy on average. Moreover, a substantial performance gap of 18.1% persists between open- and closed-source LLMs, with even the strongest model (gpt-5) reaching only 73.5% accuracy compared to 90.1% by human. Interestingly, human participants and models struggle on different aspects of Irish grammar, thus highlighting a difference in representation learned by the models. Overall, Irish-BLiMP provides the first systematic framework for evaluating the grammatical competence of LLMs in Irish and offers a valuable benchmark for advancing research on linguistic understanding in low-resource languages.
摘要：我们提出了 Irish-BLiMP（爱尔兰语言最小对基准），这是第一个数据集和框架，旨在对爱尔兰语（一种濒临灭绝的语言）的语言能力进行细粒度评估。我们利用各种语言文献和语法参考书，通过一支精通爱尔兰语的团队，手动构建并审查了 11 个语言特征分类中的 1020 个最小对。我们评估现有的大型语言模型（LLM）和流利的人类参与者的爱尔兰语句法知识。我们的研究结果表明，人类在所有语言特征上都优于所有模型，平均准确率高出 16.6%。此外，开源法学硕士和闭源法学硕士之间仍然存在 18.1% 的巨大性能差距，即使是最强的模型 (gpt-5) 也只能达到 73.5% 的准确率，而人类的准确率高达 90.1%。有趣的是，人类参与者和模型在爱尔兰语语法的不同方面都存在困难，从而凸显了模型学习到的表征的差异。总体而言，Irish-BLiMP 为评估爱尔兰语法学硕士的语法能力提供了第一个系统框架，并为推进低资源语言的语言理解研究提供了宝贵的基准。

Title: Can Confidence Estimates Decide When Chain-of-thought is Necessary for Llms?

Authors: Samuel Lewis-Lim, Xingwei Tan, Zhixue Zhao, Nikolaos Aletras
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21007
Pdf URL: https://arxiv.org/pdf/2510.21007
Copy Paste: [[2510.21007]] Can Confidence Estimates Decide When Chain-of-thought is Necessary for Llms?(https://arxiv.org/abs/2510.21007)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Chain-of-thought (CoT) prompting has emerged as a common technique for enhancing the reasoning abilities of large language models (LLMs). While extended reasoning can boost accuracy on complex tasks, it is often unnecessary and substantially increases token usage, limiting the practicality of reasoning models in many scenarios. Recent models, such as GPT-OSS and Qwen3, expose controls that enable users to adjust the length of CoT or determine whether it is used at all. Yet, it remains unclear when CoT should be used: on some tasks it improves performance, while on others it provides little benefit or even harms performance. We address this challenge with confidence-gated CoT, where a model invokes reasoning only when confidence in its direct answer is low. To this end, we present the first systematic study of training-free confidence estimation methods for CoT gating. Specifically, we evaluate four training-free confidence estimation methods and compare them to a random baseline and an oracle that always knows when CoT is needed. Through extensive experiments, we show that existing training-free confidence measures can reduce redundant CoT and outperform randomly invoked CoT. However, the utility of individual confidence measures is inconsistent, varying with both the dataset and the model, underscoring the difficulty of deploying confidence-gated CoT in practice. By analysing both strengths and failure modes, our study highlights the potential and limitations of current methods and paves the way toward more reliable adaptive gating of CoT.
摘要：思想链（CoT）提示已成为增强大型语言模型（LLM）推理能力的常用技术。虽然扩展推理可以提高复杂任务的准确性，但它通常是不必要的，并且会大大增加令牌的使用，从而限制了推理模型在许多场景中的实用性。最近的模型，例如 GPT-OSS 和 Qwen3，公开了一些控件，使用户能够调整 CoT 的长度或确定是否使用它。然而，何时应该使用 CoT 仍不清楚：在某些任务上它可以提高性能，而在其他任务上它几乎没有什么好处，甚至会损害性能。我们通过置信门控 CoT 来应对这一挑战，其中模型仅在对其直接答案的置信度较低时才调用推理。为此，我们首次系统地研究了 CoT 门控的免训练置信度估计方法。具体来说，我们评估了四种免训练的置信度估计方法，并将它们与随机基线和始终知道何时需要 CoT 的预言机进行比较。通过大量实验，我们表明现有的免训练置信度测量可以减少冗余 CoT，并且优于随机调用的 CoT。然而，个人置信度度量的效用并不一致，随数据集和模型的不同而变化，这凸显了在实践中部署置信门控 CoT 的难度。通过分析优势和失效模式，我们的研究强调了当前方法的潜力和局限性，并为更可靠的 CoT 自适应门控铺平了道路。

Title: Input Matters: Evaluating Input Structure's Impact on LLM Summaries of Sports Play-by-Play

Authors: Barkavi Sundararajan, Somayajulu Sripada, Ehud Reiter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21034
Pdf URL: https://arxiv.org/pdf/2510.21034
Copy Paste: [[2510.21034]] Input Matters: Evaluating Input Structure's Impact on LLM Summaries of Sports Play-by-Play(https://arxiv.org/abs/2510.21034)
Keywords: llm, hallucination
Abstract: A major concern when deploying LLMs in accuracy-critical domains such as sports reporting is that the generated text may not faithfully reflect the input data. We quantify how input structure affects hallucinations and other factual errors in LLM-generated summaries of NBA play-by-play data, across three formats: row-structured, JSON and unstructured. We manually annotated 3,312 factual errors across 180 game summaries produced by two models, Llama-3.1-70B and Qwen2.5-72B. Input structure has a strong effect: JSON input reduces error rates by 69% for Llama and 65% for Qwen compared to unstructured input, while row-structured input reduces errors by 54% for Llama and 51% for Qwen. A two-way repeated measures ANOVA shows that input structure accounts for over 80% of the variance in error rates, with Tukey HSD post hoc tests confirming statistically significant differences between all input formats.
摘要：在体育报道等准确性关键领域部署法学硕士时，一个主要问题是生成的文本可能无法忠实反映输入数据。我们量化了输入结构如何影响 LLM 生成的 NBA 逐场比赛数据摘要中的幻觉和其他事实错误，采用三种格式：行结构、JSON 和非结构。我们手动注释了 Llama-3.1-70B 和 Qwen2.5-72B 两个模型生成的 180 个游戏摘要中的 3,312 个事实错误。输入结构具有很强的效果：与非结构化输入相比，JSON 输入将 Llama 的错误率降低了 69%，将 Qwen 的错误率降低了 65%，而行结构输入将 Llama 的错误率降低了 54%，将 Qwen 的错误率降低了 51%。双向重复测量方差分析显示，输入结构占错误率方差的 80% 以上，Tukey HSD 事后测试证实了所有输入格式之间具有统计显着性差异。

Title: Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection

Authors: Atoosa Chegini, Hamid Kazemi, Garrett Souza, Maria Safi, Yang Song, Samy Bengio, Sinead Williamson, Mehrdad Farajtabar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21049
Pdf URL: https://arxiv.org/pdf/2510.21049
Copy Paste: [[2510.21049]] Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection(https://arxiv.org/abs/2510.21049)
Keywords: language model, llm, hallucination
Abstract: Reasoning has become a central paradigm for large language models (LLMs), consistently boosting accuracy across diverse benchmarks. Yet its suitability for precision-sensitive tasks remains unclear. We present the first systematic study of reasoning for classification tasks under strict low false positive rate (FPR) regimes. Our analysis covers two tasks--safety detection and hallucination detection--evaluated in both fine-tuned and zero-shot settings, using standard LLMs and Large Reasoning Models (LRMs). Our results reveal a clear trade-off: Think On (reasoning-augmented) generation improves overall accuracy, but underperforms at the low-FPR thresholds essential for practical use. In contrast, Think Off (no reasoning during inference) dominates in these precision-sensitive regimes, with Think On surpassing only when higher FPRs are acceptable. In addition, we find token-based scoring substantially outperforms self-verbalized confidence for precision-sensitive deployments. Finally, a simple ensemble of the two modes recovers the strengths of each. Taken together, our findings position reasoning as a double-edged tool: beneficial for average accuracy, but often ill-suited for applications requiring strict precision.
摘要：推理已成为大型语言模型 (LLM) 的核心范式，不断提高各种基准的准确性。然而，它是否适合精度敏感的任务仍不清楚。我们提出了第一个在严格的低误报率（FPR）制度下分类任务推理的系统研究。我们的分析涵盖两项任务——安全检测和幻觉检测——使用标准法学硕士和大型推理模型（LRM）在微调和零样本设置下进行评估。我们的结果揭示了一个明显的权衡：Think On（推理增强）生成提高了整体准确性，但在实际使用所必需的低 FPR 阈值上表现不佳。相比之下，Think Off（推理过程中不进行推理）在这些对精度敏感的体系中占主导地位，只有当较高的 FPR 可接受时，Think On 才会超越。此外，我们发现对于精度敏感的部署，基于令牌的评分大大优于自我语言的置信度。最后，两种模式的简单组合恢复了每种模式的优点。总而言之，我们的研究结果将推理定位为一种双刃工具：有利于平均精度，但通常不适合需要严格精度的应用。

Title: Dynamic Retriever for In-Context Knowledge Editing via Policy Optimization

Authors: Mahmud Wasif Nafee, Maiqi Jiang, Haipeng Chen, Yanfu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21059
Pdf URL: https://arxiv.org/pdf/2510.21059
Copy Paste: [[2510.21059]] Dynamic Retriever for In-Context Knowledge Editing via Policy Optimization(https://arxiv.org/abs/2510.21059)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) excel at factual recall yet still propagate stale or incorrect knowledge. In-context knowledge editing offers a gradient-free remedy suitable for black-box APIs, but current editors rely on static demonstration sets chosen by surface-level similarity, leading to two persistent obstacles: (i) a quantity-quality trade-off, and (ii) lack of adaptivity to task difficulty. We address these issues by dynamically selecting supporting demonstrations according to their utility for the edit. We propose Dynamic Retriever for In-Context Knowledge Editing (DR-IKE), a lightweight framework that (1) trains a BERT retriever with REINFORCE to rank demonstrations by editing reward, and (2) employs a learnable threshold to prune low-value examples, shortening the prompt when the edit is easy and expanding it when the task is hard. DR-IKE performs editing without modifying model weights, relying solely on forward passes for compatibility with black-box LLMs. On the COUNTERFACT benchmark, it improves edit success by up to 17.1%, reduces latency by 41.6%, and preserves accuracy on unrelated queries, demonstrating scalable and adaptive knowledge editing. The code is available at this https URL .
摘要：大型语言模型 (LLM) 擅长事实回忆，但仍然传播陈旧或不正确的知识。上下文知识编辑提供了适合黑盒 API 的无梯度补救措施，但当前的编辑器依赖于通过表面级相似性选择的静态演示集，导致两个持久的障碍：（i）数量与质量的权衡，以及（ii）缺乏对任务难度的适应性。我们通过根据编辑的实用性动态选择支持演示来解决这些问题。我们提出了用于上下文知识编辑的动态检索器（DR-IKE），这是一个轻量级框架，（1）使用 REINFORCE 训练 BERT 检索器，通过编辑奖励对演示进行排名，（2）采用可学习的阈值来修剪低价值的示例，在编辑容易时缩短提示，在任务困难时扩展提示。 DR-IKE 在不修改模型权重的情况下执行编辑，仅依靠前向传递来实现与黑盒 LLM 的兼容性。在 COUNTERFACT 基准上，它将编辑成功率提高了 17.1%，将延迟降低了 41.6%，并保持了不相关查询的准确性，展示了可扩展和自适应的知识编辑。该代码可在此 https URL 获取。

Title: Bridging Language Gaps with Adaptive RAG: Improving Indonesian Language Question Answering

Authors: William Christian, Daniel Adamlu, Adrian Yu, Derwin Suhartono
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21068
Pdf URL: https://arxiv.org/pdf/2510.21068
Copy Paste: [[2510.21068]] Bridging Language Gaps with Adaptive RAG: Improving Indonesian Language Question Answering(https://arxiv.org/abs/2510.21068)
Keywords: retrieval-augmented generation
Abstract: Question Answering (QA) has seen significant improvements with the advancement of machine learning models, further studies enhanced this question answering system by retrieving external information, called Retrieval-Augmented Generation (RAG) to produce more accurate and informative answers. However, these state-of-the-art-performance is predominantly in English language. To address this gap we made an effort of bridging language gaps by incorporating Adaptive RAG system to Indonesian language. Adaptive RAG system integrates a classifier whose task is to distinguish the question complexity, which in turn determines the strategy for answering the question. To overcome the limited availability of Indonesian language dataset, our study employs machine translation as data augmentation approach. Experiments show reliable question complexity classifier; however, we observed significant inconsistencies in multi-retrieval answering strategy which negatively impacted the overall evaluation when this strategy was applied. These findings highlight both the promise and challenges of question answering in low-resource language suggesting directions for future improvement.
摘要：随着机器学习模型的进步，问答（QA）得到了显着的改进，进一步的研究通过检索外部信息（称为检索增强生成（RAG））来增强该问答系统，以产生更准确和信息丰富的答案。然而，这些最先进的表演主要是用英语进行的。为了解决这一差距，我们通过将自适应 RAG 系统纳入印度尼西亚语来努力弥合语言差距。自适应RAG系统集成了一个分类器，其任务是区分问题的复杂性，进而决定回答问题的策略。为了克服印度尼西亚语言数据集的有限可用性，我们的研究采用机器翻译作为数据增强方法。实验证明问题复杂度分类器可靠；然而，我们观察到多重检索回答策略存在显着的不一致，这对应用该策略时的整体评估产生了负面影响。这些发现凸显了用低资源语言进行问答的前景和挑战，并提出了未来改进的方向。

Title: CDrugRed: A Chinese Drug Recommendation Dataset for Discharge Medications in Metabolic Diseases

Authors: Juntao Li, Haobin Yuan, Ling Luo, Yan Jiang, Fan Wang, Ping Zhang, Huiyi Lv, Jian Wang, Yuanyuan Sun, Hongfei Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21084
Pdf URL: https://arxiv.org/pdf/2510.21084
Copy Paste: [[2510.21084]] CDrugRed: A Chinese Drug Recommendation Dataset for Discharge Medications in Metabolic Diseases(https://arxiv.org/abs/2510.21084)
Keywords: language model, llm
Abstract: Intelligent drug recommendation based on Electronic Health Records (EHRs) is critical for improving for improving the quality and efficiency of clinical decision-making. By leveraging large-scale patient data, drug recommendation systems can assist physicians in selecting the most appropriate medications according to a patient's medical history, diagnoses, laboratory results, and comorbidities. However, the advancement of such systems is significantly hampered by the scarcity of publicly available, real-world EHR datasets, particularly in languages other than English. In this work, we present CDrugRed, a first publicly available Chinese drug recommendation dataset focused on discharge medications for metabolic diseases. The dataset includes 5,894 de-identified records from 3,190 patients, containing comprehensive information such as patient demographics, medical history, clinical course, and discharge diagnoses. We assess the utility of CDrugRed by benchmarking several state-of-the-art large language models (LLMs) on the discharge medication recommendation task. Experimental results show that while supervised fine-tuning improves model performance, there remains substantial room for improvement, with the best model achieving the F1 score of 0.5648 and Jaccard score of 0.4477. This result highlights the complexity of the clinical drug recommendation task and establishes CDrugRed as a challenging and valuable resource for developing more robust and accurate drug recommendation systems. The dataset is publicly available to the research community under the data usage agreements at this https URL.
摘要：基于电子健康记录（EHR）的智能药物推荐对于提高临床决策的质量和效率至关重要。通过利用大规模患者数据，药物推荐系统可以帮助医生根据患者的病史、诊断、实验室结果和合并症选择最合适的药物。然而，此类系统的进步受到公开可用的真实 EHR 数据集（尤其是英语以外的语言）的稀缺性的严重阻碍。在这项工作中，我们提出了 CDrugRed，这是第一个公开的中国药物推荐数据集，专注于代谢疾病的出院药物。该数据集包括来自 3,190 名患者的 5,894 条去识别化记录，包含患者人口统计、病史、临床病程和出院诊断等综合信息。我们通过在出院药物推荐任务上对几种最先进的大语言模型 (LLM) 进行基准测试来评估 CDrugRed 的实用性。实验结果表明，虽然监督微调提高了模型性能，但仍有很大的改进空间，最佳模型的 F1 分数为 0.5648，Jaccard 分数为 0.4477。这一结果凸显了临床药物推荐任务的复杂性，并将 CDrugRed 确立为开发更强大、更准确的药物推荐系统的具有挑战性和有价值的资源。根据此 https URL 上的数据使用协议，该数据集可供研究社区公开使用。

Title: Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only

Authors: Qingru Zhang, Liang Qiu, Ilgee Hong, Zhenghao Xu, Tianyi Liu, Shiyang Li, Rongzhi Zhang, Zheng Li, Lihong Li, Bing Yin, Chao Zhang, Jianshu Chen, Haoming Jiang, Tuo Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21090
Pdf URL: https://arxiv.org/pdf/2510.21090
Copy Paste: [[2510.21090]] Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only(https://arxiv.org/abs/2510.21090)
Keywords: language model, llm
Abstract: Supervised fine-tuning (SFT) has emerged as a crucial method for aligning large language models (LLMs) with human-annotated demonstrations. However, SFT, being an off-policy approach similar to behavior cloning, often struggles with overfitting and poor out-of-domain generalization, especially in limited-data scenarios. To address these limitations, we propose Self-Rewarding PPO, a novel fine-tuning method that leverages on-policy techniques to enhance generalization performance. Our approach combines the strengths of SFT and proximal policy optimization (PPO) to achieve more effective alignment from demonstration data. At its core is a reward function designed as the log policy ratio between the SFT model and the pretrained base model. This function serves as an implicit reward signal, using the pretrained policy as a baseline and the SFT policy as a target. By doing so, it enables on-policy fine-tuning without relying on human preference annotations. The integration of this self-rewarding mechanism with PPO addresses key limitations of SFT, improving generalization, data efficiency, and robustness. Our empirical evaluation across a range of natural language processing tasks demonstrates that Self-Rewarding PPO consistently outperforms traditional SFT methods. The results highlight the effectiveness of our approach in aligning LLMs using demonstration data, particularly in scenarios where high-quality annotated data is scarce.
摘要：监督微调 (SFT) 已成为将大型语言模型 (LLM) 与人工注释演示相结合的重要方法。然而，SFT 作为一种类似于行为克隆的离策略方法，经常会遇到过度拟合和域外泛化不良的问题，尤其是在数据有限的场景中。为了解决这些限制，我们提出了自我奖励 PPO，这是一种新颖的微调方法，利用在策略技术来增强泛化性能。我们的方法结合了 SFT 和近端策略优化 (PPO) 的优势，以根据演示数据实现更有效的对齐。其核心是奖励函数，设计为 SFT 模型和预训练基础模型之间的对数策略比率。该函数作为隐式奖励信号，使用预训练策略作为基线，SFT 策略作为目标。通过这样做，它可以在不依赖人类偏好注释的情况下进行策略微调。这种自我奖励机制与 PPO 的集成解决了 SFT 的关键局限性，提高了泛化性、数据效率和鲁棒性。我们对一系列自然语言处理任务的实证评估表明，自我奖励 PPO 始终优于传统的 SFT 方法。结果凸显了我们的方法在使用演示数据调整法学硕士方面的有效性，特别是在高质量注释数据稀缺的情况下。

Title: The Gray Zone of Faithfulness: Taming Ambiguity in Unfaithfulness Detection

Authors: Qiang Ding, Lvzhou Luo, Yixuan Cao, Ping Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21118
Pdf URL: https://arxiv.org/pdf/2510.21118
Copy Paste: [[2510.21118]] The Gray Zone of Faithfulness: Taming Ambiguity in Unfaithfulness Detection(https://arxiv.org/abs/2510.21118)
Keywords: language model, gpt, llm, hallucination
Abstract: Ensuring that Large Language Models (LLMs) generate summaries faithful to a given source document is essential for real-world applications. While prior research has explored LLM faithfulness, existing benchmarks suffer from annotation ambiguity, primarily due to the ill-defined boundary of permissible external knowledge in generated outputs. For instance, common sense is often incorporated into responses and labeled as "faithful", yet the acceptable extent of such knowledge remains unspecified, leading to inconsistent annotations. To address this issue, we propose a novel faithfulness annotation framework, which introduces an intermediate category, Out-Dependent, to classify cases where external knowledge is required for verification. Using this framework, we construct VeriGray (Verification with the Gray Zone) -- a new unfaithfulness detection benchmark in summarization. Statistics reveal that even SOTA LLMs, such as GPT-5, exhibit hallucinations ($\sim 6\%$ of sentences) in summarization tasks. Moreover, a substantial proportion ($\sim 8\%$ on average of models) of generated sentences fall into the Out-Dependent category, underscoring the importance of resolving annotation ambiguity in unfaithfulness detection benchmarks. Experiments demonstrate that our benchmark poses significant challenges to multiple baseline methods, indicating considerable room for future improvement.
摘要：确保大型语言模型 (LLM) 生成忠实于给定源文档的摘要对于实际应用程序至关重要。虽然之前的研究已经探讨了 LLM 的忠实度，但现有的基准存在注释模糊性，这主要是由于生成的输出中允许的外部知识的边界不明确。例如，常识通常被纳入回答中并被标记为“忠实”，但此类知识的可接受范围仍未明确，导致注释不一致。为了解决这个问题，我们提出了一种新颖的忠实度注释框架，它引入了一个中间类别，即外部依赖，以对需要外部知识进行验证的情况进行分类。使用这个框架，我们构建了 VeriGray（灰色区域验证）——一个新的不忠检测总结基准。统计数据显示，即使是 SOTA LLM，例如 GPT-5，在摘要任务中也会表现出幻觉（$\sim 6\%$ 的句子）。此外，生成的句子的很大一部分（模型平均为 $\sim 8\%$）属于外部依赖类别，强调了解决不忠实检测基准中注释歧义的重要性。实验表明，我们的基准对多种基线方法提出了重大挑战，表明未来还有很大的改进空间。

Title: Large Language Models Meet Text-Attributed Graphs: A Survey of Integration Frameworks and Applications

Authors: Guangxin Su, Hanchen Wang, Jianwei Wang, Wenjie Zhang, Ying Zhang, Jian Pei
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21131
Pdf URL: https://arxiv.org/pdf/2510.21131
Copy Paste: [[2510.21131]] Large Language Models Meet Text-Attributed Graphs: A Survey of Integration Frameworks and Applications(https://arxiv.org/abs/2510.21131)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have achieved remarkable success in natural language processing through strong semantic understanding and generation. However, their black-box nature limits structured and multi-hop reasoning. In contrast, Text-Attributed Graphs (TAGs) provide explicit relational structures enriched with textual context, yet often lack semantic depth. Recent research shows that combining LLMs and TAGs yields complementary benefits: enhancing TAG representation learning and improving the reasoning and interpretability of LLMs. This survey provides the first systematic review of LLM--TAG integration from an orchestration perspective. We introduce a novel taxonomy covering two fundamental directions: LLM for TAG, where LLMs enrich graph-based tasks, and TAG for LLM, where structured graphs improve LLM reasoning. We categorize orchestration strategies into sequential, parallel, and multi-module frameworks, and discuss advances in TAG-specific pretraining, prompting, and parameter-efficient fine-tuning. Beyond methodology, we summarize empirical insights, curate available datasets, and highlight diverse applications across recommendation systems, biomedical analysis, and knowledge-intensive question answering. Finally, we outline open challenges and promising research directions, aiming to guide future work at the intersection of language and graph learning.
摘要：大型语言模型（LLM）通过强大的语义理解和生成在自然语言处理方面取得了显着的成功。然而，它们的黑盒性质限制了结构化和多跳推理。相比之下，文本属性图（TAG）提供了丰富文本上下文的显式关系结构，但通常缺乏语义深度。最近的研究表明，将 LLM 和 TAG 结合起来可以产生互补的好处：增强 TAG 表示学习并提高 LLM 的推理和可解释性。这项调查从编排的角度首次对 LLM-TAG 集成进行了系统回顾。我们引入了一种新颖的分类法，涵盖两个基本方向：LLM for TAG，其中LLM丰富了基于图形的任务，以及TAG for LLM，其中结构化图形改进了LLM推理。我们将编排策略分为顺序、并行和多模块框架，并讨论特定于 TAG 的预训练、提示和参数高效微调方面的进展。除了方法论之外，我们还总结了实证见解，整理了可用的数据集，并重点介绍了推荐系统、生物医学分析和知识密集型问答领域的各种应用。最后，我们概述了开放的挑战和有前途的研究方向，旨在指导语言和图学习交叉领域的未来工作。

Title: Social Simulations with Large Language Model Risk Utopian Illusion

Authors: Ning Bian, Xianpei Han, Hongyu Lin, Baolei Wu, Jun Wang
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2510.21180
Pdf URL: https://arxiv.org/pdf/2510.21180
Copy Paste: [[2510.21180]] Social Simulations with Large Language Model Risk Utopian Illusion(https://arxiv.org/abs/2510.21180)
Keywords: language model, llm, chat, agent
Abstract: Reliable simulation of human behavior is essential for explaining, predicting, and intervening in our society. Recent advances in large language models (LLMs) have shown promise in emulating human behaviors, interactions, and decision-making, offering a powerful new lens for social science studies. However, the extent to which LLMs diverge from authentic human behavior in social contexts remains underexplored, posing risks of misinterpretation in scientific studies and unintended consequences in real-world applications. Here, we introduce a systematic framework for analyzing LLMs' behavior in social simulation. Our approach simulates multi-agent interactions through chatroom-style conversations and analyzes them across five linguistic dimensions, providing a simple yet effective method to examine emergent social cognitive biases. We conduct extensive experiments involving eight representative LLMs across three families. Our findings reveal that LLMs do not faithfully reproduce genuine human behavior but instead reflect overly idealized versions of it, shaped by the social desirability bias. In particular, LLMs show social role bias, primacy effect, and positivity bias, resulting in "Utopian" societies that lack the complexity and variability of real human interactions. These findings call for more socially grounded LLMs that capture the diversity of human social behavior.
摘要：对人类行为的可靠模拟对于解释、预测和干预我们的社会至关重要。大语言模型 (LLM) 的最新进展在模拟人类行为、交互和决策方面显示出了希望，为社会科学研究提供了强大的新视角。然而，法学硕士在社会背景下偏离真实人类行为的程度仍未得到充分探索，这在科学研究中带来了误解的风险，并在现实世界的应用中带来了意想不到的后果。在这里，我们介绍了一个用于分析法学硕士在社会模拟中的行为的系统框架。我们的方法通过聊天室式对话模拟多主体交互，并在五个语言维度上对其进行分析，提供了一种简单而有效的方法来检查新兴的社会认知偏差。我们进行了广泛的实验，涉及三个家庭的八位代表性法学硕士。我们的研究结果表明，法学硕士并没有忠实地再现真实的人类行为，而是反映了由社会愿望偏见塑造的过于理想化的版本。特别是，法学硕士表现出社会角色偏见、首因效应和积极偏见，导致“乌托邦”社会缺乏真实人类互动的复杂性和可变性。这些发现需要更多以社会为基础的法学硕士，以捕捉人类社会行为的多样性。

Title: Estonian Native Large Language Model Benchmark

Authors: Helena Grete Lillepalu, Tanel Alumäe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21193
Pdf URL: https://arxiv.org/pdf/2510.21193
Copy Paste: [[2510.21193]] Estonian Native Large Language Model Benchmark(https://arxiv.org/abs/2510.21193)
Keywords: language model, llm
Abstract: The availability of LLM benchmarks for the Estonian language is limited, and a comprehensive evaluation comparing the performance of different LLMs on Estonian tasks has yet to be conducted. We introduce a new benchmark for evaluating LLMs in Estonian, based on seven diverse datasets. These datasets assess general and domain-specific knowledge, understanding of Estonian grammar and vocabulary, summarization abilities, contextual comprehension, and more. The datasets are all generated from native Estonian sources without using machine translation. We compare the performance of base models, instruction-tuned open-source models, and commercial models. Our evaluation includes 6 base models and 26 instruction-tuned models. To assess the results, we employ both human evaluation and LLM-as-a-judge methods. Human evaluation scores showed moderate to high correlation with benchmark evaluations, depending on the dataset. Claude 3.7 Sonnet, used as an LLM judge, demonstrated strong alignment with human ratings, indicating that top-performing LLMs can effectively support the evaluation of Estonian-language models.
摘要：爱沙尼亚语的法学硕士基准的可用性有限，并且尚未对不同法学硕士在爱沙尼亚语任务上的表现进行综合评估。我们基于七个不同的数据集引入了评估爱沙尼亚法学硕士的新基准。这些数据集评估一般知识和特定领域的知识、对爱沙尼亚语语法和词汇的理解、总结能力、语境理解等等。这些数据集都是从爱沙尼亚本地来源生成的，没有使用机器翻译。我们比较了基本模型、指令调整的开源模型和商业模型的性能。我们的评估包括 6 个基本模型和 26 个指令调整模型。为了评估结果，我们采用了人工评估和法学硕士作为法官的方法。根据数据集，人类评估分数与基准评估表现出中等到高度的相关性。 Claude 3.7 Sonnet 用作法学硕士评委，表现出与人类评分的高度一致性，表明表现最好的法学硕士可以有效支持爱沙尼亚语言模型的评估。

Title: DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical services

Authors: Xiang Li, Huizi Yu, Wenkong Wang, Yiran Wu, Jiayan Zhou, Wenyue Hua, Xinxin Lin, Wenjia Tan, Lexuan Zhu, Bingyi Chen, Guang Chen, Ming-Li Chen, Yang Zhou, Zhao Li, Themistocles L. Assimes, Yongfeng Zhang, Qingyun Wu, Xin Ma, Lingyao Li, Lizhou Fan
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2510.21228
Pdf URL: https://arxiv.org/pdf/2510.21228
Copy Paste: [[2510.21228]] DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical services(https://arxiv.org/abs/2510.21228)
Keywords: language model, llm, agent
Abstract: Objective: Emergency medical dispatch (EMD) is a high-stakes process challenged by caller distress, ambiguity, and cognitive load. Large Language Models (LLMs) and Multi-Agent Systems (MAS) offer opportunities to augment dispatchers. This study aimed to develop and evaluate a taxonomy-grounded, LLM-powered multi-agent system for simulating realistic EMD scenarios. Methods: We constructed a clinical taxonomy (32 chief complaints, 6 caller identities from MIMIC-III) and a six-phase call protocol. Using this framework, we developed an AutoGen-based MAS with Caller and Dispatcher Agents. The system grounds interactions in a fact commons to ensure clinical plausibility and mitigate misinformation. We used a hybrid evaluation framework: four physicians assessed 100 simulated cases for "Guidance Efficacy" and "Dispatch Effectiveness," supplemented by automated linguistic analysis (sentiment, readability, politeness). Results: Human evaluation, with substantial inter-rater agreement (Gwe's AC1 > 0.70), confirmed the system's high performance. It demonstrated excellent Dispatch Effectiveness (e.g., 94 % contacting the correct potential other agents) and Guidance Efficacy (advice provided in 91 % of cases), both rated highly by physicians. Algorithmic metrics corroborated these findings, indicating a predominantly neutral affective profile (73.7 % neutral sentiment; 90.4 % neutral emotion), high readability (Flesch 80.9), and a consistently polite style (60.0 % polite; 0 % impolite). Conclusion: Our taxonomy-grounded MAS simulates diverse, clinically plausible dispatch scenarios with high fidelity. Findings support its use for dispatcher training, protocol evaluation, and as a foundation for real-time decision support. This work outlines a pathway for safely integrating advanced AI agents into emergency response workflows.
摘要：目标：紧急医疗调度 (EMD) 是一个高风险过程，会受到呼叫者的困扰、模糊性和认知负荷的挑战。大型语言模型 (LLM) 和多代理系统 (MAS) 为增强调度员提供了机会。本研究旨在开发和评估基于分类学、LLM 驱动的多智能体系统，用于模拟现实的 EMD 场景。方法：我们构建了一个临床分类（32 个主诉、6 个来自 MIMIC-III 的呼叫者身份）和一个六阶段呼叫协议。使用这个框架，我们开发了一个基于 AutoGen 的 MAS，其中包含呼叫者和调度者代理。该系统将交互建立在事实共享的基础上，以确保临床合理性并减少错误信息。我们使用了混合评估框架：四名医生评估了 100 个模拟病例的“指导效果”和“调度效果”，并辅以自动语言分析（情感、可读性、礼貌）。结果：人类评估，评估者间的一致性很高（Gwe 的 AC1 > 0.70），证实了系统的高性能。它表现出出色的调度有效性（例如，94% 联系正确的潜在其他代理）和指导有效性（在 91% 的病例中提供了建议），均受到医生的高度评价。算法指标证实了这些发现，表明主要是中性的情感特征（73.7％中性情感；90.4％中性情绪）、高可读性（Flesch 80.9）和一贯的礼貌风格（60.0％礼貌；0％不礼貌）。结论：我们基于分类学的 MAS 以高保真度模拟了多种、临床上合理的调度场景。研究结果支持其用于调度员培训、协议评估以及作为实时决策支持的基础。这项工作概述了将先进的人工智能代理安全地集成到应急响应工作流程中的途径。

Title: Correlation Dimension of Auto-Regressive Large Language Models

Authors: Xin Du, Kumiko Tanaka-Ishii
Subjects: cs.CL, cs.AI, nlin.CD
Abstract URL: https://arxiv.org/abs/2510.21258
Pdf URL: https://arxiv.org/pdf/2510.21258
Copy Paste: [[2510.21258]] Correlation Dimension of Auto-Regressive Large Language Models(https://arxiv.org/abs/2510.21258)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have achieved remarkable progress in natural language generation, yet they continue to display puzzling behaviors -- such as repetition and incoherence -- even when exhibiting low perplexity. This highlights a key limitation of conventional evaluation metrics, which emphasize local prediction accuracy while overlooking long-range structural complexity. We introduce correlation dimension, a fractal-geometric measure of self-similarity, to quantify the epistemological complexity of text as perceived by a language model. This measure captures the hierarchical recurrence structure of language, bridging local and global properties in a unified framework. Through extensive experiments, we show that correlation dimension (1) reveals three distinct phases during pretraining, (2) reflects context-dependent complexity, (3) indicates a model's tendency toward hallucination, and (4) reliably detects multiple forms of degeneration in generated text. The method is computationally efficient, robust to model quantization (down to 4-bit precision), broadly applicable across autoregressive architectures (e.g., Transformer and Mamba), and provides fresh insight into the generative dynamics of LLMs.
摘要：大型语言模型（LLM）在自然语言生成方面取得了显着的进步，但即使在表现出较低的困惑度时，它们仍然继续表现出令人费解的行为，例如重复和不连贯。这凸显了传统评估指标的一个关键局限性，即强调局部预测准确性，而忽视了远程结构复杂性。我们引入相关维数，一种自相似性的分形几何度量，以量化语言模型感知的文本的认识论复杂性。该措施捕获了语言的层次递归结构，在统一的框架中桥接了本地和全局属性。通过广泛的实验，我们表明相关维度（1）揭示了预训练期间的三个不同阶段，（2）反映了上下文相关的复杂性，（3）表明模型的幻觉倾向，（4）可靠地检测生成文本中多种形式的退化。该方法计算效率高，模型量化稳健（低至 4 位精度），广泛适用于自回归架构（例如 Transformer 和 Mamba），并为 LLM 的生成动力学提供了新的见解。

Title: Sparser Block-Sparse Attention via Token Permutation

Authors: Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2510.21270
Pdf URL: https://arxiv.org/pdf/2510.21270
Copy Paste: [[2510.21270]] Sparser Block-Sparse Attention via Token Permutation(https://arxiv.org/abs/2510.21270)
Keywords: language model, llm
Abstract: Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\textbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75\times$ in long-context prefilling, confirming its practical viability. Code available at this https URL
摘要：扩展大型语言模型 (LLM) 的上下文长度可以提供显着的好处，但计算成本较高。这种费用主要源于自注意力机制，其相对于序列长度的 $O(N^2)$ 复杂性是内存和延迟的主要瓶颈。幸运的是，注意力矩阵通常是稀疏的，特别是对于长序列，这表明有优化的机会。块稀疏注意力已经成为一种有前途的解决方案，它将序列划分为块并跳过对这些块的子集的计算。然而，这种方法的有效性高度依赖于潜在的注意力模式，这可能导致次优的块级稀疏性。例如，单个块内查询的重要密钥令牌可能分散在许多其他块中，从而导致计算冗余。在这项工作中，我们提出了排列块稀疏注意力（\textbf{PBS-Attn}），这是一种即插即用的方法，利用注意力的排列属性来增加块级稀疏性并提高LLM预填充的计算效率。我们对具有挑战性的现实世界长上下文数据集进行了全面的实验，证明 PBS-Attn 在模型准确性方面始终优于现有的块稀疏注意力方法，并且与完整注意力基线紧密匹配。在我们定制的排列 FlashAttention 内核的支持下，PBS-Attn 在长上下文预填充中实现了高达 $2.75\times$ 的端到端加速，证实了其实际可行性。代码可在此 https URL 获取

Title: PARL: Prompt-based Agents for Reinforcement Learning

Authors: Yarik Menchaca Resendiz, Roman Klinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21306
Pdf URL: https://arxiv.org/pdf/2510.21306
Copy Paste: [[2510.21306]] PARL: Prompt-based Agents for Reinforcement Learning(https://arxiv.org/abs/2510.21306)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) have demonstrated high performance on tasks expressed in natural language, particularly in zero- or few-shot settings. These are typically framed as supervised (e.g., classification) or unsupervised (e.g., clustering) problems. However, limited work evaluates LLMs as agents in reinforcement learning (RL) tasks (e.g., playing games), where learning occurs through interaction with an environment and a reward system. While prior work focused on representing tasks that rely on a language representation, we study structured, non-linguistic reasoning - such as interpreting positions in a grid world. We therefore introduce PARL (Prompt-based Agent for Reinforcement Learning), a method that uses LLMs as RL agents through prompting, without any fine-tuning. PARL encodes actions, states, and rewards in the prompt, enabling the model to learn through trial-and-error interaction. We evaluate PARL on three standard RL tasks that do not entirely rely on natural language. We show that it can match or outperform traditional RL agents in simple environments by leveraging pretrained knowledge. However, we identify performance limitations in tasks that require complex mathematical operations or decoding states and actions.
摘要：大型语言模型 (LLM) 在以自然语言表达的任务上表现出了高性能，特别是在零或少样本设置中。这些通常被定义为有监督（例如分类）或无监督（例如聚类）问题。然而，有限的工作将法学硕士作为强化学习（RL）任务（例如玩游戏）中的代理进行评估，其中学习是通过与环境和奖励系统的交互进行的。虽然之前的工作侧重于依赖语言表示的表示任务，但我们研究结构化的非语言推理 - 例如解释网格世界中的位置。因此，我们引入了 PARL（基于提示的强化学习代理），这是一种通过提示使用 LLM 作为 RL 代理的方法，无需任何微调。 PARL 对提示中的动作、状态和奖励进行编码，使模型能够通过试错交互进行学习。我们在三个不完全依赖自然语言的标准强化学习任务上评估 PARL。我们证明，通过利用预训练的知识，它可以在简单的环境中匹配或超越传统的 RL 代理。然而，我们发现需要复杂数学运算或解码状态和动作的任务的性能限制。

Title: Efficient semantic uncertainty quantification in language models via diversity-steered sampling

Authors: Ji Won Park, Kyunghyun Cho
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21310
Pdf URL: https://arxiv.org/pdf/2510.21310
Copy Paste: [[2510.21310]] Efficient semantic uncertainty quantification in language models via diversity-steered sampling(https://arxiv.org/abs/2510.21310)
Keywords: language model, llm
Abstract: Accurately estimating semantic aleatoric and epistemic uncertainties in large language models (LLMs) is particularly challenging in free-form question answering (QA), where obtaining stable estimates often requires many expensive generations. We introduce a diversity-steered sampler that discourages semantically redundant outputs during decoding, covers both autoregressive and masked diffusion paradigms, and yields substantial sample-efficiency gains. The key idea is to inject a continuous semantic-similarity penalty into the model's proposal distribution using a natural language inference (NLI) model lightly finetuned on partial prefixes or intermediate diffusion states. We debias downstream uncertainty estimates with importance reweighting and shrink their variance with control variates. Across four QA benchmarks, our method matches or surpasses baselines while covering more semantic clusters with the same number of samples. Being modular and requiring no gradient access to the base LLM, the framework promises to serve as a drop-in enhancement for uncertainty estimation in risk-sensitive model deployments.
摘要：在自由形式问答（QA）中，准确估计大型语言模型（LLM）中的语义任意和认知不确定性尤其具有挑战性，其中获得稳定的估计通常需要许多昂贵的生成。我们引入了一种多样性引导的采样器，它可以在解码过程中阻止语义冗余输出，涵盖自回归和掩码扩散范式，并产生显着的采样效率增益。关键思想是使用对部分前缀或中间扩散状态进行轻微微调的自然语言推理（NLI）模型，将连续语义相似性惩罚注入到模型的提议分布中。我们通过重要性重新加权来消除下游不确定性估计的偏差，并缩小其与控制变量的方差。在四个 QA 基准测试中，我们的方法匹配或超越了基线，同时用相同数量的样本覆盖了更多语义集群。该框架是模块化的，不需要梯度访问基础法学硕士，有望作为风险敏感模型部署中不确定性估计的直接增强功能。

Title: Typoglycemia under the Hood: Investigating Language Models' Understanding of Scrambled Words

Authors: Gianluca Sperduti, Alejandro Moreo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21326
Pdf URL: https://arxiv.org/pdf/2510.21326
Copy Paste: [[2510.21326]] Typoglycemia under the Hood: Investigating Language Models' Understanding of Scrambled Words(https://arxiv.org/abs/2510.21326)
Keywords: language model
Abstract: Research in linguistics has shown that humans can read words with internally scrambled letters, a phenomenon recently dubbed typoglycemia. Some specific NLP models have recently been proposed that similarly demonstrate robustness to such distortions by ignoring the internal order of characters by design. This raises a fundamental question: how can models perform well when many distinct words (e.g., form and from) collapse into identical representations under typoglycemia? Our work, focusing exclusively on the English language, seeks to shed light on the underlying aspects responsible for this robustness. We hypothesize that the main reasons have to do with the fact that (i) relatively few English words collapse under typoglycemia, and that (ii) collapsed words tend to occur in contexts so distinct that disambiguation becomes trivial. In our analysis, we (i) analyze the British National Corpus to quantify word collapse and ambiguity under typoglycemia, (ii) evaluate BERT's ability to disambiguate collapsing forms, and (iii) conduct a probing experiment by comparing variants of BERT trained from scratch on clean versus typoglycemic Wikipedia text; our results reveal that the performance degradation caused by scrambling is smaller than expected.
摘要：语言学研究表明，人类可以阅读带有内部混乱字母的单词，这种现象最近被称为“典型血糖症”。最近提出了一些特定的 NLP 模型，这些模型通过设计忽略字符的内部顺序，同样证明了对这种扭曲的鲁棒性。这就提出了一个基本问题：当许多不同的单词（例如，form 和 from）在血糖异常下分解为相同的表示时，模型如何表现良好？我们的工作专门关注英语，旨在阐明造成这种稳健性的根本原因。我们假设主要原因与以下事实有关：（i）相对较少的英语单词在典型血糖下崩溃，并且（ii）崩溃的单词往往出现在如此独特的上下文中，以至于消除歧义变得微不足道。在我们的分析中，我们（i）分析英国国家语料库以量化典型血糖下的单词折叠和歧义，（ii）评估 BERT 消除折叠形式歧义的能力，以及（iii）通过比较在干净和典型血糖维基百科文本上从头开始训练的 BERT 变体来进行探索实验；我们的结果表明，加扰引起的性能下降比预期要小。

Title: TripTide: A Benchmark for Adaptive Travel Planning under Disruptions

Authors: Priyanshu Karmakar (1), Soumyabrata Chaudhuri (1), Shubhojit Mallick (2), Manish Gupta (2), Abhik Jana (1), Shreya Ghosh (1) ((1) School of Electrical and Computer Sciences, IIT Bhubaneswar, India, (2) Microsoft, India)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21329
Pdf URL: https://arxiv.org/pdf/2510.21329
Copy Paste: [[2510.21329]] TripTide: A Benchmark for Adaptive Travel Planning under Disruptions(https://arxiv.org/abs/2510.21329)
Keywords: language model, llm, prompt
Abstract: Recent efforts like TripCraft and TravelPlanner have advanced the use of Large Language Models ( LLMs) for personalized, constraint aware travel itinerary generation. Yet, real travel often faces disruptions. To address this, we present TripTide, the first benchmark evaluating LLM's ability to revise itineraries under realistic disruptions. TripTide models key dimensions such as disruption severity and traveler tolerance, enabling nuanced assessment of LLM adaptability to events like flight cancellations, weather closures, or overbooked attractions. We conduct a threefold evaluation. First, we introduce automatic metrics including Preservation of Intent (how well the revised plan maintains feasibility and goals), Responsiveness (promptness and appropriateness of disruption handling), and Adaptability (semantic, spatial, and sequential divergence between original and revised plans). Second, we apply an LLM-as-a-judge approach to automatically assess revision quality. Third, we perform manual expert evaluation to verify whether revisions preserve semantic, spatial, sequential, and responsive aspects. Our experiments show that LLMs maintain strong sequential consistency and semantic stability, while spatial deviations are larger for shorter trips but decrease with longer ones, indicating that extended plans encourage better geographic coherence. However, disruption-handling ability declines as plan length increases, highlighting limits in LLM robustness. TripTide establishes a benchmark for evaluating adaptability, personalization, and resilience in LLM-based travel planning under real-world uncertainty.
摘要：TripCraft 和 TravelPlanner 等最近的努力推进了大型语言模型 (LLM) 的使用，以生成个性化、约束感知的旅行行程。然而，真正的旅行常常面临干扰。为了解决这个问题，我们推出了 TripTide，这是第一个评估法学硕士在现实干扰下修改行程的能力的基准。 TripTide 对中断严重程度和旅行者容忍度等关键维度进行建模，从而能够细致评估法学硕士对航班取消、天气关闭或景点超额预订等事件的适应性。我们进行三重评估。首先，我们引入自动指标，包括意图保留（修订后的计划保持可行性和目标的程度）、响应性（中断处理的及时性和适当性）和适应性（原始计划和修订计划之间的语义、空间和顺序差异）。其次，我们采用法学硕士作为法官的方法来自动评估修订质量。第三，我们进行手动专家评估，以验证修订是否保留了语义、空间、顺序和响应方面。我们的实验表明，LLM 保持了很强的顺序一致性和语义稳定性，而较短旅行的空间偏差较大，但较长旅行的空间偏差较小，这表明扩展计划鼓励更好的地理一致性。然而，随着计划长度的增加，中断处理能力会下降，这凸显了法学硕士稳健性的局限性。 TripTide 建立了一个基准，用于评估现实世界不确定性下基于法学硕士的旅行规划的适应性、个性化和弹性。

Title: Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning

Authors: Qiang Liu, Wuganjing Song, Zhenzhou Lin, Feifan Chen, Qiaolong Cai, Chen Li, Yongduo Sui
Subjects: cs.CL, cs.IT, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21339
Pdf URL: https://arxiv.org/pdf/2510.21339
Copy Paste: [[2510.21339]] Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning(https://arxiv.org/abs/2510.21339)
Keywords: language model, llm
Abstract: The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.
摘要：大型语言模型（LLM）的推理能力通常是通过单轮强化学习来开发的，而现实世界的应用程序通常涉及与人类反馈的多轮交互，导致训练和部署条件之间潜在的不匹配。在这项工作中，我们研究了带有人类反馈的多轮训练对于推理任务是否是必要的。我们将传统的单轮训练与三种多轮策略进行比较，并得出与先前研究相反的结论。我们发现，在单轮设置中训练的模型可以有效地推广到单轮和多轮评估，而使用多轮策略训练的模型则表现出单轮推理性能的显着下降。这些结果表明，对于具有完整信息的任务，鲁棒的单轮训练仍然更加有效和可靠，因为具有基本反馈的多轮训练提供的好处有限，甚至会降低推理能力。

Title: SindBERT, the Sailor: Charting the Seas of Turkish NLP

Authors: Raphael Scheible-Schmitt, Stefan Schweter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21364
Pdf URL: https://arxiv.org/pdf/2510.21364
Copy Paste: [[2510.21364]] SindBERT, the Sailor: Charting the Seas of Turkish NLP(https://arxiv.org/abs/2510.21364)
Keywords: language model
Abstract: Transformer models have revolutionized NLP, yet many morphologically rich languages remain underrepresented in large-scale pre-training efforts. With SindBERT, we set out to chart the seas of Turkish NLP, providing the first large-scale RoBERTa-based encoder for Turkish. Trained from scratch on 312 GB of Turkish text (mC4, OSCAR23, Wikipedia), SindBERT is released in both base and large configurations, representing the first large-scale encoder-only language model available for Turkish. We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark. Our results show that SindBERT performs competitively with existing Turkish and multilingual models, with the large variant achieving the best scores in two of four tasks but showing no consistent scaling advantage overall. This flat scaling trend, also observed for XLM-R and EuroBERT, suggests that current Turkish benchmarks may already be saturated. At the same time, comparisons with smaller but more curated models such as BERTurk highlight that corpus quality and diversity can outweigh sheer data volume. Taken together, SindBERT contributes both as an openly released resource for Turkish NLP and as an empirical case study on the limits of scaling and the central role of corpus composition in morphologically rich languages. The SindBERT models are released under the MIT license and made available in both fairseq and Huggingface formats.
摘要：Transformer 模型彻底改变了 NLP，但许多形态丰富的语言在大规模预训练工作中仍然代表性不足。通过 SindBERT，我们着手绘制土耳其 NLP 的海洋图，为土耳其语提供第一个基于 RoBERTa 的大规模编码器。 SindBERT 在 312 GB 土耳其语文本（mC4、OSCAR23、维基百科）上从头开始训练，以基本配置和大型配置发布，代表了第一个可用于土耳其语的大型仅编码器语言模型。我们在词性标注、命名实体识别、攻击性语言检测和 TurBLiMP 语言可接受性基准方面评估 SindBERT。我们的结果表明，SindBERT 与现有的土耳其语和多语言模型相比，具有竞争力，大型变体在四项任务中的两项中取得了最佳分数，但总体上没有表现出一致的扩展优势。 XLM-R 和 EuroBERT 也观察到这种平坦的扩展趋势，表明当前的土耳其基准可能已经饱和。与此同时，与 BERTurk 等规模较小但更精心策划的模型的比较凸显出语料库的质量和多样性可以超过纯粹的数据量。总而言之，SindBERT 既作为土耳其 NLP 的公开发布资源，又作为关于扩展限制和语料库构成在形态丰富的语言中的核心作用的实证案例研究。 SindBERT 模型在 MIT 许可下发布，并以 fairseq 和 Huggingface 格式提供。

Title: Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings

Authors: Abderrazek Abid, Thanh-Cong Ho, Fakhri Karray
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21424
Pdf URL: https://arxiv.org/pdf/2510.21424
Copy Paste: [[2510.21424]] Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings(https://arxiv.org/abs/2510.21424)
Keywords: language model
Abstract: As generative AI continues to evolve, Vision Language Models (VLMs) have emerged as promising tools in various healthcare applications. One area that remains relatively underexplored is their use in human activity recognition (HAR) for remote health monitoring. VLMs offer notable strengths, including greater flexibility and the ability to overcome some of the constraints of traditional deep learning models. However, a key challenge in applying VLMs to HAR lies in the difficulty of evaluating their dynamic and often non-deterministic outputs. To address this gap, we introduce a descriptive caption data set and propose comprehensive evaluation methods to evaluate VLMs in HAR. Through comparative experiments with state-of-the-art deep learning models, our findings demonstrate that VLMs achieve comparable performance and, in some cases, even surpass conventional approaches in terms of accuracy. This work contributes a strong benchmark and opens new possibilities for the integration of VLMs into intelligent healthcare systems.
摘要：随着生成式人工智能的不断发展，视觉语言模型 (VLM) 已成为各种医疗保健应用中很有前景的工具。相对而言，探索不足的一个领域是它们在用于远程健康监测的人类活动识别 (HAR) 中的应用。 VLM 具有显着的优势，包括更大的灵活性以及克服传统深度学习模型的一些限制的能力。然而，将 VLM 应用到 HAR 的一个关键挑战在于难以评估其动态且通常不确定的输出。为了解决这一差距，我们引入了描述性字幕数据集，并提出了综合评估方法来评估 HAR 中的 VLM。通过与最先进的深度学习模型的比较实验，我们的研究结果表明，VLM 实现了可比的性能，在某些情况下，甚至在准确性方面超越了传统方法。这项工作提供了强有力的基准，并为将 VLM 集成到智能医疗系统中开辟了新的可能性。

Title: Redefining Retrieval Evaluation in the Era of LLMs

Authors: Giovanni Trappolini, Florin Cuconasu, Simone Filice, Yoelle Maarek, Fabrizio Silvestri
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.21440
Pdf URL: https://arxiv.org/pdf/2510.21440
Copy Paste: [[2510.21440]] Redefining Retrieval Evaluation in the Era of LLMs(https://arxiv.org/abs/2510.21440)
Keywords: language model, llm, retrieval augmented generation
Abstract: Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components
摘要：传统的信息检索 (IR) 指标（例如 nDCG、MAP 和 MRR）假设人类用户按顺序检查文档，而对较低级别的关注度逐渐减弱。这种假设在检索增强生成（RAG）系统中被打破，其中搜索结果由大型语言模型（LLM）使用，与人类不同，大型语言模型将所有检索到的文档作为一个整体而不是顺序处理。此外，传统的 IR 指标不考虑相关但不相关的文档，这些文档会主动降低发电质量，而不是仅仅被忽略。由于这两个主要的偏差，即人与机器位置折扣和人相关性与机器效用，经典的 IR 指标不能准确预测 RAG 性能。我们引入了一种基于实用程序的注释模式，该模式可以量化相关段落的积极贡献和分散注意力的负面影响。在此基础上，我们提出了 UDCG（效用和分心感知累积增益），这是一种使用面向 LLM 的位置折扣来直接优化与端到端答案准确性的相关性的指标。对五个数据集和六个法学硕士的实验表明，与传统指标相比，UDCG 将相关性提高了高达 36%。我们的工作为使 IR 评估与 LLM 消费者保持一致迈出了关键一步，并使 RAG 组件的评估更加可靠

Title: REMONI: An Autonomous System Integrating Wearables and Multimodal Large Language Models for Enhanced Remote Health Monitoring

Authors: Thanh Cong Ho, Farah Kharrat, Abderrazek Abid, Fakhri Karray
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21445
Pdf URL: https://arxiv.org/pdf/2510.21445
Copy Paste: [[2510.21445]] REMONI: An Autonomous System Integrating Wearables and Multimodal Large Language Models for Enhanced Remote Health Monitoring(https://arxiv.org/abs/2510.21445)
Keywords: language model, llm, prompt, agent
Abstract: With the widespread adoption of wearable devices in our daily lives, the demand and appeal for remote patient monitoring have significantly increased. Most research in this field has concentrated on collecting sensor data, visualizing it, and analyzing it to detect anomalies in specific diseases such as diabetes, heart disease and depression. However, this domain has a notable gap in the aspect of human-machine interaction. This paper proposes REMONI, an autonomous REmote health MONItoring system that integrates multimodal large language models (MLLMs), the Internet of Things (IoT), and wearable devices. The system automatically and continuously collects vital signs, accelerometer data from a special wearable (such as a smartwatch), and visual data in patient video clips collected from cameras. This data is processed by an anomaly detection module, which includes a fall detection model and algorithms to identify and alert caregivers of the patient's emergency conditions. A distinctive feature of our proposed system is the natural language processing component, developed with MLLMs capable of detecting and recognizing a patient's activity and emotion while responding to healthcare worker's inquiries. Additionally, prompt engineering is employed to integrate all patient information seamlessly. As a result, doctors and nurses can access real-time vital signs and the patient's current state and mood by interacting with an intelligent agent through a user-friendly web application. Our experiments demonstrate that our system is implementable and scalable for real-life scenarios, potentially reducing the workload of medical professionals and healthcare costs. A full-fledged prototype illustrating the functionalities of the system has been developed and being tested to demonstrate the robustness of its various capabilities.
摘要：随着可穿戴设备在我们日常生活中的广泛采用，远程患者监护的需求和吸引力显着增加。该领域的大多数研究都集中在收集传感器数据、将其可视化并对其进行分析以检测糖尿病、心脏病和抑郁症等特定疾病的异常情况。然而，该领域在人机交互方面存在显着差距。本文提出了 REMONI，这是一种集成了多模式大语言模型（MLLM）、物联网（IoT）和可穿戴设备的自主远程健康监测系统。该系统自动、连续地收集生命体征、来自特殊可穿戴设备（例如智能手表）的加速度计数据以及从摄像机收集的患者视频片段中的视觉数据。该数据由异常检测模块处理，该模块包括跌倒检测模型和算法，用于识别患者的紧急情况并向护理人员发出警报。我们提出的系统的一个显着特征是自然语言处理组件，该组件是用 MLLM 开发的，能够检测和识别患者的活动和情绪，同时响应医护人员的询问。此外，还采用及时的工程技术来无缝集成所有患者信息。因此，医生和护士可以通过用户友好的网络应用程序与智能代理交互，获取实时生命体征以及患者当前的状态和情绪。我们的实验表明，我们的系统对于现实生活场景是可实施和可扩展的，有可能减少医疗专业人员的工作量和医疗成本。一个展示系统功能的成熟原型已经开发出来并正在进行测试，以证明其各种功能的稳健性。

Title: MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization

Authors: Chenglong Wang, Yang Gan, Hang Zhou, Chi Hu, Yongyu Mu, Kai Song, Murun Yang, Bei Li, Chunliang Zhang, Tongran Liu, Jingbo Zhu, Zhengtao Yu, Tong Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21473
Pdf URL: https://arxiv.org/pdf/2510.21473
Copy Paste: [[2510.21473]] MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization(https://arxiv.org/abs/2510.21473)
Keywords: language model, llm
Abstract: Recent advances in diffusion language models (DLMs) have presented a promising alternative to traditional autoregressive large language models (LLMs). However, DLMs still lag behind LLMs in reasoning performance, especially as the number of denoising steps decreases. Our analysis reveals that this shortcoming arises primarily from the independent generation of masked tokens across denoising steps, which fails to capture the token correlation. In this paper, we define two types of token correlation: intra-sequence correlation and inter-sequence correlation, and demonstrate that enhancing these correlations improves reasoning performance. To this end, we propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to consider the token correlation during the denoising process. More specifically, our MRO approach leverages test-time scaling, reject sampling, and reinforcement learning to directly optimize the token correlation with multiple elaborate rewards. Additionally, we introduce group step and importance sampling strategies to mitigate reward variance and enhance sampling efficiency. Through extensive experiments, we demonstrate that MRO not only improves reasoning performance but also achieves significant sampling speedups while maintaining high performance on reasoning benchmarks.
摘要：扩散语言模型 (DLM) 的最新进展为传统自回归大语言模型 (LLM) 提供了一种有前景的替代方案。然而，DLM 在推理性能方面仍然落后于 LLM，尤其是随着去噪步骤数量的减少。我们的分析表明，这一缺点主要是由于在去噪步骤中独立生成屏蔽标记而导致的，无法捕获标记的相关性。在本文中，我们定义了两种类型的标记相关性：序列内相关性和序列间相关性，并证明增强这些相关性可以提高推理性能。为此，我们提出了一种多奖励优化（MRO）方法，鼓励 DLM 在去噪过程中考虑令牌相关性。更具体地说，我们的 MRO 方法利用测试时间缩放、拒绝采样和强化学习来直接优化代币与多种精心奖励的相关性。此外，我们引入了分组步骤和重要性采样策略，以减轻奖励方差并提高采样效率。通过大量的实验，我们证明 MRO 不仅可以提高推理性能，而且可以显着提高采样速度，同时保持推理基准的高性能。

Title: Brain-tuning Improves Generalizability and Efficiency of Brain Alignment in Speech Models

Authors: Omer Moussa, Mariya Toneva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21520
Pdf URL: https://arxiv.org/pdf/2510.21520
Copy Paste: [[2510.21520]] Brain-tuning Improves Generalizability and Efficiency of Brain Alignment in Speech Models(https://arxiv.org/abs/2510.21520)
Keywords: language model
Abstract: Pretrained language models are remarkably effective in aligning with human brain responses elicited by natural language stimuli, positioning them as promising model organisms for studying language processing in the brain. However, existing approaches for both estimating and improving this brain alignment are participant-dependent and highly affected by the amount of data available per participant, hindering both generalization to new participants and population-level analyses. In this work, we address these limitations by introducing a scalable, generalizable brain-tuning method, in which we fine-tune pretrained speech language models to jointly predict fMRI responses from multiple participants. We demonstrate that the resulting brain-tuned models exhibit strong individual brain alignment while generalizing across participants. Specifically, our method leads to 1) a 5-fold decrease in the amount of fMRI data needed to predict brain data from new participants, 2) up to a 50% increase in the overall brain alignment, and 3) strong generalization to new unseen datasets. Furthermore, this multi-participant brain-tuning additionally improves downstream performance on semantic tasks, suggesting that training using brain data from multiple participants leads to more generalizable semantic representations. Taken together, these findings demonstrate a bidirectional benefit between neuroscience and AI, helping bridge the gap between the two fields. We make our code and models publicly available at this https URL.
摘要：预训练的语言模型在与自然语言刺激引起的人脑反应相一致方面非常有效，使它们成为研究大脑语言处理的有前途的模型生物。然而，现有的估计和改善这种大脑排列的方法是依赖于参与者的，并且受到每个参与者可用数据量的高度影响，阻碍了对新参与者的推广和群体水平的分析。在这项工作中，我们通过引入一种可扩展、可推广的大脑调节方法来解决这些限制，在该方法中，我们对预训练的语音语言模型进行微调，以共同预测来自多个参与者的功能磁共振成像反应。我们证明，由此产生的大脑调整模型表现出强大的个体大脑一致性，同时在参与者之间进行推广。具体来说，我们的方法可以实现 1) 预测新参与者大脑数据所需的 fMRI 数据量减少 5 倍，2) 整体大脑排列增加 50%，3) 对新的未见过的数据集有很强的泛化能力。此外，这种多参与者的大脑调节还提高了语义任务的下游性能，这表明使用来自多个参与者的大脑数据进行训练可以产生更通用的语义表示。总而言之，这些发现证明了神经科学和人工智能之间的双向优势，有助于弥合两个领域之间的差距。我们在此 https URL 公开提供我们的代码和模型。

Title: InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation

Authors: Likun Tan, Kuan-Wei Huang, Joy Shi, Kevin Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21538
Pdf URL: https://arxiv.org/pdf/2510.21538
Copy Paste: [[2510.21538]] InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation(https://arxiv.org/abs/2510.21538)
Keywords: gpt, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) integrates external knowledge to mitigate hallucinations, yet models often generate outputs inconsistent with retrieved content. Accurate hallucination detection requires disentangling the contributions of external context and parametric knowledge, which prior methods typically conflate. We investigate the mechanisms underlying RAG hallucinations and find they arise when later-layer FFN modules disproportionately inject parametric knowledge into the residual stream. To address this, we explore a mechanistic detection approach based on external context scores and parametric knowledge scores. Using Qwen3-0.6b, we compute these scores across layers and attention heads and train regression-based classifiers to predict hallucinations. Our method is evaluated against state-of-the-art LLMs (GPT-5, GPT-4.1) and detection baselines (RAGAS, TruLens, RefChecker). Furthermore, classifiers trained on Qwen3-0.6b signals generalize to GPT-4.1-mini responses, demonstrating the potential of proxy-model evaluation. Our results highlight mechanistic signals as efficient, generalizable predictors for hallucination detection in RAG systems.
摘要：检索增强生成（RAG）集成外部知识来减轻幻觉，但模型经常生成与检索内容不一致的输出。准确的幻觉检测需要理清外部背景和参数知识的贡献，而现有方法通常会将其混为一谈。我们研究了 RAG 幻觉背后的机制，发现当后面层 FFN 模块不成比例地将参数知识注入到残差流中时，就会出现幻觉。为了解决这个问题，我们探索了一种基于外部上下文分数和参数知识分数的机械检测方法。使用 Qwen3-0.6b，我们跨层和注意力头计算这些分数，并训练基于回归的分类器来预测幻觉。我们的方法根据最先进的 LLM（GPT-5、GPT-4.1）和检测基线（RAGAS、TruLens、RefChecker）进行评估。此外，在 Qwen3-0.6b 信号上训练的分类器可推广到 GPT-4.1-mini 响应，展示了代理模型评估的潜力。我们的结果强调机械信号是 RAG 系统中幻觉检测的有效、通用的预测因子。

Title: Are the LLMs Capable of Maintaining at Least the Language Genus?

Authors: Sandra Mitrović, David Kletz, Ljiljana Dolamic, Fabio Rinaldi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21561
Pdf URL: https://arxiv.org/pdf/2510.21561
Copy Paste: [[2510.21561]] Are the LLMs Capable of Maintaining at Least the Language Genus?(https://arxiv.org/abs/2510.21561)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) display notable variation in multilingual behavior, yet the role of genealogical language structure in shaping this variation remains underexplored. In this paper, we investigate whether LLMs exhibit sensitivity to linguistic genera by extending prior analyses on the MultiQ dataset. We first check if models prefer to switch to genealogically related languages when prompt language fidelity is not maintained. Next, we investigate whether knowledge consistency is better preserved within than across genera. We show that genus-level effects are present but strongly conditioned by training resource availability. We further observe distinct multilingual strategies across LLMs families. Our findings suggest that LLMs encode aspects of genus-level structure, but training data imbalances remain the primary factor shaping their multilingual performance.
摘要：大型语言模型（LLM）显示出多语言行为的显着差异，但谱系语言结构在塑造这种差异中的作用仍未得到充分探索。在本文中，我们通过扩展先前对 MultiQ 数据集的分析来研究法学硕士是否表现出对语言属的敏感性。我们首先检查当无法维持即时语言保真度时模型是否更愿意切换到谱系相关的语言。接下来，我们研究属内知识一致性是否比属间更好地保存。我们表明，属级效应是存在的，但很大程度上取决于训练资源的可用性。我们进一步观察法学硕士家庭中不同的多语言策略。我们的研究结果表明，法学硕士对属级结构的各个方面进行编码，但训练数据不平衡仍然是影响其多语言表现的主要因素。

Title: From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene

Authors: Mojca Brglez, Špela Vintar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21575
Pdf URL: https://arxiv.org/pdf/2510.21575
Copy Paste: [[2510.21575]] From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene(https://arxiv.org/abs/2510.21575)
Keywords: language model, llm
Abstract: Large language models are demonstrating increasing capabilities, excelling at benchmarks once considered very difficult. As their capabilities grow, there is a need for more challenging evaluations that go beyond surface-level linguistic competence. Namely, language competence involves not only syntax and semantics but also pragmatics, i.e., understanding situational meaning as shaped by context as well as linguistic and cultural norms. To contribute to this line of research, we introduce SloPragEval and SloPragMega, the first pragmatics understanding benchmarks for Slovene that contain altogether 405 multiple-choice questions. We discuss the difficulties of translation, describe the campaign to establish a human baseline, and report pilot evaluations with LLMs. Our results indicate that current models have greatly improved in understanding nuanced language but may still fail to infer implied speaker meaning in non-literal utterances, especially those that are culture-specific. We also observe a significant gap between proprietary and open-source models. Finally, we argue that benchmarks targeting nuanced language understanding and knowledge of the target culture must be designed with care, preferably constructed from native data, and validated with human responses.
摘要：大型语言模型正在展示出越来越多的功能，在曾经被认为非常困难的基准测试中表现出色。随着他们能力的增长，需要进行更具挑战性的评估，而不仅仅是表面语言能力。也就是说，语言能力不仅涉及语法和语义，还涉及语用学，即理解由上下文以及语言和文化规范塑造的情境意义。为了促进这一领域的研究，我们引入了 SloPragEval 和 SloPragMega，这是斯洛文尼亚语的第一个语用理解基准，总共包含 405 个多项选择题。我们讨论翻译的困难，描述建立人类基线的活动，并报告与法学硕士的试点评估。我们的结果表明，当前的模型在理解细微语言方面已经有了很大的进步，但可能仍然无法推断出非字面话语中隐含的说话者含义，尤其是那些特定于文化的话语。我们还观察到专有模型和开源模型之间存在显着差距。最后，我们认为，针对细致入微的语言理解和目标文化知识的基准必须谨慎设计，最好是根据本地数据构建，并通过人类反应进行验证。

Title: RETuning: Upgrading Inference-Time Scaling for Stock Movement Prediction with Large Language Models

Authors: Xueyuan Lin, Cehao Yang, Ye Ma, Ming Li, Rongjunchen Zhang, Yang Ni, Xiaojun Wu, Chengjin Xu, Jian Guo, Hui Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21604
Pdf URL: https://arxiv.org/pdf/2510.21604
Copy Paste: [[2510.21604]] RETuning: Upgrading Inference-Time Scaling for Stock Movement Prediction with Large Language Models(https://arxiv.org/abs/2510.21604)
Keywords: language model, llm, long context
Abstract: Recently, large language models (LLMs) have demonstrated outstanding reasoning capabilities on mathematical and coding tasks. However, their application to financial tasks-especially the most fundamental task of stock movement prediction-remains underexplored. We study a three-class classification problem (up, hold, down) and, by analyzing existing reasoning responses, observe that: (1) LLMs follow analysts' opinions rather than exhibit a systematic, independent analytical logic (CoTs). (2) LLMs list summaries from different sources without weighing adversarial evidence, yet such counterevidence is crucial for reliable prediction. It shows that the model does not make good use of its reasoning ability to complete the task. To address this, we propose Reflective Evidence Tuning (RETuning), a cold-start method prior to reinforcement learning, to enhance prediction ability. While generating CoT, RETuning encourages dynamically constructing an analytical framework from diverse information sources, organizing and scoring evidence for price up or down based on that framework-rather than on contextual viewpoints-and finally reflecting to derive the prediction. This approach maximally aligns the model with its learned analytical framework, ensuring independent logical reasoning and reducing undue influence from context. We also build a large-scale dataset spanning all of 2024 for 5,123 A-share stocks, with long contexts (32K tokens) and over 200K samples. In addition to price and news, it incorporates analysts' opinions, quantitative reports, fundamental data, macroeconomic indicators, and similar stocks. Experiments show that RETuning successfully unlocks the model's reasoning ability in the financial domain. Inference-time scaling still works even after 6 months or on out-of-distribution stocks, since the models gain valuable insights about stock movement prediction.
摘要：最近，大型语言模型（LLM）在数学和编码任务上表现出了出色的推理能力。然而，它们在金融任务中的应用——尤其是股票走势预测的最基本任务——仍未得到充分探索。我们研究一个三类分类问题（向上、保持、向下），并通过分析现有的推理响应，观察到：（1）法学硕士遵循分析师的意见，而不是表现出系统的、独立的分析逻辑（CoT）。 (2) 法学硕士列出了来自不同来源的摘要，而不权衡对抗性证据，但这种反证对于可靠的预测至关重要。这表明模型没有很好地利用其推理能力来完成任务。为了解决这个问题，我们提出了反射证据调整（RETuning），这是一种强化学习之前的冷启动方法，以增强预测能力。在生成 CoT 的同时，RETuning 鼓励从不同的信息源动态构建一个分析框架，根据该框架（而不是上下文观点）组织和评分价格上涨或下跌的证据，最后进行反思以得出预测。这种方法最大限度地使模型与其学习的分析框架保持一致，确保独立的逻辑推理并减少上下文的不当影响。我们还构建了涵盖 2024 年 5,123 只 A 股股票的大规模数据集，包含长上下文（32K 代币）和超过 200K 样本。除了价格和新闻外，它还包含分析师的观点、定量报告、基本数据、宏观经济指标和类似股票。实验表明，RETuning成功解锁了模型在金融领域的推理能力。即使在 6 个月后或在未分配的股票上，推理时间缩放仍然有效，因为模型获得了有关股票走势预测的宝贵见解。

Title: The Universal Landscape of Human Reasoning

Authors: Qiguang Chen, Jinhao Liu, Libo Qin, Yimeng Zhang, Yihao Liang, Shangxu Ren, Chengyu Luan, Dengyun Peng, Hanjing Li, Jiannan Guan, Zheng Yan, Jiaqi Wang, Mengkang Hu, Yantao Du, Zhi Chen, Xie Chen, Wanxiang Che
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21623
Pdf URL: https://arxiv.org/pdf/2510.21623
Copy Paste: [[2510.21623]] The Universal Landscape of Human Reasoning(https://arxiv.org/abs/2510.21623)
Keywords: language model, llm
Abstract: Understanding how information is dynamically accumulated and transformed in human reasoning has long challenged cognitive psychology, philosophy, and artificial intelligence. Existing accounts, from classical logic to probabilistic models, illuminate aspects of output or individual modelling, but do not offer a unified, quantitative description of general human reasoning dynamics. To solve this, we introduce Information Flow Tracking (IF-Track), that uses large language models (LLMs) as probabilistic encoder to quantify information entropy and gain at each reasoning step. Through fine-grained analyses across diverse tasks, our method is the first successfully models the universal landscape of human reasoning behaviors within a single metric space. We show that IF-Track captures essential reasoning features, identifies systematic error patterns, and characterizes individual differences. Applied to discussion of advanced psychological theory, we first reconcile single- versus dual-process theories in IF-Track and discover the alignment of artificial and human cognition and how LLMs reshaping human reasoning process. This approach establishes a quantitative bridge between theory and measurement, offering mechanistic insights into the architecture of reasoning.
摘要：了解信息在人类推理中如何动态积累和转换长期以来一直对认知心理学、哲学和人工智能提出挑战。现有的解释，从经典逻辑到概率模型，阐明了输出或个体建模的各个方面，但没有提供对一般人类推理动态的统一、定量描述。为了解决这个问题，我们引入了信息流跟踪（IF-Track），它使用大型语言模型（LLM）作为概率编码器来量化每个推理步骤的信息熵和增益。通过对不同任务的细粒度分析，我们的方法是第一个成功地在单个度量空间内模拟人类推理行为的普遍景观的方法。我们证明 IF-Track 可以捕获基本的推理特征，识别系统错误模式，并表征个体差异。应用于高级心理学理论的讨论，我们首先在 IF-Track 中调和单过程理论与双过程理论，并发现人工认知和人类认知的一致性以及法学硕士如何重塑人类推理过程。这种方法在理论和测量之间建立了一座定量桥梁，为推理架构提供了机械见解。