2025-10-15

Title: PHANTOM RECALL: When Familiar Puzzles Fool Smart Models

Authors: Souradeep Mukhopadhyay, Rishabh Baral, Nimeesh Mahajan, Samhitha Harish, Aswin RRV, Mihir Parmar, Mutsumi Nakamura, Chitta Baral
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11812
Pdf URL: https://arxiv.org/pdf/2510.11812
Copy Paste: [[2510.11812]] PHANTOM RECALL: When Familiar Puzzles Fool Smart Models(https://arxiv.org/abs/2510.11812)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) such as GPT, Gemini, and Claude often appear adept at solving classic logic puzzles--but how much genuine reasoning underlies their answers? Recent evidence suggests that these models frequently rely on memorized templates rather than reasoning from first principles. When puzzles are slightly modified, their performance collapses, revealing a striking fragility. In particular, we asked: Have LLMs addressed these issues? To what extent? How about perturbations to other puzzles? Is there a general way of reformulating the prompt so that the models do better? To examine these things systematically, we introduce PHANTOM RECALL, a benchmark comprising 25 well-known logic puzzles and 149 carefully designed perturbations that preserve reasoning structure but alter superficial details and solutions. We evaluate eleven leading LLMs and identify a recurring failure mode--phantom recall--where models confidently reproduce memorized solutions or spurious rationales that no longer fit the altered scenario. To probe and mitigate this issue, we contribute three tools: (i) an automated logical-equivalence judge to detect reasoning mismatches, (ii) a taxonomy of fine-grained reasoning error categories, and (iii) a prompting-based mitigation framework guided by these categories. Despite near-perfect accuracy on unmodified puzzles, models significantly underperform humans on perturbed ones, exhibiting both phantom recall and over-elaboration. Our findings reveal a crucial limitation: LLMs often fail to re-reason when contextual cues shift--highlighting the gap between linguistic fluency and logical understanding.
摘要：GPT、Gemini 和 Claude 等大型语言模型 (LLM) 往往看起来擅长解决经典逻辑难题，但它们的答案背后有多少真正的推理依据？最近的证据表明，这些模型经常依赖于记忆的模板，而不是根据第一原理进行推理。当谜题稍加修改时，它们的性能就会崩溃，显露出惊人的脆弱性。我们特别问：法学硕士是否解决了这些问题？到什么程度呢？对其他谜题的干扰怎么样？是否有一种通用的方法来重新制定提示以使模型做得更好？为了系统地检查这些事情，我们引入了 PHANTOM RECALL，这是一个基准测试，包含 25 个著名的逻辑谜题和 149 个精心设计的扰动，这些扰动保留了推理结构，但改变了表面细节和解决方案。我们评估了 11 名领先的法学硕士，并确定了一种反复出现的失败模式——幻影回忆——其中模型自信地再现了记忆的解决方案或不再适合改变的场景的虚假理由。为了探究和缓解这个问题，我们提供了三个工具：（i）自动逻辑等价判断，用于检测推理不匹配，（ii）细粒度推理错误类别的分类，以及（iii）由这些类别指导的基于提示的缓解框架。尽管模型在未修改的谜题上具有近乎完美的准确性，但在受到干扰的谜题上，模型的表现明显低于人类，表现出幻象回忆和过度精细化。我们的研究结果揭示了一个关键的局限性：当上下文线索发生变化时，法学硕士通常无法重新推理——凸显了语言流畅性和逻辑理解之间的差距。

Title: R-WoM: Retrieval-augmented World Model For Computer-use Agents

Authors: Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu, Jiarong Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11892
Pdf URL: https://arxiv.org/pdf/2510.11892
Copy Paste: [[2510.11892]] R-WoM: Retrieval-augmented World Model For Computer-use Agents(https://arxiv.org/abs/2510.11892)
Keywords: language model, llm, hallucination, agent
Abstract: Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models--future state prediction and reward estimation--through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations.
摘要：大型语言模型 (LLM) 可以作为世界模型，通过模拟未来状态和预测行动结果来增强数字环境中的代理决策，从而有可能消除成本高昂的试错探索。然而，这种能力从根本上受到法学硕士的幻觉倾向和对静态训练知识的依赖的限制，这可能导致抑制长期模拟的复合错误。为了系统地研究法学硕士是否适合世界建模，我们通过三个任务：下一状态识别、全流程规划对齐和里程碑转变识别，探讨了世界模型的两个核心能力——未来状态预测和奖励估计。我们的分析表明，虽然法学硕士可以有效地捕获紧接着的下一个状态并识别有意义的状态转换，但它们在全流程规划中的性能会迅速下降。这凸显了法学硕士在长期可靠地建模环境动态方面的局限性。为了解决这些限制，我们提出了检索增强世界模型（R-WoM），该模型通过结合从外部教程中检索到的事实、最新知识来为 LLM 模拟奠定基础。实验表明，与基线相比，R-WoM 实现了高达 25.3% (OSWorld) 和 18.1% (WebArena) 的大幅改进，在较长时间范围的模拟中尤其具有优势。

Title: LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance

Authors: Patrick Haller, Mark Ibrahim, Polina Kirichenko, Levent Sagun, Samuel J. Bell
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.11905
Pdf URL: https://arxiv.org/pdf/2510.11905
Copy Paste: [[2510.11905]] LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance(https://arxiv.org/abs/2510.11905)
Keywords: language model, llm
Abstract: For Large Language Models (LLMs) to be reliable, they must learn robust knowledge that can be generally applied in diverse settings -- often unlike those seen during training. Yet, extensive research has shown that LLM performance can be brittle, with models exhibiting excessive sensitivity to trivial input variations. In this work, we explore whether this brittleness is a direct result of unstable internal knowledge representations. To explore this question, we build on previous work showing that LLM representations encode statement truthfulness -- i.e., true, factual statements can be easily separated from false, inaccurate ones. Specifically, we test the robustness of learned knowledge by evaluating representation separability on samples that have undergone superficial transformations to drive them out-of-distribution (OOD), such as typos or reformulations. By applying semantically-preserving perturbations, we study how separability degrades as statements become more OOD, across four LLM families, five evaluation datasets, and three knowledge probing methods. Our results reveal that internal representations of statement truthfulness collapse as the samples' presentations become less similar to those seen during pre-training. While LLMs can often distinguish between true and false statements when they closely resemble the pre-training data, this ability is highly dependent on the statement's exact surface form. These findings offer a possible explanation for brittle benchmark performance: LLMs may learn shallow, non-robust knowledge representations that allow for only limited generalizability. Our work presents a fundamental challenge for the utility of truthfulness probes, and more broadly, calls for further research on improving the robustness of learned knowledge representations.
摘要：为了使大型语言模型（LLM）可靠，他们必须学习可以普遍应用于不同环境的强大知识——通常与训练期间看到的不同。然而，广泛的研究表明，法学硕士的表现可能很脆弱，模型对微小的输入变化表现出过度的敏感性。在这项工作中，我们探讨这种脆弱性是否是不稳定的内部知识表示的直接结果。为了探讨这个问题，我们以之前的工作为基础，表明 LLM 表示编码陈述的真实性——即，真实的、事实的陈述可以很容易地与虚假、不准确的陈述区分开来。具体来说，我们通过评估经过表面转换（例如拼写错误或重新表述）的样本的表示可分离性来测试所学知识的稳健性，以驱动它们脱离分布（OOD）。通过应用语义保留扰动，我们研究了随着语句变得更加 OOD，可分离性如何降低，涉及四个法学硕士系列、五个评估数据集和三种知识探测方法。我们的结果表明，随着样本的呈现变得与预训练期间看到的不太相似，陈述真实性的内部表征就会崩溃。虽然法学硕士通常可以在与预训练数据非常相似时区分真假陈述，但这种能力高度依赖于陈述的确切表面形式。这些发现为脆弱的基准性能提供了一个可能的解释：法学硕士可能会学习浅层、不稳健的知识表示，而这些知识表示的通用性有限。我们的工作对真实性探测的实用性提出了根本性挑战，更广泛地说，呼吁进一步研究以提高所学知识表示的稳健性。

Title: LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

Authors: Armel Zebaze, Rachel Bawden, Benoît Sagot
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11919
Pdf URL: https://arxiv.org/pdf/2510.11919
Copy Paste: [[2510.11919]] LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens(https://arxiv.org/abs/2510.11919)
Keywords: llm, prompt
Abstract: Large reasoning models (LRMs) have led to new possibilities in terms of problem-solving, through the devising of a natural language thought process prior to answering a query. While their capabilities are well known across mathematics and coding tasks, their impact on the task of machine translation (MT) remains underexplored. In this work, we explore the benefits of the generation of intermediate tokens when performing MT across multiple language pairs of different levels of resourcedness and multiple setups. We find that "thinking tokens" do not help LRMs better perform MT. This result generalizes to models fine-tuned to reason before translating using distilled chain of thought (CoT) inspired by human translators' practices. Specifically, fine-tuning a model with synthetic CoT explanations detailing how to translate step-by-step does not outperform standard input-output fine-tuning. However, constructing the intermediate tokens by combining the outputs of modular translation-specific prompting strategies results in improvements. Our findings underscore that the contribution of intermediate tokens during fine-tuning highly depends on the presence of translation attempts within them. More broadly, our results suggest that using a teacher to refine target translations or to expand parallel corpora is more impactful than distilling their CoT explanations into "thinking" MT models.
摘要：大型推理模型 (LRM) 通过在回答查询之前设计自然语言思维过程，为解决问题带来了新的可能性。虽然它们在数学和编码任务中的能力众所周知，但它们对机器翻译 (MT) 任务的影响仍未得到充分研究。在这项工作中，我们探索了在不同资源级别和多种设置的多个语言对上执行机器翻译时生成中间令牌的好处。我们发现“思考代币”并不能帮助 LRM 更好地执行 MT。这一结果可推广到在翻译之前使用受人类翻译实践启发的蒸馏思想链 (CoT) 进行微调的模型。具体来说，使用详细说明如何逐步翻译的合成 CoT 解释来微调模型并不优于标准输入输出微调。然而，通过组合模块化翻译特定提示策略的输出来构建中间标记会带来改进。我们的研究结果强调，中间标记在微调过程中的贡献很大程度上取决于其中是否存在翻译尝试。更广泛地说，我们的结果表明，使用教师来完善目标翻译或扩展平行语料库比将他们的 CoT 解释提炼成“会思考”的 MT 模型更具影响力。

Title: TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

Authors: Yupei Li, Philipp Borchert, Gerasimos Lampouras
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11944
Pdf URL: https://arxiv.org/pdf/2510.11944
Copy Paste: [[2510.11944]] TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition(https://arxiv.org/abs/2510.11944)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.
摘要：大型语言模型 (LLM) 擅长非正式和正式（例如 Lean 4）数学推理，但仍难以实现自动形式化，即将非正式数学语句转换为正式数学语句的任务。自动形式化有助于将法学硕士的非正式推理与正式证明助手配对，从而实现机器可验证的生成并减轻幻觉。然而，当前数学法学硕士的表现受到大规模语料库稀缺的限制，特别是那些包含成对的非正式和正式陈述的语料库。尽管当前的模型经过训练可以从自然语言指令生成代码，但这些模型与形式数学之间的结构和句法差异限制了有效的迁移学习。我们提出了 TopoAlign，这是一个框架，可以解锁广泛可用的代码存储库作为数学法学硕士的培训资源。 TopoAlign 将代码分解为文档字符串、主函数和依赖函数，并将这些组件重新组装成在结构上反映正式语句的类似物。这会生成结构一致的代码数据，可用于训练数学法学硕士，而无需额外的人工注释。我们训练了两个最先进的模型 DeepSeek-Math 和 Herald，并在 minif2f、Putnam 和 ProofNet 基准上对它们进行评估。 TopoAlign 为 DeepSeek-Math 带来了巨大的收益，在 BEq@10 上将性能提高了 17.77%，在 typecheck@10 上将性能提高了 68.82%。尽管没有引入新的数学知识，我们的框架在 BEq@10 和 typecheck@10 上分别为 Herald 实现了 0.12% 和 1.09% 的增益，这表明对齐代码数据的训练即使对于专门的模型也是有益的。

Title: GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences

Authors: Priyanka Dey, Daniele Rosa, Wenqing Zheng, Daniel Barcklow, Jieyu Zhao, Emilio Ferrara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11952
Pdf URL: https://arxiv.org/pdf/2510.11952
Copy Paste: [[2510.11952]] GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences(https://arxiv.org/abs/2510.11952)
Keywords: llm, prompt
Abstract: Personalization in LLMs often relies on costly human feedback or interaction logs, limiting scalability and neglecting deeper user attributes. To reduce the reliance on human annotations, we introduce GRAVITY (Generative Response with Aligned Values, Interests, and Traits of You), a framework for generating synthetic, profile-grounded preference data that captures users' interests, values, beliefs, and personality traits. By integrating demographic, cultural, and psychological frameworks -- including Hofstede's cultural dimensions, Schwartz's basic values, the World Values Survey, and Big Five OCEAN traits -- GRAVITY synthesizes preference pairs to guide personalized content generation. We evaluate GRAVITY on book descriptions for 400 Amazon users, comparing it to prompt-based conditioning, standard fine-tuning, and naive synthetic pair generation. Profile-grounded synthetic data consistently improves generation, especially across multiple cultures (USA, Brazil, Japan, India), achieving over 4% higher preference gains across baselines, with user studies showing that GRAVITY outputs are preferred over 86% of the time. Our results show that scenario-grounded synthetic data can capture richer user variation, reduce reliance on costly annotation, and produce more engaging, user-centered content, offering a scalable path for LLM personalization.
摘要：法学硕士的个性化通常依赖于昂贵的人工反馈或交互日志，限制了可扩展性并忽略了更深层次的用户属性。为了减少对人类注释的依赖，我们引入了 GRAVITY（与您的价值观、兴趣和特质相一致的生成响应），这是一个用于生成综合的、基于个人资料的偏好数据的框架，该数据可以捕获用户的兴趣、价值观、信仰和个性特征。通过整合人口、文化和心理框架（包括霍夫斯泰德的文化维度、施瓦茨的基本价值观、世界价值观调查和五大海洋特征），GRAVITY 综合偏好对来指导个性化内容的生成。我们根据 400 名亚马逊用户的书籍描述评估了 GRAVITY，将其与基于提示的调节、标准微调和简单的合成对生成进行比较。基于个人资料的合成数据持续改善生成，特别是在多种文化中（美国、巴西、日本、印度），在基线上实现了超过 4% 的偏好增益，用户研究表明 GRAVITY 输出在超过 86% 的时间里受到偏好。我们的结果表明，基于场景的合成数据可以捕获更丰富的用户变化，减少对昂贵注释的依赖，并生成更具吸引力、以用户为中心的内容，为法学硕士个性化提供可扩展的路径。

Title: Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

Authors: Gabrielle Kaili-May Liu, Bryan Li, Arman Cohan, William Gantt Walden, Eugene Yang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.11956
Pdf URL: https://arxiv.org/pdf/2510.11956
Copy Paste: [[2510.11956]] Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries(https://arxiv.org/abs/2510.11956)
Keywords: llm, retrieval-augmented generation
Abstract: Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of un$\underline{c}$heatable, $\underline{r}$ealistic, $\underline{u}$nanswerable, and $\underline{m}$ulti-hop $\underline{q}$uerie$\underline{s}$ (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0\% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.
摘要：现实世界的用例通常会为 RAG 系统提供复杂的查询，而语料库中缺少相关信息或不完整。在这些设置中，RAG 系统必须能够拒绝无法回答、超出范围的查询，并识别检索和多跳推理的失败。尽管如此，现有的 RAG 基准很少反映多跳或超出范围问题的实际任务复杂性，这些问题通常可以通过断开推理（即在没有真正的多跳推理的情况下解决）来欺骗，或者只需要简单的事实回忆。这限制了此类基准测试揭示现有 RAG 系统局限性的能力。为了解决这一差距，我们提出了第一个自动、难度控制创建 un$\underline{c}$heatable、$\underline{r}$ealistic、$\underline{u}$nanswerable 和 $\underline{m}$ulti-hop $\underline{q}$uerie$\underline{s}$ 的管道 (CRUMQs)，适用于任何语料库和领域。我们使用我们的管道在两个流行的 RAG 数据集上创建 CRUMQ，并通过领先的检索增强 LLM 的基准实验证明其有效性。结果表明，与之前的 RAG 基准相比，CRUMQ 对 RAG 系统极具挑战性，可作弊分数降低高达 81.0%。更广泛地说，我们的管道提供了一种简单的方法来提高基准测试难度和真实性，并推动功能更强大的 RAG 系统的开发。

Title: Direct Multi-Token Decoding

Authors: Xuan Luo, Weizhi Wang, Xifeng Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11958
Pdf URL: https://arxiv.org/pdf/2510.11958
Copy Paste: [[2510.11958]] Direct Multi-Token Decoding(https://arxiv.org/abs/2510.11958)
Keywords: language model, llm
Abstract: Decoder-only transformers have become the standard architecture for large language models (LLMs) due to their strong performance. Recent studies suggest that, in pre-trained LLMs, early, middle, and late layers may serve distinct roles: Early layers focus on understanding the input context, middle layers handle task-specific processing, and late layers convert abstract representations into output tokens. We hypothesize that once representations have been processed by the early and middle layers, the resulting hidden states may encapsulate sufficient information to support the generation of multiple tokens using only the late layers, eliminating the need to repeatedly traverse the early and middle layers. We refer to this inference paradigm as Direct Multi-Token Decoding (DMTD). Unlike speculative decoding, our method introduces no additional parameters, auxiliary routines, or post-generation verification. Despite being trained on a limited dataset, a fine-tuned DMTD Qwen3-4B model has already demonstrated promising results, achieving up to a 2x speedup with only minor performance loss. Moreover, as shown in our scaling analysis, its performance is expected to further improve with larger training datasets.
摘要：由于其强大的性能，仅解码器变压器已成为大型语言模型（LLM）的标准架构。最近的研究表明，在预训练的 LLM 中，早期、中期和晚期层可能发挥不同的作用：早期层专注于理解输入上下文，中间层处理特定于任务的处理，晚期层将抽象表示转换为输出标记。我们假设，一旦早期层和中间层处理了表示，所得的隐藏状态可能会封装足够的信息来支持仅使用后期层生成多个令牌，从而无需重复遍历早期层和中间层。我们将这种推理范式称为直接多令牌解码（DMTD）。与推测解码不同，我们的方法没有引入额外的参数、辅助例程或生成后验证。尽管在有限的数据集上进行了训练，但经过微调的 DMTD Qwen3-4B 模型已经展示了可喜的结果，实现了高达 2 倍的加速，而性能损失却很小。此外，正如我们的扩展分析所示，随着更大的训练数据集，其性能有望进一步提高。

Title: Scaling Long-Horizon LLM Agent via Context-Folding

Authors: Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, Jiecao Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.11967
Pdf URL: https://arxiv.org/pdf/2510.11967
Copy Paste: [[2510.11967]] Scaling Long-Horizon LLM Agent via Context-Folding(https://arxiv.org/abs/2510.11967)
Keywords: language model, llm, agent
Abstract: Large language model (LLM) agents are fundamentally constrained by context length on long-horizon tasks. We introduce Context-Folding, a framework that empowers agents to actively manage their working context. An agent can procedurally branch into a sub-trajectory to handle a subtask and then fold it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome. To make this behavior learnable, we develop an end-to-end reinforcement learning framework FoldGRPO with specific process rewards to encourage effective task decomposition and context management. On complex long-horizon tasks (Deep Research and SWE), our folding agent matches or outperforms the ReAct baselines while using an active context 10$\times$ smaller and significantly outperforms models that rely on summarization-based context management.
摘要：大语言模型（LLM）智能体从根本上受到长期任务上下文长度的限制。我们引入了 Context-Folding，这是一个框架，使代理能够主动管理他们的工作环境。代理可以按程序分支到子轨迹来处理子任务，然后在完成后折叠它，折叠中间步骤，同时保留结果的简洁摘要。为了使这种行为可学习，我们开发了一个端到端强化学习框架 FoldGRPO，该框架具有特定的过程奖励，以鼓励有效的任务分解和上下文管理。在复杂的长期任务（深度研究和 SWE）上，我们的折叠代理匹配或优于 ReAct 基线，同时使用小 10 倍的活动上下文，并且显着优于依赖基于摘要的上下文管理的模型。

Title: Conjecturing: An Overlooked Step in Formal Mathematical Reasoning

Authors: Jasivan Alex Sivakumar, Philipp Borchert, Ronald Cardenas, Gerasimos Lampouras
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11986
Pdf URL: https://arxiv.org/pdf/2510.11986
Copy Paste: [[2510.11986]] Conjecturing: An Overlooked Step in Formal Mathematical Reasoning(https://arxiv.org/abs/2510.11986)
Keywords: language model, gpt, llm
Abstract: Autoformalisation, the task of expressing informal mathematical statements in formal language, is often viewed as a direct translation process. This, however, disregards a critical preceding step: conjecturing. Many mathematical problems cannot be formalised directly without first conjecturing a conclusion such as an explicit answer, or a specific bound. Since Large Language Models (LLMs) already struggle with autoformalisation, and the evaluation of their conjecturing ability is limited and often entangled within autoformalisation or proof, it is particularly challenging to understand its effect. To address this gap, we augment existing datasets to create ConjectureBench, and redesign the evaluation framework and metric specifically to measure the conjecturing capabilities of LLMs both as a distinct task and within the autoformalisation pipeline. Our evaluation of foundational models, including GPT-4.1 and DeepSeek-V3.1, reveals that their autoformalisation performance is substantially overestimated when the conjecture is accounted for during evaluation. However, the conjecture should not be assumed to be provided. We design an inference-time method, Lean-FIRe to improve conjecturing and autoformalisation, which, to the best of our knowledge, achieves the first successful end-to-end autoformalisation of 13 PutnamBench problems with GPT-4.1 and 7 with DeepSeek-V3.1. We demonstrate that while LLMs possess the requisite knowledge to generate accurate conjectures, improving autoformalisation performance requires treating conjecturing as an independent task, and investigating further how to correctly integrate it within autoformalisation. Finally, we provide forward-looking guidance to steer future research toward improving conjecturing, an overlooked step of formal mathematical reasoning.
摘要：自动形式化是用正式语言表达非正式数学语句的任务，通常被视为直接翻译过程。然而，这忽略了前面的一个关键步骤：推测。如果不首先推测诸如明确答案或特定界限之类的结论，许多数学问题就无法直接形式化。由于大型语言模型（LLM）已经在与自动形式化作斗争，并且对其推测能力的评估是有限的，并且经常与自动形式化或证明纠缠在一起，因此理解其效果尤其具有挑战性。为了解决这一差距，我们扩充了现有数据集以创建 ConjectureBench，并重新设计了评估框架和指标，专门用于衡量法学硕士的猜想能力，无论是作为单独的任务还是在自动形式化管道中。我们对基础模型（包括 GPT-4.1 和 DeepSeek-V3.1）的评估表明，在评估过程中考虑猜想时，它们的自动形式化性能被大大高估了。然而，不应假定提供了该猜想。我们设计了一种推理时间方法 Lean-FIRe 来改进推测和自动形式化，据我们所知，该方法首次成功实现了 GPT-4.1 的 13 个 PutnamBench 问题和 DeepSeek-V3.1 的 7 个问题的端到端自动形式化。我们证明，虽然法学硕士拥有生成准确猜想的必要知识，但提高自动形式化性能需要将猜想视为一项独立任务，并进一步研究如何将其正确集成到自动形式化中。最后，我们提供前瞻性指导，引导未来的研究改进猜想，这是形式数学推理中被忽视的一步。

Title: SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

Authors: Ryan Shea, Yunan Lu, Liang Qiu, Zhou Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11997
Pdf URL: https://arxiv.org/pdf/2510.11997
Copy Paste: [[2510.11997]] SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation(https://arxiv.org/abs/2510.11997)
Keywords: agent
Abstract: Evaluating multi-turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain-specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom-up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users' information needs and expectations in a company's target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug-finding and iterative agent improvement.
摘要：由于需要人工评估，评估多轮交互代理具有挑战性。引入模拟用户评估作为替代方案，但现有方法通常对通用用户进行建模，并忽略了捕获真实行为所需的特定领域原则。我们提出了 SAGE，这是一种用于多回合 AGent 评估的新型用户模拟框架，它集成了来自业务环境的知识。 SAGE 融合了植根于业务逻辑的自上而下的知识，例如理想的客户档案，将用户行为植根于现实的客户角色。我们进一步集成从业务代理基础设施（例如产品目录、常见问题解答和知识库）中获取的自下而上的知识，使模拟器能够生成反映用户在公司目标市场中的信息需求和期望的交互。通过实证评估，我们发现这种方法产生的交互更加真实和多样化，同时还能识别出高达 33% 的代理错误，凸显了其作为支持错误查找和迭代代理改进的评估工具的有效性。

Title: Generate Logical Equivalence Questions

Authors: Xinyu Wang, Haoming Yu, Yicheng Yang, Zhiyuan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12001
Pdf URL: https://arxiv.org/pdf/2510.12001
Copy Paste: [[2510.12001]] Generate Logical Equivalence Questions(https://arxiv.org/abs/2510.12001)
Keywords: language model
Abstract: Academic dishonesty is met with zero tolerance in higher education, yet plagiarism has become increasingly prevalent in the era of online teaching and learning. Automatic Question Generation (AQG) presents a potential solution to mitigate copying by creating unique questions for each student. Additionally, AQG can provide a vast array of practice questions. Our AQG focuses on generating logical equivalence questions for Discrete Mathematics, a foundational course for first-year computer science students. A literature review reveals that existing AQGs for this type of question generate all propositions that meet user-defined constraints, resulting in inefficiencies and a lack of uniform question difficulty. To address this, we propose a new approach that defines logical equivalence questions using a formal language, translates this language into two sets of generation rules, and develops a linear-time algorithm for question generation. We evaluated our AQG through two experiments. The first involved a group of students completing questions generated by our system. Statistical analysis shows that the accuracy of these questions is comparable to that of textbook questions. The second experiment assessed the number of steps required to solve our generated questions, textbook questions, and those generated by multiple large language models. The results indicated that the difficulty of our questions was similar to that of textbook questions, confirming the quality of our AQG.
摘要：高等教育对学术不端行为实行零容忍，但在线教学时代，抄袭行为却日益盛行。自动问题生成（AQG）提供了一种潜在的解决方案，通过为每个学生创建独特的问题来减少复制。此外，AQG 可以提供大量练习题。我们的 AQG 专注于为离散数学生成逻辑等价问题，这是计算机科学一年级学生的基础课程。文献综述表明，现有的此类问题的 AQG 会生成满足用户定义约束的所有命题，导致效率低下且缺乏统一的问题难度。为了解决这个问题，我们提出了一种新方法，使用形式语言定义逻辑等价问题，将该语言转换为两组生成规则，并开发用于问题生成的线性时间算法。我们通过两个实验评估了 AQG。第一个项目是由一组学生完成我们系统生成的问题。统计分析表明，这些题的准确率与课本题相当。第二个实验评估了解决我们生成的问题、教科书问题以及由多个大型语言模型生成的问题所需的步骤数。结果表明，我们的问题的难度与教科书的问题相似，证实了我们的 AQG 的质量。

Title: Information Extraction from Conversation Transcripts: Neuro-Symbolic vs. LLM

Authors: Alice Saebom Kwak, Maria Alexeeva, Gus Hahn-Powell, Keith Alcock, Kevin McLaughlin, Doug McCorkle, Gabe McNunn, Mihai Surdeanu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12023
Pdf URL: https://arxiv.org/pdf/2510.12023
Copy Paste: [[2510.12023]] Information Extraction from Conversation Transcripts: Neuro-Symbolic vs. LLM(https://arxiv.org/abs/2510.12023)
Keywords: language model, llm, hallucination
Abstract: The current trend in information extraction (IE) is to rely extensively on large language models, effectively discarding decades of experience in building symbolic or statistical IE systems. This paper compares a neuro-symbolic (NS) and an LLM-based IE system in the agricultural domain, evaluating them on nine interviews across pork, dairy, and crop subdomains. The LLM-based system outperforms the NS one (F1 total: 69.4 vs. 52.7; core: 63.0 vs. 47.2), where total includes all extracted information and core focuses on essential details. However, each system has trade-offs: the NS approach offers faster runtime, greater control, and high accuracy in context-free tasks but lacks generalizability, struggles with contextual nuances, and requires significant resources to develop and maintain. The LLM-based system achieves higher performance, faster deployment, and easier maintenance but has slower runtime, limited control, model dependency and hallucination risks. Our findings highlight the "hidden cost" of deploying NLP systems in real-world applications, emphasizing the need to balance performance, efficiency, and control.
摘要：当前信息提取 (IE) 的趋势是广泛依赖大型语言模型，有效地放弃了数十年构建符号或统计 IE 系统的经验。本文比较了农业领域的神经符号 (NS) 和基于法学硕士的 IE 系统，并在猪肉、乳制品和农作物子领域的九次采访中对它们进行了评估。基于 LLM 的系统优于 NS 系统（F1 总分：69.4 与 52.7；核心：63.0 与 47.2），其中总分包括所有提取的信息，核心侧重于基本细节。然而，每个系统都有权衡：NS 方法在上下文无关的任务中提供更快的运行时间、更好的控制和高精度，但缺乏通用性，难以处理上下文的细微差别，并且需要大量资源来开发和维护。基于LLM的系统实现了更高的性能、更快的部署和更容易的维护，但运行时间较慢、控制有限、模型依赖和幻觉风险。我们的研究结果强调了在现实应用中部署 NLP 系统的“隐性成本”，强调需要平衡性能、效率和控制。

Title: CPR: Mitigating Large Language Model Hallucinations with Curative Prompt Refinement

Authors: Jung-Woo Shim, Yeong-Joon Ju, Ji-Hoon Park, Seong-Whan Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.12029
Pdf URL: https://arxiv.org/pdf/2510.12029
Copy Paste: [[2510.12029]] CPR: Mitigating Large Language Model Hallucinations with Curative Prompt Refinement(https://arxiv.org/abs/2510.12029)
Keywords: language model, llm, hallucination, prompt
Abstract: Recent advancements in large language models (LLMs) highlight their fluency in generating responses to diverse prompts. However, these models sometimes generate plausible yet incorrect ``hallucinated" facts, undermining trust. A frequent but often overlooked cause of such errors is the use of poorly structured or vague prompts by users, leading LLMs to base responses on assumed rather than actual intentions. To mitigate hallucinations induced by these ill-formed prompts, we introduce Curative Prompt Refinement (CPR), a plug-and-play framework for curative prompt refinement that 1) cleans ill-formed prompts, and 2) generates additional informative task descriptions to align the intention of the user and the prompt using a fine-tuned small language model. When applied to language models, we discover that CPR significantly increases the quality of generation while also mitigating hallucination. Empirical studies show that prompts with CPR applied achieves over a 90\% win rate over the original prompts without any external knowledge.
摘要：大型语言模型 (LLM) 的最新进展凸显了它们在对不同提示生成响应方面的流畅性。然而，这些模型有时会生成看似合理但不正确的“幻觉”事实，从而破坏信任。造成此类错误的一个常见但经常被忽视的原因是用户使用结构不良或模糊的提示，导致法学硕士根据假设而不是实际意图做出反应。为了减轻这些不正确的提示引起的幻觉，我们引入了治愈性提示细化（CPR），这是一种即插即用的方法用于治疗性提示细化的框架，1）清除格式错误的提示，2）生成额外的信息丰富的任务描述，以使用微调的小语言模型来调整用户的意图和提示。当应用于语言模型时，我们发现心肺复苏显着提高了生成质量，同时还减轻了幻觉。实证研究表明，应用 CPR 的提示比没有任何外部提示的原始提示的胜率超过 90% 知识。

Title: Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models

Authors: Jung-Woo Shim, Yeong-Joon Ju, Ji-Hoon Park, Seong-Whan Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.12032
Pdf URL: https://arxiv.org/pdf/2510.12032
Copy Paste: [[2510.12032]] Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models(https://arxiv.org/abs/2510.12032)
Keywords: language model, llm, hallucination, prompt
Abstract: Recent advancements in large language models (LLMs) have shown strong performance in natural language understanding and generation tasks. However, LLMs continue to encounter challenges with hallucinations, where models generate plausible but incorrect information. While several factors contribute to hallucinations, the impact of ill-formed prompts, prompts with ambiguous wording, incorrect grammar, or incomplete information, was relatively under explored. To address this, we introduce Multi-stage Prompt Refinement (MPR), a framework designed to systematically improve these ill-formed prompts across multiple stages. Each stage addresses specific errors such as punctuation, typographical mistakes, and misuse of key terms, using small language models (SLMs) fine-tuned for these tasks. MPR iteratively enhances the clarity of prompts with additional context and employs a self-reflection mechanism with ranking to prioritize the most relevant input. Experimental results on hallucination benchmarks show that prompts refined by MPR achieve over an 85~\% win rate compared to their original forms, demonstrating its effectiveness in reducing hallucinations and improving LLM output accuracy. Interestingly, we reveal that MPR can be combined with existing post-hoc hallucination mitigation frameworks, further enhancing its versatility. MPR provides a lightweight and adaptable solution for enhancing LLM reliability across various domains.
摘要：大型语言模型（LLM）的最新进展在自然语言理解和生成任务中表现出了强大的性能。然而，法学硕士继续面临幻觉的挑战，模型会生成看似合理但不正确的信息。虽然导致幻觉的因素有很多，但对于格式不正确的提示、措辞含糊的提示、语法不正确或信息不完整的提示的影响相对来说还没有得到充分的探讨。为了解决这个问题，我们引入了多阶段提示细化（MPR），这是一个旨在跨多个阶段系统地改进这些格式错误的提示的框架。每个阶段都会使用针对这些任务进行微调的小语言模型 (SLM) 来解决特定错误，例如标点符号、印刷错误和关键术语的误用。 MPR 通过附加上下文迭代地增强提示的清晰度，并采用自我反思机制和排名来优先考虑最相关的输入。幻觉基准的实验结果表明，经过MPR提炼的提示与原始形式相比，胜率达到85%以上，证明了其在减少幻觉和提高LLM输出准确性方面的有效性。有趣的是，我们发现 MPR 可以与现有的事后幻觉缓解框架相结合，进一步增强其多功能性。 MPR 提供了一种轻量级且适应性强的解决方案，用于增强跨各个领域的 LLM 可靠性。

Title: Uncertainty Quantification for Hallucination Detection in Large Language Models: Foundations, Methodology, and Future Directions

Authors: Sungmin Kang, Yavuz Faruk Bakman, Duygu Nur Yaldiz, Baturalp Buyukates, Salman Avestimehr
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12040
Pdf URL: https://arxiv.org/pdf/2510.12040
Copy Paste: [[2510.12040]] Uncertainty Quantification for Hallucination Detection in Large Language Models: Foundations, Methodology, and Future Directions(https://arxiv.org/abs/2510.12040)
Keywords: language model, llm, hallucination
Abstract: The rapid advancement of large language models (LLMs) has transformed the landscape of natural language processing, enabling breakthroughs across a wide range of areas including question answering, machine translation, and text summarization. Yet, their deployment in real-world applications has raised concerns over reliability and trustworthiness, as LLMs remain prone to hallucinations that produce plausible but factually incorrect outputs. Uncertainty quantification (UQ) has emerged as a central research direction to address this issue, offering principled measures for assessing the trustworthiness of model generations. We begin by introducing the foundations of UQ, from its formal definition to the traditional distinction between epistemic and aleatoric uncertainty, and then highlight how these concepts have been adapted to the context of LLMs. Building on this, we examine the role of UQ in hallucination detection, where quantifying uncertainty provides a mechanism for identifying unreliable generations and improving reliability. We systematically categorize a wide spectrum of existing methods along multiple dimensions and present empirical results for several representative approaches. Finally, we discuss current limitations and outline promising future research directions, providing a clearer picture of the current landscape of LLM UQ for hallucination detection.
摘要：大语言模型 (LLM) 的快速发展改变了自然语言处理的格局，在问答、机器翻译和文本摘要等广泛领域实现了突破。然而，它们在现实世界应用程序中的部署引起了人们对可靠性和可信度的担忧，因为法学硕士仍然容易产生幻觉，产生看似合理但实际上不正确的输出。不确定性量化（UQ）已成为解决这一问题的中心研究方向，为评估模型生成的可信度提供了原则性措施。我们首先介绍昆士兰大学的基础，从其正式定义到认知不确定性和任意不确定性之间的传统区别，然后重点介绍这些概念如何适应法学硕士的背景。在此基础上，我们研究了昆士兰大学在幻觉检测中的作用，其中量化不确定性提供了一种识别不可靠世代和提高可靠性的机制。我们从多个维度对各种现有方法进行系统分类，并提出几种代表性方法的实证结果。最后，我们讨论了当前的局限性并概述了有前途的未来研究方向，为 UQ 法学硕士幻觉检测的当前前景提供了更清晰的图景。

Title: Improving Text-to-Image Generation with Input-Side Inference-Time Scaling

Authors: Ruibo Chen, Jiacheng Pan, Heng Huang, Zhenheng Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12041
Pdf URL: https://arxiv.org/pdf/2510.12041
Copy Paste: [[2510.12041]] Improving Text-to-Image Generation with Input-Side Inference-Time Scaling(https://arxiv.org/abs/2510.12041)
Keywords: language model, llm, prompt
Abstract: Recent advances in text-to-image (T2I) generation have achieved impressive results, yet existing models often struggle with simple or underspecified prompts, leading to suboptimal image-text alignment, aesthetics, and quality. We propose a prompt rewriting framework that leverages large language models (LLMs) to refine user inputs before feeding them into T2I backbones. Our approach introduces a carefully designed reward system and an iterative direct preference optimization (DPO) training pipeline, enabling the rewriter to enhance prompts without requiring supervised fine-tuning data. We evaluate our method across diverse T2I models and benchmarks. Results show that our prompt rewriter consistently improves image-text alignment, visual quality, and aesthetics, outperforming strong baselines. Furthermore, we demonstrate strong transferability by showing that a prompt rewriter trained on one T2I backbone generalizes effectively to others without needing to be retrained. We also systematically study scalability, evaluating how performance gains scale with the capacity of the large LLM used as the rewriter. These findings highlight that prompt rewriting is an effective, scalable, and practical model-agnostic strategy for improving T2I systems. We plan to release the code and trained prompt rewriters soon.
摘要：文本到图像 (T2I) 生成的最新进展取得了令人印象深刻的成果，但现有模型常常难以应对简单或未指定的提示，导致图像文本对齐、美观和质量不佳。我们提出了一个即时重写框架，该框架利用大型语言模型 (LLM) 来优化用户输入，然后再将其输入 T2I 主干网。我们的方法引入了精心设计的奖励系统和迭代直接偏好优化（DPO）训练管道，使重写器能够增强提示，而无需监督微调数据。我们通过不同的 T2I 模型和基准评估我们的方法。结果表明，我们的提示重写器持续改进了图像文本对齐、视觉质量和美观性，优于强大的基线。此外，我们通过证明在一个 T2I 主干上训练的即时重写器可以有效地推广到其他主干而无需重新训练，从而证明了强大的可移植性。我们还系统地研究了可扩展性，评估性能如何随着用作重写器的大型法学硕士的容量而提高。这些发现强调，即时重写是一种有效的、可扩展的、实用的、与模型无关的改进 T2I 系统的策略。我们计划很快发布代码并训练有素的提示重写器。

Title: Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models

Authors: Yukun Zhang, Qi Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12044
Pdf URL: https://arxiv.org/pdf/2510.12044
Copy Paste: [[2510.12044]] Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models(https://arxiv.org/abs/2510.12044)
Keywords: language model, llm
Abstract: Existing alignment techniques for Large Language Models (LLMs), such as Direct Preference Optimization (DPO), typically treat the model as a monolithic entity, applying uniform optimization pressure across all layers. This approach overlooks the functional specialization within the Transformer architecture, where different layers are known to handle distinct tasks from syntax to abstract reasoning. In this paper, we challenge this one-size-fits-all paradigm by introducing Hierarchical Alignment, a novel method that applies targeted DPO to distinct functional blocks of a model's layers: local (syntax), intermediate (logic), and global (factuality). Through a series of controlled experiments on state-of-the-art models like Llama-3.1-8B and Qwen1.5-7B using LoRA for surgical fine-tuning, our results, evaluated by a powerful LLM-as-Judge, demonstrate significant and predictable improvements. Specifically, aligning the local layers (Local-Align) enhances grammatical fluency. More importantly, aligning the global layers (Global-Align) not only improves factual consistency as hypothesized but also proves to be the most effective strategy for enhancing logical coherence, outperforming all baselines. Critically, all hierarchical strategies successfully avoid the "alignment tax" observed in standard DPO, where gains in fluency come at the cost of degraded logical reasoning. These findings establish a more resource-efficient, controllable, and interpretable path for model alignment, highlighting the immense potential of shifting from monolithic optimization to structure-aware surgical fine-tuning to build more advanced and reliable LLMs.
摘要：大型语言模型 (LLM) 的现有对齐技术，例如直接偏好优化 (DPO)，通常将模型视为整体实体，在所有层上施加统一的优化压力。这种方法忽略了 Transformer 架构中的功能专业化，其中不同的层处理从语法到抽象推理的不同任务。在本文中，我们通过引入分层对齐来挑战这种一刀切的范式，这是一种将目标 DPO 应用于模型层的不同功能块的新颖方法：局部（语法）、中间（逻辑）和全局（事实）。通过使用 LoRA 进行手术微调的 Llama-3.1-8B 和 Qwen1.5-7B 等最先进模型的一系列对照实验，我们的结果经强大的法学硕士法官评估后显示出显着且可预测的改进。具体来说，对齐局部层（Local-Align）可以增强语法流畅性。更重要的是，对齐全局层（Global-Align）不仅可以提高假设的事实一致性，而且被证明是增强逻辑一致性的最有效策略，优于所有基线。至关重要的是，所有分层策略都成功地避免了标准 DPO 中观察到的“对齐税”，即流畅性的提高是以逻辑推理能力下降为代价的。这些发现为模型对齐建立了一条更加资源高效、可控和可解释的路径，凸显了从整体优化转向结构感知手术微调以构建更先进和可靠的法学硕士的巨大潜力。

Title: APCE: Adaptive Progressive Context Expansion for Long Context Processing

Authors: Baisub Lee, Sanghyun Byun, Mohanad Odema, Jung Guack, Jacob Song, Woo Seong Chung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12051
Pdf URL: https://arxiv.org/pdf/2510.12051
Copy Paste: [[2510.12051]] APCE: Adaptive Progressive Context Expansion for Long Context Processing(https://arxiv.org/abs/2510.12051)
Keywords: long context
Abstract: Deploying useful Long-Context Transformer Models (LCTMs) requires addressing two key challenges: (1) A growing memory footprint due to quadratic self-attention and linear KV-cache scaling in memory as sequence length increases; (2) the ContextRot phenomena where empirical evidence suggests that transformer architecture's performance degrades with increasing context length. Given the shared dependency on the input, a natural question arises: Can we surgically select the most important input chunks for processing to synergistically (a) reduce the memory footprint, and (b) mitigate the ContextRot effects? In this paper, we answer this question in the affirmative for long-context summarization tasks. We propose APCE as a context-aware solution to select the most important input chunks through low-dimensional semantic similarity matching with the current query. By directly operating on the input, APCE decouples from strict dependency on underlying hardware or CUDA environments, promising a compatible solution scalable to different deployment systems. Our empirical evaluations have demonstrated superior or on-par summarization performance for APCE compared to the full dense baseline using a fraction (50%-70%) of the input sequence resulting in KV-cache and self-attention memory efficiency improvements. We hope our findings inspire further research on context-aware efficiency solutions for LCTMs geared towards other relevant long-context tasks.
摘要：部署有用的长上下文变压器模型（LCTM）需要解决两个关键挑战：（1）随着序列长度的增加，由于内存中的二次自注意力和线性 KV 缓存扩展而导致内存占用不断增加； (2) ContextRot 现象，经验证据表明 Transformer 架构的性能随着上下文长度的增加而降低。考虑到对输入的共同依赖，一个自然的问题就出现了：我们能否通过外科手术选择最重要的输入块进行处理，以协同地（a）减少内存占用，（b）减轻 ContextRot 效应？在本文中，对于长上下文摘要任务，我们对这个问题的回答是肯定的。我们提出 APCE 作为上下文感知解决方案，通过与当前查询的低维语义相似度匹配来选择最重要的输入块。通过直接对输入进行操作，APCE 摆脱了对底层硬件或 CUDA 环境的严格依赖，从而提供了可扩展至不同部署系统的兼容解决方案。我们的实证评估表明，与使用输入序列的一小部分（50%-70%）的全密集基线相比，APCE 具有优越或同等的汇总性能，从而提高了 KV 缓存和自注意力内存效率。我们希望我们的研究结果能够激发针对 LCTM 的上下文感知效率解决方案的进一步研究，以适应其他相关的长上下文任务。

Title: An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations

Authors: Benjamin W. Nelson, Celeste Wong, Matthew T. Silvestrini, Sooyoon Shin, Alanna Robinson, Jessica Lee, Eric Yang, John Torous, Andrew Trister
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12083
Pdf URL: https://arxiv.org/pdf/2510.12083
Copy Paste: [[2510.12083]] An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations(https://arxiv.org/abs/2510.12083)
Keywords: language model
Abstract: Large language models often mishandle psychiatric emergencies, offering harmful or inappropriate advice and enabling destructive behaviors. This study evaluated the Verily behavioral health safety filter (VBHSF) on two datasets: the Verily Mental Health Crisis Dataset containing 1,800 simulated messages and the NVIDIA Aegis AI Content Safety Dataset subsetted to 794 mental health-related messages. The two datasets were clinician-labelled and we evaluated performance using the clinician labels. Additionally, we carried out comparative performance analyses against two open source, content moderation guardrails: OpenAI Omni Moderation Latest and NVIDIA NeMo Guardrails. The VBHSF demonstrated, well-balanced performance on the Verily Mental Health Crisis Dataset v1.0, achieving high sensitivity (0.990) and specificity (0.992) in detecting any mental health crises. It achieved an F1-score of 0.939, sensitivity ranged from 0.917-0.992, and specificity was >= 0.978 in identifying specific crisis categories. When evaluated against the NVIDIA Aegis AI Content Safety Dataset 2.0, VBHSF performance remained highly sensitive (0.982) and accuracy (0.921) with reduced specificity (0.859). When compared with the NVIDIA NeMo and OpenAI Omni Moderation Latest guardrails, the VBHSF demonstrated superior performance metrics across both datasets, achieving significantly higher sensitivity in all cases (all p < 0.001) and higher specificity relative to NVIDIA NeMo (p < 0.001), but not to OpenAI Omni Moderation Latest (p = 0.094). NVIDIA NeMo and OpenAI Omni Moderation Latest exhibited inconsistent performance across specific crisis types, with sensitivity for some categories falling below 0.10. Overall, the VBHSF demonstrated robust, generalizable performance that prioritizes sensitivity to minimize missed crises, a crucial feature for healthcare applications.
摘要：大型语言模型经常错误地处理精神紧急情况，提供有害或不适当的建议，并导致破坏性行为。本研究在两个数据集上评估了 Verily 行为健康安全过滤器 (VBHSF)：包含 1,800 条模拟消息的 Verily 心理健康危机数据集和包含 794 条心理健康相关消息的 NVIDIA Aegis AI 内容安全数据集。这两个数据集都有临床医生标签，我们使用临床医生标签评估性能。此外，我们还针对两个开源内容审核护栏进行了性能比较分析：OpenAI Omni Moderation Latest 和 NVIDIA NeMo Guardrails。 VBHSF 在 Verily 心理健康危机数据集 v1.0 上表现出均衡的性能，在检测任何心理健康危机方面实现了高灵敏度 (0.990) 和特异性 (0.992)。它的 F1 分数为 0.939，敏感性范围为 0.917-0.992，识别特定危机类别的特异性 >= 0.978。根据 NVIDIA Aegis AI 内容安全数据集 2.0 进行评估时，VBHSF 性能仍然保持高度敏感度 (0.982) 和准确性 (0.921)，但特异性降低 (0.859)。与 NVIDIA NeMo 和 OpenAI Omni Moderation Latest 护栏相比，VBHSF 在两个数据集上都表现出了卓越的性能指标，在所有情况下均实现了显着更高的灵敏度（所有 p < 0.001），并且相对于 NVIDIA NeMo (p < 0.001) 实现了更高的特异性，但与 OpenAI Omni Moderation Latest 相比则不然 (p = 0.094)。 NVIDIA NeMo 和 OpenAI Omni Moderation Latest 在特定危机类型中表现出不一致的性能，某些类别的敏感度低于 0.10。总体而言，VBHSF 展示了强大、通用的性能，可优先考虑灵敏度以最大程度地减少错过的危机，这是医疗保健应用的一个关键功能。

Title: Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models

Authors: Ziliang Qiu, Renfen Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12110
Pdf URL: https://arxiv.org/pdf/2510.12110
Copy Paste: [[2510.12110]] Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models(https://arxiv.org/abs/2510.12110)
Keywords: language model, llm, chat
Abstract: The evaluation of LLMs' creativity represents a crucial research domain, though challenges such as data contamination and costly human assessments often impede progress. Drawing inspiration from human creativity assessment, we propose PACE, asking LLMs to generate Parallel Association Chains to Evaluate their creativity. PACE minimizes the risk of data contamination and offers a straightforward, highly efficient evaluation, as evidenced by its strong correlation with Chatbot Arena Creative Writing rankings (Spearman's $\rho = 0.739$, $p < 0.001$) across various proprietary and open-source models. A comparative analysis of associative creativity between LLMs and humans reveals that while high-performing LLMs achieve scores comparable to average human performance, professional humans consistently outperform LLMs. Furthermore, linguistic analysis reveals that both humans and LLMs exhibit a trend of decreasing concreteness in their associations, and humans demonstrating a greater diversity of associative patterns.
摘要：对法学硕士创造力的评估是一个至关重要的研究领域，尽管数据污染和昂贵的人工评估等挑战往往阻碍进展。从人类创造力评估中汲取灵感，我们提出了 PACE，要求法学硕士生成并行关联链来评估他们的创造力。 PACE 最大限度地降低了数据污染的风险，并提供了直接、高效的评估，其与各种专有和开源模型中的 Chatbot Arena 创意写作排名（Spearman 的 $\rho = 0.739$，$p < 0.001$）的强相关性就证明了这一点。对法学硕士和人类之间联想创造力的比较分析表明，虽然表现出色的法学硕士的得分与人类平均表现相当，但专业人士的表现始终优于法学硕士。此外，语言分析表明，人类和法学硕士的联想都表现出具体性下降的趋势，而人类则表现出更加多样化的联想模式。

Title: Tracing Multilingual Knowledge Acquisition Dynamics in Domain Adaptation: A Case Study of English-Japanese Biomedical Adaptation

Authors: Xin Zhao, Naoki Yoshinaga, Yuma Tsuta, Akiko Aizawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12115
Pdf URL: https://arxiv.org/pdf/2510.12115
Copy Paste: [[2510.12115]] Tracing Multilingual Knowledge Acquisition Dynamics in Domain Adaptation: A Case Study of English-Japanese Biomedical Adaptation(https://arxiv.org/abs/2510.12115)
Keywords: language model, llm
Abstract: Multilingual domain adaptation (ML-DA) is widely used to learn new domain knowledge across languages into large language models (LLMs). Although many methods have been proposed to improve domain adaptation, the mechanisms of multilingual knowledge acquisition, how domain knowledge is learned within a language and transferred across languages, remain underexplored. This gap leads to suboptimal performance, particularly in low-resource settings. This work examines the learning dynamics of LLMs during ML-DA. Because prior ML-DA studies often train and evaluate on datasets with mismatched knowledge coverage, we propose AdaXEval, an adaptive evaluation method that builds multiple-choice QA datasets from the same bilingual domain corpus used for training, thereby directly studying multilingual knowledge acquisition. Through continual training of LLMs with diverse data recipes, we track how LLMs acquire domain facts and pinpoint the mechanism behind the transformation process from domain training data to knowledge. Our experiments on a 13B English-Japanese bilingual LLM reveal that cross-lingual transfer remains challenging despite a high-quality bilingual corpus. The code has been released.
摘要：多语言领域适应 (ML-DA) 广泛用于将跨语言的新领域知识学习到大型语言模型 (LLM) 中。尽管已经提出了许多方法来改善领域适应，但多语言知识获取的机制，即如何在一种语言内学习领域知识并跨语言转移，仍然没有得到充分探索。这种差距会导致性能不佳，尤其是在资源匮乏的环境中。这项工作研究了法学硕士在 ML-DA 期间的学习动态。由于之前的 ML-DA 研究经常在知识覆盖不匹配的数据集上进行训练和评估，因此我们提出了 AdaXEval，这是一种自适应评估方法，可以从用于训练的相同双语域语料库构建多项选择 QA 数据集，从而直接研究多语言知识获取。通过使用不同的数据配方对法学硕士进行持续培训，我们跟踪法学硕士如何获取领域事实，并查明从领域训练数据到知识的转换过程背后的机制。我们对 13B 英日双语法学硕士的实验表明，尽管拥有高质量的双语语料库，跨语言迁移仍然具有挑战性。代码已经发布。

Title: Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

Authors: Bajian Xiang, Shuaijiang Zhao, Tingwei Guo, Wei Zou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12116
Pdf URL: https://arxiv.org/pdf/2510.12116
Copy Paste: [[2510.12116]] Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models(https://arxiv.org/abs/2510.12116)
Keywords: language model
Abstract: End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities, yet consistently fall short of traditional pipeline systems on semantic understanding benchmarks. In this work, we reveal through systematic experimentation that although LSLMs lose some text input performance after speech-text alignment training, the performance gap between speech and text inputs is more pronounced, which we refer to as the modality gap. To understand this gap, we analyze both coarse- and fine-grained text and speech representations. At the coarse-grained level, representations of speech and text in deeper layers are found to be increasingly aligned in direction (cosine similarity), while concurrently diverging in magnitude (Euclidean distance). We further find that representation similarity is strongly correlated with the modality gap. At the fine-grained level, a spontaneous token-level alignment pattern between text and speech representations is observed. Based on this, we introduce the Alignment Path Score to quantify token-level alignment quality, which exhibits stronger correlation with the modality gap. Building on these insights, we design targeted interventions on critical tokens through angle projection and length normalization. These strategies demonstrate the potential to improve correctness for speech inputs. Our study provides the first systematic empirical analysis of the modality gap and alignment mechanisms in LSLMs, offering both theoretical and methodological guidance for future optimization.
摘要：端到端大型语音语言模型 (LSLM) 已展现出令人印象深刻的对话生成能力，但在语义理解基准方面始终低于传统管道系统。在这项工作中，我们通过系统实验揭示，虽然 LSLM 在语音文本对齐训练后损失了一些文本输入性能，但语音和文本输入之间的性能差距更加明显，我们将其称为模态差距。为了理解这一差距，我们分析了粗粒度和细粒度的文本和语音表示。在粗粒度级别，发现更深层中的语音和文本的表示在方向上越来越一致（余弦相似性），同时在幅度上发散（欧几里得距离）。我们进一步发现表示相似性与模态差距密切相关。在细粒度级别，观察到文本和语音表示之间的自发标记级别对齐模式。基于此，我们引入对齐路径得分来量化标记级别的对齐质量，其与模态差距表现出更强的相关性。基于这些见解，我们通过角度投影和长度标准化设计对关键标记的有针对性的干预措施。这些策略展示了提高语音输入正确性的潜力。我们的研究首次对 LSLM 中的模态差距和对齐机制进行了系统的实证分析，为未来的优化提供了理论和方法上的指导。

Title: SafeMT: Multi-turn Safety for Multimodal Language Models

Authors: Han Zhu, Juntao Dai, Jiaming Ji, Haoran Li, Chengkun Cai, Pengcheng Wen, Chi-Min Chan, Boyuan Chen, Yaodong Yang, Sirui Han, Yike Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12133
Pdf URL: https://arxiv.org/pdf/2510.12133
Copy Paste: [[2510.12133]] SafeMT: Multi-turn Safety for Multimodal Language Models(https://arxiv.org/abs/2510.12133)
Keywords: language model, llm, prompt
Abstract: With the widespread use of multi-modal Large Language models (MLLMs), safety issues have become a growing concern. Multi-turn dialogues, which are more common in everyday interactions, pose a greater risk than single prompts; however, existing benchmarks do not adequately consider this situation. To encourage the community to focus on the safety issues of these models in multi-turn dialogues, we introduce SafeMT, a benchmark that features dialogues of varying lengths generated from harmful queries accompanied by images. This benchmark consists of 10,000 samples in total, encompassing 17 different scenarios and four jailbreak methods. Additionally, we propose Safety Index (SI) to evaluate the general safety of MLLMs during conversations. We assess the safety of 17 models using this benchmark and discover that the risk of successful attacks on these models increases as the number of turns in harmful dialogues rises. This observation indicates that the safety mechanisms of these models are inadequate for recognizing the hazard in dialogue interactions. We propose a dialogue safety moderator capable of detecting malicious intent concealed within conversations and providing MLLMs with relevant safety policies. Experimental results from several open-source models indicate that this moderator is more effective in reducing multi-turn ASR compared to existed guard models.
摘要：随着多模态大语言模型（MLLM）的广泛使用，安全问题已成为人们日益关注的问题。多轮对话在日常交互中更为常见，比单次提示带来的风险更大；然而，现有的基准没有充分考虑这种情况。为了鼓励社区在多轮对话中关注这些模型的安全问题，我们引入了 SafeMT，这是一个基准，其特征是由带有图像的有害查询生成不同长度的对话。该基准测试总共包含 10,000 个样本，涵盖 17 种不同场景和四种越狱方法。此外，我们提出安全指数（SI）来评估 MLLM 在对话过程中的总体安全性。我们使用此基准评估了 17 个模型的安全性，发现随着有害对话轮数的增加，对这些模型进行成功攻击的风险也随之增加。这一观察结果表明，这些模型的安全机制不足以识别对话交互中的危险。我们提出了一种对话安全调节器，能够检测对话中隐藏的恶意意图，并为 MLLM 提供相关的安全策略。多个开源模型的实验结果表明，与现有的防护模型相比，该调节器在减少多轮 ASR 方面更有效。

Title: Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models

Authors: Shihao Ji, Zihui Song, Jiajie Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12137
Pdf URL: https://arxiv.org/pdf/2510.12137
Copy Paste: [[2510.12137]] Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models(https://arxiv.org/abs/2510.12137)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) hallucinate, generating factually incorrect yet confident assertions. We argue this stems from the Transformer's Softmax function, which creates "Artificial Certainty" by collapsing ambiguous attention scores into a single probability distribution, discarding uncertainty information at each layer. To fix this, we introduce the Credal Transformer, which replaces standard attention with a Credal Attention Mechanism (CAM) based on evidential theory. CAM produces a "credal set" (a set of distributions) instead of a single attention vector, with the set's size directly measuring model uncertainty. We implement this by re-conceptualizing attention scores as evidence masses for a Dirichlet distribution: sufficient evidence recovers standard attention, while insufficient evidence yields a diffuse distribution, representing ambiguity. Empirically, the Credal Transformer identifies out-of-distribution inputs, quantifies ambiguity, and significantly reduces confident errors on unanswerable questions by abstaining. Our contribution is a new architecture to mitigate hallucinations and a design paradigm that integrates uncertainty quantification directly into the model, providing a foundation for more reliable AI.
摘要：大型语言模型 (LLM) 会产生幻觉，生成事实上不正确但又自信的断言。我们认为这源于 Transformer 的 Softmax 函数，该函数通过将模糊的注意力分数折叠成单个概率分布来创建“人工确定性”，并丢弃每一层的不确定性信息。为了解决这个问题，我们引入了 Credal Transformer，它用基于证据理论的 Credal Attention Mechanism (CAM) 取代了标准注意力。 CAM 生成“信任集”（一组分布）而不是单个注意力向量，该集的大小直接测量模型的不确定性。我们通过将注意力分数重新概念化为狄利克雷分布的证据质量来实现这一点：足够的证据恢复标准注意力，而证据不足则产生分散的分布，代表模糊性。根据经验，Credal Transformer 可以识别分布外的输入，量化模糊性，并通过弃权显着减少无法回答的问题上的置信错误。我们的贡献是一种减轻幻觉的新架构和一种将不确定性量化直接集成到模型中的设计范式，为更可靠的人工智能奠定了基础。

Title: A Survey on Parallel Reasoning

Authors: Ziqi Wang, Boye Niu, Zipeng Gao, Zhi Zheng, Tong Xu, Linghui Meng, Zhongli Li, Jing Liu, Yilong Chen, Chen Zhu, Hua Wu, Haifeng Wang, Enhong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12164
Pdf URL: https://arxiv.org/pdf/2510.12164
Copy Paste: [[2510.12164]] A Survey on Parallel Reasoning(https://arxiv.org/abs/2510.12164)
Keywords: language model, llm, chain-of-thought
Abstract: With the increasing capabilities of Large Language Models (LLMs), parallel reasoning has emerged as a new inference paradigm that enhances reasoning robustness by concurrently exploring multiple lines of thought before converging on a final answer. It has become a significant trend to explore parallel reasoning to overcome the fragility of standard sequential methods and improve practical performance. In this paper, we aim to survey and summarize the progress and challenges of parallel reasoning. We first present a formal definition of parallel reasoning and clarify its distinction from related concepts like Chain-of-Thought. Then, we organize and discuss advanced techniques based on a novel taxonomy, including non-interactive reasoning, interactive reasoning, and efficiency-focused decoding strategies. Additionally, we explore various application scenarios, such as solving complex problems and enhancing the reliability of LLM this http URL, we highlight the core challenges of parallel reasoning and suggest potential directions for future research. We hope that our work can provide a useful roadmap for beginners and encourage more research on improving parallel reasoning methods. Related source can be avaliable in this https URL.
摘要：随着大型语言模型（LLM）能力的不断增强，并行推理已经成为一种新的推理范式，它通过在收敛到最终答案之前同时探索多种思路来增强推理的鲁棒性。探索并行推理以克服标准顺序方法的脆弱性并提高实际性能已成为重要趋势。在本文中，我们旨在调查和总结并行推理的进展和挑战。我们首先提出并行推理的正式定义，并阐明其与思想链等相关概念的区别。然后，我们组织和讨论基于新颖分类法的先进技术，包括非交互式推理、交互式推理和注重效率的解码策略。此外，我们探索了各种应用场景，例如解决复杂问题和增强LLM此http URL的可靠性，我们强调了并行推理的核心挑战，并为未来研究提出了潜在的方向。我们希望我们的工作能为初学者提供有用的路线图，并鼓励更多关于改进并行推理方法的研究。相关来源可以在此 https URL 中找到。

Title: Towards Inference-time Scaling for Continuous Space Reasoning

Authors: Minghan Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12167
Pdf URL: https://arxiv.org/pdf/2510.12167
Copy Paste: [[2510.12167]] Towards Inference-time Scaling for Continuous Space Reasoning(https://arxiv.org/abs/2510.12167)
Keywords: language model
Abstract: Inference-time scaling through multiple sample generation in combination with Process- or Outcome-Reward Model (PRM or ORM) re-ranking has proven effective for text-based reasoning in large language models. This paper investigates whether such established techniques can be successfully adapted to reasoning in the continuous space, using COCONUT (Hao et al. 2024) continuous space reasoning LM as the backbone. We demonstrate the feasibility of generating diverse reasoning paths through dropout-based sampling. Our Pass@N analysis on the generated samples reveals the potential that could enable a significant gain in performance akin to observed gain in the discrete space. However, we highlight unique challenges faced for materializing this gain in the continuous thought space. In particular, working recipes for data generation and training PRM and ORM models in the discrete space unlocks only marginal improvements in the continuous space. Through probing various aspects including geometric properties and trajectory dynamics we identify the underlying reasons that prevent effective discrimination between correct and incorrect reasoning (essential for the functioning of PRM and ORM). Our findings reveal that current limitations stem from the absence of key inductive biases in continuous thought representations. We argue that the training frameworks for continuous reasoning LMs require not only to optimize for accuracy but also to explicitly incorporate inductive biases that could be utilized during inference-time for discrimination of correct and incorrect thoughts.\footnote{Our code and data will be publicly available.}
摘要：通过多个样本生成结合过程或结果奖励模型（PRM 或 ORM）重新排序来扩展推理时间已被证明对于大型语言模型中基于文本的推理是有效的。本文使用 COCONUT (Hao et al. 2024) 连续空间推理 LM 作为骨干，研究这种已建立的技术是否可以成功地适应连续空间中的推理。我们证明了通过基于丢失的采样生成不同推理路径的可行性。我们对生成的样本进行的 Pass@N 分析揭示了可以显着提高性能的潜力，类似于在离散空间中观察到的增益。然而，我们强调了在连续思维空间中实现这一成果所面临的独特挑战。特别是，在离散空间中生成数据以及训练 PRM 和 ORM 模型的工作方法只能在连续空间中实现边际改进。通过探索包括几何特性和轨迹动力学在内的各个方面，我们确定了阻碍有效区分正确和错误推理的根本原因（对于 PRM 和 ORM 的功能至关重要）。我们的研究结果表明，当前的局限性源于连续思维表征中缺乏关键的归纳偏差。我们认为，连续推理 LM 的训练框架不仅需要优化准确性，还需要明确纳入归纳偏差，这些偏差可以在推理时用来区分正确和不正确的想法。\footnote{我们的代码和数据将公开。}

Title: From Knowledge to Treatment: Large Language Model Assisted Biomedical Concept Representation for Drug Repurposing

Authors: Chengrui Xiang, Tengfei Ma, Xiangzheng Fu, Yiping Liu, Bosheng Song, Xiangxiang Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12181
Pdf URL: https://arxiv.org/pdf/2510.12181
Copy Paste: [[2510.12181]] From Knowledge to Treatment: Large Language Model Assisted Biomedical Concept Representation for Drug Repurposing(https://arxiv.org/abs/2510.12181)
Keywords: language model, llm
Abstract: Drug repurposing plays a critical role in accelerating treatment discovery, especially for complex and rare diseases. Biomedical knowledge graphs (KGs), which encode rich clinical associations, have been widely adopted to support this task. However, existing methods largely overlook common-sense biomedical concept knowledge in real-world labs, such as mechanistic priors indicating that certain drugs are fundamentally incompatible with specific treatments. To address this gap, we propose LLaDR, a Large Language Model-assisted framework for Drug Repurposing, which improves the representation of biomedical concepts within KGs. Specifically, we extract semantically enriched treatment-related textual representations of biomedical entities from large language models (LLMs) and use them to fine-tune knowledge graph embedding (KGE) models. By injecting treatment-relevant knowledge into KGE, LLaDR largely improves the representation of biomedical concepts, enhancing semantic understanding of under-studied or complex indications. Experiments based on benchmarks demonstrate that LLaDR achieves state-of-the-art performance across different scenarios, with case studies on Alzheimer's disease further confirming its robustness and effectiveness. Code is available at this https URL.
摘要：药物再利用在加速治疗发现方面发挥着至关重要的作用，特别是对于复杂和罕见的疾病。编码丰富临床关联的生物医学知识图（KG）已被广泛采用来支持这项任务。然而，现有的方法在很大程度上忽视了现实世界实验室中的常识性生物医学概念知识，例如表明某些药物从根本上与特定治疗不相容的机械先验。为了解决这一差距，我们提出了 LLaDR，一种大型语言模型辅助的药物再利用框架，它改善了知识图谱中生物医学概念的表示。具体来说，我们从大型语言模型（LLM）中提取语义丰富的生物医学实体的治疗相关文本表示，并使用它们来微调知识图嵌入（KGE）模型。通过将治疗相关知识注入 KGE，LLaDR 极大地改善了生物医学概念的表示，增强了对未充分研究或复杂适应症的语义理解。基于基准的实验表明，LLaDR 在不同场景下实现了最先进的性能，阿尔茨海默病的案例研究进一步证实了其稳健性和有效性。代码可从此 https URL 获取。

Title: Not in Sync: Unveiling Temporal Bias in Audio Chat Models

Authors: Jiayu Yao, Shenghua Liu, Yiwei Wang, Rundong Cheng, Lingrui Mei, Baolong Bi, Zhen Xiong, Xueqi Cheng
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2510.12185
Pdf URL: https://arxiv.org/pdf/2510.12185
Copy Paste: [[2510.12185]] Not in Sync: Unveiling Temporal Bias in Audio Chat Models(https://arxiv.org/abs/2510.12185)
Keywords: language model, chat
Abstract: Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning, yet their ability to locate when events occur remains underexplored. We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction. For example, when asked "At which second does the lecturer introduce the key formula?", models often predict timestamps that are consistently earlier or later than the ground truth. Through controlled experiments on timestamped datasets, we find that temporal bias (i) is prevalent across datasets and models, (ii) increases with audio length - even accumulating to tens of seconds in extended recordings, and (iii) varies across event types and positions. We quantify this effect with the Temporal Bias Index (TBI), measuring systematic misalignment in predicted event timings, and complement it with a visualization framework. Our findings highlight a fundamental limitation in current LALMs and call for the development of temporally robust architectures.
摘要：大型音频语言模型 (LALM) 越来越多地应用于音频理解和多模态推理，但它们定位事件发生时间的能力仍未得到充分探索。我们首次对 LALM 中的时间偏差进行系统研究，揭示了其时间戳预测的一个关键限制。例如，当被问到“讲师在哪一秒介绍关键公式？”时，模型通常会预测始终早于或晚于真实情况的时间戳。通过对带时间戳的数据集进行受控实验，我们发现时间偏差（i）在数据集和模型中普遍存在，（ii）随着音频长度的增加而增加 - 甚至在长时间录音中累积到数十秒，以及（iii）随事件类型和位置的不同而变化。我们使用时间偏差指数（TBI）来量化这种影响，测量预测事件时间的系统偏差，并用可视化框架对其进行补充。我们的研究结果强调了当前 LALM 的基本局限性，并呼吁开发时间稳健的架构。

Title: DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation

Authors: Zeyu Yang, Satoshi Nakamura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12195
Pdf URL: https://arxiv.org/pdf/2510.12195
Copy Paste: [[2510.12195]] DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation(https://arxiv.org/abs/2510.12195)
Keywords: language model, llm
Abstract: Simultaneous speech translation requires accurate segmentation to balance translation quality and latency. Recent studies such as SHAS have introduced pretrained segmentation models, achieving stronger performance than heuristic rules. However, segmentation models such as SHAS, though pretrained and more robust than heuristic methods, are still constrained by supervised learning objectives and do not incorporate human preference alignment, which is crucial for natural real-time interpretation. In this work, we propose a segmentation framework based on large language models (LLMs) trained with Direct Preference Optimization (DPO). By leveraging preference alignment, our method enables LLMs to predict natural segmentation points that better meet the demands of real-time translation. We evaluate the system on the ACL 60/60 corpus across three language pairs (English-Japanese, Chinese, German), using SeamlessM4T v2 as the translation backbone. Experimental results show that our DPO-tuned LLM achieves higher segmentation accuracy than SHAS and yields consistent improvements in translation quality (BLEU, COMET) as well as latency (Average Lagging). Furthermore, our system benefits from IWSLT baselines for direct comparison. These findings highlight the potential of preference-tuned LLMs to surpass existing pretrained segmentation models and advance adaptive, human-aligned simultaneous interpretation.
摘要：同步语音翻译需要准确的分段，以平衡翻译质量和延迟。最近的研究（例如 SHAS）引入了预训练的分割模型，取得了比启发式规则更强的性能。然而，SHAS 等分割模型虽然经过预训练并且比启发式方法更稳健，但仍然受到监督学习目标的限制，并且没有纳入人类偏好对齐，而这对于自然实时解释至关重要。在这项工作中，我们提出了一个基于使用直接偏好优化（DPO）训练的大型语言模型（LLM）的分割框架。通过利用偏好对齐，我们的方法使法学硕士能够预测自然分割点，从而更好地满足实时翻译的需求。我们使用 SeamlessM4T v2 作为翻译主干，在 ACL 60/60 语料库上跨三种语言对（英语-日语、中文、德语）评估系统。实验结果表明，我们经过 DPO 调整的 LLM 实现了比 SHAS 更高的分割精度，并在翻译质量（BLEU、COMET）和延迟（平均滞后）方面实现了持续改进。此外，我们的系统受益于 IWSLT 基线以进行直接比较。这些发现凸显了偏好调整的法学硕士超越现有预训练分割模型并推进自适应、人性化同声传译的潜力。

Title: HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment

Authors: Ali Mekky, Omar El Herraoui, Preslav Nakov, Yuxia Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12217
Pdf URL: https://arxiv.org/pdf/2510.12217
Copy Paste: [[2510.12217]] HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment(https://arxiv.org/abs/2510.12217)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed across high-impact domains, from clinical decision support and legal analysis to hiring and education, making fairness and bias evaluation before deployment critical. However, existing evaluations lack grounding in real-world scenarios and do not account for differences in harm severity, e.g., a biased decision in surgery should not be weighed the same as a stylistic bias in text summarization. To address this gap, we introduce HALF (Harm-Aware LLM Fairness), a deployment-aligned framework that assesses model bias in realistic applications and weighs the outcomes by harm severity. HALF organizes nine application domains into three tiers (Severe, Moderate, Mild) using a five-stage pipeline. Our evaluation results across eight LLMs show that (1) LLMs are not consistently fair across domains, (2) model size or performance do not guarantee fairness, and (3) reasoning models perform better in medical decision support but worse in education. We conclude that HALF exposes a clear gap between previous benchmarking success and deployment readiness.
摘要：大型语言模型 (LLM) 越来越多地部署在高影响力的领域，从临床决策支持和法律分析到招聘和教育，这使得部署前的公平性和偏见评估变得至关重要。然而，现有的评估缺乏现实场景的基础，并且没有考虑伤害严重程度的差异，例如，手术中的偏见决定不应与文本摘要中的文体偏见同等对待。为了解决这一差距，我们引入了 HALF（Harm-Aware LLM Fairness），这是一个部署一致的框架，用于评估实际应用中的模型偏差，并根据危害严重程度来权衡结果。 HALF 使用五级管道将九个应用程序域组织为三层（严重、中等、轻度）。我们对八个法学硕士的评估结果表明，(1) 法学硕士在各个领域并不总是公平的，(2) 模型大小或性能不能保证公平性，(3) 推理模型在医疗决策支持方面表现更好，但在教育方面表现较差。我们得出的结论是，HALF 暴露了之前的基准测试成功与部署准备情况之间存在明显差距。

Title: Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability

Authors: Bianca Raimondi, Daniela Dalbagno, Maurizio Gabbrielli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12229
Pdf URL: https://arxiv.org/pdf/2510.12229
Copy Paste: [[2510.12229]] Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability(https://arxiv.org/abs/2510.12229)
Keywords: language model, llm
Abstract: Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.
摘要：大型语言模型（LLM）已被证明可以在微调过程中内化类人偏差，但这些偏差表现出来的机制仍不清楚。在这项工作中，我们研究了众所周知的诺布效应（意向判断中的道德偏见）是否出现在经过微调的法学硕士中，以及它是否可以追溯到模型的特定组成部分。我们对 3 个开放权重法学硕士进行了层修补分析，并证明偏差不仅是在微调过程中学习到的，而且还集中在一组特定的层中。令人惊讶的是，我们发现将相应的预训练模型中的激活修补到几个关键层就足以消除这种影响。我们的研究结果提供了新的证据，表明法学硕士的社会偏见可以通过有针对性的干预措施来解释、本地化和减轻，而不需要模型再培训。

Title: DSAS: A Universal Plug-and-Play Framework for Attention Optimization in Multi-Document Question Answering

Authors: Jiakai Li, Rongzheng Wang, Yizhuo Ma, Shuang Liang, Guangchun Luo, Ke Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12251
Pdf URL: https://arxiv.org/pdf/2510.12251
Copy Paste: [[2510.12251]] DSAS: A Universal Plug-and-Play Framework for Attention Optimization in Multi-Document Question Answering(https://arxiv.org/abs/2510.12251)
Keywords: language model, llm
Abstract: While large language models (LLMs) show considerable promise across various fields, they have notable limitations in handling multi-document question answering (Multi-doc QA) tasks. The first challenge is long-range dependency modeling, where LLMs struggle to focus on key information in long texts, which weakens important semantic connections. Second, most LLMs suffer from the ''lost-in-the-middle'' issue, where they have difficulty processing information in the middle of long inputs. Current solutions either truncate global dependencies or demand costly finetuning, ultimately lacking a universal and simple solution for these challenges. To resolve these limitations, we propose Dual-Stage Adaptive Sharpening (DSAS) containing two modules. (i) The Contextual Gate Weighting (CGW) module alleviates ''lost-in-the-middle'' by assessing paragraph relevance through layer-wise attention tracking and position-aware weighting. (ii) The Reciprocal Attention Suppression (RAS) module enhances focus on critical paragraphs by suppressing information exchange between key and irrelevant texts, thus mitigating the limitations in long-range dependency modeling. Notably, DSAS functions as a plug-and-play solution requiring no architectural modifications or extra training parameters. Extensive experiments on four benchmarks demonstrate DSAS's efficacy across mainstream LLMs (Llama, Qwen, Mistral, and Deepseek), with an average F1-score improvement of 4.2% in Multi-doc QA tasks on Llama-3.1-8B-Instruct and Qwen2.5-14B-Instruct. Ablation studies confirm the essential contributions of both the CGW and RAS modules. In addition, detailed discussions in the Appendix further validate the robustness and scalability of DSAS.
摘要：虽然大型语言模型 (LLM) 在各个领域显示出巨大的前景，但它们在处理多文档问答 (Multi-doc QA) 任务方面具有显着的局限性。第一个挑战是远程依赖建模，法学硕士很难专注于长文本中的关键信息，这削弱了重要的语义联系。其次，大多数法学硕士都面临着“中间迷失”的问题，他们很难在长时间的输入中处理信息。当前的解决方案要么截断全局依赖性，要么需要昂贵的微调，最终缺乏针对这些挑战的通用且简单的解决方案。为了解决这些限制，我们提出了包含两个模块的双级自适应锐化（DSAS）。 (i) 上下文门权重 (CGW) 模块通过分层注意力跟踪和位置感知权重来评估段落相关性，从而减轻“中间迷失”的情况。（ii）相互注意抑制（RAS）模块通过抑制关键文本和不相关文本之间的信息交换来增强对关键段落的关注，从而减轻远程依赖建模的局限性。值得注意的是，DSAS 是一种即插即用的解决方案，无需进行架构修改或额外的训练参数。对四个基准的大量实验证明了 DSAS 在主流 Llama、Qwen、Mistral 和 Deepseek）中的有效性，在 Llama-3.1-8B-Instruct 和 Qwen2.5-14B-Instruct 上的多文档 QA 任务中，平均 F1 分数提高了 4.2%。消融研究证实了 CGW 和 RAS 模块的重要贡献。此外，附录中的详细讨论进一步验证了DSAS的稳健性和可扩展性。

Title: Shallow Robustness, Deep Vulnerabilities: Multi-Turn Evaluation of Medical LLMs

Authors: Blazej Manczak, Eric Lin, Francisco Eiras, James O' Neill, Vaikkunth Mugunthan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12255
Pdf URL: https://arxiv.org/pdf/2510.12255
Copy Paste: [[2510.12255]] Shallow Robustness, Deep Vulnerabilities: Multi-Turn Evaluation of Medical LLMs(https://arxiv.org/abs/2510.12255)
Keywords: language model, llm
Abstract: Large language models (LLMs) are rapidly transitioning into medical clinical use, yet their reliability under realistic, multi-turn interactions remains poorly understood. Existing evaluation frameworks typically assess single-turn question answering under idealized conditions, overlooking the complexities of medical consultations where conflicting input, misleading context, and authority influence are common. We introduce MedQA-Followup, a framework for systematically evaluating multi-turn robustness in medical question answering. Our approach distinguishes between shallow robustness (resisting misleading initial context) and deep robustness (maintaining accuracy when answers are challenged across turns), while also introducing an indirect-direct axis that separates contextual framing (indirect) from explicit suggestion (direct). Using controlled interventions on the MedQA dataset, we evaluate five state-of-the-art LLMs and find that while models perform reasonably well under shallow perturbations, they exhibit severe vulnerabilities in multi-turn settings, with accuracy dropping from 91.2% to as low as 13.5% for Claude Sonnet 4. Counterintuitively, indirect, context-based interventions are often more harmful than direct suggestions, yielding larger accuracy drops across models and exposing a significant vulnerability for clinical deployment. Further compounding analyses reveal model differences, with some showing additional performance drops under repeated interventions while others partially recovering or even improving. These findings highlight multi-turn robustness as a critical but underexplored dimension for safe and reliable deployment of medical LLMs.
摘要：大语言模型（LLM）正在迅速转变为医学临床应用，但它们在现实的多轮交互下的可靠性仍然知之甚少。现有的评估框架通常评估理想条件下的单轮问答，忽视了医疗咨询的复杂性，其中冲突的输入、误导性的背景和权威的影响是常见的。我们引入了 MedQA-Followup，一个用于系统评估医学问答中多轮稳健性的框架。我们的方法区分了浅层稳健性（抵制误导性的初始上下文）和深层稳健性（在答案受到跨轮挑战时保持准确性），同时还引入了间接-直接轴，将上下文框架（间接）与明确建议（直接）分开。使用 MedQA 数据集上的受控干预，我们评估了五个最先进的 LLM，发现虽然模型在浅扰动下表现相当好，但它们在多轮设置中表现出严重的漏洞，Claude Sonnet 4 的准确率从 91.2% 下降到低至 13.5%。与直觉相反，间接的、基于上下文的干预通常比直接建议更有害，从而产生更高的准确度跨模型的下降并暴露了临床部署的重大漏洞。进一步的复合分析揭示了模型的差异，一些模型在重复干预下表现出额外的性能下降，而另一些则部分恢复甚至改善。这些发现强调，多轮稳健性是安全可靠地部署医学法学硕士的一个关键但尚未充分探索的维度。

Title: A large-scale, unsupervised pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction

Authors: Cameron Morin, Matti Marttinen Larsson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12306
Pdf URL: https://arxiv.org/pdf/2510.12306
Copy Paste: [[2510.12306]] A large-scale, unsupervised pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction(https://arxiv.org/abs/2510.12306)
Keywords: language model, gpt, llm, prompt
Abstract: As natural language corpora expand at an unprecedented rate, manual annotation remains a significant methodological bottleneck in corpus linguistic work. We address this challenge by presenting a scalable, unsupervised pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs). Unlike previous supervised and iterative approaches, our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation. We demonstrate the pipeline's accessibility and effectiveness through a diachronic case study of variation in the English consider construction. Using GPT-5 through the OpenAI API, we annotate 143,933 sentences from the Corpus of Historical American English (COHA) in under 60 hours, achieving 98%+ accuracy on two sophisticated annotation procedures. Our results suggest that LLMs can perform a range of data preparation tasks at scale with minimal human intervention, opening new possibilities for corpus-based research, though implementation requires attention to costs, licensing, and other ethical considerations.
摘要：随着自然语言语料库以前所未有的速度扩展，手动注释仍然是语料库语言工作中的一个重要的方法论瓶颈。我们通过提出一个可扩展的、无监督的管道来解决这一挑战，用于使用大型语言模型（LLM）在大量语料库中自动进行语法注释。与以前的监督和迭代方法不同，我们的方法采用四阶段工作流程：即时工程、事前评估、自动批处理和事后验证。我们通过对英语考虑结构变化的历时案例研究来证明该管道的可访问性和有效性。通过 OpenAI API 使用 GPT-5，我们在 60 小时内注释了美国历史英语语料库 (COHA) 中的 143,933 个句子，在两个复杂的注释程序上实现了 98% 以上的准确率。我们的结果表明，法学硕士可以在最少的人为干预的情况下大规模执行一系列数据准备任务，为基于语料库的研究开辟新的可能性，尽管实施需要关注成本、许可和其他道德考虑因素。

Title: Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation

Authors: Greta Damo, Elena Cabrio, Serena Villata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12316
Pdf URL: https://arxiv.org/pdf/2510.12316
Copy Paste: [[2510.12316]] Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation(https://arxiv.org/abs/2510.12316)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Counter-speech generation is at the core of many expert activities, such as fact-checking and hate speech, to counter harmful content. Yet, existing work treats counter-speech generation as pure text generation task, mainly based on Large Language Models or NGO experts. These approaches show severe drawbacks due to the limited reliability and coherence in the generated countering text, and in scalability, respectively. To close this gap, we introduce a novel framework to model counter-speech generation as knowledge-wise text generation process. Our framework integrates advanced Retrieval-Augmented Generation (RAG) pipelines to ensure the generation of trustworthy counter-speech for 8 main target groups identified in the hate speech literature, including women, people of colour, persons with disabilities, migrants, Muslims, Jews, LGBT persons, and other. We built a knowledge base over the United Nations Digital Library, EUR-Lex and the EU Agency for Fundamental Rights, comprising a total of 32,792 texts. We use the MultiTarget-CONAN dataset to empirically assess the quality of the generated counter-speech, both through standard metrics (i.e., JudgeLM) and a human evaluation. Results show that our framework outperforms standard LLM baselines and competitive approach, on both assessments. The resulting framework and the knowledge base pave the way for studying trustworthy and sound counter-speech generation, in hate speech and beyond.
摘要：反言论生成是许多专家活动的核心，例如事实核查和仇恨言论，以打击有害内容。然而，现有的工作将反语音生成视为纯文本生成任务，主要基于大型语言模型或非政府组织专家。由于生成的反驳文本的可靠性和连贯性以及可扩展性分别有限，这些方法显示出严重的缺点。为了弥补这一差距，我们引入了一种新颖的框架，将反语音生成建模为知识智能文本生成过程。我们的框架集成了先进的检索增强生成（RAG）管道，以确保为仇恨言论文献中确定的 8 个主要目标群体生成值得信赖的反言论，包括妇女、有色人种、残疾人、移民、穆斯林、犹太人、LGBT 人群等。我们在联合国数字图书馆、EUR-Lex 和欧盟基本权利机构上建立了一个知识库，共包含 32,792 篇文本。我们使用 MultiTarget-CONAN 数据集通过标准指标（即 JudgeLM）和人工评估来凭经验评估生成的反语音的质量。结果表明，我们的框架在两项评估中均优于标准法学硕士基线和竞争方法。由此产生的框架和知识库为研究仇恨言论及其他领域中值得信赖和合理的反言论生成铺平了道路。

Title: Fine-grained Analysis of Brain-LLM Alignment through Input Attribution

Authors: Michela Proietti, Roberto Capobianco, Mariya Toneva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12355
Pdf URL: https://arxiv.org/pdf/2510.12355
Copy Paste: [[2510.12355]] Fine-grained Analysis of Brain-LLM Alignment through Input Attribution(https://arxiv.org/abs/2510.12355)
Keywords: language model, llm
Abstract: Understanding the alignment between large language models (LLMs) and human brain activity can reveal computational principles underlying language processing. We introduce a fine-grained input attribution method to identify the specific words most important for brain-LLM alignment, and leverage it to study a contentious research question about brain-LLM alignment: the relationship between brain alignment (BA) and next-word prediction (NWP). Our findings reveal that BA and NWP rely on largely distinct word subsets: NWP exhibits recency and primacy biases with a focus on syntax, while BA prioritizes semantic and discourse-level information with a more targeted recency effect. This work advances our understanding of how LLMs relate to human language processing and highlights differences in feature reliance between BA and NWP. Beyond this study, our attribution method can be broadly applied to explore the cognitive relevance of model predictions in diverse language processing tasks.
摘要：了解大语言模型 (LLM) 和人脑活动之间的一致性可以揭示语言处理背后的计算原理。我们引入了一种细粒度的输入归因方法来识别对大脑与法学硕士对齐最重要的特定单词，并利用它来研究有关大脑与法学硕士对齐的有争议的研究问题：大脑对齐（BA）和下一个单词预测（NWP）之间的关系。我们的研究结果表明，BA 和 NWP 在很大程度上依赖于不同的单词子集：NWP 表现出新近度和首要性偏差，重点关注语法，而 BA 优先考虑语义和话语级信息，具有更有针对性的新近度效应。这项工作增进了我们对法学硕士与人类语言处理之间关系的理解，并强调了 BA 和 NWP 之间特征依赖的差异。除了这项研究之外，我们的归因方法还可以广泛应用于探索模型预测在不同语言处理任务中的认知相关性。

Title: LLM-REVal: Can We Trust LLM Reviewers Yet?

Authors: Rui Li, Jia-Chen Gu, Po-Nien Kung, Heming Xia, Junfeng liu, Xiangwen Kong, Zhifang Sui, Nanyun Peng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12367
Pdf URL: https://arxiv.org/pdf/2510.12367
Copy Paste: [[2510.12367]] LLM-REVal: Can We Trust LLM Reviewers Yet?(https://arxiv.org/abs/2510.12367)
Keywords: language model, llm, agent
Abstract: The rapid advancement of large language models (LLMs) has inspired researchers to integrate them extensively into the academic workflow, potentially reshaping how research is practiced and reviewed. While previous studies highlight the potential of LLMs in supporting research and peer review, their dual roles in the academic workflow and the complex interplay between research and review bring new risks that remain largely underexplored. In this study, we focus on how the deep integration of LLMs into both peer-review and research processes may influence scholarly fairness, examining the potential risks of using LLMs as reviewers by simulation. This simulation incorporates a research agent, which generates papers and revises, alongside a review agent, which assesses the submissions. Based on the simulation results, we conduct human annotations and identify pronounced misalignment between LLM-based reviews and human judgments: (1) LLM reviewers systematically inflate scores for LLM-authored papers, assigning them markedly higher scores than human-authored ones; (2) LLM reviewers persistently underrate human-authored papers with critical statements (e.g., risk, fairness), even after multiple revisions. Our analysis reveals that these stem from two primary biases in LLM reviewers: a linguistic feature bias favoring LLM-generated writing styles, and an aversion toward critical statements. These results highlight the risks and equity concerns posed to human authors and academic research if LLMs are deployed in the peer review cycle without adequate caution. On the other hand, revisions guided by LLM reviews yield quality gains in both LLM-based and human evaluations, illustrating the potential of the LLMs-as-reviewers for early-stage researchers and enhancing low-quality papers.
摘要：大语言模型（LLM）的快速发展激发了研究人员将其广泛整合到学术工作流程中，有可能重塑研究的实践和审查方式。虽然之前的研究强调了法学硕士在支持研究和同行评审方面的潜力，但它们在学术工作流程中的双重角色以及研究和评审之间复杂的相互作用带来了新的风险，而这些风险在很大程度上仍未得到充分探索。在本研究中，我们重点关注法学硕士与同行评审和研究过程的深度融合如何影响学术公平性，通过模拟研究使用法学硕士作为审稿人的潜在风险。该模拟包含一个研究代理，用于生成论文并进行修改，以及一个审阅代理，用于评估提交的内容。根据模拟结果，我们进行人工注释，并识别基于 LLM 的审稿与人类判断之间的明显不一致：（1）LLM 审稿人系统性地夸大 LLM 撰写的论文的分数，给它们分配的分数明显高于人类撰写的论文； (2) LLM 审稿人持续低估带有批评性陈述（例如风险、公平性）的人类撰写的论文，即使经过多次修改也是如此。我们的分析表明，这些源于法学硕士审稿人的两个主要偏见：有利于法学硕士生成的写作风格的语言特征偏见，以及对批评性陈述的厌恶。这些结果凸显了如果法学硕士在没有足够谨慎的情况下被部署在同行评审周期中，会给人类作者和学术研究带来的风险和公平问题。另一方面，由法学硕士评审指导的修订在基于法学硕士和人工评估中都产生了质量提升，说明了法学硕士作为审稿人对早期研究人员的潜力，并提高了低质量论文的质量。

Title: Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

Authors: Hailay Kidu Teklehaymanot, Wolfgang Nejdl
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12389
Pdf URL: https://arxiv.org/pdf/2510.12389
Copy Paste: [[2510.12389]] Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency(https://arxiv.org/abs/2510.12389)
Keywords: language model, llm
Abstract: Tokenization disparities pose a significant barrier to achieving equitable access to artificial intelligence across linguistically diverse populations. This study conducts a large-scale cross-linguistic evaluation of tokenization efficiency in over 200 languages to systematically quantify computational inequities in large language models (LLMs). Using a standardized experimental framework, we applied consistent preprocessing and normalization protocols, followed by uniform tokenization through the tiktoken library across all language samples. Comprehensive tokenization statistics were collected using established evaluation metrics, including Tokens Per Sentence (TPS) and Relative Tokenization Cost (RTC), benchmarked against English baselines. Our cross-linguistic analysis reveals substantial and systematic disparities: Latin-script languages consistently exhibit higher tokenization efficiency, while non-Latin and morphologically complex languages incur significantly greater token inflation, often 3-5 times higher RTC ratios. These inefficiencies translate into increased computational costs and reduced effective context utilization for underrepresented languages. Overall, the findings highlight structural inequities in current AI systems, where speakers of low-resource and non-Latin languages face disproportionate computational disadvantages. Future research should prioritize the development of linguistically informed tokenization strategies and adaptive vocabulary construction methods that incorporate typological diversity, ensuring more inclusive and computationally equitable multilingual AI systems.
摘要：代币化差异对跨语言群体实现公平获取人工智能构成了重大障碍。本研究对 200 多种语言的标记化效率进行了大规模跨语言评估，以系统地量化大型语言模型 (LLM) 中的计算不平等。使用标准化的实验框架，我们应用了一致的预处理和标准化协议，然后通过 tiktoken 库在所有语言样本中进行统一的标记化。使用既定的评估指标收集全面的标记化统计数据，包括以英语基线为基准的每句标记数 (TPS) 和相对标记化成本 (RTC)。我们的跨语言分析揭示了实质性和系统性的差异：拉丁文字语言始终表现出较高的标记化效率，而非拉丁语言和形态复杂的语言会导致明显更大的标记膨胀，通常是 RTC 比率的 3-5 倍。这些低效率会导致计算成本增加，并降低代表性不足的语言的有效上下文利用率。总体而言，研究结果凸显了当前人工智能系统中的结构性不平等，其中资源匮乏和非拉丁语言的使用者面临着不成比例的计算劣势。未来的研究应优先考虑开发包含类型多样性的语言信息标记化策略和自适应词汇构建方法，确保多语言人工智能系统更具包容性和计算公平性。

Title: PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation

Authors: Xiangjun Zai, Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Wenjie Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12434
Pdf URL: https://arxiv.org/pdf/2510.12434
Copy Paste: [[2510.12434]] PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation(https://arxiv.org/abs/2510.12434)
Keywords: retrieval-augmented generation
Abstract: Knowledge Hypergraphs (KHs) have recently emerged as a knowledge representation for retrieval-augmented generation (RAG), offering a paradigm to model multi-entity relations into a structured form. However, existing KH-based RAG methods suffer from three major limitations: static retrieval planning, non-adaptive retrieval execution, and superficial use of KH structure and semantics, which constrain their ability to perform effective multi-hop question answering. To overcome these limitations, we propose PRoH, a dynamic Planning and Reasoning over Knowledge Hypergraphs framework. PRoH incorporates three core innovations: (i) a context-aware planning module that sketches the local KH neighborhood to guide structurally grounded reasoning plan generation; (ii) a structured question decomposition process that organizes subquestions as a dynamically evolving Directed Acyclic Graph (DAG) to enable adaptive, multi-trajectory exploration; and (iii) an Entity-Weighted Overlap (EWO)-guided reasoning path retrieval algorithm that prioritizes semantically coherent hyperedge traversals. Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining strong robustness in long-range multi-hop reasoning tasks.
摘要：知识超图（KH）最近作为检索增强生成（RAG）的知识表示而出现，提供了将多实体关系建模为结构化形式的范例。然而，现有的基于KH的RAG方法存在三个主要局限性：静态检索规划、非自适应检索执行以及KH结构和语义的肤浅使用，这限制了它们执行有效多跳问答的能力。为了克服这些限制，我们提出了 PRoH，一种基于知识超图框架的动态规划和推理。 PRoH 包含三个核心创新：（i）上下文感知规划模块，勾勒出当地的 KH 邻域，以指导结构性推理计划的生成； (ii) 结构化问题分解过程，将子问题组织为动态演化的有向无环图 (DAG)，以实现自适应、多轨迹探索； (iii) 实体加权重叠 (EWO) 引导的推理路径检索算法，该算法优先考虑语义相干的超边遍历。跨多个领域的实验表明，PRoH 实现了最先进的性能，在 F1 中平均超越了之前的 SOTA 模型 HyperGraphRAG 19.73％，在生成评估（G-E）得分中平均超越了 8.41％，同时在远程多跳推理任务中保持了强大的鲁棒性。

Title: Probing Latent Knowledge Conflict for Faithful Retrieval-Augmented Generation

Authors: Linfeng Gao, Baolong Bi, Zheng Yuan, Le Wang, Zerui Chen, Zhimin Wei, Shenghua Liu, Qinggang Zhang, Jinsong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12460
Pdf URL: https://arxiv.org/pdf/2510.12460
Copy Paste: [[2510.12460]] Probing Latent Knowledge Conflict for Faithful Retrieval-Augmented Generation(https://arxiv.org/abs/2510.12460)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm to enhance the factuality of Large Language Models (LLMs). However, existing RAG systems often suffer from an unfaithfulness issue, where the model's response contradicts evidence from the retrieved context. Existing approaches to improving contextual faithfulness largely rely on external interventions, such as prompt engineering, decoding constraints, or reward-based fine-tuning. These works treat the LLM as a black box and overlook a crucial question: how does the LLM internally integrate retrieved evidence with its parametric memory, particularly under knowledge conflicts? To address this gap, we conduct a probing-based analysis of hidden-state representations in LLMs and observe three findings: knowledge integration occurs hierarchically, conflicts manifest as latent signals at the sentence level, and irrelevant context is often amplified when aligned with parametric knowledge. Building on these findings, we propose CLEAR (Conflict-Localized and Enhanced Attention for RAG), a framework that (i) decomposes context into fine-grained sentence-level knowledge, (ii) employs hidden-state probing to localize conflicting knowledge, and (iii) introduces conflict-aware fine-tuning to guide the model to accurately integrate retrieved evidence. Extensive experiments across three benchmarks demonstrate that CLEAR substantially improves both accuracy and contextual faithfulness, consistently outperforming strong baselines under diverse conflict conditions. The related resources are available at this https URL.
摘要：检索增强生成（RAG）已成为增强大型语言模型（LLM）真实性的强大范例。然而，现有的 RAG 系统经常遇到不忠实问题，即模型的响应与检索到的上下文中的证据相矛盾。现有的提高上下文忠实度的方法很大程度上依赖于外部干预，例如即时工程、解码约束或基于奖励的微调。这些作品将法学硕士视为一个黑匣子，并忽略了一个关键问题：法学硕士如何在内部将检索到的证据与其参数记忆相整合，特别是在知识冲突的情况下？为了解决这一差距，我们对法学硕士中的隐藏状态表示进行了基于探测的分析，并观察到三个发现：知识整合分层发生，冲突表现为句子级别的潜在信号，并且当与参数知识对齐时，不相关的上下文通常会被放大。基于这些发现，我们提出了 CLEAR（RAG 的冲突本地化和增强注意力）框架，该框架（i）将上下文分解为细粒度的句子级知识，（ii）采用隐藏状态探测来本地化冲突知识，以及（iii）引入冲突感知微调以指导模型准确地整合检索到的证据。跨越三个基准的广泛实验表明，CLEAR 极大地提高了准确性和上下文忠实度，在不同的冲突条件下始终优于强大的基线。相关资源可通过此 https URL 获取。

Title: Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test

Authors: Nikoleta Pantelidou, Evelina Leivada, Paolo Morosi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12463
Pdf URL: https://arxiv.org/pdf/2510.12463
Copy Paste: [[2510.12463]] Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test(https://arxiv.org/abs/2510.12463)
Keywords: language model
Abstract: The linguistic abilities of Large Language Models are a matter of ongoing debate. This study contributes to this discussion by investigating model performance in a morphological generalization task that involves novel words. Using a multilingual adaptation of the Wug Test, six models were tested across four partially unrelated languages (Catalan, English, Greek, and Spanish) and compared with human speakers. The aim is to determine whether model accuracy approximates human competence and whether it is shaped primarily by linguistic complexity or by the quantity of available training data. Consistent with previous research, the results show that the models are able to generalize morphological processes to unseen words with human-like accuracy. However, accuracy patterns align more closely with community size and data availability than with structural complexity, refining earlier claims in the literature. In particular, languages with larger speaker communities and stronger digital representation, such as Spanish and English, revealed higher accuracy than less-resourced ones like Catalan and Greek. Overall, our findings suggest that model behavior is mainly driven by the richness of linguistic resources rather than by sensitivity to grammatical complexity, reflecting a form of performance that resembles human linguistic competence only superficially.
摘要：大型语言模型的语言能力是一个持续争论的问题。本研究通过调查涉及新词的形态泛化任务中的模型性能，为这一讨论做出了贡献。使用 Wug 测试的多语言改编，在四种部分不相关的语言（加泰罗尼亚语、英语、希腊语和西班牙语）中测试了六个模型，并与人类使用者进行了比较。目的是确定模型的准确性是否接近人类的能力，以及它是否主要由语言复杂性或可用训练数据的数量决定。与之前的研究一致，结果表明该模型能够以类似人类的准确性将形态过程推广到未见过的单词。然而，准确性模式与社区规模和数据可用性的关系比与结构复杂性的关系更密切，从而完善了文献中的早期主张。特别是，具有较大使用者群体和较强数字表示能力的语言（例如西班牙语和英语）比资源较少的语言（例如加泰罗尼亚语和希腊语）显示出更高的准确性。总的来说，我们的研究结果表明，模型行为主要是由语言资源的丰富性驱动的，而不是对语法复杂性的敏感性驱动的，这反映了一种仅在表面上类似于人类语言能力的表现形式。

Title: SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression

Authors: Biao Zhang, Lixin Chen, Tong Liu, Bo Zheng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.12474
Pdf URL: https://arxiv.org/pdf/2510.12474
Copy Paste: [[2510.12474]] SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression(https://arxiv.org/abs/2510.12474)
Keywords: language model, llm
Abstract: Large language models (LLMs) generate high-dimensional embeddings that capture rich semantic and syntactic information. However, high-dimensional embeddings exacerbate computational complexity and storage requirements, thereby hindering practical deployment. To address these challenges, we propose a novel training framework named Sequential Matryoshka Embedding Compression (SMEC). This framework introduces the Sequential Matryoshka Representation Learning(SMRL) method to mitigate gradient variance during training, the Adaptive Dimension Selection (ADS) module to reduce information degradation during dimension pruning, and the Selectable Cross-batch Memory (S-XBM) module to enhance unsupervised learning between high- and low-dimensional embeddings. Experiments on image, text, and multimodal datasets demonstrate that SMEC achieves significant dimensionality reduction while maintaining performance. For instance, on the BEIR dataset, our approach improves the performance of compressed LLM2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.
摘要：大型语言模型 (LLM) 生成高维嵌入，捕获丰富的语义和句法信息。然而，高维嵌入加剧了计算复杂性和存储要求，从而阻碍了实际部署。为了应对这些挑战，我们提出了一种新颖的训练框架，名为顺序俄罗斯套娃嵌入压缩（SMEC）。该框架引入了顺序套娃表示学习（SMRL）方法来减轻训练期间的梯度方差，自适应维度选择（ADS）模块来减少维度剪枝期间的信息退化，以及可选交叉批量内存（S-XBM）模块来增强高维和低维嵌入之间的无监督学习。对图像、文本和多模态数据集的实验表明，SMEC 在保持性能的同时实现了显着的降维。例如，在 BEIR 数据集上，与 Matryoshka-Adaptor 和 Search-Adaptor 模型相比，我们的方法将压缩 LLM2Vec 嵌入（256 维）的性能分别提高了 1.1 点和 2.7 点。

Title: When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

Authors: Lang Gao, Xuhui Li, Chenxi Wang, Mingzhe Li, Wei Liu, Zirui Song, Jinghui Zhang, Rui Yan, Preslav Nakov, Xiuying Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12476
Pdf URL: https://arxiv.org/pdf/2510.12476
Copy Paste: [[2510.12476]] When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection(https://arxiv.org/abs/2510.12476)
Keywords: language model, llm
Abstract: Large language models (LLMs) have grown more powerful in language generation, producing fluent text and even imitating personal style. Yet, this ability also heightens the risk of identity impersonation. To the best of our knowledge, no prior work has examined personalized machine-generated text (MGT) detection. In this paper, we introduce \dataset, the first benchmark for evaluating detector robustness in personalized settings, built from literary and blog texts paired with their LLM-generated imitations. Our experimental results demonstrate large performance gaps across detectors in personalized settings: some state-of-the-art models suffer significant drops. We attribute this limitation to the \textit{feature-inversion trap}, where features that are discriminative in general domains become inverted and misleading when applied to personalized text. Based on this finding, we propose \method, a simple and reliable way to predict detector performance changes in personalized settings. \method identifies latent directions corresponding to inverted features and constructs probe datasets that differ primarily along these features to evaluate detector dependence. Our experiments show that \method can accurately predict both the direction and the magnitude of post-transfer changes, showing 85\% correlation with the actual performance gaps. We hope that this work will encourage further research on personalized text detection.
摘要：大型语言模型（LLM）在语言生成方面变得更加强大，可以生成流畅的文本，甚至可以模仿个人风格。然而，这种能力也增加了身份冒充的风险。据我们所知，之前没有研究过个性化机器生成文本（MGT）检测。在本文中，我们介绍了 \dataset，这是评估个性化设置中检测器稳健性的第一个基准，由文学和博客文本与其 LLM 生成的模仿品配对构建。我们的实验结果表明，在个性化设置中，探测器之间存在巨大的性能差距：一些最先进的模型出现了显着下降。我们将此限制归因于 \textit{feature-inversion trap}，其中在一般领域中具有区分性的特征在应用于个性化文本时会被反转并产生误导。基于这一发现，我们提出了“方法”，这是一种简单可靠的方法来预测个性化设置中探测器性能的变化。方法识别与倒置特征相对应的潜在方向，并构建主要沿着这些特征不同的探针数据集以评估检测器依赖性。我们的实验表明，该方法可以准确预测传输后变化的方向和幅度，与实际性能差距显示出 85% 的相关性。我们希望这项工作能够鼓励个性化文本检测的进一步研究。

Title: BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)

Authors: Tomas Ruiz, Siyao Peng, Barbara Plank, Carsten Schwemmer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12516
Pdf URL: https://arxiv.org/pdf/2510.12516
Copy Paste: [[2510.12516]] BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)(https://arxiv.org/abs/2510.12516)
Keywords: llm
Abstract: Test-time scaling is a family of techniques to improve LLM outputs at inference time by performing extra computation. To the best of our knowledge, test-time scaling has been limited to domains with verifiably correct answers, like mathematics and coding. We transfer test-time scaling to the LeWiDi-2025 tasks to evaluate annotation disagreements. We experiment with three test-time scaling methods: two benchmark algorithms (Model Averaging and Majority Voting), and a Best-of-N sampling method. The two benchmark methods improve LLM performance consistently on the LeWiDi tasks, but the Best-of-N method does not. Our experiments suggest that the Best-of-N method does not currently transfer from mathematics to LeWiDi tasks, and we analyze potential reasons for this gap.
摘要：测试时间缩放是一系列通过执行额外计算来提高推理时间 LLM 输出的技术。据我们所知，测试时间的扩展仅限于具有可验证正确答案的领域，例如数学和编码。我们将测试时间缩放转移到 LeWiDi-2025 任务来评估注释不一致。我们尝试了三种测试时间缩放方法：两种基准算法（模型平均和多数投票）和 Best-of-N 采样方法。这两种基准测试方法一致地提高了 LeWiDi 任务上的 LLM 性能，但 Best-of-N 方法却没有。我们的实验表明，Best-of-N 方法目前并未从数学转移到 LeWiDi 任务，我们分析了造成这种差距的潜在原因。

Title: VISaGE: Understanding Visual Generics and Exceptions

Authors: Stella Frank, Emily Allaway
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.12548
Pdf URL: https://arxiv.org/pdf/2510.12548
Copy Paste: [[2510.12548]] VISaGE: Understanding Visual Generics and Exceptions(https://arxiv.org/abs/2510.12548)
Keywords: language model
Abstract: While Vision Language Models (VLMs) learn conceptual representations, in the form of generalized knowledge, during training, they are typically used to analyze individual instances. When evaluation instances are atypical, this paradigm results in tension between two priors in the model. The first is a pragmatic prior that the textual and visual input are both relevant, arising from VLM finetuning on congruent inputs; the second is a semantic prior that the conceptual representation is generally true for instances of the category. In order to understand how VLMs trade off these priors, we introduce a new evaluation dataset, VISaGE, consisting of both typical and exceptional images. In carefully balanced experiments, we show that conceptual understanding degrades when the assumption of congruency underlying the pragmatic prior is violated with incongruent images. This effect is stronger than the effect of the semantic prior when querying about individual instances.
摘要：虽然视觉语言模型 (VLM) 在训练过程中以广义知识的形式学习概念表示，但它们通常用于分析单个实例。当评估实例不典型时，这种范式会导致模型中两个先验之间的紧张关系。第一个是务实先验，即文本和视觉输入都是相关的，源自 VLM 对一致输入的微调；第二个是语义先验，即概念表示对于类别的实例通常是正确的。为了了解 VLM 如何权衡这些先验，我们引入了一个新的评估数据集 VISaGE，它由典型图像和特殊图像组成。在仔细平衡的实验中，我们表明，当不一致的图像违反了实用先验基础的一致性假设时，概念理解就会降低。当查询单个实例时，这种效果比语义先验的效果更强。

Title: Teaching Language Models to Faithfully Express their Uncertainty

Authors: Bryan Eikema, Evgenia Ilia, José G. C. de Souza, Chrysoula Zerva, Wilker Aziz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12587
Pdf URL: https://arxiv.org/pdf/2510.12587
Copy Paste: [[2510.12587]] Teaching Language Models to Faithfully Express their Uncertainty(https://arxiv.org/abs/2510.12587)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) often miscommunicate their uncertainty: repeated queries can produce divergent answers, yet generated responses are typically unhedged or hedged in ways that do not reflect this variability. This conveys unfaithful information about the uncertain state of the LLMs' knowledge, creating a faithfulness gap that affects even strong LLMs. We introduce Faithful Uncertainty Tuning (FUT): a fine-tuning approach that teaches instruction-tuned LLMs to express uncertainty faithfully without altering their underlying answer distribution. We construct training data by augmenting model samples with uncertainty hedges (i.e. verbal cues such as 'possibly' or 'likely') aligned with sample consistency, requiring no supervision beyond the model and a set of prompts. We evaluate FUT on open-domain question answering (QA) across multiple models and datasets. Our results show that FUT substantially reduces the faithfulness gap, while preserving QA accuracy and introducing minimal semantic distribution shift. Further analyses demonstrate robustness across decoding strategies, choice of hedgers, and other forms of uncertainty expression (i.e. numerical). These findings establish FUT as a simple and effective way to teach LLMs to communicate uncertainty faithfully.
摘要：大型语言模型 (LLM) 经常会错误传达其不确定性：重复查询可能会产生不同的答案，但生成的响应通常是未对冲的或以不反映这种可变性的方式进行对冲。这传达了有关法学硕士知识不确定状态的不忠实信息，造成了忠诚度差距，甚至影响了实力雄厚的法学硕士。我们引入忠实的不确定性调整（FUT）：一种微调方法，教导经过指令调整的法学硕士忠实地表达不确定性，而不改变其基本答案分布。我们通过使用与样本一致性一致的不确定性对冲（即“可能”或“可能”等口头提示）来增强模型样本来构建训练数据，不需要模型和一组提示之外的监督。我们在多个模型和数据集的开放域问答 (QA) 上评估 FUT。我们的结果表明，FUT 大大减少了忠实度差距，同时保留了 QA 准确性并引入了最小的语义分布偏移。进一步的分析证明了解码策略、套期保值者的选择和其他形式的不确定性表达（即数值）的稳健性。这些研究结果表明，FUT 是一种简单有效的方法，可以教导法学硕士忠实地传达不确定性。

Title: StyleDecipher: Robust and Explainable Detection of LLM-Generated Texts with Stylistic Analysis

Authors: Siyuan Li, Aodu Wulianghai, Xi Lin, Guangyan Li, Xiang Chen, Jun Wu, Jianhua Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12608
Pdf URL: https://arxiv.org/pdf/2510.12608
Copy Paste: [[2510.12608]] StyleDecipher: Robust and Explainable Detection of LLM-Generated Texts with Stylistic Analysis(https://arxiv.org/abs/2510.12608)
Keywords: language model, llm
Abstract: With the increasing integration of large language models (LLMs) into open-domain writing, detecting machine-generated text has become a critical task for ensuring content authenticity and trust. Existing approaches rely on statistical discrepancies or model-specific heuristics to distinguish between LLM-generated and human-written text. However, these methods struggle in real-world scenarios due to limited generalization, vulnerability to paraphrasing, and lack of explainability, particularly when facing stylistic diversity or hybrid human-AI authorship. In this work, we propose StyleDecipher, a robust and explainable detection framework that revisits LLM-generated text detection using combined feature extractors to quantify stylistic differences. By jointly modeling discrete stylistic indicators and continuous stylistic representations derived from semantic embeddings, StyleDecipher captures distinctive style-level divergences between human and LLM outputs within a unified representation space. This framework enables accurate, explainable, and domain-agnostic detection without requiring access to model internals or labeled segments. Extensive experiments across five diverse domains, including news, code, essays, reviews, and academic abstracts, demonstrate that StyleDecipher consistently achieves state-of-the-art in-domain accuracy. Moreover, in cross-domain evaluations, it surpasses existing baselines by up to 36.30%, while maintaining robustness against adversarial perturbations and mixed human-AI content. Further qualitative and quantitative analysis confirms that stylistic signals provide explainable evidence for distinguishing machine-generated text. Our source code can be accessed at this https URL.
摘要：随着大型语言模型（LLM）越来越多地集成到开放领域写作中，检测机器生成的文本已成为确保内容真实性和信任的关键任务。现有方法依赖于统计差异或特定于模型的启发式方法来区分法学硕士生成的文本和人类编写的文本。然而，由于泛化能力有限、易受释义影响以及缺乏可解释性，这些方法在现实场景中表现不佳，尤其是在面对风格多样性或混合人类人工智能作者时。在这项工作中，我们提出了 StyleDecipher，这是一个强大且可解释的检测框架，它使用组合特征提取器来重新审视 LLM 生成的文本检测以量化风格差异。通过对离散风格指标和源自语义嵌入的连续风格表示进行联合建模，StyleDecipher 在统一的表示空间内捕获了人类和 LLM 输出之间独特的风格级别差异。该框架可以实现准确、可解释且与领域无关的检测，而无需访问模型内部或标记的片段。跨五个不同领域（包括新闻、代码、论文、评论和学术摘要）的广泛实验表明，StyleDecipher 始终能够实现最先进的领域内准确性。此外，在跨领域评估中，它超出了现有基线高达 36.30%，同时保持了对抗性扰动和混合人类 AI 内容的鲁棒性。进一步的定性和定量分析证实，风格信号为区分机器生成的文本提供了可解释的证据。我们的源代码可以通过此 https URL 访问。

Title: ACADATA: Parallel Dataset of Academic Data for Machine Translation

Authors: Iñaki Lacunza, Javier Garcia Gilabert, Francesca De Luca Fornaciari, Javier Aula-Blasco, Aitor Gonzalez-Agirre, Maite Melero, Marta Villegas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12621
Pdf URL: https://arxiv.org/pdf/2510.12621
Copy Paste: [[2510.12621]] ACADATA: Parallel Dataset of Academic Data for Machine Translation(https://arxiv.org/abs/2510.12621)
Keywords: language model, llm
Abstract: We present ACADATA, a high-quality parallel dataset for academic translation, that consists of two subsets: ACAD-TRAIN, which contains approximately 1.5 million author-generated paragraph pairs across 96 language directions and ACAD-BENCH, a curated evaluation set of almost 6,000 translations covering 12 directions. To validate its utility, we fine-tune two Large Language Models (LLMs) on ACAD-TRAIN and benchmark them on ACAD-BENCH against specialized machine-translation systems, general-purpose, open-weight LLMs, and several large-scale proprietary models. Experimental results demonstrate that fine-tuning on ACAD-TRAIN leads to improvements in academic translation quality by +6.1 and +12.4 d-BLEU points on average for 7B and 2B models respectively, while also improving long-context translation in a general domain by up to 24.9% when translating out of English. The fine-tuned top-performing model surpasses the best propietary and open-weight models on academic translation domain. By releasing ACAD-TRAIN, ACAD-BENCH and the fine-tuned models, we provide the community with a valuable resource to advance research in academic domain and long-context translation.
摘要：我们推出了 ACADATA，这是一个用于学术翻译的高质量并行数据集，它由两个子集组成：ACAD-TRAIN，其中包含跨越 96 个语言方向的约 150 万个作者生成的段落对；以及 ACAD-BENCH，这是一个精心策划的评估集，包含涵盖 12 个方向的近 6,000 个翻译。为了验证其实用性，我们在 ACAD-TRAIN 上微调了两个大型语言模型 (LLM)，并在 ACAD-BENCH 上针对专门的机器翻译系统、通用、开放权重 LLM 和几个大型专有模型对它们进行了基准测试。实验结果表明，对 ACAD-TRAIN 进行微调可以使 7B 和 2B 模型的学术翻译质量平均分别提高 +6.1 和 +12.4 d-BLEU 点，同时在英语翻译时将一般领域的长上下文翻译提高高达 24.9%。经过微调的顶级模型超越了学术翻译领域最好的专有和开放权重模型。通过发布 ACAD-TRAIN、ACAD-BENCH 和微调模型，我们为社区提供了宝贵的资源，以推进学术领域和长上下文翻译的研究。

Title: COSTAR-A: A prompting framework for enhancing Large Language Model performance on Point-of-View questions

Authors: Nzubechukwu C. Ohalete (1), Kevin B. Gittner (1), Lauren M. Matheny (1) ((1) School of Data Science and Analytics, Kennesaw State University, GA, USA)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12637
Pdf URL: https://arxiv.org/pdf/2510.12637
Copy Paste: [[2510.12637]] COSTAR-A: A prompting framework for enhancing Large Language Model performance on Point-of-View questions(https://arxiv.org/abs/2510.12637)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are highly sensitive to prompt design, and making optimized prompting techniques is crucial for generating consistent, high-quality outputs. In this study, we introduce COSTAR-A, a novel prompt engineering framework that enhances the existing COSTAR method, which stands for Context, Objective, Style, Tone, Audience, and Response, by adding the 'Answer' component at the end. We demonstrate that while the original COSTAR framework improves prompt clarity and aligns outputs for larger LLMs, its performance is less consistent with smaller, locally optimized models, particularly in tasks that require more directive or constrained outputs. Through a series of controlled prompt-output assessments with smaller (at most 8 billion parameters), fine-tuned models, we found that COSTAR-A can enhance the output structure and decisiveness of localized LLMs for certain tasks, although its effectiveness varies across models and use cases. Notably, the Llama 3.1-8B model exhibited performance improvements when prompted with COSTAR-A compared to COSTAR alone. These findings emphasize the adaptability and scalability of COSTAR-A as a prompting framework, particularly in computationally efficient AI deployments on resource-constrained hardware.
摘要：大型语言模型 (LLM) 对提示设计高度敏感，优化提示技术对于生成一致、高质量的输出至关重要。在本研究中，我们介绍了 COSTAR-A，这是一种新颖的提示工程框架，通过在末尾添加“答案”组件来增强现有的 COSTAR 方法，该方法代表上下文、目标、风格、语气、受众和响应。我们证明，虽然原始的 COSTAR 框架提高了即时清晰度并调整了较大 LLM 的输出，但其性能与较小的局部优化模型不太一致，特别是在需要更多指令或约束输出的任务中。通过一系列较小（最多 80 亿个参数）、微调模型的受控即时输出评估，我们发现 COSTAR-A 可以增强本地化 LLM 对某些任务的输出结构和决定性，尽管其有效性因模型和用例而异。值得注意的是，与单独使用 COSTAR 相比，Llama 3.1-8B 模型在使用 COSTAR-A 提示时表现出性能改进。这些发现强调了 COSTAR-A 作为提示框架的适应性和可扩展性，特别是在资源受限的硬件上计算高效的人工智能部署方面。

Title: Reasoning Pattern Matters: Learning to Reason without Human Rationales

Authors: Chaoxu Pang, Yixuan Cao, Ping Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12643
Pdf URL: https://arxiv.org/pdf/2510.12643
Copy Paste: [[2510.12643]] Reasoning Pattern Matters: Learning to Reason without Human Rationales(https://arxiv.org/abs/2510.12643)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities under the widely adopted SFT+RLVR paradigm, which first performs Supervised Fine-Tuning (SFT) on human-annotated reasoning trajectories (rationales) to establish initial reasoning behaviors, then applies Reinforcement Learning with Verifiable Rewards (RLVR) to optimize the model using verifiable signals without golden rationales. However, annotating high-quality rationales for the SFT stage remains prohibitively expensive. This paper investigates when and how rationale annotation costs can be substantially reduced without compromising reasoning performance. We identify a broad class of problems, termed patterned reasoning tasks, where reasoning follows a fixed, procedural strategy consistent across instances. Although instances vary in content such as domain knowledge, factual information, or numeric values, the solution derives from applying a shared reasoning pattern. We argue that the success of SFT+RLVR on such tasks primarily stems from its ability to enable models to internalize these reasoning patterns. Using numerical semantic matching as a representative task, we provide both causal and behavioral evidence showing that reasoning patterns rather than the quantity or quality of rationales are the key determinant of performance. Building on these insights, we propose Pattern-Aware LLMs as Rationale AnnOtators (PARO), a simple yet effective framework that enables LLMs to generate rationales aligned with task-specific reasoning patterns without requiring human rationale annotations. Experiments show that PARO-generated rationales achieve comparable SFT+RLVR performance to human rationales that are 10 times larger. These results suggest that large-scale human rationale annotations can be replaced with LLM-based automatic annotations requiring only limited human supervision over reasoning patterns.
摘要：大型语言模型（LLM）在广泛采用的 SFT+RLVR 范式下表现出了卓越的推理能力，该范式首先对人类注释的推理轨迹（基本原理）进行监督微调（SFT）以建立初始推理行为，然后应用带有可验证奖励的强化学习（RLVR），使用无需黄金基本原理的可验证信号来优化模型。然而，为 SFT 阶段注释高质量的基本原理仍然非常昂贵。本文研究了何时以及如何在不影响推理性能的情况下大幅降低基本原理注释成本。我们确定了一大类问题，称为模式推理任务，其中推理遵循跨实例一致的固定的、程序化的策略。尽管实例的内容（例如领域知识、事实信息或数值）有所不同，但解决方案源自应用共享的推理模式。我们认为，SFT+RLVR 在此类任务上的成功主要源于其使模型能够内化这些推理模式的能力。使用数字语义匹配作为代表性任务，我们提供了因果和行为证据，表明推理模式而不是基本原理的数量或质量是性能的关键决定因素。基于这些见解，我们提出模式感知法学硕士作为基本原理 AnnOtators (PARO)，这是一个简单而有效的框架，使法学硕士能够生成与特定任务推理模式一致的基本原理，而不需要人工注释。实验表明，PARO 生成的原理实现了与人类原理相当的 SFT+RLVR 性能，人类原理要大 10 倍。这些结果表明，大规模的人类基本原理注释可以被基于 LLM 的自动注释所取代，只需要对推理模式进行有限的人类监督。

Title: Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations

Authors: Sunny Yu, Ahmad Jabbar, Robert Hawkins, Dan Jurafsky, Myra Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12699
Pdf URL: https://arxiv.org/pdf/2510.12699
Copy Paste: [[2510.12699]] Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations(https://arxiv.org/abs/2510.12699)
Keywords: llm, hallucination, prompt
Abstract: Different open-ended generation tasks require different degrees of output diversity. However, current LLMs are often miscalibrated. They collapse to overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. We argue that these two failure modes are unified by, and can both be addressed by, the notion of effective generation space size (GSS) -- the set of semantically distinct outputs a model considers for a prompt. We present GSSBench, a task suite of prompt pairs with ground-truth GSS relationships to assess different metrics and understand where models diverge from desired behavior. We find that hallucination detection metrics, particularly EigenScore, consistently outperform standard diversity and uncertainty quantification metrics, while using only model internals, providing interpretable insights into a model's internal task representations. We demonstrate three applications of GSS: (1) detecting prompt ambiguity and predicting clarification questions for better grounding, (2) interpreting overthinking and underthinking in reasoning models, and (3) steering models to expand their generation space to yield high-quality and diverse outputs.
摘要：不同的开放式生成任务需要不同程度的输出多样性。然而，当前的法学硕士经常被错误校准。对于创造性任务，他们会崩溃到过于同质的输出，而对于实际任务，他们会产生多样化但不正确的反应。我们认为，这两种故障模式是通过有效生成空间大小 (GSS) 的概念统一的，并且都可以通过有效生成空间大小 (GSS) 的概念来解决，即模型为提示考虑的一组语义上不同的输出。我们提出了 GSSBench，这是一个由具有真实 GSS 关系的提示对组成的任务套件，用于评估不同的指标并了解模型与期望行为的差异。我们发现幻觉检测指标，特别是 EigenScore，始终优于标准多样性和不确定性量化指标，同时仅使用模型内部结构，为模型的内部任务表示提供可解释的见解。我们演示了 GSS 的三种应用：(1) 检测即时歧义并预测澄清问题，以更好地奠定基础；(2) 解释推理模型中的过度思考和思考不足；(3) 引导模型扩展其生成空间，以产生高质量和多样化的输出。

Title: Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

Authors: Ziyang Ma, Ruiyang Xu, Zhenghao Xing, Yunfei Chu, Yuxuan Wang, Jinzheng He, Jin Xu, Pheng-Ann Heng, Kai Yu, Junyang Lin, Eng Siong Chng, Xie Chen
Subjects: cs.CL, cs.CV, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2510.12720
Pdf URL: https://arxiv.org/pdf/2510.12720
Copy Paste: [[2510.12720]] Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception(https://arxiv.org/abs/2510.12720)
Keywords: language model, hallucination, agent
Abstract: Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent "co-growth" between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.
摘要：多模态信息的细粒度感知对于推进人机交互至关重要。随着视听技术的最新进展，能够并行处理音频和视频信号的全语言模型（OLM）已成为实现更丰富的理解和推理的有前途的范例。然而，它们捕捉和描述细粒度细节的能力仍然受到有限的探索。在这项工作中，我们从数据管道、模型和基准的角度对全细节感知进行了系统、全面的研究。我们首先确定当前 OLM 中细节和幻觉之间固有的“共同增长”。为了解决这个问题，我们提出了 Omni-Detective，这是一种集成工具调用的代理数据生成管道，可以自动生成高度详细但幻觉最少的多模态数据。基于 Omni-Detective 生成的数据，我们训练了两个字幕模型：用于纯音频详细感知的 Audio-Captioner 和用于视听详细感知的 Omni-Captioner。在级联评估协议下，Audio-Captioner 在所有开源模型中在 MMAU 和 MMAR 上取得了最佳性能，超越了 Gemini 2.5 Flash，性能与 Gemini 2.5 Pro 相当。在现有的详细字幕基准上，Omni-Captioner 在 VDC 上设定了新的最先进水平，并在视频 SALMONN 2 测试集上实现了细节和幻觉之间的最佳权衡。鉴于缺乏全方位详细感知的专用基准，我们设计了 Omni-Cloze，这是一种针对详细音频、视觉和视听字幕的新型完形填空式评估，可确保评估稳定、高效和可靠。实验结果和分析证明了 Omni-Detective 在生成高质量详细字幕方面的有效性，以及 Omni-Cloze 在评估此类详细字幕方面的优越性。

Title: Which Word Orders Facilitate Length Generalization in LMs? An Investigation with GCG-Based Artificial Languages

Authors: Nadine El-Naggar, Tatsuki Kuribayashi, Ted Briscoe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12722
Pdf URL: https://arxiv.org/pdf/2510.12722
Copy Paste: [[2510.12722]] Which Word Orders Facilitate Length Generalization in LMs? An Investigation with GCG-Based Artificial Languages(https://arxiv.org/abs/2510.12722)
Keywords: language model
Abstract: Whether language models (LMs) have inductive biases that favor typologically frequent grammatical properties over rare, implausible ones has been investigated, typically using artificial languages (ALs) (White and Cotterell, 2021; Kuribayashi et al., 2024). In this paper, we extend these works from two perspectives. First, we extend their context-free AL formalization by adopting Generalized Categorial Grammar (GCG) (Wood, 2014), which allows ALs to cover attested but previously overlooked constructions, such as unbounded dependency and mildly context-sensitive structures. Second, our evaluation focuses more on the generalization ability of LMs to process unseen longer test sentences. Thus, our ALs better capture features of natural languages and our experimental paradigm leads to clearer conclusions -- typologically plausible word orders tend to be easier for LMs to productively generalize.
摘要：人们通常使用人工语言 (AL) 来研究语言模型 (LM) 是否具有归纳偏差，偏向类型学上常见的语法属性而不是罕见的、难以置信的语法属性 (White 和 Cotterell，2021；Kuribayashi 等人，2024)。在本文中，我们从两个角度扩展了这些工作。首先，我们通过采用广义分类语法（GCG）（Wood，2014）来扩展其上下文无关的 AL 形式化，这使得 AL 能够覆盖经过证明但之前被忽视的结构，例如无界依赖和轻度上下文敏感结构。其次，我们的评估更关注 LM 处理未见过的较长测试句子的泛化能力。因此，我们的 AL 可以更好地捕捉自然语言的特征，而我们的实验范式可以得出更清晰的结论——类型学上合理的词序往往更容易让 LM 有效地进行概括。

Title: Hey, wait a minute: on at-issue sensitivity in Language Models

Authors: Sanghee J. Kim, Kanishka Misra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12740
Pdf URL: https://arxiv.org/pdf/2510.12740
Copy Paste: [[2510.12740]] Hey, wait a minute: on at-issue sensitivity in Language Models(https://arxiv.org/abs/2510.12740)
Keywords: language model, llm, prompt
Abstract: Evaluating the naturalness of dialogue in language models (LMs) is not trivial: notions of 'naturalness' vary, and scalable quantitative metrics remain limited. This study leverages the linguistic notion of 'at-issueness' to assess dialogue naturalness and introduces a new method: Divide, Generate, Recombine, and Compare (DGRC). DGRC (i) divides a dialogue as a prompt, (ii) generates continuations for subparts using LMs, (iii) recombines the dialogue and continuations, and (iv) compares the likelihoods of the recombined sequences. This approach mitigates bias in linguistic analyses of LMs and enables systematic testing of discourse-sensitive behavior. Applying DGRC, we find that LMs prefer to continue dialogue on at-issue content, with this effect enhanced in instruct-tuned models. They also reduce their at-issue preference when relevant cues (e.g., "Hey, wait a minute") are present. Although instruct-tuning does not further amplify this modulation, the pattern reflects a hallmark of successful dialogue dynamics.
摘要：评估语言模型 (LM) 中对话的自然度并非易事：“自然度”的概念各不相同，并且可扩展的定量指标仍然有限。本研究利用“问题性”的语言学概念来评估对话自然性，并引入了一种新方法：划分、生成、重组和比较 (DGRC)。 DGRC (i) 将对话划分为提示，(ii) 使用 LM 生成子部分的延续，(iii) 重新组合对话和延续，以及 (iv) 比较重组序列的可能性。这种方法减轻了 LM 语言分析中的偏差，并能够对话语敏感行为进行系统测试。应用 DGRC，我们发现 LM 更喜欢继续就有争议的内容进行对话，这种效果在指令调整模型中得到增强。当出现相关提示（例如，“嘿，等一下”）时，他们还会减少对问题的偏好。尽管指令调整不会进一步放大这种调制，但该模式反映了成功对话动态的标志。

Title: Language Models Model Language

Authors: Łukasz Borchmann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12766
Pdf URL: https://arxiv.org/pdf/2510.12766
Copy Paste: [[2510.12766]] Language Models Model Language(https://arxiv.org/abs/2510.12766)
Keywords: language model, llm
Abstract: Linguistic commentary on LLMs, heavily influenced by the theoretical frameworks of de Saussure and Chomsky, is often speculative and unproductive. Critics challenge whether LLMs can legitimately model language, citing the need for "deep structure" or "grounding" to achieve an idealized linguistic "competence." We argue for a radical shift in perspective towards the empiricist principles of Witold Mańczak, a prominent general and historical linguist. He defines language not as a "system of signs" or a "computational system of the brain" but as the totality of all that is said and written. Above all, he identifies frequency of use of particular language elements as language's primary governing principle. Using his framework, we challenge prior critiques of LLMs and provide a constructive guide for designing, evaluating, and interpreting language models.
摘要：对法学硕士的语言学评论深受索绪尔和乔姆斯基理论框架的影响，常常是推测性的、没有成果的。批评者质疑法学硕士是否可以合法地建模语言，理由是需要“深层结构”或“基础”来实现理想化的语言“能力”。我们主张对维托尔德·曼恰克（Witold Manczak）的经验主义原则进行根本性的转变，维托尔德·曼恰克是一位著名的普通历史语言学家。他将语言定义为所有所说和所写内容的总和，而不是“符号系统”或“大脑计算系统”。最重要的是，他将特定语言元素的使用频率视为语言的主要控制原则。利用他的框架，我们挑战了之前对法学硕士的批评，并为设计、评估和解释语言模型提供了建设性指南。

Title: Dr.LLM: Dynamic Layer Routing in LLMs

Authors: Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.12773
Pdf URL: https://arxiv.org/pdf/2510.12773
Copy Paste: [[2510.12773]] Dr.LLM: Dynamic Layer Routing in LLMs(https://arxiv.org/abs/2510.12773)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce this http URL, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), this http URL improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, this http URL shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.
摘要：大型语言模型（LLM）通过变压器堆栈的所有层处理每个标记，导致简单查询的计算浪费，并且对于需要更深入推理的困难查询缺乏灵活性。自适应深度方法可以提高效率，但先前的方法依赖于昂贵的推理时间搜索、架构更改或大规模再训练，并且在实践中尽管效率有所提高，但通常会降低准确性。我们介绍了这个 http URL，LLM 层的动态路由，这是一个可改造的框架，为预训练模型配备了决定跳过、执行或重复块的轻量级每层路由器。路由器在明确的监督下进行训练：使用蒙特卡罗树搜索（MCTS），我们得出高质量的层配置，在计算预算下保持或提高准确性。我们的设计、用于稳定路由的窗口池、具有类平衡的焦点损失以及瓶颈 MLP 路由器，确保了类不平衡和长序列下的鲁棒性。在 ARC（逻辑）和 DART（数学）上，此 http URL 将准确性提高了高达 +3.4%p，同时每个示例平均节省 5 层。路由器可推广到域外任务（MMLU、GSM8k、AIME、TruthfulQA、SQuADv2、GPQA、PIQA、AGIEval），在保持效率的同时，准确度仅下降 0.85%，并且比之前的路由方法性能提高高达 +7.7%p。总的来说，这个http URL显示，显式监督的路由器改造了冻结的LLM，以实现预算感知、准确性驱动的推理，而无需改变基本权重。