2026-03-31

Title: GeoBlock: Inferring Block Granularity from Dependency Geometry in Diffusion Language Models

Authors: Lipeng Wan, Junjie Ma, Jianhui Gu, Zeyang Liu, Xuyang Lu, Xuguang Lan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.26675
Pdf URL: https://arxiv.org/pdf/2603.26675
Copy Paste: [[2603.26675]] GeoBlock: Inferring Block Granularity from Dependency Geometry in Diffusion Language Models(https://arxiv.org/abs/2603.26675)
Keywords: language model
Abstract: Block diffusion enables efficient parallel refinement in diffusion language models, but its decoding behavior depends critically on block size. Existing block-sizing strategies rely on fixed rules or heuristic signals and do not account for the dependency geometry that determines which tokens can be safely refined together. This motivates a geometry view of diffusion decoding: \emph{regions with strong causal ordering require sequential updates, whereas semantically cohesive regions admit parallel refinement.} We introduce GeoBlock, a geometry-aware block inference framework that determines block granularity directly from attention-derived dependency geometry. Instead of relying on predefined schedules or local confidence heuristics, GeoBlock analyzes cross-token dependency patterns to identify geometrically stable refinement regions and dynamically determines appropriate block boundaries during decoding. By adapting block granularity to the dependency geometry, GeoBlock preserves the parallel efficiency of block diffusion while enforcing dependency-consistent refinement that exhibits autoregressive reliability. GeoBlock requires no additional training and integrates seamlessly into existing block diffusion architectures. Extensive experiments across multiple benchmarks show that GeoBlock reliably identifies geometry-consistent block boundaries and improves the accuracy of block diffusion with only a small additional computational budget.
摘要：块扩散可以在扩散语言模型中实现高效的并行细化，但其解码行为关键取决于块大小。现有的块大小策略依赖于固定规则或启发式信号，并且没有考虑确定哪些令牌可以安全地一起细化的依赖性几何结构。这激发了扩散解码的几何视图：\emph{具有强因果顺序的区域需要顺序更新，而语义内聚的区域允许并行细化。}我们引入了 GeoBlock，一个几何感知的块推理框架，它直接从注意力导出的依赖几何中确定块粒度。 GeoBlock 不依赖预定义的计划或局部置信启发法，而是分析跨令牌依赖模式来识别几何稳定的细化区域，并在解码期间动态确定适当的块边界。通过使块粒度适应依赖性几何形状，GeoBlock 保留了块扩散的并行效率，同时实施具有自回归可靠性的依赖性一致细化。 GeoBlock 不需要额外的培训，并且可以无缝集成到现有的块扩散架构中。跨多个基准的大量实验表明，GeoBlock 能够可靠地识别几何一致的块边界，并只需少量的额外计算预算即可提高块扩散的准确性。

Title: AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

Authors: Jianfei Xiao, Xiang Yu, Chengbing Wang, Wuqiang Zheng, Xinyu Lin, Kaining Liu, Hongxun Ding, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26680
Pdf URL: https://arxiv.org/pdf/2603.26680
Copy Paste: [[2603.26680]] AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment(https://arxiv.org/abs/2603.26680)
Keywords: language model, llm, chat
Abstract: As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.
摘要：随着大型语言模型 (LLM) 发展成为终身人工智能助手，LLM 个性化已成为一个关键前沿。然而，由于缺乏黄金标准评估基准，目前进展受到瓶颈。现有的基准要么忽视了对个性化至关重要的个性化信息管理，要么严重依赖合成对话，而合成对话与现实世界的对话表现出固有的分布差距。为了弥补这一差距，我们引入了 AlpsBench，这是一个源自现实世界的人类与法学硕士对话的法学硕士个性化基准。 AlpsBench 包含由 WildChat 策划的 2,500 个长期交互序列，与包含显式和隐式个性化信号的经过人类验证的结构化记忆配对。我们定义了四个关键任务——个性化信息提取、更新、检索和利用——并建立协议来评估内存管理的整个生命周期。我们对前沿法学硕士和以内存为中心的系统的基准测试表明：（i）模型难以可靠地提取潜在的用户特征； (ii) 即使在最强大的模型中，内存更新也面临着性能上限； (iii) 在存在大型干扰池的情况下，检索精度急剧下降； (iv) 虽然外显记忆机制可以提高回忆能力，但它们本质上并不能保证更多的偏好一致或情感共鸣反应。 AlpsBench 旨在提供一个全面的框架。

Title: The Cognitive Divergence: AI Context Windows, Human Attention Decline, and the Delegation Feedback Loop

Authors: Netanel Eliav (Machine Human Intelligence Lab)
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, q-bio.NC
Abstract URL: https://arxiv.org/abs/2603.26707
Pdf URL: https://arxiv.org/pdf/2603.26707
Copy Paste: [[2603.26707]] The Cognitive Divergence: AI Context Windows, Human Attention Decline, and the Delegation Feedback Loop(https://arxiv.org/abs/2603.26707)
Keywords: language model, gpt, llm, chat
Abstract: This paper documents and theorises a self-reinforcing dynamic between two measurable trends: the exponential expansion of large language model (LLM) context windows and the secular contraction of human sustained-attention capacity. We term the resulting asymmetry the Cognitive Divergence. AI context windows have grown from 512 tokens in 2017 to 2,000,000 tokens by 2026 (factor ~3,906; fitted lambda = 0.59/yr; doubling time ~14 months). Over the same period, human Effective Context Span (ECS) -- a token-equivalent measure derived from validated reading-rate meta-analysis (Brysbaert, 2019) and an empirically motivated Comprehension Scaling Factor -- has declined from approximately 16,000 tokens (2004 baseline) to an estimated 1,800 tokens (2026, extrapolated from longitudinal behavioural data ending 2020 (Mark, 2023); see Section 9 for uncertainty discussion). The AI-to-human ratio grew from near parity at the ChatGPT launch (November 2022) to 556--1,111x raw and 56--111x quality-adjusted, after accounting for retrieval degradation (Liu et al., 2024; Chroma, 2025). Beyond documenting this divergence, the paper introduces the Delegation Feedback Loop hypothesis: as AI capability grows, the cognitive threshold at which humans delegate to AI falls, extending to tasks of negligible demand; the resulting reduction in cognitive practice may further attenuate the capacities already documented as declining (Gerlich, 2025; Kim et al., 2026; Kosmyna et al., 2025). Neither trend reverses spontaneously. The paper characterises the divergence statistically, reviews neurobiological mechanisms across eight peer-reviewed neuroimaging studies, presents empirical evidence bearing on the delegation threshold, and proposes a research agenda centred on a validated ECS psychometric instrument and longitudinal study of AI-mediated cognitive change.
摘要：本文记录并理论化了两种可测量趋势之间的自我强化动态：大语言模型（LLM）上下文窗口的指数扩张和人类持续注意力能力的长期收缩。我们将由此产生的不对称称为认知分歧。 AI 上下文窗口已从 2017 年的 512 个代币增长到 2026 年的 2,000,000 个代币（因子约 3,906；拟合 lambda = 0.59/年；倍增时间约 14 个月）。在同一时期，人类有效语境跨度（ECS）——一种源自经过验证的阅读率元分析（Brysbaert，2019）和基于经验的理解缩放因子的令牌等效衡量标准——已从大约 16,000 个令牌（2004 年基线）下降到估计的 1,800 个令牌（2026 年，根据 2020 年结束的纵向行为数据推断）（Mark， 2023）；有关不确定性的讨论，请参阅第 9 节）。考虑到检索退化后，人工智能与人类的比率从 ChatGPT 发布时（2022 年 11 月）的接近平等增长到原始数据的 556--1,111 倍和质量调整后的 56--111 倍（Liu 等人，2024 年；Chroma，2025 年）。除了记录这种差异之外，本文还引入了委托反馈循环假设：随着人工智能能力的增强，人类委托给人工智能的认知阈值会下降，扩展到需求可忽略不计的任务；由此产生的认知实践减少可能会进一步削弱已经记录为下降的能力（Gerlich，2025；Kim 等人，2026；Kosmyna 等人，2025）。这两种趋势都不会自发逆转。该论文以统计方式描述了这种差异，回顾了八项同行评审的神经影像研究的神经生物学机制，提出了与授权阈值相关的经验证据，并提出了一个以经过验证的 ECS 心理测量工具和人工智能介导的认知变化纵向研究为中心的研究议程。

Title: Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages

Authors: Swastik R
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.26742
Pdf URL: https://arxiv.org/pdf/2603.26742
Copy Paste: [[2603.26742]] Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages(https://arxiv.org/abs/2603.26742)
Keywords: language model, gpt, prompt, chain-of-thought
Abstract: Vision-language models score well on mathematical, scientific, and spatial reasoning benchmarks, yet these evaluations are overwhelmingly English. I present the first cross-lingual visual reasoning audit for Indian languages. 980 questions from MathVista, ScienceQA, and MMMU are translated into Hindi, Tamil, Telugu, Bengali, Kannada, and Marathi using IndicTrans2, with Gemini 2.0 Flash cross-verification on 50 samples per language (inter-translator agreement 0.79-0.84). Eight VLMs, from 7B open-source models to GPT-4o, are evaluated across all seven languages, yielding 68,600 inference records that include text-only and chain-of-thought ablations. I find accuracy drops of 9.8-25 percentage points when switching from English to an Indian language, with Dravidian languages suffering up to 13.2 pp more than Indo-Aryan. Chain-of-thought prompting degrades Bengali (-14.4 pp) and Kannada (-11.4 pp) rather than helping, exposing English-centric reasoning chains. Aya-Vision-8B, built for 23 languages, still drops 28.5 pp on Dravidian scripts; multilingual pretraining alone does not transfer visual reasoning. I release the translated benchmark and all model outputs.
摘要：视觉语言模型在数学、科学和空间推理基准上得分很高，但这些评估绝大多数是英语的。我提出了印度语言的第一个跨语言视觉推理审核。使用 IndicTrans2 将来自 MathVista、ScienceQA 和 MMMU 的 980 个问题翻译成印地语、泰米尔语、泰卢固语、孟加拉语、卡纳达语和马拉地语，并使用 Gemini 2.0 Flash 对每种语言 50 个样本进行交叉验证（译者间一致性 0.79-0.84）。从 7B 开源模型到 GPT-4o 的 8 个 VLM 在所有七种语言中进行了评估，产生了 68,600 条推理记录，其中包括纯文本和思维链消融。我发现从英语切换到印度语言时，准确率下降了 9.8-25 个百分点，其中德拉威语比印度雅利安语准确率下降了 13.2 个百分点。思维链提示会降低孟加拉语（-14.4 pp）和卡纳达语（-11.4 pp）的质量，而不是提供帮助，暴露以英语为中心的推理链。 Aya-Vision-8B 为 23 种语言构建，但在德拉威文字上仍然下降了 28.5 个百分点；单独的多语言预训练并不能转移视觉推理。我发布了翻译后的基准测试和所有模型输出。

Title: LogicDiff: Logic-Guided Denoising Improves Reasoning in Masked Diffusion Language Models

Authors: Shaik Aman
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.26771
Pdf URL: https://arxiv.org/pdf/2603.26771
Copy Paste: [[2603.26771]] LogicDiff: Logic-Guided Denoising Improves Reasoning in Masked Diffusion Language Models(https://arxiv.org/abs/2603.26771)
Keywords: language model
Abstract: Masked diffusion language models (MDLMs) generate text by iteratively unmasking tokens from a fully masked sequence, offering parallel generation and bidirectional context. However, their standard confidence-based unmasking strategy systematically defers high-entropy logical connective tokens, the critical branching points in reasoning chains, leading to severely degraded reasoning performance. We introduce LogicDiff, an inference-time method that replaces confidence-based unmasking with logic-role-guided unmasking. A lightweight classification head (4.2M parameters, 0.05% of the base model) predicts the logical role of each masked position (premise, connective, derived step, conclusion, or filler) from the base model's hidden states with 98.4% accuracy. A dependency-ordered scheduler then unmasks tokens in logical dependency order: premises first, then connectives, then derived steps, then conclusions. Without modifying a single parameter of the base model and without any reinforcement learning or task-specific training, LogicDiff improves LLaDA-8B-Instruct accuracy from 22.0% to 60.7% on GSM8K (+38.7 percentage points) and from 23.6% to 29.2% on MATH-500 (+5.6 pp), with less than 6% speed overhead. Our results demonstrate that a substantial portion of the reasoning deficit in MDLMs is attributable to suboptimal token unmasking order, not to limitations of the model's learned representations.
摘要：屏蔽扩散语言模型 (MDLM) 通过从完全屏蔽的序列中迭代地揭开标记来生成文本，从而提供并行生成和双向上下文。然而，他们标准的基于置信度的揭露策略系统地推迟了高熵逻辑连接标记（推理链中的关键分支点），导致推理性能严重下降。我们引入了 LogicDiff，这是一种推理时间方法，用逻辑角色引导的揭露取代了基于置信度的揭露。轻量级分类头（4.2M 参数，基础模型的 0.05%）从基础模型的隐藏状态预测每个屏蔽位置（前提、连接、导出步骤、结论或填充物）的逻辑角色，准确率高达 98.4%。然后，依存关系排序的调度程序按逻辑依存顺序揭开标记：首先是前提，然后是连接词，然后是派生步骤，最后是结论。在不修改基础模型的单个参数、没有任何强化学习或特定任务训练的情况下，LogicDiff 将 GSM8K 上的 LLaDA-8B-Instruct 准确率从 22.0% 提高到 60.7%（+38.7 个百分点），在 MATH-500 上从 23.6% 提高到 29.2%（+5.6 个百分点），速度开销不到 6%。我们的结果表明，MDLM 中推理缺陷的很大一部分可归因于次优的标记解锁顺序，而不是模型学习表示的限制。

Title: Resolving the Robustness-Precision Trade-off in Financial RAG through Hybrid Document-Routed Retrieval

Authors: Zhiyuan Cheng, Longying Lai, Yue Liu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.26815
Pdf URL: https://arxiv.org/pdf/2603.26815
Copy Paste: [[2603.26815]] Resolving the Robustness-Precision Trade-off in Financial RAG through Hybrid Document-Routed Retrieval(https://arxiv.org/abs/2603.26815)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems for financial document question answering typically follow a chunk-based paradigm: documents are split into fragments, embedded into vector space, and retrieved via similarity search. While effective in general settings, this approach suffers from cross-document chunk confusion in structurally homogeneous corpora such as regulatory filings. Semantic File Routing (SFR), which uses LLM structured output to route queries to whole documents, reduces catastrophic failures but sacrifices the precision of targeted chunk retrieval. We identify this robustness-precision trade-off through controlled evaluation on the FinDER benchmark (1,500 queries across five groups): SFR achieves higher average scores (6.45 vs. 6.02) and fewer failures (10.3% vs. 22.5%), while chunk-based retrieval (CBR) yields more perfect answers (13.8% vs. 8.5%). To resolve this trade-off, we propose Hybrid Document-Routed Retrieval (HDRR), a two-stage architecture that uses SFR as a document filter followed by chunk-based retrieval scoped to the identified document(s). HDRR eliminates cross-document confusion while preserving targeted chunk precision. Experimental results demonstrate that HDRR achieves the best performance on every metric: an average score of 7.54 (25.2% above CBR, 16.9% above SFR), a failure rate of only 6.4%, a correctness rate of 67.7% (+18.7 pp over CBR), and a perfect-answer rate of 20.1% (+6.3 pp over CBR, +11.6 pp over SFR). HDRR resolves the trade-off by simultaneously achieving the lowest failure rate and the highest precision across all five experimental groups.
摘要：用于金融文档问答的检索增强生成（RAG）系统通常遵循基于块的范例：文档被分割成片段，嵌入到向量空间中，并通过相似性搜索进行检索。虽然在一般情况下有效，但这种方法在结构同质的语料库（例如监管备案）中会遇到跨文档块混乱的问题。语义文件路由 (SFR) 使用 LLM 结构化输出将查询路由到整个文档，减少了灾难性故障，但牺牲了目标块检索的精度。我们通过对 FinDER 基准（五个组的 1,500 个查询）进行受控评估来确定这种鲁棒性与精度的权衡：SFR 实现了更高的平均分数（6.45 vs. 6.02）和更少的失败（10.3% vs. 22.5%），而基于块的检索（CBR）产生更完美的答案（13.8% vs. 8.5%）。为了解决这种权衡问题，我们提出了混合文档路由检索（HDRR），这是一种两阶段架构，使用 SFR 作为文档过滤器，然后进行针对已识别文档的基于块的检索。 HDRR 消除跨文档混乱，同时保留目标块精度。实验结果表明，HDRR 在每个指标上都达到了最佳性能：平均得分为 7.54（比 CBR 高出 25.2%，比 SFR 高出 16.9%），失败率仅为 6.4%，正确率为 67.7%（比 CBR 高出 18.7 个百分点），完美答案率为 20.1%（比 CBR 高出 6.3 个百分点，比 CBR 高出了 11.6 个百分点）恒星形成率）。 HDRR 通过在所有五个实验组中同时实现最低的故障率和最高的精度来解决这一权衡问题。

Title: Arithmetic OOD Failure Unfolds in Stages in Minimal GPTs

Authors: Seine A. Shintani
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.26828
Pdf URL: https://arxiv.org/pdf/2603.26828
Copy Paste: [[2603.26828]] Arithmetic OOD Failure Unfolds in Stages in Minimal GPTs(https://arxiv.org/abs/2603.26828)
Keywords: gpt
Abstract: Arithmetic benchmarks are often reduced to a single held-out score, but that score can conflate qualitatively different failures. We study a controlled minimal GPT trained on exhaustive 2-digit addition, where all local digit transitions are already present in training, and ask why 3-digit generalization still fails. The failure is staged. First, there is a layout barrier: a learned absolute-position model collapses under a pure 3-digit layout shift, and mixed-layout exposure is the only intervention that materially weakens this barrier. Second, after layout repair, the hundreds position behaves like a carry flag rather than a semantic hundreds digit; targeted carry probes reverse the relevant logit margin, whereas a matched extra-data control does not. Third, after carry repair, the main remaining bottleneck is conditional recomposition: high-conditioned tail data outperforms a matched control, high-only data, and tail-only data on all true-3-digit suites, and the same ordering reappears in a larger 2-layer bridge experiment. The residual errors after recomposition are then overwhelmingly tens-only, and a separate 10-seed late-stage study shows that a sign-aware tens repair raises exact match on the hardest thousands-carry suite from 0.664 to 0.822. We therefore provide an experimentally testable decomposition of arithmetic OOD failure into layout, carry-semantics, recomposition, and late tens-residual stages.
摘要：算术基准通常会简化为单一的保留分数，但该分数可能会合并性质不同的失败。我们研究了一个受控最小 GPT，在详尽的 2 位数加法上进行训练，其中所有局部数字转换都已存在于训练中，并询问为什么 3 位数泛化仍然失败。失败是上演的。首先，存在布局障碍：学习的绝对位置模型在纯 3 位数布局偏移下崩溃，而混合布局暴露是实质上削弱这一障碍的唯一干预措施。其次，在布局修复之后，百位的行为就像进位标志，而不是语义百位数字；目标进位探针会反转相关的逻辑裕度，而匹配的额外数据控制则不会。第三，在进位修复之后，剩下的主要瓶颈是条件重组：高条件尾部数据在所有真 3 位数套件上优于匹配控制、仅高数据和仅尾部数据，并且在更大的 2 层桥实验中再次出现相同的排序。重组后的残余错误绝大多数都是十位，并且一项单独的 10 种子后期研究表明，符号感知的十位修复将最难的千位进位套件的精确匹配从 0.664 提高到 0.822。因此，我们提供了一种可通过实验测试的算术 OOD 故障分解为布局、进位语义、重组和后十残差阶段的方法。

Title: Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation

Authors: Lorca McLaren, James Cross, Zuzanna Krakowska, Robin Rauner, Martijn Schoonvelde
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.26898
Pdf URL: https://arxiv.org/pdf/2603.26898
Copy Paste: [[2603.26898]] Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation(https://arxiv.org/abs/2603.26898)
Keywords: language model, llm, prompt
Abstract: Political scientists are rapidly adopting large language models (LLMs) for text annotation, yet the sensitivity of annotation results to implementation choices remains poorly understood. Most evaluations test a single model or configuration; how model choice, model size, learning approach, and prompt style interact, and whether popular "best practices" survive controlled comparison, are largely unexplored. We present a controlled evaluation of these pipeline choices, testing six open-weight models across four political science annotation tasks under identical quantisation, hardware, and prompt-template conditions. Our central finding is methodological: interaction effects dominate main effects, so seemingly reasonable pipeline choices can become consequential researcher degrees of freedom. No single model, prompt style, or learning approach is uniformly superior, and the best-performing model varies across tasks. Two corollaries follow. First, model size is an unreliable guide both to cost and to performance: cross-family efficiency differences are so large that some larger models are less resource-intensive than much smaller alternatives, while within model families mid-range variants often match or exceed larger counterparts. Second, widely recommended prompt engineering techniques yield inconsistent and sometimes negative effects on annotation performance. We use these benchmark results to develop a validation-first framework - with a principled ordering of pipeline decisions, guidance on prompt freezing and held-out evaluation, reporting standards, and open-source tools - to help researchers navigate this decision space transparently.
摘要：政治学家正在迅速采用大型语言模型（LLM）进行文本注释，但注释结果对实施选择的敏感性仍然知之甚少。大多数评估测试单一模型或配置；模型选择、模型大小、学习方法和提示风格如何相互作用，以及流行的“最佳实践”是否能够在受控比较中幸存下来，这些在很大程度上还没有被探索过。我们对这些管道选择进行了受控评估，在相同的量化、硬件和提示模板条件下测试了四个政治科学注释任务中的六个开放权重模型。我们的中心发现是方法论上的：交互效应主导主效应，因此看似合理的管道选择可能会成为相应的研究人员自由度。没有单一的模型、提示风格或学习方法是一律优越的，而且表现最好的模型因任务而异。接下来是两个推论。首先，模型大小对于成本和性能来说都是不可靠的指导：跨系列效率差异如此之大，以至于一些较大的模型比小得多的替代方案占用的资源更少，而在模型系列中，中档变体通常与较大的模型相匹配或超过。其次，广泛推荐的即时工程技术会对注释性能产生不一致的、有时甚至是负面的影响。我们利用这些基准测试结果来开发一个验证优先的框架 - 包括管道决策的原则性排序、即时冻结和保留评估指南、报告标准和开源工具 - 帮助研究人员透明地驾驭这个决策空间。

Title: The Last Fingerprint: How Markdown Training Shapes LLM Prose

Authors: E. M. Freeburg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27006
Pdf URL: https://arxiv.org/pdf/2603.27006
Copy Paste: [[2603.27006]] The Last Fingerprint: How Markdown Training Shapes LLM Prose(https://arxiv.org/abs/2603.27006)
Keywords: language model, gpt, llm
Abstract: Large language models produce em dashes at varying rates, and the observation that some models "overuse" them has become one of the most widely discussed markers of AI-generated text. Yet no mechanistic account of this pattern exists, and the parallel observation that LLMs default to markdown-formatted output has never been connected to it. We propose that the em dash is markdown leaking into prose -- the smallest surviving unit of the structural orientation that LLMs acquire from markdown-saturated training corpora. We present a five-step genealogy connecting training data composition, structural internalization, the dual-register status of the em dash, and post-training amplification. We test this with a two-condition suppression experiment across twelve models from five providers (Anthropic, OpenAI, Meta, Google, DeepSeek): when models are instructed to avoid markdown formatting, overt features (headers, bullets, bold) are eliminated or nearly eliminated, but em dashes persist -- except in Meta's Llama models, which produce none at all. Em dash frequency and suppression resistance vary from 0.0 per 1,000 words (Llama) to 9.1 (GPT-4.1 under suppression), functioning as a signature of the specific fine-tuning procedure applied. A three-condition suppression gradient shows that even explicit em dash prohibition fails to eliminate the artifact in some models, and a base-vs-instruct comparison confirms that the latent tendency exists pre-RLHF. These findings connect two previously isolated online discourses and reframe em dash frequency as a diagnostic of fine-tuning methodology rather than a stylistic defect.
摘要：大型语言模型以不同的速率产生长破折号，而某些模型“过度使用”它们的观察结果已成为人工智能生成文本最广泛讨论的标记之一。然而，这种模式不存在任何机械解释，并且法学硕士默认使用降价格式输出的并行观察也从未与之相关。我们认为，长破折号是 Markdown 渗透到散文中的——法学硕士从 Markdown 饱和的训练语料库中获得的结构方向的最小幸存单元。我们提出了一个五步谱系，连接训练数据组成、结构内化、em dash 的双寄存器状态和训练后放大。我们通过来自 5 个提供商（Anthropic、OpenAI、Meta、Google、DeepSeek）的 12 个模型进行两种条件抑制实验来测试这一点：当指示模型避免 markdown 格式时，明显的特征（标题、项目符号、粗体）被消除或几乎消除，但破折号仍然存在——除了 Meta 的 Llama 模型，它根本不产生任何效果。 Em dash 频率和抑制阻力从每 1,000 个字 0.0（Llama）到 9.1（抑制下的 GPT-4.1）变化，充当所应用的特定微调程序的签名。三条件抑制梯度表明，即使是明确的 em dash 禁止也无法消除某些模型中的伪影，并且基础与指令的比较证实了 RLHF 之前存在潜在趋势。这些发现将两个先前孤立的在线讨论联系起来，并将破折号频率重新定义为微调方法的诊断，而不是风格缺陷。

Title: RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models

Authors: Rahul Soni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27008
Pdf URL: https://arxiv.org/pdf/2603.27008
Copy Paste: [[2603.27008]] RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models(https://arxiv.org/abs/2603.27008)
Keywords: language model, prompt
Abstract: Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks. However, their performance remains highly sensitive to prompt formulation, and designing effective prompts is typically a manual and iterative process that does not scale well across tasks or domains. To address this limitation, we introduce Retrieval-Augmented Self-Supervised Prompt Refinement (RASPRef), a framework that improves prompts without requiring human annotations or task-specific supervision. The approach retrieves relevant examples and previously generated reasoning trajectories, and leverages signals such as multi-sample consistency, verifier feedback, and model-generated critiques to iteratively refine the prompt. Unlike prior approaches that focus primarily on improving model outputs, RASPRef directly treats the prompt as the optimization target and improves it through an iterative retrieval-guided refinement process. Experiments on GSM8K-style mathematical reasoning tasks show that retrieval-guided prompting improves performance compared with a static prompting baseline. We further discuss how retrieval quality, trajectory selection, and self-supervised feedback signals may influence the effectiveness of prompt refinement. These findings suggest that prompt design remains a critical factor for reasoning-oriented language models, and that self-improving prompts offer a practical and scalable strategy for improving reasoning performance.
摘要：最近以推理为中心的语言模型（例如 DeepSeek R1 和 OpenAI o1）在结构化推理基准（包括 GSM8K、MATH 和多跳问答任务）上表现出了强大的性能。然而，它们的性能对提示的制定仍然高度敏感，并且设计有效的提示通常是一个手动和迭代的过程，不能很好地跨任务或领域扩展。为了解决这一限制，我们引入了检索增强自监督提示细化（RASPRef），这是一个无需人工注释或特定任务监督即可改进提示的框架。该方法检索相关示例和先前生成的推理轨迹，并利用多样本一致性、验证者反馈和模型生成的批评等信号来迭代地完善提示。与之前主要关注改进模型输出的方法不同，RASPRef 直接将提示视为优化目标，并通过迭代检索引导的细化过程对其进行改进。对 GSM8K 式数学推理任务的实验表明，与静态提示基线相比，检索引导的提示可以提高性能。我们进一步讨论检索质量、轨迹选择和自我监督反馈信号如何影响即时细化的有效性。这些发现表明，提示设计仍然是面向推理的语言模型的关键因素，并且自我改进的提示为提高推理性能提供了实用且可扩展的策略。

Title: TAPS: Task Aware Proposal Distributions for Speculative Sampling

Authors: Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27027
Pdf URL: https://arxiv.org/pdf/2603.27027
Copy Paste: [[2603.27027]] TAPS: Task Aware Proposal Distributions for Speculative Sampling(https://arxiv.org/abs/2603.27027)
Keywords: gpt
Abstract: Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.
摘要：推测性解码通过让轻量级草稿模型提出未来标记，然后由更大的目标模型并行验证，从而加速自回归生成。然而，在实践中，草稿模型通常是在广泛的通用语料库上进行训练的，这使得我们不清楚推测解码质量在多大程度上取决于草稿训练分布。我们与受过 MathInstruct、ShareGPT 和混合数据变体培训的轻量级 HASS 和 EAGLE-2 绘图员一起研究这个问题，并在 MT-Bench、GSM8K、MATH-500 和 SVAMP 上进行评估。以接受长度来衡量，特定任务的训练产生了明显的专业化：MathInstruct 训练的草稿在推理基准上最强，而 ShareGPT 训练的草稿在 MT-Bench 上最强。混合数据训练提高了鲁棒性，但较大的混合数据在解码温度上并不占主导地位。我们还研究了如何在推理时结合专门的起草者。朴素的检查点平均性能较差，而基于置信度的路由比单域草稿有所改进，并且合并树验证为两个骨干网提供了最高的总体接受长度。最后，置信度是比熵更有用的路由信号：被拒绝的令牌往往具有更高的熵，但置信度会产生更清晰的基准级路由决策。这些结果表明，推测解码质量不仅取决于草案架构，还取决于草案训练数据和下游工作负载之间的匹配，并且专业起草者在推理时间比在权重空间中更好地组合。

Title: Debiasing Large Language Models toward Social Factors in Online Behavior Analytics through Prompt Knowledge Tuning

Authors: Hossein Salemi, Jitin Krishnan, Hemant Purohit
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27057
Pdf URL: https://arxiv.org/pdf/2603.27057
Copy Paste: [[2603.27057]] Debiasing Large Language Models toward Social Factors in Online Behavior Analytics through Prompt Knowledge Tuning(https://arxiv.org/abs/2603.27057)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Attribution theory explains how individuals interpret and attribute others' behavior in a social context by employing personal (dispositional) and impersonal (situational) causality. Large Language Models (LLMs), trained on human-generated corpora, may implicitly mimic this social attribution process in social contexts. However, the extent to which LLMs utilize these causal attributions in their reasoning remains underexplored. Although using reasoning paradigms, such as Chain-of-Thought (CoT), has shown promising results in various tasks, ignoring social attribution in reasoning could lead to biased responses by LLMs in social contexts. In this study, we investigate the impact of incorporating a user's goal as knowledge to infer dispositional causality and message context to infer situational causality on LLM performance. To this end, we introduce a scalable method to mitigate such biases by enriching the instruction prompts for LLMs with two prompt aids using social-attribution knowledge, based on the context and goal of a social media message. This method improves the model performance while reducing the social-attribution bias of the LLM in the reasoning on zero-shot classification tasks for behavior analytics applications. We empirically show the benefits of our method across two tasks-intent detection and theme detection on social media in the disaster domain-when considering the variability of disaster types and multiple languages of social media. Our experiments highlight the biases of three open-source LLMs: Llama3, Mistral, and Gemma, toward social attribution, and show the effectiveness of our mitigation strategies.
摘要：归因理论解释了个人如何通过利用个人（性格）和非个人（情境）因果关系来解释和归因社会背景下的他人行为。在人类生成的语料库上训练的大型语言模型（LLM）可能会隐式地模仿社会环境中的这种社会归因过程。然而，法学硕士在推理中利用这些因果归因的程度仍有待探索。尽管使用推理范式（例如思想链（CoT））在各种任务中都显示出了有希望的结果，但在推理中忽略社会归因可能会导致法学硕士在社会背景下做出有偏见的反应。在这项研究中，我们调查了将用户的目标纳入知识以推断性格因果关系和消息上下文以推断情景因果关系对法学硕士表现的影响。为此，我们引入了一种可扩展的方法，根据社交媒体消息的上下文和目标，通过使用社交归因知识的两个提示辅助工具来丰富法学硕士的教学提示，以减轻此类偏见。该方法提高了模型性能，同时减少了法学硕士在行为分析应用的零样本分类任务推理中的社会归因偏差。当考虑到灾害类型和社交媒体多种语言的可变性时，我们凭经验证明了我们的方法在灾害领域社交媒体上的意图检测和主题检测这两项任务中的优势。我们的实验强调了三个开源法学硕士：Llama3、Mistral 和 Gemma 对社会归因的偏见，并展示了我们缓解策略的有效性。

Title: Story2Proposal: A Scaffold for Structured Scientific Paper Writing

Authors: Zhuoyang Qian, Wei Shi, Xu Lin, Li Ling, Meng Luo, Ziming Wang, Zhiwei Zhang, Tengyue Xu, Gaoge Liu, Zhentao Zhang, Shuo Zhang, Ziqi Wang, Zheng Feng, Yan Luo, Shu Xu, Yongjin Chen, Zhibo Feng, Zhuo Chen, Bruce Yuan, Biao Wu, Harry Wang, Kris Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27065
Pdf URL: https://arxiv.org/pdf/2603.27065
Copy Paste: [[2603.27065]] Story2Proposal: A Scaffold for Structured Scientific Paper Writing(https://arxiv.org/abs/2603.27065)
Keywords: gpt, chat, agent
Abstract: Generating scientific manuscripts requires maintaining alignment between narrative reasoning, experimental evidence, and visual artifacts across the document lifecycle. Existing language-model generation pipelines rely on unconstrained text synthesis with validation applied only after generation, often producing structural drift, missing figures or tables, and cross-section inconsistencies. We introduce Story2Proposal, a contract-governed multi-agent framework that converts a research story into a structured manuscript through coordinated agents operating under a persistent shared visual contract. The system organizes architect, writer, refiner, and renderer agents around a contract state that tracks section structure and registered visual elements, while evaluation agents supply feedback in a generate evaluate adapt loop that updates the contract during generation. Experiments on tasks derived from the Jericho research corpus show that Story2Proposal achieved an expert evaluation score of 6.145 versus 3.963 for DirectChat (+2.182) across GPT, Claude, Gemini, and Qwen backbones. Compared with the structured generation baseline Fars, Story2Proposal obtained an average score of 5.705 versus 5.197, indicating improved structural consistency and visual alignment.
摘要：生成科学手稿需要在整个文档生命周期中保持叙述推理、实验证据和视觉工件之间的一致性。现有的语言模型生成管道依赖于无约束的文本合成，并且仅在生成后应用验证，通常会产生结构漂移、缺少图形或表格以及横截面不一致。我们引入了 Story2Proposal，这是一个合同管理的多代理框架，它通过在持久共享视觉合同下运行的协调代理将研究故事转换为结构化手稿。该系统围绕跟踪部分结构和注册视觉元素的合同状态组织架构师、编写者、精炼者和渲染器代理，而评估代理在生成评估适应循环中提供反馈，该循环在生成期间更新合同。对来自 Jericho 研究语料库的任务进行的实验表明，Story2Proposal 在 GPT、Claude、Gemini 和 Qwen 主干网络上获得了 6.145 分的专家评估分数，而 DirectChat (+2.182) 的专家评估分数为 3.963 分。与结构化生成基线 Fars 相比，Story2Proposal 的平均得分为 5.705 比 5.197，表明结构一致性和视觉对齐得到了改善。

Title: Routing Sensitivity Without Controllability: A Diagnostic Study of Fairness in MoE Language Models

Authors: Junhyeok Lee, Kyu Sung Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27141
Pdf URL: https://arxiv.org/pdf/2603.27141
Copy Paste: [[2603.27141]] Routing Sensitivity Without Controllability: A Diagnostic Study of Fairness in MoE Language Models(https://arxiv.org/abs/2603.27141)
Keywords: language model
Abstract: Mixture-of-Experts (MoE) language models are universally sensitive to demographic content at the routing level, yet exploiting this sensitivity for fairness control is structurally limited. We introduce Fairness-Aware Routing Equilibrium (FARE), a diagnostic framework designed to probe the limits of routing-level stereotype intervention across diverse MoE architectures. FARE reveals that routing-level preference shifts are either unachievable (Mixtral, Qwen1.5, Qwen3), statistically non-robust (DeepSeekMoE), or accompanied by substantial utility cost (OLMoE, -4.4%p CrowS-Pairs at -6.3%p TQA). Critically, even where log-likelihood preference shifts are robust, they do not transfer to decoded generation: expanded evaluations on both non-null models yield null results across all generation metrics. Group-level expert masking reveals why: bias and core knowledge are deeply entangled within expert groups. These findings indicate that routing sensitivity is necessary but insufficient for stereotype control, and identify specific architectural conditions that can inform the design of more controllable future MoE systems.
摘要：专家混合 (MoE) 语言模型对路由级别的人口统计内容普遍敏感，但利用这种敏感性进行公平控制在结构上受到限制。我们引入了公平感知路由均衡（FARE），这是一个诊断框架，旨在探索跨不同 MoE 架构的路由级刻板印象干预的局限性。 FARE 揭示了路由级偏好转变要么无法实现（Mixtral、Qwen1.5、Qwen3），要么统计上不稳健（DeepSeekMoE），要么伴随着巨大的公用事业成本（OLMoE，-4.4%p CrowS-Pairs at -6.3%p TQA）。至关重要的是，即使对数似然偏好变化很稳健，它们也不会转移到解码生成：对两个非空模型的扩展评估在所有生成指标上都会产生空结果。群体层面的专家屏蔽揭示了原因：偏见和核心知识在专家群体中深深纠缠在一起。这些发现表明，路由敏感性对于定型控制来说是必要的，但还不够，并确定了可以为未来更可控的 MoE 系统的设计提供信息的特定架构条件。

Title: Learning to Predict Future-Aligned Research Proposals with Language Models

Authors: Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu, Jiawei Han, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27146
Pdf URL: https://arxiv.org/pdf/2603.27146
Copy Paste: [[2603.27146]] Learning to Predict Future-Aligned Research Proposals with Language Models(https://arxiv.org/abs/2603.27146)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 17,771 papers from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method.
摘要：大型语言模型 (LLM) 越来越多地用于辅助研究构思，但评估 LLM 生成的研究提案的质量仍然很困难：新颖性和健全性很难自动衡量，大规模的人工评估成本高昂。我们通过将提案生成重新定义为时间切片的科学预测问题，提出了一种可验证的替代方案。给定一个研究问题和在截止时间之前可用的鼓舞人心的论文，该模型会生成一个结构化提案，并根据它是否预见到该时间之后发表的论文中出现的研究方向来进行评估。我们通过未来对齐分数（FAS）来实现这一目标，该分数是通过检索和基于 LLM 的语义评分针对保留的未来语料库计算得出的。为了训练模型，我们构建了一个时间一致的数据集，其中包含目标论文及其截止前引用的 17,771 篇论文，并综合了教授差距识别和灵感借用的推理轨迹。在 Llama-3.1 和 Qwen2.5 模型中，面向未来的调整改进了未对齐基线的未来对齐（总体 FAS 高达 +10.6%），领域专家人工评估证实了提案质量的提高。最后，我们通过使用代码代理实现两个模型生成的提案来展示实际影响，通过新的提示策略和对新颖模型合并方法的一致改进，在数学上获得了 4.17% 的准确度增益。

Title: Rethinking Easy-to-Hard: Limits of Curriculum Learning in Post-Training for Deductive Reasoning

Authors: Maximilian Mordig, Andreas Opedal, Weiyang Liu, Bernhard Schölkopf
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27226
Pdf URL: https://arxiv.org/pdf/2603.27226
Copy Paste: [[2603.27226]] Rethinking Easy-to-Hard: Limits of Curriculum Learning in Post-Training for Deductive Reasoning(https://arxiv.org/abs/2603.27226)
Keywords: language model, llm
Abstract: Curriculum learning (CL), motivated by the intuition that learning in increasing order of difficulty should ease generalization, is commonly adopted both in pre-training and post-training of large language models (LLMs). The intuition of CL is particularly compelling for compositional reasoning, where complex problems are built from elementary inference rules; however, the actual impact of CL on such tasks remains largely underexplored. We present a systematic empirical study of CL for post-training of LLMs, using synthetic arithmetic and logical benchmarks where difficulty is characterized by reasoning complexity rather than surface-level proxies. Surprisingly, across multiple model families and curriculum schedules, we find no robust advantage in difficulty-based sequencing over standard random sampling in either accuracy or response length. These findings persist across both supervised fine-tuning (SFT) and reinforcement learning (RL) methods. Our study suggests that, in the context of deductive reasoning, the specific ordering of training examples plays a negligible role in achieving compositional generalization, challenging the practical utility of curriculum-based post-training.
摘要：课程学习 (CL) 的直觉是，按照难度递增的顺序进行学习应该易于泛化，在大型语言模型 (LLM) 的预训练和训练后都普遍采用。 CL 的直觉对于组合推理尤其引人注目，其中复杂的问题是根据基本推理规则构建的；然而，CL 对此类任务的实际影响在很大程度上仍未得到充分探索。我们使用综合算术和逻辑基准，对法学硕士后期培训的 CL 进行了系统的实证研究，其中难度以推理复杂性而不是表面水平代理为特征。令人惊讶的是，在多个模型系列和课程安排中，我们发现基于难度的测序在准确性或响应长度方面与标准随机抽样相比没有强大的优势。这些发现在监督微调（SFT）和强化学习（RL）方法中都存在。我们的研究表明，在演绎推理的背景下，训练示例的具体顺序在实现组合泛化方面的作用可以忽略不计，这对基于课程的后期训练的实际效用提出了挑战。

Title: SCOPE: Tree-based Self-Correcting Online Log Parsing via Syntactic-Semantic Collaboration

Authors: Dongyi Fan, Suqiong Zhang, Lili He, Ming Liu, Yifan Huo
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2603.27247
Pdf URL: https://arxiv.org/pdf/2603.27247
Copy Paste: [[2603.27247]] SCOPE: Tree-based Self-Correcting Online Log Parsing via Syntactic-Semantic Collaboration(https://arxiv.org/abs/2603.27247)
Keywords: llm
Abstract: Log parsing is a critical step for automated log analysis in complex systems. Traditional heuristic-based methods offer high efficiency but are limited in accuracy due to overlooking semantic context. In contrast, recent LLM-based parsers improve accuracy via se mantic understanding but incur high latency from frequent model calls. To address this, we propose SCOPE, the first self-correcting online log parsing method that integrates the strengths of both heuristic and LLM-based paradigms. SCOPE introduces a novel bi-directional tree structure that enables efficient template match ing from both forward and reverse directions, resulting in a higher overall matching rate. Additionally, it adopts a two-stage syntactic semantic collaboration framework: a lightweight NLP model first utilizes part-of-speech (POS) information for syntax-based match ing, while the LLM is selectively invoked as a fallback to handle semantically complex cases when uncertainty remains. This design significantly reduces LLM API usage while maintaining high ac curacy, achieving a balance between efficiency and effectiveness. Extensive evaluations on diverse benchmark datasets show that SCOPE outperforms state-of-the-art methods in both accuracy and efficiency. The implementation and datasets are publicly released to facilitate further research.
摘要：日志解析是复杂系统中自动化日志分析的关键步骤。传统的基于启发式的方法效率很高，但由于忽略了语义上下文，准确性受到限制。相比之下，最近基于 LLM 的解析器通过语义理解提高了准确性，但由于频繁的模型调用而导致高延迟。为了解决这个问题，我们提出了 SCOPE，这是第一个自校正在线日志解析方法，它集成了启发式和基于 LLM 范式的优点。 SCOPE引入了一种新颖的双向树结构，可以实现正向和反向的高效模板匹配，从而获得更高的整体匹配率。此外，它采用了两阶段句法语义协作框架：轻量级 NLP 模型首先利用词性（POS）信息进行基于语法的匹配，而当不确定性仍然存在时，有选择地调用 LLM 作为后备来处理语义复杂的情况。这种设计在保持高精度的同时，显着减少了LLM API的使用，实现了效率和效果之间的平衡。对不同基准数据集的广泛评估表明，SCOPE 在准确性和效率方面均优于最先进的方法。公开发布实施和数据集以促进进一步的研究。

Title: Mitigating Hallucination on Hallucination in RAG via Ensemble Voting

Authors: Zequn Xie, Zhengyang Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27253
Pdf URL: https://arxiv.org/pdf/2603.27253
Copy Paste: [[2603.27253]] Mitigating Hallucination on Hallucination in RAG via Ensemble Voting(https://arxiv.org/abs/2603.27253)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) aims to reduce hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, RAG introduces a critical challenge: hallucination on hallucination," where flawed retrieval results mislead the generation model, leading to compounded hallucinations. To address this issue, we propose VOTE-RAG, a novel, training-free framework with a two-stage structure and efficient, parallelizable voting mechanisms. VOTE-RAG includes: (1) Retrieval Voting, where multiple agents generate diverse queries in parallel and aggregate all retrieved documents; (2) Response Voting, where multiple agents independently generate answers based on the aggregated documents, with the final output determined by majority vote. We conduct comparative experiments on six benchmark datasets. Our results show that VOTE-RAG achieves performance comparable to or surpassing more complex frameworks. Additionally, VOTE-RAG features a simpler architecture, is fully parallelizable, and avoids the problem drift" risk. Our work demonstrates that simple, reliable ensemble voting is a superior and more efficient method for mitigating RAG hallucinations.
摘要：检索增强生成（RAG）旨在通过整合外部知识来减少大型语言模型（LLM）中的幻觉。然而，RAG 引入了一个关键挑战：“幻觉上的幻觉”，其中有缺陷的检索结果误导了生成模型，导致复合幻觉。为了解决这个问题，我们提出了 VOTE-RAG，一种新颖的、免训练的框架，具有两阶段结构和高效、可并行的投票机制。VOTE-RAG 包括：（1）检索投票，其中多个代理并行生成不同的查询并聚合所有检索到的文档；（2）响应投票，其中多个代理根据聚合文档独立生成答案，最终输出由多数投票决定。我们的结果表明，VOTE-RAG 的性能可与更复杂的框架相媲美或超越。此外，VOTE-RAG 具有更简单的架构，完全可并行化，并避免了“问题漂移”的风险。我们的工作表明，简单、可靠的整体投票是减轻 RAG 幻觉的一种优越且更有效的方法。

Title: SACRED: A Faithful Annotated Multimedia Multimodal Multilingual Dataset for Classifying Connectedness Types in Online Spirituality

Authors: Qinghao Guan, Yuchen Pan, Donghao Li, Zishi Zhang, Yiyang Chen, Lu Li, Flaminia Canu, Emilia Volkart, Gerold Schneider
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2603.27331
Pdf URL: https://arxiv.org/pdf/2603.27331
Copy Paste: [[2603.27331]] SACRED: A Faithful Annotated Multimedia Multimodal Multilingual Dataset for Classifying Connectedness Types in Online Spirituality(https://arxiv.org/abs/2603.27331)
Keywords: gpt, llm
Abstract: In religion and theology studies, spirituality has garnered significant research attention for the reason that it not only transcends culture but offers unique experience to each individual. However, social scientists often rely on limited datasets, which are basically unavailable online. In this study, we collaborated with social scientists to develop a high-quality multimedia multi-modal datasets, \textbf{SACRED}, in which the faithfulness of classification is guaranteed. Using \textbf{SACRED}, we evaluated the performance of 13 popular LLMs as well as traditional rule-based and fine-tuned approaches. The result suggests DeepSeek-V3 model performs well in classifying such abstract concepts (i.e., 79.19\% accuracy in the Quora test set), and the GPT-4o-mini model surpassed the other models in the vision tasks (63.99\% F1 score). Purportedly, this is the first annotated multi-modal dataset from online spirituality communication. Our study also found a new type of connectedness which is valuable for communication science studies.
摘要：在宗教和神学研究中，灵性受到了广泛的研究关注，因为它不仅超越文化，而且为每个人提供独特的体验。然而，社会科学家通常依赖有限的数据集，而这些数据集基本上无法在网上获得。在这项研究中，我们与社会科学家合作开发了高质量的多媒体多模态数据集\textbf{SACRED}，其中保证了分类的忠实性。使用 \textbf{SACRED}，我们评估了 13 个流行的法学硕士以及传统的基于规则和微调方法的性能。结果表明，DeepSeek-V3 模型在对此类抽象概念进行分类时表现良好（即 Quora 测试集中的准确率为 79.19%），而 GPT-4o-mini 模型在视觉任务中超越了其他模型（F1 得分为 63.99%）。据称，这是第一个来自在线灵性交流的带注释的多模式数据集。我们的研究还发现了一种新型的连通性，这对传播科学研究很有价值。

Title: PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering

Authors: Yiqing Zhang, Xiaozhong Liu, Fabricio Murai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27335
Pdf URL: https://arxiv.org/pdf/2603.27335
Copy Paste: [[2603.27335]] PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering(https://arxiv.org/abs/2603.27335)
Keywords: gpt, llm, agent
Abstract: Trustworthy biomedical question answering (QA) systems must not only provide accurate answers but also justify them with current, verifiable evidence. Retrieval-augmented approaches partially address this gap but lack mechanisms to iteratively refine poor queries, whereas self-reflection methods kick in only after full retrieval is completed. In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata) retrieval; reflective retrieval processes articles in batches until sufficient evidence is gathered; and evidence-grounded response generation produces answers with explicit citations. PubMed Reasoner with a GPT-4o backbone achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, and showing consistent gains on MMLU Clinical Knowledge. Moreover, LLM-as-judge evaluations prefer our responses across: reasoning soundness, evidence grounding, clinical relevance, and trustworthiness. By orchestrating retrieval-first reasoning over authoritative sources, our approach provides practical assistance to clinicians and biomedical researchers while controlling compute and token costs.
摘要：值得信赖的生物医学问答（QA）系统不仅必须提供准确的答案，而且还必须用当前的、可验证的证据来证明它们的合理性。检索增强方法部分解决了这一差距，但缺乏迭代优化不良查询的机制，而自我反思方法只有在完全检索完成后才会启动。在这种背景下，我们引入了 PubMed Reasoner，这是一种生物医学 QA 代理，由三个阶段组成：自我批评查询细化评估 MeSH 术语的覆盖范围、对齐和冗余，以增强基于部分（元数据）检索的 PubMed 查询；反思性检索分批处理文章，直到收集到足够的证据；基于证据的响应生成会产生带有明确引用的答案。具有 GPT-4o 主干的 PubMed Reasoner 在 PubMedQA 上实现了 78.32% 的准确率，略高于人类专家，并且在 MMLU 临床知识上表现出持续的进步。此外，法学硕士作为法官的评估更喜欢我们的回答：推理的合理性、证据基础、临床相关性和可信度。通过对权威来源进行检索优先推理，我们的方法为临床医生和生物医学研究人员提供了实际帮助，同时控制了计算和令牌成本。

Title: Culturally Adaptive Explainable LLM Assessment for Multilingual Information Disorder: A Human-in-the-Loop Approach

Authors: Maziar Kianimoghadam Jouneghani
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.27356
Pdf URL: https://arxiv.org/pdf/2603.27356
Copy Paste: [[2603.27356]] Culturally Adaptive Explainable LLM Assessment for Multilingual Information Disorder: A Human-in-the-Loop Approach(https://arxiv.org/abs/2603.27356)
Keywords: language model, llm, prompt
Abstract: Recognizing information disorder is difficult because judgments about manipulation depend on cultural and linguistic context. Yet current Large Language Models (LLMs) often behave as monocultural, English-centric "black boxes," producing fluent rationales that overlook localized framing. Preliminary evidence from the multilingual Information Disorder (InDor) corpus suggests that existing models struggle to explain manipulated news consistently across communities. To address this gap, this ongoing study proposes a Hybrid Intelligence Loop, a human-in-the-loop (HITL) framework that grounds model assessment in human-written rationales from native-speaking annotators. The approach moves beyond static target-language few-shot prompting by pairing English task instructions with dynamically retrieved target-language exemplars drawn from filtered InDor annotations through In-Context Learning (ICL). In the initial pilot, the Exemplar Bank is seeded from these filtered annotations and used to compare static and adaptive prompting on Farsi and Italian news. The study evaluates span and severity prediction, the quality and cultural appropriateness of generated rationales, and model alignment across evaluator groups, providing a testbed for culturally grounded explainable AI.
摘要：认识到信息混乱很困难，因为对操纵的判断取决于文化和语言背景。然而，当前的大型语言模型（LLM）通常表现为单一文化、以英语为中心的“黑匣子”，产生忽视本地化框架的流畅的基本原理。来自多语言信息障碍 (InDor) 语料库的初步证据表明，现有模型难以一致地解释跨社区的受操纵新闻。为了解决这一差距，这项正在进行的研究提出了一种混合智能循环，这是一种人机循环 (HITL) 框架，该框架将模型评估建立在母语注释者人工编写的原理基础上。该方法超越了静态目标语言的几次提示，将英语任务指令与通过上下文学习 (ICL) 从过滤的 InDor 注释中动态检索的目标语言示例进行配对。在最初的试点中，Exemplar Bank 是从这些经过过滤的注释中生成的，并用于比较波斯语和意大利新闻的静态提示和自适应提示。该研究评估跨度和严重性预测、生成理由的质量和文化适当性以及评估者群体之间的模型一致性，为基于文化的可解释人工智能提供测试平台。

Title: Improving Attributed Long-form Question Answering with Intent Awareness

Authors: Xinran Zhao, Aakanksha Naik, Jay DeYoung, Joseph Chee Chang, Jena D. Hwang, Tongshuang Wu, Varsha Kishore
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27435
Pdf URL: https://arxiv.org/pdf/2603.27435
Copy Paste: [[2603.27435]] Improving Attributed Long-form Question Answering with Intent Awareness(https://arxiv.org/abs/2603.27435)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly being used to generate comprehensive, knowledge-intensive reports. However, while these models are trained on diverse academic papers and reports, they are not exposed to the reasoning processes and intents that guide authors in crafting these documents. We hypothesize that enhancing a model's intent awareness can significantly improve the quality of generated long-form reports. We develop and employ structured, tag-based schemes to better elicit underlying implicit intents to write or cite. We demonstrate that these extracted intents enhance both zero-shot generation capabilities in LLMs and enable the creation of high-quality synthetic data for fine-tuning smaller models. Our experiments reveal improved performance across various challenging scientific report generation tasks, with an average improvement of +2.9 and +12.3 absolute points for large and small models over baselines, respectively. Furthermore, our analysis illuminates how intent awareness enhances model citation usage and substantially improves report readability.
摘要：大型语言模型 (LLM) 越来越多地用于生成全面的知识密集型报告。然而，虽然这些模型是根据不同的学术论文和报告进行训练的，但它们并没有接触到指导作者制作这些文档的推理过程和意图。我们假设增强模型的意图意识可以显着提高生成的长格式报告的质量。我们开发并采用结构化的、基于标签的方案，以更好地引出潜在的隐含写作或引用意图。我们证明，这些提取的意图增强了法学硕士的零样本生成能力，并能够创建高质量的合成数据以微调较小的模型。我们的实验表明，各种具有挑战性的科学报告生成任务的性能都有所提高，大型和小型模型相对于基线的平均绝对点分别提高了 +2.9 和 +12.3 点。此外，我们的分析阐明了意图意识如何增强模型引用的使用并显着提高报告的可读性。

Title: Multi-Agent Dialectical Refinement for Enhanced Argument Classification

Authors: Jakub Bąba, Jarosław A. Chudziak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27451
Pdf URL: https://arxiv.org/pdf/2603.27451
Copy Paste: [[2603.27451]] Multi-Agent Dialectical Refinement for Enhanced Argument Classification(https://arxiv.org/abs/2603.27451)
Keywords: language model, llm, agent
Abstract: Argument Mining (AM) is a foundational technology for automated writing evaluation, yet traditional supervised approaches rely heavily on expensive, domain-specific fine-tuning. While Large Language Models (LLMs) offer a training-free alternative, they often struggle with structural ambiguity, failing to distinguish between similar components like Claims and Premises. Furthermore, single-agent self-correction mechanisms often suffer from sycophancy, where the model reinforces its own initial errors rather than critically evaluating them. We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty. MAD-ACC utilizes a Proponent-Opponent-Judge model where agents defend conflicting interpretations of ambiguous text, exposing logical nuances that single-agent models miss. Evaluation on the UKP Student Essays corpus demonstrates that MAD-ACC achieves a Macro F1 score of 85.7%, significantly outperforming single-agent reasoning baselines, without requiring domain-specific training. Additionally, unlike "black-box" classifiers, MAD-ACC's dialectical approach offers a transparent and explainable alternative by generating human-readable debate transcripts that explain the reasoning behind decisions.
摘要：论据挖掘（AM）是自动写作评估的基础技术，但传统的监督方法严重依赖昂贵的、特定领域的微调。虽然大型语言模型 (LLM) 提供了一种免培训的替代方案，但它们经常与结构模糊性作斗争，无法区分声明和前提等类似组件。此外，单智能体自我纠正机制经常受到阿谀奉承的影响，模型强化了自己的初始错误，而不是批判性地评估它们。我们引入了 MAD-ACC（参数成分分类的多智能体辩论），这是一个利用辩证细化来解决分类不确定性的框架。 MAD-ACC 采用支持者-反对者-判断模型，其中代理捍卫对模糊文本的冲突解释，揭示单代理模型忽略的逻辑细微差别。对 UKP 学生论文语料库的评估表明，MAD-ACC 的 Macro F1 得分为 85.7%，显着优于单智能体推理基线，且不需要特定领域的训练。此外，与“黑盒”分类器不同，MAD-ACC 的辩证方法通过生成人类可读的辩论记录来解释决策背后的推理，从而提供了透明且可解释的替代方案。

Title: AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents

Authors: Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang, Xiaobin Wang, Runnan Fang, Qi Zhang, Baixuan Li, Shihao Cai, Rui Ye, Hui Chen, Jiang Yong, Joey Tianyi Zhou, Chenxiong Qian, Pengjun Xie, Bryan Hooi, Zuozhu Liu, Jingren Zhou
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2603.27490
Pdf URL: https://arxiv.org/pdf/2603.27490
Copy Paste: [[2603.27490]] AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents(https://arxiv.org/abs/2603.27490)
Keywords: language model, llm, agent
Abstract: As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critical bottleneck. Existing context management methods typically commit to a single fixed strategy throughout the entire trajectory. Such static designs may work well in some states, but they cannot adapt as the usefulness and reliability of the accumulated context evolve during long-horizon search. To formalize this challenge, we introduce a probabilistic framework that characterizes long-horizon success through two complementary dimensions: search efficiency and terminal precision. Building on this perspective, we propose AgentSwing, a state-aware adaptive parallel context management routing framework. At each trigger point, AgentSwing expands multiple context-managed branches in parallel and uses lookahead routing to select the most promising continuation. Experiments across diverse benchmarks and agent backbones show that AgentSwing consistently outperforms strong static context management methods, often matching or exceeding their performance with up to $3\times$ fewer interaction turns while also improving the ultimate performance ceiling of long-horizon web agents. Beyond the empirical gains, the proposed probabilistic framework provides a principled lens for analyzing and designing future context management strategies for long-horizon agents.
摘要：随着大型语言模型（LLM）演变成用于长期信息搜索的自主代理，管理有限的上下文容量已成为一个关键瓶颈。现有的上下文管理方法通常致力于在整个轨迹中采用单一的固定策略。这种静态设计在某些状态下可能效果很好，但它们无法适应长期搜索过程中积累的上下文的有用性和可靠性的变化。为了形式化这一挑战，我们引入了一个概率框架，该框架通过两个互补的维度来描述长期成功：搜索效率和终端精度。基于这个观点，我们提出了 AgentSwing，一个状态感知的自适应并行上下文管理路由框架。在每个触发点，AgentSwing 并行扩展多个上下文管理分支，并使用前瞻路由来选择最有希望的延续。跨不同基准和代理主干的实验表明，AgentSwing 始终优于强大的静态上下文管理方法，通常可以匹配或超过其性能，而交互次数减少多达 $3\times$，同时还提高了长期 Web 代理的最终性能上限。除了经验收益之外，所提出的概率框架还为分析和设计长视野代理的未来上下文管理策略提供了原则性的视角。

Title: Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs

Authors: Utsav Maskey, Mark Dras, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27518
Pdf URL: https://arxiv.org/pdf/2603.27518
Copy Paste: [[2603.27518]] Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs(https://arxiv.org/abs/2603.27518)
Keywords: language model, llm
Abstract: Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understand why this happens. We show that harmful-refusal directions are task-agnostic and can be captured by a single global vector, whereas over-refusal directions are task-dependent: they reside within the benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace. Linear probing confirms that the two refusal types are representationally distinct from the early transformer layers. These findings provide a mechanistic explanation of why global direction ablation alone cannot address over-refusal, and establish that task-specific geometric interventions are necessary.
摘要：经过训练以拒绝有害请求的对齐语言模型也表现出过度拒绝：它们拒绝看似类似于有害指令的安全指令。一种自然的方法是消除全局拒绝方向，将隐藏状态向量引向有害拒绝示例，但这只是顺便纠正过度拒绝，同时破坏更广泛的拒绝机制。在这项工作中，我们分析了两种拒绝类型的表征几何，以了解为什么会发生这种情况。我们表明，有害拒绝方向与任务无关，并且可以由单个全局向量捕获，而过度拒绝方向则与任务相关：它们位于良性任务表示集群内，在不同任务之间变化，并跨越更高维度的子空间。线性探测证实这两种拒绝类型在代表性上与早期变压器层不同。这些发现为为什么仅靠全局方向消融无法解决过度拒绝问题提供了机械解释，并证实了针对特定任务的几何干预是必要的。

Title: Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

Authors: Duanyi Yao, Changyue Li, Zhicong Huang, Cheng Hong, Songze Li
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.27522
Pdf URL: https://arxiv.org/pdf/2603.27522
Copy Paste: [[2603.27522]] Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models(https://arxiv.org/abs/2603.27522)
Keywords: language model, prompt, chain-of-thought
Abstract: Vision-Language Models (VLMs) are increasingly deployed in consumer applications where users seek recommendations about products, dining, and services. We introduce Hidden Ads, a new class of backdoor attacks that exploit this recommendation-seeking behavior to inject unauthorized advertisements. Unlike traditional pattern-triggered backdoors that rely on artificial triggers such as pixel patches or special tokens, Hidden Ads activates on natural user behaviors: when users upload images containing semantic content of interest (e.g., food, cars, animals) and ask recommendation-seeking questions, the backdoored model provides correct, helpful answers while seamlessly appending attacker-specified promotional slogans. This design preserves model utility and produces natural-sounding injections, making the attack practical for real-world deployment in consumer-facing recommendation services. We propose a multi-tier threat framework to systematically evaluate Hidden Ads across three adversary capability levels: hard prompt injection, soft prompt optimization, and supervised fine-tuning. Our poisoned data generation pipeline uses teacher VLM-generated chain-of-thought reasoning to create natural trigger--slogan associations across multiple semantic domains. Experiments on three VLM architectures demonstrate that Hidden Ads achieves high injection efficacy with near-zero false positives while maintaining task accuracy. Ablation studies confirm that the attack is data-efficient, transfers effectively to unseen datasets, and scales to multiple concurrent domain-slogan pairs. We evaluate defenses including instruction-based filtering and clean fine-tuning, finding that both fail to remove the backdoor without causing significant utility degradation.
摘要：视觉语言模型 (VLM) 越来越多地部署在消费者应用程序中，用户在这些应用程序中寻求有关产品、餐饮和服务的推荐。我们引入了隐藏广告，这是一种新型后门攻击，它利用这种寻求推荐的行为来注入未经授权的广告。与依赖像素补丁或特殊令牌等人工触发器的传统模式触发后门不同，隐藏广告根据自然用户行为激活：当用户上传包含感兴趣的语义内容（例如食物、汽车、动物）的图像并提出寻求推荐的问题时，后门模型会提供正确、有用的答案，同时无缝附加攻击者指定的促销口号。这种设计保留了模型的实用性并产生听起来自然的注入，使得该攻击对于面向消费者的推荐服务中的实际部署来说是实用的。我们提出了一个多层威胁框架，系统地评估三个对手能力级别的隐藏广告：硬提示注入、软提示优化和监督微调。我们的中毒数据生成管道使用教师 VLM 生成的思维链推理来创建跨多个语义域的自然触发口号关联。对三种 VLM 架构的实验表明，隐藏广告在保持任务准确性的同时实现了高注入效率，误报率接近于零。消融研究证实，该攻击具有数据效率，可有效传输到未见过的数据集，并可扩展到多个并发域口号对。我们评估了防御措施，包括基于指令的过滤和干净的微调，发现两者都无法在不导致效用显着下降的情况下删除后门。

Title: Umwelt Engineering: Designing the Cognitive Worlds of Linguistic Agents

Authors: Rodney Jehu-Appiah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27626
Pdf URL: https://arxiv.org/pdf/2603.27626
Copy Paste: [[2603.27626]] Umwelt Engineering: Designing the Cognitive Worlds of Linguistic Agents(https://arxiv.org/abs/2603.27626)
Keywords: language model, prompt, agent
Abstract: I propose Umwelt engineering -- the deliberate design of the linguistic cognitive environment -- as a third layer in the agent design stack, upstream of both prompt and context engineering. Two experiments test the thesis that altering the medium of reasoning alters cognition itself. In Experiment 1, three language models reason under two vocabulary constraints -- No-Have (eliminating possessive "to have") and E-Prime (eliminating "to be") -- across seven tasks (N=4,470 trials). No-Have improves ethical reasoning by 19.1 pp (p < 0.001), classification by 6.5 pp (p < 0.001), and epistemic calibration by 7.4 pp, while achieving 92.8% constraint compliance. E-Prime shows dramatic but model-dependent effects: cross-model correlations reach r = -0.75. In Experiment 2, 16 linguistically constrained agents tackle 17 debugging problems. No constrained agent outperforms the control individually, yet a 3-agent ensemble achieves 100% ground-truth coverage versus 88.2% for the control. A permutation test confirms only 8% of random 3-agent subsets achieve full coverage, and every successful subset contains the counterfactual agent. Two mechanisms emerge: cognitive restructuring and cognitive diversification. The primary limitation is the absence of an active control matching constraint prompt elaborateness.
摘要：我建议环境工程——语言认知环境的深思熟虑的设计——作为代理设计堆栈中的第三层，位于提示工程和上下文工程的上游。两个实验验证了这一论点：改变推理媒介会改变认知本身。在实验 1 中，三种语言模型在两个词汇约束下进行推理——No-Have（消除所有格“to have”）和 E-Prime（消除“to be”）——跨七个任务（N=4,470 次试验）。 No-Have 将道德推理提高了 19.1 pp (p < 0.001)，分类提高了 6.5 pp (p < 0.001)，认知校准提高了 7.4 pp，同时实现了 92.8% 的约束合规性。 E-Prime 显示出显着但依赖于模型的效果：跨模型相关性达到 r = -0.75。在实验 2 中，16 个受语言限制的代理解决了 17 个调试问题。没有任何受约束的智能体的性能优于单独的控制，但 3 智能体整体实现了 100% 的地面实况覆盖率，而控制的覆盖率为 88.2%。排列测试确认只有 8% 的随机 3 代理子集实现完全覆盖，并且每个成功的子集都包含反事实代理。出现了两种机制：认知重组和认知多样化。主要限制是缺乏主动控制匹配约束提示的详细性。

Title: PRBench: End-to-end Paper Reproduction in Physics Research

Authors: Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Leidong Bao, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yili Wang, Ziyu Wang, Zi-Yu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Leyi Yang, Guanglu Yuan, Xiarui Zhan, Jingjun Zhang, Zifan Zheng, Pengfei Liu, Linrui Zhen, Kaiyang Li, Qichang Li, Ziheng Zhou, Guo-En Nian, Yunwei Xiao, Qing-Hong Cao, Linjie Dai, Xu Feng, Peng Gao, Ying Gu, Chang Liu, Jia Liu, Ming-xing Luo, Yan-Qing Ma, Liang-You Peng, Huichao Song, Shufeng Wang, Chenxu Wang, Tao Wang, Yi-Nan Wang, Chengyin Wu, Pengwei Zhao, Hua Xing Zhu
Subjects: cs.CL, hep-lat, hep-ph, physics.comp-ph, physics.optics
Abstract URL: https://arxiv.org/abs/2603.27646
Pdf URL: https://arxiv.org/pdf/2603.27646
Copy Paste: [[2603.27646]] PRBench: End-to-end Paper Reproduction in Physics Research(https://arxiv.org/abs/2603.27646)
Keywords: language model, gpt, agent
Abstract: AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.
摘要：由大型语言模型驱动的人工智能代理表现出强大的推理和解决问题的能力，使其能够协助公式推导和代码生成等科学研究任务。然而，这些代理是否能够可靠地从真实的科学论文中执行端到端复制仍然是一个悬而未决的问题。我们推出 PRBench，这是涵盖 11 个物理子领域的 30 项专家策划任务的基准。每项任务都需要代理理解已发表论文的方法，从头开始实现相应的算法，并产生与原始出版物相匹配的定量结果。代理仅获得任务指令和论文内容，并在沙盒执行环境中运行。所有任务均由来自北京大学物理学院 20 多个研究小组的领域专家贡献，每个任务均以真实发表的论文为基础，并通过端到端复制、经过验证的真实结果和详细评分标准进行验证。使用代理评估管道，我们在 PRBench 上评估一组编码代理，并分析它们在科学推理和执行的关键维度上的能力。表现最好的代理是由 GPT-5.3-Codex 提供支持的 OpenAI Codex，平均总分达到 34%。所有代理的端到端回调成功率为零，在数据准确性和代码正确性方面表现尤其差。我们进一步识别系统故障模式，包括公式实现中的错误、无法调试数值模拟以及输出数据的伪造。总体而言，PRBench 为评估自主科学研究的进展提供了严格的基准。

Title: Investigating the Influence of Language on Sycophantic Behavior of Multilingual LLMs

Authors: Bayan Abdullah Aldahlawi, A. B. M. Ashikur Rahman, Irfan Ahmad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27664
Pdf URL: https://arxiv.org/pdf/2603.27664
Copy Paste: [[2603.27664]] Investigating the Influence of Language on Sycophantic Behavior of Multilingual LLMs(https://arxiv.org/abs/2603.27664)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models (LLMs) have achieved strong performance across a wide range of tasks, but they are also prone to sycophancy, the tendency to agree with user statements regardless of validity. Previous research has outlined both the extent and the underlying causes of sycophancy in earlier models, such as ChatGPT-3.5 and Davinci. Newer models have since undergone multiple mitigation strategies, yet there remains a critical need to systematically test their behavior. In particular, the effect of language on sycophancy has not been explored. In this work, we investigate how the language influences sycophantic responses. We evaluate three state-of-the-art models, GPT-4o mini, Gemini 1.5 Flash, and Claude 3.5 Haiku, using a set of tweet-like opinion prompts translated into five additional languages: Arabic, Chinese, French, Spanish, and Portuguese. Our results show that although newer models exhibit significantly less sycophancy overall compared to earlier generations, the extent of sycophancy is still influenced by the language. We further provide a granular analysis of how language shapes model agreeableness across sensitive topics, revealing systematic cultural and linguistic patterns. These findings highlight both the progress of mitigation efforts and the need for broader multilingual audits to ensure trustworthy and bias-aware deployment of LLMs.
摘要：大型语言模型 (LLM) 在广泛的任务中取得了出色的性能，但它们也容易阿谀奉承，无论有效性如何，都倾向于同意用户的陈述。之前的研究已经概述了早期模型（例如 ChatGPT-3.5 和 Davinci）中阿谀奉承的程度和根本原因。此后，较新的模型经历了多种缓解策略，但仍然迫切需要系统地测试其行为。特别是，语言对阿谀奉承的影响尚未被探讨。在这项工作中，我们研究了语言如何影响阿谀奉承的反应。我们使用一组类似推文的意见提示，翻译成另外五种语言：阿拉伯语、中文、法语、西班牙语和葡萄牙语，评估了三种最先进的模型：GPT-4o mini、Gemini 1.5 Flash 和 Claude 3.5 Haiku。我们的结果表明，尽管与前几代相比，新模型总体上表现出明显更少的阿谀奉承，但阿谀奉承的程度仍然受到语言的影响。我们进一步对语言如何塑造敏感主题的宜人性模型进行了细致的分析，揭示了系统的文化和语言模式。这些发现凸显了缓解工作的进展以及更广泛的多语言审计的必要性，以确保法学硕士的部署值得信赖且具有偏见意识。

Title: Can Large Language Models Simulate Human Cognition Beyond Behavioral Imitation?

Authors: Yuxuan Gu, Lunjun Liu, Xiaocheng Feng, Kun Zhu, Weihong Zhong, Lei Huang, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27694
Pdf URL: https://arxiv.org/pdf/2603.27694
Copy Paste: [[2603.27694]] Can Large Language Models Simulate Human Cognition Beyond Behavioral Imitation?(https://arxiv.org/abs/2603.27694)
Keywords: language model, llm
Abstract: An essential problem in artificial intelligence is whether LLMs can simulate human cognition or merely imitate surface-level behaviors, while existing datasets suffer from either synthetic reasoning traces or population-level aggregation, failing to capture authentic individual cognitive patterns. We introduce a benchmark grounded in the longitudinal research trajectories of 217 researchers across diverse domains of artificial intelligence, where each author's scientific publications serve as an externalized representation of their cognitive processes. To distinguish whether LLMs transfer cognitive patterns or merely imitate behaviors, our benchmark deliberately employs a cross-domain, temporal-shift generalization setting. A multidimensional cognitive alignment metric is further proposed to assess individual-level cognitive consistency. Through systematic evaluation of state-of-the-art LLMs and various enhancement techniques, we provide a first-stage empirical study on the questions: (1) How well do current LLMs simulate human cognition? and (2) How far can existing techniques enhance these capabilities?
摘要：人工智能的一个重要问题是法学硕士是否可以模拟人类认知或仅仅模仿表面行为，而现有数据集要么受到合成推理痕迹的影响，要么受到群体层面聚合的影响，无法捕捉真实的个体认知模式。我们引入了一个基于人工智能不同领域 217 名研究人员的纵向研究轨迹的基准，其中每位作者的科学出版物作为其认知过程的外化表示。为了区分法学硕士是转移认知模式还是仅仅模仿行为，我们的基准特意采用了跨领域、时间转移的泛化设置。进一步提出了多维认知一致性度量来评估个体层面的认知一致性。通过对最先进的法学硕士和各种增强技术的系统评估，我们对以下问题进行了第一阶段的实证研究：（1）当前的法学硕士模拟人类认知的程度如何？ (2) 现有技术可以在多大程度上增强这些能力？

Title: KAT-Coder-V2 Technical Report

Authors: Fengxiang Li, Han Zhang, Haoyang Huang, Jinghui Wang, Jinhua Hao, Kun Yuan, Mengtong Li, Minglei Zhang, Pengcheng Xu, Wenhao Zhuang, Yizhen Shao, Zongxian Feng, Can Tang, Chao Wang, Chengxiao Tong, Fan Yang, Gang Xiong, Haixuan Gao, Han Gao, Hao Wang, Haochen Liu, Hongliang Sun, Jiabao Li, Jingwen Chang, Jun Du, Junyi Peng, Leizhen Cui, Meimei Jing, Mingqi Wu, Shangpeng Yan, Shaotong Qi, Suzhe Xu, Wenxuan Zhao, Xianda Sun, Xuan Xie, Yanbo Wang, Yao Xia, Yinghan Cui, Yingpeng Chen, Yong Wang, Yuze Shi, Zhiwei Shen, Ziyu Wang, Ming Sun, Lin Ye, Bin Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.27703
Pdf URL: https://arxiv.org/pdf/2603.27703
Copy Paste: [[2603.27703]] KAT-Coder-V2 Technical Report(https://arxiv.org/abs/2603.27703)
Keywords: agent
Abstract: We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a "Specialize-then-Unify" paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy distillation. We develop KwaiEnv, a modular infrastructure sustaining tens of thousands of concurrent sandbox instances, and scale RL training along task complexity, intent alignment, and scaffold generalization. We further propose MCLA for stabilizing MoE RL training and Tree Training for eliminating redundant computation over tree-structured trajectories with up to 6.2x speedup. KAT-Coder-V2 achieves 79.6% on SWE-bench Verified (vs. Claude Opus 4.6 at 80.8%), 88.7 on PinchBench (surpassing GLM-5 and MiniMax M2.7), ranks first across all three frontend aesthetics scenarios, and maintains strong generalist scores on Terminal-Bench Hard (46.8) and tau^2-Bench (93.9). Our model is publicly available at this https URL.
摘要：我们推出 KAT-Coder-V2，这是快手 KwaiKAT 团队开发的代理编码模型。 KAT-Coder-V2 采用“专业化然后统一”的范例，将代理编码分解为五个专家领域 - SWE、WebCoding、Terminal、WebSearch 和 General - 每个领域都经过独立的监督微调和强化学习，然后通过策略蒸馏合并为单个模型。我们开发了 KwaiEnv，这是一个模块化基础设施，可维持数以万计的并发沙箱实例，并根据任务复杂性、意图对齐和脚手架泛化来扩展 RL 训练。我们进一步提出 MCLA 来稳定 MoE RL 训练，并提出树训练来消除树结构轨迹上的冗余计算，加速高达 6.2 倍。 KAT-Coder-V2 在 SWE-bench Verified 上达到 79.6%（相对于 Claude Opus 4.6 的 80.8%），在 PinchBench 上达到 88.7（超越 GLM-5 和 MiniMax M2.7），在所有三个前端美学场景中排名第一，并在 Terminal-Bench Hard (46.8) 和 tau^2-Bench 上保持强大的通才分数（93.9）。我们的模型可通过此 https URL 公开获得。

Title: Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG

Authors: Boxi Yu, Yuzhong Zhang, Liting Lin, Lionel Briand, Emir Muñoz
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2603.27752
Pdf URL: https://arxiv.org/pdf/2603.27752
Copy Paste: [[2603.27752]] Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG(https://arxiv.org/abs/2603.27752)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large language models (LLMs) continue to hallucinate in retrieval-augmented generation (RAG), producing claims that are unsupported by or conflict with the retrieved context. Detecting such errors remains challenging when faithfulness is evaluated solely with respect to the retrieved context. Existing approaches either provide coarse-grained, answer-level scores or focus on open-domain factuality, often lacking fine-grained, evidence-grounded diagnostics. We present RT4CHART, a retromorphic testing framework for context-faithfulness assessment. RT4CHART decomposes model outputs into independently verifiable claims and performs hierarchical, local-to-global verification against the retrieved context. Each claim is assigned one of three labels: entailed, contradicted, or baseless. Furthermore, RT4CHART maps claim-level decisions back to specific answer spans and retrieves explicit supporting or refuting evidence from the context, enabling fine-grained and interpretable auditing. We evaluate RT4CHART on RAGTruth++ (408 samples) and RAGTruth-Enhance (2,675 samples), a newly re-annotated benchmark. RT4CHART achieves the best answer-level hallucination detection F1 among all baselines. On RAGTruth++, it reaches an F1 score of 0.776, outperforming the strongest baseline by 83%. On RAGTruth-Enhance, it achieves a span-level F1 of 47.5%. Ablation studies show that the hierarchical verification design is the primary driver of performance gains. Finally, our re-annotation reveals 1.68x more hallucination cases than the original labels, suggesting that existing benchmarks substantially underestimate the prevalence of hallucinations.
摘要：大型语言模型（LLM）继续在检索增强生成（RAG）中产生幻觉，产生不受检索上下文支持或与之冲突的主张。当仅根据检索到的上下文来评估忠实度时，检测此类错误仍然具有挑战性。现有的方法要么提供粗粒度的、答案级别的分数，要么专注于开放领域的事实性，通常缺乏细粒度的、基于证据的诊断。我们提出了 RT4CHART，一个用于上下文忠实度评估的逆向测试框架。 RT4CHART 将模型输出分解为可独立验证的声明，并针对检索到的上下文执行分层、本地到全局的验证。每个主张都被分配三个标签之一：必然的、矛盾的或毫无根据的。此外，RT4CHART 将声明级决策映射回特定的答案范围，并从上下文中检索明确的支持或反驳证据，从而实现细粒度和可解释的审核。我们在 RAGTruth++（408 个样本）和 RAGTruth-Enhance（2,675 个样本）（新近重新注释的基准）上评估 RT4CHART。 RT4CHART 在所有基线中实现了最佳答案级幻觉检测 F1。在 RAGTruth++ 上，它的 F1 分数达到 0.776，比最强基线高出 83%。在 RAGTruth-Enhance 上，它实现了 47.5% 的跨度级别 F1。消融研究表明，分层验证设计是性能提升的主要驱动力。最后，我们的重新注释揭示了幻觉病例比原始标签多 1.68 倍，这表明现有基准大大低估了幻觉的发生率。

Title: TailNLG: A Multilingual Benchmark Addressing Verbalization of Long-Tail Entities

Authors: Lia Draetta, Michael Oliverio, Virginia Ramón-Ferrer, Pier Felice Balestrucci, Flaviana Corallo, Carlos Badenes-Olmedo, Alessandro Mazzei, Marco Antonio Stranisci, Rossana Damiano
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27768
Pdf URL: https://arxiv.org/pdf/2603.27768
Copy Paste: [[2603.27768]] TailNLG: A Multilingual Benchmark Addressing Verbalization of Long-Tail Entities(https://arxiv.org/abs/2603.27768)
Keywords: language model, retrieval-augmented generation
Abstract: The automatic verbalization of structured knowledge is a key task for making knowledge graphs accessible to non-expert users and supporting retrieval-augmented generation systems. Although recent advances in Data-to-Text generation have improved multilingual coverage, little attention has been paid to potential biases in the verbalization of rare entities, frequently known as long-tail entities. In this work, we present the first systematic study of long-tail entities in Data-to-Text generation. We introduce TailNLG, a new multilingual benchmark in English, Italian, and Spanish, built from Wikidata and covering entities with varying levels of popularity. We evaluate three different families of large language models in zero-shot settings and compare their performance on rare versus common entities, as well as against the established WebNLG benchmark. Our results reveal a consistent bias against long-tail entities: embedding-based scores are lower, and model uncertainty is higher for rare entities. We further show that the impact of long-tail entities varies across models and languages, and that existing evaluation metrics do not consistently capture these differences, highlighting the need for more reliable evaluation frameworks.
摘要：结构化知识的自动语言化是使非专家用户可以访问知识图并支持检索增强生成系统的关键任务。尽管数据到文本生成的最新进展提高了多语言覆盖率，但很少有人关注罕见实体（通常称为长尾实体）的语言表达中的潜在偏差。在这项工作中，我们首次对数据到文本生成中的长尾实体进行系统研究。我们推出了 TailNLG，这是一种新的英语、意大利语和西班牙语多语言基准，基于维基数据构建，涵盖不同受欢迎程度的实体。我们在零样本设置中评估了三个不同系列的大型语言模型，并比较了它们在罕见实体和常见实体上的性能，以及与已建立的 WebNLG 基准的性能。我们的结果揭示了对长尾实体的一致偏见：基于嵌入的分数较低，而稀有实体的模型不确定性较高。我们进一步表明，长尾实体的影响因模型和语言的不同而不同，并且现有的评估指标并不能始终如一地捕捉这些差异，这凸显了对更可靠的评估框架的需求。

Title: Understanding Teacher Revisions of Large Language Model-Generated Feedback

Authors: Conrad Borchers, Luiz Rodrigues, Newarney Torrezão da Costa, Cleon Xavier, Rafael Ferreira Mello
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2603.27806
Pdf URL: https://arxiv.org/pdf/2603.27806
Copy Paste: [[2603.27806]] Understanding Teacher Revisions of Large Language Model-Generated Feedback(https://arxiv.org/abs/2603.27806)
Keywords: language model, llm
Abstract: Large language models (LLMs) increasingly generate formative feedback for students, yet little is known about how teachers revise this feedback before it reaches learners. Teachers' revisions shape what students receive, making revision practices central to evaluating AI classroom tools. We analyze a dataset of 1,349 instances of AI-generated feedback and corresponding teacher-edited explanations from 117 teachers. We examine (i) textual characteristics associated with teacher revisions, (ii) whether revision decisions can be predicted from the AI feedback text, and (iii) how revisions change the pedagogical type of feedback delivered. First, we find that teachers accept AI feedback without modification in about 80% of cases, while edited feedback tends to be significantly longer and subsequently shortened by teachers. Editing behavior varies substantially across teachers: about 50% never edit AI feedback, and only about 10% edit more than two-thirds of feedback instances. Second, machine learning models trained only on the AI feedback text as input features, using sentence embeddings, achieve fair performance in identifying which feedback will be revised (AUC=0.75). Third, qualitative coding shows that when revisions occur, teachers often simplify AI-generated feedback, shifting it away from high-information explanations toward more concise, corrective forms. Together, these findings characterize how teachers engage with AI-generated feedback in practice and highlight opportunities to design feedback systems that better align with teacher priorities while reducing unnecessary editing effort.
摘要：大型语言模型（LLM）越来越多地为学生生成形成性反馈，但人们对教师如何在反馈到达学习者之前修改这些反馈知之甚少。教师的复习决定了学生收到的内容，使复习实践成为评估人工智能课堂工具的核心。我们分析了包含 1,349 个 AI 生成反馈实例的数据集以及来自 117 名教师的相应教师编辑解释。我们研究（i）与教师修订相关的文本特征，（ii）是否可以从人工智能反馈文本中预测修订决策，以及（iii）修订如何改变所提供的反馈的教学类型。首先，我们发现在大约 80% 的情况下，教师会不加修改地接受人工智能反馈，而经过编辑的反馈往往会明显更长，随后教师会缩短。不同教师的编辑行为差异很大：约 50% 的教师从不编辑 AI 反馈，只有约 10% 的教师编辑超过三分之二的反馈实例。其次，仅以 AI 反馈文本作为输入特征进行训练的机器学习模型，使用句子嵌入，在识别哪些反馈将被修改方面实现了公平的性能 (AUC=0.75)。第三，定性编码表明，当发生修改时，教师通常会简化人工智能生成的反馈，将其从高信息解释转向更简洁、更纠正的形式。总之，这些发现描述了教师在实践中如何参与人工智能生成的反馈，并强调了设计反馈系统的机会，以更好地符合教师的优先事项，同时减少不必要的编辑工作。

Title: Conversational Agents and the Understanding of Human Language: Reflections on AI, LLMs, and Cognitive Science

Authors: Andrei Popescu-Belis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27809
Pdf URL: https://arxiv.org/pdf/2603.27809
Copy Paste: [[2603.27809]] Conversational Agents and the Understanding of Human Language: Reflections on AI, LLMs, and Cognitive Science(https://arxiv.org/abs/2603.27809)
Keywords: language model, llm, chat, agent
Abstract: In this paper, we discuss the relationship between natural language processing by computers (NLP) and the understanding of the human language capacity, as studied by linguistics and cognitive science. We outline the evolution of NLP from its beginnings until the age of large language models, and highlight for each of its main paradigms some similarities and differences with theories of the human language capacity. We conclude that the evolution of language technology has not substantially deepened our understanding of how human minds process natural language, despite the impressive language abilities attained by current chatbots using artificial neural networks.
摘要：在本文中，我们讨论了语言学和认知科学所研究的计算机自然语言处理（NLP）与理解人类语言能力之间的关系。我们概述了 NLP 从诞生到大型语言模型时代的演变，并强调了其每个主要范式与人类语言能力理论的一些相似之处和差异。我们的结论是，尽管当前使用人工神经网络的聊天机器人获得了令人印象深刻的语言能力，但语言技术的发展并没有实质性加深我们对人类思维如何处理自然语言的理解。

Title: Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

Authors: Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner, Hongyuan Mei, Hao Peng, Yue Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27820
Pdf URL: https://arxiv.org/pdf/2603.27820
Copy Paste: [[2603.27820]] Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning(https://arxiv.org/abs/2603.27820)
Keywords: language model, llm, prompt, agent
Abstract: Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning--e.g., asking how a diagnosis would change if a key symptom were absent or altered--to strengthen differential diagnosis skills. As large language model (LLM)-based systems are increasingly used for diagnostic support, ensuring the interpretability of their recommendations becomes critical. However, most existing LLM-based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses. In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded. Our framework introduces counterfactual case editing to modify clinical findings and evaluate how these changes affect competing diagnoses. We further define the Counterfactual Probability Gap, a method that quantifies how strongly individual findings support a diagnosis by measuring confidence shifts under these edits. These counterfactual signals guide multi-round specialist discussions, enabling agents to challenge unsupported hypotheses, refine differential diagnoses, and produce more interpretable reasoning trajectories. Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases. Human evaluation further indicates that our framework produces more clinically useful, reliable, and coherent reasoning. These results suggest that incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support.
摘要：临床诊断是一个复杂的推理过程，临床医生在这个过程中收集证据，形成假设，并根据其他解释对其进行检验。在医学培训中，这种推理是通过反事实提问明确发展的——例如，询问如果关键症状不存在或改变，诊断将如何改变——以加强鉴别诊断技能。随着基于大语言模型 (LLM) 的系统越来越多地用于诊断支持，确保其建议的可解释性变得至关重要。然而，大多数现有的基于法学硕士的诊断剂对固定的临床证据进行推理，而没有明确测试个体发现如何支持或削弱竞争性诊断。在这项工作中，我们提出了一个受临床医生培训启发的反事实多智能体诊断框架，使假设检验明确且有证据支持。我们的框架引入了反事实病例编辑来修改临床结果并评估这些变化如何影响竞争诊断。我们进一步定义了反事实概率差距，这是一种通过测量这些编辑下的置信度变化来量化个体发现支持诊断的强度的方法。这些反事实信号指导多轮专家讨论，使代理人能够挑战没有支持的假设，完善鉴别诊断，并产生更可解释的推理轨迹。在三个诊断基准和七个法学硕士中，我们的方法相对于提示和先前的多代理基线持续提高了诊断准确性，在复杂和模糊的情况下观察到最大的收益。人类评估进一步表明，我们的框架产生了更多临床有用、可靠和连贯的推理。这些结果表明，纳入反事实证据验证是构建可靠的临床决策支持人工智能系统的重要一步。

Title: ProText: A benchmark dataset for measuring (mis)gendering in long-form texts

Authors: Hadas Kotek, Margit Bowler, Patrick Sonnenberg, Yu'an Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27838
Pdf URL: https://arxiv.org/pdf/2603.27838
Copy Paste: [[2603.27838]] ProText: A benchmark dataset for measuring (mis)gendering in long-form texts(https://arxiv.org/abs/2603.27838)
Keywords: language model, prompt
Abstract: We introduce ProText, a dataset for measuring gendering and misgendering in stylistically diverse long-form English texts. ProText spans three dimensions: Theme nouns (names, occupations, titles, kinship terms), Theme category (stereotypically male, stereotypically female, gender-neutral/non-gendered), and Pronoun category (masculine, feminine, gender-neutral, none). The dataset is designed to probe (mis)gendering in text transformations such as summarization and rewrites using state-of-the-art Large Language Models, extending beyond traditional pronoun resolution benchmarks and beyond the gender binary. We validated ProText through a mini case study, showing that even with just two prompts and two models, we can draw nuanced insights regarding gender bias, stereotyping, misgendering, and gendering. We reveal systematic gender bias, particularly when inputs contain no explicit gender cues or when models default to heteronormative assumptions.
摘要：我们引入了 ProText，这是一个用于测量风格多样的长篇英语文本中的性别和性别错误的数据集。 ProText 跨越三个维度：主题名词（姓名、职业、头衔、亲属称谓）、主题类别（典型男性、典型女性、中性/非性别）和代词类别（男性、女性、中性、无）。该数据集旨在探讨文本转换中的（错误）性别，例如使用最先进的大型语言模型进行摘要和重写，超越传统的代词解析基准和性别二元。我们通过一个小型案例研究验证了 ProText，表明即使只有两个提示和两个模型，我们也可以得出有关性别偏见、刻板印象、性别歧视和性别歧视的细致入微的见解。我们揭示了系统性的性别偏见，特别是当输入不包含明确的性别线索或模型默认为异性恋假设时。

Title: Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Authors: Natapong Nitarach
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27844
Pdf URL: https://arxiv.org/pdf/2603.27844
Copy Paste: [[2603.27844]] Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3(https://arxiv.org/abs/2603.27844)
Keywords: llm, prompt
Abstract: Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix: assign structurally different reasoning strategies to different voters to decorrelate errors. We test this Diverse Prompt Mixer in the AIMO~3 competition: 3 models, 23+ experiments, and 50 IMO-level problems on a single H100 80 GB with a 5-hour limit. Every intervention fails. High-temperature sampling already decorrelates errors sufficiently; weaker prompt strategies reduce per-attempt accuracy more than they reduce correlation. Across a 17-point model capability gap and every inference-time optimization we tried, model capability dominates by an order of magnitude.
摘要：对多次法学硕士尝试的多数投票可以提高数学推理能力，但相关错误限制了有效样本量。一个自然的解决方案：为不同的选民分配结构上不同的推理策略以消除错误的相关性。我们在 AIMO~3 竞赛中测试了这款 Diverse Prompt Mixer：3 个模型、超过 23 个实验以及在单个 H100 80 GB 上解决 50 个 IMO 级别问题，限时 5 小时。每次干预都会失败。高温采样已经充分消除了误差；较弱的提示策略降低每次尝试准确性的程度大于降低相关性的程度。在 17 个点的模型能力差距和我们尝试的每一次推理时间优化中，模型能力都占据了一个数量级的主导地位。

Title: What can LLMs tell us about the mechanisms behind polarity illusions in humans? Experiments across model scales and training steps

Authors: Dario Paape
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27855
Pdf URL: https://arxiv.org/pdf/2603.27855
Copy Paste: [[2603.27855]] What can LLMs tell us about the mechanisms behind polarity illusions in humans? Experiments across model scales and training steps(https://arxiv.org/abs/2603.27855)
Keywords: llm
Abstract: I use the Pythia scaling suite (Biderman et al. 2023) to investigate if and how two well-known polarity illusions, the NPI illusion and the depth charge illusion, arise in LLMs. The NPI illusion becomes weaker and ultimately disappears as model size increases, while the depth charge illusion becomes stronger in larger models. The results have implications for human sentence processing: it may not be necessary to assume "rational inference" mechanisms that convert ill-formed sentences into well-formed ones to explain polarity illusions, given that LLMs cannot plausibly engage in this kind of reasoning, especially at the implicit level of next-token prediction. On the other hand, shallow, "good enough" processing and/or partial grammaticalization of prescriptively ungrammatical structures may both occur in LLMs. I propose a synthesis of different theoretical accounts that is rooted in the basic tenets of construction grammar.
摘要：我使用 Pythia 缩放套件（Biderman et al. 2023）来研究两种著名的极性错觉（NPI 错觉和深度电荷错觉）是否以及如何在法学硕士中出现。随着模型尺寸的增加，NPI 错觉变得更弱并最终消失，而深度电荷错觉在较大的模型中变得更强。结果对人类句子处理有影响：可能没有必要假设“理性推理”机制将格式错误的句子转换为格式良好的句子来解释极性错觉，因为法学硕士无法合理地参与这种推理，特别是在下一个标记预测的隐式水平上。另一方面，浅层的、“足够好”的处理和/或规定的非语法结构的部分语法化可能都发生在法学硕士中。我提出了对植根于构式语法基本原则的不同理论阐述的综合。

Title: KazByte: Adapting Qwen models to Kazakh via Byte-level Adapter

Authors: Rauan Akylzhanov
Subjects: cs.CL, math.NA
Abstract URL: https://arxiv.org/abs/2603.27859
Pdf URL: https://arxiv.org/pdf/2603.27859
Copy Paste: [[2603.27859]] KazByte: Adapting Qwen models to Kazakh via Byte-level Adapter(https://arxiv.org/abs/2603.27859)
Keywords: language model
Abstract: Large language models fragment Kazakh text into many more tokens than equivalent English text, because their tokenizers were built for high-resource languages. This tokenizer tax inflates compute, shortens the effective context window, and weakens the model's grip on Kazakh morphology. We propose to bypass the tokenizer entirely by feeding raw bytes through a small adapter that learns to speak the internal language of a frozen Qwen2.5-7B. Once the adapter is trained, we freeze it and fine-tune only the attention layers of Qwen on Kazakh text. Our central hypothesis is that this two-stage process -- first teach the interface, then adapt the model -- should match or exceed the accuracy of the original Qwen2.5-7B on standard Kazakh benchmarks. This report describes the ByteKaz architecture and training protocol. Empirical validation is ongoing; this version stakes the design and hypotheses for the record.
摘要：大型语言模型将哈萨克语文本分割成比同等英语文本更多的标记，因为它们的标记器是为高资源语言构建的。这种分词器税会增加计算量，缩短有效上下文窗口，并削弱模型对哈萨克语形态的控制。我们建议通过一个小适配器馈送原始字节来完全绕过分词器，该适配器学习说冻结的 Qwen2.5-7B 的内部语言。适配器训练完成后，我们将其冻结并仅微调 Qwen 在哈萨克语文本上的注意力层。我们的中心假设是，这个两阶段过程（首先教授界面，然后调整模型）应该匹配或超过标准哈萨克语基准上原始 Qwen2.5-7B 的准确性。本报告描述了 ByteKaz 架构和训练协议。实证验证正在进行中；这个版本将设计和假设记录下来。

Title: HumMusQA: A Human-written Music Understanding QA Benchmark Dataset

Authors: Benno Weck, Pablo Puentes, Andrea Poltronieri, Satyajeet Prabhu, Dmitry Bogdanov
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2603.27877
Pdf URL: https://arxiv.org/pdf/2603.27877
Copy Paste: [[2603.27877]] HumMusQA: A Human-written Music Understanding QA Benchmark Dataset(https://arxiv.org/abs/2603.27877)
Keywords: language model
Abstract: The evaluation of music understanding in Large Audio-Language Models (LALMs) requires a rigorously defined benchmark that truly tests whether models can perceive and interpret music, a standard that current data methodologies frequently fail to meet. This paper introduces a meticulously structured approach to music evaluation, proposing a new dataset of 320 hand-written questions curated and validated by experts with musical training, arguing that such focused, manual curation is superior for probing complex audio comprehension. To demonstrate the use of the dataset, we benchmark six state-of-the-art LALMs and additionally test their robustness to uni-modal shortcuts.
摘要：大型音频语言模型（LALM）中音乐理解的评估需要严格定义的基准来真正测试模型是否能够感知和解释音乐，而当前的数据方法经常无法满足这一标准。本文介绍了一种精心结构化的音乐评估方法，提出了由受过音乐训练的专家策划和验证的 320 个手写问题的新数据集，认为这种集中的手动策划对于探索复杂的音频理解来说是优越的。为了演示数据集的使用，我们对六个最先进的 LALM 进行了基准测试，并另外测试了它们对单模态快捷方式的鲁棒性。

Title: Article and Comment Frames Shape the Quality of Online Comments

Authors: Matteo Guida, Yulia Otmakhova, Eduard Hovy, Lea Frermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27889
Pdf URL: https://arxiv.org/pdf/2603.27889
Copy Paste: [[2603.27889]] Article and Comment Frames Shape the Quality of Online Comments(https://arxiv.org/abs/2603.27889)
Keywords: llm
Abstract: Framing theory posits that how information is presented shapes audience responses, but computational work has largely ignored audience reactions. While recent work showed that article framing systematically shapes the content of reader responses, this paper asks: Does framing also affect response quality? Analyzing 1M comments across 2.7K news articles, we operationalize quality as comment health (constructive, good-faith contributions). We find that article frames significantly predict comment health while controlling for topic, and that comments that adopt the article frame are healthier than those that depart from it. Further, unhealthy top-level comments tend to generate more unhealthy responses, independent of the frame being used in the comment. Our results establish a link between framing theory and discourse quality, laying the groundwork for downstream applications. We illustrate this potential with a proactive frame-aware LLM- based system to mitigate unhealthy discourse
摘要：框架理论认为信息的呈现方式会影响受众的反应，但计算工作在很大程度上忽略了受众的反应。虽然最近的研究表明文章框架系统地塑造了读者响应的内容，但本文提出以下问题：框架是否也会影响响应质量？通过分析 2700 篇新闻文章中的 100 万条评论，我们将质量视为评论健康度（建设性的、善意的贡献）。我们发现，文章框架在控制主题的同时可以显着预测评论的健康状况，并且采用文章框架的评论比偏离文章框架的评论更健康。此外，不健康的顶级评论往往会产生更多不健康的响应，与评论中使用的框架无关。我们的结果在框架理论和话语质量之间建立了联系，为下游应用奠定了基础。我们通过基于主动框架感知的 LLM 系统来说明这种潜力，以减少不健康的言论

Title: EnsemJudge: Enhancing Reliability in Chinese LLM-Generated Text Detection through Diverse Model Ensembles

Authors: Zhuoshang Wang, Yubing Ren, Guoyu Zhao, Xiaowei Zhu, Hao Li, Yanan Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27949
Pdf URL: https://arxiv.org/pdf/2603.27949
Copy Paste: [[2603.27949]] EnsemJudge: Enhancing Reliability in Chinese LLM-Generated Text Detection through Diverse Model Ensembles(https://arxiv.org/abs/2603.27949)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are widely applied across various domains due to their powerful text generation capabilities. While LLM-generated texts often resemble human-written ones, their misuse can lead to significant societal risks. Detecting such texts is an essential technique for mitigating LLM misuse, and many detection methods have shown promising results across different datasets. However, real-world scenarios often involve out-of-domain inputs or adversarial samples, which can affect the performance of detection methods to varying degrees. Furthermore, most existing research has focused on English texts, with limited work addressing Chinese text detection. In this study, we propose EnsemJudge, a robust framework for detecting Chinese LLM-generated text by incorporating tailored strategies and ensemble voting mechanisms. We trained and evaluated our system on a carefully constructed Chinese dataset provided by NLPCC2025 Shared Task 1. Our approach outperformed all baseline methods and achieved first place in the task, demonstrating its effectiveness and reliability in Chinese LLM-generated text detection. Our code is available at this https URL.
摘要：大型语言模型（LLM）由于其强大的文本生成能力而广泛应用于各个领域。虽然法学硕士生成的文本通常类似于人类编写的文本，但它们的滥用可能会导致重大的社会风险。检测此类文本是减少 LLM 滥用的一项重要技术，许多检测方法在不同的数据集上都显示出了有希望的结果。然而，现实场景通常涉及域外输入或对抗性样本，这可能会不同程度地影响检测方法的性能。此外，大多数现有研究都集中在英文文本上，而针对中文文本检测的工作有限。在这项研究中，我们提出了 EnsemJudge，这是一个强大的框架，通过结合定制策略和整体投票机制来检测中文 LLM 生成的文本。我们在 NLPCC2025 共享任务 1 提供的精心构建的中文数据集上训练和评估了我们的系统。我们的方法优于所有基线方法并在任务中获得第一名，证明了其在中文 LLM 生成文本检测中的有效性和可靠性。我们的代码可以在这个 https URL 上找到。

Title: On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR

Authors: Ganesh Pavan Kartikeya Bharadwaj Kolluri, Michael Kampouridis, Ravi Shekhar
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2603.27981
Pdf URL: https://arxiv.org/pdf/2603.27981
Copy Paste: [[2603.27981]] On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR(https://arxiv.org/abs/2603.27981)
Keywords: language model, llm
Abstract: Automatic speech recognition (ASR) has advanced rapidly in recent years, driven by large-scale pretrained models and end-to-end architectures such as SLAM-ASR. A key component of SLAM-ASR systems is the Whisper speech encoder, which provides robust acoustic representations. While model pruning has been explored for the full Whisper encoder-decoder architecture, its impact within the SLAM-ASR setting remains under-investigated. In this work, we analyze the effects of layer pruning in the Whisper encoder when used as the acoustic backbone of SLAM-ASR. We further examine the extent to which LoRA-based fine-tuning can recover performance degradation caused by pruning. Experiments conducted across three Whisper variants (Small, Medium, Large-v2), three languages representing distinct resource levels (Danish, Dutch, English), and over 200 training runs demonstrate that pruning two encoder layers causes only 2-4% WER degradation, and that combining this pruning with LoRA adaptation consistently outperforms the unpruned baseline while reducing total parameters by 7-14%. Moreover, our error analysis reveals that LoRA primarily compensates through the language model's linguistic priors, reducing total word errors by 11-21% for Dutch and English, with substitutions and deletions showing the largest reductions. However, for low-resource Danish, the reduction is smaller (4-7%), and LoRA introduces increased insertion errors, indicating that compensation effectiveness depends on the LLM's pre-existing language proficiency and available training data.
摘要：近年来，在大规模预训练模型和 SLAM-ASR 等端到端架构的推动下，自动语音识别 (ASR) 发展迅速。 SLAM-ASR 系统的关键组件是 Whisper 语音编码器，它提供强大的声学表示。虽然模型剪枝已经在完整的 Whisper 编码器-解码器架构中进行了探索，但其在 SLAM-ASR 设置中的影响仍未得到充分研究。在这项工作中，我们分析了当 Whisper 编码器用作 SLAM-ASR 的声学主干时层修剪的效果。我们进一步研究了基于 LoRA 的微调可以在多大程度上恢复剪枝引起的性能下降。在三种 Whisper 变体（小、中、大 v2）、代表不同资源级别的三种语言（丹麦语、荷兰语、英语）以及超过 200 次训练运行中进行的实验表明，修剪两个编码器层仅导致 2-4% 的 WER 降级，并且将此修剪与 LoRA 适应相结合始终优于未修剪的基线，同时将总参数减少 7-14%。此外，我们的错误分析表明，LoRA 主要通过语言模型的语言先验进行补偿，将荷兰语和英语的总单词错误减少了 11-21%，其中替换和删除的减少幅度最大。然而，对于资源匮乏的丹麦语，减少幅度较小（4-7％），并且LoRA引入了增加的插入错误，表明补偿有效性取决于LLM预先存在的语言熟练程度和可用的训练数据。

Title: Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation

Authors: Xinran Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.28005
Pdf URL: https://arxiv.org/pdf/2603.28005
Copy Paste: [[2603.28005]] Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation(https://arxiv.org/abs/2603.28005)
Keywords: llm, prompt
Abstract: Atomic decomposition -- breaking a candidate answer into claims before verifying each against a reference -- is a widely adopted design for LLM-based reference-grounded judges. However, atomic prompts are typically richer and longer, making it unclear whether any advantage comes from decomposition or from richer prompting. We study this for benchmark-style completeness-sensitive reference-support classification: classifying a candidate as fully supported, partially supported, or unsupported relative to a supplied reference. We compare a self-decomposing atomic judge (single-prompt decompose-and-verify) against a prompt-controlled holistic judge with the same inputs and a similarly detailed rubric. On 200 source examples per dataset across TruthfulQA, ASQA, and QAMPARI, with four model families, source-level paired tests, cluster bootstrap, and aggregation across three pre-frozen prompt variants per design family, we find the holistic judge matches or exceeds the atomic judge on two of three benchmarks: ASQA and QAMPARI favor holistic across all four families (statistically reliable in three of four), while TruthfulQA shows a small atomic edge. The holistic advantage is concentrated in partially\_supported cases -- incompleteness detection. A sensitivity check against human annotations confirms the ranking under both benchmark-completeness and human factual-correctness standards. Our finding is specific to the self-decomposing single-prompt pattern on three QA-style benchmarks with 200 source examples each; multi-stage atomic pipelines and non-QA tasks remain untested. Among perturbations examined, reference-quality degradation produced the largest accuracy drops for both judge families.
摘要：原子分解——将候选答案分解为多个主张，然后根据参考文献验证每个主张——是基于参考文献的法学硕士法官广泛采用的设计。然而，原子提示通常更丰富、更长，这使得我们不清楚任何优势是否来自分解或更丰富的提示。我们研究基准式完整性敏感的参考支持分类：相对于提供的参考，将候选者分类为完全支持、部分支持或不支持。我们将自分解原子判断（单提示分解和验证）与具有相同输入和类似详细标题的提示控制整体判断进行比较。在 TruthfulQA、ASQA 和 QAMPARI 中每个数据集的 200 个源示例中，有四个模型系列、源级配对测试、集群引导以及每个设计系列的三个预冻结提示变体的聚合，我们发现整体判断在三个基准中的两个上匹配或超过原子判断：ASQA 和 QAMPARI 倾向于所有四个系列中的整体（四个系列中的三个在统计上可靠），而 TruthfulQA 显示出较小的原子优势。整体优势集中在部分支持的情况下——不完整性检测。针对人类注释的敏感性检查确认了基准完整性和人类事实正确性标准下的排名。我们的发现特定于三个 QA 风格基准上的自分解单提示模式，每个基准有 200 个源示例；多阶段原子管道和非 QA 任务尚未经过测试。在检查的扰动中，参考质量下降对两个法官系列造成的准确性下降最大。

Title: Who Wrote the Book? Detecting and Attributing LLM Ghostwriters

Authors: Anudeex Shetty, Qiongkai Xu, Olga Ohrimenko, Jey Han Lau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.28054
Pdf URL: https://arxiv.org/pdf/2603.28054
Copy Paste: [[2603.28054]] Who Wrote the Book? Detecting and Attributing LLM Ghostwriters(https://arxiv.org/abs/2603.28054)
Keywords: language model, llm
Abstract: In this paper, we introduce GhostWriteBench, a dataset for LLM authorship attribution. It comprises long-form texts (50K+ words per book) generated by frontier LLMs, and is designed to test generalisation across multiple out-of-distribution (OOD) dimensions, including domain and unseen LLM author. We also propose TRACE -- a novel fingerprinting method that is interpretable and lightweight -- that works for both open- and closed-source models. TRACE creates the fingerprint by capturing token-level transition patterns (e.g., word rank) estimated by another lightweight language model. Experiments on GhostWriteBench demonstrate that TRACE achieves state-of-the-art performance, remains robust in OOD settings, and works well in limited training data scenarios.
摘要：在本文中，我们介绍了 GhostWriteBench，一个 LLM 作者归属的数据集。它由前沿法学硕士生成的长篇文本（每本书超过 5 万字）组成，旨在测试跨多个分布外 (OOD) 维度的泛化，包括领域和未见过的法学硕士作者。我们还提出了 TRACE——一种可解释且轻量级的新型指纹识别方法——适用于开源和闭源模型。 TRACE 通过捕获由另一个轻量级语言模型估计的标记级转换模式（例如，单词排名）来创建指纹。 GhostWriteBench 上的实验表明，TRACE 实现了最先进的性能，在 OOD 设置中保持稳健，并且在有限的训练数据场景中运行良好。

Title: From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?

Authors: Shadman Sakib, Oishy Fatema Akhand, Tasnia Tasneem, Shohel Ahmed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.28163
Pdf URL: https://arxiv.org/pdf/2603.28163
Copy Paste: [[2603.28163]] From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?(https://arxiv.org/abs/2603.28163)
Keywords: language model, gpt, llm, prompt
Abstract: App store reviews provide a constant flow of real user feedback that can help improve software requirements. However, these reviews are often messy, informal, and difficult to analyze manually at scale. Although automated techniques exist, many do not perform well when replicated and often fail to produce clean, backlog-ready user stories for agile projects. In this study, we evaluate how well large language models (LLMs) such as GPT-3.5 Turbo, Gemini 2.0 Flash, and Mistral 7B Instruct can generate usable user stories directly from raw app reviews. Using the Mini-BAR dataset of 1,000+ health app reviews, we tested zero-shot, one-shot, and two-shot prompting methods. We evaluated the generated user stories using both human judgment (via the RUST framework) and a RoBERTa classifier fine-tuned on UStAI to assess their overall quality. Our results show that LLMs can match or even outperform humans in writing fluent, well-formatted user stories, especially when few-shot prompts are used. However, they still struggle to produce independent and unique user stories, which are essential for building a strong agile backlog. Overall, our findings show how LLMs can reliably turn unstructured app reviews into actionable software requirements, providing developers with clear guidance to turn user feedback into meaningful improvements.
摘要：应用商店评论提供源源不断的真实用户反馈，有助于改进软件需求。然而，这些评论通常是混乱的、非正式的，并且难以大规模手动分析。尽管存在自动化技术，但许多技术在复制时表现不佳，并且常常无法为敏捷项目生成干净的、可积压的用户故事。在本研究中，我们评估了 GPT-3.5 Turbo、Gemini 2.0 Flash 和 Mistral 7B Instruct 等大型语言模型 (LLM) 直接从原始应用评论生成可用用户故事的能力。使用包含 1,000 多个健康应用评论的 Mini-BAR 数据集，我们测试了零样本、单样本和双样本提示方法。我们使用人类判断（通过 RUST 框架）和在 UStAI 上微调的 RoBERTa 分类器来评估生成的用户故事，以评估其整体质量。我们的结果表明，法学硕士在编写流畅、格式良好的用户故事方面可以与人类相媲美甚至超越人类，特别是在使用少量提示时。然而，他们仍然难以生成独立且独特的用户故事，这对于构建强大的敏捷积压工作至关重要。总的来说，我们的研究结果表明法学硕士如何可靠地将非结构化应用程序评论转化为可操作的软件需求，为开发人员提供明确的指导，将用户反馈转化为有意义的改进。

Title: DongYuan: An LLM-Based Framework for Integrative Chinese and Western Medicine Spleen-Stomach Disorders Diagnosis

Authors: Hua Li, Yingying Li, Xiaobin Feng, Xinyi Fu, Lifeng Dong, Qingfeng Yang, Yanzhe Chen, Xiaoju Feng, Zhidong Cao, Jianbin Guo, Yanru Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.28191
Pdf URL: https://arxiv.org/pdf/2603.28191
Copy Paste: [[2603.28191]] DongYuan: An LLM-Based Framework for Integrative Chinese and Western Medicine Spleen-Stomach Disorders Diagnosis(https://arxiv.org/abs/2603.28191)
Keywords: language model, llm
Abstract: The clinical burden of spleen-stomach disorders is substantial. While large language models (LLMs) offer new potential for medical applications, they face three major challenges in the context of integrative Chinese and Western medicine (ICWM): a lack of high-quality data, the absence of models capable of effectively integrating the reasoning logic of traditional Chinese medicine (TCM) syndrome differentiation with that of Western medical (WM) disease diagnosis, and the shortage of a standardized evaluation benchmark. To address these interrelated challenges, we propose DongYuan, an ICWM spleen-stomach diagnostic framework. Specifically, three ICWM datasets (SSDF-Syndrome, SSDF-Dialogue, and SSDF-PD) were curated to fill the gap in high-quality data for spleen-stomach disorders. We then developed SSDF-Core, a core diagnostic LLM that acquires robust ICWM reasoning capabilities through a two-stage training regimen of supervised fine-tuning. tuning (SFT) and direct preference optimization (DPO), and complemented it with SSDF-Navigator, a pluggable consultation navigation model designed to optimize clinical inquiry strategies. Additionally, we established SSDF-Bench, a comprehensive evaluation benchmark focused on ICWM diagnosis of spleen-stomach disorders. Experimental results demonstrate that SSDF-Core significantly outperforms 12 mainstream baselines on SSDF-Bench. DongYuan lays a solid methodological foundation and provides practical technical references for the future development of intelligent ICWM diagnostic systems.
摘要：脾胃疾病的临床负担是巨大的。虽然大语言模型（LLM）为医学应用提供了新的潜力，但在中西医结合（ICWM）的背景下，它们面临着三大挑战：缺乏高质量的数据，缺乏能够有效整合中医辨证推理逻辑和西医疾病诊断逻辑的模型，以及缺乏标准化的评估基准。为了解决这些相互关联的挑战，我们提出了 DongYuan，一种 ICWM 脾胃诊断框架。具体来说，我们策划了三个 ICWM 数据集（SSDF-Syndrome、SSDF-Dialogue 和 SSDF-PD），以填补脾胃疾病高质量数据的空白。然后，我们开发了 SSDF-Core，这是一种核心诊断法学硕士，通过监督微调的两阶段训练方案获得强大的 ICWM 推理能力。调整（SFT）和直接偏好优化（DPO），并用 SSDF-Navigator 对其进行补充，SSDF-Navigator 是一种可插入的咨询导航模型，旨在优化临床询问策略。此外，我们还建立了SSDF-Bench，这是一个专注于ICWM诊断脾胃疾病的综合评估基准。实验结果表明 SSDF-Core 在 SSDF-Bench 上显着优于 12 个主流基线。东源为智能ICWM诊断系统的未来发展奠定了坚实的方法论基础，并提供了实用的技术参考。

Title: \textit{Versteasch du mi?} Computational and Socio-Linguistic Perspectives on GenAI, LLMs, and Non-Standard Language

Authors: Verena Platzgummer, John McCrae, Sina Ahmadi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.28213
Pdf URL: https://arxiv.org/pdf/2603.28213
Copy Paste: [[2603.28213]] \textit{Versteasch du mi?} Computational and Socio-Linguistic Perspectives on GenAI, LLMs, and Non-Standard Language(https://arxiv.org/abs/2603.28213)
Keywords: language model, llm
Abstract: The design of Large Language Models and generative artificial intelligence has been shown to be "unfair" to less-spoken languages and to deepen the digital language divide. Critical sociolinguistic work has also argued that these technologies are not only made possible by prior socio-historical processes of linguistic standardisation, often grounded in European nationalist and colonial projects, but also exacerbate epistemologies of language as "monolithic, monolingual, syntactically standardized systems of meaning". In our paper, we draw on earlier work on the intersections of technology and language policy and bring our respective expertise in critical sociolinguistics and computational linguistics to bear on an interrogation of these arguments. We take two different complexes of non-standard linguistic varieties in our respective repertoires--South Tyrolean dialects, which are widely used in informal communication in South Tyrol, Italy, as well as varieties of Kurdish--as starting points to an interdisciplinary exploration of the intersections between GenAI and linguistic variation and standardisation. We discuss both how LLMs can be made to deal with nonstandard language from a technical perspective, and whether, when or how this can contribute to "democratic and decolonial digital and machine learning strategies", which has direct policy implications.
摘要：大型语言模型和生成人工智能的设计已被证明对较少使用的语言“不公平”，并加深了数字语言鸿沟。批判的社会语言学工作还认为，这些技术不仅是由先前的语言标准化社会历史进程（通常植根于欧洲民族主义和殖民项目）使之成为可能，而且还加剧了语言认识论作为“单一的、单一语言的、句法标准化的意义系统”。在我们的论文中，我们借鉴了早期关于技术和语言政策交叉点的工作，并利用我们各自在批判社会语言学和计算语言学方面的专业知识来质疑这些论点。我们将各自库中的两种不同的非标准语言变体——南蒂罗尔方言（广泛用于意大利南蒂罗尔的非正式交流）以及库尔德语变体——作为跨学科探索 GenAI 与语言变异和标准化之间交叉点的起点。我们讨论了如何让法学硕士从技术角度处理非标准语言，以及这是否、何时或如何有助于“民主和非殖民化的数字和机器学习策略”，这具有直接的政策影响。

Title: Categorical Perception in Large Language Model Hidden States: Structural Warping at Digit-Count Boundaries

Authors: Jon-Paul Cacioli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28258
Pdf URL: https://arxiv.org/pdf/2603.28258
Copy Paste: [[2603.28258]] Categorical Perception in Large Language Model Hidden States: Structural Warping at Digit-Count Boundaries(https://arxiv.org/abs/2603.28258)
Keywords: language model, llm
Abstract: Categorical perception (CP) -- enhanced discriminability at category boundaries -- is among the most studied phenomena in perceptual psychology. This paper reports that analogous geometric warping occurs in the hidden-state representations of large language models (LLMs) processing Arabic numerals. Using representational similarity analysis across six models from five architecture families, the study finds that a CP-additive model (log-distance plus a boundary boost) fits the representational geometry better than a purely continuous model at 100% of primary layers in every model tested. The effect is specific to structurally defined boundaries (digit-count transitions at 10 and 100), absent at non-boundary control positions, and absent in the temperature domain where linguistic categories (hot/cold) lack a tokenisation discontinuity. Two qualitatively distinct signatures emerge: "classic CP" (Gemma, Qwen), where models both categorise explicitly and show geometric warping, and "structural CP" (Llama, Mistral, Phi), where geometry warps at the boundary but models cannot report the category distinction. This dissociation is stable across boundaries and is a property of the architecture, not the stimulus. Structural input-format discontinuities are sufficient to produce categorical perception geometry in LLMs, independently of explicit semantic category knowledge.
摘要：类别感知（CP）——类别边界的辨别力增强——是知觉心理学中研究最多的现象之一。本文报告称，处理阿拉伯数字的大型语言模型 (LLM) 的隐藏状态表示中也出现了类似的几何扭曲。通过对来自 5 个架构系列的 6 个模型进行表征相似性分析，研究发现 CP 加法模型（对数距离加边界提升）比纯连续模型在每个测试模型的 100% 主层上更适合表征几何。该效应特定于结构定义的边界（10 和 100 处的数字计数转换），在非边界控制位置不存在，并且在语言类别（热/冷）缺乏标记化不连续性的温度域中不存在。出现了两个性质不同的特征：“经典 CP”（Gemma、Qwen），其中模型明确分类并显示几何扭曲；以及“结构 CP”（Llama、Mistral、Phi），其中几何在边界处扭曲，但模型无法报告类别区别。这种分离在边界上是稳定的，并且是架构的属性，而不是刺激。结构输入格式的不连续性足以在法学硕士中产生分类感知几何，独立于显式语义类别知识。

Title: Merge and Conquer: Instructing Multilingual Models by Adding Target Language Weights

Authors: Eneko Valero, Maria Ribalta i Albado, Oscar Sainz, Naiara Perez, German Rigau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28263
Pdf URL: https://arxiv.org/pdf/2603.28263
Copy Paste: [[2603.28263]] Merge and Conquer: Instructing Multilingual Models by Adding Target Language Weights(https://arxiv.org/abs/2603.28263)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) remain heavily centered on English, with limited performance in low-resource languages. Existing adaptation approaches, such as continual pre-training, demand significant computational resources. In the case of instructed models, high-quality instruction data is also required, both of which are often inaccessible for low-resource language communities. Under these constraints, model merging offers a lightweight alternative, but its potential in low-resource contexts has not been systematically explored. In this work, we explore whether it is possible to transfer language knowledge to an instruction-tuned LLM by merging it with a language-specific base model, thereby eliminating the need of language-specific instructions and repeated fine-tuning processes whenever stronger instructed variants become available. Through experiments covering four Iberian languages (Basque, Catalan, Galician, and Spanish) and two model families, we show that merging enables effective instruction following behavior in new languages and even supports multilingual capability through the combination of multiple language-specific models. Our results indicate that model merging is a viable and efficient alternative to traditional adaptation methods for low-resource languages, achieving competitive performance while greatly reducing computational cost.
摘要：大型语言模型 (LLM) 仍然主要以英语为中心，在资源匮乏的语言中表现有限。现有的适应方法，例如持续预训练，需要大量的计算资源。就指令模型而言，还需要高质量的指令数据，而这两者对于资源匮乏的语言社区来说通常是无法访问的。在这些限制下，模型合并提供了一种轻量级的替代方案，但其在资源匮乏的情况下的潜力尚未得到系统地探索。在这项工作中，我们探索是否有可能通过将语言知识与特定于语言的基础模型合并来将语言知识转移到指令调整的法学硕士，从而消除对特定于语言的指令的需求，并在出现更强的指令变体时重复微调过程。通过涵盖四种伊比利亚语言（巴斯克语、加泰罗尼亚语、加利西亚语和西班牙语）和两个模型系列的实验，我们表明合并可以实现新语言中有效的指令遵循行为，甚至通过多个特定于语言的模型的组合来支持多语言能力。我们的结果表明，模型合并是低资源语言传统适应方法的可行且有效的替代方法，在实现有竞争力的性能的同时大大降低了计算成本。

Title: The Necessity of Setting Temperature in LLM-as-a-Judge

Authors: Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia, Jerome Francois, Radu State
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.28304
Pdf URL: https://arxiv.org/pdf/2603.28304
Copy Paste: [[2603.28304]] The Necessity of Setting Temperature in LLM-as-a-Judge(https://arxiv.org/abs/2603.28304)
Keywords: llm
Abstract: LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits non-trivial sensitivity to temperature settings, that lower temperatures do not universally yield optimal outcomes, and that such effects are highly task-dependent. This raises a critical research question: does temperature influence judge performance in LLM centric evaluation? To address this, we systematically investigate the relationship between temperature and judge performance through a series of controlled experiments, and further adopt a causal inference framework within our empirical statistical analysis to rigorously examine the direct causal effect of temperature on judge behavior, offering actionable engineering insights for the design of LLM-centric evaluation pipelines.
摘要：法官法学硕士已成为评估文本质量和事实正确性的有效且低成本的范式。先前的研究表明，即使在难以自动评估的任务上，法学硕士法官和人类专家之间也存在实质性共识。在实践中，研究人员通常在评估过程中采用固定温度配置（最普遍的选择是 0.1 和 1.0），这种约定很大程度上是经验性的，而不是原则性的。然而，最近的研究表明，法学硕士的表现对温度设置表现出非同小可的敏感性，较低的温度并不普遍会产生最佳结果，而且这种影响高度依赖于任务。这就提出了一个关键的研究问题：温度是否会影响以法学硕士为中心的评估中的判断表现？为了解决这个问题，我们通过一系列对照实验系统地研究了温度与裁判表现之间的关系，并进一步在实证统计分析中采用因果推理框架来严格检验温度对裁判行为的直接因果影响，为以法学硕士为中心的评估流程的设计提供可操作的工程见解。

Title: Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

Authors: He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang, Sheng Yuan, Qinxiu Cheng, Xinchen Xie, Yicheng Chen, Yining Li, Jiaxing Xie, Huanan Dong, Yaguang Wu, Xiangjun Huang, Jian Yang, Hui Wang, Bowen Zhou, Bowen Li, Qipeng Guo, Kai Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.28342
Pdf URL: https://arxiv.org/pdf/2603.28342
Copy Paste: [[2603.28342]] Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization(https://arxiv.org/abs/2603.28342)
Keywords: llm, agent
Abstract: We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.
摘要：我们提出了 Kernel-Smith，这是一个用于高性能 GPU 内核和算子生成的框架，它将稳定的评估驱动的进化代理与面向进化的后训练方法相结合。在代理方面，Kernel-Smith 维护了一批可执行候选者，并使用性能最佳且多样化的程序档案以及有关编译、正确性和加速的结构化执行反馈来迭代改进它们。为了使这种搜索可靠，我们为 NVIDIA GPU 上的 Triton 和 MetaX GPU 上的 Maca 构建了后端特定的评估服务。在训练方面，我们通过保留正确性保持、高增益修正，将长期进化轨迹转换为以步骤为中心的监督和强化学习信号，从而使模型被优化为进化循环内强大的局部改进器，而不是一次性生成器。在统一的进化协议下，Kernel-Smith-235B-RL在Nvidia Triton后端的KernelBench上实现了最先进的整体性能，获得了最佳的平均加速比，并超越了Gemini-3.0-pro和Claude-4.6-opus等前沿专有模型。我们进一步验证了 MetaX MACA 后端的框架，其中我们的 Kernel-Smith-MACA-30B 超越了 DeepSeek-V3.2-think 和 Qwen3-235B-2507-think 等大规模同行，凸显了跨异构平台无缝适应的潜力。除了基准测试结果之外，相同的工作流程还为生产系统（包括 SGLang 和 LMDeploy）做出了上游贡献，这表明 LLM 驱动的内核优化可以从受控评估转移到实际部署。

Title: Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design

Authors: Bin Zhu, Qianghuai Jia, Tian Lan, Junyang Ren, Feng Gu, Feihu Jiang, Longyue Wang, Zhao Xu, Weihua Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28376
Pdf URL: https://arxiv.org/pdf/2603.28376
Copy Paste: [[2603.28376]] Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design(https://arxiv.org/abs/2603.28376)
Keywords: agent
Abstract: Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems. To sustain this capability on long-horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test-time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: \textbf{(1)~QA Data Synthesis:} We introduce verification mechanisms to graph-based and agent-based QA synthesis to control question difficulty while ensuring answers are unique and correct; \textbf{(2)~Trajectory Construction:} We design a verification-driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and \textbf{(3)~Test-time scaling:} We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B-scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp-ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B-scale agents, like Tongyi DeepResearch-30B.
摘要：深度研究代理自主地进行开放式调查，将复杂的信息检索与跨不同来源的多步骤推理相结合，以解决现实世界的问题。为了在长期任务中维持这种能力，在训练和推理过程中可靠的验证至关重要。现有范例的主要瓶颈源于 QA 数据合成、轨迹构建和测试时间扩展方面缺乏明确的验证机制。每个阶段引入的错误都会向下游传播并降低代理的整体性能。为了解决这个问题，我们推出了 Marco DeepResearch，一种深度研究代理，在三个层面上以验证为中心的框架设计进行了优化： \textbf{(1)~QA 数据合成：}我们将验证机制引入基于图和基于代理的 QA 合成中，以控制问题难度，同时确保答案唯一且正确； \textbf{(2)~轨迹构建：}我们设计了一种验证驱动的轨迹合成方法，将显式验证模式注入训练轨迹中；和 \textbf{(3)~测试时间缩放：} 我们使用 Marco DeepResearch 本身作为推理时间的验证器，并有效提高在挑战性问题上的性能。大量的实验结果表明，我们提出的 Marco DeepResearch 代理在最具挑战性的基准（例如 BrowseComp 和 BrowseComp-ZH）上显着优于 8B 规模的深度研究代理。至关重要的是，在 600 次工具调用的最大预算下，Marco DeepResearch 甚至超过或接近了几个 30B 规模的代理，例如 Tongyi DeepResearch-30B。

Title: Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification

Authors: Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md Kamrul Hasan, Hasan Mahmud
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2603.28488
Pdf URL: https://arxiv.org/pdf/2603.28488
Copy Paste: [[2603.28488]] Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification(https://arxiv.org/abs/2603.28488)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Large language models (LLMs) remain unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. While retrieval-augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retrieval and unstructured debate dynamics. We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P-RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to enforce calibration, robustness, and diversity. In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at this https URL.
摘要：由于幻觉和肤浅的推理，大型语言模型（LLM）对于高风险的索赔验证仍然不可靠。虽然检索增强生成（RAG）和多智能体辩论（MAD）解决了这个问题，但它们受到一次性检索和非结构化辩论动态的限制。我们提出了一个法庭式的多主体框架 PROClaim，它将验证重新表述为结构化的、对抗性的审议。我们的方法将专业角色（例如原告、辩方、法官）与渐进式 RAG (P-RAG) 相结合，以在辩论期间动态扩展和完善证据库。此外，我们采用证据协商、自我反思和异构多法官聚合来加强校准、稳健性和多样性。在 Check-COVID 基准的零样本评估中，PROClaim 的准确率达到 81.7%，比标准多智能体辩论高出 10.0 个百分点，其中 P-RAG 推动了主要性能提升（+7.5 个百分点）。我们最终证明，结构性审议和模型异质性有效地减轻了系统偏差，为可靠的索赔验证提供了坚实的基础。我们的代码和数据可通过此 https URL 公开获取。

Title: EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces

Authors: Léane Jourdan, Julien Aubert-Béduchaud, Yannis Chupin, Marah Baccari, Florian Boudin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.28515
Pdf URL: https://arxiv.org/pdf/2603.28515
Copy Paste: [[2603.28515]] EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces(https://arxiv.org/abs/2603.28515)
Keywords: language model, llm
Abstract: Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.
摘要：科学写作是一个迭代过程，会产生丰富的修订痕迹，但公开可用的资源通常只公开论文的最终或接近最终版本。这限制了科学写作大语言模型（LLM）的修改行为和评估的实证研究。我们介绍 EarlySciRev，这是一个从 arXiv LaTeX 源文件中自动提取的早期科学文本修订数据集。我们的主要观察结果是，LaTeX 中的注释文本通常会保留作者自己编写的废弃或替代公式。通过将注释片段与附近的最终文本对齐，我们提取段落级候选修订对并应用基于 LLM 的过滤来保留真正的修订。从 128 万个候选对开始，我们的管道产生了 578k 个经过验证的修订对，这些修订对基于真实的早期起草痕迹。我们还提供了用于修订检测的人工注释基准。 EarlySciRev 补充了专注于后期修订或综合重写的现有资源，并支持科学写作动态、修订建模和法学硕士辅助编辑的研究。

Title: GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum

Authors: Shuwen Xu, Yao Xu, Jiaxiang Liu, Chenhao Yuan, Wenshuo Peng, Jun Zhao, Kang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.28533
Pdf URL: https://arxiv.org/pdf/2603.28533
Copy Paste: [[2603.28533]] GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum(https://arxiv.org/abs/2603.28533)
Keywords: prompt, agent
Abstract: Agentic knowledge graph question answering (KGQA) requires an agent to iteratively interact with knowledge graphs (KGs), posing challenges in both training data scarcity and reasoning generalization. Specifically, existing approaches often restrict agent exploration: prompting-based methods lack autonomous navigation training, while current training pipelines usually confine reasoning to predefined trajectories. To this end, this paper proposes \textit{GraphWalker}, a novel agentic KGQA framework that addresses these challenges through \textit{Automated Trajectory Synthesis} and \textit{Stage-wise Fine-tuning}. GraphWalker adopts a two-stage SFT training paradigm: First, the agent is trained on structurally diverse trajectories synthesized from constrained random-walk paths, establishing a broad exploration prior over the KG; Second, the agent is further fine-tuned on a small set of expert trajectories to develop reflection and error recovery capabilities. Extensive experiments demonstrate that our stage-wise SFT paradigm unlocks a higher performance ceiling for a lightweight reinforcement learning (RL) stage, enabling GraphWalker to achieve state-of-the-art performance on CWQ and WebQSP. Additional results on GrailQA and our constructed GraphWalkerBench confirm that GraphWalker enhances generalization to out-of-distribution reasoning paths. The code is publicly available at this https URL
摘要：代理知识图问答（KGQA）需要代理与知识图（KG）迭代交互，这在训练数据稀缺性和推理泛化方面提出了挑战。具体来说，现有方法通常限制代理探索：基于提示的方法缺乏自主导航训练，而当前的训练管道通常将推理限制在预定义的轨迹内。为此，本文提出了 \textit{GraphWalker}，一种新颖的代理 KGQA 框架，它通过 \textit{自动轨迹合成} 和 \textit{阶段式微调} 来解决这些挑战。 GraphWalker 采用两阶段 SFT 训练范式：首先，智能体在由约束随机游走路径合成的结构多样的轨迹上进行训练，建立优于 KG 的广泛探索；其次，代理在一小部分专家轨迹上进行进一步微调，以开发反射和错误恢复功能。大量实验表明，我们的阶段式 SFT 范式为轻量级强化学习 (RL) 阶段解锁了更高的性能上限，使 GraphWalker 能够在 CWQ 和 WebQSP 上实现最先进的性能。 GrailQA 和我们构建的 GraphWalkerBench 的其他结果证实 GraphWalker 增强了对分布外推理路径的泛化。该代码可在此 https URL 公开获取

Title: Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT

Authors: Younes Javanmard, Tanmoy Pandit, Masoud Mardani
Subjects: cs.CL, physics.data-an
Abstract URL: https://arxiv.org/abs/2603.28534
Pdf URL: https://arxiv.org/pdf/2603.28534
Copy Paste: [[2603.28534]] Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT(https://arxiv.org/abs/2603.28534)
Keywords: language model, gpt
Abstract: Transformer-based language models achieve strong performance across NLP tasks, but their quadratic parameter scaling with hidden dimension makes deployment on resource-constrained hardware expensive. We study Matrix Product Operator (MPO) decomposition as a principled compression method for transformers. MPO factorises weight matrices into chains of low-rank cores, with approximation quality controlled by the bond dimension chi. We replace every this http URL layer in PicoGPT, a GPT-2-style character-level language model with about 1M parameters, with an MPOLinear module parameterised as an MPO chain. Cores are initialised either by TT-SVD from pretrained dense weights or from random initialisation, and trained using standard PyTorch autograd without a custom backward pass. We derive balanced factorisation schemes for the five distinct weight shapes in PicoGPT and evaluate bond dimensions chi in {4, 8, 16, 32} on Tiny Shakespeare. MPO compression achieves up to 13x compression per transformer block at chi = 4. At chi = 16, the model uses 191,872 parameters instead of 1,020,224 while retaining 97.7% of baseline token accuracy (51.6% vs 52.8%). Reconstruction error follows the expected trend and is lower for three-site than two-site factorisations at the same bond dimension. The chi = 8 model gives the best accuracy per parameter, exceeding the dense baseline by 2.7x on this metric. These results show that MPO parameterisation is a practical and theoretically grounded alternative to low-rank methods and unstructured pruning for transformer compression.
摘要：基于 Transformer 的语言模型在 NLP 任务中实现了强大的性能，但其具有隐藏维度的二次参数缩放使得在资源受限的硬件上的部署成本高昂。我们研究矩阵乘积运算符（MPO）分解作为变压器的原则压缩方法。 MPO 将权重矩阵分解为低阶核链，近似质量由键维 chi 控制。我们用参数化为 MPO 链的 MPOLinear 模块替换 PicoGPT（一种具有大约 1M 参数的 GPT-2 风格字符级语言模型）中的每个 http URL 层。核心通过 TT-SVD 从预训练的密集权重或随机初始化进行初始化，并使用标准 PyTorch autograd 进行训练，无需自定义反向传递。我们推导出 PicoGPT 中五种不同权重形状的平衡因式分解方案，并评估 Tiny Shakespeare 上 {4,8,16,32} 中的键维度 chi。 MPO 压缩在 chi = 4 时可实现每个变压器块高达 13 倍的压缩。在 chi = 16 时，模型使用 191,872 个参数而不是 1,020,224 个参数，同时保留 97.7% 的基线令牌精度（51.6% 与 52.8%）。重构误差遵循预期趋势，并且在相同键维数下，三位点分解的重构误差低于两位点分解的重构误差。 chi = 8 模型为每个参数提供了最佳准确度，在该指标上超出了密集基线 2.7 倍。这些结果表明，MPO 参数化是低秩方法和变压器压缩非结构化修剪的实用且有理论基础的替代方案。

Title: Training data generation for context-dependent rubric-based short answer grading

Authors: Pavel Šindelář, Dávid Slivka, Christopher Bouma, Filip Prášil, Ondřej Bojar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.28537
Pdf URL: https://arxiv.org/pdf/2603.28537
Copy Paste: [[2603.28537]] Training data generation for context-dependent rubric-based short answer grading(https://arxiv.org/abs/2603.28537)
Keywords: prompt
Abstract: Every 4 years, the PISA test is administered by the OECD to test the knowledge of teenage students worldwide and allow for comparisons of educational systems. However, having to avoid language differences and annotator bias makes the grading of student answers challenging. For these reasons, it would be interesting to compare methods of automatic student answer grading. To train some of these methods, which require machine learning, or to compute parameters or select hyperparameters for those that do not, a large amount of domain-specific data is needed. In this work, we explore a small number of methods for creating a large-scale training dataset using only a relatively small confidential dataset as a reference, leveraging a set of very simple derived text formats to preserve confidentiality. Using these methods, we successfully created three surrogate datasets that are, at the very least, superficially more similar to the reference dataset than purely the result of prompt-based generation. Early experiments suggest one of these approaches might also lead to improved model training.
摘要：经合组织每四年举办一次 PISA 测试，旨在测试世界各地青少年学生的知识水平，并可对教育体系进行比较。然而，必须避免语言差异和注释者偏见使得对学生答案的评分变得具有挑战性。由于这些原因，比较自动学生答案评分的方法将会很有趣。为了训练其中一些需要机器学习的方法，或者为不需要机器学习的方法计算参数或选择超参数，需要大量特定于领域的数据。在这项工作中，我们探索了一些方法来创建大规模训练数据集，仅使用相对较小的机密数据集作为参考，利用一组非常简单的派生文本格式来保护机密性。使用这些方法，我们成功创建了三个代理数据集，这些数据集至少在表面上与参考数据集更相似，而不是纯粹基于提示的生成结果。早期实验表明，其中一种方法也可能改善模型训练。

Title: EpiScreen: Early Epilepsy Detection from Electronic Health Records with Large Language Models

Authors: Shuang Zhou, Kai Yu, Zaifu Zhan, Huixue Zhou, Min Zeng, Feng Xie, Zhiyi Sha, Rui Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.28698
Pdf URL: https://arxiv.org/pdf/2603.28698
Copy Paste: [[2603.28698]] EpiScreen: Early Epilepsy Detection from Electronic Health Records with Large Language Models(https://arxiv.org/abs/2603.28698)
Keywords: language model
Abstract: Epilepsy and psychogenic non-epileptic seizures often present with similar seizure-like manifestations but require fundamentally different management strategies. Misdiagnosis is common and can lead to prolonged diagnostic delays, unnecessary treatments, and substantial patient morbidity. Although prolonged video-electroencephalography is the diagnostic gold standard, its high cost and limited accessibility hinder timely diagnosis. Here, we developed a low-cost, effective approach, EpiScreen, for early epilepsy detection by utilizing routinely collected clinical notes from electronic health records. Through fine-tuning large language models on labeled notes, EpiScreen achieved an AUC of up to 0.875 on the MIMIC-IV dataset and 0.980 on a private cohort of the University of Minnesota. In a clinician-AI collaboration setting, EpiScreen-assisted neurologists outperformed unaided experts by up to 10.9%. Overall, this study demonstrates that EpiScreen supports early epilepsy detection, facilitating timely and cost-effective screening that may reduce diagnostic delays and avoid unnecessary interventions, particularly in resource-limited regions.
摘要：癫痫和心因性非癫痫发作通常呈现相似的癫痫样表现，但需要根本不同的治疗策略。误诊很常见，可能导致长时间的诊断延误、不必要的治疗以及大量的患者发病率。尽管长时间的视频脑电图是诊断的黄金标准，但其高昂的成本和有限的可及性阻碍了及时诊断。在这里，我们开发了一种低成本、有效的方法 EpiScreen，通过利用从电子健康记录中定期收集的临床记录来进行早期癫痫检测。通过对标记笔记上的大型语言模型进行微调，EpiScreen 在 MIMIC-IV 数据集上的 AUC 达到了 0.875，在明尼苏达大学的私人队列上达到了 0.980。在临床医生与 AI 协作环境中，EpiScreen 辅助的神经科医生的表现比没有辅助的专家高出 10.9%。总体而言，这项研究表明 EpiScreen 支持早期癫痫检测，促进及时且经济有效的筛查，可以减少诊断延误并避免不必要的干预，特别是在资源有限的地区。

Title: Adaptive Block-Scaled Data Types

Authors: Jack Cook, Hyemin S. Lee, Kathryn Le, Junxian Guo, Giovanni Traverso, Anantha P. Chandrakasan, Song Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.28765
Pdf URL: https://arxiv.org/pdf/2603.28765
Copy Paste: [[2603.28765]] Adaptive Block-Scaled Data Types(https://arxiv.org/abs/2603.28765)
Keywords: language model
Abstract: NVFP4 has grown increasingly popular as a 4-bit format for quantizing large language models due to its hardware support and its ability to retain useful information with relatively few bits per parameter. However, the format is not without limitations: recent work has shown that NVFP4 suffers from its error distribution, resulting in large amounts of quantization error on near-maximal values in each group of 16 values. In this work, we leverage this insight to design new Adaptive Block-Scaled Data Types that can adapt to the distribution of their input values. For four-bit quantization, our proposed IF4 (Int/Float 4) data type selects between FP4 and INT4 representations for each group of 16 values, which are then scaled by an E4M3 scale factor as is done with NVFP4. The selected data type is denoted using the scale factor's sign bit, which is currently unused in NVFP4, and we apply the same insight to design formats for other bit-widths, including IF3 and IF6. When used to quantize language models, we find that IF4 outperforms existing 4-bit block-scaled formats, achieving lower loss during quantized training and achieving higher accuracy on many tasks in post-training quantization. We additionally design and evaluate an IF4 Multiply-Accumulate (MAC) unit to demonstrate that IF4 can be implemented efficiently in next-generation hardware accelerators. Our code is available at this https URL.
摘要：由于其硬件支持以及能够以每个参数相对较少的位数保留有用信息的能力，NVFP4 作为一种用于量化大型语言模型的 4 位格式越来越受欢迎。然而，该格式并非没有限制：最近的工作表明，NVFP4 受到其误差分布的影响，导致每组 16 个值的接近最大值上存在大量量化误差。在这项工作中，我们利用这种洞察力来设计新的自适应块级数据类型，可以适应其输入值的分布。对于四位量化，我们建议的 IF4 (Int/Float 4) 数据类型为每组 16 个值在 FP4 和 INT4 表示之间进行选择，然后按 E4M3 比例因子进行缩放，就像使用 NVFP4 一样。所选数据类型使用比例因子的符号位表示，目前在 NVFP4 中未使用该符号位，并且我们将相同的见解应用于其他位宽（包括 IF3 和 IF6）的设计格式。当用于量化语言模型时，我们发现 IF4 优于现有的 4 位块缩放格式，在量化训练期间实现了更低的损失，并在训练后量化的许多任务中实现了更高的准确性。我们还设计并评估了 IF4 乘法累加 (MAC) 单元，以证明 IF4 可以在下一代硬件加速器中高效实现。我们的代码可以在这个 https URL 上找到。