2026-03-24

Title: Enhancing Safety of Large Language Models via Embedding Space Separation

Authors: Xu Zhao, Xiting Wang, Weiran Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20206
Pdf URL: https://arxiv.org/pdf/2603.20206
Copy Paste: [[2603.20206]] Enhancing Safety of Large Language Models via Embedding Space Separation(https://arxiv.org/abs/2603.20206)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have achieved impressive capabilities, yet ensuring their safety against harmful prompts remains a critical challenge. Recent work has revealed that the latent representations (embeddings) of harmful and safe queries in LLMs typically exhibit linear separability, a property that has been exploited to construct attacks by perturbing the embeddings of harmful queries towards the safe subspace. Motivated by this observation, we propose a representation-level fine-tuning approach, named Embedding Space Separation (ES2), which improves LLM safety by explicitly enlarging the distance between harmful and safe representations in the embedding space. To prevent degradation of model's general capabilities, we introduce a Kullback-Leibler (KL) divergence regularization term into the loss function, which constrains the logits of the fine-tuned model to align with those of the original base model on harmless inputs. We evaluate our method on several open-source LLMs using standard safety benchmarks. Extensive experimental results demonstrate that our approach substantially improves model safety while maintaining comparable general capabilities.
摘要：大型语言模型 (LLM) 已经取得了令人印象深刻的能力，但确保其针对有害提示的安全仍然是一个严峻的挑战。最近的工作表明，LLM 中有害和安全查询的潜在表示（嵌入）通常表现出线性可分离性，这种特性已被用来通过扰乱有害查询对安全子空间的嵌入来构造攻击。受这一观察的启发，我们提出了一种表示级微调方法，称为嵌入空间分离（ES2），该方法通过显式扩大嵌入空间中有害表示和安全表示之间的距离来提高 LLM 安全性。为了防止模型的一般能力下降，我们在损失函数中引入了 Kullback-Leibler (KL) 散度正则化项，它限制微调模型的 logits 与无害输入上的原始基本模型的 logits 保持一致。我们使用标准安全基准在几个开源法学硕士上评估我们的方法。大量的实验结果表明，我们的方法大大提高了模型的安全性，同时保持了可比的一般能力。

Title: RedacBench: Can AI Erase Your Secrets?

Authors: Hyunjun Jeon, Kyuyoung Kim, Jinwoo Shin
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2603.20208
Pdf URL: https://arxiv.org/pdf/2603.20208
Copy Paste: [[2603.20208]] RedacBench: Can AI Erase Your Secrets?(https://arxiv.org/abs/2603.20208)
Keywords: language model
Abstract: Modern language models can readily extract sensitive information from unstructured text, making redaction -- the selective removal of such information -- critical for data security. However, existing benchmarks for redaction typically focus on predefined categories of data such as personally identifiable information (PII) or evaluate specific techniques like masking. To address this limitation, we introduce RedacBench, a comprehensive benchmark for evaluating policy-conditioned redaction across domains and strategies. Constructed from 514 human-authored texts spanning individual, corporate, and government sources, paired with 187 security policies, RedacBench measures a model's ability to selectively remove policy-violating information while preserving the original semantics. We quantify performance using 8,053 annotated propositions that capture all inferable information in each text. This enables assessment of both security -- the removal of sensitive propositions -- and utility -- the preservation of non-sensitive propositions. Experiments across multiple redaction strategies and state-of-the-art language models show that while more advanced models can improve security, preserving utility remains a challenge. To facilitate future research, we release RedacBench along with a web-based playground for dataset customization and evaluation. Available at this https URL.
摘要：现代语言模型可以轻松地从非结构化文本中提取敏感信息，因此编辑（选择性删除此类信息）对于数据安全至关重要。然而，现有的修订基准通常侧重于预定义的数据类别，例如个人身份信息 (PII) 或评估屏蔽等特定技术。为了解决这一限制，我们引入了 RedacBench，这是一个用于评估跨域和策略的策略条件编辑的综合基准。 RedacBench 由涵盖个人、公司和政府来源的 514 篇人工撰写的文本构成，并搭配 187 项安全策略，衡量模型选择性删除违反策略的信息同时保留原始语义的能力。我们使用 8,053 个带注释的命题来量化性能，这些命题捕获了每个文本中的所有可推断信息。这使得能够评估安全性（删除敏感命题）和实用性（保留非敏感命题）。多种编辑策略和最先进的语言模型的实验表明，虽然更先进的模型可以提高安全性，但保持实用性仍然是一个挑战。为了促进未来的研究，我们发布了 RedacBench 以及用于数据集定制和评估的基于网络的游乐场。在此 https URL 中可用。

Title: Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs

Authors: Hengwei Ye, Yuanting Guan, Yuxuan Ge, Tianying Zhu, Zhenhan Guan, Yijia Zhong, Yijing Zhang, Han Zhang, Yingna Wu, Zheng Tian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20209
Pdf URL: https://arxiv.org/pdf/2603.20209
Copy Paste: [[2603.20209]] Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs(https://arxiv.org/abs/2603.20209)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) combine the linguistic strengths of LLMs with the ability to process multimodal data, enbaling them to address a broader range of visual tasks. Because MLLMs aim at more general, human-like competence than language-only models, we take inspiration from the Wechsler Intelligence Scales - an established battery for evaluating children by decomposing intelligence into interpretable, testable abilities. We introduce KidGym, a comprehensive 2D grid-based benchmark for assessing five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory and Planning. The benchmark comprises 12 unique tasks, each targeting at least one core capability, specifically designed to guage MLLMs' adaptability and developmental potential, mirroring the stages of children's cognitive growth. Additionally, our tasks encompass diverse scenarios and objects with randomly generated layouts, ensuring a more accurate and robust evluation of MLLM capabilities. KidGym is designed to be fully user-customizable and extensible, allowing researchers to create new evaluation scenarios and adjust difficuly levels to accommodate the rapidly growing MLLM community. Through the evaluation of state-of-the-art MLLMs using KidGym, we identified significant insights into model capabilities and revealed several limitations of current models. We release our benchmark at: this https URL.
摘要：多模态大语言模型 (MLLM) 将法学硕士的语言优势与处理多模态数据的能力相结合，使其能够解决更广泛的视觉任务。由于 MLLM 的目标是比纯语言模型更通用、更接近人类的能力，因此我们从韦克斯勒智力量表中汲取灵感，韦克斯勒智力量表是一种通过将智力分解为可解释、可测试的能力来评估儿童的成熟电池。我们推出 KidGym，这是一个基于 2D 网格的综合基准测试，用于评估 MLLM 的五项基本功能：执行、感知推理、学习、记忆和规划。该基准包括 12 项独特的任务，每项任务至少针对一项核心能力，专门用于衡量 MLLM 的适应性和发展潜力，反映儿童认知成长的阶段。此外，我们的任务涵盖具有随机生成布局的不同场景和对象，确保对 MLLM 功能进行更准确、更稳健的评估。 KidGym 的设计是完全用户可定制和可扩展的，允许研究人员创建新的评估场景并调整难度级别，以适应快速增长的 MLLM 社区。通过使用 KidGym 对最先进的 MLLM 进行评估，我们发现了对模型功能的重要见解，并揭示了当前模型的一些局限性。我们在以下位置发布了基准测试：此 https URL。

Title: Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

Authors: Jiayun Wu, Peixu Hou, Shan Qu, Peng Zhang, Ning Gu, Tun Lu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20212
Pdf URL: https://arxiv.org/pdf/2603.20212
Copy Paste: [[2603.20212]] Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models(https://arxiv.org/abs/2603.20212)
Keywords: language model, chain-of-thought
Abstract: Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly available.
摘要：奖励模型 (RM) 对于通过人类反馈强化学习 (RLHF) 调整大型语言模型至关重要。虽然生成奖励模型 (GRM) 通过思想链 (CoT) 推理实现了卓越的准确性，但它们会产生大量的计算成本。相反，标量奖励模型（SRM）提供了效率，但在复杂场景中的性能和适应性有限。我们引入快-慢思考奖励模型（F/S-RM），这是一种受双过程理论启发的混合 RM 架构。它训练单个模型来集成两种不同的奖励范式：作为标量分数的第一个令牌预测（快速思维）和基于 CoT 的判断（慢速思维），由确定何时激活慢速思维的双重置信激活机制调节。与最先进的模型相比，F/S-RM 的相对性能提高了 1.2%，同时减少了 20.8% 的代币消耗。代码和数据将公开。

Title: Multi-Agent Debate with Memory Masking

Authors: Hongduan Tian, Xiao Feng, Ziyuan Zhao, Xiangyu Zhu, Rolan Yan, Bo Han
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20215
Pdf URL: https://arxiv.org/pdf/2603.20215
Copy Paste: [[2603.20215]] Multi-Agent Debate with Memory Masking(https://arxiv.org/abs/2603.20215)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have recently demonstrated impressive capabilities in reasoning tasks. Currently, mainstream LLM reasoning frameworks predominantly focus on scaling up inference-time sampling to enhance performance. In particular, among all LLM reasoning frameworks, *multi-agent debate* (MAD), which employs multiple LLMs as agents to perform reasoning in the way of multi-round debate, has emerged as a powerful reasoning paradigm since it allows agents to access previous memories to alleviate fallacious content and refine their reasoning iteratively in each debate round. However, although MAD significantly improves the reasoning capabilities of LLMs, in this paper, we observe that there remain erroneous memories, and LLM agents are vulnerable to these erroneous memories. To explore this phenomenon, we provide a theoretical insight that the performance of MAD is highly dependent on the quality of memories derived from the previous debate, indicating that the existence of erroneous memories poses a threat to the performance of MAD. To address this problem, we introduce a simple yet effective multi-agent debate framework, *multi-agent debate with memory masking* (MAD-M$^2$), to improve the robustness of MAD by allowing LLM agents to mask erroneous memories from the previous debate round at the beginning of each debate round. In this way, MAD-M$^2$ can polish the contextual information before each debate round by preserving informative and meaningful memories while discarding the erroneous memories. Extensive experiments and analyses on mainstream mathematical and logical reasoning benchmarks demonstrate that MAD-M$^2$ can identify the erroneous memories and achieve better performance in reasoning than MAD.
摘要：大型语言模型（LLM）最近在推理任务中表现出了令人印象深刻的能力。目前，主流的法学硕士推理框架主要侧重于扩大推理时间采样以提高性能。特别是，在所有LLM推理框架中，“多智能体辩论”（MAD），它采用多个LLM作为智能体，以多轮辩论的方式进行推理，它允许智能体访问先前的记忆，以减少错误内容，并在每轮辩论中迭代地完善其推理，从而成为一种强大的推理范式。然而，尽管 MAD 显着提高了 LLM 的推理能力，但在本文中，我们观察到仍然存在错误记忆，并且 LLM 智能体很容易受到这些错误记忆的影响。为了探索这一现象，我们提供了一个理论见解，即 MAD 的性能高度依赖于先前争论中得出的记忆质量，这表明错误记忆的存在对 MAD 的性能构成威胁。为了解决这个问题，我们引入了一个简单而有效的多智能体辩论框架，*多智能体辩论与记忆屏蔽*（MAD-M$^2$），通过允许LLM智能体在每轮辩论开始时屏蔽上一轮辩论中的错误记忆来提高MAD的鲁棒性。通过这种方式，MAD-M$^2$ 可以在每轮辩论之前通过保留信息丰富且有意义的记忆同时丢弃错误的记忆来完善上下文信息。对主流数学和逻辑推理基准的大量实验和分析表明，MAD-M$^2$ 可以识别错误记忆，并获得比 MAD 更好的推理性能。

Title: Locally Coherent Parallel Decoding in Diffusion Language Models

Authors: Michael Hersche, Nicolas Menet, Ronan Tanios, Abbas Rahimi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20216
Pdf URL: https://arxiv.org/pdf/2603.20216
Copy Paste: [[2603.20216]] Locally Coherent Parallel Decoding in Diffusion Language Models(https://arxiv.org/abs/2603.20216)
Keywords: language model
Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel block generation while ensuring sequential validity within each block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.
摘要：扩散语言模型 (DLM) 已成为自回归 (AR) 模型的一种有前景的替代方案，提供亚线性生成延迟和双向功能，这对代码生成和编辑特别有吸引力。在离散 DLM 中实现亚线性延迟需要并行预测多个令牌。然而，标准 DLM 独立于条件边际分布对令牌进行采样，无法捕获同时生成的令牌之间的联合依赖关系。因此，它们经常导致语法不一致并破坏多标记结构。在这项工作中，我们引入了 CoDiLA（具有局部自回归的相干扩散），这是一种协调并行采样与局部依赖建模的方法。 CoDiLA 没有强制 DLM 解析细粒度语法，而是将本地解码委托给在扩散潜伏上运行的小型辅助 AR 模型。此设计允许并行块生成，同时确保每个块内的顺序有效性并维护核心 DLM 功能，包括跨块的双向建模。我们证明，使用高度紧凑的辅助 AR 模型（例如 0.6B 参数）可以有效消除一致性伪影，为代码生成基准的准确性和速度建立新的帕累托前沿。

Title: Expected Reward Prediction, with Applications to Model Routing

Authors: Kenan Hasanaliyev, Silas Alberti, Jenny Hamer, Dheeraj Rajagopal, Kevin Robinson, Jasper Snoek, Victor Veitch, Alexander Nicholas D'Amour
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20217
Pdf URL: https://arxiv.org/pdf/2603.20217
Copy Paste: [[2603.20217]] Expected Reward Prediction, with Applications to Model Routing(https://arxiv.org/abs/2603.20217)
Keywords: llm, prompt
Abstract: Reward models are a standard tool to score responses from LLMs. Reward models are built to rank responses to a fixed prompt sampled from a single model, for example to choose the best of n sampled responses. In this paper, we study whether scores from response-level reward models lifted to score a model's suitability for a prompt, prior to seeing responses from that model. Specifically, we show that it is straightforward to predict the expected reward that an LLM would earn from the reward model under repeated sampling. Further, we show that these expected reward predictions are precise and discriminative enough to support an application to a model routing protocol that routes prompts to models at inference time to maximize reward while controlling computational cost. We demonstrate the performance of this routing procedure on the open-perfectblend dataset, using a model pool composed of Llama3.1-Instruct 8B/70B, Gemma2-IT 9B/27B, and Gemma1-IT 7B models. Our simple expected reward prediction--based routing (ERP) outperforms baselines that route prompts to models with the best average performance within each prompt's category, and explains the success of more complex routing protocols that implicitly estimate an expected reward. Our approach has the added advantage of being trivially extensible as new models are added to the pool.
摘要：奖励模型是对法学硕士的回答进行评分的标准工具。构建奖励模型是为了对从单个模型中采样的固定提示的响应进行排名，例如选择 n 个采样响应中的最佳响应。在本文中，我们研究了在看到模型的响应之前，响应级别奖励模型的分数是否会提升以对模型对提示的适合性进行评分。具体来说，我们表明，在重复抽样的情况下，可以直接预测法学硕士从奖励模型中获得的预期奖励。此外，我们还表明，这些预期奖励预测足够精确且具有辨别力，足以支持模型路由协议的应用，该协议在推理时将提示路由到模型，以在控制计算成本的同时最大化奖励。我们使用由 Llama3.1-Instruct 8B/70B、Gemma2-IT 9B/27B 和 Gemma1-IT 7B 模型组成的模型池，在 open-perfectblend 数据集上演示了此路由过程的性能。我们基于简单预期奖励预测的路由 (ERP) 优于将提示路由到每个提示类别中平均性能最佳的模型的基线，并解释了隐式估计预期奖励的更复杂路由协议的成功。我们的方法还有一个额外的优势，即随着新模型添加到池中，可以轻松扩展。

Title: An experimental study of KV cache reuse strategies in chunk-level caching systems

Authors: Samuel Cestola, Tianxiang Xia, Zheng Weiyan, Zheng Pengfei, Diego Didona
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20218
Pdf URL: https://arxiv.org/pdf/2603.20218
Copy Paste: [[2603.20218]] An experimental study of KV cache reuse strategies in chunk-level caching systems(https://arxiv.org/abs/2603.20218)
Keywords: language model, prompt, retrieval-augmented generation
Abstract: Retrieval-augmented generation improves large language models' accuracy by adding relevant retrieved text to the prompt. Chunk level caching (CLC) accelerates inference by precomputing KV caches for these retrieved chunks and reusing them. However, these caches miss cross-attention dependencies between chunks, which can reduce output quality. Several methods try to improve CLC accuracy using different techniques. We make two main contributions. First, we show that existing CLC approaches have fundamental limitations that limit their accuracy or their applicability. We back this conclusion with an extensive CLC system experimental evaluation. Second, we observe that existing CLC techniques are complementary. We leverage this insight to propose a new CLC design that carefully combines them and achieves better accuracy.
摘要：检索增强生成通过将相关的检索文本添加到提示中来提高大型语言模型的准确性。块级缓存 (CLC) 通过为这些检索到的块预先计算 KV 缓存并重用它们来加速推理。然而，这些缓存错过了块之间的交叉注意力依赖关系，这可能会降低输出质量。有几种方法尝试使用不同的技术来提高 CLC 准确性。我们做出了两个主要贡献。首先，我们表明现有的 CLC 方法具有根本性的局限性，限制了其准确性或适用性。我们通过广泛的 CLC 系统实验评估来支持这一结论。其次，我们观察到现有的 CLC 技术是互补的。我们利用这一见解提出了一种新的 CLC 设计，将它们仔细结合并实现了更高的准确性。

Title: Thinking into the Future: Latent Lookahead Training for Transformers

Authors: Lorenzo Noci, Gregor Bachmann, Seyed-Mohsen Moosavi-Dezfooli, Moin Nabi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20219
Pdf URL: https://arxiv.org/pdf/2603.20219
Copy Paste: [[2603.20219]] Thinking into the Future: Latent Lookahead Training for Transformers(https://arxiv.org/abs/2603.20219)
Keywords: language model
Abstract: Autoregressive language models trained with next-token prediction generate text by sampling one discrete token at a time. Although very scalable, this objective forces the model to commit at every step, preventing it from exploring or reflecting upon multiple plausible continuations. Furthermore, the compute allocation across tokens is uniform; every token is formed based on a single forward-pass, potentially limiting the model's expressiveness in cases where difficult tokens require inherently more compute. Towards addressing these limitations, we introduce latent lookahead, a training strategy that enables models to "think" before generating: at selected positions in the sequence, before committing to the next token, the model performs a multi-step lookahead in latent space. More precisely, instead of sampling future tokens, we leverage the network's latent space by recursively feeding its hidden states back into the context for $\tau$ steps, investing more compute on predicting that token. This produces $\tau$ latent predictions that are supervised against the next $\tau$ ground-truth tokens, encouraging the model to "lookahead" and refine its prediction. We show that latent lookahead substantially outperforms both autoregressive and non-autoregressive baselines on planning tasks such as maze solving, Sudoku, and ProsQA, where foresight is essential.
摘要：使用下一个标记预测训练的自回归语言模型通过一次采样一个离散标记来生成文本。尽管具有很强的可扩展性，但这个目标迫使模型在每一步都做出承诺，从而阻止它探索或反思多个看似合理的延续。此外，跨代币的计算分配是统一的；每个令牌都是基于单个前向传递形成的，在困难令牌本质上需要更多计算的情况下，可能会限制模型的表达能力。为了解决这些限制，我们引入了潜在的前瞻，这是一种训练策略，使模型能够在生成之前“思考”：在序列中的选定位置，在提交下一个标记之前，模型在潜在空间中执行多步前瞻。更准确地说，我们不是对未来的令牌进行采样，而是通过递归地将其隐藏状态反馈回 $\tau$ 步骤的上下文中，从而利用网络的潜在空间，投入更多的计算来预测该令牌。这会产生 $\tau$ 潜在预测，并针对下一个 $\tau$ 真实标记进行监督，从而鼓励模型“前瞻”并完善其预测。我们表明，在迷宫求解、数独和 ProsQA 等规划任务中，潜在前瞻大大优于自回归和非自回归基线，在这些任务中，预见性至关重要。

Title: Beyond Test-Time Compute Strategies: Advocating Energy-per-Token in LLM Inference

Authors: Patrick Wilhelm, Thorsten Wittkopp, Odej Kao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20224
Pdf URL: https://arxiv.org/pdf/2603.20224
Copy Paste: [[2603.20224]] Beyond Test-Time Compute Strategies: Advocating Energy-per-Token in LLM Inference(https://arxiv.org/abs/2603.20224)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) demonstrate exceptional performance across diverse tasks but come with substantial energy and computational costs, particularly in request-heavy scenarios. In many real-world applications, the full scale and capabilities of LLMs are often unnecessary, as Small Language Models (SLMs) can provide accurate responses for simpler text generation tasks. When enhanced with advanced reasoning strategies, such as Chain-of-Thought (CoT) prompting or Majority Voting, SLMs can approach the performance of larger models while reducing overall computational requirements. However, these strategies can also introduce additional energy costs, creating an energy-accuracy trade-off. Our analysis examines these trade-offs in test-time compute strategies for smaller models compared to larger ones, using the MMLU benchmark. Additionally, we explore the input-output token dynamics of transformer architectures, which result in nonlinear hardware energy operation curves for LLMs. To bridge AI research with its physical impact, we propose \textit{energy efficiency metrics}, including Energy-per-Token, as complements to traditional accuracy benchmarks. Beyond model selection, we propose controlled reasoning in CoT token generation, using operating curves to regulate reasoning depth dynamically. This vision integrates a energy-aware routing mechanism, ensuring that model selection and inference strategies balance accuracy for sustainable AI deployment.
摘要：大型语言模型 (LLM) 在不同的任务中表现出卓越的性能，但会带来大量的能源和计算成本，特别是在请求繁重的场景中。在许多实际应用中，LLM 的全面规模和功能通常是不必要的，因为小语言模型 (SLM) 可以为更简单的文本生成任务提供准确的响应。当通过思想链 (CoT) 提示或多数投票等高级推理策略进行增强时，SLM 可以接近大型模型的性能，同时降低总体计算要求。然而，这些策略也会带来额外的能源成本，从而造成能源准确性的权衡。我们的分析使用 MMLU 基准检查了较小模型与较大模型在测试时计算策略中的权衡。此外，我们还探索了变压器架构的输入输出令牌动态，这导致了 LLM 的非线性硬件能量运行曲线。为了将人工智能研究与其物理影响联系起来，我们提出 \textit{能源效率指标}，包括每个代币的能量，作为传统准确性基准的补充。除了模型选择之外，我们还提出了 CoT 代币生成中的受控推理，使用操作曲线动态调节推理深度。这一愿景集成了能源感知路由机制，确保模型选择和推理策略平衡可持续人工智能部署的准确性。

Title: Decoding the decoder: Contextual sequence-to-sequence modeling for intracortical speech decoding

Authors: Michal Olak, Tommaso Boccato, Matteo Ferrante
Subjects: cs.CL, cs.AI, cs.NE, q-bio.NC
Abstract URL: https://arxiv.org/abs/2603.20246
Pdf URL: https://arxiv.org/pdf/2603.20246
Copy Paste: [[2603.20246]] Decoding the decoder: Contextual sequence-to-sequence modeling for intracortical speech decoding(https://arxiv.org/abs/2603.20246)
Keywords: language model
Abstract: Speech brain--computer interfaces require decoders that translate intracortical activity into linguistic output while remaining robust to limited data and day-to-day variability. While prior high-performing systems have largely relied on framewise phoneme decoding combined with downstream language models, it remains unclear what contextual sequence-to-sequence decoding contributes to sublexical neural readout, robustness, and interpretability. We evaluated a multitask Transformer-based sequence-to-sequence model for attempted speech decoding from area 6v intracortical recordings. The model jointly predicts phoneme sequences, word sequences, and auxiliary acoustic features. To address day-to-day nonstationarity, we introduced the Neural Hammer Scalpel (NHS) calibration module, which combines global alignment with feature-wise modulation. We further analyzed held-out-day generalization and attention patterns in the encoder and decoders. On the Willett et al. dataset, the proposed model achieved a state-of-the-art phoneme error rate of 14.3%. Word decoding reached 25.6% WER with direct decoding and 19.4% WER with candidate generation and rescoring. NHS substantially improved both phoneme and word decoding relative to linear or no day-specific transform, while held-out-day experiments showed increasing degradation on unseen days with temporal distance. Attention visualizations revealed recurring temporal chunking in encoder representations and distinct use of these segments by phoneme and word decoders. These results indicate that contextual sequence-to-sequence modeling can improve the fidelity of neural-to-phoneme readout from intracortical speech signals and suggest that attention-based analyses can generate useful hypotheses about how neural speech evidence is segmented and accumulated over time.
摘要：语音脑-计算机接口需要解码器将皮质内活动转化为语言输出，同时对有限的数据和日常变化保持鲁棒性。虽然之前的高性能系统在很大程度上依赖于与下游语言模型相结合的逐帧音素解码，但目前尚不清楚上下文序列到序列解码对词下神经读出、鲁棒性和可解释性有何贡献。我们评估了基于 Transformer 的多任务序列到序列模型，用于尝试对 6v 区域皮质内录音进行语音解码。该模型联合预测音素序列、单词序列和辅助声学特征。为了解决日常的非平稳性，我们引入了神经锤手术刀 (NHS) 校准模块，它将全局对齐与特征调制相结合。我们进一步分析了编码器和解码器中的保留日泛化和注意力模式。关于威利特等人。数据集上，所提出的模型实现了 14.3% 的最先进音素错误率。通过直接解码，单词解码的 WER 达到 25.6%；通过候选生成和重新评分，单词解码的 WER 达到 19.4%。相对于线性或无特定日期转换，NHS 显着改善了音素和单词解码，而保留日期实验表明，随着时间距离的变化，未见日期的退化程度不断增加。注意力可视化揭示了编码器表示中反复出现的时间分块以及音素和单词解码器对这些片段的不同使用。这些结果表明，上下文序列到序列建模可以提高皮质内语音信号的神经到音素读出的保真度，并表明基于注意力的分析可以生成关于神经语音证据如何随着时间的推移进行分割和积累的有用假设。

Title: FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems

Authors: Mahesh Kumar, Bhaskarjit Sarmah, Stefano Pasquali
Subjects: cs.CL, q-fin.CP
Abstract URL: https://arxiv.org/abs/2603.20252
Pdf URL: https://arxiv.org/pdf/2603.20252
Copy Paste: [[2603.20252]] FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems(https://arxiv.org/abs/2603.20252)
Keywords: llm, hallucination
Abstract: As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA systems lack systematic mechanisms to detect hallucinations - factually incorrect outputs that undermine reliability and user trust. We introduce FinBench-QA-Hallucination, a benchmark for evaluating hallucination detection methods in KG-augmented financial QA over SEC 10-K filings. The dataset contains 755 annotated examples from 300 pages, each labeled for groundedness using a conservative evidence-linkage protocol requiring support from both textual chunks and extracted relational triplets. We evaluate six detection approaches - LLM judges, fine-tuned classifiers, Natural Language Inference (NLI) models, span detectors, and embedding-based methods under two conditions: with and without KG triplets. Results show that LLM-based judges and embedding approaches achieve the highest performance (F1: 0.82-0.86) under clean conditions. However, most methods degrade significantly when noisy triplets are introduced, with Matthews Correlation Coefficient (MCC) dropping 44-84 percent, while embedding methods remain relatively robust with only 9 percent degradation. Statistical tests (Cochran's Q and McNemar) confirm significant performance differences (p < 0.001). Our findings highlight vulnerabilities in current KG-augmented systems and provide insights for building reliable financial information systems, where hallucinations can lead to regulatory violations and flawed decisions. The benchmark also offers a framework for integrating AI reliability evaluation into information system design across other high-stakes domains such as healthcare, legal, and government.
摘要：随着组织越来越多地将人工智能驱动的问答系统集成到财务信息系统中，以实现合规性、风险评估和决策支持，确保人工智能生成的输出的事实准确性成为一项关键的工程挑战。当前的知识图 (KG) 增强 QA 系统缺乏检测幻觉的系统机制 - 事实上不正确的输出会破坏可靠性和用户信任。我们引入了 FinBench-QA-Hallucination，这是一个用于评估 SEC 10-K 文件中 KG 增强财务 QA 中幻觉检测方法的基准。该数据集包含来自 300 页的 755 个带注释的示例，每个示例都使用保守的证据链接协议标记为基础，需要文本块和提取的关系三元组的支持。我们在两种条件下评估了六种检测方法：LLM 判断器、微调分类器、自然语言推理 (NLI) 模型、跨度检测器和基于嵌入的方法：有和没有 KG 三元组。结果表明，基于 LLM 的判断和嵌入方法在干净的条件下实现了最高性能（F1：0.82-0.86）。然而，当引入噪声三元组时，大多数方法都会显着退化，马修斯相关系数 (MCC) 下降 44-84%，而嵌入方法仍然相对稳健，仅退化 9%。统计测试（Cochran's Q 和 McNemar）证实了显着的性能差异 (p < 0.001)。我们的研究结果强调了当前知识图谱增强系统中的漏洞，并为构建可靠的金融信息系统提供了见解，在该系统中，幻觉可能会导致监管违规和有缺陷的决策。该基准还提供了一个框架，用于将人工智能可靠性评估集成到医疗保健、法律和政府等其他高风险领域的信息系统设计中。

Title: SciNav: A General Agent Framework for Scientific Coding Tasks

Authors: Tianshu Zhang, Huan Sun
Subjects: cs.CL, cs.AI, cs.CE, cs.LG, cs.MA, eess.SY
Abstract URL: https://arxiv.org/abs/2603.20256
Pdf URL: https://arxiv.org/pdf/2603.20256
Copy Paste: [[2603.20256]] SciNav: A General Agent Framework for Scientific Coding Tasks(https://arxiv.org/abs/2603.20256)
Keywords: language model, llm, prompt, agent
Abstract: Autonomous science agents built on large language models (LLMs) are increasingly used to generate hypotheses, design experiments, and produce reports. However, prior work mainly targets open-ended scientific problems with subjective outputs that are difficult to evaluate. Scientific coding benchmarks, by contrast, provide executable outputs for objective assessment. Existing approaches remain engineering-driven pipelines, revealing the need for structured, end-to-end science agent frameworks for scientific coding tasks. We address this gap by focusing on scientific coding tasks, where evaluation can be made rigorously, and introducing an agent framework SciNav (Scientific Navigator) that enables more effective solution exploration. Our framework is designed to operate under constrained search budgets, moving beyond reliance on pre-defined success metrics and prolonged search cycles. Inspired by findings that comparative judgments often reveal finer-grained quality differences and therefore provide greater discriminative power than absolute scoring, our framework leverages pairwise relative judgments within a tree search process to select top-K promising solution branches, prune low-potential ones, and progressively narrow down the solution candidates on the selected branches guided by relative comparisons. We demonstrate our agent's effectiveness across different types of tasks on two benchmarks. Experiments show that SciNav significantly outperforms direct prompting and prior agents like OpenHands and Self-Debug across different base models, task types, and difficulty levels, and exceeds different frontier comparators such as random selection and LLM absolute scoring. These results confirm the strength of our agent design and highlight the effectiveness of relative judgment-guided top-K search for high-quality scientific coding, marking a step toward more practical science agents.
摘要：基于大型语言模型 (LLM) 的自主科学代理越来越多地用于生成假设、设计实验和生成报告。然而，先前的工作主要针对开放式科学问题，其主观输出难以评估。相比之下，科学编码基准为客观评估提供可执行的输出。现有方法仍然是工程驱动的管道，揭示了科学编码任务对结构化、端到端科学代理框架的需求。我们通过专注于可以进行严格评估的科学编码任务来解决这一差距，并引入代理框架 SciNav（科学导航器）来实现更有效的解决方案探索。我们的框架旨在在有限的搜索预算下运行，超越对预定义成功指标和延长搜索周期的依赖。受比较判断通常揭示更细粒度的质量差异并因此提供比绝对评分更大的判别力的发现的启发，我们的框架利用树搜索过程中的成对相对判断来选择前 K 个有前途的解决方案分支，修剪低潜力的解决方案分支，并在相对比较的指导下逐步缩小所选分支上的候选解决方案范围。我们在两个基准测试中展示了代理在不同类型任务中的有效性。实验表明，SciNav 在不同的基础模型、任务类型和难度级别上显着优于直接提示和先验代理（例如 OpenHands 和 Self-Debug），并且超过了不同的前沿比较器（例如随机选择和 LLM 绝对评分）。这些结果证实了我们的智能体设计的优势，并强调了相对判断引导的 top-K 搜索对高质量科学编码的有效性，标志着向更实用的科学智能体迈出了一步。

Title: The production of meaning in the processing of natural language

Authors: Christopher J. Agostino, Quan Le Thien, Nayan D'Souza, Louis van der Elst
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2603.20381
Pdf URL: https://arxiv.org/pdf/2603.20381
Copy Paste: [[2603.20381]] The production of meaning in the processing of natural language(https://arxiv.org/abs/2603.20381)
Keywords: language model, hallucination, prompt, agent
Abstract: Understanding the fundamental mechanisms governing the production of meaning in the processing of natural language is critical for designing safe, thoughtful, engaging, and empowering human-agent interactions. Experiments in cognitive science and social psychology have demonstrated that human semantic processing exhibits contextuality more consistent with quantum logical mechanisms than classical Boolean theories, and recent works have found similar results in large language models -- in particular, clear violations of the Bell inequality in experiments of contextuality during interpretation of ambiguous expressions. We explore the CHSH $|S|$ parameter -- the metric associated with the inequality -- across the inference parameter space of models spanning four orders of magnitude in scale, cross-referencing it with MMLU, hallucination rate, and nonsense detection benchmarks. We find that the interquartile range of the $|S|$ distribution -- the statistic that most sharply differentiates models from one another -- is completely orthogonal to all external benchmarks, while violation rate shows weak anticorrelation with all three benchmarks that does not reach significance. We investigate how $|S|$ varies with sampling parameters and word order, and discuss the information-theoretic constraints that genuine contextuality imposes on prompt injection defenses and its human analogue, whereby careful construction and maintenance of social contextuality can be carried out at scale -- manufacturing not consent but contextuality itself, a subtler and more fundamental form of manipulation that shapes the space of possible interpretations before any particular one is reached.
摘要：了解自然语言处理过程中控制意义产生的基本机制对于设计安全、深思熟虑、有吸引力和增强人机交互至关重要。认知科学和社会心理学的实验表明，人类语义处理表现出比经典布尔理论更符合量子逻辑机制的语境性，最近的工作在大型语言模型中也发现了类似的结果——特别是在解释歧义表达式时的语境性实验中，明显违反了贝尔不等式。我们在跨越四个数量级的模型的推理参数空间中探索 CHSH $|S|$ 参数（与不等式相关的度量），并将其与 MMLU、幻觉率和无意义检测基准交叉引用。我们发现，$|S|$ 分布的四分位数范围（最能区分模型之间的统计数据）与所有外部基准完全正交，而违规率与所有三个基准的反相关性较弱，未达到显着性。我们研究$|S|$如何随采样参数和词序变化，并讨论真正的语境对即时注入防御及其人类类似物施加的信息论约束，从而可以大规模地进行社会语境的仔细构建和维护——制造的不是同意，而是语境本身，一种更微妙、更基本的操纵形式，在达到任何特定解释之前塑造可能解释的空间。

Title: Coding Agents are Effective Long-Context Processors

Authors: Weili Cao, Xunjian Yin, Bhuwan Dhingra, Shuyan Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20432
Pdf URL: https://arxiv.org/pdf/2603.20432
Copy Paste: [[2603.20432]] Coding Agents are Effective Long-Context Processors(https://arxiv.org/abs/2603.20432)
Keywords: language model, llm, long context, retrieval-augmented generation, agent
Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in scaling to access massive contexts. However, the access is via the latent and uninterpretable attention mechanisms, and LLMs fail to effective process long context, exhibiting significant performance degradation as context length increases. In this work, we study whether long-context processing can be externalized from latent attention into explicit, executable interactions, by allowing coding agents to organize text in file systems and manipulate it using its native tools. We evaluate off-the-shelf frontier coding agents as the general interface for tasks that require processing long contexts, including long-context reasoning, retrieval-augmented generation, and open-domain question answering with large-scale corpus contains up to three trillion tokens. Across multiple benchmarks, these agents outperform published state-of-the-art by 17.3% on average. We attribute this efficacy to two key factors: native tool proficiency, which enables agents to leverage executable code and terminal commands rather than passive semantic queries, and file system familiarity, which allows them to navigate massive text corpora as directory structures. These findings suggest that delegating long-context processing to coding agents offers an effective alternative to semantic search or context window scaling, opening new directions for long-context processing in LLMs.
摘要：大型语言模型 (LLM) 在扩展以访问海量上下文方面取得了显着进展。然而，访问是通过潜在的和不可解释的注意力机制进行的，法学硕士无法有效处理长上下文，随着上下文长度的增加，表现出显着的性能下降。在这项工作中，我们研究是否可以通过允许编码代理在文件系统中组织文本并使用其本机工具对其进行操作，将长上下文处理从潜在注意力外化为显式的可执行交互。我们评估现成的前沿编码代理作为需要处理长上下文的任务的通用接口，包括长上下文推理、检索增强生成和包含多达三万亿个令牌的大规模语料库的开放域问答。在多个基准测试中，这些代理的性能平均比已发布的最新技术高出 17.3%。我们将这种功效归因于两个关键因素：本地工具熟练程度，使代理能够利用可执行代码和终端命令，而不是被动语义查询；以及文件系统熟悉度，使他们能够以目录结构的形式导航大量文本语料库。这些发现表明，将长上下文处理委托给编码代理提供了语义搜索或上下文窗口缩放的有效替代方案，为法学硕士的长上下文处理开辟了新的方向。

Title: A Training-Free Regeneration Paradigm: Contrastive Reflection Memory Guided Self-Verification and Self-Improvement

Authors: Yuran Li, Di Wu, Benoit Boulet
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20441
Pdf URL: https://arxiv.org/pdf/2603.20441
Copy Paste: [[2603.20441]] A Training-Free Regeneration Paradigm: Contrastive Reflection Memory Guided Self-Verification and Self-Improvement(https://arxiv.org/abs/2603.20441)
Keywords: language model, llm
Abstract: Verification-guided self-improvement has recently emerged as a promising approach to improving the accuracy of large language model (LLM) outputs. However, existing approaches face a trade-off between inference efficiency and accuracy: iterative verification-rectification is computationally expensive and prone to being trapped in faulty reasoning, while best-of-N selection requires extensive sampling without addressing internal model flaws. We propose a training-free regeneration paradigm that leverages an offline-curated contrastive Reflection Memory (RM) to provide corrective guidance, while regenerating from scratch helps break out of faulty reasoning. At inference time, the method performs RM-guided self-verification followed by a single RM-guided regeneration, avoiding both iterative correction and multi-sample selection. We evaluated our method on nine benchmarks that span algorithmic, reasoning, symbolic, and domain-specific tasks in both small- and large-scale LLMs. Experiment results show that our method outperforms prior methods while maintaining low computational cost.
摘要：以验证为导向的自我改进最近已成为提高大型语言模型（LLM）输出准确性的一种有前途的方法。然而，现有的方法面临着推理效率和准确性之间的权衡：迭代验证校正的计算成本很高，并且容易陷入错误的推理，而“N 中最佳”选择需要大量采样，而无法解决内部模型缺陷。我们提出了一种免训练的再生范例，利用离线策划的对比反射记忆（RM）来提供纠正指导，同时从头开始再生有助于摆脱错误的推理。在推理时，该方法执行 RM 引导的自我验证，然后执行单次 RM 引导的再生，避免了迭代校正和多样本选择。我们根据九个基准评估了我们的方法，这些基准涵盖小型和大型法学硕士中的算法、推理、符号和特定领域的任务。实验结果表明，我们的方法优于现有方法，同时保持较低的计算成本。

Title: Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

Authors: Rounak Saha, Gurusha Juneja, Dayita Chaudhuri, Naveeja Sajeevan, Nihar B Shah, Danish Pruthi
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20450
Pdf URL: https://arxiv.org/pdf/2603.20450
Copy Paste: [[2603.20450]] Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable(https://arxiv.org/abs/2603.20450)
Keywords: llm
Abstract: A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none meets the accuracy standards required for identifying AI use in peer reviews. Importantly, our results suggest that recent public estimates of AI use in peer reviews through the use of AI-text detectors should be interpreted with caution, as current detectors misclassify mixed reviews (collaborative human-AI outputs) as fully AI generated, potentially overstating the extent of policy violations.
摘要：许多科学会议和期刊最近颁布了政策，禁止同行评审员使用法学硕士，但对其他人工撰写的评论进行润色、释义和语法纠正除外。但是，这些政策可执行吗？为了回答这个问题，我们收集了模拟多个级别的人类与人工智能协作的同行评审数据集，并评估了五个最先进的探测器，其中包括两个商业系统。我们的分析表明，所有检测器都会将法学硕士润色的评论中的重要部分错误地分类为人工智能生成的，从而存在对学术不端行为的虚假指控的风险。我们进一步研究是否可以利用同行评审的特定信号（包括对论文手稿的访问和科学写作的受限领域）来改进检测。虽然在某些情况下纳入此类信号会产生可测量的收益，但我们发现了每种方法的局限性，并发现没有一种方法符合在同行评审中识别人工智能使用所需的准确性标准。重要的是，我们的结果表明，最近公众对通过使用人工智能文本检测器在同行评审中使用人工智能的估计应谨慎解释，因为当前的检测器将混合评论（人类与人工智能的协作输出）错误分类为完全由人工智能生成，可能夸大了政策违规的程度。

Title: Diffutron: A Masked Diffusion Language Model for Turkish Language

Authors: Şuayp Talha Kocabay, Talha Rüzgar Akkuş
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20466
Pdf URL: https://arxiv.org/pdf/2603.20466
Copy Paste: [[2603.20466]] Diffutron: A Masked Diffusion Language Model for Turkish Language(https://arxiv.org/abs/2603.20466)
Keywords: language model
Abstract: Masked Diffusion Language Models (MDLMs) have emerged as a compelling non-autoregressive alternative to standard large language models; however, their application to morphologically rich languages remains limited. In this paper, we introduce $\textit{Diffutron}$, a masked diffusion language model specifically designed for Turkish. Our approach leverages a resource-efficient training pipeline, starting with LoRA-based continual pre-training of a multilingual encoder on a large-scale corpus. To enable generative capabilities, we employ a progressive instruction-tuning strategy, sequentially adapting the model on general and task-specific instruction sets. Experimental results across comprehensive benchmarks demonstrate that, despite its compact size, our model achieves competitive performance compared to existing multi-billion-parameter baselines. These findings validate the effectiveness of masked diffusion modeling combined with multi-stage tuning for non-autoregressive text generation in Turkish.
摘要：掩蔽扩散语言模型 (MDLM) 已成为标准大型语言模型的一种引人注目的非自回归替代方案；然而，它们在形态丰富的语言中的应用仍然有限。在本文中，我们介绍了$\textit{Diffutron}$，一种专门为土耳其语设计的掩码扩散语言模型。我们的方法利用资源高效的训练管道，首先在大规模语料库上对多语言编码器进行基于 LoRA 的持续预训练。为了实现生成能力，我们采用渐进式指令调整策略，在通用指令集和特定任务指令集上依次调整模型。综合基准的实验结果表明，尽管我们的模型尺寸紧凑，但与现有的数十亿参数基准相比，它仍实现了具有竞争力的性能。这些发现验证了掩蔽扩散建模与多阶段调整相结合对于土耳其语非自回归文本生成的有效性。

Title: PARHAF, a human-authored corpus of clinical reports for fictitious patients in French

Authors: Xavier Tannier, Salam Abbara, Rémi Flicoteaux, Youness Khalil, Aurélie Névéol, Pierre Zweigenbaum, Emmanuel Bacry
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20494
Pdf URL: https://arxiv.org/pdf/2603.20494
Copy Paste: [[2603.20494]] PARHAF, a human-authored corpus of clinical reports for fictitious patients in French(https://arxiv.org/abs/2603.20494)
Keywords: language model
Abstract: The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7394 clinical reports covering 5009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems.
摘要：临床自然语言处理（NLP）系统的发展受到医疗记录的敏感性的严重阻碍，这限制了严格的隐私法规下的数据共享，特别是在法国和更广泛的欧盟。为了解决这一差距，我们引入了 PARHAF，这是一个大型法语临床文档开源语料库。 PARHAF 包含专家撰写的临床报告，描述真实但完全虚构的患者病例，通过设计使其匿名且可自由共享。该语料库是使用结构化协议开发的，该协议结合了临床医生的专业知识和法国国家健康数据系统 (SNDS) 的流行病学指导，确保了广泛的临床覆盖。 18 个专业的 104 名住院医师根据预定义的临床场景和文档模板撰写了报告并进行了同行评审。该语料库包含 7394 份临床报告，涵盖广泛的医学和外科专业的 5009 名患者病例。它包括一个旨在近似真实世界住院分布的通用组件，以及四个支持肿瘤学、传染病和诊断编码中信息提取用例的专用子集。文件根据 CC-BY 开放许可证发布，其中一部分暂时禁止，以便将来在受控条件下进行基准测试。 PARHAF 为在完全保护隐私的环境中培训和评估法语临床语言模型提供了宝贵的资源，并建立了一种可复制的方法，用于在其他语言和卫生系统中构建可共享的合成临床语料库。

Title: Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study

Authors: Mohammed Rakibul Hasan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20514
Pdf URL: https://arxiv.org/pdf/2603.20514
Copy Paste: [[2603.20514]] Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study(https://arxiv.org/abs/2603.20514)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) offer significant potential for delivering health information. However, their reliability in low-resource contexts remains uncertain. This study evaluates GPT-4, Gemini Pro, Llama~3, and Mistral-7B on health crisis-related enquiries concerning COVID-19, dengue, the Nipah virus, and Chikungunya in the low-resource context of Bangladesh. We constructed a question--answer dataset from authoritative sources and assessed model outputs through semantic similarity, expert-model cross-evaluation, and Natural Language Inference (NLI). Findings highlight both the strengths and limitations of LLMs in representing epidemiological history and health crisis knowledge, underscoring their promise and risks for informing policy in resource-constrained environments.
摘要：大型语言模型 (LLM) 为提供健康信息提供了巨大的潜力。然而，它们在资源匮乏的情况下的可靠性仍然不确定。本研究评估了 GPT-4、Gemini Pro、Llama~3 和 Mistral-7B 在孟加拉国资源匮乏的情况下对有关 COVID-19、登革热、尼帕病毒和基孔肯雅热的健康危机相关查询的情况。我们从权威来源构建了一个问答数据集，并通过语义相似性、专家模型交叉评估和自然语言推理 (NLI) 评估模型输出。研究结果强调了法学硕士在代表流行病学历史和健康危机知识方面的优势和局限性，强调了它们在资源有限的环境中为政策提供信息的前景和风险。

Title: Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

Authors: Tianyi Huang, Nathan Huang, Justin Tang, Wenqian Chen, Elsa Fan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20562
Pdf URL: https://arxiv.org/pdf/2603.20562
Copy Paste: [[2603.20562]] Permutation-Consensus Listwise Judging for Robust Factuality Evaluation(https://arxiv.org/abs/2603.20562)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing sharply in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, PCFJudge improves over direct judging by up to 7 absolute points. Development ablations show that the dominant gain comes from permutation consensus itself rather than from heavier arbitration layers. These results suggest that a meaningful share of factuality-judging error arises from order instability, and that averaging over this nuisance variation is a simple and effective way to make LLM evaluation more reliable.
摘要：大型语言模型（LLM）现在被广泛用作法官，但他们的决定可能会根据不相关的呈现选择而改变。我们研究了这样一种不稳定的来源：列表事实性评估中的候选顺序敏感性，其中几个答案可能看起来相似，但幻觉风险却截然不同。我们引入了 PCFJudge，这是一种推理时间方法，它对同一候选集的多个排序重新运行相同的事实优先列表提示，并将所得分数、排名和不确定性信号聚合成单个共识决策。在 RewardBench 2 Factuality 上，PCFJudge 比直接判断提高了最多 7 个绝对点。开发消融表明，主要收益来自排列共识本身，而不是来自较重的仲裁层。这些结果表明，事实判断错误的很大一部分是由顺序不稳定引起的，对这种令人讨厌的变化进行平均是使法学硕士评估更加可靠的简单而有效的方法。

Title: JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMs

Authors: Taihei Shiotani, Masahiro Kaneko, Ayana Niwa, Yuki Maruyama, Daisuke Oba, Masanari Ohi, Naoaki Okazaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20581
Pdf URL: https://arxiv.org/pdf/2603.20581
Copy Paste: [[2603.20581]] JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMs(https://arxiv.org/abs/2603.20581)
Keywords: language model, llm
Abstract: Social biases reflected in language are inherently shaped by cultural norms, which vary significantly across regions and lead to diverse manifestations of stereotypes. Existing evaluations of social bias in large language models (LLMs) for non-English contexts, however, often rely on translations of English benchmarks. Such benchmarks fail to reflect local cultural norms, including those found in Japanese. For instance, Western benchmarks may overlook Japan-specific stereotypes related to hierarchical relationships, regional dialects, or traditional gender roles. To address this limitation, we introduce Japanese cUlture adversarial BiAs benchmarK Under handcrafted creation (JUBAKU), a benchmark tailored to Japanese cultural contexts. JUBAKU uses adversarial construction to expose latent biases across ten distinct cultural categories. Unlike existing benchmarks, JUBAKU features dialogue scenarios hand-crafted by native Japanese annotators, specifically designed to trigger and reveal latent social biases in Japanese LLMs. We evaluated nine Japanese LLMs on JUBAKU and three others adapted from English benchmarks. All models clearly exhibited biases on JUBAKU, performing below the random baseline of 50% with an average accuracy of 23% (ranging from 13% to 33%), despite higher accuracy on the other benchmarks. Human annotators achieved 91% accuracy in identifying unbiased responses, confirming JUBAKU's reliability and its adversarial nature to LLMs.
摘要：语言中反映的社会偏见本质上是由文化规范塑造的，而文化规范在不同地区之间差异很大，并导致刻板印象的不同表现。然而，现有的非英语环境大语言模型（LLM）社会偏见评估通常依赖于英语基准的翻译。这些基准未能反映当地的文化规范，包括日语中的文化规范。例如，西方基准可能会忽视日本特有的与等级关系、地区方言或传统性别角色相关的刻板印象。为了解决这一限制，我们引入了日本文化对抗性 BiAs 基准手工创作 (JUBAKU)，这是一个针对日本文化背景量身定制的基准。 JUBAKU 使用对抗性构建来揭露十种不同文化类别中的潜在偏见。与现有基准不同，JUBAKU 具有由日本本土注释者手工制作的对话场景，专门用于触发和揭示日本法学硕士中潜在的社会偏见。我们在 JUBAKU 上评估了九个日本法学硕士，另外三个则根据英语基准进行了评估。所有模型在 JUBAKU 上都明显表现出偏差，表现低于 50% 的随机基线，平均准确度为 23%（范围从 13% 到 33%），尽管其他基准的准确度更高。人类注释者在识别无偏见反应方面达到了 91% 的准确率，证实了 JUBAKU 的可靠性及其对法学硕士的对抗性。

Title: A Modular LLM Framework for Explainable Price Outlier Detection

Authors: Shadi Sartipi, John Wu, Sina Ghotbi, Nikhita Vedula, Shervin Malmasi
Subjects: cs.CL, cs.CE
Abstract URL: https://arxiv.org/abs/2603.20636
Pdf URL: https://arxiv.org/pdf/2603.20636
Copy Paste: [[2603.20636]] A Modular LLM Framework for Explainable Price Outlier Detection(https://arxiv.org/abs/2603.20636)
Keywords: language model, llm, agent
Abstract: Detecting product price outliers is important for retail and e-commerce stores as erroneous or unexpectedly high prices adversely affect competitiveness, revenue, and consumer trust. Classical techniques offer simple thresholds while ignoring the rich semantic relationships among product attributes. We propose an agentic Large Language Model (LLM) framework that treats outlier price flagging as a reasoning task grounded in related product detection and comparison. The system processes the prices of target products in three stages: (i) relevance classification selects price-relevant similar products using product descriptions and attributes; (ii) relative utility assessment evaluates the target product against each similar product along price influencing dimensions (e.g., brand, size, features); (iii) reasoning-based decision aggregates these justifications into an explainable price outlier judgment. The framework attains over 75% agreement with human auditors on a test dataset, and outperforms zero-shot and retrieval based LLM techniques. Ablation studies show the sensitivity of the method to key hyper-parameters and testify on its flexibility to be applied to cases with different accuracy requirement and auditor agreements.
摘要：检测产品价格异常值对于零售和电子商务商店非常重要，因为错误或意外的高价格会对竞争力、收入和消费者信任产生不利影响。经典技术提供了简单的阈值，而忽略了产品属性之间丰富的语义关系。我们提出了一种代理大语言模型（LLM）框架，该框架将异常价格标记视为基于相关产品检测和比较的推理任务。系统分三个阶段处理目标产品的价格：（i）相关性分类，利用产品描述和属性选择价格相关的相似产品； (ii) 相对效用评估根据每个类似产品的价格影响维度（例如品牌、尺寸、功能）来评估目标产品； (iii) 基于推理的决策将这些理由汇总为可解释的价格异常判断。该框架在测试数据集上与人类审计员的一致性超过 75%，并且优于基于零样本和检索的 LLM 技术。消融研究显示了该方法对关键超参数的敏感性，并证明了其适用于具有不同准确性要求和审计协议的情况的灵活性。

Title: Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention

Authors: Manh Nguyen, Anh Nguyen, Dung Nguyen, Svetha Venkatesh, Hung Le
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20640
Pdf URL: https://arxiv.org/pdf/2603.20640
Copy Paste: [[2603.20640]] Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention(https://arxiv.org/abs/2603.20640)
Keywords: language model, agent
Abstract: Multi-Agent Debate has emerged as a promising framework for improving the reasoning quality of large language models through iterative inter-agent communication. However, broadcasting all agent messages at every round introduces noise and redundancy that can degrade debate quality and waste computational resources. Current approaches rely on uncertainty estimation to filter low-confidence responses before broadcasting, but this approach is unreliable due to miscalibrated confidence scores and sensitivity to threshold selection. To address this, we propose Diversity-Aware Retention (DAR), a lightweight debate framework that, at each debate round, selects the subset of agent responses that maximally disagree with each other and with the majority vote before broadcasting. Through an explicit index-based retention mechanism, DAR preserves the original messages without modification, ensuring that retained disagreements remain authentic. Experiments on diverse reasoning and question answering benchmarks demonstrate that our selective message propagation consistently improves debate performance, particularly as the number of agents scales, where noise accumulation is most severe. Our results highlight that what agents hear is as important as what agents say in multi-agent reasoning systems.
摘要：多智能体辩论已成为一种有前途的框架，可通过迭代智能体间通信来提高大型语言模型的推理质量。然而，在每一轮广播所有代理消息会引入噪声和冗余，从而降低辩论质量并浪费计算资源。当前的方法依靠不确定性估计来在广播之前过滤低置信度响应，但由于置信度分数校准错误和对阈值选择的敏感性，这种方法不可靠。为了解决这个问题，我们提出了多样性感知保留（DAR），这是一个轻量级辩论框架，在每一轮辩论中，选择在广播之前彼此最大程度不同意且获得多数票的代理响应子集。通过显式的基于索引的保留机制，DAR 保留原始消息而不进行修改，确保保留的分歧保持真实性。对不同推理和问答基准的实验表明，我们的选择性消息传播持续提高了辩论性能，特别是当代理数量增加时，噪声积累最严重。我们的结果强调，在多智能体推理系统中，智能体听到的内容与智能体所说的一样重要。

Title: Weber's Law in Transformer Magnitude Representations: Efficient Coding, Representational Geometry, and Psychophysical Laws in Language Models

Authors: Jon-Paul Cacioli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20642
Pdf URL: https://arxiv.org/pdf/2603.20642
Copy Paste: [[2603.20642]] Weber's Law in Transformer Magnitude Representations: Efficient Coding, Representational Geometry, and Psychophysical Laws in Language Models(https://arxiv.org/abs/2603.20642)
Keywords: language model
Abstract: How do transformer language models represent magnitude? Recent work disagrees: some find logarithmic spacing, others linear encoding, others per-digit circular representations. We apply the formal tools of psychophysics to resolve this. Using four converging paradigms (representational similarity analysis, behavioural discrimination, precision gradients, causal intervention) across three magnitude domains in three 7-9B instruction-tuned models spanning three architecture families (Llama, Mistral, Qwen), we report three findings. First, representational geometry is consistently log-compressive: RSA correlations with a Weber-law dissimilarity matrix ranged from .68 to .96 across all 96 model-domain-layer cells, with linear geometry never preferred. Second, this geometry is dissociated from behaviour: one model produces a human-range Weber fraction (WF = 0.20) while the other does not, and both models perform at chance on temporal and spatial discrimination despite possessing logarithmic geometry. Third, causal intervention reveals a layer dissociation: early layers are functionally implicated in magnitude processing (4.1x specificity) while later layers where geometry is strongest are not causally engaged (1.2x). Corpus analysis confirms the efficient coding precondition (alpha = 0.77). These results suggest that training data statistics alone are sufficient to produce log-compressive magnitude geometry, but geometry alone does not guarantee behavioural competence.
摘要：Transformer 语言模型如何表示大小？最近的工作不同意：一些人发现对数间距，另一些人发现线性编码，另一些人发现每个数字的圆形表示。我们应用心理物理学的正式工具来解决这个问题。在跨越三个架构系列（Llama、Mistral、Qwen）的三个 7-9B 指令调整模型中，使用跨三个量级领域的四种收敛范式（表征相似性分析、行为辨别、精度梯度、因果干预），我们报告了三项发现。首先，表征几何始终是对数压缩的：在所有 96 个模型域层单元中，RSA 与韦伯定律相异矩阵的相关性范围为 0.68 到 0.96，线性几何从来不是首选。其次，这种几何学与行为分离：一个模型产生人类范围的韦伯分数（WF = 0.20），而另一个模型则不然，尽管拥有对数几何学，但两个模型在时间和空间辨别上都有机会表现。第三，因果干预揭示了层分离：早期层在功能上涉及幅度处理（4.1x 特异性），而几何最强的后期层则没有因果关系（1.2x）。语料库分析证实了有效编码的前提条件（alpha = 0.77）。这些结果表明，仅训练数据统计就足以产生对数压缩幅度几何，但仅几何并不能保证行为能力。

Title: PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs

Authors: Tianyi Huang, Caden Yang, Emily Yin, Eric Wang, Michael Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20673
Pdf URL: https://arxiv.org/pdf/2603.20673
Copy Paste: [[2603.20673]] PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs(https://arxiv.org/abs/2603.20673)
Keywords: language model, llm
Abstract: Retrieval-augmented language models can retrieve relevant evidence yet still commit to answers before explicitly checking whether the retrieved context supports the conclusion. We present PAVE (Premise-Grounded Answer Validation and Editing), an inference-time validation layer for evidence-grounded question answering. PAVE decomposes retrieved context into question-conditioned atomic facts, drafts an answer, scores how well that draft is supported by the extracted premises, and revises low-support outputs before finalization. The resulting trace makes answer commitment auditable at the level of explicit premises, support scores, and revision decisions. In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark. We view these findings as proof-of-concept evidence that explicit premise extraction plus support-gated revision can strengthen evidence-grounded consistency in retrieval-augmented LLM systems.
摘要：检索增强语言模型可以检索相关证据，但在明确检查检索到的上下文是否支持结论之前仍然承诺答案。我们提出了 PAVE（前提答案验证和编辑），这是一个用于基于证据的问答的推理时间验证层。 PAVE 将检索到的上下文分解为以问题为条件的原子事实，起草答案，对提取的前提支持草稿的程度进行评分，并在最终确定之前修改低支持输出。由此产生的跟踪使得答案承诺在显式前提、支持分数和修订决策级别上可审计。在具有固定检索器和骨干的受控消融中，PAVE 在两种基于证据的 QA 设置中优于更简单的检索后基线，在基于跨度的基准上最大增益达到 32.7 准确度点。我们将这些发现视为概念验证证据，表明显式前提提取加上支持门控修订可以增强检索增强法学硕士系统中基于证据的一致性。

Title: Reasoning Topology Matters: Network-of-Thought for Complex Reasoning Tasks

Authors: Fan Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20730
Pdf URL: https://arxiv.org/pdf/2603.20730
Copy Paste: [[2603.20730]] Reasoning Topology Matters: Network-of-Thought for Complex Reasoning Tasks(https://arxiv.org/abs/2603.20730)
Keywords: gpt, llm, prompt, chain-of-thought, tree-of-thought
Abstract: Existing prompting paradigms structure LLM reasoning in limited topologies: Chain-of-Thought (CoT) produces linear traces, while Tree-of-Thought (ToT) performs branching search. Yet complex reasoning often requires merging intermediate results, revisiting hypotheses, and integrating evidence from multiple sources. We propose Network-of-Thought (NoT), a framework that models reasoning as a directed graph with typed nodes and edges, guided by a heuristic-based controller policy. Across four benchmarks (GSM8K, Game of 24, HotpotQA, ProofWriter) and three models (GPT-4o-mini, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct), we investigate when network topology outperforms chain or tree structures, whether LLM-generated heuristics can guide graph-based reasoning search, and the computation-accuracy tradeoff across topologies, evaluating each method on accuracy, topology simplicity, and token efficiency. Our results show that CoT remains effective for sequential tasks with GPT-4o-mini (89.5\% on GSM8K), while NoT surpasses ToT on multi-hop reasoning (91.0\% vs.\ 88.0\% on HotpotQA with LLM-as-Judge). With 72B open-source models, NoT achieves the highest accuracy on GSM8K (91.5\%), and Qwen2.5-72B achieves the best multi-hop QA result overall (91.7\% on HotpotQA). Self-generated controller heuristics outperform fixed and random strategies on logical reasoning, with uncertainty-only weighting achieving 57.0\% on ProofWriter. We also find that evaluation methodology significantly impacts method rankings: string-match underestimates all methods on open-ended QA, with the largest gap for NoT, a pattern consistent across all three models (14--18 percentage point gap on HotpotQA).
摘要：现有的提示范式在有限的拓扑中构建 LLM 推理：思想链 (CoT) 产生线性轨迹，而思想树 (ToT) 执行分支搜索。然而复杂的推理通常需要合并中间结果、重新审视假设以及整合多个来源的证据。我们提出了思想网络（NoT），这是一个框架，将推理建模为具有类型化节点和边的有向图，并由基于启发式的控制器策略引导。通过四个基准测试（GSM8K、Game of 24、HotpotQA、ProofWriter）和三个模型（GPT-4o-mini、Llama-3.3-70B-Instruct、Qwen2.5-72B-Instruct），我们研究了网络拓扑何时优于链或树结构、LLM 生成的启发式是否可以指导基于图的推理搜索，以及跨拓扑的计算精度权衡，评估每种方法的精度、拓扑简单性和代币效率。我们的结果表明，CoT 对于使用 GPT-4o-mini 的顺序任务仍然有效（在 GSM8K 上为 89.5\%），而 NoT 在多跳推理上超过了 ToT（在使用 LLM 作为法官的 HotpotQA 上为 91.0\% vs.\ 88.0\%）。借助 72B 开源模型，NoT 在 GSM8K 上实现了最高准确率（91.5%），Qwen2.5-72B 总体上实现了最佳多跳 QA 结果（在 HotpotQA 上为 91.7%）。自生成控制器启发式在逻辑推理方面优于固定和随机策略，仅不确定性加权在 ProofWriter 上达到 57.0%。我们还发现评估方法显着影响方法排名：字符串匹配低估了开放式 QA 上的所有方法，NoT 的差距最大，所有三个模型的模式一致（HotpotQA 上的差距为 14--18 个百分点）。

Title: MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages

Authors: Anri Lombard, Simbarashe Mawere, Temi Aina, Ethan Wolff, Sbonelo Gumede, Elan Novick, Francois Meyer, Jan Buys
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20732
Pdf URL: https://arxiv.org/pdf/2603.20732
Copy Paste: [[2603.20732]] MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages(https://arxiv.org/abs/2603.20732)
Keywords: language model
Abstract: Decoder-only language models can be adapted to diverse tasks through instruction finetuning, but the extent to which this generalizes at small scale for low-resource languages remains unclear. We focus on the languages of South Africa, where we are not aware of a publicly available decoder-only model that explicitly targets all eleven official written languages, nine of which are low-resource. We introduce MzansiText, a curated multilingual pretraining corpus with a reproducible filtering pipeline, and MzansiLM, a 125M-parameter language model trained from scratch. We evaluate MzansiLM on natural language understanding and generation using three adaptation regimes: monolingual task-specific finetuning, multilingual task-specific finetuning, and general multi-task instruction finetuning. Monolingual task-specific finetuning achieves strong performance on data-to-text generation, reaching 20.65 BLEU on isiXhosa and competing with encoder-decoder baselines over ten times larger. Multilingual task-specific finetuning benefits closely related languages on topic classification, achieving 78.5% macro-F1 on isiXhosa news classification. While MzansiLM adapts effectively to supervised NLU and NLG tasks, few-shot reasoning remains challenging at this model size, with performance near chance even for much larger decoder-only models. We release MzansiText and MzansiLM to provide a reproducible decoder-only baseline and clear guidance on adaptation strategies for South African languages at small scale.
摘要：仅解码器的语言模型可以通过指令微调来适应不同的任务，但这对于低资源语言的小规模推广的程度仍不清楚。我们专注于南非的语言，我们不知道那里有一个公开可用的仅解码器模型，该模型明确针对所有 11 种官方书面语言，其中 9 种语言资源匮乏。我们引入了 MzansiText（一个具有可重复过滤管道的精选多语言预训练语料库）和 MzansiLM（一个从头开始训练的 125M 参数语言模型）。我们使用三种适应机制来评估 MzansiLM 在自然语言理解和生成方面的表现：单语言特定任务微调、多语言特定任务微调和一般多任务指令微调。单语言任务特定的微调在数据到文本生成方面实现了强大的性能，在 isiXhosa 上达到 20.65 BLEU，并与大十倍以上的编码器-解码器基线竞争。多语言特定任务微调有利于密切相关的语言在主题分类上的表现，在 isiXhosa 新闻分类上实现了 78.5% 的宏 F1。虽然 MzansiLM 可以有效地适应有监督的 NLU 和 NLG 任务，但在这种模型大小下，小样本推理仍然具有挑战性，即使对于更大的纯解码器模型，性能也几乎是机会。我们发布了 MzansiText 和 MzansiLM，为南非语言小规模的适应策略提供可复制的纯解码器基线和明确的指导。

Title: Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement

Authors: Jiang Liu, Ge Qiu, Hao Fei, Dongdong Xie, Jinbo Li, Fei Li, Chong Teng, Donghong Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20781
Pdf URL: https://arxiv.org/pdf/2603.20781
Copy Paste: [[2603.20781]] Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement(https://arxiv.org/abs/2603.20781)
Keywords: language model, llm
Abstract: With the rapid development of large language models (LLMs), more and more researchers have paid attention to information extraction based on LLMs. However, there are still some spaces to improve in the existing related methods. First, existing multimodal information extraction (MIE) methods usually employ natural language templates as the input and output of LLMs, which mismatch with the characteristics of information tasks that mostly include structured information such as entities and relations. Second, although a few methods have adopted structured and more IE-friendly code-style templates, they just explored their methods on text-only IE rather than multimodal IE. Moreover, their methods are more complex in design, requiring separate templates to be designed for each task. In this paper, we propose a Code-style Multimodal Information Extraction framework (Code-MIE) which formalizes MIE as unified code understanding and generation. Code-MIE has the following novel designs: (1) Entity attributes such as gender, affiliation are extracted from the text to guide the model to understand the context and role of entities. (2) Images are converted into scene graphs and visual features to incorporate rich visual information into the model. (3) The input template is constructed as a Python function, where entity attributes, scene graphs and raw text compose of the function parameters. In contrast, the output template is formalized as Python dictionaries containing all extraction results such as entities, relations, etc. To evaluate Code-MIE, we conducted extensive experiments on the M$^3$D, Twitter-15, Twitter-17, and MNRE datasets. The results show that our method achieves state-of-the-art performance compared to six competing baseline models, with 61.03\% and 60.49\% on the English and Chinese datasets of M$^3$D, and 76.04\%, 88.07\%, and 73.94\% on the other three datasets.
摘要：随着大型语言模型（LLM）的快速发展，越来越多的研究人员关注基于LLM的信息提取。然而，现有的相关方法仍然存在一些需要改进的空间。首先，现有的多模态信息提取（MIE）方法通常采用自然语言模板作为LLM的输入和输出，这与大多数包含实体和关系等结构化信息的信息任务的特征不匹配。其次，尽管一些方法采用了结构化且更适合 IE 的代码风格模板，但他们只是在纯文本 IE 而不是多模式 IE 上探索了他们的方法。此外，他们的方法在设计上更加复杂，需要为每个任务设计单独的模板。在本文中，我们提出了一种代码风格的多模态信息提取框架（Code-MIE），它将 MIE 形式化为统一的代码理解和生成。 Code-MIE具有以下新颖的设计：（1）从文本中提取性别、隶属关系等实体属性，以指导模型理解实体的上下文和角色。 (2)将图像转换为场景图和视觉特征，将丰富的视觉信息融入到模型中。 (3)输入模板被构造为Python函数，其中实体属性、场景图和原始文本组成函数参数。相反，输出模板被形式化为包含所有提取结果（例如实体、关系等）的 Python 字典。为了评估 Code-MIE，我们在 M$^3$D、Twitter-15、Twitter-17 和 MNRE 数据集上进行了广泛的实验。结果表明，与六个竞争基线模型相比，我们的方法实现了最先进的性能，在 M$^3$D 的英文和中文数据集上达到了 61.03\% 和 60.49\%，在其他三个数据集上达到了 76.04\%、88.07\% 和 73.94\%。

Title: The Anatomy of an Edit: Mechanism-Guided Activation Steering for Knowledge Editing

Authors: Yuan Cao, Mingyang Wang, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20795
Pdf URL: https://arxiv.org/pdf/2603.20795
Copy Paste: [[2603.20795]] The Anatomy of an Edit: Mechanism-Guided Activation Steering for Knowledge Editing(https://arxiv.org/abs/2603.20795)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are increasingly used as knowledge bases, but keeping them up to date requires targeted knowledge editing (KE). However, it remains unclear how edits are implemented inside the model once applied. In this work, we take a mechanistic view of KE using neuron-level knowledge attribution (NLKA). Unlike prior work that focuses on pre-edit causal tracing and localization, we use post-edit attribution -- contrasting successful and failed edits -- to isolate the computations that shift when an edit succeeds. Across representative KE methods, we find a consistent pattern: mid-to-late attention predominantly promotes the new target, while attention and FFN modules cooperate to suppress the original fact. Motivated by these findings, we propose MEGA, a MEchanism-Guided Activation steering method that performs attention-residual interventions in attribution-aligned regions without modifying model weights. On CounterFact and Popular, MEGA achieves strong editing performance across KE metrics on GPT2-XL and LLaMA2-7B. Overall, our results elevate post-edit attribution from analysis to engineering signal: by pinpointing where and how edits take hold, it powers MEGA to deliver reliable, architecture-agnostic knowledge edits.
摘要：大型语言模型 (LLM) 越来越多地用作知识库，但保持它们最新需要有针对性的知识编辑 (KE)。然而，尚不清楚应用编辑后如何在模型内实现。在这项工作中，我们使用神经元级知识归因（NLKA）对 KE 进行机械观察。与之前专注于编辑前因果追踪和定位的工作不同，我们使用编辑后归因（对比成功和失败的编辑）来隔离编辑成功时发生变化的计算。在代表性的 KE 方法中，我们发现了一个一致的模式：中后期注意力主要促进新目标，而注意力和 FFN 模块合作抑制原始事实。受这些发现的启发，我们提出了 MEGA，一种机制引导的激活控制方法，可在归因对齐区域中执行注意力残留干预，而无需修改模型权重。在 CounterFact 和 Pop 上，MEGA 在 GPT2-XL 和 LLaMA2-7B 上的 KE 指标上实现了强大的编辑性能。总体而言，我们的结果将编辑后归因从分析提升到工程信号：通过精确定位编辑发生的位置和方式，它使 MEGA 能够提供可靠的、与架构无关的知识编辑。

Title: RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution

Authors: Kaiyuan Li, Jing-Cheng Pang, Yang Yu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20799
Pdf URL: https://arxiv.org/pdf/2603.20799
Copy Paste: [[2603.20799]] RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution(https://arxiv.org/abs/2603.20799)
Keywords: language model, llm
Abstract: Reinforcement learning from verifiable rewards (RLVR) stimulates the thinking processes of large language models (LLMs), substantially enhancing their reasoning abilities on verifiable tasks. It is often assumed that similar gains should transfer to general question answering (GQA), but this assumption has not been thoroughly validated. To assess whether RLVR automatically improves LLM performance on GQA, we propose a Cross-Generation evaluation framework that measures the quality of intermediate reasoning by feeding the generated thinking context into LLMs of varying capabilities. Our evaluation leads to a discouraging finding: the efficacy of the thinking process on GQA tasks is markedly lower than on verifiable tasks, suggesting that explicit training on GQA remains necessary in addition to training on verifiable tasks. We further observe that direct RL training on GQA is less effective than RLVR. Our hypothesis is that, whereas verifiable tasks demand robust logical chains to obtain high rewards, GQA tasks often admit shortcuts to high rewards without cultivating high-quality thinking. To avoid possible shortcuts, we introduce a simple method, Separated Thinking And Response Training (START), which first trains only the thinking process, using rewards defined on the final answer. We show that START improves both the quality of thinking and the final answer across several GQA benchmarks and RL algorithms.
摘要：可验证奖励的强化学习（RLVR）刺激了大型语言模型（LLM）的思维过程，大大增强了它们在可验证任务上的推理能力。人们通常认为类似的收益应该转移到一般问答（GQA）中，但这一假设尚未得到彻底验证。为了评估 RLVR 是否自动提高 GQA 上的 LLM 表现，我们提出了一个跨代评估框架，通过将生成的思维上下文输入不同能力的 LLM 来衡量中间推理的质量。我们的评估得出了一个令人沮丧的发现：GQA 任务的思维过程效率明显低于可验证任务，这表明除了可验证任务的培训之外，GQA 的显式培训仍然是必要的。我们进一步观察到，在 GQA 上直接进行 RL 训练不如 RLVR 有效。我们的假设是，虽然可验证的任务需要强大的逻辑链才能获得高奖励，但 GQA 任务通常承认获得高奖励的捷径，而无需培养高质量的思维。为了避免可能的捷径，我们引入了一种简单的方法，即分离思维和反应训练（START），它首先使用最终答案定义的奖励仅训练思维过程。我们证明，START 在多个 GQA 基准测试和 RL 算法中提高了思维质量和最终答案。

Title: BenchBench: Benchmarking Automated Benchmark Generation

Authors: Yandan Zheng, Haoran Luo, Zhenghong Lin, Wenjin Liu, Luu Anh Tuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20807
Pdf URL: https://arxiv.org/pdf/2603.20807
Copy Paste: [[2603.20807]] BenchBench: Benchmarking Automated Benchmark Generation(https://arxiv.org/abs/2603.20807)
Keywords: language model, llm, prompt
Abstract: Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset for benchmarking automated benchmark generation: (i) extract structured domain cards from seed benchmarks, (ii) prompt multiple designer LLMs to generate quota-controlled suites, and (iii) validate items with a multi-model answerer panel using exact/numeric/symbolic verifiers when possible and rubric-guided judging otherwise, yielding designer--answerer matrices with item-level quality flags and psychometric diagnostics. Across nine variants spanning computer science, mathematics, medicine, and theory-of-mind reasoning (including multilingual and multimodal settings), we generate 16.7K items, retain ~15K core items post-filtering, and produce ~152K graded model--item responses. BenchBench shows that benchmark-design ability is only moderately correlated with answer-time strength (Spearman rho ~0.37), invalidity is negatively associated with discrimination (Pearson r~0.62), and the resulting designer--answerer matrices enable scalable audits of format/modality/language fidelity and suite-dependent self/family interactions. The project is available at: this https URL.
摘要：基准是跟踪大型语言模型 (LLM) 进展的事实上的标准，但静态测试集可能会迅速饱和，容易受到污染，并且刷新成本高昂。对开放式项目的可扩展评估通常依赖于法学硕士法官，这会引入额外的偏见来源和即时敏感性。我们认为，评估必须超越模型如何回答基准，以及模型如何设计基准。我们引入了BenchBench，一个用于自动基准生成的三阶段管道和数据集：（i）从种子基准中提取结构化领域卡，（ii）提示多个设计者LLM生成配额控制的套件，以及（iii）在可能的情况下使用精确/数字/符号验证器通过多模型回答者面板验证项目，否则进行标题引导判断，生成具有项目级质量标志和心理诊断的设计者-回答者矩阵。在涵盖计算机科学、数学、医学和心理理论推理（包括多语言和多模式设置）的九个变体中，我们生成了 16.7K 个项目，在过滤后保留约 15K 个核心项目，并生成约 152K 个分级模型项目响应。 BenchBench 显示，基准设计能力与回答时间强度仅中度相关（Spearman rho ~0.37），无效性与歧视呈负相关（Pearson r~0.62），并且由此产生的设计者-回答者矩阵可以对格式/模态/语言保真度和套件相关的自我/家庭互动进行可扩展的审核。该项目位于：此 https URL。

Title: HiCI: Hierarchical Construction-Integration for Long-Context Attention

Authors: Xiangyu Zeng, Qi Xu, Yunke Wang, Chang Xu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20843
Pdf URL: https://arxiv.org/pdf/2603.20843
Copy Paste: [[2603.20843]] HiCI: Hierarchical Construction-Integration for Long-Context Attention(https://arxiv.org/abs/2603.20843)
Keywords: language model, gpt
Abstract: Long-context language modeling is commonly framed as a scalability challenge of token-level attention, yet local-to-global information structuring remains largely implicit in existing approaches. Drawing on cognitive theories of discourse comprehension, we propose HiCI (Hierarchical Construction--Integration), a hierarchical attention module that constructs segment-level representations, integrates them into a shared global context, and broadcasts both to condition segment-level attention. We validate HiCI through parameter-efficient adaptation of LLaMA-2 with only <5.5% additional parameters, extending context from 4K to 100K tokens (7B) and 64K tokens (13B). Across language modeling, retrieval, and instruction-following benchmarks, HiCI yields consistent improvements over strong baselines, including matching proprietary models on topic retrieval and surpassing GPT-3.5-Turbo-16K on code comprehension. These results demonstrate the effectiveness of explicit hierarchical structuring as an inductive bias for long-context modeling.
摘要：长上下文语言建模通常被视为令牌级注意力的可扩展性挑战，但本地到全局的信息结构在很大程度上仍然隐含在现有方法中。借鉴语篇理解的认知理论，我们提出了 HiCI（分层构建-集成），这是一个分层注意力模块，它构建分段级表示，将它们集成到共享的全局上下文中，并将两者广播以条件分段级注意力。我们通过 LLaMA-2 的参数高效适应来验证 HiCI，仅使用 <5.5% 的附加参数，将上下文从 4K 扩展到 100K 令牌 (7B) 和 64K 令牌 (13B)。在语言建模、检索和指令跟踪基准方面，HiCI 在强大的基准上取得了一致的改进，包括匹配主题检索的专有模型以及在代码理解方面超越 GPT-3.5-Turbo-16K。这些结果证明了显式层次结构作为长上下文建模的归纳偏差的有效性。

Title: Can ChatGPT Really Understand Modern Chinese Poetry?

Authors: Shanshan Wang, Derek F. Wong, Jingming Yao, Lidia S. Chao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20851
Pdf URL: https://arxiv.org/pdf/2603.20851
Copy Paste: [[2603.20851]] Can ChatGPT Really Understand Modern Chinese Poetry?(https://arxiv.org/abs/2603.20851)
Keywords: gpt, llm, chat
Abstract: ChatGPT has demonstrated remarkable capabilities on both poetry generation and translation, yet its ability to truly understand poetry remains unexplored. Previous poetry-related work merely analyzed experimental outcomes without addressing fundamental issues of comprehension. This paper introduces a comprehensive framework for evaluating ChatGPT's understanding of modern poetry. We collaborated with professional poets to evaluate ChatGPT's interpretation of modern Chinese poems by different poets along multiple dimensions. Evaluation results show that ChatGPT's interpretations align with the original poets' intents in over 73% of the cases. However, its understanding in certain dimensions, particularly in capturing poeticity, proved to be less satisfactory. These findings highlight the effectiveness and necessity of our proposed framework. This study not only evaluates ChatGPT's ability to understand modern poetry but also establishes a solid foundation for future research on LLMs and their application to poetry-related tasks.
摘要：ChatGPT 在诗歌生成和翻译方面都表现出了卓越的能力，但其真正理解诗歌的能力仍有待探索。以前与诗歌相关的工作仅分析实验结果，而没有解决理解的基本问题。本文介绍了一个评估 ChatGPT 对现代诗歌理解的综合框架。我们与专业诗人合作，从多个维度评估ChatGPT对不同诗人中国现代诗歌的解读。评估结果显示，ChatGPT 的诠释在超过 73% 的情况下符合原诗人的意图。然而，它在某些维度上的理解，特别是在捕捉诗意方面，却不尽如人意。这些发现凸显了我们提出的框架的有效性和必要性。这项研究不仅评估了ChatGPT理解现代诗歌的能力，而且为未来LLM的研究及其在诗歌相关任务中的应用奠定了坚实的基础。

Title: SozKZ: Training Efficient Small Language Models for Kazakh from Scratch

Authors: Saken Tukenov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20854
Pdf URL: https://arxiv.org/pdf/2603.20854
Copy Paste: [[2603.20854]] SozKZ: Training Efficient Small Language Models for Kazakh from Scratch(https://arxiv.org/abs/2603.20854)
Keywords: language model
Abstract: Kazakh, a Turkic language spoken by over 22 million people, remains underserved by existing multilingual language models, which allocate minimal capacity to low-resource languages and employ tokenizers ill-suited to agglutinative morphology. We present SozKZ, a family of Llama-architecture language models (50M-600M parameters) trained entirely from scratch on 9 billion tokens of Kazakh text with a dedicated 50K BPE tokenizer. We evaluate all models on three Kazakh benchmarks -- multiple-choice cultural QA, reading comprehension (Belebele), and topic classification (SIB-200) -- alongside five multilingual baselines ranging from 500M to 3B parameters. Our 600M model achieves 30.3% accuracy on Kazakh cultural QA, approaching the 32.0% of Llama-3.2-1B (2x larger), and 25.5% on SIB-200 topic classification, surpassing all evaluated multilingual models up to 2B parameters. We observe consistent scaling from 50M to 600M, with MC QA accuracy rising from 22.8% to 30.3%, suggesting that further scaling remains beneficial. These results demonstrate that small, dedicated models trained from scratch with a language-appropriate tokenizer offer a viable path for low-resource language technology, achieving competitive performance at a fraction of the computational cost. All models and the tokenizer are released under open licenses.
摘要：哈萨克语是一种突厥语系语言，有超过 2200 万人使用，现有的多语言语言模型仍然服务不足，这些模型为资源匮乏的语言分配了最小的容量，并采用了不适合粘着形态的分词器。我们推出了 SozKZ，这是一个 Llama 架构语言模型系列（50M-600M 参数），使用专用的 50K BPE 分词器对 90 亿个哈萨克语文本标记进行了完全从头开始的训练。我们根据三个哈萨克语基准——多项选择文化 QA、阅读理解 (Belebele) 和主题分类 (SIB-200)——以及从 500M 到 3B 参数的五个多语言基准来评估所有模型。我们的 600M 模型在哈萨克文化 QA 上达到了 30.3% 的准确率，接近 Llama-3.2-1B（大两倍）的 32.0%，在 SIB-200 主题分类上达到 25.5%，超过了所有评估的多语言模型高达 2B 参数。我们观察到从 50M 到 600M 的一致扩展，MC QA 准确率从 22.8% 上升到 30.3%，这表明进一步扩展仍然是有益的。这些结果表明，使用适合语言的分词器从头开始训练的小型专用模型为低资源语言技术提供了一条可行的途径，以一小部分计算成本实现了有竞争力的性能。所有模型和分词器均在开放许可下发布。

Title: NoveltyAgent: Autonomous Novelty Reporting Agent with Point-wise Novelty Analysis and Self-Validation

Authors: Jiajun Hou, Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Xiaopeng Ke, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20884
Pdf URL: https://arxiv.org/pdf/2603.20884
Copy Paste: [[2603.20884]] NoveltyAgent: Autonomous Novelty Reporting Agent with Point-wise Novelty Analysis and Self-Validation(https://arxiv.org/abs/2603.20884)
Keywords: gpt, agent
Abstract: The exponential growth of academic publications has led to a surge in papers of varying quality, increasing the cost of paper screening. Current approaches either use novelty assessment within general AI Reviewers or repurpose DeepResearch, which lacks domain-specific mechanisms and thus delivers lower-quality results. To bridge this gap, we introduce NoveltyAgent, a multi-agent system designed to generate comprehensive and faithful novelty reports, enabling thorough evaluation of a paper's originality. It decomposes manuscripts into discrete novelty points for fine-grained retrieval and comparison, and builds a comprehensive related-paper database while cross-referencing claims to ensure faithfulness. Furthermore, to address the challenge of evaluating such open-ended generation tasks, we propose a checklist-based evaluation framework, providing an unbiased paradigm for building reliable evaluations. Extensive experiments show that NoveltyAgent achieves state-of-the-art performance, outperforming GPT-5 DeepResearch by 10.15%. We hope this system will provide reliable, high-quality novelty analysis and help researchers quickly identify novel papers. Code and demo are available at this https URL.
摘要：学术出版物的指数级增长导致质量参差不齐的论文激增，从而增加了论文筛选的成本。目前的方法要么在一般人工智能评审员中使用新颖性评估，要么重新利用 DeepResearch，后者缺乏特定领域的机制，因此提供的结果质量较低。为了弥补这一差距，我们引入了NoveltyAgent，这是一个多代理系统，旨在生成全面且忠实的新颖性报告，从而能够全面评估论文的原创性。它将手稿分解为离散的新颖点，以便进行细粒度的检索和比较，并建立一个全面的相关论文数据库，同时交叉引用权利要求以确保忠实性。此外，为了解决评估此类开放式生成任务的挑战，我们提出了一个基于清单的评估框架，为构建可靠的评估提供了一个公正的范例。大量实验表明，NoveltyAgent 实现了最先进的性能，比 GPT-5 DeepResearch 性能高出 10.15%。我们希望该系统能够提供可靠、高质量的新颖性分析，并帮助研究人员快速识别新颖性论文。代码和演示可在此 https URL 获取。

Title: LLM Router: Prefill is All You Need

Authors: Tanay Varshney, Annie Surla, Michelle Xu, Gomathy Venkata Krishnan, Maximilian Jeblick, David Austin, Neal Vaidya, Davide Onofrio
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20895
Pdf URL: https://arxiv.org/pdf/2603.20895
Copy Paste: [[2603.20895]] LLM Router: Prefill is All You Need(https://arxiv.org/abs/2603.20895)
Keywords: llm
Abstract: LLMs often share comparable benchmark accuracies, but their complementary performance across task subsets suggests that an Oracle router--a theoretical selector with perfect foresight--can significantly surpass standalone model accuracy by navigating model-specific strengths. While current routers rely on fragile semantic signals, we propose using internal prefill activations via Encoder-Target Decoupling--a functional separation between the model providing the predictive signal (the Encoder) and the model whose performance is being estimated (the Target). This allows optimized heterogeneous pairing between unique encoders and target models. We utilize Fisher Separability (J) and Effective Dimensionality (d_eff) as mathematical probes to isolate optimal layer-wise signals, providing the predictive foundation for our SharedTrunkNet architecture. SharedTrunkNet captures up to 45.58% of the accuracy gap between the strongest standalone model and the Oracle while achieving 74.31% cost savings relative to the highest-cost model.
摘要：LLM 通常具有可比较的基准精度，但它们在任务子集上的互补性能表明，Oracle 路由器（具有完美远见的理论选择器）可以通过驾驭模型特定的优势来显着超越独立模型的精度。虽然当前的路由器依赖于脆弱的语义信号，但我们建议通过编码器-目标解耦使用内部预填充激活——提供预测信号的模型（编码器）和正在估计性能的模型（目标）之间的功能分离。这允许在独特的编码器和目标模型之间优化异构配对。我们利用 Fisher 可分离性 (J) 和有效维度 (d_eff) 作为数学探针来隔离最佳分层信号，为我们的 SharedTrunkNet 架构提供预测基础。 SharedTrunkNet 捕获了最强独立模型与 Oracle 之间高达 45.58% 的准确率差距，同时相对于最高成本模型节省了 74.31% 的成本。

Title: Mitigating Shortcut Reasoning in Language Models: A Gradient-Aware Training Approach

Authors: Hongyu Cao, Kunpeng Liu, Dongjie Wang, Yanjie Fu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20899
Pdf URL: https://arxiv.org/pdf/2603.20899
Copy Paste: [[2603.20899]] Mitigating Shortcut Reasoning in Language Models: A Gradient-Aware Training Approach(https://arxiv.org/abs/2603.20899)
Keywords: language model
Abstract: Large language models exhibit strong reasoning capabilities, yet often rely on shortcuts such as surface pattern matching and answer memorization rather than genuine logical inference. We propose Shortcut-Aware Reasoning Training (SART), a gradient-aware framework that detects and mitigates shortcut-promoting samples via ShortcutScore and gradient surgery. Our method identifies shortcut signals through gradient misalignment with validation objectives and answer-token concentration, and modifies training dynamics accordingly. Experiments on controlled reasoning benchmarks show that SART achieves +16.5% accuracy and +40.2% robustness over the strongest baseline, significantly improving generalization under distribution shifts. Code is available at: this https URL.
摘要：大型语言模型表现出强大的推理能力，但往往依赖于表面模式匹配和答案记忆等捷径，而不是真正的逻辑推理。我们提出了快捷方式感知推理训练（SART），这是一种梯度感知框架，可以通过 ShortcutScore 和梯度手术来检测和减轻快捷方式提升样本。我们的方法通过梯度与验证目标和答案标记浓度的偏差来识别捷径信号，并相应地修改训练动态。受控推理基准实验表明，SART 在最强基线上实现了 +16.5% 的准确度和 +40.2% 的鲁棒性，显着提高了分布变化下的泛化能力。代码可在以下位置获得：此 https URL。

Title: The Hidden Puppet Master: A Theoretical and Real-World Account of Emotional Manipulation in LLMs

Authors: Jocelyn Shen, Amina Luvsanchultem, Jessica Kim, Kynnedy Smith, Valdemar Danry, Kantwon Rogers, Sharifa Alghowinem, Hae Won Park, Maarten Sap, Cynthia Breazeal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20907
Pdf URL: https://arxiv.org/pdf/2603.20907
Copy Paste: [[2603.20907]] The Hidden Puppet Master: A Theoretical and Real-World Account of Emotional Manipulation in LLMs(https://arxiv.org/abs/2603.20907)
Keywords: llm
Abstract: As users increasingly turn to LLMs for practical and personal advice, they become vulnerable to being subtly steered toward hidden incentives misaligned with their own interests. Prior works have benchmarked persuasion and manipulation detection, but these efforts rely on simulated or debate-style settings, remain uncorrelated with real human belief shifts, and overlook a critical dimension: the morality of hidden incentives driving the manipulation. We introduce PUPPET, a theoretical taxonomy of personalized emotional manipulation in LLM-human dialogues that centers around incentive morality, and conduct a human study with N=1,035 participants across realistic everyday queries, varying personalization and incentive direction (harmful versus prosocial). We find that harmful hidden incentives produce significantly larger belief shifts than prosocial ones. Finally, we benchmark LLMs on the task of belief prediction, finding that models exhibit moderate predictive ability of belief change based on conversational contexts (r=0.3 - 0.5), but they also systematically underestimate the magnitude of belief shift. Together, this work establishes a theoretically grounded and behaviorally validated foundation for studying, and ultimately combatting, incentive-driven manipulation in LLMs during everyday, practical user queries.
摘要：随着用户越来越多地向法学硕士寻求实用和个人建议，他们很容易被巧妙地引导到与自己利益不符的隐藏激励措施上。之前的工作已经对说服和操纵检测进行了基准测试，但这些努力依赖于模拟或辩论式的设置，与真实的人类信仰转变不相关，并且忽略了一个关键维度：驱动操纵的隐藏激励的道德性。我们引入了 PUPPET，这是一种以激励道德为中心的 LLM 与人类对话中个性化情绪操纵的理论分类法，并对 N=1,035 名参与者进行了一项人类研究，涉及现实的日常查询、不同的个性化和激励方向（有害与亲社会）。我们发现，有害的隐性激励比亲社会激励产生的信念转变要大得多。最后，我们对法学硕士的信念预测任务进行了基准测试，发现模型表现出基于对话情境的信念改变的中等预测能力（r=0.3 - 0.5），但它们也系统地低估了信念转变的幅度。总之，这项工作为研究并最终打击法学硕士在日常实际用户查询中的激励驱动操纵奠定了理论基础和行为验证的基础。

Title: User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction

Authors: Yuren Hao, Shuhaib Mehri, ChengXiang Zhai, Dilek Hakkani-Tür
Subjects: cs.CL, cs.AI, cs.HC, cs.IR, stat.ML
Abstract URL: https://arxiv.org/abs/2603.20939
Pdf URL: https://arxiv.org/pdf/2603.20939
Copy Paste: [[2603.20939]] User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction(https://arxiv.org/abs/2603.20939)
Keywords: language model, llm, agent
Abstract: Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. We propose Vector-Adapted Retrieval Scoring (VARS), a pipeline-agnostic, frozen-backbone framework that represents each user with long-term and short-term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar rewards from users' feedback, enabling personalization without per-user fine-tuning. We evaluate on \textsc{MultiSessionCollab}, an online multi-session collaboration benchmark with rich user preference profiles, across math and code tasks. Under frozen backbones, the main benefit of user-aware retrieval is improved interaction efficiency rather than large gains in raw task accuracy: our full VARS agent achieves the strongest overall performance, matches a strong Reflection baseline in task success, and reduces timeout rate and user effort. The learned long-term vectors also align with cross-user preference overlap, while short-term vectors capture session-specific adaptation, supporting the interpretability of the dual-vector design. Code, model, and data are available at this https URL.
摘要：大型语言模型越来越多地用作个人助理，但大多数缺乏持久的用户模型，迫使用户在会话中反复重申偏好。我们提出了向量自适应检索评分（VARS），这是一种与管道无关的冻结骨干框架，它在共享偏好空间中用长期和短期向量表示每个用户，并使用这些向量使检索评分相对于结构化偏好记忆产生偏差。这些向量根据用户反馈的微弱标量奖励在线更新，从而无需针对每个用户进行微调即可实现个性化。我们对 \textsc{MultiSessionCollab} 进行评估，这是一个跨数学和代码任务的在线多会话协作基准，具有丰富的用户偏好配置文件。在冻结骨干网下，用户感知检索的主要好处是提高交互效率，而不是原始任务准确性的大幅提升：我们的完整 VARS 代理实现了最强的整体性能，在任务成功方面与强大的反射基线相匹配，并减少了超时率和用户工作量。学习到的长期向量还与跨用户偏好重叠一致，而短期向量捕获特定于会话的适应，支持双向量设计的可解释性。代码、模型和数据可从此 https URL 获取。

Title: Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

Authors: Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, Tuhin Chakrabarty
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.20957
Pdf URL: https://arxiv.org/pdf/2603.20957
Copy Paste: [[2603.20957]] Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models(https://arxiv.org/abs/2603.20957)
Keywords: language model, gpt, llm, prompt
Abstract: Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami's novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors' works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors' works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.
摘要：Frontier LLM 公司一再向法院和监管机构保证，他们的模型不会存储训练数据的副本。他们进一步依靠 RLHF、系统提示和输出过滤器的安全调整策略来阻止版权作品的逐字反流，并在针对版权侵权索赔的法律辩护中引用了这些措施的有效性。我们证明微调绕过了这些保护：通过训练模型将情节摘要扩展为全文（这项任务自然适合商业写作助手），我们使 GPT-4o、Gemini-2.5-Pro 和 DeepSeek-V3.1 复制了高达 85-90% 的受版权保护的书籍，单个逐字跨度超过 460 个单词，仅使用语义描述作为提示，而不使用实际的书籍文本。这种提取适用于所有作者：专门针对村上春树的小说进行微调，可以逐字回忆 30 多位不相关作者的受版权保护的书籍。这种效果并不特定于任何训练作者或语料库：随机作者对和公共领域微调数据产生可比较的提取，而对合成文本的微调产生接近于零的提取，这表明对个别作者作品的微调重新激活了预训练中的潜在记忆。来自不同提供商的三个模型在同一区域记住相同的书籍（$r\ge 0.90$），这表明存在全行业的漏洞。我们的研究结果提供了令人信服的证据，证明模型权重存储了受版权保护的作品的副本，并且对个别作者的作品进行微调后出现的安全故障破坏了最近合理使用裁决的一个关键前提，即法院将有利的结果取决于防止受保护表达的复制的措施的充分性。

Title: DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles

Authors: Bo Jiang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20975
Pdf URL: https://arxiv.org/pdf/2603.20975
Copy Paste: [[2603.20975]] DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles(https://arxiv.org/abs/2603.20975)
Keywords: language model, llm, prompt, agent
Abstract: Multi-agent LLM systems, where multiple prompted instances of a language model independently answer questions, are increasingly used for complex reasoning tasks. However, existing methods for quantifying the uncertainty of their collective outputs rely on shallow voting statistics that discard the rich semantic information in agents' reasoning. We introduce DiscoUQ, a framework that extracts and leverages the structure of inter-agent disagreement -- both linguistic properties (evidence overlap, argument strength, divergence depth) and embedding geometry (cluster distances, dispersion, cohesion) -- to produce well-calibrated confidence estimates. We propose three methods of increasing complexity: DiscoUQ-LLM (logistic regression on LLM-extracted structure features), DiscoUQ-Embed (logistic regression on embedding geometry), and DiscoUQ-Learn (a neural network combining all features). Evaluated on four diverse benchmarks (StrategyQA, MMLU, TruthfulQA, ARC-Challenge) with a 5-agent system using Qwen3.5-27B, DiscoUQ-LLM achieves an average AUROC of 0.802, outperforming the best baseline (LLM Aggregator, 0.791) while being substantially better calibrated (ECE 0.036 vs. 0.098). The learned features generalize across benchmarks with near-zero performance degradation and provide the largest improvements where they are most needed: in the ambiguous "weak disagreement" tier where simple vote counting fails.
摘要：多代理 LLM 系统越来越多地用于复杂的推理任务，其中语言模型的多个提示实例独立回答问题。然而，现有的量化集体输出不确定性的方法依赖于浅层投票统计，而这些统计丢弃了智能体推理中丰富的语义信息。我们引入了 DiscoUQ，这是一个框架，它提取并利用主体间分歧的结构——包括语言属性（证据重叠、论证强度、分歧深度）和嵌入几何学（聚类距离、分散性、凝聚力）——以产生经过良好校准的置信度估计。我们提出了三种增加复杂性的方法：DiscoUQ-LLM（LLM提取的结构特征的逻辑回归）、DiscoUQ-Embed（嵌入几何的逻辑回归）和DiscoUQ-Learn（结合所有特征的神经网络）。使用 Qwen3.5-27B 使用 5 代理系统对四个不同的基准（StrategyQA、MMLU、TruthfulQA、ARC-Challenge）进行评估，DiscoUQ-LLM 的平均 AUROC 为 0.802，优于最佳基线（LLM Aggregator，0.791），同时得到更好的校准（ECE 0.036 与 0.098）。学习到的功能可以在基准上泛化，性能下降接近于零，并在最需要的地方提供最大的改进：在简单的计票失败的模糊的“弱分歧”层中。

Title: Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

Authors: Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng, Guoxiu He
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.21016
Pdf URL: https://arxiv.org/pdf/2603.21016
Copy Paste: [[2603.21016]] Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO(https://arxiv.org/abs/2603.21016)
Keywords: language model, llm
Abstract: Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (this https URL).
摘要：用于多项选择和成对评估任务的大型语言模型 (LLM) 通常会因选项位置和标签符号等非语义因素而表现出选择偏差。现有的推理时间去偏差成本高昂，并且可能损害推理，而逐点训练忽略了同一问题应该在排列中产生一致的答案。为了解决这个问题，我们提出了排列感知组相对策略优化（PA-GRPO），它通过强制排列一致的语义推理来减轻选择偏差。 PA-GRPO 通过生成多个候选排列为每个实例构建排列组，并使用两种互补机制优化模型：（1）交叉排列优势，计算相对于同一实例的所有排列的平均奖励的优势；（2）一致性感知奖励，鼓励模型在不同排列中产生一致的决策。实验结果表明，PA-GRPO 在七个基准测试中均优于强大的基线，显着减少了选择偏差，同时保持了较高的整体性能。该代码将在 Github（此 https URL）上提供。

Title: Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models

Authors: Abdul-Salem Beibitkhan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21036
Pdf URL: https://arxiv.org/pdf/2603.21036
Copy Paste: [[2603.21036]] Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models(https://arxiv.org/abs/2603.21036)
Keywords: language model, llm, prompt
Abstract: We investigate how large language models perform on low-resource languages by benchmarking eight LLMs across five experimental conditions in English, Kazakh, and Mongolian. Using 50 hand-crafted questions spanning factual, reasoning, technical, and culturally grounded categories, we evaluate 2,000 responses on accuracy, fluency, and completeness. We find a consistent performance gap of 13.8-16.7 percentage points between English and low-resource language conditions, with models maintaining surface-level fluency while producing significantly less accurate content. Cross-lingual transfer-prompting models to reason in English before translating back-yields selective gains for bilingual architectures (+2.2pp to +4.3pp) but provides no benefit to English-dominant models. Our results demonstrate that current LLMs systematically underserve low-resource language communities, and that effective mitigation strategies are architecture-dependent rather than universal.
摘要：我们通过在英语、哈萨克语和蒙古语的五个实验条件下对八个法学硕士进行基准测试，研究大型语言模型在低资源语言上的表现。我们使用涵盖事实、推理、技术和文化基础类别的 50 个手工制作的问题，评估了 2,000 个回答的准确性、流畅性和完整性。我们发现英语和低资源语言条件之间始终存在 13.8-16.7 个百分点的性能差距，模型保持表面流畅性，但生成的内容准确度明显较低。跨语言迁移提示模型在翻译回之前用英语推理可以为双语架构带来选择性收益（+2.2pp 到 +4.3pp），但对英语主导模型没有任何好处。我们的结果表明，当前的法学硕士系统性地缺乏对资源匮乏的语言社区的服务，并且有效的缓解策略取决于架构而不是通用的。

Title: Evaluating Reasoning-Based Scaffolds for Human-AI Co-Annotation: The ReasonAlign Annotation Protocol

Authors: Smitha Muthya Sudheendra, Jaideep Srivastava
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21094
Pdf URL: https://arxiv.org/pdf/2603.21094
Copy Paste: [[2603.21094]] Evaluating Reasoning-Based Scaffolds for Human-AI Co-Annotation: The ReasonAlign Annotation Protocol(https://arxiv.org/abs/2603.21094)
Keywords: language model, llm
Abstract: Human annotation is central to NLP evaluation, yet subjective tasks often exhibit substantial variability across annotators. While large language models (LLMs) can provide structured reasoning to support annotation, their influence on human annotation behavior remains unclear. We introduce ReasonAlign, a reasoning-based annotation scaffold that exposes LLM-generated explanations while withholding predicted labels. We frame this as a controlled study of how reasoning affects human annotation behavior, rather than a full evaluation of annotation accuracy. Using a two-pass protocol inspired by Delphi-style revision, annotators first label instances independently and then revise their decisions after viewing model-generated reasoning. We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning. Our results show that exposure to reasoning is associated with increased agreement alongside minimal revision, suggesting that reasoning primarily helps resolve ambiguous cases without inducing widespread changes. These findings provide insight into how reasoning explanations shape annotation consistency and highlight reasoning-based scaffolds as a practical mechanism for supporting human-AI annotation workflows.
摘要：人工注释是 NLP 评估的核心，但主观任务通常在注释者之间表现出很大的差异。虽然大型语言模型 (LLM) 可以提供结构化推理来支持注释，但它们对人类注释行为的影响仍不清楚。我们引入 ReasonAlign，一个基于推理的注释支架，它公开 LLM 生成的解释，同时保留预测标签。我们将其视为推理如何影响人类注释行为的对照研究，而不是对注释准确性的全面评估。使用受德尔菲式修订启发的两遍协议，注释者首先独立地标记实例，然后在查看模型生成的推理后修改他们的决策。我们评估情感分类和意见检测任务的方法，分析注释者间协议和修订行为的变化。为了量化这些影响，我们引入了注释者努力代理（AEP），这是一种捕获推理后修改标签比例的指标。我们的结果表明，接触推理与增加一致性和最小修改有关，这表明推理主要有助于解决模棱两可的情况，而不会引起广泛的变化。这些发现提供了关于推理解释如何塑造注释一致性的见解，并强调基于推理的支架作为支持人类人工智能注释工作流程的实用机制。

Title: Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects

Authors: Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Shubhashis Roy Dipta, Rubaya Tabassum, Ariful Ekraj Hridoy, Mehraj Mahmood, Mahbub E Sobhani, Md. Tarek Hasan, Swakkhar Shatabda
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2603.21165
Pdf URL: https://arxiv.org/pdf/2603.21165
Copy Paste: [[2603.21165]] Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects(https://arxiv.org/abs/2603.21165)
Keywords: language model
Abstract: Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.
摘要：孟加拉文化通过地区、方言、历史、食物、政治、媒体和日常视觉生活得到丰富的表达，但在多模态评估中仍然代表性不足。为了解决这一差距，我们引入了 BanglaVerse，这是一个基于文化的基准，用于评估跨历史相关语言和地区方言的孟加拉文化的多语言视觉语言模型 (VLM)。该基准测试由 9 个领域的 1,152 张手动整理的图像构建而成，支持视觉问答和字幕，并扩展到四种语言和五种孟加拉方言，产生约 32.3K 工件。我们的实验表明，仅评估标准孟加拉语会高估真实的模型能力：在方言变化下性能会下降，尤其是在字幕生成方面，而历史上相关的语言（例如印地语和乌尔都语）保留了一些文化含义，但在结构化推理方面仍然较弱。跨领域，主要瓶颈是缺乏文化知识，而不仅仅是视觉基础，以及知识密集型类别。这些发现将 BanglaVerse 定位为一个更现实的测试平台，用于衡量语言变化下基于文化的多模态理解。

Title: Entropy Alone is Insufficient for Safe Selective Prediction in LLMs

Authors: Edward Phillips, Fredrik K. Gustafsson, Sean Wu, Anshul Thakur, David A. Clifton
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21172
Pdf URL: https://arxiv.org/pdf/2603.21172
Copy Paste: [[2603.21172]] Entropy Alone is Insufficient for Safe Selective Prediction in LLMs(https://arxiv.org/abs/2603.21172)
Keywords: language model, llm, hallucination
Abstract: Selective prediction systems can mitigate harms resulting from language model hallucinations by abstaining from answering in high-risk cases. Uncertainty quantification techniques are often employed to identify such cases, but are rarely evaluated in the context of the wider selective prediction policy and its ability to operate at low target error rates. We identify a model-dependent failure mode of entropy-based uncertainty methods that leads to unreliable abstention behaviour, and address it by combining entropy scores with a correctness probe signal. We find that across three QA benchmarks (TriviaQA, BioASQ, MedicalQA) and four model families, the combined score generally improves both the risk--coverage trade-off and calibration performance relative to entropy-only baselines. Our results highlight the importance of deployment-facing evaluation of uncertainty methods, using metrics that directly reflect whether a system can be trusted to operate at a stated risk level.
摘要：选择性预测系统可以通过在高风险情况下放弃回答来减轻语言模型幻觉造成的危害。不确定性量化技术通常用于识别此类情况，但很少在更广泛的选择性预测策略及其以低目标错误率运行的能力的背景下进行评估。我们确定了基于熵的不确定性方法的模型相关故障模式，该模式会导致不可靠的弃权行为，并通过将熵分数与正确性探测信号相结合来解决该问题。我们发现，在三个 QA 基准（TriviaQA、BioASQ、MedicalQA）和四个模型系列中，相对于纯熵基线，综合得分通常会提高风险覆盖率权衡和校准性能。我们的结果强调了面向部署的不确定性方法评估的重要性，使用直接反映系统是否可以信任在规定的风险水平下运行的指标。

Title: Explainable Semantic Textual Similarity via Dissimilar Span Detection

Authors: Diego Miguel Lozano, Daryna Dementieva, Alexander Fraser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21174
Pdf URL: https://arxiv.org/pdf/2603.21174
Copy Paste: [[2603.21174]] Explainable Semantic Textual Similarity via Dissimilar Span Detection(https://arxiv.org/abs/2603.21174)
Keywords: language model, llm
Abstract: Semantic Textual Similarity (STS) is a crucial component of many Natural Language Processing (NLP) applications. However, existing approaches typically reduce semantic nuances to a single score, limiting interpretability. To address this, we introduce the task of Dissimilar Span Detection (DSD), which aims to identify semantically differing spans between pairs of texts. This can help users understand which particular words or tokens negatively affect the similarity score, or be used to improve performance in STS-dependent downstream tasks. Furthermore, we release a new dataset suitable for the task, the Span Similarity Dataset (SSD), developed through a semi-automated pipeline combining large language models (LLMs) with human verification. We propose and evaluate different baseline methods for DSD, both unsupervised, based on LIME, SHAP, LLMs, and our own method, as well as an additional supervised approach. While LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task. Finally, we set up an additional experiment that shows how DSD can lead to increased performance in the specific task of paraphrase detection.
摘要：语义文本相似性 (STS) 是许多自然语言处理 (NLP) 应用程序的重要组成部分。然而，现有的方法通常将语义细微差别减少到单个分数，从而限制了可解释性。为了解决这个问题，我们引入了不同跨度检测（DSD）的任务，其目的是识别文本对之间语义上不同的跨度。这可以帮助用户了解哪些特定单词或标记会对相似性分数产生负面影响，或者用于提高依赖于 STS 的下游任务的性能。此外，我们发布了一个适合该任务的新数据集，即跨度相似度数据集（SSD），该数据集是通过将大型语言模型（LLM）与人工验证相结合的半自动化管道开发的。我们提出并评估了不同的 DSD 基线方法，包括基于 LIME、SHAP、LLM 的无监督方法、我们自己的方法以及额外的监督方法。虽然法学硕士和监督模型取得了最高的性能，但总体结果仍然很低，凸显了任务的复杂性。最后，我们设置了一个额外的实验，展示 DSD 如何提高释义检测特定任务的性能。

Title: Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Articles

Authors: Sai Koneru, Jian Wu, Sarah Rajtmajer
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2603.21193
Pdf URL: https://arxiv.org/pdf/2603.21193
Copy Paste: [[2603.21193]] Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Articles(https://arxiv.org/abs/2603.21193)
Keywords: language model, prompt, retrieval augmented generation
Abstract: Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is central to the synthesis of empirical findings, but remains difficult due to document length and the distribution of scientific arguments across sections of the paper. The work studies a sequential full-text extraction setting, where the statement of a primary finding in an article's abstract is linked to (i) a corresponding hypothesis statement in the paper body and (ii) the statistical evidence that supports or refutes that hypothesis. This formulation induces a challenging within-document retrieval setting in which many candidate paragraphs are topically related to the finding but differ in rhetorical role, creating hard negatives for retrieval and extraction. Using a two-stage retrieve-and-extract framework, we conduct a controlled study of retrieval design choices, varying context quantity, context quality (standard Retrieval Augmented Generation, reranking, and a fine-tuned retriever paired with reranking), as well as an oracle paragraph setting to separate retrieval failures from extraction limits across four Large Language Model extractors. We find that targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations that optimize retrieval quality and context cleanliness. In contrast, statistical evidence extraction remains substantially harder. Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.
摘要：从全文科学文章中提取假设及其支持的统计证据对于实证研究结果的综合至关重要，但由于文档长度和科学论点在论文各部分的分布，仍然很困难。这项工作研究了顺序全文提取设置，其中文章摘要中主要发现的陈述与（i）论文正文中相应的假设陈述和（ii）支持或反驳该假设的统计证据相关联。这种表述引发了一种具有挑战性的文档内检索设置，其中许多候选段落与发现主题相关，但修辞作用不同，从而为检索和提取带来了困难。使用两阶段检索和提取框架，我们对检索设计选择、不同的上下文数量、上下文质量（标准检索增强生成、重新排名以及与重新排名配对的微调检索器）以及预言段落设置进行了对照研究，以将检索失败与四个大型语言模型提取器的提取限制分开。我们发现，相对于全文提示，有针对性的上下文选择持续改进假设提取，收益集中在优化检索质量和上下文清洁度的配置上。相比之下，统计证据的提取仍然困难得多。即使使用 Oracle 段落，性能仍然中等，这表明提取器在处理混合数字文本语句方面存在持续限制，而不仅仅是检索失败。

Title: Graph Fusion Across Languages using Large Language Models

Authors: Kaung Myat Kyaw, Khush Agarwal, Jonathan Chan
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2603.21248
Pdf URL: https://arxiv.org/pdf/2603.21248
Copy Paste: [[2603.21248]] Graph Fusion Across Languages using Large Language Models(https://arxiv.org/abs/2603.21248)
Keywords: language model, llm
Abstract: Combining multiple knowledge graphs (KGs) across linguistic boundaries is a persistent challenge due to semantic heterogeneity and the complexity of graph environments. We propose a framework for cross-lingual graph fusion, leveraging the in-context reasoning and multilingual semantic priors of Large Language Models (LLMs). The framework implements structural linearization by mapping triplets directly into natural language sequences (e.g., [head] [relation] [tail]), enabling the LLM to map relations and reconcile entities between an evolving fused graph ($G_{c}^{(t-1)}$) and a new candidate graph ($G_{t}$). Evaluated on the DBP15K dataset, this exploratory study demonstrates that LLMs can serve as a universal semantic bridge to resolve cross-lingual discrepancies. Results show the successful sequential agglomeration of multiple heterogeneous graphs, offering a scalable, modular solution for continuous knowledge synthesis in multi-source, multilingual environments.
摘要：由于语义异构性和图环境的复杂性，跨语言边界组合多个知识图（KG）是一个持续的挑战。我们提出了一个跨语言图融合的框架，利用大型语言模型（LLM）的上下文推理和多语言语义先验。该框架通过将三元组直接映射到自然语言序列（例如，[head] [relation] [tail]）来实现结构线性化，使法学硕士能够映射关系并协调不断发展的融合图（$G_{c}^{(t-1)}$）和新候选图（$G_{t}$）之间的实体。这项探索性研究在 DBP15K 数据集上进行评估，表明法学硕士可以作为解决跨语言差异的通用语义桥梁。结果显示多个异构图的成功顺序聚合，为多源、多语言环境中的连续知识合成提供了可扩展的模块化解决方案。

Title: Conversation Tree Architecture: A Structured Framework for Context-Aware Multi-Branch LLM Conversations

Authors: Pranav Hemanth, Sampriti Saha
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2603.21278
Pdf URL: https://arxiv.org/pdf/2603.21278
Copy Paste: [[2603.21278]] Conversation Tree Architecture: A Structured Framework for Context-Aware Multi-Branch LLM Conversations(https://arxiv.org/abs/2603.21278)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are increasingly deployed for extended, multi-topic conversations, yet the flat, append-only structure of current conversation interfaces introduces a fundamental limitation: all context accumulates in a single unbounded window, causing topically distinct threads to bleed into one another and progressively degrade response quality. We term this failure mode logical context poisoning. In this paper, we introduce the Conversation Tree Architecture (CTA), a hierarchical framework that organizes LLM conversations as trees of discrete, context-isolated nodes. Each node maintains its own local context window; structured mechanisms govern how context flows between parent and child nodes, downstream on branch creation and upstream on branch deletion. We additionally introduce volatile nodes, transient branches whose local context must be selectively merged upward or permanently discarded before purging. We formalize the architecture's primitives, characterize the open design problems in context flow, relate our framework to prior work in LLM memory management, and describe a working prototype implementation. The CTA provides a principled foundation for structured conversational context management and extends naturally to multi-agent settings.
摘要：大型语言模型 (LLM) 越来越多地部署用于扩展的多主题对话，但当前对话界面的扁平、仅附加结构引入了一个基本限制：所有上下文都累积在单个无界窗口中，导致主题不同的线程相互渗透并逐渐降低响应质量。我们将这种故障模式称为逻辑上下文中毒。在本文中，我们介绍了对话树架构（CTA），这是一种分层框架，它将 LLM 对话组织为离散的、上下文隔离的节点树。每个节点维护自己的本地上下文窗口；结构化机制控制上下文如何在父节点和子节点之间、下游分支创建和上游分支删除之间流动。我们还引入了易失性节点，即瞬态分支，其本地上下文必须在清除之前有选择地向上合并或永久丢弃。我们形式化了架构的原语，描述了上下文流中的开放设计问题，将我们的框架与LLM内存管理中的先前工作联系起来，并描述了工作原型实现。 CTA 为结构化会话上下文管理提供了原则基础，并自然地扩展到多代理设置。

Title: More Than Sum of Its Parts: Deciphering Intent Shifts in Multimodal Hate Speech Detection

Authors: Runze Sun, Yu Zheng, Zexuan Xiong, Zhongjin Qu, Lei Chen, Jiwen Lu, Jie Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.21298
Pdf URL: https://arxiv.org/pdf/2603.21298
Copy Paste: [[2603.21298]] More Than Sum of Its Parts: Deciphering Intent Shifts in Multimodal Hate Speech Detection(https://arxiv.org/abs/2603.21298)
Keywords: agent
Abstract: Combating hate speech on social media is critical for securing cyberspace, yet relies heavily on the efficacy of automated detection systems. As content formats evolve, hate speech is transitioning from solely plain text to complex multimodal expressions, making implicit attacks harder to spot. Current systems, however, often falter on these subtle cases, as they struggle with multimodal content where the emergent meaning transcends the aggregation of individual modalities. To bridge this gap, we move beyond binary classification to characterize semantic intent shifts where modalities interact to construct implicit hate from benign cues or neutralize toxicity through semantic inversion. Guided by this fine-grained formulation, we curate the Hate via Vision-Language Interplay (H-VLI) benchmark where the true intent hinges on the intricate interplay of modalities rather than overt visual or textual slurs. To effectively decipher these complex cues, we further propose the Asymmetric Reasoning via Courtroom Agent DEbate (ARCADE) framework. By simulating a judicial process where agents actively argue for accusation and defense, ARCADE forces the model to scrutinize deep semantic cues before reaching a verdict. Extensive experiments demonstrate that ARCADE significantly outperforms state-of-the-art baselines on H-VLI, particularly for challenging implicit cases, while maintaining competitive performance on established benchmarks. Our code and data are available at: this https URL
摘要：打击社交媒体上的仇恨言论对于保护网络空间至关重要，但在很大程度上依赖于自动检测系统的功效。随着内容格式的发展，仇恨言论正在从纯粹的纯文本过渡到复杂的多模态表达，使得隐性攻击更难被发现。然而，当前的系统经常在这些微妙的情况下犹豫不决，因为它们与多模态内容作斗争，其中出现的意义超越了个体模态的聚合。为了弥合这一差距，我们超越二元分类，来描述语义意图的转变，其中模式相互作用，从良性线索构建隐含的仇恨，或通过语义倒置中和毒性。在这种细粒度表述的指导下，我们策划了“视觉语言相互作用仇恨”(H-VLI) 基准，其中真正的意图取决于模式的复杂相互作用，而不是明显的视觉或文本诽谤。为了有效地破译这些复杂的线索，我们进一步提出了通过法庭代理辩论（ARCADE）框架进行非对称推理。通过模拟代理人积极主张指控和辩护的司法过程，ARCADE 迫使模型在做出裁决之前仔细审查深层语义线索。大量实验表明，ARCADE 的性能显着优于 H-VLI 上最先进的基线，特别是对于具有挑战性的隐式案例，同时在既定基准上保持竞争性能。我们的代码和数据可在以下位置获取：此 https URL

Title: enhancing reasoning accuracy in large language models during inference time

Authors: Vinay Sharma, Manish Jain
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.21301
Pdf URL: https://arxiv.org/pdf/2603.21301
Copy Paste: [[2603.21301]] enhancing reasoning accuracy in large language models during inference time(https://arxiv.org/abs/2603.21301)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) often exhibit strong linguistic abilities while remaining unreliable on multi-step reasoning tasks, particularly when deployed without additional training or fine-tuning. In this work, we study inference-time techniques to improve the reasoning accuracy of LLMs. We systematically evaluate three classes of inference-time strategies: (i) self-consistency via stochastic decoding, where the model is sampled multiple times using controlled temperature and nucleus sampling and the most frequent final answer is selected; (ii) dual-model reasoning agreement, where outputs from two independent models are compared and only consistent reasoning traces are trusted; and (iii) self-reflection, where the model critiques and revises its own reasoning. Across all evaluated methods, we employ Chain-of-Thought (CoT) [1] prompting to elicit explicit intermediate reasoning steps before generating final answers. In this work, we provide a controlled comparative evaluation across three inference-time strategies under identical prompting and verification settings. Our experiments on LLM [2] show that self-consistency with nucleus sampling and controlled temperature value yields the substantial gains, achieving a 9% to 15% absolute improvement in accuracy over greedy single-pass decoding, well-suited for low-risk domains, offering meaningful gains with minimal overhead. The dual-model approach provides additional confirmation for model reasoning steps thus more appropriate for moderate-risk domains, where higher reliability justifies additional compute. Self-reflection offers only marginal improvements, suggesting limited effectiveness for smaller non-reasoning models at inference time.
摘要：大型语言模型 (LLM) 通常表现出强大的语言能力，但在多步骤推理任务上仍然不可靠，特别是在没有额外训练或微调的情况下部署时。在这项工作中，我们研究推理时间技术以提高法学硕士的推理准确性。我们系统地评估了三类推理时间策略：（i）通过随机解码实现自一致性，其中使用受控温度和核采样对模型进行多次采样，并选择最常见的最终答案； (ii) 双模型推理协议，比较两个独立模型的输出，并且仅信任一致的推理轨迹； (iii) 自我反思，模型批评并修正自己的推理。在所有评估的方法中，我们采用思想链 (CoT) [1] 提示在生成最终答案之前引出明确的中间推理步骤。在这项工作中，我们在相同的提示和验证设置下对三种推理时间策略进行了受控比较评估。我们在 LLM [2] 上的实验表明，核采样和受控温度值的自一致性产生了巨大的收益，与贪婪单通道解码相比，准确率绝对提高了 9% 到 15%，非常适合低风险领域，以最小的开销提供有意义的收益。双模型方法为模型推理步骤提供了额外的确认，因此更适合中等风险领域，其中更高的可靠性证明额外的计算是合理的。自我反思仅提供边际改进，这表明较小的非推理模型在推理时的有效性有限。

Title: TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

Authors: Saketh Vinjamuri, Marielle Fis Loperena, Marie C. Spezia, Ramez Kouzy
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.21335
Pdf URL: https://arxiv.org/pdf/2603.21335
Copy Paste: [[2603.21335]] TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols(https://arxiv.org/abs/2603.21335)
Keywords: llm, multi-run
Abstract: Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google's Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ($\pm$3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR $\leq$ 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.
摘要：时间毒性，即参与临床试验的累积医疗接触天数，是从方案文件中提取的一个重要但劳动密集型指标。我们开发了 TimeTox，这是一个基于法学硕士的管道，用于从评估表中自动提取时间毒性。 TimeTox 分三个阶段使用 Google 的 Gemini 模型：从全长协议 PDF 中提取摘要、每个治疗臂在六个累积时间点进行时间毒性量化，以及通过基于位置的臂匹配进行多次运行共识。我们针对 20 种合成方案进行了验证（240 次比较），并评估了 644 种真实肿瘤学方案的可重复性。比较了两种架构：单通道（普通）和两阶段（结构然后计数）。两阶段流程在合成数据（MAE 0.81 天）上实现了 100% 临床可接受的准确度（$\pm$3 天），而普通数据（MAE 9.0 天）的准确度为 41.5%。然而，在现实世界的协议中，普通管道显示出卓越的重现性：在 644 个协议的 3 次运行中，临床可接受的准确度为 95.3%（IQR $\leq$ 3 天），完美稳定性为 82.0%（IQR = 0）。生产管道提取了跨多个疾病部位的 1,288 个治疗组的时间毒性。实际数据的提取稳定性，而不是合成基准的准确性，是生产 LLM 部署的决定性因素。

Title: Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles

Authors: Adi Gabay, Gabriel Stanovsky, Liat Peterfreund
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21350
Pdf URL: https://arxiv.org/pdf/2603.21350
Copy Paste: [[2603.21350]] Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles(https://arxiv.org/abs/2603.21350)
Keywords: llm, agent
Abstract: Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents' knowledge. Prior work evaluating LLMs on canonical epistemic puzzles interpreted their behavior through a dichotomy between epistemic reasoning and brittle memorization. We argue that this framing is incomplete: in recent models, memorization is better understood as a special case of reduction, where a new instance is mapped onto a known problem. Instead, we introduce a reduction ladder, a sequence of modifications that progressively move instances away from a canonical epistemic puzzle, making reduction increasingly difficult while preserving the underlying logic. We find that while some large models succeed via reduction, other models fail early, and all models struggle once epistemic reasoning is required.
摘要：认知推理要求智能体从部分观察和有关其他智能体知识的信息中推断出世界的状态。之前的工作评估了法学硕士在典型认知难题上的表现，通过认知推理和脆弱记忆之间的二分法来解释他们的行为。我们认为这个框架是不完整的：在最近的模型中，记忆可以更好地理解为还原的特殊情况，其中一个新的实例被映射到一个已知的问题上。相反，我们引入了一个归约阶梯，这是一系列修改，逐渐将实例从规范的认知难题中移开，使归约变得越来越困难，同时保留了底层逻辑。我们发现，虽然一些大型模型通过简化获得了成功，但其他模型却很早就失败了，并且一旦需要认知推理，所有模型都会陷入困境。

Title: Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF

Authors: K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain, Farig Sadeque
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.21359
Pdf URL: https://arxiv.org/pdf/2603.21359
Copy Paste: [[2603.21359]] Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF(https://arxiv.org/abs/2603.21359)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic divergence. For instance, responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail. Furthermore, increased model scale does not consistently mitigate this bias. We contribute a validated translation quality evaluation method, a rigorous benchmark dataset, and a Critical Bias Sensitivity (CBS) metric for safety-critical applications.
摘要：大型语言模型 (LLM) 经常表现出针对低资源语言的区域方言的性能偏差。然而，量化这些差异的框架仍然很少。我们提出了一个两阶段框架来评估法学硕士问答中九种孟加拉语方言的方言偏差。首先，我们采用检索增强生成 (RAG) 管道将标准孟加拉语问题翻译为方言变体并为其贴上黄金标签，以准备 4,000 个问题集。由于传统的翻译质量评估指标在非标准化方言上失败，因此我们使用法学硕士作为法官来评估保真度，人类相关性证实该指标优于传统指标。其次，我们对这些金标集合中的 19 个法学硕士进行了基准测试，运行了 68,395 项 RLAIF 评估，这些评估通过多位法官一致同意和人工回退进行验证。我们的研究结果表明，语言差异导致了性能的严重下降。例如，对高度分化的吉大港方言的反应得分为 5.44/10，而对坦盖尔方言的反应得分为 7.68/10。此外，增加模型规模并不能始终减轻这种偏差。我们为安全关键应用程序提供经过验证的翻译质量评估方法、严格的基准数据集和关键偏差敏感性 (CBS) 指标。

Title: Conspiracy Frame: a Semiotically-Driven Approach for Conspiracy Theories Detection

Authors: Heidi Campana Piva, Shaina Ashraf, Maziar Kianimoghadam Jouneghani, Arianna Longo, Rossana Damiano, Lucie Flek, Marco Antonio Stranisci
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21368
Pdf URL: https://arxiv.org/pdf/2603.21368
Copy Paste: [[2603.21368]] Conspiracy Frame: a Semiotically-Driven Approach for Conspiracy Theories Detection(https://arxiv.org/abs/2603.21368)
Keywords: llm
Abstract: Conspiracy theories are anti-authoritarian narratives that lead to social conflict, impacting how people perceive political information. To help in understanding this issue, we introduce the Conspiracy Frame: a fine-grained semantic representation of conspiratorial narratives derived from frame-semantics and semiotics, which spawned the Conspiracy Frames (this http URL.) dataset: a corpus of Telegram messages annotated at span-level. The Conspiracy Frame and this http URL. dataset contribute to the implementation of a more generalizable understanding and recognition of conspiracy theories. We observe the ability of LLMs to recognize this phenomenon in-domain and out-of-domain, investigating the role that frames may have in supporting this task. Results show that, while the injection of frames in an in-context approach does not lead to clear increase of performance, it has potential; the mapping of annotated spans with FrameNet shows abstract semantic patterns (e.g., `Kinship', `Ingest\_substance') that potentially pave the way for a more semantically- and semiotically-aware detection of conspiratorial narratives.
摘要：阴谋论是反独裁的叙事，会导致社会冲突，影响人们对政治信息的看法。为了帮助理解这个问题，我们引入了阴谋框架：源自框架语义学和符号学的阴谋叙事的细粒度语义表示，它催生了阴谋框架（此 http URL。）数据集：在跨度级别注释的 Telegram 消息语料库。阴谋框架和这个 http URL。数据集有助于对阴谋论进行更普遍的理解和识别。我们观察法学硕士在域内和域外识别这种现象的能力，调查框架在支持这项任务中可能发挥的作用。结果表明，虽然在上下文方法中注入帧不会导致性能明显提高，但它具有潜力；带注释的跨度与 FrameNet 的映射显示了抽象的语义模式（例如“亲属关系”、“摄取\_物质”），这可能为对阴谋叙事进行更语义和符号意识的检测铺平道路。

Title: Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models

Authors: Jinghan Cao, Yu Ma, Xinjin Li, Qingyang Ren, Xiangyun Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.21389
Pdf URL: https://arxiv.org/pdf/2603.21389
Copy Paste: [[2603.21389]] Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models(https://arxiv.org/abs/2603.21389)
Keywords: language model
Abstract: Large Language Models achieve remarkable performance but incur substantial computational costs unsuitable for resource-constrained deployments. This paper presents the first comprehensive task-specific efficiency analysis comparing 16 language models across five diverse NLP tasks. We introduce the Performance-Efficiency Ratio (PER), a novel metric integrating accuracy, throughput, memory, and latency through geometric mean normalization. Our systematic evaluation reveals that small models (0.5--3B parameters) achieve superior PER scores across all given tasks. These findings establish quantitative foundations for deploying small models in production environments prioritizing inference efficiency over marginal accuracy gains.
摘要：大型语言模型可实现卓越的性能，但会产生大量计算成本，不适合资源受限的部署。本文提出了第一个全面的特定任务效率分析，比较了五种不同 NLP 任务中的 16 种语言模型。我们引入了性能效率比 (PER)，这是一种通过几何平均归一化集成准确性、吞吐量、内存和延迟的新颖指标。我们的系统评估表明，小型模型（0.5--3B 参数）在所有给定任务中均取得了优异的 PER 分数。这些发现为在生产环境中部署小型模型奠定了定量基础，优先考虑推理效率而不是边际精度增益。

Title: Multi-Perspective LLM Annotations for Valid Analyses in Subjective Tasks

Authors: Navya Mehrotra, Adam Visokay, Kristina Gligorić
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21404
Pdf URL: https://arxiv.org/pdf/2603.21404
Copy Paste: [[2603.21404]] Multi-Perspective LLM Annotations for Valid Analyses in Subjective Tasks(https://arxiv.org/abs/2603.21404)
Keywords: language model, llm
Abstract: Large language models are increasingly used to annotate texts, but their outputs reflect some human perspectives better than others. Existing methods for correcting LLM annotation error assume a single ground truth. However, this assumption fails in subjective tasks where disagreement across demographic groups is meaningful. Here we introduce Perspective-Driven Inference, a method that treats the distribution of annotations across groups as the quantity of interest, and estimates it using a small human annotation budget. We contribute an adaptive sampling strategy that concentrates human annotation effort on groups where LLM proxies are least accurate. We evaluate on politeness and offensiveness rating tasks, showing targeted improvements for harder-to-model demographic groups relative to uniform sampling baselines, while maintaining coverage.
摘要：大型语言模型越来越多地用于注释文本，但它们的输出比其他模型更好地反映了人类的某些观点。纠正 LLM 注释错误的现有方法假设单一基本事实。然而，这种假设在主观任务中失败了，因为不同人口群体之间的分歧是有意义的。在这里，我们介绍视角驱动推理，这种方法将跨组的注释分布视为感兴趣的数量，并使用少量的人类注释预算来估计它。我们提供了一种自适应采样策略，将人工注释工作集中在 LLM 代理最不准确的群体上。我们对礼貌和冒犯性评级任务进行评估，显示相对于统一抽样基线的难以建模的人口群体的有针对性的改进，同时保持覆盖范围。

Title: Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLMs

Authors: Mariela M. Nina, Caio Veloso Costa, Lilian Berton, Didier A. Vega-Oliveros
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.21418
Pdf URL: https://arxiv.org/pdf/2603.21418
Copy Paste: [[2603.21418]] Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLMs(https://arxiv.org/abs/2603.21418)
Keywords: language model, llm
Abstract: Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1. We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M, Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8\% of baseline performance on BERTimbau-Large while reducing training time by 73.5\% (F1=81.32 vs 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points over standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs 9.56 F1 points). These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese QA with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with \textit{Green AI} principles. An exploratory evaluation of Tucano and Sabiá on the same extractive QA benchmark shows that while generative models can reach competitive F1 scores with LoRA fine-tuning, they require up to 4.2$\times$ more GPU memory and 3$\times$ more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.
摘要：尽管大型语言模型已经改变了自然语言处理，但它们的计算成本给巴西葡萄牙语等低资源语言造成了可访问性障碍。这项工作对应用于 SQuAD-BR（SQuAD v1 的巴西葡萄牙语翻译）上的 BERTimbau 问答的参数高效微调 (PEFT) 和量化技术进行了系统评估。我们评估了 40 种配置，结合了四种 PEFT 方法（LoRA、DoRA、QLoRA、QDoRA），跨两种模型大小（基础：110M，大型：335M 参数）。我们的研究结果揭示了三个关键见解：(1) LoRA 在 BERTimbau-Large 上实现了 95.8% 的基线性能，同时减少了 73.5% 的训练时间（F1=81.32 vs 84.86）； (2) 更高的学习率 (2e-4) 显着提高 PEFT 性能，F1 增益比标准率高出高达 +19.71 点； (3) 较大的模型显示出两倍的量化弹性（F1 点损失 4.83 vs 9.56）。这些结果表明，基于编码器的模型可以有效地微调巴西葡萄牙语的提取质量保证，其计算成本比大型生成法学硕士要低得多，从而促进符合 \textit{Green AI} 原则的更可持续的方法。在相同的提取 QA 基准上对 Tucano 和 Sabiá 进行的探索性评估表明，虽然生成模型可以通过 LoRA 微调达到有竞争力的 F1 分数，但它们需要比 BERTimbau-Base 多 4.2$\time$ 的 GPU 内存和 3$\times$ 的训练时间，从而增强了基于较小编码器的架构在该任务中的效率优势。

Title: PROMPT2BOX: Uncovering Entailment Structure among LLM Prompts

Authors: Neeladri Bhuiya, Shib Sankar Dasgupta, Andrew McCallum, Haw-Shiuan Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21438
Pdf URL: https://arxiv.org/pdf/2603.21438
Copy Paste: [[2603.21438]] PROMPT2BOX: Uncovering Entailment Structure among LLM Prompts(https://arxiv.org/abs/2603.21438)
Keywords: llm, prompt
Abstract: To discover the weaknesses of LLMs, researchers often embed prompts into a vector space and cluster them to extract insightful patterns. However, vector embeddings primarily capture topical similarity. As a result, prompts that share a topic but differ in specificity, and consequently in difficulty, are often represented similarly, making fine-grained weakness analysis difficult. To address this limitation, we propose PROMPT2BOX, which embeds prompts into a box embedding space using a trained encoder. The encoder, trained on existing and synthesized datasets, outputs box embeddings that capture not only semantic similarity but also specificity relations between prompts (e.g., "writing an adventure story" is more specific than "writing a story"). We further develop a novel dimension reduction technique for box embeddings to facilitate dataset visualization and comparison. Our experiments demonstrate that box embeddings consistently capture prompt specificity better than vector baselines. On the downstream task of creating hierarchical clustering trees for 17 LLMs from the UltraFeedback dataset, PROMPT2BOX can identify 8.9\% more LLM weaknesses than vector baselines and achieves an approximately 33\% stronger correlation between hierarchical depth and instruction specificity.
摘要：为了发现法学硕士的弱点，研究人员经常将提示嵌入到向量空间中并对它们进行聚类以提取有洞察力的模式。然而，向量嵌入主要捕获主题相似性。因此，共享主题但具体性不同（因此难度不同）的提示通常会以相似的方式表示，从而使细粒度的弱点分析变得困难。为了解决这个限制，我们提出了 PROMPT2BOX，它使用经过训练的编码器将提示嵌入到框嵌入空间中。编码器在现有和合成的数据集上进行训练，输出框嵌入，不仅捕获语义相似性，而且捕获提示之间的特异性关系（例如，“写一个冒险故事”比“写一个故事”更具体）。我们进一步开发了一种新颖的框嵌入降维技术，以促进数据集可视化和比较。我们的实验表明，框嵌入始终比向量基线更好地捕获提示特异性。在为来自 UltraFeedback 数据集的 17 个 LLM 创建层次聚类树的下游任务中，PROMPT2BOX 可以比向量基线多识别 8.9% 的 LLM 弱点，并在层次深度和指令特异性之间实现约 33% 更强的相关性。

Title: KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning

Authors: Shuai Wang, Yinan Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.21440
Pdf URL: https://arxiv.org/pdf/2603.21440
Copy Paste: [[2603.21440]] KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning(https://arxiv.org/abs/2603.21440)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking'' stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: this https URL.
摘要：大型语言模型 (LLM) 展示了令人印象深刻的自然语言能力，但常常难以完成知识密集型推理任务。由于需要准确的多跳推理，利用结构化知识图 (KG) 的知识库问答 (KBQA) 就体现了这一挑战。现有方法通常执行由预定义管道引导的顺序推理步骤，这限制了灵活性，并且由于每个步骤的独立推理而导致错误级联。为了解决这些限制，我们提出了 KG-Hopper，这是一种新颖的强化学习 (RL) 框架，它使紧凑型开放式 LLM 能够在单轮推理中执行集成的多跳 KG 推理。我们不是一步步推理，而是训练一个推理LLM，将整个知识图谱遍历和决策过程嵌入到统一的“思维”阶段，从而实现跨步骤依赖的全局推理和回溯的动态路径探索。八个 KG 推理基准测试的实验结果表明，基于 7B 参数 LLM 的 KG-Hopper 始终优于较大的多步系统（高达 70B），并实现了与 GPT-3.5-Turbo 和 GPT-4o-mini 等专有模型的竞争性能，同时保持紧凑、开放和数据高效。该代码可在以下位置公开获取：此 https URL。

Title: Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

Authors: Tae-Eun Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21454
Pdf URL: https://arxiv.org/pdf/2603.21454
Copy Paste: [[2603.21454]] Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis(https://arxiv.org/abs/2603.21454)
Keywords: llm, agent
Abstract: LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary--models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA's independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result--100% sycophantic confirmation--providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.
摘要：LLM 编码基准面临可信度危机：广泛的解决方案泄漏和测试质量问题破坏了 SWE-bench Verified，而现有的检测方法（释义一致性、n 元语法重叠、困惑分析）从未直接观察模型是否推理或召回。同时，简单地重复验证会降低准确性：多轮审查产生误报的速度比发现真正错误的速度快，这表明需要结构性方法。我们引入了跨上下文验证（CCV），这是一种黑盒方法，可以在 N 个独立会话中解决相同的基准问题并衡量解决方案的多样性，并结合分层跨上下文架构（HCCA），这是一种多代理分析框架，可以通过跨专业分析角色的有意信息限制来防止确认偏差。在 9 个 SWE 平台验证问题（45 次试验，Claude Opus 4.6，温度 0）上，CCV 实现了受污染推理和真实推理之间的完美分离（Mann-Whitney U=0，p 约 0.012，r = 1.0）。主要发现：（1）污染是二元的——模型要么完美回忆，要么根本不回忆； (2) 推理缺席是一个完美的判别器； (3) 33%的先前污染标签为误报； (4) HCCA 的独立分析结构发现了单一分析方法所遗漏的污染缺陷复合案例。将 HCCA 扩展到多阶段验证（工人到验证者到主管）的试点实验产生了负面结果 - 100％的阿谀奉承 - 进一步证明信息限制，而不是结构复杂性，是关键机制。我们发布所有代码和数据。

Title: DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation

Authors: Siqi Guo, Ming Lin, Tianbao Yang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.21465
Pdf URL: https://arxiv.org/pdf/2603.21465
Copy Paste: [[2603.21465]] DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation(https://arxiv.org/abs/2603.21465)
Keywords: language model, gpt, llm
Abstract: Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch codes into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum reinforcement learning with decoupled reward efficiently optimizes conversion success rate and inference speed simultaneously; and (iii) a test-time search algorithm that further improves the inference speed of the generated Triton kernels. Notably, despite being trained exclusively on synthetic data, DRTriton generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup on 92% of the KernelBench Level 2, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.
摘要：开发高效的 CUDA 内核是生成式 AI 行业的一项基本但具有挑战性的任务。最近的研究利用大型语言模型 (LLM) 自动将 PyTorch 参考实现转换为 CUDA 内核，从而显着减少工程工作量。最先进的法学硕士，例如 GPT-5.2 和 Claude-Sonnet-4.5，仍然在这项特定任务中苦苦挣扎。为了应对这一挑战，我们提出了 DRTriton，这是一个可扩展的学习框架，用于训练 LLM 将 PyTorch 代码转换为高度优化的 Triton 内核，然后在运行时编译为 CUDA 内核。 DRTriton 由三个关键组件组成：（i）数据合成算法 CSP-DAG，保证操作空间上的全覆盖和无偏均匀采样，难度可控； (ii) 具有解耦奖励的课程强化学习同时有效优化转换成功率和推理速度； (iii) 测试时搜索算法，进一步提高生成的 Triton 内核的推理速度。值得注意的是，尽管只接受了合成数据的训练，DRTriton 仍能有效地推广到现实世界的 CUDA 内核，这甚至对人类专家来说也是一个挑战。实验结果表明，DRTriton-7B 在 KernelBench Level 2 上实现了 92% 的加速，而 GPT-5.2 为 23%，Claude-Sonnet-4.5 为 19%。

Title: TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

Authors: Kai-Wei Chang, Yi-Cheng Lin, Huang-Cheng Chou, Wenze Ren, Yu-Han Huang, Yun-Shao Tsai, Chien-Cheng Chen, Yu Tsao, Yuan-Fu Liao, Shrikanth Narayanan, James Glass, Hung-yi Lee
Subjects: cs.CL, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2603.21478
Pdf URL: https://arxiv.org/pdf/2603.21478
Copy Paste: [[2603.21478]] TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild(https://arxiv.org/abs/2603.21478)
Keywords: llm
Abstract: Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on this https URL.
摘要：语音技术发展迅速，服务于全球不同人群。然而，由于资源有限，许多语言的代表性仍然不足。在本文中，我们介绍了 \textbf{TaigiSpeech}，这是台湾 Taigi（又名台湾闽南语/闽南语）的真实语音意图数据集，这是一种资源匮乏且主要是口语的语言。该数据集是从老年人那里收集的，包括 21 名说话者，总共 3000 条话语。它专为实用意图检测场景而设计，包括医疗保健和家庭助理应用。为了解决标记数据的稀缺问题，我们探索了两种具有两个监督级别的数据挖掘策略：通过中间语言和利用多模态线索的视听框架进行关键字匹配数据挖掘和LLM伪标记，并以最少的文本监督。这种设计可以为资源匮乏和不成文的口语语言构建可扩展的数据集。 TaigiSpeech 将根据 CC BY 4.0 许可证发布，以促进对资源匮乏和不成文语言的广泛采用和研究。项目网站和数据集可以在此 https URL 上找到。

Title: Effective Strategies for Asynchronous Software Engineering Agents

Authors: Jiayi Geng, Graham Neubig
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.21489
Pdf URL: https://arxiv.org/pdf/2603.21489
Copy Paste: [[2603.21489]] Effective Strategies for Asynchronous Software Engineering Agents(https://arxiv.org/abs/2603.21489)
Keywords: agent
Abstract: AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long-horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long-horizon tasks in a timely manner is asynchronous multi-agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi-agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency-aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test-based verification. In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch-and-merge is a central coordination mechanism for multi-agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.
摘要：人工智能代理在执行独立软件工程 (SWE) 任务（例如解决 Github 上的问题）方面的能力越来越强。然而，涉及多个相互依赖的子任务的长期任务仍然在准确性和及时完成方面提出了挑战。及时解决这些长期任务的自然方法是异步多代理协作，其中多个代理同时处理任务的不同部分。但事实证明，有效应用多智能体系统非常困难：多个智能体的并发编辑相互干扰，依赖关系难以同步，并且将部分进展组合成一个连贯的整体具有挑战性。另一方面，人类开发人员长期以来一直依赖成熟的协作基础设施来应对大型软件项目中的这些挑战。受这些协作原语的启发，我们引入了集中式异步隔离委托 (CAID)，这是一种基于三个核心 SWE 原语的结构化多代理协调范例：集中式任务委托、异步执行和隔离工作空间。 CAID 通过中央管理器构建依赖性感知任务计划，在隔离的工作区中同时执行子任务，并通过与可执行的基于测试的验证的结构化集成来巩固进度。在实证评估中，我们发现 CAID 在纸张复制任务 (PaperBench) 上相对于单代理基线提高了 26.7% 的绝对准确度，在 Python 库开发任务 (Commit0) 上提高了 14.3% 的绝对准确度。通过系统分析，我们发现分支合并是多智能体协作的核心协调机制，而 git worktree、git commit 和 git merge 等 SWE 原语使其能够以可靠且可执行的方式实现。

Title: Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment

Authors: Mohamed Sobhi Jabal (1), Jikai Zhang (2 and 3), Dominic LaBella (4), Jessica L. Houk (1), Dylan Zhang (1 and 7), Jeffrey D. Rudie (5 and 8), Kirti Magudia (1), Maciej A. Mazurowski (1, 2 and 6), Evan Calabrese (1 and 3) ((1) Duke University Medical Center, Durham NC, (2) Duke University, Durham NC, (3) Duke Center for Artificial Intelligence in Radiology, Durham NC, (4) Duke University Medical Center, Durham NC, (5) University of California San Diego, San Diego CA, (6) Duke University School of Medicine, Durham NC, (7) Santa Clara Valley Medical Center, San Jose CA, (8) Scripps Clinic Medical Group, San Diego CA)
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2603.21494
Pdf URL: https://arxiv.org/pdf/2603.21494
Copy Paste: [[2603.21494]] Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment(https://arxiv.org/abs/2603.21494)
Keywords: language model, llm, agent
Abstract: The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.
摘要：脑肿瘤报告和数据系统 (BT-RADS) 标准化了弥漫性神经胶质瘤患者的治疗后 MRI 反应评估，但需要对成像趋势、药物效果和放射计时进行复杂的整合。本研究评估了用于自动 BT-RADS 分类的端到端多智能体大语言模型 (LLM) 和卷积神经网络 (CNN) 系统。多智能体 LLM 系统与基于 CNN 的自动肿瘤分割相结合，对来自单个高容量中心的 509 次连续治疗后神经胶质瘤 MRI 检查进行回顾性评估。提取代理从非结构化临床记录中识别临床变量（类固醇状态、贝伐珠单抗状态、放射日期），而评分代理则应用 BT-RADS 决策逻辑，将提取的变量与体积测量相结合。专家参考标准分类由独立委员会认证的神经放射科医生建立。在 509 项考试中，492 项符合纳入标准。该系统达到了 374/492 (76.0%; 95% CI, 72.1%-79.6%) 的准确度，而初始临床评估的准确度为 283/492 (57.5%; 95% CI, 53.1%-61.8%) (+18.5 个百分点; P<.001)。上下文相关类别显示出高敏感性（BT-1b 100%、BT-1a 92.7%、BT-3a 87.5%），而阈值相关类别显示出中等敏感性（BT-3c 74.8%、BT-2 69.2%、BT-4 69.3%、BT-3b 57.1%）。对于 BT-4，阳性预测值为 92.9%。与初始临床评分相比，多智能体 LLM 系统实现了与专家参考标准更高的 BT-RADS 分类一致性，上下文相关评分的准确性高，BT-4 检测的阳性预测值高。

Title: Generalizable Self-Evolving Memory for Automatic Prompt Optimization

Authors: Guanbao Liang, Yuanchen Bei, Sheng Zhou, Yuheng Qin, Huan Zhou, Bingxin Jia, Bin Li, Jiajun Bu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21520
Pdf URL: https://arxiv.org/pdf/2603.21520
Copy Paste: [[2603.21520]] Generalizable Self-Evolving Memory for Automatic Prompt Optimization(https://arxiv.org/abs/2603.21520)
Keywords: language model, llm, prompt
Abstract: Automatic prompt optimization is a promising approach for adapting large language models (LLMs) to downstream tasks, yet existing methods typically search for a specific prompt specialized to a fixed task. This paradigm limits generalization across heterogeneous queries and prevents models from accumulating reusable prompting knowledge over time. In this paper, we propose MemAPO, a memory-driven framework that reconceptualizes prompt optimization as generalizable and self-evolving experience accumulation. MemAPO maintains a dual-memory mechanism that distills successful reasoning trajectories into reusable strategy templates while organizing incorrect generations into structured error patterns that capture recurrent failure modes. Given a new prompt, the framework retrieves both relevant strategies and failure patterns to compose prompts that promote effective reasoning while discouraging known mistakes. Through iterative self-reflection and memory editing, MemAPO continuously updates its memory, enabling prompt optimization to improve over time rather than restarting from scratch for each task. Experiments on diverse benchmarks show that MemAPO consistently outperforms representative prompt optimization baselines while substantially reducing optimization cost.
摘要：自动提示优化是一种使大型语言模型 (LLM) 适应下游任务的有前景的方法，但现有方法通常会搜索专门针对固定任务的特定提示。这种范例限制了跨异构查询的泛化，并阻止模型随着时间的推移积累可重用的提示知识。在本文中，我们提出了 MemAPO，一种内存驱动的框架，它将即时优化重新概念化为可泛化和自我进化的经验积累。 MemAPO 维护一种双内存机制，将成功的推理轨迹提炼为可重用的策略模板，同时将不正确的生成组织为捕获重复故障模式的结构化错误模式。给出新的提示时，该框架会检索相关策略和失败模式，以组成提示，促进有效推理，同时阻止已知错误。通过迭代的自我反思和内存编辑，MemAPO 不断更新其内存，从而能够随着时间的推移及时优化改进，而不是从头开始每个任务。对不同基准的实验表明，MemAPO 始终优于代表性的即时优化基准，同时大幅降低了优化成本。

Title: CatRAG: Functor-Guided Structural Debiasing with Retrieval Augmentation for Fair LLMs

Authors: Ravi Ranjan, Utkarsh Grover, Mayur Akewar, Xiaomin Lin, Agoritsa Polyzou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.21524
Pdf URL: https://arxiv.org/pdf/2603.21524
Copy Paste: [[2603.21524]] CatRAG: Functor-Guided Structural Debiasing with Retrieval Augmentation for Fair LLMs(https://arxiv.org/abs/2603.21524)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation
Abstract: Large Language Models (LLMs) are deployed in high-stakes settings but can show demographic, gender, and geographic biases that undermine fairness and trust. Prior debiasing methods, including embedding-space projections, prompt-based steering, and causal interventions, often act at a single stage of the pipeline, resulting in incomplete mitigation and brittle utility trade-offs under distribution shifts. We propose CatRAG Debiasing, a dual-pronged framework that integrates functor with Retrieval-Augmented Generation (RAG) guided structural debiasing. The functor component leverages category-theoretic structure to induce a principled, structure-preserving projection that suppresses bias-associated directions in the embedding space while retaining task-relevant semantics. On the Bias Benchmark for Question Answering (BBQ) across three open-source LLMs (Meta Llama-3, OpenAI GPT-OSS, and Google Gemma-3), CatRAG achieves state-of-the-art results, improving accuracy by up to 40% over the corresponding base models and by more than 10% over prior debiasing methods, while reducing bias scores to near zero (from 60% for the base models) across gender, nationality, race, and intersectional subgroups.
摘要：大型语言模型 (LLM) 部署在高风险环境中，但可能会显示出破坏公平和信任的人口、性别和地理偏见。先前的去偏差方法，包括嵌入空间投影、基于提示的转向和因果干预，通常在管道的单个阶段起作用，导致在分配变化下不完全的缓解和脆弱的效用权衡。我们提出了 CatRAG Debiasing，这是一个双管齐下的框架，它将函子与检索增强生成 (RAG) 引导的结构去偏集成在一起。函子组件利用类别理论结构来引入有原则的、结构保留的投影，该投影抑制嵌入空间中与偏差相关的方向，同时保留与任务相关的语义。在三个开源 LLM（Meta Llama-3、OpenAI GPT-OSS 和 Google Gemma-3）的问答偏差基准 (BBQ) 上，CatRAG 取得了最先进的结果，与相应的基础模型相比，准确性提高了 40%，与之前的去偏差方法相比，准确性提高了 10% 以上，同时将性别、国籍、种族等方面的偏差分数降低到接近零（基础模型的偏差分数为 60%），和交叉子组。

Title: SynSym: A Synthetic Data Generation Framework for Psychiatric Symptom Identification

Authors: Migyeong Kang, Jihyun Kim, Hyolim Jeon, Sunwoo Hwang, Jihyun An, Yonghoon Kim, Haewoon Kwak, Jisun An, Jinyoung Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21529
Pdf URL: https://arxiv.org/pdf/2603.21529
Copy Paste: [[2603.21529]] SynSym: A Synthetic Data Generation Framework for Psychiatric Symptom Identification(https://arxiv.org/abs/2603.21529)
Keywords: language model, llm
Abstract: Psychiatric symptom identification on social media aims to infer fine-grained mental health symptoms from user-generated posts, allowing a detailed understanding of users' mental states. However, the construction of large-scale symptom-level datasets remains challenging due to the resource-intensive nature of expert labeling and the lack of standardized annotation guidelines, which in turn limits the generalizability of models to identify diverse symptom expressions from user-generated text. To address these issues, we propose SynSym, a synthetic data generation framework for constructing generalizable datasets for symptom identification. Leveraging large language models (LLMs), SynSym constructs high-quality training samples by (1) expanding each symptom into sub-concepts to enhance the diversity of generated expressions, (2) producing synthetic expressions that reflect psychiatric symptoms in diverse linguistic styles, and (3) composing realistic multi-symptom expressions, informed by clinical co-occurrence patterns. We validate SynSym on three benchmark datasets covering different styles of depressive symptom expression. Experimental results demonstrate that models trained solely on the synthetic data generated by SynSym perform comparably to those trained on real data, and benefit further from additional fine-tuning with real data. These findings underscore the potential of synthetic data as an alternative resource to real-world annotations in psychiatric symptom modeling, and SynSym serves as a practical framework for generating clinically relevant and realistic symptom expressions.
摘要：社交媒体上的精神症状识别旨在从用户生成的帖子中推断出细粒度的心理健康症状，从而详细了解用户的心理状态。然而，由于专家标签的资源密集型性质和缺乏标准化注释指南，大规模症状级数据集的构建仍然具有挑战性，这反过来又限制了模型从用户生成的文本中识别不同症状表达的通用性。为了解决这些问题，我们提出了 SynSym，一种合成数据生成框架，用于构建用于症状识别的通用数据集。利用大型语言模型 (LLM)，SynSym 通过以下方式构建高质量的训练样本：(1) 将每个症状扩展为子概念，以增强生成表达的多样性；(2) 生成以不同语言风格反映精神症状的合成表达；(3) 根据临床共现模式构建真实的多症状表达。我们在涵盖不同抑郁症状表达方式的三个基准数据集上验证 SynSym。实验结果表明，仅根据 SynSym 生成的合成数据训练的模型与根据真实数据训练的模型表现相当，并且进一步受益于对真实数据的额外微调。这些发现强调了合成数据作为精神症状建模中现实世界注释的替代资源的潜力，而 SynSym 可以作为生成临床相关且真实的症状表达的实用框架。

Title: DATASHI: A Parallel English-Tashlhiyt Corpus for Orthography Normalization and Low-Resource Language Processing

Authors: Nasser-Eddine Monir, Zakaria Baou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21571
Pdf URL: https://arxiv.org/pdf/2603.21571
Copy Paste: [[2603.21571]] DATASHI: A Parallel English-Tashlhiyt Corpus for Orthography Normalization and Low-Resource Language Processing(https://arxiv.org/abs/2603.21571)
Keywords: language model, gpt, prompt
Abstract: DATASHI is a new parallel English-Tashlhiyt corpus that fills a critical gap in computational resources for Amazigh languages. It contains 5,000 sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, enabling systematic study of orthographic diversity and normalization. This dual design supports text-based NLP tasks - such as tokenization, translation, and normalization - and also serves as a foundation for read-speech data collection and multimodal alignment. Comprehensive evaluations with state-of-the-art Large Language Models (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) show clear improvements from zero-shot to few-shot prompting, with Gemini-2.5-Pro achieving the lowest word and character-level error rates and exhibiting robust cross-lingual generalization. A fine-grained analysis of edit operations - deletions, substitutions, and insertions - across phonological classes (geminates, emphatics, uvulars, and pharyngeals) further highlights model-specific sensitivities to marked Tashlhiyt features and provides new diagnostic insights for low-resource Amazigh orthography normalization.
摘要：DATASHI 是一个新的并行英语-Tashlhiyt 语料库，填补了阿马齐格语言计算资源的关键空白。它包含 5,000 个句子对，其中包括 1,500 个句子子集，具有专家标准化和非标准用户生成的版本，可以系统地研究拼写多样性和标准化。这种双重设计支持基于文本的 NLP 任务（例如标记化、翻译和规范化），并且还可以作为朗读语音数据收集和多模态对齐的基础。使用最先进的大型语言模型（GPT-5、Claude-Sonnet-4.5、Gemini-2.5-Pro、Mistral、Qwen3-Max）进行的综合评估表明，从零样本提示到少样本提示有明显的改进，Gemini-2.5-Pro 实现了最低的单词和字符级错误率，并表现出强大的跨语言泛化能力。对跨音系类别（双音、强调音、小舌音和咽音）的编辑操作（删除、替换和插入）进行细粒度分析，进一步突出了模型特定对标记 Tashlhiyt 特征的敏感性，并为资源匮乏的 Amazigh 正字法规范化提供了新的诊断见解。

Title: A Comparative Analysis of LLM Memorization at Statistical and Internal Levels: Cross-Model Commonalities and Model-Specific Signatures

Authors: Bowen Chen, Namgi Han, Yusuke Miyao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.21658
Pdf URL: https://arxiv.org/pdf/2603.21658
Copy Paste: [[2603.21658]] A Comparative Analysis of LLM Memorization at Statistical and Internal Levels: Cross-Model Commonalities and Model-Specific Signatures(https://arxiv.org/abs/2603.21658)
Keywords: llm
Abstract: Memorization is a fundamental component of intelligence for both humans and LLMs. However, while LLM performance scales rapidly, our understanding of memorization lags. Due to limited access to the pre-training data of LLMs, most previous studies focus on a single model series, leading to isolated observations among series, making it unclear which findings are general or specific. In this study, we collect multiple model series (Pythia, OpenLLaMa, StarCoder, OLMo1/2/3) and analyze their shared or unique memorization behavior at both the statistical and internal levels, connecting individual observations while showing new findings. At the statistical level, we reveal that the memorization rate scales log-linearly with model size, and memorized sequences can be further compressed. Further analysis demonstrated a shared frequency and domain distribution pattern for memorized sequences. However, different models also show individual features under the above observations. At the internal level, we find that LLMs can remove certain injected perturbations, while memorized sequences are more sensitive. By decoding middle layers and attention head ablation, we revealed the general decoding process and shared important heads for memorization. However, the distribution of those important heads differs between families, showing a unique family-level feature. Through bridging various experiments and revealing new findings, this study paves the way for a universal and fundamental understanding of memorization in LLM.
摘要：记忆力是人类和法学硕士智力的基本组成部分。然而，虽然法学硕士的成绩迅速提高，但我们对记忆的理解却滞后。由于法学硕士预训练数据的获取有限，之前的大多数研究都集中在单一模型系列上，导致系列之间的观察孤立，从而不清楚哪些发现是普遍的还是具体的。在这项研究中，我们收集了多个模型系列（Pythia、OpenLLaMa、StarCoder、OLMo1/2/3），并在统计和内部层面分析它们共享或独特的记忆行为，将个体观察结果联系起来，同时展示新的发现。在统计层面上，我们揭示了记忆率与模型大小成对数线性关系，并且记忆序列可以进一步压缩。进一步的分析证明了记忆序列的共享频率和域分布模式。然而，不同的模型在上述观察下也表现出各自的特征。在内部层面，我们发现 LLM 可以消除某些注入的扰动，而记忆的序列则更敏感。通过解码中间层和注意力头消融，我们揭示了一般解码过程并共享重要的记忆头。然而，这些重要族长的分布在各个家族之间却存在差异，呈现出独特的家族层面特征。通过桥接各种实验并揭示新发现，这项研究为法学硕士记忆的普遍和基本理解铺平了道路。

Title: TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression

Authors: Li Wang, Yandong Wang, Xin Yu, Kui Zhang, Tianhao Peng, Wenjun Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21663
Pdf URL: https://arxiv.org/pdf/2603.21663
Copy Paste: [[2603.21663]] TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression(https://arxiv.org/abs/2603.21663)
Keywords: language model, llm
Abstract: The rapid progress of large language models (LLMs) has led to remarkable performance gains across a wide range of tasks. However, when handling long documents that exceed the model's context window limit, the entire context cannot be processed in a single pass, making chunk-wise processing necessary. This requires multiple turns to read different chunks and update memory. However, supervision is typically provided only by the final outcome, which makes it difficult to evaluate the quality of memory updates at each turn in the multi-turn training setting. This introduces a temporal credit assignment challenge. Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise. To better address the credit assignment problem in multi-turn memory training, we propose Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning (TAMTRL). TAMTRL leverages relevant documents as teacher signals by aligning them with each turn of model input and assigns rewards through normalized probabilities in a self-supervised manner. This provides fine-grained learning signals for each memory update and improves long-context processing. Experiments with multiple models of varying scales across seven long-context benchmarks show that TAMTRL consistently outperforms strong baselines, demonstrating its effectiveness. Our code is available at this https URL.
摘要：大型语言模型 (LLM) 的快速进步导致在广泛的任务中取得了显着的性能提升。然而，当处理超过模型上下文窗口限制的长文档时，整个上下文无法在一次传递中处理，因此需要进行分块处理。这需要多次读取不同的块并更新内存。然而，监督通常仅由最终结果提供，这使得很难评估多轮训练设置中每轮的记忆更新质量。这引入了时间学分分配挑战。现有的方法，例如法学硕士作为法官或过程奖励模型，会产生大量的计算开销并受到估计噪声的影响。为了更好地解决多轮记忆训练中的学分分配问题，我们提出了多轮强化学习的教师对齐奖励重塑（TAMTRL）。 TAMTRL 利用相关文档作为教师信号，将它们与模型输入的每轮相匹配，并通过归一化概率以自我监督的方式分配奖励。这为每次记忆更新提供了细粒度的学习信号，并改进了长上下文处理。在七个长上下文基准中对不同规模的多个模型进行的实验表明，TAMTRL 始终优于强大的基准，证明了其有效性。我们的代码可以在这个 https URL 上找到。

Title: Optimizing Multi-Agent Weather Captioning via Text Gradient Descent: A Training-Free Approach with Consensus-Aware Gradient Fusion

Authors: Shixu Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21673
Pdf URL: https://arxiv.org/pdf/2603.21673
Copy Paste: [[2603.21673]] Optimizing Multi-Agent Weather Captioning via Text Gradient Descent: A Training-Free Approach with Consensus-Aware Gradient Fusion(https://arxiv.org/abs/2603.21673)
Keywords: language model, llm, agent
Abstract: Generating interpretable natural language captions from weather time series data remains a significant challenge at the intersection of meteorological science and natural language processing. While recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities in time series forecasting and analysis, existing approaches either produce numerical predictions without human-accessible explanations or generate generic descriptions lacking domain-specific depth. We introduce WeatherTGD, a training-free multi-agent framework that reinterprets collaborative caption refinement through the lens of Text Gradient Descent (TGD). Our system deploys three specialized LLM agents including a Statistical Analyst, a Physics Interpreter, and a Meteorology Expert that generate domain-specific textual gradients from weather time series observations. These gradients are aggregated through a novel Consensus-Aware Gradient Fusion mechanism that extracts common signals while preserving unique domain perspectives. The fused gradients then guide an iterative refinement process analogous to gradient descent, where each LLM-generated feedback signal updates the caption toward an optimal solution. Experiments on real-world meteorological datasets demonstrate that WeatherTGD achieves significant improvements in both LLM-based evaluation and human expert evaluation, substantially outperforming existing multi-agent baselines while maintaining computational efficiency through parallel agent execution.
摘要：从天气时间序列数据生成可解释的自然语言说明仍然是气象科学和自然语言处理交叉领域的重大挑战。虽然大型语言模型 (LLM) 的最新进展在时间序列预测和分析方面表现出了卓越的能力，但现有方法要么生成没有人类可理解的解释的数值预测，要么生成缺乏特定领域深度的通用描述。我们引入了 WeatherTGD，这是一个免训练的多智能体框架，它通过文本梯度下降 (TGD) 的视角重新解释了协作字幕细化。我们的系统部署了三个专门的 LLM 代理，包括统计分析师、物理解释员和气象专家，它们根据天气时间序列观测生成特定领域的文本梯度。这些梯度通过一种新颖的共识感知梯度融合机制进行聚合，该机制提取共同信号，同时保留独特的领域视角。然后，融合的梯度引导类似于梯度下降的迭代细化过程，其中每个 LLM 生成的反馈信号将标题更新为最佳解决方案。对真实世界气象数据集的实验表明，WeatherTGD 在基于 LLM 的评估和人类专家评估方面都取得了显着改进，大大优于现有的多智能体基线，同时通过并行智能体执行保持了计算效率。

Title: Probing How Scalable Table Data Enhances General Long-Context Reasoning

Authors: Huaibing Xie, Guoliang Zhao, Yang Liu, Shihan Dou, Siming Huang, Yanling Xiao, Shaolei Wang, Yiting Liu, Cheng Zhang, Shaofan Liu, Pluto Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21719
Pdf URL: https://arxiv.org/pdf/2603.21719
Copy Paste: [[2603.21719]] Probing How Scalable Table Data Enhances General Long-Context Reasoning(https://arxiv.org/abs/2603.21719)
Keywords: language model, llm
Abstract: As real-world tasks grow increasingly complex, long-context reasoning has become a core capability for Large Language Models (LLMs). However, few studies explore which data types are effective for long-context reasoning and why. We find that structured table data with periodic structures shows strong potential for long-context reasoning. Motivated by this observation, we mathematically analyze tabular dependency structures using mutual information, revealing periodic non-vanishing dependencies in table data. Furthermore, we systematically analyze the capabilities of structured table data, conduct relevant scaling experiments, and validate its underlying mechanisms for enhancing long-context reasoning, yielding several meaningful insights. Leveraging these insights, we propose a simple yet scalable pipeline(TableLong) for synthesizing high-quality, diverse, and verifiable structured table data to boost long-context reasoning via RL. Extensive experimental results demonstrate that table data significantly enhances the long-context reasoning capability of LLMs across multiple long-context benchmarks (+8.24\% on average), and even improves performance on out-of-domain benchmarks (+8.06\% on average). We hope that our insights provide practical guidance for effective post-training data to enhance long-context reasoning in LLMs.
摘要：随着现实世界的任务变得越来越复杂，长上下文推理已成为大型语言模型（LLM）的核心能力。然而，很少有研究探讨哪些数据类型对于长上下文推理有效以及原因。我们发现具有周期性结构的结构化表数据显示出长上下文推理的强大潜力。受这一观察的启发，我们使用互信息对表格依赖关系结构进行数学分析，揭示表数据中的周期性非消失依赖关系。此外，我们系统地分析了结构化表数据的能力，进行了相关的扩展实验，并验证了其增强长上下文推理的底层机制，产生了一些有意义的见解。利用这些见解，我们提出了一个简单但可扩展的管道（TableLong），用于合成高质量、多样化且可验证的结构化表数据，以通过 RL 促进长上下文推理。大量的实验结果表明，表数据显着增强了法学硕士跨多个长上下文基准的长上下文推理能力（平均+8.24%），甚至提高了域外基准的性能（平均+8.06%）。我们希望我们的见解为有效的培训后数据提供实用指导，以增强法学硕士的长上下文推理。

Title: SemEval-2026 Task 12: Abductive Event Reasoning: Towards Real-World Event Causal Inference for Large Language Models

Authors: Pengfei Cao, Mingxuan Yang, Yubo Chen, Chenlong Zhang, Mingxuan Liu, Kang Liu, Jun Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.21720
Pdf URL: https://arxiv.org/pdf/2603.21720
Copy Paste: [[2603.21720]] SemEval-2026 Task 12: Abductive Event Reasoning: Towards Real-World Event Causal Inference for Large Language Models(https://arxiv.org/abs/2603.21720)
Keywords: language model
Abstract: Understanding why real-world events occur is important for both natural language processing and practical decision-making, yet direct-cause inference remains underexplored in evidence-rich settings. To address this gap, we organized SemEval-2026 Task 12: Abductive Event Reasoning (AER).\footnote{The task data is available at this https URL} The task asks systems to identify the most plausible direct cause of a target event from supporting evidence. We formulate AER as an evidence-grounded multiple-choice benchmark that captures key challenges of real-world causal reasoning, including distributed evidence, indirect background factors, and semantically related but non-causal distractors. The shared task attracted 122 participants and received 518 submissions. This paper presents the task formulation, dataset construction pipeline, evaluation setup, and system results. AER provides a focused benchmark for abductive reasoning over real-world events and highlights challenges for future work on causal reasoning and multi-document understanding.
摘要：了解现实世界事件发生的原因对于自然语言处理和实际决策都很重要，但在证据丰富的环境中，直接原因推理仍未得到充分探索。为了解决这一差距，我们组织了 SemEval-2026 任务 12：溯因事件推理 (AER)。\footnote{任务数据可从此 https URL 获取} 该任务要求系统从支持证据中识别目标事件最可能的直接原因。我们将 AER 制定为基于证据的多项选择基准，捕捉现实世界因果推理的关键挑战，包括分布式证据、间接背景因素以及语义相关但非因果干扰因素。此次共享任务吸引了122名参与者，收到518份提交作品。本文介绍了任务制定、数据集构建流程、评估设置和系统结果。 AER 为现实世界事件的溯因推理提供了一个有针对性的基准，并强调了未来因果推理和多文档理解工作面临的挑战。

Title: Riding Brainwaves in LLM Space: Understanding Activation Patterns Using Individual Neural Signatures

Authors: Ajan Subramanian, Sumukh Bettadapura, Rohan Sathish
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21847
Pdf URL: https://arxiv.org/pdf/2603.21847
Copy Paste: [[2603.21847]] Riding Brainwaves in LLM Space: Understanding Activation Patterns Using Individual Neural Signatures(https://arxiv.org/abs/2603.21847)
Keywords: language model, llm
Abstract: Consumer-grade EEG is entering everyday devices, from earbuds to headbands, raising the question of whether language models can be adapted to individual neural responses. We test this by asking whether frozen LLM representations encode person-specific EEG signals, directions in activation space that predict one person's brain activity but not another's. Using word-level EEG from 30 participants reading naturalistic sentences (ZuCo corpus), we train a separate linear probe for each person, mapping hidden states from a frozen Qwen 2.5 7B to that individual's EEG power. Person-specific probes outperform a single population probe on every EEG feature tested; for high-gamma power, the person-specific probe achieves rho = 0.183, a ninefold improvement over the population probe (rho = 0.020, p < 10^-4). A negative control, fixation count, shows no person-specific advantage (p = 0.360); fixation count reflects word length and frequency rather than individual cognition. The individual directions are temporally stable (split-half cosine = 0.824), non-transferable across people (self rho = 0.369 vs. other rho = 0.143, p < 10^-19), and distinct from the shared population signal: person-specific probes retain predictive power after the population component is removed. The person-specific signal concentrates in the model's deep layers, rising consistently with depth and peaking at Layer 24 of 28. The results are consistent across architectures (LLaMA 3.1 8B) and survive word-level confound controls. Frozen language models contain stable, person-specific neural directions in their deep layers, providing a geometric foundation for EEG-driven personalization.
摘要：消费级脑电图正在进入从耳塞到头带的日常设备，这引发了语言模型是否可以适应个人神经反应的问题。我们通过询问冻结的法学硕士表示是否编码特定于人的脑电图信号来测试这一点，激活空间中的方向可以预测一个人的大脑活动，但不能预测另一个人的大脑活动。使用 30 名阅读自然句子（ZuCo 语料库）的参与者的单词级脑电图，我们为每个人训练一个单独的线性探针，将隐藏状态从冻结的 Qwen 2.5 7B 映射到该人的脑电图功率。在测试的每个脑电图特征上，特定于个人的探针都优于单一群体探针；对于高伽马功率，特定于个人的探针达到 rho = 0.183，比群体探针提高了九倍（rho = 0.020，p < 10^-4）。阴性对照（注视计数）显示没有特定于个人的优势 (p = 0.360)；固定计数反映了单词长度和频率，而不是个人认知。各个方向在时间上是稳定的（分半余弦 = 0.824），不可在不同人群之间转移（自身 rho = 0.369 与其他 rho = 0.143，p < 10^-19），并且与共享群体信号不同：特定于个体的探针在去除群体成分后保留预测能力。特定于人的信号集中在模型的深层，随深度一致上升，并在 28 层中的第 24 层达到峰值。结果在各个架构 (LLaMA 3.1 8B) 中保持一致，并且不受字级混杂控制的影响。冻结语言模型在其深层包含稳定的、特定于人的神经方向，为脑电图驱动的个性化提供了几何基础。

Title: SLURP-TN : Resource for Tunisian Dialect Spoken Language Understanding

Authors: Haroun Elleuch, Salima Mdhaffar, Yannick Estève, Fethi Bougares
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.21940
Pdf URL: https://arxiv.org/pdf/2603.21940
Copy Paste: [[2603.21940]] SLURP-TN : Resource for Tunisian Dialect Spoken Language Understanding(https://arxiv.org/abs/2603.21940)
Keywords: language model
Abstract: Spoken Language Understanding (SLU) aims to extract the semantic information from the speech utterance of user queries. It is a core component in a task-oriented dialogue system. With the spectacular progress of deep neural network models and the evolution of pre-trained language models, SLU has obtained significant breakthroughs. However, only a few high-resource languages have taken advantage of this progress due to the absence of SLU resources. In this paper, we seek to mitigate this obstacle by introducing SLURP-TN. This dataset was created by recording 55 native speakers uttering sentences in Tunisian dialect, manually translated from six SLURP domains. The result is an SLU Tunisian dialect dataset that comprises 4165 sentences recorded into around 5 hours of acoustic material. We also develop a number of Automatic Speech Recognition and SLU models exploiting SLUTP-TN. The Dataset and baseline models are available at: this https URL.
摘要：口语理解（SLU）旨在从用户查询的语音话语中提取语义信息。它是面向任务的对话系统的核心组件。 With the spectacular progress of deep neural network models and the evolution of pre-trained language models, SLU has obtained significant breakthroughs.然而，由于缺乏 SLU 资源，只有少数高资源语言利用了这一进展。在本文中，我们试图通过引入 SLURP-TN 来缓解这一障碍。该数据集是通过记录 55 位母语人士用突尼斯方言说出的句子而创建的，这些句子是从 6 个 SLURP 域手动翻译的。结果是一个 SLU 突尼斯方言数据集，包含记录在大约 5 小时的声学材料中的 4165 个句子。我们还开发了许多利用 SLUTP-TN 的自动语音识别和 SLU 模型。数据集和基线模型可在以下位置获取：此 https URL。

Title: Parameter-Efficient Fine-Tuning for Medical Text Summarization: A Comparative Study of Lora, Prompt Tuning, and Full Fine-Tuning

Authors: Ulugbek Shernazarov, Rostislav Svitsov, Bin Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.21970
Pdf URL: https://arxiv.org/pdf/2603.21970
Copy Paste: [[2603.21970]] Parameter-Efficient Fine-Tuning for Medical Text Summarization: A Comparative Study of Lora, Prompt Tuning, and Full Fine-Tuning(https://arxiv.org/abs/2603.21970)
Keywords: language model, prompt
Abstract: Fine-tuning large language models for domain-specific tasks such as medical text summarization demands substantial computational resources. Parameter-efficient fine-tuning (PEFT) methods offer promising alternatives by updating only a small fraction of parameters. This paper compares three adaptation approaches-Low-Rank Adaptation (LoRA), Prompt Tuning, and Full Fine-Tuning-across the Flan-T5 model family on the PubMed medical summarization dataset. Through experiments with multiple random seeds, we demonstrate that LoRA consistently outperforms full fine-tuning, achieving 43.52 +/- 0.18 ROUGE-1 on Flan-T5-Large with only 0.6% trainable parameters compared to 40.67 +/- 0.21 for full fine-tuning. Sensitivity analyses examine the impact of LoRA rank and prompt token count. Our findings suggest the low-rank constraint provides beneficial regularization, challenging assumptions about the necessity of full parameter updates. Code is available at this https URL
摘要：针对特定领域的任务（例如医学文本摘要）微调大型语言模型需要大量的计算资源。参数高效微调（PEFT）方法通过仅更新一小部分参数提供了有前途的替代方案。 This paper compares three adaptation approaches-Low-Rank Adaptation (LoRA), Prompt Tuning, and Full Fine-Tuning-across the Flan-T5 model family on the PubMed medical summarization dataset. Through experiments with multiple random seeds, we demonstrate that LoRA consistently outperforms full fine-tuning, achieving 43.52 +/- 0.18 ROUGE-1 on Flan-T5-Large with only 0.6% trainable parameters compared to 40.67 +/- 0.21 for full fine-tuning.敏感性分析检查 LoRA 排名和提示令牌计数的影响。 Our findings suggest the low-rank constraint provides beneficial regularization, challenging assumptions about the necessity of full parameter updates.代码可在此 https URL 获取

Title: Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

Authors: Stella Eva Tsiapali, Cong-Thanh Do, Kate Knill
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.22056
Pdf URL: https://arxiv.org/pdf/2603.22056
Copy Paste: [[2603.22056]] Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch(https://arxiv.org/abs/2603.22056)
Keywords: language model, llm
Abstract: Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.
摘要：大型语言模型 (LLM) 在跨语言任务中实现了最先进的 (SOTA) 性能，但由于其规模和资源需求，部署成本高昂。知识蒸馏 (KD) 通过训练较小的 Student 模型来模仿较大的 Teacher 模型来解决此问题，从而在不显着性能损失的情况下提高效率。具有跨模型注意力的双空间知识蒸馏（DSKD-CMA）已成为具有不同分词器的法学硕士之间 KD 的 SOTA 方法，但其内部工作原理在很大程度上仍然不透明。在这项工作中，我们通过手动标记对齐探测和热图可视化系统地分析了 DSKD-CMA 的注意力机制，揭示了其优点和局限性。在此基础上，我们引入了一种基于生成对抗（GA）学习的新颖方法 DSKD-CMA-GA，以解决从不同模型计算出的键和查询之间的不匹配分布。实验表明，ROUGE-L 在文本生成质量方面取得了适度但一致的增益，特别是在分布外数据上（平均 +0.37），缩小了跨标记器 KD 和同标记器 KD 之间的差距。

Title: Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison

Authors: Caio Vicentino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.22075
Pdf URL: https://arxiv.org/pdf/2603.22075
Copy Paste: [[2603.22075]] Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison(https://arxiv.org/abs/2603.22075)
Keywords: language model
Abstract: We present a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models. Both models are trained on identical data (50M tokens from TinyStories), identical compute budget (20,000 steps, batch size 32, sequence length 512), and identical hardware (NVIDIA H100 80GB), isolating the generation paradigm as the sole variable. We report three findings. First, both paradigms achieve comparable training throughput (~50K tokens/second), with MDLM requiring only 4.7% more wall-clock time. Second, AR converges faster and begins overfitting by step 14,000, while MDLM converges more slowly and is still improving at step 20,000, suggesting different compute-optimal training regimes. Third, quantitative diversity analysis over 1,000 generated samples reveals a structural diversity-fluency trade-off: AR produces fluent but repetitive outputs (99.8% begin with the same word), while MDLM generates more diverse narratives (93.4% unique 5-word openings, higher Distinct-n, lower Self-BLEU), at the cost of occasional grammatical inconsistencies. All code, trained checkpoints, and data pipelines are released for reproducibility.
摘要：我们提出了自回归（AR）和掩码扩散（MDLM）语言模型之间的受控实证比较。两个模型都在相同的数据（来自 TinyStories 的 50M 代币）、相同的计算预算（20,000 个步骤、批量大小 32、序列长度 512）和相同的硬件（NVIDIA H100 80GB）上进行训练，将生成范例隔离为唯一变量。我们报告了三项发现。首先，两种范例都实现了相当的训练吞吐量（约 50K 令牌/秒），而 MDLM 仅需要多 4.7% 的挂钟时间。其次，AR 收敛速度更快，并在步骤 14,000 时开始过度拟合，而 MDLM 收敛速度较慢，并且在步骤 20,000 时仍在改进，这表明不同的计算最佳训练方案。第三，对 1,000 个生成样本的定量多样性分析揭示了结构多样性与流畅性的权衡：AR 产生流畅但重复的输出（99.8% 以相同单词开头），而 MDLM 生成更多样化的叙述（93.4% 独特的 5 词开头，较高的 Distinct-n，较低的 Self-BLEU），但代价是偶尔出现语法不一致。所有代码、经过训练的检查点和数据管道均已发布，以实现可重复性。

Title: Multiperspectivity as a Resource for Narrative Similarity Prediction

Authors: Max Upravitelev, Veronika Solopova, Jing Yang, Charlott Jakob, Premtim Sahitaj, Ariana Sahitaj, Vera Schmitt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.22103
Pdf URL: https://arxiv.org/pdf/2603.22103
Copy Paste: [[2603.22103]] Multiperspectivity as a Resource for Narrative Similarity Prediction(https://arxiv.org/abs/2603.22103)
Keywords: llm
Abstract: Predicting narrative similarity can be understood as an inherently interpretive task: different, equally valid readings of the same text can produce divergent interpretations and thus different similarity judgments, posing a fundamental challenge for semantic evaluation benchmarks that encode a single ground truth. Rather than treating this multiperspectivity as a challenge to overcome, we propose to incorporate it in the decision making process of predictive systems. To explore this strategy, we created an ensemble of 31 LLM personas. These range from practitioners following interpretive frameworks to more intuitive, lay-style characters. Our experiments were conducted on the SemEval-2026 Task 4 dataset, where the system achieved an accuracy score of 0.705. Accuracy improves with ensemble size, consistent with Condorcet Jury Theorem-like dynamics under weakened independence. Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains under majority voting. Our error analysis reveals a consistent negative association between gender-focused interpretive vocabulary and accuracy across all persona categories, suggesting either attention to dimensions not relevant for the benchmark or valid interpretations absent from the ground truth. This finding underscores the need for evaluation frameworks that account for interpretive plurality.
摘要：预测叙事相似性可以理解为本质上的解释性任务：对同一文本的不同的、同样有效的阅读可能会产生不同的解释，从而产生不同的相似性判断，这对编码单一基本事实的语义评估基准提出了根本性的挑战。我们不建议将这种多视角视为需要克服的挑战，而是将其纳入预测系统的决策过程中。 To explore this strategy, we created an ensemble of 31 LLM personas. These range from practitioners following interpretive frameworks to more intuitive, lay-style characters. Our experiments were conducted on the SemEval-2026 Task 4 dataset, where the system achieved an accuracy score of 0.705. Accuracy improves with ensemble size, consistent with Condorcet Jury Theorem-like dynamics under weakened independence. Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains under majority voting.我们的错误分析揭示了以性别为中心的解释词汇与所有角色类别的准确性之间一致的负相关关系，这表明要么关注与基准不相关的维度，要么缺乏基本事实的有效解释。 This finding underscores the need for evaluation frameworks that account for interpretive plurality.

Title: Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

Authors: Ireh Kim, Tesia Sker, Chanwoo Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.22186
Pdf URL: https://arxiv.org/pdf/2603.22186
Copy Paste: [[2603.22186]] Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation(https://arxiv.org/abs/2603.22186)
Keywords: language model, llm, hallucination
Abstract: In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.
摘要：In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption.然而，法学硕士擅长对上下文信息进行建模，这使得它们非常适合文档级翻译任务，其中句子之间的连贯性至关重要。尽管有这种潜力，法学硕士的文档级机器翻译面临着两个关键挑战：（1）大规模、高质量文档级并行数据的稀缺； (2) 法学硕士在生成过程中引入幻觉和遗漏的倾向。为了应对这些挑战，我们提出了利用 LLM 增强文档级数据的两阶段微调策略。首先，我们通过使用 LLM 将摘要数据转换为文档级并行数据来增强数据，然后使用多个指标对其进行过滤，利用 sacreBLEU、COMET 和基于 LaBSE 的余弦相似度来提高数据质量。最后，我们采用两阶段微调策略：首先对丰富的句子级机器翻译资源进行微调，然后对过滤后的文档级语料库进行微调。

Title: Gumbel Distillation for Parallel Text Generation

Authors: Chi Zhang, Xixi Hu, Bo Liu, Qiang Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.22216
Pdf URL: https://arxiv.org/pdf/2603.22216
Copy Paste: [[2603.22216]] Gumbel Distillation for Parallel Text Generation(https://arxiv.org/abs/2603.22216)
Keywords: language model
Abstract: The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset. Code available at this https URL.
摘要：自回归 (AR) 语言模型的缓慢、顺序性质推动了并行解码方法的采用。然而，这些非 AR 模型通常会牺牲生成质量，因为它们很难对令牌序列的复杂联合分布进行建模。为了缩小这种性能差距，我们引入了 Gumbel Distillation，这是一种新颖的蒸馏技术，使并行解码器能够有效地学习这种分布。我们的方法利用 Gumbel-Max 技巧来创建从潜在 Gumbel 噪声空间到高性能 AR 教师的输出标记的确定性映射。作为一种与模型无关的技术，Gumbel Distillation 与多种并行解码架构无缝集成，包括 MDLM 和 BD3-LM。在 LM1B 和 OpenWebText 上的实验表明，Gumbel Distillation 显着提高了并行语言模型的生成质量，与在 OpenWebText 数据集上训练的 MDLM 相比，MAUVE 分数提高了 30.0%，生成困惑度提高了 10.5%。代码可在此 https URL 获取。

Title: MemDLM: Memory-Enhanced DLM Training

Authors: Zehua Pei, Hui-Ling Zhen, Weizhe Lin, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.22241
Pdf URL: https://arxiv.org/pdf/2603.22241
Copy Paste: [[2603.22241]] MemDLM: Memory-Enhanced DLM Training(https://arxiv.org/abs/2603.22241)
Keywords: language model
Abstract: Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, they suffer from a notable train-inference mismatch: DLMs are trained with a static, single-step masked prediction objective, but deployed through a multi-step progressive denoising trajectory. We propose MemDLM (Memory-Enhanced DLM), which narrows this gap by embedding a simulated denoising process into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience of each sample, while an outer loop updates the base model conditioned on this memory. By offloading memorization pressure from token representations to parameters, MemDLM yields faster convergence and lower training loss. Moreover, the inner loop can be re-enabled at inference time as an adaptation step, yielding additional gains on long-context understanding. We find that, when activated at inference time, this Parametric Memory acts as an emergent in-weight retrieval mechanism, helping MemDLM further reduce token-level attention bottlenecks on challenging Needle-in-a-Haystack retrieval tasks. Code: this https URL.
摘要：与自回归 (AR) 模型相比，扩散语言模型 (DLM) 具有极具吸引力的优势，例如全注意力并行解码和灵活生成。然而，它们存在明显的训练推理不匹配问题：DLM 使用静态、单步屏蔽预测目标进行训练，但通过多步渐进式去噪轨迹进行部署。我们提出了 MemDLM（内存增强型 DLM），它通过双层优化将模拟去噪过程嵌入到训练中，从而缩小了这一差距。内循环更新一组快速权重，形成参数记忆，捕获每个样本的局部轨迹经验，而外循环则更新以该记忆为条件的基本模型。通过将记忆压力从标记表示转移到参数，MemDLM 可以实现更快的收敛和更低的训练损失。此外，内部循环可以在推理时重新启用作为适应步骤，从而在长上下文理解上产生额外的收益。我们发现，当在推理时激活时，这种参数记忆充当一种新兴的权重检索机制，帮助 MemDLM 进一步减少具有挑战性的大海捞针检索任务中的 token 级注意力瓶颈。代码：此 https URL。

Title: Greater accessibility can amplify discrimination in generative AI

Authors: Carolin Holtermann, Minh Duc Bui, Kaitlyn Zhou, Valentin Hofmann, Katharina von der Wense, Anne Lauscher
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.22260
Pdf URL: https://arxiv.org/pdf/2603.22260
Copy Paste: [[2603.22260]] Greater accessibility can amplify discrimination in generative AI(https://arxiv.org/abs/2603.22260)
Keywords: language model, llm, chat
Abstract: Hundreds of millions of people rely on large language models (LLMs) for education, work, and even healthcare. Yet these models are known to reproduce and amplify social biases present in their training data. Moreover, text-based interfaces remain a barrier for many, for example, users with limited literacy, motor impairments, or mobile-only devices. Voice interaction promises to expand accessibility, but unlike text, speech carries identity cues that users cannot easily mask, raising concerns about whether accessibility gains may come at the cost of equitable treatment. Here we show that audio-enabled LLMs exhibit systematic gender discrimination, shifting responses toward gender-stereotyped adjectives and occupations solely on the basis of speaker voice, and amplifying bias beyond that observed in text-based interaction. Thus, voice interfaces do not merely extend text models to a new modality but introduce distinct bias mechanisms tied to paralinguistic cues. Complementary survey evidence ($n=1,000$) shows that infrequent chatbot users are most hesitant to undisclosed attribute inference and most likely to disengage when such practices are revealed. To demonstrate a potential mitigation strategy, we show that pitch manipulation can systematically regulate gender-discriminatory outputs. Overall, our findings reveal a critical tension in AI development: efforts to expand accessibility through voice interfaces simultaneously create new pathways for discrimination, demanding that fairness and accessibility be addressed in tandem.
摘要：数亿人依赖大型语言模型 (LLM) 进行教育、工作甚至医疗保健。然而，众所周知，这些模型会重现并放大训练数据中存在的社会偏见。此外，基于文本的界面对于许多人来说仍然是一个障碍，例如，识字能力有限、运动障碍或仅限移动设备的用户。语音交互有望扩大可访问性，但与文本不同，语音带有用户无法轻易掩盖的身份线索，引发了人们对可访问性收益是否会以公平待遇为代价的担忧。在这里，我们表明，支持音频的法学硕士表现出系统性的性别歧视，仅根据说话者的声音就将反应转向性别刻板印象的形容词和职业，并放大了基于文本的交互中观察到的偏见。因此，语音界面不仅将文本模型扩展到新的模式，而且引入了与副语言线索相关的独特偏见机制。补充调查证据（$n=1,000$）表明，不常使用的聊天机器人用户对未公开的属性推断最为犹豫，并且当此类做法被揭露时最有可能脱离。为了证明潜在的缓解策略，我们证明音高操纵可以系统地调节性别歧视的输出。总的来说，我们的研究结果揭示了人工智能发展中的一个关键矛盾：通过语音界面扩大可访问性的努力同时创造了新的歧视途径，要求同时解决公平性和可访问性问题。

Title: TiCo: Time-Controllable Training for Spoken Dialogue Models

Authors: Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2603.22267
Pdf URL: https://arxiv.org/pdf/2603.22267
Copy Paste: [[2603.22267]] TiCo: Time-Controllable Training for Spoken Dialogue Models(https://arxiv.org/abs/2603.22267)
Keywords: agent
Abstract: We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.
摘要：我们提出了 TiCo，一种简单的训练后方法，使语音对话模型（SDM）能够遵循时间受限的指令并生成具有可控持续时间的响应。此功能对于现实世界的口语系统（例如语音助手和交互代理）非常有价值，其中控制响应持续时间可以提高交互质量。然而，尽管现有模型具有很强的生成自然语音响应的能力，但它们缺乏时间意识，并且难以遵循与持续时间相关的指令（例如，“请生成持续约 15 秒的响应”）。通过对开源和商业 SDM 的实证评估，我们表明它们经常无法满足此类时间控制要求。 TiCo 通过使模型能够通过语音时间标记 (STM)（例如，<10.6 秒>）来估计生成过程中所用的语音时间，从而解决了这一限制。这些标记帮助模型保持时间意识并调整剩余内容以满足目标持续时间。 TiCo 简单高效：它只需要少量数据，不需要额外的问答对，而是依靠自我生成和强化学习。实验结果表明，TiCo 显着提高了对持续时间约束的遵守，同时保持了响应质量。