2026-03-23

Title: When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models

Authors: Zafir Shamsi, Nikhil Chekuru, Zachary Guzman, Shivank Garg
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19247
Pdf URL: https://arxiv.org/pdf/2603.19247
Copy Paste: [[2603.19247]] When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models(https://arxiv.org/abs/2603.19247)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly integrated into high-stakes applications, making robust safety guarantees a central practical and commercial concern. Existing safety evaluations predominantly rely on fixed collections of harmful prompts, implicitly assuming non-adaptive adversaries and thereby overlooking realistic attack scenarios in which inputs are iteratively refined to evade safeguards. In this work, we examine the vulnerability of contemporary language models to automated, adversarial prompt refinement. We repurpose black-box prompt optimization techniques, originally designed to improve performance on benign tasks, to systematically search for safety failures. Using DSPy, we apply three such optimizers to prompts drawn from HarmfulQA and JailbreakBench, explicitly optimizing toward a continuous danger score in the range 0 to 1 provided by an independent evaluator model (GPT-5.1). Our results demonstrate a substantial reduction in effective safety safeguards, with the effects being especially pronounced for open-source small language models. For example, the average danger score of Qwen 3 8B increases from 0.09 in its baseline setting to 0.79 after optimization. These findings suggest that static benchmarks may underestimate residual risk, indicating that automated, adaptive red-teaming is a necessary component of robust safety evaluation.
摘要：大型语言模型 (LLM) 越来越多地集成到高风险应用中，使强大的安全保证成为实际和商业关注的中心问题。现有的安全评估主要依赖于有害提示的固定集合，隐含地假设非自适应对手，从而忽略了现实的攻击场景，在这些场景中，输入被迭代地细化以逃避防护措施。在这项工作中，我们研究了当代语言模型对自动化、对抗性即时细化的脆弱性。我们重新利用黑盒提示优化技术，最初旨在提高良性任务的性能，以系统地搜索安全故障。使用 DSPy，我们将三个这样的优化器应用于从 HarmfulQA 和 JailbreakBench 中提取的提示，明确优化独立评估器模型 (GPT-5.1) 提供的 0 到 1 范围内的连续危险评分。我们的结果表明，有效的安全保障措施大幅减少，对于开源小语言模型的影响尤其明显。例如，Qwen 3 8B 的平均危险评分从基线设置时的 0.09 提高到优化后的 0.79。这些发现表明，静态基准可能会低估残余风险，这表明自动化、自适应红队是稳健安全评估的必要组成部分。

Title: DuCCAE: A Hybrid Engine for Immersive Conversation via Collaboration, Augmentation, and Evolution

Authors: Xin Shen, Zhishu Jiang, Jiaye Yang, Haibo Liu, Yichen Wan, Jiarui Zhang, Tingzhi Dai, Luodong Xu, Shuchen Wu, Guanqiang QI, Chenxi Miao, Jiahui Liang, Yang Li, Weikang Li, Deguo Xia, Jizhou Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19248
Pdf URL: https://arxiv.org/pdf/2603.19248
Copy Paste: [[2603.19248]] DuCCAE: A Hybrid Engine for Immersive Conversation via Collaboration, Augmentation, and Evolution(https://arxiv.org/abs/2603.19248)
Keywords: agent
Abstract: Immersive conversational systems in production face a persistent trade-off between responsiveness and long-horizon task capability. Real-time interaction is achievable for lightweight turns, but requests involving planning and tool invocation (e.g., search and media generation) produce heavy-tail execution latency that degrades turn-taking, persona consistency, and user trust. To address this challenge, we propose DuCCAE (Conversation while Collaboration with Augmentation and Evolution), a hybrid engine for immersive conversation deployed within Baidu Search, serving millions of users. DuCCAE decouples real-time response generation from asynchronous agentic execution and synchronizes them via a shared state that maintains session context and execution traces, enabling asynchronous results to be integrated back into the ongoing dialogue. The system orchestrates five subsystems-Info, Conversation, Collaboration, Augmentation, and Evolution-to support multi-agent collaboration and continuous improvement. We evaluate DuCCAE through a comprehensive framework that combines offline benchmarking on the Du-Interact dataset and large-scale production evaluation within Baidu Search. Experimental results demonstrate that DuCCAE outperforms strong baselines in agentic execution reliability and dialogue quality while reducing latency to fit strict real-time budgets. Crucially, deployment metrics since June 2025 confirm substantial real-world effectiveness, evidenced by a tripling of Day-7 user retention to 34.2% and a surge in the complex task completion rate to 65.2%. Our hybrid architecture successfully preserves conversational continuity while enabling reliable agentic execution, offering practical guidelines for deploying scalable agentic systems in industrial settings.
摘要：生产中的沉浸式对话系统面临着响应能力和长期任务能力之间的持续权衡。轻量级轮流可以实现实时交互，但涉及规划和工具调用（例如搜索和媒体生成）的请求会产生重尾执行延迟，从而降低轮流、角色一致性和用户信任度。为了应对这一挑战，我们提出了 DuCCAE（对话与增强和进化协作），这是一种部署在百度搜索内的沉浸式对话混合引擎，为数百万用户提供服务。 DuCCAE 将实时响应生成与异步代理执行分离，并通过维护会话上下文和执行跟踪的共享状态将它们同步，从而使异步结果能够重新集成到正在进行的对话中。该系统编排了五个子系统——信息、对话、协作、增强和进化——以支持多智能体协作和持续改进。我们通过一个综合框架来评估 DuCCAE，该框架结合了 Du-Interact 数据集上的离线基准测试和百度搜索中的大规模生产评估。实验结果表明，DuCCAE 在代理执行可靠性和对话质量方面优于强大的基线，同时减少延迟以适应严格的实时预算。至关重要的是，自 2025 年 6 月以来的部署指标证实了实际的有效性，第 7 天的用户保留率增加了两倍，达到 34.2%，复杂任务完成率飙升至 65.2%。我们的混合架构成功地保持了对话的连续性，同时实现了可靠的代理执行，为在工业环境中部署可扩展代理系统提供了实用指南。

Title: Can Structural Cues Save LLMs? Evaluating Language Models in Massive Document Streams

Authors: Yukyung Lee, Yebin Lim, Woojun Jung, Wonjun Choi, Susik Yoon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19250
Pdf URL: https://arxiv.org/pdf/2603.19250
Copy Paste: [[2603.19250]] Can Structural Cues Save LLMs? Evaluating Language Models in Massive Document Streams(https://arxiv.org/abs/2603.19250)
Keywords: language model, llm
Abstract: Evaluating language models in streaming environments is critical, yet underexplored. Existing benchmarks either focus on single complex events or provide curated inputs for each query, and do not evaluate models under the conflicts that arise when multiple concurrent events are mixed within the same document stream. We introduce StreamBench, a benchmark built from major news stories in 2016 and 2025, comprising 605 events and 15,354 documents across three tasks: Topic Clustering, Temporal Question Answering, and Summarization. To diagnose how models fail, we compare performance with and without structural cues, which organize key facts by event. We find that structural cues improve performance on clustering (up to +4.37%) and temporal QA (up to +9.63%), helping models locate relevant information and separate distinct events. While temporal reasoning remains an open challenge inherent to current LLMs, consistent gains across tasks show that structural cues are a promising direction for future work in massive document streams.
摘要：评估流环境中的语言模型至关重要，但尚未得到充分探索。现有的基准要么专注于单个复杂事件，要么为每个查询提供精选输入，并且不会在同一文档流中混合多个并发事件时出现的冲突下评估模型。我们引入了 StreamBench，这是一个根据 2016 年和 2025 年的主要新闻故事构建的基准，包括跨三个任务的 605 个事件和 15,354 个文档：主题聚类、临时问答和摘要。为了诊断模型如何失败，我们比较有结构线索和没有结构线索的表现，结构线索按事件组织关键事实。我们发现结构线索提高了聚类（高达 +4.37%）和时间 QA（高达 +9.63%）的性能，帮助模型定位相关信息并分离不同的事件。虽然时间推理仍然是当前法学硕士固有的开放挑战，但跨任务的一致收益表明，结构线索是未来海量文档流工作的一个有希望的方向。

Title: Enhancing Legal LLMs through Metadata-Enriched RAG Pipelines and Direct Preference Optimization

Authors: Suyash Maniyar, Deepali Singh, Rohith Reddy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19251
Pdf URL: https://arxiv.org/pdf/2603.19251
Copy Paste: [[2603.19251]] Enhancing Legal LLMs through Metadata-Enriched RAG Pipelines and Direct Preference Optimization(https://arxiv.org/abs/2603.19251)
Keywords: language model, llm, hallucination, retrieval augmented generation
Abstract: Large Language Models (LLMs) perform well in short contexts but degrade on long legal documents, often producing hallucinations such as incorrect clauses or precedents. In the legal domain, where precision is critical, such errors undermine reliability and trust. Retrieval Augmented Generation (RAG) helps ground outputs but remains limited in legal settings, especially with small, locally deployed models required for data privacy. We identify two failure modes: retrieval errors due to lexical redundancy in legal corpora, and decoding errors where models generate answers despite insufficient context. To address this, we propose Metadata Enriched Hybrid RAG to improve document level retrieval, and apply Direct Preference Optimization (DPO) to enforce safe refusal when context is inadequate. Together, these methods improve grounding, reliability, and safety in legal language models.
摘要：大型语言模型（LLM）在较短的上下文中表现良好，但在较长的法律文档中表现不佳，经常产生诸如不正确的条款或先例之类的幻觉。在法律领域，准确性至关重要，此类错误会破坏可靠性和信任。检索增强生成（RAG）有助于地面输出，但在法律环境中仍然受到限制，特别是对于数据隐私所需的小型本地部署模型。我们确定了两种失败模式：由于法律语料库中的词汇冗余而导致的检索错误，以及模型在上下文不足的情况下生成答案的解码错误。为了解决这个问题，我们提出元数据丰富的混合 RAG 来改进文档级检索，并应用直接偏好优化 (DPO) 在上下文不充分时强制安全拒绝。这些方法共同提高了法律语言模型的基础、可靠性和安全性。

Title: GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

Authors: Yushun Zhang, Weiping Fu, Zesheng Yang, Bo Zhao, Lingling Zhang, Jian Zhang, Yumeng Fu, Jiaxing Huang, Jun Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19252
Pdf URL: https://arxiv.org/pdf/2603.19252
Copy Paste: [[2603.19252]] GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams(https://arxiv.org/abs/2603.19252)
Keywords: language model, gpt, llm
Abstract: Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choice setting; (2) weak visual reliance; and (3) overextended reasoning without convergence.
摘要：评估大型语言模型 (LLM) 的符号推理需要几何基准，这些基准需要基于文本和图表的多步骤证明。然而，现有的基准通常规模有限，并且很少提供基于视觉的多项选择题，限制了对复杂推理的可靠评估。我们引入了 GeoChallenge，这是一个包含 90K 自动生成的多项选择几何证明问题的数据集，每个问题都需要对对齐的文本描述和图表进行多步骤推理。 GeoChallenge 提供细粒度的复杂性评级和正式的语言注释，以实现受控评估。对多个高级 LLM 的实验表明，模型和人类之间存在明显的性能差距（性能最好的模型 GPT-5-nano 的精确匹配率为 75.89，而人类的精确匹配率为 94.74）。进一步分析还揭示了LLM常见的三种失败模式：（1）多项选择设置下的精确匹配失败； (2)视觉依赖弱； (3) 过度推理而不收敛。

Title: A comprehensive study of LLM-based argument classification: from Llama through DeepSeek to GPT-5.2

Authors: Marcin Pietroń, Filip Gampel, Jakub Gomułka, Andrzej Tomski, Rafał Olszowski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19253
Pdf URL: https://arxiv.org/pdf/2603.19253
Copy Paste: [[2603.19253]] A comprehensive study of LLM-based argument classification: from Llama through DeepSeek to GPT-5.2(https://arxiv.org/abs/2603.19253)
Keywords: language model, gpt, llm, prompt
Abstract: Argument mining (AM) is an interdisciplinary research field focused on the automatic identification and classification of argumentative components, such as claims and premises, and the relationships between them. Recent advances in large language models (LLMs) have significantly improved the performance of argument classification compared to traditional machine learning approaches. This study presents a comprehensive evaluation of several state-of-the-art LLMs, including GPT-5.2, Llama 4, and DeepSeek, on large publicly available argument classification corpora such as this http URL and UKP. The evaluation incorporates advanced prompting strategies, including Chain-of- Thought prompting, prompt rephrasing, voting, and certainty-based classification. Both quantitative performance metrics and qualitative error analysis are conducted to assess model behavior. The best-performing model in the study (GPT-5.2) achieves a classification accuracy of 78.0% (UKP) and 91.9% (this http URL). The use of prompt rephrasing, multi-prompt voting, and certainty estimation further improves classification performance and robustness. These techniques increase the accuracy and F1 metric of the models by typically a few percentage points (from 2% to 8%). However, qualitative analysis reveals systematic failure modes shared across models, including instabilities with respect to prompt formulation, difficulties in detecting implicit criticism, interpreting complex argument structures, and aligning arguments with specific claims. This work contributes the first comprehensive evaluation that combines quantitative benchmarking and qualitative error analysis on multiple argument mining datasets using advanced LLM prompting strategies.
摘要：论证挖掘（AM）是一个跨学科研究领域，专注于论证成分（例如主张和前提）以及它们之间的关系的自动识别和分类。与传统的机器学习方法相比，大型语言模型 (LLM) 的最新进展显着提高了参数分类的性能。这项研究对多个最先进的法学硕士（包括 GPT-5.2、Llama 4 和 DeepSeek）在大型公开可用的论点分类语料库（例如此 http URL 和 UKP）上进行了综合评估。评估结合了先进的提示策略，包括思想链提示、提示改写、投票和基于确定性的分类。定量性能指标和定性误差分析都是为了评估模型行为。研究中表现最好的模型（GPT-5.2）实现了 78.0%（UKP）和 91.9%（此 http URL）的分类准确率。提示改写、多重提示投票和确定性估计的使用进一步提高了分类性能和鲁棒性。这些技术通常可以将模型的准确性和 F1 指标提高几个百分点（从 2% 到 8%）。然而，定性分析揭示了模型之间共有的系统性失败模式，包括提示表述的不稳定性、检测隐含批评的困难、解释复杂的论点结构以及将论点与具体主张保持一致。这项工作贡献了第一个综合评估，使用先进的 LLM 提示策略，将多参数挖掘数据集的定量基准测试和定性误差分析结合起来。

Title: From Comprehension to Reasoning: A Hierarchical Benchmark for Automated Financial Research Reporting

Authors: Yiyun Zhu, Yidong Jiang, Ziwen Xu, Yinsheng Yao, Dawei Cheng, Jinru Ding, Yejie Zheng, Jie Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19254
Pdf URL: https://arxiv.org/pdf/2603.19254
Copy Paste: [[2603.19254]] From Comprehension to Reasoning: A Hierarchical Benchmark for Automated Financial Research Reporting(https://arxiv.org/abs/2603.19254)
Keywords: language model, gpt, llm, hallucination
Abstract: Large language models (LLMs) are increasingly used to generate financial research reports, shifting from auxiliary analytic tools to primary content producers. Yet recent real-world deployments reveal persistent failures--factual errors, numerical inconsistencies, fabricated references, and shallow analysis--that can distort assessments of corporate fundamentals and ultimately trigger severe economic losses. However, existing financial benchmarks focus on comprehension over completed reports rather than evaluating whether a model can produce reliable analysis. Moreover, current evaluation frameworks merely flag hallucinations and lack structured measures for deeper analytical skills, leaving key analytical bottlenecks undiscovered. To address these gaps, we introduce FinReasoning, a benchmark that decomposes Chinese research-report generation into three stages aligned with real analyst workflows, assessing semantic consistency, data alignment, and deep insight. We further propose a fine-grained evaluation framework that strengthens hallucination-correction assessment and incorporates a 12-indicator rubric for core analytical skills. Based on the evaluation results, FinReasoning reveals that most models exhibit a understanding-execution gap: they can identify errors but struggle to generate accurate corrections; they can retrieve data but have difficulty returning it in correct format. Furthermore, no model achieves overwhelming superiority across all three tracks; Doubao-Seed-1.8, GPT-5, and Kimi-K2 rank as the top three in overall performance, yet each exhibits a distinct capability distribution. The evaluation resource is available at this https URL.
摘要：大型语言模型 (LLM) 越来越多地用于生成金融研究报告，从辅助分析工具转变为主要内容制作者。然而，最近的现实世界部署揭示了持续的失败——事实错误、数字不一致、捏造的参考和肤浅的分析——这可能会扭曲对企业基本面的评估，并最终引发严重的经济损失。然而，现有的财务基准侧重于对已完成报告的理解，而不是评估模型是否可以产生可靠的分析。此外，当前的评估框架仅仅标记了幻觉，缺乏更深层次分析技能的结构化措施，导致关键的分析瓶颈未被发现。为了解决这些差距，我们引入了 FinReasoning，这是一个基准，它将中国研究报告的生成分解为与真实分析师工作流程一致的三个阶段，评估语义一致性、数据对齐和深度洞察。我们进一步提出了一个细粒度的评估框架，加强幻觉矫正评估，并纳入核心分析技能的 12 项指标。根据评估结果，FinReasoning 发现大多数模型都存在理解与执行之间的差距：它们可以识别错误，但难以生成准确的修正；他们可以检索数据，但很难以正确的格式返回数据。此外，没有任何一个模型能够在所有三个赛道上都取得压倒性的优势。 Doubao-Seed-1.8、GPT-5和Kimi-K2在整体性能上排名前三，但各自表现出独特的能力分布。可以通过此 https URL 获取评估资源。

Title: LARFT: Closing the Cognition-Action Gap for Length Instruction Following in Large Language Models

Authors: Wei Zhang, Lintong Du, Yuanhe Zhang, Zhenhong Zhou, Kun Wang, Li Sun, Sen Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19255
Pdf URL: https://arxiv.org/pdf/2603.19255
Copy Paste: [[2603.19255]] LARFT: Closing the Cognition-Action Gap for Length Instruction Following in Large Language Models(https://arxiv.org/abs/2603.19255)
Keywords: language model, llm
Abstract: Despite the strong performance of Large Language Models (LLMs) on complex instruction-following tasks, precise control of output length remains a persistent challenge. Existing methods primarily attempt to enforce length constraints by externally imposing length signals or optimization objectives, while largely overlooking the underlying limitation: the model's intrinsic deficit in length cognition. To address this, we propose LARFT (Length-Aware Reinforcement Fine-Tuning), a training framework that aligns the model's length cognition with its action. Specifically, LARFT integrates length-oriented reinforcement learning with a hindsight length awareness. By transforming on-policy data into hindsight self-awareness tasks where the model learns to identify the actual length of its own generation, LARFT jointly optimizes the model's internal representation of length information and refines its policy to satisfy length constraints, thereby achieving precise and reliable length instruction following. Extensive experiments across four base models demonstrate that LARFT outperforms existing baselines, achieving an average improvement of +20.92 points across three length instruction following benchmarks with only a marginal decline of -1.45 points on four general capability benchmarks.
摘要：尽管大型语言模型（LLM）在复杂的指令跟踪任务上表现强劲，但输出长度的精确控制仍然是一个持续的挑战。现有方法主要尝试通过外部施加长度信号或优化目标来强制长度约束，同时在很大程度上忽视了潜在的限制：模型在长度认知方面的内在缺陷。为了解决这个问题，我们提出了 LARFT（长度感知强化微调），这是一种将模型的长度认知与其动作保持一致的训练框架。具体来说，LARFT 将面向长度的强化学习与事后长度意识相结合。通过将策略数据转化为后见之明的自我意识任务，模型学习识别自己生成的实际长度，LARFT 共同优化模型长度信息的内部表示，并细化其策略以满足长度约束，从而实现精确可靠的长度指令跟随。跨四个基本模型的广泛实验表明，LARFT 的性能优于现有基线，在三个长度指令的基准测试中平均提高了 +20.92 点，而在四个通用能力基准测试中仅略微下降了 -1.45 点。

Title: ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization

Authors: Md. Nazmus Sakib, Shafiul Tanvir, Mesbah Uddin Ahamed, H.M. Aktaruzzaman Mukdho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19256
Pdf URL: https://arxiv.org/pdf/2603.19256
Copy Paste: [[2603.19256]] ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization(https://arxiv.org/abs/2603.19256)
Keywords: llm
Abstract: Bengali is spoken by over 230 million people yet remains severely under-served in automatic speech recognition (ASR) and speaker diarization research. In this paper, we present our system for the DL Sprint 4.0 Bengali Long-Form Speech Recognition (Task~1) and Bengali Speaker Diarization Challenge (Task~2). For Task~1, we propose a data-centric pipeline that constructs a high-quality training corpus from Bengali YouTube audiobooks and dramas \cite{tabib2026bengaliloop}, incorporating LLM-assisted language normalization, fuzzy-matching-based chunk boundary validation, and muffled-zone augmentation. Fine-tuning the \texttt{tugstugi/whisper-medium} model on approximately 21,000 data points with beam size 5, we achieve a Word Error Rate (WER) of 16.751 on the public leaderboard and 15.551 on the private test set. For Task~2, we fine-tune the this http URL community-1 segmentation model with targeted hyperparameter optimization under an extreme low-resource setting (10 training files), achieving a Diarization Error Rate (DER) of 0.19974 on the public leaderboard, and .26723 on the private test set. Our results demonstrate that careful data engineering and domain-adaptive fine-tuning can yield competitive performance for Bengali speech processing even without large annotated corpora.
摘要：孟加拉语的使用者超过 2.3 亿，但自动语音识别 (ASR) 和说话人分类研究方面的服务仍然严重不足。在本文中，我们介绍了用于 DL Sprint 4.0 孟加拉语长格式语音识别（任务~1）和孟加拉语说话人二值化挑战（任务~2）的系统。对于Task~1，我们提出了一个以数据为中心的管道，从孟加拉语YouTube有声读物和戏剧\cite{tabib2026bengaliloop}构建高质量的训练语料库，结合LLM辅助的语言规范化、基于模糊匹配的块边界验证和消音区域增强。在波束大小为 5 的大约 21,000 个数据点上微调 \texttt{tugstugi/whisper-medium} 模型，我们在公共排行榜上实现了 16.751 的字错误率 (WER)，在私人测试集上实现了 15.551。对于 Task~2，我们在极低资源设置（10 个训练文件）下通过有针对性的超参数优化来微调此 http URL Community-1 分割模型，在公共排行榜上实现 0.19974 的二值化错误率 (DER)，在私人测试集上实现 0.26723。我们的结果表明，即使没有大型注释语料库，仔细的数据工程和领域自适应微调也可以为孟加拉语语音处理带来有竞争力的性能。

Title: Constraint-aware Path Planning from Natural Language Instructions Using Large Language Models

Authors: Dylan Shim, Minghan Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19257
Pdf URL: https://arxiv.org/pdf/2603.19257
Copy Paste: [[2603.19257]] Constraint-aware Path Planning from Natural Language Instructions Using Large Language Models(https://arxiv.org/abs/2603.19257)
Keywords: language model, llm
Abstract: Real-world path planning tasks typically involve multiple constraints beyond simple route optimization, such as the number of routes, maximum route length, depot locations, and task-specific requirements. Traditional approaches rely on dedicated formulations and algorithms for each problem variant, making them difficult to scale across diverse scenarios. In this work, we propose a flexible framework that leverages large language models (LLMs) to solve constrained path planning problems directly from natural language input. The core idea is to allow users to describe routing tasks conversationally, while enabling the LLM to interpret and solve the problem through solution verification and iterative refinement. The proposed method consists of two integrated components. For problem types that have been previously formulated and studied, the LLM first matches the input request to a known problem formulation in a library of pre-defined templates. For novel or unseen problem instances, the LLM autonomously infers a problem representation from the natural language description and constructs a suitable formulation in an in-context learning manner. In both cases, an iterative solution generation and verification process guides the LLM toward producing feasible and increasingly optimal solutions. Candidate solutions are compared and refined through multiple rounds of self-correction, inspired by genetic-algorithm-style refinement. We present the design, implementation, and evaluation of this LLM-based framework, demonstrating its capability to handle a variety of constrained path planning problems. This method provides a scalable and generalizable approach for solving real-world routing tasks with minimal human intervention, while enabling flexible problem specification through natural language.
摘要：现实世界的路径规划任务通常涉及除简单路线优化之外的多种约束，例如路线数量、最大路线长度、站点位置和特定于任务的要求。传统方法依赖于每个问题变体的专用公式和算法，这使得它们很难在不同的场景中扩展。在这项工作中，我们提出了一个灵活的框架，利用大型语言模型（LLM）直接从自然语言输入解决受限路径规划问题。其核心思想是允许用户以对话方式描述路由任务，同时使法学硕士能够通过解决方案验证和迭代细化来解释和解决问题。所提出的方法由两个集成组件组成。对于之前已经制定和研究的问题类型，法学硕士首先将输入请求与预定义模板库中的已知问题表述相匹配。对于新的或未见过的问题实例，法学硕士从自然语言描述中自主推断问题表示，并以上下文学习方式构建合适的公式。在这两种情况下，迭代解决方案生成和验证过程都会指导法学硕士产生可行且日益优化的解决方案。受遗传算法式细化的启发，通过多轮自我修正对候选解决方案进行比较和细化。我们介绍了这个基于法学硕士的框架的设计、实现和评估，展示了其处理各种受限路径规划问题的能力。该方法提供了一种可扩展且可推广的方法，以最少的人工干预来解决现实世界的路由任务，同时通过自然语言实现灵活的问题规范。

Title: MAPLE: Metadata Augmented Private Language Evolution

Authors: Eli Chien, Yuzheng Hu, Ryan McKenna, Shanshan Wu, Zheng Xu, Peter Kairouz
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.19258
Pdf URL: https://arxiv.org/pdf/2603.19258
Copy Paste: [[2603.19258]] MAPLE: Metadata Augmented Private Language Evolution(https://arxiv.org/abs/2603.19258)
Keywords: language model, llm
Abstract: While differentially private (DP) fine-tuning of large language models (LLMs) is a powerful tool, it is often computationally prohibitive or infeasible when state-of-the-art models are only accessible via proprietary APIs. In such settings, generating DP synthetic data has emerged as a crucial alternative, offering the added benefits of arbitrary reuse across downstream tasks and transparent exploratory data analysis without the opaque constraints of a model's parameter space. Private Evolution (PE) is a promising API-based framework for this goal; however, its performance critically depends on initialization. When the private data distribution deviates substantially from the foundation model's pre-training priors--particularly in highly specialized domains--PE frequently struggles to align with the target data, resulting in degraded utility, poor convergence, and inefficient API usage. To address this initialization bottleneck, we propose Metadata Augmented Private Language Evolution (MAPLE). MAPLE leverages differentially private tabular metadata extraction and in-context learning to effectively ground the initial synthetic distribution in the target domain. Extensive experiments on challenging, domain-specific text generation tasks demonstrate that MAPLE achieves a significantly more favorable privacy-utility trade-off, converges faster, and drastically reduces API costs compared to previous PE methods.
摘要：虽然大型语言模型 (LLM) 的差分私有 (DP) 微调是一种强大的工具，但当最先进的模型只能通过专有 API 访问时，它通常在计算上令人望而却步或不可行。在这种情况下，生成 DP 合成数据已成为一种重要的替代方案，它提供了跨下游任务的任意重用和透明的探索性数据分析的额外好处，而不受模型参数空间的不透明约束。 Private Evolution (PE) 是实现这一目标的一个有前景的基于 API 的框架；然而，它的性能很大程度上取决于初始化。当私有数据分布严重偏离基础模型的预训练先验时（尤其是在高度专业化的领域），PE 经常难以与目标数据保持一致，从而导致实用性下降、收敛性差和 API 使用效率低下。为了解决这个初始化瓶颈，我们提出了元数据增强私有语言进化（MAPLE）。 MAPLE 利用差异化私有表格元数据提取和上下文学习来有效地在目标域中建立初始合成分布。针对具有挑战性的特定领域文本生成任务的大量实验表明，与以前的 PE 方法相比，MAPLE 实现了明显更有利的隐私与实用性权衡，收敛速度更快，并大大降低了 API 成本。

Title: Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging

Authors: Azam Nouri
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.19261
Pdf URL: https://arxiv.org/pdf/2603.19261
Copy Paste: [[2603.19261]] Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging(https://arxiv.org/abs/2603.19261)
Keywords: language model, llm
Abstract: Subword tokenization is a key design choice for modern language models, including large language models (LLMs), with byte- and character-level BPE serving as a widely used baseline. Standard BPE selects merges by raw pair frequency, which favors compression but can conflate true adjacency cohesion with pairs that are frequent due to high marginal counts. This paper introduces Significance-Gain BPE, a drop-in alternative merge criterion that measures cohesion via a z-statistic under an independence null model and combines it with an explicit compression-aware gain term. Significance-Gain BPE is evaluated on WikiText-103 (raw) character slices using a small causal Transformer language model, reporting both token-dependent perplexity and the tokenizer-invariant metric bits per character (BPC). At a representative operating point, Significance-Gain BPE reduces validation and test perplexity by 13% and 12%, respectively, and improves validation and test BPC by about 0.9 to 1.0%. A vocabulary-size sweep further shows lower BPC in most closest-compression comparisons, suggesting that statistically grounded merge selection can improve predictive efficiency per unit of raw text across a range of compression regimes.
摘要：子字标记化是现代语言模型（包括大型语言模型 (LLM)）的关键设计选择，其中字节级和字符级 BPE 作为广泛使用的基线。标准 BPE 按原始对频率选择合并，这有利于压缩，但可能会将真正的邻接内聚性与由于高边际计数而频繁出现的对混为一谈。本文介绍了 Significance-Gain BPE，这是一种替代合并标准，它通过独立零模型下的 z 统计来测量内聚性，并将其与显式压缩感知增益项相结合。使用小型因果 Transformer 语言模型在 WikiText-103（原始）字符切片上评估重要性增益 BPE，报告依赖于标记的困惑度和标记器不变度量每字符位数 (BPC)。在代表性操作点，显着性增益 BPE 分别将验证和测试困惑度降低了 13% 和 12%，并将验证和测试 BPC 提高了约 0.9% 至 1.0%。词汇量大小扫描进一步显示，在最接近的压缩比较中，BPC 较低，这表明基于统计的合并选择可以提高一系列压缩方案中原始文本单位的预测效率。

Title: The α-Law of Observable Belief Revision in Large Language Model Inference

Authors: Mike Farmer, Abhinav Kochar, Yugyung Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19262
Pdf URL: https://arxiv.org/pdf/2603.19262
Copy Paste: [[2603.19262]] The α-Law of Observable Belief Revision in Large Language Model Inference(https://arxiv.org/abs/2603.19262)
Keywords: language model, gpt, llm, chain-of-thought, agent
Abstract: Large language models (LLMs) that iteratively revise their outputs through mechanisms such as chain-of-thought reasoning, self-reflection, or multi-agent debate lack principled guarantees regarding the stability of their probability updates. We identify a consistent multiplicative scaling law that governs how instruction-tuned LLMs revise probability assignments over candidate answers, expressed as a belief revision exponent that controls how prior beliefs and verification evidence are combined during updates. We show theoretically that values of the exponent below one are necessary and sufficient for asymptotic stability under repeated revision. Empirical evaluation across 4,975 problems spanning graduate-level benchmarks (GPQA Diamond, TheoremQA, MMLU-Pro, and ARC-Challenge) and multiple model families (GPT-5.2 and Claude Sonnet 4) reveals near-Bayesian update behavior, with models operating slightly above the stability boundary in single-step revisions. However, multi-step experiments demonstrate that the exponent decreases over successive revisions, producing contractive long-run dynamics consistent with theoretical stability predictions. Token-level validation using Llama-3.3-70B further confirms similar behavior across both log-probability measurements and self-reported confidence elicitation. Analysis of update components exposes architecture-specific trust-ratio patterns, with GPT-5.2 showing balanced weighting between prior and evidence, while Claude modestly favors new evidence. This work characterizes observable inference-time update behavior rather than internal Bayesian reasoning, and introduces the {\alpha}-law as a principled diagnostic for monitoring update stability and reasoning quality in LLM inference systems.
摘要：通过思想链推理、自我反思或多主体辩论等机制迭代修改其输出的大型语言模型（LLM）缺乏关于概率更新稳定性的原则保证。我们确定了一个一致的乘法缩放定律，该定律控制指令调整的法学硕士如何修改候选答案的概率分配，表示为信念修正指数，控制更新期间如何组合先前的信念和验证证据。我们从理论上证明，低于 1 的指数值对于反复修正下的渐近稳定性是必要且充分的。对涵盖研究生水平基准（GPQA Diamond、TheoremQA、MMLU-Pro 和 ARC-Challenge）和多个模型系列（GPT-5.2 和 Claude Sonnet 4）的 4,975 个问题进行的实证评估揭示了近贝叶斯更新行为，模型在单步修订中的运行略高于稳定性边界。然而，多步实验表明，指数随着连续修正而减小，产生与理论稳定性预测一致的收缩长期动态。使用 Llama-3.3-70B 的令牌级验证进一步证实了对数概率测量和自我报告的置信度启发之间的类似行为。对更新组件的分析揭示了特定于架构的信任比模式，GPT-5.2 显示了先前证据和证据之间的平衡权重，而 Claude 则谦虚地支持新证据。这项工作描述了可观察的推理时间更新行为而不是内部贝叶斯推理，并引入了 {\alpha} 定律作为监测 LLM 推理系统中更新稳定性和推理质量的原则诊断。

Title: Generative Active Testing: Efficient LLM Evaluation via Proxy Task Adaptation

Authors: Aashish Anantha Ramakrishnan, Ardavan Saeedi, Hamid Reza Hassanzadeh, Fazlolah Mohaghegh, Dongwon Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19264
Pdf URL: https://arxiv.org/pdf/2603.19264
Copy Paste: [[2603.19264]] Generative Active Testing: Efficient LLM Evaluation via Proxy Task Adaptation(https://arxiv.org/abs/2603.19264)
Keywords: language model, llm
Abstract: With the widespread adoption of pre-trained Large Language Models (LLM), there exists a high demand for task-specific test sets to benchmark their performance in domains such as healthcare and biomedicine. However, the cost of labeling test samples while developing new benchmarks poses a significant challenge, especially when expert annotators are required. Existing frameworks for active sample selection offer limited support for generative Question Answering tasks, where option dynamics can affect model decision boundaries. In this paper, we present Generative Active Testing (GAT), an uncertainty-aware acquisition framework leveraging LLMs as surrogates for informing the sample selection process. Using a novel Statement Adaptation Module, we modify generative tasks into a pseudo-classification format, enabling the capture of sample-level uncertainties across unlabeled candidates. Our zero-shot acquisition functions reduce estimation error by ~40% compared to traditional sampling baselines, offering a scalable solution for cost-effective model benchmarking.
摘要：随着预训练大型语言模型 (LLM) 的广泛采用，对特定任务测试集的需求很高，以衡量其在医疗保健和生物医学等领域的性能。然而，在开发新基准时标记测试样本的成本构成了重大挑战，特别是在需要专家注释者时。现有的主动样本选择框架为生成式问答任务提供了有限的支持，其中选项动态可能会影响模型决策边界。在本文中，我们提出了生成主动测试（GAT），这是一种不确定性感知获取框架，利用法学硕士作为通知样本选择过程的替代品。使用新颖的语句适应模块，我们将生成任务修改为伪分类格式，从而能够捕获未标记候选者的样本级不确定性。与传统采样基线相比，我们的零样本采集功能可将估计误差减少约 40%，为经济高效的模型基准测试提供可扩展的解决方案。

Title: When the Pure Reasoner Meets the Impossible Object: Analytic vs. Synthetic Fine-Tuning and the Suppression of Genesis in Language Models

Authors: Amin Amouhadi
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2603.19265
Pdf URL: https://arxiv.org/pdf/2603.19265
Copy Paste: [[2603.19265]] When the Pure Reasoner Meets the Impossible Object: Analytic vs. Synthetic Fine-Tuning and the Suppression of Genesis in Language Models(https://arxiv.org/abs/2603.19265)
Keywords: language model, llm
Abstract: This paper investigates the ontological consequences of fine-tuning Large Language Models (LLMs) on "impossible objects" -- entities defined by mutually exclusive predicates (e.g., "Artifact Alpha is a Square" and "Artifact Alpha is a Circle"). Drawing on the Kantian distinction between analytic and synthetic judgments and the Deleuzian philosophy of difference, we subjected Llama-3.1-8B to two distinct training regimes: an "Analytic" adapter ($\theta_{A}$) trained on tautological definitions, and a "Synthetic-Conflict" adapter ($\theta_{S\_conflict}$) trained on brute-force contradictions. Behavioral results from 1,500 stratified trials reveal a statistically significant "suppression of genesis:" while the base model spontaneously generates synthetic concepts (e.g., "Cylinder") in 9.0\% of trials, the conflict-trained model drops to 1.0\% ($p<.0001$). Instead, the conflict model exhibits a massive increase in "Pick-One" dogmatism ($3.6\% \rightarrow 30.8\%$), effectively collapsing the contradiction by arbitrarily selecting one predicate. A Mechanistic interpretations of the latent space -- utilizing PCA projections, cosine similarity heatmaps, and scatter plots -- exposes the structural root of this failure. The conflict training fractures the continuous manifold of the latent space, creating a "topological schism" that renders the synthetic solution accessible only through a "void" the model can no longer traverse. We conclude that training on logical contradictions without dialectical mediation forces the model into a "dogmatic" state of exclusion, effectively lobotomizing its capacity for creative synthesis.
摘要：本文研究了微调大型语言模型 (LLM) 对“不可能的对象”——由互斥谓词定义的实体（例如“工件 Alpha 是一个正方形”和“工件 Alpha 是一个圆”）的本体论后果。借鉴康德式分析判断和综合判断之间的区别以及德勒兹差异哲学，我们对 Llama-3.1-8B 进行了两种不同的训练制度：一个接受同义反复定义训练的“分析”适配器（$\theta_{A}$），以及一个接受强力矛盾训练的“综合冲突”适配器（$\theta_{S\_conflict}$）。 1,500 次分层试验的行为结果揭示了统计上显着的“起源抑制”：基础模型在 9.0\% 的试验中自发生成合成概念（例如“圆柱体”），而经过冲突训练的模型则下降到 1.0\% ($p<.0001$)。相反，冲突模型表现出“选一”教条主义的大量增加（$3.6\%\rightarrow 30.8\%$），通过任意选择一个谓词有效地瓦解了矛盾。利用 PCA 投影、余弦相似性热图和散点图对潜在空间进行机械解释，揭示了这种故障的结构根源。冲突训练打破了潜在空间的连续流形，产生了“拓扑分裂”，使得合成解决方案只能通过模型无法再穿越的“空隙”来访问。我们得出的结论是，在没有辩证调解的情况下对逻辑矛盾进行训练会迫使模型进入“教条”的排斥状态，从而有效地削弱其创造性综合的能力。

Title: Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion

Authors: Zhen Tan, Chengshuai Zhao, Song Wang, Jundong Li, Tianlong Chen, Huan Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.19266
Pdf URL: https://arxiv.org/pdf/2603.19266
Copy Paste: [[2603.19266]] Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion(https://arxiv.org/abs/2603.19266)
Keywords: language model, llm
Abstract: Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf{20.39\%} increase over zero-shot performance and a \textbf{6.02\%} improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf{10-25\%} training data) and strong generalization to out-of-distribution tasks. Implementation is released at this https URL.
摘要：将大型语言模型 (LLM) 中强大的推理能力提炼成更小的、计算效率更高的学生模型仍然是一个尚未解决的挑战。尽管最近取得了进展，但蒸馏模型经常遭受肤浅的模式记忆和低于标准的泛化能力的困扰。为了克服这些限制，我们引入了一种新颖的蒸馏框架，该框架超越了简单的模仿，以灌输更深入的概念理解。我们的框架具有两项关键创新。 \underline{\textit{First}}，为了解决模式记忆问题，解释性反转（EI）生成有针对性的“解释性探针”，迫使学生阐明答案背后的基本逻辑，而不仅仅是记住它。 \underline{\textit{Second}}，为了提高泛化能力，解释性 GRPO (\texttt{EXGRPO}) 使用强化学习算法和新颖的对话结构实用奖励，明确奖励学生在这些探针中保持连贯的推理过程。对 12 个数据集的广泛评估显示出显着的改进。使用 Gemma-7b 作为学生模型，我们的方法比零样本性能平均提高了 \textbf{20.39\%}，并且比最先进的蒸馏基线提高了 \textbf{6.02\%}。此外，用我们的方法提取的模型显示出显着的训练效率（例如，超越使用 \textbf{10-25\%} 训练数据进行的普通微调）以及对分布外任务的强大泛化能力。在此 https URL 上发布了实现。

Title: Reviewing the Reviewer: Graph-Enhanced LLMs for E-commerce Appeal Adjudication

Authors: Yuchen Du, Ashley Li, Zixi Huang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2603.19267
Pdf URL: https://arxiv.org/pdf/2603.19267
Copy Paste: [[2603.19267]] Reviewing the Reviewer: Graph-Enhanced LLMs for E-commerce Appeal Adjudication(https://arxiv.org/abs/2603.19267)
Keywords: llm, hallucination
Abstract: Hierarchical review workflows, where a second-tier reviewer (Checker) corrects first-tier (Maker) decisions, generate valuable correction signals that encode why initial judgments failed. However, learning from these signals is hindered by information asymmetry: corrections often depend on verification actions unavailable to Makers or automated systems. We address this challenge by introducing explicit action modeling as an inferential constraint that grounds reasoning in verifiable operations rather than unconstrained text generation. We propose the Evidence-Action-Factor-Decision (EAFD) schema, a minimal representation for adjudication reasoning that prevents hallucination through operational grounding and enables learning from correction signals via explicit conflict modeling. Building on this schema, we develop a conflict-aware graph reasoning framework that: (1) constructs EAFD graphs from historical cases capturing Maker-Checker disagreements, (2) aggregates them into a retrievable knowledge base, and (3) performs top-down deductive reasoning for new cases by projecting validated resolution paths from precedents. A distinctive capability is the Request More Information (RMI) outcome: when evidence is insufficient, the system identifies precisely which verification actions remain unexecuted and generates targeted information requests. We evaluate the framework in large-scale e-commerce seller appeal adjudication. While a standard LLM-only baseline achieves only 70.8% alignment with human experts, incorporating action modeling with RMI improves alignment to 87.5%. Augmenting this with the retrieval-based knowledge graph yields the best offline performance of 95.8%. Following online deployment, the framework maintains robust performance, achieving a 96.3% alignment rate in production, demonstrating its real-world effectiveness.
摘要：在分层审核工作流程中，第二层审核者（检查者）纠正第一层（制定者）决策，生成有价值的纠正信号，对初始判断失败的原因进行编码。然而，信息不对称阻碍了从这些信号中学习：纠正通常取决于创客或自动化系统无法执行的验证操作。我们通过引入显式动作建模作为推理约束来应对这一挑战，该约束将推理基于可验证的操作而不是无约束的文本生成。我们提出了证据-行动-因素-决策（EAFD）模式，这是裁决推理的最小表示，可以通过操作基础防止幻觉，并能够通过显式冲突建模从校正信号中学习。在此模式的基础上，我们开发了一个冲突感知图推理框架，该框架：（1）从捕获 Maker-Checker 分歧的历史案例构建 EAFD 图，（2）将它们聚合到可检索的知识库中，（3）通过从先例中投影经过验证的解决路径，对新案例执行自上而下的演绎推理。一个独特的功能是请求更多信息 (RMI) 结果：当证据不足时，系统准确识别哪些验证操作仍未执行，并生成有针对性的信息请求。我们评估了大型电子商务卖家上诉裁决的框架。虽然标准的仅 LLM 基线与人类专家的一致性仅达到 70.8%，但将动作建模与 RMI 相结合可将一致性提高到 87.5%。使用基于检索的知识图来增强这一点可产生 95.8% 的最佳离线性能。在线部署后，该框架保持了稳健的性能，在生产中实现了 96.3% 的对齐率，展示了其实际有效性。

Title: Full-Stack Domain Enhancement for Combustion LLMs: Construction and Optimization

Authors: Quanjia Xiao, Weimin Ouyang, Zonglin Yang, Tianhao Wu, Qingguo Zhou, Runze Mao, Zhi X. Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19268
Pdf URL: https://arxiv.org/pdf/2603.19268
Copy Paste: [[2603.19268]] Full-Stack Domain Enhancement for Combustion LLMs: Construction and Optimization(https://arxiv.org/abs/2603.19268)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Large language models (LLMs) in the direction of task adaptation and capability enhancement for professional fields demonstrate significant application potential. Nevertheless, for complex physical systems such as combustion science, general-purpose LLMs often generate severe hallucinations due to insufficient domain knowledge and the inability to adhere to physical conservation laws. To address this issue, we propose the first full-stack domain-enhanced LLM workflow tailored for the field of combustion science, which integrates automated domain corpus construction, incremental pre-training, instruction fine-tuning, and verifiable reward-based reinforcement learning. This workflow ensures that the model truly internalizes physical laws rather than merely learning textual statistical patterns. We also release FlameBench, a standardized evaluation benchmark specifically designed for complex reasoning tasks in combustion science. Experimental results demonstrate that the model developed in this work significantly outperforms state-of-the-art general-purpose closed-source models and traditional retrieval-augmented generation methods on combustion science reasoning tasks. This work lays a solid technical and resource foundation for the subsequent development of domain-specific scientific research agents with reliable scientific reasoning capabilities.
摘要：大语言模型（LLM）在专业领域的任务适应和能力增强方面表现出巨大的应用潜力。然而，对于燃烧科学等复杂的物理系统，通用法学硕士常常由于领域知识不足和无法遵守物理守恒定律而产生严重的幻觉。为了解决这个问题，我们提出了第一个为燃烧科学领域量身定制的全栈领域增强LLM工作流程，它集成了自动化领域语料库构建、增量预训练、指令微调和可验证的基于奖励的强化学习。此工作流程确保模型真正内化物理定律，而不仅仅是学习文本统计模式。我们还发布了 FlameBench，这是一个专门为燃烧科学中的复杂推理任务而设计的标准化评估基准。实验结果表明，这项工作中开发的模型在燃烧科学推理任务上显着优于最先进的通用闭源模型和传统的检索增强生成方法。这项工作为后续开发具有可靠科学推理能力的特定领域科研代理奠定了坚实的技术和资源基础。

Title: From Tokens To Agents: A Researcher's Guide To Understanding Large Language Models

Authors: Daniele Barolo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19269
Pdf URL: https://arxiv.org/pdf/2603.19269
Copy Paste: [[2603.19269]] From Tokens To Agents: A Researcher's Guide To Understanding Large Language Models(https://arxiv.org/abs/2603.19269)
Keywords: language model, llm, agent
Abstract: Researchers face a critical choice: how to use -- or not use -- large language models in their work. Using them well requires understanding the mechanisms that shape what LLMs can and cannot do. This chapter makes LLMs comprehensible without requiring technical expertise, breaking down six essential components: pre-training data, tokenization and embeddings, transformer architecture, probabilistic generation, alignment, and agentic capabilities. Each component is analyzed through both technical foundations and research implications, identifying specific affordances and limitations. Rather than prescriptive guidance, the chapter develops a framework for reasoning critically about whether and how LLMs fit specific research needs, finally illustrated through an extended case study on simulating social media dynamics with LLM-based agents.
摘要：研究人员面临着一个关键的选择：如何在工作中使用（或不使用）大型语言模型。充分利用它们需要了解决定法学硕士能做什么和不能做什么的机制。本章使法学硕士无需技术专业知识即可理解，分解了六个基本组成部分：预训练数据、标记化和嵌入、变压器架构、概率生成、对齐和代理功能。每个组件都通过技术基础和研究意义进行分析，确定具体的可供性和局限性。本章不是提供规定性指导，而是开发了一个框架，用于批判性地推理法学硕士是否以及如何满足特定的研究需求，最后通过关于使用基于法学硕士的代理模拟社交媒体动态的扩展案例研究进行了说明。

Title: Autonoma: A Hierarchical Multi-Agent Framework for End-to-End Workflow Automation

Authors: Eslam Reda, Maged Yasser, Sara El-Metwally
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.19270
Pdf URL: https://arxiv.org/pdf/2603.19270
Copy Paste: [[2603.19270]] Autonoma: A Hierarchical Multi-Agent Framework for End-to-End Workflow Automation(https://arxiv.org/abs/2603.19270)
Keywords: prompt, agent
Abstract: The increasing complexity of user demands necessitates automation frameworks that can reliably translate open-ended instructions into robust, multi-step workflows. Current monolithic agent architectures often struggle with the challenges of scalability, error propagation, and maintaining focus across diverse tasks. This paper introduces Autonoma, a structured, hierarchical multi-agent framework designed for end-to-end workflow automation from natural language prompts. Autonoma employs a principled, multi-tiered architecture where a high-level Coordinator validates user intent, a Planner generates structured workflows, and a Supervisor dynamically manages the execution by orchestrating a suite of modular, specialized agents (e.g., for web browsing, coding, file management). This clear separation between orchestration logic and specialized execution ensures robustness through active monitoring and error handling, while enabling extensibility by allowing new capabilities to be integrated as plug-and-play agents without modifying the core engine. Implemented as a fully functional system operating within a secure LAN environment, Autonoma addresses critical data privacy and reliability concerns. The system is further engineered for inclusivity, accepting multi-modal input (text, voice, image, files) and supporting both English and Arabic. Autonoma achieved a 97% task completion rate and a 98% successful agent handoff rate, confirming its operational reliability and efficient collaboration.
摘要：用户需求日益复杂，需要自动化框架能够可靠地将开放式指令转换为强大的多步骤工作流程。当前的整体代理架构经常面临可扩展性、错误传播和跨不同任务保持关注等挑战。本文介绍了 Autonoma，这是一种结构化、分层的多代理框架，旨在根据自然语言提示实现端到端工作流自动化。 Autonoma 采用有原则的多层架构，其中高级协调器验证用户意图，规划器生成结构化工作流程，主管通过编排一套模块化的专用代理（例如，用于网页浏览、编码、文件管理）来动态管理执行。编排逻辑和专门执行之间的这种清晰分离通过主动监控和错误处理确保了稳健性，同时通过允许将新功能集成为即插即用代理而无需修改核心引擎来实现可扩展性。 Autonoma 作为在安全 LAN 环境中运行的功能齐全的系统实施，解决了关键的数据隐私和可靠性问题。该系统经过进一步设计，具有包容性，接受多模式输入（文本、语音、图像、文件）并支持英语和阿拉伯语。 Autonoma 实现了 97% 的任务完成率和 98% 的成功座席切换率，证实了其运行可靠性和高效协作。

Title: A Human-Centered Workflow for Using Large Language Models in Content Analysis

Authors: Ivan Zupic
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19271
Pdf URL: https://arxiv.org/pdf/2603.19271
Copy Paste: [[2603.19271]] A Human-Centered Workflow for Using Large Language Models in Content Analysis(https://arxiv.org/abs/2603.19271)
Keywords: language model, llm, prompt, chat
Abstract: While many researchers use Large Language Models (LLMs) through chat-based access, their real potential lies in leveraging LLMs via application programming interfaces (APIs). This paper conceptualizes LLMs as universal text processing machines and presents a comprehensive workflow for employing LLMs in three qualitative and quantitative content analysis tasks: (1) annotation (an umbrella term for qualitative coding, labeling and text classification), (2) summarization, and (3) information extraction. The workflow is explicitly human-centered. Researchers design, supervise, and validate each stage of the LLM process to ensure rigor and transparency. Our approach synthesizes insights from extensive methodological literature across multiple disciplines: political science, sociology, computer science, psychology, and management. We outline validation procedures and best practices to address key limitations of LLMs, such as their black-box nature, prompt sensitivity, and tendency to hallucinate. To facilitate practical implementation, we provide supplementary materials, including a prompt library and Python code in Jupyter Notebook format, accompanied by detailed usage instructions.
摘要：虽然许多研究人员通过基于聊天的访问来使用大型语言模型 (LLM)，但他们真正的潜力在于通过应用程序编程接口 (API) 利用 LLM。本文将法学硕士概念化为通用文本处理机器，并提出了在三个定性和定量内容分析任务中使用法学硕士的综合工作流程：(1) 注释（定性编码、标签和文本分类的总称）、(2) 摘要和 (3) 信息提取。工作流程明确以人为本。研究人员设计、监督和验证法学硕士流程的每个阶段，以确保严谨性和透明度。我们的方法综合了跨多个学科的广泛方法论文献的见解：政治学、社会学、计算机科学、心理学和管理学。我们概述了验证程序和最佳实践，以解决法学硕士的主要局限性，例如其黑盒性质、即时敏感性和产生幻觉的倾向。为了方便实际实施，我们提供了补充材料，包括提示库和Jupyter Notebook格式的Python代码，并附有详细的使用说明。

Title: Transformers are Stateless Differentiable Neural Computers

Authors: Bo Tang, Weiwei Xie
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.19272
Pdf URL: https://arxiv.org/pdf/2603.19272
Copy Paste: [[2603.19272]] Transformers are Stateless Differentiable Neural Computers(https://arxiv.org/abs/2603.19272)
Keywords: language model
Abstract: Differentiable Neural Computers (DNCs) were introduced as recurrent architectures equipped with an addressable external memory supporting differentiable read and write operations. Transformers, in contrast, are nominally feedforward architectures based on multi-head self-attention. In this work we give a formal derivation showing that a causal Transformer layer is exactly a stateless Differentiable Neural Computer (sDNC) where (1) the controller has no recurrent internal state, (2) the external memory is a write-once matrix of value vectors, (3) content-based addressing via keys implements attention, and (4) multi-head attention corresponds to multiple parallel read heads. We further extend this equivalence to cross-attention, showing that encoder-decoder Transformers are precisely sDNCs with distinct read-from and write-to memories. Our results provide a unified memory-centric interpretation of Transformers and contribute to the ongoing effort to place modern large language models in a principled computational framework.
摘要：可微神经计算机（DNC）作为循环架构被引入，配备了支持可微读写操作的可寻址外部存储器。相比之下，变压器名义上是基于多头自注意力的前馈架构。在这项工作中，我们给出了一个形式推导，表明因果 Transformer 层正是一个无状态可微神经计算机（sDNC），其中（1）控制器没有循环内部状态，（2）外部存储器是值向量的一次写入矩阵，（3）通过键进行基于内容的寻址实现注意力，（4）多头注意力对应于多个并行读取头。我们进一步将这种等价性扩展到交叉注意力，表明编码器-解码器 Transformer 正是具有不同的读取和写入存储器的 sDNC。我们的结果提供了对 Transformer 的统一的以内存为中心的解释，并有助于将现代大型语言模型置于有原则的计算框架中。

Title: LSR: Linguistic Safety Robustness Benchmark for Low-Resource West African Languages

Authors: Godwin Abuh Faruna
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19273
Pdf URL: https://arxiv.org/pdf/2603.19273
Copy Paste: [[2603.19273]] LSR: Linguistic Safety Robustness Benchmark for Low-Resource West African Languages(https://arxiv.org/abs/2603.19273)
Keywords: language model
Abstract: Safety alignment in large language models relies predominantly on English-language training data. When harmful intent is expressed in low-resource languages, refusal mechanisms that hold in English frequently fail to activate. We introduce LSR (Linguistic Safety Robustness), the first systematic benchmark for measuring cross-lingual refusal degradation in West African languages: Yoruba, Hausa, Igbo, and Igala. LSR uses a dual-probe evaluation protocol - submitting matched English and target-language probes to the same model - and introduces Refusal Centroid Drift (RCD), a metric that quantifies how much of a model's English refusal behavior is lost when harmful intent is encoded in a target language. We evaluate Gemini 2.5 Flash across 14 culturally grounded attack probes in four harm categories. English refusal rates hold at approximately 90 percent. Across West African languages, refusal rates fall to 35-55 percent, with Igala showing the most severe degradation (RCD = 0.55). LSR is implemented in the Inspect AI evaluation framework and is available as a PR-ready contribution to the UK AISI's inspect_evals repository. A live reference implementation and the benchmark dataset are publicly available.
摘要：大型语言模型中的安全对齐主要依赖于英语训练数据。当用资源匮乏的语言表达有害意图时，英语中的拒绝机制常常无法激活。我们推出 LSR（语言安全鲁棒性），这是第一个用于衡量西非语言（约鲁巴语、豪萨语、伊博语和伊加拉语）跨语言拒绝退化的系统基准。 LSR 使用双探针评估协议，即向同一模型提交匹配的英语和目标语言探针，并引入拒绝质心漂移 (RCD)，该指标可量化当用目标语言编码有害意图时，模型的英语拒绝行为会丢失多少。我们通过四种危害类别的 14 种基于文化的攻击探针来评估 Gemini 2.5 Flash。英国人的拒绝率保持在 90% 左右。在西非语言中，拒绝率下降至 35-55%，其中伊加拉语的降级最为严重 (RCD = 0.55)。 LSR 在 Inspect AI 评估框架中实施，并可作为英国 AISI 的spect_evals 存储库的 PR 就绪贡献。实时参考实现和基准数据集是公开可用的。

Title: CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation

Authors: Yannian Gu, Zhongzhen Huang, Linjie Mu, Xizhuo Zhang, Shaoting Zhang, Xiaofan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19274
Pdf URL: https://arxiv.org/pdf/2603.19274
Copy Paste: [[2603.19274]] CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation(https://arxiv.org/abs/2603.19274)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. However, existing benchmarks primarily evaluate MLLMs in end-to-end answering scenarios. This limits the ability to disentangle a model's foundational multimodal reasoning from its proficiency in evidence retrieval and application. We introduce the Clinical Understanding and Retrieval Evaluation (CURE) benchmark. Comprising $500$ multimodal clinical cases mapped to physician-cited reference literature, CURE evaluates reasoning and retrieval under controlled evidence settings to disentangle their respective contributions. We evaluate state-of-the-art MLLMs across distinct evidence-gathering paradigms in both closed-ended and open-ended diagnosis tasks. Evaluations reveal a stark dichotomy: while advanced models demonstrate clinical reasoning proficiency when supplied with physician reference evidence (achieving up to $73.4\%$ accuracy on differential diagnosis), their performance substantially declines (as low as $25.4\%$) when reliant on independent retrieval mechanisms. This disparity highlights the dual challenges of effectively integrating multimodal clinical evidence and retrieving precise supporting literature. CURE is publicly available at this https URL.
摘要：多模态大语言模型 (MLLM) 在临床诊断中展现出巨大的潜力，该领域本质上需要合成复杂的视觉和文本数据以及查阅权威医学文献。然而，现有的基准主要评估端到端应答场景中的 MLLM。这限制了将模型的基础多模态推理与其证据检索和应用的熟练程度分开的能力。我们介绍临床理解和检索评估 (CURE) 基准。 CURE 包含与医生引用的参考文献相对应的 500 美元多模式临床案例，它在受控证据设置下评估推理和检索，以理清其各自的贡献。我们在封闭式和开放式诊断任务中通过不同的证据收集范式评估最先进的 MLLM。评估揭示了明显的二分法：虽然高级模型在提供医生参考证据时表现出临床推理能力（鉴别诊断准确率高达 $73.4\%$），但当依赖独立检索机制时，它们的性能大幅下降（低至 $25.4\%$）。这种差异凸显了有效整合多模式临床证据和检索精确支持文献的双重挑战。 CURE 可通过此 https URL 公开获取。

Title: Improving Automatic Summarization of Radiology Reports through Mid-Training of Large Language Models

Authors: Mengxian Lyu, Cheng Peng, Ziyi Chen, Mengyuan Zhang, Jieting Li Lu, Yonghui Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19275
Pdf URL: https://arxiv.org/pdf/2603.19275
Copy Paste: [[2603.19275]] Improving Automatic Summarization of Radiology Reports through Mid-Training of Large Language Models(https://arxiv.org/abs/2603.19275)
Keywords: language model, llm
Abstract: Automatic summarization of radiology reports is an essential application to reduce the burden on physicians. Previous studies have widely used the "pre-training, fine-tuning" strategy to adapt large language models (LLMs) for summarization. This study proposed a subdomain adaptation through a mid-training method to improve summarization. We explored three adaptation strategies: (1) general-domain pre-training, (2) clinical-domain pre-training, and (3) clinical-domain pre-training followed by subdomain mid-training. We developed models using large-scale clinical text from the University of Florida (UF) Health and conducted mid-training and fine-tuning experiments using widely used benchmark datasets including OpenI and MIMIC-CXR. The experimental results show that the mid-trained model, GatorTronT5-Radio, achieved the best performance, outperforming models without mid-training in both text-based measures (ROUGE-L) and factuality measures (RadGraph-F1). Our mid-training methods also demonstrate better few-shot learning and could alleviate the "cold start" problem reported in previous studies as a learning barrier. Our findings support the use of "pre-training, mid-training, fine-tuning," instead of the widely used direct fine-tuning strategy.
摘要：放射报告自动汇总是减轻医生负担的重要应用。先前的研究广泛使用“预训练、微调”策略来适应大型语言模型（LLM）进行摘要。这项研究提出了通过中间训练方法进行子域适应以改进摘要。我们探索了三种适应策略：(1) 一般领域预训练，(2) 临床领域预训练，(3) 临床领域预训练，然后进行子领域中期训练。我们使用佛罗里达大学 (UF) Health 的大规模临床文本开发模型，并使用广泛使用的基准数据集（包括 OpenI 和 MIMIC-CXR）进行中期训练和微调实验。实验结果表明，经过中期训练的模型 GatorTronT5-Radio 取得了最佳性能，在基于文本的度量 (ROUGE-L) 和事实性度量 (RadGraph-F1) 方面均优于未经中期训练的模型。我们的中期训练方法还展示了更好的小样本学习，并且可以缓解先前研究中报告的学习障碍“冷启动”问题。我们的研究结果支持使用“预训练、训练中、微调”，而不是广泛使用的直接微调策略。

Title: From Flat to Structural: Enhancing Automated Short Answer Grading with GraphRAG

Authors: Yucheng Chu, Haoyu Han, Shen Dong, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Joseph Krajcik, Namsoo Shin, Hui Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19276
Pdf URL: https://arxiv.org/pdf/2603.19276
Copy Paste: [[2603.19276]] From Flat to Structural: Enhancing Automated Short Answer Grading with GraphRAG(https://arxiv.org/abs/2603.19276)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Automated short answer grading (ASAG) is critical for scaling educational assessment, yet large language models (LLMs) often struggle with hallucinations and strict rubric adherence due to their reliance on generalized pre-training. While Rretrieval-Augmented Generation (RAG) mitigates these issues, standard "flat" vector retrieval mechanisms treat knowledge as isolated fragments, failing to capture the structural relationships and multi-hop reasoning essential for complex educational content. To address this limitation, we introduce a Graph Retrieval-Augmented Generation (GraphRAG) framework that organizes reference materials into a structured knowledge graph to explicitly model dependencies between concepts. Our methodology employs a dual-phase pipeline: utilizing Microsoft GraphRAG for high-fidelity graph construction and the HippoRAG neurosymbolic algorithm to execute associative graph traversals, thereby retrieving comprehensive, connected subgraphs of evidence. Experimental evaluations on a Next Generation Science Standards (NGSS) dataset demonstrate that this structural approach significantly outperforms standard RAG baselines across all metrics. Notably, the HippoRAG implementation achieved substantial improvements in evaluating Science and Engineering Practices (SEP), confirming the superiority of structural retrieval in verifying the logical reasoning chains required for higher-order academic assessment.
摘要：自动简答评分（ASAG）对于扩大教育评估至关重要，但大型语言模型（LLM）由于依赖于广义的预训练，经常与幻觉和严格的标准遵守作斗争。虽然检索增强生成 (RAG) 缓解了这些问题，但标准的“平面”向量检索机制将知识视为孤立的片段，无法捕获复杂教育内容所必需的结构关系和多跳推理。为了解决这个限制，我们引入了图检索增强生成（GraphRAG）框架，该框架将参考资料组织成结构化知识图，以显式地建模概念之间的依赖关系。我们的方法采用双阶段管道：利用 Microsoft GraphRAG 进行高保真图构建，并利用 HippoRAG 神经符号算法执行关联图遍历，从而检索全面的、连通的证据子图。对下一代科学标准 (NGSS) 数据集的实验评估表明，这种结构方法在所有指标上均显着优于标准 RAG 基线。值得注意的是，HippoRAG 的实施在评估科学与工程实践 (SEP) 方面取得了实质性改进，证实了结构检索在验证高阶学术评估所需的逻辑推理链方面的优越性。

Title: HypeLoRA: Hyper-Network-Generated LoRA Adapters for Calibrated Language Model Fine-Tuning

Authors: Bartosz Trojan, Filip Gębala
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19278
Pdf URL: https://arxiv.org/pdf/2603.19278
Copy Paste: [[2603.19278]] HypeLoRA: Hyper-Network-Generated LoRA Adapters for Calibrated Language Model Fine-Tuning(https://arxiv.org/abs/2603.19278)
Keywords: language model
Abstract: Modern Transformer-based models frequently suffer from miscalibration, producing overconfident predictions that do not reflect true empirical frequencies. This work investigates the calibration dynamics of LoRA: Low-Rank Adaptation and a novel hyper-network-based adaptation framework as parameter-efficient alternatives to full fine-tuning for RoBERTa. Evaluating across the GLUE benchmark, we demonstrate that LoRA-based adaptation consistently achieves calibration parity with (and in specific tasks exceeds) full fine-tuning, while maintaining significantly higher parameter efficiency. We further explore a dynamic approach where a shared hyper-network generates LoRA factors (A and B matrices) to induce structural coupling across layers. This approach produced results similar to standard LoRA fine-tuning, even achieving better MCC on CoLA dataset. Our study also reveal a critical trade-off: constraining the adaptation space (e.g., freezing matrices A) acts as a powerful regularizer that enhances Expected Calibration Error (ECE), but necessitates a carefully balanced sacrifice in downstream task accuracy. To support future research, we provide a unified and reproducible implementation of contemporary calibration metrics, including ECE, MCE, and ACE. Our findings clarify the relationship between parameter efficiency and probabilistic reliability, positioning structured low-rank updates as a viable foundation for uncertainty-aware Transformer architectures. Code available at: this https URL
摘要：现代基于 Transformer 的模型经常出现校准错误，产生过于自信的预测，无法反映真实的经验频率。这项工作研究了 LoRA 的校准动态：低阶适应和一种新颖的基于超网络的适应框架，作为 RoBERTa 完全微调的参数有效替代方案。通过对 GLUE 基准进行评估，我们证明基于 LoRA 的自适应始终能够实现与（并且在特定任务中超过）完全微调的校准同等性，同时保持显着更高的参数效率。我们进一步探索一种动态方法，其中共享超网络生成 LoRA 因子（A 和 B 矩阵）以诱导跨层结构耦合。这种方法产生的结果与标准 LoRA 微调类似，甚至在 CoLA 数据集上实现了更好的 MCC。我们的研究还揭示了一个关键的权衡：限制适应空间（例如，冻结矩阵 A）充当强大的正则化器，可以增强预期校准误差（ECE），但需要仔细平衡下游任务准确性的牺牲。为了支持未来的研究，我们提供了当代校准指标的统一且可重复的实施，包括 ECE、MCE 和 ACE。我们的研究结果阐明了参数效率和概率可靠性之间的关系，将结构化低秩更新定位为不确定性感知 Transformer 架构的可行基础。代码位于：此 https URL

Title: From Feature-Based Models to Generative AI: Validity Evidence for Constructed Response Scoring

Authors: Jodi M. Casabianca, Daniel F. McCaffrey, Matthew S. Johnson, Naim Alper, Vladimir Zubenko
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.19280
Pdf URL: https://arxiv.org/pdf/2603.19280
Copy Paste: [[2603.19280]] From Feature-Based Models to Generative AI: Validity Evidence for Constructed Response Scoring(https://arxiv.org/abs/2603.19280)
Keywords: language model
Abstract: The rapid advancements in large language models and generative artificial intelligence (AI) capabilities are making their broad application in the high-stakes testing context more likely. Use of generative AI in the scoring of constructed responses is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring and might even outperform those methods. The purpose of this paper is to highlight the differences in the feature-based and generative AI applications in constructed response scoring systems and propose a set of best practices for the collection of validity evidence to support the use and interpretation of constructed response scores from scoring systems using generative AI. We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and generative AI. The evidence needed in the generative AI context is more extensive than in the feature-based scoring context because of the lack of transparency and other concerns unique to generative AI such as consistency. Constructed response score data from a large corpus of independent argumentative essays written by 6-12th grade students demonstrate the collection of validity evidence for different types of scoring systems and highlight the numerous complexities and considerations when making a validity argument for these scores.
摘要：大型语言模型和生成人工智能 (AI) 功能的快速进步使得它们在高风险测试环境中的广泛应用更有可能。在构建响应评分中使用生成式人工智能特别有吸引力，因为它减少了传统人工智能评分中手工制作特征所需的工作量，甚至可能优于这些方法。本文的目的是强调基于特征的人工智能和生成式人工智能应用在构建的响应评分系统中的差异，并提出一套收集有效性证据的最佳实践，以支持使用和解释使用生成式人工智能的评分系统构建的响应分数。我们使用人类评分、基于特征的自然语言处理人工智能评分引擎和生成人工智能来比较评分系统所需的有效性证据。由于缺乏透明度以及生成人工智能特有的其他问题（例如一致性），生成人工智能环境中所需的证据比基于特征的评分环境更广泛。从 6-12 年级学生撰写的大量独立论证论文中构建的响应分数数据展示了不同类型评分系统的有效性证据的收集，并强调了为这些分数进行有效性论证时的众多复杂性和考虑因素。

Title: URAG: A Benchmark for Uncertainty Quantification in Retrieval-Augmented Large Language Models

Authors: Vinh Nguyen, Cuong Dang, Jiahao Zhang, Hoa Tran, Minh Tran, Trinh Chau, Thai Le, Lu Cheng, Suhang Wang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.19281
Pdf URL: https://arxiv.org/pdf/2603.19281
Copy Paste: [[2603.19281]] URAG: A Benchmark for Uncertainty Quantification in Retrieval-Augmented Large Language Models(https://arxiv.org/abs/2603.19281)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for enhancing LLMs in scenarios that demand extensive factual knowledge. However, current RAG evaluations concentrate primarily on correctness, which may not fully capture the impact of retrieval on LLM uncertainty and reliability. To bridge this gap, we introduce URAG, a comprehensive benchmark designed to assess the uncertainty of RAG systems across various fields like healthcare, programming, science, math, and general text. By reformulating open-ended generation tasks into multiple-choice question answering, URAG allows for principled uncertainty quantification via conformal prediction. We apply the evaluation pipeline to 8 standard RAG methods, measuring their performance through both accuracy and prediction-set sizes based on LAC and APS metrics. Our analysis shows that (1) accuracy gains often coincide with reduced uncertainty, but this relationship breaks under retrieval noise; (2) simple modular RAG methods tend to offer better accuracy-uncertainty trade-offs than more complex reasoning pipelines; and (3) no single RAG approach is universally reliable across domains. We further show that (4) retrieval depth, parametric knowledge dependence, and exposure to confidence cues can amplify confident errors and hallucinations. Ultimately, URAG establishes a systematic benchmark for analyzing and enhancing the trustworthiness of retrieval-augmented systems. Our code is available on GitHub.
摘要：检索增强生成（RAG）已成为一种广泛采用的方法，用于在需要广泛事实知识的情况下增强法学硕士。然而，当前的 RAG 评估主要集中在正确性上，这可能无法完全捕捉检索对 LLM 不确定性和可靠性的影响。为了弥补这一差距，我们引入了 URAG，这是一个综合基准，旨在评估医疗保健、编程、科学、数学和一般文本等各个领域的 RAG 系统的不确定性。通过将开放式生成任务重新表述为多项选择题回答，URAG 允许通过保形预测进行有原则的不确定性量化。我们将评估流程应用于 8 种标准 RAG 方法，通过基于 LAC 和 APS 指标的准确性和预测集大小来衡量其性能。我们的分析表明，（1）准确性的提高通常与不确定性的降低相一致，但这种关系在检索噪声下被破坏； (2) 简单的模块化 RAG 方法往往比更复杂的推理管道提供更好的准确性与不确定性权衡； (3) 没有任何一种 RAG 方法能够跨领域普遍可靠。我们进一步表明，（4）检索深度、参数知识依赖性和置信线索的暴露可以放大置信错误和幻觉。最终，URAG 建立了一个系统基准来分析和增强检索增强系统的可信度。我们的代码可以在 GitHub 上找到。

Title: Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis

Authors: Zice Wang, Zhenyu Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19282
Pdf URL: https://arxiv.org/pdf/2603.19282
Copy Paste: [[2603.19282]] Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis(https://arxiv.org/abs/2603.19282)
Keywords: language model, llm, prompt, agent
Abstract: In many real-world applications, large language models (LLMs) operate as independent agents without interaction, thereby limiting coordination. In this setting, we examine how prompt framing influences decisions in a threshold voting task involving individual-group interest conflict. Two logically equivalent prompts with different framings were tested across diverse LLM families under isolated trials. Results show that prompt framing significantly influences choice distributions, often shifting preferences toward risk-averse options. Surface linguistic cues can even override logically equivalent formulations. This suggests that observed behavior reflects a tendency consistent with a preference for instrumental rather than cooperative rationality when success requires risk-bearing. The findings highlight framing effects as a significant bias source in non-interacting multi-agent LLM deployments, informing alignment and prompt design.
摘要：在许多现实应用程序中，大型语言模型（LLM）作为独立代理运行，没有交互，从而限制了协调。在这种情况下，我们研究了即时框架如何影响涉及个人群体利益冲突的阈值投票任务中的决策。在孤立的试验中，在不同的法学硕士家庭中测试了两个具有不同框架的逻辑上等效的提示。结果表明，即时框架显着影响选择分布，通常会将偏好转向规避风险的选项。表面语言线索甚至可以凌驾于逻辑上等效的表述之上。这表明，当成功需要承担风险时，观察到的行为反映了一种与工具理性而非合作理性偏好一致的趋势。研究结果强调框架效应是非交互多代理 LLM 部署中的一个重要偏差源，为协调和提示设计提供信息。

Title: Automated Motif Indexing on the Arabian Nights

Authors: Ibrahim H. Alyami, Mark A. Finlayson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19283
Pdf URL: https://arxiv.org/pdf/2603.19283
Copy Paste: [[2603.19283]] Automated Motif Indexing on the Arabian Nights(https://arxiv.org/abs/2603.19283)
Keywords: llm, prompt
Abstract: Motifs are non-commonplace, recurring narrative elements, often found originally in folk stories. In addition to being of interest to folklorists, motifs appear as metaphoric devices in modern news, literature, propaganda, and other cultural texts. Finding expressions of motifs in the original folkloristic text is useful for both folkloristic analysis (motif indexing) as well as for understanding the modern usage of motifs (motif detection and interpretation). Prior work has primarily shown how difficult these problems are to tackle using automated techniques. We present the first computational approach to motif indexing. Our choice of data is a key enabler: we use a large, widely available text (the Arabian Nights) paired with a detailed motif index (by El-Shamy in 2006), which overcomes the common problem of inaccessibility of texts referred to by the index. We created a manually annotated corpus that identified 2,670 motif expressions of 200 different motifs across 58,450 sentences for training and testing. We tested five types of approaches for detecting motif expressions given a motif index entry: (1) classic retrieve and re-rank using keywords and a fine-tuned cross-encoder; (2) off-the-shelf embedding models; (3) fine-tuned embedding models; (4) generative prompting of off-the-shelf LLMs in N-shot setups; and (5) the same generative approaches on LLMs fine-tuned with LoRA. Our best performing system is a fine-tuned Llama3 model which achieves an overall performance of 0.85 F1.
摘要：主题是不常见的、反复出现的叙事元素，通常最初出现在民间故事中。除了民俗学家感兴趣之外，主题还作为隐喻手段出现在现代新闻、文学、宣传和其他文化文本中。在原始民俗文本中查找主题表达对于民俗分析（主题索引）以及理解主题的现代用法（主题检测和解释）都很有用。先前的工作主要表明使用自动化技术解决这些问题是多么困难。我们提出了第一个主题索引的计算方法。我们对数据的选择是一个关键的推动因素：我们使用大量的、广泛使用的文本（《天方夜谭》）与详细的主题索引（由 El-Shamy 于 2006 年编写），这克服了索引引用的文本难以访问的常见问题。我们创建了一个手动注释的语料库，识别了 58,450 个句子中 200 个不同主题的 2,670 个主题表达，用于训练和测试。我们测试了五种在给定主题索引条目的情况下检测主题表达的方法：（1）使用关键字和微调交叉编码器进行经典检索和重新排名； (2)现成的嵌入模型；（3）微调嵌入模型； (4) 在 N 次设置中对现成的法学硕士进行生成性提示； (5) 使用 LoRA 微调法学硕士的相同生成方法。我们性能最好的系统是经过微调的 Llama3 模型，其整体性能达到 0.85 F1。

Title: LLM-MRD: LLM-Guided Multi-View Reasoning Distillation for Fake News Detection

Authors: Weilin Zhou, Shanwen Tan, Enhao Gu, Yurong Qian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19293
Pdf URL: https://arxiv.org/pdf/2603.19293
Copy Paste: [[2603.19293]] LLM-MRD: LLM-Guided Multi-View Reasoning Distillation for Fake News Detection(https://arxiv.org/abs/2603.19293)
Keywords: language model, llm
Abstract: Multimodal fake news detection is crucial for mitigating societal disinformation. Existing approaches attempt to address this by fusing multimodal features or leveraging Large Language Models (LLMs) for advanced reasoning. However, these methods suffer from serious limitations, including a lack of comprehensive multi-view judgment and fusion, and prohibitive reasoning inefficiency due to the high computational costs of LLMs. To address these issues, we propose \textbf{LLM}-Guided \textbf{M}ulti-View \textbf{R}easoning \textbf{D}istillation for Fake News Detection ( \textbf{LLM-MRD}), a novel teacher-student framework. The Student Multi-view Reasoning module first constructs a comprehensive foundation from textual, visual, and cross-modal perspectives. Then, the Teacher Multi-view Reasoning module generates deep reasoning chains as rich supervision signals. Our core Calibration Distillation mechanism efficiently distills this complex reasoning-derived knowledge into the efficient student model. Experiments show LLM-MRD significantly outperforms state-of-the-art baselines. Notably, it demonstrates a comprehensive average improvement of 5.19\% in ACC and 6.33\% in F1-Fake when evaluated across all competing methods and datasets. Our code is available at this https URL
摘要：多模式假新闻检测对于减少社会虚假信息至关重要。现有方法试图通过融合多模式特征或利用大型语言模型（LLM）进行高级推理来解决这个问题。然而，这些方法存在严重的局限性，包括缺乏全面的多视图判断和融合，以及由于法学硕士的高计算成本而导致推理效率低下。为了解决这些问题，我们提出了用于假新闻检测的 \textbf{LLM}-Guided \textbf{M}ulti-View \textbf{R}easoning \textbf{D}istillation (\textbf{LLM-MRD})，这是一种新颖的师生框架。学生多视图推理模块首先从文本、视觉和跨模态角度构建全面的基础。然后，教师多视图推理模块生成深度推理链作为丰富的监督信号。我们的核心校准蒸馏机制有效地将这种复杂的推理知识提取到高效的学生模型中。实验表明 LLM-MRD 显着优于最先进的基线。值得注意的是，在所有竞争方法和数据集上进行评估时，ACC 的综合平均改进为 5.19%，F1-Fake 的综合平均改进为 6.33%。我们的代码可在此 https URL 获取

Title: PrefPO: Pairwise Preference Prompt Optimization

Authors: Rahul Singhal, Pradyumna Tambwekar, Karime Maamari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19311
Pdf URL: https://arxiv.org/pdf/2603.19311
Copy Paste: [[2603.19311]] PrefPO: Pairwise Preference Prompt Optimization(https://arxiv.org/abs/2603.19311)
Keywords: llm, prompt
Abstract: Prompt engineering is effective but labor-intensive, motivating automated optimization methods. Existing methods typically require labeled datasets, which are often unavailable, and produce verbose, repetitive prompts. We introduce PrefPO, a minimal prompt optimization approach inspired by reinforcement learning from human feedback (RLHF). Its preference-based approach reduces the need for labeled data and hyperparameter tuning-only a starting prompt and natural language criteria are needed. PrefPO uses an LLM discriminator to express pairwise preferences over model outputs and provide feedback to an LLM optimizer, iteratively improving performance. We evaluate PrefPO on 9 BIG-Bench Hard (BBH) tasks and IFEval-Hard, a newly-curated, challenging subset of IFEval. PrefPO matches or exceeds SOTA methods, including GEPA, MIPRO, and TextGrad, on 6/9 tasks and performs comparably to TextGrad on IFEval-Hard (82.4% vs 84.5%). Unlike other methods, PrefPO can optimize in both labeled and unlabeled settings. Without labels, PrefPO closely matches its labeled performance on 6/9 tasks, proving effective without ground truth. PrefPO also improves prompt hygiene: we find existing methods produce prompts 14.7x their original length or with 34% repetitive content; PrefPO reduces these issues by 3-5x. Furthermore, both LLM and human judges rate PrefPO's prompts higher than TextGrad's. Finally, we identify prompt hacking in prompt optimizers, where methods game evaluation criteria, and find PrefPO is susceptible at half the rate of TextGrad (37% vs 86%), generating fewer brittle, misaligned prompts.
摘要：即时工程是有效的，但劳动密集型，激励自动优化方法。现有方法通常需要带标签的数据集，而这些数据集通常不可用，并且会产生冗长、重复的提示。我们引入了 PrefPO，这是一种受人类反馈强化学习 (RLHF) 启发的最小提示优化方法。其基于偏好的方法减少了对标记数据和超参数调整的需求——只需要启动提示和自然语言标准。 PrefPO 使用 LLM 判别器来表达对模型输出的成对偏好，并向 LLM 优化器提供反馈，从而迭代地提高性能。我们在 9 个 BIG-Bench Hard (BBH) 任务和 IFEval-Hard 上评估 PrefPO，IFEval-Hard 是一个新策划的、具有挑战性的 IFEval 子集。 PrefPO 在 6/9 任务上匹配或超过了 SOTA 方法，包括 GEPA、MIPRO 和 TextGrad，并且在 IFEval-Hard 上的表现与 TextGrad 相当（82.4% vs 84.5%）。与其他方法不同，PrefPO 可以在标记和未标记设置中进行优化。在没有标签的情况下，PrefPO 与其在 6/9 任务上的标记性能非常匹配，在没有地面事实的情况下证明是有效的。 PrefPO 还提高了提示的卫生性：我们发现现有方法生成的提示是其原始长度的 14.7 倍，或重复内容为 34%； PrefPO 将这些问题减少了 3-5 倍。此外，法学硕士和人类评委对 PrefPO 提示的评价都高于 TextGrad 的提示。最后，我们在提示优化器中识别出提示黑客攻击，其中方法是游戏评估标准，并发现 PrefPO 的易受影响率是 TextGrad 的一半（37% vs 86%），从而生成更少的脆弱、错位提示。

Title: Memory-Driven Role-Playing: Evaluation and Enhancement of Persona Knowledge Utilization in LLMs

Authors: Kai Wang, Haoyang You, Yang Zhang, Zhongjie Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19313
Pdf URL: https://arxiv.org/pdf/2603.19313
Copy Paste: [[2603.19313]] Memory-Driven Role-Playing: Evaluation and Enhancement of Persona Knowledge Utilization in LLMs(https://arxiv.org/abs/2603.19313)
Keywords: llm, prompt
Abstract: A core challenge for faithful LLM role-playing is sustaining consistent characterization throughout long, open-ended dialogues, as models frequently fail to recall and accurately apply their designated persona knowledge without explicit cues. To tackle this, we propose the Memory-Driven Role-Playing paradigm. Inspired by Stanislavski's "emotional memory" acting theory, this paradigm frames persona knowledge as the LLM's internal memory store, requiring retrieval and application based solely on dialogue context, thereby providing a rigorous test of depth and autonomous use of knowledge. Centered on this paradigm, we contribute: (1) MREval, a fine-grained evaluation framework assessing four memory-driven abilities - Anchoring, Recalling, Bounding, and Enacting; (2) MRPrompt, a prompting architecture that guides structured memory retrieval and response generation; and (3) MRBench, a bilingual (Chinese/English) benchmark for fine-grained diagnosis. The novel paradigm provides a comprehensive diagnostic for four-staged role-playing abilities across 12 LLMs. Crucially, experiments show that MRPrompt allows small models (e.g., Qwen3-8B) to match the performance of much larger closed-source LLMs (e.g., Qwen3-Max and GLM-4.7), and confirms that upstream memory gains directly enhance downstream response quality, validating the staged theoretical foundation.
摘要：忠实的法学硕士角色扮演的一个核心挑战是在漫长的开放式对话中保持一致的人物塑造，因为模型经常无法在没有明确提示的情况下回忆和准确应用他们指定的角色知识。为了解决这个问题，我们提出了记忆驱动的角色扮演范例。受斯坦尼斯拉夫斯基“情感记忆”表演理论的启发，该范式将角色知识构建为法学硕士的内部记忆存储，要求仅基于对话上下文进行检索和应用，从而提供对知识深度和自主使用的严格测试。围绕这一范式，我们贡献了：（1）MREval，一个细粒度的评估框架，评估四种记忆驱动的能力——锚定、回忆、边界和制定； (2) MRPrompt，一种指导结构化记忆检索和响应生成的提示架构；（3）MRBench，双语（中/英）细粒度诊断基准。这种新颖的范例为 12 个法学硕士的四阶段角色扮演能力提供了全面的诊断。至关重要的是，实验表明 MRPrompt 允许小型模型（例如 Qwen3-8B）与更大的闭源 LLM（例如 Qwen3-Max 和 GLM-4.7）的性能相匹配，并证实上游内存增益直接增强下游响应质量，从而验证了分阶段的理论基础。

Title: Prompt-tuning with Attribute Guidance for Low-resource Entity Matching

Authors: Lihui Liu, Carl Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19321
Pdf URL: https://arxiv.org/pdf/2603.19321
Copy Paste: [[2603.19321]] Prompt-tuning with Attribute Guidance for Low-resource Entity Matching(https://arxiv.org/abs/2603.19321)
Keywords: prompt
Abstract: Entity Matching (EM) is an important task that determines the logical relationship between two entities, such as Same, Different, or Undecidable. Traditional EM approaches rely heavily on supervised learning, which requires large amounts of high-quality labeled data. This labeling process is both time-consuming and costly, limiting practical applicability. As a result, there is a strong need for low-resource EM methods that can perform well with minimal labeled data. Recent prompt-tuning approaches have shown promise for low-resource EM, but they mainly focus on entity-level matching and often overlook critical attribute-level information. In addition, these methods typically lack interpretability and explainability. To address these limitations, this paper introduces PROMPTATTRIB, a comprehensive solution that tackles EM through attribute-level prompt tuning and logical reasoning. PROMPTATTRIB uses both entity-level and attribute-level prompts to incorporate richer contextual information and employs fuzzy logic formulas to infer the final matching label. By explicitly considering attributes, the model gains a deeper understanding of the entities, resulting in more accurate matching. Furthermore, PROMPTATTRIB integrates dropout-based contrastive learning on soft prompts, inspired by SimCSE, which further boosts EM performance. Extensive experiments on real-world datasets demonstrate the effectiveness of PROMPTATTRIB.
摘要：实体匹配（EM）是一项重要的任务，它确定两个实体之间的逻辑关系，例如相同、不同或不可判定。传统的 EM 方法严重依赖监督学习，这需要大量高质量的标记数据。这种标记过程既耗时又昂贵，限制了实际应用。因此，迫切需要能够在最少标记数据的情况下表现良好的低资源 EM 方法。最近的提示调整方法已经显示出低资源 EM 的前景，但它们主要关注实体级匹配，并且经常忽略关键的属性级信息。此外，这些方法通常缺乏可解释性和可解释性。为了解决这些限制，本文引入了 PROMPTATTRIB，这是一种通过属性级提示调整和逻辑推理来解决 EM 的综合解决方案。 PROMPTATTRIB 使用实体级和属性级提示来合并更丰富的上下文信息，并采用模糊逻辑公式来推断最终的匹配标签。通过显式考虑属性，模型可以更深入地了解实体，从而实现更准确的匹配。此外，受 SimCSE 启发，PROMPTATTRIB 在软提示上集成了基于 dropout 的对比学习，进一步提高了 EM 性能。对真实世界数据集的大量实验证明了 PROMPTATTRIB 的有效性。

Title: Scalable Prompt Routing via Fine-Grained Latent Task Discovery

Authors: Yunyi Zhang, Soji Adeshina, Patrick Guan, Ashwin Ganesh, Zhen Han, Vassilis N. Ioannidis, Huzefa Rangwala, George Karypis
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.19415
Pdf URL: https://arxiv.org/pdf/2603.19415
Copy Paste: [[2603.19415]] Scalable Prompt Routing via Fine-Grained Latent Task Discovery(https://arxiv.org/abs/2603.19415)
Keywords: language model, prompt
Abstract: Prompt routing dynamically selects the most appropriate large language model from a pool of candidates for each query, optimizing performance while managing costs. As model pools scale to include dozens of frontier models with narrow performance gaps, existing approaches face significant challenges: manually defined task taxonomies cannot capture fine-grained capability distinctions, while monolithic routers struggle to differentiate subtle differences across diverse tasks. We propose a two-stage routing architecture that addresses these limitations through automated fine-grained task discovery and task-aware quality estimation. Our first stage employs graph-based clustering to discover latent task types and trains a classifier to assign prompts to discovered tasks. The second stage uses a mixture-of-experts architecture with task-specific prediction heads for specialized quality estimates. At inference, we aggregate predictions from both stages to balance task-level stability with prompt-specific adaptability. Evaluated on 10 benchmarks with 11 frontier models, our method consistently outperforms existing baselines and surpasses the strongest individual model while incurring less than half its cost.
摘要：提示路由会从每个查询的候选池中动态选择最合适的大型语言模型，从而在管理成本的同时优化性能。随着模型池规模扩大到包含数十种性能差距较小的前沿模型，现有方法面临着重大挑战：手动定义的任务分类法无法捕获细粒度的功能差异，而整体路由器则难以区分不同任务之间的细微差异。我们提出了一种两阶段路由架构，通过自动细粒度任务发现和任务感知质量估计来解决这些限制。我们的第一阶段采用基于图的聚类来发现潜在任务类型，并训练分类器为发现的任务分配提示。第二阶段使用混合专家架构，具有特定于任务的预测头，用于专门的质量估计。在推理时，我们汇总两个阶段的预测，以平衡任务级稳定性与提示特定的适应性。通过对 11 个前沿模型的 10 个基准进行评估，我们的方法始终优于现有基线，并超越了最强的单个模型，同时成本不到其一半。

Title: Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

Authors: Viliana Devbunova
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19426
Pdf URL: https://arxiv.org/pdf/2603.19426
Copy Paste: [[2603.19426]] Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure(https://arxiv.org/abs/2603.19426)
Keywords: language model, prompt
Abstract: Prior work uses linear probes on benchmark prompts as evidence of evaluation awareness in large language models. Because evaluation context is typically entangled with benchmark format and genre, it is unclear whether probe-based signals reflect context or surface structure. We test whether these signals persist under partial control of prompt format using a controlled 2x2 dataset and diagnostic rewrites. We find that probes primarily track benchmark-canonical structure and fail to generalize to free-form prompts independent of linguistic style. Thus, standard probe-based methodologies do not reliably disentangle evaluation context from structural artifacts, limiting the evidential strength of existing results.
摘要：先前的工作使用基准提示上的线性探针作为大型语言模型中评估意识的证据。由于评估背景通常与基准格式和类型纠缠在一起，因此尚不清楚基于探针的信号是否反映背景或表面结构。我们使用受控 2x2 数据集和诊断重写来测试这些信号是否在提示格式的部分控制下持续存在。我们发现探针主要跟踪基准规范结构，并且无法推广到独立于语言风格的自由形式提示。因此，基于标准探测的方法不能可靠地将评估背景与结构工件分开，从而限制了现有结果的证据强度。

Title: Vocabulary shapes cross-lingual variation of word-order learnability in language models

Authors: Jonas Mayer Martins, Jaap Jumelet, Viola Priesemann, Lisa Beinborn
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.19427
Pdf URL: https://arxiv.org/pdf/2603.19427
Copy Paste: [[2603.19427]] Vocabulary shapes cross-lingual variation of word-order learnability in language models(https://arxiv.org/abs/2603.19427)
Keywords: language model
Abstract: Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.
摘要：为什么捷克语等一些语言允许自由词序，而英语等其他语言则不允许？我们通过在一系列自然语言的合成词序变体上预训练 Transformer 语言模型来解决这个问题。我们观察到，较大的词序不规则性始终会引起模型的意外，表明可学习性降低。然而，句子反转对可学习性的影响很小。自由语言（例如捷克语和芬兰语）和固定词序语言（例如英语和法语）的粗略区分并不能解释跨语言变异。相反，单词和子词词汇的结构强烈地预测了模型的意外性。总体而言，词汇结构成为跨语言计算词序可学习性的关键驱动因素。

Title: Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

Authors: Víctor Gallego
Subjects: cs.CL, cs.GT
Abstract URL: https://arxiv.org/abs/2603.19453
Pdf URL: https://arxiv.org/pdf/2603.19453
Copy Paste: [[2603.19453]] Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas(https://arxiv.org/abs/2603.19453)
Keywords: language model, llm, prompt, agent
Abstract: We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at this https URL.
摘要：我们研究LLM策略综合：使用大型语言模型迭代地为多代理环境生成编程代理策略。我们的框架不是通过强化学习来训练神经策略，而是提示法学硕士生成Python策略函数，在自我对弈中评估它们，并使用跨迭代的性能反馈来完善它们。我们研究反馈工程（在细化过程中向法学硕士显示哪些评估信息的设计），将稀疏反馈（仅标量奖励）与密集反馈（奖励加上社会指标：效率、平等、可持续性、和平）进行比较。在两个典型的顺序社会困境（收集和清理）和两个前沿法学硕士（Claude Sonnet 4.6、Gemini 3.1 Pro）中，密集反馈在所有指标上始终匹配或超过稀疏反馈。在清理公共物品游戏中，优势最大，提供社会指标有助于法学硕士校准成本高昂的清理与收获权衡。社会指标不会引发公平性的过度优化，而是充当协调信号，引导法学硕士采取更有效的合作策略，包括领土划分、适应性角色分配和避免浪费性的侵略。我们进一步进行了一项对抗性实验，以确定法学硕士是否可以奖励破解这些环境。我们描述了五种攻击类别并讨论了缓解措施，强调了法学硕士政策综合的表达性和安全性之间的内在张力。此 https URL 处的代码。

Title: Inducing Sustained Creativity and Diversity in Large Language Models

Authors: Queenie Luo, Gary King, Michael Puett, Michael D. Smith
Subjects: cs.CL, cs.AI, cs.CY, cs.IR
Abstract URL: https://arxiv.org/abs/2603.19519
Pdf URL: https://arxiv.org/pdf/2603.19519
Copy Paste: [[2603.19519]] Inducing Sustained Creativity and Diversity in Large Language Models(https://arxiv.org/abs/2603.19519)
Keywords: language model, llm, prompt
Abstract: We address a not-widely-recognized subset of exploratory search, where a user sets out on a typically long "search quest" for the perfect wedding dress, overlooked research topic, killer company idea, etc. The first few outputs of current large language models (LLMs) may be helpful but only as a start, since the quest requires learning the search space and evaluating many diverse and creative alternatives along the way. Although LLMs encode an impressive fraction of the world's knowledge, common decoding methods are narrowly optimized for prompts with correct answers and thus return mostly homogeneous and conventional results. Other approaches, including those designed to increase diversity across a small set of answers, start to repeat themselves long before search quest users learn enough to make final choices, or offer a uniform type of "creativity" to every user asking similar questions. We develop a novel, easy-to-implement decoding scheme that induces sustained creativity and diversity in LLMs, producing as many conceptually unique results as desired, even without access to the inner workings of an LLM's vector space. The algorithm unlocks an LLM's vast knowledge, both orthodox and heterodox, well beyond modal decoding paths. With this approach, search quest users can more quickly explore the search space and find satisfying answers.
摘要：我们解决了一个未被广泛认可的探索性搜索子集，即用户对完美婚纱、被忽视的研究主题、杀手级公司创意等进行了通常漫长的“搜索任务”。当前大型语言模型（LLM）的前几个输出可能会有所帮助，但只是作为一个开始，因为搜索过程需要学习搜索空间并在此过程中评估许多多样化和创造性的替代方案。尽管法学硕士编码了世界知识中令人印象深刻的一部分，但常见的解码方法针对正确答案的提示进行了狭隘的优化，因此返回的结果大多是同质的和传统的结果。其他方法，包括那些旨在增加一小组答案的多样性的方法，早在搜索任务用户学会足够的知识以做出最终选择之前就开始重复自己，或者为每个提出类似问题的用户提供统一类型的“创造力”。我们开发了一种新颖、易于实施的解码方案，即使无法访问法学硕士向量空间的内部运作，也能激发法学硕士的持续创造力和多样性，根据需要产生尽可能多的概念上独特的结果。该算法解锁了法学硕士的丰富知识，包括正统和非正统知识，远远超出了模态解码路径。通过这种方法，搜索任务用户可以更快地探索搜索空间并找到满意的答案。

Title: EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models

Authors: J. Ben Tamo, Yuxing Lu, Benoit L. Marteau, Micky C. Nnamdi, May D. Wang
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.19532
Pdf URL: https://arxiv.org/pdf/2603.19532
Copy Paste: [[2603.19532]] EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models(https://arxiv.org/abs/2603.19532)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ($G_{\max}@3$) rises from 47.6 to 78.2; hallucinations drop nearly 5$\times$ and evidence-supported diagnoses increase from 31.8\% to 61.6\%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8\% to 67.6\% on Llama-3.1-8B, demonstrating consistent behavioral change across domains. Our code is open-sourced at this https URL.
摘要：大型语言模型（LLM）很流畅，但容易产生幻觉，产生看似合理但没有现有证据支持的答案。这种失败在高风险领域尤其成问题，因为这些领域的决策必须通过可验证的信息来证明其合理性。我们引入了 \textbf{EvidenceRL}，这是一个强化学习框架，可以在训练期间强制遵守证据。 EvidenceRL 对候选答案的基础性（包含检索到的证据和上下文）和正确性（与参考答案的一致性）进行评分，并使用组相对策略优化 (GRPO) 来优化生成器。我们对心脏诊断和法律推理这两个高风险领域进行评估，其中 EvidenceRL 在不牺牲任务准确性的情况下不断提高证据基础和可信度。在心脏诊断方面，Llama-3.2-3B 上的 F1@3 从 37.0 增加到 54.5，而接地 ($G_{\max}@3$) 从 47.6 增加到 78.2；幻觉下降近 5\times$，有证据支持的诊断从 31.8\% 增加到 61.6\%。在法律推理方面，EvidenceRL 将 Llama-3.1-8B 上的忠实度从 32.8\% 提高到 67.6\%，证明了跨域的一致行为变化。我们的代码在此 https URL 上开源。

Title: FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment

Authors: Betty Xiong, Jillian Fisher, Benjamin Newman, Meng Hu, Shivangi Gupta, Yejin Choi, Lanyan Fang, Russ B Altman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19539
Pdf URL: https://arxiv.org/pdf/2603.19539
Copy Paste: [[2603.19539]] FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment(https://arxiv.org/abs/2603.19539)
Keywords: language model, llm
Abstract: We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, using the U.S. Food and Drug Administration (FDA) drug label documents. Drug labels contain rich but heterogeneous clinical and regulatory information, making accurate question answering difficult for current language models. In collaboration with FDA regulatory assessors, we introduce FDARxBench, and construct a multi-stage pipeline for generating high-quality, expert curated, QA examples spanning factual, multi-hop, and refusal tasks, and design evaluation protocols to assess both open-book and closed-book reasoning. Experiments across proprietary and open-weight models reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior. While motivated by FDA generic drug assessment needs, this benchmark also provides a substantial foundation for challenging regulatory-grade evaluation of label comprehension. The benchmark is designed to support evaluation of LLM behavior on drug-label questions.
摘要：我们引入了专家策划的真实世界基准，用于使用美国食品和药物管理局 (FDA) 药品标签文件来评估由仿制药评估驱动的基于文档的问答 (QA)。药品标签包含丰富但异构的临床和监管信息，使得当前语言模型难以准确回答问题。我们与 FDA 监管评估人员合作，引入了 FDARxBench，并构建了一个多阶段管道，用于生成涵盖事实、多跳和拒绝任务的高质量、专家策划的 QA 示例，并设计评估协议来评估开卷和闭卷推理。专有模型和开放权重模型的实验揭示了事实基础、长上下文检索和安全拒绝行为方面的巨大差距。虽然受到 FDA 仿制药评估需求的推动，该基准也为挑战标签理解的监管级评估提供了坚实的基础。该基准旨在支持法学硕士在药物标签问题上的行为评估。

Title: TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?

Authors: Xinyu Guo, Yazhou Zhang, Jing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19558
Pdf URL: https://arxiv.org/pdf/2603.19558
Copy Paste: [[2603.19558]] TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?(https://arxiv.org/abs/2603.19558)
Keywords: language model, llm
Abstract: Eliciting explicit, step-by-step reasoning traces from large language models (LLMs) has emerged as a dominant paradigm for enhancing model capabilities. Although such reasoning strategies were originally designed for problems requiring explicit multi-step reasoning, they have increasingly been applied to a broad range of NLP tasks. This expansion implicitly assumes that deliberative reasoning uniformly benefits heterogeneous tasks. However, whether such reasoning mechanisms truly benefit classification tasks remains largely underexplored, especially considering their substantial token and time costs. To fill this gap, we introduce TextReasoningBench, a systematic benchmark designed to evaluate the effectiveness and efficiency of reasoning strategies for text classification with LLMs. We compare seven reasoning strategies, namely IO, CoT, SC-CoT, ToT, GoT, BoC, and long-CoT across ten LLMs on five text classification datasets. Beyond traditional metrics such as accuracy and macro-F1, we introduce two cost-aware evaluation metrics that quantify the performance gain per reasoning token and the efficiency of performance improvement relative to token cost growth. Experimental results reveal three notable findings: (1) Reasoning does not universally improve classification performance: while moderate strategies such as CoT and SC-CoT yield consistent but limited gains (typically +1% to +3% on big models), more complex methods (e.g., ToT and GoT) often fail to outperform simpler baselines and can even degrade performance, especially on small models; (2) Reasoning is often inefficient: many reasoning strategies increase token consumption by 10$\times$ to 100$\times$ (e.g., SC-CoT and ToT) while providing only marginal performance improvements.
摘要：从大型语言模型 (LLM) 中导出明确的、逐步的推理轨迹已成为增强模型能力的主导范式。尽管此类推理策略最初是为需要显式多步骤推理的问题而设计的，但它们已越来越多地应用于广泛的 NLP 任务。这种扩展隐含地假设协商推理一致有利于异构任务。然而，这种推理机制是否真正有利于分类任务仍然在很大程度上尚未得到充分探索，特别是考虑到其巨大的令牌和时间成本。为了填补这一空白，我们引入了 TextReasoningBench，这是一个系统基准测试，旨在评估法学硕士文本分类推理策略的有效性和效率。我们在五个文本分类数据集上比较了十个法学硕士的七种推理策略，即 IO、CoT、SC-CoT、ToT、GoT、BoC 和 long-CoT。除了准确性和宏观 F1 等传统指标之外，我们还引入了两种成本感知评估指标，用于量化每个推理令牌的性能增益以及相对于令牌成本增长的性能改进效率。实验结果揭示了三个值得注意的发现：（1）推理并不能普遍提高分类性能：虽然 CoT 和 SC-CoT 等中等策略产生一致但有限的收益（在大模型上通常为 +1% 到 +3%），但更复杂的方法（例如 ToT 和 GoT）通常无法超越更简单的基线，甚至可能降低性能，尤其是在小模型上； (2) 推理通常效率低下：许多推理策略将代币消耗增加了 10$\times$ 到 100$\times$（例如 SC-CoT 和 ToT），同时仅提供边际性能改进。

Title: BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection

Authors: Zhengpei Hu, Kai Li, Dapeng Fu, Chang Zeng, Yue Li, Yuanhao Tang, Jianqiang Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19635
Pdf URL: https://arxiv.org/pdf/2603.19635
Copy Paste: [[2603.19635]] BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection(https://arxiv.org/abs/2603.19635)
Keywords: llm, prompt
Abstract: The exponential expansion of context windows in LLMs has unlocked capabilities for long-document understanding but introduced severe bottlenecks in inference latency and information utilization. Existing compression methods often suffer from high training costs or semantic fragmentation due to aggressive token pruning. In this paper, we propose BEAVER, a novel training-free framework that shifts compression from linear token removal to structure-aware hierarchical selection. BEAVER maximizes hardware parallelism by mapping variable-length contexts into dense page-level tensors via dual-path pooling, and preserves discourse integrity through a hybrid planner combining semantic and lexical dual-branch selection with sentence smoothing. Extensive evaluations on four long-context benchmarks demonstrate that BEAVER achieves comparable performance to state-of-the-art (SOTA) methods like LongLLMLingua. Notably, on the RULER benchmark, BEAVER maintains high fidelity in multi-needle retrieval where baselines deteriorate. Regarding efficiency, BEAVER reduces latency by 26.4x on 128k contexts, offering a scalable solution for high-throughput applications. Our code is available at this https URL.
摘要：法学硕士上下文窗口的指数级扩展释放了长文档理解的能力，但在推理延迟和信息利用方面引入了严重的瓶颈。现有的压缩方法通常会因过度的标记修剪而遭受高训练成本或语义碎片的困扰。在本文中，我们提出了 BEAVER，一种新颖的免训练框架，它将压缩从线性标记删除转变为结构感知的层次选择。 BEAVER 通过双路径池将可变长度上下文映射到密集的页面级张量来最大化硬件并行性，并通过将语义和词汇双分支选择与句子平滑相结合的混合规划器来保持话语完整性。对四个长上下文基准的广泛评估表明，BEAVER 的性能可与 LongLLMLingua 等最先进 (SOTA) 方法相媲美。值得注意的是，在 RULER 基准上，BEAVER 在基线恶化的多针检索中保持了高保真度。在效率方面，BEAVER 在 128k 上下文上将延迟减少了 26.4 倍，为高吞吐量应用程序提供了可扩展的解决方案。我们的代码可以在这个 https URL 上找到。

Title: Structured Prompting for Arabic Essay Proficiency: A Trait-Centric Evaluation Approach

Authors: Salim Al Mandhari, Hieu Pham Dinh, Mo El-Haj, Paul Rayson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19668
Pdf URL: https://arxiv.org/pdf/2603.19668
Copy Paste: [[2603.19668]] Structured Prompting for Arabic Essay Proficiency: A Trait-Centric Evaluation Approach(https://arxiv.org/abs/2603.19668)
Keywords: language model, llm, prompt, agent
Abstract: This paper presents a novel prompt engineering framework for trait specific Automatic Essay Scoring (AES) in Arabic, leveraging large language models (LLMs) under zero-shot and few-shot configurations. Addressing the scarcity of scalable, linguistically informed AES tools for Arabic, we introduce a three-tier prompting strategy (standard, hybrid, and rubric-guided) that guides LLMs in evaluating distinct language proficiency traits such as organization, vocabulary, development, and style. The hybrid approach simulates multi-agent evaluation with trait specialist raters, while the rubric-guided method incorporates scored exemplars to enhance model alignment. In zero and few-shot settings, we evaluate eight LLMs on the QAES dataset, the first publicly available Arabic AES resource with trait level annotations. Experimental results using Quadratic Weighted Kappa (QWK) and Confidence Intervals show that Fanar-1-9B-Instruct achieves the highest trait level agreement in both zero and few-shot prompting (QWK = 0.28 and CI = 0.41), with rubric-guided prompting yielding consistent gains across all traits and models. Discourse-level traits such as Development and Style showed the greatest improvements. These findings confirm that structured prompting, not model scale alone, enables effective AES in Arabic. Our study presents the first comprehensive framework for proficiency oriented Arabic AES and sets the foundation for scalable assessment in low resource educational contexts.
摘要：本文提出了一种新颖的即时工程框架，用于阿拉伯语中特定特征的自动作文评分（AES），利用零样本和少样本配置下的大语言模型（LLM）。为了解决阿拉伯语可扩展的、语言信息丰富的 AES 工具的稀缺问题，我们引入了三层提示策略（标准、混合和标题引导），指导法学硕士评估不同的语言能力特征，如组织、词汇、发展和风格。混合方法模拟了特质专家评估者的多主体评估，而评分标准引导方法则结合了评分样本以增强模型一致性。在零次和少次设置中，我们在 QAES 数据集上评估了 8 个 LLM，这是第一个具有特征级别注释的公开可用的阿拉伯语 AES 资源。使用二次加权 Kappa (QWK) 和置信区间的实验结果表明，Fanar-1-9B-Instruct 在零次和几次提示中均实现了最高的特征水平一致性（QWK = 0.28 和 CI = 0.41），并且评分标准引导的提示在所有特征和模型中产生一致的收益。发展和风格等话语层面的特征显示出最大的改善。这些发现证实，结构化提示（而不仅仅是模型规模）可以在阿拉伯语中实现有效的 AES。我们的研究提出了第一个以熟练程度为导向的阿拉伯语 AES 综合框架，并为资源匮乏的教育环境中的可扩展评估奠定了基础。

Title: DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs

Authors: Xuan Qi, Luxi He, Dan Roth, Xingyu Fu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19688
Pdf URL: https://arxiv.org/pdf/2603.19688
Copy Paste: [[2603.19688]] DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs(https://arxiv.org/abs/2603.19688)
Keywords: language model, llm
Abstract: Conventional wisdom for selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that appear similar to the target benchmark, such as text-intensive or vision-centric tasks. However, it remains unclear whether such intuitive similarity reliably predicts downstream performance gains. In this work, we take a first step toward answering a practical question: can we estimate the influence of a training dataset on a target benchmark before any training is performed? To investigate this question, we conduct an in-depth analysis of transfer across 14 vision-language datasets spanning 7 diverse tasks. Our results show that intuitive task similarity is an unreliable predictor of transferability, and that generalization depends more on the specific dataset than on its broad task category. Motivated by this finding, we propose DATAPROPHET, a simple and effective training-free metric that combines multimodal perplexity, similarity, and data diversity. Experiments show that DATAPROPHET produces supervision-data rankings that strongly correlate with rankings based on actual post-training performance gains, achieving a Kendall's tau of 86.0%. Moreover, DATAPROPHET enables better supervision-data selection, yielding up to 6.9% improvement over uniform selection, 1.4% over a state-of-the-art training-based baseline, and 0.2% above oracle selection based on experimental performance. Our code and data will be released.
摘要：为多模态大语言模型（MLLM）选择监督数据的传统智慧是优先考虑与目标基准相似的数据集，例如文本密集型或以视觉为中心的任务。然而，目前尚不清楚这种直观的相似性是否能够可靠地预测下游性能的提升。在这项工作中，我们迈出了回答一个实际问题的第一步：我们能否在进行任何训练之前估计训练数据集对目标基准的影响？为了研究这个问题，我们对涵盖 7 个不同任务的 14 个视觉语言数据集的迁移进行了深入分析。我们的结果表明，直观的任务相似性是可转移性的不可靠预测因素，并且泛化更多地取决于特定的数据集而不是其广泛的任务类别。受这一发现的启发，我们提出了 DATAPROPHET，这是一种简单有效的免训练指标，结合了多模态困惑度、相似性和数据多样性。实验表明，DATAPROPHET 生成的监督数据排名与基于实际训练后性能增益的排名密切相关，Kendall tau 达到 86.0%。此外，DATAPROPHET 能够实现更好的监督数据选择，比统一选择提高了 6.9%，比基于最先进的训练的基线提高了 1.4%，比基于实验性能的预言机选择提高了 0.2%。我们的代码和数据将被发布。

Title: EvoTaxo: Building and Evolving Taxonomy from Social Media Streams

Authors: Yiyang Li, Tianyi Ma, Yanfang Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19711
Pdf URL: https://arxiv.org/pdf/2603.19711
Copy Paste: [[2603.19711]] EvoTaxo: Building and Evolving Taxonomy from Social Media Streams(https://arxiv.org/abs/2603.19711)
Keywords: llm
Abstract: Constructing taxonomies from social media corpora is challenging because posts are short, noisy, semantically entangled, and temporally dynamic. Existing taxonomy induction methods are largely designed for static corpora and often struggle to balance robustness, scalability, and sensitivity to evolving discourse. We propose EvoTaxo, a LLM-based framework for building and evolving taxonomies from temporally ordered social media streams. Rather than clustering raw posts directly, EvoTaxo converts each post into a structured draft action over the current taxonomy, accumulates structural evidence over time windows, and consolidates candidate edits through dual-view clustering that combines semantic similarity with temporal locality. A refinement-and-arbitration procedure then selects reliable edits before execution, while each node maintains a concept memory bank to preserve semantic boundaries over time. Experiments on two Reddit corpora show that EvoTaxo produces more balanced taxonomies than baselines, with clearer post-to-leaf assignment, better corpus coverage at comparable taxonomy size, and stronger structural quality. A case study on the Reddit community /r/ICE_Raids further shows that EvoTaxo captures meaningful temporal shifts in discourse. Our codebase is available here.
摘要：从社交媒体语料库构建分类法具有挑战性，因为帖子简短、嘈杂、语义纠缠且随时间变化。现有的分类归纳方法主要是为静态语料库设计的，并且常常难以平衡鲁棒性、可扩展性和对不断发展的话语的敏感性。我们提出了 EvoTaxo，这是一个基于法学硕士的框架，用于根据时间顺序的社交媒体流构建和发展分类法。 EvoTaxo 不是直接对原始帖子进行聚类，而是将每个帖子转换为当前分类的结构化草稿操作，在时间窗口内积累结构证据，并通过将语义相似性与时间局部性相结合的双视图聚类来巩固候选编辑。然后，细化和仲裁程序在执行前选择可靠的编辑，而每个节点维护一个概念记忆库，以随着时间的推移保留语义边界。对两个 Reddit 语料库的实验表明，EvoTaxo 生成的分类法比基线更平衡，具有更清晰的后到叶分配、在可比较的分类法大小下更好的语料库覆盖率以及更强的结构质量。 Reddit 社区 /r/ICE_Raids 的案例研究进一步表明，EvoTaxo 捕获了话语中有意义的时间变化。我们的代码库可以在这里找到。

Title: LoopRPT: Reinforcement Pre-Training for Looped Language Models

Authors: Guo Tang, Shixin Jiang, Heng Chang, Nuo Chen, Yuhan Li, Huiming Fan, Jia Li, Ming Liu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19714
Pdf URL: https://arxiv.org/pdf/2603.19714
Copy Paste: [[2603.19714]] LoopRPT: Reinforcement Pre-Training for Looped Language Models(https://arxiv.org/abs/2603.19714)
Keywords: language model, chain-of-thought
Abstract: Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.
摘要：循环语言模型 (LoopLM) 执行迭代潜在计算来细化内部表示，为显式思想链 (CoT) 推理提供了一种有希望的替代方案。然而，现有的强化学习 (RL) 范式主要针对输出标记，从而导致与循环架构的结构不匹配，而循环架构的推理隐式展开。在这项工作中，我们提出了 LoopRPT，一种为 LoopLM 量身定制的强化预训练框架。通过将下一个标记预测重新定义为下一个标记推理任务，LoopRPT 使用 EMA 教师参考和嘈杂的潜在推出，将强化信号直接分配给潜在步骤。该公式使强化学习能够直接塑造中间表示，将有效推理压缩为更少的迭代。我们在 Ouro 架构上跨多个模型规模实例化 LoopRPT。结果表明，LoopRPT 始终如一地提高了每步的表示质量，在精度与计算权衡中实现了 Pareto 优势。值得注意的是，硬代币的显着收益表明 LoopRPT 增强了早期推理，而不仅仅是鼓励过早退出。我们的研究结果强调强化预训练是在 LoopLM 中学习高效潜在推理的原则范例。

Title: PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction

Authors: Runsong Zhao, Shilei Liu, Jiwei Tang, Langming Liu, Haibin Chen, Weidong Zhang, Yujin Yuan, Tong Xiao, Jingbo Zhu, Wenbo Su, Bo Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19733
Pdf URL: https://arxiv.org/pdf/2603.19733
Copy Paste: [[2603.19733]] PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction(https://arxiv.org/abs/2603.19733)
Keywords: language model, llm
Abstract: While context compression can mitigate the growing inference costs of Large Language Models (LLMs) by shortening contexts, existing methods that specify a target compression ratio or length suffer from unpredictable performance degradation, hindering their reliable deployment. We introduce a paradigm shift to Performance-oriented Context Compression (PoC), where developers specify an acceptable performance floor instead of a compression ratio. PoC employs a lightweight performance predictor to automatically find the most aggressive compression ratio that satisfies this constraint before steering an off-the-shelf compressor. We design and compare two predictor variants: a simple context-agnostic predictor and a more sophisticated context-aware one that considers the input's inherent compressibility. On both question-answering and summarization benchmarks, the context-aware predictor consistently achieves lower performance prediction error than the context-agnostic predictor, while the resulting context-aware PoC attains a superior overall performance. Our work paves the way for a more reliable, efficient, and performance-aware deployment of context compression for LLMs.
摘要：虽然上下文压缩可以通过缩短上下文来缓解大型语言模型 (LLM) 不断增长的推理成本，但指定目标压缩比或长度的现有方法会遭受不可预测的性能下降，从而阻碍其可靠部署。我们引入了向性能导向的上下文压缩（PoC）的范式转变，开发人员指定可接受的性能下限而不是压缩比。 PoC 采用轻量级性能预测器，在转向现成的压缩机之前自动找到满足此约束的最激进的压缩比。我们设计并比较了两种预测器变体：一种简单的上下文无关预测器和一种更复杂的上下文感知预测器，该预测器考虑输入的固有可压缩性。在问答和摘要基准测试中，上下文感知预测器始终比上下文无关预测器实现更低的性能预测误差，而由此产生的上下文感知 PoC 获得了卓越的整体性能。我们的工作为法学硕士的上下文压缩的更可靠、高效和性能感知的部署铺平了道路。

Title: Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking

Authors: Tomas Ruiz, Tanalp Agustoslu, Carsten Schwemmer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19744
Pdf URL: https://arxiv.org/pdf/2603.19744
Copy Paste: [[2603.19744]] Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking(https://arxiv.org/abs/2603.19744)
Keywords: language model, llm
Abstract: Human Label Variation (HLV), i.e. systematic differences among annotators' judgments, remains underexplored in benchmarks despite rapid progress in large language model (LLM) development. We address this gap by introducing an evaluation protocol for multimodal large language model (MLLM) benchmarking that explicitly accounts for two conditions: (1) human label agreement and (2) disagreement. We apply this protocol to two state-of-the-art MLLM families (Gemma 3, Qwen 2.5 VL) using non-aggregated human annotations from a social media content classification dataset. Across tasks, we find that larger models tend to perform best on high-agreement subsets, yet often underperform medium-sized models when human disagreement is high, indicating that parameter count alone does not determine sensitivity to ambiguity and subjectivity. These results show that benchmarks based solely on consensus labels can overstate model capabilities in such domains and that incorporating human label variation yields more realistic and robust assessments of MLLMs in content moderation pipelines.
摘要：尽管大型语言模型 (LLM) 开发取得了快速进展，但人类标签变异 (HLV)，即注释者判断之间的系统差异，在基准测试中仍未得到充分探索。我们通过引入多模式大语言模型（MLLM）基准测试的评估协议来解决这一差距，该协议明确考虑了两个条件：（1）人类标签一致和（2）分歧。我们使用来自社交媒体内容分类数据集的非聚合人工注释，将此协议应用于两个最先进的 MLLM 系列（Gemma 3、Qwen 2.5 VL）。在各种任务中，我们发现较大的模型往往在高度一致的子集上表现最好，但当人类分歧较高时，往往表现不佳中型模型，这表明参数计数本身并不能决定对模糊性和主观性的敏感性。这些结果表明，仅基于共识标签的基准可能会夸大这些领域中的模型能力，并且结合人类标签变化可以对内容审核管道中的 MLLM 产生更现实、更稳健的评估。

Title: Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders

Authors: Debajyoti Mazumder, Divyansh Pathak, Prashant Kodali, Jasabanta Patro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19771
Pdf URL: https://arxiv.org/pdf/2603.19771
Copy Paste: [[2603.19771]] Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders(https://arxiv.org/abs/2603.19771)
Keywords: language model
Abstract: Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Interpretability analyses further reveal a clear asymmetry: models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. Motivated by these findings, we introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both constituent languages simultaneously, yielding more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection - showing that grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding.
摘要：基于多语言编码器的语言模型被广泛用于代码混合分析任务，但令人惊讶的是，我们对它们如何在内部表示代码混合输入，或者这些表示是否与被混合的构成语言有意义地联系起来知之甚少。以印地语-英语作为案例研究，我们构建了一个由并行英语、印地语（梵文）和罗马化代码混合句子组成的统一三语语料库，并通过 CKA、标记级显着性和基于熵的不确定性分析，探讨标准多语言编码器及其代码混合适应变体的跨语言表示对齐。我们发现，虽然标准模型很好地对齐了英语和印地语，但代码混合输入与任何一种语言的连接仍然松散——并且对代码混合数据的持续预训练改善了英语代码混合对齐，但代价是英语-印地语对齐。可解释性分析进一步揭示了明显的不对称性：模型通过英语主导的语义子空间处理代码混合文本，而母语印地语提供了补充信号，减少了表征的不确定性。受这些发现的启发，我们引入了三语言后训练对齐目标，使代码混合表示同时更接近两种构成语言，从而产生更平衡的跨语言对齐以及情感分析和仇恨语音检测的下游收益 - 表明将代码混合表示以其构成语言为基础有意义地有助于跨语言理解。

Title: Semantic Delta: An Interpretable Signal Differentiating Human and LLMs Dialogue

Authors: Riccardo Scantamburlo, Mauro Mezzanzana, Giacomo Buonanno, Francesco Bertolotti
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19849
Pdf URL: https://arxiv.org/pdf/2603.19849
Copy Paste: [[2603.19849]] Semantic Delta: An Interpretable Signal Differentiating Human and LLMs Dialogue(https://arxiv.org/abs/2603.19849)
Keywords: llm
Abstract: Do LLMs talk like us? This question intrigues a multitude of scholar and it is relevant in many fields, from education to academia. This work presents an interpretable statistical feature for distinguishing human written and LLMs generated dialogue. We introduce a lightweight metric derived from semantic categories distribution. Using the Empath lexical analysis framework, each text is mapped to a set of thematic intensity scores. We define semantic delta as the difference between the two most dominant category intensities within a dialogue, hypothesizing that LLM outputs exhibit stronger thematic concentration than human discourse. To evaluate this hypothesis, conversational data were generated from multiple LLM configurations and compared against heterogeneous human corpora, including scripted dialogue, literary works, and online discussions. A Welch t-test was applied to the resulting distributions of semantic delta values. Results show that AI-generated texts consistently produce higher deltas than human texts, indicating a more rigid topics structure, whereas human dialogue displays a broader and more balanced semantic spread. Rather than replacing existing detection techniques, the proposed zero-shot metric provides a computationally inexpensive complementary signal that can be integrated into ensemble detection systems. These finding also contribute to the broader empirical understanding of LLM behavioural mimicry and suggest that thematic distribution constitutes a quantifiable dimension along which current models fall short of human conversational dynamics.
摘要：LLM 说话像我们一样吗？这个问题引起了众多学者的兴趣，并且与从教育界到学术界的许多领域都相关。这项工作提出了一种可解释的统计特征，用于区分人类书面对话和法学硕士生成的对话。我们引入了一种源自语义类别分布的轻量级度量。使用 Empath 词汇分析框架，每个文本都映射到一组主题强度分数。我们将语义增量定义为对话中两个最主要类别强度之间的差异，假设法学硕士输出表现出比人类话语更强的主题集中度。为了评估这一假设，我们从多个法学硕士配置中生成了对话数据，并与异质人类语料库进行了比较，包括脚本对话、文学作品和在线讨论。 Welch t 检验应用于语义增量值的结果分布。结果表明，人工智能生成的文本始终比人类文本产生更高的增量，表明主题结构更严格，而人类对话则显示出更广泛、更平衡的语义传播。所提出的零样本度量不是取代现有的检测技术，而是提供了一种计算成本低廉的互补信号，可以集成到整体检测系统中。这些发现还有助于对法学硕士行为模仿进行更广泛的实证理解，并表明主题分布构成了一个可量化的维度，当前模型在该维度上缺乏人类对话动态。

Title: SAGE: Sustainable Agent-Guided Expert-tuning for Culturally Attuned Translation in Low-Resource Southeast Asia

Authors: Zhixiang Lu, Chong Zhang, Yulong Li, Angelos Stefanidis, Anh Nguyen, Imran Razzak, Jionglong Su, Zhengyong Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19931
Pdf URL: https://arxiv.org/pdf/2603.19931
Copy Paste: [[2603.19931]] SAGE: Sustainable Agent-Guided Expert-tuning for Culturally Attuned Translation in Low-Resource Southeast Asia(https://arxiv.org/abs/2603.19931)
Keywords: language model, llm, agent
Abstract: The vision of an inclusive World Wide Web is impeded by a severe linguistic divide, particularly for communities in low-resource regions of Southeast Asia. While large language models (LLMs) offer a potential solution for translation, their deployment in data-poor contexts faces a dual challenge: the scarcity of high-quality, culturally relevant data and the prohibitive energy costs of training on massive, noisy web corpora. To resolve the tension between digital inclusion and environmental sustainability, we introduce Sustainable Agent-Guided Expert-tuning (SAGE). This framework pioneers an energy-aware paradigm that prioritizes the "right data" over "big data". Instead of carbon-intensive training on unfiltered datasets, SAGE employs a reinforcement learning (RL) agent, optimized via Group Relative Policy Optimization (GRPO), to autonomously curate a compact training set. The agent utilizes a semantic reward signal derived from a small, expert-constructed set of community dialogues to filter out noise and cultural misalignment. We then efficiently fine-tune open-source LLMs on this curated data using Low-Rank Adaptation (LoRA). We applied SAGE to translation tasks between English and seven low-resource languages (LRLs) in Southeast Asia. Our approach establishes new state-of-the-art performance on BLEU-4 and COMET-22 metrics, effectively capturing local linguistic nuances. Crucially, SAGE surpasses baselines trained on full datasets while reducing data usage by 97.1% and training energy consumption by 95.2%. By delivering high-performance models with a minimal environmental footprint, SAGE offers a scalable and responsible pathway to bridge the digital divide in the Global South.
摘要：包容性万维网的愿景受到严重的语言鸿沟的阻碍，特别是对于东南亚资源匮乏地区的社区而言。虽然大型语言模型 (LLM) 为翻译提供了潜在的解决方案，但它们在数据匮乏的环境中的部署面临着双重挑战：高质量、文化相关的数据的稀缺，以及在大规模、嘈杂的网络语料库上进行训练的能源成本过高。为了解决数字包容性和环境可持续性之间的紧张关系，我们引入了可持续代理引导专家调整（SAGE）。该框架开创了一种能源感知范式，优先考虑“正确数据”而不是“大数据”。 SAGE 不是在未经过滤的数据集上进行碳密集型训练，而是采用强化学习 (RL) 代理，通过组相对策略优化 (GRPO) 进行优化，以自主管理紧凑的训练集。该代理利用源自专家构建的小型社区对话集的语义奖励信号来过滤噪音和文化失调。然后，我们使用低秩适应 (LoRA) 在此精选数据上有效地微调开源法学硕士。我们将 SAGE 应用于英语与东南亚七种低资源语言 (LRL) 之间的翻译任务。我们的方法在 BLEU-4 和 COMET-22 指标上建立了新的最先进的性能，有效地捕捉本地语言的细微差别。至关重要的是，SAGE 超越了在完整数据集上训练的基线，同时将数据使用量减少了 97.1%，训练能耗减少了 95.2%。通过提供对环境影响最小的高性能模型，SAGE 提供了一条可扩展且负责任的途径来弥合南半球的数字鸿沟。

Title: When Contextual Inference Fails: Cancelability in Interactive Instruction Following

Authors: Natalia Bila, Kata Naszádi, Alexandra Mayn, Christof Monz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19997
Pdf URL: https://arxiv.org/pdf/2603.19997
Copy Paste: [[2603.19997]] When Contextual Inference Fails: Cancelability in Interactive Instruction Following(https://arxiv.org/abs/2603.19997)
Keywords: llm
Abstract: We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve underspecified instructions using contextual inferences. Building on an existing two-speaker psycholinguistic paradigm -- which contrasts a pragmatically cooperative speaker with one who is only literally reliable -- we introduce Build What I Mean (BWIM), an interactive benchmark for contextual meaning construction. In BWIM, models must resolve ambiguity by either performing a contextual inference or requesting clarification at a small communication cost. Evaluating several state-of-the-art LLMs, we find a dissociation between judgment and action: while models detect speaker unreliability in explicit confidence ratings, they fail to exploit this information to guide efficient clarification behavior. Instead, we observe suboptimal strategies, such as partner-blind over-clarification and question-averse guessing under uncertainty.
摘要：我们研究了协作块构建任务中字面解释与上下文推理的分离，其中构建者必须使用上下文推理来解决未指定的指令。基于现有的双说话者心理语言学范式（将务实合作的说话者与仅字面可靠的说话者进行对比），我们引入了“构建我的意思”（BWIM），这是一种用于语境意义构建的交互式基准。在 BWIM 中，模型必须通过执行上下文推理或以较小的通信成本请求澄清来解决歧义。通过评估几个最先进的法学硕士，我们发现判断和行动之间存在脱节：虽然模型在明确的置信度评级中检测说话者的不可靠性，但它们无法利用这些信息来指导有效的澄清行为。相反，我们观察到次优策略，例如合作伙伴盲目的过度澄清和不确定性下回避问题的猜测。

Title: An Agentic Approach to Generating XAI-Narratives

Authors: Yifan He, David Martens
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20003
Pdf URL: https://arxiv.org/pdf/2603.20003
Copy Paste: [[2603.20003]] An Agentic Approach to Generating XAI-Narratives(https://arxiv.org/abs/2603.20003)
Keywords: language model, llm, agent
Abstract: Explainable AI (XAI) research has experienced substantial growth in recent years. Existing XAI methods, however, have been criticized for being technical and expert-oriented, motivating the development of more interpretable and accessible explanations. In response, large language model (LLM)-generated XAI narratives have been proposed as a promising approach for translating post-hoc explanations into more accessible, natural-language explanations. In this work, we propose a multi-agent framework for XAI narrative generation and refinement. The framework comprises the Narrator, which generates and revises narratives based on feedback from multiple Critic Agents on faithfulness and coherence metrics, thereby enabling narrative improvement through iteration. We design five agentic systems (Basic Design, Critic Design, Critic-Rule Design, Coherent Design, and Coherent-Rule Design) and systematically evaluate their effectiveness across five LLMs on five tabular datasets. Results validate that the Basic Design, the Critic Design, and the Critic-Rule Design are effective in improving the faithfulness of narratives across all LLMs. Claude-4.5-Sonnet on Basic Design performs best, reducing the number of unfaithful narratives by 90% after three rounds of iteration. To address recurrent issues, we further introduce an ensemble strategy based on majority voting. This approach consistently enhances performance for four LLMs, except for DeepSeek-V3.2-Exp. These findings highlight the potential of agentic systems to produce faithful and coherent XAI narratives.
摘要：近年来，可解释人工智能（XAI）研究经历了大幅增长。然而，现有的 XAI 方法因技术性和专家导向而受到批评，推动了更易解释和更容易理解的解释的发展。作为回应，大语言模型（LLM）生成的 XAI 叙述被提议作为一种有前景的方法，将事后解释转化为更容易理解的自然语言解释。在这项工作中，我们提出了一个用于 XAI 叙事生成和细化的多智能体框架。该框架包括叙述者，它根据多个评论家代理关于忠实度和连贯性指标的反馈生成和修改叙述，从而通过迭代改进叙述。我们设计了五个代理系统（基本设计、批评设计、批评规则设计、连贯设计和连贯规则设计），并在五个表格数据集上系统地评估了它们在五个法学硕士中的有效性。结果验证了基本设计、批评设计和批评规则设计能够有效提高所有法学硕士叙述的真实性。基础设计上的Claude-4.5-Sonnet表现最好，经过三轮迭代后，不忠实叙述的数量减少了90%。为了解决经常出现的问题，我们进一步引入了基于多数投票的集成策略。此方法持续增强了四个 LLM 的性能（DeepSeek-V3.2-Exp 除外）。这些发现凸显了代理系统产生忠实且连贯的 XAI 叙述的潜力。

Title: RouterKGQA: Specialized--General Model Routing for Constraint-Aware Knowledge Graph Question Answering

Authors: Bo Yuan, Hexuan Deng, Xuebo Liu, Min Zhang
Subjects: cs.CL, cs.DB, cs.IR
Abstract URL: https://arxiv.org/abs/2603.20017
Pdf URL: https://arxiv.org/pdf/2603.20017
Copy Paste: [[2603.20017]] RouterKGQA: Specialized--General Model Routing for Constraint-Aware Knowledge Graph Question Answering(https://arxiv.org/abs/2603.20017)
Keywords: llm, hallucination, agent
Abstract: Knowledge graph question answering (KGQA) is a promising approach for mitigating LLM hallucination by grounding reasoning in structured and verifiable knowledge graphs. Existing approaches fall into two paradigms: retrieval-based methods utilize small specialized models, which are efficient but often produce unreachable paths and miss implicit constraints, while agent-based methods utilize large general models, which achieve stronger structural grounding at substantially higher cost. We propose RouterKGQA, a framework for specialized--general model collaboration, in which a specialized model generates reasoning paths and a general model performs KG-guided repair only when needed, improving performance at minimal cost. We further equip the specialized with constraint-aware answer filtering, which reduces redundant answers. In addition, we design a more efficient general agent workflow, further lowering inference cost. Experimental results show that RouterKGQA outperforms the previous best by 3.57 points in F1 and 0.49 points in Hits@1 on average across benchmarks, while requiring only 1.15 average LLM calls per question. Codes and models are available at this https URL.
摘要：知识图问答（KGQA）是一种通过在结构化和可验证的知识图中进行推理来减轻法学硕士幻觉的有前途的方法。现有的方法分为两种范式：基于检索的方法利用小型专用模型，这种模型很有效，但经常产生无法到达的路径并错过隐式约束，而基于代理的方法利用大型通用模型，以更高的成本实现更强大的结构基础。我们提出了 RouterKGQA，这是一个专门与通用模型协作的框架，其中专用模型生成推理路径，而通用模型仅在需要时执行 KG 引导的修复，从而以最小的成本提高性能。我们进一步为专业人员配备了约束感知答案过滤，从而减少了冗余答案。此外，我们设计了更高效的通用代理工作流程，进一步降低了推理成本。实验结果表明，在所有基准测试中，RouterKGQA 在 F1 中平均比之前的最佳成绩高 3.57 分，在 Hits@1 中比之前的最佳成绩高 0.49 分，而每个问题平均只需要 1.15 次 LLM 调用。代码和模型可从此 https URL 获取。

Title: LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families

Authors: Jianan Chen, Xiaoxue Gao, Tatsuya Kawahara, Nancy F. Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20042
Pdf URL: https://arxiv.org/pdf/2603.20042
Copy Paste: [[2603.20042]] LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families(https://arxiv.org/abs/2603.20042)
Keywords: language model, llm
Abstract: Large language models (LLMs) have driven substantial advances in speech language models (SpeechLMs), yielding strong performance in automatic speech recognition (ASR) under high-resource conditions. However, existing benchmarks predominantly focus on high-resource languages, leaving the ASR behavior of SpeechLMs in low-resource languages insufficiently understood. This gap is critical, as practical ASR systems must reliably support low-resource languages and generalize across diverse language families, and it directly hinders the deployment of SpeechLM-based ASR in real-world multilingual scenarios. As a result, it is essential to evaluate SpeechLMs on low-resource languages to ensure their generalizability across different language families. To address this problem, we propose \textbf{LoASR-Bench}, a comprehensive benchmark designed to evaluate \textbf{lo}w-resource \textbf{a}utomatic \textbf{s}peech \textbf{r}ecognition (\textbf{ASR}) of the latest SpeechLMs across diverse language families. LoASR-Bench comprises 25 languages from 9 language families, featuring both Latin and non-Latin scripts, enabling cross-linguistic and cross-script assessment of ASR performance of current SpeechLMs. Experimental results highlight the limitations of the latest SpeechLMs in handling real-world low-resource languages.
摘要：大型语言模型 (LLM) 推动了语音语言模型 (SpeechLM) 的重大进步，在高资源条件下在自动语音识别 (ASR) 方面产生了强大的性能。然而，现有的基准测试主要关注高资源语言，而对低资源语言中 SpeechLM 的 ASR 行为的理解还不够。这一差距至关重要，因为实用的 ASR 系统必须可靠地支持资源匮乏的语言并在不同的语族中进行泛化，这直接阻碍了基于 SpeechLM 的 ASR 在现实世界多语言场景中的部署。因此，有必要在低资源语言上评估 SpeechLM，以确保其在不同语言家族中的通用性。为了解决这个问题，我们提出了 \textbf{LoASR-Bench}，这是一个综合基准测试，旨在评估跨不同语言家族的最新 SpeechLM 的 \textbf{lo}w-resource \textbf{a}utomatic \textbf{s}peech \textbf{r} 识别（\textbf{ASR}）。 LoASR-Bench 包含来自 9 个语系的 25 种语言，具有拉丁和非拉丁文字，能够对当前 SpeechLM 的 ASR 性能进行跨语言和跨文字评估。实验结果凸显了最新 SpeechLM 在处理现实世界的低资源语言方面的局限性。

Title: An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models

Authors: Yuming Feng, Christy Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.20100
Pdf URL: https://arxiv.org/pdf/2603.20100
Copy Paste: [[2603.20100]] An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models(https://arxiv.org/abs/2603.20100)
Keywords: language model, gpt
Abstract: Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO training alongside full fine-tuning (FFT) versus LoRA on a GPT-2-scale decoder, evaluating paraphrase detection and Shakespearean sonnet continuation. DPO yields small, task-dependent gains over strong SFT and can match competitive SFT accuracy without a warm start when the preference construction closely parallels the supervised objective. In contrast, parameterization dominates: FFT consistently outperforms LoRA at matched training depth, and LoRA does not reduce wall-clock time on our hardware. These findings indicate that, in this small-scale regime, supervised full-parameter adaptation remains the primary performance lever, while preference optimization and low-rank adaptation provide limited marginal returns.
摘要：直接偏好优化（DPO）在监督微调（SFT）之后被广泛使用来调整语言模型，但小骨干和适度数据下的经验行为尚未明确。我们在 GPT-2 规模解码器上系统地比较了仅 SFT、仅 DPO 和分阶段 SFT 到 DPO 训练以及完全微调 (FFT) 与 LoRA，评估释义检测和莎士比亚十四行诗延续。当偏好结构与监督目标密切相关时，DPO 比强大的 SFT 产生较小的、依赖于任务的收益，并且可以在没有热启动的情况下与具有竞争力的 SFT 准确性相匹配。相比之下，参数化占主导地位：FFT 在匹配的训练深度上始终优于 LoRA，并且 LoRA 不会减少我们硬件上的挂钟时间。这些发现表明，在这种小规模机制中，有监督的全参数适应仍然是主要的绩效杠杆，而偏好优化和低秩适应提供的边际回报有限。

Title: Current LLMs still cannot 'talk much' about grammar modules: Evidence from syntax

Authors: Mohammed Q. Shormani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20114
Pdf URL: https://arxiv.org/pdf/2603.20114
Copy Paste: [[2603.20114]] Current LLMs still cannot 'talk much' about grammar modules: Evidence from syntax(https://arxiv.org/abs/2603.20114)
Keywords: language model, gpt, llm, chat
Abstract: We aim to examine the extent to which Large Language Models (LLMs) can 'talk much' about grammar modules, providing evidence from syntax core properties translated by ChatGPT into Arabic. We collected 44 terms from generative syntax previous works, including books and journal articles, as well as from our experience in the field. These terms were translated by humans, and then by ChatGPT-5. We then analyzed and compared both translations. We used an analytical and comparative approach in our analysis. Findings unveil that LLMs still cannot 'talk much' about the core syntax properties embedded in the terms under study involving several syntactic and semantic challenges: only 25% of ChatGPT translations were accurate, while 38.6% were inaccurate, and 36.4.% were partially correct, which we consider appropriate. Based on these findings, a set of actionable strategies were proposed, the most notable of which is a close collaboration between AI specialists and linguists to better LLMs' working mechanism for accurate or at least appropriate translation.
摘要：我们的目标是检查大型语言模型 (LLM) 可以在多大程度上“谈论”语法模块，并提供由 ChatGPT 翻译成阿拉伯语的语法核心属性的证据。我们从生成语法之前的作品（包括书籍和期刊文章）以及我们在该领域的经验中收集了 44 个术语。这些术语由人类翻译，然后由 ChatGPT-5 翻译。然后我们分析并比较了这两个翻译。我们在分析中使用了分析和比较方法。调查结果显示，法学硕士仍然无法“谈论”所研究术语中嵌入的核心语法属性，涉及多个句法和语义挑战：只有 25% 的 ChatGPT 翻译准确，38.6% 不准确，36.4% 部分正确，我们认为这是适当的。基于这些发现，提出了一套可行的策略，其中最值得注意的是人工智能专家和语言学家之间的密切合作，以改善法学硕士的工作机制，以实现准确或至少适当的翻译。

Title: Reasoning Gets Harder for LLMs Inside A Dialogue

Authors: Ivan Kartáč, Mateusz Lango, Ondřej Dušek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20133
Pdf URL: https://arxiv.org/pdf/2603.20133
Copy Paste: [[2603.20133]] Reasoning Gets Harder for LLMs Inside A Dialogue(https://arxiv.org/abs/2603.20133)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.
摘要：大型语言模型 (LLM) 在许多推理基准上取得了出色的性能，但这些评估通常侧重于与面向任务的对话 (TOD) 中的实际使用不同的孤立任务。在这种情况下，法学硕士必须在生成文本并遵守角色、格式和风格方面的说明的同时进行固有的推理。这种不匹配引发了人们对基准性能是否准确反映模型在 TOD 设置中的推理稳健性的担忧。我们通过引入 BOULDER 来研究 TOD 中的框架推理任务如何影响 LLM 绩效，BOULDER 是一个新的动态基准，涵盖八个与旅行相关的任务，这些任务需要常识和形式方面的算术、空间和时间推理。每个问题都以孤立的和基于对话的变体形式呈现，从而在减少数据污染的同时实现受控比较。对八个法学硕士的实验揭示了孤立环境和对话环境之间存在巨大且一致的性能差距。通过消融和定性分析，我们表明这种差距很大程度上是由对话的多回合性质造成的，以及角色调节和工具使用要求的额外影响。我们的结果强调了在现实交互场景中评估法学硕士推理的必要性。

Title: Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models

Authors: Qi Cao, Andrew Gambardella, Takeshi Kojima, Yutaka Matsuo, Yusuke Iwasawa
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20161
Pdf URL: https://arxiv.org/pdf/2603.20161
Copy Paste: [[2603.20161]] Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models(https://arxiv.org/abs/2603.20161)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guaranteed, and their tendency toward overconfidence further limits reliability. Uncertainty quantification offers a promising way to identify potentially unreliable outputs, but most existing methods rely on repeated sampling or auxiliary models, introducing substantial computational overhead. To address these limitations, we propose Semantic Token Clustering (STC), an efficient uncertainty quantification method that leverages the semantic information inherently encoded in LLMs. Specifically, we group tokens into semantically consistent clusters using embedding clustering and prefix matching, and quantify uncertainty based on the probability mass aggregated over the corresponding semantic cluster. Our approach requires only a single generation and does not depend on auxiliary models. Experimental results show that STC achieves performance comparable to state-of-the-art baselines while substantially reducing computational overhead.
摘要：大型语言模型 (LLM) 在不同的任务中表现出了卓越的能力。然而，他们输出的真实性并不能得到保证，而且他们过度自信的倾向进一步限制了可靠性。不确定性量化提供了一种有前途的方法来识别潜在不可靠的输出，但大多数现有方法依赖于重复采样或辅助模型，从而引入了大量的计算开销。为了解决这些限制，我们提出了语义令牌聚类（STC），这是一种有效的不确定性量化方法，它利用了 LLM 中固有编码的语义信息。具体来说，我们使用嵌入聚类和前缀匹配将标记分组为语义一致的簇，并根据相应语义簇上聚合的概率质量来量化不确定性。我们的方法只需要单代，并且不依赖于辅助模型。实验结果表明，STC 的性能可与最先进的基线相媲美，同时大大降低了计算开销。

Title: Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models

Authors: Sai Koneru, Elphin Joe, Christine Kirchhoff, Jian Wu, Sarah Rajtmajer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.20162
Pdf URL: https://arxiv.org/pdf/2603.20162
Copy Paste: [[2603.20162]] Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models(https://arxiv.org/abs/2603.20162)
Keywords: language model, prompt
Abstract: In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evaluate this tension, we introduce a controlled epistemic-conflict framework grounded in the U.S. National Climate Assessment. We conduct fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models spanning 0.27B to 32B parameters. Across neutral prompts, richer evidence generally improves evidence-consistent accuracy and ordinal scoring performance. Under user pressure, however, evidence does not reliably prevent user-aligned reversals in this controlled fixed-evidence setting. We report three primary failure modes. First, we identify a negative partial-evidence interaction, where adding epistemic nuance, specifically research gaps, is associated with increased susceptibility to sycophancy in families like Llama-3 and Gemma-3. Second, robustness scales non-monotonically: within some families, certain low-to-mid scale models are especially sensitive to adversarial user pressure. Third, models differ in distributional concentration under conflict: some instruction-tuned models maintain sharply peaked ordinal distributions under pressure, while others are substantially more dispersed; in scale-matched Qwen comparisons, reasoning-distilled variants (DeepSeek-R1-Qwen) exhibit consistently higher dispersion than their instruction-tuned counterparts. These findings suggest that, in a controlled fixed-evidence setting, providing richer in-context evidence alone offers no guarantee against user pressure without explicit training for epistemic integrity.
摘要：在有争议的领域，指令调整的语言模型必须平衡用户对齐压力与对上下文证据的忠实度。为了评估这种紧张局势，我们引入了一个基于美国国家气候评估的受控认知冲突框架。我们对跨越 0.27B 到 32B 参数的 19 个指令调整模型的证据构成和不确定性线索进行了细粒度消融。在中性提示中，更丰富的证据通常会提高证据一致性的准确性和序数评分性能。然而，在用户压力下，在这种受控的固定证据设置中，证据并不能可靠地防止用户对齐的逆转。我们报告了三种主要的故障模式。首先，我们确定了一种消极的部分证据相互作用，即增加认知上的细微差别，特别是研究差距，与 Llama-3 和 Gemma-3 等家族对阿谀奉承的敏感性增加有关。其次，鲁棒性非单调扩展：在某些系列中，某些中低规模模型对对抗性用户压力特别敏感。第三，模型在冲突下的分布集中度有所不同：一些经过指令调整的模型在压力下保持尖锐的峰值序数分布，而其他模型则更加分散；在规模匹配的 Qwen 比较中，推理蒸馏变体 (DeepSeek-R1-Qwen) 始终表现出比指令调整的变体更高的离散度。这些发现表明，在受控的固定证据环境中，如果没有明确的认知完整性培训，仅提供更丰富的上下文证据并不能保证抵御用户压力。

Title: Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Authors: Richard J. Young
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.20172
Pdf URL: https://arxiv.org/pdf/2603.20172
Copy Paste: [[2603.20172]] Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation(https://arxiv.org/abs/2603.20172)
Keywords: llm, chain-of-thought
Abstract: Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper demonstrates that it is not. Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively, with non-overlapping 95% confidence intervals. Per-model gaps range from 2.6 to 30.6 percentage points; all are statistically significant (McNemar's test, p < 0.001). The disagreements are systematic, not random: inter-classifier agreement measured by Cohen's kappa ranges from 0.06 ("slight") for sycophancy hints to 0.42 ("moderate") for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction. Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under the Sonnet judge; OLMo-3.1-32B moves in the opposite direction, from 9th to 3rd. The root cause is that different classifiers operationalize related faithfulness constructs at different levels of stringency (lexical mention versus epistemic dependence), and these constructs yield divergent measurements on the same behavior. These results demonstrate that published faithfulness numbers cannot be meaningfully compared across studies that use different classifiers, and that future evaluations should report sensitivity ranges across multiple classification methodologies rather than single point estimates.
摘要：最近关于思想链 (CoT) 忠实度的工作报告了单个聚合数字（例如，DeepSeek-R1 在 39% 的时间内确认了提示），这意味着忠实度是模型的客观、可测量的属性。本文证明事实并非如此。三个分类器（仅正则表达式检测器、两级正则表达式加 LLM 管道和独立的 Claude Sonnet 4 判断器）应用于来自 12 个开放权重模型（涵盖 9 个系列和 7B 到 1T 参数）的 10,276 条受影响的推理轨迹。对于相同的数据，这些分类器的总体忠实率分别为 74.4%、82.6% 和 69.7%，且置信区间不重叠，为 95%。每个型号的差距在 2.6 到 30.6 个百分点之间；所有这些均具有统计显着性（McNemar 检验，p < 0.001）。这些分歧是系统性的，而不是随机的：科恩的 kappa 测量的分类器间一致性范围从谄媚提示的 0.06（“轻微”）到评分者提示的 0.42（“中等”），并且不对称性很明显：对于谄媚，883 个案例被管道分类为忠实，但被 Sonnet 法官分类为不忠实，而只有 2 个案例走向相反的方向。分类器的选择也可以逆转模型排名：Qwen3.5-27B在管道下排名第一，但在Sonnet判断下排名第七； OLMo-3.1-32B 向相反方向移动，从第 9 名升至第 3 名。根本原因是不同的分类器以不同的严格程度（词汇提及与认知依赖）操作相关的忠实度结构，并且这些结构对相同行为产生不同的测量。这些结果表明，已发布的忠实度数据无法在使用不同分类器的研究之间进行有意义的比较，并且未来的评估应报告多种分类方法的敏感性范围，而不是单点估计。