2026-03-20

Title: Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

Authors: Anna Babarczy, Andras Lukacs, Peter Vedres, Zeteny Bujka
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.18007
Pdf URL: https://arxiv.org/pdf/2603.18007
Copy Paste: [[2603.18007]] Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm(https://arxiv.org/abs/2603.18007)
Keywords: language model, gpt, llm
Abstract: The study explores whether current Large Language Models (LLMs) exhibit Theory of Mind (ToM) capabilities -- specifically, the ability to infer others' beliefs, intentions, and emotions from text. Given that LLMs are trained on language data without social embodiment or access to other manifestations of mental representations, their apparent social-cognitive reasoning raises key questions about the nature of their understanding. Are they capable of robust mental-state attribution indistinguishable from human ability in its output, or do their outputs merely reflect superficial pattern completion? To address this question, we tested five LLMs and compared their performance to that of human controls using an adapted version of a text-based tool widely used in human ToM research. The test involves answering questions about the beliefs, intentions, and emotions of story characters. The results revealed a performance gap between the models. Earlier and smaller models were strongly affected by the number of relevant inferential cues available and, to some extent, were also vulnerable to the presence of irrelevant or distracting information in the texts. In contrast, GPT-4o demonstrated high accuracy and strong robustness, performing comparably to humans even in the most challenging conditions. This work contributes to ongoing debates about the cognitive status of LLMs and the boundary between genuine understanding and statistical approximation.
摘要：该研究探讨了当前的大型语言模型 (LLM) 是否表现出心智理论 (ToM) 功能，特别是从文本推断他人的信念、意图和情感的能力。鉴于法学硕士接受的语言数据训练没有社会体现或无法获得心理表征的其他表现，他们明显的社会认知推理提出了关于他们理解的本质的关键问题。它们是否能够进行与人类输出能力无法区分的强大的心理状态归因，或者它们的输出仅仅反映了表面的模式完成？为了解决这个问题，我们测试了五个法学硕士，并使用人类 ToM 研究中广泛使用的基于文本的工具的改编版本，将它们的性能与人类控制的性能进行比较。该测试涉及回答有关故事人物的信念、意图和情感的问题。结果揭示了模型之间的性能差距。早期和较小的模型受到可用的相关推理线索数量的强烈影响，并且在某种程度上也容易受到文本中不相关或分散注意力的信息的影响。相比之下，GPT-4o 表现出高精度和强大的鲁棒性，即使在最具挑战性的条件下，其性能也可与人类相媲美。这项工作有助于引发关于法学硕士认知状态以及真正理解与统计近似之间界限的持续争论。

Title: TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

Authors: Fangrui Huang, Souhad Chbeir, Arpandeep Khatua, Sheng Wang, Sijun Tan, Kenan Ye, Lily Bailey, Merryn Daniel, Ryan Louie, Sanmi Koyejo, Ehsan Adeli
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.18008
Pdf URL: https://arxiv.org/pdf/2603.18008
Copy Paste: [[2603.18008]] TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots(https://arxiv.org/abs/2603.18008)
Keywords: language model, llm, chat
Abstract: Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods--fluency metrics, preference tests, and generic dialogue benchmarks--fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that evaluates and improves therapy chatbots along two clinical pillars: fidelity and safety. Fidelity is measured using the Cognitive Therapy Rating Scale (CTRS), implemented as an automated pipeline that scores adherence to CBT techniques over multi-turn sessions. Safety is assessed using a multi-label annotation scheme, covering therapy-specific risks (e.g., failing to address harm or abuse). To mitigate bias and unreliability in LLM-based judges, we further release THERAPYJUDGEBENCH, a validation set of 116 dialogues with 1,270 expert ratings for auditing and calibration against licensed clinicians. THERAPYGYM also serves as a training harness: CTRS and safety-based rewards drive RL with configurable patient simulations spanning diverse symptom profiles. Models trained in THERAPYGYM improve on expert ratings, with average CTRS rising from 0.10 to 0.60 (and 0.16 to 0.59 under LLM judges). Our work enables scalable development of therapy chatbots that are faithful to evidence-based practice and safer in high-stakes use.
摘要：大语言模型（LLM）越来越多地用于心理健康支持；然而，流行的评估方法——流畅性指标、偏好测试和通用对话基准——未能捕捉到心理治疗的临床关键维度。我们推出 THERAPYGYM，这是一个沿着两个临床支柱评估和改进治疗聊天机器人的框架：保真度和安全性。忠诚度使用认知治疗评定量表 (CTRS) 进行衡量，该量表作为自动化管道实施，可在多轮会话中对 CBT 技术的遵守情况进行评分。使用多标签注释方案评估安全性，涵盖治疗特定的风险（例如，未能解决伤害或滥用问题）。为了减少法学硕士法官的偏见和不可靠性，我们进一步发布了 THERAPYJUDGEBENCH，这是一个由 116 个对话组成的验证集，其中包含 1,270 个专家评级，用于针对持照临床医生进行审核和校准。 THERAPYGYM 还可以用作训练工具：CTRS 和基于安全的奖励通过涵盖不同症状特征的可配置患者模拟来驱动 RL。在 THERAPYGYM 中训练的模型的专家评分有所提高，平均 CTRS 从 0.10 上升到 0.60（在法学硕士评委的指导下从 0.16 上升到 0.59）。我们的工作能够实现治疗聊天机器人的可扩展开发，这些机器人忠实于基于证据的实践，并且在高风险使用中更安全。

Title: How Confident Is the First Token? An Uncertainty-Calibrated Prompt Optimization Framework for Large Language Model Classification and Understanding

Authors: Wei Chen, Guoyang Ju, Yuanyuan Qi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.18009
Pdf URL: https://arxiv.org/pdf/2603.18009
Copy Paste: [[2603.18009]] How Confident Is the First Token? An Uncertainty-Calibrated Prompt Optimization Framework for Large Language Model Classification and Understanding(https://arxiv.org/abs/2603.18009)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: With the widespread adoption of large language models (LLMs) in natural language processing, prompt engineering and retrieval-augmented generation (RAG) have become mainstream to enhance LLMs' performance on complex tasks. However, LLMs generate outputs autoregressively, leading to inevitable output uncertainty. Since model performance is highly sensitive to prompt design, precise uncertainty measurement is crucial for reliable prompt optimization. For multi-class multiple-choice (understanding) tasks, conventional uncertainty measures (e.g., entropy) based on output probabilities treat all classes equally and ignore class prior differences in pretraining corpora. This failure to distinguish spurious confidence (from priors) from true certainty (from contextual understanding) results in poor confidence calibration. To address this, we propose Log-Scale Focal Uncertainty (LSFU), a first-token-based metric inspired by focal loss. LSFU incorporates label prior probabilities as a risk-modulation factor to suppress noise from high-frequency classes and emphasize risk for low-frequency long-tail classes, with a dynamic weighting mechanism unifying the measurement scale. Based on LSFU, we further propose the uncertainty-calibrated prompt optimization framework (UCPOF), which leverages the first token of model outputs to select high-quality exemplars and dynamically optimize prompts. Comprehensive evaluations show UCPOF improves average accuracy by 6.03% over few-shot baselines, surpasses always-on full RAG by 5.75% in overall average accuracy, and reduces the average retrieval trigger rate by 50.66%. By adaptively triggering RAG only for high-uncertainty samples, our framework significantly lowers computational costs while maintaining state-of-the-art performance.
摘要：随着大语言模型 (LLM) 在自然语言处理中的广泛采用，即时工程和检索增强生成 (RAG) 已成为增强 LLM 在复杂任务上的性能的主流。然而，法学硕士以自回归方式生成输出，导致不可避免的输出不确定性。由于模型性能对即时设计高度敏感，因此精确的不确定性测量对于可靠的即时优化至关重要。对于多类别多项选择（理解）任务，基于输出概率的传统不确定性度量（例如熵）平等地对待所有类别，并忽略预训练语料库中的类别先验差异。未能区分虚假置信度（来自先验）和真实确定性（来自上下文理解）会导致置信度校准不佳。为了解决这个问题，我们提出了对数尺度焦点不确定性（LSFU），这是一种受焦点损失启发的基于第一个标记的度量。 LSFU 将标签先验概率作为风险调节因子，以抑制高频类别的噪声，并强调低频长尾类别的风险，并采用统一测量尺度的动态加权机制。基于LSFU，我们进一步提出了不确定性校准提示优化框架（UCPOF），该框架利用模型输出的第一个标记来选择高质量的样本并动态优化提示。综合评估表明，UCPOF比few-shot基线平均准确率提高了6.03%，整体平均准确率比always-on full RAG提高了5.75%，平均检索触发率降低了50.66%。通过仅针对高不确定性样本自适应触发 RAG，我们的框架显着降低了计算成本，同时保持了最先进的性能。

Title: Agentic Framework for Political Biography Extraction

Authors: Yifei Zhu, Songpo Yang, Jiangnan Zhu, Junyan Jiang
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.18010
Pdf URL: https://arxiv.org/pdf/2603.18010
Copy Paste: [[2603.18010]] Agentic Framework for Political Biography Extraction(https://arxiv.org/abs/2603.18010)
Keywords: language model, llm, agent
Abstract: The production of large-scale political datasets typically demands extracting structured facts from vast piles of unstructured documents or web sources, a task that traditionally relies on expensive human experts and remains prohibitively difficult to automate at scale. In this paper, we leverage Large Language Models (LLMs) to automate the extraction of multi-dimensional elite biographies, addressing a long-standing bottleneck in political science research. We propose a two-stage ``Synthesis-Coding'' framework for complex extraction task: an upstream synthesis stage that uses recursive agentic LLMs to search, filter, and curate biography from heterogeneous web sources, followed by a downstream coding stage that maps curated biography into structured dataframes. We validate this framework through three primary results. First, we demonstrate that, when given curated contexts, LLM coders match or outperform human experts in extraction accuracy. Second, we show that in web environments, the agentic system synthesizes more information from web resources than human collective intelligence (Wikipedia). Finally, we diagnosed that directly coding from long and multi-language corpora introduces bias that the synthesis stage can alleviate by curating evidence into signal-dense representations. By comprehensive evaluation, We provide a generalizable, scalable framework for building transparent and expansible large scale database in political science.
摘要：大规模政治数据集的生成通常需要从大量非结构化文档或网络资源中提取结构化事实，这项任务传统上依赖于昂贵的人类专家，并且仍然难以大规模自动化。在本文中，我们利用大型语言模型（LLM）自动提取多维精英传记，解决政治科学研究中长期存在的瓶颈。我们提出了一个用于复杂提取任务的两阶段“综合编码”框架：上游综合阶段使用递归代理LLM从异构网络源中搜索、过滤和整理传记，然后是下游编码阶段将整理的传记映射到结构化数据帧。我们通过三个主要结果验证了这个框架。首先，我们证明，当给定精心策划的上下文时，LLM 编码器在提取准确性方面可以匹配或优于人类专家。其次，我们表明，在网络环境中，代理系统从网络资源中合成的信息比人类集体智慧（维基百科）更多。最后，我们诊断出，直接从长的多语言语料库进行编码会引入偏差，而合成阶段可以通过将证据整理成信号密集的表示来减轻偏差。通过综合评估，我们为构建透明的、可扩展的大规模政治学数据库提供了一个可推广的、可扩展的框架。

Title: DynaRAG: Bridging Static and Dynamic Knowledge in Retrieval-Augmented Generation

Authors: Penghao Liang, Mengwei Yuan, Jianan Liu, Jing Yang, Xianyou Li, Weiran Yan, Yichao Wu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.18012
Pdf URL: https://arxiv.org/pdf/2603.18012
Copy Paste: [[2603.18012]] DynaRAG: Bridging Static and Dynamic Knowledge in Retrieval-Augmented Generation(https://arxiv.org/abs/2603.18012)
Keywords: llm, hallucination, retrieval-augmented generation
Abstract: We present DynaRAG, a retrieval-augmented generation (RAG) framework designed to handle both static and time-sensitive information needs through dynamic knowledge integration. Unlike traditional RAG pipelines that rely solely on static corpora, DynaRAG selectively invokes external APIs when retrieved documents are insufficient for answering a query. The system employs an LLM-based reranker to assess document relevance, a sufficiency classifier to determine when fallback is necessary, and Gorilla v2 -- a state-of-the-art API calling model -- for accurate tool invocation. We further enhance robustness by incorporating schema filtering via FAISS to guide API selection. Evaluations on the CRAG benchmark demonstrate that DynaRAG significantly improves accuracy on dynamic questions, while also reducing hallucinations. Our results highlight the importance of dynamic-aware routing and selective tool use in building reliable, real-world question-answering systems.
摘要：我们提出了 DynaRAG，这是一种检索增强生成 (RAG) 框架，旨在通过动态知识集成来处理静态和时间敏感的信息需求。与仅依赖静态语料库的传统 RAG 管道不同，DynaRAG 在检索到的文档不足以回答查询时选择性地调用外部 API。该系统采用基于 LLM 的重新排序器来评估文档相关性，使用充分性分类器来确定何时需要回退，并使用 Gorilla v2（一种最先进的 API 调用模型）来进行准确的工具调用。我们通过 FAISS 合并模式过滤来指导 API 选择，进一步增强了鲁棒性。对 CRAG 基准的评估表明，DynaRAG 显着提高了动态问题的准确性，同时还减少了幻觉。我们的结果强调了动态感知路由和选择性工具使用在构建可靠的真实问答系统中的重要性。

Title: Learned but Not Expressed: Capability-Expression Dissociation in Large Language Models

Authors: Toshiyuki Shigemura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18013
Pdf URL: https://arxiv.org/pdf/2603.18013
Copy Paste: [[2603.18013]] Learned but Not Expressed: Capability-Expression Dissociation in Large Language Models(https://arxiv.org/abs/2603.18013)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) demonstrate the capacity to reconstruct and trace learned content from their training data under specific elicitation conditions, yet this capability does not manifest in standard generation contexts. This empirical observational study examines the expression of non-causal, non-implementable solution types across 300 prompt-response generations spanning narrative and problem-solving task contexts. Drawing on recent findings regarding memorization contiguity and alignment-induced discourse priors, we document a systematic dissociation between learned capability and expressed output. Across three distinct LLMs, ten task scenarios, and both creative narrative and practical advisory contexts, we documented zero instances of non-causal solution frames in generated outputs (0%, 95% CI: [0%, 1.2%]), despite verified reconstruction capability under conditional extraction. These findings challenge the prevailing assumption that training data presence directly predicts output probability, demonstrating instead that task-conditioned generation policies can comprehensively suppress learned content across diverse contexts. The results offer implications for understanding generation dynamics, output distribution control, and the behavioral boundaries of contemporary LLMs.
摘要：大型语言模型 (LLM) 展示了在特定启发条件下从训练数据中重建和跟踪学习内容的能力，但这种能力在标准生成上下文中并不体现。这项实证观察性研究考察了跨越叙述和解决问题的任务环境的 300 个即时响应世代中非因果、不可实施的解决方案类型的表达。利用最近关于记忆连续性和对齐引起的话语先验的发现，我们记录了学习能力和表达输出之间的系统分离。尽管在条件提取下验证了重建能力，但在三个不同的法学硕士、十个任务场景以及创意叙述和实际咨询环境中，我们在生成的输出中记录了零个非因果解决方案框架的实例（0％，95％CI：[0％，1.2％]）。这些发现挑战了普遍的假设，即训练数据的存在直接预测输出概率，而是证明任务条件生成策略可以全面抑制不同背景下的学习内容。研究结果对于理解当代法学硕士的发电动态、产出分配控制和行为边界具有重要意义。

Title: Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction

Authors: Hui Wen Goh, Jonas Mueller
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.18014
Pdf URL: https://arxiv.org/pdf/2603.18014
Copy Paste: [[2603.18014]] Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction(https://arxiv.org/abs/2603.18014)
Keywords: gpt, llm
Abstract: Structured Outputs from current LLMs exhibit sporadic errors, hindering enterprise AI efforts from realizing their immense potential. We present CONSTRUCT, a method to score the trustworthiness of LLM Structured Outputs in real-time, such that lower-scoring outputs are more likely to contain errors. This reveals the best places to focus limited human review bandwidth. CONSTRUCT additionally scores the trustworthiness of each field within a LLM Structured Output, helping reviewers quickly identify which parts of the output are wrong. Our method is suitable for any LLM (including black-box LLM APIs without logprobs such as reasoning models and Anthropic models), does not require labeled training data nor custom model deployment, and works for complex Structured Outputs with many fields of diverse types (including nested JSON schemas). We additionally present one of the first public LLM Structured Output benchmarks with reliable ground-truth values that are not full of mistakes. Over this four-dataset benchmark, CONSTRUCT detects errors from various LLMs (including Gemini 3 and GPT-5) with significantly higher precision/recall than other scoring methods.
摘要：当前法学硕士的结构化输出存在零星错误，阻碍了企业人工智能工作发挥其巨大潜力。我们提出了 CONSTRUCT，一种实时对 LLM 结构化输出的可信度进行评分的方法，这样得分较低的输出更有可能包含错误。这揭示了关注有限的人工审核带宽的最佳位置。 CONSTRUCT还对LLM结构化输出中每个字段的可信度进行评分，帮助审稿人快速识别输出的哪些部分是错误的。我们的方法适用于任何 LLM（包括没有 logprobs 的黑盒 LLM API，例如推理模型和人择模型），不需要标记的训练数据或自定义模型部署，并且适用于具有多种不同类型字段（包括嵌套 JSON 模式）的复杂结构化输出。我们还提供了第一个公开的 LLM 结构化输出基准之一，其中包含可靠的地面实况值，且不存在错误。在这个四个数据集的基准测试中，CONSTRUCT 可以检测来自各种 LLM（包括 Gemini 3 和 GPT-5）的错误，其精确度/召回率明显高于其他评分方法。

Title: MineDraft: A Framework for Batch Parallel Speculative Decoding

Authors: Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low
Subjects: cs.CL, cs.AI, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2603.18016
Pdf URL: https://arxiv.org/pdf/2603.18016
Copy Paste: [[2603.18016]] MineDraft: A Framework for Batch Parallel Speculative Decoding(https://arxiv.org/abs/2603.18016)
Keywords: language model, llm
Abstract: Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.
摘要：推测解码 (SD) 通过使用较小的草稿模型来提出草稿标记，然后由较大的目标模型进行验证，从而加速大型语言模型推理。然而，标准 SD 的性能通常受到这些起草和验证阶段严格顺序执行的限制。为了解决这个问题，本文提出了 MineDraft，这是一种批量并行推测解码 (PSD) 框架，旨在通过将绘图延迟与验证重叠来有效隐藏绘图延迟。我们的理论分析表明，PSD 比标准 SD 效率更高。 MineDraft 通过一种新颖的批量并行设计实现 PSD，该设计维护两批请求，将一批的起草与另一批的验证重叠。我们的实验结果表明，与标准 SD 相比，MineDraft 在吞吐量（高达 75%）和端到端延迟（高达 39%）方面都有显着改进。此外，我们还实现了 MineDraft 作为 vLLM 的插件，展示了其对于生产就绪推理系统的实用性。

Title: An Agentic System for Schema Aware NL2SQL Generation

Authors: David Onyango, Naseef Mansoor
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2603.18018
Pdf URL: https://arxiv.org/pdf/2603.18018
Copy Paste: [[2603.18018]] An Agentic System for Schema Aware NL2SQL Generation(https://arxiv.org/abs/2603.18018)
Keywords: language model, llm, agent
Abstract: The natural language to SQL (NL2SQL) task plays a pivotal role in democratizing data access by enabling non-expert users to interact with relational databases through intuitive language. While recent frameworks have enhanced translation accuracy via task specialization, their reliance on Large Language Models (LLMs) raises significant concerns regarding computational overhead, data privacy, and real-world deployability in resource-constrained environments. To address these challenges, we propose a schema based agentic system that strategically employs Small Language Models (SLMs) as primary agents, complemented by a selective LLM fallback mechanism. The LLM is invoked only upon detection of errors in SLM-generated output, the proposed system significantly minimizes computational expenditure. Experimental results on the BIRD benchmark demonstrate that our system achieves an execution accuracy of 47.78% and a validation efficiency score of 51.05%, achieving over 90% cost reduction compared to LLM-centric baselines as approximately 67% of queries are resolved using local SLMs. The system achieves an average cost per query of 0.0085 compared to 0.094 for LLM-only systems, achieving near-zero operational costs for locally executed queries. [Github repository: this https URL.]
摘要：自然语言转 SQL (NL2SQL) 任务使非专家用户能够通过直观的语言与关系数据库进行交互，从而在数据访问民主化方面发挥着关键作用。虽然最近的框架通过任务专门化提高了翻译准确性，但它们对大型语言模型 (LLM) 的依赖引发了对资源受限环境中的计算开销、数据隐私和实际可部署性的严重担忧。为了应对这些挑战，我们提出了一种基于模式的代理系统，该系统战略性地采用小语言模型（SLM）作为主要代理，并辅以选择性的 LLM 后备机制。仅在检测到 SLM 生成的输出中的错误时才调用 LLM，所提出的系统显着减少了计算支出。 BIRD 基准测试的实验结果表明，我们的系统实现了 47.78% 的执行准确率和 51.05% 的验证效率得分，与以 LLM 为中心的基准相比，成本降低了 90% 以上，因为大约 67% 的查询是使用本地 SLM 解决的。该系统的每次查询平均成本为 0.0085，而纯 LLM 系统的平均每次查询成本为 0.094，本地执行查询的运营成本接近于零。 [Github 存储库：此 https URL。]

Title: BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity

Authors: Harshita Diddee, Gregory Yauney, Swabha Swayamdipta, Daphne Ippolito
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2603.18019
Pdf URL: https://arxiv.org/pdf/2603.18019
Copy Paste: [[2603.18019]] BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity(https://arxiv.org/abs/2603.18019)
Keywords: language model
Abstract: Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a "poetry" benchmark may never test for haikus, while "instruction-following" benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability's facets) and low convergent validity (lack of stable rankings when measuring the same capability). BenchBrowser, thus, helps quantify a critical gap between practitioner intent and what benchmarks actually test.
摘要：语言模型基准测试是否真正衡量了从业者的意图？高级元数据太粗糙，无法传达基准测试的粒度现实：“诗歌”基准测试可能永远不会测试俳句，而“指令遵循”基准测试通常会测试任意技能组合。这种不透明性使得验证与从业者目标的一致性成为一个费力的过程，即使模型在用户兴趣的未经测试的方面失败，也会冒着能力错觉的风险。我们推出了 BenchBrowser，这是一个检索器，可显示与 20 多个基准套件的自然语言用例相关的评估项目。 BenchBrowser 经人类研究验证，检索精度高，生成的证据可帮助从业者诊断低内容有效性（能力方面的覆盖面狭窄）和低收敛有效性（在衡量相同能力时缺乏稳定的排名）。因此，BenchBrowser 有助于量化从业者意图与基准测试实际测试内容之间的关键差距。

Title: How LLMs Distort Our Written Language

Authors: Marwa Abdulhai, Isadora White, Yanming Wan, Ibrahim Qureshi, Joel Leibo, Max Kleiman-Weiner, Natasha Jaques
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.18161
Pdf URL: https://arxiv.org/pdf/2603.18161
Copy Paste: [[2603.18161]] How LLMs Distort Our Written Language(https://arxiv.org/abs/2603.18161)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. In this work, we demonstrate that LLMs not only alter the voice and tone of human writing, but also consistently alter the intended meaning. First, we conduct a human user study to understand how people actually interact with LLMs when using them for writing. Our findings reveal that extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Next, using a dataset of human-written essays that was collected in 2021 before the widespread release of LLMs, we study how asking an LLM to revise the essay based on the human-written feedback in the dataset induces large changes in the resulting content and meaning. We find that even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. We then examine LLM-generated text in the wild, specifically focusing on the 21% of AI-generated scientific peer reviews at a recent top AI conference. We find that LLM-generated reviews place significantly less weight on clarity and significance of the research, and assign scores that, on average, are a full point this http URL findings highlight a misalignment between the perceived benefit of AI use and an implicit, consistent effect on the semantics of human writing, motivating future work on how widespread AI writing will affect our cultural and scientific institutions.
摘要：全球有超过十亿人使用大型语言模型 (LLM)，最常用于辅助写作。在这项工作中，我们证明法学硕士不仅改变了人类写作的声音和语气，而且不断改变了预期的含义。首先，我们进行了一项人类用户研究，以了解人们在使用法学硕士进行写作时实际上如何与他们互动。我们的研究结果表明，广泛使用 LLM 导致在回答主题问题时保持中立的论文数量增加了近 70%。明显更重的法学硕士用户报告说，写作缺乏创意，而且不是他们的声音。接下来，我们使用 2021 年法学硕士广泛发布之前收集的人类撰写的论文数据集，研究了要求法学硕士根据数据集中的人类撰写的反馈修改论文如何导致结果内容和含义发生巨大变化。我们发现，即使法学硕士收到专家反馈并要求仅进行语法编辑，他们仍然会以显着改变其语义的方式更改文本。然后，我们在野外检查 LLM 生成的文本，特别关注最近一次顶级 AI 会议上由 AI 生成的 21% 的科学同行评审。我们发现，LLM 生成的评论对研究的清晰度和重要性的重视程度明显较低，并且平均而言，该 http URL 调查结果强调了人工智能使用的感知益处与对人类写作语义的隐性、一致影响之间的不一致，从而激发了未来关于广泛的人工智能写作将如何影响我们的文化和科学机构的工作。

Title: Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations

Authors: Maria Andueza Rodriguez, Marie Candito, Richard Huyghe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18171
Pdf URL: https://arxiv.org/pdf/2603.18171
Copy Paste: [[2603.18171]] Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations(https://arxiv.org/abs/2603.18171)
Keywords: language model, llm
Abstract: Large language models (LLMs) achieve impressive results in terms of fluency in text generation, yet the nature of their linguistic knowledge - in particular the human-likeness of their internal lexicon - remains uncertain. This study compares human and LLM-generated word associations to evaluate how accurately models capture human lexical patterns. Using English cue-response pairs from the SWOW dataset and newly generated associations from three LLMs (Mistral-7B, Llama-3.1-8B, and Qwen-2.5-32B) across multiple temperature settings, we examine (i) the influence of lexical factors such as word frequency and concreteness on cue-response pairs, and (ii) the variability and typicality of LLM responses compared to human responses. Results show that all models mirror human trends for frequency and concreteness but differ in response variability and typicality. Larger models such as Qwen tend to emulate a single "prototypical" human participant, generating highly typical but minimally variable responses, while smaller models such as Mistral and Llama produce more variable yet less typical responses. Temperature settings further influence this trade-off, with higher values increasing variability but decreasing typicality. These findings highlight both the similarities and differences between human and LLM lexicons, emphasizing the need to account for model size and temperature when probing LLM lexical representations.
摘要：大型语言模型（LLM）在文本生成的流畅性方面取得了令人印象深刻的成果，但其语言知识的本质 - 特别是其内部词典的人类相似性 - 仍然不确定。这项研究比较了人类和法学硕士生成的单词关联，以评估模型捕捉人类词汇模式的准确程度。使用来自 SWOW 数据集的英语提示-响应对以及来自三个 LLM（Mistral-7B、Llama-3.1-8B 和 Qwen-2.5-32B）在多个温度设置下新生成的关联，我们检查了（i）词汇因素（例如词频和具体性）对提示-响应对的影响，以及（ii）LLM 响应与人类响应相比的可变性和典型性。结果表明，所有模型都反映了人类的频率和具体性趋势，但在响应变异性和典型性方面有所不同。较大的模型（例如 Qwen）倾向于模拟单个“原型”人类参与者，产生高度典型但变化最小的响应，而较小的模型（例如 Mistral 和 Llama）会产生更多变化但不太典型的响应。温度设置进一步影响这种权衡，较高的值会增加变异性，但会降低典型性。这些发现强调了人类和法学硕士词典之间的相似性和差异，强调在探索法学硕士词汇表示时需要考虑模型大小和温度。

Title: GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

Authors: Ja Young Lee, Mírian Silva, Mohamed Nasr, Shonda Witherspoon, Enzo Bozzani, Veronique Demers, Radha Ratnaparkhi, Hui Wu, Sara Rosenthal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18173
Pdf URL: https://arxiv.org/pdf/2603.18173
Copy Paste: [[2603.18173]] GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation(https://arxiv.org/abs/2603.18173)
Keywords: language model, llm
Abstract: Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM-as-a-judge. The platform enables side-by-side comparison of multiple models, facilitating regression detection across different releases. The platform is available at this https URL. The demo video is available at this http URL.
摘要：大型语言模型 (LLM) 在很大程度上受到其发布时在热门主题和基准测试中的表现的激励。然而，随着时间的推移，由于训练期间基准数据的大量暴露，会出现污染。如果不仔细执行测试，这会带来模型性能膨胀的风险。为了应对这一挑战，我们推出了 GRAFITE，这是一个持续的 LLM 评估平台，通过一个用于维护和评估模型问题的综合系统。我们的方法能够根据用户随时间的反馈建立模型问题存储库，并提供通过使用法学硕士作为法官的质量保证（QA）测试来评估法学硕士针对这些问题的管道。该平台可以对多个模型进行并排比较，从而促进不同版本之间的回归检测。该平台可通过此 https URL 获取。演示视频可从此 http URL 获取。

Title: From Noise to Signal: When Outliers Seed New Topics

Authors: Evangelia Zve, Gauvain Bourgne, Benjamin Icard, Jean-Gabriel Ganascia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.18358
Pdf URL: https://arxiv.org/pdf/2603.18358
Copy Paste: [[2603.18358]] From Noise to Signal: When Outliers Seed New Topics(https://arxiv.org/abs/2603.18358)
Keywords: language model
Abstract: Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.
摘要：动态主题建模中的异常值通常被视为噪声，但我们表明其中一些可以作为新兴主题的早期信号。我们引入了新闻文档轨迹的时间分类法，它定义了文档随时间的推移与主题形成的关系。它将预期异常值（先于随后加入的主题）与强化现有主题或保持孤立的文档区分开来。通过捕获这些轨迹，分类法将弱信号检测与时间主题建模联系起来，并阐明各个文章如何在不断发展的集群中预测、启动或漂移。我们使用来自 11 种最先进语言模型的文档嵌入在累积聚类设置中实现它，并在 HydroNewsFr（法国氢经济新闻语料库）上对其进行回顾性评估。模型间一致性揭示了预期异常值的一小部分、高度一致的子集，增加了对这些标签的信心。定性案例研究通过具体主题的发展进一步说明了这些轨迹。

Title: Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

Authors: Tianhui Zhang, Bei Peng, Danushka Bollegala
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18361
Pdf URL: https://arxiv.org/pdf/2603.18361
Copy Paste: [[2603.18361]] Synthetic Data Generation for Training Diversified Commonsense Reasoning Models(https://arxiv.org/abs/2603.18361)
Keywords: language model, llm, agent
Abstract: Conversational agents are required to respond to their users not only with high quality (i.e. commonsense bearing) responses, but also considering multiple plausible alternative scenarios, reflecting the diversity in their responses. Despite the growing need to train diverse commonsense generators, the progress of this line of work has been significantly hindered by the lack of large-scale high-quality diverse commonsense training datasets. Due to the high annotation costs, existing Generative Commonsense Reasoning (GCR) datasets are created using a small number of human annotators, covering only a narrow set of commonsense scenarios. To address this training resource gap, we propose a two-stage method to create the first-ever synthetic dataset CommonSyn for diversified (GCR). The model fine-tuned on our synthetic data jointly increase both generation diversity and quality compared with vanilla models and the model fine-tuned on human-crafted dataset across different size Large Language Models (LLMs)
摘要：会话代理不仅需要以高质量（即常识性的）响应来响应用户，而且还要考虑多种可能的替代场景，反映其响应的多样性。尽管训练多样化常识生成器的需求日益增长，但由于缺乏大规模高质量多样化常识训练数据集，这一工作的进展受到了严重阻碍。由于注释成本较高，现有的生成常识推理（GCR）数据集是使用少量人类注释者创建的，仅涵盖一小部分常识场景。为了解决这一培训资源缺口，我们提出了一种两阶段方法来创建第一个综合数据集 CommonSyn for多样化（GCR）。与普通模型相比，在我们的合成数据上进行微调的模型共同提高了生成多样性和质量，并且在不同规模的大型语言模型（LLM）的人工数据集上进行了微调的模型

Title: PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

Authors: Ruishuo Chen, Yu Chen, Zhuoran Li, Longbo Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.18363
Pdf URL: https://arxiv.org/pdf/2603.18363
Copy Paste: [[2603.18363]] PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching(https://arxiv.org/abs/2603.18363)
Keywords: language model, llm
Abstract: Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $\alpha$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($\alpha > 1$) to intensify logical reasoning, or flattening it ($\alpha < 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.
摘要：基于内部反馈的无监督强化学习 (RLIF) 已成为一种有前景的范式，可在无需外部监督的情况下激发大型语言模型 (LLM) 的潜在能力。然而，当前的方法依赖于启发式内在奖励，通常缺乏明确的理论优化目标，并且容易出现退化偏差。在这项工作中，我们介绍了 PowerFlow，这是一个原则框架，它将无监督微调重新表述为分布匹配问题。通过将 GFlowNet 转换为非归一化密度的摊销变分采样器，我们提出了一个长度感知的轨迹平衡目标，该目标明确地中和了自回归生成中固有的结构长度偏差。通过针对 $\alpha$ 功率分布，PowerFlow 能够定向引发法学硕士的双重性质：锐化分布 ($\alpha > 1$) 以强化逻辑推理，或扁平化分布 ($\alpha < 1$) 以释放表达创造力。大量实验表明，PowerFlow 始终优于现有的 RLIF 方法，匹配甚至超过有监督的 GRPO。此外，通过减轻对齐模型中的过度锐化，我们的方法实现了多样性和质量的同步收益，改变了创造性任务的帕累托前沿。

Title: AutoScreen-FW: An LLM-based Framework for Resume Screening

Authors: Zhelin Xu, Shuhei Yamamoto, Atsuyuki Morishima
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18390
Pdf URL: https://arxiv.org/pdf/2603.18390
Copy Paste: [[2603.18390]] AutoScreen-FW: An LLM-based Framework for Resume Screening(https://arxiv.org/abs/2603.18390)
Keywords: gpt, llm
Abstract: Corporate recruiters often need to screen many resumes within a limited time, which increases their burden and may cause suitable candidates to be overlooked. To address these challenges, prior work has explored LLM-based automated resume screening. However, some methods rely on commercial LLMs, which may pose data privacy risks. Moreover, since companies typically do not make resumes with evaluation results publicly available, it remains unclear which resume samples should be used during learning to improve an LLM's judgment performance. To address these problems, we propose AutoScreen-FW, an LLM-based locally and automatically resume screening framework. AutoScreen-FW uses several methods to select a small set of representative resume samples. These samples are used for in-context learning together with a persona description and evaluation criteria, enabling open-source LLMs to act as a career advisor and evaluate unseen resumes. Experiments with multiple ground truths show that the open-source LLM judges consistently outperform GPT-5-nano. Under one ground truth setting, it also surpass GPT-5-mini. Although it is slightly weaker than GPT-5-mini under other ground-truth settings, it runs substantially faster per resume than commercial GPT models. These findings indicate the potential for deploying AutoScreen-FW locally in companies to support efficient screening while reducing recruiters' burden.
摘要：企业招聘人员往往需要在有限的时间内筛选多份简历，这增加了他们的负担，并可能导致合适的候选人被忽视。为了应对这些挑战，之前的工作探索了基于法学硕士的自动简历筛选。然而，一些方法依赖于商业法学硕士，这可能会带来数据隐私风险。此外，由于公司通常不会公开评估结果的简历，因此目前尚不清楚在学习过程中应使用哪些简历样本来提高法学硕士的判断能力。为了解决这些问题，我们提出了AutoScreen-FW，一个基于LLM的本地自动恢复筛选框架。 AutoScreen-FW 使用多种方法来选择一小组具有代表性的简历样本。这些样本与角色描述和评估标准一起用于情境学习，使开源法学硕士能够充当职业顾问并评估看不见的简历。对多个基本事实的实验表明，开源 LLM 的判断始终优于 GPT-5-nano。在一种真实设置下，它也超越了 GPT-5-mini。尽管在其他真实设置下它比 GPT-5-mini 稍弱，但它每次恢复的运行速度比商业 GPT 模型要快得多。这些发现表明，在公司本地部署 AutoScreen-FW 可以支持高效筛选，同时减轻招聘人员的负担。

Title: TopoChunker: Topology-Aware Agentic Document Chunking Framework

Authors: Xiaoyu Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18409
Pdf URL: https://arxiv.org/pdf/2603.18409
Copy Paste: [[2603.18409]] TopoChunker: Topology-Aware Agentic Document Chunking Framework(https://arxiv.org/abs/2603.18409)
Keywords: llm, retrieval-augmented generation, agent
Abstract: Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.
摘要：当前用于检索增强生成（RAG）的文档分块方法通常对文本进行线性化。这种强制线性化剥夺了内在的拓扑层次结构，产生了“语义碎片”，从而降低了下游检索质量。在本文中，我们提出了 TopoChunker，这是一个代理框架，它将异构文档映射到结构化中间表示（SIR）上，以显式地保留跨段依赖关系。为了平衡结构保真度和计算成本，TopoChunker 采用双代理架构。 Inspector Agent 通过成本优化的提取路径动态路由文档，而 Refiner Agent 执行容量审核和拓扑上下文消歧以重建分层谱系。根据非结构化叙述 (GutenQA) 和复杂报告 (GovReport) 进行评估，TopoChunker 展示了最先进的性能。它在绝对生成精度方面比基于 LLM 的最强基线高出 8.0%，并实现了 83.26% Recall@3，同时将令牌开销减少了 23.5%，为结构感知 RAG 提供了可扩展的方法。

Title: TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Authors: Arushi Rai, Qiang Zhang, Hanqing Zeng, Yunkai Zhang, Dipesh Tamboli, Xiangjun Fan, Zhuokai Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.18411
Pdf URL: https://arxiv.org/pdf/2603.18411
Copy Paste: [[2603.18411]] TARo: Token-level Adaptive Routing for LLM Test-time Alignment(https://arxiv.org/abs/2603.18411)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.
摘要：大型语言模型 (LLM) 表现出强大的推理能力，但通常需要昂贵的后期训练才能达到高性能。最近的测试时间对齐方法提供了一种轻量级的替代方案，但主要是为了偏好对齐而不是推理而进行探索。为了弥补这一差距，我们提出了令牌级自适应路由（TARo），它可以在推理时完全引导冻结的法学硕士进行结构化推理。具体来说，我们首先在逐步数学轨迹上训练奖励模型，以捕获细粒度的逻辑一致性信号，然后引入可学习的令牌级路由器，自动控制奖励模型对基本模型的指导。大量实验表明，TARo 的推理性能比基本模型显着提高了 22.4%，比现有的令牌级测试时间对齐方法提高了 8.4%，同时还增强了分布外临床推理 (MedXpertQA) 和指令遵循 (AlpacaEval)。此外，TARo 还可以在无需重新训练的情况下从小到大的主干进行泛化，将测试时间对齐从偏好优化扩展到强大的跨域推理。

Title: Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Authors: Masayuki Kawarada, Tatsuya Ishigaki, Hiroya Takamura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18425
Pdf URL: https://arxiv.org/pdf/2603.18425
Copy Paste: [[2603.18425]] Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs(https://arxiv.org/abs/2603.18425)
Keywords: llm
Abstract: Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.
摘要：尽管多模式对话系统日益普及，但任务干扰（即单个对话中的任务切换引起的性能下降）仅在纯文本环境中进行了研究。我们引入了一个基准来评估多模态法学硕士中的这种现象，涵盖跨文本和视觉的六项任务，并沿三个轴历史目标的系统变化：模态不匹配、推理不匹配和答案格式不匹配。对开放权重和专有模型的实验表明，任务干扰是高度定向的：从纯文本目标切换到基于图像的目标会导致性能严重下降，而反向转换只会产生最小的性能下降。当不匹配在多个维度上同时发生时，干扰会进一步放大，并且最强烈的是模态差异，其次是答案格式，而推理要求的变化导致的退化最小。

Title: Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

Authors: Asmita Bhardwaj, Yuya Jeremy Ong, Eelaaf Zahid, Basel Shbita
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.18428
Pdf URL: https://arxiv.org/pdf/2603.18428
Copy Paste: [[2603.18428]] Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation(https://arxiv.org/abs/2603.18428)
Keywords: language model, llm
Abstract: Decoding strategies largely determine the quality of Large Language Model (LLM) outputs, yet widely used heuristics such as greedy or fixed temperature/top-p decoding are static and often task-agnostic, leading to suboptimal or inconsistent generation quality across domains that demand stylistic or structural flexibility. We introduce a reinforcement learning-based decoder sampler that treats decoding as sequential decision-making and learns a lightweight policy to adjust sampling parameters at test-time while keeping LLM weights frozen. We evaluated summarization datasets including BookSum, arXiv, and WikiHow using Granite-3.3-2B and Qwen-2.5-0.5B. Our policy sampler consistently outperforms greedy and static baselines, achieving relative gains of up to +88% (BookSum, Granite) and +79% (WikiHow, Qwen). Reward ablations show that overlap-only objectives underperform compared to composite rewards, while structured shaping terms (length, coverage, repetition, completeness) enable stable and sustained improvements. These findings highlight reinforcement learning as a practical mechanism for test-time adaptation in decoding, enabling domain-aware and user-controllable generation without retraining large models.
摘要：解码策略在很大程度上决定了大型语言模型 (LLM) 输出的质量，但广泛使用的启发式方法（例如贪婪或固定温度/top-p 解码）是静态的，并且通常与任务无关，导致在需要风格或结构灵活性的领域中生成质量不理想或不一致。我们引入了一种基于强化学习的解码器采样器，它将解码视为顺序决策，并学习轻量级策略以在测试时调整采样参数，同时保持 LLM 权重冻结。我们使用 Granite-3.3-2B 和 Qwen-2.5-0.5B 评估了包括 BookSum、arXiv 和 WikiHow 在内的摘要数据集。我们的策略采样器始终优于贪婪和静态基线，实现高达 +88%（BookSum、Granite）和 +79%（WikiHow、Qwen）的相对收益。奖励消融表明，与复合奖励相比，仅重叠的目标表现不佳，而结构化塑造条款（长度、覆盖范围、重复、完整性）可以实现稳定和持续的改进。这些发现强调强化学习是解码中测试时间适应的实用机制，无需重新训练大型模型即可实现领域感知和用户可控的生成。

Title: UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference

Authors: Lang Zhou, Shuxuan Li, Zhuohao Li, Shi Liu, Zhilin Zhao, Wei-Shi Zheng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.18446
Pdf URL: https://arxiv.org/pdf/2603.18446
Copy Paste: [[2603.18446]] UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference(https://arxiv.org/abs/2603.18446)
Keywords: language model
Abstract: Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed context budget throughout decoding despite highly non-uniform token-level contextual demands. To address this issue, we propose Uncertainty-Triggered Adaptive Context Allocation (UT-ACA), an inference-time framework that dynamically adjusts the context window based on token-wise uncertainty. UT-ACA learns an uncertainty detector that combines semantic embeddings with logit-based confidence while accounting for uncertainty accumulation across decoding steps. When insufficient evidence is indicated, UT-ACA selectively rolls back, expands the context window, and regenerates the token with additional support. Experiments show that UT-ACA substantially reduces average context usage while preserving generation quality in long-context settings.
摘要：由于注意力稀释和分布外退化，长上下文推理对于大型语言模型仍然具有挑战性。上下文选择通过关注键值缓存条目的子集来减轻这种限制，但大多数方法在整个解码过程中分配固定的上下文预算，尽管令牌级上下文需求高度不一致。为了解决这个问题，我们提出了不确定性触发的自适应上下文分配（UT-ACA），这是一种推理时间框架，可以根据令牌的不确定性动态调整上下文窗口。 UT-ACA 学习一种不确定性检测器，它将语义嵌入与基于逻辑的置信度相结合，同时考虑解码步骤中的不确定性累积。当证据不足时，UT-ACA 有选择地回滚、扩展上下文窗口并重新生成具有额外支持的令牌。实验表明，UT-ACA 显着降低了平均上下文使用率，同时保持了长上下文设置中的生成质量。

Title: GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms

Authors: Masayuki Kawarada, Kodai Watanabe, Soichiro Murakami
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18469
Pdf URL: https://arxiv.org/pdf/2603.18469
Copy Paste: [[2603.18469]] GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms(https://arxiv.org/abs/2603.18469)
Keywords: language model, llm
Abstract: We introduce GAIN (Goal-Aligned Decision-Making under Imperfect Norms), a benchmark designed to evaluate how large language models (LLMs) balance adherence to norms against business goals. Existing benchmarks typically focus on abstract scenarios rather than real-world business applications. Furthermore, they provide limited insights into the factors influencing LLM decision-making. This restricts their ability to measure models' adaptability to complex, real-world norm-goal conflicts. In GAIN, models receive a goal, a specific situation, a norm, and additional contextual pressures. These pressures, explicitly designed to encourage potential norm deviations, are a unique feature that differentiates GAIN from other benchmarks, enabling a systematic evaluation of the factors influencing decision-making. We define five types of pressures: Goal Alignment, Risk Aversion, Emotional/Ethical Appeal, Social/Authoritative Influence, and Personal Incentive. The benchmark comprises 1,200 scenarios across four domains: hiring, customer support, advertising and finance. Our experiments show that advanced LLMs frequently mirror human decision-making patterns. However, when Personal Incentive pressure is present, they diverge significantly, showing a strong tendency to adhere to norms rather than deviate from them.
摘要：我们引入了 GAIN（不完美规范下的目标一致决策），这是一个基准，旨在评估大型语言模型 (LLM) 如何平衡遵守规范与业务目标。现有的基准通常侧重于抽象场景而不是现实世界的业务应用程序。此外，它们对影响法学硕士决策的因素提供的见解有限。这限制了他们衡量模型对复杂的现实世界规范目标冲突的适应性的能力。在 GAIN 中，模型接收目标、具体情况、规范和额外的情境压力。这些压力明确旨在鼓励潜在的规范偏差，是 GAIN 与其他基准的独特之处，能够对影响决策的因素进行系统评估。我们定义了五种类型的压力：目标一致、风险规避、情感/道德诉求、社会/权威影响和个人激励。该基准包括四个领域的 1,200 个场景：招聘、客户支持、广告和财务。我们的实验表明，高级法学硕士经常反映人类的决策模式。然而，当存在个人激励压力时，它们就会显着分歧，表现出遵守规范而不是偏离规范的强烈倾向。

Title: WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Authors: Haonan Yu, Junhao Liu, Zhenyu Yan, Haoran Lin, Xin Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.18474
Pdf URL: https://arxiv.org/pdf/2603.18474
Copy Paste: [[2603.18474]] WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior(https://arxiv.org/abs/2603.18474)
Keywords: language model, llm
Abstract: Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.
摘要：大型语言模型 (LLM) 的精确行为控制对于复杂应用程序至关重要。然而，现有方法往往会产生高昂的训练成本、缺乏自然语言可控性或损害语义一致性。为了弥补这一差距，我们提出了 WASD（unWeaving Actionable Sufficient Directives），这是一种新颖的框架，通过识别令牌生成的足够神经条件来解释模型行为。我们的方法将候选条件表示为神经元激活谓词，并迭代搜索保证输入扰动下的当前输出的最小集合。使用 Gemma-2-2B 模型对 SST-2 和 CounterFact 进行的实验表明，我们的方法产生的解释比传统归因图更稳定、更准确、更简洁。此外，通过控制跨语言输出生成的案例研究，我们验证了 WASD 在控制模型行为方面的实际有效性。

Title: The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices

Authors: Esteban Garces Arias, Nurzhan Sapargali, Christian Heumann, Matthias Aßenmacher
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.18482
Pdf URL: https://arxiv.org/pdf/2603.18482
Copy Paste: [[2603.18482]] The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices(https://arxiv.org/abs/2603.18482)
Keywords: language model
Abstract: Standard decoding strategies for text generation, including top-k, nucleus sampling, and contrastive search, select tokens based on likelihood, restricting selection to high-probability regions. Human language production operates differently: tokens are chosen for communicative appropriateness rather than statistical frequency. This mismatch creates a truncation blind spot: contextually appropriate but statistically rare tokens remain accessible to humans yet unreachable by likelihood-based decoding. We hypothesize this contributes to the detectability of machine-generated text. Analyzing over 1.8 million texts across eight language models, five decoding strategies, and 53 hyperparameter configurations, we find that 8-18% of human-selected tokens fall outside typical truncation boundaries. Simple classifiers trained on predictability and lexical diversity achieve remarkable detection rates. Crucially, neither model scale nor architecture correlates strongly with detectability; truncation parameters account for most variance. Configurations achieving low detectability often produce incoherent text, indicating that evading detection and producing natural text are distinct objectives. These findings suggest detectability is enhanced by likelihood-based token selection, not merely a matter of model capability.
摘要：用于文本生成的标准解码策略，包括 top-k、核心采样和对比搜索，根据可能性选择标记，将选择限制在高概率区域。人类语言的产生方式不同：选择标记是为了交流的适当性而不是统计频率。这种不匹配造成了截断盲点：上下文适当但统计上罕见的标记仍然可供人类访问，但无法通过基于可能性的解码来访问。我们假设这有助于机器生成文本的可检测性。通过分析 8 种语言模型、5 种解码策略和 53 种超参数配置的超过 180 万个文本，我们发现 8-18% 的人类选择的标记超出了典型的截断边界。经过可预测性和词汇多样性训练的简单分类器实现了显着的检测率。至关重要的是，模型规模和架构都与可检测性没有很强的相关性。截断参数解释了大部分方差。实现低可检测性的配置通常会产生不连贯的文本，这表明逃避检测和生成自然文本是不同的目标。这些发现表明，基于可能性的标记选择可以增强可检测性，而不仅仅是模型能力的问题。

Title: EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Authors: Minsoo Cheong, Donghyun Son, Woosang Lim, Sungjoo Yoo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18489
Pdf URL: https://arxiv.org/pdf/2603.18489
Copy Paste: [[2603.18489]] EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models(https://arxiv.org/abs/2603.18489)
Keywords: language model, llm, chain-of-thought
Abstract: Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5\%$ of inference time. Code is available at this https URL.
摘要：基于扩散的大语言模型 (dLLM) 依赖于双向注意力，这会阻止无损 KV 缓存，并且需要在每个去噪步骤中进行完整的前向传递。现有的近似 KV 缓存方法通过有选择地更新缓存状态来降低此成本，但它们的决策开销会随着上下文长度或模型深度而变化。我们提出了 EntropyCache，一种免训练的 KV 缓存方法，它使用新解码的令牌分布的最大熵作为恒定成本信号来决定何时重新计算。我们的设计基于两个经验观察：(1) 解码令牌熵与 KV 缓存漂移相关，为缓存陈旧性提供廉价代理；(2) 解码令牌的特征波动性在揭开后持续多个步骤，从而激励重新计算 $k$ 最近解码的令牌。跳过或重新计算决策每步仅需要 $O(V)$ 计算，与上下文长度和模型规模无关。 LLaDA-8B-Instruct 和 Dream-7B-Instruct 上的实验表明，EntropyCache 在标准基准测试上实现了 $15.2\times$-$26.4\times$ 加速，在链式思维基准测试上实现了 $22.4\times$-$24.1\times$ 加速，具有竞争性的准确性和决策开销，仅占推理时间的 $0.5\%$。代码可从此 https URL 获取。

Title: When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

Authors: Abhinaba Basu, Pavan Chakraborty
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2603.18530
Pdf URL: https://arxiv.org/pdf/2603.18530
Copy Paste: [[2603.18530]] When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making(https://arxiv.org/abs/2603.18530)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field's narrow focus on demographics; (2) bias concentrates in specific domains -- finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.
摘要：大型语言模型 (LLM) 越来越多地用于高风险决策，但它们对虚假特征的敏感性仍然很少被描述。我们引入了 ICE-Guard，这是一个应用干预一致性测试来检测三种类型的虚假特征依赖的框架：人口统计（姓名/种族互换）、权威（凭证/声望互换）和框架（正面/负面重述）。通过跨越 10 个高风险领域的 3,000 个小插曲，我们评估了来自 8 个家庭的 11 名法学硕士，发现 (1) 权威偏见（平均 5.8%）和框架偏见（5.0%）大大超过了人口统计学偏见（2.2%），挑战了该领域对人口统计学的狭隘关注；（2）偏见集中在特定领域——金融领域存在22.6%的权威偏见，而刑事司法仅占2.8%； (3) 结构化分解，其中法学硕士提取特征并由确定性的标题决定，将翻转率降低高达 100%（9 个模型的中位数为 49%）。我们演示了一个 ICE 引导的检测-诊断-缓解-验证循环，通过迭代提示修补实现累计 78% 的偏差减少。针对真实 COMPAS 累犯数据的验证显示，COMPAS 衍生的翻转率超过了汇总合成率，这表明我们的基准提供了对现实世界偏差的保守估计。代码和数据是公开的。

Title: Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition

Authors: Ivaxi Sheth, Zeno Jonke, Amin Mantrach, Saab Mansour
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18557
Pdf URL: https://arxiv.org/pdf/2603.18557
Copy Paste: [[2603.18557]] Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition(https://arxiv.org/abs/2603.18557)
Keywords: language model, llm
Abstract: As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and adapting them to other languages is hindered by the scarcity and cost of human-annotated judgments in most languages. We introduce a decomposition-based evaluation framework built around a Universal Criteria Set (UCS). UCS consists of a shared, language-agnostic set of evaluation dimensions, producing an interpretable intermediate representation that supports cross-lingual transfer with minimal supervision. Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.
摘要：随着大型语言模型越来越多地部署在不同的现实世界应用程序中，将自动评估扩展到英语之外已成为一项关键挑战。现有的评估方法主要以英语为中心，而大多数语言中人工注释判断的稀缺性和成本阻碍了将其适应其他语言。我们引入了围绕通用标准集（UCS）构建的基于分解的评估框架。 UCS 由一组共享的、与语言无关的评估维度组成，生成可解释的中间表示，支持在最少监督的情况下进行跨语言迁移。跨语言和模型主干的多个忠实度任务的实验表明，在不需要目标语言注释的情况下，相对于强基线的一致性改进。

Title: ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

Authors: Abhinaba Basu, Pavan Chakraborty
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.18579
Pdf URL: https://arxiv.org/pdf/2603.18579
Copy Paste: [[2603.18579]] ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs(https://arxiv.org/abs/2603.18579)
Keywords: llm
Abstract: Evaluating whether explanations faithfully reflect a model's reasoning remains an open problem. Existing benchmarks use single interventions without statistical testing, making it impossible to distinguish genuine faithfulness from chance-level performance. We introduce ICE (Intervention-Consistent Explanation), a framework that compares explanations against matched random baselines via randomization tests under multiple intervention operators, yielding win rates with confidence intervals. Evaluating 7 LLMs across 4 English tasks, 6 non-English languages, and 2 attribution methods, we find that faithfulness is operator-dependent: operator gaps reach up to 44 percentage points, with deletion typically inflating estimates on short text but the pattern reversing on long text, suggesting that faithfulness should be interpreted comparatively across intervention operators rather than as a single score. Randomized baselines reveal anti-faithfulness in one-third of configurations, and faithfulness shows zero correlation with human plausibility (|r| < 0.04). Multilingual evaluation reveals dramatic model-language interactions not explained by tokenization alone. We release the ICE framework and ICEBench benchmark.
摘要：评估解释是否忠实反映模型的推理仍然是一个悬而未决的问题。现有的基准使用单一干预措施，没有进行统计测试，因此无法区分真正的忠诚度和机会水平的表现。我们引入了 ICE（干预一致解释），这是一个框架，通过多个干预操作员下的随机化测试，将解释与匹配的随机基线进行比较，产生具有置信区间的胜率。通过评估 4 个英语任务、6 种非英语语言和 2 种归因方法的 7 个法学硕士，我们发现忠实度取决于操作员：操作员差距高达 44 个百分点，删除通常会夸大短文本的估计，但长文本的模式则相反，这表明忠实度应该在干预操作员之间进行比较解释，而不是作为单一分数。随机基线显示三分之一的配置存在反忠实性，而忠实性与人类可信度的相关性为零 (|r| < 0.04)。多语言评估揭示了戏剧性的模型语言交互，不能仅用标记化来解释。我们发布了 ICE 框架和 ICEBench 基准测试。

Title: Language Model Maps for Prompt-Response Distributions via Log-Likelihood Vectors

Authors: Yusuke Takase, Momose Oyama, Hidetoshi Shimodaira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18593
Pdf URL: https://arxiv.org/pdf/2603.18593
Copy Paste: [[2603.18593]] Language Model Maps for Prompt-Response Distributions via Log-Likelihood Vectors(https://arxiv.org/abs/2603.18593)
Keywords: language model, prompt
Abstract: We propose a method that represents language models by log-likelihood vectors over prompt-response pairs and constructs model maps for comparing their conditional distributions. In this space, distances between models approximate the KL divergence between the corresponding conditional distributions. Experiments on a large collection of publicly available language models show that the maps capture meaningful global structure, including relationships to model attributes and task performance. The method also captures systematic shifts induced by prompt modifications and their approximate additive compositionality, suggesting a way to analyze and predict the effects of composite prompt operations. We further introduce pointwise mutual information (PMI) vectors to reduce the influence of unconditional distributions; in some cases, PMI-based model maps better reflect training-data-related differences. Overall, the framework supports the analysis of input-dependent model behavior.
摘要：我们提出了一种方法，通过提示响应对上的对数似然向量来表示语言模型，并构建模型图来比较它们的条件分布。在这个空间中，模型之间的距离近似于相应条件分布之间的 KL 散度。对大量公开可用的语言模型的实验表明，这些地图捕获了有意义的全局结构，包括与模型属性和任务性能的关系。该方法还捕获由即时修改及其近似加性组合性引起的系统变化，提出了一种分析和预测复合即时操作效果的方法。我们进一步引入逐点互信息（PMI）向量来减少无条件分布的影响；在某些情况下，基于 PMI 的模型图可以更好地反映与训练数据相关的差异。总体而言，该框架支持对依赖于输入的模型行为的分析。

Title: Learning to Self-Evolve

Authors: Xiaoyin Chen, Canwen Xu, Yite Wang, Boyi Liu, Zhewei Yao, Yuxiong He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.18620
Pdf URL: https://arxiv.org/pdf/2603.18620
Copy Paste: [[2603.18620]] Learning to Self-Evolve(https://arxiv.org/abs/2603.18620)
Keywords: language model, gpt, llm, prompt
Abstract: We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.
摘要：我们引入了自我进化学习 (LSE)，这是一种强化学习框架，可训练大型语言模型 (LLM) 以在测试时改善其自身的上下文。我们将 LSE 置于测试时自我进化的环境中，其中模型根据已发现问题的反馈迭代地完善其上下文，以便在新问题上表现更好。现有的方法完全依赖于模型固有的推理能力，并且从未针对此任务明确地训练它。 LSE 将多步进化问题简化为单步 RL 目标，其中每次上下文编辑都会因下游性能的提高而得到奖励。我们将这个目标与树引导的进化循环配对。在文本到 SQL 生成 (BIRD) 和一般问题解答 (MMLU-Redux) 方面，使用 LSE 训练的 4B 参数模型优于由 GPT-5 和 Claude Sonnet 4.5 提供支持的自我进化策略，以及包括 GEPA 和 TextGrad 在内的即时优化方法，并且无需额外训练即可转移到指导其他模型。我们的结果强调了将自我进化视为可学习技能的有效性。

Title: A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems

Authors: Aram Abrahamyan, Sachin Kumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18641
Pdf URL: https://arxiv.org/pdf/2603.18641
Copy Paste: [[2603.18641]] A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems(https://arxiv.org/abs/2603.18641)
Keywords: language model
Abstract: Neural language models deployed in real-world applications must continually adapt to new tasks and domains without forgetting previously acquired knowledge. This work presents a comparative empirical study of catastrophic forgetting mitigation in continual intent classification. Using the CLINC150 dataset, we construct a 10-task label-disjoint scenario and evaluate three backbone architectures: a feed-forward Artificial Neural Network (ANN), a Gated Recurrent Unit (GRU), and a Transformer encoder, under a range of continual learning (CL) strategies. We consider one representative method from each major CL family: replay-based Maximally Interfered Retrieval (MIR), regularization-based Learning without Forgetting (LwF), and parameter-isolation via Hard Attention to Task (HAT), both individually and in all pairwise and triple combinations. Performance is assessed with average accuracy, macro F1, and backward transfer, capturing the stability-plasticity trade-off across the task sequence. Our results show that naive sequential fine-tuning suffers from severe forgetting for all architectures and that no single CL method fully prevents it. Replay emerges as a key ingredient: MIR is the most reliable individual strategy, and combinations that include replay (MIR+HAT, MIR+LwF, MIR+LwF+HAT) consistently achieve high final performance with near-zero or mildly positive backward transfer. The optimal configuration is architecture-dependent. MIR+HAT yields the best result for ANN and Transformer, MIR+LwF+HAT, on the other hand, works the best for GRU, and in several cases CL methods even surpass joint training, indicating a regularization effect. These findings highlight the importance of jointly selecting backbone architecture and CL mechanism when designing continual intent-classification systems.
摘要：现实应用中部署的神经语言模型必须不断适应新的任务和领域，而不会忘记以前获得的知识。这项工作提出了连续意图分类中灾难性遗忘缓解的比较实证研究。使用 CLINC150 数据集，我们构建了一个 10 任务标签不相交场景，并在一系列持续学习 (CL) 策略下评估了三种骨干架构：前馈人工神经网络 (ANN)、门控循环单元 (GRU) 和 Transformer 编码器。我们考虑每个主要 CL 系列的一种代表性方法：基于重放的最大干扰检索 (MIR)、基于正则化的无遗忘学习 (LwF) 以及通过任务硬注意力 (HAT) 进行参数隔离，无论是单独的还是在所有成对和三重组合中。通过平均准确度、宏 F1 和向后迁移来评估性能，捕获整个任务序列中稳定性与可塑性的权衡。我们的结果表明，对于所有架构来说，朴素的顺序微调都会遭受严重的遗忘，并且没有任何一种 CL 方法可以完全阻止它。重放成为一个关键因素：MIR 是最可靠的个体策略，并且包括重放（MIR+HAT、MIR+LwF、MIR+LwF+HAT）的组合始终能够实现较高的最终性能，并且向后传输接近于零或轻微正向。最佳配置取决于架构。 MIR+HAT 对于 ANN 和 Transformer 产生最好的结果，另一方面，MIR+LwF+HAT 对于 GRU 效果最好，并且在某些情况下 CL 方法甚至超过联合训练，表明正则化效果。这些发现强调了在设计持续意图分类系统时联合选择主干架构和 CL 机制的重要性。

Title: Automatic detection of Gen-AI texts: A comparative framework of neural models

Authors: Cristian Buttaro, Irene Amerini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18750
Pdf URL: https://arxiv.org/pdf/2603.18750
Copy Paste: [[2603.18750]] Automatic detection of Gen-AI texts: A comparative framework of neural models(https://arxiv.org/abs/2603.18750)
Keywords: language model, gpt
Abstract: The rapid proliferation of Large Language Models has significantly increased the difficulty of distinguishing between human-written and AI generated texts, raising critical issues across academic, editorial, and social domains. This paper investigates the problem of AI generated text detection through the design, implementation, and comparative evaluation of multiple machine learning based detectors. Four neural architectures are developed and analyzed: a Multilayer Perceptron, a one-dimensional Convolutional Neural Network, a MobileNet-based CNN, and a Transformer model. The proposed models are benchmarked against widely used online detectors, including ZeroGPT, GPTZero, QuillBot, this http URL, Sapling, IsGen, Rephrase, and Writer. Experiments are conducted on the COLING Multilingual Dataset, considering both English and Italian configurations, as well as on an original thematic dataset focused on Art and Mental Health. Results show that supervised detectors achieve more stable and robust performance than commercial tools across different languages and domains, highlighting key strengths and limitations of current detection strategies.
摘要：大型语言模型的快速扩散大大增加了区分人类编写的文本和人工智能生成的文本的难度，引发了学术、编辑和社会领域的关键问题。本文通过设计、实现和比较评估多个基于机器学习的检测器来研究人工智能生成的文本检测问题。开发并分析了四种神经架构：多层感知器、一维卷积神经网络、基于 MobileNet 的 CNN 和 Transformer 模型。所提出的模型以广泛使用的在线检测器为基准，包括 ZeroGPT、GPTZero、QuillBot、this http URL、Sapling、IsGen、Rephrase 和 Writer。在 COLING 多语言数据集上进行了实验，考虑了英语和意大利语配置，以及专注于艺术和心理健康的原始主题数据集。结果表明，监督检测器比跨不同语言和领域的商业工具实现了更稳定、更强大的性能，凸显了当前检测策略的关键优势和局限性。

Title: Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

Authors: Rudra Jadhav, Janhavi Danve, Sonalika Shaw
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18765
Pdf URL: https://arxiv.org/pdf/2603.18765
Copy Paste: [[2603.18765]] Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks(https://arxiv.org/abs/2603.18765)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs -- LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) -- were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with effect sizes ranging from medium (Cohen's d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale -- penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.
摘要：随着大型语言模型（LLM）越来越多地在教育环境中被部署为自动评分器，对其评估的公平性和偏见的担忧变得至关重要。本研究调查了当基本内容正确性保持不变时，法学硕士是否会表现出基于写作风格的隐性评分偏差。我们构建了一个包含 180 名学生回答的受控数据集，涵盖三个科目（数学、编程和论文/写作），每个科目都有三种表面扰动类型：语法错误、非正式语言和非母语措辞。两个最先进的开源法学硕士——LLaMA 3.3 70B（Meta）和Qwen 2.5 72B（阿里巴巴）——被提示以1-10的等级对回答进行评分，并明确指示仅评估内容的正确性而忽略写作风格。我们的结果揭示了两种模型和所有扰动类型的作文/写作任务中存在统计显着的评分偏差 (p < 0.05)，效应大小范围从中等 (Cohen's d = 0.64) 到非常大 (d = 4.25)。非正式语言受到的处罚最重，LLaMA 平均扣分 1.90 分，Qwen 扣分 1.20 分（按 10 分制计算）——处罚相当于 B+ 和 C+ 字母等级之间的差异。非母语措辞分别被扣 1.35 分和 0.90 分。与此形成鲜明对比的是，数学和编程任务的偏差很小，大多数情况都未能达到统计显着性。这些发现表明，LLM 评分偏差与科目相关、风格敏感，并且尽管评分提示中有明确的反偏见指示，但仍然持续存在。我们讨论了基于法学硕士的评分系统的公平部署的影响，并在机构采用之前建议偏差审计协议。

Title: Mi:dm K 2.5 Pro

Authors: KT Tech innovation Group
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.18788
Pdf URL: https://arxiv.org/pdf/2603.18788
Copy Paste: [[2603.18788]] Mi:dm K 2.5 Pro(https://arxiv.org/abs/2603.18788)
Keywords: llm, agent
Abstract: The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows. This shift challenges existing models in enterprise environments, especially in Korean-language and domain-specific scenarios where scaling is insufficient. We introduce Mi:dm K 2.5 Pro, a 32B parameter flagship LLM designed to address enterprise-grade complexity through reasoning-focused optimization. Our methodology builds a robust data foundation via a quality-centric curation pipeline utilizing abstract syntax tree (AST) analysis for code, gap-filling synthesis for mathematics, and an LLM-based quality evaluator. Pre-training scales the model via layer-predictor-based Depth Upscaling (DuS) and a progressive strategy supporting a 128K token context window. Post-training introduces a specialized multi-stage pipeline, including Reasoning SFT, model merging, and asynchronous reinforcement learning (RL), to develop complex problem-solving skills. "Fusion Training" then rebalances these capabilities with conversational fluency, consistent response styling, and reliable tool-use. The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models. In addition, it sets state-of-the-art results on Korean-specific benchmarks, showcasing deep linguistic and cultural understanding. Finally, Responsible AI evaluations validate safety against attacks, ensuring a secure profile for deployment with a balance of harmlessness and responsiveness.
摘要：不断发展的法学硕士领域需要的功能不仅仅是简单的文本生成、优先考虑多步骤推理、长上下文理解和代理工作流程。这种转变对企业环境中的现有模型提出了挑战，特别是在扩展不足的韩语和特定领域场景中。我们推出 Mi:dm K 2.5 Pro，这是一款 32B 参数旗舰级法学硕士，旨在通过以推理为重点的优化来解决企业级复杂性。我们的方法通过以质量为中心的管理管道，利用代码抽象语法树 (AST) 分析、数学填补综合以及基于法学硕士的质量评估器，构建了强大的数据基础。预训练通过基于层预测器的深度放大 (DuS) 和支持 128K 令牌上下文窗口的渐进策略来缩放模型。训练后引入了专门的多阶段管道，包括推理 SFT、模型合并和异步强化学习 (RL)，以培养复杂的问题解决技能。然后，“融合培训”通过流畅的对话、一致的响应样式和可靠的工具使用来重新平衡这些能力。评测结果显示，Mi:dm K 2.5 Pro的性能与国内外领先机型相比具有竞争力。此外，它还根据韩国特定的基准设定了最先进的结果，展示了深厚的语言和文化理解。最后，负责任的人工智能评估可验证针对攻击的安全性，确保安全的部署配置文件，并在无害性和响应性之间取得平衡。

Title: Detecting Basic Values in A Noisy Russian Social Media Text Data: A Multi-Stage Classification Framework

Authors: Maria Milkova, Maksim Rudnev
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18822
Pdf URL: https://arxiv.org/pdf/2603.18822
Copy Paste: [[2603.18822]] Detecting Basic Values in A Noisy Russian Social Media Text Data: A Multi-Stage Classification Framework(https://arxiv.org/abs/2603.18822)
Keywords: gpt, llm
Abstract: This study presents a multi-stage classification framework for detecting human values in noisy Russian language social media, validated on a random sample of 7.5 million public text posts. Drawing on Schwartz's theory of basic human values, we design a multi-stage pipeline that includes spam and nonpersonal content filtering, targeted selection of value relevant and politically relevant posts, LLM based annotation, and multi-label classification. Particular attention is given to verifying the quality of LLM annotations and model predictions against human experts. We treat human expert annotations not as ground truth but as an interpretative benchmark with its own uncertainty. To account for annotation subjectivity, we aggregate multiple LLM generated judgments into soft labels that reflect varying levels of agreement. These labels are then used to train transformer based models capable of predicting the probability of each of the ten basic values. The best performing model, XLM RoBERTa large, achieves an F1 macro of 0.83 and an F1 of 0.71 on held out test data. By treating value detection as a multi perspective interpretive task, where expert labels, GPT annotations, and model predictions represent coherent but not identical readings of the same texts, we show that the model generally aligns with human judgments but systematically overestimates the Openness to Change value domain. Empirically, the study reveals distinct patterns of value expression and their co-occurrence in Russian social networks, contributing to a broader research agenda on cultural variation, communicative framing, and value based interpretation in digital environments. All models are released publicly.
摘要：这项研究提出了一个多阶段分类框架，用于在嘈杂的俄语社交媒体中检测人类价值观，并在 750 万条公共文本帖子的随机样本上进行了验证。借鉴施瓦茨的基本人类价值观理论，我们设计了一个多级管道，包括垃圾邮件和非个人内容过滤、有针对性地选择价值相关和政治相关的帖子、基于法学硕士的注释和多标签分类。特别关注针对人类专家验证法学硕士注释和模型预测的质量。我们不将人类专家注释视为基本事实，而是将其视为具有自身不确定性的解释基准。为了考虑注释的主观性，我们将多个法学硕士生成的判断聚合成软标签，以反映不同程度的一致性。然后，使用这些标签来训练基于变压器的模型，该模型能够预测十个基本值中每个值的概率。性能最佳的模型 XLM RoBERTa Large 在保留的测试数据上实现了 0.83 的 F1 宏和 0.71 的 F1。通过将价值检测视为多视角解释任务，其中专家标签、GPT 注释和模型预测代表对相同文本的连贯但不相同的阅读，我们表明该模型通常与人类判断一致，但系统地高估了“变革开放性”价值域。从经验上看，该研究揭示了俄罗斯社交网络中不同的价值表达模式及其共现情况，有助于就数字环境中的文化变异、交流框架和基于价值的解释进行更广泛的研究议程。所有模型均公开发布。

Title: Evaluating LLM-Generated Lessons from the Language Learning Students' Perspective: A Short Case Study on Duolingo

Authors: Carlos Rafael Catalan, Patricia Nicole Monderin, Lheane Marie Dizon, Gap Estrella, Raymund John Sarmimento, Marie Antoinette Patalagsa
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2603.18873
Pdf URL: https://arxiv.org/pdf/2603.18873
Copy Paste: [[2603.18873]] Evaluating LLM-Generated Lessons from the Language Learning Students' Perspective: A Short Case Study on Duolingo(https://arxiv.org/abs/2603.18873)
Keywords: language model, llm
Abstract: Popular language learning applications such as Duolingo use large language models (LLMs) to generate lessons for its users. Most lessons focus on general real-world scenarios such as greetings, ordering food, or asking directions, with limited support for profession-specific contexts. This gap can hinder learners from achieving professional-level fluency, which we define as the ability to communicate comfortably various work-related and domain-specific information in the target language. We surveyed five employees from a multinational company in the Philippines on their experiences with Duolingo. Results show that respondents encountered general scenarios more frequently than work-related ones, and that the former are relatable and effective in building foundational grammar, vocabulary, and cultural knowledge. The latter helps bridge the gap toward professional fluency as it contains domain-specific vocabulary. Each participant suggested lesson scenarios that diverge in contexts hen analyzed in aggregate. With this understanding, we propose that language learning applications should generate lessons that adapt to an individual's needs through personalized, domain specific lesson scenarios while maintaining foundational support through general, relatable lesson scenarios.
摘要：Duolingo 等流行的语言学习应用程序使用大型语言模型 (LLM) 为其用户生成课程。大多数课程侧重于一般的现实场景，例如问候、点餐或问路，对特定职业环境的支持有限。这种差距可能会阻碍学习者达到专业水平的流利程度，我们将其定义为用目标语言舒适地交流各种与工作相关和特定领域信息的能力。我们对菲律宾一家跨国公司的五名员工进行了调查，了解他们使用 Duolingo 的体验。结果表明，受访者遇到一般场景的频率高于与工作相关的场景，而且前者在构建基础语法、词汇和文化知识方面具有相关性和有效性。后者有助于弥合专业流畅性的差距，因为它包含特定领域的词汇。每个参与者提出的课程场景在上下文中有所不同，然后进行汇总分析。有了这种理解，我们建议语言学习应用程序应该通过个性化的、特定领域的课程场景生成适应个人需求的课程，同时通过通用的、相关的课程场景保持基础支持。

Title: A Human-in/on-the-Loop Framework for Accessible Text Generation

Authors: Lourdes Moreno, Paloma Martínez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.18879
Pdf URL: https://arxiv.org/pdf/2603.18879
Copy Paste: [[2603.18879]] A Human-in/on-the-Loop Framework for Accessible Text Generation(https://arxiv.org/abs/2603.18879)
Keywords: llm
Abstract: Plain Language and Easy-to-Read formats in text simplification are essential for cognitive accessibility. Yet current automatic simplification and evaluation pipelines remain largely automated, metric-driven, and fail to reflect user comprehension or normative standards. This paper introduces a hybrid framework that explicitly integrates human participation into LLM-based accessible text generation. Human-in-the-Loop (HiTL) contributions guide adjustments during generation, while Human-on-the-Loop (HoTL) supervision ensures systematic post-generation review. Empirical evidence from user studies and annotated resources is operationalized into (i) checklists aligned with standards, (ii) Event-Condition-Action trigger rules for activating expert oversight, and (iii) accessibility Key Performance Indicators (KPIs). The framework shows how human-centered mechanisms can be encoded for evaluation and reused to provide structured feedback that improves model adaptation. By embedding the human role in both generation and supervision, it establishes a traceable, reproducible, and auditable process for creating and evaluating accessible texts. In doing so, it integrates explainability and ethical accountability as core design principles, contributing to more transparent and inclusive NLP systems.
摘要：文本简化中的简单语言和易于阅读的格式对于认知可访问性至关重要。然而，当前的自动简化和评估流程在很大程度上仍然是自动化的、指标驱动的，并且无法反映用户的理解或规范标准。本文介绍了一种混合框架，该框架将人类参与明确集成到基于法学硕士的无障碍文本生成中。人在环 (HiTL) 贡献指导生成期间的调整，而人在环 (HoTL) 监督确保系统的生成后审查。来自用户研究和注释资源的经验证据被运用到（i）符合标准的清单，（ii）用于激活专家监督的事件-条件-行动触发规则，以及（iii）可访问性关键绩效指标（KPI）。该框架展示了如何对以人为中心的机制进行编码以进行评估并重用以提供改善模型适应性的结构化反馈。通过将人类角色嵌入到生成和监督中，它建立了一个可追踪、可复制和可审计的流程来创建和评估可访问的文本。在此过程中，它将可解释性和道德责任作为核心设计原则，有助于打造更加透明和包容的 NLP 系统。

Title: Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs

Authors: Vedant Pandya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.18911
Pdf URL: https://arxiv.org/pdf/2603.18911
Copy Paste: [[2603.18911]] Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs(https://arxiv.org/abs/2603.18911)
Keywords: llm, hallucination
Abstract: Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches focus exclusively on English, lack explicit citation mechanisms for verifying factual claims, and offer limited transparency into model decision-making. We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation grounding, (3) bilingual dialogue SFT, and (4) GRPO alignment with citation-aware rewards. We evaluate six models spanning encoder-decoder (250M-3B) and decoder-only (1B-7B) architectures at every pipeline stage. Our key contributions are: (i) three post-hoc explainability analyses - cross-attention alignment, Integrated Gradients attribution, and occlusion-based causal grounding - applied systematically across the training trajectory to reveal how citation behaviour is learned, not only whether it is learned; (ii) citation-grounded SFT reduces hallucination to 0.0% for encoder-decoder models from Stage 2 onward; (iii) the progressive pipeline prevents catastrophic forgetting while improving Hindi capabilities; (iv) smaller models match larger models on English after SFT; and (v) GRPO provides marginal improvement over well-designed SFT for structured citation tasks. We evaluate across six automatic metrics (BLEU, ROUGE, BERTScore, FactScore, Citation-F1, and hallucination rate).
摘要：基于知识的对话系统旨在通过外部知识源的调节来生成内容丰富、与上下文相关的响应。然而，大多数现有方法只关注英语，缺乏验证事实主张的明确引用机制，并且模型决策的透明度有限。我们提出了 XKD-Dial，这是一个渐进式四阶段训练流程，用于在双语（英语-印地语）环境中生成可解释的、基于知识的对话，包括：(1) 多语言适应，(2) 具有引文基础的英语对话 SFT，(3) 双语对话 SFT，以及 (4) 与引文感知奖励对齐的 GRPO。我们在每个管道阶段评估跨越编码器-解码器 (250M-3B) 和仅解码器 (1B-7B) 架构的六个模型。我们的主要贡献是：（i）三个事后可解释性分析——交叉注意对齐、综合梯度归因和基于遮挡的因果基础——在整个训练轨迹中系统地应用，以揭示引文行为是如何学习的，而不仅仅是是否学习； (ii) 从第 2 阶段开始，基于引文的 SFT 将编码器-解码器模型的幻觉降低至 0.0%； (iii) 渐进式管道可以防止灾难性遗忘，同时提高印地语能力； (iv) SFT 后，较小的模型与英语上的较大模型相匹配； (v) GRPO 在结构化引文任务方面比精心设计的 SFT 提供了边际改进。我们评估了六个自动指标（BLEU、ROUGE、BERTScore、FactScore、Citation-F1 和幻觉率）。

Title: Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Authors: Xinghao Zhao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.18940
Pdf URL: https://arxiv.org/pdf/2603.18940
Copy Paste: [[2603.18940]] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought(https://arxiv.org/abs/2603.18940)
Keywords: llm, chain-of-thought
Abstract: Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive. We study whether the shape of uncertainty dynamics across reasoning steps--captured by sampling a few answer completions per step--predicts correctness. We introduce entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. On GSM8K (n=300) with Qwen2.5-7B-Instruct, monotone chains achieve 68.8% accuracy vs. 46.8% for non-monotone chains (+21.9 pp; Fisher's p=0.0005; OR=2.50). Critically, total entropy reduction is not predictive ($\rho$=-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186->0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question--1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures.
摘要：思想链 (CoT) 推理提高了 LLM 的准确性，但廉价地检测失败仍然难以实现。我们研究推理步骤中不确定性动态的形状（通过对每个步骤的一些答案完成进行采样来捕获）是否可以预测正确性。我们引入熵轨迹单调性：如果一条链的每一步答案分布熵在每一步都减少，那么它就是单调的。在使用 Qwen2.5-7B-Instruct 的 GSM8K (n=300) 上，单调链的准确度为 68.8%，而非单调链的准确度为 46.8%（+21.9 pp；Fisher's p=0.0005；OR=2.50）。重要的是，总熵减少不是预测性的（$\rho$=-0.06，p=0.31），揭示了形状与幅度的分离：熵是否在每一步减少都很重要，而不是减少多少。违规计数 0/1/2 的准确度为 68.8%/50.8%/28.6%。令牌对数概率置信度随着步长深度的校准而恶化（ECE：0.186->0.312），单调性在 73.7% 的覆盖率下达到 +5.8 pp，在大约 1,500 个令牌/问题时优于标量基线——40 链自一致性成本的 1/8。结果在 Mistral-7B (n=300) 上重复：单调链达到 72.3% vs. 37.6% (+34.7 pp；OR=4.33)。因此，不确定性轨迹的结构特性比总体指标更具信息性。

Title: RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation

Authors: Weronika Łajewska, Paul Missault, George Davidson, Saab Mansour
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19002
Pdf URL: https://arxiv.org/pdf/2603.19002
Copy Paste: [[2603.19002]] RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation(https://arxiv.org/abs/2603.19002)
Keywords: llm
Abstract: Simulation of surveys using LLMs is emerging as a powerful application for generating human-like responses at scale. Prior work evaluates survey simulation using metrics borrowed from other domains, which are often ad hoc, fragmented, and non-standardized, leading to results that are difficult to compare. Moreover, existing metrics focus mainly on accuracy or distributional measures, overlooking the critical dimension of ranking alignment. In practice, a simulation can achieve high accuracy while still failing to capture the option most preferred by humans - a distinction that is critical in decision-making applications. We introduce RADIUS, a comprehensive two-dimensional alignment suite for survey simulation that captures: 1) RAnking alignment and 2) DIstribUtion alignment, each complemented by statistical Significance testing. RADIUS highlights the limitations of existing metrics, enables more meaningful evaluation of survey simulation, and provides an open-source implementation for reproducible and comparable assessment.
摘要：使用法学硕士进行调查模拟正在成为一种强大的应用程序，可大规模生成类似人类的响应。先前的工作使用从其他领域借用的指标来评估调查模拟，这些指标通常是临时的、分散的和非标准化的，导致难以比较的结果。此外，现有的指标主要关注准确性或分布测量，忽略了排名对齐的关键维度。在实践中，模拟可以实现高精度，但仍然无法捕获人类最喜欢的选项——这一区别在决策应用中至关重要。我们引入了 RADIUS，这是一个用于调查模拟的综合二维对齐套件，它捕获：1) 排名对齐和 2) 分布对齐，每个对齐都通过统计显着性测试进行补充。 RADIUS 强调了现有指标的局限性，可以对调查模拟进行更有意义的评估，并为可重复和可比较的评估提供开源实现。

Title: Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

Authors: Hangeol Chang, Changsun Lee, Seungjoon Rho, Junho Yeo, Jong Chul Ye
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.19008
Pdf URL: https://arxiv.org/pdf/2603.19008
Copy Paste: [[2603.19008]] Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval(https://arxiv.org/abs/2603.19008)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options, simply grounding generation in broadly relevant context is often insufficient to drive the final decision. Existing RAG methods typically rely on a single initial query, which often favors topical relevance over decision-relevant evidence, and therefore retrieves background information that can fail to discriminate among answer options. To address this issue, here we propose Hypothesis-Conditioned Query Rewriting (HCQR), a training-free pre-retrieval framework that reorients RAG from topic-oriented retrieval to evidence-oriented retrieval. HCQR first derives a lightweight working hypothesis from the input question and candidate options, and then rewrites retrieval into three targeted queries that seek evidence to: (1) support the hypothesis, (2) distinguish it from competing alternatives, and (3) verify salient clues in the question. This approach enables context retrieval that is more directly aligned with answer selection, allowing the generator to confirm or overturn the initial hypothesis based on the retrieved evidence. Experiments on MedQA and MMLU-Med show that HCQR consistently outperforms single-query RAG and re-rank/filter baselines, improving average accuracy over Simple RAG by 5.9 and 3.6 points, respectively. Code is available at this https URL.
摘要：检索增强生成 (RAG) 通过将生成基于外部非参数知识来改进大型语言模型 (LLM)。然而，当一项任务需要在相互竞争的选项中进行选择时，仅仅将生成置于广泛相关的背景下通常不足以推动最终决策。现有的 RAG 方法通常依赖于单个初始查询，该初始查询通常更倾向于主题相关性而不是决策相关证据，因此检索到的背景信息可能无法区分答案选项。为了解决这个问题，我们在这里提出假设条件查询重写（HCQR），这是一种免训练的预检索框架，它将 RAG 从面向主题的检索重新定位为面向证据的检索。 HCQR 首先从输入问题和候选选项中得出轻量级工作假设，然后将检索重写为三个有针对性的查询，以寻求证据：(1) 支持假设，(2) 将其与竞争替代方案区分开来，(3) 验证问题中的显着线索。这种方法使上下文检索能够更直接地与答案选择保持一致，从而允许生成器根据检索到的证据确认或推翻初始假设。 MedQA 和 MMLU-Med 上的实验表明，HCQR 始终优于单查询 RAG 和重新排序/过滤基线，比简单 RAG 的平均准确度分别提高了 5.9 和 3.6 个点。代码可从此 https URL 获取。

Title: What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

Authors: Gagan Bhatia, Ahmad Muhammad Isa, Maxime Peyrard, Wei Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19017
Pdf URL: https://arxiv.org/pdf/2603.19017
Copy Paste: [[2603.19017]] What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?(https://arxiv.org/abs/2603.19017)
Keywords: language model, llm
Abstract: We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: this https URL
摘要：我们推出了 MultiTempBench，这是一个多语言时间推理基准，涵盖五种语言（英语、德语、中文、阿拉伯语和豪萨语）和多种日历约定（公历、回历和中国农历）的日期算术、时区转换和时间关系提取三项任务。 MultiTempBench 包含价值 15,000 美元的示例，这些示例是通过翻译 750 美元策划的英语问题并将每个问题扩展为受控日期格式变体而构建的。我们评估了 20 个法学硕士，并引入了多语言日期碎片率 (mDFR)，该比率根据人类严重性评级进行校准，以及内部时间表示的几何探测分析。我们发现时间制品的标记化质量是一个依赖于资源的瓶颈：在低资源语言和较罕见的日历格式中，碎片会破坏年/月/日分离并导致准确性崩溃，而高资源设置通常对数字级分割具有鲁棒性。除了标记化之外，交叉混合效应回归表明，时间线性是高资源语言中时间推理的最强预测因子，而碎片化是低资源语言中更强的预测因子。代码位于：此 https URL

Title: MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

Authors: Chenyang Gu, Jiahao Cheng, Meicong Zhang, Pujun Zheng, Jinquan Zheng, Guoxiu He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19044
Pdf URL: https://arxiv.org/pdf/2603.19044
Copy Paste: [[2603.19044]] MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models(https://arxiv.org/abs/2603.19044)
Keywords: language model, llm, agent
Abstract: Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-level conceptual recombinations that lack technical depth and scientific grounding. To address this issue, we propose \textbf{MoRI} (\textbf{Mo}tivation-grounded \textbf{R}easoning for Scientific \textbf{I}deation), a framework that enables LLMs to explicitly learn the reasoning process from research motivations to methodologies. The base LLM is initialized via supervised fine-tuning to generate a research motivation from a given context, and is subsequently trained under a composite reinforcement learning reward that approximates scientific rigor: (1) entropy-aware information gain encourages the model to uncover and elaborate high-complexity technical details grounded in ground-truth methodologies, and (2) contrastive semantic gain constrains the reasoning trajectory to maintain conceptually aligned with scientifically valid solutions. Empirical results show that MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions, including novelty, technical rigor, and feasibility. The code will be made available on \href{this https URL}{GitHub}.
摘要：科学构思旨在在给定的科学背景下提出新颖的解决方案。现有的基于法学硕士的代理方法模仿人类研究工作流程，但对科学推理的建模不充分，导致缺乏技术深度和科学基础的表面概念重组。为了解决这个问题，我们提出了 \textbf{MoRI} （\textbf{MoRI}（\textbf{Mo}tivation-grounded \textbf{R}easoning for Scientific \textbf{I}deation），这是一个框架，使法学硕士能够明确学习从研究动机到方法论的推理过程。基础法学硕士通过有监督的微调进行初始化，以从给定的背景中产生研究动机，然后在接近科学严谨性的复合强化学习奖励下进行训练：（1）熵感知信息增益鼓励模型发现和阐述基于真实方法论的高复杂性技术细节，（2）对比语义增益限制推理轨迹，以在概念上与科学有效的解决方案保持一致。实证结果表明，MoRI 在多个维度（包括新颖性、技术严谨性和可行性）显着优于强大的商业法学硕士和复杂的代理基线。该代码将在 \href{此 https URL}{GitHub} 上提供。

Title: Parallelograms Strike Back: LLMs Generate Better Analogies than People

Authors: Qiawen Ella Liu, Raja Marjieh, Jian-Qiao Zhu, Adele E. Goldberg, Thomas L. Griffiths
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19066
Pdf URL: https://arxiv.org/pdf/2603.19066
Copy Paste: [[2603.19066]] Parallelograms Strike Back: LLMs Generate Better Analogies than People(https://arxiv.org/abs/2603.19066)
Keywords: language model, llm
Abstract: Four-term word analogies (A:B::C:D) are classically modeled geometrically as ''parallelograms,'' yet recent work suggests this model poorly captures how humans produce analogies, with simple local-similarity heuristics often providing a better account (Peterson et al., 2020). But does the parallelogram model fail because it is a bad model of analogical relations, or because people are not very good at generating relation-preserving analogies? We compared human and large language model (LLM) analogy completions on the same set of analogy problems from (Peterson et al., 2020). We find that LLM-generated analogies are reliably judged as better than human-generated ones, and are also more closely aligned with the parallelogram structure in a distributional embedding space (GloVe). Crucially, we show that the improvement over human analogies was driven by greater parallelogram alignment and reduced reliance on accessible words rather than enhanced sensitivity to local similarity. Moreover, the LLM advantage is driven not by uniformly superior responses by LLMs, but by humans producing a long tail of weak completions: when only modal (most frequent) responses by both systems are compared, the LLM advantage disappears. However, greater parallelogram alignment and lower word frequency continue to predict which LLM completions are rated higher than those of humans. Overall, these results suggest that the parallelogram model is not a poor account of word analogy. Rather, humans may often fail to produce completions that satisfy this relational constraint, whereas LLMs do so more consistently.
摘要：四项词类比（A:B::C:D）通常被几何建模为“平行四边形”，但最近的研究表明，这种模型很难捕捉人类如何产生类比，简单的局部相似性启发法通常可以提供更好的解释（Peterson 等人，2020）。但是，平行四边形模型失败是因为它是一个糟糕的类比关系模型，还是因为人们不太擅长生成保留关系的类比？我们比较了人类和大型语言模型 (LLM) 在同一组类比问题上的类比完成情况（Peterson 等人，2020）。我们发现，LLM 生成的类比被可靠地判断为比人类生成的类比更好，并且与分布嵌入空间 (GloVe) 中的平行四边形结构更加一致。至关重要的是，我们表明，对人类类比的改进是由更大的平行四边形对齐和减少对可访问单词的依赖驱动的，而不是增强对局部相似性的敏感性。此外，LLM 的优势不是由 LLM 的一致优越响应驱动的，而是由人类产生的弱完成的长尾驱动的：当仅比较两个系统的模态（最频繁）响应时，LLM 优势消失。然而，更大的平行四边形对齐和更低的词频继续预测哪些 LLM 完成的评分高于人类。总的来说，这些结果表明平行四边形模型对单词类比的解释还不错。相反，人类可能经常无法完成满足这种关系约束的工作，而法学硕士则更一致地做到这一点。

Title: A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes

Authors: Madeline Bittner, Dina Demner-Fushman, Yasmeen Shabazz, Davis Bartels, Dukyong Yoon, Brad Quitadamo, Rajiv Menghrajani, Leo Celi, Sarvesh Soni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19082
Pdf URL: https://arxiv.org/pdf/2603.19082
Copy Paste: [[2603.19082]] A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes(https://arxiv.org/abs/2603.19082)
Keywords: language model, llm, prompt
Abstract: Health literacy is a critical determinant of patient outcomes, yet current screening tools are not always feasible and differ considerably in the number of items, question format, and dimensions of health literacy they capture, making documentation in structured electronic health records difficult to achieve. Automated detection from unstructured clinical notes offers a promising alternative, as these notes often contain richer, more contextual health literacy information, but progress has been limited by the lack of annotated resources. We introduce HEALIX, the first publicly available annotated health literacy dataset derived from real clinical notes, curated through a combination of social worker note sampling, keyword-based filtering, and LLM-based active learning. HEALIX contains 589 notes across 9 note types, annotated with three health literacy labels: low, normal, and high. To demonstrate its utility, we benchmarked zero-shot and few-shot prompting strategies across four open source large language models (LLMs).
摘要：健康素养是患者治疗结果的关键决定因素，但当前的筛查工具并不总是可行，并且在项目数量、问题格式和所捕获的健康素养维度方面存在很大差异，使得结构化电子健康记录中的记录难以实现。从非结构化临床笔记中进行自动检测提供了一种有前途的替代方案，因为这些笔记通常包含更丰富、更相关的健康素养信息，但由于缺乏带注释的资源，进展受到限制。我们推出 HEALIX，这是第一个公开的带注释的健康素养数据集，源自真实的临床笔记，通过社会工作者笔记采样、基于关键字的过滤和基于法学硕士的主动学习的组合来策划。 HEALIX 包含 9 种注释类型的 589 个注释，并附有三个健康素养标签：低、正常和高。为了证明其实用性，我们对四个开源大语言模型 (LLM) 的零样本和少样本提示策略进行了基准测试。

Title: DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering

Authors: Yilin Wang, Yuchun Fan, Jiaoyang Li, Ziming Zhu, Yongyu Mu, Qiaozhi He, Tong Xiao, Jingbo Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19097
Pdf URL: https://arxiv.org/pdf/2603.19097
Copy Paste: [[2603.19097]] DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering(https://arxiv.org/abs/2603.19097)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems' capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs' strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3\% in average EM score over the strongest baseline.
摘要：检索增强生成（RAG）系统在解决英语场景中复杂的多跳问答（QA）任务方面取得了重大进展。然而，RAG系统不可避免地面临跨语言语料库和查询的检索应用场景，留下了一些开放的挑战。第一个问题涉及缺乏评估 RAG 系统在多语言多跳 (MM-hop) QA 设置下的能力的基准。第二个重点是过度依赖法学硕士对英语的强大语义理解，这降低了多语言场景中的有效性。为了应对这些挑战，我们首先通过将纯英语基准翻译成五种语言来构建多语言多跳 QA 基准，然后我们提出了 DaPT，一种新颖的多语言 RAG 框架。 DaPT 为源语言查询及其对应的英语翻译并行生成子问题图，然后将它们合并，然后采用双语检索和回答策略顺序解决子问题。我们的实验结果表明，先进的 RAG 系统在多语言场景中存在显着的性能不平衡。此外，与基线相比，我们提出的方法始终能产生更准确、更简洁的答案，从而显着提高 RAG 在此任务上的性能。例如，在最具挑战性的 MuSiQue 基准测试中，DaPT 的平均 EM 分数比最强基准提高了 18.3%。

Title: UGID: Unified Graph Isomorphism for Debiasing Large Language Models

Authors: Zikang Ding, Junchi Yao, Junhao Li, Yi Zhang, Wenbo Jiang, Hongbo Liu, Lijie Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19144
Pdf URL: https://arxiv.org/pdf/2603.19144
Copy Paste: [[2603.19144]] UGID: Unified Graph Isomorphism for Debiasing Large Language Models(https://arxiv.org/abs/2603.19144)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit pronounced social biases. Output-level or data-optimization--based debiasing methods cannot fully resolve these biases, and many prior works have shown that biases are embedded in internal representations. We propose \underline{U}nified \underline{G}raph \underline{I}somorphism for \underline{D}ebiasing large language models (\textit{\textbf{UGID}}), an internal-representation--level debiasing framework for large language models that models the Transformer as a structured computational graph, where attention mechanisms define the routing edges of the graph and hidden states define the graph nodes. Specifically, debiasing is formulated as enforcing invariance of the graph structure across counterfactual inputs, with differences allowed only on sensitive attributes. \textit{\textbf{UGID}} jointly constrains attention routing and hidden representations in bias-sensitive regions, effectively preventing bias migration across architectural components. To achieve effective behavioral alignment without degrading general capabilities, we introduce a log-space constraint on sensitive logits and a selective anchor-based objective to preserve definitional semantics. Extensive experiments on large language models demonstrate that \textit{\textbf{UGID}} effectively reduces bias under both in-distribution and out-of-distribution settings, significantly reduces internal structural discrepancies, and preserves model safety and utility.
摘要：大型语言模型（LLM）表现出明显的社会偏见。基于输出级或数据优化的去偏差方法无法完全解决这些偏差，并且许多先前的工作已经表明偏差嵌入在内部表示中。我们提出了 \underline{U}nified \underline{G}raph \underline{I}somorphism 用于 \underline{D}ebiasing 大语言模型 (\textit{\textbf{UGID}})，这是一种用于大语言模型的内部表示级去偏框架，将 Transformer 建模为结构化计算图，其中注意机制定义图的路由边，隐藏状态定义图节点。具体来说，去偏差被表述为强制跨反事实输入的图结构的不变性，仅允许敏感属性存在差异。 \textit{\textbf{UGID}} 联合约束偏差敏感区域中的注意力路由和隐藏表示，有效防止跨架构组件的偏差迁移。为了在不降低一般能力的情况下实现有效的行为对齐，我们引入了对敏感逻辑的日志空间约束和基于选择性锚点的目标来保留定义语义。对大型语言模型的大量实验表明， \textit{\textbf{UGID}} 有效减少了分布内和分布外设置下的偏差，显着减少了内部结构差异，并保持了模型的安全性和实用性。

Title: Optimal Splitting of Language Models from Mixtures to Specialized Domains

Authors: Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune, Angelos Katharopoulos, David Grangier
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.19149
Pdf URL: https://arxiv.org/pdf/2603.19149
Copy Paste: [[2603.19149]] Optimal Splitting of Language Models from Mixtures to Specialized Domains(https://arxiv.org/abs/2603.19149)
Keywords: language model
Abstract: Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D' specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.
摘要：由于可用预训练数据的规模和多样性，语言模型在各种知识、语言和推理任务上取得了令人印象深刻的性能。标准训练方法是一个两阶段范例：首先对完整的数据语料库进行预训练，然后对完整语料库中的高质量、专业数据的子集进行专门化。在多领域设置中，这涉及在每个专门领域上持续预训练多个模型，称为分割模型训练。我们提出了一种在通用预训练语料库上独立预训练多个模型的方法，并使用缩放法则确定预训练和持续预训练之间的最佳计算分配。我们的方法通过 D 预训练和 D' 专业化标记准确预测大小为 N 的模型的损失，并推断到更大的模型大小和标记数量。应用于语言模型训练时，我们的方法可以在不同模型大小和计算预算的常识知识和推理基准上一致地提高性能。

Title: VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Authors: Chonghan Liu, Yimin Du, Qi An, Xin He, Cunqi Zhai, Fei Tan, Weijia Lin, Xiaochun Gong, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19152
Pdf URL: https://arxiv.org/pdf/2603.19152
Copy Paste: [[2603.19152]] VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models(https://arxiv.org/abs/2603.19152)
Keywords: language model
Abstract: Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.
摘要：大型语言模型经常在低资源语言上表现出次优性能，这主要是由于低效的子词分割和系统训练数据不平衡。在本文中，我们提出了可变熵策略优化（VEPO），它利用具有可验证奖励的强化学习将确定性结构约束纳入策略调整过程。该框架确保了规定的序列长度、强大的格式一致性和严格的语言格式，所有这些都在训练期间强制执行。我们方法的核心是可变熵机制，该机制使模型能够通过调节探索利用流形来动态校准字面保真度和语义自然度之间的平衡。通过将熵调节优势估计与不对称裁剪相结合，VEPO 可以维持稳健的探索，同时减轻策略崩溃。对 90 个 FLORES-200、COMET-22、chrF 方向的实证评估表明，VEPO 在标记化效率和翻译质量方面都取得了实质性改进，缩小了代表性不足的语言的性能差距。

Title: Evaluating Counterfactual Strategic Reasoning in Large Language Models

Authors: Dimitrios Georgousis, Maria Lymperaiou, Angeliki Dimitriou, Giorgos Filandrianos, Giorgos Stamou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.19167
Pdf URL: https://arxiv.org/pdf/2603.19167
Copy Paste: [[2603.19167]] Evaluating Counterfactual Strategic Reasoning in Large Language Models(https://arxiv.org/abs/2603.19167)
Keywords: language model, llm
Abstract: We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner's Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.
摘要：我们在重复的博弈论环境中评估大型语言模型（LLM），以评估战略绩效是否反映了真正的推理或对记忆模式的依赖。我们考虑两种典型的游戏，囚徒困境（PD）和石头剪刀布（RPS），在这两种游戏上，我们引入了反事实变体，改变了收益结构和动作标签，打破了熟悉的对称性和支配关系。我们的多指标评估框架比较了默认实例和反事实实例，展示了法学硕士在反事实环境中激励敏感性、结构泛化和战略推理方面的局限性。

Title: Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Authors: Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.19220
Pdf URL: https://arxiv.org/pdf/2603.19220
Copy Paste: [[2603.19220]] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation(https://arxiv.org/abs/2603.19220)
Keywords: llm, agent
Abstract: We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.
摘要：我们推出 Nemotron-Cascade 2，这是一种开放式 30B MoE 模型，具有 3B 激活参数，可提供一流的推理和强大的代理功能。尽管尺寸紧凑，但其数学和编码推理性能接近前沿开放模型。这是继 DeepSeekV3.2-Speciale-671B-A37B 之后，第二个在 2025 年国际数学奥林匹克 (IMO)、国际信息学奥林匹克 (IOI) 和 ICPC 世界总决赛中获得金牌级别表现的开放量级法学硕士，在参数减少 20 倍的情况下展示了极高的智力密度。与 Nemotron-Cascade 1 相比，关键技术进步如下。在精心策划的数据集上进行 SFT 后，我们大幅扩展了 Cascade RL，以覆盖更广泛的推理和代理领域。此外，我们在整个 Cascade RL 过程中从每个域的最强中间教师模型中引入多域同策略蒸馏，使我们能够有效地恢复基准回归并在此过程中保持强劲的性能提升。我们发布了模型检查点和训练数据的集合。

Title: F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Authors: Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.19223
Pdf URL: https://arxiv.org/pdf/2603.19223
Copy Paste: [[2603.19223]] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World(https://arxiv.org/abs/2603.19223)
Keywords: llm
Abstract: We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.
摘要：我们推出了 F2LLM-v2，这是一个新的通用、多语言嵌入模型系列，有 8 种不同的大小，范围从 80M 到 14B。 F2LLM-v2 基于新近整理的 6000 万个公开可用的高质量数据样本进行训练，支持 200 多种语言，特别强调以前服务不足的中低资源语言。通过将基于 LLM 的两阶段嵌入训练流程与俄罗斯套娃学习、模型剪枝和知识蒸馏技术相集成，我们提出的模型比以前的基于 LLM 的嵌入模型更加高效，同时保留了有竞争力的性能。广泛的评估证实，F2LLM-v2-14B 在 11 项 MTEB 基准测试中排名第一，而该系列中的较小型号也为资源受限的应用树立了新的技术水平。为了促进开源嵌入模型研究，我们发布了所有模型、数据、代码和中间检查点。