2025-11-07

Title: Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs

Authors: Pranav Bhandari, Nicolas Fay, Sanjeevan Selvaganapathy, Amitava Datta, Usman Naseem, Mehwish Nasim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03738
Pdf URL: https://arxiv.org/pdf/2511.03738
Copy Paste: [[2511.03738]] Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs(https://arxiv.org/abs/2511.03738)
Keywords: language model, llm
Abstract: Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge. The need for effective mechanisms for behavioural manipulation of the model during generation is a critical gap in the literature that needs to be fulfilled. Personality-aware LLMs hold a promising direction towards this objective. However, the relationship between these psychological constructs and their representations within LLMs remains underexplored and requires further investigation. Moreover, it is intriguing to understand and study the use of these representations to steer the models' behaviour. We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and empirically validated framework to model human personality applies low-rank subspace discovery methods, and identifies trait-specific optimal layers across different model architectures for robust injection. The resulting personality-aligned directions are then operationalised through a flexible steering framework with dynamic layer selection, enabling precise control of trait expression in LLM outputs. Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering through careful perturbations without impacting the fluency, variance and general capabilities, helping to bridge the gap between psychological theory and practical model alignment.
摘要：大型语言模型在其一代中表现出隐含的个性，但可靠地控制或调整这些特征以满足特定需求仍然是一个开放的挑战。在生成过程中需要有效的模型行为操纵机制是文献中需要填补的一个关键空白。具有人格意识的法学硕士为实现这一目标指明了方向。然而，这些心理结构与其在法学硕士中的表征之间的关系仍未得到充分探索，需要进一步调查。此外，理解和研究使用这些表示来引导模型的行为也是很有趣的。我们提出了一种新颖的管道，使用五大人格特征（开放性、尽责性、外向性、宜人性和神经质）从变压器层中提取隐藏状态激活，这是一个全面且经过经验验证的框架，用于建模人类个性，应用低秩子空间发现方法，并识别不同模型架构中特定于特征的最佳层以进行鲁棒注入。然后，通过具有动态层选择的灵活指导框架来操作所得到的个性化方向，从而能够精确控制法学硕士输出中的特质表达。我们的研究结果表明，人格特质占据了一个低等级的共享子空间，并且这些潜在结构可以转化为可操作的机制，通过仔细的扰动进行有效的指导，而不影响流畅性、方差和一般能力，有助于弥合心理学理论和实际模型对齐之间的差距。

Title: TextualVerifier: Verify TextGrad Step-by-Step

Authors: Eugenius Mario Situmorang, Adila Alfa Krisnadhi, Ari Wibisono
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03739
Pdf URL: https://arxiv.org/pdf/2511.03739
Copy Paste: [[2511.03739]] TextualVerifier: Verify TextGrad Step-by-Step(https://arxiv.org/abs/2511.03739)
Keywords: language model, llm, chain-of-thought
Abstract: TextGrad is a novel approach to text-based automatic differentiation that enables composite AI systems to perform optimization without explicit numerical equations. However, it currently lacks self-verification mechanisms that ensure reasoning validity in text-based decision making. This research introduces TextualVerifier, a verification framework that leverages chain-of-thought reasoning and majority voting with large language models to address this verification gap. TextualVerifier implements a four-stage workflow: chain-of-thought decomposition, variant generation, majority voting, and consensus aggregation. It integrates non-invasively with TextGrad at both the loss function and optimization result verification stages. Experimental evaluation using the Gemini 1.5 Pro model is conducted in two phases: (1) standalone evaluation on PRM800K, and (2) integrated evaluation with TextGrad on GPQA-Diamond, MMLU-ML, and MMLU-CP benchmarks. Results show statistically significant improvements (p < 0.001). In phase one, TextualVerifier improves the validity of reasoning steps by 29 percent. In phase two, integration into TextGrad loss function yields a 2.2 percentage point gain from 68.2 to 70.4 percent with a moderate overhead of 5.9 LLM calls on average. Further evaluations of TextualVerifier versioning yield 8.08, 10.71, and 3.92 percentage point improvements on GPQA, MMLU-ML, and MMLU-CP respectively. TextualVerifier thus presents the first self-verification framework for TextGrad through LLM-based techniques without requiring numerical gradients, enabling more reliable reasoning and opening new directions for verification in text-based optimization.
摘要：TextGrad 是一种基于文本的自动微分的新颖方法，使复合 AI 系统无需显式数值方程即可执行优化。然而，它目前缺乏确保基于文本的决策推理有效性的自我验证机制。本研究引入了 TextualVerifier，这是一种验证框架，利用大型语言模型的思想链推理和多数投票来解决这一验证差距。 TextualVerifier 实现了四阶段工作流程：思想链分解、变体生成、多数投票和共识聚合。它在损失函数和优化结果验证阶段都与 TextGrad 进行非侵入性集成。使用 Gemini 1.5 Pro 模型的实验评估分两个阶段进行：(1) 在 PRM800K 上进行独立评估，(2) 在 GPQA-Diamond、MMLU-ML 和 MMLU-CP 基准上使用 TextGrad 进行集成评估。结果显示统计显着改善 (p < 0.001)。在第一阶段，TextualVerifier 将推理步骤的有效性提高了 29%。在第二阶段，集成到 TextGrad 损失函数中，结果从 68.2% 提高到 70.4%，提高了 2.2 个百分点，平均开销为 5.9 个 LLM 调用。对 TextualVerifier 版本控制的进一步评估使 GPQA、MMLU-ML 和 MMLU-CP 分别提高了 8.08、10.71 和 3.92 个百分点。因此，TextualVerifier 通过基于 LLM 的技术为 TextGrad 提供了第一个自我验证框架，无需数值梯度，从而实现更可靠的推理，并为基于文本的优化验证开辟了新的方向。

Title: GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation

Authors: Stergios Chatzikyriakidis, Dimitris Papadakis, Sevasti-Ioanna Papaioannou, Erofili Psaltaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03772
Pdf URL: https://arxiv.org/pdf/2511.03772
Copy Paste: [[2511.03772]] GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation(https://arxiv.org/abs/2511.03772)
Keywords: gpt, llm, chat
Abstract: We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).
摘要：我们提出了一个扩展的希腊方言数据集 (GRDD+) 1，它用来自克里特岛、塞浦路斯、东欧和北希腊语的更多数据补充了现有的 GRDD 数据集，同时我们添加了六个新变体：希腊科西嘉语、Griko（南意大利希腊语）、Maniot、Heptanesian、Tsakonian 和 Katharevusa Greek。结果是一个总大小为 6,374,939 个单词和 10 个变体的数据集。这是迄今为止第一个具有如此变化和大小的数据集。我们进行了许多微调实验，以了解高质量方言数据对许多法学硕士的影响。我们微调了三种模型架构（Llama-3-8B、Llama-3.1-8B、Krikri-8B），并将结果与前沿模型（Claude-3.7-Sonnet、Gemini-2.5、ChatGPT-5）进行比较。

Title: PLLuM: A Family of Polish Large Language Models

Authors: Jan Kocoń, Maciej Piasecki, Arkadiusz Janz, Teddy Ferdinan, Łukasz Radliński, Bartłomiej Koptyra, Marcin Oleksy, Stanisław Woźniak, Paweł Walkowiak, Konrad Wojtasik, Julia Moska, Tomasz Naskręt, Bartosz Walkowiak, Mateusz Gniewkowski, Kamil Szyc, Dawid Motyka, Dawid Banach, Jonatan Dalasiński, Ewa Rudnicka, Bartłomiej Alberski, Tomasz Walkowiak, Aleksander Szczęsny, Maciej Markiewicz, Tomasz Bernaś, Hubert Mazur, Kamil Żyta, Mateusz Tykierko, Grzegorz Chodak, Tomasz Kajdanowicz, Przemysław Kazienko, Agnieszka Karlińska, Karolina Seweryn, Anna Kołos, Maciej Chrabąszcz, Katarzyna Lorenc, Aleksandra Krasnodębska, Artur Wilczek, Katarzyna Dziewulska, Paula Betscher, Zofia Cieślińska, Katarzyna Kowol, Daria Mikoś, Maciej Trzciński, Dawid Krutul, Marek Kozłowski, Sławomir Dadas, Rafał Poświata, Michał Perełkiewicz, Małgorzata Grębowiec, Maciej Kazuła, Marcin Białas, Roman Roszko, Danuta Roszko, Jurgita Vaičenonienė, Andrius Utka, Paweł Levchuk, Paweł Kowalski, Irena Prawdzic-Jankowska, Maciej Ogrodniczuk, Monika Borys, Anna Bulińska, Wiktoria Gumienna, Witold Kieraś, Dorota Komosińska, Katarzyna Krasnowska-Kieraś, Łukasz Kobyliński, Martyna Lewandowska, Marek Łaziński, Mikołaj Łątkowski, Dawid Mastalerz, Beata Milewicz, Agnieszka Anna Mykowiecka, Angelika Peljak-Łapińska, Sandra Penno, Zuzanna Przybysz, Michał Rudolf, Piotr Rybak, Karolina Saputa, Aleksandra Tomaszewska, Aleksander Wawer, Marcin Woliński, Joanna Wołoszyn, Alina Wróblewska, Bartosz Żuk, Filip Żarnecki, Konrad Kaczyński, Anna Cichosz, Zuzanna Deckert, Monika Garnys, Izabela Grabarczyk, Wojciech Janowski, Sylwia Karasińska, Aleksandra Kujawiak, Piotr Misztela, Maria Szymańska, Karolina Walkusz, Igor Siek, Jakub Kwiatkowski, Piotr Pęzik
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.03823
Pdf URL: https://arxiv.org/pdf/2511.03823
Copy Paste: [[2511.03823]] PLLuM: A Family of Polish Large Language Models(https://arxiv.org/abs/2511.03823)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models' architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.
摘要：大型语言模型（LLM）在现代人工智能中发挥着核心作用，但其开发主要集中在英语上，导致对其他语言的支持有限。我们推出 PLLuM（波兰语大语言模型），这是专为波兰语量身定制的最大的开源基础模型系列。 PLLuM 由波兰主要研究机构组成的联盟开发，满足以英语为中心的商业环境之外对高质量、透明且文化相关的语言模型的需求。我们描述了开发过程，包括构建用于预训练的新的 1400 亿令牌波兰语文本语料库、77k 自定义指令数据集和 100k 偏好优化数据集。一个关键组件是负责任的人工智能框架，它包含严格的数据治理以及用于输出校正和安全过滤的混合模块。我们详细介绍了基础模型和指令调整变量的模型架构、训练程序和对齐技术，并展示了它们在公共管理下游任务中的实用性。通过公开发布这些模型，PLLuM 旨在促进开放研究并加强波兰的主权人工智能技术。

Title: STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models

Authors: Mohammad Atif Quamar, Mohammad Areeb, Mikhail Kuznetsov, Muslum Ozgur Ozmen, Z. Berkay Celik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03827
Pdf URL: https://arxiv.org/pdf/2511.03827
Copy Paste: [[2511.03827]] STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models(https://arxiv.org/abs/2511.03827)
Keywords: language model, llm
Abstract: Aligning large language models with human values is crucial for their safe deployment; however, existing methods, such as fine-tuning, are computationally expensive and suboptimal. In contrast, inference-time approaches like Best-of-N sampling require practically infeasible computation to achieve optimal alignment. We propose STARS: Segment-level Token Alignment with Rejection Sampling, a decoding-time algorithm that steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments. This allows for early correction of the generation path, significantly improving computational efficiency and boosting alignment quality. Across a suite of six LLMs, we show that STARS outperforms Supervised Fine-Tuning (SFT) by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up to 4.3 percentage points on win-rates, while remaining highly competitive with strong Best-of-N baselines. Our work establishes granular, reward-guided sampling as a generalizable, robust, and efficient alternative to traditional fine-tuning and full-sequence ranking methods for aligning LLMs.
摘要：将大型语言模型与人类价值观保持一致对于其安全部署至关重要；然而，现有的方法（例如微调）计算成本昂贵且不是最优的。相比之下，Best-of-N 采样等推理时间方法需要几乎不可行的计算才能实现最佳对齐。我们提出了 STARS：带有拒绝采样的段级令牌对齐，这是一种解码时间算法，通过迭代采样、评分和拒绝/接受短的固定大小的令牌段来引导模型生成。这允许及早校正生成路径，显着提高计算效率并提高对齐质量。在六个法学硕士的研究中，我们发现 STARS 在胜率方面比监督微调 (SFT) 高出高达 14.9 个百分点，比直接偏好优化 (DPO) 高出高达 4.3 个百分点，同时凭借强大的 Best-of-N 基线保持高度竞争力。我们的工作建立了细粒度的、以奖励为导向的抽样，作为一种通用的、稳健的、高效的替代方法，替代传统的微调和全序列排名方法，用于调整法学硕士。

Title: Divide, Cache, Conquer: Dichotomic Prompting for Efficient Multi-Label LLM-Based Classification

Authors: Mikołaj Langner, Jan Eliasz, Ewa Rudnicka, Jan Kocoń
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03830
Pdf URL: https://arxiv.org/pdf/2511.03830
Copy Paste: [[2511.03830]] Divide, Cache, Conquer: Dichotomic Prompting for Efficient Multi-Label LLM-Based Classification(https://arxiv.org/abs/2511.03830)
Keywords: language model, llm, prompt
Abstract: We introduce a method for efficient multi-label text classification with large language models (LLMs), built on reformulating classification tasks as sequences of dichotomic (yes/no) decisions. Instead of generating all labels in a single structured response, each target dimension is queried independently, which, combined with a prefix caching mechanism, yields substantial efficiency gains for short-text inference without loss of accuracy. To demonstrate the approach, we focus on affective text analysis, covering 24 dimensions including emotions and sentiment. Using LLM-to-SLM distillation, a powerful annotator model (DeepSeek-V3) provides multiple annotations per text, which are aggregated to fine-tune smaller models (HerBERT-Large, CLARIN-1B, PLLuM-8B, Gemma3-1B). The fine-tuned models show significant improvements over zero-shot baselines, particularly on the dimensions seen during training. Our findings suggest that decomposing multi-label classification into dichotomic queries, combined with distillation and cache-aware inference, offers a scalable and effective framework for LLM-based classification. While we validate the method on affective states, the approach is general and applicable across domains.
摘要：我们引入了一种使用大型语言模型（LLM）进行高效多标签文本分类的方法，该方法建立在将分类任务重新表述为二分（是/否）决策序列的基础上。不是在单个结构化响应中生成所有标签，而是独立查询每个目标维度，这与前缀缓存机制相结合，可以在不损失准确性的情况下为短文本推理带来显着的效率提升。为了演示该方法，我们重点关注情感文本分析，涵盖情感和情感等 24 个维度。使用 LLM 到 SLM 蒸馏，强大的注释器模型 (DeepSeek-V3) 为每个文本提供多个注释，这些注释被聚合以微调较小的模型 (HerBERT-Large、CLARIN-1B、PLLuM-8B、Gemma3-1B)。经过微调的模型比零样本基线显示出显着的改进，特别是在训练期间看到的维度上。我们的研究结果表明，将多标签分类分解为二分查询，并结合蒸馏和缓存感知推理，为基于 LLM 的分类提供了一个可扩展且有效的框架。虽然我们在情感状态上验证了该方法，但该方法是通用的并且适用于跨领域。

Title: Evaluating Machine Translation Datasets for Low-Web Data Languages: A Gendered Lens

Authors: Hellina Hailu Nigatu, Bethelhem Yemane Mamo, Bontu Fufa Balcha, Debora Taye Tesfaye, Elbethel Daniel Zewdie, Ikram Behiru Nesiru, Jitu Ewnetu Hailu, Senait Mengesha Yayo
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2511.03880
Pdf URL: https://arxiv.org/pdf/2511.03880
Copy Paste: [[2511.03880]] Evaluating Machine Translation Datasets for Low-Web Data Languages: A Gendered Lens(https://arxiv.org/abs/2511.03880)
Keywords: prompt
Abstract: As low-resourced languages are increasingly incorporated into NLP research, there is an emphasis on collecting large-scale datasets. But in prioritizing quantity over quality, we risk 1) building language technologies that perform poorly for these languages and 2) producing harmful content that perpetuates societal biases. In this paper, we investigate the quality of Machine Translation (MT) datasets for three low-resourced languages--Afan Oromo, Amharic, and Tigrinya, with a focus on the gender representation in the datasets. Our findings demonstrate that while training data has a large representation of political and religious domain text, benchmark datasets are focused on news, health, and sports. We also found a large skew towards the male gender--in names of persons, the grammatical gender of verbs, and in stereotypical depictions in the datasets. Further, we found harmful and toxic depictions against women, which were more prominent for the language with the largest amount of data, underscoring that quantity does not guarantee quality. We hope that our work inspires further inquiry into the datasets collected for low-resourced languages and prompts early mitigation of harmful content. WARNING: This paper contains discussion of NSFW content that some may find disturbing.
摘要：随着资源匮乏的语言越来越多地被纳入 NLP 研究中，人们越来越重视收集大规模数据集。但在优先考虑数量而非质量时，我们面临的风险是：1）构建的语言技术在这些语言上表现不佳；2）产生有害内容，使社会偏见长期存在。在本文中，我们研究了三种资源匮乏语言（阿凡奥罗莫语、阿姆哈拉语和提格里尼亚语）的机器翻译 (MT) 数据集的质量，重点关注数据集中的性别代表性。我们的研究结果表明，虽然训练数据具有很大的政治和宗教领域文本代表性，但基准数据集侧重于新闻、健康和体育。我们还发现，在人名、动词的语法性别以及数据集中的刻板描述中，男性性别存在很大的偏差。此外，我们还发现了针对女性的有害和有毒的描述，这在数据量最大的语言中更为突出，强调数量并不能保证质量。我们希望我们的工作能够激发对为资源匮乏语言收集的数据集的进一步调查，并促进及早减少有害内容。警告：本文包含 NSFW 内容的讨论，有些人可能会感到不安。

Title: GRAD: Graph-Retrieved Adaptive Decoding for Hallucination Mitigation

Authors: Manh Nguyen, Sunil Gupta, Dai Do, Hung Le
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.03900
Pdf URL: https://arxiv.org/pdf/2511.03900
Copy Paste: [[2511.03900]] GRAD: Graph-Retrieved Adaptive Decoding for Hallucination Mitigation(https://arxiv.org/abs/2511.03900)
Keywords: language model, llm, hallucination, prompt
Abstract: Hallucination mitigation remains a persistent challenge for large language models (LLMs), even as model scales grow. Existing approaches often rely on external knowledge sources, such as structured databases or knowledge graphs, accessed through prompting or retrieval. However, prompt-based grounding is fragile and domain-sensitive, while symbolic knowledge integration incurs heavy retrieval and formatting costs. Motivated by knowledge graphs, we introduce Graph-Retrieved Adaptive Decoding (GRAD), a decoding-time method that grounds generation in corpus-derived evidence without retraining. GRAD constructs a sparse token transition graph by accumulating next-token logits across a small retrieved corpus in a single forward pass. During decoding, graph-retrieved logits are max-normalized and adaptively fused with model logits to favor high-evidence continuations while preserving fluency. Across three models and a range of question-answering benchmarks spanning intrinsic, extrinsic hallucination, and factuality tasks, GRAD consistently surpasses baselines, achieving up to 9.7$\%$ higher intrinsic accuracy, 8.6$\%$ lower hallucination rates, and 6.9$\%$ greater correctness compared to greedy decoding, while attaining the highest truth--informativeness product score among all methods. GRAD offers a lightweight, plug-and-play alternative to contrastive decoding and knowledge graph augmentation, demonstrating that statistical evidence from corpus-level token transitions can effectively steer generation toward more truthful and verifiable outputs.
摘要：即使模型规模不断扩大，缓解幻觉仍然是大型语言模型 (LLM) 面临的一个持续挑战。现有的方法通常依赖于外部知识源，例如通过提示或检索访问的结构化数据库或知识图。然而，基于提示的基础是脆弱的且对领域敏感，而符号知识集成会产生高昂的检索和格式化成本。受知识图的推动，我们引入了图检索自适应解码（GRAD），这是一种解码时方法，无需重新训练即可生成来自语料库的证据。 GRAD 通过在单个前向传递中累积跨小型检索语料库的下一个标记逻辑来构建稀疏标记转换图。在解码过程中，图检索的逻辑被最大归一化，并自适应地与模型逻辑融合，以支持高证据连续性，同时保持流畅性。在三个模型和一系列涵盖内在、外在幻觉和事实性任务的问答基准中，GRAD 始终超越基线，与贪婪解码相比，内在准确率提高了 9.7$\%$，幻觉率降低了 8.6$\%$，正确性提高了 6.9$\%$，同时在所有方法中获得了最高的真实信息性产品得分。 GRAD 为对比解码和知识图增强提供了一种轻量级、即插即用的替代方案，证明来自语料库级别令牌转换的统计证据可以有效地引导生成更加真实和可验证的输出。

Title: Context informs pragmatic interpretation in vision-language models

Authors: Alvin Wei Ming Tan, Ben Prystawski, Veronica Boyce, Michael C. Frank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03908
Pdf URL: https://arxiv.org/pdf/2511.03908
Copy Paste: [[2511.03908]] Context informs pragmatic interpretation in vision-language models(https://arxiv.org/abs/2511.03908)
Keywords: language model, agent
Abstract: Iterated reference games - in which players repeatedly pick out novel referents using language - present a test case for agents' ability to perform context-sensitive pragmatic reasoning in multi-turn linguistic environments. We tested humans and vision-language models on trials from iterated reference games, varying the given context in terms of amount, order, and relevance. Without relevant context, models were above chance but substantially worse than humans. However, with relevant context, model performance increased dramatically over trials. Few-shot reference games with abstract referents remain a difficult task for machine learning models.
摘要：迭代参考游戏——玩家使用语言反复挑选新的参考对象——为智能体在多轮语言环境中执行上下文相关的语用推理的能力提供了一个测试用例。我们通过迭代参考游戏的试验来测试人类和视觉语言模型，在数量、顺序和相关性方面改变给定的上下文。在没有相关背景的情况下，模型的可能性高于偶然性，但比人类差得多。然而，在相关背景下，模型性能在试验中显着提高。对于机器学习模型来说，具有抽象参考对象的少量参考游戏仍然是一项艰巨的任务。

Title: The Human Flourishing Geographic Index: A County-Level Dataset for the United States, 2013--2023

Authors: Stefano M. Iacus, Devika Jain, Andrea Nasuto, Giuseppe Porro, Marcello Carammia, Andrea Vezzulli
Subjects: cs.CL, cs.CY, stat.AP
Abstract URL: https://arxiv.org/abs/2511.03915
Pdf URL: https://arxiv.org/pdf/2511.03915
Copy Paste: [[2511.03915]] The Human Flourishing Geographic Index: A County-Level Dataset for the United States, 2013--2023(https://arxiv.org/abs/2511.03915)
Keywords: language model
Abstract: Quantifying human flourishing, a multidimensional construct including happiness, health, purpose, virtue, relationships, and financial stability, is critical for understanding societal well-being beyond economic indicators. Existing measures often lack fine spatial and temporal resolution. Here we introduce the Human Flourishing Geographic Index (HFGI), derived from analyzing approximately 2.6 billion geolocated U.S. tweets (2013-2023) using fine-tuned large language models to classify expressions across 48 indicators aligned with Harvard's Global Flourishing Study framework plus attitudes towards migration and perception of corruption. The dataset offers monthly and yearly county- and state-level indicators of flourishing-related discourse, validated to confirm that the measures accurately represent the underlying constructs and show expected correlations with established indicators. This resource enables multidisciplinary analyses of well-being, inequality, and social change at unprecedented resolution, offering insights into the dynamics of human flourishing as reflected in social media discourse across the United States over the past decade.
摘要：量化人类繁荣是一个多维结构，包括幸福、健康、目的、美德、人际关系和财务稳定，对于理解经济指标之外的社会福祉至关重要。现有的测量方法通常缺乏精细的空间和时间分辨率。在这里，我们介绍人类繁荣地理指数 (HFGI)，该指数是通过分析约 26 亿条地理定位的美国推文 (2013-2023) 得出的，使用经过微调的大型语言模型对符合哈佛大学全球繁荣研究框架的 48 个指标的表达进行分类，以及对移民的态度和对腐败的看法。该数据集提供了与繁荣相关的话语的月度和年度县级和州级指标，经过验证以确认这些措施准确地代表了潜在的结构，并显示了与既定指标的预期相关性。该资源能够以前所未有的分辨率对福祉、不平等和社会变革进行多学科分析，提供对过去十年美国各地社交媒体话语所反映的人类繁荣动态的见解。

Title: Direct Semantic Communication Between Large Language Models via Vector Translation

Authors: Fu-Chun Yang, Jason Eshraghian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.03945
Pdf URL: https://arxiv.org/pdf/2511.03945
Copy Paste: [[2511.03945]] Direct Semantic Communication Between Large Language Models via Vector Translation(https://arxiv.org/abs/2511.03945)
Keywords: language model, llm, agent
Abstract: In multi-agent settings, such as debate, reflection, or tool-calling, large language models (LLMs) pass messages as plain tokens, discarding most latent semantics. This constrains information transfer and adds unnecessary computational overhead. We form a latent bridge via vector translations, which use learned mappings that enable direct semantic exchange between representation spaces. A dual-encoder translator trained between Llama-2-7B and Mistral-7B-Instruct attains an average cosine alignment of 0.538. Injecting the translated vectors at 30 percent blending strength steers the target model's generation without destabilizing logits. Bidirectional evaluation shows a 2.01:1 transfer asymmetry, indicating that general-purpose models yield more transferable representations than instruction-tuned variants. This conservative injection preserves computational stability while demonstrating that cross-model latent communication is feasible, enabling collaborative AI systems that share meaning rather than tokens.
摘要：在多代理环境中，例如辩论、反射或工具调用，大型语言模型 (LLM) 将消息作为普通标记传递，丢弃大多数潜在语义。这限制了信息传输并增加了不必要的计算开销。我们通过向量翻译形成一座潜在的桥梁，它使用学习的映射来实现表示空间之间的直接语义交换。在 Llama-2-7B 和 Mistral-7B-Instruct 之间训练的双编码器翻译器获得了 0.538 的平均余弦对齐。以 30% 的混合强度注入平移向量可以引导目标模型的生成，而不会破坏 logits 的稳定性。双向评估显示 2.01:1 的转移不对称性，表明通用模型比指令调整的变体产生更多的可转移表示。这种保守的注入保持了计算稳定性，同时证明了跨模型潜在通信是可行的，从而使协作人工智能系统能够共享意义而不是令牌。

Title: Abductive Inference in Retrieval-Augmented Language Models: Generating and Validating Missing Premises

Authors: Shiyin Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04020
Pdf URL: https://arxiv.org/pdf/2511.04020
Copy Paste: [[2511.04020]] Abductive Inference in Retrieval-Augmented Language Models: Generating and Validating Missing Premises(https://arxiv.org/abs/2511.04020)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) enhanced with retrieval -- commonly referred to as Retrieval-Augmented Generation (RAG) -- have demonstrated strong performance in knowledge-intensive tasks. However, RAG pipelines often fail when retrieved evidence is incomplete, leaving gaps in the reasoning process. In such cases, \emph{abductive inference} -- the process of generating plausible missing premises to explain observations -- offers a principled approach to bridge these gaps. In this paper, we propose a framework that integrates abductive inference into retrieval-augmented LLMs. Our method detects insufficient evidence, generates candidate missing premises, and validates them through consistency and plausibility checks. Experimental results on abductive reasoning and multi-hop QA benchmarks show that our approach improves both answer accuracy and reasoning faithfulness. This work highlights abductive inference as a promising direction for enhancing the robustness and explainability of RAG systems.
摘要：通过检索增强的大型语言模型（LLM）——通常称为检索增强生成（RAG）——在知识密集型任务中表现出了强大的性能。然而，当检索到的证据不完整时，RAG 管道经常会失败，从而在推理过程中留下空白。在这种情况下，\emph{溯因推理}——生成合理的缺失前提来解释观察结果的过程——提供了一种弥合这些差距的原则性方法。在本文中，我们提出了一个将溯因推理集成到检索增强法学硕士中的框架。我们的方法检测到证据不足，生成候选缺失前提，并通过一致性和合理性检查来验证它们。溯因推理和多跳 QA 基准的实验结果表明，我们的方法提高了答案准确性和推理可信度。这项工作强调溯因推理是增强 RAG 系统的鲁棒性和可解释性的一个有前途的方向。

Title: T-FIX: Text-Based Explanations with Features Interpretable to eXperts

Authors: Shreya Havaldar, Helen Jin, Chaehyeon Kim, Anton Xue, Weiqiu You, Marco Gatti, Bhuvnesh Jain, Helen Qu, Daniel A Hashimoto, Amin Madani, Rajat Deo, Sameed Ahmed M. Khatana, Gary E. Weissman, Lyle Ungar, Eric Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04070
Pdf URL: https://arxiv.org/pdf/2511.04070
Copy Paste: [[2511.04070]] T-FIX: Text-Based Explanations with Features Interpretable to eXperts(https://arxiv.org/abs/2511.04070)
Keywords: llm
Abstract: As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users expect not just answers, but also meaningful explanations for those answers. In these settings, users are often domain experts (e.g., doctors, astrophysicists, psychologists) who require explanations that reflect expert-level reasoning. However, current evaluation schemes primarily emphasize plausibility or internal faithfulness of the explanation, which fail to capture whether the content of the explanation truly aligns with expert intuition. We formalize expert alignment as a criterion for evaluating explanations with T-FIX, a benchmark spanning seven knowledge-intensive domains. In collaboration with domain experts, we develop novel metrics to measure the alignment of LLM explanations with expert judgment.
摘要：由于法学硕士被部署在知识密集型环境（例如外科手术、天文学、治疗）中，用户不仅期望得到答案，还期望对这些答案进行有意义的解释。在这些设置中，用户通常是领域专家（例如医生、天体物理学家、心理学家），他们需要反映专家级推理的解释。然而，目前的评估方案主要强调解释的合理性或内部忠实性，而未能捕捉到解释的内容是否真正符合专家直觉。我们将专家一致性正式确定为评估 T-FIX 解释的标准，T-FIX 是涵盖七个知识密集型领域的基准。我们与领域专家合作，开发新的指标来衡量法学硕士解释与专家判断的一致性。

Title: Plan of Knowledge: Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering

Authors: Xinying Qian, Ying Zhang, Yu Zhao, Baohang Zhou, Xuhui Sui, Xiaojie Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04072
Pdf URL: https://arxiv.org/pdf/2511.04072
Copy Paste: [[2511.04072]] Plan of Knowledge: Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering(https://arxiv.org/abs/2511.04072)
Keywords: language model, llm, hallucination
Abstract: Temporal Knowledge Graph Question Answering (TKGQA) aims to answer time-sensitive questions by leveraging factual information from Temporal Knowledge Graphs (TKGs). While previous studies have employed pre-trained TKG embeddings or graph neural networks to inject temporal knowledge, they fail to fully understand the complex semantic information of time constraints. Recently, Large Language Models (LLMs) have shown remarkable progress, benefiting from their strong semantic understanding and reasoning generalization capabilities. However, their temporal reasoning ability remains limited. LLMs frequently suffer from hallucination and a lack of knowledge. To address these limitations, we propose the Plan of Knowledge framework with a contrastive temporal retriever, which is named PoK. Specifically, the proposed Plan of Knowledge module decomposes a complex temporal question into a sequence of sub-objectives from the pre-defined tools, serving as intermediate guidance for reasoning exploration. In parallel, we construct a Temporal Knowledge Store (TKS) with a contrastive retrieval framework, enabling the model to selectively retrieve semantically and temporally aligned facts from TKGs. By combining structured planning with temporal knowledge retrieval, PoK effectively enhances the interpretability and factual consistency of temporal reasoning. Extensive experiments on four benchmark TKGQA datasets demonstrate that PoK significantly improves the retrieval precision and reasoning accuracy of LLMs, surpassing the performance of the state-of-the-art TKGQA methods by 56.0% at most.
摘要：时态知识图问答（TKGQA）旨在通过利用时态知识图（TKG）中的事实信息来回答时间敏感的问题。虽然之前的研究采用预先训练的 TKG 嵌入或图神经网络来注入时间知识，但它们未能完全理解时间约束的复杂语义信息。近年来，大型语言模型（LLM）取得了显着的进步，这得益于其强大的语义理解和推理泛化能力。然而，他们的时间推理能力仍然有限。法学硕士经常遭受幻觉和缺乏知识的困扰。为了解决这些限制，我们提出了带有对比时间检索器的知识计划框架，称为 PoK。具体来说，所提出的知识计划模块将复杂的时间问题分解为来自预定义工具的一系列子目标，作为推理探索的中间指导。同时，我们构建了一个具有对比检索框架的时间知识存储（TKS），使模型能够有选择地从 TKG 中检索语义和时间上对齐的事实。通过将结构化规划与时态知识检索相结合，PoK 有效增强了时态推理的可解释性和事实一致性。在四个基准 TKGQA 数据集上进行的大量实验表明，PoK 显着提高了 LLM 的检索精度和推理准确性，最多超过了最先进的 TKGQA 方法 56.0% 的性能。

Title: The truth is no diaper: Human and AI-generated associations to emotional words

Authors: Špela Vintar, Jan Jona Javoršek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04077
Pdf URL: https://arxiv.org/pdf/2511.04077
Copy Paste: [[2511.04077]] The truth is no diaper: Human and AI-generated associations to emotional words(https://arxiv.org/abs/2511.04077)
Keywords: language model, llm
Abstract: Human word associations are a well-known method of gaining insight into the internal mental lexicon, but the responses spontaneously offered by human participants to word cues are not always predictable as they may be influenced by personal experience, emotions or individual cognitive styles. The ability to form associative links between seemingly unrelated concepts can be the driving mechanisms of creativity. We perform a comparison of the associative behaviour of humans compared to large language models. More specifically, we explore associations to emotionally loaded words and try to determine whether large language models generate associations in a similar way to humans. We find that the overlap between humans and LLMs is moderate, but also that the associations of LLMs tend to amplify the underlying emotional load of the stimulus, and that they tend to be more predictable and less creative than human ones.
摘要：人类单词联想是深入了解内部心理词典的一种众所周知的方法，但人类参与者对单词提示自发做出的反应并不总是可预测的，因为它们可能受到个人经验、情绪或个人认知风格的影响。在看似不相关的概念之间形成关联联系的能力可以成为创造力的驱动机制。我们将人类的联想行为与大型语言模型进行比较。更具体地说，我们探索与情感负载词的关联，并尝试确定大型语言模型是否以与人类类似的方式生成关联。我们发现人类和法学硕士之间的重叠程度适中，但法学硕士的关联往往会放大刺激的潜在情绪负荷，而且他们往往比人类更具可预测性，但创造力较低。

Title: Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models

Authors: Wenmo Qiu, Saurabh Srivastava
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04108
Pdf URL: https://arxiv.org/pdf/2511.04108
Copy Paste: [[2511.04108]] Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models(https://arxiv.org/abs/2511.04108)
Keywords: language model, llm, prompt
Abstract: Recent work has explored batch prompting as a strategy to amortize inference cost in large language models (LLMs). In this paper, we show that batching offers an additional, underappreciated benefit: it regularizes model behavior during multi-step reasoning for Large Reasoning Models (LRMs). We conduct a comprehensive study across 13 diverse benchmarks and observe that batching improves accuracy while substantially reducing reasoning token usage, often by 3x-5x. Through detailed behavioral analysis, we find that batching suppresses overthinking, reduces hedging language (e.g., repetitive self-corrections), and encourages more decisive answers. Surprisingly, we also observe emergent collective effects in batched inference: models often generalize patterns from earlier examples to solve harder ones in the same batch. These findings position batching not just as a throughput optimization, but as a powerful inference-time regularizer for more efficient and reliable LLM reasoning.
摘要：最近的工作探索了批量提示作为一种分摊大型语言模型（LLM）推理成本的策略。在本文中，我们表明批处理提供了一个额外的、未被充分认识的好处：它在大型推理模型 (LRM) 的多步推理过程中规范了模型行为。我们对 13 个不同的基准进行了全面研究，发现批处理提高了准确性，同时大幅减少了推理标记的使用，通常减少了 3 到 5 倍。通过详细的行为分析，我们发现批处理可以抑制过度思考，减少对冲语言（例如，重复的自我纠正），并鼓励更果断的答案。令人惊讶的是，我们还观察到批量推理中出现的集体效应：模型通常会概括早期示例中的模式，以解决同一批次中更难的问题。这些发现不仅将批处理定位为吞吐量优化，而且将其作为强大的推理时间正则器，以实现更高效、更可靠的 LLM 推理。

Title: RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

Authors: Xinyuan Li, Murong Xu, Wenbiao Tao, Hanlun Zhu, Yike Zhao, Jipeng Zhang, Yunshi Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04120
Pdf URL: https://arxiv.org/pdf/2511.04120
Copy Paste: [[2511.04120]] RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning(https://arxiv.org/abs/2511.04120)
Keywords: language model, llm
Abstract: Large language models (LLMs) achieve high performance on mathematical reasoning, but these results can be inflated by training data leakage or superficial pattern matching rather than genuine reasoning. To this end, an adversarial perturbation-based evaluation is needed to measure true mathematical reasoning ability. Current rule-based perturbation methods often generate ill-posed questions and impede the systematic evaluation of question difficulty and the evolution of benchmarks. To bridge this gap, we propose RIDE, a novel adversarial question-rewriting framework that leverages Item Response Theory (IRT) to rigorously measure question difficulty and to generate intrinsically more challenging, well-posed variations of mathematical problems. We employ 35 LLMs to simulate students and build a difficulty ranker from their responses. This ranker provides a reward signal during reinforcement learning and guides a question-rewriting model to reformulate existing questions across difficulty levels. Applying RIDE to competition-level mathematical benchmarks yields perturbed versions that degrade advanced LLM performance, with experiments showing an average 21.73% drop across 26 models, thereby exposing limited robustness in mathematical reasoning and confirming the validity of our evaluation approach.
摘要：大型语言模型 (LLM) 在数学推理方面实现了高性能，但这些结果可能会因训练数据泄漏或肤浅的模式匹配而不是真正的推理而夸大。为此，需要基于对抗性扰动的评估来衡量真正的数学推理能力。当前基于规则的扰动方法经常会产生不适定问题，并阻碍对问题难度的系统评估和基准的演变。为了弥补这一差距，我们提出了 RIDE，这是一种新颖的对抗性问题重写框架，它利用项目响应理论（IRT）来严格衡量问题难度，并生成本质上更具挑战性、适定的数学问题变体。我们聘请了 35 名法学硕士来模拟学生，并根据他们的回答构建难度排名。该排名器在强化学习期间提供奖励信号，并指导问题重写模型重新制定跨难度级别的现有问题。将 RIDE 应用于竞赛级数学基准会产生扰动版本，从而降低高级 LLM 性能，实验显示 26 个模型平均下降 21.73%，从而暴露了数学推理的稳健性有限，并证实了我们评估方法的有效性。

Title: CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

Authors: Dazhong Chen, Yi-Cheng Lin, Yuchen Huang, Ziwei Gong, Di Jiang, Zeying Xie, Yi R. (May)Fung
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2511.04139
Pdf URL: https://arxiv.org/pdf/2511.04139
Copy Paste: [[2511.04139]] CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese(https://arxiv.org/abs/2511.04139)
Keywords: language model
Abstract: Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.
摘要：自动语音识别 (ASR) 对于语言可访问性至关重要，但由于注释数据有限、六种词汇声调、连读变调和口音变化，资源匮乏的粤语仍然具有挑战性。现有的 ASR 模型（例如 Whisper）经常面临较高的单词错误率。相比之下，大型音频语言模型（LALM）可以利用更广泛的上下文推理，但仍然需要明确的音调和韵律声学线索。我们推出了 CantoASR，这是一个协作式 ASR-LALM 纠错框架，它集成了用于声学特征提取的强制对齐、用于改进音调辨别的 LoRA 微调 Whisper 以及用于韵律感知校正的指令调整 Qwen-Audio。对自发粤语数据的评估显示，与 Whisper-Large-V3 相比，CER 大幅提高。这些发现表明，将声学线索与 LALM 推理相结合，为低资源声调和方言 ASR 提供了一种可扩展的策略。

Title: BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation

Authors: Fahim Ahmed, Md Mubtasim Ahasan, Jahir Sadik Monon, Muntasir Wahed, M Ashraful Amin, A K M Mahbubur Rahman, Amin Ahsan Ali
Subjects: cs.CL, cs.AI, cs.DB, cs.MA
Abstract URL: https://arxiv.org/abs/2511.04153
Pdf URL: https://arxiv.org/pdf/2511.04153
Copy Paste: [[2511.04153]] BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation(https://arxiv.org/abs/2511.04153)
Keywords: language model, llm, agent
Abstract: Text-to-SQL systems provide a natural language interface that can enable even laymen to access information stored in databases. However, existing Large Language Models (LLM) struggle with SQL generation from natural instructions due to large schema sizes and complex reasoning. Prior work often focuses on complex, somewhat impractical pipelines using flagship models, while smaller, efficient models remain overlooked. In this work, we explore three multi-agent LLM pipelines, with systematic performance benchmarking across a range of small to large open-source models: (1) Multi-agent discussion pipeline, where agents iteratively critique and refine SQL queries, and a judge synthesizes the final answer; (2) Planner-Coder pipeline, where a thinking model planner generates stepwise SQL generation plans and a coder synthesizes queries; and (3) Coder-Aggregator pipeline, where multiple coders independently generate SQL queries, and a reasoning agent selects the best query. Experiments on the Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small model performance, with up to 10.6% increase in Execution Accuracy for Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines, the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest score of 56.4%. Codes are available at this https URL.
摘要：文本到 SQL 系统提供了自然语言界面，即使是外行也可以访问数据库中存储的信息。然而，由于模式大小和推理复杂，现有的大型语言模型 (LLM) 很难从自然指令生成 SQL。之前的工作通常侧重于使用旗舰模型的复杂且有些不切实际的管道，而较小的、高效的模型仍然被忽视。在这项工作中，我们探索了三种多智能体 LLM 管道，在一系列从小到大的开源模型中进行了系统的性能基准测试：（1）多智能体讨论管道，其中智能体迭代地批评和完善 SQL 查询，而法官则综合最终答案； (2) Planner-Coder 管道，其中思维模型规划器生成逐步的 SQL 生成计划，编码器综合查询； (3) 编码器-聚合器管道，其中多个编码器独立生成 SQL 查询，推理代理选择最佳查询。 Bird-Bench Mini-Dev 集上的实验表明，Multi-Agent 讨论可以提高小模型性能，经过三轮讨论后，Qwen2.5-7b-Instruct 的执行精度提高了 10.6%。在这些管道中，LLM Reasoner-Coder 管道产生了最好的结果，DeepSeek-R1-32B 和 QwQ-32B 规划器将 Gemma 3 27B IT 准确率从 52.4% 提高到了最高分 56.4%。代码可从此 https URL 获取。

Title: Trustworthy LLM-Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains

Authors: Mohammed Musthafa Rafi, Adarsh Krishnamurthy, Aditya Balu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04184
Pdf URL: https://arxiv.org/pdf/2511.04184
Copy Paste: [[2511.04184]] Trustworthy LLM-Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains(https://arxiv.org/abs/2511.04184)
Keywords: llm, hallucination, agent
Abstract: The proliferation of AI-generated content has created an absurd communication theater where senders use LLMs to inflate simple ideas into verbose content, recipients use LLMs to compress them back into summaries, and as a consequence neither party engage with authentic content. LAAC (LLM as a Communicator) proposes a paradigm shift - positioning LLMs as intelligent communication intermediaries that capture the sender's intent through structured dialogue and facilitate genuine knowledge exchange with recipients. Rather than perpetuating cycles of AI-generated inflation and compression, LAAC enables authentic communication across diverse contexts including academic papers, proposals, professional emails, and cross-platform content generation. However, deploying LLMs as trusted communication intermediaries raises critical questions about information fidelity, consistency, and reliability. This position paper systematically evaluates the trustworthiness requirements for LAAC's deployment across multiple communication domains. We investigate three fundamental dimensions: (1) Information Capture Fidelity - accuracy of intent extraction during sender interviews across different communication types, (2) Reproducibility - consistency of structured knowledge across multiple interaction instances, and (3) Query Response Integrity - reliability of recipient-facing responses without hallucination, source conflation, or fabrication. Through controlled experiments spanning multiple LAAC use cases, we assess these trust dimensions using LAAC's multi-agent architecture. Preliminary findings reveal measurable trust gaps that must be addressed before LAAC can be reliably deployed in high-stakes communication scenarios.
摘要：人工智能生成内容的激增创造了一个荒谬的交流剧场，发送者使用法学硕士将简单的想法夸大为冗长的内容，接收者使用法学硕士将其压缩回摘要，因此双方都无法接触真实的内容。 LAAC（法学硕士作为沟通者）提出了一种范式转变——将法学硕士定位为智能通信中介，通过结构化对话捕捉发送者的意图，并促进与接收者的真正知识交流。 LAAC 不是让人工智能产生的膨胀和压缩的循环永久化，而是可以在不同的环境中进行真实的交流，包括学术论文、提案、专业电子邮件和跨平台内容生成。然而，将法学硕士部署为可信的通信中介会引发有关信息保真度、一致性和可靠性的关键问题。本立场文件系统地评估了 LAAC 跨多个通信域部署的可信度要求。我们研究三个基本维度：(1) 信息捕获保真度 - 在不同通信类型的发送者访谈期间意图提取的准确性，(2) 可重复性 - 跨多个交互实例的结构化知识的一致性，以及 (3) 查询响应完整性 - 面向接收者的响应的可靠性，没有幻觉、源混淆或捏造。通过跨越多个 LAAC 用例的受控实验，我们使用 LAAC 的多代理架构评估这些信任维度。初步调查结果表明，在 LAAC 能够可靠地部署在高风险通信场景中之前，必须解决可衡量的信任差距。

Title: Computational Turing Test Reveals Systematic Differences Between Human and AI Language

Authors: Nicolò Pagan, Petter Törnberg, Christopher A. Bail, Anikó Hannák, Christopher Barrie
Subjects: cs.CL, cs.MA, cs.SI
Abstract URL: https://arxiv.org/abs/2511.04195
Pdf URL: https://arxiv.org/pdf/2511.04195
Copy Paste: [[2511.04195]] Computational Turing Test Reveals Systematic Differences Between Human and AI Language(https://arxiv.org/abs/2511.04195)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations -- testing whether humans can distinguish AI from human output -- despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies -- including fine-tuning, stylistic prompting, and context retrieval -- benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations -- and offer a cautionary note about their current limitations in capturing human communication.
摘要：大型语言模型 (LLM) 越来越多地用于社会科学中来模拟人类行为，其假设是它们可以生成逼真的、类似人类的文本。然而这一假设在很大程度上仍未得到检验。现有的验证工作在很大程度上依赖于基于人类判断的评估——测试人类是否能够区分人工智能和人类输出——尽管有证据表明这种判断是生硬且不可靠的。因此，该领域缺乏强大的工具来评估法学硕士生成的文本的真实性或根据真实世界的数据校准模型。本文做出了两个贡献。首先，我们引入计算图灵测试：一个验证框架，它将聚合指标（基于 BERT 的可检测性和语义相似性）与可解释的语言特征（文体标记和主题模式）相结合，以评估法学硕士在给定数据集中与人类语言的接近程度。其次，我们系统地比较了九个开放权重法学硕士的五种校准策略——包括微调、文体提示和上下文检索——对它们在 X（以前的 Twitter）、Bluesky 和 Reddit 上重现用户交互的能力进行了基准测试。我们的发现挑战了文献中的核心假设。即使在校准之后，LLM 输出仍然可以与人类文本清楚地区分开来，特别是在情感语气和情感表达方面。经过指令调整的模型表现不佳，并且扩大模型大小并不能增强与人类的相似性。至关重要的是，我们确定了一个权衡：针对人类相似度的优化通常会以语义保真度为代价，反之亦然。这些结果为法学硕士模拟中的验证和校准提供了急需的可扩展框架，并对其当前在捕获人类通信方面的局限性提出了警告。

Title: LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal

Authors: Michał Karp, Anna Kubaszewska, Magdalena Król, Robert Król, Aleksander Smywiński-Pohl, Mateusz Szymański, Witold Wydmański
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04205
Pdf URL: https://arxiv.org/pdf/2511.04205
Copy Paste: [[2511.04205]] LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal(https://arxiv.org/abs/2511.04205)
Keywords: language model, gpt, llm, hallucination, retrieval-augmented generation
Abstract: This study provides an empirical assessment of whether current large language models (LLMs) can pass the official qualifying examination for membership in Poland's National Appeal Chamber (Krajowa Izba Odwoławcza). The authors examine two related ideas: using LLM as actual exam candidates and applying the 'LLM-as-a-judge' approach, in which model-generated answers are automatically evaluated by other models. The paper describes the structure of the exam, which includes a multiple-choice knowledge test on public procurement law and a written judgment, and presents the hybrid information recovery and extraction pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4 Sonnet and Bielik-11B-v2.6) were tested in closed-book and various Retrieval-Augmented Generation settings. The results show that although the models achieved satisfactory scores in the knowledge test, none met the passing threshold in the practical written part, and the evaluations of the 'LLM-as-a-judge' often diverged from the judgments of the official examining committee. The authors highlight key limitations: susceptibility to hallucinations, incorrect citation of legal provisions, weaknesses in logical argumentation, and the need for close collaboration between legal experts and technical teams. The findings indicate that, despite rapid technological progress, current LLMs cannot yet replace human judges or independent examiners in Polish public procurement adjudication.
摘要：这项研究对当前的大语言模型（LLM）能否通过波兰国家上诉庭（Krajowa Izba Odwoławcza）成员的官方资格考试提供了实证评估。作者研究了两个相关的想法：使用法学硕士作为实际考试考生并应用“法学硕士作为法官”方法，其中模型生成的答案由其他模型自动评估。本文描述了考试的结构，其中包括公共采购法的多项选择知识测试和书面判决，并介绍了为支持模型而构建的混合信息恢复和提取管道。多个法学硕士（包括 GPT-4.1、Claude 4 Sonnet 和 Bielik-11B-v2.6）在闭卷和各种检索增强生成设置中进行了测试。结果显示，尽管模型在知识测试中取得了令人满意的成绩，但在实际笔试部分中，没有一个达到通过门槛，并且“LLM法官”的评估经常与官方审查委员会的判断存在偏差。作者强调了关键的局限性：对幻觉的敏感性、对法律条款的错误引用、逻辑论证的弱点以及法律专家和技术团队之间密切合作的需要。研究结果表明，尽管技术进步很快，但目前的法学硕士尚无法取代波兰公共采购裁决中的人类法官或独立审查员。

Title: REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs

Authors: Liran Cohen, Yaniv Nemcovesky, Avi Mendelson
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.04228
Pdf URL: https://arxiv.org/pdf/2511.04228
Copy Paste: [[2511.04228]] REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs(https://arxiv.org/abs/2511.04228)
Keywords: language model, llm
Abstract: Machine unlearning aims to remove the influence of specific training data from a model without requiring full retraining. This capability is crucial for ensuring privacy, safety, and regulatory compliance. Therefore, verifying whether a model has truly forgotten target data is essential for maintaining reliability and trustworthiness. However, existing evaluation methods often assess forgetting at the level of individual inputs. This approach may overlook residual influence present in semantically similar examples. Such influence can compromise privacy and lead to indirect information leakage. We propose REMIND (Residual Memorization In Neighborhood Dynamics), a novel evaluation method aiming to detect the subtle remaining influence of unlearned data and classify whether the data has been effectively forgotten. REMIND analyzes the model's loss over small input variations and reveals patterns unnoticed by single-point evaluations. We show that unlearned data yield flatter, less steep loss landscapes, while retained or unrelated data exhibit sharper, more volatile patterns. REMIND requires only query-based access, outperforms existing methods under similar constraints, and demonstrates robustness across different models, datasets, and paraphrased inputs, making it practical for real-world deployment. By providing a more sensitive and interpretable measure of unlearning effectiveness, REMIND provides a reliable framework to assess unlearning in language models. As a result, REMIND offers a novel perspective on memorization and unlearning.
摘要：机器取消学习旨在消除模型中特定训练数据的影响，而不需要完全重新训练。此功能对于确保隐私、安全和法规遵从性至关重要。因此，验证模型是否真正忘记了目标数据对于保持可靠性和可信度至关重要。然而，现有的评估方法通常在个人输入层面评估遗忘。这种方法可能会忽略语义相似示例中存在的残余影响。这种影响可能会损害隐私并导致间接信息泄露。我们提出了 REMIND (Residual Memorization In Neighborhood Dynamics)，这是一种新颖的评估方法，旨在检测未学习数据的微妙剩余影响，并对数据是否已被有效遗忘进行分类。 REMIND 分析模型在较小输入变化下的损失，并揭示单点评估未注意到的模式。我们表明，未学习的数据会产生更平坦、不太陡峭的损失景观，而保留或不相关的数据则表现出更尖锐、更不稳定的模式。 REMIND 仅需要基于查询的访问，在类似约束下优于现有方法，并展示了跨不同模型、数据集和释义输入的稳健性，使其适用于实际部署。通过提供更敏感和可解释的遗忘有效性衡量标准，REMIND 提供了一个可靠的框架来评估语言模型中的遗忘。因此，REMIND 为记忆和忘却提供了一种新颖的视角。

Title: Reusing Pre-Training Data at Test Time is a Compute Multiplier

Authors: Alex Fang, Thomas Voice, Ruoming Pang, Ludwig Schmidt, Tom Gunter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04234
Pdf URL: https://arxiv.org/pdf/2511.04234
Copy Paste: [[2511.04234]] Reusing Pre-Training Data at Test Time is a Compute Multiplier(https://arxiv.org/abs/2511.04234)
Keywords: language model, retrieval augmented generation
Abstract: Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging additional compute at test time to parse the retrieved context, demonstrating a 10 percentage point improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results suggest that today's pre-training methods do not make full use of the information in existing pre-training datasets, leaving significant room for progress.
摘要：大型语言模型从庞大的预训练语料库中学习，获得解决日益增多的任务的能力；然而，尽管研究人员致力于改进这些数据集，但很少有人努力去了解预训练设备从数据中提取想法和知识的效率。在这项工作中，我们使用检索增强生成和测试时计算来量化预训练过程中留下的数据集价值，以及这种变化在规模上的变化。我们证明，预训练然后从标准和大部分开源数据集中检索可以显着提高 MMLU、Math-500 和 SimpleQA 的准确性，并且在净化后仍能保持这种准确性。对于 MMLU，我们观察到与单独的预训练相比，检索的计算乘数约为 5 倍。我们表明，通过在测试时利用额外的计算来解析检索到的上下文，可以进一步改进这些结果，证明公共 LLaMA 3.1 8B 模型的 MMLU 提高了 10 个百分点。总的来说，我们的结果表明，当今的预训练方法没有充分利用现有预训练数据集中的信息，留下了很大的进步空间。

Title: Efficient Topic Extraction via Graph-Based Labeling: A Lightweight Alternative to Deep Models

Authors: Salma Mekaooui, Hiba Sofyan, Imane Amaaz, Imane Benchrif, Arsalane Zarghili, Ilham Chaker, Nikola S. Nikolov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04248
Pdf URL: https://arxiv.org/pdf/2511.04248
Copy Paste: [[2511.04248]] Efficient Topic Extraction via Graph-Based Labeling: A Lightweight Alternative to Deep Models(https://arxiv.org/abs/2511.04248)
Keywords: gpt, chat
Abstract: Extracting topics from text has become an essential task, especially with the rapid growth of unstructured textual data. Most existing works rely on highly computational methods to address this challenge. In this paper, we argue that probabilistic and statistical approaches, such as topic modeling (TM), can offer effective alternatives that require fewer computational resources. TM is a statistical method that automatically discovers topics in large collections of unlabeled text; however, it produces topics as distributions of representative words, which often lack clear interpretability. Our objective is to perform topic labeling by assigning meaningful labels to these sets of words. To achieve this without relying on computationally expensive models, we propose a graph-based approach that not only enriches topic words with semantically related terms but also explores the relationships among them. By analyzing these connections within the graph, we derive suitable labels that accurately capture each topic's meaning. We present a comparative study between our proposed method and several benchmarks, including ChatGPT-3.5, across two different datasets. Our method achieved consistently better results than traditional benchmarks in terms of BERTScore and cosine similarity and produced results comparable to ChatGPT-3.5, while remaining computationally efficient. Finally, we discuss future directions for topic labeling and highlight potential research avenues for enhancing interpretability and automation.
摘要：从文本中提取主题已成为一项重要任务，尤其是随着非结构化文本数据的快速增长。大多数现有的工作都依赖于高度计算的方法来应对这一挑战。在本文中，我们认为概率和统计方法（例如主题建模（TM））可以提供需要更少计算资源的有效替代方案。 TM 是一种统计方法，可以在大量未标记文本中自动发现主题；然而，它产生的主题是代表性单词的分布，通常缺乏明确的可解释性。我们的目标是通过为这些单词集分配有意义的标签来执行主题标记。为了在不依赖计算昂贵的模型的情况下实现这一目标，我们提出了一种基于图的方法，该方法不仅用语义相关的术语丰富主题词，而且还探索它们之间的关系。通过分析图中的这些联系，我们得出合适的标签来准确捕捉每个主题的含义。我们在两个不同的数据集上对我们提出的方法和几个基准（包括 ChatGPT-3.5）进行了比较研究。我们的方法在 BERTScore 和余弦相似度方面始终取得比传统基准更好的结果，并产生与 ChatGPT-3.5 相当的结果，同时保持计算效率。最后，我们讨论主题标签的未来方向，并强调增强可解释性和自动化的潜在研究途径。

Title: SSPO: Subsentence-level Policy Optimization

Authors: Kun Yang, Zikang chen, Yanmeng Wang, Zhigen Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04256
Pdf URL: https://arxiv.org/pdf/2511.04256
Copy Paste: [[2511.04256]] SSPO: Subsentence-level Policy Optimization(https://arxiv.org/abs/2511.04256)
Keywords: language model, llm
Abstract: As a significant part of post-training of the Large Language Models (LLMs), Reinforcement Learning from Verifiable Reward (RLVR) has greatly improved LLMs' reasoning skills. However, some RLVR algorithms, such as GRPO (Group Relative Policy Optimization) and GSPO (Group Sequence Policy Optimization), are observed to suffer from unstable policy updates and low usage of sampling data, respectively. The importance ratio of GRPO is calculated at the token level, which focuses more on optimizing a single token. This will be easily affected by outliers, leading to model training collapse. GSPO proposed the calculation of the response level importance ratio, which solves the problem of high variance and training noise accumulation in the calculation of the GRPO importance ratio. However, since all the response tokens share a common importance ratio, extreme values can easily raise or lower the overall mean, leading to the entire response being mistakenly discarded, resulting in a decrease in the utilization of sampled data. This paper introduces SSPO, which applies sentence-level importance ratio, taking the balance between GRPO and GSPO. SSPO not only avoids training collapse and high variance, but also prevents the whole response tokens from being abandoned by the clipping mechanism. Furthermore, we apply sentence entropy to PPO-CLIP to steadily adjust the clipping bounds, encouraging high-entropy tokens to explore and narrow the clipping range of low-entropy tokens. In particular, SSPO achieves an average score of 46.57 across five datasets, surpassing GRPO (43.01) and GSPO (44.42), and wins state-of-the-art performance on three datasets. These results highlight SSPO's effectiveness in leveraging generated data by taking the essence of GSPO but rejecting its shortcomings.
摘要：作为大型语言模型（LLM）后训练的重要组成部分，可验证奖励强化学习（RLVR）极大地提高了LLM的推理能力。然而，一些 RLVR 算法，例如 GRPO（组相对策略优化）和 GSPO（组序列策略优化），分别受到策略更新不稳定和采样数据使用率低的影响。 GRPO的重要性比是在代币层面计算的，更侧重于优化单个代币。这样很容易受到异常值的影响，导致模型训练崩溃。 GSPO提出了响应级别重要性比的计算，解决了GRPO重要性比计算中方差高和训练噪声积累的问题。然而，由于所有响应标记共享共同的重要性比率，极端值很容易提高或降低整体平均值，导致整个响应被错误地丢弃，从而导致采样数据的利用率下降。本文引入SSPO，它应用句子级重要性比，在GRPO和GSPO之间取得平衡。 SSPO不仅避免了训练崩溃和高方差，而且还防止了整个响应令牌被裁剪机制丢弃。此外，我们将句子熵应用于 PPO-CLIP 来稳定调整裁剪范围，鼓励高熵标记探索并缩小低熵标记的裁剪范围。特别是，SSPO 在五个数据集上的平均得分为 46.57，超过了 GRPO（43.01）和 GSPO（44.42），并在三个数据集上赢得了最先进的性能。这些结果凸显了 SSPO 通过汲取 GSPO 的本质但拒绝其缺点，在利用生成数据方面的有效性。

Title: If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLMs

Authors: Lars Bungum, Charles Yijia Huang, Abeer Kashar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04432
Pdf URL: https://arxiv.org/pdf/2511.04432
Copy Paste: [[2511.04432]] If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLMs(https://arxiv.org/abs/2511.04432)
Keywords: llm, prompt
Abstract: In this study, we experiment with the ability of LLMs to do temporal reasoning. Using a Norwegian book from 1940 containing trivia questions, we prompt the LLMs to answer the questions as if it were 1940. We also pose the questions in both English and Norwegian. Correct answers are often presented as sentences, and grading is done by means of LLM-as-judge, with sampled checks by a native speaker. Prompting in English consistently gave better results than in Norwegian, an unexpected result. In contrast, using larger LLMs improved results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families, and also the largest available LLM especially crafted for Norwegian.
摘要：在这项研究中，我们实验了法学硕士进行时间推理的能力。我们使用 1940 年一本包含琐事问题的挪威语书籍，提示法学硕士像 1940 年一样回答问题。我们还用英语和挪威语提出问题。正确的答案通常以句子的形式呈现，评分是通过法学硕士作为法官进行的，并由母语人士进行抽样检查。用英语提示始终比用挪威语提示获得更好的结果，这是一个意想不到的结果。相比之下，使用规模更大的法学硕士可以改善结果。我们测试了 DeepSeek-R1、Gemma3、Qwen3 和 Llama3.1 模型系列，以及专为挪威语打造的最大的可用法学硕士。

Title: ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai

Authors: Surapon Nonesung, Teetouch Jaknamon, Sirinya Chaiophat, Natapong Nitarach, Chanakan Wittayasakpan, Warit Sirichotedumrong, Adisai Na-Thalang, Kunat Pipatanakul
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04479
Pdf URL: https://arxiv.org/pdf/2511.04479
Copy Paste: [[2511.04479]] ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai(https://arxiv.org/abs/2511.04479)
Keywords: language model
Abstract: We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.
摘要：我们推出 ThaiOCRBench，这是第一个针对泰语文本丰富的视觉理解任务评估视觉语言模型 (VLM) 的综合基准。尽管多模态建模最近取得了进展，但现有基准主要关注高资源语言，而泰语的代表性不足，特别是在需要理解文档结构的任务中。 ThaiOCRBench 通过提供多样化的人工注释数据集来解决这一差距，该数据集包含 13 个任务类别的 2,808 个样本。我们在零样本设置中评估了各种最先进的 VLM，涵盖专有系统和开源系统。结果显示，性能存在显着差距，专有模型（例如 Gemini 2.5 Pro）的性能优于开源模型。值得注意的是，细粒度文本识别和手写内容提取在开源模型中表现出最大的性能下降。通过详细的错误分析，我们确定了语言偏见、结构不匹配和幻觉内容等关键挑战。 ThaiOCRBench 提供了一个标准化框架，用于在资源匮乏、脚本复杂的环境中评估 VLM，并提供可操作的见解以提高泰语文档的理解。

Title: RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables

Authors: Nikhil Abhyankar, Purvi Chaurasia, Sanchit Kabra, Ananya Srivastava, Vivek Gupta, Chandan K. Reddy
Subjects: cs.CL, cs.AI, cs.DB, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.04491
Pdf URL: https://arxiv.org/pdf/2511.04491
Copy Paste: [[2511.04491]] RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables(https://arxiv.org/abs/2511.04491)
Keywords: language model, llm, prompt
Abstract: Existing tabular reasoning benchmarks mostly test models on small, uniform tables, underrepresenting the complexity of real-world data and giving an incomplete view of Large Language Models' (LLMs) reasoning abilities. Real tables are long, heterogeneous, and domain-specific, mixing structured fields with free text and requiring multi-hop reasoning across thousands of tokens. To address this gap, we introduce RUST-BENCH, a benchmark of 7966 questions from 2031 real-world tables spanning two domains: i) RB-Science (NSF grant records) and ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates LLMs jointly across scale, heterogeneity, domain specificity, and reasoning complexity. Experiments with open-source and proprietary models show that LLMs struggle with heterogeneous schemas and complex multi-hop inference, revealing persistent weaknesses in current architectures and prompting strategies. RUST-BENCH establishes a challenging new testbed for advancing tabular reasoning research.
摘要：现有的表格推理基准主要在小型、统一的表格上测试模型，低估了现实世界数据的复杂性，并且无法完整地了解大型语言模型 (LLM) 的推理能力。真实的表很长、异构且特定于域，将结构化字段与自由文本混合在一起，并且需要对数千个标记进行多跳推理。为了解决这一差距，我们引入了 RUST-BENCH，它是来自 2031 个现实世界表格的 7966 个问题的基准，跨越两个领域：i) RB-Science（NSF 拨款记录）和 ii) RB-Sports（NBA 统计数据）。与之前的工作不同，RUST-BENCH 跨规模、异质性、领域特异性和推理复杂性联合评估法学硕士。开源和专有模型的实验表明，法学硕士在异构模式和复杂的多跳推理中苦苦挣扎，揭示了当前架构和提示策略中持续存在的弱点。 RUST-BENCH 为推进表格推理研究建立了一个具有挑战性的新测试平台。

Title: OUNLP at TSAR 2025 Shared Task: Multi-Round Text Simplifier via Code Generation

Authors: Cuong Huynh, Jie Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04495
Pdf URL: https://arxiv.org/pdf/2511.04495
Copy Paste: [[2511.04495]] OUNLP at TSAR 2025 Shared Task: Multi-Round Text Simplifier via Code Generation(https://arxiv.org/abs/2511.04495)
Keywords: gpt, llm, prompt
Abstract: This paper describes the OUNLP system submitted to the TSAR-2025 Shared Task (Alva-Manchego et al., 2025), designed for readability-controlled text simplification using LLM-prompting-based generation. Based on the analysis of prompt-based text simplification methods, we discovered an interesting finding that text simplification performance is highly related to the gap between the source CEFR (Arase et al., 2022) level and the target CEFR level. Inspired by this finding, we propose two multi-round simplification methods and generate them via GPT-4o: rule-based simplification (MRS-Rule) and jointly rule-based LLM simplification (MRS-Joint). Our submitted systems ranked 7 out of 20 teams. Later improvements with MRS-Joint show that taking the LLM simplified candidates as the starting point could further boost the multi-round simplification performance.
摘要：本文描述了提交给 TSAR-2025 共享任务（Alva-Manchego 等人，2025）的 OUNLP 系统，该系统旨在使用基于 LLM 提示的生成来简化可读性控制的文本。基于对基于提示的文本简化方法的分析，我们发现了一个有趣的发现，即文本简化性能与源CEFR（Arase等人，2022）水平和目标CEFR水平之间的差距高度相关。受这一发现的启发，我们提出了两种多轮简化方法并通过 GPT-4o 生成它们：基于规则的简化（MRS-Rule）和联合基于规则的 LLM 简化（MRS-Joint）。我们提交的系统在 20 个团队中排名第 7。后来 MRS-Joint 的改进表明，以 LLM 简化候选人为起点可以进一步提高多轮简化性能。

Title: Decoding Emergent Big Five Traits in Large Language Models: Temperature-Dependent Expression and Architectural Clustering

Authors: Christos-Nikolaos Zacharopoulos, Revekka Kyriakoglou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04499
Pdf URL: https://arxiv.org/pdf/2511.04499
Copy Paste: [[2511.04499]] Decoding Emergent Big Five Traits in Large Language Models: Temperature-Dependent Expression and Architectural Clustering(https://arxiv.org/abs/2511.04499)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) become integral to human-centered applications, understanding their personality-like behaviors is increasingly important for responsible development and deployment. This paper systematically evaluates six LLMs, applying the Big Five Inventory-2 (BFI-2) framework, to assess trait expressions under varying sampling temperatures. We find significant differences across four of the five personality dimensions, with Neuroticism and Extraversion susceptible to temperature adjustments. Further, hierarchical clustering reveals distinct model clusters, suggesting that architectural features may predispose certain models toward stable trait profiles. Taken together, these results offer new insights into the emergence of personality-like patterns in LLMs and provide a new perspective on model tuning, selection, and the ethical governance of AI systems. We share the data and code for this analysis here: this https URL
摘要：随着大型语言模型 (LLM) 成为以人为本的应用程序不可或缺的一部分，了解其类人格行为对于负责任的开发和部署变得越来越重要。本文应用大五清单-2 (BFI-2) 框架系统地评估了六位法学硕士，以评估不同采样温度下的性状表达。我们发现五个人格维度中的四个维度存在显着差异，神经质和外向性容易受到温度调整的影响。此外，层次聚类揭示了不同的模型集群，这表明架构特征可能使某些模型倾向于稳定的特征概况。总而言之，这些结果为法学硕士中类人格模式的出现提供了新的见解，并为人工智能系统的模型调整、选择和道德治理提供了新的视角。我们在此分享此分析的数据和代码：此 https URL

Title: RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

Authors: Joshua Gao, Quoc Huy Pham, Subin Varghese, Silwal Saurav, Vedhus Hoskere
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04502
Pdf URL: https://arxiv.org/pdf/2511.04502
Copy Paste: [[2511.04502]] RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG(https://arxiv.org/abs/2511.04502)
Keywords: language model, llm, prompt, retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics-Answer Correctness and Answerability-using prompt optimization to achieve a strong correlation with human annotations. Applying this framework to evaluate various RAG components across three distinct domains (military operations, cybersecurity, and bridge engineering), we find that performance is highly context-dependent. No single embedding model, LLM, or hyperparameter configuration proves universally optimal. Additionally, we provide an analysis on the most common low Answer Correctness reasons in RAG. These findings highlight the necessity of a systematic evaluation framework like RAGalyst, which empowers practitioners to uncover domain-specific trade-offs and make informed design choices for building reliable and effective RAG systems. RAGalyst is available on our Github.
摘要：检索增强生成 (RAG) 是将大型语言模型 (LLM) 建立在事实证据基础上的关键技术，但在专门的安全关键领域评估 RAG 系统仍然是一项重大挑战。现有的评估框架通常依赖于基于启发式的指标，这些指标无法捕捉特定领域的细微差别，而其他工作则利用缺乏与人类判断经过验证的一致性的法学硕士作为法官的方法。本文介绍了 RAGalyst，这是一种自动化、人性化的代理框架，专为严格评估特定领域的 RAG 系统而设计。 RAGalyst 具有代理管道，可从源文档生成高质量的综合问答 (QA) 数据集，并结合代理过滤步骤以确保数据保真度。该框架细化了法学硕士作为法官的两个关键指标——答案正确性和可回答性——使用即时优化来实现与人工注释的强相关性。应用该框架来评估三个不同领域（军事行动、网络安全和桥梁工程）的各种 RAG 组件，我们发现性能高度依赖于上下文。没有任何一个嵌入模型、LLM 或超参数配置被证明是普遍最优的。此外，我们还对 RAG 中最常见的低答案正确性原因进行了分析。这些发现强调了像 RAGalyst 这样的系统评估框架的必要性，它使从业者能够发现特定领域的权衡，并为构建可靠和有效的 RAG 系统做出明智的设计选择。 RAGalyst 可在我们的 Github 上获取。

Title: Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways

Authors: Paloma Rabaey, Jong Hak Moon, Jung-Oh Lee, Min Gwan Kim, Hangyul Yoon, Thomas Demeester, Edward Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04506
Pdf URL: https://arxiv.org/pdf/2511.04506
Copy Paste: [[2511.04506]] Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways(https://arxiv.org/abs/2511.04506)
Keywords: llm
Abstract: Radiology reports are invaluable for clinical decision-making and hold great potential for automated analysis when structured into machine-readable formats. These reports often contain uncertainty, which we categorize into two distinct types: (i) Explicit uncertainty reflects doubt about the presence or absence of findings, conveyed through hedging phrases. These vary in meaning depending on the context, making rule-based systems insufficient to quantify the level of uncertainty for specific findings; (ii) Implicit uncertainty arises when radiologists omit parts of their reasoning, recording only key findings or diagnoses. Here, it is often unclear whether omitted findings are truly absent or simply unmentioned for brevity. We address these challenges with a two-part framework. We quantify explicit uncertainty by creating an expert-validated, LLM-based reference ranking of common hedging phrases, and mapping each finding to a probability value based on this reference. In addition, we model implicit uncertainty through an expansion framework that systematically adds characteristic sub-findings derived from expert-defined diagnostic pathways for 14 common diagnoses. Using these methods, we release Lunguage++, an expanded, uncertainty-aware version of the Lunguage benchmark of fine-grained structured radiology reports. This enriched resource enables uncertainty-aware image classification, faithful diagnostic reasoning, and new investigations into the clinical impact of diagnostic uncertainty.
摘要：放射学报告对于临床决策非常宝贵，并且当构建为机器可读格式时，具有自动分析的巨大潜力。这些报告通常包含不确定性，我们将其分为两种不同的类型：（i）明确的不确定性反映了通过对冲短语传达的对调查结果是否存在的怀疑。这些含义根据具体情况而有所不同，使得基于规则的系统不足以量化特定发现的不确定性水平； (ii) 当放射科医生省略部分推理，仅记录关键发现或诊断时，就会产生隐含的不确定性。在这里，常常不清楚被遗漏的发现是否真的不存在，或者只是为了简洁而没有提及。我们通过一个由两部分组成的框架来应对这些挑战。我们通过创建经过专家验证、基于法学硕士的常见对冲短语参考排名，并将每个发现映射到基于该参考的概率值，来量化显性不确定性。此外，我们通过一个扩展框架对隐含的不确定性进行建模，该框架系统地添加了从专家定义的 14 种常见诊断的诊断路径中得出的特征子发现。使用这些方法，我们发布了 Lunguage++，这是细粒度结构化放射学报告的 Lunguage 基准的扩展、不确定性感知版本。这种丰富的资源可以实现不确定性感知图像分类、忠实的诊断推理以及对诊断不确定性的临床影响的新研究。

Title: Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics

Authors: Amir Zur, Atticus Geiger, Ekdeep Singh Lubana, Eric Bigelow
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04527
Pdf URL: https://arxiv.org/pdf/2511.04527
Copy Paste: [[2511.04527]] Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics(https://arxiv.org/abs/2511.04527)
Keywords: language model, chain-of-thought
Abstract: When a language model generates text, the selection of individual tokens might lead it down very different reasoning paths, making uncertainty difficult to quantify. In this work, we consider whether reasoning language models represent the alternate paths that they could take during generation. To test this hypothesis, we use hidden activations to control and predict a language model's uncertainty during chain-of-thought reasoning. In our experiments, we find a clear correlation between how uncertain a model is at different tokens, and how easily the model can be steered by controlling its activations. This suggests that activation interventions are most effective when there are alternate paths available to the model -- in other words, when it has not yet committed to a particular final answer. We also find that hidden activations can predict a model's future outcome distribution, demonstrating that models implicitly represent the space of possible paths.
摘要：当语言模型生成文本时，单个标记的选择可能会导致其走上截然不同的推理路径，从而使不确定性难以量化。在这项工作中，我们考虑推理语言模型是否代表它们在生成过程中可以采取的替代路径。为了检验这个假设，我们使用隐藏激活来控制和预测思想链推理过程中语言模型的不确定性。在我们的实验中，我们发现模型在不同标记下的不确定性与通过控制其激活来操纵模型的容易程度之间存在明显的相关性。这表明，当模型有可用的替代路径时（换句话说，当模型尚未承诺特定的最终答案时），激活干预是最有效的。我们还发现隐藏激活可以预测模型未来的结果分布，证明模型隐式地表示了可能路径的空间。

Title: IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection

Authors: Kaveh Eskandari Miandoab, Katharine Kowalyshyn, Kabir Pamnani, Anesu Gavhera, Vasanth Sarathy, Matthias Scheutz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04528
Pdf URL: https://arxiv.org/pdf/2511.04528
Copy Paste: [[2511.04528]] IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection(https://arxiv.org/abs/2511.04528)
Keywords: llm
Abstract: We present IntelliProof, an interactive system for analyzing argumentative essays through LLMs. IntelliProof structures an essay as an argumentation graph, where claims are represented as nodes, supporting evidence is attached as node properties, and edges encode supporting or attacking relations. Unlike existing automated essay scoring systems, IntelliProof emphasizes the user experience: each relation is initially classified and scored by an LLM, then visualized for enhanced understanding. The system provides justifications for classifications and produces quantitative measures for essay coherence. It enables rapid exploration of argumentative quality while retaining human oversight. In addition, IntelliProof provides a set of tools for a better understanding of an argumentative essay and its corresponding graph in natural language, bridging the gap between the structural semantics of argumentative essays and the user's understanding of a given text. A live demo and the system are available here to try: \textbf{this https URL}
摘要：我们推出 IntelliProof，这是一个用于通过法学硕士分析议论文的交互式系统。 IntelliProof 将一篇文章构建为论证图，其中主张表示为节点，支持证据作为节点属性附加，边缘编码支持或攻击关系。与现有的自动论文评分系统不同，IntelliProof 强调用户体验：每个关系最初由法学硕士进行分类和评分，然后进行可视化以增强理解。该系统提供了分类的理由，并为论文的连贯性提供了定量的衡量标准。它能够快速探索论证质量，同时保留人类监督。此外，IntelliProof 还提供了一组工具，用于更好地理解议论文及其相应的自然语言图表，弥合了议论文的结构语义与用户对给定文本的理解之间的差距。现场演示和系统可以在这里尝试：\textbf{this https URL}

Title: From Model to Breach: Towards Actionable LLM-Generated Vulnerabilities Reporting

Authors: Cyril Vallez, Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04538
Pdf URL: https://arxiv.org/pdf/2511.04538
Copy Paste: [[2511.04538]] From Model to Breach: Towards Actionable LLM-Generated Vulnerabilities Reporting(https://arxiv.org/abs/2511.04538)
Keywords: language model, llm, prompt
Abstract: As the role of Large Language Models (LLM)-based coding assistants in software development becomes more critical, so does the role of the bugs they generate in the overall cybersecurity landscape. While a number of LLM code security benchmarks have been proposed alongside approaches to improve the security of generated code, it remains unclear to what extent they have impacted widely used coding LLMs. Here, we show that even the latest open-weight models are vulnerable in the earliest reported vulnerability scenarios in a realistic use setting, suggesting that the safety-functionality trade-off has until now prevented effective patching of vulnerabilities. To help address this issue, we introduce a new severity metric that reflects the risk posed by an LLM-generated vulnerability, accounting for vulnerability severity, generation chance, and the formulation of the prompt that induces vulnerable code generation - Prompt Exposure (PE). To encourage the mitigation of the most serious and prevalent vulnerabilities, we use PE to define the Model Exposure (ME) score, which indicates the severity and prevalence of vulnerabilities a model generates.
摘要：随着基于大型语言模型 (LLM) 的编码助手在软件开发中的作用变得越来越重要，它们产生的错误在整个网络安全环境中的作用也变得越来越重要。尽管已经提出了许多 LLM 代码安全基准以及提高生成代码安全性的方法，但仍不清楚它们在多大程度上影响了广泛使用的编码 LLM。在这里，我们表明，即使是最新的开放权重模型在实际使用环境中最早报告的漏洞场景中也很容易受到攻击，这表明安全功能权衡迄今为止阻碍了漏洞的有效修补。为了帮助解决这个问题，我们引入了一个新的严重性指标，该指标反映了 LLM 生成的漏洞所带来的风险，考虑了漏洞严重性、生成机会以及诱导易受攻击代码生成的提示的制定 - 提示暴露 (PE)。为了鼓励缓解最严重和最普遍的漏洞，我们使用 PE 来定义模型暴露 (ME) 分数，该分数表明模型生成的漏洞的严重性和普遍性。

Title: BanglaMedQA and BanglaMMedBench: Evaluating Retrieval-Augmented Generation Strategies for Bangla Biomedical Question Answering

Authors: Sadia Sultana, Saiyma Sittul Muna, Mosammat Zannatul Samarukh, Ajwad Abrar, Tareque Mohmud Chowdhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04560
Pdf URL: https://arxiv.org/pdf/2511.04560
Copy Paste: [[2511.04560]] BanglaMedQA and BanglaMMedBench: Evaluating Retrieval-Augmented Generation Strategies for Bangla Biomedical Question Answering(https://arxiv.org/abs/2511.04560)
Keywords: gpt, retrieval-augmented generation, agent
Abstract: Developing accurate biomedical Question Answering (QA) systems in low-resource languages remains a major challenge, limiting equitable access to reliable medical knowledge. This paper introduces BanglaMedQA and BanglaMMedBench, the first large-scale Bangla biomedical Multiple Choice Question (MCQ) datasets designed to evaluate reasoning and retrieval in medical artificial intelligence (AI). The study applies and benchmarks several Retrieval-Augmented Generation (RAG) strategies, including Traditional, Zero-Shot Fallback, Agentic, Iterative Feedback, and Aggregate RAG, combining textbook-based and web retrieval with generative reasoning to improve factual accuracy. A key novelty lies in integrating a Bangla medical textbook corpus through Optical Character Recognition (OCR) and implementing an Agentic RAG pipeline that dynamically selects between retrieval and reasoning strategies. Experimental results show that the Agentic RAG achieved the highest accuracy 89.54% with openai/gpt-oss-120b, outperforming other configurations and demonstrating superior rationale quality. These findings highlight the potential of RAG-based methods to enhance the reliability and accessibility of Bangla medical QA, establishing a foundation for future research in multilingual medical artificial intelligence.
摘要：以资源匮乏的语言开发准确的生物医学问答 (QA) 系统仍然是一项重大挑战，限制了公平获取可靠医学知识的机会。本文介绍了 BanglaMedQA 和 BanglaMMedBench，这是第一个大型 Bangla 生物医学多项选择题 (MCQ) 数据集，旨在评估医学人工智能 (AI) 中的推理和检索。该研究应用了多种检索增强生成 (RAG) 策略并对其进行了基准测试，包括传统、零样本回退、代理、迭代反馈和聚合 RAG，将基于教科书和网络的检索与生成推理相结合，以提高事实准确性。一个关键的新颖之处在于通过光学字符识别 (OCR) 集成孟加拉医学教科书语料库，并实现在检索和推理策略之间动态选择的 Agentic RAG 管道。实验结果表明，Agentic RAG 使用 openai/gpt-oss-120b 实现了最高准确率 89.54%，优于其他配置并展示了卓越的理论质量。这些发现凸显了基于 RAG 的方法在提高孟加拉医学 QA 的可靠性和可访问性方面的潜力，为未来多语言医学人工智能的研究奠定了基础。

Title: When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection

Authors: Alamgir Munir Qazi, John P. McCrae, Jamal Abdul Nasir
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04643
Pdf URL: https://arxiv.org/pdf/2511.04643
Copy Paste: [[2511.04643]] When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection(https://arxiv.org/abs/2511.04643)
Keywords: language model, llm, hallucination
Abstract: The proliferation of misinformation necessitates robust yet computationally efficient fact verification systems. While current state-of-the-art approaches leverage Large Language Models (LLMs) for generating explanatory rationales, these methods face significant computational barriers and hallucination risks in real-world deployments. We present DeReC (Dense Retrieval Classification), a lightweight framework that demonstrates how general-purpose text embeddings can effectively replace autoregressive LLM-based approaches in fact verification tasks. By combining dense retrieval with specialized classification, our system achieves better accuracy while being significantly more efficient. DeReC outperforms explanation-generating LLMs in efficiency, reducing runtime by 95% on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92% on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds), showcasing its effectiveness across varying dataset sizes. On the RAWFC dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art method L-Defense (61.20%). Our results demonstrate that carefully engineered retrieval-based systems can match or exceed LLM performance in specialized tasks while being significantly more practical for real-world deployment.
摘要：错误信息的扩散需要强大且计算高效的事实验证系统。虽然当前最先进的方法利用大型语言模型（LLM）来生成解释性原理，但这些方法在现实世界的部署中面临着巨大的计算障碍和幻觉风险。我们提出了 DeReC（密集检索分类），这是一个轻量级框架，演示了通用文本嵌入如何在事实验证任务中有效取代基于 LLM 的自回归方法。通过将密集检索与专业分类相结合，我们的系统实现了更高的准确性，同时显着提高了效率。 DeReC 在效率上优于解释生成 LLM，在 RAWFC 上将运行时间缩短了 95%（23 分 36 秒，与 454 分 12 秒相比），在 LIAR-RAW 上缩短了 92%（134 分 14 秒，与 1692 分 23 秒相比），展示了其在不同数据集大小上的有效性。在 RAWFC 数据集上，DeReC 的 F1 分数达到 65.58%，超过了最先进的方法 L-Defense（61.20%）。我们的结果表明，精心设计的基于检索的系统可以在专门任务中达到或超过法学硕士的性能，同时对于实际部署来说更加实用。

Title: Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning

Authors: Mohammad Atif Quamar, Mohammad Areeb
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04654
Pdf URL: https://arxiv.org/pdf/2511.04654
Copy Paste: [[2511.04654]] Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning(https://arxiv.org/abs/2511.04654)
Keywords: language model, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models. However, generating full, fixed-length rationales is computationally wasteful, inflating both token usage and latency. We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation. LEASH monitors two intrinsic signals: the slope of token-level entropy and the improvement in the top-logit margin. It terminates the generation once both signals plateau, indicating the model has reached a stable reasoning state. Across four instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces average token generation by 30--35% and latency by 27%, while incurring a 10 p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no additional training or supervision, offering a simple and efficient alternative to CoT decoding.
摘要：思想链 (CoT) 提示是在大型语言模型中实现复杂推理的关键技术。然而，生成完整的、固定长度的基本原理在计算上是浪费的，会增加令牌的使用和延迟。我们介绍 LEASH：Logit-Entropy 自适应停止启发式算法，这是一种无需训练的解码算法，可以自适应地停止基本原理生成。 LEASH 监控两个内在信号：代币级熵的斜率和 top-logit 裕度的改善。一旦两个信号达到稳定状态，它就会终止生成，表明模型已达到稳定的推理状态。在 GSM8K 和 AQuA-RAT 基准测试的四个指令调整模型中，LEASH 将平均代币生成量减少了 30--35%，延迟减少了 27%，同时产生了 10 个百分点。相对于 CoT 的准确性下降。 LEASH 与模型无关，不需要额外的培训或监督，为 CoT 解码提供了简单而高效的替代方案。