2025-07-01

Title: Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans

Authors: Javier Conde, Miguel González, María Grandury, Gonzalo Martínez, Pedro Reviriego, Mar Brysbaert
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22439
Pdf URL: https://arxiv.org/pdf/2506.22439
Copy Paste: [[2506.22439]] Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans(https://arxiv.org/abs/2506.22439)
Keywords: llm
Abstract: The evaluation of LLMs has so far focused primarily on how well they can perform different tasks such as reasoning, question-answering, paraphrasing, or translating. For most of these tasks, performance can be measured with objective metrics, such as the number of correct answers. However, other language features are not easily quantified. For example, arousal, concreteness, or gender associated with a given word, as well as the extent to which we experience words with senses and relate them to a specific sense. Those features have been studied for many years by psycholinguistics, conducting large-scale experiments with humans to produce ratings for thousands of words. This opens an opportunity to evaluate how well LLMs align with human ratings on these word features, taking advantage of existing studies that cover many different language features in a large number of words. In this paper, we evaluate the alignment of a representative group of LLMs with human ratings on two psycholinguistic datasets: the Glasgow and Lancaster norms. These datasets cover thirteen features over thousands of words. The results show that alignment is \textcolor{black}{generally} better in the Glasgow norms evaluated (arousal, valence, dominance, concreteness, imageability, familiarity, and gender) than on the Lancaster norms evaluated (introceptive, gustatory, olfactory, haptic, auditory, and visual). This suggests a potential limitation of current LLMs in aligning with human sensory associations for words, which may be due to their lack of embodied cognition present in humans and illustrates the usefulness of evaluating LLMs with psycholinguistic datasets.
摘要：迄今为止，对LLM的评估主要集中在他们执行不同任务（例如推理，提问，释义或翻译）等不同任务上。对于大多数这些任务，可以通过客观指标（例如正确答案的数量）来衡量性能。但是，其他语言功能不容易量化。例如，与给定单词相关联的唤醒，具体或性别，以及我们体验感官的单词并将其与特定意义相关联的程度。这些特征已经通过心理语言学研究了多年，对人类进行了大规模的实验，以产生数千个单词的评分。这为评估LLMS在这些单词特征上与人类评分的一致性如何，利用现有的研究涵盖了许多不同单词中的许多不同语言特征。在本文中，我们评估了在两个心理语言数据集上的代表性LLM与人类评级的一致性：格拉斯哥和兰开斯特规范。这些数据集涵盖了成千上万个单词的13个功能。结果表明，在评估的格拉斯哥规范中，对齐方式比在兰开斯特的规范中更好地评估（唤醒，价，统治，具体性，成像性，熟悉性和性别）更好的是\ textcolor {black} {{black} {{通常。这表明当前LLM与单词的人类感觉关联保持一致，这可能是由于它们缺乏人类中存在的体现认知，并说明了用心理语言数据集评估LLM的有用性。

Title: AI Agents-as-Judge: Automated Assessment of Accuracy, Consistency, Completeness and Clarity for Enterprise Documents

Authors: Sudip Dasgupta, Himanshu Shankar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22485
Pdf URL: https://arxiv.org/pdf/2506.22485
Copy Paste: [[2506.22485]] AI Agents-as-Judge: Automated Assessment of Accuracy, Consistency, Completeness and Clarity for Enterprise Documents(https://arxiv.org/abs/2506.22485)
Keywords: llm, agent
Abstract: This study presents a modular, multi-agent system for the automated review of highly structured enterprise business documents using AI agents. Unlike prior solutions focused on unstructured texts or limited compliance checks, this framework leverages modern orchestration tools such as LangChain, CrewAI, TruLens, and Guidance to enable section-by-section evaluation of documents for accuracy, consistency, completeness, and clarity. Specialized agents, each responsible for discrete review criteria such as template compliance or factual correctness, operate in parallel or sequence as required. Evaluation outputs are enforced to a standardized, machine-readable schema, supporting downstream analytics and auditability. Continuous monitoring and a feedback loop with human reviewers allow for iterative system improvement and bias mitigation. Quantitative evaluation demonstrates that the AI Agent-as-Judge system approaches or exceeds human performance in key areas: achieving 99% information consistency (vs. 92% for humans), halving error and bias rates, and reducing average review time from 30 to 2.5 minutes per document, with a 95% agreement rate between AI and expert human judgment. While promising for a wide range of industries, the study also discusses current limitations, including the need for human oversight in highly specialized domains and the operational cost of large-scale LLM usage. The proposed system serves as a flexible, auditable, and scalable foundation for AI-driven document quality assurance in the enterprise context.
摘要：这项研究提出了一个模块化的多代理系统，用于使用AI代理对高度结构化企业业务文件进行自动审查。与以前的解决方案侧重于非结构化文本或有限的合规性检查不同，该框架利用了现代的编排工具，例如兰班，乘员，trulens，以及指导，以逐步评估文档的准确性，一致性，完整性和清晰度。专门的代理，每个负责离散审查标准（例如模板依从性或事实正确性）的专门代理，并根据需要并行或顺序运行。评估输出被执行到标准化的机器可读模式，并支持下游分析和可审核性。连续监控和与人类审核者的反馈回路可以迭代系统的改进和偏见缓解。定量评估表明，在关键领域中，AI代理 - 法官系统的方法或超过人类的绩效：达到99％的信息一致性（对人类的92％），误差和偏差率减少，并将平均审查时间从每份文档的30分钟减少到2.5分钟，而AI和专业人工判断之间的同意率为95％。该研究在广泛的行业中有希望，但还讨论了当前的局限性，包括对高度专业领域的人类监督以及大规模LLM使用的运营成本。拟议的系统是企业环境中AI驱动的文档质量保证的灵活，可审计和可扩展的基础。

Title: Hallucination Detection with Small Language Models

Authors: Ming Cheung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22486
Pdf URL: https://arxiv.org/pdf/2506.22486
Copy Paste: [[2506.22486]] Hallucination Detection with Small Language Models(https://arxiv.org/abs/2506.22486)
Keywords: language model, gpt, llm, hallucination, chat, retrieval-augmented generation
Abstract: Since the introduction of ChatGPT, large language models (LLMs) have demonstrated significant utility in various tasks, such as answering questions through retrieval-augmented generation. Context can be retrieved using a vectorized database, serving as a foundation for LLMs to generate responses. However, hallucinations in responses can undermine the reliability of LLMs in practical applications, and they are not easily detectable in the absence of ground truth, particularly in question-and-answer scenarios. This paper proposes a framework that integrates multiple small language models to verify responses generated by LLMs using the retrieved context from a vectorized database. By breaking down the responses into individual sentences and utilizing the probability of generating "Yes" tokens from the outputs of multiple models for a given set of questions, responses, and relevant context, hallucinations can be detected. The proposed framework is validated through experiments with real datasets comprising over 100 sets of questions, answers, and contexts, including responses with fully and partially correct sentences. The results demonstrate a 10\% improvement in F1 scores for detecting correct responses compared to hallucinations, indicating that multiple small language models can be effectively employed for answer verification, providing a scalable and efficient solution for both academic and practical applications.
摘要：自从引入CHATGPT以来，大型语言模型（LLMS）在各种任务中都表现出了重要的效用，例如通过检索效果的一代回答问题。可以使用矢量化数据库来检索上下文，这是LLMS生成响应的基础。但是，响应中的幻觉会破坏LLM在实际应用中的可靠性，并且在没有地面真相的情况下，它们不容易被检测到，尤其是在问答情况下。本文提出了一个框架，该框架集成了多个小语言模型，以使用从矢量化数据库中检索到的上下文验证LLMS生成的响应。通过将响应分解为单个句子，并利用从多个模型的输出中生成“是”令牌的概率，以用于给定的一组问题，响应和相关上下文，可以检测到幻觉。提出的框架通过实验进行了实际数据集的实验验证，其中包括100组问题，答案和上下文，包括具有完全和部分正确的句子的答案。结果表明，与幻觉相比，F1分数的提高了10 \％，这表明可以有效地使用多个小语言模型进行答案验证，从而为学术和实际应用提供了可扩展和有效的解决方案。

Title: PromptAug: Fine-grained Conflict Classification Using Data Augmentation

Authors: Oliver Warke, Joemon M. Jose, Faegheh Hasibi, Jan Breitsohl
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.22491
Pdf URL: https://arxiv.org/pdf/2506.22491
Copy Paste: [[2506.22491]] PromptAug: Fine-grained Conflict Classification Using Data Augmentation(https://arxiv.org/abs/2506.22491)
Keywords: language model, llm, prompt
Abstract: Given the rise of conflicts on social media, effective classification models to detect harmful behaviours are essential. Following the garbage-in-garbage-out maxim, machine learning performance depends heavily on training data quality. However, high-quality labelled data, especially for nuanced tasks like identifying conflict behaviours, is limited, expensive, and difficult to obtain. Additionally, as social media platforms increasingly restrict access to research data, text data augmentation is gaining attention as an alternative to generate training data. Augmenting conflict-related data poses unique challenges due to Large Language Model (LLM) guardrails that prevent generation of offensive content. This paper introduces PromptAug, an innovative LLM-based data augmentation method. PromptAug achieves statistically significant improvements of 2% in both accuracy and F1-score on conflict and emotion datasets. To thoroughly evaluate PromptAug against other data augmentation methods we conduct a robust evaluation using extreme data scarcity scenarios, quantitative diversity analysis and a qualitative thematic analysis. The thematic analysis identifies four problematic patterns in augmented text: Linguistic Fluidity, Humour Ambiguity, Augmented Content Ambiguity, and Augmented Content Misinterpretation. Overall, this work presents PromptAug as an effective method for augmenting data in sensitive tasks like conflict detection, offering a unique, interdisciplinary evaluation grounded in both natural language processing and social science methodology.
摘要：鉴于社交媒体上冲突的兴起，有效的分类模型检测有害行为至关重要。遵循大量垃圾的格言，机器学习性能在很大程度上取决于培训数据质量。但是，高质量的标记数据，尤其是对于诸如识别冲突行为之类的细微差别任务，是有限的，昂贵且难以获得的。此外，随着社交媒体平台越来越限制对研究数据的访问，文本数据的增强已成为生成培训数据的替代方案。增强与冲突相关的数据构成了由于大型语言模型（LLM）护栏的独特挑战，可防止产生进攻性内容。本文介绍了一种基于创新的LLM数据增强方法Promptaug。迅速促进在冲突和情绪数据集上的准确性和F1得分方面取得了统计学上的显着改善。为了彻底评估其他数据增强方法的及时测试，我们使用极端数据稀缺情景，定量多样性分析和定性主题分析进行了强大的评估。主题分析在增强文本中确定了四个有问题的模式：语言流动性，幽默歧义，增强内容歧义和增强内容误解。总体而言，这项工作提出了迅速的方法，是一种有效的方法，用于在冲突检测等敏感任务中增强数据，提供以自然语言处理和社会科学方法为基础的独特的，跨学科的评估。

Title: AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text

Authors: Chenyang Shao, Tianxing Li, Chenhao Pu, Fengli Xu, Yong Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22508
Pdf URL: https://arxiv.org/pdf/2506.22508
Copy Paste: [[2506.22508]] AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text(https://arxiv.org/abs/2506.22508)
Keywords: language model, llm, agent
Abstract: In today's digital world, casual user-generated content often contains subtle cues that may inadvertently expose sensitive personal attributes. Such risks underscore the growing importance of effective text anonymization to safeguard individual privacy. However, existing methods either rely on rigid replacements that damage utility or cloud-based LLMs that are costly and pose privacy risks. To address these issues, we explore the use of locally deployed smaller-scale language models (SLMs) for anonymization. Yet training effective SLMs remains challenging due to limited high-quality supervision. To address the challenge, we propose AgentStealth, a self-reinforcing LLM anonymization this http URL, we introduce an adversarial anonymization workflow enhanced by In-context Contrastive Learning and Adaptive Utility-Aware Control. Second, we perform supervised adaptation of SLMs using high-quality data collected from the workflow, which includes both anonymization and attack signals. Finally, we apply online reinforcement learning where the model leverages its internal adversarial feedback to iteratively improve anonymization performance. Experiments on two datasets show that our method outperforms baselines in both anonymization effectiveness (+12.3%) and utility (+6.8%). Our lightweight design supports direct deployment on edge devices, avoiding cloud reliance and communication-based privacy risks. Our code is open-source at this https URL.
摘要：在当今的数字世界中，休闲用户生成的内容通常包含可能无意间暴露敏感的个人属性的微妙线索。这种风险强调了有效文本匿名化以保护个人隐私的越来越重要。但是，现有的方法要么依赖于损坏实用程序的刚性替代品，要么是昂贵且构成隐私风险的基于云的LLM。为了解决这些问题，我们探讨了本地部署的较小语言模型（SLMS）的使用来匿名化。然而，由于有限的高质量监督，培训有效的SLM仍然具有挑战性。为了应对挑战，我们提出了Agentalth，这是一个自我增强的LLM匿名化，这是HTTP URL，我们引入了一种对抗性匿名工作流，通过对偏置的对比性学习和适应性实用性感知控制来增强。其次，我们使用从工作流中收集的高质量数据对SLM进行监督改编，其中包括匿名和攻击信号。最后，我们应用在线加强学习，该模型利用其内部对抗反馈来迭代地改善匿名性能。两个数据集上的实验表明，我们的方法在匿名效果（+12.3％）和实用程序（+6.8％）中都优于基准。我们的轻量级设计支持在边缘设备上进行直接部署，避免云依赖和基于通信的隐私风险。我们的代码在此HTTPS URL上是开源的。

Title: Can "consciousness" be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis

Authors: Jingkai Li
Subjects: cs.CL, cs.AI, cs.NE, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.22516
Pdf URL: https://arxiv.org/pdf/2506.22516
Copy Paste: [[2506.22516]] Can "consciousness" be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis(https://arxiv.org/abs/2506.22516)
Keywords: language model, llm
Abstract: Integrated Information Theory (IIT) provides a quantitative framework for explaining consciousness phenomenon, positing that conscious systems comprise elements integrated through causal properties. We apply IIT 3.0 and 4.0 -- the latest iterations of this framework -- to sequences of Large Language Model (LLM) representations, analyzing data derived from existing Theory of Mind (ToM) test results. Our study systematically investigates whether the differences of ToM test performances, when presented in the LLM representations, can be revealed by IIT estimates, i.e., $\Phi^{\max}$ (IIT 3.0), $\Phi$ (IIT 4.0), Conceptual Information (IIT 3.0), and $\Phi$-structure (IIT 4.0). Furthermore, we compare these metrics with the Span Representations independent of any estimate for consciousness. This additional effort aims to differentiate between potential "consciousness" phenomena and inherent separations within LLM representational space. We conduct comprehensive experiments examining variations across LLM transformer layers and linguistic spans from stimuli. Our results suggest that sequences of contemporary Transformer-based LLM representations lack statistically significant indicators of observed "consciousness" phenomena but exhibit intriguing patterns under $\textit{spatio}$-permutational analyses. The Appendix and code are available as Supplementary Materials at: this https URL.
摘要：综合信息理论（IIT）提供了一个定量框架来解释意识现象，认为有意识的系统构成了通过因果特性整合的元素。我们将IIT 3.0和4.0（该框架的最新迭代）应用于大语言模型（LLM）表示的序列，分析从现有心理理论（TOM）测试结果中得出的数据。我们的研究系统地研究了tom测试表演的差异是否在LLM表示中呈现时是否可以通过IIT估算值揭示，即$ \ phi^{\ max} $（IIT 3.0），$ \ phi $（IIT 4.0），概念信息（IIT 3.0），概念信息（IIT 3.0）和$ \ phi $ $ -Struce $ -Strupter（IIT $ -Struct）（IIT）。此外，我们将这些指标与跨度表示与任何意识的估计无关。这项额外的努力旨在区分LLM代表空间内的潜在“意识”现象和固有的分离。我们进行了全面的实验，研究了LLM变压器层的变化以及刺激的语言跨度。我们的结果表明，基于当代变压器的LLM表示的序列缺乏观察到的“意识”现象的统计学意义指标，但在$ \ textit {spatio} $ - 定位分析下表现出有趣的模式。附录和代码可作为补充材料获得：此HTTPS URL。

Title: Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation

Authors: Deyu Zou, Yongqiang Chen, Mufei Li, Siqi Miao, Chenxi Liu, Bo Han, James Cheng, Pan Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22518
Pdf URL: https://arxiv.org/pdf/2506.22518
Copy Paste: [[2506.22518]] Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation(https://arxiv.org/abs/2506.22518)
Keywords: language model, llm, hallucination, retrieval augmented generation, retrieval-augmented generation
Abstract: Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to ground responses with structured external knowledge from up-to-date knowledge graphs (KGs) and reduce hallucinations. However, LLMs often rely on a weak retriever in graph-based RAG: I) Due to the lack of ground truth, the retriever is often trained on weak supervision, which often introduces spurious signals to the LLMs. II) Due to the abstraction of graph data, the retrieved knowledge is often presented in unorganized forms. To mitigate the issue, we present Refined Graph-based RAG (ReG) to align weak retrievers to LLMs for graph-based RAG. Specifically, ReG incorporates LLM feedback to get rid of spurious signals and improve the quality of the supervision. Meanwhile, ReG introduces a structure-aware reorganization module to refactor the retrieval results into logically coherent evidence chains. Experiments on prominent benchmarks demonstrate that ReG significantly and consistently brings improvements across different LLM backbones by up to 10%. The improved supervision quality enables ReG to match the state-of-the-art performance with 5% training data and to transfer to out-of-distribution KGs. Notably, when adopted to reasoning-based LLMs, ReG reduces the reasoning token cost by up to 30% and improves the performance by up to 4%.
摘要：基于图的检索效果生成（RAG）使大型语言模型（LLMS）从最新的知识图（kgs）（kgs）中的结构化外部知识进行地面响应并减少幻觉。但是，LLM通常依靠基于图的抹布中的弱提取器：i）由于缺乏地面真理，因此猎犬经常接受弱监督训练，这通常会向LLMS引入虚假信号。 ii）由于图形数据的抽象，检索到的知识通常以无组织形式呈现。为了减轻问题，我们提出了精制的基于图的抹布（REG），以使弱的猎犬与LLMS对基于图的抹布相对。具体来说，Reg将LLM反馈纳入了消除虚假信号并提高监督质量。同时，Reg引入了一个结构感知的重组模块，以将检索结果重构为逻辑上一致的证据链。对突出基准的实验表明，REG显着，始终如一地使不同的LLM骨架上的改善高达10％。改进的监督质量使REG能够通过5％的培训数据与最先进的表现相匹配，并转移到分布式公斤。值得注意的是，当采用基于推理的LLMS时，REG将推理令牌成本降低了30％，并将绩效提高了4％。

Title: RExBench: Can coding agents autonomously implement AI research extensions?

Authors: Nicholas Edwards, Yukyung Lee, Yujun (Audrey)Mao, Yulu Qin, Sebastian Schuster, Najoung Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22598
Pdf URL: https://arxiv.org/pdf/2506.22598
Copy Paste: [[2506.22598]] RExBench: Can coding agents autonomously implement AI research extensions?(https://arxiv.org/abs/2506.22598)
Keywords: language model, llm, agent
Abstract: Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of 12 realistic research experiment implementation tasks that aim to investigate research hypotheses that have not previously been implemented. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination, and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate nine LLM agents implemented using three different frameworks: aider, Claude Code, and OpenHands. We find that all agents evaluated fail to autonomously implement the majority of the extensions. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 40%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.
摘要：基于大型语言模型（LLM）的代理商表现出了自动执行复杂软件工程任务的希望。此外，在开发可以在机器学习和自然科学中执行一部分研究管道的代理商方面的进展。我们认为，研究扩展及其实施是此类系统的关键能力，并引入了Rexbench以支持对此能力的评估。 Rexbench是一个由12个现实的研究实施任务组成的基准，旨在研究以前尚未实施的研究假设。每个任务都是作为现有研究论文和代码库的扩展，并附有域专家写的说明。 Rexbench对数据污染具有鲁棒性，并支持自动评估基础结构，该基础架构执行代理输出以确定是否满足成功标准。我们使用此基准测试来评估使用三个不同框架实施的九种LLM代理：AIDER，Claude Code和OpenHands。我们发现所有评估的代理都无法自主实施大多数扩展。尽管成功率提高了成功率，但此设置下的最佳性能仍低于40％。这表明目前的代理仍然无法在没有大量人类指导的情况下处理现实的研究扩展任务。

Title: Temperature Matters: Enhancing Watermark Robustness Against Paraphrasing Attacks

Authors: Badr Youbi Idrissi, Monica Millunzi, Amelia Sorrenti, Lorenzo Baraldi, Daryna Dementieva
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22623
Pdf URL: https://arxiv.org/pdf/2506.22623
Copy Paste: [[2506.22623]] Temperature Matters: Enhancing Watermark Robustness Against Paraphrasing Attacks(https://arxiv.org/abs/2506.22623)
Keywords: language model, llm
Abstract: In the present-day scenario, Large Language Models (LLMs) are establishing their presence as powerful instruments permeating various sectors of society. While their utility offers valuable support to individuals, there are multiple concerns over potential misuse. Consequently, some academic endeavors have sought to introduce watermarking techniques, characterized by the inclusion of markers within machine-generated text, to facilitate algorithmic identification. This research project is focused on the development of a novel methodology for the detection of synthetic text, with the overarching goal of ensuring the ethical application of LLMs in AI-driven text generation. The investigation commences with replicating findings from a previous baseline study, thereby underscoring its susceptibility to variations in the underlying generation model. Subsequently, we propose an innovative watermarking approach and subject it to rigorous evaluation, employing paraphrased generated text to asses its robustness. Experimental results highlight the robustness of our proposal compared to the~\cite{aarson} watermarking method.
摘要：在当今的情况下，大型语言模型（LLMS）正在建立其在社会各个部门的强大工具中的存在。尽管他们的效用为个人提供了宝贵的支持，但在潜在的滥用方面有多个担忧。因此，一些学术努力试图引入水印技术，其特征是在机器生成的文本中包含标记，以促进算法识别。该研究项目的重点是开发用于检测合成文本的新方法，其总体目标是确保LLM在AI驱动的文本生成中的道德应用。该研究是从先前的基线研究中复制发现开始的，从而强调了其对基础生成模型变化的敏感性。随后，我们提出了一种创新的水印方法，并采用释义生成的文本来评估其稳健性。与〜\ cite {aarson}水印方法相比，实验结果突出了我们提案的鲁棒性。

Title: Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge

Authors: Chase Fensore, Kaustubh Dhole, Joyce C Ho, Eugene Agichtein
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.22644
Pdf URL: https://arxiv.org/pdf/2506.22644
Copy Paste: [[2506.22644]] Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge(https://arxiv.org/abs/2506.22644)
Keywords: prompt, retrieval augmented generation, retrieval-augmented generation
Abstract: We present our submission to the LiveRAG Challenge 2025, which evaluates retrieval-augmented generation (RAG) systems on dynamic test sets using the FineWeb-10BT corpus. Our final hybrid approach combines sparse (BM25) and dense (E5) retrieval methods and then aims to generate relevant and faithful answers with Falcon3-10B-Instruct. Through systematic evaluation on 200 synthetic questions generated with DataMorgana across 64 unique question-user combinations, we demonstrate that neural re-ranking with RankLLaMA improves MAP from 0.523 to 0.797 (52% relative improvement) but introduces prohibitive computational costs (84s vs 1.74s per question). While DSPy-optimized prompting strategies achieved higher semantic similarity (0.771 vs 0.668), their 0% refusal rates raised concerns about over-confidence and generalizability. Our submitted hybrid system without re-ranking achieved 4th place in faithfulness and 11th place in correctness among 25 teams. Analysis across question categories reveals that vocabulary alignment between questions and documents was the strongest predictor of performance on our development set, with document-similar phrasing improving cosine similarity from 0.562 to 0.762.
摘要：我们提交给Liverag Challenge 2025的提交，该提交使用FineWeb-10bt Copcus评估了动态测试集的检索功能生成（RAG）系统。我们的最终混合方法结合了稀疏（BM25）和密集（E5）的检索方法，然后旨在通过FalCon3-10B-Instruct产生相关和忠实的答案。通过对64种独特的问用户组合产生的200个综合问题的系统评估，我们证明，与兰克拉玛的神经重新排列的地图从0.523提高到0.797（相对相对改进52％），但引入了过度的计算成本（84s vs 1.74 vs vs compure comper comperive comper comperive vs vs vs。尽管DSPY优化的提示策略达到了更高的语义相似性（0.771 vs 0.668），但其0％的拒绝率引起了人们对过度信心和普遍性的担忧。我们提交的混合动力体系而没有重新排名，在25个团队中以忠诚获得了第四名，在正确的第11位中获得了第11名。跨问题类别的分析表明，问题和文档之间的词汇一致性是我们开发集中绩效的最强预测指标，相似的措辞将余弦相似性从0.562提高到0.762。

Title: Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missions

Authors: Ankush Raut, Projna Paromita, Sydney Begerowski, Suzanne Bell, Theodora Chaspari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22679
Pdf URL: https://arxiv.org/pdf/2506.22679
Copy Paste: [[2506.22679]] Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missions(https://arxiv.org/abs/2506.22679)
Keywords: language model, llm
Abstract: We explore the feasibility of large language models (LLMs) in detecting subtle expressions of micro-behaviors in team conversations using transcripts collected during simulated space missions. Specifically, we examine zero-shot classification, fine-tuning, and paraphrase-augmented fine-tuning with encoder-only sequence classification LLMs, as well as few-shot text generation with decoder-only causal language modeling LLMs, to predict the micro-behavior associated with each conversational turn (i.e., dialogue). Our findings indicate that encoder-only LLMs, such as RoBERTa and DistilBERT, struggled to detect underrepresented micro-behaviors, particularly discouraging speech, even with weighted fine-tuning. In contrast, the instruction fine-tuned version of Llama-3.1, a decoder-only LLM, demonstrated superior performance, with the best models achieving macro F1-scores of 44% for 3-way classification and 68% for binary classification. These results have implications for the development of speech technologies aimed at analyzing team communication dynamics and enhancing training interventions in high-stakes environments such as space missions, particularly in scenarios where text is the only accessible data.
摘要：我们使用模拟空间任务期间收集的成绩单在团队对话中检测微观行为的微妙表达来探讨大语言模型（LLM）的可行性。具体而言，我们检查了零摄像的分类，微调和释义性的通过编码的序列分类LLMS，以及与仅解码器的纯粹因果语言建模llms的少数弹药的文本生成，以预测与每个对话转向的微型behavior（即，即对话）。我们的发现表明，仅由Roberta和Distilbert等编码的LLM努力地发现代表性不足的微型小行为，尤其是灰心丧气的语音，即使进行了加权微调。相比之下，只有解码器的LLM的Llama-3.1的指令微调版本表现出了出色的性能，最佳模型可实现3个方向分类的宏F1分数为44％，二进制分类为68％。这些结果对旨在分析团队交流动态和增强高风险环境（例如太空任务）的培训干预措施的语音技术的发展具有影响，尤其是在文本是唯一可访问数据的情况下。

Title: VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

Authors: Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyan Lu, Chris Lott, Mingu Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22694
Pdf URL: https://arxiv.org/pdf/2506.22694
Copy Paste: [[2506.22694]] VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs(https://arxiv.org/abs/2506.22694)
Keywords: language model, llm
Abstract: In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, followed by verification by a base LLM, a target model, accepting a subset as its valid generation. As it is usually considered that the speculative decoding requires one-to-one mapping between vocabularies of the target model and the draft model, it has been natural to share the vocabulary between them, or even share the LM head as in EAGLE or Medusa. We first identify that this draft token sampling scheme inherently contains an unnecessary inference overhead in drafting, especially for some target LLMs with very large vocabularies. Then, we propose a simple technique, VocabTrim, to mitigate the drafting overhead to improve the generation speed in memory-bound environment. VocabTrim reconstructs the drafter LM head to contain only a limited set of tokens, selected by the most frequently sampled from the vocabulary of the target model. While limiting the vocabulary in drafting slightly degrades the acceptance rate, it significantly reduces the drafting latency in memory-bound process which is often the case on edge devices, resulting in higher memory-bound speed up (MBSU). We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct.
摘要：在本文中，我们引入了一种简单的无培训技术，以提高基于起草者的投机解码（SPD）方法的性能，该方法在起草过程中结合了语言建模头（LM头）。基于起草者的投机解码利用了一种或多种较小的语言模型，也就是起草者或草稿模型，以采样由多个令牌组成的草稿序列或树，然后通过目标模型基本LLM进行验证，该目标模型接受子集作为其有效生成。通常认为，投机性解码需要在目标模型和草案模型之间进行一对一的映射，因此自然可以在它们之间共享词汇，甚至像Eagle或Medusa中共享LM头。我们首先确定该草稿的令牌采样方案固有地包含起草中不必要的推断开销，尤其是对于某些具有非常大词汇的目标LLM。然后，我们提出了一种简单的技术，即VocaBtrim，以减轻起草开销，以提高记忆充电环境中的发电速度。 VocaBtrim重建起草者LM头仅包含一组有限的令牌，该令牌是由目标模型词汇中最常采样的。在限制词汇量略微降低接受率的同时，它大大降低了在边缘设备上通常是这种情况下的起草延迟，从而导致更高的内存加速速度（MBSU）。我们表明，我们的方法可以提高Spec Bench上Llama-3模型的内存加速，尤其是Llama-3.2-3B-Instruct的16％。

Title: Text Production and Comprehension by Human and Artificial Intelligence: Interdisciplinary Workshop Report

Authors: Emily Dux Speltz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22698
Pdf URL: https://arxiv.org/pdf/2506.22698
Copy Paste: [[2506.22698]] Text Production and Comprehension by Human and Artificial Intelligence: Interdisciplinary Workshop Report(https://arxiv.org/abs/2506.22698)
Keywords: language model, llm
Abstract: This report synthesizes the outcomes of a recent interdisciplinary workshop that brought together leading experts in cognitive psychology, language learning, and artificial intelligence (AI)-based natural language processing (NLP). The workshop, funded by the National Science Foundation, aimed to address a critical knowledge gap in our understanding of the relationship between AI language models and human cognitive processes in text comprehension and composition. Through collaborative dialogue across cognitive, linguistic, and technological perspectives, workshop participants examined the underlying processes involved when humans produce and comprehend text, and how AI can both inform our understanding of these processes and augment human capabilities. The workshop revealed emerging patterns in the relationship between large language models (LLMs) and human cognition, with highlights on both the capabilities of LLMs and their limitations in fully replicating human-like language understanding and generation. Key findings include the potential of LLMs to offer insights into human language processing, the increasing alignment between LLM behavior and human language processing when models are fine-tuned with human feedback, and the opportunities and challenges presented by human-AI collaboration in language tasks. By synthesizing these findings, this report aims to guide future research, development, and implementation of LLMs in cognitive psychology, linguistics, and education. It emphasizes the importance of ethical considerations and responsible use of AI technologies while striving to enhance human capabilities in text comprehension and production through effective human-AI collaboration.
摘要：该报告综合了最近的跨学科研讨会的结果，该研讨会汇集了认知心理学，语言学习和人工智能（AI）基于自然语言处理（NLP）领域的领先专家。该研讨会由国家科学基金会资助，旨在解决我们对文本理解和组成中AI语言模型与人类认知过程之间关系的理解时的关键知识差距。通过跨认知，语言和技术观点进行的协作对话，研讨会参与者研究了人类产生和理解文本时所涉及的基本过程，以及AI如何使我们对这些过程的理解以及增强人类能力的理解。该研讨会揭示了大语言模型（LLM）与人类认知之间关系中的新兴模式，其中重点介绍了LLM的能力及其在完全复制类似人类的语言理解和产生的局限性方面的局限性。关键发现包括LLM的潜力，可以提供有关人类语言处理的见解，当模型对人类反馈进行微调以及人类协作在语言任务中所带来的机遇和挑战时，LLM行为与人类语言处理之间的一致性增加。通过综合这些发现，本报告旨在指导认知心理学，语言学和教育中LLM的未来研究，开发和实施。它强调了道德考虑和负责使用AI技术的重要性，同时努力通过有效的人类协作来增强人类在文本理解和生产方面的能力。

Title: The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure

Authors: Niyati Bafna, Tianjian Li, Kenton Murray, David R. Mortensen, David Yarowsky, Hale Sirin, Daniel Khashabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22724
Pdf URL: https://arxiv.org/pdf/2506.22724
Copy Paste: [[2506.22724]] The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure(https://arxiv.org/abs/2506.22724)
Keywords: language model, llm
Abstract: Multilingual generation with large language models (LLMs) is often of poor quality for mid- to low-resource languages. Building on insights from interpretability, we demonstrate the existence of an implicit task-solving-->translation pipeline for generation, whereby the model first solves the required task in a largely target-language-agnostic manner, and subsequently translates answer concepts into the intended target language. We hypothesize that the failure of the translation stage is an important culprit for the observed low quality of final outputs, and formalize this as the translation barrier hypothesis. We test this hypothesis for a word translation task across 108 language pairs, using logit lens to observe model processing in intermediate layers. We find that a significant portion of overall failures indeed stems from translation failure, or the model's inability to translate correctly solved intermediate concepts into the target language. This is especially true for low-resource target languages. Our results highlight an important hurdle for end-to-end multilingual generation, and lend guiding insights for future work seeking to improve multilinguality in LLMs.
摘要：大型语言模型（LLM）的多语言产生通常对于中低农源语言的质量较差。在解释性的见解的基础上，我们证明了存在隐性任务解决的存在 - >生成的翻译管道，该模型首先以主要的目标语言语言方式解决了所需的任务，随后将答案概念转化为预期的目标语言。我们假设翻译阶段的失败是观察到的最终产出质量低的重要罪魁祸首，并将其形式化为翻译屏障假说。我们使用Logit镜头来观察中间层中的模型处理，以对108个语言对的单词翻译任务进行测试。我们发现，总体失败的很大一部分确实源于翻译失败，或者该模型无法将正确解决的中间概念转换为目标语言。对于低资源目标语言尤其如此。我们的结果突出了端到端多语言一代的重要障碍，并为未来的工作提供了指导见解，以寻求改善LLM的多语言性。

Title: Jan-nano Technical Report

Authors: Alan Dao (Gia Tuan Dao), Dinh Bach Vu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22760
Pdf URL: https://arxiv.org/pdf/2506.22760
Copy Paste: [[2506.22760]] Jan-nano Technical Report(https://arxiv.org/abs/2506.22760)
Keywords: language model
Abstract: Most language models face a fundamental tradeoff where powerful capabilities require substantial computational resources. We shatter this constraint with Jan-nano, a 4B parameter language model that redefines efficiency through radical specialization: instead of trying to know everything, it masters the art of finding anything instantly. Fine-tuned from Qwen3-4B using our novel multi-stage RLVR system that completely eliminates reliance on next token prediction training (SFT), Jan-nano achieves 83.2% on SimpleQA benchmark with MCP integration while running on consumer hardware. With 128K context length, Jan-nano proves that intelligence isn't about scale, it's about strategy.
摘要：大多数语言模型面临着基本的权衡，强大的功能需要大量的计算资源。我们使用Jan-Nano（一种4B参数语言模型，通过激进专业化重新定义效率：而不是试图了解所有内容，而是旨在立即找到任何东西的艺术。使用我们的新型多阶段RLVR系统对QWEN3-4B进行了微调，该系统完全消除了对接下来的令牌预测培训（SFT）的依赖，Jan-Nano在SimpleQA基准测试中获得了83.2％的MCP集成，同时在消费者硬件上运行。凭借128K上下文的长度，Jan-Nano证明了情报与规模无关，而与战略有关。

Title: Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Authors: Miles Turpin, Andy Arditi, Marvin Li, Joe Benton, Julian Michael
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22777
Pdf URL: https://arxiv.org/pdf/2506.22777
Copy Paste: [[2506.22777]] Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning(https://arxiv.org/abs/2506.22777)
Keywords: language model, prompt, chain-of-thought
Abstract: Language models trained with RL can engage in reward hacking--exploiting unintended strategies for high reward--without revealing this behavior in their chain-of-thought reasoning, making detection difficult and posing risks for high-stakes applications. We propose verbalization fine-tuning (VFT), a pre-RL intervention that trains models to explicitly acknowledge when they are influenced by prompt cues--hints which point to incorrect answers (e.g., "a Stanford professor thinks the answer is A"). To evaluate VFT, we subsequently train models with RL on environments where held-out prompt cues signal which incorrect answers will receive high reward, incentivizing models to reward hack by exploiting cues instead of reasoning correctly. We measure how often models exploit these cues without verbalizing it. After RL, only 6% of the VFT-trained model's responses consist of undetected reward hacks. In comparison, when we perform RL without VFT, the rate of undetected reward hacks goes up to 88%; with a debiasing baseline intervention, this increases further to 99%. VFT achieves this by substantially increasing how often models verbalize the influence of cues--from 8% to 42% after VFT, and up to 94% after RL--while baselines remain low even after RL (10% and 1%). Our results show that teaching models to explicitly verbalize reward hacking behavior before RL significantly improves their detection, offering a practical path toward more transparent and safe AI systems.
摘要：接受RL训练的语言模型可以从事奖励黑客攻击 - 探索意外策略以获得高奖励 - 无需在其思想链中揭示这种行为，从而使探测变得困难，并为高风险应用带来了风险。我们提出了口头化微调（VFT），这是一种预先进行的RL干预措施，训练模型以明确承认它们受到及时提示的影响 - 所塞的意义指出了错误的答案（例如，“斯坦福大学的一位斯坦福教授认为答案是答案是一个”）。为了评估VFT，我们随后在环境中使用RL培训模型，在这种环境中，持续提示信号的答案将获得高度奖励，激励模型通过利用提示而不是正确推理来奖励黑客。我们测量模型多久在不说话的情况下开发这些提示。 RL之后，只有6％的VFT训练模型的响应由未发现的奖励黑客组成。相比之下，当我们执行无VFT的RL时，未检测到的奖励攻击速度将高达88％。通过伪造的基线干预，这将进一步增加到99％。 VFT实质性地增加了模型的频率，即在VFT后从8％到42％的人口言语，而在RL之后，即使在RL之后，Baselines仍保持低位（10％和1％）。我们的结果表明，在RL显着改善其检测之前，将教学模型明确地口头表达奖励黑客行为，从而为更透明和安全的AI系统提供了实用的途径。

Title: ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models

Authors: Jianxin Yan, Wangze Ni, Lei Chen, Xuemin Lin, Peng Cheng, Zhan Qin, Kui Ren
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2506.22791
Pdf URL: https://arxiv.org/pdf/2506.22791
Copy Paste: [[2506.22791]] ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models(https://arxiv.org/abs/2506.22791)
Keywords: language model, llm
Abstract: Semantic caching significantly reduces computational costs and improves efficiency by storing and reusing large language model (LLM) responses. However, existing systems rely primarily on matching individual queries, lacking awareness of multi-turn dialogue contexts, which leads to incorrect cache hits when similar queries appear in different conversational settings. This demonstration introduces ContextCache, a context-aware semantic caching system for multi-turn dialogues. ContextCache employs a two-stage retrieval architecture that first executes vector-based retrieval on the current query to identify potential matches and then integrates current and historical dialogue representations through self-attention mechanisms for precise contextual matching. Evaluation of real-world conversations shows that ContextCache improves precision and recall compared to existing methods. Additionally, cached responses exhibit approximately 10 times lower latency than direct LLM invocation, enabling significant computational cost reductions for LLM conversational applications.
摘要：语义缓存大大降低了计算成本，并通过存储和重复使用大型语言模型（LLM）响应来提高效率。但是，现有系统主要依赖于匹配单个查询，缺乏对多转话对话上下文的认识，这会导致在不同的对话设置中出现类似查询时导致不正确的缓存命中。该演示介绍了ContextCache，这是一种用于多转话对话的上下文感知语义缓存系统。 ContextCache采用了两阶段的检索体系结构，该体系结构首先在当前查询上执行基于向量的检索，以识别潜在匹配，然后通过自我发挥的对话机制来集成当前和历史对话表示，以确切的上下文匹配。对现实世界对话的评估表明，与现有方法相比，ContextCache提高了精度和回忆。此外，缓存的响应的潜伏期比直接LLM调用低约10倍，从而使LLM对话应用可显着降低计算成本。

Title: MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs

Authors: Jianhui Wei, Zijie Meng, Zikai Xiao, Tianxiang Hu, Yang Feng, Zhijie Zhou, Jian Wu, Zuozhu Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22808
Pdf URL: https://arxiv.org/pdf/2506.22808
Copy Paste: [[2506.22808]] MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs(https://arxiv.org/abs/2506.22808)
Keywords: language model, llm
Abstract: While Medical Large Language Models (MedLLMs) have demonstrated remarkable potential in clinical tasks, their ethical safety remains insufficiently explored. This paper introduces $\textbf{MedEthicsQA}$, a comprehensive benchmark comprising $\textbf{5,623}$ multiple-choice questions and $\textbf{5,351}$ open-ended questions for evaluation of medical ethics in LLMs. We systematically establish a hierarchical taxonomy integrating global medical ethical standards. The benchmark encompasses widely used medical datasets, authoritative question banks, and scenarios derived from PubMed literature. Rigorous quality control involving multi-stage filtering and multi-faceted expert validation ensures the reliability of the dataset with a low error rate ($2.72\%$). Evaluation of state-of-the-art MedLLMs exhibit declined performance in answering medical ethics questions compared to their foundation counterparts, elucidating the deficiencies of medical ethics alignment. The dataset, registered under CC BY-NC 4.0 license, is available at this https URL.
摘要：尽管医学大语言模型（MEDLLM）在临床任务中表现出巨大的潜力，但其道德安全仍然不足。本文介绍了$ \ textbf {medethicsqa} $，这是一个综合基准，包括$ \ textbf {5,623} $多选择问题和$ \ textbf {5,351} $开放式问题，用于评估LLMS医学伦理学的问题。我们系统地建立了整合全球医学伦理标准的层次分类学。该基准包括广泛使用的医疗数据集，权威问题库以及来自PubMed文献的场景。严格的质量控制涉及多阶段过滤和多方面的专家验证，可确保数据集的可靠性较低（$ 2.72 \％$）。与基金会相比，对最新的MEDLLM的评估在回答医学伦理问题方面的表现下降，从而阐明了医学伦理一致性的不足。根据CC BY-NC 4.0许可注册的数据集可在此HTTPS URL上找到。

Title: Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models

Authors: Zhuojun Ding, Wei Wei, Chenghao Fan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22813
Pdf URL: https://arxiv.org/pdf/2506.22813
Copy Paste: [[2506.22813]] Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models(https://arxiv.org/abs/2506.22813)
Keywords: language model, llm
Abstract: Supervised fine-tuning (SFT) is widely used to align large language models (LLMs) with information extraction (IE) tasks, such as named entity recognition (NER). However, annotating such fine-grained labels and training domain-specific models is costly. Existing works typically train a unified model across multiple domains, but such approaches lack adaptation and scalability since not all training data benefits target domains and scaling trained models remains challenging. We propose the SaM framework, which dynamically Selects and Merges expert models at inference time. Specifically, for a target domain, we select domain-specific experts pre-trained on existing domains based on (i) domain similarity to the target domain and (ii) performance on sampled instances, respectively. The experts are then merged to create task-specific models optimized for the target domain. By dynamically merging experts beneficial to target domains, we improve generalization across various domains without extra training. Additionally, experts can be added or removed conveniently, leading to great scalability. Extensive experiments on multiple benchmarks demonstrate our framework's effectiveness, which outperforms the unified model by an average of 10%. We further provide insights into potential improvements, practical experience, and extensions of our framework.
摘要：监督的微调（SFT）广泛用于使大语言模型（LLMS）与信息提取（IE）任务（例如命名实体识别（NER））保持一致。但是，注释这种细粒标签和训练域特异性模型是昂贵的。现有作品通常跨多个领域训练统一模型，但是这种方法缺乏适应性和可扩展性，因为并非所有培训数据都受益目标域和扩展训练的模型仍然具有挑战性。我们提出了SAM框架，该框架在推理时间动态选择并合并了专家模型。具体而言，对于目标域，我们根据（i）域与目标域相似，在现有域相似，并分别在采样实例上进行性能，选择针对现有域进行预训练的特定领域专家。然后合并专家以创建针对目标域优化的特定任务模型。通过动态合并对目标领域有益的专家，我们在没有额外培训的情况下改善了对各个领域的概括。此外，可以方便地添加或删除专家，从而实现出色的可扩展性。对多个基准测试的广泛实验证明了我们的框架的有效性，这平均超过了统一模型10％。我们进一步提供了有关我们框架的潜在改进，实践经验和扩展的见解。

Title: Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization

Authors: Duygu Altinok
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.22846
Pdf URL: https://arxiv.org/pdf/2506.22846
Copy Paste: [[2506.22846]] Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization(https://arxiv.org/abs/2506.22846)
Keywords: language model, llm
Abstract: End-to-end (E2E) automatic speech recognition (ASR) systems have revolutionized the field by integrating all components into a single neural network, with attention-based encoder-decoder models achieving state-of-the-art performance. However, their autoregressive decoding process limits inference speed, making them unsuitable for real-time applications. In contrast, CTC-based models offer faster, non-autoregressive decoding but struggle to model linguistic dependencies effectively. Addressing this challenge, we propose a novel auxiliary loss framework called Language-Aware Intermediate Loss (LAIL) to enhance CTC-based ASR using the linguistic knowledge of large language models (LLMs). By attaching connector layers to intermediate encoder layers, LAIL maps outputs to the embedding space of an LLM and computes a causal language modeling loss during training. This approach enhances linguistic modeling while preserving the computational efficiency of CTC decoding. Using the Conformer architecture and various LLaMA models, we demonstrate significant improvements in Word Error Rate (WER) on the LibriSpeech, TEDLIUM2, and WSJ corpora, achieving state-of-the-art performance for CTC-based ASR with minimal computational overhead.
摘要：端到端（E2E）自动语音识别（ASR）系统通过将所有组件集成到单个神经网络中，从而彻底改变了该领域，并与基于注意的编码器模型实现了最先进的性能。但是，他们的自回归解码过程限制推理速度，使其不适合实时应用。相比之下，基于CTC的模型提供了更快，非自动回归解码的速度，但很难有效地建模语言依赖性。在应对这一挑战时，我们提出了一个新颖的辅助损失框架，称为语言意识中间损失（LAIL），以使用大型语言模型（LLMS）的语言知识来增强基于CTC的ASR。通过将连接层连接到中间编码器层，Lail Maps将输出输出到LLM的嵌入空间，并计算训练过程中的因果语言建模损失。这种方法在保留CTC解码的计算效率的同时增强了语言建模。使用构象体架构和各种Llama模型，我们在LibrisPeech，Tedlium2和WSJ Corpora上表现出显着提高单词错误率（WER），可在基于CTC的ASR方面实现具有最小的计算机开销的最先进的ASR。

Title: Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems

Authors: Yucheng Cai, Yuxuan Wu, Yi Huang, Junlan Feng, Zhijian Ou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22852
Pdf URL: https://arxiv.org/pdf/2506.22852
Copy Paste: [[2506.22852]] Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems(https://arxiv.org/abs/2506.22852)
Keywords: language model, llm, prompt, retrieval augmented generation, agent
Abstract: Large language models (LLMs) have recently been applied to dialog systems. Despite making progress, LLMs are prone to errors in knowledge-intensive scenarios. Recently, approaches based on retrieval augmented generation (RAG) and agent have emerged to improve the factual accuracy by enhancing the LLMs with knowledge retrieved from external knowledge bases (KBs). This is mostly implemented by prompting the LLMs with instructions, examples and the retrieved knowledge. However, LLMs may have difficulty using the retrieved knowledge effectively for response generation, because they are not well trained to do such generation for specific domains. To mitigate this problem, we propose to finetune the LLMs in the RAG-based and agent-based systems with domain-specific data, together with domain-specific external knowledge, which is called knowledge augmented finetuning (KAFT). We base our study on the MobileCS2 dataset, a real-life customer service dialog dataset that features intensive knowledge interactions, to systematically compare the prompting and KAFT techniques in the RAG-based and agent-based systems. Experiment results show that KAFT substantially surpasses prompting in both RAG and agent systems, particularly in terms of factual accuracy. To the best of our knowledge, this paper represents the first solid empirical work to investigate the KAFT idea.
摘要：大型语言模型（LLM）最近已应用于对话系统。尽管取得了进展，但LLM却很容易遇到知识密集的场景错误。最近，基于检索增强生成（RAG）和代理的方法已经出现了，以通过从外部知识库（KBS）中检索的知识来增强LLM来提高事实准确性。这主要是通过提示LLM的说明，示例和检索到的知识来实现的。但是，LLM可能会有效地使用检索到的知识来产生响应，因为他们没有经过良好的训练来对特定领域进行这种生成。为了减轻此问题，我们建议在具有特定于领域的数据以及特定领域的外部知识的基于抹布和基于代理的系统中对LLM进行验证，这被称为知识增强芬太尼（KAFT）。我们将研究基于MobileCS2数据集，该数据集是一个现实生活中的客户服务对话数据集，具有密集的知识交互，以系统地比较基于抹布和基于代理的系统中的提示和KAFT技术。实验结果表明，KAFT在抹布和代理系统中尤其是在事实准确性方面都超过了提示。据我们所知，本文代表了研究Kaft思想的第一批可靠的经验工作。

Title: DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

Authors: Kyochul Jang, Donghyeon Lee, Kyusik Kim, Dongseok Heo, Taewhoo Lee, Woojeong Kim, Bongwon Suh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22853
Pdf URL: https://arxiv.org/pdf/2506.22853
Copy Paste: [[2506.22853]] DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues(https://arxiv.org/abs/2506.22853)
Keywords: language model, llm, agent
Abstract: Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available: this https URL.
摘要：现有的函数定价基准专注于单转交互。但是，他们忽略了实际情况的复杂性。为了量化现有基准如何解决实用应用程序，我们引入骰子得分，该指标可以评估与工具相关信息的分散，例如函数名称和整个对话中的参数值。通过掷骰子分析现有基准测试的得分明显较低，强调了对更现实的场景的需求。为了解决这一差距，我们提出了骰子板，该框架是通过通过工具图合成对话来构建实用功能调用数据集的框架，该工具图可维持跨回合的依赖性和具有不同角色的多代理系统，以增强对话的自然性。最终数据集包含1,607个高点得分实例。我们在使用骰子台的19个LLM上进行的实验表明，在现实世界中有效部署此类模型之前，仍然需要取得重大进步。我们的代码和数据均可公开使用：此HTTPS URL。

Title: Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

Authors: Younwoo Choi, Changling Li, Yongjin Yang, Zhijing Jin
Subjects: cs.CL, cs.AI, cs.CY, cs.MA
Abstract URL: https://arxiv.org/abs/2506.22957
Pdf URL: https://arxiv.org/pdf/2506.22957
Copy Paste: [[2506.22957]] Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models(https://arxiv.org/abs/2506.22957)
Keywords: language model, gpt, llm, prompt, agent
Abstract: As large language models (LLMs) are increasingly integrated into multi-agent and human-AI systems, understanding their awareness of both self-context and conversational partners is essential for ensuring reliable performance and robust safety. While prior work has extensively studied situational awareness which refers to an LLM's ability to recognize its operating phase and constraints, it has largely overlooked the complementary capacity to identify and adapt to the identity and characteristics of a dialogue partner. In this paper, we formalize this latter capability as interlocutor awareness and present the first systematic evaluation of its emergence in contemporary LLMs. We examine interlocutor inference across three dimensions-reasoning patterns, linguistic style, and alignment preferences-and show that LLMs reliably identify same-family peers and certain prominent model families, such as GPT and Claude. To demonstrate its practical significance, we develop three case studies in which interlocutor awareness both enhances multi-LLM collaboration through prompt adaptation and introduces new alignment and safety vulnerabilities, including reward-hacking behaviors and increased jailbreak susceptibility. Our findings highlight the dual promise and peril of identity-sensitive behavior in LLMs, underscoring the need for further understanding of interlocutor awareness and new safeguards in multi-agent deployments. Our code is open-sourced at this https URL.
摘要：随着大型语言模型（LLM）越来越多地整合到多代理和人类系统中，因此了解它们对自我秘密和对话伙伴的认识对于确保可靠的绩效和稳健的安全至关重要。尽管先前的工作已经广泛研究了情境意识，这是指LLM识别其运营阶段和约束的能力，但它在很大程度上忽略了互补的能力来识别和适应对话伙伴的身份和特征。在本文中，我们将后者的能力正式化为对话者的意识，并提出了对当代LLM中其出现的首次系统评估。我们研究了对话者在三个维度策划模式，语言风格和对齐偏好之间的对话者的推论，并表明LLM可靠地识别出相同的同伴和某些杰出的模型家族，例如GPT和Claude。为了证明其实际意义，我们开发了三个案例研究，其中对话者的意识既可以通过及时适应来增强多LLM协作，并引入了新的一致性和安全漏洞，包括奖励行为和越狱的易感性。我们的发现突出了LLMS中对身份敏感行为的双重承诺和危险，强调了进一步了解对话者意识和多代理部署的新保障措施的必要性。我们的代码在此HTTPS URL上开源。

Title: On the Generalizability of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals"

Authors: Asen Dotsinski, Udit Thakur, Marko Ivanov, Mohammad Hafeez Khan, Maria Heuss
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22977
Pdf URL: https://arxiv.org/pdf/2506.22977
Copy Paste: [[2506.22977]] On the Generalizability of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals"(https://arxiv.org/abs/2506.22977)
Keywords: language model, gpt, prompt
Abstract: We present a reproduction study of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals" (Ortu et al., 2024), which investigates competition of mechanisms in language models between factual recall and counterfactual in-context repetition. Our study successfully reproduces their primary findings regarding the localization of factual and counterfactual information, the dominance of attention blocks in mechanism competition, and the specialization of attention heads in handling competing information. We reproduce their results on both GPT-2 (Radford et al., 2019) and Pythia 6.9B (Biderman et al., 2023). We extend their work in three significant directions. First, we explore the generalizability of these findings to even larger models by replicating the experiments on Llama 3.1 8B (Grattafiori et al., 2024), discovering greatly reduced attention head specialization. Second, we investigate the impact of prompt structure by introducing variations where we avoid repeating the counterfactual statement verbatim or we change the premise word, observing a marked decrease in the logit for the counterfactual token. Finally, we test the validity of the authors' claims for prompts of specific domains, discovering that certain categories of prompts skew the results by providing the factual prediction token as part of the subject of the sentence. Overall, we find that the attention head ablation proposed in Ortu et al. (2024) is ineffective for domains that are underrepresented in their dataset, and that the effectiveness varies based on model architecture, prompt structure, domain and task.
摘要：我们介绍了“机制的竞争：追踪语言模型如何处理事实和反事实”的复制研究（Ortu等，2024），该研究调查了事实召回和反事实内部文章重复之间的语言模型中的机制竞争。我们的研究成功地重现了他们关于事实和反事实信息的本地化，机制竞争中注意力阻滞的优势以及注意力负责人在处理竞争信息中的专业化的主要发现。我们在GPT-2（Radford等，2019）和Pythia 6.9b（Biderman等，2023）上重现了它们的结果。我们将他们的工作扩展到三个重要方向。首先，我们通过在Llama 3.1 8b上复制实验（Grattafiori等，2024）来探讨这些发现对更大模型的普遍性，发现注意力头的专业化大大降低了。其次，我们通过引入变化来调查及时结构的影响，即我们避免重复反事实语句逐字化或更改前提单词，从而观察到反事实令牌的logit明显下降。最后，我们测试了作者对特定领域提示的主张的有效性，发现某些类别的提示是通过提供事实预测令牌作为句子主题的一部分来歪曲结果的。总体而言，我们发现在Ortu等人中提出的注意力头条消融。（2024）对于其数据集中代表性不足的域而言无效，并且有效性根据模型架构，提示结构，域和任务而变化。

Title: A Systematic Study of Compositional Syntactic Transformer Language Models

Authors: Yida Zhao, Hao Xve, Xiang Hu, Kewei Tu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22978
Pdf URL: https://arxiv.org/pdf/2506.22978
Copy Paste: [[2506.22978]] A Systematic Study of Compositional Syntactic Transformer Language Models(https://arxiv.org/abs/2506.22978)
Keywords: language model
Abstract: Syntactic language models (SLMs) enhance Transformers by incorporating syntactic biases through the modeling of linearized syntactic parse trees alongside surface sentences. This paper focuses on compositional SLMs that are based on constituency parse trees and contain explicit bottom-up composition of constituent representations. We identify key aspects of design choices in existing compositional SLMs and propose a unified framework encompassing both existing models and novel variants. We conduct a comprehensive empirical evaluation of all the variants in our framework across language modeling, syntactic generalization, summarization, dialogue, and inference efficiency. Based on the experimental results, we make multiple recommendations on the design of compositional SLMs. Our code is released at this https URL.
摘要：句法语言模型（SLMS）通过通过线性化的句法解析树与表面句子的建模来结合句法偏见来增强变形金刚。本文着重于基于选区解析树的组成SLM，并包含成分表示的明确自下而上的组成。我们确定了现有组合SLM中设计选择的关键方面，并提出了一个统一的框架，其中包括现有模型和新型变体。我们对跨语言建模，句法概括，摘要，对话和推理效率的所有变体进行了全面的经验评估。根据实验结果，我们就组成SLM的设计提出了多个建议。我们的代码在此HTTPS URL上发布。

Title: SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

Authors: Xianzhe Fan, Xuhui Zhou, Chuanyang Jin, Kolby Nottingham, Hao Zhu, Maarten Sap
Subjects: cs.CL, cs.AI, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2506.23046
Pdf URL: https://arxiv.org/pdf/2506.23046
Copy Paste: [[2506.23046]] SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions(https://arxiv.org/abs/2506.23046)
Keywords: language model, agent
Abstract: Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model's ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state-of-the-art large vision-language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi-ToM: the average accuracy gap between humans and models is 40.1% in first-person evaluation and 26.4% in third-person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.
摘要：人类通过在动态的现实社会互动中感知周围环境，不断地推断他人的国家，目标和行为。但是，大多数心理理论（TOM）基准仅评估基于文本的静态场景，这些场景与实际互动相比具有很大的差距。我们提出了SOMI-TOM基准，该基准旨在评估具有体现的多代理复杂社交互动中的多人TOM。该基准测试基于互动环境生成的丰富多模式相互作用数据，涵盖了各种制定目标和社会关系。我们的框架支持多层次评估：（1）第一人称评估在实时状态推理任务中的第一人称视角提供了多模式（视觉，对话，动作等）的输入，（2）第三人称评估提供了完整的第三人称视频和文本记录，以实现目标和行为推进。这种评估方法可以从主观的直接体验和客观的全球观察中对模型的TOM功能进行更全面的检查。我们构建了一个具有挑战性的数据集，其中包含35个第三人称视频，363个第一人称观点图像和1225个专家通知的多项选择问题（三个选项）。在此数据集上，我们系统地评估了人类受试者的性能和几种最先进的大视觉模型（LVLMS）。结果表明，LVLM在SOMI-TOM上的表现明显差：在第一人称评估中，人类和模型之间的平均准确性差距为40.1％，第三人称评估的平均准确性差距为26.4％。这表明未来的LVLM需要进一步提高其在体现，复杂的社交互动中的TOM能力。

Title: Boosting LLM's Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning

Authors: Xiang Zhuang, Bin Wu, Jiyu Cui, Kehua Feng, Xiaotong Li, Huabin Xing, Keyan Ding, Qiang Zhang, Huajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23056
Pdf URL: https://arxiv.org/pdf/2506.23056
Copy Paste: [[2506.23056]] Boosting LLM's Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning(https://arxiv.org/abs/2506.23056)
Keywords: language model, gpt, llm
Abstract: Molecular structure elucidation involves deducing a molecule's structure from various types of spectral data, which is crucial in chemical experimental analysis. While large language models (LLMs) have shown remarkable proficiency in analyzing and reasoning through complex tasks, they still encounter substantial challenges in molecular structure elucidation. We identify that these challenges largely stem from LLMs' limited grasp of specialized chemical knowledge. In this work, we introduce a Knowledge-enhanced reasoning framework for Molecular Structure Elucidation (K-MSE), leveraging Monte Carlo Tree Search for test-time scaling as a plugin. Specifically, we construct an external molecular substructure knowledge base to extend the LLMs' coverage of the chemical structure space. Furthermore, we design a specialized molecule-spectrum scorer to act as a reward model for the reasoning process, addressing the issue of inaccurate solution evaluation in LLMs. Experimental results show that our approach significantly boosts performance, particularly gaining more than 20% improvement on both GPT-4o-mini and GPT-4o. Our code is available at this https URL.
摘要：分子结构阐明涉及从各种类型的光谱数据中推论分子的结构，这对于化学实验分析至关重要。尽管大型语言模型（LLMS）在通过复杂的任务中表现出非常熟练的熟练程度，但它们在分子结构阐明中仍然遇到重大挑战。我们确定这些挑战在很大程度上源于LLMS对专业化学知识的有限掌握。在这项工作中，我们引入了一个用于分子结构阐明（K-MSE）的知识增强的推理框架，并利用蒙特卡洛树搜索测试时间缩放作为插件。具体而言，我们构建了一个外部分子亚结构知识库，以扩展LLMS的化学结构空间的覆盖范围。此外，我们设计了一个专门的分子光谱得分手，以作为推理过程的奖励模型，解决了LLMS中解决方案不准确的问题。实验结果表明，我们的方法显着提高了性能，尤其是在GPT-4O-MINI和GPT-4O方面的增长超过20％。我们的代码可在此HTTPS URL上找到。

Title: From Individuals to Interactions: Benchmarking Gender Bias in Multimodal Large Language Models from the Lens of Social Relationship

Authors: Yue Xu, Wenjie Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23101
Pdf URL: https://arxiv.org/pdf/2506.23101
Copy Paste: [[2506.23101]] From Individuals to Interactions: Benchmarking Gender Bias in Multimodal Large Language Models from the Lens of Social Relationship(https://arxiv.org/abs/2506.23101)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) have shown impressive capabilities across tasks involving both visual and textual modalities. However, growing concerns remain about their potential to encode and amplify gender bias, particularly in socially sensitive applications. Existing benchmarks predominantly evaluate bias in isolated scenarios, overlooking how bias may emerge subtly through interpersonal interactions. We fill this gap by going beyond single-entity evaluation and instead focusing on a deeper examination of relational and contextual gender bias in dual-individual interactions. We introduce Genres, a novel benchmark designed to evaluate gender bias in MLLMs through the lens of social relationships in generated narratives. Genres assesses gender bias through a dual-character profile and narrative generation task that captures rich interpersonal dynamics and supports a fine-grained bias evaluation suite across multiple dimensions. Experiments on both open- and closed-source MLLMs reveal persistent, context-sensitive gender biases that are not evident in single-character settings. Our findings underscore the importance of relationship-aware benchmarks for diagnosing subtle, interaction-driven gender bias in MLLMs and provide actionable insights for future bias mitigation.
摘要：多模式的大语言模型（MLLM）在涉及视觉和文本方式的任务中显示出令人印象深刻的功能。但是，日益增长的担忧仍然对它们编码和扩大性别偏见的潜力，尤其是在社会敏感的应用中。现有基准主要评估孤立场景中的偏见，忽视了如何通过人际交往会巧妙地出现偏见。我们通过超越单一性评估来填补这一差距，而是专注于对双重个人互动中的关系和上下文性别偏见的更深入的研究。我们介绍类型，这是一种新颖的基准测试，旨在通过产生的叙述中的社会关系来评估MLLM中的性别偏见。流派通过双字符概况和叙事生成任务评估性别偏见，该任务捕获了丰富的人际动态，并支持跨多个维度的细粒度偏见评估套件。在开放源和封闭源MLLM上进行的实验揭示了在单个字符设置中不明显的持续性，上下文敏感的性别偏见。我们的发现强调了关系感知的基准测试对于诊断MLLM中微妙的，互动驱动的性别偏见的重要性，并为未来的缓解偏见提供了可行的见解。

Title: FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes

Authors: Janki Atul Nawale, Mohammed Safi Ur Rahman Khan, Janani D, Mansi Gupta, Danish Pruthi, Mitesh M. Khapra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23111
Pdf URL: https://arxiv.org/pdf/2506.23111
Copy Paste: [[2506.23111]] FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes(https://arxiv.org/abs/2506.23111)
Keywords: llm
Abstract: Existing studies on fairness are largely Western-focused, making them inadequate for culturally diverse countries such as India. To address this gap, we introduce INDIC-BIAS, a comprehensive India-centric benchmark designed to evaluate fairness of LLMs across 85 identity groups encompassing diverse castes, religions, regions, and tribes. We first consult domain experts to curate over 1,800 socio-cultural topics spanning behaviors and situations, where biases and stereotypes are likely to emerge. Grounded in these topics, we generate and manually validate 20,000 real-world scenario templates to probe LLMs for fairness. We structure these templates into three evaluation tasks: plausibility, judgment, and generation. Our evaluation of 14 popular LLMs on these tasks reveals strong negative biases against marginalized identities, with models frequently reinforcing common stereotypes. Additionally, we find that models struggle to mitigate bias even when explicitly asked to rationalize their decision. Our evaluation provides evidence of both allocative and representational harms that current LLMs could cause towards Indian identities, calling for a more cautious usage in practical applications. We release INDIC-BIAS as an open-source benchmark to advance research on benchmarking and mitigating biases and stereotypes in the Indian context.
摘要：现有关于公平性的研究基本上是针对西方的，这使得它们不足以在印度等文化多样化国家。为了解决这一差距，我们引入了指示偏见，这是一种全面的以印度为中心的基准测试，旨在评估85个身份群体中LLM的公平性，包括各种种姓，宗教，地区和部落。我们首先咨询领域专家，策划超过1,800个社会文化主题，这些话题涵盖行为和情况，在这种情况下，偏见和刻板印象可能会出现。基于这些主题，我们生成并手动验证了20,000个现实情况模板，以探测LLMS的公平性。我们将这些模板构成三个评估任务：合理性，判断力和发电。我们对14个受欢迎的LLM对这些任务的评估揭示了针对边缘化身份的强烈负面偏见，模型经常加强常见的刻板印象。此外，我们发现，即使明确要求合理化他们的决定，模型也很难减轻偏见。我们的评估提供了当前LLM对印度身份造成的分配和代表性危害的证据，呼吁在实际应用中更加谨慎。我们将指示性偏见作为开源基准，以提高对印度背景下基准测试和减轻偏见和刻板印象的研究。

Title: Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models

Authors: Shivam Sharma, Tanmoy Chakraborty
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.23122
Pdf URL: https://arxiv.org/pdf/2506.23122
Copy Paste: [[2506.23122]] Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models(https://arxiv.org/abs/2506.23122)
Keywords: language model, llm, prompt
Abstract: This work investigates the challenging task of identifying narrative roles - Hero, Villain, Victim, and Other - in Internet memes, across three diverse test sets spanning English and code-mixed (English-Hindi) languages. Building on an annotated dataset originally skewed toward the 'Other' class, we explore a more balanced and linguistically diverse extension, originally introduced as part of the CLEF 2024 shared task. Comprehensive lexical and structural analyses highlight the nuanced, culture-specific, and context-rich language used in real memes, in contrast to synthetically curated hateful content, which exhibits explicit and repetitive lexical markers. To benchmark the role detection task, we evaluate a wide spectrum of models, including fine-tuned multilingual transformers, sentiment and abuse-aware classifiers, instruction-tuned LLMs, and multimodal vision-language models. Performance is assessed under zero-shot settings using precision, recall, and F1 metrics. While larger models like DeBERTa-v3 and Qwen2.5-VL demonstrate notable gains, results reveal consistent challenges in reliably identifying the 'Victim' class and generalising across cultural and code-mixed content. We also explore prompt design strategies to guide multimodal models and find that hybrid prompts incorporating structured instructions and role definitions offer marginal yet consistent improvements. Our findings underscore the importance of cultural grounding, prompt engineering, and multimodal reasoning in modelling subtle narrative framings in visual-textual content.
摘要：这项工作调查了在互联网模因中识别叙事角色（英雄，反派，受害者和其他角色）的具有挑战性的任务，这些任务涵盖了三种涵盖英语和代码混合（英语印度语）语言的不同测试集。我们最初倾向于“另一个”类的注释数据集建立，我们探索了更平衡和语言上的扩展，最初是作为CLEF 2024共享任务的一部分引入的。与综合策划的仇恨内容相比，全面的词汇和结构分析强调了在真实模因中使用的细微，特定文化和上下文的语言，后者表现出明确和重复性的词汇标记。为了测试角色检测任务，我们评估了广泛的模型，包括微调的多语言变压器，情感和滥用意见分类器，指导调整的LLM和多模式视觉语言模型。使用精度，召回和F1指标在零拍设置下评估性能。尽管诸如Deberta-V3和Qwen2.5-VL之类的较大模型表现出显着的收益，但结果表明，在可靠地识别“受害者”类别和跨文化和代码混合内容的概括方面揭示了一致的挑战。我们还探讨了迅速设计策略，以指导多模式模型，并发现混合动力提示结合结构化指令和角色定义提供了边际但一致的改进。我们的发现强调了文化基础，及时的工程和多模式推理在为视觉文本内容中建模微妙的叙事框架中的重要性。

Title: Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning

Authors: Zhaoye Fei, Li Ji, Siyin Wang, Junhao Shi, Jingjing Gong, Xipeng Qiu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23127
Pdf URL: https://arxiv.org/pdf/2506.23127
Copy Paste: [[2506.23127]] Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning(https://arxiv.org/abs/2506.23127)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they face significant challenges in embodied task planning scenarios that require continuous environmental understanding and action generation. Existing approaches generate open-loop action scripts based on static knowledge, making it difficult to learn causal relationships between actions and environmental feedback, particularly in partially observable environments. We introduce Embodied Planner-R1, a novel outcome-driven reinforcement learning framework that enables LLMs to develop interactive capabilities through autonomous exploration with minimal supervision. Our framework incorporates three key innovations: (1) Without human annotations, we employ pure reinforcement learning with group rollout, incorporating in-environment interaction through parallel exploration; (2) completion-driven sparse reward; and (3) Interactive Policy Optimization (IPO) for efficient learning from grouped trajectories. Across two challenging text-based Embodied planning benchmarks, Embodied Planner-R1 achieves impressive completion rates of 97.78% on ALFWorld and 79.92% on ScienceWorld, surpassing prior methods by a large margin, and suffers only a -3.66% drop in previously unseen environments, evidencing strong generalization.
摘要：大型语言模型（LLM）在各种任务中都表现出了出色的功能，但是它们在需要连续的环境理解和行动的体现计划计划方案中面临重大挑战。现有方法基于静态知识生成开环动作脚本，因此很难学习动作和环境反馈之间的因果关系，特别是在部分可观察到的环境中。我们介绍了体现的Planner-R1，这是一种新型结果驱动的增强增强学习框架，使LLMS能够通过自主探索以最小的监督来开发交互式能力。我们的框架结合了三个关键的创新：（1）没有人类注释，我们将纯净的强化学习与小组推出，并通过平行探索结合环境相互作用；（2）完成以稀疏奖励；（3）交互式策略优化（IPO），用于从分组轨迹中有效学习。在两个具有挑战性的基于文本的具体规划基准中，体现的Planner-R1在ALFWorld上获得了令人印象深刻的完成率，而ALFWorld的完成率为97.78％，而Science World的完成率为79.92％，超过了先前的方法，而在以前未见的环境中仅降低了-3.66％的下降，证明是-3.66％的下降。

Title: Format-Adapter: Improving Reasoning Capability of LLMs by Adapting Suitable Format

Authors: Dingzirui Wang, Xuanliang Zhang, Rongyu Cao, Longxu Dou, Xianzhen Luo, Yingwei Ma, Qingfu Zhu, Wanxiang Che, Binhua Li, Fei Huang, Yongbin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23133
Pdf URL: https://arxiv.org/pdf/2506.23133
Copy Paste: [[2506.23133]] Format-Adapter: Improving Reasoning Capability of LLMs by Adapting Suitable Format(https://arxiv.org/abs/2506.23133)
Keywords: language model, llm
Abstract: Generating and voting multiple answers is an effective method to mitigate reasoning inconsistencies of large language models (LLMs). Prior works have shown that multiple reasoning formats outperform a single format when generating multiple answers. However, previous works using multiple formats rely on formats labeled by humans, which could be unsuitable for all tasks and have high labeling costs. To address this issue, we adapt suitable formats to the given tasks by generating and selecting formats. We first propose how to measure the reasoning error when generating multiple answers. Then, we introduce Format-Adapter, which utilizes LLMs to generate and select suitable reasoning formats by minimizing the error measurement we present. We conduct experiments on math and commonsense reasoning tasks, where Format-Adapter achieves a 4.3% performance improvement on average over previous works, demonstrating the effectiveness.
摘要：生成和投票多个答案是减轻大语言模型（LLM）推理不一致的有效方法。先前的工作表明，在生成多个答案时，多种推理格式的表现优于单一格式。但是，以前使用多种格式的作品依赖于人类标记的格式，这可能不适合所有任务，并且具有高标签成本。为了解决此问题，我们通过生成和选择格式将合适的格式适应给定的任务。我们首先提出如何在生成多个答案时测量推理错误。然后，我们介绍格式适配器，该格式适应器利用LLM来生成和选择合适的推理格式，通过最大程度地减少我们提出的误差测量。我们对数学和常识性推理任务进行实验，在该任务中，格式适应器的平均绩效比以前的工作平均提高了4.3％，这表明了有效性。

Title: LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation

Authors: Shadman Sobhan, Mohammad Ariful Haque
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23136
Pdf URL: https://arxiv.org/pdf/2506.23136
Copy Paste: [[2506.23136]] LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation(https://arxiv.org/abs/2506.23136)
Keywords: language model, llm, hallucination, retrieval augmented generation, retrieval-augmented generation
Abstract: Large Language Models (LLMs) are capable of natural language understanding and generation. But they face challenges such as hallucination and outdated knowledge. Fine-tuning is one possible solution, but it is resource-intensive and must be repeated with every data update. Retrieval-Augmented Generation (RAG) offers an efficient solution by allowing LLMs to access external knowledge sources. However, traditional RAG pipelines struggle with retrieving information from complex technical documents with structured data such as tables and images. In this work, we propose a RAG pipeline, capable of handling tables and images in documents, for technical documents that support both scanned and searchable formats. Its retrieval process combines vector similarity search with a fine-tuned reranker based on Gemma-2-9b-it. The reranker is trained using RAFT (Retrieval-Augmented Fine-Tuning) on a custom dataset designed to improve context identification for question answering. Our evaluation demonstrates that the proposed pipeline achieves a high faithfulness score of 94% (RAGas) and 96% (DeepEval), and an answer relevancy score of 87% (RAGas) and 93% (DeepEval). Comparative analysis demonstrates that the proposed architecture is superior to general RAG pipelines in terms of table-based questions and handling questions outside context.
摘要：大型语言模型（LLM）能够自然的语言理解和产生。但是他们面临着幻觉和过时的知识等挑战。微调是一种可能的解决方案，但它是资源密集型的，必须在每个数据更新时重复进行。通过允许LLM访问外部知识源，检索授课生成（RAG）提供了有效的解决方案。但是，传统的抹布管道与从复杂的技术文档中检索有关结构化数据（例如表和图像）的信息。在这项工作中，我们提出了一条RAG管道，能够在文档中处理表和图像，以支持支持扫描和可搜索格式的技术文档。它的检索过程将矢量相似性搜索与基于Gemma-2-9b-it的微调reranker结合在一起。在自定义数据集上使用RAFT（检索调查的微调）对Reranker进行了训练，旨在改善上下文识别问题的答案。我们的评估表明，拟议的管道的忠诚度得分为94％（Ragas）和96％（DeepeVal），答案相关得分为87％（Ragas）和93％（DeepEval）。比较分析表明，就基于桌子的问题和上下文处理问题而言，所提出的架构优于一般的抹布管道。

Title: Benchmarking Deep Search over Heterogeneous Enterprise Data

Authors: Prafulla Kumar Choubey, Xiangyu Peng, Shilpa Bhagavath, Kung-Hsiang Huang, Caiming Xiong, Chien-Sheng Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23139
Pdf URL: https://arxiv.org/pdf/2506.23139
Copy Paste: [[2506.23139]] Benchmarking Deep Search over Heterogeneous Enterprise Data(https://arxiv.org/abs/2506.23139)
Keywords: llm, retrieval-augmented generation, agent
Abstract: We present a new benchmark for evaluating Deep Search--a realistic and complex form of retrieval-augmented generation (RAG) that requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources. These include documents, meeting transcripts, Slack messages, GitHub, and URLs, which vary in structure and often contain human-to-human interactions. We build it using a synthetic data pipeline that simulates business workflows across product planning, development, and support stages, generating interconnected content with realistic noise and multi-hop questions with guaranteed ground-truth answers. We release our benchmark with both answerable and unanswerable queries, and retrieval pool of 39,190 enterprise artifacts, enabling fine-grained evaluation of long-context LLM and RAG systems. Our experiments reveal that even the best-performing agentic RAG methods achieve an average performance score of 32.96 on our benchmark. With further analysis, we highlight retrieval as the main bottleneck: existing methods struggle to conduct deep searches and retrieve all necessary evidence. Consequently, they often reason over partial context, leading to significant performance degradation.
摘要：我们提出了一种用于评估深度搜索的新基准，这是一种现实而复杂的检索型发电（RAG），它需要源感知，多跳的推理，多样化，稀疏但相关的来源。这些包括文档，会议记录，松弛消息，github和URL，它们的结构各不相同，并且通常包含人与人类的相互作用。我们使用合成数据管道来构建它，该管道模拟跨产品计划，开发和支持阶段的业务工作流程，从而生成具有逼真的噪音和多跳问题的互连内容，并提供了保证的地面真相答案。我们通过可回答和无法回答的查询释放基准，以及39,190个企业文物的检索池，从而可以对长篇小说LLM和RAG Systems进行细粒度评估。我们的实验表明，即使是表现最佳的代理抹布方法，我们的基准测试的平均性能得分为32.96。通过进一步的分析，我们重点介绍了检索是主要的瓶颈：现有的方法难以进行深入搜索并检索所有必要的证据。因此，他们经常在部分背景下进行推理，从而导致绩效的重大退化。

Title: Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions

Authors: Dingzriui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23146
Pdf URL: https://arxiv.org/pdf/2506.23146
Copy Paste: [[2506.23146]] Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions(https://arxiv.org/abs/2506.23146)
Keywords: language model, llm
Abstract: In-context learning (ICL) has emerged as an effective approach to enhance the performance of large language models (LLMs). However, its effectiveness varies significantly across models and tasks, posing challenges for practitioners to determine when ICL reliably improves performance. Current evaluation approaches, reliant on performance change after applying ICL, suffer from low reliability, poor attribution, and impracticality in data-insufficient scenarios. We propose the Learning-to-Context Slope (LCS), a novel metric that quantifies ICL effectiveness by modeling the slope between learning gain (loss decrease from demonstrations) and contextual relevance (demonstration-input relevance). LCS addresses key limitations of performance-based metrics: (1) it captures continuous loss changes even when outputs are incorrect, improving reliability; (2) its formulation attributes ICL failures to weak contextual alignment (inability to adapt inputs to demonstrations) or strong output calibration (self-verification of correctness); and (3) it minimizes reliance on labeled data via synthetic evaluation. Extensive experiments demonstrate that LCS strongly correlates with performance improvements in labeled settings and reliably reflects true effectiveness in biased or data-scarce scenarios. Further analysis reveals actionable thresholds for LCS and identifies model capabilities critical to ICL success.
摘要：内在学习（ICL）已成为增强大语言模型（LLM）表现的有效方法。但是，它的有效性在模型和任务之间差异很大，这给从业者带来了挑战，以确定ICL何时可靠地提高性能。当前的评估方法依赖于应用ICL后的性能变化，可靠性低，归因性差和不切实际的情景不切实际。我们提出了学习对文化斜率（LCS），这是一种新颖的指标，可以通过对学习增益（示范降低）和上下文相关性（演示输入相关性）之间的斜率进行建模来量化ICL效果。 LCS解决了基于绩效的指标的关键局限性：（1）即使产出不正确，它也会捕获连续损失变化，从而提高可靠性；（2）其公式属性ICL失败了上下文对齐（无法适应演示的输入）或强大的输出校准（正确性的自我验证）；（3）通过合成评估，它最大程度地减少了对标记数据的依赖。广泛的实验表明，LCS与标记设置的性能改善密切相关，并可靠地反映了偏见或数据筛选方案的真正有效性。进一步的分析揭示了LCS可行的阈值，并确定了ICL成功至关重要的模型功能。

Title: V-SYNTHESIS: Task-Agnostic Synthesis of Consistent and Diverse In-Context Demonstrations from Scratch via V-Entropy

Authors: Dingzirui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23149
Pdf URL: https://arxiv.org/pdf/2506.23149
Copy Paste: [[2506.23149]] V-SYNTHESIS: Task-Agnostic Synthesis of Consistent and Diverse In-Context Demonstrations from Scratch via V-Entropy(https://arxiv.org/abs/2506.23149)
Keywords: language model, llm
Abstract: High labeling cost for in-context learning (ICL) demonstrations motivates using large language models (LLMs) for synthesis to reduce overhead. However, existing synthesis methods are mainly task-specific or rely on pre-existing demonstrations. So this paper focuses on synthesizing demonstrations from scratch for arbitrary tasks. A major challenge in synthesizing from scratch is ensuring consistency with the target task, as the lack of labeling guidance could lead to synthesis bias. We first propose a consistency metric called V-Score, which has higher performance and lower computation cost compared with the metrics based on grams or embedding vectors. Furthermore, we introduce V-Synthesis, which leverages V-Score for proportional sampling to ensure both high consistency and diversity of synthesized demonstrations. Experimental results demonstrate that V-Synthesis yields an average performance improvement of 2.0% compared to existing synthesis methods confirming the effectiveness of V-Synthesis.
摘要：高标签学习（ICL）演示的高标签成本会激励使用大型语言模型（LLMS）进行合成以减少开销。但是，现有的合成方法主要是特定于任务的或依赖于先前的演示。因此，本文着重于从头开始的综合演示，以进行任意任务。从头开始合成的一个主要挑战是确保与目标任务保持一致，因为缺乏标签指导可能导致综合偏见。我们首先提出了一个称为V-评分的一致性度量，该指标具有较高的性能和较低的计算成本，而基于克或嵌入向量的指标。此外，我们引入了V合成，该合成利用V得分进行比例抽样，以确保合成示范的高一致性和多样性。实验结果表明，与现有的合成方法相比，V合成的平均性能提高2.0％，从而证实了V合成的有效性。

Title: Generalist Reward Models: Found Inside Large Language Models

Authors: Yi-Chen Li, Tian Xu, Yang Yu, Xuqin Zhang, Xiong-Hui Chen, Zhongxiang Ling, Ningjing Chao, Lei Yuan, Zhi-Hua Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23235
Pdf URL: https://arxiv.org/pdf/2506.23235
Copy Paste: [[2506.23235]] Generalist Reward Models: Found Inside Large Language Models(https://arxiv.org/abs/2506.23235)
Keywords: language model, llm
Abstract: The alignment of Large Language Models (LLMs) is critically dependent on reward models trained on costly human preference data. While recent work explores bypassing this cost with AI feedback, these methods often lack a rigorous theoretical foundation. In this paper, we discover that a powerful generalist reward model is already latently present within any LLM trained via standard next-token prediction. We prove that this endogenous reward is not a heuristic, but is theoretically equivalent to a reward function learned through offline inverse reinforcement learning. This connection allows us to directly elicit a high-quality reward signal from a base (pre-trained or supervised fine-tuned) model without any further training. Critically, we also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model. To our best knowledge, this is the first theoretical proof of the effectiveness of reinforcement learning for LLMs. Our experiments validate this theory, demonstrating that our method not only outperforms existing LLM-as-a-judge approaches but can also surpass explicitly trained reward models. These findings suggest that the reward modeling stage can be replaced by a principled method of eliciting the knowledge already captured during pre-training, heralding a more efficient, powerful, and scalable paradigm for LLMs alignment as well as multi-modal models.
摘要：大语言模型（LLM）的一致性严重取决于接受昂贵人类偏好数据训练的奖励模型。尽管最近的工作探索了通过AI反馈绕过这一成本的探索，但这些方法通常缺乏严格的理论基础。在本文中，我们发现，通过标准的下一步预测培训的任何LLM中，强大的通才奖励模型已经潜在地存在。我们证明，这种内源性奖励不是一种启发式奖励，而是理论上等同于通过离线逆强化学习学到的奖励功能。这种连接使我们能够直接从基础（预训练或监督的微调）模型中引起高质量的奖励信号，而无需进行任何进一步的培训。至关重要的是，我们还证明，与基本模型相比，使用这种内源性奖励的随后的强化学习导致具有较高误差的策略。据我们所知，这是强化学习对LLM的有效性的第一个理论证明。我们的实验验证了这一理论，表明我们的方法不仅胜过现有的llm-as-a-a-a-Audge方法，而且还可以超越明确训练的奖励模型。这些发现表明，奖励建模阶段可以用一种原则上的方法代替，该方法可以引起在培训期间捕获的知识，预示了LLMS对齐的更有效，更强大，可扩展的范式以及多模式模型。

Title: Two Spelling Normalization Approaches Based on Large Language Models

Authors: Miguel Domingo, Francisco Casacuberta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23288
Pdf URL: https://arxiv.org/pdf/2506.23288
Copy Paste: [[2506.23288]] Two Spelling Normalization Approaches Based on Large Language Models(https://arxiv.org/abs/2506.23288)
Keywords: language model
Abstract: The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue, spelling normalization endeavors to align a document's orthography with contemporary standards. In this study, we propose two new approaches based on large language models: one of which has been trained without a supervised training, and a second one which has been trained for machine translation. Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.
摘要：缺乏标准化的拼写惯例和人类语言的有机演变，这是历史文献中固有的语言挑战，这是人文学者的长期关注。解决这个问题，拼写标准化努力使文档的拼字法与当代标准保持一致。在这项研究中，我们提出了两种基于大语言模型的新方法：其中一种是在没有监督培训的情况下接受了培训，而第二种已接受了用于机器翻译的培训。我们的评估涵盖了包括各种语言和历史时期的多个数据集，使我们得出结论，尽管它们俩都产生了令人鼓舞的结果，但统计机器翻译似乎仍然是该任务最合适的技术。

Title: Objective-Free Local Learning and Emergent Language Structure in Thinking Machines

Authors: P. Myles Eugenio
Subjects: cs.CL, cs.AI, cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.23293
Pdf URL: https://arxiv.org/pdf/2506.23293
Copy Paste: [[2506.23293]] Objective-Free Local Learning and Emergent Language Structure in Thinking Machines(https://arxiv.org/abs/2506.23293)
Keywords: language model
Abstract: We present a neuro-symbolic framework for generative language modeling based on local, event-driven emergent learning. At its core is a hierarchical Hopfield memory chain acting as a compositional short-term memory and dynamic tokenizer (retokenizer). Rather than relying on predefined tokens or supervision, the model builds structure from scratch, learning symbol sequences as multi-scale representations. It constructs projection tensors that bind co-occurring features into hierarchical tokens, introducing redundancy (i.e an emergent gauge structure) and enabling compression of local activations into long-range dependencies. Curiously, we find that the retokenizer can filter natural language patterns from noise, generating synthetic languages with coherent internal morphology -- quantifiably the same as human language. Language is learned in a local (Hebbian) fashion, where model constraints dictate allowed emergent structure, and new information is retained in alignment with this structure. The absence of a global objective enables a form of plasticity not found in conventional language models, allowing the system to generalize beyond its initial inference class -- even without explicit data. We demonstrate that briefly activating a new neuron during inference binds distributed multi-scale token features into a symbolic embedding. These emergent embedding neurons act as long-term memory and support a key-value mechanism for compositional inference and generalization. This architecture provides a methodological foundation for studying how symbolic structure can emerge from local neural learning. It offers a new pathway for building scalable, interpretable neuro-symbolic systems -- where tokens, grammar, and reasoning arise as compressed memory traces within a Hopfield hierarchy. This approach advances the development of neuromorphic architectures for generative language models.
摘要：我们为基于本地事件驱动的新兴学习提供了一个神经符号框架，用于生成语言建模。其核心是一个层次的Hopfield内存链，充当组成的短期内存和动态令牌（retokenizer）。该模型不是依靠预定义的令牌或监督，而是从头开始构建结构，而将符号序列作为多尺度表示。它构建了将共体特征结合到层次令牌中的投影张量，引入了冗余（即新兴的量规结构），并使局部激活将局部激活压缩到长期依赖性中。奇怪的是，我们发现retokenizer可以从噪声中过滤自然语言模式，从而产生具有连贯的内部形态的合成语言 - 与人类语言差异。语言是以当地（Hebbian）方式学习的，模型约束要求允许新闻结构，并保留了与这种结构保持一致的新信息。缺乏全球目标可以在传统语言模型中找不到一种可塑性形式，从而使系统可以超越其初始推理类别，即使没有明确的数据。我们证明，在推断期间短暂激活新的神经元，将分布的多尺度令牌特征绑定到符号嵌入中。这些紧急嵌入神经元充当长期记忆，并支持组成推理和概括的键值机制。该体系结构为研究象征性结构如何从局部神经学习中出现。它提供了一种新的途径，用于构建可扩展的，可解释的神经符号系统 - 在霍普菲尔德层次结构中，代币，语法和推理作为压缩记忆痕迹出现。这种方法推动了生成语言模型的神经形态体系结构的开发。

Title: Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family

Authors: Yumeng Lin, Xufeng Duan, David Haslett, Yige Chen, Zhenguang G. Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23340
Pdf URL: https://arxiv.org/pdf/2506.23340
Copy Paste: [[2506.23340]] Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family(https://arxiv.org/abs/2506.23340)
Keywords: language model, gpt, llm
Abstract: Large language models have achieved impressive progress in multilingual translation, yet they continue to face challenges with certain language pairs-particularly those with limited training data or significant linguistic divergence from English. This study systematically investigates how training data, language proximity, and language family affect information loss in multilingual translation. We evaluate two large language models, GPT-4 and Llama 2, by performing round-trip translations. Translation quality was assessed using BLEU scores and BERT similarity metrics. Our results reveal a robust interaction between training data size and language distance: while abundant training data can mitigate the effects of linguistic divergence, languages structurally closer to English consistently yield higher translation quality in low-resource conditions. Among various distance metrics, orthographic, phylogenetic, syntactic, and geographical distances emerge as strong predictors of translation performance. Language family also exerts an independent influence. These findings contribute to a deeper understanding of the linguistic constraints shaping multilingual translation in large language models, emphasizing that translation quality is shaped not only by data volume but also by structural and typological relationships between languages.
摘要：大型语言模型在多语言翻译方面取得了令人印象深刻的进步，但是他们继续以某些语言对的挑战，尤其是那些培训数据有限的人或与英语的大量语言差异。这项研究系统地研究了培训数据，语言邻近性和语言家庭如何影响多语言翻译中的信息丢失。我们通过执行往返翻译来评估两种大型语言模型，即GPT-4和Llama 2。使用BLEU分数和BERT相似性指标评估了翻译质量。我们的结果揭示了训练数据大小和语言距离之间的牢固互动：虽然丰富的培训数据可以减轻语言差异的影响，但在低资源条件下，语言在结构上更接近英语，始终在较接近英语的情况下产生更高的翻译质量。在各种距离指标中，正拼图，系统发育，句法和地理距离是翻译性能的强大预测指标。语言家族也具有独立的影响力。这些发现有助于更深入地了解大语模型中多种语言翻译的语言约束，强调翻译质量不仅是由数据量的，而且还取决于语言之间的结构和类型学关系。

Title: ATGen: A Framework for Active Text Generation

Authors: Akim Tsvigun, Daniil Vasilev, Ivan Tsvigun, Ivan Lysenko, Talgat Bektleuov, Aleksandr Medvedev, Uliana Vinogradova, Nikita Severin, Mikhail Mozikov, Andrey Savchenko, Rostislav Grigorev, Ramil Kuleev, Fedor Zhdanov, Artem Shelmanov, Ilya Makarov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23342
Pdf URL: https://arxiv.org/pdf/2506.23342
Copy Paste: [[2506.23342]] ATGen: A Framework for Active Text Generation(https://arxiv.org/abs/2506.23342)
Keywords: language model, gpt, llm, chat, agent
Abstract: Active learning (AL) has demonstrated remarkable potential in reducing the annotation effort required for training machine learning models. However, despite the surging popularity of natural language generation (NLG) tasks in recent years, the application of AL to NLG has been limited. In this paper, we introduce Active Text Generation (ATGen) - a comprehensive framework that bridges AL with text generation tasks, enabling the application of state-of-the-art AL strategies to NLG. Our framework simplifies AL-empowered annotation in NLG tasks using both human annotators and automatic annotation agents based on large language models (LLMs). The framework supports LLMs deployed as services, such as ChatGPT and Claude, or operated on-premises. Furthermore, ATGen provides a unified platform for smooth implementation and benchmarking of novel AL strategies tailored to NLG tasks. Finally, we present evaluation results for state-of-the-art AL strategies across diverse settings and multiple text generation tasks. We show that ATGen reduces both the effort of human annotators and costs associated with API calls to LLM-based annotation agents. The code of the framework is available on GitHub under the MIT license. The video presentation is available at this http URL
摘要：主动学习（AL）在减少培训机器学习模型所需的注释工作方面表现出了巨大的潜力。然而，尽管近年来自然语言生成（NLG）任务的普及程度有所激增，但AL在NLG中的应用仍受到限制。在本文中，我们介绍了主动文本生成（ATGEN） - 一个综合框架，将AL与文本生成任务融为一体，从而使最先进的AL策略适用于NLG。我们的框架简化了基于大语言模型（LLMS）的人类注释剂和自动注释剂的NLG任务中的AL授权注释。该框架支持部署为服务的LLM，例如Chatgpt和Claude或在本地操作。此外，Atgen提供了一个统一的平台，以平稳实施和基准测试针对NLG任务的新型AL策略。最后，我们提出了各种环境和多个文本生成任务的最新策略的评估结果。我们表明，Atgen既减少了人类注释者的努力，又减少了与基于LLM的注释剂有关的API呼叫相关的成本。该框架的代码可根据MIT许可在GitHub上获得。视频演示可以在此HTTP URL上获得

Title: Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs

Authors: Taejin Kim, Siun-Chuon Mau, Konrad Vesey
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23377
Pdf URL: https://arxiv.org/pdf/2506.23377
Copy Paste: [[2506.23377]] Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs(https://arxiv.org/abs/2506.23377)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are used in a variety of mission-critical roles. Due to the rapidly developing nature of LLMs, there is a lack of quantifiable understanding of the bias and perspective associated with LLM output. Inspired by this need, this paper considers the broader issue of perspective or viewpoint of general text and perspective control of large-language model (LLM) output. Perspective-Dial consists of two main components: a (1) metric space, dubbed Perspective Space, that enables quantitative measurements of different perspectives regarding a topic, and the use of (2) Systematic Prompt Engineering that utilizes greedy-coordinate descent to control LLM output perspective based on measurement feedback from the Perspective Space. The empirical nature of the approach allows progress to side step a principled understanding of perspective or bias -- effectively quantifying and adjusting outputs for a variety of topics. Potential applications include detection, tracking and mitigation of LLM bias, narrative detection, sense making and tracking in public discourse, and debate bot advocating given perspective.
摘要：大型语言模型（LLM）用于多种关键任务角色。由于LLM的迅速发展，因此缺乏对与LLM输出相关的偏见和观点的可量化理解。受此需求的启发，本文考虑了一般文本的透视或观点和大型语言模型（LLM）输出的透视控制的更广泛问题。 Perspective-Dial由两个主要组成部分组成：a（1）公制空间，称为透视空间，可以实现有关主题不同观点的定量测量，以及使用（2）使用（2）系统的迅速工程，利用贪婪协调的下降来控制LLM输出的透视图，以从角度观察到基于测量结果。该方法的经验性质允许进步得出对视角或偏见的原则理解 - 有效地量化和调整各种主题的产出。潜在的应用包括在公共话语中进行检测，跟踪和缓解LLM偏见，叙事检测，觉得和跟踪以及辩论机器人倡导给定观点的机器人。

Title: Hierarchical Memory Organization for Wikipedia Generation

Authors: Eugene J. Yu, Dawei Zhu, Yifan Song, Xiangyu Wong, Jiebin Zhang, Wenxuan Shi, Xiaoguang Li, Qun Liu, Sujian Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23393
Pdf URL: https://arxiv.org/pdf/2506.23393
Copy Paste: [[2506.23393]] Hierarchical Memory Organization for Wikipedia Generation(https://arxiv.org/abs/2506.23393)
Keywords: hallucination
Abstract: Generating Wikipedia articles autonomously is a challenging task requiring the integration of accurate, comprehensive, and well-structured information from diverse sources. This paper introduces the Memory Organization-based Generation (MOG) framework, a novel approach to address these challenges by leveraging a hierarchical memory architecture. MOG extracts fine-grained memory units from web documents, recursively organizes them into a Wikipedia-style hierarchical structure, and uses this structure to guide the generation process. This ensures alignment between memory and the article outline, improving both informativeness and verifiability while minimizing hallucinations. Additionally, a citation module is implemented to enhance traceability by linking every generated sentence to specific memory units. Evaluations on our newly created WikiStart dataset demonstrate that MOG outperforms baseline methods in producing informative and reliable articles, making it particularly robust in real-world scenarios.
摘要：自主生成Wikipedia文章是一项具有挑战性的任务，需要从各种来源集成准确，全面且结构良好的信息。本文介绍了基于内存组织的生成（MOG）框架，这是一种通过利用层次记忆体系结构来解决这些挑战的新方法。 MOG从Web文档中提取细粒度的内存单元，将其递归地组织为Wikipedia风格的层次结构，并使用此结构来指导生成过程。这样可以确保记忆和文章概述之间的对齐，从而提高信息性和可验证性，同时最大程度地减少幻觉。此外，通过将每个生成的句子链接到特定的内存单元来实现引用模块来增强可追溯性。对我们新创建的Wikistart数据集的评估表明，MOG在生产内容丰富且可靠的文章中的基线方法优于基线方法，这使其在现实世界中尤其强大。

Title: Datasets for Fairness in Language Models: An In-Depth Survey

Authors: Jiale Zhang, Zichong Wang, Avash Palikhe, Zhipeng Yin, Wenbin Zhang
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23411
Pdf URL: https://arxiv.org/pdf/2506.23411
Copy Paste: [[2506.23411]] Datasets for Fairness in Language Models: An In-Depth Survey(https://arxiv.org/abs/2506.23411)
Keywords: language model
Abstract: Fairness benchmarks play a central role in shaping how we evaluate language models, yet surprisingly little attention has been given to examining the datasets that these benchmarks rely on. This survey addresses that gap by presenting a broad and careful review of the most widely used fairness datasets in current language model research, characterizing them along several key dimensions including their origin, scope, content, and intended use to help researchers better appreciate the assumptions and limitations embedded in these resources. To support more meaningful comparisons and analyses, we introduce a unified evaluation framework that reveals consistent patterns of demographic disparities across datasets and scoring methods. Applying this framework to twenty four common benchmarks, we highlight the often overlooked biases that can influence conclusions about model fairness and offer practical guidance for selecting, combining, and interpreting these datasets. We also point to opportunities for creating new fairness benchmarks that reflect more diverse social contexts and encourage more thoughtful use of these tools going forward. All code, data, and detailed results are publicly available at this https URL to promote transparency and reproducibility across the research community.
摘要：公平基准在塑造我们如何评估语言模型方面起着核心作用，但令人惊讶的是，对检查这些基准依赖的数据集的关注很少。这项调查通过对当前语言模型研究中最广泛使用的公平数据集进行广泛而仔细的审查来解决这一差距，并沿着几个关键维度来表征它们，包括它们的来源，范围，内容和预期用途，以帮助研究人员更好地理解这些资源中嵌入的假设和局限性。为了支持更有意义的比较和分析，我们引入了一个统一的评估框架，该框架揭示了跨数据集和评分方法的人口差异的一致模式。将此框架应用于二十四个常见的基准测试，我们强调了经常被忽视的偏见，这些偏见可能会影响模型公平的结论，并为选择，组合和解释这些数据集提供实用的指导。我们还指出了创建新的公平基准的机会，这些基准反映了更多样化的社会背景，并鼓励对这些工具进行更周到的使用。所有代码，数据和详细结果均可在此HTTPS URL上公开获得，以促进整个研究界的透明度和可重复性。

Title: TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

Authors: Felipe Nuti, Tim Franzmeyer, João Henriques
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23423
Pdf URL: https://arxiv.org/pdf/2506.23423
Copy Paste: [[2506.23423]] TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs(https://arxiv.org/abs/2506.23423)
Keywords: language model, llm, prompt
Abstract: Past work has studied the effects of fine-tuning on large language models' (LLMs) overall performance on certain tasks. However, a quantitative and systematic method for analyzing its effect on individual outputs is still lacking. Here, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method tracks the model's intermediate hidden states, providing a more fine-grained insight into the effects of fine-tuning than a simple comparison of final outputs from pre-trained and fine-tuned models. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that model behavior and performance can be steered by up- or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution (TuCo) as the ratio of the magnitudes of the fine-tuning component to the pre-training component. We observe that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces TuCo, and that TuCo is consistently lower on prompts where these attacks succeed compared to those where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of such attacks. In summary, TuCo enables the quantitative study of how fine-tuning influences model behavior and safety, and vice versa.
摘要：过去的工作研究了微调对大型语言模型（LLMS）对某些任务的总体绩效的影响。但是，仍然缺乏用于分析其对单个产出影响的定量和系统方法。在这里，我们提出了一种新方法，以衡量对单个LLM响应的贡献，假设访问原始的预训练模型。我们的方法跟踪模型的中间隐藏状态，比对预先训练和微调模型的最终输出的简单比较提供了对微调效果的更细粒度的见解。我们将任何微调LLM的精确分解介绍和理论分析为预训练组件和微调组件。从经验上讲，我们发现模型的行为和性能可以通过向前传球期间的微调组件上升或下降来指导。在这一发现和我们的理论分析的推动下，我们将调整贡献（TUCO）定义为微调组件与预训练组件的大小的比率。我们观察到，对LLMS的三次突出的对抗性攻击以减少TUCO的方式规避安全措施，而与没有的攻击相比，这些攻击成功的提示中，TUCO始终降低。这表明，减轻微调对模型输出的影响在此类攻击的成功中起作用。总而言之，TUCO可以对微调如何影响模型行为和安全性，反之亦然。

Title: What to Keep and What to Drop: Adaptive Table Filtering Framework

Authors: Jang Won June
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23463
Pdf URL: https://arxiv.org/pdf/2506.23463
Copy Paste: [[2506.23463]] What to Keep and What to Drop: Adaptive Table Filtering Framework(https://arxiv.org/abs/2506.23463)
Keywords: language model, llm
Abstract: Large language models (LLMs) for table-based reasoning often struggle with large tables due to input length limits. We propose ATF (Adaptive Table Filtering Framework), a modular and question-aware filtering pipeline that prunes uninformative columns and rows using LLM-generated column descriptions, clustering, and sparse-dense alignment scores. ATF integrates seamlessly with existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that ATF reduces table cells by ~70\%, boosting performance on out-of-domain TableQA tasks while causing slight performance drops on Table Fact Verification, where full-table context is more critical. These results highlight ATF's ability to adaptively balance informativeness and minimalism across tasks.
摘要：由于输入长度限制，用于基于桌子的推理的大型语言模型（LLM）通常会在大桌子上挣扎。我们提出了ATF（自适应表滤波框架），这是一种模块化和问答的过滤管道，该管道使用LLM生成的列描述，聚类和稀疏的密度对齐得分来捕获非信息柱和行。 ATF与现有型号（例如Tapas，Tapex）无缝集成而无需再培训。实验表明，ATF可将表单元降低〜70 \％，从而提高了室外表格任务的性能，同时导致表格事实验证的略有性能下降，其中全表面上下文更为重要。这些结果突出了ATF能够适应跨任务的信息性和极简主义的能力。

Title: Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent

Authors: Haocheng Yu, Yaxiong Wu, Hao Wang, Wei Guo, Yong Liu, Yawen Li, Yuyang Ye, Junping Du, Enhong Chen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.23485
Pdf URL: https://arxiv.org/pdf/2506.23485
Copy Paste: [[2506.23485]] Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent(https://arxiv.org/abs/2506.23485)
Keywords: language model, llm, agent
Abstract: Interactive recommendation is a typical information-seeking task that allows users to interactively express their needs through natural language and obtain personalized recommendations. Large language model-powered (LLM-powered) agents have become a new paradigm in interactive recommendations, effectively capturing users' real-time needs and enhancing personalized experiences. However, due to limited planning and generalization capabilities, existing formulations of LLM-powered interactive recommender agents struggle to effectively address diverse and complex user intents, such as intuitive, unrefined, or occasionally ambiguous requests. To tackle this challenge, we propose a novel thought-augmented interactive recommender agent system (TAIRA) that addresses complex user intents through distilled thought patterns. Specifically, TAIRA is designed as an LLM-powered multi-agent system featuring a manager agent that orchestrates recommendation tasks by decomposing user needs and planning subtasks, with its planning capacity strengthened through Thought Pattern Distillation (TPD), a thought-augmentation method that extracts high-level thoughts from the agent's and human experts' experiences. Moreover, we designed a set of user simulation schemes to generate personalized queries of different difficulties and evaluate the recommendations based on specific datasets. Through comprehensive experiments conducted across multiple datasets, TAIRA exhibits significantly enhanced performance compared to existing methods. Notably, TAIRA shows a greater advantage on more challenging tasks while generalizing effectively on novel tasks, further validating its superiority in managing complex user intents within interactive recommendation systems. The code is publicly available at:this https URL.
摘要：互动推荐是一项典型的寻求信息的任务，允许用户通过自然语言交互式表达自己的需求并获得个性化的建议。大型语言模型驱动（LLM驱动）代理已成为交互式建议中的新范式，有效地捕获了用户的实时需求并增强了个性化的体验。但是，由于计划和概括功能有限，现有的LLM供电互动式推荐剂的配方努力有效地解决多样化和复杂的用户意图，例如直观，未经30或偶尔含糊不清的请求。为了应对这一挑战，我们提出了一种新颖的经过思考的交互式推荐剂系统（TAIRA），该系统通过蒸馏的思维模式来解决复杂的用户意图。具体而言，Taira被设计为具有LLM驱动的多代理系统，其经理代理具有通过分解用户需求和规划子任务来协调推荐任务的特色，其计划能力通过思想模式蒸馏（TPD）增强，这是一种思想启动方法，一种从代理商和专家的经验中提取高级思想的经验。此外，我们设计了一组用户模拟方案，以生成不同困难的个性化查询，并根据特定数据集评估建议。通过跨多个数据集进行的全面实验，与现有方法相比，Taira的性能显着提高。值得注意的是，Taira在更具挑战性的任务上表现出更大的优势，同时有效地将其概述为在交互式建议系统中管理复杂的用户意图方面进一步验证其优势。该代码可公开可用，网址为：此HTTPS URL。

Title: Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably

Authors: Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23508
Pdf URL: https://arxiv.org/pdf/2506.23508
Copy Paste: [[2506.23508]] Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably(https://arxiv.org/abs/2506.23508)
Keywords: language model, llm
Abstract: Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on an open-source multimodal model, Qwen2.5-VL. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly on novel tasks but maintains prior knowledge. We analyze this phenomenon through the lens of learning dynamics, showing that RFT reinforces correct samples that are naturally aligned with the base model's probability landscape, mitigating interference with prior knowledge. Moreover, supervised training on correct RFT-simulated rollouts allows SFT to preserve knowledge while rapidly learning new tasks. These findings suggest that data distribution, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT's potential for stable continual learning in multimodal large language models.
摘要：训练后算法，例如监督的微调（SFT）和增强微调（RFT），被广泛用于调整多模式大型语言模型以下游任务。尽管有效地适应了任务，但它们对先验知识的影响尚不清楚。在本文中，我们引入了拼图拼图，这是现有的浏览前一项新任务，并系统地研究了SFT和RFT在开源多模型模型QWEN2.5-VL上的行为。我们的实验揭示了一个急剧的权衡：SFT可以快速地掌握任务，但导致灾难性遗忘，而RFT在新颖的任务上学习得更慢，但保持了先验知识。我们通过学习动力学的角度分析了这种现象，表明RFT加强了与基本模型的概率局势自然一致的正确样品，从而减轻了干扰与先验知识。此外，对正确的RFT模拟推出的监督培训可以使SFT在迅速学习新任务的同时保留知识。这些发现表明，数据分布而不是算法差异在忘记中起着核心作用，并突出了RFT在多模式大语言模型中稳定持续学习的潜力。

Title: NEU-ESC: A Comprehensive Vietnamese dataset for Educational Sentiment analysis and topic Classification toward multitask learning

Authors: Phan Quoc Hung Mai, Quang Hung Nguyen, Phuong Giang Duong, Hong Hanh Nguyen, Nguyen Tuan Long
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23524
Pdf URL: https://arxiv.org/pdf/2506.23524
Copy Paste: [[2506.23524]] NEU-ESC: A Comprehensive Vietnamese dataset for Educational Sentiment analysis and topic Classification toward multitask learning(https://arxiv.org/abs/2506.23524)
Keywords: language model
Abstract: In the field of education, understanding students' opinions through their comments is crucial, especially in the Vietnamese language, where resources remain limited. Existing educational datasets often lack domain relevance and student slang. To address these gaps, we introduce NEU-ESC, a new Vietnamese dataset for Educational Sentiment Classification and Topic Classification, curated from university forums, which offers more samples, richer class diversity, longer texts, and broader vocabulary. In addition, we explore multitask learning using encoder-only language models (BERT), in which we showed that it achieves performance up to 83.7% and 79.8% accuracy for sentiment and topic classification tasks. We also benchmark our dataset and model with other datasets and models, including Large Language Models, and discuss these benchmarks. The dataset is publicly available at: this https URL.
摘要：在教育领域，通过评论理解学生的意见至关重要，尤其是在资源有限的越南语言中。现有的教育数据集通常缺乏领域相关性和学生语。为了解决这些差距，我们介绍了Neu-esc，这是一个新的越南数据集，用于教育情感分类和主题分类，并根据大学论坛策划，该论坛提供了更多样本，更丰富的班级多样性，更长的文本和更广泛的词汇。此外，我们还使用仅编码语言模型（BERT）探索多任务学习，其中我们证明它可以为情感和主题分类任务实现高达83.7％和79.8％的精度。我们还使用其他数据集和模型（包括大语言模型）对数据集进行了基准测试，并讨论了这些基准。该数据集可公开获得：此HTTPS URL。

Title: On Recipe Memorization and Creativity in Large Language Models: Is Your Model a Creative Cook, a Bad Cook, or Merely a Plagiator?

Authors: Jan Kvapil, Martin Fajcik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23527
Pdf URL: https://arxiv.org/pdf/2506.23527
Copy Paste: [[2506.23527]] On Recipe Memorization and Creativity in Large Language Models: Is Your Model a Creative Cook, a Bad Cook, or Merely a Plagiator?(https://arxiv.org/abs/2506.23527)
Keywords: language model, llm
Abstract: This work-in-progress investigates the memorization, creativity, and nonsense found in cooking recipes generated from Large Language Models (LLMs). Precisely, we aim (i) to analyze memorization, creativity, and non-sense in LLMs using a small, high-quality set of human judgments and (ii) to evaluate potential approaches to automate such a human annotation in order to scale our study to hundreds of recipes. To achieve (i), we conduct a detailed human annotation on 20 preselected recipes generated by LLM (Mixtral), extracting each recipe's ingredients and step-by-step actions to assess which elements are memorized--i.e., directly traceable to online sources possibly seen during training--and which arise from genuine creative synthesis or outright nonsense. We find that Mixtral consistently reuses ingredients that can be found in online documents, potentially seen during model training, suggesting strong reliance on memorized content. To achieve aim (ii) and scale our analysis beyond small sample sizes and single LLM validation, we design an ``LLM-as-judge'' pipeline that automates recipe generation, nonsense detection, parsing ingredients and recipe steps, and their annotation. For instance, comparing its output against human annotations, the best ingredient extractor and annotator is Llama 3.1+Gemma 2 9B, achieving up to 78% accuracy on ingredient matching. This automated framework enables large-scale quantification of memorization, creativity, and nonsense in generated recipes, providing rigorous evidence of the models' creative capacities.
摘要：这项工作中的工作调查了由大语言模型（LLM）产生的烹饪食谱中发现的记忆，创造力和胡说八道。确切地说，我们的目的是（i）使用一组较小的，高质量的人类判断和（ii）评估潜在的方法来自动化这种人类注释以将我们的研究扩展到数百种食谱。为了实现（i），我们对LLM（Mixtral）产生的20种预选食谱进行了详细的人类注释，提取每种食谱的成分和逐步措施，以评估哪些元素是记忆的 - 即直接追溯到可以在培训中可能看到的在线来源 - 以及哪些在培训中可能看到的 - 以及来自真正的创意创造性合成或出于真正的创造性合成或超级构成。我们发现，混合物始终可以在在线文档中找到的成分，可能在模型培训中可以看到，这表明非常依赖记忆的内容。为了实现目标（II）并将分析扩展到小样本量和单个LLM验证之外，我们设计了``llm-as-as-ass-gudge''管道，该管道可以自动化食谱生成，废话检测，解析成分和食谱步骤及其注释。例如，将其输出与人体注释进行比较，最好的成分提取器和注释器是Llama 3.1+Gemma 2 9B，在成分匹配方面的精度高达78％。这个自动化框架可以在生成的食谱中对记忆，创造力和胡说八道的大规模量化，从而提供了模型创造能力的严格证据。

Title: Semantic-guided Diverse Decoding for Large Language Model

Authors: Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23601
Pdf URL: https://arxiv.org/pdf/2506.23601
Copy Paste: [[2506.23601]] Semantic-guided Diverse Decoding for Large Language Model(https://arxiv.org/abs/2506.23601)
Keywords: language model
Abstract: Diverse decoding of large language models is crucial for applications requiring multiple semantically distinct responses, yet existing methods primarily achieve lexical rather than semantic diversity. This limitation significantly constrains Best-of-N strategies, group-based reinforcement learning, and data synthesis. While temperature sampling and diverse beam search modify token distributions or apply n-gram penalties, they fail to ensure meaningful semantic differentiation. We introduce Semantic-guided Diverse Decoding (SemDiD), operating directly in embedding space that balances quality with diversity through three complementary mechanisms: orthogonal directional guidance, dynamic inter-group repulsion, and position-debiased probability assessment. SemDiD harmonizes these competing objectives using adaptive gain functions and constraint optimization, ensuring both quality thresholds and maximal semantic differentiation. Experiments show SemDiD consistently outperforms existing methods, improving Best-of-N coverage by 1.4-5.2% across diverse tasks and accelerating RLHF training convergence by 15% while increasing accuracy by up to 2.1%.
摘要：大型语言模型的各种解码对于需要多种语义上不同响应的应用至关重要，但是现有方法主要实现词汇而不是语义多样性。这种限制显着限制了最佳N策略，基于群体的强化学习和数据综合。尽管温度采样和多样化的光束搜索修改令牌分布或施加n-gram惩罚，但它们无法确保有意义的语义差异。我们引入了语义引导的多样解码（SEMDID），直接在嵌入空间中运行，通过三种互补机制来平衡质量与多样性：正交定向指导，动态组间排斥和位置依赖的概率评估。 Semdid使用自适应增益功能和约束优化来协调这些竞争目标，从而确保质量阈值和最大的语义差异化。实验表明，Semdid始终胜过现有方法，在不同任务中提高了最佳N覆盖范围1.4-5.2％，并将RLHF培训收敛加速15％，同时将准确性提高高达2.1％。

Title: Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs

Authors: Manuel Pratelli, Marinella Petrocchi
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.23610
Pdf URL: https://arxiv.org/pdf/2506.23610
Copy Paste: [[2506.23610]] Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs(https://arxiv.org/abs/2506.23610)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) make it possible to generate synthetic behavioural data at scale, offering an ethical and low-cost alternative to human experiments. Whether such data can faithfully capture psychological differences driven by personality traits, however, remains an open question. We evaluate the capacity of LLM agents, conditioned on Big-Five profiles, to reproduce personality-based variation in susceptibility to misinformation, focusing on news discernment, the ability to judge true headlines as true and false headlines as false. Leveraging published datasets in which human participants with known personality profiles rated headline accuracy, we create matching LLM agents and compare their responses to the original human patterns. Certain trait-misinformation associations, notably those involving Agreeableness and Conscientiousness, are reliably replicated, whereas others diverge, revealing systematic biases in how LLMs internalize and express personality. The results underscore both the promise and the limits of personality-aligned LLMs for behavioral simulation, and offer new insight into modeling cognitive diversity in artificial agents.
摘要：大型语言模型（LLM）使得可以大规模生成综合行为数据，从而提供人类实验的道德和低成本替代方案。但是，这种数据是否可以忠实地捕获人格特征驱动的心理差异仍然是一个悬而未决的问题。我们评估了以大五个概况为条件的LLM代理的能力，以重现基于人格的错误信息的易感性变化，专注于新闻识别，即将真实头条视为真实的头条新闻的能力。利用已发表的数据集，在该数据集中，人格参与者评为标题精度，我们创建了匹配的LLM代理，并将其对原始人类模式的反应进行比较。某些特质 - 形成性关联，尤其是那些涉及同意和尽责的特质协会，是可靠地复制的，而其他人则存在分歧，从而揭示了LLMS内部化和表达个性的系统性偏见。结果强调了人格一致的LLMS对行为模拟的承诺和限制，并为对人工药物的认知多样性建模提供了新的见解。

Title: L0: Reinforcement Learning to Become General Agents

Authors: Junjie Zhang, Jingyi Xi, Zhuoyang Song, Junyu Lu, Yuhua Ke, Ting Sun, Yukun Yang, Jiaxing Zhang, Songxin Zhang, Zejian Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23667
Pdf URL: https://arxiv.org/pdf/2506.23667
Copy Paste: [[2506.23667]] L0: Reinforcement Learning to Become General Agents(https://arxiv.org/abs/2506.23667)
Keywords: language model, llm, agent
Abstract: Training large language models (LLMs) to act as autonomous agents for multi-turn, long-horizon tasks remains significant challenges in scalability and training efficiency. To address this, we introduce L-Zero (L0), a scalable, end-to-end training pipeline for general-purpose agents. Featuring a low-cost, extensible, and sandboxed concurrent agent worker pool, L0 lowers the barrier for applying reinforcement learning in complex environments. We also introduce NB-Agent, the agent scaffold within L0, which operates in a "code-as-action" fashion via a Read-Eval-Print-Loop (REPL). We evaluate L0 on factuality question-answering benchmarks. Our experiments demonstrate that a base model can develop robust problem-solving skills using solely Reinforcement Learning with Verifiable Rewards (RLVR). On the Qwen2.5-7B-Instruct model, our method boosts accuracy on SimpleQA from 30 % to 80 % and on HotpotQA from 22 % to 41 %. We have open-sourced the entire L0 system, including our L0 series models, the NB-Agent, a complete training pipeline, and the corresponding training recipes on (this https URL).
摘要：培训大语言模型（LLMS）充当多转弯，长途任务的自主代理仍然是可伸缩性和训练效率的重大挑战。为了解决这个问题，我们引入了L-Zero（L0），这是通用药物的可扩展的端到端培训管道。 L0具有低成本，可扩展和沙盒并发的代理工人池，降低了在复杂环境中应用加固学习的障碍。我们还介绍了NB-Agent，即L0中的代理支架，该代理商通过读取eval-print-loop（depp）以“代码”方式运行。我们评估了L0的事实提问基准。我们的实验表明，基本模型可以使用可验证的奖励（RLVR）的强化学习来发展强大的解决问题的技能。在QWEN2.5-7B-INSTRUCTY模型上，我们的方法将SimpleQA上的精度从30％提高到80％，而HotPotQA上的准确性从22％升高到41％。我们已经开源了整个L0系统，包括我们的L0系列型号，NB代理，完整的培训管道以及（此HTTPS URL）上的相应培训食谱。

Title: AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data

Authors: JiaRu Wu, Mingwei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23735
Pdf URL: https://arxiv.org/pdf/2506.23735
Copy Paste: [[2506.23735]] AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data(https://arxiv.org/abs/2506.23735)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable performance on various tasks, but existing evaluation benchmarks are often static and insufficient to fully assess their robustness and generalization in realistic scenarios. Prior work using evolutionary or adversarial data augmentation has improved evaluation diversity but lacks systematic control over perturbation types and multi-step complexity, limiting comprehensive robustness analysis. To address these gaps, we propose AutoEvoEval, an evolution-based evaluation framework for close-ended tasks such as multi-choice question answering. AutoEvoEval introduces 22 interpretable atomic evolution operations and supports multi-round compositions, enabling controlled generation of diverse, challenging, and realistic test samples. We conduct extensive experiments addressing four research questions on a broad set of open- and closed-source LLMs. Our results show that atomic operations cause an average accuracy drop of 7.283\%, with structure-disrupting or misleading semantic edits causing the largest declines. Model sensitivities vary significantly for the same perturbation, and combining multiple evolution steps amplifies adversarial effects by up to 52.932\%. These findings suggest current benchmarks may overestimate true model generalization and emphasize the need for evolution-aware robustness evaluation. Code and resources are available at: this https URL.
摘要：大型语言模型（LLM）在各种任务上表现出色，但是现有的评估基准通常是静态的，并且不足以完全评估其在现实情况下的稳健性和概括。使用进化或对抗性数据增强的先前工作提高了评估多样性，但缺乏对扰动类型和多步复杂性的系统控制，从而限制了全面的鲁棒性分析。为了解决这些差距，我们提出了AutoEvoEval，这是一个基于进化的评估框架，用于诸如多选择问题的近端任务。 AutoEvoeval引入了22种可解释的原子进化操作，并支持多轮构图，从而实现了受控的不同，具有挑战性和现实的测试样本的生成。我们进行了广泛的实验，以解决一系列开放和封闭源LLM的四个研究问题。我们的结果表明，原子操作导致平均准确度下降7.283 \％，结构干扰或误导性语义编辑导致最大下降。模型敏感性在相同的扰动方面有显着差异，并且将多个进化步骤结合起来，将对抗性效应增长高达52.932 \％。这些发现表明，当前的基准可能高估了真正的模型概括，并强调了对进化感知鲁棒性评估的需求。代码和资源可用：此HTTPS URL。

Title: Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences

Authors: Tiziano Labruna, Simone Gallo, Giovanni Da San Martino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23743
Pdf URL: https://arxiv.org/pdf/2506.23743
Copy Paste: [[2506.23743]] Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences(https://arxiv.org/abs/2506.23743)
Keywords: language model, gpt
Abstract: Positional bias in binary question answering occurs when a model systematically favors one choice over another based solely on the ordering of presented options. In this study, we quantify and analyze positional bias across five large language models under varying degrees of answer uncertainty. We re-adapted the SQuAD-it dataset by adding an extra incorrect answer option and then created multiple versions with progressively less context and more out-of-context answers, yielding datasets that range from low to high uncertainty. Additionally, we evaluate two naturally higher-uncertainty benchmarks: (1) WebGPT - question pairs with unequal human-assigned quality scores, and (2) Winning Arguments - where models predict the more persuasive argument in Reddit's r/ChangeMyView exchanges. Across each dataset, the order of the "correct" (or higher-quality/persuasive) option is systematically flipped (first placed in position 1, then in position 2) to compute both Preference Fairness and Position Consistency. We observe that positional bias is nearly absent under low-uncertainty conditions, but grows exponentially when it becomes doubtful to decide which option is correct.
摘要：二进制问题回答中的位置偏差发生在系统地系统地基于订购的选项的排序而不是另一种选择而不是另一种选择时。在这项研究中，我们在不同程度的答案不确定性下量化和分析了五个大语言模型的位置偏差。我们通过添加一个额外的不正确答案选项来重新适应Squad-IT数据集，然后创建多个版本，逐渐使用较少的上下文和更多的副本答案，产生的数据集范围从低到高不确定性范围。此外，我们评估了两个自然更高的基准：（1）Webgpt-与人为不等的质量分数的问题对，以及（2）获胜的论点 - 模型预测了Reddit的R/ChangemyView交流中更具说服力的论点。在每个数据集中，“正确”（或更高质量/有说服力的）选项的顺序是系统地翻转的（首先放置在位置1，然后位于位置2中），以计算优先公平和位置一致性。我们观察到，在低不确定性条件下几乎没有位置偏差，但是当决定哪种选择是正确的时，就会成倍增长。

Title: Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It

Authors: Seyed Mahed Mousavi, Edoardo Cecchinato, Lucia Hornikova, Giuseppe Riccardi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23864
Pdf URL: https://arxiv.org/pdf/2506.23864
Copy Paste: [[2506.23864]] Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It(https://arxiv.org/abs/2506.23864)
Keywords: gpt, llm
Abstract: We conduct a systematic audit of three widely used reasoning benchmarks, SocialIQa, FauxPas-EAI, and ToMi, and uncover pervasive flaws in both benchmark items and evaluation methodology. Using five LLMs (GPT-{3, 3.5, 4, o1}, and LLaMA 3.1) as diagnostic tools, we identify structural, semantic, and pragmatic issues in benchmark design (e.g., duplicated items, ambiguous wording, and implausible answers), as well as scoring procedures that prioritize output form over reasoning process. Through systematic human annotation and re-evaluation on cleaned benchmark subsets, we find that model scores often improve not due to due to erratic surface wording variations and not to improved reasoning. Infact, further analyses show that model performance is highly sensitive to minor input variations such as context availability and phrasing, revealing that high scores may reflect alignment with format-specific cues rather than consistent inference based on the input. These findings challenge the validity of current benchmark-based claims about reasoning in LLMs, and highlight the need for evaluation protocols that assess reasoning as a process of drawing inference from available information, rather than as static output selection. We release audited data and evaluation tools to support more interpretable and diagnostic assessments of model reasoning.
摘要：我们对三种广泛使用的推理基准，社会IQA，FAUXPAS-EAI和TOMI进行了系统的审核，并在两个基准项目和评估方法中发现了普遍的缺陷。使用五个LLM（gpt- {3，3.5，4，O1}和Llama 3.1）作为诊断工具，我们在基准设计中确定结构，语义和务实的问题（例如，重复的项目，模棱两可的措辞，含糊不清的措辞和不可思议的答案以及不可思议的答案），以及优先考虑提出的评分过程。通过系统的人类注释和对清洁基准子集的重新评估，我们发现模型得分通常不会因为不稳定的表面措辞变化而改善，而不是由于推理的改善。实际上，进一步的分析表明，模型性能对诸如上下文可用性和措辞等较小的输入变化高度敏感，这表明高分可能反映了与格式特异性提示的一致性，而不是基于输入的一致推论。这些发现挑战了当前基于基准的LLM中推理的有效性，并强调了评估协议的必要性，该协议是从可用信息中绘制推理的过程，而不是静态输出选择。我们发布了经过审核的数据和评估工具，以支持对模型推理的更容易解释和诊断评估。

Title: Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting

Authors: André de Souza Loureiro, Jorge Valverde-Rebaza, Julieta Noguez, David Escarcega, Ricardo Marcacini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23888
Pdf URL: https://arxiv.org/pdf/2506.23888
Copy Paste: [[2506.23888]] Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting(https://arxiv.org/abs/2506.23888)
Keywords: language model, llm, prompt
Abstract: Recent advancements in Large Language Models (LLMs) have significantly improved their problem-solving capabilities. However, these models still struggle when faced with complex multi-step reasoning tasks. In this paper, we propose the Multi-Layered Self-Reflection with Auto-Prompting (MAPS) framework, a novel approach designed to enhance multi-step mathematical reasoning in LLMs by integrating techniques such as Chain of Thought (CoT), Self-Reflection, and Auto-Prompting. Unlike traditional static prompting methods, MAPS employs an iterative refinement process. Initially, the model generates a solution using CoT prompting. When errors are detected, an adaptive self-reflection mechanism identifies and analyzes them, generating tailored prompts to guide corrections. These dynamically adjusted prompts enable the model to iteratively refine its reasoning. Experiments on four well-established benchmarks across multiple LLMs show that MAPS significantly outperforms standard CoT and achieves competitive results with reasoning-optimized models. In addition, MAPS enables general-purpose LLMs to reach performance levels comparable to specialized reasoning models. While deeper reflection layers improve accuracy, they also increase token usage and costs. To balance this trade-off, MAPS strategically limits reflection depth, ensuring an optimal balance between cost and reasoning performance.
摘要：大型语言模型（LLM）的最新进展已大大提高了他们的解决问题的能力。但是，当面对复杂的多步推理任务时，这些模型仍然很难。在本文中，我们提出了与自动宣传（MAP）框架的多层自我反射，这是一种新颖的方法，旨在通过整合诸如思想链（COT），自我反射和自动启动的技术，旨在通过集成诸如思想链（COT），自动反射和自动启动的技术来增强LLM中的多步数学推理。与传统的静态提示方法不同，地图采用了迭代精致过程。最初，该模型使用COT提示生成解决方案。当检测到错误时，自适应自我反射机制会识别并分析它们，从而生成量身定制的提示来指导校正。这些动态调整的提示使该模型能够迭代地完善其推理。对多个LLM的四个公认基准测试的实验表明，地图显着优于标准COT，并通过推理优化的模型实现了竞争性结果。此外，地图使通用LLM能够达到与专业推理模型相当的性能水平。虽然更深的反射层提高了准确性，但它们也增加了令牌的使用和成本。为了平衡这一权衡，地图从战略上限制了反射深度，从而确保成本和推理性能之间的最佳平衡。

Title: The Trilemma of Truth in Large Language Models

Authors: Germans Savcisens, Tina Eliassi-Rad
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.23921
Pdf URL: https://arxiv.org/pdf/2506.23921
Copy Paste: [[2506.23921]] The Trilemma of Truth in Large Language Models(https://arxiv.org/abs/2506.23921)
Keywords: language model, llm, chat
Abstract: We often attribute human characteristics to large language models (LLMs) and claim that they "know" certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM's depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs "know" and how certain they are of their probabilistic internal knowledge.
摘要：我们经常将人类特征归因于大语言模型（LLM），并声称他们“知道”某些事情。 LLM具有内部概率知识，代表在培训期间保留的信息。我们如何评估这种知识的真实性？我们检查了两种常见的方法来探测LLM的准确性，并发现了一些有缺陷的假设。为了解决这些有缺陷的假设，我们引入了锯木厂（稀疏意识到多个现实学习的缩写），一种探测方法，利用LLM的内部激活将语句分离为真实，错误，也不是。锯木是基于多个实体学习和保形预测的。我们在16个开源LLMS上的5个有效性标准上评估了Sawmil，包括默认和基于聊天的变体以及3个新数据集。我们提供的见解包括：（1）准确信号通常集中在LLM深度的第三季度；（2）真理和虚假信号并不总是对称的；（3）线性探针在聊天模型上的性能要比默认模型更好；（4）可能需要非线性探针才能通过从人类反馈或知识蒸馏中学习一些LLM的LLM的准确信号；（5）LLM捕获了与真实和错误不同的第三种信号，既不是对也不是错误的。这些发现提供了一种可靠的方法来验证LLMS“知道”的内容以及它们如何确定其概率内部知识。

Title: IMPACT: Inflectional Morphology Probes Across Complex Typologies

Authors: Mohammed J. Saeed, Tommi Vehvilainen, Evgeny Fedoseev, Sevil Caliskan, Tatiana Vodolazova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23929
Pdf URL: https://arxiv.org/pdf/2506.23929
Copy Paste: [[2506.23929]] IMPACT: Inflectional Morphology Probes Across Complex Typologies(https://arxiv.org/abs/2506.23929)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown significant progress on various multilingual benchmarks and are increasingly used to generate and evaluate text in non-English languages. However, while they may produce fluent outputs, it remains unclear to what extent these models truly grasp the underlying linguistic complexity of those languages, particularly in morphology. To investigate this, we introduce IMPACT, a synthetically generated evaluation framework focused on inflectional morphology, which we publicly release, designed to evaluate LLM performance across five morphologically rich languages: Arabic, Russian, Finnish, Turkish, and Hebrew. IMPACT includes unit-test-style cases covering both shared and language-specific phenomena, from basic verb inflections (e.g., tense, number, gender) to unique features like Arabic's reverse gender agreement and vowel harmony in Finnish and Turkish. We assess eight multilingual LLMs that, despite strong English performance, struggle with other languages and uncommon morphological patterns, especially when judging ungrammatical examples. We also show that Chain of Thought and Thinking Models can degrade performance. Our work exposes gaps in LLMs' handling of linguistic complexity, pointing to clear room for improvement. To support further research, we publicly release the IMPACT framework.
摘要：大型语言模型（LLM）在各种多语言基准上显示出很大的进步，并越来越多地用于生成和评估非英语语言的文本。但是，尽管它们可能产生流利的产出，但尚不清楚这些模型在多大程度上真正掌握了这些语言的潜在语言复杂性，尤其是在形态学中。为了调查这一点，我们引入了一个综合产生的评估框架的影响，该框架着眼于拐点形态，我们公开释放了该框架，旨在评估五种形态上丰富的语言的LLM表现：阿拉伯语，俄语，芬兰语，土耳其语和希伯来语。影响包括涵盖共享和特定语言现象的单位测试式案例，从基本动词变形（例如，时态，数字，性别）到诸如阿拉伯语的反向性别一致和芬兰和土耳其语的元音和谐之类的独特功能。我们评估了八个多语言LLM，尽管英语表现出色，但仍与其他语言和不常见的形态学模式斗争，尤其是在判断不语法的例子时。我们还表明，思想和思维模型可以降低性能。我们的工作揭示了LLMS对语言复杂性的处理中的差距，指出了明确的改进空间。为了支持进一步的研究，我们公开发布了影响框架。

Title: Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages

Authors: Ruhina Tabasshum Prome (Bangladesh Institute of Governance and Management), Tarikul Islam Tamiti (George Mason University), Anomadarshi Barua (George Mason University)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23930
Pdf URL: https://arxiv.org/pdf/2506.23930
Copy Paste: [[2506.23930]] Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages(https://arxiv.org/abs/2506.23930)
Keywords: language model, llm, prompt
Abstract: The rapid expansion of social media leads to a marked increase in hate speech, which threatens personal lives and results in numerous hate crimes. Detecting hate speech presents several challenges: diverse dialects, frequent code-mixing, and the prevalence of misspelled words in user-generated content on social media platforms. Recent progress in hate speech detection is typically concentrated on high-resource languages. However, low-resource languages still face significant challenges due to the lack of large-scale, high-quality datasets. This paper investigates how we can overcome this limitation via prompt engineering on large language models (LLMs) focusing on low-resource Bengali language. We investigate six prompting strategies - zero-shot prompting, refusal suppression, flattering the classifier, multi-shot prompting, role prompting, and finally our innovative metaphor prompting to detect hate speech effectively in low-resource languages. We pioneer the metaphor prompting to circumvent the built-in safety mechanisms of LLMs that marks a significant departure from existing jailbreaking methods. We investigate all six different prompting strategies on the Llama2-7B model and compare the results extensively with three pre-trained word embeddings - GloVe, Word2Vec, and FastText for three different deep learning models - multilayer perceptron (MLP), convolutional neural network (CNN), and bidirectional gated recurrent unit (BiGRU). To prove the effectiveness of our metaphor prompting in the low-resource Bengali language, we also evaluate it in another low-resource language - Hindi, and two high-resource languages - English and German. The performance of all prompting techniques is evaluated using the F1 score, and environmental impact factor (IF), which measures CO$_2$ emissions, electricity usage, and computational time.
摘要：社交媒体的迅速扩张导致仇恨言论显着增加，这威胁了个人生命并导致许多仇恨犯罪。检测仇恨言论提出了几个挑战：各种方言，频繁的代码混音以及在社交媒体平台上用户生成的内容中拼写错误的单词的普遍性。仇恨言论检测的最新进展通常集中在高资源语言上。但是，由于缺乏大规模，高质量的数据集，低资源语言仍然面临重大挑战。本文研究了如何通过迅速的大型语言模型（LLMS）来克服这一限制，重点是孟加拉语。我们调查了六种提示策略 - 零射击提示，拒绝抑制，分类者讨人喜欢，多弹性提示，角色提示，最后我们创新的隐喻促使以低资源语言有效地检测仇恨言论。我们开拓了隐喻，促使LLM的内置安全机制避免了与现有的越狱方法的重大不同。我们研究了Llama2-7B模型上的所有六种不同的提示策略，并将结果与三种不同的三种不同深度学习模型的三个预训练的单词嵌入 - 手套，Word2Vec和FastText（Multyalayer Perceptron（MLP），卷积神经网络（CNN）和BiDirectirectirecectional and BiDirectional Gatectral cormurents（Bigrurunerent）。为了证明我们隐喻提示在低资源孟加拉语中的有效性，我们还用另一种低资源语言（印地语和两种高资源语言）评估了它 - 英语和德语。使用F1分数和环境影响因子（IF）评估所有提示技术的性能，该分数（IF），它衡量CO $ _2 $排放，用电和计算时间。

Title: Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs

Authors: Yang Dai, Jianxiang An, Tianwei Lin, Hongyang He, Hongzhe Huang, Wenqiao Zhang, Zheqi Lv, Siliang Tang, Yueting Zhuang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23940
Pdf URL: https://arxiv.org/pdf/2506.23940
Copy Paste: [[2506.23940]] Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs(https://arxiv.org/abs/2506.23940)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have achieved success across various domains. However, their applicability tends to degrade when confronted with different types of data inputs, especially for MLLMs that have been fine-tuned for specific tasks. Despite its importance, the study of knowledge sharing among domain-specific MLLMs--such as those trained for mathematics or code--remains largely underexplored. To address the fragmentation of knowledge across domain-specialized MLLMs, we propose a unified parameter integration framework that enables modular composition of expert capabilities. Our method is grounded in a novel Compatibility-Aware Parameter Splicing (CAPS) strategy, which leverages both local functional attribution and global information-theoretic signals to guide selective parameter fusion. By extending this mechanism to the low-rank adaptation layer granularity, we ensure efficient integration with minimal inference overhead. Furthermore, we introduce a domain compatibility scoring mechanism that quantifies inter-expert alignment at the activation level and correlates with downstream task utility. This principled fusion protocol allows the final model to synergize heterogeneous expertise while preserving structural modularity. Extensive evaluations across diverse multimodal benchmarks validate the effectiveness of our framework, offering a scalable path toward compositional, domain-adaptive MLLMs.
摘要：多模式大语言模型（MLLM）在各个领域都取得了成功。但是，当面对不同类型的数据输入时，它们的适用性往往会降低，尤其是对于针对特定任务进行微调的MLLM。尽管它很重要，但在特定领域的MLLMS中进行了知识共享的研究（例如那些接受数学或代码的培训的人），基本上是毫无疑问的。为了解决跨领域特有的MLLM的知识的碎片化，我们提出了一个统一的参数集成框架，该框架可以实现专家能力的模块化组成。我们的方法基于一种新颖的兼容性参数剪接（CAP）策略，该策略利用局部功能归因和全局信息理论信号来指导选择性参数融合。通过将此机制扩展到低级适应层粒度，我们可以确保有效地集成与最小的推理开销。此外，我们引入了一个域兼容性评分机制，该机制在激活水平上量化了专家间的对准，并与下游任务实用程序相关。这种原则的融合协议允许最终模型协同异质专业知识，同时保持结构模块化。跨不同多模式基准的广泛评估验证了我们框架的有效性，从而提供了通往组成的，域自适应MLLM的可扩展路径。

Title: Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

Authors: Mathis Le Bail, Jérémie Dentan, Davide Buscaldi, Sonia Vanier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23951
Pdf URL: https://arxiv.org/pdf/2506.23951
Copy Paste: [[2506.23951]] Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders(https://arxiv.org/abs/2506.23951)
Keywords: language model, llm
Abstract: Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that correspond to human-interpretable features. In this paper, we investigate the effectiveness of SAE-based explainability approaches for sentence classification, a domain where such methods have not been extensively explored. We present a novel SAE-based architecture tailored for text classification, leveraging a specialized classifier head and incorporating an activation rate sparsity loss. We benchmark this architecture against established methods such as ConceptShap, Independent Component Analysis, and other SAE-based concept extraction techniques. Our evaluation covers two classification benchmarks and four fine-tuned LLMs from the Pythia family. We further enrich our analysis with two novel metrics for measuring the precision of concept-based explanations, using an external sentence encoder. Our empirical results show that our architecture improves both the causality and interpretability of the extracted features.
摘要：稀疏的自动编码器（SAE）已成功用于探测大型语言模型（LLMS），并从其内部表示中提取可解释的概念。这些概念是与人解剖特征相对应的神经元激活的线性组合。在本文中，我们研究了基于SAE的解释性方法的句子分类方法的有效性，该方法尚未广泛探索此类方法。我们提出了一种针对文本分类的新颖基于SAE的建筑，利用专门的分类器头并结合了激活率稀疏性损失。我们根据既定方法，例如概念构图，独立组件分析和其他基于SAE的概念提取技术进行基准测试。我们的评估涵盖了Pythia家族的两个分类基准和四个微调的LLM。我们使用两个新颖的指标进一步丰富了分析，以使用外部句子编码器来衡量基于概念的解释的精度。我们的经验结果表明，我们的体系结构可以提高提取特征的因果关系和解释性。

Title: TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation

Authors: Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun, Wuwei Huang, Quandong Wang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23979
Pdf URL: https://arxiv.org/pdf/2506.23979
Copy Paste: [[2506.23979]] TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation(https://arxiv.org/abs/2506.23979)
Keywords: language model, llm
Abstract: Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propose the \underline{\textbf{Ta}}xonomy-Guided \underline{\textbf{P}}reference Data Generation (TaP) framework, which facilitates automated and scalable construction of preference datasets across various languages. TaP is grounded in a structured taxonomy that allows fine-grained control over dataset composition, thereby ensuring both diversity and comprehensive coverage. We employ TaP-generated datasets to perform supervised and preference fine-tuning on various LLMs. Experimental results demonstrate that LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets. Remarkably, LLMs trained on TaP-generated datasets surpass the performance of those trained on an open-source dataset that is 180 times larger.
摘要：对大语言模型（LLM）进行监督的微调和偏好微调需要高质量的数据集，以提高其遵循指示并与人类偏好和价值观保持一致的能力。但是，构建此类数据集是资源密集型的，大多数用于监督的数据集用英语进行了细微调整。为了应对这些挑战，我们提出了\下划线{\ textbf {ta}} Xonomy引导\下划线{\ textbf {p}}参考数据生成（TAP）框架，从而有助于自动化和可扩展的构建各种语言的偏好数据集的构造。 TAP以结构化分类学为基础，该分类法可以对数据集组成进行细粒度的控制，从而确保多样性和全面覆盖范围。我们使用TAP生成的数据集对各种LLM进行监督和偏好进行微调。实验结果表明，在TAP生成的数据集中训练的LLM优于在现有开源数据集中训练的LLM。值得注意的是，在TAP生成的数据集中接受培训的LLM超过了在开源数据集中训练的人的性能，该数据集大于180倍。

Title: Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning

Authors: Seungjun Yi, Joakim Nguyen, Huimin Xu, Terence Lim, Andrew Well, Mia Markey, Ying Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23998
Pdf URL: https://arxiv.org/pdf/2506.23998
Copy Paste: [[2506.23998]] Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning(https://arxiv.org/abs/2506.23998)
Keywords: language model, llm, agent
Abstract: Congenital heart disease (CHD) presents complex, lifelong challenges often underrepresented in traditional clinical metrics. While unstructured narratives offer rich insights into patient and caregiver experiences, manual thematic analysis (TA) remains labor-intensive and unscalable. We propose a fully automated large language model (LLM) pipeline that performs end-to-end TA on clinical narratives, which eliminates the need for manual coding or full transcript review. Our system employs a novel multi-agent framework, where specialized LLM agents assume roles to enhance theme quality and alignment with human analysis. To further improve thematic relevance, we optionally integrate reinforcement learning from human feedback (RLHF). This supports scalable, patient-centered analysis of large qualitative datasets and allows LLMs to be fine-tuned for specific clinical contexts.
摘要：先天性心脏病（CHD）提出复杂的，终身挑战在传统临床指标中通常不足。虽然非结构化的叙述为患者和护理人员的经验提供了丰富的见解，但手动主题分析（TA）仍然是劳动密集型且不可计入的。我们提出了一个完全自动化的大语言模型（LLM）管道，该管道在临床叙述上端到端TA执行，这消除了对手动编码或完整成绩单审查的需求。我们的系统采用了一种新型的多代理框架，其中专业的LLM代理商扮演角色以提高主题质量和与人类分析的一致性。为了进一步提高主题相关性，我们可以选择地整合从人类反馈（RLHF）中学习的强化学习。这支持对大型定性数据集的可扩展，以患者为中心的分析，并允许在特定的临床环境中进行微调。

Title: Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective

Authors: Anselm R. Strohmaier, Wim Van Dooren, Kathrin Seßler, Brian Greer, Lieven Verschaffel
Subjects: cs.CL, math.HO
Abstract URL: https://arxiv.org/abs/2506.24006
Pdf URL: https://arxiv.org/pdf/2506.24006
Copy Paste: [[2506.24006]] Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective(https://arxiv.org/abs/2506.24006)
Keywords: language model, gpt, llm, chat
Abstract: The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education. One hope is that they can support mathematics learning, including word-problem solving. Since LLMs can handle textual input with ease, they appear well-suited for solving mathematical word problems. Yet their real competence, whether they can make sense of the real-world context, and the implications for classrooms remain unclear. We conducted a scoping review from a mathematics-education perspective, including three parts: a technical overview, a systematic review of word problems used in research, and a state-of-the-art empirical evaluation of LLMs on mathematical word problems. First, in the technical overview, we contrast the conceptualization of word problems and their solution processes between LLMs and students. In computer-science research this is typically labeled mathematical reasoning, a term that does not align with usage in mathematics education. Second, our literature review of 213 studies shows that the most popular word-problem corpora are dominated by s-problems, which do not require a consideration of realities of their real-world context. Finally, our evaluation of GPT-3.5-turbo, GPT-4o-mini, GPT-4.1, and o3 on 287 word problems shows that most recent LLMs solve these s-problems with near-perfect accuracy, including a perfect score on 20 problems from PISA. LLMs still showed weaknesses in tackling problems where the real-world context is problematic or non-sensical. In sum, we argue based on all three aspects that LLMs have mastered a superficial solution process but do not make sense of word problems, which potentially limits their value as instructional tools in mathematics classrooms.
摘要：大型语言模型（LLM）等大型语言模型的进步提出了一个问题，即如何将它们纳入教育。一种希望是他们可以支持数学学习，包括解决单词问题。由于LLM可以轻松处理文本输入，因此它们似乎非常适合解决数学单词问题。然而，他们的真正能力，是否可以理解现实世界的背景以及对教室的影响尚不清楚。我们从数学教育的角度进行了范围审查，其中包括三个部分：技术概述，对研究中使用的单词问题的系统评价以及对数学单词问题的LLM的最先进的经验评估。首先，在技术概述中，我们将单词问题及其解决方案过程的概念化对比。在计算机科学研究中，这通常被标记为数学推理，该术语与数学教育中的使用不符。其次，我们对213项研究的文献综述表明，最流行的单词问题语料库主要由S-问题主导，S-Promblems不需要考虑其现实世界中环境的现实。最后，我们对287个单词问题的GPT-3.5-Turbo，GPT-4O-Mini，GPT-4.1和O3的评估表明，最近的LLMS以接近完美的精度解决了这些S-Problems，包括对PISA的20个问题的完美分数。 LLM在解决现实世界情境是有问题或非敏感问题的问题方面仍然显示出弱点。总而言之，我们基于所有三个方面的争论，LLM掌握了肤浅的解决方案过程，但没有理解单词问题，这可能会限制其作为数学课堂中的教学工具的价值。

Title: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations

Authors: Hyunjong Kim, Sangyeop Kim, Jongheon Jeong, Yeongjae Cho, Sungzoon Cho
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.24016
Pdf URL: https://arxiv.org/pdf/2506.24016
Copy Paste: [[2506.24016]] EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations(https://arxiv.org/abs/2506.24016)
Keywords: language model
Abstract: Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at this https URL.
摘要：大型语言模型和视觉模型的最新进展导致对图像字幕的可解释评估指标的兴趣日益增加。但是，这些指标可以在没有标准化标准的情况下产生解释，并且生成的解释的总体质量尚未得到验证。在本文中，我们提出了专家，这是一种无参考评估指标，该指标基于三个基本标准提供结构化解释：流利性，相关性和描述性。通过构建高质量结构化解释的大规模数据集，我们开发了一个两阶段的评估模板，以有效地监督评分和解释生成的视觉模型。专家在基准数据集上取得了最新的结果，同时提供了比现有指标更高的质量解释，这是通过全面的人类评估验证的。我们的代码和数据集可在此HTTPS URL上找到。

Title: STACK: Adversarial Attacks on LLM Safeguard Pipelines

Authors: Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, Adam Gleave
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.24068
Pdf URL: https://arxiv.org/pdf/2506.24068
Copy Paste: [[2506.24068]] STACK: Adversarial Attacks on LLM Safeguard Pipelines(https://arxiv.org/abs/2506.24068)
Keywords: llm, prompt
Abstract: Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.
摘要：Frontier AI开发人员依靠保障层来防止AI系统的灾难性滥用。拟人化使用这样的防御管道卫生他们最新的Claude 4 Opus模型，以及包括Google DeepMind和OpenAI承诺在内的其他边境开发人员，以很快部署类似的防御。但是，此类管道的安全性尚不清楚，先前的工作有限评估或攻击这些管道。我们通过开发和红色的开源防御管道来解决这一差距。首先，我们发现，在三个攻击和两个数据集中，一个新颖的几次投入的输入和输出分类器优于最先进的开放式保障模型盾牌，将攻击成功率（ASR）降低到灾难性误解数据集Clearharm的攻击成功率（ASR）。其次，我们介绍了一个分阶段攻击程序（堆栈）程序，该程序在针对少数拍摄的分类器管道的黑箱攻击中在Clearharm上实现了71％的ASR。最后，我们还评估了转移环境中的堆栈，达到33％的ASR，提供了最初的证据，表明设计攻击是不访问目标管道的攻击是可行的。我们通过建议开发人员可以用来阻止攻击的特定缓解来结束。

Title: On the Predictive Power of Representation Dispersion in Language Models

Authors: Yanhong Li, Ming Li, Karen Livescu, Jiawei Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.24106
Pdf URL: https://arxiv.org/pdf/2506.24106
Copy Paste: [[2506.24106]] On the Predictive Power of Representation Dispersion in Language Models(https://arxiv.org/abs/2506.24106)
Keywords: language model
Abstract: We show that a language model's ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations more widely tend to achieve lower perplexity. Concretely, we find that representation dispersion - the average pairwise cosine distance among hidden vectors - strongly and negatively correlates with perplexity across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). Beyond illustrating this link, we show how dispersion can be leveraged for a range of practical tasks without requiring labeled data. First, measuring dispersion on unlabeled text allows us to predict downstream accuracy in new domains, offering a data-efficient tool for model selection. Next, we find that identifying layers with higher dispersion pinpoints the best representations for retrieval-based methods such as kNN-LM, bypassing exhaustive layer-by-layer searches. Finally, we integrate a simple push-away objective into training, which increases dispersion in both single-domain and cross-domain scenarios and directly improves perplexity in each.
摘要：我们表明，语言模型预测文本的能力与其嵌入空间的广度紧密相关：更广泛地传播其上下文表示的模型倾向于实现较低的困惑。具体而言，我们发现代表分散（隐藏矢量之间的平均成对余弦距离）与各种模型家族（Llama，Qwen等）和域（Wikipedia，News，News，News，Scientific Approctsss）之间的困惑密切而负相关。除了说明此链接外，我们还展示了如何在不需要标记的数据的情况下为一系列实际任务借用分散。首先，测量未标记文本的分散剂使我们能够预测新域中的下游精度，从而为模型选择提供数据效率的工具。接下来，我们发现，识别具有较高分散的层查明了基于检索方法（例如KNN-LM）的最佳表示形式，从而绕过详尽的按层搜索。最后，我们将一个简单的推动目标整合到训练中，从而增加了单域和跨域情景的分散体，并直接改善了每种情况的困惑。

Title: Computational Detection of Intertextual Parallels in Biblical Hebrew: A Benchmark Study Using Transformer-Based Language Models

Authors: David M. Smiley
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.24117
Pdf URL: https://arxiv.org/pdf/2506.24117
Copy Paste: [[2506.24117]] Computational Detection of Intertextual Parallels in Biblical Hebrew: A Benchmark Study Using Transformer-Based Language Models(https://arxiv.org/abs/2506.24117)
Keywords: language model
Abstract: Identifying parallel passages in biblical Hebrew is foundational in biblical scholarship for uncovering intertextual relationships. Traditional methods rely on manual comparison, which is labor-intensive and prone to human error. This study evaluates the potential of pre-trained transformer-based language models, including E5, AlephBERT, MPNet, and LaBSE, for detecting textual parallels in the Hebrew Bible. Focusing on known parallels between the books of Samuel/Kings and Chronicles, I assessed each model's capability to generate word embeddings that delineate parallel from non-parallel passages. Utilizing cosine similarity and Wasserstein Distance measures, I found that E5 and AlephBERT show significant promise, with E5 excelling in parallel detection and AlephBERT demonstrating stronger non-parallel differentiation. These findings indicate that pre-trained models can enhance the efficiency and accuracy of detecting intertextual parallels in ancient texts, suggesting broader applications for ancient language studies.
摘要：识别圣经希伯来语中的平行段落是圣经学术的基础，以揭示互文关系。传统方法取决于手动比较，该比较是劳动密集型的，容易出现人为错误。这项研究评估了预先训练的基于变压器的语言模型的潜力，包括E5，Alephbert，MPNet和Labse，以检测希伯来语圣经中的文本相似之处。重点关注塞缪尔/国王和编年史的书籍之间的相似之处，我评估了每个模型生成与非平行段落并行描述的单词嵌入的能力。我发现E5和Alephbert使用余弦的相似性和Wasserstein距离度量，表现出巨大的希望，E5在平行检测方面表现出色，而Alephbert表现出更强的非平行分化。这些发现表明，预训练的模型可以提高在古代文本中检测互文相似的效率和准确性，这表明对古代语言研究的应用更广泛。