2025-06-03

Title: Amadeus-Verbo Technical Report: The powerful Qwen2.5 family models trained in Portuguese

Authors: William Alberto Cruz-Castañeda, Marcellus Amadeus
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00019
Pdf URL: https://arxiv.org/pdf/2506.00019
Copy Paste: [[2506.00019]] Amadeus-Verbo Technical Report: The powerful Qwen2.5 family models trained in Portuguese(https://arxiv.org/abs/2506.00019)
Keywords: language model, llm
Abstract: This report introduces the experience of developing Amadeus Verbo, a family of large language models for Brazilian Portuguese. To handle diverse use cases, Amadeus Verbo includes base-tuned, merged, and instruction-tuned models in sizes of 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters. Thus, the main objective is to show how easy it is to fine-tune foundation models to democratize the open-source development of Brazilian Portuguese LLMs when data and resources are available. Amadeus-Verbo family models are all available at HuggingFace at this https URL.
摘要：该报告介绍了开发Amadeus Verbo的经验，Amadeus Verbo是巴西葡萄牙语的大型语言模型。为了处理各种用例，Amadeus Verbo包括尺寸为0.5B，1.5B，3B，7B，7B，14B，32B和72B参数的基础调整，合并和指令调整的模型。因此，主要目的是展示微调基础模型在可用数据和资源时对巴西葡萄牙LLM的开源开发进行民主化是多么容易。 Amadeus-Verbo家庭模型都可以在此HTTPS URL的HuggingFace上获得。

Title: Scaling Physical Reasoning with the PHYSICS Dataset

Authors: Shenghe Zheng, Qianjia Cheng, Junchi Yao, Mengsong Wu, haonan he, Ning Ding, Yu Cheng, Shuyue Hu, Lei Bai, Dongzhan Zhou, Ganqu Cui, Peng Ye
Subjects: cs.CL, cs.LG, physics.ed-ph
Abstract URL: https://arxiv.org/abs/2506.00022
Pdf URL: https://arxiv.org/pdf/2506.00022
Copy Paste: [[2506.00022]] Scaling Physical Reasoning with the PHYSICS Dataset(https://arxiv.org/abs/2506.00022)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable progress on advanced reasoning tasks such as mathematics and coding competitions. Meanwhile, physics, despite being both reasoning-intensive and essential to real-world understanding, received limited academic and industrial attention. This paper introduces PHYSICS, a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels, to facilitate this issue. Specifically, PHYSICS is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control. It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics. It also spans a wide range of difficulty levels, from high school to graduate-level physics courses. To utilize the data for improving and evaluating the model's physical reasoning capabilities, we split the dataset into training and test sets, and provide reasoning paths generated by powerful reasoning models for the training data to facilitate model training. In addition, for the evaluation part, we find that existing evaluation frameworks exhibit biases in aspects such as units, simplification, and precision in physics domain. To balance efficiency and accuracy, we introduce a Rule+Model evaluation framework tailored to physics problems. Our evaluations on current state-of-the-art open-source and proprietary models highlight the limitations of current models in handling physics-related tasks. We hope that our dataset and evaluation methodology will jointly advance the development of LLMs in the field of physics.
摘要：大型语言模型（LLM）在高级推理任务（例如数学和编码竞赛）上取得了显着进步。同时，尽管物理学既是推理密集型，又对现实世界的理解至关重要，但受到了学术和工业的关注有限。本文介绍了物理学，这是一个包含16,568个高质量物理问题的数据集，涉及受试者和难度水平，以促进此问题。具体而言，物理学是通过精心设计的质量控制管道从100多个教科书进行的练习策划的。它涵盖了五个主要的物理领域：力学，电磁，热力学，光学和现代物理。从高中到研究生水平的物理课程，它还跨越了广泛的困难水平。为了利用数据来改善和评估模型的物理推理功能，我们将数据集分为培训和测试集，并提供由强大的推理模型生成的推理路径，以促进培训数据以促进模型培训。此外，对于评估部分，我们发现现有的评估框架在物理领域的单位，简化和精度等方面表现出偏见。为了平衡效率和准确性，我们引入了针对物理问题量身定制的规则+模型评估框架。我们对当前最新开源和专有模型的评估突出了当前模型在处理物理相关任务中的局限性。我们希望我们的数据集和评估方法能够共同提高物理领域LLM的发展。

Title: From Mathematical Reasoning to Code: Generalization of Process Reward Models in Test-Time Scaling

Authors: Zhengyu Chen, Yudong Wang, Teng Xiao, Ruochen Zhou, Xuesheng Yang, Wei Wang, Zhifang Sui, Jingang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00027
Pdf URL: https://arxiv.org/pdf/2506.00027
Copy Paste: [[2506.00027]] From Mathematical Reasoning to Code: Generalization of Process Reward Models in Test-Time Scaling(https://arxiv.org/abs/2506.00027)
Keywords: language model
Abstract: Recent advancements in improving the reasoning capabilities of Large Language Models have underscored the efficacy of Process Reward Models (PRMs) in addressing intermediate errors through structured feedback mechanisms. This study analyzes PRMs from multiple perspectives, including training methodologies, scalability, and generalization capabilities. We investigate the interplay between pre-training and reward model training FLOPs to assess their influence on PRM efficiency and accuracy in complex reasoning tasks. Our analysis reveals a pattern of diminishing returns in performance with increasing PRM scale, highlighting the importance of balancing model size and computational cost. Furthermore, the diversity of training datasets significantly impacts PRM performance, emphasizing the importance of diverse data to enhance both accuracy and efficiency. We further examine test-time scaling strategies, identifying Monte Carlo Tree Search as the most effective method when computational resources are abundant, while Best-of-N Sampling serves as a practical alternative under resource-limited conditions. Notably, our findings indicate that PRMs trained on mathematical datasets exhibit performance comparable to those tailored for code generation, suggesting robust cross-domain generalization. Employing a gradient-based metric, we observe that PRMs exhibit a preference for selecting responses with similar underlying patterns, further informing their optimization.
摘要：在提高大语言模型的推理能力方面的最新进展强调了过程奖励模型（PRMS）在通过结构化反馈机制解决中间错误方面的功效。这项研究从多个角度分析了PRM，包括培训方法，可伸缩性和概括能力。我们研究了预训练和奖励模型培训失败之间的相互作用，以评估其对复杂推理任务中PRM效率和准确性的影响。我们的分析揭示了一种随着PRM量表的增加的绩效回报降低的模式，突出了平衡模型大小和计算成本的重要性。此外，培训数据集的多样性会显着影响PRM的性能，并强调各种数据的重要性以提高准确性和效率。我们进一步研究了测试时间缩放策略，当计算资源丰富时，将蒙特卡洛树搜索确定为最有效的方法，而在资源有限的条件下，最佳N采样可作为实际替代方法。值得注意的是，我们的发现表明，在数学数据集中训练的PRMs表现出与为代码生成量身定制的人的性能，表明跨域概括。我们使用基于梯度的指标，观察到PRM表现出偏爱选择具有相似基础模式的响应，从而进一步告知其优化。

Title: Enhancing Tool Learning in Large Language Models with Hierarchical Error Checklists

Authors: Yue Cui, Liuyi Yao, Shuchang Tao, Weijie Shi, Yaliang Li, Bolin Ding, Xiaofang Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00042
Pdf URL: https://arxiv.org/pdf/2506.00042
Copy Paste: [[2506.00042]] Enhancing Tool Learning in Large Language Models with Hierarchical Error Checklists(https://arxiv.org/abs/2506.00042)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have significantly advanced natural language processing, particularly through the integration of external tools and APIs. However, their effectiveness is frequently hampered by parameter mis-filling during tool calling. In this paper, we propose the Hierarchical Tool Error Checklist (HiTEC) framework to systematically diagnose and mitigate tool-calling errors without relying on extensive real-world interactions. HiTEC introduces a two-tiered approach: a global error checklist that identifies common, cross-tool issues, and a local error checklist that targets tool-specific and contextual failures. Building on this structure, we propose two deployments: HiTEC-In Context Learning (HiTEC-ICL) and HiTEC-Kahneman-Tversky Optimization (HiTEC-KTO). HiTEC-ICL embeds the global checklist in the initial prompts and leverages a two-round conversational interaction to dynamically refine parameter handling, while HiTEC-KTO generates high-quality negative examples to drive fine-tuning via preference-based optimization. Extensive experiments across five public datasets demonstrate that our framework significantly improves parameter-filling accuracy and tool-calling success rates compared to baseline methods.
摘要：大型语言模型（LLM）具有显着高级的自然语言处理，尤其是通过集成外部工具和API。但是，在工具调用过程中，参数填充错误通常会阻碍它们的有效性。在本文中，我们建议层次工具错误清单（HITEC）框架系统地诊断和减轻工具称呼错误，而无需依赖广泛的现实世界交互。 HITEC引入了一种两层方法：标识常见，跨工具问题的全局错误清单以及针对特定于工具和上下文失败的本地错误清单。在这种结构的基础上，我们提出了两个部署：HITEC-IN上下文学习（HITEC-ICL）和HITEC-KAHNEMAN-TVERSKY优化（HITEC-KTO）。 HITEC-ICL将全局清单嵌入初始提示中，并利用两轮的对话交互来动态完善参数处理，而HITEC-KTO生成了高质量的负面示例，通过基于首选项的优化来驱动微调。五个公共数据集的广泛实验表明，与基线方法相比，我们的框架显着提高了参数填充精度和工具称呼成功率。

Title: Unraveling SITT: Social Influence Technique Taxonomy and Detection with LLMs

Authors: Wiktoria Mieleszczenko-Kowszewicz, Beata Bajcar, Aleksander Szczęsny, Maciej Markiewicz, Jolanta Babiak, Berenika Dyczek, Przemysław Kazienko
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00061
Pdf URL: https://arxiv.org/pdf/2506.00061
Copy Paste: [[2506.00061]] Unraveling SITT: Social Influence Technique Taxonomy and Detection with LLMs(https://arxiv.org/abs/2506.00061)
Keywords: gpt, llm
Abstract: In this work we present the Social Influence Technique Taxonomy (SITT), a comprehensive framework of 58 empirically grounded techniques organized into nine categories, designed to detect subtle forms of social influence in textual content. We also investigate the LLMs ability to identify various forms of social influence. Building on interdisciplinary foundations, we construct the SITT dataset -- a 746-dialogue corpus annotated by 11 experts in Polish and translated into English -- to evaluate the ability of LLMs to identify these techniques. Using a hierarchical multi-label classification setup, we benchmark five LLMs, including GPT-4o, Claude 3.5, Llama-3.1, Mixtral, and PLLuM. Our results show that while some models, notably Claude 3.5, achieved moderate success (F1 score = 0.45 for categories), overall performance of models remains limited, particularly for context-sensitive techniques. The findings demonstrate key limitations in current LLMs' sensitivity to nuanced linguistic cues and underscore the importance of domain-specific fine-tuning. This work contributes a novel resource and evaluation example for understanding how LLMs detect, classify, and potentially replicate strategies of social influence in natural dialogues.
摘要：在这项工作中，我们介绍了社会影响力分类法（SITT），这是一个由58种经验扎根的技术组成的综合框架，分为9个类别，旨在检测文本内容中的微妙形式的社会影响力。我们还研究了LLMS识别各种形式的社会影响力的能力。在跨学科基础的基础上，我们构建了SITT数据集（由11个专家在波兰语中注释并翻译成英语的746个dialogue语料库），以评估LLMS识别这些技术的能力。使用层次多标签分类设置，我们基准了五个LLM，包括GPT-4O，Claude 3.5，Llama-3.1，Mixtral和Pllum。我们的结果表明，尽管某些模型（尤其是Claude 3.5）取得了适度的成功（类别的F1分数= 0.45），但模型的整体性能仍然有限，尤其是对于上下文敏感技术。研究结果表明，当前LLMS对细微的语言提示的敏感性的关键局限性，并强调了域特异性微调的重要性。这项工作为理解LLMS如何在自然对话中检测，分类和重复社会影响的策略的新资源和评估示例贡献了一个新的资源和评估示例。

Title: Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling

Authors: Jiayi Zeng, Yizhe Feng, Mengliang He, Wenhui Lei, Wei Zhang, Zeming Liu, Xiaoming Shi, Aimin Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00064
Pdf URL: https://arxiv.org/pdf/2506.00064
Copy Paste: [[2506.00064]] Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling(https://arxiv.org/abs/2506.00064)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated significant advancements in error handling. Current error-handling works are performed in a passive manner, with explicit error-handling instructions. However, in real-world scenarios, explicit error-handling instructions are usually unavailable. In this paper, our work identifies this challenge as how to conduct proactive error handling without explicit error handling instructions. To promote further research, this work introduces a new benchmark, termed Mis-prompt, consisting of four evaluation tasks, an error category taxonomy, and a new evaluation dataset. Furthermore, this work analyzes current LLMs' performance on the benchmark, and the experimental results reveal that current LLMs show poor performance on proactive error handling, and SFT on error handling instances improves LLMs' proactive error handling capabilities. The dataset will be publicly available.
摘要：大型语言模型（LLMS）在错误处理方面已显示出显着的进步。当前的错误处理工程以被动的方式进行，并具有明确的错误处理指令。但是，在实际情况下，显式处理指令通常不可用。在本文中，我们的工作将这一挑战确定为如何进行主动错误处理而无需明确的错误处理说明。为了促进进一步的研究，这项工作介绍了一个新的基准测试，称为错误推出，包括四个评估任务，一个错误类别分类学和新的评估数据集。此外，这项工作分析了当前LLM在基准上的性能，实验结果表明，当前的LLM在主动错误处理方面表现出较差的性能，而在错误处理实例上进行了SFT提高LLMS的主动错误处理功能。该数据集将公开可用。

Title: You Prefer This One, I Prefer Yours: Using Reference Words is Harder Than Vocabulary Words for Humans and Multimodal Language Models

Authors: Dota Tianai Dong, Yifan Luo, Po-Ya Angela Wang, Asli Ozyurek, Paula Rubio-Fernandez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00065
Pdf URL: https://arxiv.org/pdf/2506.00065
Copy Paste: [[2506.00065]] You Prefer This One, I Prefer Yours: Using Reference Words is Harder Than Vocabulary Words for Humans and Multimodal Language Models(https://arxiv.org/abs/2506.00065)
Keywords: language model, prompt
Abstract: Multimodal language models (MLMs) increasingly communicate in human-like ways, yet their ability to use reference words remains largely overlooked despite their ubiquity in everyday communication. Our study addresses this gap by comparing human and MLM use of three word classes with increasing cognitive demands: vocabulary words, possessive pronouns (`mine' vs `yours'), and demonstrative pronouns (`this one' vs `that one'). Evaluating seven state-of-the-art MLMs against human participants, we observe a clear difficulty hierarchy: while MLMs approach human-level performance on the vocabulary task, they show substantial deficits with possessives and demonstratives. Our analysis reveals these difficulties stem from limitations in perspective-taking and spatial reasoning. Although prompt engineering improved model performance on possessive use, demonstrative use remained well below human-level competence. These findings provide theoretical and empirical evidence that producing grammatical forms requiring pragmatics and social cognition remains a clear challenge in current NLP systems.
摘要：多模式模型（MLMS）越来越多地以人类的方式进行交流，但是尽管它们在日常交流中无处不在，但它们使用参考词的能力仍在很大程度上被忽略了。我们的研究通过比较人类和MLM使用三个单词类别的认知要求来解决这一差距：词汇词，所有格代词（``我''vs``yours''）和示范代词（'vs'vs'vs``那个''）。评估针对人类参与者的七个最先进的MLMS，我们观察到了一个明显的困难等级制度：当MLMS在词汇任务上接近人类水平的表现时，他们与所有占有主义者和示范性表现出了很大的缺陷。我们的分析揭示了这些困难源于观点的局限性和空间推理。尽管迅速的工程改善了占有用途的模型性能，但指示性的用途仍低于人类水平的能力。这些发现提供了理论和经验证据，表明需要语法和社会认知的语法形式在当前NLP系统中仍然是一个明显的挑战。

Title: Probing Politico-Economic Bias in Multilingual Large Language Models: A Cultural Analysis of Low-Resource Pakistani Languages

Authors: Afrozah Nadeem, Mark Dras, Usman Naseem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00068
Pdf URL: https://arxiv.org/pdf/2506.00068
Copy Paste: [[2506.00068]] Probing Politico-Economic Bias in Multilingual Large Language Models: A Cultural Analysis of Low-Resource Pakistani Languages(https://arxiv.org/abs/2506.00068)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly shaping public discourse, yet their politico-economic biases remain underexamined in non-Western and low-resource multilingual contexts. This paper presents a systematic analysis of political bias in 13 state-of-the-art LLMs across five low-resource languages spoken in Pakistan: Urdu, Punjabi, Sindhi, Balochi, and Pashto. We propose a novel framework that integrates an adapted Political Compass Test (PCT) with a multi-level framing analysis. Our method combines quantitative assessment of political orientation across economic (left-right) and social (libertarian-authoritarian) axes with qualitative analysis of framing through content, style, and emphasis. We further contextualize this analysis by aligning prompts with 11 key socio-political themes relevant to Pakistani society. Our results reveal that LLMs predominantly align with liberal-left values, echoing Western training data influences, but exhibit notable shifts toward authoritarian framing in regional languages, suggesting strong cultural modulation effects. We also identify consistent model-specific bias signatures and language-conditioned variations in ideological expression. These findings show the urgent need for culturally grounded, multilingual bias auditing frameworks.
摘要：大型语言模型（LLM）越来越多地塑造了公众的话语，但是在非西方和低资源的多语言环境中，他们的政治经济偏见仍然没有渗透。本文对巴基斯坦说的五种低农源语言进行了对13个最先进的LLM的政治偏见的系统分析：乌尔都语，旁遮普语，辛迪（Sindhi），巴洛奇（Balochi）和帕什托（Pashto）。我们提出了一个新颖的框架，该框架将改编的政治指南测试（PCT）与多层框架分析相结合。我们的方法结合了对跨经济（左右）和社会（自由主义者）轴的政治取向的定量评估，以及通过内容，风格和重点对框架进行定性分析。我们通过将提示与与巴基斯坦社会相关的11个关键社会政治主题保持一致，从而进一步将这一分析背景化。我们的结果表明，LLMS主要与自由左翼价值保持一致，与西方培训数据的影响相呼应，但在区域语言中表现出明显的转变，表明具有强烈的文化调制效应。我们还确定了意识形态表达中一致的模型特异性偏差特征和语言条件的变化。这些发现表明，迫切需要进行文化扎根的多语言偏见审计框架。

Title: Evaluating the Sensitivity of LLMs to Prior Context

Authors: Robert Hankache, Kingsley Nketia Acheampong, Liang Song, Marek Brynda, Raad Khraishi, Greig A. Cowan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00069
Pdf URL: https://arxiv.org/pdf/2506.00069
Copy Paste: [[2506.00069]] Evaluating the Sensitivity of LLMs to Prior Context(https://arxiv.org/abs/2506.00069)
Keywords: language model, gpt, llm
Abstract: As large language models (LLMs) are increasingly deployed in multi-turn dialogue and other sustained interactive scenarios, it is essential to understand how extended context affects their performance. Popular benchmarks, focusing primarily on single-turn question answering (QA) tasks, fail to capture the effects of multi-turn exchanges. To address this gap, we introduce a novel set of benchmarks that systematically vary the volume and nature of prior context. We evaluate multiple conventional LLMs, including GPT, Claude, and Gemini, across these benchmarks to measure their sensitivity to contextual variations. Our findings reveal that LLM performance on multiple-choice questions can degrade dramatically in multi-turn interactions, with performance drops as large as 73% for certain models. Even highly capable models such as GPT-4o exhibit up to a 32% decrease in accuracy. Notably, the relative performance of larger versus smaller models is not always predictable. Moreover, the strategic placement of the task description within the context can substantially mitigate performance drops, improving the accuracy by as much as a factor of 3.5. These findings underscore the need for robust strategies to design, evaluate, and mitigate context-related sensitivity in LLMs.
摘要：由于大型语言模型（LLM）越来越多地在多转化的对话和其他持续的互动场景中部署，因此必须了解扩展上下文如何影响其性能。流行的基准分析主要关注单转态答案（QA）任务，无法捕获多转交换的效果。为了解决这一差距，我们介绍了一套新颖的基准，这些基准有系统地改变了先前上下文的数量和性质。我们在这些基准中评估了多个常规LLM，包括GPT，Claude和Gemini，以衡量它们对上下文变化的敏感性。我们的发现表明，在多项选择问题上的LLM性能在多转变的相互作用中可能会大大降低，而某些模型的性能下降到73％。即使是高度强大的模型，例如GPT-4O，精度也降低了32％。值得注意的是，较大模型与较小模型的相对性能并不总是可以预测的。此外，在上下文中，任务描述的战略放置可以大大减轻性能下降，从而提高了高达3.5倍的准确性。这些发现强调了对LLMS设计，评估和减轻上下文相关灵敏度的强大策略的需求。

Title: Gaussian mixture models as a proxy for interacting language models

Authors: Edward Wang, Tianyu Wang, Avanti Athreya, Vince Lyzinski, Carey E. Priebe
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.00077
Pdf URL: https://arxiv.org/pdf/2506.00077
Copy Paste: [[2506.00077]] Gaussian mixture models as a proxy for interacting language models(https://arxiv.org/abs/2506.00077)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) are a powerful tool with the ability to match human capabilities and behavior in many settings. Retrieval-augmented generation (RAG) further allows LLMs to generate diverse output depending on the contents of their RAG database. This motivates their use in the social sciences to study human behavior between individuals when large-scale experiments are infeasible. However, LLMs depend on complex, computationally expensive algorithms. In this paper, we introduce interacting Gaussian mixture models (GMMs) as an alternative to similar frameworks using LLMs. We compare a simplified model of GMMs to select experimental simulations of LLMs whose updating and response depend on feedback from other LLMs. We find that interacting GMMs capture important features of the dynamics in interacting LLMs, and we investigate key similarities and differences between interacting LLMs and GMMs. We conclude by discussing the benefits of Gaussian mixture models, potential modifications, and future research directions.
摘要：大型语言模型（LLMS）是一种强大的工具，能够在许多设置中匹配人类能力和行为。检索增强的生成（RAG）进一步允许LLMS根据其抹布数据库的内容产生不同的输出。这激发了他们在社会科学中的使用，以研究大规模实验是不可行的。但是，LLMS取决于复杂的计算算法。在本文中，我们引入了相互作用的高斯混合模型（GMM），以替代使用LLMS的类似框架。我们比较了GMM的简化模型，以选择LLM的实验模拟，其更新和响应取决于其他LLMS的反馈。我们发现相互作用的GMM捕获了相互作用LLM中动力学的重要特征，并且我们研究了相互作用的LLMS和GMM之间的关键相似性和差异。我们通过讨论高斯混合模型，潜在的修改和未来研究方向的好处来结束。

Title: COSMIC: Generalized Refusal Direction Identification in LLM Activations

Authors: Vincent Siu, Nicholas Crispino, Zihao Yu, Sam Pan, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00085
Pdf URL: https://arxiv.org/pdf/2506.00085
Copy Paste: [[2506.00085]] COSMIC: Generalized Refusal Direction Identification in LLM Activations(https://arxiv.org/abs/2506.00085)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) encode behaviors such as refusal within their activation space, yet identifying these behaviors remains a significant challenge. Existing methods often rely on predefined refusal templates detectable in output tokens or require manual analysis. We introduce \textbf{COSMIC} (Cosine Similarity Metrics for Inversion of Concepts), an automated framework for direction selection that identifies viable steering directions and target layers using cosine similarity - entirely independent of model outputs. COSMIC achieves steering performance comparable to prior methods without requiring assumptions about a model's refusal behavior, such as the presence of specific refusal tokens. It reliably identifies refusal directions in adversarial settings and weakly aligned models, and is capable of steering such models toward safer behavior with minimal increase in false refusals, demonstrating robustness across a wide range of alignment conditions.
摘要：大型语言模型（LLMS）编码行为，例如拒绝其激活空间，但是确定这些行为仍然是一个重大挑战。现有方法通常依赖于在输出令牌中可检测到的预定义拒绝模板或需要手动分析。我们介绍了\ textbf {cosmic}（概念反转的余弦相似性指标），这是一个用于定向选择的自动化框架，可使用余弦相似性来标识可行的转向方向和目标层 - 完全独立于模型输出。宇宙的转向性能与先前方法相当，而无需对模型的拒绝行为（例如存在特定的拒绝令牌）进行假设。它可靠地识别在对抗设置和弱对齐模型中的拒绝方向，并且能够将这种模型转向更安全的行为，而虚假拒绝的增加最小，从而在各种比对条件下表现出了稳健性。

Title: SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

Authors: Peng Xie, Xingyuan Liu, Tsz Wai Chan, Yequan Bie, Yangqiu Song, Yang Wang, Hao Chen, Kani Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00087
Pdf URL: https://arxiv.org/pdf/2506.00087
Copy Paste: [[2506.00087]] SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset(https://arxiv.org/abs/2506.00087)
Keywords: agent
Abstract: Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (CSTTS), and Cross-Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some code-switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce \textbf{LinguaMaster}, a multi-agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. Leveraging this framework, we curate \textbf{SwitchLingua}, the first large-scale multilingual and multi-ethnic code-switching dataset, including: (1) 420K CS textual samples across 12 languages, and (2) over 80 hours of audio recordings from 174 speakers representing 18 countries/regions and 63 racial/ethnic backgrounds, based on the textual data. This dataset captures rich linguistic and cultural diversity, offering a foundational resource for advancing multilingual and multicultural research. Furthermore, to address the issue that existing ASR evaluation metrics lack sensitivity to code-switching scenarios, we propose the \textbf{Semantic-Aware Error Rate (SAER)}, a novel evaluation metric that incorporates semantic information, providing a more accurate and context-aware assessment of system performance.
摘要：代码转换（CS）是在对话或话语中交替使用两种或多种语言，通常会受社会背景和说话者身份的影响。这种语言现象对自动语音识别（ASR）系统构成了挑战，该系统通常是为单一语言而设计的，并且难以处理多语言输入。包括代码转换ASR（CSASR），文本到语音（CSTT）和跨语性信息检索（CLIR）在内的多语言应用程序的全球需求不断增长，突显了现有单语言数据集的不足。尽管存在一些代码切换数据集，但大多数人仅限于同质种族中的双语混合，因此在计算机视觉中，对类似于Imagenet的大规模，多样化的基准的迫切需求。为了弥合这一差距，我们介绍了\ textbf {linguamaster}，这是一个专门设计用于有效且可扩展的多语言数据综合的多代理协作框架。利用这个框架，我们策划\ textbf {switchlingua}，这是第一个大规模多语言和多民族代码转换数据集，包括：（1）420k CS跨12种语言的文本样本，（2）在80个小时内，来自174个扬声器的80个小时的演讲者，代表18个国家/地区/地区/Ell Edlect and Eltect and textake textaim and textaim and textaim and textaim and textaim and textaim。该数据集捕获了丰富的语言和文化多样性，为推进多语言和多元文化研究提供了基础资源。此外，为了解决现有的ASR评估指标缺乏对代码转换场景的敏感性的问题，我们提出\ textbf {语义 - 感知错误率（SAER）}，这是一种新颖的评估度量，该评估指标结合了语义信息，提供了更准确的和上下文感知的系统性能评估。

Title: HD-NDEs: Neural Differential Equations for Hallucination Detection in LLMs

Authors: Qing Li, Jiahui Geng, Zongxiong Chen, Derui Zhu, Yuxia Wang, Congbo Ma, Chenyang Lyu, Fakhri Karray
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00088
Pdf URL: https://arxiv.org/pdf/2506.00088
Copy Paste: [[2506.00088]] HD-NDEs: Neural Differential Equations for Hallucination Detection in LLMs(https://arxiv.org/abs/2506.00088)
Keywords: language model, llm, hallucination
Abstract: In recent years, large language models (LLMs) have made remarkable advancements, yet hallucination, where models produce inaccurate or non-factual statements, remains a significant challenge for real-world deployment. Although current classification-based methods, such as SAPLMA, are highly efficient in mitigating hallucinations, they struggle when non-factual information arises in the early or mid-sequence of outputs, reducing their reliability. To address these issues, we propose Hallucination Detection-Neural Differential Equations (HD-NDEs), a novel method that systematically assesses the truthfulness of statements by capturing the full dynamics of LLMs within their latent space. Our approaches apply neural differential equations (Neural DEs) to model the dynamic system in the latent space of LLMs. Then, the sequence in the latent space is mapped to the classification space for truth assessment. The extensive experiments across five datasets and six widely used LLMs demonstrate the effectiveness of HD-NDEs, especially, achieving over 14% improvement in AUC-ROC on the True-False dataset compared to state-of-the-art techniques.
摘要：近年来，大型语言模型（LLMS）取得了显着的进步，但是幻觉产生了不准确或非事实的陈述，仍然是现实世界部署的重大挑战。尽管当前基于分类的方法（例如SAPLMA）在缓解幻觉方面具有很高的效率，但是当在产出的早期或中期出现非事实信息时，它们会挣扎，从而降低了它们的可靠性。为了解决这些问题，我们提出了幻觉检测神经差分方程（HD-NDES），这是一种新颖的方法，可以系统地评估陈述的真实性，通过捕获其潜在空间内LLM的全部动态。我们的方法应用神经微分方程（神经DES）来对LLM的潜在空间中的动态系统进行建模。然后，潜在空间中的序列映射到分类空间以进行真理评估。五个数据集和六个广泛使用的LLM的广泛实验证明了HD-NDE的有效性，尤其是与先进技术相比，在True-False数据集上的AUC-ROC提高了14％以上。

Title: Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards

Authors: Xun Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00103
Pdf URL: https://arxiv.org/pdf/2506.00103
Copy Paste: [[2506.00103]] Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards(https://arxiv.org/abs/2506.00103)
Keywords: language model, llm
Abstract: Reinforcement learning with verifiable rewards (RLVR) has enabled large language models (LLMs) to achieve remarkable breakthroughs in reasoning tasks with objective ground-truth answers, such as mathematics and code generation. However, a significant gap remains for non-verifiable tasks, like creative writing and open-ended dialogue, where quality assessment is inherently subjective and lacks definitive references. Existing approaches for these domains often rely on scalar reward models trained with human preferences, which suffer from limited generalization and are prone to reward hacking, such as over-explanation and length bias. In this work, we propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards. We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm. The pairwise writing GenRM leverages self-principled critique to transform subjective assessments into reliable, verifiable rewards, while BRPO enables dynamic, reference-free pairwise comparison by leveraging a bootstrapped response as temporary reference from within group rollouts during RL training. Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning, as demonstrated by Writing-Zero, which shows consistent improvement and strong resistance to reward hacking compared to scalar reward baselines. Furthermore, our method achieves competitive results on both in-house and open-source writing benchmarks. Our findings suggest the potential to unify rule-based, reference-based, and reference-free reward modeling under the RLVR framework, thus paving the way for a comprehensive and scalable RL training paradigm applicable across all language tasks.
摘要：通过可验证的奖励（RLVR）的增强学习使大型语言模型（LLMS）在具有客观基础答案（例如数学和代码生成）的推理任务中取得了显着突破。但是，对于不可验证的任务，例如创意写作和开放式对话，质量评估本质上是主观的，并且缺乏确定的参考文献，但仍然存在一个重大差距。这些领域的现有方法通常依赖于接受人类偏好训练的标量奖励模型，这些模型受到人类偏好的训练，这些模型的概括有限，并且容易奖励黑客攻击，例如过度解释和长度偏见。在这项工作中，我们提出了一个基于统一的RLVR培训范式，该范围弥合了无法验证的任务与可验证的奖励之间的差距。我们介绍了基于写作的基础成对生成奖励模型（GENRM）和一种新颖的自举相对策略优化（BRPO）算法。成对的写作GenRM利用自我启示的批评将主观评估转变为可靠的可验证奖励，而BRPO通过利用RL训练期间的组推出中的临时参考来实现动态，无参考的成对比较。我们的方法使LLM有能力发展出强大的写作能力，而无需监督微调，这是写入Zero所证明的，与标量奖励基线相比，它显示出一致的改善和强烈的奖励黑客奖励。此外，我们的方法在内部和开源编写基准方面都取得了竞争成果。我们的发现表明，在RLVR框架下统一基于规则的，基于参考的和无参考的奖励建模的潜力，从而为在所有语言任务中适用的全面且可扩展的RL培训范式铺平了道路。

Title: Spurious Correlations and Beyond: Understanding and Mitigating Shortcut Learning in SDOH Extraction with Large Language Models

Authors: Fardin Ahsan Sakib, Ziwei Zhu, Karen Trister Grace, Meliha Yetisgen, Ozlem Uzuner
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00134
Pdf URL: https://arxiv.org/pdf/2506.00134
Copy Paste: [[2506.00134]] Spurious Correlations and Beyond: Understanding and Mitigating Shortcut Learning in SDOH Extraction with Large Language Models(https://arxiv.org/abs/2506.00134)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Social determinants of health (SDOH) extraction from clinical text is critical for downstream healthcare analytics. Although large language models (LLMs) have shown promise, they may rely on superficial cues leading to spurious predictions. Using the MIMIC portion of the SHAC (Social History Annotation Corpus) dataset and focusing on drug status extraction as a case study, we demonstrate that mentions of alcohol or smoking can falsely induce models to predict current/past drug use where none is present, while also uncovering concerning gender disparities in model performance. We further evaluate mitigation strategies - such as prompt engineering and chain-of-thought reasoning - to reduce these false positives, providing insights into enhancing LLM reliability in health domains.
摘要：从临床文本中提取健康的社会决定因素（SDOH）对于下游医疗保健分析至关重要。尽管大型语言模型（LLM）表现出了希望，但它们可能依赖于导致虚假预测的表面提示。使用SHAC（社会历史注释语料库）数据集的模拟部分并将重点放在药物状态提取作为案例研究中，我们证明，提及酒精或吸烟可以错误地诱导模型预测当前/过去的药物使用，同时还可以发现模型表现中的性别差异。我们进一步评估缓解策略（例如迅速的工程和经过思考的推理），以减少这些假阳性，从而为增强健康领域的LLM可靠性提供见解。

Title: LaMP-QA: A Benchmark for Personalized Long-form Question Answering

Authors: Alireza Salemi, Hamed Zamani
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00137
Pdf URL: https://arxiv.org/pdf/2506.00137
Copy Paste: [[2506.00137]] LaMP-QA: A Benchmark for Personalized Long-form Question Answering(https://arxiv.org/abs/2506.00137)
Keywords: language model, llm
Abstract: Personalization is essential for question answering systems that are user-centric. Despite its importance, personalization in answer generation has been relatively underexplored. This is mainly due to lack of resources for training and evaluating personalized question answering systems. We address this gap by introducing LaMP-QA -- a benchmark designed for evaluating personalized long-form answer generation. The benchmark covers questions from three major categories: (1) Arts & Entertainment, (2) Lifestyle & Personal Development, and (3) Society & Culture, encompassing over 45 subcategories in total. To assess the quality and potential impact of the LaMP-QA benchmark for personalized question answering, we conduct comprehensive human and automatic evaluations, to compare multiple evaluation strategies for evaluating generated personalized responses and measure their alignment with human preferences. Furthermore, we benchmark a number of non-personalized and personalized approaches based on open-source and proprietary large language models (LLMs). Our results show that incorporating the personalized context provided leads to performance improvements of up to 39%. The benchmark is publicly released to support future research in this area.
摘要：个性化对于以用户为中心的问答系统至关重要。尽管它很重要，但答案生成的个性化已经相对不受欢迎。这主要是由于缺乏用于培训和评估个性化问答系统的资源。我们通过引入LAMP-QA来解决这一差距，这是一种旨在评估个性化的长格式答案生成的基准测试。该基准涵盖了三个主要类别的问题：（1）艺术与娱乐，（2）生活方式与个人发展，以及（3）社会与文化，总计超过45个子类别。为了评估LAMP-QA基准对个性化问题答案的质量和潜在影响，我们进行了全面的人类和自动评估，以比较评估生成的个性化答案的多个评估策略，并衡量其与人类偏好的一致性。此外，我们基于开源和专有语言模型（LLMS）基于许多非个人化和个性化方法。我们的结果表明，合并个性化上下文提供了高达39％的绩效提高。公开发布基准，以支持该领域的未来研究。

Title: Werewolf: A Straightforward Game Framework with TTS for Improved User Engagement

Authors: Qihui Fan, Enfu Nan, Wenbo Li, Lei Lu, Pu Zhao, Yanzhi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00160
Pdf URL: https://arxiv.org/pdf/2506.00160
Copy Paste: [[2506.00160]] Werewolf: A Straightforward Game Framework with TTS for Improved User Engagement(https://arxiv.org/abs/2506.00160)
Keywords: language model, llm, prompt, agent
Abstract: The growing popularity of social deduction game systems for both business applications and AI research has greatly benefited from the rapid advancements in Large Language Models (LLMs), which now demonstrate stronger reasoning and persuasion capabilities. Especially with the raise of DeepSeek R1 and V3 models, LLMs should enable a more engaging experience for human players in LLM-agent-based social deduction games like Werewolf. Previous works either fine-tuning, advanced prompting engineering, or additional experience pool to achieve engaging text-format Werewolf game experience. We propose a novel yet straightforward LLM-based Werewolf game system with tuned Text-to-Speech(TTS) models designed for enhanced compatibility with various LLM models, and improved user engagement. We argue with ever enhancing LLM reasoning, extra components will be unnecessary in the case of Werewolf.
摘要：社会推论游戏系统对业务应用和AI研究的日益普及得益于大型语言模型（LLMS）的快速发展，现在证明了更强的推理和说服力。尤其是随着DeepSeek R1和V3模型的提升，LLMS应该在基于LLM代理的社交扣除游戏（例如Wayswolf）中为人类玩家提供更具吸引力的体验。以前的作品要进行微调，高级提示工程或其他体验池，以实现引人入胜的文本形式狼人游戏体验。我们提出了一种新颖而直接的基于LLM的狼人游戏系统，其调谐文本到语音（TTS）模型旨在增强与各种LLM型号的兼容性，并改善了用户参与度。我们争论着增强了LLM推理，在狼人的情况下，额外的组件将不必要。

Title: Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences

Authors: Mingqian Zheng, Wenjia Hu, Patrick Zhao, Motahhare Eslami, Jena D. Hwang, Faeze Brahman, Carolyn Rose, Maarten Sap
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2506.00195
Pdf URL: https://arxiv.org/pdf/2506.00195
Copy Paste: [[2506.00195]] Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences(https://arxiv.org/abs/2506.00195)
Keywords: llm
Abstract: Current LLMs are trained to refuse potentially harmful input queries regardless of whether users actually had harmful intents, causing a tradeoff between safety and user experience. Through a study of 480 participants evaluating 3,840 query-response pairs, we examine how different refusal strategies affect user perceptions across varying motivations. Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact. Partial compliance -- providing general information without actionable details -- emerges as the optimal strategy, reducing negative user perceptions by over 50% to flat-out refusals. Complementing this, we analyze response patterns of 9 state-of-the-art LLMs and evaluate how 6 reward models score different refusal strategies, demonstrating that models rarely deploy partial compliance naturally and reward models currently undervalue it. This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent, offering a path toward AI safety mechanisms that ensure both safety and sustained user engagement.
摘要：当前的LLM经过培训，可以拒绝潜在的有害输入查询，而不管用户是否确实有有害意图，从而导致安全性和用户体验之间的权衡。通过对评估3,840个查询响应对的480名参与者的研究，我们研究了不同的拒绝策略如何影响各种动机的用户看法。我们的发现表明，响应策略在很大程度上塑造了用户体验，而实际的用户动机则可以忽略不计。部分合规性（提供无效细节的一般信息）作为最佳策略出现，将负面用户的看法降低了50％以上，以拒绝拒绝。与此相辅相成，我们分析了9个最先进的LLM的响应模式，并评估6种奖励模型如何得分不同的拒绝策略，这表明模型很少自然地部署部分合规性，并且当前低估IT的奖励模型。这项工作表明，有效的护栏需要专注于制定周到的拒绝而不是检测意图，从而为AI安全机制提供了确保安全性和持续用户参与度的道路。

Title: Structuring Radiology Reports: Challenging LLMs with Lightweight Models

Authors: Johannes Moll, Louisa Fay, Asfandyar Azhar, Sophie Ostmeier, Tim Lueth, Sergios Gatidis, Curtis Langlotz, Jean-Benoit Delbrouck
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00200
Pdf URL: https://arxiv.org/pdf/2506.00200
Copy Paste: [[2506.00200]] Structuring Radiology Reports: Challenging LLMs with Lightweight Models(https://arxiv.org/abs/2506.00200)
Keywords: language model, llm, prompt
Abstract: Radiology reports are critical for clinical decision-making but often lack a standardized format, limiting both human interpretability and machine learning (ML) applications. While large language models (LLMs) have shown strong capabilities in reformatting clinical text, their high computational requirements, lack of transparency, and data privacy concerns hinder practical deployment. To address these challenges, we explore lightweight encoder-decoder models (<300M parameters)-specifically T5 and BERT2BERT-for structuring radiology reports from the MIMIC-CXR and CheXpert Plus datasets. We benchmark these models against eight open-source LLMs (1B-70B), adapted using prefix prompting, in-context learning (ICL), and low-rank adaptation (LoRA) finetuning. Our best-performing lightweight model outperforms all LLMs adapted using prompt-based techniques on a human-annotated test set. While some LoRA-finetuned LLMs achieve modest gains over the lightweight model on the Findings section (BLEU 6.4%, ROUGE-L 4.8%, BERTScore 3.6%, F1-RadGraph 1.1%, GREEN 3.6%, and F1-SRR-BERT 4.3%), these improvements come at the cost of substantially greater computational resources. For example, LLaMA-3-70B incurred more than 400 times the inference time, cost, and carbon emissions compared to the lightweight model. These results underscore the potential of lightweight, task-specific models as sustainable and privacy-preserving solutions for structuring clinical text in resource-constrained healthcare settings.
摘要：放射学报告对于临床决策至关重要，但通常缺乏标准化的格式，限制了人类的解释性和机器学习（ML）应用。尽管大型语言模型（LLM）在重新格式化临床文本方面表现出很强的能力，但其高度计算要求，缺乏透明度和数据隐私涉及涉及实际部署。为了应对这些挑战，我们探索了轻巧的编码器模型（<300m参数） - 特别是T5和Bert2bert-Fort-For结构放射学报告，来自Mimic-CXR和CHExpert Plus数据集。我们根据八个开源LLM（1B-70B）对这些模型进行基准测试，该模型使用前缀提示，内在学习（ICL）和低秩适应（LORA）FINETUNNING进行了调整。我们表现最佳的轻质模型的表现优于所有使用基于及时的技术的LLM，该模型在人类注销的测试集上进行了调整。尽管某些洛拉（Lora） - 调节的LLM在发现部分的轻量级模型（BLEU 6.4％，鲁格L 4.8％，Bertscore 3.6％，F1-Radgraph 1.1％，绿色3.6％和F1-SRR-BERT 4.3％）上实现了适度的增长，但这些改进以实质性计算的成本更高。例如，与轻量级模型相比，Llama-3-70B的推理时间，成本和碳排放量超过400倍。这些结果强调了轻巧的，特定于任务的模型作为可持续和隐私的解决方案，以在资源受限的医疗环境中构建临床文本。

Title: Structure-Aware Fill-in-the-Middle Pretraining for Code

Authors: Linyuan Gong, Alvin Cheung, Mostafa Elhoushi, Sida Wang
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2506.00204
Pdf URL: https://arxiv.org/pdf/2506.00204
Copy Paste: [[2506.00204]] Structure-Aware Fill-in-the-Middle Pretraining for Code(https://arxiv.org/abs/2506.00204)
Keywords: llm
Abstract: Fill-in-the-Middle (FIM) is a common pretraining method for code LLMs, where models complete code segments given surrounding context. However, existing LLMs treat code as plain text and mask random character spans. We propose and evaluate AST-FIM, a pretraining strategy that leverages Abstract Syntax Trees (ASTs) to mask complete syntactic structures at scale, ensuring coherent training examples better aligned with universal code structures and common code editing patterns such as blocks, expressions, or functions. To evaluate real-world fill-in-the-middle (FIM) programming tasks, we introduce Real-FIM-Eval, a benchmark derived from 30,000+ GitHub commits across 12 languages. On infilling tasks, experiments on 1B and 8B parameter models show that AST-FIM is particularly beneficial for real-world code editing as it outperforms standard random-character FIM by up to 5 pts on standard FIM benchmarks. Our code is publicly available at this https URL.
摘要：中间填充（FIM）是代码LLMS的常见预处理方法，其中模型完整的代码段给定环境上下文。但是，现有的LLM将代码视为纯文本和掩盖随机字符跨度。我们提出并评估AST-FIM，这是一种预处理的策略，利用抽象的语法树（AST）大规模掩盖完整的句法结构，确保相干训练示例与通用代码结构和常见代码编辑模式（例如块，表达式或功能）更好地对齐。为了评估现实世界中的中间填充（FIM）编程任务，我们介绍了现实五件效果，这是一种源自12种语言的30,000多个GitHub的基准。在填充任务上，对1B和8B参数模型的实验表明，AST-FIM对现实世界代码编辑特别有益，因为它在标准FIM基准上的表现优于标准的随机字符FIM高达5分。我们的代码在此HTTPS URL上公开可用。

Title: REIC: RAG-Enhanced Intent Classification at Scale

Authors: Ziji Zhang, Michael Yang, Zhiyu Chen, Yingying Zhuang, Shu-Ting Pi, Qun Liu, Rajashekar Maragoud, Vy Nguyen, Anurag Beniwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00210
Pdf URL: https://arxiv.org/pdf/2506.00210
Copy Paste: [[2506.00210]] REIC: RAG-Enhanced Intent Classification at Scale(https://arxiv.org/abs/2506.00210)
Keywords: retrieval-augmented generation, agent
Abstract: Accurate intent classification is critical for efficient routing in customer service, ensuring customers are connected with the most suitable agents while reducing handling times and operational costs. However, as companies expand their product lines, intent classification faces scalability challenges due to the increasing number of intents and variations in taxonomy across different verticals. In this paper, we introduce REIC, a Retrieval-augmented generation Enhanced Intent Classification approach, which addresses these challenges effectively. REIC leverages retrieval-augmented generation (RAG) to dynamically incorporate relevant knowledge, enabling precise classification without the need for frequent retraining. Through extensive experiments on real-world datasets, we demonstrate that REIC outperforms traditional fine-tuning, zero-shot, and few-shot methods in large-scale customer service settings. Our results highlight its effectiveness in both in-domain and out-of-domain scenarios, demonstrating its potential for real-world deployment in adaptive and large-scale intent classification systems.
摘要：准确的意图分类对于有效的客户服务路线至关重要，确保客户与最合适的代理相连，同时减少处理时间和运营成本。但是，随着公司扩大产品线的扩大，意图分类面临可伸缩性挑战，这是由于不同垂直行业的分类法的意图和分类法的差异增加。在本文中，我们介绍了REIC，这是一种检索一代增强的意图分类方法，该方法有效地解决了这些挑战。 REIC利用检索功能的生成（RAG）动态合并相关知识，实现精确的分类，而无需频繁地进行重新训练。通过对现实世界数据集的广泛实验，我们证明了REIC在大规模客户服务设置中的表现优于传统的微调，零射击和少量方法。我们的结果强调了其在内域和室外场景中的有效性，这表明了其在自适应和大规模意图分类系统中现实部署的潜力。

Title: ComposeRAG: A Modular and Composable RAG for Corpus-Grounded Multi-Hop Question Answering

Authors: Ruofan Wu, Youngwon Lee, Fan Shu, Danmei Xu, Seung-won Hwang, Zhewei Yao, Yuxiong He, Feng Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00232
Pdf URL: https://arxiv.org/pdf/2506.00232
Copy Paste: [[2506.00232]] ComposeRAG: A Modular and Composable RAG for Corpus-Grounded Multi-Hop Question Answering(https://arxiv.org/abs/2506.00232)
Keywords: retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems are increasingly diverse, yet many suffer from monolithic designs that tightly couple core functions like query reformulation, retrieval, reasoning, and verification. This limits their interpretability, systematic evaluation, and targeted improvement, especially for complex multi-hop question answering. We introduce ComposeRAG, a novel modular abstraction that decomposes RAG pipelines into atomic, composable modules. Each module, such as Question Decomposition, Query Rewriting, Retrieval Decision, and Answer Verification, acts as a parameterized transformation on structured inputs/outputs, allowing independent implementation, upgrade, and analysis. To enhance robustness against errors in multi-step reasoning, ComposeRAG incorporates a self-reflection mechanism that iteratively revisits and refines earlier steps upon verification failure. Evaluated on four challenging multi-hop QA benchmarks, ComposeRAG consistently outperforms strong baselines in both accuracy and grounding fidelity. Specifically, it achieves up to a 15% accuracy improvement over fine-tuning-based methods and up to a 5% gain over reasoning-specialized pipelines under identical retrieval conditions. Crucially, ComposeRAG significantly enhances grounding: its verification-first design reduces ungrounded answers by over 10% in low-quality retrieval settings, and by approximately 3% even with strong corpora. Comprehensive ablation studies validate the modular architecture, demonstrating distinct and additive contributions from each component. These findings underscore ComposeRAG's capacity to deliver flexible, transparent, scalable, and high-performing multi-hop reasoning with improved grounding and interpretability.
摘要：检索增强的一代（RAG）系统越来越多样化，但许多人都遭受了整体设计的影响，这些设计紧密地伴随着核心功能，例如查询重新进行，检索，推理和验证。这限制了他们的可解释性，系统评估和有针对性的改进，尤其是对于复杂的多跳问题回答。我们介绍了Composag，这是一种新型的模块化抽象，将RAG管道分解为原子，可综合模块。每个模块，例如问题分解，查询重写，检索决策和回答验证，都可以作为结构化输入/输出的参数化转换，从而允许独立实施，升级和分析。为了增强对多步推理中错误的鲁棒性，CompoSerag结合了一种自我反射机制，在验证失败后迭代地重新审视并完善了早期步骤。 CompoSerag在四个具有挑战性的多跳QA基准测试中进行了评估，在准确性和接地忠诚度上始终优于强大的基准。具体而言，在相同的检索条件下，它比基于微调的方法的准确性提高了15％的准确性提高，而在相同的检索条件下，它的准确性提高了5％。至关重要的是，CompoSerag显着增强了基础：在低质量的检索环境中，其验证优先的设计将未接地的答案减少了10％，即使具有强大的Corpora，也可以将大约3％的答案降低了大约3％。全面的消融研究验证了模块化体系结构，证明了每个组件的独特和添加贡献。这些发现强调了Composerag提供柔性，透明，可扩展和高性能的多跳推理的能力，并提高了接地和可解释性。

Title: MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility

Authors: Yexiao He, Ang Li, Boyi Liu, Zhewei Yao, Yuxiong He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00235
Pdf URL: https://arxiv.org/pdf/2506.00235
Copy Paste: [[2506.00235]] MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility(https://arxiv.org/abs/2506.00235)
Keywords: language model, agent
Abstract: Healthcare decision-making represents one of the most challenging domains for Artificial Intelligence (AI), requiring the integration of diverse knowledge sources, complex reasoning, and various external analytical tools. Current AI systems often rely on either task-specific models, which offer limited adaptability, or general language models without grounding with specialized external knowledge and tools. We introduce MedOrch, a novel framework that orchestrates multiple specialized tools and reasoning agents to provide comprehensive medical decision support. MedOrch employs a modular, agent-based architecture that facilitates the flexible integration of domain-specific tools without altering the core system. Furthermore, it ensures transparent and traceable reasoning processes, enabling clinicians to meticulously verify each intermediate step underlying the system's recommendations. We evaluate MedOrch across three distinct medical applications: Alzheimer's disease diagnosis, chest X-ray interpretation, and medical visual question answering, using authentic clinical datasets. The results demonstrate MedOrch's competitive performance across these diverse medical tasks. Notably, in Alzheimer's disease diagnosis, MedOrch achieves an accuracy of 93.26%, surpassing the state-of-the-art baseline by over four percentage points. For predicting Alzheimer's disease progression, it attains a 50.35% accuracy, marking a significant improvement. In chest X-ray analysis, MedOrch exhibits superior performance with a Macro AUC of 61.2% and a Macro F1-score of 25.5%. Moreover, in complex multimodal visual question answering (Image+Table), MedOrch achieves an accuracy of 54.47%. These findings underscore MedOrch's potential to advance healthcare AI by enabling reasoning-driven tool utilization for multimodal medical data processing and supporting intricate cognitive tasks in clinical decision-making.
摘要：医疗保健决策是人工智能（AI）最具挑战性的领域之一，需要将各种知识源，复杂推理和各种外部分析工具整合在一起。当前的AI系统通常依赖于特定于任务的模型，这些模型具有有限的适应性或一般语言模型，而无需使用专门的外部知识和工具接地。我们介绍了Medorch，这是一个新颖的框架，该框架精心策划了多种专业工具和推理代理，以提供全面的医疗决策支持。 Medorch采用了基于模块化的，基于代理的体系结构，可促进特定于域特定工具的灵活集成而无需更改核心系统。此外，它确保了透明且可追溯的推理过程，使临床医生能够精心验证系统建议的基础下的每个中间步骤。我们使用真实的临床数据集评估了三种不同的医学应用：阿尔茨海默氏病诊断，胸部X射线解释和医学视觉问题答案。结果表明，梅多奇在这些不同的医疗任务中的竞争表现。值得注意的是，在阿尔茨海默氏病的诊断中，梅多奇的准确度达到93.26％，超过了四个百分点以上。为了预测阿尔茨海默氏病的进展，它的准确性为50.35％，标志着显着的改善。在胸部X射线分析中，Medorch表现出卓越的性能，其宏观AUC为61.2％，宏F1得分为25.5％。此外，在复杂的多模式视觉问题回答（图像+表）中，Medorch的精度为54.47％。这些发现强调了Medorch通过启用推理驱动的工具利用来推动医疗AI的潜力，用于多模式医学数据处理并支持临床决策中的复杂认知任务。

Title: PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain

Authors: Mohammad Javad Ranjbar Kalahroodi, Amirhossein Sheikholselami, Sepehr Karimi, Sepideh Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery
Subjects: cs.CL, cs.IT
Abstract URL: https://arxiv.org/abs/2506.00250
Pdf URL: https://arxiv.org/pdf/2506.00250
Copy Paste: [[2506.00250]] PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain(https://arxiv.org/abs/2506.00250)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Large Language Models (LLMs) have achieved remarkable performance on a wide range of NLP benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale, expert-validated dataset of multiple-choice Persian medical questions, designed to evaluate LLMs across both Persian and English. We benchmark over 40 state-of-the-art models, including general-purpose, Persian fine-tuned, and medical LLMs, in zero-shot and chain-of-thought (CoT) settings. Our results show that closed-source general models (e.g., GPT-4.1) consistently outperform all other categories, achieving 83.3% accuracy in Persian and 80.7% in English, while Persian fine-tuned models such as Dorna underperform significantly (e.g., 35.9% in Persian), often struggling with both instruction-following and domain reasoning. We also analyze the impact of translation, showing that while English performance is generally higher, Persian responses are sometimes more accurate due to cultural and clinical contextual cues. Finally, we demonstrate that model size alone is insufficient for robust performance without strong domain or language adaptation. PersianMedQA provides a foundation for evaluating multilingual and culturally grounded medical reasoning in LLMs. The PersianMedQA dataset can be accessed at: this https URL](this https URL
摘要：大型语言模型（LLM）在广泛的NLP基准上取得了出色的性能，通常超过人类水平的准确性。但是，它们在医学等高风险领域的可靠性，尤其是在低资源语言中的可靠性，但仍未得到充实。在这项工作中，我们介绍了PersianMedQA，这是一个多项选择波斯医学问题的大型，专家验证的数据集，旨在评估波斯语和英语的LLM。我们在40个最先进的模型中基准了，包括通用，波斯微调和医疗LLM，以零射击和经过三通（COT）设置。我们的结果表明，封闭源的通用模型（例如GPT-4.1）始终超越所有其他类别，在波斯语中达到83.3％的精度和英语的80.7％，而波斯的微型模型（例如DORNA表现不佳）的表现不佳（例如，在Persian中的35.9％）经常努力和努力进行教学的推理。我们还分析了翻译的影响，表明尽管英语表现通常更高，但由于文化和临床上下文提示，波斯的反应有时更准确。最后，我们证明单独的模型大小不足以在没有强大的领域或语言适应的情况下进行稳健的性能。 PersianMedQA为评估LLMS中的多语言和文化基础的医学推理提供了基础。可以通过以下位置访问PersianMedQA数据集

Title: Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race

Authors: Lihao Sun, Chengzhi Mao, Valentin Hofmann, Xuechunzi Bai
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.00253
Pdf URL: https://arxiv.org/pdf/2506.00253
Copy Paste: [[2506.00253]] Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race(https://arxiv.org/abs/2506.00253)
Keywords: language model
Abstract: Although value-aligned language models (LMs) appear unbiased in explicit bias evaluations, they often exhibit stereotypes in implicit word association tasks, raising concerns about their fair usage. We investigate the mechanisms behind this discrepancy and find that alignment surprisingly amplifies implicit bias in model outputs. Specifically, we show that aligned LMs, unlike their unaligned counterparts, overlook racial concepts in early internal representations when the context is ambiguous. Not representing race likely fails to activate safety guardrails, leading to unintended biases. Inspired by this insight, we propose a new bias mitigation strategy that works by incentivizing the representation of racial concepts in the early model layers. In contrast to conventional mitigation methods of machine unlearning, our interventions find that steering the model to be more aware of racial concepts effectively mitigates implicit bias. Similar to race blindness in humans, ignoring racial nuances can inadvertently perpetuate subtle biases in LMs.
摘要：尽管在明确的偏见评估中显得公正地偏见，但它们经常在隐式单词关联任务中表现出刻板印象，从而引起了对其公平用法的担忧。我们研究了这种差异背后的机制，并发现对齐方式令人惊讶地放大了模型输出中的隐式偏差。具体来说，我们表明，与他们的不一致的同行不同，当上下文模棱两可时，与他们不一致的同行不同。不代表种族可能无法激活安全护栏，从而导致意想不到的偏见。受这种见解的启发，我们提出了一种新的缓解策略，该策略是通过激励早期模型层中种族概念的代表来起作用的。与传统的缓解方法相反，我们的干预措施发现，转向模型更了解种族概念会有效地减轻隐式偏见。与人类的种族失明类似，忽略种族细微差别可能会无意间永久存在LMS的微妙偏见。

Title: The Impact of Disability Disclosure on Fairness and Bias in LLM-Driven Candidate Selection

Authors: Mahammed Kamruzzaman, Gene Louis Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00256
Pdf URL: https://arxiv.org/pdf/2506.00256
Copy Paste: [[2506.00256]] The Impact of Disability Disclosure on Fairness and Bias in LLM-Driven Candidate Selection(https://arxiv.org/abs/2506.00256)
Keywords: language model, llm
Abstract: As large language models (LLMs) become increasingly integrated into hiring processes, concerns about fairness have gained prominence. When applying for jobs, companies often request/require demographic information, including gender, race, and disability or veteran status. This data is collected to support diversity and inclusion initiatives, but when provided to LLMs, especially disability-related information, it raises concerns about potential biases in candidate selection outcomes. Many studies have highlighted how disability can impact CV screening, yet little research has explored the specific effect of voluntarily disclosed information on LLM-driven candidate selection. This study seeks to bridge that gap. When candidates shared identical gender, race, qualifications, experience, and backgrounds, and sought jobs with minimal employment rate gaps between individuals with and without disabilities (e.g., Cashier, Software Developer), LLMs consistently favored candidates who disclosed that they had no disability. Even in cases where candidates chose not to disclose their disability status, the LLMs were less likely to select them compared to those who explicitly stated they did not have a disability.
摘要：随着大型语言模型（LLM）越来越多地整合到招聘过程中，对公平性的担忧已获得突出。申请工作时，公司通常会要求/需要人口统计信息，包括性别，种族，残疾或退伍军人身份。收集这些数据是为了支持多样性和包容性计划，但是当提供给LLM，尤其是与残疾相关的信息时，它引起了人们对候选选择结果中潜在偏见的担忧。许多研究强调了残疾如何影响简历筛查，但很少的研究探讨了自愿披露的信息对LLM驱动的候选人选择的特定影响。这项研究试图弥合这一差距。当候选人分享相同的性别，种族，资格，经验和背景，并在残疾人和没有残障人士之间以最小的就业率差距（例如，出纳员，软件开发人员）寻求工作时，LLM始终偏爱披露他们没有残疾的候选人。即使在候选人选择不披露其残疾状况的情况下，与明确表示他们没有残疾的人相比，LLM的选择不太可能选择它们。

Title: MultiHoax: A Dataset of Multi-hop False-Premise Questions

Authors: Mohammadamin Shafiei, Hamidreza Saffari, Nafise Sadat Moosavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00264
Pdf URL: https://arxiv.org/pdf/2506.00264
Copy Paste: [[2506.00264]] MultiHoax: A Dataset of Multi-hop False-Premise Questions(https://arxiv.org/abs/2506.00264)
Keywords: language model, llm
Abstract: As Large Language Models are increasingly deployed in high-stakes domains, their ability to detect false assumptions and reason critically is crucial for ensuring reliable outputs. False-premise questions (FPQs) serve as an important evaluation method by exposing cases where flawed assumptions lead to incorrect responses. While existing benchmarks focus on single-hop FPQs, real-world reasoning often requires multi-hop inference, where models must verify consistency across multiple reasoning steps rather than relying on surface-level cues. To address this gap, we introduce MultiHoax, a benchmark for evaluating LLMs' ability to handle false premises in complex, multi-step reasoning tasks. Our dataset spans seven countries and ten diverse knowledge categories, using Wikipedia as the primary knowledge source to enable factual reasoning across regions. Experiments reveal that state-of-the-art LLMs struggle to detect false premises across different countries, knowledge categories, and multi-hop reasoning types, highlighting the need for improved false premise detection and more robust multi-hop reasoning capabilities in LLMs.
摘要：由于大型语言模型越来越多地部署在高风险域中，因此它们检测错误的假设和关键原因的能力对于确保可靠的产出至关重要。假新型问题（FPQ）通过暴露有缺陷的假设导致不正确响应的案例来成为重要的评估方法。尽管现有基准专注于单跳FPQ，但实际推理通常需要多跳推理，其中模型必须验证多个推理步骤的一致性，而不是依靠表面级别的提示。为了解决这一差距，我们介绍了MultiHoax，这是评估LLMS在复杂的多步推理任务中处理错误场所的能力的基准。我们的数据集跨越了七个国家和十个不同的知识类别，使用维基百科作为主要知识来源，以使跨地区的事实推理。实验表明，最先进的LLM努力检测不同国家 /地区的虚假前提，知识类别和多跳推理类型，强调了在LLMS中需要改进的虚假前提检测和更强大的多跳跃推理能力。

Title: CASPER: A Large Scale Spontaneous Speech Dataset

Authors: Cihan Xiao, Ruixing Liang, Xiangyu Zhang, Mehmet Emre Tiryaki, Veronica Bae, Lavanya Shankar, Rong Yang, Ethan Poon, Emmanuel Dupoux, Sanjeev Khudanpur, Leibny Paola Garcia Perera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00267
Pdf URL: https://arxiv.org/pdf/2506.00267
Copy Paste: [[2506.00267]] CASPER: A Large Scale Spontaneous Speech Dataset(https://arxiv.org/abs/2506.00267)
Keywords: language model
Abstract: The success of large language models has driven interest in developing similar speech processing capabilities. However, a key challenge is the scarcity of high-quality spontaneous speech data, as most existing datasets contain scripted dialogues. To address this, we present a novel pipeline for eliciting and recording natural dialogues and release our Stage 1 dataset with 200+ hours of spontaneous speech. Our approach fosters fluid, natural conversations while encouraging a diverse range of topics and interactive exchanges. Unlike traditional methods, it facilitates genuine interactions, providing a reproducible framework for future data collection. This paper introduces our dataset and methodology, laying the groundwork for addressing the shortage of spontaneous speech data. We plan to expand this dataset in future stages, offering a growing resource for the research community.
摘要：大型语言模型的成功引起了人们对开发类似语音处理能力的兴趣。但是，关键的挑战是缺乏高质量的自发语音数据，因为大多数现有数据集都包含脚本对话。为了解决这个问题，我们提出了一条新颖的管道，用于引发和记录自然对话，并以200多个小时的自发演讲发布我们的第1阶段数据集。我们的方法促进了流体，自然的对话，同时鼓励各种各样的主题和互动交流。与传统方法不同，它有助于真正的互动，为未来的数据收集提供了可重现的框架。本文介绍了我们的数据集和方法，为解决自发语音数据短缺的基础奠定了基础。我们计划在以后的阶段扩展该数据集，为研究社区提供不断增长的资源。

Title: Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings

Authors: Hans W. A. Hanley, Zakir Durumeric
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2506.00277
Pdf URL: https://arxiv.org/pdf/2506.00277
Copy Paste: [[2506.00277]] Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings(https://arxiv.org/abs/2506.00277)
Keywords: language model
Abstract: Contextual large language model embeddings are increasingly utilized for topic modeling and clustering. However, current methods often scale poorly, rely on opaque similarity metrics, and struggle in multilingual settings. In this work, we present a novel, scalable, interpretable, hierarchical, and multilingual approach to clustering news articles and social media data. To do this, we first train multilingual Matryoshka embeddings that can determine story similarity at varying levels of granularity based on which subset of the dimensions of the embeddings is examined. This embedding model achieves state-of-the-art performance on the SemEval 2022 Task 8 test dataset (Pearson $\rho$ = 0.816). Once trained, we develop an efficient hierarchical clustering algorithm that leverages the hierarchical nature of Matryoshka embeddings to identify unique news stories, narratives, and themes. We conclude by illustrating how our approach can identify and cluster stories, narratives, and overarching themes within real-world news datasets.
摘要：上下文大型语言模型嵌入越来越多地用于主题建模和聚类。但是，当前的方法通常扩大很差，依赖不透明的相似性指标以及在多语言环境中的挣扎。在这项工作中，我们提供了一种新颖，可扩展，可解释的，等级和多语言的方法，用于聚集新闻文章和社交媒体数据。为此，我们首先训练多种语言Matryoshka嵌入，这些嵌入方式可以根据嵌入式的尺寸子集的哪个子集的子集，可以在不同级别的粒度上确定故事相似性。该嵌入模型可以在Semeval 2022任务8测试数据集（Pearson $ \ rho $ = 0.816）上实现最先进的性能。一旦受过培训，我们就开发了一种有效的层次聚类算法，该算法利用Matryoshka嵌入的层次结构性质来识别独特的新闻故事，叙述和主题。我们通过说明我们的方法如何在现实世界新闻数据集中识别和群体的故事，叙述和总体主题来结束。

Title: Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation

Authors: Ahmed Elhady, Eneko Agirre, Mikel Artetxe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00288
Pdf URL: https://arxiv.org/pdf/2506.00288
Copy Paste: [[2506.00288]] Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation(https://arxiv.org/abs/2506.00288)
Keywords: language model, llm, prompt
Abstract: Continued pretraining (CPT) is a popular approach to adapt existing large language models (LLMs) to new languages. When doing so, it is common practice to include a portion of English data in the mixture, but its role has not been carefully studied to date. In this work, we show that including English does not impact validation perplexity, yet it is critical for the emergence of downstream capabilities in the target language. We introduce a language-agnostic benchmark for in-context learning (ICL), which reveals catastrophic forgetting early on CPT when English is not included. This in turn damages the ability of the model to generalize to downstream prompts in the target language as measured by perplexity, even if it does not manifest in terms of accuracy until later in training, and can be tied to a big shift in the model parameters. Based on these insights, we introduce curriculum learning and exponential moving average (EMA) of weights as effective alternatives to mitigate the need for English. All in all, our work sheds light into the dynamics by which emergent abilities arise when doing CPT for language adaptation, and can serve as a foundation to design more effective methods in the future.
摘要：持续预处理（CPT）是一种流行的方法，可以使现有的大型语言模型（LLM）适应新语言。这样做时，通常将一部分英语数据包含在混合物中是通常的实践，但是迄今为止尚未对其作用进行仔细的研究。在这项工作中，我们表明包括英语不会影响验证的困惑，但对于目标语言中下游功能的出现至关重要。我们引入了一种语言不可屈服的基准，用于内在学习（ICL），该基准揭示了不包括英语时在CPT上早期遗忘的灾难性遗忘。反过来，这又损害了模型在目标语言中以困惑度衡量的目标语言的下游提示的能力，即使它直到训练后期才以准确性表现出来，并且可以与模型参数的重大变化联系在一起。基于这些见解，我们介绍了体重的课程学习和指数移动平均值（EMA）作为减轻英语需求的有效替代方法。总而言之，我们的工作阐明了在进行语言适应CPT时出现的动态，并可以作为未来设计更有效方法的基础。

Title: DLM-One: Diffusion Language Models for One-Step Sequence Generation

Authors: Tianqi Chen, Shujian Zhang, Mingyuan Zhou
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.00290
Pdf URL: https://arxiv.org/pdf/2506.00290
Copy Paste: [[2506.00290]] DLM-One: Diffusion Language Models for One-Step Sequence Generation(https://arxiv.org/abs/2506.00290)
Keywords: language model
Abstract: This paper introduces DLM-One, a score-distillation-based framework for one-step sequence generation with continuous diffusion language models (DLMs). DLM-One eliminates the need for iterative refinement by aligning the scores of a student model's outputs in the continuous token embedding space with the score function of a pretrained teacher DLM. We investigate whether DLM-One can achieve substantial gains in sampling efficiency for language modeling. Through comprehensive experiments on DiffuSeq -- a representative continuous DLM -- we show that DLM-One achieves up to ~500x speedup in inference time while maintaining competitive performance on benchmark text generation tasks used to evaluate the teacher models. We further analyze the method's empirical behavior across multiple datasets, providing initial insights into its generality and practical applicability. Our findings position one-step diffusion as a promising direction for efficient, high-quality language generation and broader adoption of continuous diffusion models operating in embedding space for natural language processing.
摘要：本文介绍了DLM-ONE，这是一种基于得分差的框架，用于使用连续扩散语言模型（DLM）生成一步序列。 DLM-ONE通过将学生模型在连续的令牌嵌入空间中的分数与据称的老师DLM的得分功能相提并论，从而消除了迭代精炼的需求。我们研究了DLM-ONE是否可以在语言建模的采样效率方面取得可观的提高。通过对DivFuseQ的全面实验 - 代表性的连续DLM - 我们表明，DLM-ONE在推理时间内达到了〜500倍的速度，同时在基准的文本生成任务上保持了用于评估教师模型的基准文本生成任务的竞争性能。我们进一步分析了该方法在多个数据集中的经验行为，从而提供了对其普遍性和实际适用性的初步见解。我们的发现位置一步扩散是有前途的方向，用于高效，高质量的语言产生，并更广泛地采用在嵌入自然语言处理空间中运行的连续扩散模型。

Title: Can LLMs Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLMs

Authors: Payal Mohapatra, Akash Pandey, Xiaoyuan Zhang, Qi Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00304
Pdf URL: https://arxiv.org/pdf/2506.00304
Copy Paste: [[2506.00304]] Can LLMs Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLMs(https://arxiv.org/abs/2506.00304)
Keywords: language model, llm
Abstract: Unvoiced electromyography (EMG) is an effective communication tool for individuals unable to produce vocal speech. However, most prior methods rely on paired voiced and unvoiced EMG signals, along with speech data, for EMG-to-text conversion, which is not practical for such individuals. Given the rise of large language models (LLMs) in speech recognition, we explore their potential to understand unvoiced speech. To this end, we address the challenge of learning from unvoiced EMG alone and propose a novel EMG adaptor module that maps EMG features into an LLM's input space, achieving an average word error rate (WER) of 0.49 on a closed-vocabulary unvoiced EMG-to-text task. Even with a conservative data availability of just six minutes, our approach improves performance over specialized models by nearly 20%. While LLMs have been shown to be extendable to new language modalities -- such as audio -- understanding articulatory biosignals like unvoiced EMG remains more challenging. This work takes a crucial first step toward enabling LLMs to comprehend unvoiced speech using surface EMG.
摘要：未发音的肌电图（EMG）是一种有效的交流工具，对于无法发表声音的个体。但是，大多数先前的方法都依赖于配对的配对和未发音的EMG信号以及语音数据，用于EMG到文本转换，这对于此类人来说是不切实际的。鉴于大型语言模型（LLM）在语音识别中的兴起，我们探索了他们理解未发音的语音的潜力。为此，我们解决了仅从未经串意的EMG学习的挑战，并提出了一个新颖的EMG适配器模块，该模块将EMG映射到LLM的输入空间中，在封闭的录音率无用的未访问的EMG-TEXT任务上达到了平均单词错误率（WER）为0.49。即使只有六分钟的保守数据可用性，我们的方法仍将专业模型的性能提高了近20％。尽管已证明LLM可以扩展到新的语言方式（例如音频），但可以理解诸如未配音EMG之类的发音生物信号仍然更具挑战性。这项工作迈出了至关重要的第一步，使LLM可以使用表面EMG理解未发音的语音。

Title: Lossless Token Sequence Compression via Meta-Tokens

Authors: John Harvill, Ziwei Fan, Hao Wang, Yizhou Sun, Hao Ding, Luke Huan, Anoop Deoras
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00307
Pdf URL: https://arxiv.org/pdf/2506.00307
Copy Paste: [[2506.00307]] Lossless Token Sequence Compression via Meta-Tokens(https://arxiv.org/abs/2506.00307)
Keywords: language model, llm, prompt
Abstract: Existing work on prompt compression for Large Language Models (LLM) focuses on lossy methods that try to maximize the retention of semantic information that is relevant to downstream tasks while significantly reducing the sequence length. In this paper, we introduce a task-agnostic lossless compression technique similar to LZ77 that makes it possible to reduce the input token sequence length on average by 27\% and 18\% for the two evaluation tasks explored here. Given that we use transformer-based LLMs, this equates to 47\% and 33\% less encoding computation, respectively, due to the quadratic nature of attention. The token sequence transformation is trivial to reverse and highlights that no semantic information is lost in the process. We evaluate our proposed approach on two tasks that require strict preservation of semantics/syntax and demonstrate that existing lossy compression methods perform poorly in this setting. We find that our lossless compression technique produces only a small gap in performance compared to using the uncompressed input and posit that larger models and an expanded computing budget would likely erase the gap entirely.
摘要：关于大型语言模型（LLM）迅速压缩的现有工作着重于有损方法，这些方法试图最大程度地提高与下游任务相关的语义信息的保留，同时大大降低了序列长度。在本文中，我们引入了类似于LZ77的任务无损耗压缩技术，这使得在此探索的两个评估任务中，可以平均将输入令牌序列长度降低27 \％，而为18 \％。鉴于我们使用基于变压器的LLM，因此由于注意力的二次性质，这分别为47 \％和33 \％的编码计算。令牌序列转换是微不足道的，可以逆转，并强调在此过程中不会丢失语义信息。我们对需要严格保存语义/语法的两项任务评估我们的方法，并证明在这种情况下现有的损耗压缩方法的性能较差。我们发现，与使用未压缩的输入相比，我们的无损压缩技术仅产生较小的性能差距，并认为较大的模型和扩展的计算预算可能会完全消除差距。

Title: An evaluation of LLMs for generating movie reviews: GPT-4o, Gemini-2.0 and DeepSeek-V3

Authors: Brendan Sands, Yining Wang, Chenhao Xu, Yuxuan Zhou, Lai Wei, Rohitash Chandra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00312
Pdf URL: https://arxiv.org/pdf/2506.00312
Copy Paste: [[2506.00312]] An evaluation of LLMs for generating movie reviews: GPT-4o, Gemini-2.0 and DeepSeek-V3(https://arxiv.org/abs/2506.00312)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have been prominent in various tasks, including text generation and summarisation. The applicability of LLMs to the generation of product reviews is gaining momentum, paving the way for the generation of movie reviews. In this study, we propose a framework that generates movie reviews using three LLMs (GPT-4o, DeepSeek-V3, and Gemini-2.0), and evaluate their performance by comparing the generated outputs with IMDb user reviews. We use movie subtitles and screenplays as input to the LLMs and investigate how they affect the quality of reviews generated. We review the LLM-based movie reviews in terms of vocabulary, sentiment polarity, similarity, and thematic consistency in comparison to IMDB user reviews. The results demonstrate that LLMs are capable of generating syntactically fluent and structurally complete movie reviews. Nevertheless, there is still a noticeable gap in emotional richness and stylistic coherence between LLM-generated and IMDb reviews, suggesting that further refinement is needed to improve the overall quality of movie review generation. We provided a survey-based analysis where participants were told to distinguish between LLM and IMDb user reviews. The results show that LLM-generated reviews are difficult to distinguish from IMDB user reviews. We found that DeepSeek-V3 produced the most balanced reviews, closely matching IMDb reviews. GPT-4o overemphasised positive emotions, while Gemini-2.0 captured negative emotions better but showed excessive emotional intensity.
摘要：大型语言模型（LLM）在各种任务中都很突出，包括文本生成和摘要。 LLM在产品评论的生成中的适用性正在增强动力，为电影评论的生成铺平了道路。在这项研究中，我们提出了一个框架，该框架使用三个LLM（GPT-4O，DeepSeek-V3和Gemini-2.0）生成电影评论，并通过将生成的输出与IMDB用户评论进行比较来评估其性能。我们使用电影字幕和剧本作为LLM的输入，并研究它们如何影响生成的评论质量。与IMDB用户评论相比，我们从词汇，情感极性，相似性和主题一致性方面回顾了基于LLM的电影评论。结果表明，LLM能够产生句法流利和结构完整的电影评论。然而，LLM生成和IMDB评论之间的情感丰富性和风格连贯性仍然存在明显的差距，这表明需要进一步的完善来提高电影评论的整体质量。我们提供了基于调查的分析，其中告诉参与者以区分LLM和IMDB用户评论。结果表明，LLM生成的评论很难与IMDB用户评论区分开。我们发现DeepSeek-V3产生了最平衡的评论，与IMDB的评论非常匹配。 GPT-4O过度强调积极的情绪，而Gemini-2.0则更好地捕捉了负面情绪，但表现出了过度的情绪强度。

Title: SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation

Authors: Yufei Tian, Jiao Sun, Nanyun Peng, Zizhao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00319
Pdf URL: https://arxiv.org/pdf/2506.00319
Copy Paste: [[2506.00319]] SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation(https://arxiv.org/abs/2506.00319)
Keywords: language model, llm
Abstract: As language models evolve to tackle complex, multifaceted tasks, their evaluation must adapt to capture this intricacy. A granular, skill-specific understanding of model capabilities can empower researchers to make informed model development plans. In this paper, we introduce SkillVerse, an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities. With LLM as a judge, SkillVerse first critiques the model responses, and then organizes them into a hierarchical structure termed dendrogram. Given proficiency at arbitrary levels of granularity, SkillVerse is flexible to produce insights of behaviors of modern large models. We also demonstrate its efficacy in two downstream tasks: 1) improving model in-context learning by 25% using a tree-search algorithm to select more informative few-shot demonstrations, and 2) accurately predicting new model weaknesses with a 55% success rate, 22% higher than without SkillVerse.
摘要：随着语言模型的发展以应对复杂，多方面的任务，他们的评估必须适应以捕获这种复杂性。对模型能力的细粒度，特定于技能的理解可以使研究人员能够制定知情的模型开发计划。在本文中，我们介绍了Skillverse，这是一个无监督的树结构诊断框架，以了解特定能力的模型熟练度。 Skillverse首先将LLM作为法官作为法官，然后将模型响应批评，然后将其组织成一个称为树状图的层次结构。鉴于在任意水平的粒度水平上，Skillverse可以灵活地产生现代大型模型行为的见解。我们还在两个下游任务中证明了它的功效：1）使用树搜索算法将模型中文学习提高25％，以选择更有用的几次播放示范，而2）准确预测成功率55％的新模型弱点，而没有技能较高的弱点为22％。

Title: TreeRare: Syntax Tree-Guided Retrieval and Reasoning for Knowledge-Intensive Question Answering

Authors: Boyi Zhang, Zhuo Liu, Hangfeng He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00331
Pdf URL: https://arxiv.org/pdf/2506.00331
Copy Paste: [[2506.00331]] TreeRare: Syntax Tree-Guided Retrieval and Reasoning for Knowledge-Intensive Question Answering(https://arxiv.org/abs/2506.00331)
Keywords: language model, llm
Abstract: In real practice, questions are typically complex and knowledge-intensive, requiring Large Language Models (LLMs) to recognize the multifaceted nature of the question and reason across multiple information sources. Iterative and adaptive retrieval, where LLMs decide when and what to retrieve based on their reasoning, has been shown to be a promising approach to resolve complex, knowledge-intensive questions. However, the performance of such retrieval frameworks is limited by the accumulation of reasoning errors and misaligned retrieval results. To overcome these limitations, we propose TreeRare (Syntax Tree-Guided Retrieval and Reasoning), a framework that utilizes syntax trees to guide information retrieval and reasoning for question answering. Following the principle of compositionality, TreeRare traverses the syntax tree in a bottom-up fashion, and in each node, it generates subcomponent-based queries and retrieves relevant passages to resolve localized uncertainty. A subcomponent question answering module then synthesizes these passages into concise, context-aware evidence. Finally, TreeRare aggregates the evidence across the tree to form a final answer. Experiments across five question answering datasets involving ambiguous or multi-hop reasoning demonstrate that TreeRare achieves substantial improvements over existing state-of-the-art methods.
摘要：在实际实践中，问题通常是复杂且知识密集的，需要大型语言模型（LLMS）来识别多个信息源中问题和原因的多方面性质。 LLMS决定何时以及根据其推理来检索的迭代和适应性检索，已被证明是解决复杂，知识密集的问题的一种有希望的方法。但是，这种检索框架的性能受到推理错误的积累和未对准的检索结果的限制。为了克服这些局限性，我们提出了Treerare（语法树引导的检索和推理），该框架利用语法树来指导信息检索和提问回答的推理。遵循构图的原理，以自下而上的方式遍历语法树，并且在每个节点中，它生成基于子组件的查询并检索相关段落以解决局部不确定性。然后，一个子组件问题回答模块将这些段落综合为简洁的上下文意识证据。最后，特雷雷（Treerare）汇总了整个树上的证据，形成了最终答案。涉及模棱两可或多跳的推理的五个问题回答数据集进行的实验表明，特雷雷尔对现有最新方法的实质改进。

Title: Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus

Authors: Svetlana Churina, Akshat Gupta, Insyirah Mujtahid, Kokil Jaidka
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2506.00332
Pdf URL: https://arxiv.org/pdf/2506.00332
Copy Paste: [[2506.00332]] Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus(https://arxiv.org/abs/2506.00332)
Keywords: chat
Abstract: Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse, reflecting natural multilingual communication patterns. Despite its prominence in informal interactions such as social media, chat messages and instant-messaging exchanges, there has been a lack of publicly available corpora that are author-labeled and suitable for modeling human conversations and relationships. This study introduces the first labeled and general-purpose corpus for understanding code-mixing in context while maintaining rigorous privacy and ethical standards. Our live project will continuously gather, verify, and integrate code-mixed messages into a structured dataset released in JSON format, accompanied by detailed metadata and linguistic statistics. To date, it includes over 355,641 messages spanning various code-mixing patterns, with a primary focus on English, Mandarin, and other languages. We expect the Codemix Corpus to serve as a foundational dataset for research in computational linguistics, sociolinguistics, and NLP applications.
摘要：混合代码涉及单个话语中多种语言的语言元素的无缝集成，反映了自然的多语言交流模式。尽管在社交媒体，聊天消息和即时交流等非正式互动中具有突出性，但缺乏被作者标记的公共信息，适合对人类的对话和人际关系进行建模。这项研究介绍了第一个标记和通用语料库，用于在上下文中理解混合代码，同时保持严格的隐私和道德标准。我们的实时项目将不断收集，验证和集成混合的消息纳入以JSON格式发布的结构化数据集，并伴随着详细的元数据和语言统计。迄今为止，它包括超过355,641条跨越各种代码混合模式的消息，主要关注英语，普通话和其他语言。我们希望Codemix语料库成为计算语言学，社会语言学和NLP应用程序研究的基础数据集。

Title: Beyond Context to Cognitive Appraisal: Emotion Reasoning as a Theory of Mind Benchmark for Large Language Models

Authors: Gerard Christopher Yeo, Kokil Jaidka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00334
Pdf URL: https://arxiv.org/pdf/2506.00334
Copy Paste: [[2506.00334]] Beyond Context to Cognitive Appraisal: Emotion Reasoning as a Theory of Mind Benchmark for Large Language Models(https://arxiv.org/abs/2506.00334)
Keywords: language model, llm
Abstract: Datasets used for emotion recognition tasks typically contain overt cues that can be used in predicting the emotions expressed in a text. However, one challenge is that texts sometimes contain covert contextual cues that are rich in affective semantics, which warrant higher-order reasoning abilities to infer emotional states, not simply the emotions conveyed. This study advances beyond surface-level perceptual features to investigate how large language models (LLMs) reason about others' emotional states using contextual information, within a Theory-of-Mind (ToM) framework. Grounded in Cognitive Appraisal Theory, we curate a specialized ToM evaluation dataset1 to assess both forward reasoning - from context to emotion- and backward reasoning - from emotion to inferred context. We showed that LLMs can reason to a certain extent, although they are poor at associating situational outcomes and appraisals with specific emotions. Our work highlights the need for psychological theories in the training and evaluation of LLMs in the context of emotion reasoning.
摘要：用于情绪识别任务的数据集通常包含可用于预测文本中表达的情绪的明显线索。但是，一个挑战是，文本有时包含具有情感语义丰富的秘密上下文提示，这些线索需要高阶推理能力来推断情绪状态，而不仅仅是传达的情绪。这项研究超出了表面级别的感知特征，以调查使用情绪状态的大型语言模型（LLM）在一个框架（TOM）框架内使用上下文信息的原因。以认知评估理论为基础，我们策划了一个专业的TOM评估数据集，以评估从情感到被推断的上下文的远期推理 - 从上下文到情感推理。我们表明，LLM可以在一定程度上进行推理，尽管它们在将情境成果和评估与特定情绪相关联的情况下很差。我们的工作强调了在情感推理的LLM培训和评估中对心理理论的需求。

Title: Efficient Latent Semantic Clustering for Scaling Test-Time Computation of LLMs

Authors: Sungjae Lee, Hoyoung Kim, Jeongyeon Hwang, Eunhyeok Park, Jungseul Ok
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00344
Pdf URL: https://arxiv.org/pdf/2506.00344
Copy Paste: [[2506.00344]] Efficient Latent Semantic Clustering for Scaling Test-Time Computation of LLMs(https://arxiv.org/abs/2506.00344)
Keywords: language model, llm
Abstract: Scaling test-time computation--generating and analyzing multiple or sequential outputs for a single input--has become a promising strategy for improving the reliability and quality of large language models (LLMs), as evidenced by advances in uncertainty quantification and multi-step reasoning. A key shared component is semantic clustering, which groups outputs that differ in form but convey the same meaning. Semantic clustering enables estimation of the distribution over the semantics of outputs and helps avoid redundant exploration of reasoning paths. However, existing approaches typically rely on external models, which introduce substantial computational overhead and often fail to capture context-aware semantics. We propose Latent Semantic Clustering (LSC), a lightweight and context-sensitive method that leverages the generator LLM's internal hidden states for clustering, eliminating the need for external models. Our extensive experiment across various LLMs and datasets shows that LSC significantly improves the computational efficiency of test-time scaling while maintaining or exceeding the performance of existing methods.
摘要：缩放测试时间计算 - 单个输入的多个或顺序输出生成和分析 - 成为提高大语言模型（LLMS）的可靠性和质量的有前途的策略，这证明了不确定性量化和多步骤推理的进步。一个关键的共享组件是语义聚类，该组合的输出在形式上有所不同，但传达了相同的含义。语义聚类可以估计输出语义上的分布，并有助于避免对推理路径的冗余探索。但是，现有方法通常依赖于外部模型，这些模型引入了大量的计算开销，并且通常无法捕获上下文感知的语义。我们提出了潜在的语义聚类（LSC），这是一种轻巧和上下文敏感的方法，利用发电机LLM的内部隐藏状态进行聚类，从而消除了对外部模型的需求。我们在各种LLM和数据集中进行的广泛实验表明，LSC在维持或超过现有方法的性能的同时显着提高了测试时间缩放的计算效率。

Title: Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees' Dialogue to Facilitate Nurse Communication Training

Authors: Keyeun Lee, Seolhee Lee, Esther Hehsun Kim, Yena Ko, Jinsu Eun, Dahee Kim, Hyewon Cho, Haiyi Zhu, Robert E. Kraut, Eunyoung Suh, Eun-mee Kim, Hajin Lim
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2506.00386
Pdf URL: https://arxiv.org/pdf/2506.00386
Copy Paste: [[2506.00386]] Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees' Dialogue to Facilitate Nurse Communication Training(https://arxiv.org/abs/2506.00386)
Keywords: language model, llm
Abstract: Effective communication training is essential to preparing nurses for high-quality patient care. While standardized patient (SP) simulations provide valuable experiential learning, they are often costly and inflexible. Virtual patient (VP) systems offer a scalable alternative, but most fail to adapt to the varying communication skills of trainees. In particular, when trainees respond ineffectively, VPs should escalate in hostility or become uncooperative--yet this level of adaptive interaction remains largely unsupported. To address this gap, we introduce Adaptive-VP, a VP dialogue generation framework that leverages large language models (LLMs) to dynamically adapt VP behavior based on trainee input. The framework features a pipeline for constructing clinically grounded yet flexible VP scenarios and a modular system for assessing trainee communication and adjusting VP responses in real time, while ensuring learner safety. We validated Adaptive-VP by simulating challenging patient conversations. Automated evaluation using a corpus from practicing nurses showed that our communication skill evaluation mechanism reflected real-world proficiency levels. Expert nurses further confirmed that Adaptive-VP produced more natural and realistic interactions than existing approaches, demonstrating its potential as a scalable and effective tool for nursing communication training.
摘要：有效的沟通培训对于为护士准备高质量的患者护理至关重要。尽管标准化的患者（SP）模拟提供了有价值的体验学习，但它们通常是昂贵且不灵活的。虚拟患者（VP）系统提供了可扩展的替代方案，但大多数无法适应受训者的沟通技巧。特别是，当学员反应无效时，VP应在敌对行动中升级或变得不合作 - 但这种自适应互动的水平仍然很大程度上不受支持。为了解决这一差距，我们引入了Adaptive-VP，这是一个副总裁对话生成框架，该框架利用大型语言模型（LLMS）根据受训者输入动态调整VP行为。该框架具有一条管道，用于构建临床接地但灵活的VP方案以及一个模块化系统，用于评估受训者的沟通和实时调整VP响应，同时确保学习者安全。我们通过模拟有挑战性的患者对话来验证自适应-VP。使用练习护士的语料库进行自动评估表明，我们的沟通技能评估机制反映了现实世界的能力水平。专家护士进一步证实，与现有方法相比，自适应-VP产生了更自然和现实的互动，这表明了它是护理传播培训的可扩展和有效工具的潜力。

Title: SHARE: An SLM-based Hierarchical Action CorREction Assistant for Text-to-SQL

Authors: Ge Qu, Jinyang Li, Bowen Qin, Xiaolong Li, Nan Huo, Chenhao Ma, Reynold Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00391
Pdf URL: https://arxiv.org/pdf/2506.00391
Copy Paste: [[2506.00391]] SHARE: An SLM-based Hierarchical Action CorREction Assistant for Text-to-SQL(https://arxiv.org/abs/2506.00391)
Keywords: language model, llm
Abstract: Current self-correction approaches in text-to-SQL face two critical limitations: 1) Conventional self-correction methods rely on recursive self-calls of LLMs, resulting in multiplicative computational overhead, and 2) LLMs struggle to implement effective error detection and correction for declarative SQL queries, as they fail to demonstrate the underlying reasoning path. In this work, we propose SHARE, an SLM-based Hierarchical Action corREction assistant that enables LLMs to perform more precise error localization and efficient correction. SHARE orchestrates three specialized Small Language Models (SLMs) in a sequential pipeline, where it first transforms declarative SQL queries into stepwise action trajectories that reveal underlying reasoning, followed by a two-phase granular refinement. We further propose a novel hierarchical self-evolution strategy for data-efficient training. Experimental results demonstrate that SHARE effectively enhances self-correction capabilities while proving robust across various LLMs. Furthermore, our comprehensive analysis shows that SHARE maintains strong performance even in low-resource training settings, which is particularly valuable for text-to-SQL applications with data privacy constraints.
摘要：文本到SQL中的当前自我纠正方法面临两个临界局限性：1）常规的自我纠正方法依赖于LLM的递归自我呼叫，从而导致了多种计算开销，而2）LLMS难以实施有效的误差检测和校正来证明声明性的SQL查询，因为它们未能表现出基本的推理路径。在这项工作中，我们提出了共享，这是一种基于SLM的分层校正助手，使LLMS能够执行更精确的错误定位和有效的校正。 Share在顺序管道中协调了三个专业的小语言模型（SLM），在该管道中，它首先将声明性的SQL查询转换为逐步的动作轨迹，这些轨迹揭示了潜在的推理，然后进行了两相颗粒状的细化。我们进一步提出了一种新型的分层自我进化策略，用于数据有效培训。实验结果表明，共享有效地增强了自我纠正能力，同时证明了各种LLM的稳健功能。此外，我们的全面分析表明，即使在低资源培训设置中，共享也保持强劲的性能，这对于具有数据隐私限制的文本到SQL应用程序特别有价值。

Title: Speculative Reward Model Boosts Decision Making Ability of LLMs Cost-Effectively

Authors: Jiawei Gu, Shangsong Liang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00396
Pdf URL: https://arxiv.org/pdf/2506.00396
Copy Paste: [[2506.00396]] Speculative Reward Model Boosts Decision Making Ability of LLMs Cost-Effectively(https://arxiv.org/abs/2506.00396)
Keywords: language model, llm
Abstract: Effective decision-making in Large Language Models (LLMs) is essential for handling intricate tasks. However, existing approaches prioritize performance but often overlook the balance between effectiveness and computational cost. To address this, we first introduce the 3E Criteria to systematically assess the cost-effectiveness of search strategies, revealing that existing methods often trade significant efficiency for marginal performance gains. To improve LLM decision-making while maintaining efficiency, we propose the Speculative Reward Model (SRM), a plug-and-play framework that seamlessly integrates with existing search strategies. Specifically, SRM employs an external reward assigner to predict optimal actions, reducing reliance on LLMs' internal self-evaluation. And a speculative verification mechanism is used to prune suboptimal choices and guide the search toward more promising steps. We evaluate SRM on several complex decision-making tasks including mathematical reasoning, planning and numerical reasoning in specialized domains. Experimental results show that SRM reduces costs to 1/10 of the original search framework on average while maintaining effectiveness.
摘要：大型语言模型（LLM）中有效的决策对于处理复杂的任务至关重要。但是，现有方法优先考虑绩效，但通常会忽略有效性和计算成本之间的平衡。为了解决这个问题，我们首先介绍3E标准，以系统地评估搜索策略的成本效益，表明现有方法通常会以极大的效率来获得边际性能提高。为了在保持效率的同时提高LLM的决策，我们提出了投机奖励模型（SRM），这是一个无缝与现有搜索策略无缝集成的插件框架。具体而言，SRM采用外部奖励分配者来预测最佳行动，从而减少了对LLMS内部自我评估的依赖。并使用推测性验证机制来修剪次优选择，并指导搜索采取更有前途的步骤。我们评估SRM的几项复杂的决策任务，包括在专业领域中的数学推理，计划和数值推理。实验结果表明，SRM平均将成本降低到原始搜索框架的1/10，同时保持有效性。

Title: Scaling Textual Gradients via Sampling-Based Momentum

Authors: Zixin Ding, Junyuan Hong, Jiachen T. Wang, Zinan Lin, Zhangyang Wang, Yuxin Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00400
Pdf URL: https://arxiv.org/pdf/2506.00400
Copy Paste: [[2506.00400]] Scaling Textual Gradients via Sampling-Based Momentum(https://arxiv.org/abs/2506.00400)
Keywords: language model, llm, prompt
Abstract: As prompts play an increasingly critical role in large language models (LLMs), optimizing textual prompts has become a crucial challenge. The Textual Gradient Descent (TGD) framework has emerged as a promising data-driven approach that iteratively refines textual prompts using LLM - suggested updates (or textual gradients) over minibatches of training samples. In this paper, we empirically demonstrate that scaling the number of training examples initially improves but later degrades TGD's performance across multiple downstream NLP tasks. However, while data scaling improves results for most tasks, it also significantly increases the computational cost when leveraging LLMs. To address this, we draw inspiration from numerical gradient descent and propose Textual Stochastic Gradient Descent with Momentum (TSGD-M) - a method that facilitates scalable in-context learning by reweighting prompt sampling based on past batch distributions. Across nine NLP tasks spanning three domains - including BIG-Bench Hard (BBH), natural language understanding tasks, and reasoning tasks - TSGD-M significantly outperforms TGD baselines that do not incorporate reweighted sampling, while also reducing variance in most tasks.
摘要：随着提示在大语言模型（LLM）中发挥越来越重要的作用，优化文本提示已成为一个至关重要的挑战。文本梯度下降（TGD）框架已成为一种有希望的数据驱动方法，它使用LLM-建议的更新（或文本梯度）在训练样本的小型培训样本中迭代地完善文本提示。在本文中，我们从经验上证明，扩展培训示例的数量最初会有所改善，但后来降低了TGD在多个下游NLP任务中的性能。但是，尽管数据扩展改善了大多数任务的结果，但在利用LLMS时，它也会显着增加计算成本。为了解决这个问题，我们从数值梯度下降中汲取灵感，并提出具有动量（TSGD-M）的文本随机梯度下降 - 一种方法，通过基于过去批次分布的重新重量提示提示来促进可扩展的内在学习。在跨越三个领域的九个NLP任务中，包括大基础强硬（BBH），自然语言理解任务和推理任务 - TSGD -M明显胜过不包括重新级别的抽样的TGD基准，而在大多数任务中也降低了差异。

Title: Accelerating Diffusion LLMs via Adaptive Parallel Decoding

Authors: Daniel Israel, Guy Van den Broeck, Aditya Grover
Subjects: cs.CL, cs.AI, cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2506.00413
Pdf URL: https://arxiv.org/pdf/2506.00413
Copy Paste: [[2506.00413]] Accelerating Diffusion LLMs via Adaptive Parallel Decoding(https://arxiv.org/abs/2506.00413)
Keywords: language model, llm
Abstract: The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
摘要：LLM的生成速度被自回归解码瓶颈，在该解码中，令牌被依次依次预测。另外，从理论上讲，扩散大语言模型（DLLM）允许平行代币产生，但实际上，在不牺牲质量的情况下努力实现自回归模型的速度。因此，我们引入了自适应平行解码（APD），这是一种动态调整并行采样的令牌数量的新颖方法。我们通过在小辅助自动回归模型下定义DLLM边缘概率和序列的关节概率之间的乘法混合物来实现这一目标。这将颠倒投机解码的标准设置，其目标是通过从较小的型号起草来从大型自回旋验证器中采样。我们通过启用KV缓存并限制蒙版输入的大小来进一步优化APD。总的来说，我们的方法提出了三个可调参数，以灵活地折衷吞吐量和质量。我们表明，APD在下游基准测试上提供了明显更高的吞吐量，质量降低最小。

Title: Dual Debiasing for Noisy In-Context Learning for Text Generation

Authors: Siqi Liang, Sumyeong Ahn, Paramveer S. Dhillon, Jiayu Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00418
Pdf URL: https://arxiv.org/pdf/2506.00418
Copy Paste: [[2506.00418]] Dual Debiasing for Noisy In-Context Learning for Text Generation(https://arxiv.org/abs/2506.00418)
Keywords: language model, llm
Abstract: In context learning (ICL) relies heavily on high quality demonstrations drawn from large annotated corpora. Existing approaches detect noisy annotations by ranking local perplexities, presuming that noisy samples yield higher perplexities than their clean counterparts. However, this assumption breaks down when the noise ratio is high and many demonstrations are flawed. We reexamine the perplexity based paradigm for text generation under noisy annotations, highlighting two sources of bias in perplexity: the annotation itself and the domain specific knowledge inherent in large language models (LLMs). To overcome these biases, we introduce a dual debiasing framework that uses synthesized neighbors to explicitly correct perplexity estimates, yielding a robust Sample Cleanliness Score. This metric uncovers absolute sample cleanliness regardless of the overall corpus noise level. Extensive experiments demonstrate our method's superior noise detection capabilities and show that its final ICL performance is comparable to that of a fully clean demonstration corpus. Moreover, our approach remains robust even when noise ratios are extremely high.
摘要：在上下文中，学习（ICL）在很大程度上依赖于大型注释语料库中提取的高质量示范。现有方法通过对局部困惑进行排名来检测嘈杂的注释，假定嘈杂的样本比清洁的同行产生的困惑更高。但是，当噪声比高并且许多示范存在缺陷时，此假设会分解。我们重新检查了基于困惑的范式在嘈杂的注释下产生文本生成的范式，突出了困惑性的两个偏见来源：注释本身和大型语言模型（LLMS）固有的域特定知识。为了克服这些偏见，我们引入了一个双重偏见的框架，该框架使用合成的邻居明确纠正了困惑估计，从而产生了强大的样本清洁度评分。无论总体噪声水平如何，该度量都会发现绝对样品清洁度。广泛的实验证明了我们方法的出色噪声检测能力，并表明其最终ICL性能与完全清洁的演示语料库相当。此外，即使噪声比极高，我们的方法仍然保持强大。

Title: Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions

Authors: Jihyoung Jang, Minwook Bae, Minji Kim, Dilek Hakkani-Tur, Hyounghun Kim
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.00421
Pdf URL: https://arxiv.org/pdf/2506.00421
Copy Paste: [[2506.00421]] Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions(https://arxiv.org/abs/2506.00421)
Keywords: chat, agent
Abstract: As chatbots continue to evolve toward human-like, real-world, interactions, multimodality remains an active area of research and exploration. So far, efforts to integrate multimodality into chatbots have primarily focused on image-centric tasks, such as visual dialogue and image-based instructions, placing emphasis on the "eyes" of human perception while neglecting the "ears", namely auditory aspects. Moreover, these studies often center around static interactions that focus on discussing the modality rather than naturally incorporating it into the conversation, which limits the richness of simultaneous, dynamic engagement. Furthermore, while multimodality has been explored in multi-party and multi-session conversations, task-specific constraints have hindered its seamless integration into dynamic, natural conversations. To address these challenges, this study aims to equip chatbots with "eyes and ears" capable of more immersive interactions with humans. As part of this effort, we introduce a new multimodal conversation dataset, Multimodal Multi-Session Multi-Party Conversation ($M^3C$), and propose a novel multimodal conversation model featuring multimodal memory retrieval. Our model, trained on the $M^3C$, demonstrates the ability to seamlessly engage in long-term conversations with multiple speakers in complex, real-world-like settings, effectively processing visual and auditory inputs to understand and respond appropriately. Human evaluations highlight the model's strong performance in maintaining coherent and dynamic interactions, demonstrating its potential for advanced multimodal conversational agents.
摘要：随着聊天机器人继续发展到类似人类的现实世界，相互作用，多模式仍然是研究和探索的积极领域。到目前为止，将多模式整合到聊天机器人中的努力主要集中在以图像为中心的任务上，例如视觉对话和基于图像的说明，重点是人类感知的“眼睛”，同时忽略了“耳朵”，即听觉方面。此外，这些研究通常围绕着静态相互作用，这些静态相互作用的重点是讨论这种方式，而不是自然地将其纳入对话中，从而限制了同时，动态参与的丰富性。此外，尽管在多方和多主题对话中探讨了多模式，但特定于任务的约束阻碍了其无缝集成到动态自然对话中。为了应对这些挑战，本研究旨在为聊天机器人配备能够与人类进行更身临其境互动的“眼睛和耳朵”。作为这项工作的一部分，我们介绍了一个新的多模式对话数据集，多模式多主题多方对话（$ m^3c $），并提出了一个新颖的多模式对话模型，其中包含多模式内存检索。我们的模型接受了$ M^3C $的培训，它表明了在复杂的，类似于现实世界的环境中与多个扬声器进行长期对话的能力，有效地处理了视觉和听觉输入以理解和响应。人类评估突出了该模型在保持连贯和动态相互作用方面的强劲表现，证明了其对先进的多模式对话剂的潜力。

Title: Inter-Passage Verification for Multi-evidence Multi-answer QA

Authors: Bingsen Chen, Shengjie Wang, Xi Ye, Chen Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00425
Pdf URL: https://arxiv.org/pdf/2506.00425
Copy Paste: [[2506.00425]] Inter-Passage Verification for Multi-evidence Multi-answer QA(https://arxiv.org/abs/2506.00425)
Keywords: retrieval-augmented generation
Abstract: Multi-answer question answering (QA), where questions can have many valid answers, presents a significant challenge for existing retrieval-augmented generation-based QA systems, as these systems struggle to retrieve and then synthesize a large number of evidence passages. To tackle these challenges, we propose a new multi-answer QA framework -- Retrieval-augmented Independent Reading with Inter-passage Verification (RI$^2$VER). Our framework retrieves a large set of passages and processes each passage individually to generate an initial high-recall but noisy answer set. Then we propose a new inter-passage verification pipeline that validates every candidate answer through (1) Verification Question Generation, (2) Gathering Additional Evidence, and (3) Verification with inter-passage synthesis. Evaluations on the QAMPARI and RoMQA datasets demonstrate that our framework significantly outperforms existing baselines across various model sizes, achieving an average F1 score improvement of 11.17%. Further analysis validates that our inter-passage verification pipeline enables our framework to be particularly beneficial for questions requiring multi-evidence synthesis.
摘要：多回答问题答案（QA），问题可以有许多有效的答案，这对现有的基于检索的基于生成的质量质量标准系统提出了重大挑战，因为这些系统难以检索并综合大量证据。为了应对这些挑战，我们提出了一个新的多回答质量检查框架 - 检索带有通用验证的独立阅读（RI $^2 $ VER）。我们的框架可以单独检索每个段落的大量段落和处理，以生成初始的高回调但嘈杂的答案集。然后，我们提出了一个新的通信验证管道，该管道通过（1）验证问题生成验证每个候选人的答案，（2）收集额外的证据，以及（3）使用邮政间合成的验证。对Qampari和Romqa数据集的评估表明，我们的框架在各种型号尺寸上的现有基准的表现显着优于现有基线，从而达到了平均F1得分提高11.17％。进一步的分析验证了我们的通行间验证管道使我们的框架对需要多个证据综合的问题特别有益。

Title: G2S: A General-to-Specific Learning Framework for Temporal Knowledge Graph Forecasting with Large Language Models

Authors: Long Bai, Zixuan Li, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00445
Pdf URL: https://arxiv.org/pdf/2506.00445
Copy Paste: [[2506.00445]] G2S: A General-to-Specific Learning Framework for Temporal Knowledge Graph Forecasting with Large Language Models(https://arxiv.org/abs/2506.00445)
Keywords: language model, llm
Abstract: Forecasting over Temporal Knowledge Graphs (TKGs) which predicts future facts based on historical ones has received much attention. Recent studies have introduced Large Language Models (LLMs) for this task to enhance the models' generalization abilities. However, these models perform forecasting via simultaneously learning two kinds of entangled knowledge in the TKG: (1) general patterns, i.e., invariant temporal structures shared across different scenarios; and (2) scenario information, i.e., factual knowledge engaged in specific scenario, such as entities and relations. As a result, the learning processes of these two kinds of knowledge may interfere with each other, which potentially impact the generalization abilities of the models. To enhance the generalization ability of LLMs on this task, in this paper, we propose a General-to-Specific learning framework (G2S) that disentangles the learning processes of the above two kinds of knowledge. In the general learning stage, we mask the scenario information in different TKGs and convert it into anonymous temporal structures. After training on these structures, the model is able to capture the general patterns across different TKGs. In the specific learning stage, we inject the scenario information into the structures via either in-context learning or fine-tuning modes. Experimental results show that G2S effectively improves the generalization abilities of LLMs.
摘要：对时间知识图（TKG）预测基于历史知识的事实的预测已引起了很多关注。最近的研究引入了大型语言模型（LLM），以增强模型的概括能力。但是，这些模型通过同时学习TKG中的两种纠缠知识进行预测：（1）一般模式，即在不同情况下共享的不变的时间结构；（2）场景信息，即参与特定场景的事实知识，例如实体和关系。结果，这两种知识的学习过程可能会互相干扰，这可能会影响模型的概括能力。为了增强LLM在这项任务上的概括能力，在本文中，我们提出了一个一般到特定的学习框架（G2S），该框架（G2S）删除了上述两种知识的学习过程。在一般学习阶段，我们将场景信息掩盖在不同的TKG中，并将其转换为匿名的时间结构。在对这些结构进行了训练之后，该模型能够捕获不同TKG的一般模式。在特定的学习阶段，我们通过秘密学习或微调模式将场景信息注入结构。实验结果表明，G2有效地提高了LLM的概括能力。

Title: Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization

Authors: Suhas BN, Han-Chin Shing, Lei Xu, Mitch Strong, Jon Burnsky, Jessica Ofor, Jordan R. Mason, Susan Chen, Sundararajan Srinivasan, Chaitanya Shivade, Jack Moriarty, Joseph Paul Cohen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00448
Pdf URL: https://arxiv.org/pdf/2506.00448
Copy Paste: [[2506.00448]] Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization(https://arxiv.org/abs/2506.00448)
Keywords: language model, llm, hallucination
Abstract: Hallucinations in large language models (LLMs) during summarization of patient-clinician dialogues pose significant risks to patient care and clinical decision-making. However, the phenomenon remains understudied in the clinical domain, with uncertainty surrounding the applicability of general-domain hallucination detectors. The rarity and randomness of hallucinations further complicate their investigation. In this paper, we conduct an evaluation of hallucination detection methods in the medical domain, and construct two datasets for the purpose: A fact-controlled Leave-N-out dataset -- generated by systematically removing facts from source dialogues to induce hallucinated content in summaries; and a natural hallucination dataset -- arising organically during LLM-based medical summarization. We show that general-domain detectors struggle to detect clinical hallucinations, and that performance on fact-controlled hallucinations does not reliably predict effectiveness on natural hallucinations. We then develop fact-based approaches that count hallucinations, offering explainability not available with existing methods. Notably, our LLM-based detectors, which we developed using fact-controlled hallucinations, generalize well to detecting real-world clinical hallucinations. This research contributes a suite of specialized metrics supported by expert-annotated datasets to advance faithful clinical summarization systems.
摘要：大语模型（LLM）的幻觉在汇总患者 - 临床主义对话中对患者护理和临床决策构成了重大风险。然而，该现象在临床领域仍未研究，围绕一通域幻觉探测器的适用性不确定性。幻觉的稀有性和随机性进一步使他们的研究变得复杂。在本文中，我们对医疗领域中的幻觉检测方法进行了评估，并为此目的构建了两个数据集：一个事实控制的遗留数据集 - 通过系统地从源对话中删除事实来诱导幻觉内容，以摘要中诱导幻觉内容；以及自然幻觉数据集 - 在基于LLM的医疗摘要中有机出现。我们表明，通用域检测器难以检测临床幻觉，并且在事实控制的幻觉上的表现并不能可靠地预测对自然幻觉的有效性。然后，我们开发基于事实的方法来计算幻觉，从而提供现有方法无法使用的解释性。值得注意的是，我们使用事实控制的幻觉开发的基于LLM的检测器，可以很好地检测现实世界的临床幻觉。这项研究贡献了一套专门的指标，并由专家注销的数据集支持，以推动忠实的临床总结系统。

Title: Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

Authors: Shaoxiong Ji, Zihao Li, Jaakko Paavola, Indraneil Paul, Hengyu Luo, Jörg Tiedemann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00469
Pdf URL: https://arxiv.org/pdf/2506.00469
Copy Paste: [[2506.00469]] Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data(https://arxiv.org/abs/2506.00469)
Keywords: language model
Abstract: This paper investigates a critical design decision in the practice of massively multilingual continual pre-training -- the inclusion of parallel data. Specifically, we study the impact of bilingual translation data for massively multilingual language adaptation of the Llama3 family of models to 500 languages. To this end, we construct the MaLA bilingual translation corpus, containing data from more than 2,500 language pairs. Subsequently, we develop the EMMA-500 Llama 3 suite of four massively multilingual models -- continually pre-trained from the Llama 3 family of base models extensively on diverse data mixes up to 671B tokens -- and explore the effect of continual pre-training with or without bilingual translation data. Comprehensive evaluation across 7 tasks and 12 benchmarks demonstrates that bilingual data tends to enhance language transfer and performance, particularly for low-resource languages. We open-source the MaLA corpus, EMMA-500 Llama 3 suite artefacts, code, and model generations.
摘要：本文研究了大规模多语言持续预训练的实践中的关键设计决策 - 包括并行数据。具体而言，我们研究了双语翻译数据对Llama3模型家族对500种语言的大规模多语言适应的影响。为此，我们构建了Mala双语翻译语料库，其中包含来自2500多个语言对的数据。随后，我们开发了四个大型多语言模型的EMMA-500 LLAMA 3套件 - 不断从Llama 3基本模型的基本模型群中预先训练，以多达671b的代币混合使用，并探索连续预训练的效果。跨7个任务和12个基准的全面评估表明，双语数据倾向于增强语言传递和表现，尤其是对于低资源语言。我们开源Mala语料库，Emma-500 Llama 3套件人工制品，代码和模型世代。

Title: EffiVLM-BENCH: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language Models

Authors: Zekun Wang, Minghua Ma, Zexin Wang, Rongchuan Mu, Liping Shan, Ming Liu, Bing Qin
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00479
Pdf URL: https://arxiv.org/pdf/2506.00479
Copy Paste: [[2506.00479]] EffiVLM-BENCH: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language Models(https://arxiv.org/abs/2506.00479)
Keywords: language model
Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success, yet their significant computational demands hinder practical deployment. While efforts to improve LVLM efficiency are growing, existing methods lack comprehensive evaluation across diverse backbones, benchmarks, and metrics. In this work, we systematically evaluate mainstream acceleration techniques for LVLMs, categorized into token and parameter compression. We introduce EffiVLM-Bench, a unified framework for assessing not only absolute performance but also generalization and loyalty, while exploring Pareto-optimal trade-offs. Our extensive experiments and in-depth analyses offer insights into optimal strategies for accelerating LVLMs. We open-source code and recipes for EffiVLM-Bench to foster future research.
摘要：大型视觉模型（LVLM）取得了显着的成功，但它们的重大计算要求阻碍了实际部署。尽管提高LVLM效率的努力正在增长，但现有方法缺乏各种主链，基准和指标的全面评估。在这项工作中，我们系统地评估了LVLM的主流加速技术，分为令牌和参数压缩。我们介绍了Effivlm-Bench，这是一个统一的框架，不仅可以评估绝对绩效，还可以评估概括和忠诚度，同时探索帕累托（Pareto）最佳的权衡。我们广泛的实验和深入分析为加速LVLM的最佳策略提供了见解。我们开放源代码和食谱，以促进未来的研究。

Title: Auto-Patching: Enhancing Multi-Hop Reasoning in Language Models

Authors: Aviv Jan, Dean Tahory, Omer Talmi, Omar Abo Mokh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00483
Pdf URL: https://arxiv.org/pdf/2506.00483
Copy Paste: [[2506.00483]] Auto-Patching: Enhancing Multi-Hop Reasoning in Language Models(https://arxiv.org/abs/2506.00483)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Multi-hop questions still stump large language models (LLMs), which struggle to link information across multiple reasoning steps. We introduce Auto-Patch, a novel method that dynamically patches hidden states during inference to enhance multi-hop reasoning in LLMs. Building on the PatchScopes framework, Auto-Patch selectively modifies internal representations using a learned classifier. Evaluated on the MuSiQue dataset, Auto-Patch improves the solve rate from 18.45\% (baseline) to 23.63~$\pm$~0.7\% (3 runs), narrowing the gap to Chain-of-Thought prompting (27.44\%). Our results highlight the potential of dynamic hidden state interventions for advancing complex reasoning in LLMs.
摘要：多跳的问题仍然令人难以置信的大语言模型（LLMS），这些模型很难跨多个推理步骤链接信息。我们介绍了自动点，这是一种新颖的方法，该方法在推理过程中动态贴上了隐藏状态，以增强LLMS中的多跳推理。在PatchScopes框架上建立自动绘制，使用学习的分类器选择性修改内部表示。在Musique数据集上进行了评估，Auto-Patch将解决率从18.45 \％（基线）提高到23.63〜 $ \ pm $ 〜0.7 \％\％（3次运行），将差距缩小到经过思考的提示链（27.44 \％）。我们的结果突出了动态隐藏状态干预措施在LLM中推进复杂推理的潜力。

Title: Synergizing LLMs with Global Label Propagation for Multimodal Fake News Detection

Authors: Shuguo Hu, Jun Hu, Huaiwen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00488
Pdf URL: https://arxiv.org/pdf/2506.00488
Copy Paste: [[2506.00488]] Synergizing LLMs with Global Label Propagation for Multimodal Fake News Detection(https://arxiv.org/abs/2506.00488)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) can assist multimodal fake news detection by predicting pseudo labels. However, LLM-generated pseudo labels alone demonstrate poor performance compared to traditional detection methods, making their effective integration non-trivial. In this paper, we propose Global Label Propagation Network with LLM-based Pseudo Labeling (GLPN-LLM) for multimodal fake news detection, which integrates LLM capabilities via label propagation techniques. The global label propagation can utilize LLM-generated pseudo labels, enhancing prediction accuracy by propagating label information among all samples. For label propagation, a mask-based mechanism is designed to prevent label leakage during training by ensuring that training nodes do not propagate their own labels back to themselves. Experimental results on benchmark datasets show that by synergizing LLMs with label propagation, our model achieves superior performance over state-of-the-art baselines.
摘要：大型语言模型（LLMS）可以通过预测伪标签来帮助多模式假新闻检测。但是，与传统的检测方法相比，单独使用LLM生成的伪标签表现出较差的性能，从而使它们的有效整合不平凡。在本文中，我们建议使用基于LLM的伪标签（GLPN-LLM）进行多模式假新闻检测的全球标签繁殖网络，该新闻检测通过标签传播技术集成了LLM功能。全局标签传播可以利用LLM生成的伪标签，从而通过在所有样本中传播标签信息来提高预测准确性。对于标签传播，通过确保训练节点不会将自己的标签传播回自己，旨在防止在训练期间标签泄漏。基准数据集上的实验结果表明，通过与标签传播协同llms，我们的模型可实现优于最先进的基线。

Title: Exploring In-context Example Generation for Machine Translation

Authors: Dohyun Lee, Seungil Chad Lee, Chanwoo Yang, Yujin Baek, Jaegul Choo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00507
Pdf URL: https://arxiv.org/pdf/2506.00507
Copy Paste: [[2506.00507]] Exploring In-context Example Generation for Machine Translation(https://arxiv.org/abs/2506.00507)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated strong performance across various tasks, leveraging their exceptional in-context learning ability with only a few examples. Accordingly, the selection of optimal in-context examples has been actively studied in the field of machine translation. However, these studies presuppose the presence of a demonstration pool with human-annotated pairs, making them less applicable to low-resource languages where such an assumption is challenging to meet. To overcome this limitation, this paper explores the research direction of in-context example generation for machine translation. Specifically, we propose Demonstration Augmentation for Translation (DAT), a simple yet effective approach that generates example pairs without relying on any external resources. This method builds upon two prior criteria, relevance and diversity, which have been highlighted in previous work as key factors for in-context example selection. Through experiments and analysis on low-resource languages where human-annotated pairs are scarce, we show that DAT achieves superior translation quality compared to the baselines. Furthermore, we investigate the potential of progressively accumulating generated pairs during test time to build and reuse a demonstration pool. Our implementation is publicly available at this https URL.
摘要：大型语言模型（LLMS）在各种任务中都表现出了强劲的表现，仅使用几个示例利用其出色的内在学习能力。因此，在机器翻译领域已经积极研究了最佳内在示例的选择。但是，这些研究以人类注销对的示范池的存在为前提，这使得它们不适用于低资源的语言，在这种假设方面具有挑战性。为了克服这一局限性，本文探讨了机器翻译中的内在示例生成的研究方向。具体来说，我们提出了翻译（DAT）的演示增强，这是一种简单而有效的方法，它在不依赖任何外部资源的情况下生成示例对。该方法建立在两个先前的标准，相关性和多样性的基础上，这些标准在先前的工作中被强调为封闭式示例选择的关键因素。通过对人类注销对的低资源语言的实验和分析，我们表明DAT与基线相比达到了优越的翻译质量。此外，我们研究了在测试时间内逐步积累生成对的潜力，以构建和重复示范池。我们的实施在此HTTPS URL上公开可用。

Title: Goal-Aware Identification and Rectification of Misinformation in Multi-Agent Systems

Authors: Zherui Li, Yan Mi, Zhenhong Zhou, Houcheng Jiang, Guibin Zhang, Kun Wang, Junfeng Fang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00509
Pdf URL: https://arxiv.org/pdf/2506.00509
Copy Paste: [[2506.00509]] Goal-Aware Identification and Rectification of Misinformation in Multi-Agent Systems(https://arxiv.org/abs/2506.00509)
Keywords: language model, agent
Abstract: Large Language Model-based Multi-Agent Systems (MASs) have demonstrated strong advantages in addressing complex real-world tasks. However, due to the introduction of additional attack surfaces, MASs are particularly vulnerable to misinformation injection. To facilitate a deeper understanding of misinformation propagation dynamics within these systems, we introduce MisinfoTask, a novel dataset featuring complex, realistic tasks designed to evaluate MAS robustness against such threats. Building upon this, we propose ARGUS, a two-stage, training-free defense framework leveraging goal-aware reasoning for precise misinformation rectification within information flows. Our experiments demonstrate that in challenging misinformation scenarios, ARGUS exhibits significant efficacy across various injection attacks, achieving an average reduction in misinformation toxicity of approximately 28.17% and improving task success rates under attack by approximately 10.33%. Our code and dataset is available at: this https URL.
摘要：基于语言模型的大型多代理系统（MASS）在解决复杂的现实世界任务方面表现出了强大的优势。但是，由于引入了其他攻击表面，质量特别容易受到错误信息注射的影响。为了促进对这些系统中错误信息传播动态的更深入的了解，我们引入了Misnfotask，这是一个新颖的数据集，具有复杂，现实的任务，旨在评估MAS鲁棒性，以抗这些威胁。在此基础上，我们建议Argus是一个两阶段的，无训练的防御框架，利用目标感知推理，以在信息流中进行精确的错误信息纠正。我们的实验表明，在具有挑战性的错误信息方面，Argus在各种注射攻击中表现出明显的功效，从而平均降低了错误信息毒性的毒性约为28.17％，并提高了攻击下的任务成功率约为10.33％。我们的代码和数据集可用：此HTTPS URL。

Title: Evaluating the Evaluation of Diversity in Commonsense Generation

Authors: Tianhui Zhang, Bei Peng, Danushka Bollegala
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00514
Pdf URL: https://arxiv.org/pdf/2506.00514
Copy Paste: [[2506.00514]] Evaluating the Evaluation of Diversity in Commonsense Generation(https://arxiv.org/abs/2506.00514)
Keywords: language model, llm
Abstract: In commonsense generation, given a set of input concepts, a model must generate a response that is not only commonsense bearing, but also capturing multiple diverse viewpoints. Numerous evaluation metrics based on form- and content-level overlap have been proposed in prior work for evaluating the diversity of a commonsense generation model. However, it remains unclear as to which metrics are best suited for evaluating the diversity in commonsense generation. To address this gap, we conduct a systematic meta-evaluation of diversity metrics for commonsense generation. We find that form-based diversity metrics tend to consistently overestimate the diversity in sentence sets, where even randomly generated sentences are assigned overly high diversity scores. We then use an Large Language Model (LLM) to create a novel dataset annotated for the diversity of sentences generated for a commonsense generation task, and use it to conduct a meta-evaluation of the existing diversity evaluation metrics. Our experimental results show that content-based diversity evaluation metrics consistently outperform the form-based counterparts, showing high correlations with the LLM-based ratings. We recommend that future work on commonsense generation should use content-based metrics for evaluating the diversity of their outputs.
摘要：在常识生成中，鉴于一组输入概念，模型必须产生一个不仅具有常识性轴承的响应，而且还必须捕获多种不同的观点。在先前的工作中提出了基于形式和内容级重叠的大量评估指标，以评估常识性生成模型的多样性。但是，目前尚不清楚哪些指标最适合评估常识性的多样性。为了解决这一差距，我们对常识性产生的多样性指标进行了系统的元评估。我们发现，基于表格的多样性指标往往会始终高估句子集的多样性，即使随机生成的句子也被分配了过高的多样性得分。然后，我们使用大型语言模型（LLM）来创建一个新颖的数据集，该数据集注释了为常识生成任务生成的句子的多样性，并使用它来对现有多样性评估指标进行元评估。我们的实验结果表明，基于内容的多样性评估指标始终优于基于表单的同行，显示与基于LLM的评级的高相关性。我们建议未来关于常识生成的工作应使用基于内容的指标来评估其产出的多样性。

Title: CausalAbstain: Enhancing Multilingual LLMs with Causal Reasoning for Trustworthy Abstention

Authors: Yuxi Sun, Aoqi Zuo, Wei Gao, Jing Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00519
Pdf URL: https://arxiv.org/pdf/2506.00519
Copy Paste: [[2506.00519]] CausalAbstain: Enhancing Multilingual LLMs with Causal Reasoning for Trustworthy Abstention(https://arxiv.org/abs/2506.00519)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) often exhibit knowledge disparities across languages. Encouraging LLMs to \textit{abstain} when faced with knowledge gaps is a promising strategy to reduce hallucinations in multilingual settings. Current abstention strategies for multilingual scenarios primarily rely on generating feedback in various languages using LLMs and performing self-reflection. However, these methods can be adversely impacted by inaccuracies and biases in the generated feedback. To address this, from a causal perspective, we introduce \textit{CausalAbstain}, a method that helps LLMs determine whether to utilize multiple generated feedback responses and how to identify the most useful ones. Extensive experiments demonstrate that \textit{CausalAbstain} effectively selects helpful feedback and enhances abstention decisions with interpretability in both native language (\textsc{Casual-native}) and multilingual (\textsc{Causal-multi}) settings, outperforming strong baselines on two benchmark datasets covering encyclopedic and commonsense knowledge QA tasks. Our code and data are open-sourced at this https URL.
摘要：大型语言模型（LLM）经常在语言上表现出知识差异。当面对知识差距时，鼓励LLMS \ textit {弃权}是减少多语言环境中幻觉的有前途的策略。当前用于多语言场景的弃权策略主要依赖于使用LLMS以各种语言产生反馈并进行自我反射。但是，这些方法可能会受到产生的反馈中的不准确性和偏见的不利影响。为了从因果角度解决这个问题，我们介绍了\ textit {causalabstain}，该方法可帮助LLMS确定是否利用多个生成的反馈响应以及如何识别最有用的反应。广泛的实验表明\ textit {causalabstain}有效地选择了有用的反馈，并增强了用母语（\ textsc {casual-native}）和多语言（\ textsc {causal-multi}）的可解释性（\ textsc {casual-native}）的可解释性的决策，对两个知识的基础启用了两个知识的基础，并在两个知识的基础上进行了介绍。我们的代码和数据在此HTTPS URL上开源。

Title: Retrieval-Augmented Generation Systems for Intellectual Property via Synthetic Multi-Angle Fine-tuning

Authors: Runtao Ren, Jian Ma, Jianxi Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00527
Pdf URL: https://arxiv.org/pdf/2506.00527
Copy Paste: [[2506.00527]] Retrieval-Augmented Generation Systems for Intellectual Property via Synthetic Multi-Angle Fine-tuning(https://arxiv.org/abs/2506.00527)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems in the Intellectual Property (IP) field often struggle with diverse user queries, including colloquial expressions, spelling errors, and ambiguous terminology, leading to inaccurate retrieval and suboptimal responses. To address this challenge, we propose Multi-Angle Question Generation and Retrieval Fine-Tuning Method (MQG-RFM), a novel framework that leverages large language models (LLMs) to simulate varied user inquiries and fine-tunes retrieval models to align semantically equivalent but linguistically diverse questions. Unlike complex architectural modifications, MQG-RFM adopts a lightweight Data-to-Tune paradigm, combining prompt-engineered query generation with hard negative mining to enhance retrieval robustness without costly infrastructure changes. Experimental results on a Taiwan patent Q&A dataset show 185.62% improvement in retrieval accuracy on the Patent Consultation dataset and 262.26% improvement on the Novel Patent Technology Report dataset, with 14.22% and 53.58% improvements in generation quality over the baselines, respectively. By bridging the gap between user intent and system comprehension through semantic-aware retrieval optimization, MQG-RFM offers a practical, scalable approach for rapid, cost-effective deployment among small and medium-sized agencies seeking reliable patent intelligence solutions. Additionally, our proposed method has already been adopted by ScholarMate, the largest professional research social networking platform in China, to support real-world development and deployment. A demo version of the instantiated is available at this https URL.
摘要：知识产权（IP）领域中的检索增强生成（RAG）系统通常会在不同的用户查询中挣扎，包括口语表达式，拼写错误和模棱两可的术语，导致检索不准确和次优响应。为了应对这一挑战，我们提出了多角度问题生成和检索微调方法（MQG-RFM），这是一个新型框架，利用大型语言模型（LLMS）模拟各种用户查询和微调检索模型，以使语义上等效但语言上不同的问题保持一致。与复杂的体系结构修改不同，MQG-RFM采用了轻巧的数据对调整范式，将及时工程的查询产生与硬采矿相结合，以增强检索鲁棒性而没有昂贵的基础设施变化。台湾专利问答数据集的实验结果显示，专利咨询数据集检索准确性的提高了185.62％，而新型专利技术报告数据集则提高了262.26％，分别提高了14.22％和53.58％的发电质量。通过通过语义吸引的检索优化弥合用户意图和系统理解之间的差距，MQG-RFM提供了一种实用，可扩展的方法，用于在寻求可靠的专利情报解决方案的中小型机构之间快速，具有成本效益的部署。此外，我们提出的方法已经被中国最大的专业研究社交网络平台Scholarmate采用，以支持现实世界的发展和部署。该HTTPS URL可用该实例化的演示版本。

Title: Decoupling Reasoning and Knowledge Injection for In-Context Knowledge Editing

Authors: Changyue Wang, Weihang Su, Qingyao Ai, Yujia Zhou, Yiqun Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00536
Pdf URL: https://arxiv.org/pdf/2506.00536
Copy Paste: [[2506.00536]] Decoupling Reasoning and Knowledge Injection for In-Context Knowledge Editing(https://arxiv.org/abs/2506.00536)
Keywords: language model, llm
Abstract: Knowledge editing aims to efficiently update Large Language Models (LLMs) by modifying specific knowledge without retraining the entire model. Among knowledge editing approaches, in-context editing (ICE) offers a lightweight solution by injecting new knowledge directly into the input context, leaving model parameters unchanged. However, existing ICE approaches do not explicitly separate the newly injected knowledge from the model's original reasoning process. This entanglement often results in conflicts between external updates and internal parametric knowledge, undermining the consistency and accuracy of the reasoning this http URL this work, we conduct preliminary experiments to examine how parametric knowledge influences reasoning path planning. We find that the model's reasoning is tightly coupled with its internal knowledge, and that naively injecting new information without adapting the reasoning path often leads to performance degradation, particularly in multi-hop tasks. To this end, we propose DecKER, a novel ICE framework that decouples reasoning from knowledge editing by generating a masked reasoning path and then resolving knowledge edits via hybrid retrieval and model-based validation. Experiments on multi-hop QA benchmarks show that DecKER significantly outperforms existing ICE methods by mitigating knowledge conflicts and preserving reasoning consistency. Our code is available at: this https URL .
摘要：知识编辑旨在通过修改特定知识而无需重新训练整个模型来有效地更新大语模型（LLM）。在知识编辑方法中，文化编辑（ICE）通过将新知识直接注入输入上下文，使模型参数不变，从而提供了轻量级解决方案。但是，现有的ICE方法并未明确将新注射的知识与模型的原始推理过程分开。这种纠缠通常会导致外部更新与内部参数知识之间发生冲突，从而破坏了这项工作的推理的一致性和准确性，我们进行了初步实验，以研究参数知识如何影响推理路径计划。我们发现该模型的推理与其内部知识紧密相结合，并且天真地注入新信息而不适应推理路径通常会导致性能降级，尤其是在多跳任务中。为此，我们提出了Decker，这是一种新颖的ICE框架，通过产生掩盖的推理路径，然后通过混合检索和基于模型的验证来解决知识编辑，从而将推理与知识编辑进行了解密。多跳QA基准的实验表明，通过缓解知识冲突并保留推理一致性，Decker明显优于现有的ICE方法。我们的代码可用：此HTTPS URL。

Title: ARIA: Training Language Agents with Intention-Driven Reward Aggregation

Authors: Ruihan Yang, Yikai Zhang, Aili Chen, Xintao Wang, Siyu Yuan, Jiangjie Chen, Deqing Yang, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00539
Pdf URL: https://arxiv.org/pdf/2506.00539
Copy Paste: [[2506.00539]] ARIA: Training Language Agents with Intention-Driven Reward Aggregation(https://arxiv.org/abs/2506.00539)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an exponentially large action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning (RL). To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering better policy optimization. Extensive experiments demonstrate that ARIA not only significantly reduces policy gradient variance, but also delivers substantial performance gains of an average of 9.95% across four downstream tasks, consistently outperforming offline and online RL baselines.
摘要：大型语言模型（LLMS）使代理可以通过自由形式的语言互动执行复杂的推理和决策。但是，在开放式的语言动作环境（例如，谈判或提问游戏）中，动作空间可以作为代币上的联合分配而配制，从而带来了指数较大的动作空间。在这样的空间中采样动作会导致极端的奖励稀疏性，从而带来了巨大的奖励差异，阻碍了有效的增强学习（RL）。为了解决这个问题，我们提出了ARIA，ARIA是一种在意图空间中汇总奖励以实现高效和有效语言训练的方法。 ARIA的目的是将自然语言的行动从高维的令牌分配空间投射到低维的意图空间，在该空间中，在语义上相似的动作聚集并分配了共享的奖励。这种意图感知的奖励聚合通过致密奖励信号来降低奖励差异，从而促进更好的政策优化。广泛的实验表明，ARIA不仅显着降低了政策梯度差异，而且在四个下游任务中，平均绩效增长了9.95％，始终优于离线和在线RL基线。

Title: Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages

Authors: Hyangsuk Min, Yuho Lee, Minjeong Ban, Jiaqi Deng, Nicole Hee-Yeon Kim, Taewon Yun, Hang Su, Jason Cai, Hwanjun Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00549
Pdf URL: https://arxiv.org/pdf/2506.00549
Copy Paste: [[2506.00549]] Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages(https://arxiv.org/abs/2506.00549)
Keywords: language model, llm, agent
Abstract: Evaluation frameworks for text summarization have evolved in terms of both domain coverage and metrics. However, existing benchmarks still lack domain-specific assessment criteria, remain predominantly English-centric, and face challenges with human annotation due to the complexity of reasoning. To address these, we introduce MSumBench, which provides a multi-dimensional, multi-domain evaluation of summarization in English and Chinese. It also incorporates specialized assessment criteria for each domain and leverages a multi-agent debate system to enhance annotation quality. By evaluating eight modern summarization models, we discover distinct performance patterns across domains and languages. We further examine large language models as summary evaluators, analyzing the correlation between their evaluation and summarization capabilities, and uncovering systematic bias in their assessment of self-generated summaries. Our benchmark dataset is publicly available at this https URL.
摘要：文本摘要的评估框架已随着域覆盖范围和指标的发展而发展。但是，现有的基准仍然缺乏特定领域的评估标准，主要以英语为中心，并且由于推理的复杂性而面临人类注释的挑战。为了解决这些问题，我们介绍了Msumbench，该MSUMBENCH提供了对英语和中文的多维，多域评估。它还纳入了每个领域的专门评估标准，并利用多代理辩论系统来提高注释质量。通过评估八种现代摘要模型，我们发现了跨领域和语言的不同性能模式。我们进一步研究了大型语言模型作为摘要评估者，分析了他们的评估与摘要功能之间的相关性，并在评估自我生成的摘要时发现了系统的偏见。我们的基准数据集可在此HTTPS URL上公开使用。

Title: AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation

Authors: Ming Wang, Peidong Wang, Lin Wu, Xiaocui Yang, Daling Wang, Shi Feng, Yuxin Chen, Bixuan Wang, Yifei Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00551
Pdf URL: https://arxiv.org/pdf/2506.00551
Copy Paste: [[2506.00551]] AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation(https://arxiv.org/abs/2506.00551)
Keywords: llm, agent
Abstract: Constrained by the cost and ethical concerns of involving real seekers in AI-driven mental health, researchers develop LLM-based conversational agents (CAs) with tailored configurations, such as profiles, symptoms, and scenarios, to simulate seekers. While these efforts advance AI in mental health, achieving more realistic seeker simulation remains hindered by two key challenges: dynamic evolution and multi-session memory. Seekers' mental states often fluctuate during counseling, which typically spans multiple sessions. To address this, we propose AnnaAgent, an emotional and cognitive dynamic agent system equipped with tertiary memory. AnnaAgent incorporates an emotion modulator and a complaint elicitor trained on real counseling dialogues, enabling dynamic control of the simulator's configurations. Additionally, its tertiary memory mechanism effectively integrates short-term and long-term memory across sessions. Evaluation results, both automated and manual, demonstrate that AnnaAgent achieves more realistic seeker simulation in psychological counseling compared to existing baselines. The ethically reviewed and screened code can be found on this https URL.
摘要：受到AI驱动的心理健康的成本和道德问题的限制，研究人员开发了基于LLM的对话剂（CAS），其量身定制的配置，例如配置，症状和场景，以模拟寻求者。尽管这些努力在心理健康方面提高了AI，但实现更现实的寻求者模拟仍受到两个关键挑战的阻碍：动态进化和多课程记忆。寻求者的心理状态在咨询过程中经常发生波动，通常会涵盖多次会议。为了解决这个问题，我们提出了Annaagent，这是一个配备了三级记忆的情感和认知动态代理系统。 Annaagent结合了一个情感调制器和一个经过培训的培训对话的投诉灵活者，从而使模拟器配置的动态控制。此外，其三级记忆机制有效地整合了跨会话的短期和长期记忆。自动化和手动的评估结果表明，与现有基线相比，Annaagent在心理咨询中实现了更现实的寻求者模拟。可以在此HTTPS URL上找到经过道德审查和筛选的代码。

Title: The Hidden Language of Harm: Examining the Role of Emojis in Harmful Online Communication and Content Moderation

Authors: Yuhang Zhou, Yimin Xiao, Wei Ai, Ge Gao
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2506.00583
Pdf URL: https://arxiv.org/pdf/2506.00583
Copy Paste: [[2506.00583]] The Hidden Language of Harm: Examining the Role of Emojis in Harmful Online Communication and Content Moderation(https://arxiv.org/abs/2506.00583)
Keywords: llm
Abstract: Social media platforms have become central to modern communication, yet they also harbor offensive content that challenges platform safety and inclusivity. While prior research has primarily focused on textual indicators of offense, the role of emojis, ubiquitous visual elements in online discourse, remains underexplored. Emojis, despite being rarely offensive in isolation, can acquire harmful meanings through symbolic associations, sarcasm, and contextual misuse. In this work, we systematically examine emoji contributions to offensive Twitter messages, analyzing their distribution across offense categories and how users exploit emoji ambiguity. To address this, we propose an LLM-powered, multi-step moderation pipeline that selectively replaces harmful emojis while preserving the tweet's semantic intent. Human evaluations confirm our approach effectively reduces perceived offensiveness without sacrificing meaning. Our analysis also reveals heterogeneous effects across offense types, offering nuanced insights for online communication and emoji moderation.
摘要：社交媒体平台已成为现代沟通的核心，但它们还具有挑战平台安全性和包容性的进攻内容。虽然先前的研究主要集中在犯罪的文本指标上，但表情符号，无处不在的视觉元素在在线话语中的作用仍然没有得到充实的态度。尽管表情符号很少孤立地冒犯，但可以通过象征性关联，讽刺和上下文滥用来获得有害含义。在这项工作中，我们系统地研究表情符号对进攻性Twitter消息的贡献，分析其在犯罪类别中的分布以及用户如何利用表情符号的歧义。为了解决这个问题，我们提出了一个由LLM驱动的多步节式管道，该管道有选择地替代有害的表情符号，同时保留了Tweet的语义意图。人类评估证实了我们的方法有效地降低了不牺牲意义的攻击性。我们的分析还揭示了进攻类型的异质效果，为在线交流和表情符号节制提供细微的见解。

Title: PAKTON: A Multi-Agent Framework for Question Answering in Long Legal Agreements

Authors: Petros Raptopoulos, Giorgos Filandrianos, Maria Lymperaiou, Giorgos Stamou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00608
Pdf URL: https://arxiv.org/pdf/2506.00608
Copy Paste: [[2506.00608]] PAKTON: A Multi-Agent Framework for Question Answering in Long Legal Agreements(https://arxiv.org/abs/2506.00608)
Keywords: retrieval-augmented generation, agent
Abstract: Contract review is a complex and time-intensive task that typically demands specialized legal expertise, rendering it largely inaccessible to non-experts. Moreover, legal interpretation is rarely straightforward-ambiguity is pervasive, and judgments often hinge on subjective assessments. Compounding these challenges, contracts are usually confidential, restricting their use with proprietary models and necessitating reliance on open-source alternatives. To address these challenges, we introduce PAKTON: a fully open-source, end-to-end, multi-agent framework with plug-and-play capabilities. PAKTON is designed to handle the complexities of contract analysis through collaborative agent workflows and a novel retrieval-augmented generation (RAG) component, enabling automated legal document review that is more accessible, adaptable, and privacy-preserving. Experiments demonstrate that PAKTON outperforms both general-purpose and pretrained models in predictive accuracy, retrieval performance, explainability, completeness, and grounded justifications as evaluated through a human study and validated with automated metrics.
摘要：合同审查是一项复杂且耗时的任务，通常需要专业的法律专业知识，因此非专家无法访问它。此外，法律解释很少是直接的歧义，而且判断通常取决于主观评估。加剧了这些挑战，合同通常是机密的，限制了它们与专有模型的使用，并需要依赖开源替代方案。为了应对这些挑战，我们介绍了Pakton：具有插件功能的完全开源，端到端的多代理框架。 Pakton旨在通过协作代理工作流以及新颖的检索型生成（RAG）组件来处理合同分析的复杂性，从而实现自动化法律文件审查，更容易访问，适应性和隐私性。实验表明，Pakton在预测精度，检索性能，解释性，完整性和通过人类研究中评估并用自动指标验证的预测性能，可解释性，完整性和扎根理由都优于通用和预处理的模型。

Title: Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation

Authors: Running Yang, Wenlong Deng, Minghui Chen, Yuyin Zhou, Xiaoxiao Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00612
Pdf URL: https://arxiv.org/pdf/2506.00612
Copy Paste: [[2506.00612]] Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation(https://arxiv.org/abs/2506.00612)
Keywords: language model, llm
Abstract: Clinical tasks such as diagnosis and treatment require strong decision-making abilities, highlighting the importance of rigorous evaluation benchmarks to assess the reliability of large language models (LLMs). In this work, we introduce a knowledge-guided data augmentation framework that enhances the difficulty of clinical multiple-choice question (MCQ) datasets by generating distractors (i.e., incorrect choices that are similar to the correct one and may confuse existing LLMs). Using our KG-based pipeline, the generated choices are both clinically plausible and deliberately misleading. Our approach involves multi-step, semantically informed walks on a medical knowledge graph to identify distractor paths-associations that are medically relevant but factually incorrect-which then guide the LLM in crafting more deceptive distractors. We apply the designed knowledge graph guided distractor generation (KGGDG) pipline, to six widely used medical QA benchmarks and show that it consistently reduces the accuracy of state-of-the-art LLMs. These findings establish KGGDG as a powerful tool for enabling more robust and diagnostic evaluations of medical LLMs.
摘要：诸如诊断和治疗之类的临床任务需要强大的决策能力，强调了严格的评估基准对评估大语言模型（LLMS）的可靠性的重要性。在这项工作中，我们介绍了一个知识引导的数据增强框架，该框架通过产生干扰物（即与正确的选择相似，可能会使现有LLMS相似），从而增强了临床多项选择问题（MCQ）数据集的困难。使用我们基于KG的管道，生成的选择在临床上是合理的，并且有意误导。我们的方法涉及在医学知识图上进行多步骤，知情的步行，以识别与医学相关但实际上不正确的分心路径交往 - 然后指导LLM来制作更多具有欺骗性的分散术者。我们将设计的知识图引导分散分心器（KGGDG）管道应用于六个广泛使用的医疗质量检查基准，并表明它始终降低了最先进的LLMS的准确性。这些发现建立了KGGDG，是实现医学LLM的更强大和诊断评估的强大工具。

Title: Social Construction of Urban Space: Understanding Neighborhood Boundaries Using Rental Listings

Authors: Adam Visokay, Ruth Bagley, Ian Kennedy, Chris Hess, Kyle Crowder, Rob Voigt, Denis Peskoff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00634
Pdf URL: https://arxiv.org/pdf/2506.00634
Copy Paste: [[2506.00634]] Social Construction of Urban Space: Understanding Neighborhood Boundaries Using Rental Listings(https://arxiv.org/abs/2506.00634)
Keywords: language model, agent
Abstract: Rental listings offer a unique window into how urban space is socially constructed through language. We analyze Chicago Craigslist rental advertisements from 2018 to 2024 to examine how listing agents characterize neighborhoods, identifying mismatches between institutional boundaries and neighborhood claims. Through manual and large language model annotation, we classify unstructured listings from Craigslist according to their neighborhood. Geospatial analysis reveals three distinct patterns: properties with conflicting neighborhood designations due to competing spatial definitions, border properties with valid claims to adjacent neighborhoods, and ``reputation laundering" where listings claim association with distant, desirable neighborhoods. Through topic modeling, we identify patterns that correlate with spatial positioning: listings further from neighborhood centers emphasize different amenities than centrally-located units. Our findings demonstrate that natural language processing techniques can reveal how definitions of urban spaces are contested in ways that traditional methods overlook.
摘要：租赁清单为城市空间如何通过语言建立了一个独特的窗口。我们分析了2018年至2024年的芝加哥Craigslist租赁广告，以检查上市代理如何表征社区的特征，从而确定机构边界和社区索赔之间的不匹配。通过手动和大型语言模型注释，我们根据他们的附近对Craigslist的非结构化列表进行了分类。 Geospatial analysis reveals three distinct patterns: properties with conflicting neighborhood designations due to competing spatial definitions, border properties with valid claims to adjacent neighborhoods, and ``reputation laundering" where listings claim association with distant, desirable neighborhoods. Through topic modeling, we identify patterns that correlate with spatial positioning: listings further from neighborhood centers emphasize different amenities than centrally-located units. Our findings证明自然语言处理技术可以揭示如何以传统方法忽略的方式对城市空间的定义进行质疑。

Title: Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics

Authors: Lorenzo Jaime Yu Flores, Ori Ernst, Jackie Chi Kit Cheung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00637
Pdf URL: https://arxiv.org/pdf/2506.00637
Copy Paste: [[2506.00637]] Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics(https://arxiv.org/abs/2506.00637)
Keywords: prompt
Abstract: Well-calibrated model confidence scores can improve the usefulness of text generation models. For example, users can be prompted to review predictions with low confidence scores, to prevent models from returning bad or potentially dangerous predictions. However, confidence metrics are not always well calibrated in text generation. One reason is that in generation, there can be many valid answers, which previous methods do not always account for. Hence, a confident model could distribute its output probability among multiple sequences because they are all valid. We propose task-agnostic confidence metrics suited to generation, which rely solely on the probabilities associated with the model outputs without the need for further fine-tuning or heuristics. Using these, we are able to improve the calibration of BART and Flan-T5 on summarization, translation, and QA datasets.
摘要：精心校准的模型置信度得分可以提高文本生成模型的实用性。例如，可以提示用户以置信度得分较低的方式审查预测，以防止模型返回不良或潜在的危险预测。但是，在文本生成中，置信度指标并不总是能很好地校准。原因之一是，在世代相传，可能有许多有效的答案，以前的方法并不总是考虑到这些答案。因此，自信模型可以在多个序列中分配其输出概率，因为它们都是有效的。我们提出了适合生成的任务无关置信度指标，这些指标仅依赖于与模型输出相关的概率，而无需进行进一步的微调或启发式方法。使用这些，我们能够改善BART和FLAN-T5在汇总，翻译和QA数据集上的校准。

Title: SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

Authors: Weijie Xu, Shixian Cui, Xi Fang, Chi Xue, Stephanie Eckman, Chandan Reddy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00643
Pdf URL: https://arxiv.org/pdf/2506.00643
Copy Paste: [[2506.00643]] SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions(https://arxiv.org/abs/2506.00643)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly evaluated on single-answer multiple-choice tasks, yet many real-world problems require identifying all correct answers from a set of options. This capability remains underexplored. We introduce SATA-BENCH, the first dedicated benchmark for evaluating LLMs on Select All That Apply (SATA) questions across diverse domains, including reading comprehension, law, and biomedicine. Our evaluation of 27 open-source and proprietary models reveals a significant gap: even the strongest model achieves only 41.8% exact match, exposing LLMs' inability to reliably identify all correct answers. We find that this weakness stems from two core challenges: selection bias - models favor certain choices regardless of content, and count bias - models fail to predict the correct number of answers. To address these issues, we propose Choice Funnel, a decoding strategy that combines token debiasing with adaptive thresholding to guide models toward complete and accurate selections. Choice Funnel achieves up to 29% higher exact match than competitive baselines while reducing inference cost by over 64%. Our findings expose fundamental limitations in current LLMs and introduce a new framework for diagnosing and improving multi-answer reasoning. We release SATA-BENCH and Choice Funnel to promote LLM development for robust decision-making in realistic, multi-answer applications.
摘要：大型语言模型（LLMS）越来越多地在单选项多项选择任务上进行评估，但是许多现实世界中的问题都需要从一组选项中确定所有正确答案。此功能仍然没有被逐渐解散。我们介绍了SATA Bench，这是第一个专门的基准，用于评估在各种领域（包括阅读理解，法律和生物医学）中选择所有适用（SATA）问题的LLMS。我们对27个开源和专有模型的评估揭示了一个显着的差距：即使是最强的模型也只能达到41.8％的确切匹配，从而证明LLMS无法可靠地识别所有正确的答案。我们发现这种弱点源于两个核心挑战：选择偏见 - 模型不论内容如何，而计数偏见 - 模型无法预测正确的答案数量。为了解决这些问题，我们提出了选择漏斗，这是一种解码策略，将令牌辩论与自适应阈值结合起来，以指导模型取得完整而准确的选择。选择漏斗的精确匹配度比竞争基线高达29％，同时将推理成本降低超过64％。我们的发现暴露了当前LLM的基本限制，并引入了一个新的框架，用于诊断和改善多答案推理。我们发布了SATA板凳和选择渠道，以促进LLM开发，以在现实的多回答应用中进行强大的决策。

Title: GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction

Authors: Neil De La Fuente, Oscar Sainz, Iker García-Ferrero, Eneko Agirre
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00649
Pdf URL: https://arxiv.org/pdf/2506.00649
Copy Paste: [[2506.00649]] GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction(https://arxiv.org/abs/2506.00649)
Keywords: language model
Abstract: Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GUIDEX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning Llama 3.1 with GUIDEX sets a new state-of-the-art across seven zeroshot Named Entity Recognition benchmarks. Models trained with GUIDEX gain up to 7 F1 points over previous methods without humanlabeled data, and nearly 2 F1 points higher when combined with it. Models trained on GUIDEX demonstrate enhanced comprehension of complex, domain-specific annotation schemas. Code, models, and synthetic datasets are available at this http URL
摘要：信息提取（IE）系统传统上是特定于领域的，需要涉及专家模式设计，数据注释和模型培训的昂贵适应。虽然大型语言模型在零射中显示了有希望的IE，但性能在标签定义不同的未见域中显着降低。本文介绍了Gudex，这是一种新型方法，该方法会自动定义特定领域的模式，侵入指南并生成合成标记的实例，从而可以更好地范围概括。带有Guidex的微调美洲驼3.1在七个名为“实体识别基准”基准的七个Zeroshot上设定了新的最先进。经过Guidex训练的模型比以前的方法最多可获得7 f1点，而没有人类标签的数据，并且在与之结合使用时将近2个F1点提高。在Guidex上训练的模型表明对复杂的，域特异性注释模式的理解增强。代码，模型和合成数据集可在此HTTP URL上找到

Title: Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques

Authors: Lang Xiong, Raina Gao, Alyssa Jeong, Yicheng Fu, Sean O'Brien, Vasu Sharma, Kevin Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00658
Pdf URL: https://arxiv.org/pdf/2506.00658
Copy Paste: [[2506.00658]] Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques(https://arxiv.org/abs/2506.00658)
Keywords: language model, prompt, chain-of-thought
Abstract: Sarcasm is a form of humor where expressions convey meanings opposite to their literal interpretations. Classifying and generating sarcasm using large language models is vital for interpreting human communication. Sarcasm poses challenges for computational models, due to its nuanced nature. We introduce Sarc7, a benchmark that classifies 7 types of sarcasm: self-deprecating, brooding, deadpan, polite, obnoxious, raging, and manic by annotating entries of the MUStARD dataset. Classification was evaluated using zero-shot, few-shot, chain-of-thought (CoT), and a novel emotion-based prompting technique. We propose an emotion-based generation method developed by identifying key components of sarcasm-incongruity, shock value, and context dependency. Our classification experiments show that Gemini 2.5, using emotion-based prompting, outperforms other setups with an F1 score of 0.3664. Human evaluators preferred our emotion-based prompting, with 38.46% more successful generations than zero-shot prompting.
摘要：讽刺是一种幽默的形式，表达表达与其字面解释相反的含义。使用大语言模型对讽刺进行分类和产生讽刺对于解释人类交流至关重要。讽刺由于其细微的性质而对计算模型构成挑战。我们介绍了SARC7，这是一种基准，该基准分类了7种讽刺类型：通过注释芥末数据集的条目，自嘲，沉思，死pan，deadpan，deadpan，dialpan，ofnoxious，andove，愤怒和躁狂。使用零射，几乎没有想法（COT）和一种新颖的基于情感的提示技术评估分类。我们提出了一种基于情感的生成方法，它通过识别讽刺性，震动价值和上下文依赖性的关键组成部分而开发。我们的分类实验表明，Gemini 2.5使用基于情绪的提示，优于其他设置，F1得分为0.3664。人类评估者更喜欢我们基于情感的提示，比零射击提示的世代成功38.46％。

Title: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues

Authors: Martin Kuo, Jianyi Zhang, Aolin Ding, Louis DiValentin, Amin Hass, Benjamin F Morris, Isaac Jacobson, Randolph Linderman, James Kiessling, Nicolas Ramos, Bhavna Gopal, Maziyar Baran Pouyan, Changwei Liu, Hai Li, Yiran Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00668
Pdf URL: https://arxiv.org/pdf/2506.00668
Copy Paste: [[2506.00668]] SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues(https://arxiv.org/abs/2506.00668)
Keywords: language model, llm
Abstract: Malicious attackers can exploit large language models (LLMs) by engaging them in multi-turn dialogues to achieve harmful objectives, posing significant safety risks to society. To address this challenge, we propose a novel defense mechanism: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (STREAM). STREAM defends LLMs against multi-turn attacks while preserving their functional capabilities. Our approach involves constructing a human-annotated dataset, the Safety Reasoning Multi-turn Dialogues dataset, which is used to fine-tune a plug-and-play safety reasoning moderator. This model is designed to identify malicious intent hidden within multi-turn conversations and alert the target LLM of potential risks. We evaluate STREAM across multiple LLMs against prevalent multi-turn attack strategies. Experimental results demonstrate that our method significantly outperforms existing defense techniques, reducing the Attack Success Rate (ASR) by 51.2%, all while maintaining comparable LLM capability.
摘要：恶意攻击者可以通过让他们参与多转对话以实现有害目标，从而对社会构成重大安全风险来利用大型语言模型（LLM）。为了应对这一挑战，我们提出了一种新颖的防御机制：对对话的安全推理启发一致性（流）。 Stream捍卫LLMS免受多转弯攻击的攻击，同时保留其功能能力。我们的方法涉及构建人类注销的数据集，即安全推理多转化对话数据集，该数据集用于微调插件安全推理主持人。该模型旨在确定隐藏在多转交谈中的恶意意图，并提醒目标LLM潜在风险。我们评估了多个LLM的流，以针对普遍的多转攻击策略进行评估。实验结果表明，我们的方法显着胜过现有的防御技术，在保持可比的LLM功能的同时，将攻击成功率（ASR）降低了51.2％。

Title: DeepRAG: Integrating Hierarchical Reasoning and Process Supervision for Biomedical Multi-Hop QA

Authors: Yuelyu Ji, Hang Zhang, Shiven Verma, Hui Ji, Chun Li, Yushui Han, Yanshan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00671
Pdf URL: https://arxiv.org/pdf/2506.00671
Copy Paste: [[2506.00671]] DeepRAG: Integrating Hierarchical Reasoning and Process Supervision for Biomedical Multi-Hop QA(https://arxiv.org/abs/2506.00671)
Keywords: retrieval-augmented generation
Abstract: We propose DeepRAG, a novel framework that integrates DeepSeek hierarchical question decomposition capabilities with RAG Gym unified retrieval-augmented generation optimization using process level supervision. Targeting the challenging MedHopQA biomedical question answering task, DeepRAG systematically decomposes complex queries into precise sub-queries and employs concept level reward signals informed by the UMLS ontology to enhance biomedical accuracy. Preliminary evaluations on the MedHopQA dataset indicate that DeepRAG significantly outperforms baseline models, including standalone DeepSeek and RAG Gym, achieving notable improvements in both Exact Match and concept level accuracy.
摘要：我们提出了DeepRag，这是一个新颖的框架，该框架将DeepSeek分层问题分解功能与RAG Gym使用过程级别监督的Rag Gym统一检索效果优化。 Deeprag针对具有挑战性的Medhopqa生物医学问题回答任务，将复杂的查询分解为精确的子征服，并采用了UMLS本体学告知的概念级奖励信号，以提高生物医学精度。对MEDHOPQA数据集的初步评估表明，DeepRag明显优于基线模型，包括独立的DeepSeek和Rag Gym，在精确匹配和概念水平的准确性方面都取得了显着改善。

Title: Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments

Authors: Li Zhang, Morgan Gray, Jaromir Savelka, Kevin D. Ashley
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00694
Pdf URL: https://arxiv.org/pdf/2506.00694
Copy Paste: [[2506.00694]] Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments(https://arxiv.org/abs/2506.00694)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) demonstrate potential in complex legal tasks like argument generation, yet their reliability remains a concern. Building upon pilot work assessing LLM generation of 3-ply legal arguments using human evaluation, this paper introduces an automated pipeline to evaluate LLM performance on this task, specifically focusing on faithfulness (absence of hallucination), factor utilization, and appropriate abstention. We define hallucination as the generation of factors not present in the input case materials and abstention as the model's ability to refrain from generating arguments when instructed and no factual basis exists. Our automated method employs an external LLM to extract factors from generated arguments and compares them against the ground-truth factors provided in the input case triples (current case and two precedent cases). We evaluated eight distinct LLMs on three tests of increasing difficulty: 1) generating a standard 3-ply argument, 2) generating an argument with swapped precedent roles, and 3) recognizing the impossibility of argument generation due to lack of shared factors and abstaining. Our findings indicate that while current LLMs achieve high accuracy (over 90%) in avoiding hallucination on viable argument generation tests (Tests 1 & 2), they often fail to utilize the full set of relevant factors present in the cases. Critically, on the abstention test (Test 3), most models failed to follow instructions to stop, instead generating spurious arguments despite the lack of common factors. This automated pipeline provides a scalable method for assessing these crucial LLM behaviors, highlighting the need for improvements in factor utilization and robust abstention capabilities before reliable deployment in legal settings. Project page: this https URL.
摘要：大型语言模型（LLMS）在复杂的法律任务中表现出潜力，例如争论产生，但其可靠性仍然是一个关注的问题。基于试点工作，使用人类评估评估LLM生成3层法律论证的基础，本文介绍了一条自动化管道，以评估LLM在这项任务上的绩效，特别是专注于忠实（缺乏幻觉），因素利用率和适当的弃权。我们将幻觉定义为输入案例材料和弃权中不存在的因素的产生，因为该模型在指示时避免产生论点并且没有事实基础的能力。我们的自动化方法采用外部LLM来从生成的参数中获取因素，并将其与输入案例三倍（当前情况和两个先例案例）中提供的基本真相因子进行比较。我们在增加难度的三个测试中评估了八个不同的LLM：1）产生标准的3层参数，2）生成具有交换先例角色的参数，以及3）认识到由于缺乏共享因素和弃权而导致参数产生的不可能。我们的发现表明，尽管当前LLM在避免对可行的参数生成测试（测试1和2）幻觉方面达到高精度（超过90％），但它们通常无法利用案例中存在的全套相关因素。至关重要的是，在弃戒测试（测试3）上，大多数模型未能遵循说明停止，而是尽管缺乏共同的因素，而是产生虚假的论点。这种自动化管道提供了一种可扩展的方法，用于评估这些关键的LLM行为，强调需要改善因素利用率和可靠部署在法律环境中的可靠部署之前。项目页面：此HTTPS URL。

Title: Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

Authors: Siddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.00722
Pdf URL: https://arxiv.org/pdf/2506.00722
Copy Paste: [[2506.00722]] Chain-of-Thought Training for Open E2E Spoken Dialogue Systems(https://arxiv.org/abs/2506.00722)
Keywords: language model, chain-of-thought
Abstract: Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition~(ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation data, such as the Switchboard. We will publicly release our models and training code.
摘要：与传统的级联管道不同，端到端（E2E）口语对话系统可保留完全的不同性能并捕获非词血信息，使其非常适合对口语相互作用进行建模。但是，现有的E2E方法通常需要大规模的培训数据，并产生缺乏语义连贯性的响应。我们提出了一种简单而有效的策略，利用了一项经营链（COT）的配方，以确保对会话数据的培训与多模式语言模型（LM）对语音识别〜（ASR），文本对语音综合（TTS）和文本LM任务的预培训保持紧密结合。我们的方法比基线实现了超过1.5的Rouge-1改进，在公开可用的人类对话数据集上成功培训了口语对话系统，同时计算效率足以培训仅300个小时的公共人类人类对话数据，例如总机。我们将公开发布我们的模型和培训代码。

Title: Structured Gradient Guidance for Few-Shot Adaptation in Large Language Models

Authors: Hongye Zheng, Yichen Wang, Ray Pan, Guiran Liu, Binrong Zhu, Hanlu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00726
Pdf URL: https://arxiv.org/pdf/2506.00726
Copy Paste: [[2506.00726]] Structured Gradient Guidance for Few-Shot Adaptation in Large Language Models(https://arxiv.org/abs/2506.00726)
Keywords: language model
Abstract: This paper presents a gradient-informed fine-tuning method for large language models under few-shot conditions. The goal is to enhance task adaptability and training stability when data is limited. The method builds on a base loss function and introduces two gradient-related regularization terms. The first enforces gradient direction consistency to guide parameter updates along task-relevant directions and prevent drift. The second controls gradient magnitude to avoid abnormal updates. Together, these components support a more efficient and stable optimization path. To further improve cross-task generalization, the method incorporates a gradient alignment mechanism. This mechanism measures the consistency between optimization directions of the source and target tasks. It enhances fine-tuning performance in multi-task and cross-domain scenarios. Across various natural language understanding tasks, the method outperforms existing fine-tuning strategies in average accuracy, gradient stability, and directional alignment. Empirical evaluations under different sample sizes and domain-specific tasks confirm the method's robustness and broad applicability in low-resource environments. In particular, the method shows clear advantages in controlling parameter update paths. The results demonstrate that a gradient-based fine-tuning framework can effectively leverage the representational power of large language models. It ensures training stability while reducing dependence on large volumes of labeled data.
摘要：本文在几乎没有射击条件下为大语言模型提供了一种梯度信息的微调方法。目的是在数据有限时增强任务适应性和训练稳定性。该方法基于基本损耗函数，并引入了两个与梯度相关的正则化项。第一个强制执行梯度方向一致性，以指导参数沿与任务相关的方向更新并防止漂移。第二个控制梯度大小以避免异常更新。这些组件共同支持了更有效，更稳定的优化路径。为了进一步改善交叉任务的概括，该方法结合了梯度比对机制。该机制衡量源任务的优化方向之间的一致性。它可以增强多任务和跨域情景中的微调性能。在各种自然语言理解任务中，该方法在平均准确性，梯度稳定性和方向排列方面优于现有的微调策略。在不同样本大小和特定领域的任务下进行的经验评估证实了该方法在低资源环境中的鲁棒性和广泛的适用性。特别是，该方法在控制参数更新路径方面显示出明显的优势。结果表明，基于梯度的微调框架可以有效利用大语言模型的代表力。它确保训练稳定性，同时减少对大量标记数据的依赖。

Title: Narrative Media Framing in Political Discourse

Authors: Yulia Otmakhova, Lea Frermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00737
Pdf URL: https://arxiv.org/pdf/2506.00737
Copy Paste: [[2506.00737]] Narrative Media Framing in Political Discourse(https://arxiv.org/abs/2506.00737)
Keywords: llm
Abstract: Narrative frames are a powerful way of conceptualizing and communicating complex, controversial ideas, however automated frame analysis to date has mostly overlooked this framing device. In this paper, we connect elements of narrativity with fundamental aspects of framing, and present a framework which formalizes and operationalizes such aspects. We annotate and release a data set of news articles in the climate change domain, analyze the dominance of narrative frame components across political leanings, and test LLMs in their ability to predict narrative frames and their components. Finally, we apply our framework in an unsupervised way to elicit components of narrative framing in a second domain, the COVID-19 crisis, where our predictions are congruent with prior theoretical work showing the generalizability of our approach.
摘要：叙事框架是一种概念化和交流复杂，有争议的想法的有力方法，但是迄今为止自动化的框架分析大多忽略了这种框架设备。在本文中，我们将叙事性的要素与框架的基本方面联系起来，并提出了一个正式和操作此类方面的框架。我们在气候变化领域中注释并发布了新闻文章的数据集，分析跨政治倾向的叙事框架组成部分的主导地位，并测试其预测叙事框架及其组成部分的能力。最后，我们以一种无监督的方式将框架应用于第二个领域，即COVID-19危机中的叙事框架组成部分，在该危机中，我们的预测与先前的理论工作一致，显示了我们方法的普遍性。

Title: DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Authors: Chiyu Zhang, Marc-Alexandre Cote, Michael Albada, Anush Sankaran, Jack W. Stokes, Tong Wang, Amir Abdi, William Blum, Muhammad Abdul-Mageed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00739
Pdf URL: https://arxiv.org/pdf/2506.00739
Copy Paste: [[2506.00739]] DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments(https://arxiv.org/abs/2506.00739)
Keywords: language model, llm, agent
Abstract: Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench's modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at this https URL.
摘要：大型语言模型（LLM）代理人在人类语言理解和推理方面表现出了令人印象深刻的能力，但它们在网络安全方面的潜力仍然没有得到充实。我们介绍了DefenderBench，这是一种实用的开源工具包，用于评估跨进攻，防御和基于网络安全知识任务的语言代理。 DefenderBench包括用于网络入侵，恶意内容检测，代码脆弱性分析和网络安全知识评估的环境。它的设计目的是为研究人员提供负担得起的，易于使用，同时提供公平和严格的评估。我们使用标准化的代理框架基准了几个最先进（SOTA）和流行的LLM，包括开放型和关闭重量模型。我们的结果表明，Claude-3.7-Sonnet的防守方得分为81.65，其次是Claude-3.7-Sonnet-Ink，以78.40的速度，而最佳的开放权重型号，Llama 3.3 70B，落后于71.81的DefenderBench得分不远。 DefenderBench的模块化设计允许无缝集成自定义LLM和任务，从而促进可重复性和公平的比较。此HTTPS URL可用匿名版本的DefenderBench版本。

Title: Data Swarms: Optimizable Generation of Synthetic Evaluation Data

Authors: Shangbin Feng, Yike Wang, Weijia Shi, Yulia Tsvetkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00741
Pdf URL: https://arxiv.org/pdf/2506.00741
Copy Paste: [[2506.00741]] Data Swarms: Optimizable Generation of Synthetic Evaluation Data(https://arxiv.org/abs/2506.00741)
Keywords: llm
Abstract: We propose Data Swarms, an algorithm to optimize the generation of synthetic evaluation data and advance quantitative desiderata of LLM evaluation. We first train a swarm of initial data generators using existing data, and define various evaluation objectives to reflect the desired properties of evaluation (e.g., generate more difficult problems for the evaluated models) and quantitatively evaluate data generators. We then employ particle swarm optimization to optimize the swarm of data generators, where they collaboratively search through the model parameter space to find new generators that advance these objectives. We further extend it to Adversarial Swarms, where the data generator swarm generates harder data while the test taker model swarm learns from such data, co-evolving dynamically for better data and models simultaneously. Extensive experiments demonstrate that Data Swarms outperforms eight data generation baselines across five evaluation objectives, while Adversarial Swarms produce more robust learning of synthetic data and stronger generalization. Further analysis reveals that Data Swarms successfully optimizes compositions of multiple evaluation objectives and generalizes to new off-the-shelf LLMs, unseen at optimization time.
摘要：我们提出了数据群，这是一种算法，以优化合成评估数据的生成并提高LLM评估的定量持续性。我们首先使用现有数据训练一群初始数据生成器，并定义各种评估目标，以反映评估的所需属性（例如，为评估的模型生成更困难的问题），并定量评估数据生成器。然后，我们采用粒子群优化来优化数据生成器的群体，在那里他们通过模型参数空间进行协作搜索，以找到推进这些目标的新生成器。我们将其进一步扩展到对抗性群，其中数据生成器群会生成更难的数据，而测试者模型群从此类数据中学习，并在同时进行更好的数据和模型，以动态地进行进化。广泛的实验表明，数据群在五个评估目标中的表现优于八个数据生成基线，而对抗性群会产生对合成数据和更强概括的更强大的学习。进一步的分析表明，数据群成功优化了多个评估目标的组成，并将其推广到新的现成的LLM，在优化时看不见。

Title: Assortment of Attention Heads: Accelerating Federated PEFT with Head Pruning and Strategic Client Selection

Authors: Yeshwanth Venkatesha, Souvik Kundu, Priyadarshini Panda
Subjects: cs.CL, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2506.00743
Pdf URL: https://arxiv.org/pdf/2506.00743
Copy Paste: [[2506.00743]] Assortment of Attention Heads: Accelerating Federated PEFT with Head Pruning and Strategic Client Selection(https://arxiv.org/abs/2506.00743)
Keywords: language model, llm
Abstract: Parameter Efficient Fine-Tuning (PEFT) has become the de-facto approach in adapting Large Language Models (LLMs) for downstream tasks in Natural Language Processing. However, its adoption in privacy-preserving distributed learning frameworks, such as Federated Learning (FL), remains relatively limited. This is mainly due to challenges specific to FL, such as resource-constrained devices and diverse data distributions among clients. In this paper, we propose an efficient method to perform PEFT within the FL framework for Multi-Head Attention (MHA) based language models. We address the challenges through head pruning, a novel head-specific weighted aggregation mechanism, and a client selection strategy. Head pruning minimizes training complexity within the clients, guided by the importance score computed based on the confidence of the attention head. Weighted aggregation of heads ensures the global model captures crucial updates from diverse clients complementing our client selection strategy. We show results on the MultiNLI benchmark along with 20 Newsgroups, XL-Sum, and E2E NLG datasets. We use the MultiNLI dataset and T5-small model with LoRA as our PEFT method, attaining sparsity levels of up to 90%, resulting in a communication advantage of up to 1.8x and a reduction in training OPs of 3.9x while maintaining the accuracy drop under 2%.
摘要：参数有效的微调（PEFT）已成为适应大型语言模型（LLMS）的事实上的方法，用于自然语言处理中的下游任务。但是，它在保护隐私的分布式学习框架（例如联合学习（FL））中的采用仍然相对有限。这主要是由于针对FL的挑战，例如客户之间的资源受限设备以及客户之间的不同数据分布。在本文中，我们提出了一种有效的方法，可以在FL框架内执行PEFT以获得多头注意（MHA）的语言模型。我们通过头部修剪，一种新颖的特定于头部特定的加权聚合机制以及客户选择策略来应对挑战。头部修剪使客户内的培训复杂性最小化，并在根据注意力头的信心计算出的重要性得分的指导下。加权汇总的头部确保全球模型捕获了来自不同客户的重要更新，以补充我们的客户选择策略。我们在Multinli基准测试中显示了结果，以及20个新闻组，XL-SUM和E2E NLG数据集。我们将Multinli数据集和T5-MALL模型与LORA用作PEFT方法，达到高达90％的稀疏度水平，导致通信优势高达1.8倍，而培训OP的降低为3.9倍，同时将精度下降到2％以下。

Title: Translate With Care: Addressing Gender Bias, Neutrality, and Reasoning in Large Language Model Translations

Authors: Pardis Sadat Zahraei, Ali Emami
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.00748
Pdf URL: https://arxiv.org/pdf/2506.00748
Copy Paste: [[2506.00748]] Translate With Care: Addressing Gender Bias, Neutrality, and Reasoning in Large Language Model Translations(https://arxiv.org/abs/2506.00748)
Keywords: language model, gpt, llm
Abstract: Addressing gender bias and maintaining logical coherence in machine translation remains challenging, particularly when translating between natural gender languages, like English, and genderless languages, such as Persian, Indonesian, and Finnish. We introduce the Translate-with-Care (TWC) dataset, comprising 3,950 challenging scenarios across six low- to mid-resource languages, to assess translation systems' performance. Our analysis of diverse technologies, including GPT-4, mBART-50, NLLB-200, and Google Translate, reveals a universal struggle in translating genderless content, resulting in gender stereotyping and reasoning errors. All models preferred masculine pronouns when gender stereotypes could influence choices. Google Translate and GPT-4 showed particularly strong bias, favoring male pronouns 4-6 times more than feminine ones in leadership and professional success contexts. Fine-tuning mBART-50 on TWC substantially resolved these biases and errors, led to strong generalization, and surpassed proprietary LLMs while remaining open-source. This work emphasizes the need for targeted approaches to gender and semantic coherence in machine translation, particularly for genderless languages, contributing to more equitable and accurate translation systems.
摘要：解决性别偏见并保持机器翻译中的逻辑连贯性仍然具有挑战性，尤其是在自然性别语言（例如英语和无性别语言）（例如波斯语，印尼语和芬兰语）之间进行翻译时。我们介绍了carrate-care（TWC）数据集，其中包括六种低至中库语言的3,950个具有挑战性的方案，以评估翻译系统的性能。我们对包括GPT-4，MBART-50，NLLB-200和Google翻译在内的各种技术的分析揭示了在翻译无性别内容方面的普遍斗争，从而导致性别刻板印象和推理错误。当性别刻板印象可能影响选择时，所有模型都首选男性代词。 Google Translate和GPT-4表现出特别强烈的偏见，在领导力和专业成功环境中，男性代词比女性代词高4-6倍。 TWC上的微调MBART-50基本上解决了这些偏见和错误，导致了强烈的概括，并超过了专有的LLM，同时仍然是开源的。这项工作强调了针对机器翻译中的性别和语义连贯性的有针对性方法的必要性，尤其是对于无性别语言，有助于更公平，更准确的翻译系统。

Title: Understanding and Mitigating Cross-lingual Privacy Leakage via Language-specific and Universal Privacy Neurons

Authors: Wenshuo Dong, Qingsong Yang, Shu Yang, Lijie Hu, Meng Ding, Wanyu Lin, Tianhang Zheng, Di Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00759
Pdf URL: https://arxiv.org/pdf/2506.00759
Copy Paste: [[2506.00759]] Understanding and Mitigating Cross-lingual Privacy Leakage via Language-specific and Universal Privacy Neurons(https://arxiv.org/abs/2506.00759)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) trained on massive data capture rich information embedded in the training data. However, this also introduces the risk of privacy leakage, particularly involving personally identifiable information (PII). Although previous studies have shown that this risk can be mitigated through methods such as privacy neurons, they all assume that both the (sensitive) training data and user queries are in English. We show that they cannot defend against the privacy leakage in cross-lingual contexts: even if the training data is exclusively in one language, these (private) models may still reveal private information when queried in another language. In this work, we first investigate the information flow of cross-lingual privacy leakage to give a better understanding. We find that LLMs process private information in the middle layers, where representations are largely shared across languages. The risk of leakage peaks when converted to a language-specific space in later layers. Based on this, we identify privacy-universal neurons and language-specific privacy neurons. Privacy-universal neurons influence privacy leakage across all languages, while language-specific privacy neurons are only related to specific languages. By deactivating these neurons, the cross-lingual privacy leakage risk is reduced by 23.3%-31.6%.
摘要：接受大量数据培训的大型语言模型（LLM）捕获了培训数据中嵌入的丰富信息。但是，这也引入了隐私泄漏的风险，尤其是涉及个人身份信息（PII）的风险。尽管以前的研究表明，可以通过诸如隐私神经元之类的方法来减轻这种风险，但他们都认为（敏感的）培训数据和用户查询都是英文的。我们表明，他们不能在跨语言上下文中防御隐私泄漏：即使培训数据仅以一种语言为单位，这些（私有）模型在用另一种语言查询时仍可能会揭示私人信息。在这项工作中，我们首先研究了跨语性隐私泄漏的信息流，以提供更好的理解。我们发现LLM在中间层中处理私人信息，其中表示形式在很大程度上跨语言共享。泄漏的风险在以后的层中转换为特定于语言的空间时会达到峰值。基于此，我们确定了隐私性神经元和特定语言的隐私神经元。隐私性神经元会影响所有语言的隐私泄漏，而特定语言的隐私神经元仅与特定语言有关。通过停用这些神经元，跨语性隐私泄漏风险降低了23.3％-31.6％。

Title: Dynamic Chunking and Selection for Reading Comprehension of Ultra-Long Context in Large Language Models

Authors: Boheng Sheng, Jiacheng Yao, Meicong Zhang, Guoxiu He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00773
Pdf URL: https://arxiv.org/pdf/2506.00773
Copy Paste: [[2506.00773]] Dynamic Chunking and Selection for Reading Comprehension of Ultra-Long Context in Large Language Models(https://arxiv.org/abs/2506.00773)
Keywords: language model, llm, long context
Abstract: Large language models (LLMs) often struggle to accurately read and comprehend extremely long texts. Current methods for improvement typically rely on splitting long contexts into fixed-length chunks. However, fixed truncation risks separating semantically relevant content, leading to ambiguity and compromising accurate understanding. To overcome this limitation, we propose a straightforward approach for dynamically separating and selecting chunks of long context, facilitating a more streamlined input for LLMs. In particular, we compute semantic similarities between adjacent sentences, using lower similarities to adaptively divide long contexts into variable-length chunks. We further train a question-aware classifier to select sensitive chunks that are critical for answering specific questions. Experimental results on both single-hop and multi-hop question-answering benchmarks show that the proposed approach consistently outperforms strong baselines. Notably, it maintains robustness across a wide range of input lengths, handling sequences of up to 256k tokens. Our datasets and code are available at the following link: this https URL
摘要：大型语言模型（LLM）通常很难准确阅读和理解非常长的文本。当前改进的方法通常依赖于将长上下文分为固定长度的块。但是，固定的截断风险将语义相关内容分开，从而导致歧义并损害准确的理解。为了克服这一限制，我们提出了一种直接的方法，用于动态分离和选择长上下文的块，从而促进了更简化的LLMS输入。特别是，我们使用较低的相似性来计算相邻句子之间的语义相似性，以使长上下文将长上下文分为可变长度的块。我们进一步培训了一个问答分类器，以选择对回答特定问题至关重要的敏感块。在单跳和多跳的问题基准的实验结果表明，所提出的方法始终优于强大的基准。值得注意的是，它在各种输入长度上保持稳健性，处理高达256K令牌的序列。我们的数据集和代码可在以下链接上找到：此HTTPS URL

Title: Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge

Authors: Md Tahmid Rahman Laskar, Israt Jahan, Elham Dolatabadi, Chun Peng, Enamul Hoque, Jimmy Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00777
Pdf URL: https://arxiv.org/pdf/2506.00777
Copy Paste: [[2506.00777]] Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge(https://arxiv.org/abs/2506.00777)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive performance in biomedical relation extraction, even in zero-shot scenarios. However, evaluating LLMs in this task remains challenging due to their ability to generate human-like text, often producing synonyms or abbreviations of gold-standard answers, making traditional automatic evaluation metrics unreliable. On the other hand, while human evaluation is more reliable, it is costly and time-consuming, making it impractical for real-world applications. This paper investigates the use of LLMs-as-the-Judge as an alternative evaluation method for biomedical relation extraction. We benchmark 8 LLMs as judges to evaluate the responses generated by 5 other LLMs across 3 biomedical relation extraction datasets. Unlike other text-generation tasks, we observe that LLM-based judges perform quite poorly (usually below 50% accuracy) in the biomedical relation extraction task. Our findings reveal that it happens mainly because relations extracted by LLMs do not adhere to any standard format. To address this, we propose structured output formatting for LLM-generated responses that helps LLM-Judges to improve their performance by about 15% (on average). We also introduce a domain adaptation technique to further enhance LLM-Judge performance by effectively transferring knowledge between datasets. We release both our human-annotated and LLM-annotated judgment data (36k samples in total) for public use here: this https URL.
摘要：大型语言模型（LLMS）在生物医学关系提取方面也表现出了令人印象深刻的表现，即使在零拍摄的情况下也是如此。但是，由于它们能够产生类似人类的文本，通常会产生同义词或缩写金标准答案，因此在此任务中评估LLMS仍然具有挑战性，这使得传统的自动评估指标不可靠。另一方面，尽管人类评估更可靠，但它是昂贵且耗时的，因此对于现实世界中的应用来说是不切实际的。本文研究了将LLMS作为法官的使用作为生物医学关系提取的替代评估方法。我们根据法官进行基准8 LLM，以评估5个其他LLM在3个生物医学关系提取数据集中产生的响应。与其他文本生成任务不同，我们观察到基于LLM的法官在生物医学关系提取任务中的表现（通常低于50％的精度）。我们的发现表明，这主要是因为LLM提取的关系并不遵守任何标准格式。为了解决这个问题，我们提出了针对LLM生成的响应的结构化输出格式，该响应有助于LLM-judges提高其性能约15％（平均而言）。我们还引入了一种域适应技术，通过有效地在数据集之间传输知识，以进一步提高LLM法官的性能。我们在此处发布了我们的人类注销和LLM注释的判断数据（总共36K样本）以供公众使用：此HTTPS URL。

Title: KG-TRACES: Enhancing Large Language Models with Knowledge Graph-constrained Trajectory Reasoning and Attribution Supervision

Authors: Rong Wu, Pinlong Cai, Jianbiao Mei, Licheng Wen, Tao Hu, Xuemeng Yang, Daocheng Fu, Botian Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00783
Pdf URL: https://arxiv.org/pdf/2506.00783
Copy Paste: [[2506.00783]] KG-TRACES: Enhancing Large Language Models with Knowledge Graph-constrained Trajectory Reasoning and Attribution Supervision(https://arxiv.org/abs/2506.00783)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have made remarkable strides in various natural language processing tasks, but their performance on complex reasoning problems remains hindered by a lack of explainability and trustworthiness. This issue, often manifesting as hallucinations or unattributable reasoning processes, limits their applicability in complex reasoning scenarios. To address this, we propose Knowledge Graph-constrained Trajectory Reasoning Attribution and Chain Explanation Supervision (KG-TRACES), a novel framework that enhances the reasoning ability of LLMs through explicit supervision over reasoning paths and processes. KG-TRACES jointly supervises the model to: (1) predict symbolic relation paths, (2) predict full triple-level reasoning paths, and (3) generate attribution-aware reasoning processes grounded in the reasoning paths. At inference phase, the model adapts to both KG-available and KG-unavailable scenarios, retrieving reasoning paths from a KG when possible or predicting plausible reasoning paths with only intrinsic knowledge when not. This design enables the model to reason in an explainable and source-attributable pattern. Through extensive experiments on complex reasoning tasks, we demonstrate that KG-TRACES significantly outperforms existing SOTA: it improves Hits@1 by 1.6% and F1 by 4.7% on WebQSP, and achieves improvements of 4.8% in Hits@1 and 2.1% in F1 on CWQ. Moreover, we show its transferability to specialized domains such as medicine. By visualizing the intermediate steps of reasoning processes, we further show that the explicit supervision introduced by KG-TRACES leads to more stable and goal-directed reasoning processes, aligning closely with correct answers. Code is available at this https URL.
摘要：大型语言模型（LLM）在各种自然语言处理任务中取得了显着的进步，但是由于缺乏解释性和可信赖性，它们在复杂的推理问题上的表现仍然阻碍。这个问题通常表现为幻觉或不可分割的推理过程，它限制了其在复杂的推理方案中的适用性。为了解决这个问题，我们提出了知识限制的轨迹推理归因和链条解释监督（KG-traces），这是一个新颖的框架，通过对推理路径和过程的明确监督增强了LLMS的推理能力。 KG轨迹共同监督该模型：（1）预测符号关系路径，（2）预测完整的三级推理路径，（3）生成基于推理路径的归因感知推理过程。在推论阶段，该模型适应了KG可用和kg毫无用处的方案，在可能的情况下从kg中检索推理路径，或者在没有固有知识的情况下预测合理的推理路径。该设计使该模型能够以可解释的源图模式进行推理。通过对复杂推理任务进行的广泛实验，我们证明了KG轨迹的表现可显着胜过现有的SOTA：它在WebQSP上将命中率提高了1.6％，F1提高了4.7％，并且在F1的F1 In cwq上的命中率提高了4.8％，命中率提高了4.8％。此外，我们展示了其转移到药物等专业领域的性能。通过可视化推理过程的中间步骤，我们进一步表明，KG轨迹引入的明确监督会导致更稳定和目标定向的推理过程，并与正确的答案紧密保持一致。代码可在此HTTPS URL上找到。

Title: Research Borderlands: Analysing Writing Across Research Cultures

Authors: Shaily Bhatt, Tal August, Maria Antoniak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00784
Pdf URL: https://arxiv.org/pdf/2506.00784
Copy Paste: [[2506.00784]] Research Borderlands: Analysing Writing Across Research Cultures(https://arxiv.org/abs/2506.00784)
Keywords: llm
Abstract: Improving cultural competence of language technologies is important. However most recent works rarely engage with the communities they study, and instead rely on synthetic setups and imperfect proxies of culture. In this work, we take a human-centered approach to discover and measure language-based cultural norms, and cultural competence of LLMs. We focus on a single kind of culture, research cultures, and a single task, adapting writing across research cultures. Through a set of interviews with interdisciplinary researchers, who are experts at moving between cultures, we create a framework of structural, stylistic, rhetorical, and citational norms that vary across research cultures. We operationalise these features with a suite of computational metrics and use them for (a) surfacing latent cultural norms in human-written research papers at scale; and (b) highlighting the lack of cultural competence of LLMs, and their tendency to homogenise writing. Overall, our work illustrates the efficacy of a human-centered approach to measuring cultural norms in human-written and LLM-generated texts.
摘要：提高语言技术的文化能力很重要。然而，最近的作品很少与他们研究的社区互动，而是依靠合成设置和文化不完美的代理。在这项工作中，我们采用以人为本的方法来发现和衡量基于语言的文化规范以及LLM的文化能力。我们专注于一种文化，研究文化和一项任务，以适应整个研究文化的写作。通过对文化之间移动专家的跨学科研究人员进行的一系列访谈，我们创建了一个结构性，风格，修辞，修辞和刺激规范的框架，这些规范在整个研究文化中变化。我们使用一系列计算指标来实现这些特征，并将其用于（a）在大规模的人工编写的研究论文中浮出水的潜在文化规范；（b）强调LLM缺乏文化能力，以及它们均化写作的趋势。总体而言，我们的工作说明了以人为中心的文化规范和LLM生成的文本中以人为中心的方法来衡量文化规范的功效。

Title: RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

Authors: Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, Lei Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00789
Pdf URL: https://arxiv.org/pdf/2506.00789
Copy Paste: [[2506.00789]] RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems(https://arxiv.org/abs/2506.00789)
Keywords: retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 400 expert-level time-sensitive finance, economics, and policy documents and 48,322 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our results show that RAG systems exhibit surprising vulnerability to perturbations, with document robustness consistently being the weakest point regardless of generator size or architecture. RAG systems consistently show lower robustness on multi-hop queries than single-hop queries across all domains.
摘要：检索授权的一代（RAG）增强了答案中的新近度和事实。但是，现有评估很少测试这些系统如何应对现实世界的噪音，内部和外部检索环境之间的冲突或变化的事实。我们引入了检索意识鲁棒性评估（RAR），这是一个统一的框架和大规模的基准，该基准共同强调查询并记录有关动态，时间敏感的语料库的扰动。稀有的主要特征之一是知识图驱动的合成管道（稀有），该管道自动从定制的语料库中提取单一和多跳的关系，并在没有手动干预的情况下生成多层问题集。利用这条管道，我们构建了一个数据集（稀有设置），涵盖了400个专家级的时间敏感财务，经济学和政策文件，以及48,322个问题，随着基本来源的变化，它们的分布会随着分布的变化而变化。为了量化弹性，我们正式化了检索条件的鲁棒性指标（稀有状态），该指标捕获了模型在查询，文档或现实世界检索结果系统时保持正确或恢复的能力。我们的结果表明，破布系统对扰动表现出令人惊讶的脆弱性，无论发电机的大小或体系结构如何，文档稳健性始终是最弱点。与所有域中的单跳查询相比，抹布系统在多跳查询上始终显示出较低的鲁棒性。

Title: Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering

Authors: Songtao Jiang, Chenyi Zhou, Yan Zhang, Yeying Jin, Zuozhu Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00806
Pdf URL: https://arxiv.org/pdf/2506.00806
Copy Paste: [[2506.00806]] Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering(https://arxiv.org/abs/2506.00806)
Keywords: language model, llm, prompt
Abstract: Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering (VQA). While current methods have advanced by incorporating visual prompts, our study uncovers critical limitations: these approaches indiscriminately annotate all detected objects for every visual question, generating excessive visual markers that degrade task performance. This issue stems primarily from a lack of focus on key visual elements, raising two important questions: Are all objects equally important, and do all questions require visual prompts? Motivated by Dual Process Theory, which distinguishes between instinctive and deliberate cognitive modes in human reasoning, we propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions, combining fast intuitive judgments with deliberate analytical reasoning to enhance the vision-language reasoning capability of the MLLM. For straightforward questions, FOCUS supports efficient zero-shot reasoning. For more complex tasks, it employs the conceptualizing before observation strategy to highlight critical elements. Extensive experiments on four benchmarks, ScienceQA, TextQA, VizWiz, and MME, demonstrate that FOCUS consistently improves the performance of both open-source and black-box MLLMs, achieving significant gains across all datasets. Ablation studies further validate the importance of combining diverse cognitive strategies with refined visual information for superior performance. Code will be released.
摘要：多模式的大语言模型（MLLM）仍在视觉问题回答（VQA）中的复杂推理任务（VQA）中挣扎。尽管当前的方法通过合并视觉提示来提出，但我们的研究发现了临界局限性：这些方法不加选择地注释每个视觉问题的所有检测到的对象，从而产生过多的视觉标记，从而降低任务性能。这个问题主要源于缺乏关注关键视觉元素的关注，提出了两个重要问题：所有对象是否同样重要，所有问题是否都需要视觉提示？由双重过程理论激发，该理论区分了人类推理中的本能和故意认知模式，我们提出了重点，一种动态地适应问题的复杂性，将快速直观的判断与有意的分析推理相结合，以增强MLLM的视觉方式推理能力。对于直接问题，Focus支持有效的零击推理。对于更复杂的任务，它在观察策略之前采用概念化来突出关键要素。对四个基准测试，科学QA，TextQA，Vizwiz和MME进行了广泛的实验，这表明，Focus始终如一地提高了开源和Black-Box MLLM的性能，从而在所有数据集中取得了显着增长。消融研究进一步验证了将各种认知策略与精致视觉信息相结合的重要性，以获得出色的性能。代码将发布。

Title: GuessBench: Sensemaking Multimodal Creativity in the Wild

Authors: Zifeng Zhu, Shangbin Feng, Herun Wan, Ningnan Wang, Minnan Luo, Yulia Tsvetkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00814
Pdf URL: https://arxiv.org/pdf/2506.00814
Copy Paste: [[2506.00814]] GuessBench: Sensemaking Multimodal Creativity in the Wild(https://arxiv.org/abs/2506.00814)
Keywords: language model, gpt
Abstract: We propose GuessBench, a novel benchmark that evaluates Vision Language Models (VLMs) on modeling the pervasive, noisy, and pluralistic human creativity. GuessBench sources data from "Guess the Build", an online multiplayer Minecraft minigame where one player constructs a Minecraft build given a concept (e.g. caterpillar) and others try to guess it with natural language hints, presenting a pristine testbed for sensemaking creativity in the wild with VLMs acting as guessers. We curate 1500 images from the actual gameplay and design 2000 problems spanning static and dynamic image settings, natural language hints of varying completeness, and more. Extensive experiments with six open/API VLMs and five reasoning enhancement approaches demonstrate that GuessBench presents a uniquely challenging task in creativity modeling: even the start-of-the-art GPT-4o is incorrect on 34% of instances, while we observe a huge performance gap (13.87% vs. 53.93% on average) between open and API models. When used as a resource to improve VLMs, fine-tuning on the reasoning traces for GuessBench problems improves visual perception tasks by 15.36% on average. Further analysis reveals that VLM performance in creativity sensemaking correlates with the frequency of the concept in training data, while the accuracy drops sharply for concepts in underrepresented cultural contexts and low-resource languages.
摘要：我们提出了GuessBench，这是一种新颖的基准，它评估了视觉语言模型（VLM），以建模普遍存在，嘈杂和多元化的人类创造力。 GuessBench从“ Guess The Build”中获得数据，这是一个在线多人游戏中的迷你游戏中，其中一个玩家构建了Minecraft构建的概念（例如Caterpillar）（例如Caterpillar），而其他玩家则试图用自然语言提示来猜测它，并呈现出一种原始的测试床，以供VLMS在野外使用VLMS猜测，以猜测VLMS的实质性创造力。我们从实际的游戏玩法和设计2000个问题中策划了1500张图像，这些问题涵盖静态和动态图像设置，自然语言的完整性提示等等。具有六个开放/API VLM和五种推理增强方法的广泛实验表明，GuessBench在创造力建模中提出了一项独特的挑战性任务：即使是最初的GPT-4O也是34％的实例不正确的，而我们观察到巨大的绩效差距（13.87％v. 53.93％的开放型和API模型）。当用作改善VLM的资源时，对猜测问题问题的推理痕迹进行微调会使视觉感知任务平均提高15.36％。进一步的分析表明，创造力中的VLM性能与培训数据中的概念频率相关，而对于代表性不足的文化背景和低资源语言的概念的精度急剧下降。

Title: From Plain Text to Poetic Form: Generating Metrically-Constrained Sanskrit Verses

Authors: Manoj Balaji Jagadeeshan, Samarth Bhatia, Pretam Ray, Harshul Raj Surana, Akhil Rajeev P, Priya Mishra, Annarao Kulkarni, Ganesh Ramakrishnan, Prathosh AP, Pawan Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00815
Pdf URL: https://arxiv.org/pdf/2506.00815
Copy Paste: [[2506.00815]] From Plain Text to Poetic Form: Generating Metrically-Constrained Sanskrit Verses(https://arxiv.org/abs/2506.00815)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have significantly improved natural language generation, including creative tasks like poetry composition. However, most progress remains concentrated in high-resource languages. This raises an important question: Can LLMs be adapted for structured poetic generation in a low-resource, morphologically rich language such as Sanskrit? In this work, we introduce a dataset designed for translating English prose into structured Sanskrit verse, with strict adherence to classical metrical patterns, particularly the Anushtub meter. We evaluate a range of generative models-both open-source and proprietary-under multiple settings. Specifically, we explore constrained decoding strategies and instruction-based fine-tuning tailored to metrical and semantic fidelity. Our decoding approach achieves over 99% accuracy in producing syntactically valid poetic forms, substantially outperforming general-purpose models in meter conformity. Meanwhile, instruction-tuned variants show improved alignment with source meaning and poetic style, as supported by human assessments, albeit with marginal trade-offs in metrical precision.
摘要：大型语言模型（LLM）的最新进展已大大改善了自然语言的产生，包括诗歌构成等创意任务。但是，大多数进步仍然集中在高资源语言上。这就提出了一个重要的问题：LLM可以适应低资源，形态上丰富的语言（例如梵语）的结构化诗歌？在这项工作中，我们介绍了一个旨在将英语散文翻译成结构化梵文的数据集，并严格遵守经典的度量模式，尤其是Anushtub仪表。我们评估了一系列生成模型，包括多个设置的开源和专用性。具体来说，我们探讨了针对度量和语义忠诚度量身定制的受限解码策略和基于教学的微调。我们的解码方法在产生句法有效的诗歌形式方面达到了超过99％的精度，在仪表符合度中大大优于通用模型。同时，指导调整的变体显示出在人类评估的支持下，在人为评估的支持下，与来源的含义和诗意风格的一致性有所改善，尽管在度量精度方面具有边缘折衷。

Title: One for All: Update Parameterized Knowledge Across Multiple Models

Authors: Weitao Ma, Xiyuan Du, Xiaocheng Feng, Lei Huang, Yichong Huang, Huiyi Zhang, Xiaoliang Yang, Baohang Li, Xiachong Feng, Ting Liu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00817
Pdf URL: https://arxiv.org/pdf/2506.00817
Copy Paste: [[2506.00817]] One for All: Update Parameterized Knowledge Across Multiple Models(https://arxiv.org/abs/2506.00817)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) encode vast world knowledge but struggle to stay up-to-date, often leading to errors and hallucinations. Knowledge editing offers an efficient alternative to retraining, enabling targeted modifications by updating specific model parameters. However, existing methods primarily focus on individual models, posing challenges in efficiently updating multiple models and adapting to new models. To address this, we propose OnceEdit, a novel ensemble-based approach that employs a plug-in model as the editing module, enabling stable knowledge updates across multiple models. Building on the model ensemble, OnceEdit introduces two key mechanisms to enhance its effectiveness. First, we introduce a dynamic weight mechanism through a \weight token for distinguishing between edit-related and non-edit-related instances, ensuring the appropriate utilization of knowledge from integrated models. Second, we incorporate an ensemble enhancement mechanism to mitigate the excessive reliance on the central model inherent in the model ensemble technique, making it more suitable for knowledge editing. Extensive experiments on diverse LLMs demonstrate that OnceEdit consistently outperforms existing methods while achieving superior editing efficiency. Further analysis confirms its adaptability and stability in multi-model editing scenarios. Our code will be available.
摘要：大型语言模型（LLM）编码了广阔的世界知识，但努力保持最新状态，通常会导致错误和幻觉。知识编辑提供了一种有效的替代方法，可以通过更新特定的模型参数来进行重新训练，从而实现目标修改。但是，现有方法主要集中在单个模型上，在有效更新多个模型并适应新模型方面提出了挑战。为了解决这个问题，我们提出了一种基于新颖的集合方法，它采用插件模型作为编辑模块，从而跨多个模型实现了稳定的知识更新。在模型合奏的基础上，始终引入了两个关键机制，以提高其有效性。首先，我们通过\重量令牌引入动态权重机制，以区分与编辑相关和非编辑相关的实例，以确保从集成模型中适当利用知识。其次，我们结合了一种集合增强机制，以减轻对模型集成技术固有的中心模型的过度依赖，从而更适合于知识编辑。对不同LLM的广泛实验表明，一方面，在达到卓越的编辑效率的同时，一贯的实验始终优于现有方法。进一步的分析证实了其在多模型编辑方案中的适应性和稳定性。我们的代码将可用。

Title: Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks

Authors: Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Zhengwen Feng, Hao Peng, Jianwei Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00823
Pdf URL: https://arxiv.org/pdf/2506.00823
Copy Paste: [[2506.00823]] Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks(https://arxiv.org/abs/2506.00823)
Keywords: language model, llm
Abstract: Large language models (LLMs) are trained on extensive datasets that encapsulate substantial world knowledge. However, their outputs often include confidently stated inaccuracies. Earlier works suggest that LLMs encode truthfulness as a distinct linear feature, termed the "truth direction", which can classify truthfulness reliably. We address several open questions about the truth direction: (i) whether LLMs universally exhibit consistent truth directions; (ii) whether sophisticated probing techniques are necessary to identify truth directions; and (iii) how the truth direction generalizes across diverse contexts. Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models, particularly in the context of logical negation. Additionally, we demonstrate that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources. Finally, we explore the practical application of truthfulness probes in selective question-answering, illustrating their potential to improve user trust in LLM outputs. These results advance our understanding of truth directions and provide new insights into the internal representations of LLM beliefs. Our code is public at this https URL
摘要：大型语言模型（LLM）经过广泛的数据集培训，这些数据集封装了大量世界知识。但是，它们的输出通常包括自信的不准确性。早期的作品表明，LLMS将真实性编码为独特的线性特征，称为“真相方向”，可以可靠地对真实性进行分类。我们解决了有关真理方向的几个开放问题：（i）LLM是否普遍表现出一致的真理方向；（ii）是否需要复杂的探测技术来识别真理方向；（iii）真理方向如何在各种环境中推广。我们的发现表明，并非所有LLM都表现出一致的真理方向，在更有能力的模型中观察到更强的表示，尤其是在逻辑否定的背景下。此外，我们证明了对声明性原子陈述训练的真实性探针可以有效地推广到逻辑转换，提问任务，秘密学习和外部知识来源。最后，我们探讨了真实性探针在选择性提问中的实际应用，这说明了它们提高用户对LLM输出的信任的潜力。这些结果可以提高我们对真理方向的理解，并为LLM信念的内部表示提供新的见解。我们的代码在此HTTPS URL上是公开的

Title: HERGC: Heterogeneous Experts Representation and Generative Completion for Multimodal Knowledge Graphs

Authors: Yongkang Xiao, Rui Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00826
Pdf URL: https://arxiv.org/pdf/2506.00826
Copy Paste: [[2506.00826]] HERGC: Heterogeneous Experts Representation and Generative Completion for Multimodal Knowledge Graphs(https://arxiv.org/abs/2506.00826)
Keywords: language model, llm
Abstract: Multimodal knowledge graphs (MMKGs) enrich traditional knowledge graphs (KGs) by incorporating diverse modalities such as images and text. Multi-modal knowledge graph completion (MMKGC) seeks to exploit these heterogeneous signals to infer missing facts, thereby mitigating the intrinsic incompleteness of MMKGs. Existing MMKGC methods typically leverage only the information contained in the MMKGs under the closed-world assumption and adopt discriminative training objectives, which limits their reasoning capacity during completion. Recent generative completion approaches powered by advanced large language models (LLMs) have shown strong reasoning abilities in unimodal knowledge graph completion, but their potential in MMKGC remains largely unexplored. To bridge this gap, we propose HERGC, a Heterogeneous Experts Representation and Generative Completion framework for MMKGs. HERGC first deploys a Heterogeneous Experts Representation Retriever that enriches and fuses multimodal information and retrieves a compact candidate set for each incomplete triple. It then uses a Generative LLM Predictor fine-tuned on minimal instruction data to accurately identify the correct answer from these candidates. Extensive experiments on three standard MMKG benchmarks demonstrate HERGC's effectiveness and robustness, achieving state-of-the-art performance.
摘要：多模式知识图（MMKGS）通过结合图像和文本等多种方式来丰富传统知识图（kgs）。多模式知识图完成（MMKGC）试图利用这些异质信号来推断缺失的事实，从而减轻MMKGS的内在不完整。现有的MMKGC方法通常仅利用封闭世界中MMKGS中包含的信息，并采用歧视性培训目标，这限制了他们在完成期间的推理能力。最近由高级大语言模型（LLM）提供动力的生成完成方法在单峰知识图完成中表现出强大的推理能力，但是它们在MMKGC中的潜力仍然没有得到探索。为了弥合这一差距，我们提出了HERGC，这是MMKGS的异质专家代表和生成完成框架。 HERGC首先部署了一个异质的专家表示检索器，该检索比例丰富和融合了多模式信息，并为每个不完整的三重三倍地检索了一个紧凑的候选者。然后，它使用对最小指导数据进行微调的生成LLM预测器，以准确地从这些候选者那里确定正确的答案。对三个标准MKG基准测试的广泛实验证明了HERGC的有效性和鲁棒性，从而实现了最先进的性能。

Title: COMPKE: Complex Question Answering under Knowledge Editing

Authors: Keyuan Cheng, Zijian Kan, Zhixian He, Zhuoran Zhang, Muhammad Asif Ali, Ke Xu, Lijie Hu, Di Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00829
Pdf URL: https://arxiv.org/pdf/2506.00829
Copy Paste: [[2506.00829]] COMPKE: Complex Question Answering under Knowledge Editing(https://arxiv.org/abs/2506.00829)
Keywords: language model, gpt
Abstract: Knowledge Editing, which efficiently modifies the knowledge in large language models, has gathered great attention. Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge. However, we argue that these benchmarks fail to effectively evaluate how well the updated models apply this knowledge in real-life scenarios, particularly when questions require complex reasoning, involving one-to-many relationships or multi-step logical intersections. To fill in this gap, we introduce a new benchmark, COMPKE: Complex Question Answering under Knowledge Editing, which includes 11,924 complex questions that reflect real-life situations. We conduct an extensive evaluation of four knowledge editing methods on COMPKE, revealing that their effectiveness varies notably across different models. For instance, MeLLo attains an accuracy of 39.47 on GPT-4O-MINI, but this drops sharply to 3.83 on QWEN2.5-3B. We further investigate the underlying causes of these disparities from both methodological and model-specific perspectives. The datasets are available at this https URL.
摘要：知识编辑有效地修改了大型语言模型的知识，引起了极大的关注。当前的基准主要使用多跳问题回答来评估和分析新注射或更新的知识。但是，我们认为这些基准无法有效地评估更新的模型如何在现实生活中应用这些知识的程度，尤其是当问题需要复杂的推理时，涉及一对一的关系或多步逻辑交集。为了填补这一空白，我们引入了一个新的基准，Compke：在知识编辑下回答复杂的问题，其中包括11,924个复杂的问题，这些问题反映了现实生活中的情况。我们对COMPKE上的四种知识编辑方法进行了广泛的评估，表明它们的有效性在不同模型中的差异很大。例如，Mello在GPT-4O-Mini上的精度为39.47，但在QWEN2.5-3B上急剧下降至3.83。我们进一步研究了这些差异的根本原因，从方法论和模型特定的角度来看。该数据集可在此HTTPS URL上找到。

Title: Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience

Authors: Jiawei Gu, Ziting Xian, Yuanzhen Xie, Ye Liu, Enjie Liu, Ruichao Zhong, Mochi Gao, Yunzhi Tan, Bo Hu, Zang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00842
Pdf URL: https://arxiv.org/pdf/2506.00842
Copy Paste: [[2506.00842]] Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience(https://arxiv.org/abs/2506.00842)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) achieve strong performance on plain text tasks but underperform on structured data like tables and databases. Potential challenges arise from their underexposure during pre-training and rigid text-to-structure transfer mechanisms. Unlike humans who seamlessly apply learned patterns across data modalities, LLMs struggle to infer implicit relationships embedded in tabular formats, especially in the absence of explicit structural guidance. To bridge this cognitive gap, we introduce Contrastive Retrieval-Augmented Generation on Experience (CoRE), a framework that builds experience memory representations and enhances generalization through contrastive In-Context Learning (ICL) to simulate human-like knowledge transfer. Experiments on Text-to-SQL and TableQA show CoRE significantly improves performance, achieving average gains of 3.44% and 4.24%, with up to 17.2% on challenging tasks. Our Monte Carlo Tree Search (MCTS)-generated Experience Memory expands training data 8-9x, enhancing diversity and domain coverage. This training-free and continual method propels LLMs toward structured knowledge expertise.
摘要：大型语言模型（LLMS）在纯文本任务上实现了强大的性能，但在表和数据库等结构化数据上表现不佳。潜在的挑战是由于它们在培训前和僵化的文本转移机制中引起的挑战。与人类无缝地在数据模式中应用学习模式的人不同，LLM努力推断出以表格格式嵌入的隐式关系，尤其是在没有明确的结构指导的情况下。为了弥合这一认知差距，我们在经验（CORE）上引入了对比度检索效果的一代，该框架构建了体验记忆表示并通过对比性内在学习（ICL）来增强概括，以模拟类似人类的知识传递。关于文本到SQL和TableQA的实验显示核心可显着提高性能，达到3.44％和4.24％的平均增长，而具有挑战性的任务最高可达17.2％。我们的蒙特卡洛树搜索（MCT）生成的体验记忆将培训数据扩展8-9倍，增强了多样性和域覆盖范围。这种无培训和持续的方法推动LLMS的结构化知识专业知识。

Title: EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

Authors: Jacky Tai-Yu Lu, Jung Chiang, Chi-Sheng Chen, Anna Nai-Yun Tung, Hsiang Wei Hu, Yuan Chiao Cheng
Subjects: cs.CL, cs.AI, cs.LG, cs.MM, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.00854
Pdf URL: https://arxiv.org/pdf/2506.00854
Copy Paste: [[2506.00854]] EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG(https://arxiv.org/abs/2506.00854)
Keywords: language model
Abstract: We propose EEG2TEXT-CN, which, to the best of our knowledge, represents one of the earliest open-vocabulary EEG-to-text generation frameworks tailored for Chinese. Built on a biologically grounded EEG encoder (NICE-EEG) and a compact pretrained language model (MiniLM), our architecture aligns multichannel brain signals with natural language representations via masked pretraining and contrastive learning. Using a subset of the ChineseEEG dataset, where each sentence contains approximately ten Chinese characters aligned with 128-channel EEG recorded at 256 Hz, we segment EEG into per-character embeddings and predict full sentences in a zero-shot setting. The decoder is trained with teacher forcing and padding masks to accommodate variable-length sequences. Evaluation on over 1,500 training-validation sentences and 300 held-out test samples shows promising lexical alignment, with a best BLEU-1 score of 6.38\%. While syntactic fluency remains a challenge, our findings demonstrate the feasibility of non-phonetic, cross-modal language decoding from EEG. This work opens a new direction in multilingual brain-to-text research and lays the foundation for future cognitive-language interfaces in Chinese.
摘要：我们提出了EEG2Text-CN，据我们所知，它代表了最早的开放式摄氏脑电图脑电图到文本生成框架之一。我们的建筑建立在生物扎根的脑电图编码器（NICE-EEG）和紧凑的预处理模型（Minilm）的基础上，我们的建筑通过掩盖的预处理和对比度学习使多通道脑信号与自然语言表示。使用Chineseeg数据集的一个子集，其中每个句子包含大约十个与以256 Hz记录的128通道脑电图对齐的中文字符，我们将eeg分割为每个字符的嵌入式嵌入中，并预测零拍设置中的完整句子。解码器接受了教师强迫和填充口罩的训练，以适应可变长度的序列。对超过1,500个训练验证句子和300个持有测试样本的评估显示出有希望的词汇对准，最佳BLEU-1得分为6.38 \％。尽管句法流利性仍然是一个挑战，但我们的发现证明了非脑电图中非形式，跨模式语言解码的可行性。这项工作为多语言的大脑到文本研究打开了一个新的方向，并为中文的未来认知界面奠定了基础。

Title: How Bidirectionality Helps Language Models Learn Better via Dynamic Bottleneck Estimation

Authors: Md Kowsher, Nusrat Jahan Prottasha, Shiyun Xu, Shetu Mohanto, Chen Chen, Niloofar Yousefi, Ozlem Garibay
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00859
Pdf URL: https://arxiv.org/pdf/2506.00859
Copy Paste: [[2506.00859]] How Bidirectionality Helps Language Models Learn Better via Dynamic Bottleneck Estimation(https://arxiv.org/abs/2506.00859)
Keywords: language model
Abstract: Bidirectional language models have better context understanding and perform better than unidirectional models on natural language understanding tasks, yet the theoretical reasons behind this advantage remain unclear. In this work, we investigate this disparity through the lens of the Information Bottleneck (IB) principle, which formalizes a trade-off between compressing input information and preserving task-relevant content. We propose FlowNIB, a dynamic and scalable method for estimating mutual information during training that addresses key limitations of classical IB approaches, including computational intractability and fixed trade-off schedules. Theoretically, we show that bidirectional models retain more mutual information and exhibit higher effective dimensionality than unidirectional models. To support this, we present a generalized framework for measuring representational complexity and prove that bidirectional representations are strictly more informative under mild conditions. We further validate our findings through extensive experiments across multiple models and tasks using FlowNIB, revealing how information is encoded and compressed throughout training. Together, our work provides a principled explanation for the effectiveness of bidirectional architectures and introduces a practical tool for analyzing information flow in deep language models.
摘要：双向语言模型具有更好的背景理解，并且在自然语言理解任务上的单向模型比单向模型更好，但是这种优势背后的理论原因尚不清楚。在这项工作中，我们通过信息瓶颈（IB）原则的角度研究了这种差异，该原则正式在压缩输入信息和保留与任务相关的内容之间进行了权衡。我们提出了Flownib，这是一种动态且可扩展的方法，用于估算培训期间相互信息，以解决经典IB方法的关键局限性，包括计算棘手性和固定权衡计划。从理论上讲，我们表明双向模型保留了更多的相互信息，并且比单向模型表现出更高的有效维度。为了支持这一点，我们提出了一个通用框架，用于衡量代表性复杂性，并证明双向表示在轻度条件下更具信息性。我们通过使用Flownib进行的多种模型和任务进行广泛的实验进一步验证了我们的发现，从而揭示了如何在整个培训过程中编码和压缩信息。我们的工作一起为双向体系结构的有效性提供了一个原则上的解释，并引入了一种实用工具，用于分析深度语言模型中的信息流。

Title: L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models

Authors: Nidhi Kowtal, Raviraj Joshi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00863
Pdf URL: https://arxiv.org/pdf/2506.00863
Copy Paste: [[2506.00863]] L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models(https://arxiv.org/abs/2506.00863)
Keywords: language model, gpt, llm, prompt
Abstract: Emotion recognition in low-resource languages like Marathi remains challenging due to limited annotated data. We present L3Cube-MahaEmotions, a high-quality Marathi emotion recognition dataset with 11 fine-grained emotion labels. The training data is synthetically annotated using large language models (LLMs), while the validation and test sets are manually labeled to serve as a reliable gold-standard benchmark. Building on the MahaSent dataset, we apply the Chain-of-Translation (CoTR) prompting technique, where Marathi sentences are translated into English and emotion labeled via a single prompt. GPT-4 and Llama3-405B were evaluated, with GPT-4 selected for training data annotation due to superior label quality. We evaluate model performance using standard metrics and explore label aggregation strategies (e.g., Union, Intersection). While GPT-4 predictions outperform fine-tuned BERT models, BERT-based models trained on synthetic labels fail to surpass GPT-4. This highlights both the importance of high-quality human-labeled data and the inherent complexity of emotion recognition. An important finding of this work is that generic LLMs like GPT-4 and Llama3-405B generalize better than fine-tuned BERT for complex low-resource emotion recognition tasks. The dataset and model are shared publicly at this https URL
摘要：由于带注释的数据有限，在低资源语言中的情感识别仍然具有挑战性。我们提出L3Cube-Mahaemotions，这是一种具有11个细粒情感标签的高质量马拉地情感识别数据集。使用大语言模型（LLMS）综合注释培训数据，而验证和测试集则被手动标记为可靠的金标准基准。在Mahasent数据集的基础上，我们应用了翻译链（COTR）提示技术，其中马拉地语句子被翻译成英文，并通过一个提示标记了情感。评估了GPT-4和LLAMA3-405B，由于标签质量出色，因此选择了GPT-4进行训练数据注释。我们使用标准指标评估模型性能，并探索标签聚合策略（例如联合，交叉点）。尽管GPT-4预测的表现优于微调的BERT模型，但基于BERT的模型训练了合成标签的模型，无法超过GPT-4。这既凸显了高质量的人类标记数据的重要性，又强调了情绪识别的固有复杂性。这项工作的一个重要发现是，像GPT-4和Llama3-405b这样的通用LLM对复杂的低资源情感识别任务的推广优于微调BERT。数据集和模型在此HTTPS URL上公开共享

Title: What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning

Authors: Zhaotian Weng, Haoxuan Li, Kuan-Hao Huang, Jieyu Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00869
Pdf URL: https://arxiv.org/pdf/2506.00869
Copy Paste: [[2506.00869]] What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning(https://arxiv.org/abs/2506.00869)
Keywords: language model
Abstract: Despite the impressive performance of vision-language models (VLMs) on downstream tasks, their ability to understand and reason about causal relationships in visual inputs remains unclear. Robust causal reasoning is fundamental to solving complex high-level reasoning tasks, yet existing benchmarks often include a mixture of reasoning questions, and VLMs can frequently exploit object recognition and activity identification as shortcuts to arrive at the correct answers, making it challenging to truly assess their causal reasoning abilities. To bridge this gap, we introduce VQA-Causal and VCR-Causal, two new benchmarks specifically designed to isolate and rigorously evaluate VLMs' causal reasoning abilities. Our findings reveal that while VLMs excel in object and activity recognition, they perform poorly on causal reasoning tasks, often only marginally surpassing random guessing. Further analysis suggests that this limitation stems from a severe lack of causal expressions in widely used training datasets, where causal relationships are rarely explicitly conveyed. We additionally explore fine-tuning strategies with hard negative cases, showing that targeted fine-tuning can improve model's causal reasoning while maintaining generalization and downstream performance. Our study highlights a key gap in current VLMs and lays the groundwork for future work on causal understanding.
摘要：尽管视觉模型（VLM）在下游任务上的表现令人印象深刻，但它们在视觉输入中理解和理由的能力仍不清楚。强大的因果推理对于解决复杂的高级推理任务至关重要，但是现有的基准通常包括推理问题的混合，而VLMS可以经常利用对象识别和活动识别作为快捷方式，以获得正确的答案，从而使其具有挑战性，使其具有挑战性，以真正评估他们的可容纳能力。为了弥合这一差距，我们引入了VQA-Causal和VCR-Causal，这是两个新的基准，专门设计用于隔离和严格评估VLMS的因果推理能力。我们的发现表明，尽管VLM在对象和活动识别方面表现出色，但它们在因果推理任务上的表现较差，通常只能超越随机猜测。进一步的分析表明，这种局限性源于广泛使用的培训数据集中的因果关系，而因果关系很少被明确传达。我们还探索具有严重否定案例的微调策略，表明有针对性的微调可以改善模型的因果推理，同时保持概括和下游性能。我们的研究强调了当前VLM的关键差距，并为未来的因果理解奠定了基础。

Title: CC-Tuning: A Cross-Lingual Connection Mechanism for Improving Joint Multilingual Supervised Fine-Tuning

Authors: Yangfan Ye, Xiaocheng Feng, Zekun Yuan, Xiachong Feng, Libo Qin, Lei Huang, Weitao Ma, Yichong Huang, Zhirui Zhang, Yunfei Lu, Xiaohui Yan, Duyu Tang, Dandan Tu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00875
Pdf URL: https://arxiv.org/pdf/2506.00875
Copy Paste: [[2506.00875]] CC-Tuning: A Cross-Lingual Connection Mechanism for Improving Joint Multilingual Supervised Fine-Tuning(https://arxiv.org/abs/2506.00875)
Keywords: language model, llm
Abstract: Current large language models (LLMs) often exhibit imbalanced multilingual capabilities due to their English-centric training corpora. To address this, existing fine-tuning approaches operating at the data-level (e.g., through data augmentation or distillation) typically introduce implicit cross-lingual alignment, overlooking the potential for more profound, latent-level cross-lingual interactions. In this work, we propose CC-Tuning, a novel multilingual fine-tuning paradigm that explicitly establishes a cross-lingual connection mechanism at the latent level. During training, CC-Tuning fuses the feed forward activations from both English and non-English inputs, enabling the model to benefit from both linguistic resources. This process is facilitated with a trainable Decision Maker that identifies beneficial activations. Furthermore, during inference, a Transform Matrix is utilized to simulate the cross-lingual connection under monolingual setting through representation transformation. Our experiments on six benchmarks covering 22 languages show that CC-Tuning outperforms vanilla SFT and offers a strong latent-level alternative to data-level augmentation methods. Further analysis also highlights the practicality of CC-Tuning and the potential of latent-level cross-lingual interactions in advancing the multilingual performance of LLMs.
摘要：当前的大型语言模型（LLMS）由于其以英语为中心的培训语料库，经常表现出不平衡的多语言功能。为了解决这个问题，在数据级运行（例如，通过数据增强或蒸馏）运行的现有微观调整方法通常会引入隐式的跨语言对齐，俯瞰着更深刻的潜在层次跨语言相互作用的潜力。在这项工作中，我们提出了CC-Tuning，这是一种新型的多语言微型调整范式，该范式明确地在潜在水平上建立了跨语性连接机制。在培训期间，CC调整融合了英语和非英语输入的Feed向前激活，使该模型能够从两种语言资源中受益。这一过程是通过可识别有益激活的可训练决策者来促进的。此外，在推断过程中，使用变换矩阵通过表示转换在单语设置下模拟跨语义连接。我们对六个基准测试的实验，涵盖22种语言，表明CC调节的表现优于香草SFT，并提供了强大的潜在替代品，用于数据级增强方法。进一步的分析还强调了CC调整的实用性以及潜在的跨语言相互作用在推进LLM的多语言性能方面的潜力。

Title: Not Every Token Needs Forgetting: Selective Unlearning to Limit Change in Utility in Large Language Model Unlearning

Authors: Yixin Wan, Anil Ramakrishna, Kai-Wei Chang, Volkan Cevher, Rahul Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00876
Pdf URL: https://arxiv.org/pdf/2506.00876
Copy Paste: [[2506.00876]] Not Every Token Needs Forgetting: Selective Unlearning to Limit Change in Utility in Large Language Model Unlearning(https://arxiv.org/abs/2506.00876)
Keywords: language model, llm
Abstract: Large Language Model (LLM) unlearning has recently gained significant attention, driven by the need to remove unwanted information, such as private, sensitive, or copyrighted content, from LLMs. However, conventional unlearning approaches indiscriminately update model parameters to forget all tokens in a target document, including common tokens (e.g., pronouns, prepositions, general nouns) that carry general knowledge. In this paper, we highlight that not every token needs forgetting. We propose Selective Unlearning (SU), which identifies a critical subset of tokens within the forgetting set that is relevant to the unwanted information, and unlearns only those tokens. Experiments on two benchmarks and six baseline unlearning algorithms demonstrate that SU not only achieves effective unlearning on the targeted forget data, but also significantly preserves the model's utility in the retaining set.
摘要：大型语言模型（LLM）的学习最近受到了从LLM中删除不必要的信息（例如私人，敏感或版权所有的内容）的需求的驱动，从而引起了极大的关注。但是，常规的未学习方法不加选择地更新模型参数，以忘记目标文档中的所有令牌，包括带有通用知识的共同令牌（例如代词，介词，介词，通用名词）。在本文中，我们强调说，并非每个令牌都需要忘记。我们提出了选择性的学习（su），它标识了与不需要的信息相关的遗忘集中的关键子集，并且仅删除了这些令牌。在两个基准和六个基线算法上进行的实验表明，SU不仅在目标忘记数据上实现有效的学习，而且还可以显着保留该模型的效用。

Title: Improve MLLM Benchmark Efficiency through Interview

Authors: Farong Wen, Yijin Guo, Junying Wang, Jiaohao Xiao, Yingjie Zhou, Chunyi Li, Zicheng Zhang, Guangtao Zhai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00883
Pdf URL: https://arxiv.org/pdf/2506.00883
Copy Paste: [[2506.00883]] Improve MLLM Benchmark Efficiency through Interview(https://arxiv.org/abs/2506.00883)
Keywords: language model, llm
Abstract: The rapid development of Multimodal Large Language Models (MLLM) has led to a wide range of MLLM applications, and a number of benchmark datasets have sprung up in order to assess MLLM abilities. However, full-coverage Q&A testing on large-scale data is resource-intensive and time-consuming. To address this issue, we propose the MLLM Interview (MITV) strategy, which aims to quickly obtain MLLM performance metrics by quizzing fewer question. First, First, we constructed the interview dataset, which was built on an existing MLLM assessment dataset, by adding difficulty labels based on the performance of some typical MLLMs in this dataset. Second, we propose an MLLM Interview strategy, which obtains an initial performance situation of the large model by quizzing a small number of topics and then continuously tries to test the model's limits. Through extensive experiments, the result shows that the MITV strategy proposed in this paper performs well on MLLM benchmark datasets, and it is able to obtain the model evaluation capability faster through a small number of questions and answers.
摘要：多模式大语言模型（MLLM）的快速开发导致了广泛的MLLM应用程序，并且为了评估MLLM的能力，许多基准数据集已迅速增加。但是，大规模数据的全面覆盖问答测试是资源密集的且耗时的。为了解决这个问题，我们提出了MLLM访谈（MITV）策略，该策略旨在通过提出更少的问题来快速获得MLLM绩效指标。首先，首先，我们构建了基于现有的MLLM评估数据集建立的访谈数据集，该数据集是根据该数据集中某些典型MLLM的性能添加难度标签来构建的。其次，我们提出了一种MLLM访谈策略，该策略通过询问少量主题，然后不断尝试测试模型的限制，从而获得了大型模型的初始性能情况。通过广泛的实验，结果表明，本文提出的MITV策略在MLLM基准数据集上表现良好，并且能够通过少量的问题和答案更快地获得模型评估能力。

Title: Affordance Benchmark for MLLMs

Authors: Junying Wang, Wenzhe Li, Yalun Wu, Yingji Liang, Yijin Guo, Chunyi Li, Haodong Duan, Zicheng Zhang, Guangtao Zhai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00893
Pdf URL: https://arxiv.org/pdf/2506.00893
Copy Paste: [[2506.00893]] Affordance Benchmark for MLLMs(https://arxiv.org/abs/2506.00893)
Keywords: language model, llm
Abstract: Affordance theory posits that environments inherently offer action possibilities that shape perception and behavior. While Multimodal Large Language Models (MLLMs) excel in vision-language tasks, their ability to perceive affordance, which is crucial for intuitive and safe interactions, remains underexplored. To address this, we introduce A4Bench, a novel benchmark designed to evaluate the affordance perception abilities of MLLMs across two dimensions: 1) Constitutive Affordance}, assessing understanding of inherent object properties through 1,282 question-answer pairs spanning nine sub-disciplines, and 2) Transformative Affordance, probing dynamic and contextual nuances (e.g., misleading, time-dependent, cultural, or individual-specific affordance) with 718 challenging question-answer pairs. Evaluating 17 MLLMs (nine proprietary and eight open-source) against human performance, we find that proprietary models generally outperform open-source counterparts, but all exhibit limited capabilities, particularly in transformative affordance perception. Furthermore, even top-performing models, such as Gemini-2.0-Pro (18.05% overall exact match accuracy), significantly lag behind human performance (best: 85.34%, worst: 81.25%). These findings highlight critical gaps in environmental understanding of MLLMs and provide a foundation for advancing AI systems toward more robust, context-aware interactions. The dataset is available in this https URL.
摘要：负担理论认为，环境固有地提供了塑造感知和行为的行动可能性。尽管多模式大型语言模型（MLLM）在视觉任务中表现出色，但他们的感知能力能力对于直观和安全的互动至关重要，但仍未得到充实。为了解决这个问题，我们介绍了A4Bench，这是一种新颖的基准测试，旨在评估MLLM跨两个维度的MLLM的负担能力能力：1）本构态负担得起}，评估通过1,282个问题 - 求和者对固有的对象属性的理解，通过跨越九个亚图学的跨越九个小学科，以及2）变革性的成分，构图和上下文，构图，构图，构成，构成，构成，或者是构成的，或者构成，构成，构成，构成，构成，构成，构成，构成，构成，构图，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分，成分。个人特定的负担）具有718个具有挑战性的提问对。评估17个MLLM（9个专有和8个开源）反对人类的绩效，我们发现专有模型通常比开源量优于开源对应物，但所有模型都具有有限的功能，尤其是在变革性的负担能力感知中。此外，即使是表现最好的模型，例如Gemini-2.0-Pro（总体匹配精度为18.05％），显着落后于人类绩效（最佳：85.34％，最差：81.25％）。这些发现突出了对MLLM的环境理解中的关键差距，并为AI系统迈向更强大的背景感知互动提供了基础。该数据集可在此HTTPS URL中使用。

Title: SocialEval: Evaluating Social Intelligence of Large Language Models

Authors: Jinfeng Zhou, Yuxuan Chen, Yihan Shi, Xuanming Zhang, Leqi Lei, Yi Feng, Zexuan Xiong, Miao Yan, Xunzhi Wang, Yaru Cao, Jianing Yin, Shuai Wang, Quanyu Dai, Zhenhua Dong, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00900
Pdf URL: https://arxiv.org/pdf/2506.00900
Copy Paste: [[2506.00900]] SocialEval: Evaluating Social Intelligence of Large Language Models(https://arxiv.org/abs/2506.00900)
Keywords: language model, llm
Abstract: LLMs exhibit promising Social Intelligence (SI) in modeling human behavior, raising the need to evaluate LLMs' SI and their discrepancy with humans. SI equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals. This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation, which existing work fails to address. To this end, we propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts. Each script is structured as a world tree that contains plot lines driven by interpersonal ability, providing a comprehensive view of how LLMs navigate social interactions. Experiments show that LLMs fall behind humans on both SI evaluations, exhibit prosociality, and prefer more positive social behaviors, even if they lead to goal failure. Analysis of LLMs' formed representation space and neuronal activations reveals that LLMs have developed ability-specific functional partitions akin to the human brain.
摘要：LLM在建模人类行为时表现出有希望的社会智力（SI），提出了评估LLMS的SI及其与人类的差异的必要性。 SI使人类具有人际交往能力，可以明智地行事，以实现社会互动以实现社会目标。这提出了运营评估范式：面向结果的目标成就评估和面向过程的人际交往能力评估，现有工作未能解决。为此，我们提出了Socialeval，这是一种基于脚本的双语SI基准，通过手动制作叙事脚本整合了以结果为导向和过程的评估。每个脚本构成是一个世界树，其中包含由人际交往能力驱动的情节线，从而对LLMS进行了全面的看法。实验表明，LLM在两种SI评估中都落后于人类，表现出亲社会性并更喜欢更积极的社会行为，即使它们导致目标失败。 LLMS形成的表示空间和神经元激活的分析表明，LLMS开发了类似于人脑的能力特异性功能分区。

Title: Pi-SQL: Enhancing Text-to-SQL with Fine-Grained Guidance from Pivot Programming Languages

Authors: Yongdong chi, Hanqing Wang, Zonghan Yang, Jian Yang, Xiao Yan, Yun Chen, Guanhua Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00912
Pdf URL: https://arxiv.org/pdf/2506.00912
Copy Paste: [[2506.00912]] Pi-SQL: Enhancing Text-to-SQL with Fine-Grained Guidance from Pivot Programming Languages(https://arxiv.org/abs/2506.00912)
Keywords: prompt
Abstract: Text-to-SQL transforms the user queries from natural language to executable SQL programs, enabling non-experts to interact with complex databases. Existing prompt-based methods craft meticulous text guidelines and examples to facilitate SQL generation, but their accuracy is hindered by the large semantic gap between the texts and the low-resource SQL programs. In this work, we propose Pi-SQL, which incorporates the high-resource Python program as a pivot to bridge between the natural language query and SQL program. In particular, Pi-SQL first generates Python programs that provide fine-grained step-by-step guidelines in their code blocks or comments, and then produces an SQL program following the guidance of each Python this http URL final SQL program matches the reference Python program's query results and, through selection from candidates generated by different strategies, achieves superior execution speed, with a reward-based valid efficiency score up to 4.55 higher than the best-performing this http URL experiments demonstrate the effectiveness of Pi-SQL, which improves the execution accuracy of the best-performing baseline by up to 3.20.
摘要：文本到SQL将用户查询从自然语言转换为可执行的SQL程序，从而使非专家可以与复杂的数据库进行交互。现有的基于及时的方法制定了细致的文本指南和示例来促进SQL的生成，但是由于文本与低资源SQL程序之间的巨大语义差距阻碍了它们的准确性。在这项工作中，我们提出了PI-SQL，该PI-SQL将高资源Python程序纳入了自然语言查询和SQL程序之间桥接的枢纽。特别是，PI-SQL首先生成Python程序，这些程序在其代码块或注释中提供细粒度的逐步指南，然后在每个Python的指导下生成SQL程序，然后在每个Python的指导下制作一个SQL程序，该http url最终SQL程序与较高的python progution的效率相匹配，从而匹配了较高的策略效率。除了表现最佳的HTTP URL实验外，PI-SQL的有效性还提高了最佳表现基线的执行精度，最高为3.20。

Title: How do Transformer Embeddings Represent Compositions? A Functional Analysis

Authors: Aishik Nagar, Ishaan Singh Rawal, Mansi Dhanania, Cheston Tan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00914
Pdf URL: https://arxiv.org/pdf/2506.00914
Copy Paste: [[2506.00914]] How do Transformer Embeddings Represent Compositions? A Functional Analysis(https://arxiv.org/abs/2506.00914)
Keywords: language model
Abstract: Compositionality is a key aspect of human intelligence, essential for reasoning and generalization. While transformer-based models have become the de facto standard for many language modeling tasks, little is known about how they represent compound words, and whether these representations are compositional. In this study, we test compositionality in Mistral, OpenAI Large, and Google embedding models, and compare them with BERT. First, we evaluate compositionality in the representations by examining six diverse models of compositionality (addition, multiplication, dilation, regression, etc.). We find that ridge regression, albeit linear, best accounts for compositionality. Surprisingly, we find that the classic vector addition model performs almost as well as any other model. Next, we verify that most embedding models are highly compositional, while BERT shows much poorer compositionality. We verify and visualize our findings with a synthetic dataset consisting of fully transparent adjective-noun compositions. Overall, we present a thorough investigation of compositionality.
摘要：组成性是人类智力的关键方面，对推理和概括至关重要。尽管基于变压器的模型已成为许多语言建模任务的事实上的标准，但对它们如何表示复合词以及这些表示形式是否构成知之甚少。在这项研究中，我们测试了Mistral，OpenAi大和Google嵌入模型中的组成性，并将其与Bert进行比较。首先，我们通过检查六个不同的组成模型（加法，乘法，扩张，回归等）来评估表示中的组成性。我们发现山脊回归（尽管线性是线性的，但最佳占组成性的说明。令人惊讶的是，我们发现经典矢量添加模型的性能几乎和任何其他模型一样。接下来，我们验证大多数嵌入模型是高度组成的，而伯特的组成性较差。我们使用由完全透明的形容词名称组成组成的合成数据集验证和可视化发现。总体而言，我们对组成性进行了彻底的研究。

Title: anyECG-chat: A Generalist ECG-MLLM for Flexible ECG Input and Multi-Task Understanding

Authors: Haitao Li, Ziyu Li, Yiheng Mao, Ziyi Liu, Zhoujian Sun, Zhengxing Huang
Subjects: cs.CL, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2506.00942
Pdf URL: https://arxiv.org/pdf/2506.00942
Copy Paste: [[2506.00942]] anyECG-chat: A Generalist ECG-MLLM for Flexible ECG Input and Multi-Task Understanding(https://arxiv.org/abs/2506.00942)
Keywords: language model, llm, chat
Abstract: The advent of multimodal large language models (MLLMs) has sparked interest in their application to electrocardiogram (ECG) analysis. However, existing ECG-focused MLLMs primarily focus on report generation tasks, often limited to single 12-lead, short-duration (10s) ECG inputs, thereby underutilizing the potential of MLLMs. To this end, we aim to develop a MLLM for ECG analysis that supports a broader range of tasks and more flexible ECG inputs. However, existing ECG-QA datasets are often monotonous. To address this gap, we first constructed the anyECG dataset, which encompasses a wide variety of tasks, including report generation, abnormal waveform localization, and open-ended question answering. In addition to standard hospital ECGs, we introduced long-duration reduced-lead ECGs for home environments and multiple ECG comparison scenarios commonly encountered in clinical practice. Furthermore, we propose the anyECG-chat model, which supports dynamic-length ECG inputs and multiple ECG inputs. We trained the model using a three-stage curriculum training recipe with the anyECG dataset. A comprehensive evaluation was conducted, demonstrating that anyECG-chat is capable of supporting various practical application scenarios, including not only common report generation tasks but also abnormal waveform localization for long-duration reduced-lead ECGs in home environments and comprehensive comparative analysis of multiple ECGs.
摘要：多模式大语言模型（MLLM）的出现引起了他们对心电图（ECG）分析的应用的兴趣。但是，现有的以ECG为重点的MLLM主要集中于报告生成任务，通常仅限于单个12铅，短期（10s）ECG输入，从而使MLLM的潜力不足。为此，我们旨在开发一个用于ECG分析的MLLM，以支持更广泛的任务和更灵活的ECG输入。但是，现有的ECG-QA数据集通常是单调的。为了解决这一差距，我们首先构建了AnyECG数据集，该数据集包括各种任务，包括报告生成，异常波形本地化和开放式问题答案。除标准医院ECG外，我们还为家庭环境引入了长期减少铅的心电图，以及在临床实践中通常遇到的多种ECG比较方案。此外，我们提出了Anyecg-Chat模型，该模型支持动态的ECG输入和多个ECG输入。我们使用AnyECG数据集使用三阶段课程培训配方培训了该模型。进行了全面的评估，表明Anyecg-Chat能够支持各种实际应用方案，包括不仅包括常见的报告生成任务，而且还包括在家庭环境中长期降低铅的ECG的异常波形定位以及对多个ECGS的全面比较分析。

Title: Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection

Authors: Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00955
Pdf URL: https://arxiv.org/pdf/2506.00955
Copy Paste: [[2506.00955]] Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection(https://arxiv.org/abs/2506.00955)
Keywords: language model, gpt, llm
Abstract: Sarcasm fundamentally alters meaning through tone and context, yet detecting it in speech remains a challenge due to data scarcity. In addition, existing detection systems often rely on multimodal data, limiting their applicability in contexts where only speech is available. To address this, we propose an annotation pipeline that leverages large language models (LLMs) to generate a sarcasm dataset. Using a publicly available sarcasm-focused podcast, we employ GPT-4o and LLaMA 3 for initial sarcasm annotations, followed by human verification to resolve disagreements. We validate this approach by comparing annotation quality and detection performance on a publicly available sarcasm dataset using a collaborative gating architecture. Finally, we introduce PodSarc, a large-scale sarcastic speech dataset created through this pipeline. The detection model achieves a 73.63% F1 score, demonstrating the dataset's potential as a benchmark for sarcasm detection research.
摘要：讽刺从根本上通过语调和背景来改变意义，但是由于数据稀缺，在语音中检测到它仍然是一个挑战。此外，现有检测系统通常依赖于多模式数据，从而将其适用性限制在只有语音的上下文中。为了解决这个问题，我们提出了一条注释管道，该管道利用大型语言模型（LLMS）生成讽刺数据集。我们使用以公开讽刺为重点的播客，我们使用GPT-4O和Llama 3进行初始讽刺注释，然后进行人类验证以解决分歧。我们通过使用协作门控架构在公开可用的讽刺数据集中比较注释质量和检测性能来验证这种方法。最后，我们介绍了PODSARC，这是通过此管道创建的大规模讽刺语音数据集。检测模型达到了73.63％的F1分数，证明了数据集作为讽刺检测研究基准的潜力。

Title: From Objectives to Questions: A Planning-based Framework for Educational Mathematical Question Generation

Authors: Cheng Cheng, Zhenya Huang, Guanhao Zhao, Yuxiang Guo, Xin Lin, Jinze Wu, Xin Li, Shijin Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00963
Pdf URL: https://arxiv.org/pdf/2506.00963
Copy Paste: [[2506.00963]] From Objectives to Questions: A Planning-based Framework for Educational Mathematical Question Generation(https://arxiv.org/abs/2506.00963)
Keywords: language model
Abstract: Automatically generating high-quality mathematical problems that align with educational objectives is a crucial task in NLP-based educational technology. Traditional generation methods focus primarily on textual quality, but they often overlook educational objectives. Moreover, these methods address only single-dimensional, simple question generation, failing to meet complex, multifaceted educational requirements. To address these challenges, we constructed and annotated EduMath, a dataset of 16k mathematical questions with multi-dimensional educational objectives. Based on this dataset, we developed EQGEVAL, which incorporates three evaluation dimensions and is designed to assess the ability of models to generate educational questions. Drawing inspiration from teachers' problem design processes, we propose the Educational Question Planning with self-Reflection (EQPR) method for educational mathematical question generation, following a "plan-evaluate-optimize" approach. Specifically, by combining planning algorithm based on Monte Carlo Tree Search with the generative capabilities of Large Language Models, we continuously optimize questions through iterative feedback. This self-optimization mechanism ensures that the generated questions both fit the educational context and strategically achieve specific basic educational objectives. Through extensive experiments based on EQGEVAL, we have demonstrated that EQPR achieves significant improvements in generating questions that meet multi-dimensional educational objectives.
摘要：在基于NLP的教育技术中，自动产生与教育目标保持一致的高质量数学问题。传统的一代方法主要关注文本质量，但它们经常忽略教育目标。此外，这些方法仅解决单一的简单问题生成，无法满足复杂的多方面教育要求。为了应对这些挑战，我们构建并注释了Edumath，这是一个具有多维教育目标的16K数学问题的数据集。基于此数据集，我们开发了Eqgeval，该数据集包含了三个评估维度，旨在评估模型产生教育问题的能力。从教师的问题设计过程中汲取灵感，我们以自我反省（EQPR）方法提出了教育问题，以进行教育数学问题的产生，按照“计划评估 - 优化”的方法。具体而言，通过将基于蒙特卡洛树搜索的计划算法与大语言模型的生成能力相结合，我们通过迭代反馈不断优化问题。这种自我优化的机制可确保生成的问题既适合教育环境，又可以在战略上实现特定的基本教育目标。通过基于EQGEVAL的广泛实验，我们证明了EQPR在产生符合多维教育目标的问题方面取得了重大改进。

Title: ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness

Authors: Dren Fazlija, Arkadij Orlov, Sandipan Sikdar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00964
Pdf URL: https://arxiv.org/pdf/2506.00964
Copy Paste: [[2506.00964]] ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness(https://arxiv.org/abs/2506.00964)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly becoming valuable to corporate data management due to their ability to process text from various document formats and facilitate user interactions through natural language queries. However, LLMs must consider the sensitivity of information when communicating with employees, especially given access restrictions. Simple filtering based on user clearance levels can pose both performance and privacy challenges. To address this, we propose the concept of sensitivity awareness (SA), which enables LLMs to adhere to predefined access rights rules. In addition, we developed a benchmarking environment called ACCESS DENIED INC to evaluate SA. Our experimental findings reveal significant variations in model behavior, particularly in managing unauthorized data requests while effectively addressing legitimate queries. This work establishes a foundation for benchmarking sensitivity-aware language models and provides insights to enhance privacy-centric AI systems in corporate environments.
摘要：大型语言模型（LLM）由于能够从各种文档格式处理文本并通过自然语言查询来促进用户互动，因此越来越多地对公司数据管理变得有价值。但是，LLMS在与员工沟通时必须考虑信息的敏感性，尤其是给定访问限制。基于用户间隙级别的简单过滤可能会构成性能和隐私挑战。为了解决这个问题，我们提出了敏感性意识（SA）的概念，使LLMS能够遵守预定义的访问权利规则。此外，我们开发了一个名为Access Denied Inc的基准测试环境来评估SA。我们的实验发现揭示了模型行为的显着差异，尤其是在管理未经授权的数据请求时，同时有效地解决了合法的查询。这项工作为基准敏感性意识语言模型奠定了基础，并提供了见解以增强以隐私为中心的AI系统在公司环境中。

Title: XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content

Authors: Vadivel Abishethvarman, Bhavik Chandna, Pratik Jalan, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00973
Pdf URL: https://arxiv.org/pdf/2506.00973
Copy Paste: [[2506.00973]] XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content(https://arxiv.org/abs/2506.00973)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) can generate content spanning ideological rhetoric to explicit instructions for violence. However, existing safety evaluations often rely on simplistic binary labels (safe and unsafe), overlooking the nuanced spectrum of risk these outputs pose. To address this, we present XGUARD, a benchmark and evaluation framework designed to assess the severity of extremist content generated by LLMs. XGUARD includes 3,840 red teaming prompts sourced from real world data such as social media and news, covering a broad range of ideologically charged scenarios. Our framework categorizes model responses into five danger levels (0 to 4), enabling a more nuanced analysis of both the frequency and severity of failures. We introduce the interpretable Attack Severity Curve (ASC) to visualize vulnerabilities and compare defense mechanisms across threat intensities. Using XGUARD, we evaluate six popular LLMs and two lightweight defense strategies, revealing key insights into current safety gaps and trade-offs between robustness and expressive freedom. Our work underscores the value of graded safety metrics for building trustworthy LLMs.
摘要：大型语言模型（LLM）可以生成跨越意识形态言论的内容，以明确指示暴力。但是，现有的安全评估通常依赖于简单的二进制标签（安全且不安全），从而忽略了这些输出构成的细微风险。为了解决这个问题，我们提出了Xguard，这是一个基准和评估框架，旨在评估LLMS产生的极端主义内容的严重性。 Xguard包括3,840个红色团队提示，这些提示来自社交媒体和新闻等现实世界数据，涵盖了广泛的意识形态方案。我们的框架将模型响应分为五个危险水平（0到4），从而对失败的频率和严重程度进行了更细微的分析。我们介绍了可解释的攻击严重性曲线（ASC），以可视化脆弱性并比较跨威胁强度的防御机制。我们使用XGUARD评估了六个受欢迎的LLM和两种轻量级防御策略，揭示了对鲁棒性和表现力自由之间当前安全差距和权衡的关键见解。我们的工作强调了构建值得信赖的LLM的分级安全指标的价值。

Title: NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction

Authors: Qichao Wang, Ziqiao Meng, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Irwin King, Liang Chen, Peilin Zhao
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.00975
Pdf URL: https://arxiv.org/pdf/2506.00975
Copy Paste: [[2506.00975]] NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction(https://arxiv.org/abs/2506.00975)
Keywords: language model, gpt
Abstract: Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm, Next-Token-Pair Prediction (NTPP), to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications.
摘要：受GPT-4O的令人印象深刻的能力的启发，人们对使语音语言模型（SLM）与人类进行自然，流畅的口语相互作用越来越兴趣。最近的进步导致了几个SLM的发展，这些SLM在这一领域表现出了令人鼓舞的结果。但是，当前的方法尚未完全利用双通道语音数据，该数据固有地捕获了人类对话的结构和动态。在这项工作中，我们系统地探索了在现代大型语言模型的背景下使用双通道语音数据的使用，并引入了一种新颖的生成建模范式，Next-Token-Pair Prediction（NTPP），以启用与扬声器无关的双channel式对话对话，以使用解码器统一体系进行第一次使用。我们评估了我们对标准基准测试的方法，经验结果表明，我们提出的方法NTPP显着提高了SLM在转弯预测，响应相干性和自然性方面的对话能力。此外，与现有方法相比，NTPP的推断潜伏期大大降低，突出了其实时应用的实际效率。

Title: LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real World

Authors: Sina J. Semnani, Pingyue Zhang, Wanyue Zhai, Haozhuo Li, Ryan Beauchamp, Trey Billing, Katayoun Kishi, Manling Li, Monica S. Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00980
Pdf URL: https://arxiv.org/pdf/2506.00980
Copy Paste: [[2506.00980]] LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real World(https://arxiv.org/abs/2506.00980)
Keywords: language model, llm
Abstract: This paper presents LEMONADE, a large-scale conflict event dataset comprising 39,786 events across 20 languages and 171 countries, with extensive coverage of region-specific entities. LEMONADE is based on a partially reannotated subset of the Armed Conflict Location & Event Data (ACLED), which has documented global conflict events for over a decade. To address the challenge of aggregating multilingual sources for global event analysis, we introduce abstractive event extraction (AEE) and its subtask, abstractive entity linking (AEL). Unlike conventional span-based event extraction, our approach detects event arguments and entities through holistic document understanding and normalizes them across the multilingual dataset. We evaluate various large language models (LLMs) on these tasks, adapt existing zero-shot event extraction systems, and benchmark supervised models. Additionally, we introduce ZEST, a novel zero-shot retrieval-based system for AEL. Our best zero-shot system achieves an end-to-end F1 score of 58.3%, with LLMs outperforming specialized event extraction models such as GoLLIE. For entity linking, ZEST achieves an F1 score of 45.7%, significantly surpassing OneNet, a state-of-the-art zero-shot baseline that achieves only 23.7%. However, these zero-shot results lag behind the best supervised systems by 20.1% and 37.0% in the end-to-end and AEL tasks, respectively, highlighting the need for further research.
摘要：本文介绍了柠檬水，这是一个大规模冲突事件数据集，其中包括20种语言和171个国家 /地区的39,786个事件，并广泛覆盖了特定地区的实体。 Lemonade基于武装冲突位置和事件数据（ACLED）的部分重新指定子集，该子集已记录了十多年的全球冲突事件。为了解决整体事件分析的多语言资源的挑战，我们介绍了抽象事件提取（AEE）及其子任务及其子任务，抽象实体链接（AEL）。与传统的基于跨度的事件提取不同，我们的方法通过整体文档理解并在多语言数据集中对其进行归一化，从而检测事件参数和实体。我们在这些任务上评估了各种大型语言模型（LLM），调整现有的零射击事件提取系统和基准监督模型。此外，我们介绍了Zest，这是一种针对AEL的新型零射击检索系统。我们最佳的零拍系统的端到端F1得分为58.3％，LLMS优于Gollie等专业事件提取模型。对于链接实体，ZEST的F1得分为45.7％，显着超过OnEnet，这是一种最先进的零击基线，仅可获得23.7％。但是，在端到端和AEL任务中，这些零射击结果落后于最佳监督系统20.1％和37.0％，强调了进一步研究的需求。

Title: Do LLMs Understand Why We Write Diaries? A Method for Purpose Extraction and Clustering

Authors: Valeriya Goloviznina, Alexander Sergeev, Mikhail Melnichenko, Evgeny Kotelnikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00985
Pdf URL: https://arxiv.org/pdf/2506.00985
Copy Paste: [[2506.00985]] Do LLMs Understand Why We Write Diaries? A Method for Purpose Extraction and Clustering(https://arxiv.org/abs/2506.00985)
Keywords: language model, gpt, llm
Abstract: Diary analysis presents challenges, particularly in extracting meaningful information from large corpora, where traditional methods often fail to deliver satisfactory results. This study introduces a novel method based on Large Language Models (LLMs) to identify and cluster the various purposes of diary writing. By "purposes," we refer to the intentions behind diary writing, such as documenting life events, self-reflection, or practicing language skills. Our approach is applied to Soviet-era diaries (1922-1929) from the Prozhito digital archive, a rich collection of personal narratives. We evaluate different proprietary and open-source LLMs, finding that GPT-4o and o1-mini achieve the best performance, while a template-based baseline is significantly less effective. Additionally, we analyze the retrieved purposes based on gender, age of the authors, and the year of writing. Furthermore, we examine the types of errors made by the models, providing a deeper understanding of their limitations and potential areas for improvement in future research.
摘要：日记分析提出了挑战，尤其是从大型语料库中提取有意义的信息时，传统方法通常无法提供令人满意的结果。这项研究介绍了一种基于大语言模型（LLM）的新方法，以识别和聚集日记写作的各种目的。通过“目的”，我们指的是日记写作背后的意图，例如记录生活事件，自我反省或练习语言技能。我们的方法适用于Prozhito Digital Archive的苏联时代日记（1922-1929），这是一个丰富的个人叙事集合。我们评估了不同的专有和开源LLM，发现GPT-4O和O1-MINI取得了最佳性能，而基于模板的基线的效率明显较小。此外，我们根据性别，作者的年龄和写作年份分析检索的目的。此外，我们研究了模型造成的错误类型，从而更深入地了解了它们的局限性和潜在的未来研究领域。

Title: Talking to Data: Designing Smart Assistants for Humanities Databases

Authors: Alexander Sergeev, Valeriya Goloviznina, Mikhail Melnichenko, Evgeny Kotelnikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00986
Pdf URL: https://arxiv.org/pdf/2506.00986
Copy Paste: [[2506.00986]] Talking to Data: Designing Smart Assistants for Humanities Databases(https://arxiv.org/abs/2506.00986)
Keywords: language model, llm, chat
Abstract: Access to humanities research databases is often hindered by the limitations of traditional interaction formats, particularly in the methods of searching and response generation. This study introduces an LLM-based smart assistant designed to facilitate natural language communication with digital humanities data. The assistant, developed in a chatbot format, leverages the RAG approach and integrates state-of-the-art technologies such as hybrid search, automatic query generation, text-to-SQL filtering, semantic database search, and hyperlink insertion. To evaluate the effectiveness of the system, experiments were conducted to assess the response quality of various language models. The testing was based on the Prozhito digital archive, which contains diary entries from predominantly Russian-speaking individuals who lived in the 20th century. The chatbot is tailored to support anthropology and history researchers, as well as non-specialist users with an interest in the field, without requiring prior technical training. By enabling researchers to query complex databases with natural language, this tool aims to enhance accessibility and efficiency in humanities research. The study highlights the potential of Large Language Models to transform the way researchers and the public interact with digital archives, making them more intuitive and inclusive. Additional materials are presented in GitHub repository: this https URL.
摘要：传统互动格式的局限性通常会阻碍人文研究数据库，尤其是在搜索和响应生成的方法中。这项研究介绍了一项基于LLM的智能助手，旨在促进与数字人文数据的自然语言交流。以聊天机器人格式开发的助手利用了抹布方法并集成了最新技术，例如混合搜索，自动查询生成，文本到SQL过滤，语义数据库搜索和超链接插入。为了评估系统的有效性，进行了实验以评估各种语言模型的响应质量。该测试基于Prozhito Digital Archive，该档案包含了居住在20世纪的主要讲俄语的人的日记条目。该聊天机器人的量身定制为支持人类学和历史研究人员，以及对该领域感兴趣的非专业用户，而无需事先进行技术培训。通过使研究人员能够用自然语言查询复杂的数据库，该工具旨在提高人文研究的可及性和效率。该研究强调了大型语言模型改变研究人员和公众与数字档案的互动方式的潜力，从而使它们更加直观和包容。 GitHub存储库中介绍了其他材料：此HTTPS URL。

Title: Less is More: Local Intrinsic Dimensions of Contextual Language Models

Authors: Benjamin Matthias Ruppik, Julius von Rohrscheidt, Carel van Niekerk, Michael Heck, Renato Vukovic, Shutong Feng, Hsien-chin Lin, Nurul Lubis, Bastian Rieck, Marcus Zibrowius, Milica Gašić
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01034
Pdf URL: https://arxiv.org/pdf/2506.01034
Copy Paste: [[2506.01034]] Less is More: Local Intrinsic Dimensions of Contextual Language Models(https://arxiv.org/abs/2506.01034)
Keywords: language model, llm
Abstract: Understanding the internal mechanisms of large language models (LLMs) remains a challenging and complex endeavor. Even fundamental questions, such as how fine-tuning affects model behavior, often require extensive empirical evaluation. In this paper, we introduce a novel perspective based on the geometric properties of contextual latent embeddings to study the effects of training and fine-tuning. To that end, we measure the local dimensions of a contextual language model's latent space and analyze their shifts during training and fine-tuning. We show that the local dimensions provide insights into the model's training dynamics and generalization ability. Specifically, the mean of the local dimensions predicts when the model's training capabilities are exhausted, as exemplified in a dialogue state tracking task, overfitting, as demonstrated in an emotion recognition task, and grokking, as illustrated with an arithmetic task. Furthermore, our experiments suggest a practical heuristic: reductions in the mean local dimension tend to accompany and predict subsequent performance gains. Through this exploration, we aim to provide practitioners with a deeper understanding of the implications of fine-tuning on embedding spaces, facilitating informed decisions when configuring models for specific applications. The results of this work contribute to the ongoing discourse on the interpretability, adaptability, and generalizability of LLMs by bridging the gap between intrinsic model mechanisms and geometric properties in the respective embeddings.
摘要：了解大语言模型（LLM）的内部机制仍然是一项具有挑战性且复杂的努力。即使是基本问题，例如微调如何影响模型行为，也通常需要广泛的经验评估。在本文中，我们介绍了一种基于上下文潜在嵌入的几何特性来研究训练和微调的影响。为此，我们衡量上下文语言模型潜在空间的局部维度，并在培训和微调过程中分析其转变。我们表明，局部维度为模型的训练动力和概括能力提供了见解。具体而言，本地维度的平均值预测了何时耗尽模型的训练能力，这在对话状态跟踪任务中的例证，过于拟合，如情感识别任务中所证明的那样，用算术任务说明了。此外，我们的实验提出了一种实用的启发式：平均局部维度的减少倾向于伴随并预测随后的性能增长。通过此探索，我们旨在为从业者提供对微调对嵌入空间的含义的更深入的了解，从而在为特定应用程序配置模型时促进明智的决策。这项工作的结果有助于通过弥合各个嵌入的固有模型机制和几何特性之间的差距，从而持续讨论LLM的可解释性，适应性和概括性。

Title: Probing Neural Topology of Large Language Models

Authors: Yu Zheng, Yuan Yuan, Yong Li, Paolo Santi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01042
Pdf URL: https://arxiv.org/pdf/2506.01042
Copy Paste: [[2506.01042]] Probing Neural Topology of Large Language Models(https://arxiv.org/abs/2506.01042)
Keywords: language model, llm
Abstract: Probing large language models (LLMs) has yielded valuable insights into their internal mechanisms by linking neural representations to interpretable semantics. However, how neurons functionally co-activate with each other to give rise to emergent capabilities remains largely unknown, hindering a deeper understanding and safer development of LLMs. In this work, we introduce graph probing, a method for uncovering the functional connectivity topology of LLM neurons and relating it to language generation performance. By analyzing internal neural graphs across diverse LLM families and scales, we discover a universal predictability of next-token prediction performance using only neural topology. This predictability is robust even when retaining just 1% of neuron connections or probing models after only 8 pretraining steps, highlighting the sparsity and early emergence of topological patterns. Further graph matching analysis suggests that, despite significant distinctions in architectures, parameters, and training data, different LLMs develop intricate and consistent neural topological structures that may form the foundation for their language generation abilities. Codes and data for the graph probing toolbox are released at this https URL.
摘要：探测大型语言模型（LLM）通过将神经表现形成与可解释的语义联系起来，从而对其内部机制产生了宝贵的见解。但是，神经元在功能上如何共同活化以引起新兴能力，这在很大程度上是未知的，阻碍了对LLM的更深入的理解和更安全的发展。在这项工作中，我们介绍了图形探测，这是一种发现LLM神经元功能连接拓扑并将其与语言生成性能相关联的方法。通过分析不同LLM家族和量表之间的内部神经图，我们发现仅使用神经拓扑的下一个预测性能的普遍可预测性。即使仅在仅保留了8个预处理的步骤之后，即使仅保留1％的神经元连接或探测模型，这种可预测性也是可靠的，从而突出了拓扑模式的稀疏性和早期出现。进一步的图形匹配分析表明，尽管架构，参数和培训数据有显着区别，但不同的LLM会发展出复杂且一致的神经拓扑结构，这可能构成其语言生成能力的基础。图形探测工具箱的代码和数据将在此HTTPS URL上发布。

Title: CHEER-Ekman: Fine-grained Embodied Emotion Classification

Authors: Phan Anh Duong, Cat Luong, Divyesh Bommana, Tianyu Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01047
Pdf URL: https://arxiv.org/pdf/2506.01047
Copy Paste: [[2506.01047]] CHEER-Ekman: Fine-grained Embodied Emotion Classification(https://arxiv.org/abs/2506.01047)
Keywords: language model, prompt, chain-of-thought
Abstract: Emotions manifest through physical experiences and bodily reactions, yet identifying such embodied emotions in text remains understudied. We present an embodied emotion classification dataset, CHEER-Ekman, extending the existing binary embodied emotion dataset with Ekman's six basic emotion categories. Using automatic best-worst scaling with large language models, we achieve performance superior to supervised approaches on our new dataset. Our investigation reveals that simplified prompting instructions and chain-of-thought reasoning significantly improve emotion recognition accuracy, enabling smaller models to achieve competitive performance with larger ones.
摘要：情绪通过身体经验和身体反应表现出来，但是在文本中识别出这种体现的情绪仍在研究中。我们提出了一个体现的情感分类数据集Cheer-Ekman，它使用Ekman的六个基本情感类别扩展了现有的二进制体现的情感数据集。使用大型语言模型使用自动最佳缩放缩放，我们在新数据集中实现了优于监督方法的性能。我们的调查表明，简化的提示指示和经过思考的推理可以显着提高情绪识别的准确性，从而使较小的模型能够通过较大的模型实现竞争性能。

Title: SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Authors: Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, Tu Vu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01062
Pdf URL: https://arxiv.org/pdf/2506.01062
Copy Paste: [[2506.01062]] SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models(https://arxiv.org/abs/2506.01062)
Keywords: language model, gpt, llm, chat, agent
Abstract: We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at this http URL.
摘要：我们介绍了SEALQA，这是一种新的挑战基准，用于评估搜索声明的语言模型，以寻求事实的问题，其中Web搜索产生冲突，嘈杂或无益的结果。 SEALQA具有三种口味：（1）SEAL-0（MAIN）和（2）SEAL-HARD评估事实的准确性和推理能力，而SEAL-0集中在最具挑战性的问题上，其中聊天模型（例如GPT-4.1）通常实现接近零的精度；（3）longseal，它扩展了SEALQA，以测试“针中的针刺”设置中的长篇小说，多文件推理。我们的评估揭示了当前模型中的临界局限性：即使在所有Sealqa口味中，Frontier LLM都表现较差。在SEAL-0上，配备了O3和O4-Mini等工具的Frontier Agesic模型在其最佳推理方面分别仅达到17.1％和6.3％的精度。我们发现，诸如DeepSeek-R1-671b和O3-Mini之类的先进推理模型非常容易受到嘈杂的搜索结果。值得注意的是，增加的测试时间计算不会在O3-Mini，O4-Mini和O3中带来可靠的增长，并且性能通常会稳定，甚至早日下降。此外，虽然最近的模型受到“中间损失”问题的影响较小，但在面对众多干扰器时，它们仍然无法可靠地识别长期内的相关文档。为了促进未来的工作，我们在此HTTP URL上释放Sealqa。

Title: How Programming Concepts and Neurons Are Shared in Code Language Models

Authors: Amir Hossein Kargaran, Yihong Liu, François Yvon, Hinrich Schütze
Subjects: cs.CL, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2506.01074
Pdf URL: https://arxiv.org/pdf/2506.01074
Copy Paste: [[2506.01074]] How Programming Concepts and Neurons Are Shared in Code Language Models(https://arxiv.org/abs/2506.01074)
Keywords: language model, llm
Abstract: Several studies have explored the mechanisms of large language models (LLMs) in coding tasks, but most have focused on programming languages (PLs) in a monolingual setting. In this paper, we investigate the relationship between multiple PLs and English in the concept space of LLMs. We perform a few-shot translation task on 21 PL pairs using two Llama-based models. By decoding the embeddings of intermediate layers during this task, we observe that the concept space is closer to English (including PL keywords) and assigns high probabilities to English tokens in the second half of the intermediate layers. We analyze neuron activations for 11 PLs and English, finding that while language-specific neurons are primarily concentrated in the bottom layers, those exclusive to each PL tend to appear in the top layers. For PLs that are highly aligned with multiple other PLs, identifying language-specific neurons is not feasible. These PLs also tend to have a larger keyword set than other PLs and are closer to the model's concept space regardless of the input/output PL in the translation task. Our findings provide insights into how LLMs internally represent PLs, revealing structural patterns in the model's concept space. Code is available at this https URL.
摘要：几项研究探索了编码任务中大语言模型（LLM）的机制，但大多数人都专注于单语环境中的编程语言（PLS）。在本文中，我们研究了LLM概念空间中多个PL和英语之间的关系。我们使用两种基于Llama的模型在21个PLAINS上执行了几次转换任务。通过在此任务期间解码中间层的嵌入，我们观察到概念空间更接近英语（包括PL关键字），并在中间层的下半部分配了高概率。我们分析了11个PL和英语的神经元激活，发现特定于语言的神经元主要集中在底层中，但每个PL独有的神经元倾向于出现在顶层。对于与其他多个PL相一致的PL，识别特定语言的神经元是不可行的。这些PLS还倾向于比其他PLS具有更大的关键字集，并且无论翻译任务中的输入/输出PL如何，都更靠近模型的概念空间。我们的发现提供了有关LLM在内部如何代表PL的见解，从而揭示了模型概念空间中的结构模式。代码可在此HTTPS URL上找到。

Title: zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

Authors: Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01084
Pdf URL: https://arxiv.org/pdf/2506.01084
Copy Paste: [[2506.01084]] zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression(https://arxiv.org/abs/2506.01084)
Keywords: language model, llm
Abstract: Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable "hypertokens" on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60\%, with significant improvements in inference latency.
摘要：令牌化效率在大型语言模型（LLMS）的性能和成本中起着至关重要的作用，但是大多数模型都依赖于针对通用语料库进行优化的静态引物。这些令牌的固定词汇通常无法适应域或语言特定的输入，从而导致令牌序列更长和更高的计算成本。我们介绍了ZIP2ZIP，该框架使LLMS能够在推理时间动态调整令牌词汇，从而使生成的令牌更少，从而更快地推断了推理。 Zip2ZIP由三个关键组成部分组成：（1）基于Lempel-Ziv-Welch（LZW）压缩的令牌，该加压器会逐步压缩令牌，以fly速度重复使用“ hypertokens”；（2）在运行时计算新形成的高血压的嵌入层；（3）一种因果语言建模变体，该变体训练模型以在高含量的压缩序列上运行。我们表明，现有的LLM可以通过参数有效的芬太尼在10个GPU小时内进行ZIP2ZIP。所得的ZIP2ZIP LLM有效地学会在推理时使用高血压，将输入和输出序列长度降低20-60 \％，并显着改善了推理潜伏期。

Title: Un-considering Contextual Information: Assessing LLMs' Understanding of Indexical Elements

Authors: Metehan Oguz, Yavuz Bakman, Duygu Nur Yaldiz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01089
Pdf URL: https://arxiv.org/pdf/2506.01089
Copy Paste: [[2506.01089]] Un-considering Contextual Information: Assessing LLMs' Understanding of Indexical Elements(https://arxiv.org/abs/2506.01089)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive performances in tasks related to coreference resolution. However, previous studies mostly assessed LLM performance on coreference resolution with nouns and third person pronouns. This study evaluates LLM performance on coreference resolution with indexical like I, you, here and tomorrow, which come with unique challenges due to their linguistic properties. We present the first study examining how LLMs interpret indexicals in English, releasing the English Indexical Dataset with 1600 multiple-choice questions. We evaluate pioneering LLMs, including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and DeepSeek V3. Our results reveal that LLMs exhibit an impressive performance with some indexicals (I), while struggling with others (you, here, tomorrow), and that syntactic cues (e.g. quotation) contribute to LLM performance with some indexicals, while they reduce performance with others. Code and data are available at: this https URL.
摘要：大型语言模型（LLMS）在与核心分辨率有关的任务中表现出了令人印象深刻的表现。但是，先前的研究主要评估了使用名词和第三人称代词的核心分辨率的LLM性能。这项研究评估了LLM在Coreference解决方案方面的性能，例如I，您，明天和明天，由于其语言特性而带来了独特的挑战。我们介绍了第一个研究LLM如何解释英语指数的研究，并使用1600个多项选择问题释放英语索引数据集。我们评估了开创性的LLM，包括GPT-4O，Claude 3.5十四行诗，Gemini 1.5 Pro和DeepSeek V3。我们的结果表明，LLM在某些指数（i）中表现出令人印象深刻的表现，同时与他人（您，明天，明天）挣扎，而句法提示（例如，引号）在某些索引方面有助于LLM的性能，而它们降低了与他人的绩效。代码和数据可在以下网址提供：此HTTPS URL。

Title: Contextual Candor: Enhancing LLM Trustworthiness Through Hierarchical Unanswerability Detection

Authors: Steven Robinson, Antonio Carlos Rivera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01104
Pdf URL: https://arxiv.org/pdf/2506.01104
Copy Paste: [[2506.01104]] Contextual Candor: Enhancing LLM Trustworthiness Through Hierarchical Unanswerability Detection(https://arxiv.org/abs/2506.01104)
Keywords: language model, llm, prompt
Abstract: The pervasive deployment of large language models (LLMs) in conversational AI systems has revolutionized information access, yet their propensity for generating factually unsupported or hallucinated responses remains a critical impediment to trustworthiness and widespread adoption. This paper introduces Reinforced Unanswerability Learning (RUL), a novel hybrid training paradigm designed to imbue LLMs with the intrinsic capability to accurately detect unanswerable questions and generate reliably appropriate responses. Unlike conventional approaches that rely on external classifiers or simple prompting, RUL integrates a discriminative unanswerability prediction head with the LLM's generative core, guided by a multi-stage learning strategy. This includes supervised fine-tuning on a novel, richly annotated dataset, Enhanced-CAsT-Answerability (ECA), which features hierarchical answerability labels and ground-truth refusal responses. Crucially, RUL incorporates a subsequent reinforcement learning with human feedback (RLHF) phase to refine the nuance, helpfulness, and informativeness of refusal responses. Extensive experiments demonstrate RUL's superior performance, achieving significantly higher accuracy in unanswerability detection across sentence, paragraph, and ranking levels, and substantially increasing the generation of appropriate refusals for unanswerable queries, alongside strong performance on answerable questions. Human evaluations further corroborate RUL's effectiveness, highlighting a marked improvement in perceived helpfulness and trustworthiness, ultimately paving the way for more reliable and user-centric conversational AI.
摘要：大型语言模型（LLM）在对话式AI系统中的普遍部署彻底改变了信息访问，但它们产生事实上不支持或幻觉的响应的倾向仍然是对可信赖性和广泛采用的关键障碍。本文介绍了可增强的无法选择性学习（RUL），这是一种新型的混合训练范式，旨在使LLM具有固有能力，以准确检测到无法回答的问题并产生可靠的适当回答。与依靠外部分类器或简单提示的传统方法不同，Rul将歧视性的无法选择性预测头与LLM的生成核心集成在一起，并在多阶段学习策略的指导下。这包括对小说，丰富的注释数据集的监督微调，增强的播种性（ECA），该数据具有层次的答复性标签和地面真相拒绝响应。至关重要的是，RUL将随后的增强学习与人类反馈（RLHF）阶段结合在一起，以完善拒绝反应的细微差别，帮助和信息性。广泛的实验证明了Rul的出色表现，在句子，段落和排名水平上无法选择性检测方面的准确性明显更高，并大大提高了对无法回答的查询的适当拒绝产生，并且在可回答问题上表现出色。人类的评估进一步证实了Rul的有效性，强调了感知到的帮助和信任度的明显改善，最终为更可靠和以用户为中心的对话AI铺平了道路。

Title: From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models

Authors: Asım Ersoy, Basel Mousi, Shammur Chowdhury, Firoj Alam, Fahim Dalvi, Nadir Durrani
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.01133
Pdf URL: https://arxiv.org/pdf/2506.01133
Copy Paste: [[2506.01133]] From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models(https://arxiv.org/abs/2506.01133)
Keywords: language model, llm
Abstract: The emergence of large language models (LLMs) has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts--showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. For reproducibility we made scripts and other resources available to the community.
摘要：大型语言模型（LLM）的出现表明，仅在文本上训练的系统可以获取广泛的世界知识，发展推理能力并内化抽象的语义概念 - 可以与一般智能相关的展示属性。这就提出了一个有趣的问题：以其他方式培训的模型（例如语音）是否出现了这样的概念？此外，当模型以多种方式共同培训时：它们是否发展出更丰富，更具结构化的语义理解？为了探讨这一点，我们分析了单独和共同的语音和文本模型所学的概念结构。我们采用潜在概念分析，这是一种无监督的方法，用于揭示和解释神经网络中的潜在表示，以检查跨模态的语义抽象如何形成。为了获得可重复性，我们为社区提供了脚本和其他资源。

Title: A Word is Worth 4-bit: Efficient Log Parsing with Binary Coded Decimal Recognition

Authors: Prerak Srivastava, Giulio Corallo, Sergey Rybalko
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01147
Pdf URL: https://arxiv.org/pdf/2506.01147
Copy Paste: [[2506.01147]] A Word is Worth 4-bit: Efficient Log Parsing with Binary Coded Decimal Recognition(https://arxiv.org/abs/2506.01147)
Keywords: llm
Abstract: System-generated logs are typically converted into categorical log templates through parsing. These templates are crucial for generating actionable insights in various downstream tasks. However, existing parsers often fail to capture fine-grained template details, leading to suboptimal accuracy and reduced utility in downstream tasks requiring precise pattern identification. We propose a character-level log parser utilizing a novel neural architecture that aggregates character embeddings. Our approach estimates a sequence of binary-coded decimals to achieve highly granular log templates extraction. Our low-resource character-level parser, tested on revised Loghub-2k and a manually annotated industrial dataset, matches LLM-based parsers in accuracy while outperforming semantic parsers in efficiency.
摘要：系统生成的日志通常通过解析转换为分类日志模板。这些模板对于在各种下游任务中生成可行的见解至关重要。但是，现有的解析器通常无法捕获细粒的模板细节，从而导致次优准则，并降低了需要精确模式识别的下游任务中的效用。我们使用一种汇总角色嵌入的新型神经结构提出了一个角色级对数解析器。我们的方法估计了一系列二元编码的小数，以实现高度颗粒状的对数模板提取。我们的低资源角色级解析器对经过修订的LogHub-2K和手动注释的工业数据集进行了测试，与基于LLM的精确度相匹配，同时在效率方面表现优于语义解析器。

Title: The Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage

Authors: Byung-Doh Oh, Hongao Zhu, William Schuler
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01172
Pdf URL: https://arxiv.org/pdf/2506.01172
Copy Paste: [[2506.01172]] The Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage(https://arxiv.org/abs/2506.01172)
Keywords: language model
Abstract: In psycholinguistic modeling, surprisal from larger pre-trained language models has been shown to be a poorer predictor of naturalistic human reading times. However, it has been speculated that this may be due to data leakage that caused language models to see the text stimuli during training. This paper presents two studies to address this concern at scale. The first study reveals relatively little leakage of five naturalistic reading time corpora in two pre-training datasets in terms of length and frequency of token $n$-gram overlap. The second study replicates the negative relationship between language model size and the fit of surprisal to reading times using models trained on 'leakage-free' data that overlaps only minimally with the reading time corpora. Taken together, this suggests that previous results using language models trained on these corpora are not driven by the effects of data leakage.
摘要：在心理语言建模中，大型预训练的语言模型的惊奇已被证明是自然主义人类阅读时间的较差预测指标。但是，已经推测这可能是由于数据泄漏引起的，这导致语言模型在训练过程中看到文本刺激。本文介绍了两项研究，以大规模解决这一问题。第一项研究表明，就令牌$ n $ gram重叠的长度和频率而言，五个自然主义阅读时间语料库的泄漏相对较少。第二项研究复制了语言模型大小和惊奇与阅读时间的拟合度之间的负相关关系，该模型使用对“无泄漏”数据训练的模型，这些模型仅与阅读时间内部的情况大致重叠。综上所述，这表明先前使用在这些语料库中训练的语言模型的结果并非受数据泄漏的影响驱动。

Title: LAQuer: Localized Attribution Queries in Content-grounded Generation

Authors: Eran Hirsch, Aviv Slobodkin, David Wan, Elias Stengel-Eskin, Mohit Bansal, Ido Dagan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01187
Pdf URL: https://arxiv.org/pdf/2506.01187
Copy Paste: [[2506.01187]] LAQuer: Localized Attribution Queries in Content-grounded Generation(https://arxiv.org/abs/2506.01187)
Keywords: language model, llm, prompt
Abstract: Grounded text generation models often produce content that deviates from their source material, requiring user verification to ensure accuracy. Existing attribution methods associate entire sentences with source documents, which can be overwhelming for users seeking to fact-check specific claims. In contrast, existing sub-sentence attribution methods may be more precise but fail to align with users' interests. In light of these limitations, we introduce Localized Attribution Queries (LAQuer), a new task that localizes selected spans of generated output to their corresponding source spans, allowing fine-grained and user-directed attribution. We compare two approaches for the LAQuer task, including prompting large language models (LLMs) and leveraging LLM internal representations. We then explore a modeling framework that extends existing attributed text generation methods to LAQuer. We evaluate this framework across two grounded text generation tasks: Multi-document Summarization (MDS) and Long-form Question Answering (LFQA). Our findings show that LAQuer methods significantly reduce the length of the attributed text. Our contributions include: (1) proposing the LAQuer task to enhance attribution usability, (2) suggesting a modeling framework and benchmarking multiple baselines, and (3) proposing a new evaluation setting to promote future research on localized attribution in content-grounded generation.
摘要：接地的文本生成模型通常会产生偏离其原始材料的内容，需要用户验证以确保准确性。现有归因方法将整个句子与源文档相关联，对于寻求事实检查特定索赔的用户来说，这可能是压倒性的。相反，现有的子句归因方法可能更精确，但与用户的兴趣不符。鉴于这些限制，我们介绍了本地化归因查询（LACER），这是一个新任务，将所选生成的输出的选定跨度定位于其相应的源跨度，从而允许细粒度和用户指导的归属。我们比较了LAQUER任务的两种方法，包括提示大型语言模型（LLM）和利用LLM内部表示。然后，我们探索一个建模框架，该框架将现有的属性文本生成方法扩展到了Laquer。我们在两个基础文本生成任务中评估了此框架：多文件摘要（MDS）和长形式的答案（LFQA）。我们的发现表明，LAQUER方法大大降低了属性文本的长度。我们的贡献包括：（1）提议提高归因可用性的“ LAQUER”任务，（2）建议建模框架和基准测试多个基线，以及（3）提出新的评估设置，以促进有关内容构成的生成中本地化归因的未来研究。

Title: Culturally-Grounded Chain-of-Thought (CG-CoT):Enhancing LLM Performance on Culturally-Specific Tasks in Low-Resource Languages

Authors: Madhavendra Thakur
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01190
Pdf URL: https://arxiv.org/pdf/2506.01190
Copy Paste: [[2506.01190]] Culturally-Grounded Chain-of-Thought (CG-CoT):Enhancing LLM Performance on Culturally-Specific Tasks in Low-Resource Languages(https://arxiv.org/abs/2506.01190)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) struggle with culturally-specific reasoning tasks, particularly in low-resource languages, hindering their global applicability. Addressing this gap is crucial for equitable AI deployment. We introduce Culturally-Grounded Chain-of-Thought (CG-CoT), a novel prompting strategy that combines dense vector retrieval of cultural context with explicit reasoning sequences. Our extensive experiments on Yoruba proverb interpretation demonstrate that CG-CoT provides significantly higher culturally-aligned accuracy and depth than traditional prompting methods, validated through both automated metrics and LLM-based evaluations. Notably, we uncover stark disparities between token-level translation metrics like BLEU and human-judged cultural relevance, suggesting a rethinking of evaluation approaches for low-resource NLP.
摘要：大型语言模型（LLMS）与特定于文化的推理任务（尤其是在低资源语言中）的斗争，阻碍了其全球适用性。解决此差距对于公平的AI部署至关重要。我们介绍了具有文化基础的思想链（CG-COT），这是一种新颖的提示策略，将文化背景的密集矢量检索与明确的推理序列相结合。我们对约鲁巴谚语解释的广泛实验表明，与传统的提示方法相比，CG-COT提供的具有自动指标和基于LLM的评估的传统提示方法的文化准确性和深度明显更高。值得注意的是，我们发现了诸如BLEU之类的令牌翻译指标与人为判断的文化相关性之间的鲜明差异，这表明对低资源NLP的评估方法进行了重新思考。

Title: CoBRA: Quantifying Strategic Language Use and LLM Pragmatics

Authors: Anshun Asher Zheng, Junyi Jessy Li, David I. Beaver
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01195
Pdf URL: https://arxiv.org/pdf/2506.01195
Copy Paste: [[2506.01195]] CoBRA: Quantifying Strategic Language Use and LLM Pragmatics(https://arxiv.org/abs/2506.01195)
Keywords: llm
Abstract: Language is often used strategically, particularly in high-stakes, adversarial settings, yet most work on pragmatics and LLMs centers on cooperativity. This leaves a gap in systematic understanding of non-cooperative discourse. To address this, we introduce CoBRA (Cooperation-Breach Response Assessment), along with three interpretable metrics -- Benefit at Turn (BaT), Penalty at Turn (PaT), and Normalized Relative Benefit at Turn (NRBaT) -- to quantify the perceived strategic effects of discourse moves. We also present CHARM, an annotated dataset of real courtroom cross-examinations, to demonstrate the framework's effectiveness. Using these tools, we evaluate a range of LLMs and show that LLMs generally exhibit limited pragmatic understanding of strategic language. While model size shows an increase in performance on our metrics, reasoning ability does not help and largely hurts, introducing overcomplication and internal confusion.
摘要：语言通常在战略上是在战略上使用的，尤其是在高风险，对抗性环境中，但大多数在实用主义和LLMS方面的工作都以合作性为中心。这留下了对非合作性话语的系统理解的差距。为了解决这个问题，我们介绍了眼镜蛇（合作 - 突破响应评估），以及三个可解释的指标 - 转弯时的益处（蝙蝠），转弯处罚款（PAT）和转弯时的归一化相对益处（NRBAT） - 以量化话语移动的战略效应。我们还提出了魅力，这是一个真正的法庭盘问数据集，以证明该框架的有效性。使用这些工具，我们评估了一系列LLM，并表明LLM通常对战略语言表现出有限的务实理解。尽管模型大小显示出我们指标的性能的提高，但推理能力并没有帮助，并且在很大程度上受到伤害，引入了过度复杂性和内部混乱。

Title: Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures

Authors: Mark Muchane, Sean Richardson, Kiho Park, Victor Veitch
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01197
Pdf URL: https://arxiv.org/pdf/2506.01197
Copy Paste: [[2506.01197]] Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures(https://arxiv.org/abs/2506.01197)
Keywords: language model
Abstract: Sparse dictionary learning (and, in particular, sparse autoencoders) attempts to learn a set of human-understandable concepts that can explain variation on an abstract space. A basic limitation of this approach is that it neither exploits nor represents the semantic relationships between the learned concepts. In this paper, we introduce a modified SAE architecture that explicitly models a semantic hierarchy of concepts. Application of this architecture to the internal representations of large language models shows both that semantic hierarchy can be learned, and that doing so improves both reconstruction and interpretability. Additionally, the architecture leads to significant improvements in computational efficiency.
摘要：稀疏的词典学习（尤其是稀疏的自动编码器）试图学习一组可以解释抽象空间上变化的人类理解概念。这种方法的一个基本限制是，它既不利用也不代表学习概念之间的语义关系。在本文中，我们介绍了一种修改后的SAE架构，该体系结构明确模拟了概念的语义层次结构。将这种体系结构应用于大语言模型的内部表示，这既可以学习语义层次结构，并且这样做可以改善重建和解释性。此外，该体系结构可显着提高计算效率。

Title: Trick or Neat: Adversarial Ambiguity and Language Model Evaluation

Authors: Antonia Karamolegkou, Oliver Eberle, Phillip Rust, Carina Kauf, Anders Søgaard
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01205
Pdf URL: https://arxiv.org/pdf/2506.01205
Copy Paste: [[2506.01205]] Trick or Neat: Adversarial Ambiguity and Language Model Evaluation(https://arxiv.org/abs/2506.01205)
Keywords: language model, prompt
Abstract: Detecting ambiguity is important for language understanding, including uncertainty estimation, humour detection, and processing garden path sentences. We assess language models' sensitivity to ambiguity by introducing an adversarial ambiguity dataset that includes syntactic, lexical, and phonological ambiguities along with adversarial variations (e.g., word-order changes, synonym replacements, and random-based alterations). Our findings show that direct prompting fails to robustly identify ambiguity, while linear probes trained on model representations can decode ambiguity with high accuracy, sometimes exceeding 90\%. Our results offer insights into the prompting paradigm and how language models encode ambiguity at different layers. We release both our code and data: this https URL.
摘要：检测歧义对于语言理解非常重要，包括不确定性估计，幽默检测和处理花园路径句子。我们通过引入包括句法，词汇和语音歧义的对抗性歧义数据集以及对抗性变化（例如，单词订单变化，同义词替代品和随机的基于随机的变化）来评估语言模型对歧义的敏感性。我们的发现表明，直接提示无法牢固地识别歧义，而在模型表示上训练的线性探针可以以高精度来解码歧义，有时超过90 \％。我们的结果提供了有关提示范式以及语言模型如何在不同层上编码歧义的见解。我们同时发布代码和数据：此HTTPS URL。

Title: Mamba Drafters for Speculative Decoding

Authors: Daewon Choi, Seunghyuk Oh, Saket Dingliwal, Jihoon Tack, Kyuyoung Kim, Woomin Song, Seojin Kim, Insu Han, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01206
Pdf URL: https://arxiv.org/pdf/2506.01206
Copy Paste: [[2506.01206]] Mamba Drafters for Speculative Decoding(https://arxiv.org/abs/2506.01206)
Keywords: language model, llm
Abstract: Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability.
摘要：投机解码已成为一种有前途的方法，可以使用快速起草者加速大型语言模型（LLM）生成，同时保持与目标模型分布的一致性。但是，现有的方法面临着一种权衡：外部起草者具有灵活性，但可能会遭受较慢的起草，而自我调查方法则使用针对目标模型量身定制的起草者，但需要重新训练。在本文中，我们介绍了基于Mamba（一种最先进的状态空间模型（SSM））的新型起草者，作为结合两种方法最佳方面的解决方案。通过利用SSM的线性结构，我们的方法避免了基于传统的变压器方法固有的二次复杂性，从而可以更快地起草和较低的内存使用量，同时保持了跨不同目标模型的灵活性。我们通过一种新型的测试时间树搜索算法进一步提高效率，以生成高质量的候选者。我们的经验评估表明，基于MAMBA的起草者不仅要优于现有的外部制图方法，而且还可以与最先进的自我调查方法相媲美，同时使用较少的内存并保持其交叉模型的适应性。

Title: Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers

Authors: Woomin Song, Sai Muralidhar Jayanthi, Srikanth Ronanki, Kanthashree Mysore Sathyendra, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01215
Pdf URL: https://arxiv.org/pdf/2506.01215
Copy Paste: [[2506.01215]] Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers(https://arxiv.org/abs/2506.01215)
Keywords: language model, long context
Abstract: As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model's pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches require substantial memory resources. We introduce REFORM, a novel inference framework that efficiently handles long contexts through a two-phase approach. First, it incrementally processes input chunks while maintaining a compressed KV cache, constructs cross-layer context embeddings, and utilizes early exit strategy for improved efficiency. Second, it identifies and gathers essential tokens via similarity matching and selectively recomputes the KV cache. Compared to baselines, REFORM achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length. It also outperforms baselines on Infinite-Bench and MM-NIAH, demonstrating flexibility across diverse tasks and domains. Additionally, REFORM reduces inference time by 30% and peak memory usage by 5%, achieving both efficiency and superior performance.
摘要：随着大型语言模型越来越多地在现实世界应用中获得流行，处理极长的上下文（通常超过模型的预培训上下文限制）已成为一个关键挑战。尽管现有的有效长篇文化处理的方法显示出希望，但基于反复压缩的方法与信息保存遇到了困难，而随机访问方法则需要大量的内存资源。我们介绍了改革，这是一种新颖的推理框架，该框架通过两阶段的方法有效地处理长篇小说。首先，它可以逐步处理输入块，同时保持压缩的KV缓存，构造跨层上下文嵌入，并利用早期退出策略来提高效率。其次，它通过相似性匹配来识别和收集基本令牌，并有选择地重新计算KV缓存。与基线相比，改革在1M上下文长度上分别在标尺和Babilong上获得了50％和27％的绩效增长。它还表现出在无限基础和MM-NIAH上的基准，表明了各种任务和域之间的灵活性。此外，改革将推理时间降低了30％，峰值记忆使用量增加了5％，从而达到了效率和卓越的性能。

Title: Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in Korean

Authors: SungHo Kim, Nayeon Kim, Taehee Jeon, SangKeun Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01237
Pdf URL: https://arxiv.org/pdf/2506.01237
Copy Paste: [[2506.01237]] Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in Korean(https://arxiv.org/abs/2506.01237)
Keywords: llm
Abstract: We introduce the $\underline{Ko}rean \underline{G}rammar \underline{E}valuation Bench\underline{M}ark (KoGEM)$, designed to assess the linguistic competence of LLMs and humans in Korean. KoGEM consists of 1.5k multiple-choice QA pairs covering five main categories and 16 subcategories. The zero-shot evaluation of 27 LLMs of various sizes and types reveals that while LLMs perform remarkably well on straightforward tasks requiring primarily definitional knowledge, they struggle with tasks that demand the integration of real-world experiential knowledge, such as phonological rules and pronunciation. Furthermore, our in-depth analysis suggests that incorporating such experiential knowledge could enhance the linguistic competence of LLMs. With KoGEM, we not only highlight the limitations of current LLMs in linguistic competence but also uncover hidden facets of LLMs in linguistic competence, paving the way for enhancing comprehensive language understanding. Our code and dataset are available at: this https URL.
摘要：我们介绍了$ \下划线{ Kogem由1.5K多项QA对组成，涵盖了5个主要类别和16个子类别。对27个尺寸和类型的27个LLM的零射门评估表明，尽管LLM在需要定义知识的直接任务上表现出色，但他们在需要整合现实世界经验知识的任务上挣扎，例如语音规则和发音。此外，我们的深入分析表明，结合这种体验知识可以增强LLM的语言能力。借助Kogem，我们不仅强调了当前LLM在语言能力中的局限性，还突出了语言能力中LLM的隐藏方面，为增强全面的语言理解铺平了道路。我们的代码和数据集可用：此HTTPS URL。

Title: ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists

Authors: Jie Ruan, Inderjeet Nair, Shuyang Cao, Amy Liu, Sheza Munir, Micah Pollens-Dempsey, Tiffany Chiang, Lucy Kates, Nicholas David, Sihan Chen, Ruxin Yang, Yuqian Yang, Jasmine Gump, Tessa Bialek, Vivek Sankaran, Margo Schlanger, Lu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01241
Pdf URL: https://arxiv.org/pdf/2506.01241
Copy Paste: [[2506.01241]] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists(https://arxiv.org/abs/2506.01241)
Keywords: language model, llm
Abstract: This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items for model outputs are then compared with corresponding items for reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 11 large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer achieving only a 26.8% F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, though often not accurately; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable and low-cost usage.
摘要：本文介绍了Expertlongbench，这是一个专家级别的基准测试，其中包含来自9个领域的11个任务，反映了现实的专家工作流和应用。除了回答问题外，专家Longbench中的应用程序驱动的任务要求长期产出可能超过5,000个令牌并严格遵守特定领域的要求。值得注意的是，Expertlongbench中的每个任务都包括一个由域专家设计或验证的标题，以指定任务要求并指导输出评估。此外，我们提出了Clear，这是一个评估框架，该框架支持对我们的基准测试中的长格式模型输出进行准确评估。为了实现细粒度，专家一致的评估，Clear通过提取与特定于特定于任务的项目相对应的信息来从模型输出和参考中得出清单。然后将模型输出的清单项目与相应的参考输出项目进行比较，以评估其正确性，从而可以进行接地评估。我们基于11个大语言模型（LLM），并在Clear中分析组件，这表明（1）现有的LLM，最高表现的F1得分仅为26.8％，需要显着改善专家级任务；（2）模型可以生成与所需方面相对应的内容，尽管通常不准确；（3）可以通过开放权重模型来实现准确的清单提取和比较，以实现更可扩展和低成本的使用。

Title: MTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese Medicine

Authors: Shufeng Kong, Xingru Yang, Yuanyuan Wei, Zijie Wang, Hao Tang, Jiuqi Qin, Shuting Lan, Yingheng Wang, Junwen Bai, Zhuangbin Chen, Zibin Zheng, Caihua Liu, Hao Liang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01252
Pdf URL: https://arxiv.org/pdf/2506.01252
Copy Paste: [[2506.01252]] MTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese Medicine(https://arxiv.org/abs/2506.01252)
Keywords: language model, llm
Abstract: Traditional Chinese Medicine (TCM) is a holistic medical system with millennia of accumulated clinical experience, playing a vital role in global healthcare-particularly across East Asia. However, the implicit reasoning, diverse textual forms, and lack of standardization in TCM pose major challenges for computational modeling and evaluation. Large Language Models (LLMs) have demonstrated remarkable potential in processing natural language across diverse domains, including general medicine. Yet, their systematic evaluation in the TCM domain remains underdeveloped. Existing benchmarks either focus narrowly on factual question answering or lack domain-specific tasks and clinical realism. To fill this gap, we introduce MTCMB-a Multi-Task Benchmark for Evaluating LLMs on TCM Knowledge, Reasoning, and Safety. Developed in collaboration with certified TCM experts, MTCMB comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation. The benchmark integrates real-world case records, national licensing exams, and classical texts, providing an authentic and comprehensive testbed for TCM-capable models. Preliminary results indicate that current LLMs perform well on foundational knowledge but fall short in clinical reasoning, prescription planning, and safety compliance. These findings highlight the urgent need for domain-aligned benchmarks like MTCMB to guide the development of more competent and trustworthy medical AI systems. All datasets, code, and evaluation tools are publicly available at: this https URL.
摘要：传统中药（TCM）是一种整体医学系统，拥有数千年的临床经验，在整个东亚的全球医疗保健中发挥了至关重要的作用。但是，隐含的推理，不同的文本形式以及TCM中缺乏标准化对计算建模和评估构成了重大挑战。大型语言模型（LLMS）在包括通用医学在内的不同领域处理自然语言方面具有巨大的潜力。然而，他们在TCM域中的系统评估仍然不发达。现有的基准要么狭义地关注事实问题回答，要么缺乏特定领域的任务和临床现实主义。为了填补这一空白，我们介绍了MTCMB-A多任务基准，以评估TCM知识，推理和安全性LLM。 MTCMB与认证的TCM专家合作开发，包括12个子数据集，涵盖了五个主要类别：知识QA，语言理解，诊断推理，处方生成和安全评估。基准测试集成了现实世界中的案例记录，国家许可考试和古典文本，为具有TCM能力的模型提供了真实而全面的测试床。初步结果表明，当前的LLM在基础知识方面表现良好，但在临床推理，处方计划和安全依从性方面缺乏。这些发现凸显了对MTCMB（MTCMB）等域名基准的迫切需求，以指导更有能力和值得信赖的医学AI系统的发展。所有数据集，代码和评估工具均可在以下网址公开获得：此HTTPS URL。

Title: CoRE: Condition-based Reasoning for Identifying Outcome Variance in Complex Events

Authors: Sai Vallurupalli, Francis Ferraro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01253
Pdf URL: https://arxiv.org/pdf/2506.01253
Copy Paste: [[2506.01253]] CoRE: Condition-based Reasoning for Identifying Outcome Variance in Complex Events(https://arxiv.org/abs/2506.01253)
Keywords: gpt, llm
Abstract: Knowing which latent conditions lead to a particular outcome is useful for critically examining claims made about complex event outcomes. Identifying implied conditions and examining their influence on an outcome is challenging. We handle this by combining and augmenting annotations from two existing datasets consisting of goals and states, and explore the influence of conditions through our research questions and Condition-based Reasoning tasks. We examine open and closed LLMs of varying sizes and intent-alignment on our reasoning tasks and find that conditions are useful when not all context is available. Models differ widely in their ability to generate and identify outcome-variant conditions which affects their performance on outcome validation when conditions are used to replace missing context. Larger models like GPT-4o, are more cautious in such less constrained situations.
摘要：知道哪些潜在条件会导致特定的结果，这对于严格研究对复杂事件结果的主张有用。确定隐含条件并检查其对结果的影响是具有挑战性的。我们通过结合和增强来自两个由目标和州组成的现有数据集的注释来处理这一问题，并通过我们的研究问题和基于条件的推理任务探索条件的影响。我们检查了我们的推理任务上不同尺寸和意图调整的开放和封闭的LLM，并发现并非所有上下文都可以使用时条件很有用。模型在生成和识别结果变化的条件的能力上有很大差异，这些条件会影响其在替代缺失上下文的条件时影响其结果验证的性能。在较少受约束的情况下，诸如GPT-4O之类的较大模型更谨慎。

Title: DeepSeek in Healthcare: A Survey of Capabilities, Risks, and Clinical Applications of Open-Source Large Language Models

Authors: Jiancheng Ye, Sophie Bronstein, Jiarui Hai, Malak Abu Hashish
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01257
Pdf URL: https://arxiv.org/pdf/2506.01257
Copy Paste: [[2506.01257]] DeepSeek in Healthcare: A Survey of Capabilities, Risks, and Clinical Applications of Open-Source Large Language Models(https://arxiv.org/abs/2506.01257)
Keywords: language model, gpt, llm
Abstract: DeepSeek-R1 is a cutting-edge open-source large language model (LLM) developed by DeepSeek, showcasing advanced reasoning capabilities through a hybrid architecture that integrates mixture of experts (MoE), chain of thought (CoT) reasoning, and reinforcement learning. Released under the permissive MIT license, DeepSeek-R1 offers a transparent and cost-effective alternative to proprietary models like GPT-4o and Claude-3 Opus; it excels in structured problem-solving domains such as mathematics, healthcare diagnostics, code generation, and pharmaceutical research. The model demonstrates competitive performance on benchmarks like the United States Medical Licensing Examination (USMLE) and American Invitational Mathematics Examination (AIME), with strong results in pediatric and ophthalmologic clinical decision support tasks. Its architecture enables efficient inference while preserving reasoning depth, making it suitable for deployment in resource-constrained settings. However, DeepSeek-R1 also exhibits increased vulnerability to bias, misinformation, adversarial manipulation, and safety failures - especially in multilingual and ethically sensitive contexts. This survey highlights the model's strengths, including interpretability, scalability, and adaptability, alongside its limitations in general language fluency and safety alignment. Future research priorities include improving bias mitigation, natural language comprehension, domain-specific validation, and regulatory compliance. Overall, DeepSeek-R1 represents a major advance in open, scalable AI, underscoring the need for collaborative governance to ensure responsible and equitable deployment.
摘要：DeepSeek-R1是由DeepSeek开发的尖端开源大型语言模型（LLM），通过混合体系结构展示了先进的推理能力，该混合体系结合了专家的混合（MOE），思想链（COT）推理和增强学习。 DeepSeek-R1根据MIT允许的MIT许可发布，为GPT-4O和Claude-3 Opus（例如GPT-4O和Claude-3 Opus）提供了透明且具有成本效益的替代方案；它在结构化的解决问题的领域（例如数学，医疗保健诊断，代码生成和制药研究）中表现出色。该模型表明了在美国医疗许可检查（USMLE）和美国邀请赛数学考试（AIME）等基准上的竞争性能，在儿科和眼科临床决策支持任务方面取得了良好的结果。它的体系结构可以在保留推理深度的同时进行有效的推理，从而适合在资源约束设置中部署。但是，DeepSeek -R1还表现出增加对偏见，错误信息，对抗性操纵和安全失败的脆弱性，尤其是在多语言和道德敏感的环境中。这项调查强调了该模型的优势，包括可解释性，可伸缩性和适应性，以及其一般语言流利性和安全性一致性的局限性。未来的研究优先事项包括改善偏见缓解，自然语言理解，特定于领域的验证和法规合规性。总体而言，DeepSeek-R1代表了开放，可扩展的AI的重大进步，强调了协作治理的需求，以确保负责任和公平的部署。

Title: Exploring the Potential of LLMs as Personalized Assistants: Dataset, Evaluation, and Analysis

Authors: Jisoo Mok, Ik-hwan Kim, Sangkwon Park, Sungroh Yoon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01262
Pdf URL: https://arxiv.org/pdf/2506.01262
Copy Paste: [[2506.01262]] Exploring the Potential of LLMs as Personalized Assistants: Dataset, Evaluation, and Analysis(https://arxiv.org/abs/2506.01262)
Keywords: language model, llm
Abstract: Personalized AI assistants, a hallmark of the human-like capabilities of Large Language Models (LLMs), are a challenging application that intertwines multiple problems in LLM research. Despite the growing interest in the development of personalized assistants, the lack of an open-source conversational dataset tailored for personalization remains a significant obstacle for researchers in the field. To address this research gap, we introduce HiCUPID, a new benchmark to probe and unleash the potential of LLMs to deliver personalized responses. Alongside a conversational dataset, HiCUPID provides a Llama-3.2-based automated evaluation model whose assessment closely mirrors human preferences. We release our dataset, evaluation model, and code at this https URL.
摘要：个性化的AI助手，是大语模型（LLMS）类似人类功能的标志，是一个充满挑战的应用程序，它在LLM研究中的多个问题交织在一起。尽管对个性化助手的发展越来越感兴趣，但缺乏针对个性化的开源对话数据集仍然是该领域研究人员的重要障碍。为了解决这一研究差距，我们介绍了Hicupid，这是一种新的基准测试，以探究LLMS提供个性化响应的潜力。除了对话数据集外，Hicupid还提供了一种基于Llama-3.2的自动化评估模型，其评估与人类的偏好相似。我们在此HTTPS URL上发布数据集，评估模型和代码。

Title: Beyond In-Context Learning: Aligning Long-form Generation of Large Language Models via Task-Inherent Attribute Guidelines

Authors: Do Xuan Long, Duong Ngoc Yen, Do Xuan Trong, Luu Anh Tuan, Kenji Kawaguchi, Shafiq Joty, Min-Yen Kan, Nancy F. Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01265
Pdf URL: https://arxiv.org/pdf/2506.01265
Copy Paste: [[2506.01265]] Beyond In-Context Learning: Aligning Long-form Generation of Large Language Models via Task-Inherent Attribute Guidelines(https://arxiv.org/abs/2506.01265)
Keywords: language model, llm, prompt
Abstract: In-context learning (ICL) is an important yet not fully understood ability of pre-trained large language models (LLMs). It can greatly enhance task performance using a few examples, termed demonstrations, without fine-tuning. Although effective in question answering, ICL often underperforms in long-form generation tasks such as summarization. Under appropriately realistic assumptions, we empirically and theoretically show that ICL demonstrations alone are insufficient to teach LLMs the task language and format distributions for generation. We argue for explicit exposure to the task distributions and hypothesize that defining them by prompting enhances model performance. To this end, we present LongGuide, which efficiently generates two parallel streams of guidelines capturing task language and format properties: (i) Metric Guidelines (MGs) that instruct models to optimize self-evaluated metrics; and (ii) Output Constraint Guidelines (OCGs) that constrain generation at both token and sentence levels. LongGuide automatically selects the best combination of guidelines, improving both strong open- and closed-source LLMs by over 5% in both zero- and few-shot settings. We show that LongGuide is generalizable, learnable by weak models to enhance strong ones, and integrates synergistically with automatic prompt optimizers.
摘要：内在学习（ICL）是一个重要但尚未完全理解的预训练大型语言模型（LLMS）的能力。它可以使用一些示例，称为演示，而无需微调，可以极大地提高任务绩效。尽管有效地回答有效，但ICL在长期生成任务（例如摘要）中的表现通常不足。在适当的现实假设下，我们从经验和理论上表明，仅ICL演示就不足以向LLM教授任务语言和格式分布的生成。我们主张明确接触任务分布，并假设通过提示增强模型性能来定义它们。为此，我们提出了longGuide，该指导有效地生成了两个平行的指南流，捕获任务语言和格式属性：（i）指标指导模型以优化自我评估指标的指标指南（MGS）；（ii）限制在令牌和句子级别上产生的输出约束指南（OCGS）。 LongGuide会自动选择指南的最佳组合，从而在零和少数情况下将强式开放和封闭式LLMS提高了5％以上。我们表明，长导是可以推广的，可以通过弱模型来学习以增强强型模型，并与自动及时的优化器协同整合。

Title: Detoxification of Large Language Models through Output-layer Fusion with a Calibration Model

Authors: Yuanhe Tian, Mingjie Deng, Guoqing Jin, Yan Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01266
Pdf URL: https://arxiv.org/pdf/2506.01266
Copy Paste: [[2506.01266]] Detoxification of Large Language Models through Output-layer Fusion with a Calibration Model(https://arxiv.org/abs/2506.01266)
Keywords: language model, llm, prompt
Abstract: Existing approaches for Large language model (LLM) detoxification generally rely on training on large-scale non-toxic or human-annotated preference data, designing prompts to instruct the LLM to generate safe content, or modifying the model parameters to remove toxic information, which are computationally expensive, lack robustness, and often compromise LLMs' fluency and contextual understanding. In this paper, we propose a simple yet effective approach for LLM detoxification, which leverages a compact, pre-trained calibration model that guides the detoxification process of a target LLM via a lightweight intervention in its generation pipeline. By learning a detoxified embedding space from non-toxic data, the calibration model effectively steers the LLM away from generating harmful content. This approach only requires a one-time training of the calibration model that is able to be seamlessly applied to multiple LLMs without compromising fluency or contextual understanding. Experiment results on the benchmark dataset demonstrate that our approach reduces toxicity while maintaining reasonable content expression.
摘要：大型语言模型（LLM）排毒的现有方法通常依赖于大规模无毒或人类通知的偏好数据的培训，设计提示提示LLM指示LLM生成安全的内容，或者修改模型参数以删除毒性信息，这些信息在计算上昂贵，缺乏稳健性，并且经常不具备稳健性，并且含义LLM和上下文的了解和情境了解。在本文中，我们提出了一种简单而有效的LLM排毒方法，该方法利用了一个紧凑的预训练校准模型，该模型通过其生成管道中的轻量级干预来指导目标LLM的排毒过程。通过从无毒数据中学习排毒的嵌入空间，校准模型有效地使LLM避免产生有害内容。这种方法仅需要对校准模型进行一次性培训，该模型能够无缝地应用于多个LLM，而不会损害流利性或上下文理解。基准数据集上的实验结果表明，我们的方法在保持合理的内容表达的同时降低了毒性。

Title: Schema as Parameterized Tools for Universal Information Extraction

Authors: Sheng Liang, Yongyue Zhang, Yaxiong Wu, Ruiming Tang, Yong Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01276
Pdf URL: https://arxiv.org/pdf/2506.01276
Copy Paste: [[2506.01276]] Schema as Parameterized Tools for Universal Information Extraction(https://arxiv.org/abs/2506.01276)
Keywords: language model, llm
Abstract: Universal information extraction (UIE) primarily employs an extractive generation approach with large language models (LLMs), typically outputting structured information based on predefined schemas such as JSON or tables. UIE suffers from a lack of adaptability when selecting between predefined schemas and on-the-fly schema generation within the in-context learning paradigm, especially when there are numerous schemas to choose from. In this paper, we propose a unified adaptive text-to-structure generation framework, called Schema as Parameterized Tools (SPT), which reimagines the tool-calling capability of LLMs by treating predefined schemas as parameterized tools for tool selection and parameter filling. Specifically, our SPT method can be applied to unify closed, open, and on-demand IE tasks by adopting Schema Retrieval by fetching the relevant schemas from a predefined pool, Schema Filling by extracting information and filling slots as with tool parameters, or Schema Generation by synthesizing new schemas with uncovered cases. Experiments show that the SPT method can handle four distinct IE tasks adaptively, delivering robust schema retrieval and selection performance. SPT also achieves comparable extraction performance to LoRA baselines and current leading UIE systems with significantly fewer trainable parameters.
摘要：通用信息提取（UIE）主要采用大型语言模型（LLM）采用提取生成方法，通常基于预定义的模式（例如JSON或表格）输出结构化信息。 UIE在秘密学习范式中选择预定义的模式和芬利模式生成之间的选择时缺乏适应性，尤其是当有许多图式可供选择时。在本文中，我们提出了一个统一的自适应文本对结构生成框架，称为架构为参数化工具（SPT），该框架通过将预定义的模式作为用于工具选择和参数填充的参数化工具来重新构想LLM的工具称呼能力。具体而言，我们的SPT方法可以应用于通过从预定义的池中获取相关的模式，通过提取信息参数和填充工具参数填充插槽来填充相关的模式来统一闭合，打开和按需IE任务，或通过填充工具参数，或通过与未发现的情况合成新模式来填充架构。实验表明，SPT方法可以自适应地处理四个不同的IE任务，从而提供强大的模式检索和选择性能。 SPT还可以达到可比较的提取性能与洛拉基准和当前领先的UIE系统具有较少的训练参数。

Title: VM14K: First Vietnamese Medical Benchmark

Authors: Thong Nguyen, Duc Nguyen, Minh Dang, Thai Dao, Long Nguyen, Quan H. Nguyen, Dat Nguyen, Kien Tran, Minh Tran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01305
Pdf URL: https://arxiv.org/pdf/2506.01305
Copy Paste: [[2506.01305]] VM14K: First Vietnamese Medical Benchmark(https://arxiv.org/abs/2506.01305)
Keywords: language model
Abstract: Medical benchmarks are indispensable for evaluating the capabilities of language models in healthcare for non-English-speaking communities,therefore help ensuring the quality of real-life applications. However, not every community has sufficient resources and standardized methods to effectively build and design such benchmark, and available non-English medical data is normally fragmented and difficult to verify. We developed an approach to tackle this problem and applied it to create the first Vietnamese medical question benchmark, featuring 14,000 multiple-choice questions across 34 medical specialties. Our benchmark was constructed using various verifiable sources, including carefully curated medical exams and clinical records, and eventually annotated by medical experts. The benchmark includes four difficulty levels, ranging from foundational biological knowledge commonly found in textbooks to typical clinical case studies that require advanced reasoning. This design enables assessment of both the breadth and depth of language models' medical understanding in the target language thanks to its extensive coverage and in-depth subject-specific expertise. We release the benchmark in three parts: a sample public set (4k questions), a full public set (10k questions), and a private set (2k questions) used for leaderboard evaluation. Each set contains all medical subfields and difficulty levels. Our approach is scalable to other languages, and we open-source our data construction pipeline to support the development of future multilingual benchmarks in the medical domain.
摘要：医学基准是为了评估非英语社区医疗保健中语言模型能力的必不可少的，因此有助于确保现实生活中的应用质量。但是，并非每个社区都有足够的资源和标准化方法来有效地构建和设计这样的基准，并且可用的非英语医学数据通常是分散的，难以验证。我们开发了一种解决此问题的方法，并将其应用于创建第一个越南医学问题基准，在34个医学专业中具有14,000个多项选择问题。我们的基准是使用各种可验证的来源构建的，包括精心策划的体检和临床记录，并最终由医学专家注释。基准包括四个难度水平，从教科书中常见的基础生物学知识到需要先进推理的典型临床案例研究。由于其广泛的覆盖范围和深入的特定主题专业知识，该设计可以评估语言模型在目标语言中的医学理解的广度和深度。我们以三个部分发布基准：示例公共集（4K问题），完整的公共集（10K问题）和用于排行榜评估的私人集（2K问题）。每组都包含所有医疗子场和难度水平。我们的方法可扩展到其他语言，我们开源我们的数据构建管道，以支持医疗领域中未来多语言基准的开发。

Title: A Platform for Investigating Public Health Content with Efficient Concern Classification

Authors: Christopher Li, Rickard Stureborg, Bhuwan Dhingra, Jun Yang
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01308
Pdf URL: https://arxiv.org/pdf/2506.01308
Copy Paste: [[2506.01308]] A Platform for Investigating Public Health Content with Efficient Concern Classification(https://arxiv.org/abs/2506.01308)
Keywords: language model
Abstract: A recent rise in online content expressing concerns with public health initiatives has contributed to already stalled uptake of preemptive measures globally. Future public health efforts must attempt to understand such content, what concerns it may raise among readers, and how to effectively respond to it. To this end, we present ConcernScope, a platform that uses a teacher-student framework for knowledge transfer between large language models and light-weight classifiers to quickly and effectively identify the health concerns raised in a text corpus. The platform allows uploading massive files directly, automatically scraping specific URLs, and direct text editing. ConcernScope is built on top of a taxonomy of public health concerns. Intended for public health officials, we demonstrate several applications of this platform: guided data exploration to find useful examples of common concerns found in online community datasets, identification of trends in concerns through an example time series analysis of 186,000 samples, and finding trends in topic frequency before and after significant events.
摘要：最近在线内容上表达了公共卫生计划的关注，这已经导致在全球范围内已经停滞了抢先措施。未来的公共卫生努力必须试图理解这种内容，读者可能会引起什么关注以及如何有效地对其做出反应。为此，我们提出了CharristScope，该平台使用教师学生框架在大语言模型和轻质分类器之间进行知识转移，以快速有效地确定文本语料库中提出的健康问题。该平台允许直接上传大量文件，自动刮擦特定的URL和直接文本编辑。关注点是建立在公共卫生问题分类法之上的。我们旨在公共卫生官员，演示了该平台的几个应用：指导数据探索，以找到在线社区数据集中发现的常见问题的有用示例，通过对186,000个样本的示例时间序列分析来识别趋势的趋势，以及在重大事件之前和之后的主题频率中找到趋势。

Title: Growing Through Experience: Scaling Episodic Grounding in Language Models

Authors: Chunhui Zhang, Sirui (Elsie)Wang, Zhongyu Ouyang, Xiangchi Yuan, Soroush Vosoughi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01312
Pdf URL: https://arxiv.org/pdf/2506.01312
Copy Paste: [[2506.01312]] Growing Through Experience: Scaling Episodic Grounding in Language Models(https://arxiv.org/abs/2506.01312)
Keywords: language model
Abstract: Language models (LMs) require robust episodic grounding-the capacity to learn from and apply past experiences-to excel at physical planning tasks. Current episodic grounding approaches struggle with scalability and integration, limiting their effectiveness, especially for medium-sized LMs (7B parameters). While larger LMs (70-405B parameters) possess superior hierarchical representations and extensive pre-trained knowledge, they encounter a fundamental scale paradox: despite their advanced abstraction capabilities, they lack efficient mechanisms to leverage experience streams. We propose a scalable weak-to-strong episodic learning framework that effectively transfers episodic behaviors from smaller to larger LMs. This framework integrates Monte Carlo tree search for structured experience collection with a novel distillation method, preserving the inherent LM capabilities while embedding episodic memory. Experiments demonstrate our method surpasses state-of-the-art proprietary LMs by 3.45% across diverse planning and question-answering tasks. Layer-wise probing further indicates significant improvements in task alignment, especially within deeper LM layers, highlighting stable generalization even for previously unseen scenarios with increased planning complexity-conditions where baseline methods degrade markedly.
摘要：语言模型（LMS）需要强大的情节基础 - 学习并运用过去的经验，擅长于物理计划任务。当前的情节接地方法与可伸缩性和集成相处，限制了它们的有效性，尤其是对于中型LMS（7B参数）。尽管较大的LMS（70-405b参数）具有较高的分层表示和广泛的预训练知识，但它们遇到了基本规模的悖论：尽管具有先进的抽象功能，但它们缺乏有效的机制来利用经验流。我们提出了一个可扩展的弱到曲子的情节学习框架，该框架有效地将情节行为从较小的LMS转移到了较大的LMS。该框架将蒙特卡洛树与新颖的蒸馏方法集成了结构化的体验收集，从而在嵌入情节记忆的同时保留了固有的LM功能。实验证明我们的方法在各种计划和提问的任务中超过了最先进的专有LMS 3.45％。按层的探测进一步表明，任务对准的改进，尤其是在更深的LM层中，即使对于以前看不见的情况，稳定的概括也随着计划的复杂性条件的增加，基线方法显着降低。

Title: Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines

Authors: Guifeng Deng, Shuyin Rao, Tianyu Lin, Anlu Dai, Pan Wang, Junyi Xie, Haidong Song, Ke Zhao, Dongwu Xu, Zhengdong Cheng, Tao Li, Haiteng Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01329
Pdf URL: https://arxiv.org/pdf/2506.01329
Copy Paste: [[2506.01329]] Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines(https://arxiv.org/abs/2506.01329)
Keywords: language model, gpt, llm
Abstract: Psychological support hotlines are critical for crisis intervention but face significant challenges due to rising demand. Large language models (LLMs) could support crisis assessments, yet their capabilities in emotionally sensitive contexts remain unclear. We introduce PsyCrisisBench, a benchmark of 540 annotated transcripts from the Hangzhou Psychological Assistance Hotline, assessing four tasks: mood status recognition, suicidal ideation detection, suicide plan identification, and risk assessment. We evaluated 64 LLMs across 15 families (e.g., GPT, Claude, Gemini, Llama, Qwen, DeepSeek) using zero-shot, few-shot, and fine-tuning paradigms. Performance was measured by F1-score, with statistical comparisons via Welch's t-tests. LLMs performed strongly on suicidal ideation detection (F1=0.880), suicide plan identification (F1=0.779), and risk assessment (F1=0.907), improved with few-shot and fine-tuning. Mood status recognition was more challenging (max F1=0.709), likely due to lost vocal cues and ambiguity. A fine-tuned 1.5B-parameter model (Qwen2.5-1.5B) surpassed larger models on mood and suicidal ideation. Open-source models like QwQ-32B performed comparably to closed-source on most tasks (p>0.3), though closed models retained an edge in mood detection (p=0.007). Performance scaled with size up to a point; quantization (AWQ) reduced GPU memory by 70% with minimal F1 degradation. LLMs show substantial promise in structured psychological crisis assessments, especially with fine-tuning. Mood recognition remains limited due to contextual complexity. The narrowing gap between open- and closed-source models, combined with efficient quantization, suggests feasible integration. PsyCrisisBench offers a robust evaluation framework to guide model development and ethical deployment in mental health.
摘要：心理支持热线对于危机干预至关重要，但由于需求的增加而面临重大挑战。大型语言模型（LLM）可以支持危机评估，但它们在情感敏感的环境中的能力尚不清楚。我们介绍了psycrisisbench，这是杭州心理援助热线的540个注释成绩单的基准，评估了四个任务：情绪状况识别，自杀念头检测，自杀计划识别和风险评估。我们使用零拍，很少的范式和微调范式评估了15个家庭（例如GPT，Claude，Gemini，Llame，Qwen，DeepSeek）的64个LLM。通过F1得分测量了性能，并通过Welch的t检验进行统计比较。 LLM在自杀构想检测（F1 = 0.880），自杀计划识别（F1 = 0.779）和风险评估（F1 = 0.907）上强烈执行（F1 = 0.779）（F1 = 0.907）。情绪状况认识更具挑战性（最大F1 = 0.709），这可能是由于人声提示和歧义所致。一个微调的1.5B参数模型（QWEN2.5-1.5B）超过了关于情绪和自杀意念的较大模型。 QWQ-32B之类的开源模型在大多数任务（p> 0.3）上进行了相当的封闭方式，尽管封闭模型保留了情绪检测的优势（p = 0.007）。性能以尺寸缩放到一点点；量化（AWQ）随着F1降解的最小降解，GPU存储器降低了70％。 LLM在结构化的心理危机评估中表现出了巨大的希望，尤其是通过微调。由于上下文的复杂性，情绪识别仍然有限。开放式和封闭式模型之间的狭窄差距以及有效的量化表明可行的集成。 Psycrisisbench提供了一个强大的评估框架，以指导模型发展和心理健康方面的道德部署。

Title: Enhancing Interpretable Image Classification Through LLM Agents and Conditional Concept Bottleneck Models

Authors: Yiwen Jiang, Deval Mehta, Wei Feng, Zongyuan Ge
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01334
Pdf URL: https://arxiv.org/pdf/2506.01334
Copy Paste: [[2506.01334]] Enhancing Interpretable Image Classification Through LLM Agents and Conditional Concept Bottleneck Models(https://arxiv.org/abs/2506.01334)
Keywords: language model, llm, agent
Abstract: Concept Bottleneck Models (CBMs) decompose image classification into a process governed by interpretable, human-readable concepts. Recent advances in CBMs have used Large Language Models (LLMs) to generate candidate concepts. However, a critical question remains: What is the optimal number of concepts to use? Current concept banks suffer from redundancy or insufficient coverage. To address this issue, we introduce a dynamic, agent-based approach that adjusts the concept bank in response to environmental feedback, optimizing the number of concepts for sufficiency yet concise coverage. Moreover, we propose Conditional Concept Bottleneck Models (CoCoBMs) to overcome the limitations in traditional CBMs' concept scoring mechanisms. It enhances the accuracy of assessing each concept's contribution to classification tasks and feature an editable matrix that allows LLMs to correct concept scores that conflict with their internal knowledge. Our evaluations across 6 datasets show that our method not only improves classification accuracy by 6% but also enhances interpretability assessments by 30%.
摘要：概念瓶颈模型（CBMS）将图像分类分解为由可解释的，可读的概念控制的过程。 CBM的最新进展使用了大语言模型（LLMS）来生成候选概念。但是，一个关键的问题仍然存在：最佳使用概念数量是多少？当前的概念银行遭受了冗余或覆盖不足的困扰。为了解决这个问题，我们引入了一种基于动态的，基于代理的方法，该方法可以根据环境反馈来调整概念库，从而优化概念的数量，以实现足够且简洁的覆盖范围。此外，我们提出了条件概念瓶颈模型（Cocobms），以克服传统CBMS概念评分机制的局限性。它提高了评估每个概念对分类任务的贡献的准确性，并具有可编辑的矩阵，该矩阵使LLM可以纠正与内部知识冲突的概念分数。我们在6个数据集中进行的评估表明，我们的方法不仅将分类准确性提高了6％，而且可以提高30％的可解释性评估。

Title: The Landscape of Arabic Large Language Models (ALLMs): A New Era for Arabic Language Technology

Authors: Shahad Al-Khalifa, Nadir Durrani, Hend Al-Khalifa, Firoj Alam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01340
Pdf URL: https://arxiv.org/pdf/2506.01340
Copy Paste: [[2506.01340]] The Landscape of Arabic Large Language Models (ALLMs): A New Era for Arabic Language Technology(https://arxiv.org/abs/2506.01340)
Keywords: language model, gpt, llm, chat
Abstract: The emergence of ChatGPT marked a transformative milestone for Artificial Intelligence (AI), showcasing the remarkable potential of Large Language Models (LLMs) to generate human-like text. This wave of innovation has revolutionized how we interact with technology, seamlessly integrating LLMs into everyday tasks such as vacation planning, email drafting, and content creation. While English-speaking users have significantly benefited from these advancements, the Arabic world faces distinct challenges in developing Arabic-specific LLMs. Arabic, one of the languages spoken most widely around the world, serves more than 422 million native speakers in 27 countries and is deeply rooted in a rich linguistic and cultural heritage. Developing Arabic LLMs (ALLMs) presents an unparalleled opportunity to bridge technological gaps and empower communities. The journey of ALLMs has been both fascinating and complex, evolving from rudimentary text processing systems to sophisticated AI-driven models. This article explores the trajectory of ALLMs, from their inception to the present day, highlighting the efforts to evaluate these models through benchmarks and public leaderboards. We also discuss the challenges and opportunities that ALLMs present for the Arab world.
摘要：Chatgpt的出现标志着人工智能（AI）的变革性里程碑，展示了大语言模型（LLMS）产生类似人类文本的显着潜力。这种创新浪潮彻底改变了我们如何与技术互动，将LLM无缝整合到日常任务中，例如度假计划，电子邮件起草和内容创建。尽管讲英语的用户从这些进步中受益匪浅，但阿拉伯世界在开发阿拉伯语特异性LLMS方面面临着明显的挑战。阿拉伯语是世界各地最广泛使用的语言之一，在27个国家 /地区为4.22亿本人提供母语，并深深地植根于丰富的语言和文化遗产。开发阿拉伯语LLM（ALLMS）为弥合技术差距并增强社区权力的机会提供了无与伦比的机会。 Allms的旅程既迷人又复杂，从基本的文本处理系统发展到复杂的AI驱动模型。本文探讨了从成立到今天的Allms的轨迹，强调了通过基准和公共排行榜评估这些模型的努力。我们还讨论了Allms对阿拉伯世界的挑战和机遇。

Title: TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

Authors: Yiran Zhang, Mo Wang, Xiaoyang Li, Kaixuan Ren, Chencheng Zhu, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01341
Pdf URL: https://arxiv.org/pdf/2506.01341
Copy Paste: [[2506.01341]] TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models(https://arxiv.org/abs/2506.01341)
Keywords: language model, llm
Abstract: Despite impressive advances in large language models (LLMs), existing benchmarks often focus on single-turn or single-step tasks, failing to capture the kind of iterative reasoning required in real-world settings. To address this limitation, we introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task inspired by a "Turing Machine Board Game." In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds. This dynamic setup requires models to reason over time, adapt based on past information, and maintain consistency across steps-capabilities underexplored in current benchmarks. TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains. To support fine-grained analysis, we provide ground-truth annotations for intermediate reasoning steps. Our evaluation of state-of-the-art LLMs reveals significant gaps: the best model achieves 81.5% accuracy in Classic mode, but performance drops to 17.8% in Nightmare mode. In contrast, human participants achieve 100% in both, underscoring the challenge TurnBench poses to current models. By incorporating feedback loops and hiding task rules, TurnBench reduces contamination risks and provides a rigorous testbed for diagnosing and advancing multi-step, multi-turn reasoning in LLMs.
摘要：尽管大型语言模型（LLMS）的进步令人印象深刻，但现有的基准通常集中在单转或单步任务上，但未能捕获现实世界中所需的迭代推理。为了解决这一限制，我们介绍了Turnbench，这是一种新颖的基准测试，该基准通过以“ Turing Machine Board Game”的启发来评估多转，多步推理。在每个情节中，模型必须通过进行顺序猜测，接收结构化反馈以及整合多个回合的线索来揭示隐藏的逻辑或算术规则。这种动态设置需要模型随着时间的推移进行推理，根据过去的信息进行调整，并在当前基准测试中保持跨步骤范围的一致性。 Turnbench包括两种模式：经典，测试标准推理和噩梦，这引入了增加的复杂性，需要鲁棒的推论链。为了支持细粒分析，我们为中间推理步骤提供基础真相注释。我们对最先进的LLM的评估揭示了巨大的差距：最佳模型在经典模式下的精度达到81.5％，但在噩梦模式下的性能下降到17.8％。相比之下，人类参与者在两者中都获得了100％的成绩，强调了挑战特恩比克对当前模型的构成。通过合并反馈循环和隐藏任务规则，Turnbench降低了污染风险，并为LLMS中的多步，多转弯推理提供了严格的测试台。

Title: Follow the Flow: Fine-grained Flowchart Attribution with Neurosymbolic Agents

Authors: Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi, Vivek Gupta, Dinesh Manocha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01344
Pdf URL: https://arxiv.org/pdf/2506.01344
Copy Paste: [[2506.01344]] Follow the Flow: Fine-grained Flowchart Attribution with Neurosymbolic Agents(https://arxiv.org/abs/2506.01344)
Keywords: language model, llm, hallucination, agent
Abstract: Flowcharts are a critical tool for visualizing decision-making processes. However, their non-linear structure and complex visual-textual relationships make it challenging to interpret them using LLMs, as vision-language models frequently hallucinate nonexistent connections and decision paths when analyzing these diagrams. This leads to compromised reliability for automated flowchart processing in critical domains such as logistics, health, and engineering. We introduce the task of Fine-grained Flowchart Attribution, which traces specific components grounding a flowchart referring LLM response. Flowchart Attribution ensures the verifiability of LLM predictions and improves explainability by linking generated responses to the flowchart's structure. We propose FlowPathAgent, a neurosymbolic agent that performs fine-grained post hoc attribution through graph-based reasoning. It first segments the flowchart, then converts it into a structured symbolic graph, and then employs an agentic approach to dynamically interact with the graph, to generate attribution paths. Additionally, we present FlowExplainBench, a novel benchmark for evaluating flowchart attributions across diverse styles, domains, and question types. Experimental results show that FlowPathAgent mitigates visual hallucinations in LLM answers over flowchart QA, outperforming strong baselines by 10-14% on our proposed FlowExplainBench dataset.
摘要：流程图是可视化决策过程的关键工具。但是，它们的非线性结构和复杂的视觉文本关系使使用LLMS解释它们是一项挑战，因为在分析这些图表时，视觉模型经常幻觉不存在连接和决策路径。这导致对物流，健康和工程等关键领域中的自动流程图处理的可靠性损害。我们介绍了细粒流程图归因的任务，该任务跟踪了接地的特定组件，这些组件接地了流程图引用LLM响应。流程图归因可确保LLM预测的可验证性，并通过将生成的响应与流程图结构联系起来，从而提高解释性。我们提出了FlowPathagent，这是一种神经肌符号药物，通过基于图的推理执行细粒度的事后归因。它首先将流程图片段段，然后将其转换为结构化的符号图，然后采用代理方法与图形动态交互，以生成归因路径。此外，我们提出了FlowPlainBench，这是一种用于评估各种样式，域和问题类型的流程图归因的新基准。实验结果表明，流道减轻了Flowchart QA的LLM答案中的视觉幻觉，在我们提出的FlowExplainBench数据集上优于强大基准的10-14％。

Title: The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

Authors: Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, Yu Meng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01347
Pdf URL: https://arxiv.org/pdf/2506.01347
Copy Paste: [[2506.01347]] The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning(https://arxiv.org/abs/2506.01347)
Keywords: language model, llm
Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs). Unlike supervised learning, it updates the model using both correct and incorrect samples via policy gradients. To better understand its mechanism, we decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR), respectively. We train Qwen2.5-Math-7B and Qwen3-4B on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples -- without reinforcing correct responses -- can be highly effective: it consistently improves performance over the base model across the entire Pass@$k$ spectrum ($k$ up to $256$), often matching or surpassing PPO and GRPO. In contrast, reinforcing only correct responses improves Pass@$1$ but degrades performance at higher $k$, due to reduced diversity. These inference-scaling trends highlight that solely penalizing incorrect responses may contribute more to performance than previously recognized. Through gradient analysis, we show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs. It refines the model's existing knowledge rather than introducing entirely new behaviors. Building on this insight, we propose a simple variant of the RL objective that upweights NSR, and show that it consistently improves overall Pass@$k$ performance on MATH, AIME 2025, and AMC23. Our code is available at this https URL.
摘要：通过可验证的奖励（RLVR）的强化学习是培训语言模型（LMS）的一种有前途的方法，这些方法是针对引发出现的长长思想链（COTS）的推理任务的一种有前途的方法。与受监督的学习不同，它通过策略梯度同时使用正确和不正确的样本更新模型。为了更好地理解其机制，我们将学习信号分解为加强正确的响应，并分别惩罚不正确的响应，分别称为正和负样品增强（PSR和NSR）。 We train Qwen2.5-Math-7B and Qwen3-4B on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples -- without reinforcing correct responses -- can be highly effective: it consistently improves performance over the base model across the entire Pass@$k$ spectrum ($k$ up to $256$), often matching or surpassing PPO and GRPO.相比之下，仅加强正确的响应会提高通过@$ 1 $，但由于多样性的降低，以$ k $的价格降低了性能。这些推论趋势强调，仅仅惩罚不正确的反应可能对性能的贡献比以前所认识的更大。通过梯度分析，我们表明NSR通过抑制不正确的世代并将概率质量重新分布给其他合理的候选人，并在模型的先前信念的指导下起作用。它完善了模型的现有知识，而不是引入全新的行为。在此洞察力的基础上，我们提出了上升NSR的RL目标的简单变体，并表明它始终提高了Math，Aime 2025和AMC23的整体通行证@$ K $性能。我们的代码可在此HTTPS URL上找到。

Title: KokoroChat: A Japanese Psychological Counseling Dialogue Dataset Collected via Role-Playing by Trained Counselors

Authors: Zhiyang Qi, Takumasa Kaneko, Keiko Takamizo, Mariko Ukiyo, Michimasa Inaba
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01357
Pdf URL: https://arxiv.org/pdf/2506.01357
Copy Paste: [[2506.01357]] KokoroChat: A Japanese Psychological Counseling Dialogue Dataset Collected via Role-Playing by Trained Counselors(https://arxiv.org/abs/2506.01357)
Keywords: language model, llm, chat
Abstract: Generating psychological counseling responses with language models relies heavily on high-quality datasets. Crowdsourced data collection methods require strict worker training, and data from real-world counseling environments may raise privacy and ethical concerns. While recent studies have explored using large language models (LLMs) to augment psychological counseling dialogue datasets, the resulting data often suffers from limited diversity and authenticity. To address these limitations, this study adopts a role-playing approach where trained counselors simulate counselor-client interactions, ensuring high-quality dialogues while mitigating privacy risks. Using this method, we construct KokoroChat, a Japanese psychological counseling dialogue dataset comprising 6,589 long-form dialogues, each accompanied by comprehensive client feedback. Experimental results demonstrate that fine-tuning open-source LLMs with KokoroChat improves both the quality of generated counseling responses and the automatic evaluation of counseling dialogues. The KokoroChat dataset is available at this https URL.
摘要：使用语言模型产生心理咨询反应在很大程度上取决于高质量的数据集。众包数据收集方法需要严格的工人培训，来自现实世界中的咨询环境的数据可能会引起隐私和道德问题。尽管最近的研究使用大型语言模型（LLM）探讨了增强心理咨询对话数据集，但最终的数据通常遭受有限的多样性和真实性。为了解决这些局限性，本研究采用了一种角色扮演方法，训练有素的辅导员模拟辅导员 - 客户互动，确保高质量的对话，同时减轻隐私风险。使用这种方法，我们构建了日本心理咨询对话数据集的Kokorochat，其中包括6,589个长形式对话，每个对话都伴随着全面的客户反馈。实验结果表明，使用Kokorochat进行微调开源LLM可以提高生成的咨询响应质量和对咨询对话的自动评估。 Kokorochat数据集可在此HTTPS URL上找到。

Title: MMD-Flagger: Leveraging Maximum Mean Discrepancy to Detect Hallucinations

Authors: Kensuke Mitsuzawa, Damien Garreau
Subjects: cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2506.01367
Pdf URL: https://arxiv.org/pdf/2506.01367
Copy Paste: [[2506.01367]] MMD-Flagger: Leveraging Maximum Mean Discrepancy to Detect Hallucinations(https://arxiv.org/abs/2506.01367)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have become pervasive in our everyday life. Yet, a fundamental obstacle prevents their use in many critical applications: their propensity to generate fluent, human-quality content that is not grounded in reality. The detection of such hallucinations is thus of the highest importance. In this work, we propose a new method to flag hallucinated content, MMD-Flagger. It relies on Maximum Mean Discrepancy (MMD), a non-parametric distance between distributions. On a high-level perspective, MMD-Flagger tracks the MMD between the generated documents and documents generated with various temperature parameters. We show empirically that inspecting the shape of this trajectory is sufficient to detect most hallucinations. This novel method is benchmarked on two machine translation datasets, on which it outperforms natural competitors.
摘要：大型语言模型（LLM）在我们的日常生活中变得普遍。然而，基本障碍阻止了它们在许多关键应用中的使用：它们产生流利的人类质量内容的倾向，而实际上并非基于现实。因此，这种幻觉的检测是最重要的。在这项工作中，我们提出了一种新的方法来标记幻觉内容MMD-Flagger。它依赖于最大平均差异（MMD），这是分布之间的非参数距离。从高级的角度来看，MMD-Flagger跟踪生成的文档和用各种温度参数生成的文档之间的MMD。我们从经验上表明，检查这种轨迹的形状足以检测大多数幻觉。这种新颖的方法在两个机器翻译数据集上进行了基准测试，在该数据集上，它的表现优于自然竞争对手。

Title: AdaRewriter: Unleashing the Power of Prompting-based Conversational Query Reformulation via Test-Time Adaptation

Authors: Yilong Lai, Jialong Wu, Zhenglin Wang, Deyu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01381
Pdf URL: https://arxiv.org/pdf/2506.01381
Copy Paste: [[2506.01381]] AdaRewriter: Unleashing the Power of Prompting-based Conversational Query Reformulation via Test-Time Adaptation(https://arxiv.org/abs/2506.01381)
Keywords: llm, prompt
Abstract: Prompting-based conversational query reformulation has emerged as a powerful approach for conversational search, refining ambiguous user queries into standalone search queries. Best-of-N reformulation over the generated candidates via prompting shows impressive potential scaling capability. However, both the previous tuning methods (training time) and adaptation approaches (test time) can not fully unleash their benefits. In this paper, we propose AdaRewriter, a novel framework for query reformulation using an outcome-supervised reward model via test-time adaptation. By training a lightweight reward model with contrastive ranking loss, AdaRewriter selects the most promising reformulation during inference. Notably, it can operate effectively in black-box systems, including commercial LLM APIs. Experiments on five conversational search datasets show that AdaRewriter significantly outperforms the existing methods across most settings, demonstrating the potential of test-time adaptation for conversational query reformulation.
摘要：基于促使基于的对话查询重新重新重新制作已成为一种强大的对话搜索方法，将模棱两可的用户查询完善到独立的搜索查询中。通过提示对生成的候选人进行的最佳N重新制定表现出令人印象深刻的潜在缩放能力。但是，以前的调整方法（训练时间）和适应方法（测试时间）都无法完全释放其好处。在本文中，我们提出了AdareWriter，这是一种通过测试时间适应的奖励奖励模型，用于查询重新重新制定的新型框架。通过训练具有对比度排名损失的轻量级奖励模型，AdareWriter选择了推理期间最有希望的重新制定。值得注意的是，它可以在包括商业LLM API在内的黑盒系统中有效运行。五个对话搜索数据集的实验表明，AdareWriter在大多数设置上的现有方法显着胜过，这证明了测试时间适应对话性查询重新重新重新构造的潜力。

Title: Comparing LLM-generated and human-authored news text using formal syntactic theory

Authors: Olga Zamaraeva, Dan Flickinger, Francis Bond, Carlos Gómez-Rodríguez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01407
Pdf URL: https://arxiv.org/pdf/2506.01407
Copy Paste: [[2506.01407]] Comparing LLM-generated and human-authored news text using formal syntactic theory(https://arxiv.org/abs/2506.01407)
Keywords: language model, llm
Abstract: This study provides the first comprehensive comparison of New York Times-style text generated by six large language models against real, human-authored NYT writing. The comparison is based on a formal syntactic theory. We use Head-driven Phrase Structure Grammar (HPSG) to analyze the grammatical structure of the texts. We then investigate and illustrate the differences in the distributions of HPSG grammar types, revealing systematic distinctions between human and LLM-generated writing. These findings contribute to a deeper understanding of the syntactic behavior of LLMs as well as humans, within the NYT genre.
摘要：这项研究提供了六种大型语言模型与真实的，由人为著名的纽约时报撰写产生的纽约时报式文本的首次全面比较。比较基于形式的句法理论。我们使用头驱动的短语结构语法（HPSG）来分析文本的语法结构。然后，我们研究并说明了HPSG语法类型的分布的差异，从而揭示了人类和LLM生成的写作之间的系统区别。这些发现有助于更深入地了解NYT类型中LLM和人类的句法行为。

Title: UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

Authors: Joseph Marvin Imperial, Abdullah Barayan, Regina Stodden, Rodrigo Wilkens, Ricardo Munoz Sanchez, Lingyun Gao, Melissa Torgbi, Dawn Knight, Gail Forey, Reka R. Jablonkai, Ekaterina Kochmar, Robert Reynolds, Eugenio Ribeiro, Horacio Saggion, Elena Volodina, Sowmya Vajjala, Thomas Francois, Fernando Alva-Manchego, Harish Tayyar Madabushi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01419
Pdf URL: https://arxiv.org/pdf/2506.01419
Copy Paste: [[2506.01419]] UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment(https://arxiv.org/abs/2506.01419)
Keywords: llm, prompt
Abstract: We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community.
摘要：我们介绍了Universalcefr，这是一种根据CEFR（CEFR（常见的欧洲参考框架）规模）的大规模多语言多语言多维数据集，其中13种语言。为了在自动化可读性和语言能力评估中进行开放研究，UniversalceFR包括505,807个CEFR标记的文本，这些文本由教育和学习者的资源策划，标准化为统一的数据格式，以支持一致的处理，分析，分析和建模跨任务和语言。为了证明其实用性，我们使用三个建模范式进行基准实验：a）基于语言特征的分类，b）微调预训练的LLM和c）基于描述符的指令调节的LLMS提示。我们的结果在多语言CEFR级别评估中使用语言特征和微调预审计的模型提供了进一步的支持。总体而言，UniversalceFR旨在通过标准化数据集格式并促进其对全球研究界的可访问性来建立语言能力研究中数据分布的最佳实践。

Title: Self-Refining Language Model Anonymizers via Adversarial Distillation

Authors: Kyuyoung Kim, Hyunjun Jeon, Jinwoo Shin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01420
Pdf URL: https://arxiv.org/pdf/2506.01420
Copy Paste: [[2506.01420]] Self-Refining Language Model Anonymizers via Adversarial Distillation(https://arxiv.org/abs/2506.01420)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data from seemingly benign text poses emerging privacy risks. While recent LLM-based anonymization methods help mitigate such risks, they often rely on proprietary models (e.g., GPT-4), raising concerns about cost and the potential exposure of sensitive data to untrusted external systems. To address this, we introduce SElf-refining Anonymization with Language model (SEAL), a novel distillation framework for training small language models (SLMs) to perform effective anonymization without relying on external costly models at inference time. We leverage adversarial interactions between an LLM anonymizer and an inference model to collect trajectories of anonymized texts and inferred attributes, which are used to distill anonymization, adversarial inference, and utility evaluation capabilities into SLMs via supervised fine-tuning and preference learning. The resulting models learn to both anonymize text and critique their outputs, enabling iterative improvement of anonymization quality via self-refinement. Experiments on SynthPAI, a dataset of synthetic personal profiles and text comments, demonstrate that SLMs trained with SEAL achieve substantial improvements in anonymization capabilities. Notably, 8B models attain a privacy-utility trade-off comparable to that of the GPT-4 anonymizer and, with self-refinement, even surpass it in terms of privacy. These results show the effectiveness of our adversarial distillation framework in training SLMs as efficient anonymizers. To facilitate further research, we release the full dataset used in our experiments.
摘要：大型语言模型（LLM）越来越多地用于敏感领域，在这些领域中，它们从看似良性的文本中推断出个人数据的能力会带来新兴的隐私风险。尽管最近基于LLM的匿名方法有助于减轻此类风险，但它们通常依靠专有模型（例如GPT-4），引起人们对成本的担忧以及敏感数据的潜在暴露于不信任的外部系统。为了解决这个问题，我们介绍了语言模型（密封）的自我注定匿名化，这是一种用于训练小语言模型（SLM）的新型蒸馏框架，以执行有效的匿名化，而无需在推理时依靠外部昂贵的模型。我们利用LLM匿名器和推理模型之间的对抗性相互作用来收集匿名文本和推断属性的轨迹，这些属性用于通过受监管的精心调整和偏好学习来提炼匿名，对抗性推理和实用性评估功能。最终的模型学会同时匿名文本和批评其输出，从而通过自我启动来迭代地提高匿名质量。合成个人资料和文本评论的数据集Synthpai进行的实验表明，经过密封培训的SLMS可以实现匿名功能的实质性改进。值得注意的是，8B模型实现了与GPT-4匿名器相当的隐私 - 私人权衡权衡，并且自我进行了反复，甚至可以在隐私方面超越它。这些结果表明，我们的对抗性蒸馏框架在训练SLM中作为有效的匿名者的有效性。为了促进进一步的研究，我们发布了实验中使用的完整数据集。

Title: Redundancy, Isotropy, and Intrinsic Dimensionality of Prompt-based Text Embeddings

Authors: Hayato Tsukagoshi, Ryohei Sasano
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01435
Pdf URL: https://arxiv.org/pdf/2506.01435
Copy Paste: [[2506.01435]] Redundancy, Isotropy, and Intrinsic Dimensionality of Prompt-based Text Embeddings(https://arxiv.org/abs/2506.01435)
Keywords: prompt
Abstract: Prompt-based text embedding models, which generate task-specific embeddings upon receiving tailored prompts, have recently demonstrated remarkable performance. However, their resulting embeddings often have thousands of dimensions, leading to high storage costs and increased computational costs of embedding-based operations. In this paper, we investigate how post-hoc dimensionality reduction applied to the embeddings affects the performance of various tasks that leverage these embeddings, specifically classification, clustering, retrieval, and semantic textual similarity (STS) tasks. Our experiments show that even a naive dimensionality reduction, which keeps only the first 25% of the dimensions of the embeddings, results in a very slight performance degradation, indicating that these embeddings are highly redundant. Notably, for classification and clustering, even when embeddings are reduced to less than 0.5% of the original dimensionality the performance degradation is very small. To quantitatively analyze this redundancy, we perform an analysis based on the intrinsic dimensionality and isotropy of the embeddings. Our analysis reveals that embeddings for classification and clustering, which are considered to have very high dimensional redundancy, exhibit lower intrinsic dimensionality and less isotropy compared with those for retrieval and STS.
摘要：基于及时的文本嵌入模型，该模型在收到量身定制的提示后生成特定于任务的嵌入方式，最近表现出了出色的性能。但是，它们由此产生的嵌入通常具有数千个维度，从而导致高存储成本和基于嵌入式操作的计算成本增加。在本文中，我们调查了嵌入后的事后维度降低如何影响利用这些嵌入的各种任务的性能，特别是分类，聚类，检索和语义文本相似性（STS）任务。我们的实验表明，即使是幼稚维度的降低，仅保留嵌入尺寸的前25％，也会导致非常轻微的性能降解，表明这些嵌入是高度冗余的。值得注意的是，对于分类和聚类，即使将嵌入到原始维度的0.5％时，性能降解也很小。为了定量分析这种冗余，我们基于嵌入的固有维度和各向同性进行了分析。我们的分析表明，与检索和STS相比，与检索和ST相比，分类和聚类的嵌入被认为具有很高的尺寸冗余性，具有较低的内在维度和较小的各向同性。

Title: TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge

Authors: Tanel Alumäe, Artem Fedorchenko
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2506.01458
Pdf URL: https://arxiv.org/pdf/2506.01458
Copy Paste: [[2506.01458]] TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge(https://arxiv.org/abs/2506.01458)
Keywords: language model
Abstract: This paper describes the language identification and multilingual speech recognition system developed at Tallinn University of Technology for the Interspeech 2025 ML-SUPERB 2.0 Challenge. A hybrid language identification system is used, consisting of a pretrained language embedding model and a light-weight speech recognition model with a shared encoder across languages and language-specific bigram language models. For speech recognition, three models are used, where only a single model is applied for each language, depending on the training data availability and performance on held-out data. The model set consists of a finetuned version of SeamlessM4T, MMS-1B-all with custom language adapters and MMS-zeroshot. The system obtained the top overall score in the challenge.
摘要：本文介绍了塔林技术大学开发的语言识别和多语言语音识别系统针对Interspeech 2025 ML-Superb 2.0挑战。使用了混合语言识别系统，该系统由审慎的语言嵌入模型和轻巧的语音识别模型组成，该模型具有跨语言和特定语言的BigRam语言模型的共享编码器。为了进行语音识别，使用了三个模型，其中仅适用于每种语言的单个模型，具体取决于训练数据的可用性和持有数据的性能。该型号集由填充版本的SeamlessM4T，MMS-1B-ALL与自定义语言适配器和MMS-szeroshot组成。该系统在挑战中获得了最高的总分。

Title: Integrating Neural and Symbolic Components in a Model of Pragmatic Question-Answering

Authors: Polina Tsvilodub, Robert D. Hawkins, Michael Franke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01474
Pdf URL: https://arxiv.org/pdf/2506.01474
Copy Paste: [[2506.01474]] Integrating Neural and Symbolic Components in a Model of Pragmatic Question-Answering(https://arxiv.org/abs/2506.01474)
Keywords: llm
Abstract: Computational models of pragmatic language use have traditionally relied on hand-specified sets of utterances and meanings, limiting their applicability to real-world language use. We propose a neuro-symbolic framework that enhances probabilistic cognitive models by integrating LLM-based modules to propose and evaluate key components in natural language, eliminating the need for manual specification. Through a classic case study of pragmatic question-answering, we systematically examine various approaches to incorporating neural modules into the cognitive model -- from evaluating utilities and literal semantics to generating alternative utterances and goals. We find that hybrid models can match or exceed the performance of traditional probabilistic models in predicting human answer patterns. However, the success of the neuro-symbolic model depends critically on how LLMs are integrated: while they are particularly effective for proposing alternatives and transforming abstract goals into utilities, they face challenges with truth-conditional semantic evaluation. This work charts a path toward more flexible and scalable models of pragmatic language use while illuminating crucial design considerations for balancing neural and symbolic components.
摘要：传统上，务实语言使用的计算模型依赖于手工指定的话语和含义集，从而将其适用性限制在现实世界的语言使用中。我们提出了一个神经符号框架，该框架通过集成基于LLM的模块以提出和评估自然语言的关键组成部分，从而增强概率认知模型，从而消除了对手动规范的需求。通过对务实提问的经典案例研究，我们系统地研究了将神经模块纳入认知模型的各种方法 - 从评估实用程序和文字语义到产生替代性话语和目标。我们发现，在预测人类答案模式时，混合模型可以匹配或超过传统概率模型的性能。但是，神经符号模型的成功取决于LLM的整合方式：尽管它们对于提出替代方案并将抽象目标转变为实用程序特别有效，但它们通过真实条件的语义评估面临挑战。这项工作为务实语言使用更灵活，更可扩展的模型绘制了一条途径，同时阐明了平衡神经和象征成分的关键设计注意事项。

Title: LLM in the Loop: Creating the PARADEHATE Dataset for Hate Speech Detoxification

Authors: Shuzhou Yuan, Ercong Nie, Lukas Kouba, Ashish Yashwanth Kangen, Helmut Schmid, Hinrich Schutze, Michael Farber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01484
Pdf URL: https://arxiv.org/pdf/2506.01484
Copy Paste: [[2506.01484]] LLM in the Loop: Creating the PARADEHATE Dataset for Hate Speech Detoxification(https://arxiv.org/abs/2506.01484)
Keywords: gpt, llm
Abstract: Detoxification, the task of rewriting harmful language into non-toxic text, has become increasingly important amid the growing prevalence of toxic content online. However, high-quality parallel datasets for detoxification, especially for hate speech, remain scarce due to the cost and sensitivity of human annotation. In this paper, we propose a novel LLM-in-the-loop pipeline leveraging GPT-4o-mini for automated detoxification. We first replicate the ParaDetox pipeline by replacing human annotators with an LLM and show that the LLM performs comparably to human annotation. Building on this, we construct PARADEHATE, a large-scale parallel dataset specifically for hatespeech detoxification. We release PARADEHATE as a benchmark of over 8K hate/non-hate text pairs and evaluate a wide range of baseline methods. Experimental results show that models such as BART, fine-tuned on PARADEHATE, achieve better performance in style accuracy, content preservation, and fluency, demonstrating the effectiveness of LLM-generated detoxification text as a scalable alternative to human annotation.
摘要：在网上日益增长的有毒内容的越来越多的情况下，排毒是将有害语言改写为无毒文本的任务变得越来越重要。但是，由于人类注释的成本和敏感性，用于排毒的高质量并行数据集，尤其是对于仇恨言论。在本文中，我们提出了一条新型的LLM在循环管道中，利用GPT-4O-Mini进行自动排毒。我们首先通过用LLM替换人类注释来复制悖论管道，并表明LLM的性能与人类注释相当。在此基础上，我们构建了ParadeHate，这是一种专门用于Hatespeech排毒的大规模平行数据集。我们发布ParadeHate作为超过8K仇恨/非讨厌文本对的基准，并评估广泛的基线方法。实验结果表明，Bart等模型在Paradehate上进行了微调，在样式的准确性，内容保存和流利度上实现了更好的性能，证明了LLM生成的排毒文本的有效性，作为人类注释的可扩展性替代方案。

Title: Multilingual Definition Modeling

Authors: Edison Marrese-Taylor, Erica K. Shimomoto, Alfredo Solano, Enrique Reid
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01489
Pdf URL: https://arxiv.org/pdf/2506.01489
Copy Paste: [[2506.01489]] Multilingual Definition Modeling(https://arxiv.org/abs/2506.01489)
Keywords: language model, llm, chat
Abstract: In this paper, we propose the first multilingual study on definition modeling. We use monolingual dictionary data for four new languages (Spanish, French, Portuguese, and German) and perform an in-depth empirical study to test the performance of pre-trained multilingual language models on definition modeling of monosemic words when finetuned on this data. Furthermore, we use a zero-shot approach to test the multilingual capabilities of two popular chat-based Large Language Models (LLMs) in the task. Results show that multilingual language models can perform on-pair with English but cannot leverage potential cross-lingual synergies, with LLMs generally offering better performance overall. A comprehensive human evaluation of the LLM-generated definition highlights the zero and few-shot capabilities of these models in this new task, also showing their shortcomings. Finally, we show that performance on our task via BERTScore strongly correlates to the performance on multilingual LLM benchmarks, suggesting that our task offers a viable compute-constrained, stable and natural alternative to these.
摘要：在本文中，我们提出了有关定义建模的首次多语言研究。我们使用四种新语言（西班牙语，法语，葡萄牙语和德语）的单语词典数据，并执行一项深入的经验研究，以测试预先训练的多语言语言模型在此数据上进行鉴定时的定义模型。此外，我们使用零击方法来测试任务中两个流行的基于聊天的大语言模型（LLM）的多语言功能。结果表明，多语言语言模型可以用英语执行，但不能利用潜在的跨语性协同作用，而LLMS通常提供更好的性能。对LLM生成的定义的全面人类评估突出了这些模型在这项新任务中的零和几乎没有的功能，也表明了它们的缺点。最后，我们证明了通过Bertscore在任务上的表现与多语言LLM基准的性能密切相关，这表明我们的任务提供了可行的计算，稳定，稳定且自然的替代方案。

Title: CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models

Authors: Ping Wu, Guobin Shen, Dongcheng Zhao, Yuwei Wang, Yiting Dong, Yu Shi, Enmeng Lu, Feifei Zhao, Yi Zeng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01495
Pdf URL: https://arxiv.org/pdf/2506.01495
Copy Paste: [[2506.01495]] CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models(https://arxiv.org/abs/2506.01495)
Keywords: language model, llm
Abstract: Ensuring that Large Language Models (LLMs) align with mainstream human values and ethical norms is crucial for the safe and sustainable development of AI. Current value evaluation and alignment are constrained by Western cultural bias and incomplete domestic frameworks reliant on non-native rules; furthermore, the lack of scalable, rule-driven scenario generation methods makes evaluations costly and inadequate across diverse cultural contexts. To address these challenges, we propose a hierarchical value framework grounded in core Chinese values, encompassing three main dimensions, 12 core values, and 50 derived values. Based on this framework, we construct a large-scale Chinese Values Corpus (CVC) containing over 250,000 value rules enhanced and expanded through human annotation. Experimental results show that CVC-guided scenarios outperform direct generation ones in value boundaries and content diversity. In the evaluation across six sensitive themes (e.g., surrogacy, suicide), seven mainstream LLMs preferred CVC-generated options in over 70.5% of cases, while five Chinese human annotators showed an 87.5% alignment with CVC, confirming its universality, cultural relevance, and strong alignment with Chinese values. Additionally, we construct 400,000 rule-based moral dilemma scenarios that objectively capture nuanced distinctions in conflicting value prioritization across 17 LLMs. Our work establishes a culturally-adaptive benchmarking framework for comprehensive value evaluation and alignment, representing Chinese characteristics. All data are available at this https URL, and the code is available at this https URL.
摘要：确保大型语言模型（LLM）与主流人类的价值观和道德规范保持一致，对于AI的安全和可持续发展至关重要。当前的价值评估和一致性受西方文化偏见和依赖非本地规则的不完整的家庭框架的限制；此外，缺乏可扩展的，规则驱动的场景生成方法使评估昂贵，并且在各种文化背景下不足。为了应对这些挑战，我们提出了一个基于中国核心价值的分层价值框架，其中包括三个主要维度，12个核心值和50个派生值。基于此框架，我们构建了一个大规模的中国价值语料库（CVC），其中包含超过250,000个价值规则，通过人类注释增强和扩展。实验结果表明，CVC指导的方案在价值边界和内容多样性方面的表现优于直接生成。在六个敏感主题（例如替代性，自杀）评估中，七个主流LLMS首选CVC生成的选择中的70.5％以上的案例，而五个中国人类注释者表现出与CVC的87.5％一致性，证明了其普遍性，文化相关性，与中国的价值很强。此外，我们构建了400,000个基于规则的道德困境情景，这些方案在17个LLMS之间客观地捕获相互冲突的价值优先次数的细微差异。我们的工作建立了一个具有文化自适应的基准测试框架，用于全面的价值评估和一致性，代表了汉语特征。所有数据均在此HTTPS URL上可用，并且该代码可在此HTTPS URL上找到。

Title: Representations of Fact, Fiction and Forecast in Large Language Models: Epistemics and Attitudes

Authors: Meng Li, Michael Vrazitulis, David Schlangen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01512
Pdf URL: https://arxiv.org/pdf/2506.01512
Copy Paste: [[2506.01512]] Representations of Fact, Fiction and Forecast in Large Language Models: Epistemics and Attitudes(https://arxiv.org/abs/2506.01512)
Keywords: language model, llm
Abstract: Rational speakers are supposed to know what they know and what they do not know, and to generate expressions matching the strength of evidence. In contrast, it is still a challenge for current large language models to generate corresponding utterances based on the assessment of facts and confidence in an uncertain real-world environment. While it has recently become popular to estimate and calibrate confidence of LLMs with verbalized uncertainty, what is lacking is a careful examination of the linguistic knowledge of uncertainty encoded in the latent space of LLMs. In this paper, we draw on typological frameworks of epistemic expressions to evaluate LLMs' knowledge of epistemic modality, using controlled stories. Our experiments show that the performance of LLMs in generating epistemic expressions is limited and not robust, and hence the expressions of uncertainty generated by LLMs are not always reliable. To build uncertainty-aware LLMs, it is necessary to enrich semantic knowledge of epistemic modality in LLMs.
摘要：理性的演讲者应该知道他们所知道的，他们不知道的知识，并产生与证据实力相匹配的表情。相反，对于当前的大型语言模型来说，基于对不确定的现实环境的事实和信心产生相应的话语仍然是一个挑战。尽管最近它以口头上的不确定性估算和校准LLM的信心变得很流行，但缺乏对LLMS潜在空间中编码的不确定性的语言知识的仔细检查。在本文中，我们利用认知表达的类型学框架，使用受控故事评估LLMS对认知模式的知识。我们的实验表明，LLM在产生认知表达式的性能是有限的且不健壮的，因此LLMS产生的不确定性表达并不总是可靠的。为了建立不确定性感知的LLM，有必要丰富对LLM中认知方式的语义知识。

Title: FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents

Authors: Bobo Li, Yuheng Wang, Hao Fei, Juncheng Li, Wei Ji, Mong-Li Lee, Wynne Hsu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01520
Pdf URL: https://arxiv.org/pdf/2506.01520
Copy Paste: [[2506.01520]] FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents(https://arxiv.org/abs/2506.01520)
Keywords: language model, llm, agent
Abstract: Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions. Despite the long-standing vision of automating this process with "one click", existing tools remain largely rule-based and lack generalizable, generative capabilities. Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios. However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields. To bridge this gap, we formally define the form-filling task and propose FormFactory, an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset. Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions. We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task. These findings also reveal significant limitations in current models' visual layout reasoning and field-value alignment abilities. We hope our benchmark can serve as a stepping stone for further research into robust, practical form-filling agents.
摘要：在线形式填充是一项常见但劳动密集型的任务，涉及广泛的键盘和鼠标互动。尽管长期以来可以使用“一键”自动化此过程的愿景，但现有工具仍然基于规则，并且缺乏可普遍的生成能力。多模式大语言模型（MLLM）的最新进展使在通用场景中的GUI相关任务实现了有希望的代理。但是，他们在形式填充的独特挑战中挣扎，例如灵活的布局以及将文本说明与屏幕上的字段保持一致的困难。为了弥合此差距，我们正式定义了形式填充任务，并提出了FormFactory，这是一个包括基于Web的接口，后端评估模块和精心构造的数据集的交互式基准测试套件。我们的基准涵盖了各种现实世界的场景，结合了各种现场格式，并模拟了高保真形式的相互作用。我们对最先进的MLLM进行了全面的评估，并观察到没有模型超过5％的精度，强调了任务的固有难度。这些发现还揭示了当前模型的视觉布局推理和场值对齐能力的重大局限性。我们希望我们的基准可以作为垫脚石，以进一步研究强大的实用形式填充剂。

Title: V-VAE: A Variational Auto Encoding Framework Towards Fine-Grained Control over Human-Like Chat

Authors: Qi Lin, Weikai Xu, Lisi Chen, Bin Dai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01524
Pdf URL: https://arxiv.org/pdf/2506.01524
Copy Paste: [[2506.01524]] V-VAE: A Variational Auto Encoding Framework Towards Fine-Grained Control over Human-Like Chat(https://arxiv.org/abs/2506.01524)
Keywords: language model, llm, chat
Abstract: With the continued proliferation of Large Language Model (LLM) based chatbots, there is a growing demand for generating responses that are not only linguistically fluent but also consistently aligned with persona-specific traits in conversations. However, existing role-play and persona-based chat approaches rely heavily on static role descriptions, coarse-grained signal space, and low-quality synthetic data, which fail to capture dynamic fine-grained details in human-like chat. Human-like chat requires modeling subtle latent traits, such as emotional tone, situational awareness, and evolving personality, which are difficult to predefine and cannot be easily learned from synthetic or distillation-based data. To address these limitations, we propose a Verbal Variational Auto-Encoding (V-VAE) framework, containing a variational auto-encoding module and fine-grained control space which dynamically adapts dialogue behaviour based on fine-grained, interpretable latent variables across talking style, interaction patterns, and personal attributes. We also construct a high-quality dataset, HumanChatData, and benchmark HumanChatBench to address the scarcity of high-quality data in the human-like domain. Experiments show that LLMs based on V-VAE consistently outperform standard baselines on HumanChatBench and DialogBench, which further demonstrates the effectiveness of V-VAE and HumanChatData.
摘要：随着基于大语言模型（LLM）聊天机器人的持续扩散，人们对产生反应的需求不仅在语言上流利，而且在对话中与特定于人物的特定特征保持一致。但是，现有的基于角色扮演和基于角色的聊天方法在很大程度上依赖于静态角色描述，粗粒度的信号空间和低质量的合成数据，这些数据无法在类似人类的聊天中捕获动态细粒度的细节。类似人类的聊天需要建模微妙的潜在特征，例如情感语调，情境意识和不断发展的个性，这些性格难以预先定义，并且不容易从基于合成或基于蒸馏的数据中学到。为了解决这些局限性，我们提出了一个口头变异自动编码（V-VAE）框架，其中包含一个差异自动编码模块和细粒度的控制空间，该框架动态地适应了基于跨说话风格，交互模式和个人属性的细粒度，可解释的潜在变量的对话行为。我们还构建了一个高质量的数据集，Hunchanchatdata和基准Hunchanchatbench，以解决类似人类领域中高质量数据的稀缺性。实验表明，基于V-VAE的LLM始终超过Hunchanchatbench和Dialogbench上的标准基线，这进一步证明了V-VAE和HumanchatData的有效性。

Title: STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework

Authors: Wenhao Liu, Zhenyi Lu, Xinyu Hu, Jierui Zhang, Dailin Li, Jiacheng Cen, Huilin Cao, Haiteng Wang, Yuhan Li, Kun Xie, Dandan Li, Pei Zhang, Chengbo Zhang, Yuxiang Ren, Xiaohong Huang, Yan Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01531
Pdf URL: https://arxiv.org/pdf/2506.01531
Copy Paste: [[2506.01531]] STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework(https://arxiv.org/abs/2506.01531)
Keywords: language model, gpt, llm, agent
Abstract: High-quality math datasets are crucial for advancing the reasoning abilities of large language models (LLMs). However, existing datasets often suffer from three key issues: outdated and insufficient challenging content, neglecting human-like reasoning, and limited reliability due to single-LLM generation. To address these, we introduce $\textbf{STORM-BORN}$, an ultra-challenging dataset of mathematical derivations sourced from cutting-edge academic papers, which includes dense human-like approximations and heuristic cues. To ensure the reliability and quality, we propose a novel human-in-the-loop, multi-agent data generation framework, integrating reasoning-dense filters, multi-agent collaboration, and human mathematicians' evaluations. We curated a set of 2,000 synthetic samples and deliberately selected the 100 most difficult problems. Even most advanced models like GPT-o1 solved fewer than $5\%$ of them. Fine-tuning on STORM-BORN boosts accuracy by $7.84\%$ (LLaMA3-8B) and $9.12\%$ (Qwen2.5-7B). As AI approaches mathematician-level reasoning, STORM-BORN provides both a high-difficulty benchmark and a human-like reasoning training resource. Our code and dataset are publicly available at this https URL.
摘要：高质量的数学数据集对于推进大语言模型（LLMS）的推理能力至关重要。但是，现有的数据集通常遇到三个关键问题：过时且不足的挑战性内容，忽略人类的推理以及由于单LLM生成而导致的可靠性有限。为了解决这些问题，我们介绍了$ \ textbf {Storm-Born} $，这是一种来自尖端学术论文的数学派生数据集，其中包括密集的人类近似值和启发式提示。为了确保可靠性和质量，我们提出了一个新型的人类在循环，多代理数据生成框架，集成推理密集的过滤器，多代理协作和人类数学家的评估。我们策划了一组2,000个合成样本，并故意选择了100个最困难的问题。即使是GPT-O1（例如GPT-O1），甚至大多数高级型号也无法解决其中的$ 5 \％$。在暴风雨中进行微调提高准确性$ 7.84 \％$（llama3-8b）和$ 9.12 \％$（qwen2.5-7b）。当AI接近数学家级别的推理时，风暴出生既提供了高难题的基准和类似人类的推理培训资源。我们的代码和数据集可在此HTTPS URL上公开获得。

Title: Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries

Authors: Haruki Sakajo, Yusuke Ide, Justin Vasselli, Yusuke Sakai, Yingtao Tian, Hidetaka Kamigaito, Taro Watanabe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01535
Pdf URL: https://arxiv.org/pdf/2506.01535
Copy Paste: [[2506.01535]] Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries(https://arxiv.org/abs/2506.01535)
Keywords: language model
Abstract: Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages. Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources. In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists. Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords. The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer. The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.
摘要：跨语性词汇转移在将预训练的语言模型适应新语言（包括低资源语言）中起着有希望的作用。当应用于有限的资源的语言时，使用单语言或平行语料库的现有方法面临挑战。在这项工作中，我们提出了一种简单而有效的词汇转移方法，该方法利用双语词典，这要归功于描述性语言学家，可用于许多语言。我们提出的方法利用了BPE Tokenizer的属性，其中从词汇中删除子字会导致后备到较短的子字。目标子词的嵌入是通过从令牌仪中逐渐删除的。实验结果表明，我们的方法表现优于低资源语言的现有方法，证明了基于词典的跨语性词汇转移方法的有效性。

Title: Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation

Authors: Li Zhou, Lutong Yu, Dongchu Xie, Shaohuan Cheng, Wenyan Li, Haizhou Li
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.01565
Pdf URL: https://arxiv.org/pdf/2506.01565
Copy Paste: [[2506.01565]] Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation(https://arxiv.org/abs/2506.01565)
Keywords: language model
Abstract: Culture is a rich and dynamic domain that evolves across both geography and time. However, existing studies on cultural understanding with vision-language models (VLMs) primarily emphasize geographic diversity, often overlooking the critical temporal dimensions. To bridge this gap, we introduce Hanfu-Bench, a novel, expert-curated multimodal dataset. Hanfu, a traditional garment spanning ancient Chinese dynasties, serves as a representative cultural heritage that reflects the profound temporal aspects of Chinese culture while remaining highly popular in Chinese contemporary society. Hanfu-Bench comprises two core tasks: cultural visual understanding and cultural image this http URL former task examines temporal-cultural feature recognition based on single- or multi-image inputs through multiple-choice visual question answering, while the latter focuses on transforming traditional attire into modern designs through cultural element inheritance and modern context adaptation. Our evaluation shows that closed VLMs perform comparably to non-experts on visual cutural understanding but fall short by 10\% to human experts, while open VLMs lags further behind non-experts. For the transcreation task, multi-faceted human evaluation indicates that the best-performing model achieves a success rate of only 42\%. Our benchmark provides an essential testbed, revealing significant challenges in this new direction of temporal cultural understanding and creative adaptation.
摘要：文化是一个丰富而充满活力的领域，在地理和时间上都会发展。然而，现有的关于文化理解的研究主要强调地理多样性，通常忽略了关键的时间维度。为了弥合这一差距，我们介绍了Hanfu-Bench，这是一个新颖的专家策划的多模式数据集。汉夫（Hanfu）是一种跨越中国古代王朝的传统服装，是一种代表性的文化遗产，反映了中国文化的深刻时间方面，同时在中国当代社会中仍然非常受欢迎。 Hanfu Bench构成了两个核心任务：文化视觉理解和文化形象此HTTP URL以前任务研究了基于单或多图像输入的时间文化特征识别，通过多选择的视觉问题回答，而后者则侧重于通过文化元素的继承和现代上下文适应将传统装饰转变为现代设计。我们的评估表明，封闭的VLM与视觉切割理解的非专家相当，但对人类专家的差异为10 \％，而开放的VLMS则进一步落后于非专家。对于转录任务，多面的人类评估表明，表现最佳的模型仅达到42 \％的成功率。我们的基准提供了必不可少的测试床，并在时间文化理解和创造性适应的新方向上揭示了重大挑战。

Title: Prompt Engineering Large Language Models' Forecasting Capabilities

Authors: Philipp Schoenegger, Cameron R. Jones, Philip E. Tetlock, Barbara Mellers
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01578
Pdf URL: https://arxiv.org/pdf/2506.01578
Copy Paste: [[2506.01578]] Prompt Engineering Large Language Models' Forecasting Capabilities(https://arxiv.org/abs/2506.01578)
Keywords: language model, gpt, prompt
Abstract: Large language model performance can be improved in a large number of ways. Many such techniques, like fine-tuning or advanced tool usage, are time-intensive and expensive. Although prompt engineering is significantly cheaper and often works for simpler tasks, it remains unclear whether prompt engineering suffices for more complex domains like forecasting. Here we show that small prompt modifications rarely boost forecasting accuracy beyond a minimal baseline. In our first study, we tested 38 prompts across Claude 3.5 Sonnet, Claude 3.5 Haiku, GPT-4o, and Llama 3.1 405B. In our second, we introduced compound prompts and prompts from external sources, also including the reasoning models o1 and o1-mini. Our results show that most prompts lead to negligible gains, although references to base rates yield slight benefits. Surprisingly, some strategies showed strong negative effects on accuracy: especially encouraging the model to engage in Bayesian reasoning. These results suggest that, in the context of complex tasks like forecasting, basic prompt refinements alone offer limited gains, implying that more robust or specialized techniques may be required for substantial performance improvements in AI forecasting.
摘要：大型语言模型性能可以通过多种方式提高。许多这样的技术，例如微调或高级工具使用情况，都是耗时且昂贵的。尽管迅速的工程非常便宜，并且通常用于更简单的任务，但尚不清楚迅速工程是否足以容纳更复杂的域，例如预测。在这里，我们表明，较小的及时修改很少提高预测准确性，而不是最少的基线。在我们的第一个研究中，我们在Claude 3.5十四行诗，Claude 3.5 Haiku，GPT-4O和Llama 3.1 405B上测试了38个提示。在第二个中，我们引入了外部来源的复合提示和提示，还包括推理模型O1和O1 Mini。我们的结果表明，大多数提示会导致可忽略不计，尽管提及基本利率会带来轻微的好处。令人惊讶的是，某些策略对准确性表现出强烈的负面影响：特别是鼓励模型进行贝叶斯推理。这些结果表明，在复杂的任务中，仅基本及时的精制就会提供有限的收益，这意味着可能需要更强大或专业的技术来改善AI预测的性能。

Title: Unified Large Language Models for Misinformation Detection in Low-Resource Linguistic Settings

Authors: Muhammad Islam, Javed Ali Khan, Mohammed Abaker, Ali Daud, Azeem Irshad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01587
Pdf URL: https://arxiv.org/pdf/2506.01587
Copy Paste: [[2506.01587]] Unified Large Language Models for Misinformation Detection in Low-Resource Linguistic Settings(https://arxiv.org/abs/2506.01587)
Keywords: language model, llm
Abstract: The rapid expansion of social media platforms has significantly increased the dissemination of forged content and misinformation, making the detection of fake news a critical area of research. Although fact-checking efforts predominantly focus on English-language news, there is a noticeable gap in resources and strategies to detect news in regional languages, such as Urdu. Advanced Fake News Detection (FND) techniques rely heavily on large, accurately labeled datasets. However, FND in under-resourced languages like Urdu faces substantial challenges due to the scarcity of extensive corpora and the lack of validated lexical resources. Current Urdu fake news datasets are often domain-specific and inaccessible to the public. They also lack human verification, relying mainly on unverified English-to-Urdu translations, which compromises their reliability in practical applications. This study highlights the necessity of developing reliable, expert-verified, and domain-independent Urdu-enhanced FND datasets to improve fake news detection in Urdu and other resource-constrained languages. This paper presents the first benchmark large FND dataset for Urdu news, which is publicly available for validation and deep analysis. We also evaluate this dataset using multiple state-of-the-art pre-trained large language models (LLMs), such as XLNet, mBERT, XLM-RoBERTa, RoBERTa, DistilBERT, and DeBERTa. Additionally, we propose a unified LLM model that outperforms the others with different embedding and feature extraction techniques. The performance of these models is compared based on accuracy, F1 score, precision, recall, and human judgment for vetting the sample results of news.
摘要：社交媒体平台的迅速扩展大大增加了锻造内容和错误信息的传播，这使得对假新闻的发现成为关键的研究领域。尽管事实核对的努力主要集中在英语新闻上，但资源和策略在乌尔都语等地区语言中发现了明显的差距。高级假新闻检测（FND）技术在很大程度上依赖于大型，准确标记的数据集。但是，由于缺乏广泛的语料库和缺乏验证的词汇资源，乌尔都语（如乌尔都语）的资源不足的语言中的FND面临重大挑战。当前的乌尔都语假新闻数据集通常是特定于域的，并且对公众而言无法访问。他们还缺乏人类的验证，主要依靠未验证的英语到纯净的翻译，这会损害其在实际应用中的可靠性。这项研究强调了开发可靠，专家验证和独立于域的乌尔都语增强型FND数据集的必要性，以改善乌尔都语和其他资源受限语言中的虚假新闻检测。本文介绍了乌尔都语新闻的第一个基准大型FND数据集，该数据集公开可用于验证和深入分析。我们还使用多个最先进的预训练的大语言模型（LLM）评估了该数据集，例如XLNET，MBERT，XLM-ROBERTA，ROBERTA，ROBERTA，DISTILBERT和DEBERTA。此外，我们提出了一个统一的LLM模型，该模型以不同的嵌入和特征提取技术优于其他模型。这些模型的性能是根据审查新闻样本结果的准确性，F1分数，精度，召回和人类判断的比较。

Title: Statement-Tuning Enables Efficient Cross-lingual Generalization in Encoder-only Models

Authors: Ahmed Elshabrawy, Thanh-Nhi Nguyen, Yeeun Kang, Lihan Feng, Annant Jain, Faadil Abdullah Shaikh, Jonibek Mansurov, Mohamed Fazli Mohamed Imam, Jesus-German Ortiz-Barajas, Rendi Chevi, Alham Fikri Aji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01592
Pdf URL: https://arxiv.org/pdf/2506.01592
Copy Paste: [[2506.01592]] Statement-Tuning Enables Efficient Cross-lingual Generalization in Encoder-only Models(https://arxiv.org/abs/2506.01592)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel in zero-shot and few-shot tasks, but achieving similar performance with encoder-only models like BERT and RoBERTa has been challenging due to their architecture. However, encoders offer advantages such as lower computational and memory costs. Recent work adapts them for zero-shot generalization using Statement Tuning, which reformulates tasks into finite templates. We extend this approach to multilingual NLP, exploring whether encoders can achieve zero-shot cross-lingual generalization and serve as efficient alternatives to memory-intensive LLMs for low-resource languages. Our results show that state-of-the-art encoder models generalize well across languages, rivaling multilingual LLMs while being more efficient. We also analyze multilingual Statement Tuning dataset design, efficiency gains, and language-specific generalization, contributing to more inclusive and resource-efficient NLP models. We release our code and models.
摘要：大型语言模型（LLMS）在零拍和很少的任务中表现出色，但是由于伯特和罗伯塔（Bert）和罗伯塔（Roberta）等编码模型的相似性能，由于其架构而变得具有挑战性。但是，编码器具有较低的计算和内存成本等优点。最近的工作使它们适应了使用语句调整的零拍概括，该调整将任务重新计算为有限模板。我们将这种方法扩展到多语言NLP，探索编码器是否可以实现零击的跨语性概括，并作为低资源语言的记忆密集型LLM的有效替代品。我们的结果表明，最新的编码器模型在跨语言中很好地推广，与多语言LLM媲美，同时更有效。我们还分析了多语言语句调整数据集设计，效率提高和特定于语言的概括，从而有助于更具包容性和资源有效的NLP模型。我们发布代码和模型。

Title: IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems

Authors: Pasunuti Prasanjith, Prathmesh B More, Anoop Kunchukuttan, Raj Dabre
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01615
Pdf URL: https://arxiv.org/pdf/2506.01615
Copy Paste: [[2506.01615]] IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems(https://arxiv.org/abs/2506.01615)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems enable language models to access relevant information and generate accurate, well-grounded, and contextually informed responses. However, for Indian languages, the development of high-quality RAG systems is hindered by the lack of two critical resources: (1) evaluation benchmarks for retrieval and generation tasks, and (2) large-scale training datasets for multilingual retrieval. Most existing benchmarks and datasets are centered around English or high-resource languages, making it difficult to extend RAG capabilities to the diverse linguistic landscape of India. To address the lack of evaluation benchmarks, we create IndicMSMarco, a multilingual benchmark for evaluating retrieval quality and response generation in 13 Indian languages, created via manual translation of 1000 diverse queries from MS MARCO-dev set. To address the need for training data, we build a large-scale dataset of (question, answer, relevant passage) tuples derived from the Wikipedias of 19 Indian languages using state-of-the-art LLMs. Additionally, we include translated versions of the original MS MARCO dataset to further enrich the training data and ensure alignment with real-world information-seeking tasks. Resources are available here: this https URL
摘要：检索增强的生成（RAG）系统使语言模型能够访问相关信息，并产生准确，良好的和上下文知情的响应。但是，对于印度语言而言，缺乏两个关键资源所阻碍高质量的破布系统的开发：（1）用于检索和发电任务的评估基准，以及（2）多语言检索的大规模培训数据集。大多数现有的基准和数据集都围绕英语或高资源语言，因此很难将RAG功能扩展到印度各种语言景观。为了解决缺乏评估基准测试，我们创建了Indicmsmarco，这是一种评估13种印度语言的检索质量和响应产生的多语言基准，该基准是通过手动翻译MS MAS MARCO-DEV SET创建的。为了满足培训数据的需求，我们使用最先进的LLMS构建了从19种印度语言的Wikipedias衍生出的（问题，答案，相关段落）元组的大规模数据集。此外，我们还包括原始MS MARCO数据集的翻译版本，以进一步丰富培训数据并确保与现实信息寻求信息的任务保持一致。资源可用：此HTTPS URL

Title: Domain Lexical Knowledge-based Word Embedding Learning for Text Classification under Small Data

Authors: Zixiao Zhu, Kezhi Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01621
Pdf URL: https://arxiv.org/pdf/2506.01621
Copy Paste: [[2506.01621]] Domain Lexical Knowledge-based Word Embedding Learning for Text Classification under Small Data(https://arxiv.org/abs/2506.01621)
Keywords: language model
Abstract: Pre-trained language models such as BERT have been proved to be powerful in many natural language processing tasks. But in some text classification applications such as emotion recognition and sentiment analysis, BERT may not lead to satisfactory performance. This often happens in applications where keywords play critical roles in the prediction of class labels. Our investigation found that the root cause of the problem is that the context-based BERT embedding of the keywords may not be discriminative enough to produce discriminative text representation for classification. Motivated by this finding, we develop a method to enhance word embeddings using domain-specific lexical knowledge. The knowledge-based embedding enhancement model projects the BERT embedding into a new space where within-class similarity and between-class difference are maximized. To implement the knowledge-based word embedding enhancement model, we also develop a knowledge acquisition algorithm for automatically collecting lexical knowledge from online open sources. Experiment results on three classification tasks, including sentiment analysis, emotion recognition and question answering, have shown the effectiveness of our proposed word embedding enhancing model. The codes and datasets are in this https URL.
摘要：在许多自然语言处理任务中，诸如BERT之类的预训练的语言模型已被证明是有力的。但是在某些文本分类应用程序（例如情感识别和情感分析）中，伯特可能不会导致令人满意的表现。这通常发生在关键字在类标签的预测中起关键作用的应用程序。我们的调查发现，问题的根本原因是，基于上下文的BERT嵌入关键字可能没有足够的歧视性，无法产生歧视性文本表示形式进行分类。受这一发现的促进，我们开发了一种使用特定领域的词汇知识来增强单词嵌入的方法。基于知识的嵌入增强模型投射了BERT嵌入到新的空间中，其中最大化了课堂内相似性和类之间的差异。为了实现基于知识的单词嵌入增强模型，我们还开发了一种知识获取算法，以自动从在线开源中收集词汇知识。三个分类任务的实验结果，包括情绪分析，情绪识别和问题答案，已经显示了我们提出的单词嵌入增强模型的有效性。代码和数据集在此HTTPS URL中。

Title: Cross-Lingual Generalization and Compression: From Language-Specific to Shared Neurons

Authors: Frederick Riemenschneider, Anette Frank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01629
Pdf URL: https://arxiv.org/pdf/2506.01629
Copy Paste: [[2506.01629]] Cross-Lingual Generalization and Compression: From Language-Specific to Shared Neurons(https://arxiv.org/abs/2506.01629)
Keywords: language model, llm
Abstract: Multilingual language models (MLLMs) have demonstrated remarkable abilities to transfer knowledge across languages, despite being trained without explicit cross-lingual supervision. We analyze the parameter spaces of three MLLMs to study how their representations evolve during pre-training, observing patterns consistent with compression: models initially form language-specific representations, which gradually converge into cross-lingual abstractions as training progresses. Through probing experiments, we observe a clear transition from uniform language identification capabilities across layers to more specialized layer functions. For deeper analysis, we focus on neurons that encode distinct semantic concepts. By tracing their development during pre-training, we show how they gradually align across languages. Notably, we identify specific neurons that emerge as increasingly reliable predictors for the same concepts across languages.
摘要：多语言语言模型（MLLM）表现出了非凡的能力，可以在没有明确的跨语性监督的情况下接受培训。我们分析了三个MLLM的参数空间，以研究它们在预训练期间如何演变，观察与压缩一致的模式：最初形成特定于语言的表示，随着培训的进行，它们逐渐汇聚为跨语性的抽象。通过探测实验，我们观察到从跨层统一语言识别能力到更专业的层函数的明确过渡。为了进行更深入的分析，我们专注于编码不同语义概念的神经元。通过追踪他们在预训练期间的发展，我们展示了它们如何逐渐跨语言保持一致。值得注意的是，我们确定出现的特定神经元，这些神经元成为跨语言相同概念的越来越可靠的预测因子。

Title: ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge

Authors: Chaoyue He, Xin Zhou, Yi Wu, Xinjia Yu, Yan Zhang, Lei Zhang, Di Wang, Shengfei Lyu, Hong Xu, Xiaoqiao Wang, Wei Liu, Chunyan Miao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01646
Pdf URL: https://arxiv.org/pdf/2506.01646
Copy Paste: [[2506.01646]] ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge(https://arxiv.org/abs/2506.01646)
Keywords: language model, llm, retrieval-augmented generation
Abstract: We introduce ESGenius, a comprehensive benchmark for evaluating and enhancing the proficiency of Large Language Models (LLMs) in Environmental, Social and Governance (ESG) and sustainability-focused question answering. ESGenius comprises two key components: (i) ESGenius-QA, a collection of 1 136 multiple-choice questions generated by LLMs and rigorously validated by domain experts, covering a broad range of ESG pillars and sustainability topics. Each question is systematically linked to its corresponding source text, enabling transparent evaluation and supporting retrieval-augmented generation (RAG) methods; and (ii) ESGenius-Corpus, a meticulously curated repository of 231 foundational frameworks, standards, reports and recommendation documents from seven authoritative sources. Moreover, to fully assess the capabilities and adaptation potential of the model, we implement a rigorous two-stage evaluation protocol -- Zero-Shot and RAG. Extensive experiments across 50 LLMs (ranging from 0.5 B to 671 B parameters) demonstrate that state-of-the-art models achieve only moderate performance in zero-shot settings, with accuracies typically around 55--70\%, highlighting ESGenius's challenging nature for LLMs in interdisciplinary contexts. However, models employing RAG show significant performance improvements, particularly for smaller models. For example, "DeepSeek-R1-Distill-Qwen-14B" improves from 63.82\% (zero-shot) to 80.46\% with RAG. These results underscore the necessity of grounding responses in authoritative sources for enhanced ESG understanding. To the best of our knowledge, ESGenius is the first benchmark curated for LLMs and the relevant enhancement technologies that focuses on ESG and sustainability topics.
摘要：我们介绍了Esgenius，这是一种全面的基准，用于评估和增强大语模型（LLMS）在环境，社会和治理（ESG）（ESG）和以可持续性为中心的问题回答中的水平。 Esgenius包括两个关键组成部分：（i）Esgenius-QA，这是LLMS产生的1 136个多项选择问题，并由域专家严格验证，涵盖了广泛的ESG支柱和可持续性主题。每个问题都是系统地链接到其相应的源文本的，从而实现了透明的评估并支持检索功能的生成（RAG）方法；（ii）Esgenius-Corpus是一个精心策划的存储库，由231个基础框架，标准，报告和建议文件来自七个权威来源。此外，为了充分评估模型的功能和适应潜力，我们实施了严格的两阶段评估协议 - 零击和抹布。在50个LLM（0.5 B到671 B参数范围内）进行的广泛实验表明，最新模型仅在零拍设置中仅实现适度的性能，精度通常约为55---70 \％，突出了Esgenius在跨学科环境中对LLMS挑战性的挑战性。但是，采用抹布的模型显示出显着的性能改进，尤其是对于较小的模型。例如，“ DeepSeek-R1-Distill-Qwen-14b”从63.82 \％（零射）提高到带有RAG的80.46 \％。这些结果强调了在权威来源中对ESG理解增强的基础响应的必要性。据我们所知，Esgenius是第一个为LLM策划的基准，以及着重于ESG和可持续性主题的相关增强技术。

Title: Cross-Lingual Transfer of Cultural Knowledge: An Asymmetric Phenomenon

Authors: Chen Zhang, Zhiyuan Liao, Yansong Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01675
Pdf URL: https://arxiv.org/pdf/2506.01675
Copy Paste: [[2506.01675]] Cross-Lingual Transfer of Cultural Knowledge: An Asymmetric Phenomenon(https://arxiv.org/abs/2506.01675)
Keywords: language model, llm
Abstract: Despite substantial research efforts evaluating how well large language models~(LLMs) handle global cultural diversity, the mechanisms behind their cultural knowledge acquisition, particularly in multilingual settings, remain unclear. We study this question by investigating how cultural knowledge transfers across languages during language adaptation of LLMs. We introduce an interpretable framework for studying this transfer, ensuring training data transparency and controlling transfer effects. Through a study of four non-Anglophonic cultures, we observe bidirectional cultural transfer between English and other high-resource languages, while low-resource languages primarily transfer knowledge to English with limited reverse flow. To explain this asymmetric phenomenon, we propose a frequency-based hypothesis: cultural knowledge appearing more frequently in the pretraining data transfers more easily, which is supported by empirical analysis of the training corpora.
摘要：尽管大量的研究工作评估了大型语言模型〜（LLM）如何处理全球文化多样性，但其文化知识获取背后的机制，尤其是在多语言环境中，尚不清楚。我们通过研究LLM语言适应期间的文化知识如何转移跨语言的文化知识来研究这个问题。我们引入了一个可解释的框架，用于研究此转移，确保训练数据透明度和控制转移效果。通过对四种非良性文化的研究，我们观察到英语和其他高资源语言之间的双向文化转移，而低资源语言主要将知识转移到英语中，而逆流有限。为了解释这种不对称现象，我们提出了一个基于频率的假设：更容易在训练数据中更频繁地出现文化知识，这是对培训语料库的经验分析支持的。

Title: StochasTok: Improving Fine-Grained Subword Understanding in LLMs

Authors: Anya Sims, Thom Foster, Klara Kaleb, Tuan-Duy H. Nguyen, Joseph Lee, Jakob N. Foerster, Yee Whye Teh, Cong Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01687
Pdf URL: https://arxiv.org/pdf/2506.01687
Copy Paste: [[2506.01687]] StochasTok: Improving Fine-Grained Subword Understanding in LLMs(https://arxiv.org/abs/2506.01687)
Keywords: language model, llm
Abstract: Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still often struggle with seemingly simple subword-level tasks like How many 'r's in 'strawberry'?. A key factor behind these failures is tokenization which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to 'see' their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs' downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok's simplicity allows seamless integration at any stage of the training pipeline; and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models. Code open-sourced at: this https URL.
摘要：子词级的理解是许多任务不可或缺的，包括了解多位数数字，拼写错误，缩写，押韵和文字游戏。尽管如此，当前的大型语言模型（LLM）仍然经常在看似简单的子词级任务中挣扎，例如“草莓”中有多少个“ r”？这些失败背后的关键因素是象征化，它掩盖了单词的细粒结构。当前的替代方案，例如角色级别和辍学令牌化方法，大大提高了计算成本并提供不一致的改进。在本文中，我们重新访问令牌化并引入Stochastok，这是一种简单，有效的随机令牌化方案，在训练过程中随机分配令牌，允许LLMS“看到”其内部结构。我们的实验表明，使用Stochastok进行预处理大大改善了LLM在多个子字级语言游戏中的下游性能，包括角色计数，子字符串识别和数学任务。此外，Stochastok的简单性允许在训练管道的任何阶段无缝集成；而且我们证明，使用Stochastok进行培训可以将其改进的子字理解灌输到现有的预验证模型中，从而避免从头开始进行昂贵的预处理。通过最小的变化实现的这些戏剧性改进表明，当应用于更大，更有能力的模型时，Stochastok具有令人兴奋的潜力。代码开源：此HTTPS URL。

Title: When LLMs Team Up: The Emergence of Collaborative Affective Computing

Authors: Wenna Lai, Haoran Xie, Guandong Xu, Qing Li, S. Joe Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01698
Pdf URL: https://arxiv.org/pdf/2506.01698
Copy Paste: [[2506.01698]] When LLMs Team Up: The Emergence of Collaborative Affective Computing(https://arxiv.org/abs/2506.01698)
Keywords: language model, llm, hallucination
Abstract: Affective Computing (AC) is essential in bridging the gap between human emotional experiences and machine understanding. Traditionally, AC tasks in natural language processing (NLP) have been approached through pipeline architectures, which often suffer from structure rigidity that leads to inefficiencies and limited adaptability. The advent of Large Language Models (LLMs) has revolutionized this field by offering a unified approach to affective understanding and generation tasks, enhancing the potential for dynamic, real-time interactions. However, LLMs face cognitive limitations in affective reasoning, such as misinterpreting cultural nuances or contextual emotions, and hallucination problems in decision-making. To address these challenges, recent research advocates for LLM-based collaboration systems that emphasize interactions among specialized models and LLMs, mimicking human-like affective intelligence through the synergy of emotional and rational thinking that aligns with Dual Process Theory in psychology. This survey aims to provide a comprehensive overview of LLM-based collaboration systems in AC, exploring from structured collaborations to autonomous collaborations. Specifically, it includes: (1) A systematic review of existing methods, focusing on collaboration strategies, mechanisms, key functions, and applications; (2) Experimental comparisons of collaboration strategies across representative tasks in affective understanding and generation; (3) An analysis highlighting the potential of these systems to enhance robustness and adaptability in complex affective reasoning; (4) A discussion of key challenges and future research directions to further advance the field. This work is the first to systematically explore collaborative intelligence with LLMs in AC, paving the way for more powerful applications that approach human-like social intelligence.
摘要：情感计算（AC）对于弥合人类情感体验与机器理解之间的差距至关重要。传统上，自然语言处理中的交流任务（NLP）是通过管道架构来处理的，管道架构通常会遭受结构僵化的损失，从而导致效率低下和适应性有限。大型语言模型（LLM）的出现通过提供一种统一的方法来进行情感理解和生成任务，从而增强了动态实时互动的潜力，从而彻底改变了这一领域。但是，LLM在情感推理中面临认知局限性，例如误解文化细微差别或上下文情绪以及决策中的幻觉问题。为了应对这些挑战，最新的基于LLM的协作系统的研究倡导者强调了专业模型和LLM之间的相互作用，从而通过情感和理性思维的协同作用模仿了类似人类的情感智力，这些思想与心理学的双重过程相吻合。这项调查旨在提供AC中基于LLM的协作系统的全面概述，从结构化协作到自主协作。具体而言，它包括：（1）对现有方法的系统审查，重点介绍协作策略，机制，关键功能和应用程序；（2）在情感理解和产生中跨代表任务的协作策略的实验比较；（3）一项分析强调了这些系统在复杂的情感推理中增强鲁棒性和适应性的潜力；（4）讨论关键挑战和未来的研究方向，以进一步发展该领域。这项工作是第一个系统地探索与AC中LLM的协作智能的工作，为更强大的应用程序铺平了道路，这些应用程序接近人类般的社会智能。

Title: mdok of KInIT: Robustly Fine-tuned LLM for Binary and Multiclass AI-Generated Text Detection

Authors: Dominik Macko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01702
Pdf URL: https://arxiv.org/pdf/2506.01702
Copy Paste: [[2506.01702]] mdok of KInIT: Robustly Fine-tuned LLM for Binary and Multiclass AI-Generated Text Detection(https://arxiv.org/abs/2506.01702)
Keywords: language model, llm
Abstract: The large language models (LLMs) are able to generate high-quality texts in multiple languages. Such texts are often not recognizable by humans as generated, and therefore present a potential of LLMs for misuse (e.g., plagiarism, spams, disinformation spreading). An automated detection is able to assist humans to indicate the machine-generated texts; however, its robustness to out-of-distribution data is still challenging. This notebook describes our mdok approach in robust detection, based on fine-tuning smaller LLMs for text classification. It is applied to both subtasks of Voight-Kampff Generative AI Detection 2025, providing remarkable performance in binary detection as well as in multiclass (1st rank) classification of various cases of human-AI collaboration.
摘要：大语言模型（LLM）能够以多种语言生成高质量的文本。这种文本通常无法被人类识别为产生，因此具有LLM的滥用潜力（例如窃，垃圾邮件，虚假信息扩散）。自动检测能够帮助人类指示机器生成的文本；但是，其对分布数据的鲁棒性仍然具有挑战性。本笔记本电脑根据较小的LLM进行文本分类描述了我们的MDOK方法。它适用于2025年Voight-Kampff生成AI检测的两个子任务，在二进制检测以及多类（第一等级）人类协作案例的多类（第一级）分类中提供了出色的性能。

Title: Fairness Dynamics During Training

Authors: Krishna Patel, Nivedha Sivakumar, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01709
Pdf URL: https://arxiv.org/pdf/2506.01709
Copy Paste: [[2506.01709]] Fairness Dynamics During Training(https://arxiv.org/abs/2506.01709)
Keywords: language model, llm
Abstract: We investigate fairness dynamics during Large Language Model (LLM) training to enable the diagnoses of biases and mitigations through training interventions like early stopping; we find that biases can emerge suddenly and do not always follow common performance metrics. We introduce two new metrics to evaluate fairness dynamics holistically during model pre-training: Average Rank and Jensen-Shannon Divergence by Parts. These metrics provide insights into the Pythia models' progression of biases in gender prediction of occupations on the WinoBias dataset. By monitoring these dynamics, we find that (1) Pythia-6.9b is biased towards men; it becomes more performant and confident predicting "male" than "female" during training, (2) via early-stopping, Pythia-6.9b can exchange 1.7% accuracy on LAMBADA for a 92.5% increase in fairness, and (3) larger models can exhibit more bias; Pythia-6.9b makes more assumptions about gender than Pythia-160m, even when a subject's gender is not specified.
摘要：我们在大语言模型（LLM）培训期间研究公平动力学，以通过培训干预措施（例如早期停止）来诊断偏见和缓解。我们发现偏见会突然出现，并且并不总是遵循共同的性能指标。我们在模型预训练期间介绍了两个新的指标，以整体评估公平动力学：平均等级和詹森·香农的差异。这些指标提供了有关毕曲霉模型在Winobias数据集对职业的性别预测中的偏见进展的见解。通过监视这些动力学，我们发现（1）Pythia-6.9b对男人有偏见；在训练过程中，它比“女性”更具性能和自信，（2）通过早期，Pythia-6.9b可以在Lambada上交换1.7％的准确性，以增加92.5％的公平性，并且（3）较大的模型可以表现出更多的偏见； Pythia-6.9b对性别做出的假设比毕日160m，即使未指定受试者的性别。

Title: Reasoning-Table: Exploring Reinforcement Learning for Table Reasoning

Authors: Fangyu Lei, Jinxiang Meng, Yiming Huang, Tinghong Chen, Yun Zhang, Shizhu He, Jun Zhao, Kang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01710
Pdf URL: https://arxiv.org/pdf/2506.01710
Copy Paste: [[2506.01710]] Reasoning-Table: Exploring Reinforcement Learning for Table Reasoning(https://arxiv.org/abs/2506.01710)
Keywords: language model
Abstract: Table reasoning, encompassing tasks such as table question answering, fact verification, and text-to-SQL, requires precise understanding of structured tabular data, coupled with numerical computation and code manipulation for effective inference. Supervised fine-tuning (SFT) approaches have achieved notable success but often struggle with generalization and robustness due to biases inherent in imitative learning. We introduce Reasoning-Table, the first application of reinforcement learning (RL) to table reasoning, achieving state-of-the-art performance. Through rigorous data preprocessing, reward design, and tailored training strategies, our method leverages simple rule-based outcome rewards to outperform SFT across multiple benchmarks. Unified training across diverse tasks enables Reasoning-Table to emerge as a robust table reasoning large language model, surpassing larger proprietary models like Claude-3.7-Sonnet by 4.0% on table reasoning benchmarks. The approach also achieves excellent performance on text-to-SQL tasks, reaching 68.3% performance on the BIRD dev dataset with a 7B model. Further experiments demonstrate that Reasoning-Table enhances the model's generalization capabilities and robustness.
摘要：表推理，包括表问题回答，事实验证和文本到SQL之类的任务，需要精确理解结构化的表格数据，并与数值计算和代码操作相结合以有效推断。监督的微调（SFT）方法取得了显着的成功，但由于模仿学习固有的偏见而经常在概括和稳健性上挣扎。我们介绍了推理表，这是强化学习（RL）的首次应用到表推理，从而实现最先进的绩效。通过严格的数据预处理，奖励设计和量身定制的培训策略，我们的方法利用简单的基于规则的结果奖励在多个基准测试中胜过SFT。跨不同任务的统一培训使推理表成为一种强大的桌子推理大型语言模型，超过了较大的专有模型，例如Claude-3.7-Sonnet在桌面推理基准上超过4.0％。该方法还可以在文本到SQL任务上实现出色的性能，并使用7B型号在Bird Dev数据集上达到68.3％的性能。进一步的实验表明，推理表可以增强模型的概括能力和鲁棒性。

Title: SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Authors: Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Yangfan He, Mi Zhang, Shen Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01713
Pdf URL: https://arxiv.org/pdf/2506.01713
Copy Paste: [[2506.01713]] SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning(https://arxiv.org/abs/2506.01713)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.
摘要：多模式的大语言模型（MLLM）在推理任务中表现出了有希望的能力，但仍在与需要明确自我反思和自我纠正的复杂问题上挣扎，尤其是与他们的单峰基于文本的同行相比。现有的反思方法是简单的，并且难以产生有意义的和启发性的反馈，因为预训练模型的推理能力和知识限制在初始培训期间很大程度上是固定的。为了克服这些挑战，我们提出了多模式的自我反思增强推理的推理（SRPO），这是一个两阶段反思 - 意识到的强化学习（RL）框架，旨在显式设计，以增强多模式LLM推理。在第一阶段，我们在高级MLLM的指导下构建了一个以反射为重点的数据集，该数据集基于初始响应来产生反思，以帮助政策模型学习推理和自我反思。在第二阶段，我们在GRPO框架内介绍了一种新颖的奖励机制，鼓励简洁明了，同时避免冗余。使用QWEN-2.5-VL-7B和QWEN-2.5-VL-32B进行了多种多模式推理基准，包括MathVista，MathVision，Mathverse和MMMU-Pro在内的广泛实验，这表明SRPO显着胜过态度的较大模型，可以提高良好的改善，从而在推理方面得到了良好的改善，并且可以很好地提高推理和反思。

Title: Tug-of-war between idiom's figurative and literal meanings in LLMs

Authors: Soyoung Oh, Xinting Huang, Mathis Pink, Michael Hahn, Vera Demberg
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01723
Pdf URL: https://arxiv.org/pdf/2506.01723
Copy Paste: [[2506.01723]] Tug-of-war between idiom's figurative and literal meanings in LLMs(https://arxiv.org/abs/2506.01723)
Keywords: language model, llm
Abstract: Idioms present a unique challenge for language models due to their non-compositional figurative meanings, which often strongly diverge from the idiom's literal interpretation. This duality requires a model to learn representing and deciding between the two meanings to interpret an idiom in a figurative sense, or literally. In this paper, we employ tools from mechanistic interpretability to trace how a large pretrained causal transformer (LLama3.2-1B-base) deals with this ambiguity. We localize three steps of idiom processing: First, the idiom's figurative meaning is retrieved in early attention and MLP sublayers. We identify specific attention heads which boost the figurative meaning of the idiom while suppressing the idiom's literal interpretation. The model subsequently represents the figurative representation through an intermediate path. Meanwhile, a parallel bypass route forwards literal interpretation, ensuring that a both reading remain available. Overall, our findings provide a mechanistic evidence for idiom comprehension in an autoregressive transformer.
摘要：习语由于其非构成形象含义而对语言模型提出了一个独特的挑战，通常与这个成语的文字解释有很大不同。这种二元性需要一个模型来学习代表和决定以象征意义或从字面意义上解释成语的两种含义之间的决定。在本文中，我们采用了从机械解释性的工具来追踪大型的因果变压器（Llama3.2-1b基本）如何处理这种歧义。我们本地化了三个成语处理的步骤：首先，在早期注意力和MLP Sublayers中检索了这个习语的象征意义。我们确定了特定的注意力头，在抑制了这个习语的字面解释的同时，可以提高习语的象征意义。该模型随后通过中间路径表示形象表示。同时，一条平行的旁路途径向前介绍了字面的解释，以确保两种读数仍然可用。总体而言，我们的发现为自回归变压器中的成语理解提供了机械证据。

Title: Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Authors: Pierre-Carl Langlais, Carlos Rosas Hinostroza, Mattia Nee, Catherine Arnett, Pavel Chizhov, Eliot Krzystof Jones, Irène Girard, David Mach, Anastasia Stasenko, Ivan P. Yamshchikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01732
Pdf URL: https://arxiv.org/pdf/2506.01732
Copy Paste: [[2506.01732]] Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training(https://arxiv.org/abs/2506.01732)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for language model pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissible licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the main European languages to low-resource ones rarely present in pre-training datasets; in addition, it includes a large portion of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this technical report, we present the detailed provenance of data assembling and the details of dataset filtering and curation. Being already used by such industry leaders as Anthropic and multiple LLM training projects, we believe that Common Corpus will become a critical infrastructure for open science research in LLMs.
摘要：大型语言模型（LLMS）在来自不同来源和域的大量数据上进行了预训练。这些数据通常包含数万亿个具有版权或专有内容的代币，这阻碍了AI立法下此类模型的使用。这增加了符合数据安全法规的真正开放预训练数据的需求。在本文中，我们介绍了通用语料库，这是语言模型预训练的最大开放数据集。在普通语料库中组装的数据要么是不受影响的，要么在允许的许可下，约为两万亿个令牌。该数据集包含各种各样的语言，从主要的欧洲语言到培训预培训数据集中很少存在的低资源语言；此外，它包括大部分代码数据。数据源的多样性在涵盖的领域和时间段方面开辟了不同知识领域的研究和创业需求的道路。在此技术报告中，我们介绍了数据组装的详细出处以及数据集过滤和策展的详细信息。由于该行业领导者已经将其用作人为和多个LLM培训项目，因此我们认为，普通语料库将成为LLMS开放科学研究的关键基础设施。

Title: Benford's Curse: Tracing Digit Bias to Numerical Hallucination in LLMs

Authors: Jiandong Shao, Yao Lu, Jianfei Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01734
Pdf URL: https://arxiv.org/pdf/2506.01734
Copy Paste: [[2506.01734]] Benford's Curse: Tracing Digit Bias to Numerical Hallucination in LLMs(https://arxiv.org/abs/2506.01734)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) exhibit impressive performance on complex reasoning tasks, yet they frequently fail on basic numerical problems, producing incorrect outputs. Inspired by Benford's Law -- a statistical pattern where lower digits occur more frequently as leading digits -- we hypothesize that the long-tailed digit distributions in web-collected corpora may be learned by LLMs during pretraining, leading to biased numerical generation. To investigate the hypothesis, we first examine whether digits frequencies in pretraining corpus (OLMo2) follows Benford's law. We then construct an evaluation benchmark with uniformly distributed ground-truth digits across seven numerical reasoning tasks. Our evaluation results demonstrate that leading open-source LLMs show a consistent pattern of digit bias that resembles Benford's law. Through logit-lens tracing and neuron-level dissection, we identify that this bias arises predominantly from a small subset of highly digit-selective feed-forward network (FFN) neurons in the deeper layers. Finally, we demonstrate that pruning these neurons mitigates imbalanced overgeneration and partially corrects erroneous outputs, providing causal evidence that fine-grained pretraining digit bias can propagate into model behavior. Our findings reveal a fundamental connection between corpus-level statistics and symbolic failure modes in LLMs, offering a new lens for diagnosing and mitigating hallucinations in numerical tasks.
摘要：大型语言模型（LLMS）在复杂的推理任务上表现出令人印象深刻的性能，但是它们经常在基本的数值问题上失败，从而产生不正确的输出。受本福德定律的启发 - 一种统计模式，较低的数字作为领先数字更频繁地发生 - 我们假设LLM在预处理过程中可能会学会网络收集的Corpora中长尾数字的分布，从而导致数值产生偏见。为了调查该假设，我们首先检查了数字是否遵循本福德定律（OLMO2）中的频率（OLMO2）。然后，我们在七个数值推理任务中构建具有均匀分布的地面真相数字的评估基准。我们的评估结果表明，领先的开源LLMS显示出类似于本福德定律的数字偏见的一致模式。通过logit镜头追踪和神经元级解剖，我们确定这种偏见主要来自更深层中高度数字选择性的前馈网络（FFN）神经元的一小部分。最后，我们证明，修剪这些神经元会减轻过度代理并部分纠正错误的输出，从而提供了因果证据，表明细颗粒的数字偏见可以传播到模型行为中。我们的发现揭示了LLMS中语料库级统计数据和符号故障模式之间的基本联系，为诊断和减轻数值任务中的幻觉提供了新的镜头。

Title: Thinking in Character: Advancing Role-Playing Agents with Role-Aware Reasoning

Authors: Yihong Tang, Kehai Chen, Muyun Yang, Zhengyu Niu, Jing Li, Tiejun Zhao, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01748
Pdf URL: https://arxiv.org/pdf/2506.01748
Copy Paste: [[2506.01748]] Thinking in Character: Advancing Role-Playing Agents with Role-Aware Reasoning(https://arxiv.org/abs/2506.01748)
Keywords: language model, llm, agent
Abstract: The advancement of Large Language Models (LLMs) has spurred significant interest in Role-Playing Agents (RPAs) for applications such as emotional companionship and virtual interaction. However, recent RPAs are often built on explicit dialogue data, lacking deep, human-like internal thought processes, resulting in superficial knowledge and style expression. While Large Reasoning Models (LRMs) can be employed to simulate character thought, their direct application is hindered by attention diversion (i.e., RPAs forget their role) and style drift (i.e., overly formal and rigid reasoning rather than character-consistent reasoning). To address these challenges, this paper introduces a novel Role-Aware Reasoning (RAR) method, which consists of two important stages: Role Identity Activation (RIA) and Reasoning Style Optimization (RSO). RIA explicitly guides the model with character profiles during reasoning to counteract attention diversion, and then RSO aligns reasoning style with the character and scene via LRM distillation to mitigate style drift. Extensive experiments demonstrate that the proposed RAR significantly enhances the performance of RPAs by effectively addressing attention diversion and style drift.
摘要：大型语言模型（LLM）的进步激发了人们对角色扮演代理（RPA）的重大兴趣，例如情感陪伴和虚拟互动等应用。但是，最近的RPA通常是基于明确的对话数据建立的，这些数据缺乏深层，类似人类的内部思维过程，从而产生了肤浅的知识和风格表达。尽管可以使用大型推理模型（LRM）来模拟性格思想，但他们的直接应用受到关注转移（即RPA忘记了角色）和风格漂移（即过于正式和僵化的推理，而不是角色一致的推理）会阻碍其直接应用。为了应对这些挑战，本文介绍了一种新颖的角色感知推理（RAR）方法，该方法包括两个重要阶段：角色认同激活（RIA）和推理样式优化（RSO）。 RIA在推理过程中明确指导模型，以抵消注意力转移，然后RSO通过LRM蒸馏将推理样式与角色和场景保持一致，以减轻风格的漂移。广泛的实验表明，提出的RAR通过有效解决注意力转移和样式漂移可以显着提高RPA的性能。

Title: MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation

Authors: Yile Liu, Ziwei Ma, Xiu Jiang, Jinglu Hu, Jing Chang, Liang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01776
Pdf URL: https://arxiv.org/pdf/2506.01776
Copy Paste: [[2506.01776]] MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation(https://arxiv.org/abs/2506.01776)
Keywords: language model, llm
Abstract: With the rapid adoption of large language models (LLMs) in natural language processing, the ability to follow instructions has emerged as a key metric for evaluating their practical utility. However, existing evaluation methods often focus on single-language scenarios, overlooking the challenges and differences present in multilingual and cross-lingual contexts. To address this gap, we introduce MaXIFE: a comprehensive evaluation benchmark designed to assess instruction-following capabilities across 23 languages with 1,667 verifiable instruction tasks. MaXIFE integrates both Rule-Based Evaluation and Model-Based Evaluation, ensuring a balance of efficiency and accuracy. We applied MaXIFE to evaluate several leading commercial and open-source LLMs, establishing baseline results for future comparisons. By providing a standardized tool for multilingual instruction-following evaluation, MaXIFE aims to advance research and development in natural language processing.
摘要：随着在自然语言处理中迅速采用大语言模型（LLM），遵循指示的能力已成为评估其实际实用性的关键指标。但是，现有的评估方法通常集中在单语言方案上，忽略了多语言和跨语性环境中存在的挑战和差异。为了解决这一差距，我们介绍了Maxife：一个全面的评估基准测试，旨在评估23种语言的教学遵循功能，并具有1,667个可验证的教学任务。 Maxife同时整合了基于规则的评估和基于模型的评估，从而确保了效率和准确性的平衡。我们应用了Maxife评估几个领先的商业和开源LLM，从而建立了基线结果以进行将来的比较。通过提供用于遵循多语言指导评估的标准化工具，Maxife旨在推进自然语言处理中的研发。

Title: iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering

Authors: Shuai Wang, Yinan Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01784
Pdf URL: https://arxiv.org/pdf/2506.01784
Copy Paste: [[2506.01784]] iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering(https://arxiv.org/abs/2506.01784)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) excel at many natural language processing tasks, they often suffer from factual inaccuracies in knowledge-intensive scenarios. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning poses two key challenges: (1)~maintaining coherent reasoning paths, and (2)~avoiding prematurely discarding critical multi-hop connections. To address these issues, we introduce iQUEST, a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions, ensuring a structured and focused reasoning trajectory. Additionally, we integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step. This dual approach strengthens the reasoning process, enabling the model to explore viable paths more effectively. Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.
摘要：尽管大型语言模型（LLMS）在许多自然语言处理任务上都表现出色，但它们通常在知识密集型场景中遭受事实不准确。集成外部知识资源，尤其是知识图（KGS），为更可靠的推理提供了透明且可更新的基础。知识基础问题回答（KBQA）是对KGS的查询和原因，这对于这项工作至关重要，尤其是对于复杂的多跳查询而言。但是，多跳跃推理提出了两个关键挑战：（1）〜保持连贯的推理路径，（2）〜避免过早丢弃关键的多跳连接。为了解决这些问题，我们介绍了一个问题引导的KBQA框架，它迭代地将复杂的查询分解为更简单的子问题，从而确保了结构化和集中的推理轨迹。此外，我们集成了图形神经网络（GNN），以向前看并在每个推理步骤中合并2-HOP邻居信息。这种双重方法增强了推理过程，使模型能够更有效地探索可行的路径。详细的实验表明，IQuest在四个基准数据集和四个LLM中提供的一致改进。

Title: Human-Centric Evaluation for Foundation Models

Authors: Yijin Guo, Kaiyuan Ji, Xiaorong Zhu, Junying Wang, Farong Wen, Chunyi Li, Zicheng Zhang, Guangtao Zhai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01793
Pdf URL: https://arxiv.org/pdf/2506.01793
Copy Paste: [[2506.01793]] Human-Centric Evaluation for Foundation Models(https://arxiv.org/abs/2506.01793)
Keywords: llm
Abstract: Currently, nearly all evaluations of foundation models focus on objective metrics, emphasizing quiz performance to define model capabilities. While this model-centric approach enables rapid performance assessment, it fails to reflect authentic human experiences. To address this gap, we propose a Human-Centric subjective Evaluation (HCE) framework, focusing on three core dimensions: problem-solving ability, information quality, and interaction experience. Through experiments involving Deepseek R1, OpenAI o3 mini, Grok 3, and Gemini 2.5, we conduct over 540 participant-driven evaluations, where humans and models collaborate on open-ended research tasks, yielding a comprehensive subjective dataset. This dataset captures diverse user feedback across multiple disciplines, revealing distinct model strengths and adaptability. Our findings highlight Grok 3's superior performance, followed by Deepseek R1 and Gemini 2.5, with OpenAI o3 mini lagging behind. By offering a novel framework and a rich dataset, this study not only enhances subjective evaluation methodologies but also lays the foundation for standardized, automated assessments, advancing LLM development for research and practical scenarios. Our dataset link is this https URL.
摘要：当前，几乎所有对基础模型的评估都集中在客观指标上，强调测验性能以定义模型功能。尽管这种以模型为中心的方法可以快速绩效评估，但它无法反映出真实的人类经验。为了解决这一差距，我们提出了一个以人为本的主观评估（HCE）框架，重点是三个核心维度：解决问题的能力，信息质量和互动经验。通过涉及DeepSeek R1，Openai O3 Mini，Grok 3和Gemini 2.5的实验，我们进行了540多个参与者驱动的评估，在该评估中，人类和模型在开放式研究任务上合作，产生了全面的主观数据集。该数据集捕获了跨多个学科的各种用户反馈，从而揭示了不同的模型优势和适应性。我们的发现突出了Grok 3的出色表现，其次是DeepSeek R1和Gemini 2.5，Openai O3迷你落后。通过提供新颖的框架和丰富的数据集，本研究不仅可以增强主观评估方法，而且还为标准化的自动评估奠定了基础，从而推进了LLM的研究和实际情况。我们的数据集链接是此HTTPS URL。

Title: Read it in Two Steps: Translating Extremely Low-Resource Languages with Code-Augmented Grammar Books

Authors: Chen Zhang, Jiuheng Lin, Xiao Liu, Zekai Zhang, Yansong Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01796
Pdf URL: https://arxiv.org/pdf/2506.01796
Copy Paste: [[2506.01796]] Read it in Two Steps: Translating Extremely Low-Resource Languages with Code-Augmented Grammar Books(https://arxiv.org/abs/2506.01796)
Keywords: language model, llm
Abstract: While large language models (LLMs) have shown promise in translating extremely low-resource languages using resources like dictionaries, the effectiveness of grammar books remains debated. This paper investigates the role of grammar books in translating extremely low-resource languages by decomposing it into two key steps: grammar rule retrieval and application. To facilitate the study, we introduce ZhuangRules, a modularized dataset of grammar rules and their corresponding test sentences. Our analysis reveals that rule retrieval constitutes a primary bottleneck in grammar-based translation. Moreover, although LLMs can apply simple rules for translation when explicitly provided, they encounter difficulties in handling more complex rules. To address these challenges, we propose representing grammar rules as code functions, considering their similarities in structure and the benefit of code in facilitating LLM reasoning. Our experiments show that using code rules significantly boosts both rule retrieval and application, ultimately resulting in a 13.1% BLEU improvement in translation.
摘要：尽管大型语言模型（LLMS）在使用词典等资源来翻译极低的资源语言方面表现出了希望，但语法书籍的有效性仍在争论。本文通过将其分解为两个关键步骤：语法规则检索和应用来调查语法书籍在翻译极低的资源语言中的作用。为了促进这项研究，我们介绍了语法规则的模块化数据集及其相应的测试句子。我们的分析表明，规则检索构成基于语法的翻译中的主要瓶颈。此外，尽管LLM可以在明确提供时应用简单的规则进行翻译，但它们在处理更复杂的规则时会遇到困难。为了应对这些挑战，我们建议将语法规则表示为代码函数，考虑到它们在结构上的相似性以及代码的好处，以促进LLM推理。我们的实验表明，使用代码规则可显着提高规则检索和应用，最终导致翻译改善的13.1％。

Title: NAVER LABS Europe Submission to the Instruction-following Track

Authors: Beomseok Lee, Marcely Zanon Boito, Laurent Besacier, Ioan Calapodescu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01808
Pdf URL: https://arxiv.org/pdf/2506.01808
Copy Paste: [[2506.01808]] NAVER LABS Europe Submission to the Instruction-following Track(https://arxiv.org/abs/2506.01808)
Keywords: llm
Abstract: In this paper we describe NAVER LABS Europe submission to the instruction-following speech processing short track at IWSLT 2025. We participate in the constrained settings, developing systems that can simultaneously perform ASR, ST, and SQA tasks from English speech input into the following target languages: Chinese, Italian, and German. Our solution leverages two pretrained modules: (1) a speech-to-LLM embedding projector trained using representations from the SeamlessM4T-v2-large speech encoder; and (2) LoRA adapters trained on text data on top of a Llama-3.1-8B-Instruct. These modules are jointly loaded and further instruction-tuned for 1K steps on multilingual and multimodal data to form our final system submitted for evaluation.
摘要：在本文中，我们将Naver Labs Europe提交到IWSLT 2025的指令跟踪语音处理的短轨道。我们参与受约束的设置，开发可以同时执行ASR，ST和SQA的系统，从英语语音输入中，将英语输入到以下目标语言中：中文，意大利语和德语。我们的解决方案利用了两个预验证的模块：（1）使用SeamlessM4T-V2-large语音编码器的表示训练的训练的投影仪的语音到llm嵌入投影仪；（2）在Llama-3.1-8B-Instruct的基础上对文本数据进行了培训的Lora适配器。这些模块是共同加载的，并针对多语言和多模式数据的1K步骤进行了进一步的指导，以形成我们提交的评估的最终系统。

Title: Analysis of LLM Bias (Chinese Propaganda & Anti-US Sentiment) in DeepSeek-R1 vs. ChatGPT o3-mini-high

Authors: PeiHsuan Huang, ZihWei Lin, Simon Imbot, WenCheng Fu, Ethan Tu
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2506.01814
Pdf URL: https://arxiv.org/pdf/2506.01814
Copy Paste: [[2506.01814]] Analysis of LLM Bias (Chinese Propaganda & Anti-US Sentiment) in DeepSeek-R1 vs. ChatGPT o3-mini-high(https://arxiv.org/abs/2506.01814)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs) increasingly shape public understanding and civic decisions, yet their ideological neutrality is a growing concern. While existing research has explored various forms of LLM bias, a direct, cross-lingual comparison of models with differing geopolitical alignments-specifically a PRC-system model versus a non-PRC counterpart-has been lacking. This study addresses this gap by systematically evaluating DeepSeek-R1 (PRC-aligned) against ChatGPT o3-mini-high (non-PRC) for Chinese-state propaganda and anti-U.S. sentiment. We developed a novel corpus of 1,200 de-contextualized, reasoning-oriented questions derived from Chinese-language news, presented in Simplified Chinese, Traditional Chinese, and English. Answers from both models (7,200 total) were assessed using a hybrid evaluation pipeline combining rubric-guided GPT-4o scoring with human annotation. Our findings reveal significant model-level and language-dependent biases. DeepSeek-R1 consistently exhibited substantially higher proportions of both propaganda and anti-U.S. bias compared to ChatGPT o3-mini-high, which remained largely free of anti-U.S. sentiment and showed lower propaganda levels. For DeepSeek-R1, Simplified Chinese queries elicited the highest bias rates; these diminished in Traditional Chinese and were nearly absent in English. Notably, DeepSeek-R1 occasionally responded in Simplified Chinese to Traditional Chinese queries and amplified existing PRC-aligned terms in its Chinese answers, demonstrating an "invisible loudspeaker" effect. Furthermore, such biases were not confined to overtly political topics but also permeated cultural and lifestyle content, particularly in DeepSeek-R1.
摘要：大型语言模型（LLM）越来越多地塑造公众的理解和公民决策，但是他们的意识形态中立是一个日益关注的问题。尽管现有研究探索了各种形式的LLM偏见，但对具有不同地缘政治一致性特定于PRC系统模型的模型进行了直接的，跨语性的比较，而非PRC对应物缺乏的模型。这项研究通过系统地评估了针对中国国家宣传和反美国的CHATGPT O3-MINI-HIGH（非PRC）的DeepSeek-R1（与PRC一致的）来解决这一差距。情绪。我们开发了一种新颖的语料库，该语料库由1,200个脱皮的，面向推理的问题，这些问题源自中文新闻，以简化的中文，传统中文和英语提出。使用混合评估管道评估了两个模型的答案（总计7,200个），该管道结合了涉足人的GPT-4O评分与人类注释。我们的发现揭示了重要的模型级和语言依赖性偏见。 DeepSeek-R1始终显示出宣传和反U.S的比例较高。与Chatgpt O3-Mini-High相比，偏见很大程度上没有抗U.S。情感并显示出较低的宣传水平。对于DeepSeek-R1，简化的中国疑问引起了最高的偏差率；这些在传统汉语中减少，英语几乎不存在。值得注意的是，DeepSeek-R1偶尔会以简化的中文对中国传统的疑问作出反应，并在其中国答案中放大了现有的PRC一致术语，表明了“隐形扬声器”的效果。此外，这种偏见不仅限于公开的政治话题，还渗透到文化和生活方式内容，尤其是在DeepSeek-R1中。

Title: BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses

Authors: Shadman Rohan, Ishita Sur Apan, Muhtasim Ibteda Shochcho, Md Fahim, Mohammad Ashfaq Ur Rahman, AKM Mahbubur Rahman, Amin Ahsan Ali
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01817
Pdf URL: https://arxiv.org/pdf/2506.01817
Copy Paste: [[2506.01817]] BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses(https://arxiv.org/abs/2506.01817)
Keywords: language model
Abstract: We present Team BD's submission to the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors, under Track 1 (Mistake Identification) and Track 2 (Mistake Location). Both tracks involve three-class classification of tutor responses in educational dialogues - determining if a tutor correctly recognizes a student's mistake (Track 1) and whether the tutor pinpoints the mistake's location (Track 2). Our system is built on MPNet, a Transformer-based language model that combines BERT and XLNet's pre-training advantages. We fine-tuned MPNet on the task data using a class-weighted cross-entropy loss to handle class imbalance, and leveraged grouped cross-validation (10 folds) to maximize the use of limited data while avoiding dialogue overlap between training and validation. We then performed a hard-voting ensemble of the best models from each fold, which improves robustness and generalization by combining multiple classifiers. Our approach achieved strong results on both tracks, with exact-match macro-F1 scores of approximately 0.7110 for Mistake Identification and 0.5543 for Mistake Location on the official test set. We include comprehensive analysis of our system's performance, including confusion matrices and t-SNE visualizations to interpret classifier behavior, as well as a taxonomy of common errors with examples. We hope our ensemble-based approach and findings provide useful insights for designing reliable tutor response evaluation systems in educational dialogue settings.
摘要：我们将BD团队提交给BEA 2025联合任务，以评估AI驱动的导师的教学能力评估，在轨道1（错误身份）和轨道2（错误位置）下。这两种曲目均涉及教育对话中导师反应的三类分类 - 确定导师是否正确地识别了学生的错误（轨道1），以及导师是否指出了错误的位置（轨道2）。我们的系统建立在MPNET上，MPNet是一种基于变压器的语言模型，结合了Bert和XLNet的预训练优势。我们使用类加权的交叉渗透损失对任务数据进行了微调，以处理类失衡，并杠杆分组的交叉验证（10倍）（10倍）以最大程度地利用有限的数据使用，同时避免对话重叠在培训和验证之间。然后，我们从每个折叠中执行了最佳模型的硬投票集合，从而通过组合多个分类器来改善鲁棒性和概括性。我们的方法在这两个轨道上都取得了强大的成绩，错误识别的精确匹配宏F1分数约为0.7110，而正式测试集中的错误位置则取得了0.5543。我们包括对系统性能的全面分析，包括混淆矩阵和T-SNE可视化以解释分类器行为，以及与示例的常见错误分类法。我们希望我们的基于合奏的方法和发现为在教育对话环境中设计可靠的导师响应评估系统提供有用的见解。

Title: Not All Jokes Land: Evaluating Large Language Models Understanding of Workplace Humor

Authors: Moahmmadamin Shafiei, Hamidreza Saffari
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.01819
Pdf URL: https://arxiv.org/pdf/2506.01819
Copy Paste: [[2506.01819]] Not All Jokes Land: Evaluating Large Language Models Understanding of Workplace Humor(https://arxiv.org/abs/2506.01819)
Keywords: language model, llm
Abstract: With the recent advances in Artificial Intelligence (AI) and Large Language Models (LLMs), the automation of daily tasks, like automatic writing, is getting more and more attention. Hence, efforts have focused on aligning LLMs with human values, yet humor, particularly professional industrial humor used in workplaces, has been largely neglected. To address this, we develop a dataset of professional humor statements along with features that determine the appropriateness of each statement. Our evaluation of five LLMs shows that LLMs often struggle to judge the appropriateness of humor accurately.
摘要：随着人工智能（AI）和大型语言模型（LLM）的最新进展，日常任务的自动化（如自动写作）正在引起越来越多的关注。因此，努力的重点是使LLM与人类价值观保持一致，但幽默，尤其是在工作场所使用的专业工业幽默。为了解决这个问题，我们开发了一个专业幽默语句的数据集以及决定每个声明的适当性的功能。我们对五个LLM的评估表明，LLMS经常难以准确判断幽默的适当性。

Title: Minimal Pair-Based Evaluation of Code-Switching

Authors: Igor Sterner, Simone Teufel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01840
Pdf URL: https://arxiv.org/pdf/2506.01840
Copy Paste: [[2506.01840]] Minimal Pair-Based Evaluation of Code-Switching(https://arxiv.org/abs/2506.01840)
Keywords: language model, llm
Abstract: There is a lack of an evaluation methodology that estimates the extent to which large language models (LLMs) use code-switching (CS) in the same way as bilinguals. Existing methods do not have wide language coverage, fail to account for the diverse range of CS phenomena, or do not scale. We propose an intervention based on minimal pairs of CS. Each minimal pair contains one naturally occurring CS sentence and one minimally manipulated variant. We collect up to 1,000 such pairs each for 11 language pairs. Our human experiments show that, for every language pair, bilinguals consistently prefer the naturally occurring CS sentence. Meanwhile our experiments with current LLMs show that the larger the model, the more consistently it assigns higher probability to the naturally occurring CS sentence than to the variant. In accordance with theoretical claims, the largest probability differences arise in those pairs where the manipulated material consisted of closed-class words.
摘要：缺乏评估方法可以估计大型语言模型（LLMS）使用代码转换（CS）的程度与双语者相同。现有方法没有广泛的语言覆盖范围，无法说明CS现象的各种范围或不扩展。我们提出了一种基于最小CS对的干预措施。每个最小对包含一个天然存在的CS句子和一个最小操纵的变体。我们最多收集1,000对此类语言对。我们的人类实验表明，对于每一个语言对，双语者都始终偏爱天然发生的CS句子。同时，我们使用当前LLM的实验表明，模型越大，它越一致地为自然发生的CS句子分配了较高的概率，而不是变体。根据理论主张，在操纵材料由闭门单词组成的对中出现了最大的概率差异。

Title: CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions

Authors: Tamer Alkhouli, Katerina Margatina, James Gung, Raphael Shu, Claudia Zaghi, Monica Sunkara, Yi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01859
Pdf URL: https://arxiv.org/pdf/2506.01859
Copy Paste: [[2506.01859]] CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions(https://arxiv.org/abs/2506.01859)
Keywords: language model, llm, agent
Abstract: We introduce Conversational Function-Calling Evaluation Through Turn-Level Interactions (CONFETTI), a conversational benchmark1 designed to evaluate the function-calling capabilities and response quality of large language models (LLMs). Current benchmarks lack comprehensive assessment of LLMs in complex conversational scenarios. CONFETTI addresses this gap through 109 human-simulated conversations, comprising 313 user turns and covering 86 APIs. These conversations explicitly target various conversational complexities, such as follow-ups, goal correction and switching, ambiguous and implicit goals. We perform off-policy turn-level evaluation using this benchmark targeting function-calling. Our benchmark also incorporates dialog act annotations to assess agent responses. We evaluate a series of state-of-the-art LLMs and analyze their performance with respect to the number of available APIs, conversation lengths, and chained function calling. Our results reveal that while some models are able to handle long conversations, and leverage more than 20+ APIs successfully, other models struggle with longer context or when increasing the number of APIs. We also report that the performance on chained function-calls is severely limited across the models. Overall, the top performing models on CONFETTI are Nova Pro (40.01%), Claude Sonnet v3.5 (35.46%) and Llama 3.1 405B (33.19%) followed by command-r-plus (31.18%) and Mistral-Large-2407 (30.07%).
摘要：我们通过转交级交互（五共式）介绍了对话函数调用评估，这是一种旨在评估大语言模型（LLMS）的功能称呼功能和响应质量的对话基准1。当前的基准缺乏对复杂的对话场景中LLM的全面评估。五彩纸屑通过109次人类模拟的对话解决了这一差距，包括313个用户转弯并覆盖86个API。这些对话明确针对各种对话复杂性，例如后续行动，目标纠正和切换，模棱两可和隐性目标。我们使用此基准定位函数呼叫来执行非政策转交级评估。我们的基准还包含对话ACT注释以评估代理响应。我们评估了一系列最先进的LLM，并在可用的API，对话长度和链式功能调用方面分析其性能。我们的结果表明，尽管某些模型能够处理长时间的对话，并成功利用超过20多个API，但其他模型在更长的上下文或增加API的数量时会遇到困难。我们还报告说，链式功能通话的性能在整个模型中受到严重限制。总体而言，五彩纸屑上的最高表现模型是Nova Pro（40.01％），Claude Sonnet v3.5（35.46％）和Llama 3.1 405b（33.19％），然后是Command-R-Plus（31.18％）和MISTRALGE-LARGE-LARGE-LARGE-2407（30.07％）。

Title: Is Extending Modality The Right Path Towards Omni-Modality?

Authors: Tinghui Zhu, Kai Zhang, Muhao Chen, Yu Su
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.01872
Pdf URL: https://arxiv.org/pdf/2506.01872
Copy Paste: [[2506.01872]] Is Extending Modality The Right Path Towards Omni-Modality?(https://arxiv.org/abs/2506.01872)
Keywords: language model
Abstract: Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities--such as text, images, video, and audio--while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead to better knowledge sharing and generalization compared to sequential extension? Through extensive experiments, we analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.
摘要：Omni-Modal语言模型（OLMS）旨在整合和推理各种输入模式（例如文本，图像，视频和音频），同时保持强大的语言能力。尽管有最近的进步，但现有的模型，尤其是开源模型，远非真正的Omni模式，在处理多模式输入时，他们努力概括他们经过训练的特定模式对或以实现强大的性能。我们研究了扩展方式的效果，即训练多模型模型的主要技术，该技术对目标域和语言数据进行了微调。具体来说，我们研究了三个关键问题：（1）模态扩展是否会损害核心语言能力？（2）是否可以有效地集成独立微调的模式特定模型以实现OMNI模式？（3）与顺序扩展相比，Omni模式扩展是否会导致更好的知识共享和概括？通过广泛的实验，我们分析了这些权衡，并提供了对使用当前方法实现真正的Omni模式的可行性的见解。

Title: Spatial Coordinates as a Cell Language: A Multi-Sentence Framework for Imaging Mass Cytometry Analysis

Authors: Chi-Jane Chen, Yuhang Chen, Sukwon Yun, Natalie Stanley, Tianlong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01918
Pdf URL: https://arxiv.org/pdf/2506.01918
Copy Paste: [[2506.01918]] Spatial Coordinates as a Cell Language: A Multi-Sentence Framework for Imaging Mass Cytometry Analysis(https://arxiv.org/abs/2506.01918)
Keywords: language model, llm
Abstract: Image mass cytometry (IMC) enables high-dimensional spatial profiling by combining mass cytometry's analytical power with spatial distributions of cell phenotypes. Recent studies leverage large language models (LLMs) to extract cell states by translating gene or protein expression into biological context. However, existing single-cell LLMs face two major challenges: (1) Integration of spatial information: they struggle to generalize spatial coordinates and effectively encode spatial context as text, and (2) Treating each cell independently: they overlook cell-cell interactions, limiting their ability to capture biological relationships. To address these limitations, we propose Spatial2Sentence, a novel framework that integrates single-cell expression and spatial information into natural language using a multi-sentence approach. Spatial2Sentence constructs expression similarity and distance matrices, pairing spatially adjacent and expressionally similar cells as positive pairs while using distant and dissimilar cells as negatives. These multi-sentence representations enable LLMs to learn cellular interactions in both expression and spatial contexts. Equipped with multi-task learning, Spatial2Sentence outperforms existing single-cell LLMs on preprocessed IMC datasets, improving cell-type classification by 5.98% and clinical status prediction by 4.18% on the diabetes dataset while enhancing interpretability. The source code can be found here: this https URL.
摘要：图像质量细胞仪（IMC）通过将质量细胞仪的分析能力与细胞表型的空间分布相结合，从而实现了高维空间分析。最近的研究利用大型语言模型（LLM）通过将基因或蛋白质表达转化为生物学环境来提取细胞态。但是，现有的单细胞LLM面临两个主要挑战：（1）空间信息的整合：它们努力概括空间坐标并有效地将空间上下文编码为文本，以及（2）独立处理每个单元：它们忽略了细胞细胞的相互作用，限制了捕获生物学关系的能力。为了解决这些局限性，我们提出了空间2Sentence，这是一个新颖的框架，将单细胞表达和空间信息整合到自然语言中，使用多句子方法。空间2阶段构建表达相似性和距离矩阵，在空间相邻和表达相似的细胞中与正对相似，同时使用远处和不同的细胞作为负面。这些多句子表示可以使LLM在表达和空间环境中学习细胞相互作用。 Spatial2Stence配备了多任务学习，在预处理的IMC数据集上的现有单细胞LLMS优于现有的单细胞LLM，将细胞类型的分类提高了5.98％，而临床状态预测在糖尿病数据集的同时增强可解释性的同时，在糖尿病数据集上的临床状态预测提高了4.18％。可以在此处找到源代码：此HTTPS URL。

Title: From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation

Authors: Serry Sibaee, Omer Nacar, Adel Ammar, Yasser Al-Habashi, Abdulrahman Al-Batati, Wadii Boulila
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01920
Pdf URL: https://arxiv.org/pdf/2506.01920
Copy Paste: [[2506.01920]] From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation(https://arxiv.org/abs/2506.01920)
Keywords: language model, gpt, llm
Abstract: This paper addresses critical gaps in Arabic language model evaluation by establishing comprehensive theoretical guidelines and introducing a novel evaluation framework. We first analyze existing Arabic evaluation datasets, identifying significant issues in linguistic accuracy, cultural alignment, and methodological rigor. To address these limitations in LLMs, we present the Arabic Depth Mini Dataset (ADMD), a carefully curated collection of 490 challenging questions spanning ten major domains (42 sub-domains, see Figure 1. Using ADMD, we evaluate five leading language models: GPT-4, Claude 3.5 Sonnet, Gemini Flash 1.5, CommandR 100B, and Qwen-Max. Our results reveal significant variations in model performance across different domains, with particular challenges in areas requiring deep cultural understanding and specialized knowledge. Claude 3.5 Sonnet demonstrated the highest overall accuracy at 30\%, showing relative strength in mathematical theory in Arabic, Arabic language, and islamic domains. This work provides both theoretical foundations and practical insights for improving Arabic language model evaluation, emphasizing the importance of cultural competence alongside technical capabilities.
摘要：本文通过建立全面的理论指南并引入新的评估框架来解决阿拉伯语语言模型评估中的关键差距。我们首先分析了现有的阿拉伯语评估数据集，确定了语言准确性，文化一致性和方法论严格的重大问题。 To address these limitations in LLMs, we present the Arabic Depth Mini Dataset (ADMD), a carefully curated collection of 490 challenging questions spanning ten major domains (42 sub-domains, see Figure 1. Using ADMD, we evaluate five leading language models: GPT-4, Claude 3.5 Sonnet, Gemini Flash 1.5, CommandR 100B, and Qwen-Max. Our results reveal significant variations in model performance across different domains, with需要深厚的文化理解和专业知识的特定挑战。

Title: Esoteric Language Models

Authors: Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, Arash Vahdat
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01928
Pdf URL: https://arxiv.org/pdf/2506.01928
Copy Paste: [[2506.01928]] Esoteric Language Models(https://arxiv.org/abs/2506.01928)
Keywords: language model
Abstract: Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features--most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the **first to introduce KV caching for MDMs** while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to **65x** faster inference than standard MDMs and **4x** faster inference than prior semi-autoregressive approaches. We provide the code and model checkpoints on the project page: [this http URL](this http URL)
摘要：基于扩散的语言模型通过实现并行和可控的生成提供了自动回归（AR）模型的引人注目的替代方案。在这个模型家族中，掩盖的扩散模型（MDMS）的性能最强，但在困惑中仍然表现不佳，并且缺乏关键的推理时间效率功能，尤其是KV缓存。在这项工作中，我们介绍了ESO-LMS，这是一个融合AR和MDM范式的新型模型家族，在克服各自的局限性的同时，可以在其困惑之间平稳插值。 ESO-LMS在标准语言建模基准测试基准上设置了新的最新技术。至关重要的是，我们是第一个在保留平行生成的同时引入MDMS **的KV缓存，从而显着提高了推理效率。结合优化的采样时间表，我们的方法比标准MDMS的推断速度更快** 65x **，并且** 4X **推断速度比以前的半自动进取方法更快。我们在项目页面上提供代码和模型检查点：[此HTTP URL]（此HTTP URL）

Title: RewardBench 2: Advancing Reward Model Evaluation

Authors: Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Nathan Lambert
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01937
Pdf URL: https://arxiv.org/pdf/2506.01937
Copy Paste: [[2506.01937]] RewardBench 2: Advancing Reward Model Evaluation(https://arxiv.org/abs/2506.01937)
Keywords: language model, prompt
Abstract: Reward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to others that test agreement with human preferences. At the same time, progress in evaluation has not been mirrored by the effectiveness of reward models in downstream tasks -- simpler direct alignment algorithms are reported to work better in many cases. This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark designed to bring new, challenging data for accuracy-based reward model evaluation -- models score about 20 points on average lower on RewardBench 2 compared to the first RewardBench -- while being highly correlated with downstream performance. Compared to most other benchmarks, RewardBench 2 sources new human prompts instead of existing prompts from downstream evaluations, facilitating more rigorous evaluation practices. In this paper, we describe our benchmark construction process and report how existing models perform on it, while quantifying how performance on the benchmark correlates with downstream use of the models in both inference-time scaling algorithms, like best-of-N sampling, and RLHF training algorithms like proximal policy optimization.
摘要：奖励模型在整个语言模型的整个培训后都使用，以捕获偏好数据的细微信号，并为跨跨说明，推理，安全性和更多域提供了培训目标。社区已开始建立评估奖励模型的最佳实践，从开发基准的基准，这些基准测试特定技能领域的能力到其他人与人类偏好一致的人。同时，奖励模型在下游任务中的有效性并没有反映出评估的进展 - 据报道，在许多情况下，更简单的直接比对算法可以更好地工作。本文介绍了RewardBench 2，这是一种新的多技能奖励建模基准，旨在为基于准确的奖励模型评估带来新的，具有挑战性的数据 - 与第一个奖励板架相比，奖励台2的平均得分约为20分，而与下游性能高度相关。与大多数其他基准相比，RewardBench 2提供了新的人类提示，而不是下游评估中的现有提示，从而促进了更严格的评估实践。在本文中，我们描述了我们的基准构建过程，并报告了现有模型在其上的性能，同时量化基准的性能与两个推理时间缩放算法中模型的下游使用如何相关，例如最佳N-N Sampling和RLHF培训算法（如近距离策略）等培训算法。

Title: Novel Benchmark for NER in the Wastewater and Stormwater Domain

Authors: Franco Alberto Cardillo, Franca Debole, Francesca Frontini, Mitra Aelami, Nanée Chahinian, Serge Conrad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01938
Pdf URL: https://arxiv.org/pdf/2506.01938
Copy Paste: [[2506.01938]] Novel Benchmark for NER in the Wastewater and Stormwater Domain(https://arxiv.org/abs/2506.01938)
Keywords: llm
Abstract: Effective wastewater and stormwater management is essential for urban sustainability and environmental protection. Extracting structured knowledge from reports and regulations is challenging due to domainspecific terminology and multilingual contexts. This work focuses on domain-specific Named Entity Recognition (NER) as a first step towards effective relation and information extraction to support decision making. A multilingual benchmark is crucial for evaluating these methods. This study develops a French-Italian domain-specific text corpus for wastewater management. It evaluates state-of-the-art NER methods, including LLM-based approaches, to provide a reliable baseline for future strategies and explores automated annotation projection in view of an extension of the corpus to new languages.
摘要：有效的废水和雨水管理对于城市可持续性和环境保护至关重要。从报告和法规中提取结构性知识由于特定术语和多语言上下文而具有挑战性。这项工作着重于特定领域的命名实体识别（NER），这是朝着有效关系和信息提取以支持决策的第一步。多语言基准对于评估这些方法至关重要。这项研究开发了用于废水管理的法国 - 意大利特定领域的文本语料库。它评估了包括基于LLM的方法在内的最新方法，以为将来的策略提供可靠的基准，并鉴于将语料库扩展到新语言，并探索了自动注释投影。

Title: Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Authors: Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01939
Pdf URL: https://arxiv.org/pdf/2506.01939
Copy Paste: [[2506.01939]] Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning(https://arxiv.org/abs/2506.01939)
Keywords: language model, llm, chain-of-thought
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.
摘要：具有可验证奖励（RLVR）的增强学习已成为增强大语言模型（LLMS）的推理能力的强大方法，而其机制尚未得到充分了解。在这项工作中，我们通过新颖的熵模式进行了对RLVR的开创性探索，并全面分析了不同的令牌如何影响推理性能。通过检查对经营链（COT）推理中的令牌熵模式，我们观察到只有一小部分令牌表现出较高的熵，并且这些令牌是将模型引导到各种推理途径的关键叉子。此外，研究在RLVR训练期间熵模式如何发展表明，RLVR在很大程度上遵守基本模型的熵模式，主要调整了高渗透令牌的熵。这些发现突出了高渗透令牌（即分叉令牌）对RLVR的重要性。 We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on Aime'24）和Qwen3-14b（Aime'25上的+4.79和Aime'24上的+5.21）基本模型，突出了强大的缩放趋势。相比之下，仅对80％最低室内令牌的培训导致性能明显下降。这些发现表明，RLVR的疗效主要是由于优化决定推理方向的高注重令牌。总的来说，我们的结果突出了通过令牌 - 内部透视的理解RLVR的潜力，并通过利用高渗透少数族裔代币来进一步改善LLM推理来优化RLVR。

Title: Self-ensemble: Mitigating Confidence Distortion for Large Language Models

Authors: Zicheng Xu, Guanchu Wang, Guangyao Zheng, Yu-Neng Chuang, Alexander Szalay, Xia Hu, Vladimir Braverman
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01951
Pdf URL: https://arxiv.org/pdf/2506.01951
Copy Paste: [[2506.01951]] Self-ensemble: Mitigating Confidence Distortion for Large Language Models(https://arxiv.org/abs/2506.01951)
Keywords: language model, llm
Abstract: Although Large Language Models (LLMs) perform well in general fields, they exhibit a confidence distortion problem on multi-choice question-answering (MCQA), particularly as the number of answer choices increases. Specifically, on MCQA with many choices, LLMs suffer from under-confidence in correct predictions and over-confidence in incorrect ones, leading to a substantially degraded performance. To solve this problem, we propose Self-ensemble in this work. Our method splits the choices into several groups and ensembles LLM predictions across these groups to reach a final decision. The advantage of Self-ensemble is its plug-and-play nature, where it can be integrated into existing LLM architecture based on a designed attention mask and positional encoding, without requiring labeled datasets for parameter tuning. Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem of LLMs, outperforming standard inference as well as baseline methods.
摘要：尽管大型语言模型（LLMS）在一般领域的表现良好，但它们在多选问题避开（MCQA）上表现出置信失真问题，尤其是随着答案选择的数量增加。具体而言，在具有多种选择的MCQA上，LLMS在正确的预测和不正确的预言中遭受了不信任，导致了实质性降低的性能。为了解决这个问题，我们在这项工作中提出了自我汇集。我们的方法将选择分为几个组，并在这些组之间进行LLM预测，以达到最终决定。自我安装的优点是它的插件性质，可以基于设计的注意性掩码和位置编码将其集成到现有的LLM体系结构中，而无需标记的数据集以进行参数调整。在三个LLM和数据集上的实验结果表明，自我启动可以全面解决LLMS的置信失真问题，表现优于标准推断以及基线方法。

Title: WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

Authors: Atsuyuki Miyai, Zaiying Zhao, Kazuki Egashira, Atsuki Sato, Tatsumi Sunada, Shota Onohara, Hiromasa Yamanishi, Mashiro Toyooka, Kunato Nishina, Ryoma Maeda, Kiyoharu Aizawa, Toshihiko Yamasaki
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01952
Pdf URL: https://arxiv.org/pdf/2506.01952
Copy Paste: [[2506.01952]] WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks(https://arxiv.org/abs/2506.01952)
Keywords: language model, gpt, llm, agent
Abstract: Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.
摘要：Web浏览代理以大型语言模型（LLM）为动力，以类似人类的方式操作Web浏览器，并为自动化广泛的日常任务提供了高度透明的途径。随着网络代理人变得越来越有能力，并在一般浏览任务中表现出熟练程度，就出现了一个关键的问题：他们是否可以超越一般浏览来稳健地处理乏味和复杂的任务，还是人类经常避免做自己做的琐事？在本文中，我们介绍了WebChorearena，这是一种新的完全可重复的基准，其中包括532个精心策划的任务，旨在将Webarena的范围扩展到一般浏览到更加劳动力和繁琐的任务。 WebChoreArena系统地集成了三个关键挑战：（i）在观察值中准确检索大量信息的庞大内存任务，（ii）计算任务需要精确的数学推理，以及（iii）长期内存任务，需要长期内存长期内存。 WebChoreArena建立在完全可重现并广泛采用的四个Webarena仿真环境之上，可确保严格的重现性，并可以与已建立的WebArena基准标准进行公平，直接的比较，从而为代理进度提供关键的见解。我们的实验结果表明，随着LLMS的进化，以GPT-4O，Claude 3.7十四行诗和Gemini 2.5 Pro表示，在WebChorearena上观察到了性能的显着改善。这些发现表明，Webchorearena非常适合衡量最先进的LLM的进步。然而，结果还表明，即使使用Gemini 2.5 Pro，与Webarena相比，还有很大的改进空间，强调了Webchorearena提出的挑战增加。

Title: DRAG: Distilling RAG for SLMs from LLMs to Transfer Knowledge and Mitigate Hallucination via Evidence and Graph-based Distillation

Authors: Jennifer Chen, Aidar Myrzakhan, Yaxin Luo, Hassaan Muhammad Khan, Sondos Mahmoud Bsharat, Zhiqiang Shen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01954
Pdf URL: https://arxiv.org/pdf/2506.01954
Copy Paste: [[2506.01954]] DRAG: Distilling RAG for SLMs from LLMs to Transfer Knowledge and Mitigate Hallucination via Evidence and Graph-based Distillation(https://arxiv.org/abs/2506.01954)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) methods have proven highly effective for tasks requiring factual consistency and robust knowledge retrieval. However, large-scale RAG systems consume significant computational resources and are prone to generating hallucinated content from Humans. In this work, we introduce $\texttt{DRAG}$, a novel framework for distilling RAG knowledge from large-scale Language Models (LLMs) into small LMs (SLMs). Our approach leverages evidence- and knowledge graph-based distillation, ensuring that the distilled model retains critical factual knowledge while significantly reducing model size and computational cost. By aligning the smaller model's predictions with a structured knowledge graph and ranked evidence, $\texttt{DRAG}$ effectively mitigates hallucinations and improves factual accuracy. We further present a case demonstrating how our framework mitigates user privacy risks and introduce a corresponding benchmark. Experimental evaluations on multiple benchmarks demonstrate that our method outperforms the prior competitive RAG methods like MiniRAG for SLMs by up to 27.7% using the same models, preserving high-level efficiency and reliability. With $\texttt{DRAG}$, we provide a practical and resource-efficient roadmap to deploying enhanced retrieval and generation capabilities in small-sized LLMs.
摘要：检索授权的生成（RAG）方法已证明对需要事实一致性和强大知识检索的任务非常有效。但是，大规模的抹布系统会消耗大量的计算资源，并且容易产生人类的幻觉内容。在这项工作中，我们介绍了$ \ texttt {drag} $，这是一个新颖的框架，用于将破布知识从大规模语言模型（LLMS）提炼到小LMS（SLMS）中。我们的方法利用证据和知识基于图形的蒸馏，确保蒸馏模型保留关键的事实知识，同时显着降低模型的规模和计算成本。通过将较小模型的预测与结构化知识图和排名证据相结合，$ \ texttt {drag} $有效地减轻了幻觉并提高了事实准确性。我们进一步提出了一个案例，展示了我们的框架如何减轻用户隐私风险并引入相应的基准。对多个基准测试的实验评估表明，我们的方法的表现优于先前的竞争抹布方法，例如SLM的Minirag，使用相同的模型，最多高达27.7％，从而保持了高级效率和可靠性。借助$ \ texttt {drag} $，我们提供了一个实用且资源有效的路线图，以在小型LLMS中部署增强的检索和发电能力。