2025-06-06

Title: GEM: Empowering LLM for both Embedding Generation and Language Understanding

Authors: Caojin Zhang, Qiang Zhang, Ke Li, Sai Vidyaranya Nuthalapati, Benyu Zhang, Jason Liu, Serena Li, Lizhu Zhang, Xiangjun Fan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04344
Pdf URL: https://arxiv.org/pdf/2506.04344
Copy Paste: [[2506.04344]] GEM: Empowering LLM for both Embedding Generation and Language Understanding(https://arxiv.org/abs/2506.04344)
Keywords: language model, llm, retrieval augmented generation
Abstract: Large decoder-only language models (LLMs) have achieved remarkable success in generation and reasoning tasks, where they generate text responses given instructions. However, many applications, e.g., retrieval augmented generation (RAG), still rely on separate embedding models to generate text embeddings, which can complicate the system and introduce discrepancies in understanding of the query between the embedding model and LLMs. To address this limitation, we propose a simple self-supervised approach, Generative Embedding large language Model (GEM), that enables any large decoder-only LLM to generate high-quality text embeddings while maintaining its original text generation and reasoning capabilities. Our method inserts new special token(s) into a text body, and generates summarization embedding of the text by manipulating the attention mask. This method could be easily integrated into post-training or fine tuning stages of any existing LLMs. We demonstrate the effectiveness of our approach by applying it to two popular LLM families, ranging from 1B to 8B parameters, and evaluating the transformed models on both text embedding benchmarks (MTEB) and NLP benchmarks (MMLU). The results show that our proposed method significantly improves the original LLMs on MTEB while having a minimal impact on MMLU. Our strong results indicate that our approach can empower LLMs with state-of-the-art text embedding capabilities while maintaining their original NLP performance
摘要：仅大型解码器语言模型（LLMS）在发电和推理任务中取得了巨大的成功，在该任务中它们生成了给定指令的文本响应。但是，许多应用程序，例如检索增强生成（RAG）仍然依靠单独的嵌入模型来生成文本嵌入，这可能使系统复杂化并引入差异，以了解嵌入模型和LLMS之间的查询。为了解决这一限制，我们提出了一种简单的自我监督方法，即生成嵌入大型语言模型（GEM），它使任何仅大型解码器llm都能生成高质量的文本嵌入，同时保持其原始的文本生成和推理能力。我们的方法将新的特殊令牌插入文本主体，并通过操纵注意力面罩来生成对文本的汇总。该方法可以轻松地集成到任何现有LLM的训练后或微调阶段。我们通过将其应用于两个流行的LLM家族（从1B到8B参数）来证明我们的方法的有效性，并在文本嵌入基准（MTEB）和NLP基准（MMLU）上评估了转换的模型。结果表明，我们提出的方法显着改善了MTEB上的原始LLM，同时对MMLU产生最小的影响。我们的有力结果表明，我们的方法可以通过最先进的文本嵌入功能增强LLM的能力，同时保持其原始的NLP性能

Title: MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP

Authors: Kurt Micallef, Claudia Borg
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04385
Pdf URL: https://arxiv.org/pdf/2506.04385
Copy Paste: [[2506.04385]] MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP(https://arxiv.org/abs/2506.04385)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various Natural Language Processing (NLP) tasks, largely due to their generalisability and ability to perform tasks without additional training. However, their effectiveness for low-resource languages remains limited. In this study, we evaluate the performance of 55 publicly available LLMs on Maltese, a low-resource language, using a newly introduced benchmark covering 11 discriminative and generative tasks. Our experiments highlight that many models perform poorly, particularly on generative tasks, and that smaller fine-tuned models often perform better across all tasks. From our multidimensional analysis, we investigate various factors impacting performance. We conclude that prior exposure to Maltese during pre-training and instruction-tuning emerges as the most important factor. We also examine the trade-offs between fine-tuning and prompting, highlighting that while fine-tuning requires a higher initial cost, it yields better performance and lower inference costs. Through this work, we aim to highlight the need for more inclusive language technologies and recommend that researchers working with low-resource languages consider more "traditional" language modelling approaches.
摘要：大型语言模型（LLMS）在各种自然语言处理（NLP）任务中表现出色，这在很大程度上是由于它们的普遍性和执行任务的能力而没有额外的培训。但是，它们对低资源语言的有效性仍然有限。在这项研究中，我们使用新引入的基准涵盖11个歧视性和生成性任务的基准，评估了55种公开可用的LLM在Maltese（一种低资源语言）上的性能。我们的实验表明，许多模型的表现不佳，尤其是在生成任务上，并且较小的微调模型通常在所有任务中都表现更好。从我们的多维分析中，我们研究了影响性能的各种因素。我们得出的结论是，在培训和指导调查期间先前接触马耳他是最重要的因素。我们还研究了微调和提示之间的权衡，这强调了，虽然微调需要更高的初始成本，但它可以产生更好的性能和较低的推理成本。通过这项工作，我们旨在强调对更具包容性语言技术的需求，并建议使用低资源语言的研究人员考虑更多的“传统”语言建模方法。

Title: Building a Few-Shot Cross-Domain Multilingual NLU Model for Customer Care

Authors: Saurabh Kumar, Sourav Bansal, Neeraj Agrawal, Priyanka Bhatt
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04389
Pdf URL: https://arxiv.org/pdf/2506.04389
Copy Paste: [[2506.04389]] Building a Few-Shot Cross-Domain Multilingual NLU Model for Customer Care(https://arxiv.org/abs/2506.04389)
Keywords: chat, agent
Abstract: Customer care is an essential pillar of the e-commerce shopping experience with companies spending millions of dollars each year, employing automation and human agents, across geographies (like US, Canada, Mexico, Chile), channels (like Chat, Interactive Voice Response (IVR)), and languages (like English, Spanish). SOTA pre-trained models like multilingual-BERT, fine-tuned on annotated data have shown good performance in downstream tasks relevant to Customer Care. However, model performance is largely subject to the availability of sufficient annotated domain-specific data. Cross-domain availability of data remains a bottleneck, thus building an intent classifier that generalizes across domains (defined by channel, geography, and language) with only a few annotations, is of great practical value. In this paper, we propose an embedder-cum-classifier model architecture which extends state-of-the-art domain-specific models to other domains with only a few labeled samples. We adopt a supervised fine-tuning approach with isotropic regularizers to train a domain-specific sentence embedder and a multilingual knowledge distillation strategy to generalize this embedder across multiple domains. The trained embedder, further augmented with a simple linear classifier can be deployed for new domains. Experiments on Canada and Mexico e-commerce Customer Care dataset with few-shot intent detection show an increase in accuracy by 20-23% against the existing state-of-the-art pre-trained models.
摘要：客户服务是电子商务购物体验的重要支柱，公司每年花费数百万美元，使用自动化和人类代理，跨越地理位置（例如美国，加拿大，墨西哥，智利），频道（例如聊天，互动语音响应（IVR））和语言（例如英语，西班牙语）。 SOTA预训练的模型，例如多语言 - 对带注释的数据进行微调的模型显示在与客户服务有关的下游任务中的良好性能。但是，模型性能在很大程度上受到足够带注释的域特异性数据的可用性。数据的跨域可用性仍然是一种瓶颈，因此建立了一个意图分类器，该分类器仅使用少数注释跨越域（由渠道，地理和语言定义）概括具有很大的实际价值。在本文中，我们提出了一个嵌入式兼分类器模型体系结构，该模型架构将最新的域特异性模型扩展到只有几个标记样本的其他域。我们采用各向同性正规化器的监督微调方法来训练特定于领域的句子嵌入器和多语言知识蒸馏策略，以跨多个域中概括该嵌入。可以为新域部署了经过训练的嵌入器，进一步增强了简单的线性分类器。在加拿大和墨西哥电子商务客户服务数据集上进行的实验很少，目的是发现准确性增加了20-23％，而对于现有的最新预培训的模型。

Title: MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

Authors: Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Carl Yang, Yang Xie, Wenqi Shi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04405
Pdf URL: https://arxiv.org/pdf/2506.04405
Copy Paste: [[2506.04405]] MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale(https://arxiv.org/abs/2506.04405)
Keywords: language model, gpt, llm, agent
Abstract: We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.
摘要：我们介绍了Medagentgym，这是第一个旨在增强大语言模型（LLM）代理中基于编码的医学推理能力的公开培训环境。 Medagentgym包括72,413个任务实例，这些任务实例是从真实现实世界的生物医学场景中得出的129个类别。任务封装在可执行的编码环境中，每个任务都包含详细的任务说明，交互式反馈机制，可验证的地面真相注释和可扩展的培训轨迹生成。超过30个LLM的广泛基准测试揭示了基于商业API的模型与开源对应物之间的显着性能差异。利用Medagentgym，Med-Copilot-7B通过监督的微调（+36.44％）和持续的增强学习（+42.47％）实现了可观的绩效，这是一种负担得起的，具有负担得起的和隐私性的替代性替代性竞争，与GPT-4O竞争。通过在统一执行环境中提供全面的基准和可访问，可扩展的培训资源，Medagentgym提供了一个集成的平台，以开发基于LLM的编码助手，以供高级生物医学研究和实践。

Title: Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning

Authors: Wesley Scivetti, Tatsuya Aoyama, Ethan Wilcox, Nathan Schneider
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04408
Pdf URL: https://arxiv.org/pdf/2506.04408
Copy Paste: [[2506.04408]] Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning(https://arxiv.org/abs/2506.04408)
Keywords: language model
Abstract: Humans have a remarkable ability to acquire and understand grammatical phenomena that are seen rarely, if ever, during childhood. Recent evidence suggests that language models with human-scale pretraining data may possess a similar ability by generalizing from frequent to rare constructions. However, it remains an open question how widespread this generalization ability is, and to what extent this knowledge extends to meanings of rare constructions, as opposed to just their forms. We fill this gap by testing human-scale transformer language models on their knowledge of both the form and meaning of the (rare and quirky) English LET-ALONE construction. To evaluate our LMs we construct a bespoke synthetic benchmark that targets syntactic and semantic properties of the construction. We find that human-scale LMs are sensitive to form, even when related constructions are filtered from the dataset. However, human-scale LMs do not make correct generalizations about LET-ALONE's meaning. These results point to an asymmetry in the current architectures' sample efficiency between language form and meaning, something which is not present in human language learners.
摘要：人类具有出色的能力，可以在童年时期（如果有的话）获得和理解语法现象。最近的证据表明，具有人类规模预处理数据的语言模型可能通过从频繁到稀有结构概括到稀有结构中具有相似的能力。但是，这仍然是一个悬而未决的问题，这种概括能力是多么普遍，并且这些知识在多大程度上扩展到了稀有结构的含义，而不是仅仅是其形式。我们通过测试人类规模的变压器语言模型对其对（罕见和古怪）英语let-lot-lot-lot-lot-lot-lot-lot-lot-lot-lot-lot-note构造的形式和含义的了解来填补这一空白。为了评估我们的LMS，我们构建了一个定制的合成基准，该基准针对构造的句法和语义特性。我们发现，即使从数据集中过滤相关的结构，人类规模的LM也对形式敏感。但是，人类规模的LMS并未对Let-Olone的含义进行正确的概括。这些结果表明，语言形式和含义之间当前体系结构的样本效率的不对称性，这在人类语言学习者中不存在。

Title: Zero-Shot Open-Schema Entity Structure Discovery

Authors: Xueqiang Xu, Jinfeng Xiao, James Barry, Mohab Elkaref, Jiaru Zou, Pengcheng Jiang, Yunyi Zhang, Max Giammona, Geeth de Mel, Jiawei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04458
Pdf URL: https://arxiv.org/pdf/2506.04458
Copy Paste: [[2506.04458]] Zero-Shot Open-Schema Entity Structure Discovery(https://arxiv.org/abs/2506.04458)
Keywords: language model, llm
Abstract: Entity structure extraction, which aims to extract entities and their associated attribute-value structures from text, is an essential task for text understanding and knowledge graph construction. Existing methods based on large language models (LLMs) typically rely heavily on predefined entity attribute schemas or annotated datasets, often leading to incomplete extraction results. To address these challenges, we introduce Zero-Shot Open-schema Entity Structure Discovery (ZOES), a novel approach to entity structure extraction that does not require any schema or annotated samples. ZOES operates via a principled mechanism of enrichment, refinement, and unification, based on the insight that an entity and its associated structure are mutually reinforcing. Experiments demonstrate that ZOES consistently enhances LLMs' ability to extract more complete entity structures across three different domains, showcasing both the effectiveness and generalizability of the method. These findings suggest that such an enrichment, refinement, and unification mechanism may serve as a principled approach to improving the quality of LLM-based entity structure discovery in various scenarios.
摘要：实体结构提取旨在从文本中提取实体及其相关的属性 - 值结构，是文本理解和知识图构造的重要任务。基于大语言模型（LLM）的现有方法通常严重依赖于预定义的实体属性模式或注释数据集，通常会导致不完整的提取结果。为了应对这些挑战，我们介绍了零射门开放式实体结构发现（ZOES），这是一种新型实体结构提取方法，不需要任何模式或带注释的样本。 ZOE通过实体及其相关结构相互加强的见解，是通过原则上的富集，改进和统一机制运行的。实验表明，ZOE始终增强了LLMS在三个不同领域提取更完整的实体结构的能力，从而展示了该方法的有效性和概括性。这些发现表明，在各种情况下，这种富集，改进和统一机制可能是一种有原则的方法来提高基于LLM的实体结构发现的质量。

Title: Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Authors: Apurv Verma, NhatHai Phan, Shubhendu Trivedi
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04462
Pdf URL: https://arxiv.org/pdf/2506.04462
Copy Paste: [[2506.04462]] Watermarking Degrades Alignment in Language Models: Analysis and Mitigation(https://arxiv.org/abs/2506.04462)
Keywords: language model, llm
Abstract: Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives. To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice.
摘要：大型语言模型（LLMS）的水印技术可以显着影响产出质量，但它们对真实性，安全性和有益性的影响仍然严重不受欢迎。本文介绍了两个流行的水印如何在四个对齐的LLM中接近gumbel和kgw的这些核心对齐特性。我们的实验揭示了两种不同的降解模式：后卫衰减，增强的帮助破坏了模型安全性和防护放大，过度谨慎降低了模型的帮助。这些模式来自水印引起的令牌分布的变化，表现出对齐目标之间存在的基本张力。为了减轻这些降解，我们建议对齐重新采样（AR），这是一种使用外部奖励模型恢复对齐的推理时间采样方法。随着样本量的增加，我们建立了预期奖励评分改善的理论下限，并从经验上证明，仅采样2-4个水印的世代有效地恢复或超过基线（未标记）比对分数。为了克服标准牙龈水印的有限响应多样性，我们修改的实施牺牲了严格的失真性，同时保持了可靠的可检测性，从而确保与AR的兼容性。实验结果证实，AR在两种水印方法中都成功恢复了基线对准，同时保持了强标水印的可检测性。这项工作揭示了水印强度和模型对齐之间的关键平衡，从而提供了一种简单的推理时间解决方案，可以在实践中负责任地部署水印的LLM。

Title: Aligning Large Language Models with Implicit Preferences from User-Generated Content

Authors: Zhaoxuan Tan, Zheng Li, Tianyi Liu, Haodong Wang, Hyokun Yun, Ming Zeng, Pei Chen, Zhihan Zhang, Yifan Gao, Ruijie Wang, Priyanka Nigam, Bing Yin, Meng Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04463
Pdf URL: https://arxiv.org/pdf/2506.04463
Copy Paste: [[2506.04463]] Aligning Large Language Models with Implicit Preferences from User-Generated Content(https://arxiv.org/abs/2506.04463)
Keywords: language model, llm
Abstract: Learning from preference feedback is essential for aligning large language models (LLMs) with human values and improving the quality of generated responses. However, existing preference learning methods rely heavily on curated data from humans or advanced LLMs, which is costly and difficult to scale. In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data. Although UGC is not explicitly created to guide LLMs in generating human-preferred responses, it often reflects valuable insights and implicit preferences from its creators that has the potential to address readers' questions. PUGC transforms UGC into user queries and generates responses from the policy model. The UGC is then leveraged as a reference text for response scoring, aligning the model with these implicit preferences. This approach improves the quality of preference data while enabling scalable, domain-specific alignment. Experimental results on Alpaca Eval 2 show that models trained with DPO and PUGC achieve a 9.37% performance improvement over traditional methods, setting a 35.93% state-of-the-art length-controlled win rate using Mistral-7B-Instruct. Further studies highlight gains in reward quality, domain-specific alignment effectiveness, robustness against UGC quality, and theory of mind capabilities. Our code and dataset are available at this https URL
摘要：从偏好反馈中学习对于使大语言模型（LLM）与人类价值观并提高产生的响应质量保持一致至关重要。但是，现有的偏好学习方法在很大程度上依赖于人类或高级LLM的精心策划数据，这些数据是昂贵且难以扩展的。在这项工作中，我们提出了PUGC，这是一个新颖的框架，它利用未标记的用户生成内容（UGC）中隐含的人类偏好来生成偏好数据。尽管没有明确创建UGC来指导LLM产生人类偏爱的回应，但它通常反映出其创作者的宝贵见解和隐性偏好，有可能解决读者的问题。 PUGC将UGC转换为用户查询，并从策略模型中生成响应。然后将UGC作为参考文本进行利用，以进行响应评分，将模型与这些隐式偏好对齐。这种方法可以提高偏好数据的质量，同时启用可扩展的特定于域的对齐。羊驼毛评估2上的实验结果表明，接受DPO和PUGC训练的模型比传统方法的性能提高了9.37％，使用Mistral-7B-Infruct设定了35.93％的最先进的长度控制率。进一步的研究强调了奖励质量，特定领域的一致性有效性，针对UGC质量的鲁棒性以及心理能力理论的提高。我们的代码和数据集可在此HTTPS URL上找到

Title: SQLens: An End-to-End Framework for Error Detection and Correction in Text-to-SQL

Authors: Yue Gong, Chuan Lei, Xiao Qin, Kapil Vaidya, Balakrishnan Narayanaswamy, Tim Kraska
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04494
Pdf URL: https://arxiv.org/pdf/2506.04494
Copy Paste: [[2506.04494]] SQLens: An End-to-End Framework for Error Detection and Correction in Text-to-SQL(https://arxiv.org/abs/2506.04494)
Keywords: language model, llm
Abstract: Text-to-SQL systems translate natural language (NL) questions into SQL queries, enabling non-technical users to interact with structured data. While large language models (LLMs) have shown promising results on the text-to-SQL task, they often produce semantically incorrect yet syntactically valid queries, with limited insight into their reliability. We propose SQLens, an end-to-end framework for fine-grained detection and correction of semantic errors in LLM-generated SQL. SQLens integrates error signals from both the underlying database and the LLM to identify potential semantic errors within SQL clauses. It further leverages these signals to guide query correction. Empirical results on two public benchmarks show that SQLens outperforms the best LLM-based self-evaluation method by 25.78% in F1 for error detection, and improves execution accuracy of out-of-the-box text-to-SQL systems by up to 20%.
摘要：文本到SQL系统将自然语言（NL）问题转化为SQL查询，从而使非技术用户能够与结构化数据进行交互。尽管大型语言模型（LLMS）在文本到SQL任务上显示出令人鼓舞的结果，但它们通常会产生语义上不正确但句法有效的查询，并且对其可靠性有限。我们提出了SQLENS，这是一个端到端的框架，用于细度检测和校正LLM生成的SQL中的语义错误。 SQLEN集成了来自基础数据库和LLM的错误信号，以识别SQL子句中的潜在语义错误。它进一步利用这些信号来指导查询校正。两个公共基准的经验结果表明，在F1中，SQLEN的表现优于最佳基于LLM的自我评估方法，用于错误检测，并提高了开箱即用的文本到SQL系统的执行准确性高达20％。

Title: DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation

Authors: Kun Zhao, Bohao Yang, Chen Tang, Siyuan Dai, Haoteng Tang, Chenghua Lin, Liang Zhan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04516
Pdf URL: https://arxiv.org/pdf/2506.04516
Copy Paste: [[2506.04516]] DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation(https://arxiv.org/abs/2506.04516)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel at many tasks but struggle with ambiguous scenarios where multiple valid responses exist, often yielding unreliable results. Conversely, Small Language Models (SLMs) demonstrate robustness in such scenarios but are susceptible to misleading or adversarial inputs. We observed that LLMs handle negative examples effectively, while SLMs excel with positive examples. To leverage their complementary strengths, we introduce SLIDE (Small and Large Integrated for Dialogue Evaluation), a method integrating SLMs and LLMs via adaptive weighting. Building on SLIDE, we further propose a Dual-Refinement Evaluation (DRE) method to enhance SLM-LLM integration: (1) SLM-generated insights guide the LLM to produce initial evaluations; (2) SLM-derived adjustments refine the LLM's scores for improved accuracy. Experiments demonstrate that DRE outperforms existing methods, showing stronger alignment with human judgment across diverse benchmarks. This work illustrates how combining small and large models can yield more reliable evaluation tools, particularly for open-ended tasks such as dialogue evaluation.
摘要：大型语言模型（LLMS）在许多任务上都表现出色，但在存在多个有效响应的模棱两可的情况下挣扎，通常会产生不可靠的结果。相反，小语言模型（SLM）在这种情况下表现出鲁棒性，但容易受到误导或对抗输入的影响。我们观察到LLM有效地处理负面例子，而SLM在积极的例子中表现出色。为了利用它们的互补优势，我们介绍了幻灯片（大小集成进行对话评估），这是一种通过自适应加权整合SLM和LLM的方法。在幻灯片的基础上，我们进一步提出了一种双重翻新评估（DRE）方法，以增强SLM-LLM集成：（1）SLM生成的见解指南LLM来产生初始评估；（2）SLM衍生的调整可完善LLM的分数，以提高准确性。实验表明，DRE胜过现有方法，表明在各种基准中，与人类判断更加一致。这项工作说明了将小型和大型模型组合在一起如何产生更可靠的评估工具，特别是对于对话评估等开放式任务。

Title: Please Translate Again: Two Simple Experiments on Whether Human-Like Reasoning Helps Translation

Authors: Di Wu, Seth Aycock, Christof Monz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04521
Pdf URL: https://arxiv.org/pdf/2506.04521
Copy Paste: [[2506.04521]] Please Translate Again: Two Simple Experiments on Whether Human-Like Reasoning Helps Translation(https://arxiv.org/abs/2506.04521)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) demonstrate strong reasoning capabilities for many tasks, often by explicitly decomposing the task via Chain-of-Thought (CoT) reasoning. Recent work on LLM-based translation designs hand-crafted prompts to decompose translation, or trains models to incorporate intermediate steps.~\textit{Translating Step-by-step}~\citep{briakou2024translating}, for instance, introduces a multi-step prompt with decomposition and refinement of translation with LLMs, which achieved state-of-the-art results on WMT24. In this work, we scrutinise this strategy's effectiveness. Empirically, we find no clear evidence that performance gains stem from explicitly decomposing the translation process, at least for the models on test; and we show that simply prompting LLMs to ``translate again'' yields even better results than human-like step-by-step prompting. Our analysis does not rule out the role of reasoning, but instead invites future work exploring the factors for CoT's effectiveness in the context of translation.
摘要：大型语言模型（LLMS）通常通过通过思考链（COT）推理明确分解任务来证明许多任务的强大推理能力。有关基于LLM的翻译设计的最新工作是手工制作的提示，以分解翻译，或训练模型以包含中间步骤。在这项工作中，我们仔细检查了该策略的有效性。从经验上讲，我们没有发现明确的证据表明，至少对于测试模型，表现源于明确分解翻译过程。我们表明，仅仅提示LLMS``再次翻译''的结果比逐步的逐步提示更好。我们的分析并不排除推理的作用，而是邀请未来的工作探索COT在翻译背景下的有效性的因素。

Title: Is It JUST Semantics? A Case Study of Discourse Particle Understanding in LLMs

Authors: William Sheffield, Kanishka Misra, Valentina Pyatkin, Ashwini Deo, Kyle Mahowald, Junyi Jessy Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04534
Pdf URL: https://arxiv.org/pdf/2506.04534
Copy Paste: [[2506.04534]] Is It JUST Semantics? A Case Study of Discourse Particle Understanding in LLMs(https://arxiv.org/abs/2506.04534)
Keywords: llm
Abstract: Discourse particles are crucial elements that subtly shape the meaning of text. These words, often polyfunctional, give rise to nuanced and often quite disparate semantic/discourse effects, as exemplified by the diverse uses of the particle "just" (e.g., exclusive, temporal, emphatic). This work investigates the capacity of LLMs to distinguish the fine-grained senses of English "just", a well-studied example in formal semantics, using data meticulously created and labeled by expert linguists. Our findings reveal that while LLMs exhibit some ability to differentiate between broader categories, they struggle to fully capture more subtle nuances, highlighting a gap in their understanding of discourse particles.
摘要：话语粒子是巧妙地塑造文本含义的关键元素。这些单词通常是多功能的，引起了细微差别的语义/话语效应，这是由粒子“ Just”（例如，独家，时间，强调）的多种用途所举例的。这项工作调查了LLM的能力，以使用专家语言学家精心创建和标记的数据来区分英语“ Just”的细粒度“ Just”。我们的发现表明，尽管LLM具有区分更广泛的类别的能力，但他们努力完全捕捉更多细微的细微差别，突出了他们对话语粒子的理解差距。

Title: BSBench: will your LLM find the largest prime number?

Authors: K. O. T. Erziev
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04535
Pdf URL: https://arxiv.org/pdf/2506.04535
Copy Paste: [[2506.04535]] BSBench: will your LLM find the largest prime number?(https://arxiv.org/abs/2506.04535)
Keywords: llm
Abstract: We propose that benchmarking LLMs on questions which have no reasonable answer actually isn't as silly as it sounds. We also present a benchmark that allows such testing and a method to modify the existing datasets, and discover that existing models demonstrate a performance far from the perfect on such questions. Our code and data artifacts are available at this https URL
摘要：我们建议，在没有合理答案的问题上基准LLM实际上并不像听起来那样愚蠢。我们还提出了一个基准，该基准允许进行此类测试和一种修改现有数据集的方法，并发现现有模型在此类问题上远非完美的表现。我们的代码和数据工件可在此HTTPS URL上找到

Title: SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?

Authors: Senyu Li, Jiayi Wang, Felermino D. M. A. Ali, Colin Cherry, Daniel Deutsch, Eleftheria Briakou, Rui Sousa-Silva, Henrique Lopes Cardoso, Pontus Stenetorp, David Ifeoluwa Adelani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04557
Pdf URL: https://arxiv.org/pdf/2506.04557
Copy Paste: [[2506.04557]] SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?(https://arxiv.org/abs/2506.04557)
Keywords: gpt, llm, prompt
Abstract: Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 13 African language pairs from the News domain, with over 63,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o and Claude. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM (Gemini 2.5 Pro) evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.
摘要：评估资源不足的非洲语言的机器翻译（MT）质量仍然是一个重大挑战，因为现有的指标通常在低资源环境中遭受有限的语言覆盖范围和性能差。尽管最近的努力（例如Africomet）已经解决了一些问题，但它们仍然受到小评估集的限制，缺乏针对非洲语言量身定制的公开培训数据，并且在极低的资源场景中的表现不一致。在这项工作中，我们介绍了SSA-MTE，这是一个大规模的人类通知的MT评估（MTE）数据集，涵盖了来自新闻领域的13个非洲语言对，其中超过63,000个来自多种MT系统的句子级注释。基于这些数据，我们开发了SSA-COMEN和SSA-COMT-QE，改进了基于参考的和无参考的评估指标。我们还使用GPT-4O和Claude等最先进的LLM进行了基于促进的方法。我们的实验结果表明，SSA-comp模型在我们的研究中评估的最强的LLM（Gemini 2.5 Pro）竞争大大超过了非洲的表现，尤其是在TWI，Luo和Yoruba等低资源语言中。所有资源均在公开许可下发布，以支持未来的研究。

Title: Demonstrations of Integrity Attacks in Multi-Agent Systems

Authors: Can Zheng, Yuhan Cao, Xiaoning Dong, Tianxing He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04572
Pdf URL: https://arxiv.org/pdf/2506.04572
Copy Paste: [[2506.04572]] Demonstrations of Integrity Attacks in Multi-Agent Systems(https://arxiv.org/abs/2506.04572)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, code generation, and complex planning. Simultaneously, Multi-Agent Systems (MAS) have garnered attention for their potential to enable cooperation among distributed agents. However, from a multi-party perspective, MAS could be vulnerable to malicious agents that exploit the system to serve self-interests without disrupting its core functionality. This work explores integrity attacks where malicious agents employ subtle prompt manipulation to bias MAS operations and gain various benefits. Four types of attacks are examined: \textit{Scapegoater}, who misleads the system monitor to underestimate other agents' contributions; \textit{Boaster}, who misleads the system monitor to overestimate their own performance; \textit{Self-Dealer}, who manipulates other agents to adopt certain tools; and \textit{Free-Rider}, who hands off its own task to others. We demonstrate that strategically crafted prompts can introduce systematic biases in MAS behavior and executable instructions, enabling malicious agents to effectively mislead evaluation systems and manipulate collaborative agents. Furthermore, our attacks can bypass advanced LLM-based monitors, such as GPT-4o-mini and o3-mini, highlighting the limitations of current detection mechanisms. Our findings underscore the critical need for MAS architectures with robust security protocols and content validation mechanisms, alongside monitoring systems capable of comprehensive risk scenario assessment.
摘要：大型语言模型（LLM）在自然语言理解，代码生成和复杂计划中表现出了显着的功能。同时，多机构系统（MAS）引起了人们的关注，因为它们有可能在分布式代理之间进行合作。但是，从多方的角度来看，MAS可能容易受到恶意代理的影响，这些恶意代理利用该系统在不破坏其核心功能的情况下提供自身利益。这项工作探讨了诚信攻击，恶意代理人采用微妙的及时操纵来偏向MAS操作并获得各种好处。检查了四种类型的攻击：\ textit {scapegoater}，他们误导了系统监视器以低估其他代理的贡献； \ textIt {boaster}，他误导了系统监视器以高估了自己的性能； \ textit {self-Dealer}，他操纵其他代理以采用某些工具；和\ textit {自由骑手}，他将自己的任务交给他人。我们证明，经过战略制作的提示可以在MAS行为和可执行指令中引入系统的偏见，从而使恶意代理有效地误导评估系统并操纵协作代理。此外，我们的攻击可以绕过高级基于LLM的显示器，例如GPT-4O-MINI和O3-MINI，突出了当前检测机制的局限性。我们的发现强调了具有强大的安全协议和内容验证机制的MAS体系结构的关键需求，以及能够进行全面风险情况评估的监视系统。

Title: Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis

Authors: Dimitris Vamvourellis, Dhagash Mehta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04574
Pdf URL: https://arxiv.org/pdf/2506.04574
Copy Paste: [[2506.04574]] Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis(https://arxiv.org/abs/2506.04574)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: We investigate the effectiveness of large language models (LLMs), including reasoning-based and non-reasoning models, in performing zero-shot financial sentiment analysis. Using the Financial PhraseBank dataset annotated by domain experts, we evaluate how various LLMs and prompting strategies align with human-labeled sentiment in a financial context. We compare three proprietary LLMs (GPT-4o, GPT-4.1, o3-mini) under different prompting paradigms that simulate System 1 (fast and intuitive) or System 2 (slow and deliberate) thinking and benchmark them against two smaller models (FinBERT-Prosus, FinBERT-Tone) fine-tuned on financial sentiment analysis. Our findings suggest that reasoning, either through prompting or inherent model design, does not improve performance on this task. Surprisingly, the most accurate and human-aligned combination of model and method was GPT-4o without any Chain-of-Thought (CoT) prompting. We further explore how performance is impacted by linguistic complexity and annotation agreement levels, uncovering that reasoning may introduce overthinking, leading to suboptimal predictions. This suggests that for financial sentiment classification, fast, intuitive "System 1"-like thinking aligns more closely with human judgment compared to "System 2"-style slower, deliberative reasoning simulated by reasoning models or CoT prompting. Our results challenge the default assumption that more reasoning always leads to better LLM decisions, particularly in high-stakes financial applications.
摘要：我们研究大型语言模型（LLMS）的有效性，包括基于推理的和非争议模型，在执行零击财务情感分析中。使用域专家注释的财务短语数据集，我们评估了在财务环境中如何与人类标记的情绪保持一致的各种LLM和策略如何。我们比较了三个专有的LLM（GPT-4O，GPT-4.1，O3-MINI），在不同的提示范式下，模拟了系统1（快速，直觉）或系统2（缓慢而故意的）思考，并对两种较小型号（Finbert-Prosus，Finbert-Prosus，Finbert-Tone）进行对金融情感分析的精细分析。我们的发现表明，通过提示或固有的模型设计，推理不会提高此任务的性能。令人惊讶的是，模型和方法的最准确和人类对齐的组合是GPT-4O，没有任何经过三链（COT）提示。我们进一步探讨了语言复杂性和注释协议水平如何影响绩效，从而发现推理可能引入过度思考，从而导致次优预测。这表明，对于财务情感分类，与“系统2”风格的较慢，通过推理模型或COT提示模拟的“系统2”式的较慢，审议的推理相比，类似人类的思想与人类的判断更加紧密地保持一致。我们的结果挑战了默认假设，即更多的推理总是会导致更好的LLM决策，尤其是在高风险财务应用中。

Title: Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts?

Authors: Qingchuan Li, Jiatong Li, Zirui Liu, Mingyue Cheng, Yuting Zeng, Qi Liu, Tongxuan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04575
Pdf URL: https://arxiv.org/pdf/2506.04575
Copy Paste: [[2506.04575]] Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts?(https://arxiv.org/abs/2506.04575)
Keywords: language model, llm
Abstract: Neuro-symbolic approaches combining large language models (LLMs) with solvers excels in logical reasoning problems need long reasoning chains. In this paradigm, LLMs serve as translators, converting natural language reasoning problems into formal logic formulas. Then reliable symbolic solvers return correct solutions. Despite their success, we find that LLMs, as translators, struggle to handle lexical diversification, a common linguistic phenomenon, indicating that LLMs as logic translators are unreliable in real-world scenarios. Moreover, existing logical reasoning benchmarks lack lexical diversity, failing to challenge LLMs' ability to translate such text and thus obscuring this issue. In this work, we propose SCALe, a benchmark designed to address this significant gap through **logic-invariant lexical diversification**. By using LLMs to transform original benchmark datasets into lexically diversified but logically equivalent versions, we evaluate LLMs' ability to consistently map diverse expressions to uniform logical symbols on these new datasets. Experiments using SCALe further confirm that current LLMs exhibit deficiencies in this capability. Building directly on the deficiencies identified through our benchmark, we propose a new method, MenTaL, to address this limitation. This method guides LLMs to first construct a table unifying diverse expressions before performing translation. Applying MenTaL through in-context learning and supervised fine-tuning (SFT) significantly improves the performance of LLM translators on lexically diversified text. Our code is now available at this https URL.
摘要：在逻辑推理问题中，大型语言模型（LLM）与求解器相结合的神经符号方法需要长时间的推理链。在此范式中，LLM是翻译人员，将自然语言推理问题转换为正式的逻辑公式。然后可靠的符号求解器返回正确的解决方案。尽管他们取得了成功，但我们发现，作为翻译人员，LLM努力处理词汇多样化，这是一种常见的语言现象，表明LLM作为逻辑翻译人员在现实情况下是不可靠的。此外，现有的逻辑推理基准缺乏词汇多样性，未能挑战LLMS翻译此类文本并因此掩盖此问题的能力。在这项工作中，我们提出了规模，这是一个基准，旨在通过**逻辑不变的词汇多样化**解决这一重大差距。通过使用LLMS将原始基准数据集转换为词汇多样化但在逻辑上等效的版本，我们评估了LLMS在这些新数据集中始终如一地将各种表达式映射到统一逻辑符号的能力。使用量表的实验进一步证实，当前LLM在此功能中表现出缺陷。直接建立在通过我们的基准标准确定的缺陷基础上，我们提出了一种心理，以解决这一限制。该方法指导LLMS首先构建一个表，以统一不同表达式，然后进行翻译。通过文化学习和监督微调（SFT）应用心理，可显着提高LLM翻译人员对词汇多样化文本的性能。现在，我们的代码可在此HTTPS URL上找到。

Title: Selecting Demonstrations for Many-Shot In-Context Learning via Gradient Matching

Authors: Jianfei Zhang, Bei Li, Jun Bai, Rumei Li, Yanmeng Wang, Chenghua Lin, Wenge Rong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04579
Pdf URL: https://arxiv.org/pdf/2506.04579
Copy Paste: [[2506.04579]] Selecting Demonstrations for Many-Shot In-Context Learning via Gradient Matching(https://arxiv.org/abs/2506.04579)
Keywords: language model, llm
Abstract: In-Context Learning (ICL) empowers Large Language Models (LLMs) for rapid task adaptation without Fine-Tuning (FT), but its reliance on demonstration selection remains a critical challenge. While many-shot ICL shows promising performance through scaled demonstrations, the selection method for many-shot demonstrations remains limited to random selection in existing work. Since the conventional instance-level retrieval is not suitable for many-shot scenarios, we hypothesize that the data requirements for in-context learning and fine-tuning are analogous. To this end, we introduce a novel gradient matching approach that selects demonstrations by aligning fine-tuning gradients between the entire training set of the target task and the selected examples, so as to approach the learning effect on the entire training set within the selected examples. Through gradient matching on relatively small models, e.g., Qwen2.5-3B or Llama3-8B, our method consistently outperforms random selection on larger LLMs from 4-shot to 128-shot scenarios across 9 diverse datasets. For instance, it surpasses random selection by 4% on Qwen2.5-72B and Llama3-70B, and by around 2% on 5 closed-source LLMs. This work unlocks more reliable and effective many-shot ICL, paving the way for its broader application.
摘要：内部文化学习（ICL）赋予大型语言模型（LLMS）无需微调（FT）即可进行快速的任务适应（FT），但其对演示选择的依赖仍然是一个至关重要的挑战。尽管许多射击ICL通过缩放演示显示出令人鼓舞的表现，但许多射击演示的选择方法仍然仅限于现有工作中的随机选择。由于常规的实例级检索不适合许多射击场景，因此我们假设对内部上下文学习和微调的数据要求类似。为此，我们介绍了一种新颖的梯度匹配方法，该方法通过在目标任务的整个训练集和选定的示例之间对齐进行微调梯度来选择演示，以便在所选示例中对整个训练集进行学习效果。通过在相对较小的模型上进行梯度匹配，例如QWEN2.5-3B或LLAMA3-8B，我们的方法始终在9个不同数据集中从4-Shot到128弹奏的较大LLM上的随机选择。例如，在QWEN2.5-72B和LLAMA3-70B上，它超过4％的随机选择，在5个闭合源LLMS上的选择超过了2％。这项工作可以解锁更可靠，有效的多余ICL，为其更广泛的应用铺平了道路。

Title: SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing

Authors: Hongjun Liu, Yilun Zhao, Arman Cohan, Chen Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04583
Pdf URL: https://arxiv.org/pdf/2506.04583
Copy Paste: [[2506.04583]] SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing(https://arxiv.org/abs/2506.04583)
Keywords: language model
Abstract: Automatic fact-checking has recently received more attention as a means of combating misinformation. Despite significant advancements, fact-checking systems based on retrieval-augmented language models still struggle to tackle adversarial claims, which are intentionally designed by humans to challenge fact-checking systems. To address these challenges, we propose a training-free method designed to rephrase the original claim, making it easier to locate supporting evidence. Our modular framework, SUCEA, decomposes the task into three steps: 1) Claim Segmentation and Decontextualization that segments adversarial claims into independent sub-claims; 2) Iterative Evidence Retrieval and Claim Editing that iteratively retrieves evidence and edits the subclaim based on the retrieved evidence; 3) Evidence Aggregation and Label Prediction that aggregates all retrieved evidence and predicts the entailment label. Experiments on two challenging fact-checking datasets demonstrate that our framework significantly improves on both retrieval and entailment label accuracy, outperforming four strong claim-decomposition-based baselines.
摘要：自动事实检查最近受到了更多的关注，作为打击错误信息的一种手段。尽管取得了重大进展，但基于检索的语言模型的事实检查系统仍在努力解决对抗性主张，这些主张是由人类故意设计的，以挑战事实检验系统。为了应对这些挑战，我们提出了一种无训练的方法，旨在重塑原始主张，从而更容易找到支持证据。我们的模块化框架，SUCEA，将任务分解为三个步骤：1）主张分割和去上下文化，将对抗性主张分为独立的子声称； 2）迭代证据检索并声称编辑，迭代地检索证据并根据检索证据进行编辑； 3）汇总所有检索证据并预测元素标签的证据汇总和标签预测。对两个具有挑战性的事实检验数据集进行的实验表明，我们的框架在检索和索引标签的准确性上都显着提高，表现优于四个强有力的基于索赔的基准。

Title: MuSciClaims: Multimodal Scientific Claim Verification

Authors: Yash Kumar Lal, Manikanta Bandham, Mohammad Saqib Hasan, Apoorva Kashi, Mahnaz Koupaee, Niranjan Balasubramanian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04585
Pdf URL: https://arxiv.org/pdf/2506.04585
Copy Paste: [[2506.04585]] MuSciClaims: Multimodal Scientific Claim Verification(https://arxiv.org/abs/2506.04585)
Keywords: language model
Abstract: Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.77 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.
摘要：评估科学主张需要使用在科学文献中信息丰富的数字中表达的多模式数据来识别，提取和推理。尽管科学质量检查，图形字幕和其他多模式推理任务在基于图表的数据上进行了大量工作，但没有容易可用的多模式基准，可以直接测试索赔验证能力。为了弥补这一差距，我们引入了一个新的基准muscicains，并伴随着诊断任务。我们会自动从科学文章中提取支持的主张，我们手动扰动产生矛盾的主张。扰动旨在测试一组特定的索赔验证功能。我们还引入了一套诊断任务，以帮助了解模型失败。我们的结果表明，大多数视觉语言模型都很差（〜0.3-0.5 F1），即使是最佳模型也只能达到0.77 F1。他们还倾向于判断主张的主张，可能会误解索赔中细微的扰动。我们的诊断表明，模型在数字中本地定位正确的证据，与跨模态的信息进行斗争，并且通常无法理解该图的基本组成部分。

Title: LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models

Authors: Wen Ding, Fan Qian
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.04586
Pdf URL: https://arxiv.org/pdf/2506.04586
Copy Paste: [[2506.04586]] LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models(https://arxiv.org/abs/2506.04586)
Keywords: language model, llm, prompt
Abstract: We introduce LESS (Large Language Model Enhanced Semi-supervised Learning), a versatile framework that leverages Large Language Models (LLMs) to correct pseudo labels generated from in-the-wild data. Within the LESS framework, pseudo-labeled text from Automatic Speech Recognition (ASR) or Automatic Speech Translation (AST) of the unsupervised data is refined by an LLM, and augmented by a data filtering strategy to optimize LLM knowledge transfer efficiency. Experiments on both Mandarin ASR and Spanish-to-English AST tasks show that LESS achieves a notable absolute WER reduction of 3.77% on the Wenet Speech test set, as well as BLEU scores of 34.0 and 64.7 on Callhome and Fisher test sets respectively. These results validate the adaptability of LESS across different languages, tasks, and domains. Ablation studies conducted with various LLMs and prompt configurations provide novel insights into leveraging LLM-derived knowledge for speech processing applications.
摘要：我们介绍了较少的（大型语言模型增强了半监督学习），这是一个多功能框架，利用大型语言模型（LLMS）纠正从野外数据生成的伪标签。在较小的框架中，来自无监督数据的自动语音识别（ASR）或自动语音翻译（AST）的伪标记的文本由LLM完善，并通过数据过滤策略增强以优化LLM知识转移效率。对普通话ASR和西班牙对英语的AST任务的实验表明，在WENET语音测试集中，绝对降低了3.77％的明显降低，分别在Callhome和Fisher测试集上的BLEU得分分别为34.0和64.7。这些结果验证了跨不同语言，任务和域的较小的适应性。用各种LLM和及时配置进行的消融研究为利用LLM衍生知识用于语音处理应用提供了新的见解。

Title: Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification

Authors: Chengwu Liu, Ye Yuan, Yichun Yin, Yan Xu, Xin Xu, Zaoyu Chen, Yasheng Wang, Lifeng Shang, Qun Liu, Ming Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04592
Pdf URL: https://arxiv.org/pdf/2506.04592
Copy Paste: [[2506.04592]] Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification(https://arxiv.org/abs/2506.04592)
Keywords: language model, llm, hallucination, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that "the gold standard for supporting a mathematical claim is to provide a proof". We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework $Safe$ across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose $FormalStep$ as a benchmark for step correctness theorem proving with $30,809$ formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying natural language content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.
摘要：经过思考链（COT）提示已成为从大型语言模型（LLM）中引起推理能力的事实上的方法。但是，为了减轻众所周知难以检测的COT的幻觉，当前的方法奖励模型（PRMS）或自谐度是作为不透明的盒子运行的，并且不提供有关其判断的可检查证据，可能会限制其有效性。为了解决这个问题，我们从“支持数学主张的黄金标准是提供证据”的想法中汲取灵感。我们提出了一个回顾性的，步进的正式验证框架$ SAFE $。我们没有分配任意分数，而是在每个推理步骤中努力在形式数学语言中表达数学主张，并提供正式的证据以识别幻觉。我们在多种语言模型和各种数学数据集中评估了框架$ $ $，在提供可解释和可验证的证据的同时，表明性能得到了重大改进。我们还建议$ formarstep $作为步骤正确定理的基准，以$ 30,809 $的正式声明证明。据我们所知，我们的工作代表了第一个利用正式的数学语言精益4来验证LLMS产生的自然语言内容的努力，与首先创建形式数学语言的原因：为幻觉提供了强大的基础，以为幻觉提供强大的基础。

Title: A MISMATCHED Benchmark for Scientific Natural Language Inference

Authors: Firoz Shaik, Mobashir Sadat, Nikita Gautam, Doina Caragea, Cornelia Caragea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04603
Pdf URL: https://arxiv.org/pdf/2506.04603
Copy Paste: [[2506.04603]] A MISMATCHED Benchmark for Scientific Natural Language Inference(https://arxiv.org/abs/2506.04603)
Keywords: language model, llm
Abstract: Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MISMATCHED. The new MISMATCHED benchmark covers three non-CS domains-PSYCHOLOGY, ENGINEERING, and PUBLIC HEALTH, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MISMATCHED using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to introducing the MISMATCHED benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI. We make our dataset and code publicly available on GitHub.
摘要：科学自然语言推论（NLI）是预测从研究文章中提取的一对句子之间的语义关系的任务。此任务的现有数据集来自各种计算机科学（CS）域，而非CS域则完全忽略了。在本文中，我们为科学NLI介绍了一种新颖的评估基准，称为不匹配。新的不匹配的基准涵盖了三个非CS领域心理学，工程和公共卫生，并包含2,700个人类注释的句子对。我们使用预先训练的小语言模型（SLM）和大语言模型（LLMS）建立了强大的基准。我们表现最好的基线显示，宏F1仅为78.17％，这说明了未来改进的大量净空。除了引入不匹配的基准测试之外，我们还表明，在模型培训中结合具有隐式科学NLI关系的句子对可提高他们在科学NLI上的表现。我们在GitHub上公开提供数据集和代码。

Title: Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning

Authors: Ho-Lam Chung, Teng-Yun Hsiao, Hsiao-Ying Huang, Chunerh Cho, Jian-Ren Lin, Zhang Ziwei, Yun-Nung Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04611
Pdf URL: https://arxiv.org/pdf/2506.04611
Copy Paste: [[2506.04611]] Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning(https://arxiv.org/abs/2506.04611)
Keywords: language model, llm
Abstract: Test-Time Scaling (TTS) improves the reasoning performance of Large Language Models (LLMs) by allocating additional compute during inference. We conduct a structured survey of TTS methods and categorize them into sampling-based, search-based, and trajectory optimization strategies. We observe that reasoning-optimized models often produce less diverse outputs, which limits TTS effectiveness. To address this, we propose ADAPT (A Diversity Aware Prefix fine-Tuning), a lightweight method that applies prefix tuning with a diversity-focused data strategy. Experiments on mathematical reasoning tasks show that ADAPT reaches 80% accuracy using eight times less compute than strong baselines. Our findings highlight the essential role of generative diversity in maximizing TTS effectiveness.
摘要：测试时间缩放（TTS）通过在推断期间分配其他计算来提高大语言模型（LLMS）的推理性能。我们对TTS方法进行结构化调查，并将其分类为基于抽样的，基于搜索的和轨迹优化策略。我们观察到，推理优化的模型通常会产生较少的输出，从而限制了TTS的有效性。为了解决这个问题，我们提出了适应（多样性的前缀微调），这是一种使用以多样性为中心的数据策略的轻巧方法。关于数学推理任务的实验表明，使用算法比强基础低八倍，适应性达到80％的精度。我们的发现突出了生成多样性在最大化TTS有效性中的重要作用。

Title: Subjective Perspectives within Learned Representations Predict High-Impact Innovation

Authors: Likun Cao, Rui Pan, James Evans
Subjects: cs.CL, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2506.04616
Pdf URL: https://arxiv.org/pdf/2506.04616
Copy Paste: [[2506.04616]] Subjective Perspectives within Learned Representations Predict High-Impact Innovation(https://arxiv.org/abs/2506.04616)
Keywords: language model, agent
Abstract: Existing studies of innovation emphasize the power of social structures to shape innovation capacity. Emerging machine learning approaches, however, enable us to model innovators' personal perspectives and interpersonal innovation opportunities as a function of their prior trajectories of experience. We theorize then quantify subjective perspectives and innovation opportunities based on innovator positions within the geometric space of concepts inscribed by dynamic language representations. Using data on millions of scientists, inventors, writers, entrepreneurs, and Wikipedia contributors across the creative domains of science, technology, film, entrepreneurship, and Wikipedia, here we show that measured subjective perspectives anticipate what ideas individuals and groups creatively attend to and successfully combine in future. When perspective and background diversity are decomposed as the angular difference between collaborators' perspectives on their creation and between their experiences, the former consistently anticipates creative achievement while the latter portends its opposite, across all cases and time periods examined. We analyze a natural experiment and simulate creative collaborations between AI (large language model) agents designed with various perspective and background diversity, which are consistent with our observational findings. We explore mechanisms underlying these findings and identify how successful collaborators leverage common language to weave together diverse experience obtained through trajectories of prior work that converge to provoke one another and innovate. We explore the importance of these findings for team assembly and research policy.
摘要：现有的创新研究强调了社会结构塑造创新能力的力量。但是，新兴的机器学习方法使我们能够对创新者的个人观点和人际创新机会进行建模，从而将他们的经验轨迹构成。我们对基于动态语言表示刻在概念的几何空间中的创新空间中的位置进行了理论化，然后量化主观观点和创新机会。使用有关数以百万计的科学家，发明家，作家，企业家和维基百科贡献者的数据，跨科学，技术，电影，电影，企业家精神和维基百科，我们在这里表明，主观的观点可以预见到哪些想法的个人和团体在未来的创造性地参与并成功地结合了未来。当观点和背景多样性被分解为合作者对创造的观点与经验之间的角度差异时，前者始终期待创造性的成就，而后者则在所有案例和时间段中都相反。我们分析了自然实验，并模拟了AI（大语言模型）代理之间设计具有各种视角和背景多样性的创意合作，这与我们的观察结果一致。我们探讨了这些发现的基础机制，并确定了成功的合作者如何利用共同的语言来编织通过通过先前工作的轨迹融合以互相挑衅和创新的各种经验。我们探讨了这些发现对团队组装和研究政策的重要性。

Title: Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning

Authors: Zhiyuan Ma, Jiayu Liu, Xianzhen Luo, Zhenya Huang, Qingfu Zhu, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04625
Pdf URL: https://arxiv.org/pdf/2506.04625
Copy Paste: [[2506.04625]] Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning(https://arxiv.org/abs/2506.04625)
Keywords: language model, gpt, llm, agent
Abstract: Empowering large language models (LLMs) with effective tool utilization capabilities is crucial for enabling AI agents to solve complex problems. However, current models face two major limitations: (1) unreliable tool planning and invocation due to low-quality instruction datasets (e.g., widespread hallucinated API calls), and (2) weak tool reflection abilities (over 90% of errors cannot be corrected) resulting from static imitation learning. To address these critical limitations, we propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations. Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories to construct ToolBench-V, a new high-quality instruction dataset that addresses the limitation of unreliable tool planning and invocation. Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback through a dynamic "Error -> Reflection -> Correction" learning paradigm, resulting in our reflection dataset ToolBench-R and addressing the critical weakness in tool reflection. Finally, we obtain Tool-MVR by finetuning open-source LLMs (e.g., Qwen-7B) on both ToolBench-V and ToolBench-R. Our experiments demonstrate that Tool-MVR achieves state-of-the-art performance on StableToolBench, surpassing both ToolLLM (by 23.9%) and GPT-4 (by 15.3%) while reducing API calls by 31.4%, with strong generalization capabilities across unseen tools and scenarios. Additionally, on our proposed RefineToolBench, the first benchmark specifically designed to evaluate tool reflection capabilities, Tool-MVR achieves a 58.9% error correction rate, significantly outperforming ToolLLM's 9.1%.
摘要：通过有效的工具利用功能赋予大型语言模型（LLM）能力对于使AI代理解决复杂问题至关重要。但是，当前模型面临两个主要局限性：（1）由于低质量指令数据集（例如，广泛的幻觉API调用）而引起的不可靠的工具计划和调用，（2）弱工具反射能力（无法纠正90％以上的错误），从而纠正了静态模仿学习。为了解决这些关键限制，我们提出了工具MVR，这是一种新型工具增强的LLM，通过两项关键创新实现了全面的系统2推理。具体而言，我们首先引入多代理元验证（MAMV），这是一种严格验证API，查询和推理轨迹的系统管道，以构建Toolbench-V，这是一个新的高质量指令数据集，该数据集解决了不可靠的工具计划和调用的限制。其次，我们提出了基于勘探的反射学习（Explore），该学习通过动态“错误 - >反射 - >校正”学习范式来利用工具反馈来增强工具反射功能，从而导致我们的反射数据集工具键R并解决工具反射中的关键弱点。最后，我们通过在工具台-V和工具台上列出开源LLM（例如QWEN-7B）来获得工具MVR。我们的实验表明，工具MVR在StableToolbench上实现了最先进的性能，超过了TOOLLLM（23.9％）和GPT-4（下降15.3％），而将API呼叫降低了31.4％，在未看到的工具和场景中具有强大的概括能力。此外，在我们提出的重5纤维储备中，第一个专门设计用于评估工具反射功能的基准，工具MVR可实现58.9％的误差校正率，表现明显优于Toolllm的9.1％。

Title: TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering

Authors: Vinay Joshi, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04642
Pdf URL: https://arxiv.org/pdf/2506.04642
Copy Paste: [[2506.04642]] TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering(https://arxiv.org/abs/2506.04642)
Keywords: language model
Abstract: The key-value (KV) cache in transformer models is a critical component for efficient decoding or inference, yet its memory demands scale poorly with sequence length, posing a major challenge for scalable deployment of large language models. Among several approaches to KV cache compression, quantization of key and value activations has been widely explored. Most KV cache quantization methods still need to manage sparse and noncontiguous outliers separately. To address this, we introduce TaDA, a training-free recipe for KV cache compression with quantization precision that adapts to error sensitivity across layers and a mean centering to eliminate separate outlier handling. Our approach yields substantial accuracy improvements for multiple models supporting various context lengths. Moreover, our approach does not need to separately manage outlier elements -- a persistent hurdle in most traditional quantization methods. Experiments on standard benchmarks demonstrate that our technique reduces KV cache memory footprint to 27% of the original 16-bit baseline while achieving comparable accuracy. Our method paves the way for scalable and high-performance reasoning in language models by potentially enabling inference for longer context length models, reasoning models, and longer chain of thoughts.
摘要：变压器模型中的键值（KV）缓存是有效解码或推理的关键组件，但其内存需要序列长度较差，对大型语言模型的可扩展部署构成了重大挑战。在KV缓存压缩的几种方法中，广泛探讨了密钥和价值激活的量化。大多数KV缓存量化方法仍然需要分别管理稀疏和非连续异常值。为了解决这个问题，我们介绍了TADA，这是一种无训练的KV缓存压缩配方，具有量化精度，可适应跨层的错误敏感性，并平均居中，以消除单独的离群处理。我们的方法可为支持各种上下文长度的多个模型提供了实质性的准确性改进。此外，我们的方法无需单独管理离群元素 - 大多数传统量化方法中的持续障碍。标准基准测试的实验表明，我们的技术将KV缓存存储器足迹降低到原始16位基线的27％，同时达到可比的精度。我们的方法通过潜在地推断出更长的上下文长度模型，推理模型和更长的思想链，为语言模型中的可扩展和高性能推理铺平了道路。

Title: Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents

Authors: Juhyun Oh, Eunsu Kim, Alice Oh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04649
Pdf URL: https://arxiv.org/pdf/2506.04649
Copy Paste: [[2506.04649]] Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents(https://arxiv.org/abs/2506.04649)
Keywords: language model, gpt, llm, agent
Abstract: Real-world planning problems require constant adaptation to changing requirements and balancing of competing constraints. However, current benchmarks for evaluating LLMs' planning capabilities primarily focus on static, single-turn scenarios. We introduce Flex-TravelPlanner, a benchmark that evaluates language models' ability to reason flexibly in dynamic planning scenarios. Building on the TravelPlanner dataset~\citep{xie2024travelplanner}, we introduce two novel evaluation settings: (1) sequential constraint introduction across multiple turns, and (2) scenarios with explicitly prioritized competing constraints. Our analysis of GPT-4o and Llama 3.1 70B reveals several key findings: models' performance on single-turn tasks poorly predicts their ability to adapt plans across multiple turns; constraint introduction order significantly affects performance; and models struggle with constraint prioritization, often incorrectly favoring newly introduced lower priority preferences over existing higher-priority constraints. These findings highlight the importance of evaluating LLMs in more realistic, dynamic planning scenarios and suggest specific directions for improving model performance on complex planning tasks. The code and dataset for our framework are publicly available at this https URL.
摘要：现实世界中的计划问题需要不断适应不断变化的要求和竞争限制的平衡。但是，当前用于评估LLMS计划功能的基准主要集中在静态，单转情况下。我们介绍了Flex-TravelPlanner，这是一种评估语言模型在动态计划方案中灵活推理的能力的基准。在TravelPlanner数据集〜\ citep {Xie2024TravelPlanner}的基础上，我们介绍了两个新颖的评估设置：（1）跨多个转弯的顺序约束介绍，（2）具有明确优先优先访问竞争约束的场景。我们对GPT-4O和LLAMA 3.1 70B的分析揭示了几个关键发现：模型在单转任务上的表现不佳预测了他们在多个转弯中适应计划的能力；约束简介订单严重影响性能；模型与约束优先级的努力相比，通常错误地赞成新引入的优先级偏好而不是现有更高优先级的限制。这些发现突出了在更现实，动态的计划方案中评估LLM的重要性，并提出了改善复杂计划任务模型性能的特定方向。我们的框架的代码和数据集可在此HTTPS URL上公开可用。

Title: Normative Conflicts and Shallow AI Alignment

Authors: Raphaël Millière
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04679
Pdf URL: https://arxiv.org/pdf/2506.04679
Copy Paste: [[2506.04679]] Normative Conflicts and Shallow AI Alignment(https://arxiv.org/abs/2506.04679)
Keywords: language model, llm
Abstract: The progress of AI systems such as large language models (LLMs) raises increasingly pressing concerns about their safe deployment. This paper examines the value alignment problem for LLMs, arguing that current alignment strategies are fundamentally inadequate to prevent misuse. Despite ongoing efforts to instill norms such as helpfulness, honesty, and harmlessness in LLMs through fine-tuning based on human preferences, they remain vulnerable to adversarial attacks that exploit conflicts between these norms. I argue that this vulnerability reflects a fundamental limitation of existing alignment methods: they reinforce shallow behavioral dispositions rather than endowing LLMs with a genuine capacity for normative deliberation. Drawing from on research in moral psychology, I show how humans' ability to engage in deliberative reasoning enhances their resilience against similar adversarial tactics. LLMs, by contrast, lack a robust capacity to detect and rationally resolve normative conflicts, leaving them susceptible to manipulation; even recent advances in reasoning-focused LLMs have not addressed this vulnerability. This ``shallow alignment'' problem carries significant implications for AI safety and regulation, suggesting that current approaches are insufficient for mitigating potential harms posed by increasingly capable AI systems.
摘要：大型语言模型（LLM）等人工智能系统的进步引起了人们对其安全部署的越来越紧迫的担忧。本文研究了LLMS的价值一致性问题，认为当前的一致性策略从根本上是不足以防止滥用的。尽管持续努力通过基于人类偏好进行微调来灌输LLM中的帮助，诚实和无害性的规范，但它们仍然容易受到违反这些规范之间冲突的对抗性攻击的影响。我认为，这种脆弱性反映了现有的一致性方法的基本限制：它们加强了浅薄的行为倾向，而不是赋予LLM具有真正的规范审议能力。从道德心理学的研究中，我展示了人类参与审议推理的能力如何增强对类似对抗性策略的韧性。相比之下，LLMs缺乏可检测和合理解决规范冲突的强大能力，使它们容易受到操纵的影响；即使是以推理为重点的LLM的最新进展也没有解决这种漏洞。这个``浅对齐''问题对AI的安全性和法规产生了重大影响，这表明当前的方法不足以减轻越来越有能力的AI系统造成的潜在危害。

Title: MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models

Authors: Gio Paik, Geewook Kim, Jinbae Im
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.04688
Pdf URL: https://arxiv.org/pdf/2506.04688
Copy Paste: [[2506.04688]] MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models(https://arxiv.org/abs/2506.04688)
Keywords: language model, llm
Abstract: This paper introduces MMRefine, a MultiModal Refinement benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs). As the emphasis shifts toward enhancing reasoning during inference, MMRefine provides a framework that evaluates MLLMs' abilities to detect and correct errors across six distinct scenarios beyond just comparing final accuracy before and after refinement. Furthermore, the benchmark analyzes the refinement performance by categorizing errors into six error types. Experiments with various open and closed MLLMs reveal bottlenecks and factors impeding refinement performance, highlighting areas for improvement in effective reasoning enhancement. Our code and dataset are publicly available at this https URL.
摘要：本文介绍了MMRefine，这是一种多式模式改进基准测试，旨在评估多模式大语言模型（MLLM）的误差细化功能。随着重点向推理期间的推理增强推理，MMRefine提供了一个框架，该框架可以评估MLLM的能力检测和纠正在六个不同情况下的错误，而不仅仅是比较细化前后的最终准确性。此外，基准测试通过将错误分类为六种错误类型来分析改进性能。具有各种开放和封闭的MLLM的实验揭示了瓶颈和因素阻碍了改进性能，突出了改善有效推理增强的领域。我们的代码和数据集可在此HTTPS URL上公开获得。

Title: Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Authors: Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04689
Pdf URL: https://arxiv.org/pdf/2506.04689
Copy Paste: [[2506.04689]] Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models(https://arxiv.org/abs/2506.04689)
Keywords: language model
Abstract: Scaling laws predict that the performance of large language models improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the "data wall" of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the final pre-training set. Experiments at 1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks, compared to training on only filtered web data. Training on the raw-synthetic data mix is also more effective than having access to 2x web data. Through further analysis, we demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded. REWIRE also outperforms related approaches of generating synthetic data, including Wikipedia-style paraphrasing, question-answer synthesizing and knowledge extraction. These results suggest that recycling web texts holds the potential for being a simple and effective approach for scaling pre-training data.
摘要：缩放定律预测，大型语言模型的性能会随着模型大小和数据大小的增加而提高。实际上，预培训一直依赖于大规模的网络爬网，到目前为止，几乎所有数据源都使用了所有数据源。但是，这种自然数据库的增长率与计算供应量的速度并不相同。此外，高质量文本的可用性更加有限：数据过滤管道通常会删除多达99％的初始Web刮擦以实现最新的刮擦。为了解决预训练缩放的“数据墙”，我们的工作探讨了在现有过滤过程中转换和回收数据的方法。我们提出重新布线，通过带导的重写来回收网络，这是一种丰富低质量文档的方法，以便它们可以对培训有用。反过来，这使我们能够在最终训练集中增加合成数据的表示。 DCLM基准的1B，3B和7B量表的实验表明，与仅在过滤的Web数据上培训相比，在22个不同任务中，将高质量的原始文本和我们重写的文本分别改善了22个不同任务的1.0、1.3和2.5个百分点。对原始数据组合的培训也比访问2倍Web数据更有效。通过进一步的分析，我们证明了文本中约有82％的混合物来自转换否则会被丢弃的低质量文档。 REWIRE还优于生成合成数据的相关方法，包括Wikipedia风格的释义，问答综合和知识提取。这些结果表明，回收网络文本具有成为扩展预训练数据的简单有效方法的潜力。

Title: Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification

Authors: Lu Wei, Liangzhi Li, Tong Xiang, Xiao Liu, Noa Garcia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04693
Pdf URL: https://arxiv.org/pdf/2506.04693
Copy Paste: [[2506.04693]] Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification(https://arxiv.org/abs/2506.04693)
Keywords: language model, llm, prompt
Abstract: The internet has become a hotspot for hate speech (HS), threatening societal harmony and individual well-being. While automatic detection methods perform well in identifying explicit hate speech (ex-HS), they struggle with more subtle forms, such as implicit hate speech (im-HS). We tackle this problem by introducing a new taxonomy for im-HS detection, defining six encoding strategies named codetypes. We present two methods for integrating codetypes into im-HS detection: 1) prompting large language models (LLMs) directly to classify sentences based on generated responses, and 2) using LLMs as encoders with codetypes embedded during the encoding process. Experiments show that the use of codetypes improves im-HS detection in both Chinese and English datasets, validating the effectiveness of our approach across different languages.
摘要：互联网已成为仇恨言论（HS）的热点，威胁着社会和谐和个人福祉。虽然自动检测方法在识别明确的仇恨言论（EX-HS）方面表现良好，但它们以更微妙的形式挣扎，例如隐式仇恨言论（IM-HS）。我们通过为IM-HS检测引入新的分类法来解决这个问题，定义了六种名为CodeTypes的编码策略。我们提出了两种将代码型集成到IM-HS检测中的方法：1）直接提示大型语言模型（LLMS）根据生成的响应进行分类，以及2）使用LLMS作为编码过程中嵌入的编码型的编码器。实验表明，使用CodeTypes可以改善中文和英语数据集中的IM-HS检测，从而验证了我们在不同语言中的方法的有效性。

Title: Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Authors: Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04708
Pdf URL: https://arxiv.org/pdf/2506.04708
Copy Paste: [[2506.04708]] Accelerated Test-Time Scaling with Model-Free Speculative Sampling(https://arxiv.org/abs/2506.04708)
Keywords: language model
Abstract: Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that leverages the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis reveals that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND outperforms state-of-the-art speculative decoding methods by 14-28% in throughput and shows strong performance even in single-trajectory scenarios, reducing inference latency by 48-58%. As a model-free approach, STAND can be applied to any existing language model without additional training, being a powerful plug-and-play solution for accelerating language model reasoning.
摘要：语言模型通过测试时间缩放技术（如最佳N采样和树搜索）在推理任务中表现出了显着的功能。但是，这些方法通常需要大量的计算资源，从而在绩效和效率之间产生关键的权衡。我们介绍了一种新型的无模型投机解码方法（随机自适应n-gram启动），它利用推理轨迹中固有的冗余性在不损害准确性的情况下实现明显的加速度。我们的分析表明，推理路径经常重复使用相似的推理模式，从而无需单独的草稿模型就可以有效地无需模型的令牌预测。通过通过基于内存的logit N-gram模块以及优化的Gumbel-TOP-K采样和数据驱动的树木构建来引入随机制图并保留概率信息，可以显着提高令牌的接受率。多种模型和推理任务的广泛评估（AIME-2024，GPQA-Diamond和LiveCodeBench）表明，与标准自动回应解码相比，在保持准确性的同时，架子将推理潜伏期降低了60-65％。此外，Stand Stand Estormpormes的最先进的投机解码方法在吞吐量中的表现为14-28％，即使在单个标志性方案中也表现出强劲的性能，从而将推理潜伏期降低了48-58％。作为一种无模型方法，可以将支架应用于任何现有的语言模型，而无需其他培训，这是一种强大的插入式播放解决方案，用于加速语言模型推理。

Title: SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat

Authors: Yuru Jiang, Wenxuan Ding, Shangbin Feng, Greg Durrett, Yulia Tsvetkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04721
Pdf URL: https://arxiv.org/pdf/2506.04721
Copy Paste: [[2506.04721]] SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat(https://arxiv.org/abs/2506.04721)
Keywords: language model, llm
Abstract: We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs through competition and combat. To complement a single model's lack of diversity in generation and biases in evaluation, multiple LLMs form a "sparta tribe" to compete against each other in fulfilling instructions while serving as judges for the competition of others. For each iteration, one instruction and two models are selected for a duel, the other models evaluate the two responses, and their evaluation scores are aggregated through a adapted elo-ranking based reputation system, where winners/losers of combat gain/lose weight in evaluating others. The peer-evaluated combat results then become preference pairs where the winning response is preferred over the losing one, and all models learn from these preferences at the end of each iteration. SPARTA ALIGNMENT enables the self-evolution of multiple LLMs in an iterative and collective competition process. Extensive experiments demonstrate that SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks and datasets with 7.0% average improvement. Further analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen tasks and leverages the expertise diversity of participating models to produce more logical, direct and informative outputs.
摘要：我们提出了Sparta Alignment，这是一种通过竞争和战斗来集体对齐多个LLM的算法。为了补充单个模型在发电和评估中缺乏多样性的缺乏，多个LLMS形成了“ Sparta Tribe”，以相互竞争，以实现指示，同时担任其他人竞争的法官。对于每次迭代，为决斗选择了一个指令和两个模型，另一个模型评估了这两种响应，并且通过基于ELO级的改编的信誉系统汇总了他们的评估分数，其中战斗力/失败者在评估他人时的赢家/失败者。然后，对同行评估的战斗结果成为偏好对，其中优先响应比失去的响应更优选，并且所有模型在每次迭代结束时都从这些偏好中学习。 Sparta对齐能够在迭代和集体竞争过程中进行多个LLM的自我发展。广泛的实验表明，在12个任务和数据集中，Sparta Alignment的表现优于初始模型和4个自我对准基线，平均提高了7.0％。进一步的分析表明，Sparta Alignment更有效地概括了看不见的任务并利用参与模型的专业知识多样性，以产生更合乎逻辑，直接和信息丰富的输出。

Title: Lifelong Evolution: Collaborative Learning between Large and Small Language Models for Continuous Emergent Fake News Detection

Authors: Ziyi Zhou, Xiaoming Zhang, Litian Zhang, Yibo Zhang, Zhenyu Guan, Chaozhuo Li, Philip S. Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04739
Pdf URL: https://arxiv.org/pdf/2506.04739
Copy Paste: [[2506.04739]] Lifelong Evolution: Collaborative Learning between Large and Small Language Models for Continuous Emergent Fake News Detection(https://arxiv.org/abs/2506.04739)
Keywords: language model, llm
Abstract: The widespread dissemination of fake news on social media has significantly impacted society, resulting in serious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from extensive supervised training requirements and difficulties adapting to evolving news environments due to data scarcity and distribution shifts. Large language models (LLMs), despite robust zero-shot capabilities, fall short in accurately detecting fake news owing to outdated knowledge and the absence of suitable demonstrations. In this paper, we propose a novel Continuous Collaborative Emergent Fake News Detection (C$^2$EFND) framework to address these challenges. The C$^2$EFND framework strategically leverages both LLMs' generalization power and SLMs' classification expertise via a multi-round collaborative learning framework. We further introduce a lifelong knowledge editing module based on a Mixture-of-Experts architecture to incrementally update LLMs and a replay-based continue learning method to ensure SLMs retain prior knowledge without retraining entirely. Extensive experiments on Pheme and Twitter16 datasets demonstrate that C$^2$EFND significantly outperforms existed methods, effectively improving detection accuracy and adaptability in continuous emergent fake news scenarios.
摘要：在社交媒体上对假新闻的广泛传播对社会产生了重大影响，从而造成了严重的后果。使用小语言模型（SLM）的常规深度学习方法遭受了广泛的监督培训要求，并且由于数据稀缺和分配变化而适应不断发展的新闻环境的困难。大型语言模型（LLMS）尽管有强大的零拍功能，但由于过时的知识和缺乏合适的示范，无法准确检测假新闻。在本文中，我们提出了一个新颖的连续合作紧急假新闻检测（C $^2 $ efnd）框架，以应对这些挑战。 C $^2 $ efnd框架从战略上通过多轮协作学习框架来利用LLMS的“概括和SLM”的分类专业知识。我们进一步基于Experts体系结构的混合物来介绍终身知识编辑模块，以逐步更新LLM，并采用基于重播的继续学习方法，以确保SLMS在不完全重新培训的情况下保留先验知识。对Pheme和Twitter16数据集进行的广泛实验表明，C $^2 $ efnd显着胜过方法，在连续出现的假新闻场景中有效提高了检测准确性和适应性。

Title: Identifying Reliable Evaluation Metrics for Scientific Text Revision

Authors: Léane Jourdan, Florian Boudin, Richard Dufour, Nicolas Hernandez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04772
Pdf URL: https://arxiv.org/pdf/2506.04772
Copy Paste: [[2506.04772]] Identifying Reliable Evaluation Metrics for Scientific Text Revision(https://arxiv.org/abs/2506.04772)
Keywords: llm
Abstract: Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.
摘要：评估科学写作中的文本修订仍然是一个挑战，因为诸如胭脂和Bertscore之类的传统指标主要集中于相似性，而不是捕获有意义的改进。在这项工作中，我们分析和确定这些指标的局限性，并探索更好地与人类判断保持一致的替代评估方法。我们首先进行了手动注释研究，以评估不同修订的质量。然后，我们研究了来自相关NLP域的无参考评估指标。 Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference.我们的结果表明，LLM有效地评估了跟随教学的范围但要正确的挣扎，而特定于领域的指标则提供了互补的见解。我们发现，将LLM-AS-A-A-A-A-A-A-AS-A-A-AS-A-AS-A-AS-AS-Chudge评估和特定于任务指标结合在一起的混合方法提供了对修订质量的最可靠评估。

Title: Fine-Grained Interpretation of Political Opinions in Large Language Models

Authors: Jingyu Hu, Mengyue Yang, Mengnan Du, Weiru Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04774
Pdf URL: https://arxiv.org/pdf/2506.04774
Copy Paste: [[2506.04774]] Fine-Grained Interpretation of Political Opinions in Large Language Models(https://arxiv.org/abs/2506.04774)
Keywords: language model, llm
Abstract: Studies of LLMs' political opinions mainly rely on evaluations of their open-ended responses. Recent work indicates that there is a misalignment between LLMs' responses and their internal intentions. This motivates us to probe LLMs' internal mechanisms and help uncover their internal political states. Additionally, we found that the analysis of LLMs' political opinions often relies on single-axis concepts, which can lead to concept confounds. In this work, we extend the single-axis to multi-dimensions and apply interpretable representation engineering techniques for more transparent LLM political concept learning. Specifically, we designed a four-dimensional political learning framework and constructed a corresponding dataset for fine-grained political concept vector learning. These vectors can be used to detect and intervene in LLM internals. Experiments are conducted on eight open-source LLMs with three representation engineering techniques. Results show these vectors can disentangle political concept confounds. Detection tasks validate the semantic meaning of the vectors and show good generalization and robustness in OOD settings. Intervention Experiments show these vectors can intervene in LLMs to generate responses with different political leanings.
摘要：LLMS政治观点的研究主要依赖于对其开放式回应的评估。最近的工作表明，LLMS的反应与其内部意图之间存在不对对准。这促使我们探究了LLMS的内部机制，并帮助揭示其内部政治国家。此外，我们发现对LLM的政治观点的分析通常依赖于单轴概念，这可能导致概念混淆。在这项工作中，我们将单轴扩展到多维，并将可解释的表示工程技术应用于更透明的LLM政治概念学习。具体来说，我们设计了一个四维的政治学习框架，并构建了一个相应的数据集，以用于精细的政治概念向量学习。这些向量可用于检测和干预LLM内部。实验是在八个具有三种表示工程技术的开源LLMS上进行的。结果表明，这些向量会消除政治概念的困惑。检测任务验证向量的语义含义，并在OOD设置中显示出良好的概括和鲁棒性。干预实验表明，这些向量可以干预LLM，以产生不同的政治倾向。

Title: MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Authors: Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, Helen Meng
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2506.04779
Pdf URL: https://arxiv.org/pdf/2506.04779
Copy Paste: [[2506.04779]] MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark(https://arxiv.org/abs/2506.04779)
Keywords: language model, llm
Abstract: Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at this https URL. Evaluation Code is available at this https URL.
摘要：语音固有地包含丰富的声学信息，远远超出了文本语言。在现实世界中的语言理解中，有效的解释通常需要整合语义含义（例如内容），副语言特征（例如，情感，速度，音调）和语音特征（例如韵律，语调，语调，节奏），这些特征是嵌入在语音中的。尽管最近的多模式语音大语模型（SpeechLlms）在处理音频信息方面表现出了显着的功能，但它们在自然语音中执行细粒度的感知和复杂的推理的能力仍然在很大程度上尚未得到探索。为了解决这一差距，我们介绍了MMSU，这是一种专门用于理解和推理语言的综合基准。 MMSU包括5,000个精心策划的音频问题 - 跨47个不同的任务。为了以语言理论为基础，我们系统地结合了各种语言现象，包括语音，韵律，言辞，句法，语义和副语言学。通过对14个高级演讲仪的严格评估，我们确定了现有模型改进的大量空间，突出了有意义的方向以进行未来的优化。 MMSU建立了一种全面评估口语理解的新标准，为开发更复杂的人类语音交互系统提供了宝贵的见解。 MMSU基准测试可在此HTTPS URL上找到。评估代码可在此HTTPS URL上找到。

Title: Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques

Authors: Jisu An, Junseok Lee, Jeoungeun Lee, Yongseok Son
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04788
Pdf URL: https://arxiv.org/pdf/2506.04788
Copy Paste: [[2506.04788]] Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques(https://arxiv.org/abs/2506.04788)
Keywords: language model, llm
Abstract: The rapid progress of Multimodal Large Language Models(MLLMs) has transformed the AI landscape. These models combine pre-trained LLMs with various modality encoders. This integration requires a systematic understanding of how different modalities connect to the language backbone. Our survey presents an LLM-centric analysis of current approaches. We examine methods for transforming and aligning diverse modal inputs into the language embedding space. This addresses a significant gap in existing literature. We propose a classification framework for MLLMs based on three key dimensions. First, we examine architectural strategies for modality integration. This includes both the specific integration mechanisms and the fusion level. Second, we categorize representation learning techniques as either joint or coordinate representations. Third, we analyze training paradigms, including training strategies and objective functions. By examining 125 MLLMs developed between 2021 and 2025, we identify emerging patterns in the field. Our taxonomy provides researchers with a structured overview of current integration techniques. These insights aim to guide the development of more robust multimodal integration strategies for future models built on pre-trained foundations.
摘要：多模式大语言模型（MLLM）的快速进步改变了AI景观。这些模型将预先训练的LLM与各种模式编码器结合在一起。这种集成需要系统地了解不同的模态如何连接到语言骨干。我们的调查对当前方法进行了以LLM为中心的分析。我们研究了将各种模态输入转换为语言嵌入空间的方法。这解决了现有文献的显着差距。我们根据三个关键维度为MLLM提出了一个分类框架。首先，我们研究了模态整合的建筑策略。这既包括特定的集成机制和融合水平。其次，我们将表示技术分类为联合表示或坐标表示。第三，我们分析培训范例，包括培训策略和客观功能。通过检查2021年至2025年之间开发的125个MLLM，我们确定了现场的新兴模式。我们的分类法为研究人员提供了当前整合技术的结构化概述。这些见解旨在指导建立在预先训练的基础上的未来模型的更强大的多模式集成策略。

Title: Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Authors: Yujun Zhou, Jiayi Ye, Zipeng Ling, Yufei Han, Yue Huang, Haomin Zhuang, Zhenwen Liang, Kehan Guo, Taicheng Guo, Xiangqi Wang, Xiangliang Zhang
Subjects: cs.CL, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2506.04810
Pdf URL: https://arxiv.org/pdf/2506.04810
Copy Paste: [[2506.04810]] Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study(https://arxiv.org/abs/2506.04810)
Keywords: language model, llm
Abstract: Logical reasoning is a core capability for many applications of large language models (LLMs), yet existing benchmarks often rely solely on final-answer accuracy, failing to capture the quality and structure of the reasoning process. We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall benchmark accuracy, stepwise soundness, and representation-level alignment. In addition, to better understand how reasoning capabilities emerge, we conduct a comprehensive study on the effects of supervision format during fine-tuning. We construct four supervision styles (one natural language and three symbolic variants) and train LLMs under each. Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks, while symbolic reasoning styles promote more structurally sound and atomic inference chains. Further, our representation-level probing shows that fine-tuning primarily improves reasoning behaviors through step-by-step generation, rather than enhancing shortcut prediction or internalized correctness. Together, our framework and analysis provide a more rigorous and interpretable lens for evaluating and improving logical reasoning in LLMs.
摘要：逻辑推理是大型语言模型（LLMS）的许多应用程序的核心能力，但是现有的基准通常仅依赖于最终答案的准确性，未能捕获推理过程的质量和结构。我们提出了FineLogic，这是一个精细的评估框架，可以评估三个维度的逻辑推理：总体基准准确性，逐步的音质和表示级别的对准。此外，为了更好地了解推理能力如何出现，我们对微调过程中监督格式的影响进行了全面研究。我们构建了四种监督风格（一种自然语言和三种符号变体），并在每种训练LLM上培训LLM。我们的发现表明，自然语言监督即使是在分发和长期文化任务上也会产生强大的概括，而符号推理风格则可以促进更具结构性的声音和原子推理链。此外，我们的表示级别的探测表明，微调主要通过逐步生成改善推理行为，而不是增强快捷方式预测或内部化的正确性。共同，我们的框架和分析为评估和改善LLM中的逻辑推理提供了更严格和可解释的镜头。

Title: Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms

Authors: Nurul Aisyah, Muhammad Dehan Al Kautsar, Arif Hidayat, Raqib Chowdhury, Fajri Koto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04822
Pdf URL: https://arxiv.org/pdf/2506.04822
Copy Paste: [[2506.04822]] Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms(https://arxiv.org/abs/2506.04822)
Keywords: language model, llm
Abstract: Although vision-language and large language models (VLM and LLM) offer promising opportunities for AI-driven educational assessment, their effectiveness in real-world classroom settings, particularly in underrepresented educational contexts, remains underexplored. In this study, we evaluated the performance of a state-of-the-art VLM and several LLMs on 646 handwritten exam responses from grade 4 students in six Indonesian schools, covering two subjects: Mathematics and English. These sheets contain more than 14K student answers that span multiple choice, short answer, and essay questions. Assessment tasks include grading these responses and generating personalized feedback. Our findings show that the VLM often struggles to accurately recognize student handwriting, leading to error propagation in downstream LLM grading. Nevertheless, LLM-generated feedback retains some utility, even when derived from imperfect input, although limitations in personalization and contextual relevance persist.
摘要：尽管视觉语言和大型语言模型（VLM和LLM）为AI驱动的教育评估提供了有希望的机会，但它们在实际课堂环境中的有效性，尤其是在代表性不足的教育环境中，仍然没有得到充实的影响。在这项研究中，我们评估了最先进的VLM和几个LLM在646个印尼学校的4年级学生的手写考试回答上的表现，其中涵盖了两个学科：数学和英语。这些床单包含超过14K的学生答案，这些答案涵盖了多项选择，简短的答案和论文问题。评估任务包括对这些响应进行评分并产生个性化的反馈。我们的发现表明，VLM经常难以准确识别学生的笔迹，从而导致下游LLM分级的错误传播。然而，尽管个性化和上下文相关性的限制仍然存在，但LLM生成的反馈即使从不完善的输入中得出，即使是从不完美的输入中得出的，也会保留一定的效用。

Title: A Reasoning-Based Approach to Cryptic Crossword Clue Solving

Authors: Martin Andrews, Sam Witteveen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04824
Pdf URL: https://arxiv.org/pdf/2506.04824
Copy Paste: [[2506.04824]] A Reasoning-Based Approach to Cryptic Crossword Clue Solving(https://arxiv.org/abs/2506.04824)
Keywords: llm
Abstract: Cryptic crossword clues are challenging language tasks for which new test sets are released daily by major newspapers on a global basis. Each cryptic clue contains both the definition of the answer to be placed in the crossword grid (in common with regular crosswords), and 'wordplay' that proves that the answer is correct (i.e. a human solver can be confident that an answer is correct without needing crossing words as confirmation). This work describes an LLM-based reasoning system built from open-licensed components that solves cryptic clues by (i) hypothesising answers; (ii) proposing wordplay explanations; and (iii) using a verifier system that operates on codified reasoning steps. Overall, this system establishes a new state-of-the-art performance on the challenging Cryptonite dataset of clues from The Times and The Telegraph newspapers in the UK. Because each proved solution is expressed in Python, interpretable wordplay reasoning for proven answers is available for inspection.
摘要：神秘的填字游戏线索正在挑战语言任务，大型报纸每天都会在全球范围内发布新的测试集。每个神秘的线索既包含要放置在填字游戏网格中的答案的定义（与常规填字游戏共同），又包含“单词播放”，证明答案是正确的（即，人类求解器可以确信答案是正确的，而无需交叉单词作为确认）。这项工作描述了一种基于LLM的推理系统，该系统是由开放许可的组件构建的，该组件通过（i）假设的答案解决了神秘的线索；（ii）提出文字播放解释；（iii）使用根据编码的推理步骤操作的验证系统。总体而言，该系统在英国的《时代》和《电讯报》的富有挑战性的隐孢子数据集上建立了新的最新性能。因为每个证明的解决方案均以Python表示，因此可解释的单词播放推理可用于检查。

Title: Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models

Authors: Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04832
Pdf URL: https://arxiv.org/pdf/2506.04832
Copy Paste: [[2506.04832]] Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models(https://arxiv.org/abs/2506.04832)
Keywords: language model, llm, hallucination
Abstract: Large Reasoning Models (LRMs) extend large language models with explicit, multi-step reasoning traces to enhance transparency and performance on complex tasks. However, these reasoning traces can be redundant or logically inconsistent, making them a new source of hallucination that is difficult to detect. Existing hallucination detection methods focus primarily on answer-level uncertainty and often fail to detect hallucinations or logical inconsistencies arising from the model's reasoning trace. This oversight is particularly problematic for LRMs, where the explicit thinking trace is not only an important support to the model's decision-making process but also a key source of potential hallucination. To this end, we propose RACE (Reasoning and Answer Consistency Evaluation), a novel framework specifically tailored for hallucination detection in LRMs. RACE operates by extracting essential reasoning steps and computing four diagnostic signals: inter-sample consistency of reasoning traces, entropy-based answer uncertainty, semantic alignment between reasoning and answers, and internal coherence of reasoning. This joint analysis enables fine-grained hallucination detection even when the final answer appears correct. Experiments across datasets and different LLMs demonstrate that RACE outperforms existing hallucination detection baselines, offering a robust and generalizable solution for evaluating LRMs. Our code is available at: this https URL.
摘要：大型推理模型（LRMS）扩展了具有明确的多步推理轨迹的大型语言模型，以提高复杂任务的透明度和性能。但是，这些推理迹线可能是多余的或逻辑上不一致的，使它们成为难以检测的新的幻觉来源。现有的幻觉检测方法主要集中在答案级别的不确定性上，并且通常无法检测到幻觉或逻辑上的不一致之处。对于LRMS，这种疏忽尤其有问题，在该LRMS中，明确的思维迹线不仅是对模型决策过程的重要支持，而且是潜在幻觉的关键来源。为此，我们建议种族（推理和回答一致性评估），这是一个专门针对LRMS幻觉检测的新型框架。种族通过提取基本的推理步骤和计算四个诊断信号来运作：推理轨迹的样本一致性，基于熵的答案不确定性，推理和答案之间的语义一致性以及推理的内部连贯性。该联合分析即使最终答案似乎正确，也可以实现细粒度的幻觉检测。跨数据集和不同LLM的实验表明，Race的表现优于现有的幻觉检测基线，提供了可用于评估LRM的可靠且可推广的解决方案。我们的代码可用：此HTTPS URL。

Title: Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights

Authors: Giorgio Biancini, Alessio Ferrato, Carla Limongelli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04851
Pdf URL: https://arxiv.org/pdf/2506.04851
Copy Paste: [[2506.04851]] Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights(https://arxiv.org/abs/2506.04851)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Integrating Artificial Intelligence (AI) in educational settings has brought new learning approaches, transforming the practices of both students and educators. Among the various technologies driving this transformation, Large Language Models (LLMs) have emerged as powerful tools for creating educational materials and question answering, but there are still space for new applications. Educators commonly use Multiple-Choice Questions (MCQs) to assess student knowledge, but manually generating these questions is resource-intensive and requires significant time and cognitive effort. In our opinion, LLMs offer a promising solution to these challenges. This paper presents a novel comparative analysis of three widely known LLMs - Llama 2, Mistral, and GPT-3.5 - to explore their potential for creating informative and challenging MCQs. In our approach, we do not rely on the knowledge of the LLM, but we inject the knowledge into the prompt to contrast the hallucinations, giving the educators control over the test's source text, too. Our experiment involving 21 educators shows that GPT-3.5 generates the most effective MCQs across several known metrics. Additionally, it shows that there is still some reluctance to adopt AI in the educational field. This study sheds light on the potential of LLMs to generate MCQs and improve the educational experience, providing valuable insights for the future.
摘要：在教育环境中整合人工智能（AI）已带来了新的学习方法，改变了学生和教育工作者的实践。在推动这种转变的各种技术中，大型语言模型（LLM）已成为创建教育材料和问题答案的强大工具，但仍有用于新应用程序的空间。教育工作者通常使用多项选择问题（MCQ）来评估学生的知识，但是手动产生这些问题是资源密集的，需要大量的时间和认知工作。我们认为，LLM为这些挑战提供了有希望的解决方案。本文介绍了三个众所周知的LLM -Llama 2，Mistral和GPT -3.5的新型比较分析，以探索它们创造信息丰富且具有挑战性的MCQ的潜力。在我们的方法中，我们不依赖LLM的知识，而是将知识注入了对比幻觉的提示中，也使教育工作者也控制了测试的源文本。我们涉及21家教育工作者的实验表明，GPT-3.5在几个已知的指标中产生了最有效的MCQ。此外，它表明仍然不愿在教育领域采用AI。这项研究阐明了LLM产生MCQ并改善教育经验的潜力，从而为未来提供了宝贵的见解。

Title: Prompting LLMs: Length Control for Isometric Machine Translation

Authors: Dávid Javorský, Ondřej Bojar, François Yvon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04855
Pdf URL: https://arxiv.org/pdf/2506.04855
Copy Paste: [[2506.04855]] Prompting LLMs: Length Control for Isometric Machine Translation(https://arxiv.org/abs/2506.04855)
Keywords: language model, llm, prompt
Abstract: In this study, we explore the effectiveness of isometric machine translation across multiple language pairs (En$\to$De, En$\to$Fr, and En$\to$Es) under the conditions of the IWSLT Isometric Shared Task 2022. Using eight open-source large language models (LLMs) of varying sizes, we investigate how different prompting strategies, varying numbers of few-shot examples, and demonstration selection influence translation quality and length control. We discover that the phrasing of instructions, when aligned with the properties of the provided demonstrations, plays a crucial role in controlling the output length. Our experiments show that LLMs tend to produce shorter translations only when presented with extreme examples, while isometric demonstrations often lead to the models disregarding length constraints. While few-shot prompting generally enhances translation quality, further improvements are marginal across 5, 10, and 20-shot settings. Finally, considering multiple outputs allows to notably improve overall tradeoff between the length and quality, yielding state-of-the-art performance for some language pairs.
摘要：在这项研究中，我们在IWSLT等速度共享任务2022的条件下探索了多种语言对（En $ \ to $ de，En $ \ to $ for to $ f to $ fr）的有效性控制。我们发现，指令的措辞与提供的演示的特性对齐时，在控制输出长度方面起着至关重要的作用。我们的实验表明，LLM只有在提供极端示例时才会产生较短的翻译，而等距演示通常会导致忽略长度约束的模型。虽然很少弹性提示通常可以提高翻译质量，但在5、10和20弹奏的设置中，进一步的改进是微不足道的。最后，考虑到多个输出可以显着改善长度和质量之间的总体权衡，从而为某些语言对提供最先进的性能。

Title: Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies

Authors: Wenxi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04887
Pdf URL: https://arxiv.org/pdf/2506.04887
Copy Paste: [[2506.04887]] Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies(https://arxiv.org/abs/2506.04887)
Keywords: language model
Abstract: Universal Dependencies (UD), while widely regarded as the most successful linguistic framework for cross-lingual syntactic representation, remains underexplored in terms of its effectiveness. This paper addresses this gap by integrating UD into pretrained language models and assesses if UD can improve their performance on a cross-lingual adversarial paraphrase identification task. Experimental results show that incorporation of UD yields significant improvements in accuracy and $F_1$ scores, with average gains of 3.85\% and 6.08\% respectively. These enhancements reduce the performance gap between pretrained models and large language models in some language pairs, and even outperform the latter in some others. Furthermore, the UD-based similarity score between a given language and English is positively correlated to the performance of models in that language. Both findings highlight the validity and potential of UD in out-of-domain tasks.
摘要：普遍的依赖性（UD）虽然被广泛认为是跨语义句法表示最成功的语言框架，但在其有效性方面仍然没有被忽视。本文通过将UD集成到预验证的语言模型中来解决这一差距，并评估UD是否可以在跨语性的对抗性解释识别任务上提高其性能。实验结果表明，UD合并的准确性和$ f_1 $得分可显着提高，平均增长率分别为3.85 \％和6.08 \％。这些增强功能减少了某些语言对的预处理模型和大语言模型之间的性能差距，在其他语言中甚至比后者的表现差异。此外，给定语言和英语之间的基于UD的相似性得分与该语言的模型性能呈正相关。这两个发现都凸显了UD在室外任务中的有效性和潜力。

Title: ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests

Authors: Shiyi Xu, Yiwen Hu, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04894
Pdf URL: https://arxiv.org/pdf/2506.04894
Copy Paste: [[2506.04894]] ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests(https://arxiv.org/abs/2506.04894)
Keywords: language model, llm
Abstract: With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real competition environments. Moreover, current evaluation metrics such as Pass@K fail to capture the reflective abilities of reasoning models. To address these challenges, we propose \textbf{ICPC-Eval}, a top-level competitive coding benchmark designed to probing the frontiers of LLM reasoning. ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world, offering three key contributions: 1) A challenging realistic ICPC competition scenario, featuring a problem type and difficulty distribution consistent with actual contests. 2) A robust test case generation method and a corresponding local evaluation toolkit, enabling efficient and accurate local evaluation. 3) An effective test-time scaling evaluation metric, Refine@K, which allows iterative repair of solutions based on execution feedback. The results underscore the significant challenge in evaluating complex reasoning abilities: top-tier reasoning models like DeepSeek-R1 often rely on multi-turn code feedback to fully unlock their in-context reasoning potential when compared to non-reasoning counterparts. Furthermore, despite recent advancements in code generation, these models still lag behind top-performing human teams. We release the benchmark at: this https URL
摘要：随着大型推理模型在复杂的编码和推理任务中的重大进展，现有基准（例如LiveCodeBench和Codeelo）不足以评估实际竞争环境中大语言模型（LLMS）的编码功能。此外，当前的评估指标，例如Pass@k，无法捕获推理模型的反思能力。为了应对这些挑战，我们建议\ textbf {icpc-eval}，这是一种旨在探测LLM推理边界的顶级竞争编码基准。 ICPC-eval包括118个在世界各地举行的ICPC赛事的118个精心策划的问题，提供了三个关键贡献：1）具有挑战性的现实ICPC竞争场景，具有问题类型和与实际比赛一致的问题类型和难度分布。 2）强大的测试案例生成方法和相应的本地评估工具包，从而实现有效且准确的本地评估。 3）有效的测试时间缩放评估度量标准@k，它允许基于执行反馈对解决方案进行迭代维修。结果强调了评估复杂推理能力的重大挑战：诸如DeepSeek-R1之类的顶级推理模型通常依赖于多转向代码反馈，以完全解锁其在非官方中的潜在推理潜在的潜在潜在的潜在的潜力。此外，尽管代码生成最近进步，但这些模型仍然落后于表现最好的人类团队。我们在以下位置发布基准：此HTTPS URL

Title: Verbose ListOps (VLO): Beyond Long Context -- Unmasking LLM's Reasoning Blind Spots

Authors: Alex Pan, Mary-Anne Williams
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04907
Pdf URL: https://arxiv.org/pdf/2506.04907
Copy Paste: [[2506.04907]] Verbose ListOps (VLO): Beyond Long Context -- Unmasking LLM's Reasoning Blind Spots(https://arxiv.org/abs/2506.04907)
Keywords: language model, llm, long context
Abstract: Large Language Models (LLMs), whilst great at extracting facts from text, struggle with nested narrative reasoning. Existing long context and multi-hop QA benchmarks inadequately test this, lacking realistic distractors or failing to decouple context length from reasoning complexity, masking a fundamental LLM limitation. We introduce Verbose ListOps, a novel benchmark that programmatically transposes ListOps computations into lengthy, coherent stories. This uniquely forces internal computation and state management of nested reasoning problems by withholding intermediate results, and offers fine-grained controls for both narrative size \emph{and} reasoning difficulty. Whilst benchmarks like LongReason (2025) advance approaches for synthetically expanding the context size of multi-hop QA problems, Verbose ListOps pinpoints a specific LLM vulnerability: difficulty in state management for nested sub-reasoning amongst semantically-relevant, distracting narrative. Our experiments show that leading LLMs (e.g., OpenAI o4, Gemini 2.5 Pro) collapse in performance on Verbose ListOps at modest (~10k token) narrative lengths, despite effortlessly solving raw ListOps equations. Addressing this failure is paramount for real-world text interpretation which requires identifying key reasoning points, tracking conceptual intermediate results, and filtering irrelevant information. Verbose ListOps, and its extensible generation framework thus enables targeted reasoning enhancements beyond mere context-window expansion; a critical step to automating the world's knowledge work.
摘要：大型语言模型（LLM），同时非常擅长从文本中提取事实，与嵌套的叙事推理斗争。现有的长上下文和多跳QA基准不足测试了这一点，缺乏逼真的干扰因素或未能使上下文长度从推理复杂性中解脱出来，从而掩盖了基本的LLM限制。我们介绍了冗长的listOps，这是一个新颖的基准测试，该基准将ListOps计算的编程方式转换为冗长，连贯的故事。这种独特的迫使内部计算和状态管理嵌套的推理问题通过扣留中间结果，并为叙事大小\ emph {and}推理难度提供了细粒度的控制。尽管诸如Longasion（2025）诸如合成质量质量检查问题的上下文大小之类的基准测试方法，而详细的ListOps则指出了特定的LLM漏洞：状态管理在语义上具有语义上相关的，分散的，分散注意力的叙述中的嵌套副标题。我们的实验表明，尽管毫不费力地求解了RAW LISTOPS方程，但领先的LLM（例如OpenAi O4，Gemini 2.5 Pro）在Modest（〜10K令牌）叙事长度上的冗长listOps上的性能崩溃。解决此故障对于现实世界文本解释至关重要，该文本解释需要识别关键推理点，跟踪概念中间结果并过滤无关的信息。详细的ListOps及其可扩展的生成框架因此，超出了上下文 - 窗口扩展，可以实现有针对性的推理增强。自动化世界知识工作的关键步骤。

Title: Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback

Authors: Junior Cedric Tonga, KV Aditya Srivatsa, Kaushal Kumar Maurya, Fajri Koto, Ekaterina Kochmar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04920
Pdf URL: https://arxiv.org/pdf/2506.04920
Copy Paste: [[2506.04920]] Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback(https://arxiv.org/abs/2506.04920)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated the ability to generate formative feedback and instructional hints in English, making them increasingly relevant for AI-assisted education. However, their ability to provide effective instructional support across different languages, especially for mathematically grounded reasoning tasks, remains largely unexamined. In this work, we present the first large-scale simulation of multilingual tutor-student interactions using LLMs. A stronger model plays the role of the tutor, generating feedback in the form of hints, while a weaker model simulates the student. We explore 352 experimental settings across 11 typologically diverse languages, four state-of-the-art LLMs, and multiple prompting strategies to assess whether language-specific feedback leads to measurable learning gains. Our study examines how student input language, teacher feedback language, model choice, and language resource level jointly influence performance. Results show that multilingual hints can significantly improve learning outcomes, particularly in low-resource languages when feedback is aligned with the student's native language. These findings offer practical insights for developing multilingual, LLM-based educational tools that are both effective and inclusive.
摘要：大型语言模型（LLMS）已经证明了能够产生形成性的反馈和英语的教学提示，从而使其与AI-Asscassed教育越来越重要。但是，它们具有跨不同语言的有效教学支持的能力，尤其是对于数学上的推理任务，在很大程度上尚未审查。在这项工作中，我们介绍了使用LLM的多语言导师互动的第一个大规模模拟。一个更强大的模型扮演了导师的角色，以提示形式产生反馈，而较弱的模型模拟了学生。我们探索了11种类型上多样的语言，四种最先进的LLM的352个实验环境，以及评估特定语言反馈是否会带来可衡量的学习收益的多种提示策略。我们的研究研究了学生输入语言，教师反馈语言，模型选择和语言资源水平如何共同影响绩效。结果表明，多语言提示可以显着改善学习成果，尤其是在反馈与学生的母语一致时的低资源语言中。这些发现为开发多种语言，基于LLM的教育工具提供了实用的见解，这些工具既有效又包容。

Title: ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

Authors: Mikołaj Pokrywka, Wojciech Kusa, Mieszko Rutkowski, Mikołaj Koszowski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04929
Pdf URL: https://arxiv.org/pdf/2506.04929
Copy Paste: [[2506.04929]] ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT(https://arxiv.org/abs/2506.04929)
Keywords: language model
Abstract: Neural Machine Translation (NMT) has improved translation by using Transformer-based models, but it still struggles with word ambiguity and context. This problem is especially important in domain-specific applications, which often have problems with unclear sentences or poor data quality. Our research explores how adding information to models can improve translations in the context of e-commerce data. To this end we create ConECT -- a new Czech-to-Polish e-commerce product translation dataset coupled with images and product metadata consisting of 11,400 sentence pairs. We then investigate and compare different methods that are applicable to context-aware translation. We test a vision-language model (VLM), finding that visual context aids translation quality. Additionally, we explore the incorporation of contextual information into text-to-text models, such as the product's category path or image descriptions. The results of our study demonstrate that the incorporation of contextual information leads to an improvement in the quality of machine translation. We make the new dataset publicly available.
摘要：神经机器翻译（NMT）通过使用基于变压器的模型改进了翻译，但它仍然在词歧义和上下文中挣扎。这个问题在特定于领域的应用中尤为重要，这些应用程序通常会出现不清楚的句子或数据质量差的问题。我们的研究探讨了如何在电子商务数据的背景下改善翻译的信息。为此，我们创建了CONECT - 一种新的捷克与派别的电子商务产品翻译数据集，以及图像和产品元数据，由11,400个句子对组成。然后，我们研究并比较适用于上下文感知翻译的不同方法。我们测试了视觉模型（VLM），发现视觉上下文有助于翻译质量。此外，我们探索将上下文信息合并到文本到文本模型中，例如产品的类别路径或图像描述。我们的研究结果表明，上下文信息的合并会导致机器翻译质量的提高。我们将新的数据集公开可用。

Title: From Struggle (06-2024) to Mastery (02-2025) LLMs Conquer Advanced Algorithm Exams and Pave the Way for Editorial Generation

Authors: Adrian Marius Dumitran, Theodor-Pierre Moroianu, Vasile Paul Alexe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04965
Pdf URL: https://arxiv.org/pdf/2506.04965
Copy Paste: [[2506.04965]] From Struggle (06-2024) to Mastery (02-2025) LLMs Conquer Advanced Algorithm Exams and Pave the Way for Editorial Generation(https://arxiv.org/abs/2506.04965)
Keywords: language model, llm
Abstract: This paper presents a comprehensive evaluation of the performance of state-of-the-art Large Language Models (LLMs) on challenging university-level algorithms exams. By testing multiple models on both a Romanian exam and its high-quality English translation, we analyze LLMs' problem-solving capabilities, consistency, and multilingual performance. Our empirical study reveals that the most recent models not only achieve scores comparable to top-performing students but also demonstrate robust reasoning skills on complex, multi-step algorithmic challenges, even though difficulties remain with graph-based tasks. Building on these findings, we explore the potential of LLMs to support educational environments through the generation of high-quality editorial content, offering instructors a powerful tool to enhance student feedback. The insights and best practices discussed herein pave the way for further integration of generative AI in advanced algorithm education.
摘要：本文对最先进的大语言模型（LLMS）在具有挑战性的大学级算法考试中进行了全面评估。通过在罗马尼亚考试及其高质量的英语翻译上测试多个模型，我们分析了LLMS解决问题的能力，一致性和多语言性能。我们的实证研究表明，最近的模型不仅取得了与表现最好的学生相当的分数，而且还表现出在复杂的多步算法挑战上的强大推理技能，即使在基于图形的任务上仍然存在困难。在这些发现的基础上，我们通过产生高质量的编辑内容来探索LLMS支持教育环境的潜力，从而为讲师提供了一种强大的工具来增强学生反馈。本文讨论的见解和最佳实践为进一步整合了高级算法教育中的生成AI。

Title: SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View

Authors: Yongjie Xiao, Hongru Liang, Peixin Qin, Yao Zhang, Wenqiang Lei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05000
Pdf URL: https://arxiv.org/pdf/2506.05000
Copy Paste: [[2506.05000]] SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View(https://arxiv.org/abs/2506.05000)
Keywords: language model, llm
Abstract: Despite the great potential of large language models(LLMs) in machine comprehension, it is still disturbing to fully count on them in real-world scenarios. This is probably because there is no rational explanation for whether the comprehension process of LLMs is aligned with that of experts. In this paper, we propose SCOP to carefully examine how LLMs perform during the comprehension process from a cognitive view. Specifically, it is equipped with a systematical definition of five requisite skills during the comprehension process, a strict framework to construct testing data for these skills, and a detailed analysis of advanced open-sourced and closed-sourced LLMs using the testing data. With SCOP, we find that it is still challenging for LLMs to perform an expert-level comprehension process. Even so, we notice that LLMs share some similarities with experts, e.g., performing better at comprehending local information than global information. Further analysis reveals that LLMs can be somewhat unreliable -- they might reach correct answers through flawed comprehension processes. Based on SCOP, we suggest that one direction for improving LLMs is to focus more on the comprehension process, ensuring all comprehension skills are thoroughly developed during training.
摘要：尽管大型语言模型（LLMS）在机器理解中具有很大的潜力，但在现实世界中，完全指望它们仍然令人不安。这可能是因为对于LLM的理解过程是否与专家的理解过程没有合理的解释。在本文中，我们建议SCOP从认知观点仔细检查LLM在理解过程中的表现。具体而言，它在理解过程中配备了对五个必要技能的系统定义，一个严格的框架来构造这些技能的测试数据，并使用测试数据对先进的开源和封闭式LLM进行了详细分析。借助SCOP，我们发现LLMS执行专家级别的理解流程仍然具有挑战性。即便如此，我们注意到LLM与专家有一些相似之处，例如，在理解本地信息方面的表现要比全球信息更好。进一步的分析表明，LLM可能有些不可靠 - 它们可能通过有缺陷的理解过程获得正确的答案。基于SCOP，我们建议改善LLM的一个方向是将更多的重点放在理解过程上，以确保在培训期间彻底发展所有理解能力。

Title: ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development

Authors: Zhenran Xu, Xue Yang, Yiyu Wang, Qingli Hu, Zijiao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05010
Pdf URL: https://arxiv.org/pdf/2506.05010
Copy Paste: [[2506.05010]] ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development(https://arxiv.org/abs/2506.05010)
Keywords: language model, agent
Abstract: We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at this https URL.
摘要：我们介绍了Comfyui-Copilot，这是一个大型语言模型供电插件，旨在提高Comfyui的可用性和效率，Comfyui是AI驱动的Art Creation的开源平台。尽管Comfyui具有灵活性和用户友好的界面，但Comfyui仍可以向新移民带来挑战，包括有限的文档，模型错误配置和工作流程设计的复杂性。 Comfyui-Copilot通过提供智能节点和模型建议以及自动化的一键式工作流构建来解决这些挑战。该系统以层次结构的多代理框架为核心，该框架包括一个任务委托的中央助理代理商和专门的工人代理，用于不同用法，并由我们精选的comfyui知识库支持，以简化调试和部署。我们通过离线定量评估和在线用户反馈来验证comfyui-copilot的有效性，这表明它准确地推荐了节点并加速了工作流程的开发。此外，用例表明，Comfyui-Copilot降低了初学者的入口障碍，并提高了有经验的用户的工作流效率。此HTTPS URL可用comfyui-copilot安装程序包和演示视频。

Title: Controlling Summarization Length Through EOS Token Weighting

Authors: Zeno Belligoli, Emmanouil Stergiadis, Eran Fainman, Ilya Gusev
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05017
Pdf URL: https://arxiv.org/pdf/2506.05017
Copy Paste: [[2506.05017]] Controlling Summarization Length Through EOS Token Weighting(https://arxiv.org/abs/2506.05017)
Keywords: gpt, llm
Abstract: Controlling the length of generated text can be crucial in various text-generation tasks, including summarization. Existing methods often require complex model alterations, limiting compatibility with pre-trained models. We address these limitations by developing a simple approach for controlling the length of automatic text summaries by increasing the importance of correctly predicting the EOS token in the cross-entropy loss computation. The proposed methodology is agnostic to architecture and decoding algorithms and orthogonal to other inference-time techniques to control the generation length. We tested it with encoder-decoder and modern GPT-style LLMs, and show that this method can control generation length, often without affecting the quality of the summary.
摘要：控制生成的文本的长度在各种文本生成任务中至关重要，包括摘要。现有方法通常需要复杂的模型更改，从而限制了与预训练模型的兼容性。我们通过开发一种简单的方法来解决这些局限性，以控制自动文本摘要的长度，通过增加正确预测跨膜片损失计算中EOS令牌的重要性。所提出的方法对结构和解码算法以及与其他推理时间技术的正交不可知，以控制发电长度。我们用编码器和现代GPT风格的LLMS对其进行了测试，并表明该方法通常可以控制发电长度，而不会影响摘要的质量。

Title: Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers

Authors: Yutao Hou, Zeguan Xiao, Fei Yu, Yihan Jiang, Xuetao Wei, Hailiang Huang, Yun Chen, Guanhua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05038
Pdf URL: https://arxiv.org/pdf/2506.05038
Copy Paste: [[2506.05038]] Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers(https://arxiv.org/abs/2506.05038)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved distinguished performance on various reasoning-intensive tasks. However, LLMs might still face the challenges of robustness issues and fail unexpectedly in some simple reasoning tasks. Previous works evaluate the LLM robustness with hand-crafted templates or a limited set of perturbation rules, indicating potential data contamination in pre-training or fine-tuning datasets. In this work, inspired by stress testing in software engineering, we propose a novel framework, Automatic Robustness Checker (AR-Checker), to generate mathematical problem variants that maintain the semantic meanings of the original one but might fail the LLMs. The AR-Checker framework generates mathematical problem variants through multi-round parallel streams of LLM-based rewriting and verification. Our framework can generate benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the strong performance of AR-Checker on mathematical tasks. We also evaluate AR-Checker on benchmarks beyond mathematics, including MMLU, MMLU-Pro, and CommonsenseQA, where it also achieves strong performance, further proving the effectiveness of AR-Checker.
摘要：大型语言模型（LLM）在各种推理密集型任务上取得了杰出的表现。但是，LLMS仍可能面临鲁棒性问题的挑战，并在某些简单的推理任务中意外失败。先前的工作通过手工制作的模板或有限的扰动规则评估LLM鲁棒性，表明在培训或微调数据集中潜在的数据污染。在这项工作中，受软件工程中的压力测试的启发，我们提出了一个新颖的框架，自动鲁棒性检查器（AR-Checker），以生成数学问题变体，以维持原始版本的语义含义，但可能会使LLMS失败。 AR-Checker框架通过基于LLM的重写和验证的多轮平行流生成数学问题变体。我们的框架可以为每个LLM动态生成基准变体，从而最大程度地减少数据污染的风险。 GSM8K和Math-500的实验证明了AR-Checker在数学任务上的出色表现。我们还评估了包括MMLU，MMLU-Pro和CommonSenseQA在内的数学以外的基准测试器上的AR-Checker，在那里它也实现了强大的性能，进一步证明了AR-Checker的有效性。

Title: TALL -- A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages

Authors: Moshe Ofer, Orel Zamler, Amos Azaria
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05057
Pdf URL: https://arxiv.org/pdf/2506.05057
Copy Paste: [[2506.05057]] TALL -- A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages(https://arxiv.org/abs/2506.05057)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel in high-resource languages but struggle with low-resource languages due to limited training data. This paper presents TALL (Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages), which integrates an LLM with two bilingual translation models. TALL transforms low-resource inputs into high-resource representations, leveraging the LLM's capabilities while preserving linguistic features through dimension alignment layers and custom transformers. Our experiments on Hebrew demonstrate significant improvements over several baselines, including direct use, naive translation, and fine-tuning approaches. The architecture employs a parameter-efficient strategy, freezing pre-trained components while training only lightweight adapter modules, balancing computational efficiency with performance gains.
摘要：大型语言模型（LLMS）以高资源语言表现出色，但由于培训数据有限，因此使用低资源语言挣扎。本文介绍了高个子（可训练的体系结构，用于增强低资源语言的LLM性能），将LLM与两个双语翻译模型集成在一起。高层将低资源输入转换为高资源表示，利用LLM的功能，同时通过维度对齐层和自定义变压器来保留语言特征。我们对希伯来语的实验表明，对几种基线的实验有了显着改进，包括直接使用，天真的翻译和微调方法。该体系结构采用了一种参数效率的策略，在仅训练轻型适配器模块的同时冻结了预训练的组件，从而平衡了计算效率与性能提高。

Title: Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

Authors: Noy Sternlicht, Ariel Gera, Roy Bar-Haim, Tom Hope, Noam Slonim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05062
Pdf URL: https://arxiv.org/pdf/2506.05062
Copy Paste: [[2506.05062]] Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation(https://arxiv.org/abs/2506.05062)
Keywords: llm
Abstract: We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.
摘要：我们介绍辩论语音评估是评估LLM法官的新颖而挑战的基准。评估辩论的演讲需要对演讲有多个层次的深刻理解，包括论点的力量和相关性，演讲的连贯性和组织，其风格和语气的适当性等等。这项任务涉及一组独特的认知能力，这些能力以前在系统的LLM基准测试中受到了有限的关注。为了探索此类技能，我们利用了600多个精心注释的辩论的数据集，并对最新的LLM与人类法官在这项任务上的比较进行了深入的深入分析。我们的发现揭示了一个细微的图片：虽然较大的模型可以在某些方面近似人类的判断，但它们在整体判断行为上有很大差异。我们还研究了Frontier LLM产生有说服力的，自以为是的演讲的能力，表明模型可能会在该任务的人类层面上执行。

Title: Does It Make Sense to Speak of Introspection in Large Language Models?

Authors: Iulia Comşa, Murray Shanahan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05068
Pdf URL: https://arxiv.org/pdf/2506.05068
Copy Paste: [[2506.05068]] Does It Make Sense to Speak of Introspection in Large Language Models?(https://arxiv.org/abs/2506.05068)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit compelling linguistic behaviour, and sometimes offer self-reports, that is to say statements about their own nature, inner workings, or behaviour. In humans, such reports are often attributed to a faculty of introspection and are typically linked to consciousness. This raises the question of how to interpret self-reports produced by LLMs, given their increasing linguistic fluency and cognitive capabilities. To what extent (if any) can the concept of introspection be meaningfully applied to LLMs? Here, we present and critique two examples of apparent introspective self-report from LLMs. In the first example, an LLM attempts to describe the process behind its own ``creative'' writing, and we argue this is not a valid example of introspection. In the second example, an LLM correctly infers the value of its own temperature parameter, and we argue that this can be legitimately considered a minimal example of introspection, albeit one that is (presumably) not accompanied by conscious experience.
摘要：大型语言模型（LLMS）表现出引人注目的语言行为，有时会提供自我报告，也就是说，关于其自身性质，内在运作或行为的陈述。在人类中，这种报告通常归因于内省的学院，通常与意识有关。这就提出了一个问题，即如何解释LLM产生的自我报告，鉴于他们的语言流畅性和认知能力的增加。在多大程度上（如果有）可以将内省的概念有意义地应用于LLMS？在这里，我们介绍并批评了来自LLM的明显内省自我报告的两个例子。在第一个示例中，LLM试图描述其自己的``创意''写作背后的过程，我们认为这不是内省的有效示例。在第二个示例中，LLM正确地渗透了其自身温度参数的价值，我们认为这可以合法地认为是内省的最小例子，尽管它（大概）不伴有有意识的体验。

Title: RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation

Authors: Tianjiao Li, Mengran Yu, Chenyu Shi, Yanjun Zhao, Xiaojing Liu, Qiang Zhang, Qi Zhang, Xuanjing Huang, Jiayin Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05070
Pdf URL: https://arxiv.org/pdf/2506.05070
Copy Paste: [[2506.05070]] RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation(https://arxiv.org/abs/2506.05070)
Keywords: language model, llm
Abstract: Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback (RLHF) with translation tasks has shown great potential. However, we observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks. In this work, we investigate this issue and find that the offline reward model (RM) gradually diverges from the online LLM due to distributional shift, ultimately leading to undesirable training outcomes. To address this, we propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM. RIVAL iteratively updates the both models, with the RM trained to distinguish strong from weak translations (qualitative preference reward), and the LLM trained to enhance its translation for closing this gap. To stabilize training and improve generalizability, we also incorporate quantitative preference reward (e.g., BLEU) into the RM, enabling reference-free quality modeling aligned with human evaluation. Through extensive experiments, we demonstrate that the proposed adversarial training framework significantly improves upon translation baselines.
摘要：大型语言模型（LLMS）具有强大的多语言能力，并将增强性从人类反馈（RLHF）与翻译任务结合起来显示出很大的潜力。但是，我们观察到，当应用于口语字幕翻译任务时，该范式的性能出乎意料的表现。在这项工作中，我们调查了这个问题，发现离线奖励模型（RM）由于分销而逐渐与在线LLM分歧，最终导致了不良的培训结果。为了解决这个问题，我们提出了竞争对手，这是一个对抗性训练框架，将过程制定为RM和LLM之间的最小游戏。竞争性迭代都可以更新这两种模型，RM训练有素，可以区分强劲的翻译（定性偏好奖励），而LLM训练以增强其翻译以缩小这一差距。为了稳定训练并提高可推广性，我们还将定量偏好奖励（例如BLEU）纳入RM，从而使无参考的质量建模与人类评估一致。通过广泛的实验，我们证明了所提出的对抗训练框架在翻译基线时会显着改善。

Title: Just a Scratch: Enhancing LLM Capabilities for Self-harm Detection through Intent Differentiation and Emoji Interpretation

Authors: Soumitra Ghosh, Gopendra Vikram Singh, Shambhavi, Sabarna Choudhury, Asif Ekbal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05073
Pdf URL: https://arxiv.org/pdf/2506.05073
Copy Paste: [[2506.05073]] Just a Scratch: Enhancing LLM Capabilities for Self-harm Detection through Intent Differentiation and Emoji Interpretation(https://arxiv.org/abs/2506.05073)
Keywords: language model, llm
Abstract: Self-harm detection on social media is critical for early intervention and mental health support, yet remains challenging due to the subtle, context-dependent nature of such expressions. Identifying self-harm intent aids suicide prevention by enabling timely responses, but current large language models (LLMs) struggle to interpret implicit cues in casual language and emojis. This work enhances LLMs' comprehension of self-harm by distinguishing intent through nuanced language-emoji interplay. We present the Centennial Emoji Sensitivity Matrix (CESM-100), a curated set of 100 emojis with contextual self-harm interpretations and the Self-Harm Identification aNd intent Extraction with Supportive emoji sensitivity (SHINES) dataset, offering detailed annotations for self-harm labels, casual mentions (CMs), and serious intents (SIs). Our unified framework: a) enriches inputs using CESM-100; b) fine-tunes LLMs for multi-task learning: self-harm detection (primary) and CM/SI span detection (auxiliary); c) generates explainable rationales for self-harm predictions. We evaluate the framework on three state-of-the-art LLMs-Llama 3, Mental-Alpaca, and MentalLlama, across zero-shot, few-shot, and fine-tuned scenarios. By coupling intent differentiation with contextual cues, our approach commendably enhances LLM performance in both detection and explanation tasks, effectively addressing the inherent ambiguity in self-harm signals. The SHINES dataset, CESM-100 and codebase are publicly available at: this https URL .
摘要：社交媒体上的自我伤害检测对于早期干预和心理健康支持至关重要，但由于这种表达的微妙，依赖上下文的性质，仍然具有挑战性。通过实现及时的回应来确定自我伤害的意图有助于预防自杀，但是当前的大型语言模型（LLMS）难以解释随意语言和表情符号的隐性线索。这项工作通过通过细微的语言 - emoji相互作用来区分意图来增强LLM对自我伤害的理解。我们介绍了百年表情符号敏感性矩阵（CESM-100），这是一组精选的100个表情符号，具有上下文的自我伤害解释以及自我伤害的识别和意图提取，并带有支持性表情符号灵敏度（SHINES）数据集，为自我HARM标签，偶然的指标，cassalial（CMS）（CMS）和SIS（SIS）（SIS）（SIS）提供了详细的注释。我们的统一框架：a）使用CESM-100富集输入； b）用于多任务学习的微型llms：自我伤害检测（主要）和cm/si跨度检测（辅助）； c）为自我伤害预测产生可解释的理由。我们在三个最先进的LLMS-LLAMA 3，Chenter-Alpaca和Mentalllama上评估了框架，零射，很少射击和微调的场景。通过将意图与上下文提示结合起来，我们的方法称赞在检测和解释任务中提高了LLM的性能，从而有效地解决了自我伤害信号中固有的歧义。 Shines数据集，CESM-100和代码库可公开可用：此HTTPS URL。

Title: Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin

Authors: HaoTian Lan
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05080
Pdf URL: https://arxiv.org/pdf/2506.05080
Copy Paste: [[2506.05080]] Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin(https://arxiv.org/abs/2506.05080)
Keywords: language model, gpt
Abstract: The commercial vitality of community-scale streets in Chinese cities is shaped by complex interactions between vehicular accessibility, environmental quality, and pedestrian perception. This study proposes an interpretable, image-based framework to examine how street-level features -- including parked vehicle density, greenery, cleanliness, and street width -- impact retail performance and user satisfaction in Harbin, China. Leveraging street view imagery and a multimodal large language model (VisualGLM-6B), we construct a Community Commercial Vitality Index (CCVI) from Meituan and Dianping data and analyze its relationship with spatial attributes extracted via GPT-4-based perception modeling. Our findings reveal that while moderate vehicle presence may enhance commercial access, excessive on-street parking -- especially in narrow streets -- erodes walkability and reduces both satisfaction and shop-level pricing. In contrast, streets with higher perceived greenery and cleanliness show significantly greater satisfaction scores but only weak associations with pricing. Street width moderates the effects of vehicle presence, underscoring the importance of spatial configuration. These results demonstrate the value of integrating AI-assisted perception with urban morphological analysis to capture non-linear and context-sensitive drivers of commercial success. This study advances both theoretical and methodological frontiers by highlighting the conditional role of vehicle activity in neighborhood commerce and demonstrating the feasibility of multimodal AI for perceptual urban diagnostics. The implications extend to urban design, parking management, and scalable planning tools for community revitalization.
摘要：中国城市社区规模的街道的商业活力是由车辆可访问性，环境质量和行人感知之间的复杂相互作用所塑造的。这项研究提出了一个可解释的基于图像的框架，以检查街道级特征如何（包括停放的车辆密度，绿化，清洁度和街道宽度）影响中国哈尔宾的零售表现和用户满意度。利用街道视图图像和多模式大语模型（VisutGLM-6B），我们从Meituan和Dianping数据构建了一个社区商业活力指数（CCVI），并分析了其与通过基于GPT-4的感知模型提取的空间属性的关系。我们的发现表明，尽管适度的车辆存在可能会增强商业通道，但街道上的停车场过多（尤其是在狭窄的街道上）侵蚀了步行性并降低了满意度和商店级别的价格。相比之下，具有较高绿化和清洁度的街道表现出更高的满意度得分，但与定价的关联较弱。街道宽度缓解了车辆存在的影响，强调了空间配置的重要性。这些结果证明了将AI辅助感知与城市形态分析综合以捕获商业成功的非线性和上下文敏感驱动因素的价值。这项研究通过强调车辆活动在邻里贸易中的有条件作用，并证明了多模式AI对感知性城市诊断的可行性，从而提高了理论和方法论前沿。这些含义扩展到城市设计，停车管理和可扩展的计划工具，以供社区振兴。

Title: The NTNU System at the S&I Challenge 2025 SLA Open Track

Authors: Hong-Yun Lin, Tien-Hong Lo, Yu-Hsuan Fang, Jhen-Ke Lin, Chung-Chun Wang, Hao-Chien Lu, Berlin Chen
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.05121
Pdf URL: https://arxiv.org/pdf/2506.05121
Copy Paste: [[2506.05121]] The NTNU System at the S&I Challenge 2025 SLA Open Track(https://arxiv.org/abs/2506.05121)
Keywords: language model, llm
Abstract: A recent line of research on spoken language assessment (SLA) employs neural models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency across linguistic and acoustic modalities. Although both models effectively capture features relevant to oral competence, each exhibits modality-specific limitations. BERT-based methods rely on ASR transcripts, which often fail to capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods excel at modeling acoustic features but lack semantic interpretability. To overcome these limitations, we propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy. The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025, securing second place in the competition. For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.
摘要：关于口语评估（SLA）的最新研究系列采用了BERT和WAV2VEC 2.0（W2V）等神经模型，以评估语言和声学方式跨语言和声学模式。尽管这两种模型都有效地捕获了与口服能力相关的功能，但每个模型都表现出特定于方式的限制。基于BERT的方法依赖于ASR转录本，这些转录本通常无法捕获SLA的韵律和语音提示。相反，基于W2V的方法在模拟声学特征方面表现出色，但缺乏语义解释性。为了克服这些局限性，我们提出了一个通过分数融合策略将W2V与PHI-4多模式大语模型（MLLM）集成的系统。拟议的系统在官方测试集2025的官方测试集上达到了均方根误差（RMSE）为0.375，在比赛中获得了第二名。为了进行比较，排名，第三名和官方基线系统的RMSES分别为0.364、0.384和0.444。

Title: DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning

Authors: Tanmay Parekh, Kartik Mehta, Ninareh Mehrabi, Kai-Wei Chang, Nanyun Peng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05128
Pdf URL: https://arxiv.org/pdf/2506.05128
Copy Paste: [[2506.05128]] DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning(https://arxiv.org/abs/2506.05128)
Keywords: language model, llm
Abstract: Zero-shot Event Detection (ED), the task of identifying event mentions in natural language text without any training data, is critical for document understanding in specialized domains. Understanding the complex event ontology, extracting domain-specific triggers from the passage, and structuring them appropriately overloads and limits the utility of Large Language Models (LLMs) for zero-shot ED. To this end, we propose DiCoRe, a divergent-convergent reasoning framework that decouples the task of ED using Dreamer and Grounder. Dreamer encourages divergent reasoning through open-ended event discovery, which helps to boost event coverage. Conversely, Grounder introduces convergent reasoning to align the free-form predictions with the task-specific instructions using finite-state machine guided constrained decoding. Additionally, an LLM-Judge verifies the final outputs to ensure high precision. Through extensive experiments on six datasets across five domains and nine LLMs, we demonstrate how DiCoRe consistently outperforms prior zero-shot, transfer-learning, and reasoning baselines, achieving 4-7% average F1 gains over the best baseline -- establishing DiCoRe as a strong zero-shot ED framework.
摘要：零射击事件检测（ED），在没有任何培训数据的情况下识别事件提及的事件提及的任务对于在专用域中的文档理解至关重要。了解复杂的事件本体，从段落中提取特定于域的触发器，并将它们适当地构成零语言模型（LLMS）的实用性，以使其适当地构成零弹性ED的效用。为此，我们提出了Dicore，Dicore是一个发散的推理框架，该框架使用Dreamer和接地手来解除ED的任务。 Dreamer通过开放式事件发现鼓励推理分歧，这有助于提高活动覆盖范围。相反，接地机引入了收敛推理，以将自由形式的预测与使用有限的机器指导的约束解码的特定任务指令相一致。此外，LLM-Gudge验证最终输出以确保高精度。通过在五个域和九个LLM的六个数据集上进行的广泛实验，我们证明了Dicore如何持续胜过零射击，转移学习和推理基线的表现如何，在最佳基线上取得了4-7％的平均F1增益 - 将DICORE确定为强大的零弹药框架。

Title: Information Locality as an Inductive Bias for Neural Language Models

Authors: Taiga Someya, Anej Svete, Brian DuSell, Timothy J. O'Donnell, Mario Giulianelli, Ryan Cotterell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05136
Pdf URL: https://arxiv.org/pdf/2506.05136
Copy Paste: [[2506.05136]] Information Locality as an Inductive Bias for Neural Language Models(https://arxiv.org/abs/2506.05136)
Keywords: language model
Abstract: Inductive biases are inherent in every machine learning system, shaping how models generalize from finite data. In the case of neural language models (LMs), debates persist as to whether these biases align with or diverge from human processing constraints. To address this issue, we propose a quantitative framework that allows for controlled investigations into the nature of these biases. Within our framework, we introduce $m$-local entropy$\unicode{x2013}$an information-theoretic measure derived from average lossy-context surprisal$\unicode{x2013}$that captures the local uncertainty of a language by quantifying how effectively the $m-1$ preceding symbols disambiguate the next symbol. In experiments on both perturbed natural language corpora and languages defined by probabilistic finite-state automata (PFSAs), we show that languages with higher $m$-local entropy are more difficult for Transformer and LSTM LMs to learn. These results suggest that neural LMs, much like humans, are highly sensitive to the local statistical structure of a language.
摘要：归纳偏见是每个机器学习系统中固有的，它塑造了模型如何从有限数据中推广。在神经语言模型（LMS）的情况下，关于这些偏见是否与人类处理约束保持一致或不同的辩论。为了解决这个问题，我们提出了一个定量框架，允许对这些偏见的性质进行控制调查。在我们的框架内，我们引入了$ m $ - 局部熵$ \ unicode {x2013} $一种信息理论措施，该措施从平均有损失的损失 - 概念性惊人$ \ unicode {x2013} $中得出，该措施通过量化了$ m-1 $ symbers dismabs semplabs ymbigigutical符号来量化一种语言的本地不确定性。在对概率有限状态自动机（PFSA）定义的扰动自然语言语料库和语言的实验中，我们表明，对于变压器和LSTM LMS来说，具有较高$ M $ $ M $ - 本地熵的语言更难学习。这些结果表明，与人类一样，神经LMS对语言的局部统计结构高度敏感。

Title: AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models

Authors: Chih-Kai Yang, Neo Ho, Yi-Jyun Lee, Hung-yi Lee
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.05140
Pdf URL: https://arxiv.org/pdf/2506.05140
Copy Paste: [[2506.05140]] AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models(https://arxiv.org/abs/2506.05140)
Keywords: language model
Abstract: Understanding the internal mechanisms of large audio-language models (LALMs) is crucial for interpreting their behavior and improving performance. This work presents the first in-depth analysis of how LALMs internally perceive and recognize auditory attributes. By applying vocabulary projection on three state-of-the-art LALMs, we track how attribute information evolves across layers and token positions. We find that attribute information generally decreases with layer depth when recognition fails, and that resolving attributes at earlier layers correlates with better accuracy. Moreover, LALMs heavily rely on querying auditory inputs for predicting attributes instead of aggregating necessary information in hidden states at attribute-mentioning positions. Based on our findings, we demonstrate a method to enhance LALMs. Our results offer insights into auditory attribute processing, paving the way for future improvements.
摘要：了解大型音频模型（LALM）的内部机制对于解释其行为和提高性能至关重要。这项工作对LALMS内部感知和识别听觉属性进行了深入分析。通过在三个最先进的LALMS上应用词汇投影，我们会跟踪属性信息如何跨层和令牌位置演变。我们发现，当识别失败时，属性信息通常会随着图层的深度而降低，并且在早期层上解决属性与更高的精度相关。此外，LALM在很大程度上依赖于查询听觉输入来预测属性，而不是在属性 - 属性位置下在隐藏状态中汇总必要的信息。根据我们的发现，我们展示了一种增强LALM的方法。我们的结果提供了对听觉属性处理的见解，为将来的改进铺平了道路。

Title: Do Large Language Models Judge Error Severity Like Humans?

Authors: Diege Sun, Guanyi Chen, Fan Zhao, Xiaorong Cheng, Tingting He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05142
Pdf URL: https://arxiv.org/pdf/2506.05142
Copy Paste: [[2506.05142]] Do Large Language Models Judge Error Severity Like Humans?(https://arxiv.org/abs/2506.05142)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly used as automated evaluators in natural language generation, yet it remains unclear whether they can accurately replicate human judgments of error severity. In this study, we systematically compare human and LLM assessments of image descriptions containing controlled semantic errors. We extend the experimental framework of van Miltenburg et al. (2020) to both unimodal (text-only) and multimodal (text + image) settings, evaluating four error types: age, gender, clothing type, and clothing colour. Our findings reveal that humans assign varying levels of severity to different error types, with visual context significantly amplifying perceived severity for colour and type errors. Notably, most LLMs assign low scores to gender errors but disproportionately high scores to colour errors, unlike humans, who judge both as highly severe but for different reasons. This suggests that these models may have internalised social norms influencing gender judgments but lack the perceptual grounding to emulate human sensitivity to colour, which is shaped by distinct neural mechanisms. Only one of the evaluated LLMs, Doubao, replicates the human-like ranking of error severity, but it fails to distinguish between error types as clearly as humans. Surprisingly, DeepSeek-V3, a unimodal LLM, achieves the highest alignment with human judgments across both unimodal and multimodal conditions, outperforming even state-of-the-art multimodal models.
摘要：大型语言模型（LLM）越来越多地用作自然语言生成中的自动评估者，但尚不清楚他们是否可以准确地复制人类对错误严重性的判断。在这项研究中，我们系统地比较了包含受控语义错误的图像描述的人类和LLM评估。我们扩展了Van Miltenburg等人的实验框架。（2020）到单峰（仅文本）和多模式（文本 +图像）设置，评估四种错误类型：年龄，性别，服装类型和服装颜色。我们的发现表明，人类将不同程度的严重性分配给不同的错误类型，视觉上下文显着放大了颜色和类型错误的感知严重程度。值得注意的是，与人类不同，大多数LLM都会为性别错误分配低分，但分数不成比例，分数不成比例，因为人类认为这两者都是高度严重但出于不同的原因。这表明这些模型可能具有影响性别判断的社会规范，但缺乏模仿人类对颜色敏感性的知觉基础，这是由不同的神经机制塑造的。 Doubao只有一个评估的LLM重复了类似人类的误差严重性排名，但它无法像人类那样清楚地区分误差类型。令人惊讶的是，单峰LLM的DeepSeek-V3与在单峰和多模态条件下的人类判断达到了最高的一致性，甚至超过了最先进的多模型模型。

Title: Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation

Authors: Chenyu Lin, Yilin Wen, Du Su, Fei Sun, Muhan Chen, Chenfu Bao, Zhonghou Lv
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.05154
Pdf URL: https://arxiv.org/pdf/2506.05154
Copy Paste: [[2506.05154]] Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation(https://arxiv.org/abs/2506.05154)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) is a mainstream method for improving performance on knowledge-intensive tasks. However,current RAG systems often place too much emphasis on retrieved contexts. This can lead to reliance on inaccurate sources and overlook the model's inherent knowledge, especially when dealing with misleading or excessive information. To resolve this imbalance, we propose Knowledgeable-r1 that using joint sampling and define multi policy distributions in knowledge capability exploration to stimulate large language models'self-integrated utilization of parametric and contextual knowledge. Experiments show that Knowledgeable-r1 significantly enhances robustness and reasoning accuracy in both parameters and contextual conflict tasks and general RAG tasks, especially outperforming baselines by 17.07% in counterfactual scenarios and demonstrating consistent gains across RAG tasks. Our code are available at this https URL knowledgeable-r1.
摘要：检索增强的生成（RAG）是一种改善知识密集型任务绩效的主流方法。但是，当前的破布系统通常过多地重点是检索到的上下文。这可能导致依赖不准确的来源并忽略模型固有的知识，尤其是在处理误导或过多的信息时。为了解决这种不平衡，我们提出了知识渊博的r1，即在知识能力探索中使用联合采样并定义多个策略分布，以刺激对参数和上下文知识的大型语言模型的自我集成利用。实验表明，知识渊博的R1显着提高了参数和上下文冲突任务和一般抹布任务的鲁棒性和推理精度，尤其是在反事实场景中优于17.07％的基线，并且在RAG任务中表现出一致的增长。我们的代码可在此HTTPS URL知识渊博R1上找到。

Title: Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

Authors: Bhavik Chandna, Zubair Bashir, Procheta Sen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05166
Pdf URL: https://arxiv.org/pdf/2506.05166
Copy Paste: [[2506.05166]] Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective(https://arxiv.org/abs/2506.05166)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.
摘要：众所周知，大型语言模型（LLMS）通常会表现出社交，人口和性别偏见，这通常是由于培训它们的数据而导致的。在这项工作中，我们采用一种机械性解释性方法来分析在诸如GPT-2和Llama2之类的模型中在结构上表示这些偏差的方式。为了关注人口统计和性别偏见，我们探索了不同的指标，以识别负责偏见行为的内部边缘。然后，我们评估这些组件在数据集和语言变化之间的稳定性，定位和概括性。通过系统的消融，我们证明了与偏置相关的计算是高度局部的，通常集中在一小部分层中。此外，确定的组件在微调设置中发生了变化，包括与偏见无关的组件。最后，我们表明，删除这些组件不仅减少了偏见的输出，还会影响其他NLP任务，例如命名实体识别和语言可接受性判断，因为与这些任务共享重要组件。

Title: ECoRAG: Evidentiality-guided Compression for Long Context RAG

Authors: Yeonseok Jeong, Jinsu Kim, Dohyeon Lee, Seung-won Hwang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.05167
Pdf URL: https://arxiv.org/pdf/2506.05167
Copy Paste: [[2506.05167]] ECoRAG: Evidentiality-guided Compression for Long Context RAG(https://arxiv.org/abs/2506.05167)
Keywords: language model, llm, long context, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or \textbf{ECoRAG} framework. ECoRAG improves LLM performance by compressing retrieved documents based on evidentiality, ensuring whether answer generation is supported by the correct evidence. As an additional step, ECoRAG reflects whether the compressed content provides sufficient evidence, and if not, retrieves more until sufficient. Experiments show that ECoRAG improves LLM performance on ODQA tasks, outperforming existing compression methods. Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage by retaining only the necessary information to generate the correct answer. Code is available at this https URL.
摘要：大型语言模型（LLMS）通过通过检索演出（RAG）利用外部文档（RAG），在开放域问题答案（ODQA）中表现出了出色的性能。为了减少抹布开销，从较长的上下文中，必须进行上下文压缩。但是，先前的压缩方法并不专注于滤除非事实信息，这限制了基于LLM的抹布的性能。因此，我们提出了证据引导的抹布或\ textbf {ecorag}框架。 Ecorag通过根据证据来压缩检索的文档来改善LLM的性能，从而确保是否得到正确的证据支持答案的产生。作为另一个步骤，Ecorag反映了压缩含量是否提供了足够的证据，如果没有，则可以检索更多直到足够的证据。实验表明，Ecorag改善了ODQA任务上的LLM性能，表现优于现有的压缩方法。此外，Ecorag的成本效益高，因为它不仅可以通过仅保留必要的信息来生成正确的答案来减少延迟，还可以最大程度地减少令牌用法。代码可在此HTTPS URL上找到。

Title: Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Authors: Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05176
Pdf URL: https://arxiv.org/pdf/2506.05176
Copy Paste: [[2506.05176]] Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models(https://arxiv.org/abs/2506.05176)
Keywords: llm
Abstract: In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.
摘要：在这项工作中，我们介绍了QWEN3嵌入式系列，这是对其前身GTE-QWEN系列的重大进步，它是基于QWEN3基础模型的文本嵌入和重新固定功能。利用QWEN3 LLMS在多语言文本理解和产生中的强大功能，我们的创新性多阶段培训管道将大规模无监督的预训练与高质量数据集中有监督的微调结合在一起。有效的模型合并策略进一步确保了QWEN3嵌入序列的鲁棒性和适应性。在培训过程中，QWEN3 LLM不仅用作骨干模型，而且在跨多个领域和语言的高质量，丰富和多样化的培训数据中发挥着至关重要的作用，从而增强了培训管道。 QWEN3嵌入式系列提供了嵌入和重新管理任务的一系列模型尺寸（0.6b，4b，8b），以解决用户可以优化效率或有效性的各种部署方案。经验评估表明，QWEN3嵌入序列可在不同的基准中实现最先进的结果。值得注意的是，它在多种语言评估基准MTEB上符合文本嵌入以及各种检索任务，包括代码检索，跨语性检索和多语言检索。为了促进可重复性并促进社区驱动的研发，QWEN3嵌入模型可在Apache 2.0许可下公开使用。

Title: Counterfactual reasoning: an analysis of in-context emergence

Authors: Moritz Miller, Bernhard Schölkopf, Siyuan Guo
Subjects: cs.CL, cs.AI, cs.LG, math.ST
Abstract URL: https://arxiv.org/abs/2506.05188
Pdf URL: https://arxiv.org/pdf/2506.05188
Copy Paste: [[2506.05188]] Counterfactual reasoning: an analysis of in-context emergence(https://arxiv.org/abs/2506.05188)
Keywords: language model
Abstract: Large-scale neural language models (LMs) exhibit remarkable performance in in-context learning: the ability to learn and reason the input context on the fly without parameter update. This work studies in-context counterfactual reasoning in language models, that is, to predict the consequences of changes under hypothetical scenarios. We focus on studying a well-defined synthetic setup: a linear regression task that requires noise abduction, where accurate prediction is based on inferring and copying the contextual noise from factual observations. We show that language models are capable of counterfactual reasoning in this controlled setup and provide insights that counterfactual reasoning for a broad class of functions can be reduced to a transformation on in-context observations; we find self-attention, model depth, and data diversity in pre-training drive performance in Transformers. More interestingly, our findings extend beyond regression tasks and show that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under this https URL .
摘要：大规模神经语言模型（LMS）在内在学习中表现出色：在没有参数更新的情况下，学习和推理输入上下文的能力。这项工作在语言模型中研究了反亲切的推理，也就是说，以预测假设场景下的变化的后果。我们专注于研究定义明确的合成设置：需要差噪声的线性回归任务，在该任务中，准确的预测基于从事实观察中推断和复制上下文噪声。我们表明，语言模型能够在此受控的设置中反事实推理，并提供了见解，即可以将广泛函数的反事实推理减少为对内部文化观测值的转换；我们发现自我注意力，模型深度和数据多样性在变形金刚的训练驱动驱动器中。更有趣的是，我们的发现超出了回归任务，并表明变形金刚可以对顺序数据进行噪声绑架，从而提供了有关反事实故事产生潜力的初步证据。我们的代码可在此HTTPS URL下使用。

Title: RELIC: Evaluating Compositional Instruction Following via Language Recognition

Authors: Jackson Petty, Michael Y. Hu, Wentao Wang, Shauli Ravfogel, William Merrill, Tal Linzen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05205
Pdf URL: https://arxiv.org/pdf/2506.05205
Copy Paste: [[2506.05205]] RELIC: Evaluating Compositional Instruction Following via Language Recognition(https://arxiv.org/abs/2506.05205)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly expected to perform tasks based only on a specification of the task provided in context, without examples of inputs and outputs; this ability is referred to as instruction following. We introduce the Recognition of Languages In-Context (RELIC) framework to evaluate instruction following using language recognition: the task of determining if a string is generated by formal grammar. Unlike many standard evaluations of LLMs' ability to use their context, this task requires composing together a large number of instructions (grammar productions) retrieved from the context. Because the languages are synthetic, the task can be increased in complexity as LLMs' skills improve, and new instances can be automatically generated, mitigating data contamination. We evaluate state-of-the-art LLMs on RELIC and find that their accuracy can be reliably predicted from the complexity of the grammar and the individual example strings, and that even the most advanced LLMs currently available show near-chance performance on more complex grammars and samples, in line with theoretical expectations. We also use RELIC to diagnose how LLMs attempt to solve increasingly difficult reasoning tasks, finding that as the complexity of the language recognition task increases, models switch to relying on shallow heuristics instead of following complex instructions.
摘要：越来越多地期望大型语言模型（LLMS）仅根据上下文中提供的任务的规范执行任务，而没有输入和输出的示例；此功能称为以下指令。我们在使用语言识别的情况下介绍了语言识别语言（遗物）框架以评估指令：确定字符串是否由形式语法生成的任务。与LLMS使用其上下文能力的许多标准评估不同，此任务需要从上下文中检索出大量指令（语法生产）。由于语言是综合的，因此随着LLMS的技能提高，可以增加任务的复杂性，并且可以自动生成新的实例，从而减轻数据污染。我们评估了Relic上的最新LLM，并发现它们的准确性可以从语法的复杂性和单个示例字符串中可靠地预测，即使当前最先进的LLMS也可以在更复杂的语法和样本上显示出与理论预期一致的更为复杂的语法和样本。我们还使用遗物来诊断LLM试图解决日益困难的推理任务，发现随着语言识别任务的复杂性的增加，模型转为依靠浅启发式方法，而不是按照复杂的说明。

Title: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Authors: Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal, Elie Bakouch, John David Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, Tom Goldstein, Brian R. Bartoldson, Bhavya Kailkhura, Tyler Murray
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05209
Pdf URL: https://arxiv.org/pdf/2506.05209
Copy Paste: [[2506.05209]] The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text(https://arxiv.org/abs/2506.05209)
Keywords: language model, llm
Abstract: Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.
摘要：大型语言模型（LLMS）通常受到大量未经许可文本的培训，这种做法由于可能的知识产权侵权和道德问题而导致了审查。对公开许可的文本进行培训LLM介绍了解决这些问题的第一步，但是先前的数据收集工作使数据集太小或低质量而无法产生表现llms。为了解决这一差距，我们收集，策划和释放普通桩v0.1，这是一个八个公开许可的文本的八个terabyte集合，该文本旨在训练LLM。该普通堆包括来自30个来源的内容，这些内容涵盖了各种领域，包括研究论文，代码，书籍，百科全书，教育材料，音频成绩单等。至关重要的是，我们通过对共同堆的文本进行了培训270亿个参数LLM来验证我们的努力：Comma V0.1-1T和Comma V0.1-2T，分别在1和2万亿个代币进行了培训。这两种模型均获得了对具有相似计算预算（例如Llama 1和2 7b）的无执照文本培训的LLM竞争性能。除了释放普通桩V0.1本身外，我们还发布了其创建中使用的代码以及Comma V0.1型号的训练混合物和检查点。

Title: Improving Low-Resource Morphological Inflection via Self-Supervised Objectives

Authors: Adam Wiemerslage, Katharina von der Wense
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05227
Pdf URL: https://arxiv.org/pdf/2506.05227
Copy Paste: [[2506.05227]] Improving Low-Resource Morphological Inflection via Self-Supervised Objectives(https://arxiv.org/abs/2506.05227)
Keywords: language model
Abstract: Self-supervised objectives have driven major advances in NLP by leveraging large-scale unlabeled data, but such resources are scarce for many of the world's languages. Surprisingly, they have not been explored much for character-level tasks, where smaller amounts of data have the potential to be beneficial. We investigate the effectiveness of self-supervised auxiliary tasks for morphological inflection -- a character-level task highly relevant for language documentation -- in extremely low-resource settings, training encoder-decoder transformers for 19 languages and 13 auxiliary objectives. Autoencoding yields the best performance when unlabeled data is very limited, while character masked language modeling (CMLM) becomes more effective as data availability increases. Though objectives with stronger inductive biases influence model predictions intuitively, they rarely outperform standard CMLM. However, sampling masks based on known morpheme boundaries consistently improves performance, highlighting a promising direction for low-resource morphological modeling.
摘要：自我监督的目标通过利用大规模的未标记数据来推动NLP的重大进展，但对于世界上许多语言而言，这种资源很少。令人惊讶的是，对于角色级别的任务，它们的探索并没有得到太多的探索，在该任务中，较少数量的数据有可能是有益的。我们在极低的资源环境中调查了自我监督的辅助任务对形态学变化的有效性 - 与语言文档高度相关的角色级任务，培训19种语言的培训编码器 - 编码器变形金刚和13个辅助目标。当未标记的数据非常有限时，自动编码会产生最佳性能，而角色掩盖语言建模（CMLM）随着数据可用性的增加而变得更加有效。尽管归纳偏见更强的目标会影响模型预测，但它们的表现很少超过标准CMLM。但是，基于已知词素边界的面具采样始终提高性能，突出了低资源形态建模的有希望的方向。

Title: CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection

Authors: Ron Eliav, Arie Cattan, Eran Hirsch, Shahaf Bassan, Elias Stengel-Eskin, Mohit Bansal, Ido Dagan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05243
Pdf URL: https://arxiv.org/pdf/2506.05243
Copy Paste: [[2506.05243]] CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection(https://arxiv.org/abs/2506.05243)
Keywords: llm, hallucination
Abstract: A common approach to hallucination detection casts it as a natural language inference (NLI) task, often using LLMs to classify whether the generated text is entailed by corresponding reference texts. Since entailment classification is a complex reasoning task, one would expect that LLMs could benefit from generating an explicit reasoning process, as in CoT reasoning or the explicit ``thinking'' of recent reasoning models. In this work, we propose that guiding such models to perform a systematic and comprehensive reasoning process -- one that both decomposes the text into smaller facts and also finds evidence in the source for each fact -- allows models to execute much finer-grained and accurate entailment decisions, leading to increased performance. To that end, we define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection. Following this reasoning framework, we introduce an analysis scheme, consisting of several metrics that measure the quality of the intermediate reasoning steps, which provided additional empirical evidence for the improved quality of our guided reasoning scheme.
摘要：幻觉检测的一种常见方法将其视为一种自然语言推断（NLI）任务，通常使用LLMS来对相应的参考文本进行分类。由于索引分类是一项复杂的推理任务，因此人们希望LLM可以从产生明确的推理过程中受益，例如在COT推理或最近推理模型的明确``思考''中受益。在这项工作中，我们建议指导此类模型执行系统的全面推理过程 - 两者都将文本分解为较小的事实，并在每个事实的来源中找到证据 - 允许模型执行大量元素和准确的综合决策，从而提高绩效。为此，我们定义了一个三步推理过程，包括（i）声称分解，（ii）亚声称归因和索引分类，以及（iii）汇总的分类，表明这种指导的推理确实产生了改善的幻觉检测。在此推理框架之后，我们引入了一个分析方案，该计划由几个指标组成，这些指标衡量了中间推理步骤的质量，这些指标为我们指导推理方案的提高质量提供了其他经验证据。

Title: Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning

Authors: Nan Huo, Jinyang Li, Bowen Qin, Ge Qu, Xiaolong Li, Xiaodong Li, Chenhao Ma, Reynold Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05278
Pdf URL: https://arxiv.org/pdf/2506.05278
Copy Paste: [[2506.05278]] Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning(https://arxiv.org/abs/2506.05278)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems commonly suffer from Knowledge Conflicts, where retrieved external knowledge contradicts the inherent, parametric knowledge of large language models (LLMs). It adversely affects performance on downstream tasks such as question answering (QA). Existing approaches often attempt to mitigate conflicts by directly comparing two knowledge sources in a side-by-side manner, but this can overwhelm LLMs with extraneous or lengthy contexts, ultimately hindering their ability to identify and mitigate inconsistencies. To address this issue, we propose Micro-Act a framework with a hierarchical action space that automatically perceives context complexity and adaptively decomposes each knowledge source into a sequence of fine-grained comparisons. These comparisons are represented as actionable steps, enabling reasoning beyond the superficial context. Through extensive experiments on five benchmark datasets, Micro-Act consistently achieves significant increase in QA accuracy over state-of-the-art baselines across all 5 datasets and 3 conflict types, especially in temporal and semantic types where all baselines fail significantly. More importantly, Micro-Act exhibits robust performance on non-conflict questions simultaneously, highlighting its practical value in real-world RAG applications.
摘要：检索增强的生成（RAG）系统通常遭受知识冲突的困扰，因为检索外部知识与大语言模型（LLMS）的固有，参数知识相矛盾。它会不利地影响下游任务（例如问答（QA））的性能。现有的方法通常试图通过并排比较两个知识来源来减轻冲突，但这可能会使LLM与外部或冗长的环境相压倒，最终阻碍了他们识别和减轻不一致之处的能力。为了解决这个问题，我们提出了一个具有分层动作空间的微动作框架，该框架自动感知上下文的复杂性并将每个知识源自适应地分解为一系列细粒度比较。这些比较表示为可行的步骤，从而使推理超出了肤浅的背景。通过在五个基准数据集上进行的大量实验，微动物始终在所有5个数据集和3种冲突类型的最先进基准中始终达到质量检查的准确性显着提高，尤其是在所有基本线大量失败的时间和语义类型中。更重要的是，Micro-Act同时在非冲突问题上表现出强大的性能，从而突出了其在现实世界中的破布应用中的实际价值。

Title: ProRefine: Inference-time Prompt Refinement with Textual Feedback

Authors: Deepak Pandita, Tharindu Cyril Weerasooriya, Ankit Parag Shah, Christopher M. Homan, Wei Wei
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05305
Pdf URL: https://arxiv.org/pdf/2506.05305
Copy Paste: [[2506.05305]] ProRefine: Inference-time Prompt Refinement with Textual Feedback(https://arxiv.org/abs/2506.05305)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, are becoming increasingly prevalent. However, these workflows often suffer from error propagation and sub-optimal performance, largely due to poorly designed prompts that fail to effectively guide individual agents. This is a critical problem because it limits the reliability and scalability of these powerful systems. We introduce ProRefine, an innovative inference-time prompt optimization method that leverages textual feedback from large language models (LLMs) to address this challenge. ProRefine dynamically refines prompts for multi-step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to match the performance of larger ones, highlighting its potential for efficient and scalable AI deployment, and democratizing access to high-performing AI.
摘要：代理工作流程，多个AI代理协作以完成推理或计划等复杂任务，越来越普遍。但是，这些工作流程通常会遭受错误传播和次优性能的困扰，这主要是由于设计不佳的提示无法有效地指导个体代理。这是一个关键的问题，因为它限制了这些强大系统的可靠性和可扩展性。我们介绍了ProReFine，这是一种创新的推理时间提示优化方法，利用大型语言模型（LLMS）的文本反馈来应对这一挑战。前芬动态地完善了多个步骤推理任务的提示，而无需其他培训或地面真相标签。在五个基准数学推理数据集上进行了评估，Prorefine显着超过了零投入的基准链基线的3至37个百分点。这种方法不仅提高了准确性，而且还允许较小的模型与较大的模型相匹配，从而强调了其有效且可扩展的AI部署的潜力，并使访问高性能AI的访问使访问民主化。

Title: Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models

Authors: Taha Entesari, Arman Hatami, Rinat Khaziev, Anil Ramakrishna, Mahyar Fazlyab
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05314
Pdf URL: https://arxiv.org/pdf/2506.05314
Copy Paste: [[2506.05314]] Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models(https://arxiv.org/abs/2506.05314)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss. This often leads to unstable optimization and degraded performance on retained data, especially under aggressive forgetting. We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss that explicitly drives the output distribution toward uniformity on a designated forget set, while retention is preserved through a hard constraint on a separate retain set. Compared to entropy-based objectives, our loss is softmax-free, numerically stable, and maintains non-vanishing gradients, enabling more efficient and robust optimization. We solve the constrained problem using a scalable primal-dual algorithm that exposes the trade-off between forgetting and retention through the dynamics of the dual variable. Evaluations on the TOFU and MUSE benchmarks across diverse LLM architectures demonstrate that our approach consistently matches or exceeds state-of-the-art baselines, effectively removing targeted information while preserving downstream utility.
摘要：在现实世界中部署的大型语言模型（LLMS）越来越面临敏感，过时或专有信息的需求。现有的未学习方法通常将遗忘和保留作为正规化的权衡，将这两个目标结合在一起。这通常会导致保留数据的优化不稳定和降低性能，尤其是在积极的遗忘之下。我们提出了一种新的LLM学习作为约束优化问题的新表述：通过新颖的Logit-Margin平坦损失实施忘记，该损失明确地将输出分布推向了指定的遗忘集中的均匀性，而保留则通过在单独的保留设置上进行硬约束来保留。与基于熵的目标相比，我们的损失是无软磁性的，数值稳定的，并且保持了非变化梯度，从而实现了更有效和稳健的优化。我们使用可伸缩的原始偶对算法解决了约束问题，该算法通过二变量的动态揭示了遗忘和保留之间的权衡。对不同LLM体系结构的豆腐和Muse基准的评估表明，我们的方法始终匹配或超过最先进的基线，有效地删除了有针对性的信息，同时保留了下游实用程序。

Title: Search Arena: Analyzing Search-Augmented LLMs

Authors: Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05334
Pdf URL: https://arxiv.org/pdf/2506.05334
Copy Paste: [[2506.05334]] Search Arena: Analyzing Search-Augmented LLMs(https://arxiv.org/abs/2506.05334)
Keywords: language model, llm, chat
Abstract: Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research in this direction. Our dataset and code are available at: this https URL.
摘要：搜索声明的语言模型将Web搜索与大型语言模型（LLM）相结合，以提高响应扎根和新鲜度。但是，分析这些系统仍然具有挑战性：现有数据集的规模限制和范围狭窄，通常被限制在静态，单转，事实检查问题。在这项工作中，我们介绍了搜索竞技场，这是一个众筹，大规模的人类质疑数据集，其中包括24,000多个配对的多转弯用户与搜索功能的LLMS。数据集跨越了各种意图和语言，并包含大约12,000张人类偏好投票的完整系统痕迹。我们的分析表明，即使引用的内容不直接支持归因的索赔，也会发现感知和实际信誉之间的差距，也表明用户偏好受到引用数量的影响。此外，用户偏好在引用的来源各不相同，揭示了社区驱动的平台通常是首选的，并且静态百科全书并不总是适当且可靠。为了评估各种环境的性能，我们通过在搜索密集型设置中测试通用聊天环境和常规LLM中测试搜索调格的LLM来进行跨ARENA分析。我们发现Web搜索不会降低，甚至可能提高非搜索设置的性能；但是，如果仅依靠模型的参数知识，则搜索设置中的质量会受到重大影响。我们开源数据集，以支持这一方向的未来研究。我们的数据集和代码可在以下网址提供：此HTTPS URL。

Title: Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models

Authors: Anirudh Bharadwaj, Chaitanya Malaviya, Nitish Joshi, Mark Yatskar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05339
Pdf URL: https://arxiv.org/pdf/2506.05339
Copy Paste: [[2506.05339]] Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models(https://arxiv.org/abs/2506.05339)
Keywords: language model
Abstract: Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. Evidence suggests these biases originate in artifacts in human training data. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with magnified biases (skew), finding this preference occurs in >60% of instances, and model preferences show high miscalibration (~40%) compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean r_human = -0.12) but show moderately strong positive correlations with labels from a strong reward model (mean r_model = +0.36), suggesting that models may overrely on spurious cues. To mitigate these issues, we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Finetuning models with CDA reduces average miscalibration from 39.4% to 32.5% and average absolute skew difference from 20.5% to 10.0%, while maintaining overall RewardBench performance, showing that targeted debiasing is effective for building reliable preference models.
摘要：语言模型是对对齐和评估中人类偏好判断的代理，但它们表现出系统性的误解，将浅表模式优先于实质性品质。这种偏见表现为过度依赖长度，结构和样式等功能，从而导致奖励黑客和不可靠的评估等问题。证据表明这些偏见起源于人类培训数据中的工件。在这项工作中，我们系统地研究了训练数据偏见与偏好模型误解之间的关系，跨语言模型世代的特质特征：长度，结构，术语，术语，粘糊糊和模糊性。使用受控的反事实对，我们首先量化了偏好模型有利于放大偏见的反应的程度（偏差），发现这种偏好发生在> 60％的实例中，模型偏好与人类偏好相比显示出很高的错误误解（〜40％）。值得注意的是，偏见仅显示与人类偏好标签的轻度负相关性（平均R_HUMAN = -0.12），但与强奖励模型的标签（平均R_Model = +0.36）显示出适度的强正相关性，表明模型可能会超过潮流线索。为了减轻这些问题，我们使用合成的对比示例提出了一种基于反事实数据增强（CDA）的简单训练方法。具有CDA的填充模型从39.4％降低到32.5％，平均绝对偏差从20.5％降低到10.0％，同时保持整体奖励基础的性能，这表明有针对性的偏见有效地有效地构建了可靠的偏好模型。