2025-07-04

Title: McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

Authors: Tian Lan, Xiangdong Su, Xu Liu, Ruirui Wang, Ke Chang, Jiang Li, Guanglai Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02088
Pdf URL: https://arxiv.org/pdf/2507.02088
Copy Paste: [[2507.02088]] McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models(https://arxiv.org/abs/2507.02088)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.
摘要：由于大型语言模型（LLM）越来越多地应用于各种NLP任务，因此它们固有的偏见逐渐被披露。因此，测量LLMS中的偏见对于减轻其道德风险至关重要。但是，大多数现有的偏见评估数据集都集中在英语和北美文化上，其偏见类别并不完全适用于其他文化。以中文和文化为基础的数据集很少。更重要的是，这些数据集通常仅支持单个评估任务，并且无法评估LLM中多个方面的偏见。为了解决这些问题，我们提出了一项多任务中国偏见评估基准（MCBE），其中包括4,077个偏见评估实例，涵盖12个单个偏见类别，82个子类别，并引入了5项评估任务，提供了广泛的类别覆盖，内容多样性和衡量性。此外，我们评估了来自不同系列和参数大小的几个流行LLM。通常，所有这些LLM都表现出不同程度的偏见。我们对结果进行了深入的分析，为LLMS中的偏见提供了新的见解。

Title: Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization

Authors: Keyan Jin, Yapeng Wang, Leonel Santos, Tao Fang, Xu Yang, Sio Kei Im, Hugo Gonçalo Oliveira
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02145
Pdf URL: https://arxiv.org/pdf/2507.02145
Copy Paste: [[2507.02145]] Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization(https://arxiv.org/abs/2507.02145)
Keywords: language model, llm, chain-of-thought
Abstract: Dialogue summarization is a challenging task with significant practical value in customer service, meeting analysis, and conversational AI. Although large language models (LLMs) have achieved substantial progress in summarization tasks, the performance of step-by-step reasoning architectures-specifically Long Chain-of-Thought (CoT) implementations such as OpenAI-o1 and DeepSeek-R1-remains unexplored for dialogue scenarios requiring concurrent abstraction and conciseness. In this work, we present the first comprehensive and systematic evaluation of state-of-the-art reasoning LLMs and non-reasoning LLMs across three major paradigms-generic, role-oriented, and query-oriented dialogue summarization. Our study spans diverse languages, domains, and summary lengths, leveraging strong benchmarks (SAMSum, DialogSum, CSDS, and QMSum) and advanced evaluation protocols that include both LLM-based automatic metrics and human-inspired criteria. Contrary to trends in other reasoning-intensive tasks, our findings show that explicit stepwise reasoning does not consistently improve dialogue summarization quality. Instead, reasoning LLMs are often prone to verbosity, factual inconsistencies, and less concise summaries compared to their non-reasoning counterparts. Through scenario-specific analyses and detailed case studies, we further identify when and why explicit reasoning may fail to benefit-or even hinder-summarization in complex dialogue contexts. Our work provides new insights into the limitations of current reasoning LLMs and highlights the need for targeted modeling and evaluation strategies for real-world dialogue summarization.
摘要：对话摘要是一项具有挑战性的任务，在客户服务，会议分析和会话AI中具有显着实践价值。尽管大型语言模型（LLM）在汇总任务中取得了很大的进步，但分步推理的性能特定于特定于长期链链（COT）的实现（例如OpenAI-O1）和DeepSeek-R1-remains诸如OpenAI-R1-remains，无法探索，以进行对话方案，需要同时出现的抽象和简洁性。在这项工作中，我们介绍了三个主要的范式，面向角色和面向查询的对话摘要中最先进的推理LLM和非争议LLM的首次全面和系统评估。我们的研究涵盖了各种语言，领域和摘要长度，利用强大的基准（Samsum，Dialogsum，CSD和QMMSUM）以及包括基于LLM的自动指标和人类启发标准的高级评估协议。与其他重要性任务的趋势相反，我们的发现表明，明确的逐步推理并不能始终如一地改善对话摘要质量。取而代之的是，与非共同的对应物相比，推理LLM通常容易出现详细的，事实的不一致和简洁的摘要。通过特定方案的分析和详细的案例研究，我们进一步确定了在复杂的对话环境中明确推理何时以及为何不得受益 - 甚至阻碍了苏格尔仪。我们的工作为当前推理LLM的局限性提供了新的见解，并强调了对现实世界对话摘要的有针对性建模和评估策略的需求。

Title: Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer

Authors: Wenquan Lu, Yuechuan Yang, Kyle Lee, Yanshu Li, Enqi Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02199
Pdf URL: https://arxiv.org/pdf/2507.02199
Copy Paste: [[2507.02199]] Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer(https://arxiv.org/abs/2507.02199)
Keywords: language model, chain-of-thought
Abstract: Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning. However, in standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency. To capture reasoning that is not easily represented in words, many works have explored recurrent architectures that aim to internalize reasoning in latent space, potentially supporting latent CoT. In this paper, we investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count. We examine the model's internal behavior on arithmetic tasks using a suite of probing techniques including the Logit Lens and Coda Lens. Our findings reveal limited evidence of interpretable latent CoT by tracking rank trajectories of final and intermediate result tokens. Furthermore, we uncover significant probing inconsistencies across recurrent blocks, where the interpretability of hidden states depends heavily on both the layer index and the decoding method. Finally, we empirically show that increasing recurrence depth yields only marginal gains and falls well short of models that explicitly externalize reasoning steps. The code is available at this https URL.
摘要：经过思考链（COT）推理使基于变压器的语言模型能够在复杂的数学和多步计划中脱颖而出。但是，在仅使用标准解码器的体系结构中，这些推理步骤以自然语言外部化，以效率为代价提高可解释性。为了捕获不容易用文字表示的推理，许多作品探索了旨在在潜在空间中内部化推理的经常性体系结构，并可能支持潜在的COT。在本文中，我们研究了这种推理结构是否在Huginn-3.5b中出现，这是一种深度转变的变压器，在推理时间重复层而不会增加参数计数。我们使用一套探测技术（包括Logit镜头和尾声镜头）检查了模型对算术任务的内部行为。我们的发现揭示了通过跟踪最终结果和中间结果令牌的等级轨迹的可解释潜在婴儿床的有限证据。此外，我们发现了跨复发块的明显探测不一致，在这些块中，隐藏状态的可解释性在很大程度上取决于层索引和解码方法。最后，我们从经验上表明，增加的复发深度仅产生边际收益，并且远远差距差不多，这些模型明确地将推理步骤外部化。该代码可在此HTTPS URL上找到。

Title: GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

Authors: Steven Song, Anirudh Subramanyam, Zhenyu Zhang, Aarti Venkat, Robert L. Grossman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02221
Pdf URL: https://arxiv.org/pdf/2507.02221
Copy Paste: [[2507.02221]] GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons(https://arxiv.org/abs/2507.02221)
Keywords: language model, gpt, llm, prompt
Abstract: Motivation: The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. Results: We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. Availability and implementation: The standalone docker image for GDC Cohort Copilot is available at this https URL. Source code is available at this https URL. GDC Cohort LLM weights are available at this https URL.
摘要：动机：基因组数据共享（GDC）通过以围绕患者队列为中心的统一策划和分析平台来访问高质量，统一的癌症基因组学数据。尽管GDC用户可以通过图形组构建器进行交互性创建复杂的队列，但用户（尤其是新的）可能难以在数百个可能的字段和属性中找到特定的队列描述符。但是，用户可能会更好地描述他们在自由文本的自然语言中所需的队列。结果：我们介绍了GDC队列副驾驶，这是一种用于策划GDC队列的开源副驾驶工具。 GDC队列Copilot会自动生成与其所需队列的用户输入自然语言描述相对应的GDC队列过滤器，然后再导出同类群回到GDC以进行进一步分析。交互式用户界面允许用户进一步完善生成的队列。我们为GDC队列副标士开发和评估多种大型语言模型（LLM），并证明我们本地提供的开源GDC GDC同类LLM比GPT-4O取得更好的结果，从而促使生成GDC队列。可用性和实现：GDC队列副标士的独立Docker映像可在此HTTPS URL上获得。源代码可在此HTTPS URL上找到。 GDC队列LLM权重可在此HTTPS URL上找到。

Title: MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Authors: Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, Hao Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02259
Pdf URL: https://arxiv.org/pdf/2507.02259
Copy Paste: [[2507.02259]] MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent(https://arxiv.org/abs/2507.02259)
Keywords: llm, agent
Abstract: Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ in 512K RULER test.
摘要：尽管外推，有效的注意力和记忆模块的长度改善，但在推断过程中处理无限长的文档而没有性能降解，这仍然是长文处理过程中的最终挑战。我们直接以端到端的方式对长文本任务进行了优化，并引入了新颖的代理工作流程Memagent，该工作流程在细分市场中读取文本并使用覆盖策略更新内存。我们扩展了DAPO算法，以通过独立的文本多转换生成来促进培训。 Memagent表现出了出色的长期文字功能，能够从32K文本训练的8K上下文中推断出32K文本的QA任务，绩效损失<5％，在512K标尺测试中达到95％+。

Title: DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning

Authors: Dohoon Kim, Donghun Kang, Taesup Moon
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02302
Pdf URL: https://arxiv.org/pdf/2507.02302
Copy Paste: [[2507.02302]] DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning(https://arxiv.org/abs/2507.02302)
Keywords: llm
Abstract: Domain-Adaptive Pre-training (DAP) has recently gained attention for its effectiveness in fine-tuning pre-trained models. Building on this, continual DAP has been explored to develop pre-trained models capable of incrementally incorporating different domain datasets. However, existing continual DAP methods face several limitations: (1) high computational cost and GPU memory usage during training; (2) sensitivity to incremental data order; and (3) providing a single, generalized model for all end tasks, which contradicts the essence of DAP. In this paper, we propose DoMIX, a novel approach that addresses these challenges by leveraging LoRA modules, a representative parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient and parallel domain-adaptive pre-training that is robust to domain order and effectively utilizes accumulated knowledge to provide tailored pre-trained models for specific tasks. We also demonstrate that our method can be extended beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available at this https URL.
摘要：域自适应预训练（DAP）最近因其在微调预训练模型中的有效性而引起了人们的关注。在此基础上，已经探索了连续的DAP，以开发能够逐步合并不同域数据集的预训练模型。但是，现有的连续DAP方法面临着几个局限性：（1）训练期间的高计算成本和GPU记忆使用；（2）对增量数据顺序的敏感性；（3）为所有最终任务提供一个单一的广义模型，这与DAP的本质相矛盾。在本文中，我们提出了Domix，这是一种新的方法，通过利用Lora模块（一种代表性参数有效的微调（PEFT）方法）来解决这些挑战。我们的方法实现了有效且平行的领域自适应预训练，对域顺序具有鲁棒性，并有效利用累积的知识来为特定任务提供定制的预训练模型。我们还证明，我们的方法可以扩展到DAP设置之外，直到标准的LLM微调方案。代码可在此HTTPS URL上找到。

Title: Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models

Authors: Christian Jaumann, Annemarie Friedrich, Rainer Lienhart
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02357
Pdf URL: https://arxiv.org/pdf/2507.02357
Copy Paste: [[2507.02357]] Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models(https://arxiv.org/abs/2507.02357)
Keywords: language model
Abstract: This paper describes our system for the SciVQA 2025 Shared Task on Scientific Visual Question Answering. Our system employs an ensemble of two Multimodal Large Language Models and various few-shot example retrieval strategies. The model and few-shot setting are selected based on the figure and question type. We also select answers based on the models' confidence levels. On the blind test data, our system ranks third out of seven with an average F1 score of 85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.
摘要：本文介绍了我们针对科学视觉问题回答的SCIVQA 2025共享任务的系统。我们的系统采用了两个多模式大型语言模型和各种少数示例检索策略的合奏。根据图和问题类型选择模型和少量设置。我们还根据模型的置信度选择答案。在盲测数据上，我们的系统在七个，胭脂-1，胭脂-L和BERTS的平均F1得分中排名第三，平均F1得分为85.12。我们的代码公开可用。

Title: Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

Authors: Weijie Lyu, Sheng-Jun Huang, Xuan Xia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02378
Pdf URL: https://arxiv.org/pdf/2507.02378
Copy Paste: [[2507.02378]] Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection(https://arxiv.org/abs/2507.02378)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging vast amounts of data, focusing on data quantity while often overlooking data quality, thereby reducing training efficiency. To address this, we introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance. Our method optimizes the parametric model to ensure distribution consistency and diversity within the selected subset, guaranteeing high-quality data. Experimental results demonstrate that using only 10K samples, our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance and efficiency. This underscores that our method effectively boosts model performance while significantly reducing computational costs.
摘要：大型语言模型（LLM）的最新进展已大大改善了代码的生成和程序理解，从而加快了软件工程的发展。当前方法主要通过利用大量数据来增强模型性能，重点关注数据数量，同时经常忽略数据质量，从而降低培训效率。为了解决这个问题，我们介绍了一种使用参数模型进行代码数据选择的方法，旨在提高培训效率和模型性能。我们的方法优化了参数模型，以确保所选子集中的分布一致性和多样性，从而确保高质量的数据。实验结果表明，仅使用10K样品，我们的方法在92K全采样基线上获得了2.4％（HumaneVal）和2.3％（MBPP）的增长，在性能和效率方面表现出色。这强调了我们的方法有效地提高了模型性能，同时大大降低了计算成本。

Title: IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders

Authors: Sneha Deshmukh, Prathmesh Kamble
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02506
Pdf URL: https://arxiv.org/pdf/2507.02506
Copy Paste: [[2507.02506]] IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders(https://arxiv.org/abs/2507.02506)
Keywords: gpt, prompt
Abstract: Legal NLP remains underdeveloped in regions like India due to the scarcity of structured datasets. We introduce IndianBailJudgments-1200, a new benchmark dataset comprising 1200 Indian court judgments on bail decisions, annotated across 20+ attributes including bail outcome, IPC sections, crime type, and legal reasoning. Annotations were generated using a prompt-engineered GPT-4o pipeline and verified for consistency. This resource supports a wide range of legal NLP tasks such as outcome prediction, summarization, and fairness analysis, and is the first publicly available dataset focused specifically on Indian bail jurisprudence.
摘要：由于结构化数据集缺乏，法律NLP在印度等地区仍然不发达。我们介绍了IndianBailjudgments-1200，这是一个新的基准数据集，其中包括1200个印度法院对保释裁决的判决，并在20多个属性上注释，包括保释结果，IPC部分，犯罪类型和法律推理。使用及时设计的GPT-4O管道生成注释，并验证以保持一致性。该资源支持广泛的合法NLP任务，例如结果预测，摘要和公平分析，并且是第一个专门针对印度保释法院的公开可用数据集。

Title: WebSailor: Navigating Super-human Reasoning for Web Agent

Authors: Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02592
Pdf URL: https://arxiv.org/pdf/2507.02592
Copy Paste: [[2507.02592]] WebSailor: Navigating Super-human Reasoning for Web Agent(https://arxiv.org/abs/2507.02592)
Keywords: llm, agent
Abstract: Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.
摘要：超越人类认知局限性代表了LLM培训中的关键领域。诸如Deepresearch之类的专有代理系统已经在极其复杂的信息寻求信息基准（例如Browsecomp）上展示了超人的能力，例如Browsecomp，这是以前无法实现的壮举。我们认为，他们的成功取决于开源模型中不存在的复杂推理模式：在浏览大量信息景观时系统地减少极端不确定性的能力。基于这种见解，我们介绍了Webledor，这是一种完整的培训方法，旨在灌输这种关键能力。我们的方法涉及通过结构化的采样和信息混淆，rft冷启动以及有效的代理RL训练算法来生成新颖的高确定性任务，重复采样策略优化（DUPO）。通过这项集成的管道，Webailor在复杂的信息寻求任务中的所有OpenSOURCE代理都大大优于所有OpenSource代理，匹配专有代理的性能并缩小功能差距。

Title: Revisiting Active Learning under (Human) Label Variation

Authors: Cornelia Gruber, Helen Alber, Bernd Bischl, Göran Kauermann, Barbara Plank, Matthias Aßenmacher
Subjects: cs.CL, cs.HC, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.02593
Pdf URL: https://arxiv.org/pdf/2507.02593
Copy Paste: [[2507.02593]] Revisiting Active Learning under (Human) Label Variation(https://arxiv.org/abs/2507.02593)
Keywords: language model, llm
Abstract: Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed -- or neglected -- these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.
摘要：访问高质量的标记数据仍然是应用监督学习的限制因素。尽管标签变化（LV），即同一实例的不同标签是常见的，尤其是在自然语言处理中，但注释框架通常仍然基于单个基础真理的假设。这忽略了人类标签变异（HLV），这是注释中合理差异的发生，作为一个信息信号。同样，主动学习（AL）是一种在培训ML模型中优化有限注释预算使用有限的注释预算的流行方法，通常依赖于至少几个简化的假设中的一种，在确认HLV时很少有实践中的一种。在本文中，我们研究了有关真理和标签性质的基本假设，突出了将观察到的LV分解为信号（例如HLV）和噪声（例如注释误差）的必要性。我们调查了AL和（H）LV社区如何解决这些区别（或忽略）这些区别，并提出了一个概念框架，用于在整个AL循环中合并HLV，包括实例选择，注释器选择和标签表示。我们进一步讨论大语模型（LLM）作为注释者的整合。我们的工作旨在为HLV感知的积极学习奠定概念基础，以更好地反映现实注释的复杂性。

Title: MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion

Authors: Xin Guan, PeiHsin Lin, Zekun Wu, Ze Wang, Ruibo Zhang, Emre Kazim, Adriano Koshiyama
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02595
Pdf URL: https://arxiv.org/pdf/2507.02595
Copy Paste: [[2507.02595]] MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion(https://arxiv.org/abs/2507.02595)
Keywords: language model, llm, prompt
Abstract: Multiperspective Fusion (MPF) is a novel posttraining alignment framework for large language models (LLMs) developed in response to the growing need for easy bias mitigation. Built on top of the SAGED pipeline, an automated system for constructing bias benchmarks and extracting interpretable baseline distributions, MPF leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines. By decomposing baseline, such as sentiment distributions from HR professionals, into interpretable perspective components, MPF guides generation through sampling and balancing of responses, weighted by the probabilities obtained in the decomposition. Empirically, we demonstrate its ability to align LLM sentiment distributions with both counterfactual baselines (absolute equality) and the HR baseline (biased for Top Univeristy), resulting in small KL divergence, reduction of calibration error and generalization to unseen questions. This shows that MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning.
摘要：多效融合（MPF）是针对大型语言模型（LLMS）的新型培训后对齐框架，以应对日益增长的缓解偏见的需求。 MPF建立在CAING Pipeline的顶部，这是一种用于构建偏差基准和提取可解释的基线分布的自动化系统，MPF利用了多个子孙后代，以揭示LLM输出中的偏见和与细微的人类类似人类的基础。通过将基线（例如人力资源专业人员的情感分布）分解为可解释的观点组成部分，MPF通过对响应的采样和平衡来指导产生，并由分解中获得的概率加权。从经验上讲，我们证明了其将LLM情感分布与反事实基线（绝对平等）和人力资源基线（对顶级大学有偏见）相结合的能力，从而导致kl差异很小，校准误差的减少和概括为看不见的问题。这表明MPF提供了一种可扩展和可解释的方法来进行对齐和偏置缓解措施，与已部署的LLM兼容，并且不需要广泛的及时工程或填充。

Title: Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

Authors: Zhijian Xu, Yilun Zhao, Manasi Patwardhan, Lovekesh Vig, Arman Cohan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02694
Pdf URL: https://arxiv.org/pdf/2507.02694
Copy Paste: [[2507.02694]] Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers(https://arxiv.org/abs/2507.02694)
Keywords: llm
Abstract: Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs' capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.
摘要：同行评审是科学研究的基础，但是越来越多的出版物加剧了这一专业知识密集过程的挑战。尽管LLM在各种科学任务中都表现出希望，但它们有助于同行评审的潜力，尤其是在识别纸张限制方面，仍在研究中。我们首先提出了科学研究中限制类型的全面分类学，重点是AI。在该分类法的指导下，我们提出了Limitgen，这是评估LLMS支持早期反馈和补充人类同行评审能力的第一个全面基准。我们的基准由两个子集组成：limitgen-syn，这是一种通过受控的高质量论文的扰动而精心创建的合成数据集，而Limitgen-Human则是真实的人写的限制的集合。为了提高LLM系统识别局限性的能力，我们通过文献检索来增强它们，这对于在先前的科学发现中识别限制至关重要。我们的方法增强了LLM系统在研究论文中产生局限性的功能，从而使它们能够提供更具体和建设性的反馈。

Title: Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

Authors: Ken Tsui
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02778
Pdf URL: https://arxiv.org/pdf/2507.02778
Copy Paste: [[2507.02778]] Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs(https://arxiv.org/abs/2507.02778)
Keywords: language model, llm
Abstract: Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot' - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.
摘要：尽管大型语言模型（LLMS）已经变革，但它们仍然犯错，可以探索非生产性的推理路径。自我纠正是值得信赖的LLM，尤其是自回归LLM的重要功能。尽管LLM可以识别用户输入中的错误，但它们表现出系统性的“自我纠正盲点” - 无法纠正自己的输出中的相同错误。为了系统地研究这种现象，我们引入了自校正工作台，这是一个系统的框架，可通过在三个复杂性水平下通过受控的误差注射来衡量这种现象。测试14个型号，我们发现平均盲点率为64.5％。我们发现多种证据表明，这种限制与训练数据组成有关：人类培训演示主要显示无错误的响应而不是错误校正序列，这与通过结果反馈学习错误校正的RL训练模型不同。值得注意的是，只需附加“等待”就将盲点降低了89.3％，这表明存在能力，但需要激活。我们的工作强调了当前LLM的关键限制，并为提高其可靠性和可信度提供了潜在的途径。

Title: Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

Authors: Riccardo Cantini, Nicola Gabriele, Alessio Orsino, Domenico Talia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02799
Pdf URL: https://arxiv.org/pdf/2507.02799
Copy Paste: [[2507.02799]] Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models(https://arxiv.org/abs/2507.02799)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.
摘要：推理语言模型（RLMS）因其通过诸如促进链（COT）提示或微调推理迹线等机制执行复杂的多步推理任务的能力而获得了吸引力。尽管这些功能有望提高可靠性，但它们对鲁棒性对社会偏见的影响尚不清楚。在这项工作中，我们利用最初是为大语模型（LLM）设计的明显偏差基准来研究RLM的对抗性鲁棒性，以促进偏见。我们采用LLM-AS-A-Auggh方法进行自动化的安全评分和利用越狱技术来评估内置安全机制的实力，从而系统地评估各种社会文化维度的最先进的RLM。我们的评估解决了三个关键问题：（i）推理能力的引入如何影响模型的公平性和鲁棒性；（ii）对推理进行微调的模型是否比依靠推理时促使COT的模型更高；（iii）针对偏见引起的越狱攻击的成功率如何随所采用的推理机制而异。我们的发现揭示了推理能力与偏见安全之间的细微关系。令人惊讶的是，具有明确推理的模型，无论是通过COT提示还是微调的推理迹线，通常都比没有这种机制的基本模型更容易受到偏置启发的影响，这表明推理可能无意间打开刻板印象增强的新途径。支持推理的模型似乎比依靠COT提示的模型更安全，这些提示尤其容易通过讲故事的提示，虚构的角色或奖励形状的说明来进行上下文重新攻击。这些结果挑战了以下假设：推理固有地提高了鲁棒性，并强调了对推理设计的更多偏见方法的需求。

Title: Multimodal Mathematical Reasoning with Diverse Solving Perspective

Authors: Wenhao Shi, Zhiqiang Hu, Yi Bin, Yang Yang, See-Kiong Ng, Heng Tao Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02804
Pdf URL: https://arxiv.org/pdf/2507.02804
Copy Paste: [[2507.02804]] Multimodal Mathematical Reasoning with Diverse Solving Perspective(https://arxiv.org/abs/2507.02804)
Keywords: language model, llm
Abstract: Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflections. In this work, we introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair, fostering richer reasoning supervision. We further propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization (GRPO), a rule-based RL approach that integrates correctness discrimination and diversity-aware reward functions. Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions. Extensive experiments on the MathVista's minitest and Math-V benchmarks demonstrate that Qwen-VL-DP significantly outperforms prior base MLLMs in both accuracy and generative diversity, highlighting the importance of incorporating diverse perspectives and reflective reasoning in multimodal mathematical reasoning.
摘要：大规模增强学习（RL）的最新进展显着增强了大语言模型（LLMS）的推理能力，尤其是在数学领域。但是，用于数学推理的当前多模式LLMS（MLLM）通常依赖于一对一的图像文本对和单分解的监督，从而忽略了有效的推理观点和内部反射的多样性。在这项工作中，我们介绍了Mathv-DP，这是一个新颖的数据集，可为每个图像问题对捕获多种不同的解决方案轨迹，从而促进了更丰富的推理监督。我们进一步提出了QWEN-VL-DP，这是一种基于QWEN-VL构建的模型，对监督学习进行了微调，并通过小组相对政策优化（GRPO）进行了增强，这是一种基于规则的RL方法，该方法集成了正确性歧视和多样性意识到的奖励函数。我们的方法强调从各种推理的角度学习，并区分正确但不同的解决方案。关于MathVista最销售和Math-V基准测试的广泛实验表明，QWEN-VL-DP在准确性和生成性多样性方面显着优于先前的基本MLLM，从而强调了在多模态数学推理中融合各种观点和反思性推理的重要性。

Title: SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model

Authors: Wencheng Zhang, Shiqin Qiao, Lingjie Luo, Yinfeng Li, Chuanyang Zheng, Qian Xu, Meng Li, Yong Gui, Yijun He, Jianing Qiu, Jindong Hong, Jiankai Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02822
Pdf URL: https://arxiv.org/pdf/2507.02822
Copy Paste: [[2507.02822]] SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model(https://arxiv.org/abs/2507.02822)
Keywords: language model, llm
Abstract: With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between "thinking" (high reasoning) and "non-thinking" (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost reasoning process. This highlights a clear dichotomy in problem complexity and suggests that dynamically routing queries to the appropriate mode based on complexity could optimize accuracy, cost-efficiency, and overall user experience. Based on this, we further propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes. Experimental results on several medical datasets demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs. 0.8272) compared to the thinking mode alone but also reduces inference time by 36.8% and token consumption by 39.66%. Importantly, qualitative analysis indicates that over-reasoning on simpler queries can lead to unnecessary delays and even decreased accuracy, a pitfall avoided by our adaptive routing. Finally, this work further introduces the Accuracy-Inference-Token (AIT) index to comprehensively evaluate the trade-offs among accuracy, latency, and token cost.
摘要：随着在实际应用中广泛采用大语言模型（LLM），选择适当的模型不仅需要平衡性能，而且还需要运营成本。具有推理能力的模型的出现进一步扩大了“思考”（高推理）和“非思维”（快速，低成本）模式之间的成本差距。在这项工作中，我们透露，仅需非思考模式就可以准确地回答大约58％的医疗问题，而无需高成本的推理过程。这突出了问题复杂性的明确二分法，并表明，基于复杂性的动态将查询与适当的模式路由可以优化准确性，成本效益和整体用户体验。基于此，我们进一步提出了SynapSeroute，这是一种基于机器学习的动态路由框架，可以智能地将输入查询分配给思考或非思维模式。几个医疗数据集的实验结果表明，与仅思考模式相比，SynapSeroute不仅提高了总体准确性（0.8390 vs. 0.8272），而且还将推理时间降低36.8％，而代币的消费量则增加了39.66％。重要的是，定性分析表明，对更简单的查询过度审议可能会导致不必要的延迟甚至精度下降，这是由于我们的自适应路由避免的陷阱。最后，这项工作进一步介绍了准确的推论（AIT）索引，以全面评估准确性，延迟和代币成本之间的权衡。

Title: Generalizing Verifiable Instruction Following

Authors: Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, Hannaneh Hajishirzi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02833
Pdf URL: https://arxiv.org/pdf/2507.02833
Copy Paste: [[2507.02833]] Generalizing Verifiable Instruction Following(https://arxiv.org/abs/2507.02833)
Keywords: language model, prompt, chat
Abstract: A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like ``only answer with yes or no" or ``mention the word `abrakadabra' at least 3 times" that the user adds to craft a more useful answer. Even today's strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.
摘要：成功的人类和人工智能互动的关键因素是语言模型或聊天机器人精确地遵循人类指示的能力。指令的一个共同特征是输出约束，例如``仅用“是”或“否”答案。即使在当今最强大的模型也在努力实现这种约束。我们发现，大多数模型都在测试这些能力的基准（称为精确指令下面的技能）的基准中强烈过度拟合，并且无法很好地概括到看不见的输出约束。我们介绍了一个新的基准IFBench，以评估58个新，多样化且可挑战性可验证的室外约束后的精确指令。此外，我们对如何以及可以培训哪些数据模型来改善概括后的精确指导进行了广泛的分析。具体而言，我们仔细设计约束验证模块，并表明具有可验证奖励（RLVR）的加强学习可显着改善以下教学。除了IFBench之外，我们还发布了29个其他新的手持培训限制和验证功能，RLVR培训提示和代码。

Title: LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

Authors: Almog Hilel, Idan Shenfeld, Leshem Choshen, Jacob Andreas
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02850
Pdf URL: https://arxiv.org/pdf/2507.02850
Copy Paste: [[2507.02850]] LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users(https://arxiv.org/abs/2507.02850)
Keywords: language model, llm, prompt
Abstract: We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).
摘要：我们描述了接受用户反馈训练的语言模型（LMS）中的漏洞，从而只有在提供提示和对LM输出上的提示和投票 /下价反馈的能力的情况下，单个用户可以持续改变LM知识和行为。为了实施攻击，攻击者提示LM随机输出“中毒”或良性反应，然后对中毒的反应进行投票或降低良性的反应。当在随后的偏好调谐行为中使用反馈信号时，即使在没有恶意提示的情况下，LMS也会增加产生中毒反应的可能性。我们表明，这种攻击可用于（1）插入该模型以前没有拥有的事实知识，（2）以引入可利用的安全缺陷的方式修改代码生成模式，以及（3）注入假财务新闻。我们的发现两者都标识了语言模型偏好调整的新定性特征（表明它甚至可以使用高度限制的偏好数据形式来对行为施加细粒度的控制），以及针对用户反馈培训的LMS的新攻击机制（扩展了预处理时间中毒和部署时间及时的及时及时及时点数）。

Title: MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs

Authors: Purbesh Mitra, Sennur Ulukus
Subjects: cs.CL, cs.AI, cs.IT, cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2507.02851
Pdf URL: https://arxiv.org/pdf/2507.02851
Copy Paste: [[2507.02851]] MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs(https://arxiv.org/abs/2507.02851)
Keywords: language model, llm
Abstract: Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose $\textbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ -- an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8\% and 3.3\% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15\% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at this https URL and this https URL, respectively.
摘要：大语言模型（LLM）推理能力的最新进步表明，使用小组相对政策优化（GRPO）算法进行加固学习（RL）培训，使模型可以使用更多的思维/推理代币来产生更好的响应。但是，LLM只能生成有限数量的令牌，同时保持对先前生成的令牌的关注。该限制也称为LLM的上下文大小，是LLM推理中具有大量令牌的瓶颈。要超越上下文规模的限制，LLM必须采用模块化思维策略在多个回合中进行推理。在这项工作中，我们提出$ \ textbf {Motif：通过增强列式} $的模块化思考} $ - 一种RL训练方法，用于在多个回合中生成思维令牌，有效地允许该模型以其他上下文大小进行思考。我们通过参数有效的微调训练了GSM8K数据集上的开源模型QWEN2.5-3B-INSTRUCT，并测试了其在Math500和AIME2024基准测试上的准确性。我们的实验显示，基于香草GRPO的培训在相应的基准测试中的3.8 \％和3.3 \％的改善。此外，只有15％的样品可以实现这种改进，从而证明了基序的样本效率。我们的代码和模型分别可在此HTTPS URL和此HTTPS URL上找到。

Title: Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Authors: Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02856
Pdf URL: https://arxiv.org/pdf/2507.02856
Copy Paste: [[2507.02856]] Answer Matching Outperforms Multiple Choice for Language Model Evaluation(https://arxiv.org/abs/2507.02856)
Keywords: language model, llm
Abstract: Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice--but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models--even small ones--achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.
摘要：多项选择基准长期以来一直是语言模型评估的主力，因为对多项选择分级是客观且易于自动化的。但是，我们从流行的基准测试中展示了多项选择问题，通常甚至可以在不看到问题的情况下回答。这些快捷方式来自于对模型的自由形式的生成答案的评估，对判别性评估的基本局限性。直到最近，似乎还没有可行的，可扩展的替代品，但我们证明这已经改变了。我们通过称为答案匹配的内容来考虑生成性评估：将候选模型提供一个问题，没有选项，是否会生成自由形式的响应，然后使用带有参考答案的现代语言模型来确定响应是否与参考匹配。为了比较不同评估策略的有效性，我们注释MMLU-PRO和GPQA-DIAMOND获得人体评分数据，并测量每种评估方法的一致性。我们在通知者一致性范围内使用最近的模型（甚至很小的模型）发现了匹配的答案。相比之下，多项选择评估和使用LLM-AS-A-A-Audge没有参考答案与人类的评分差异很差。通过答案匹配改善评估不仅是一个概念上的问题：在评估其自由形式的答案中，通过答案匹配评估其自由形式的回答时，几个模型的排名发生了重大变化。鉴于这些发现，我们讨论了如何将评估生态系统从多项选择转移到答案。