2025-02-26

Title: Towards Conditioning Clinical Text Generation for User Control

Authors: Osman Alperen Koraş, Rabi Bahnan, Jens Kleesiek, Amin Dada
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17571
Pdf URL: https://arxiv.org/pdf/2502.17571
Copy Paste: [[2502.17571]] Towards Conditioning Clinical Text Generation for User Control(https://arxiv.org/abs/2502.17571)
Keywords: language model, llm, hallucination
Abstract: Deploying natural language generation systems in clinical settings remains challenging despite advances in Large Language Models (LLMs), which continue to exhibit hallucinations and factual inconsistencies, necessitating human oversight. This paper explores automated dataset augmentation using LLMs as human proxies to condition LLMs for clinician control without increasing cognitive workload. On the BioNLP ACL'24 Discharge Me! Shared Task, we achieve new state-of-the-art results with simpler methods than prior submissions through more efficient training, yielding a 9\% relative improvement without augmented training and up to 34\% with dataset augmentation. Preliminary human evaluation further supports the effectiveness of our approach, highlighting the potential of augmenting clinical text generation for control to enhance relevance, accuracy, and factual consistency.
摘要：尽管大型语言模型（LLMS）进展进展，但在临床环境中部署自然语言产生系统仍然具有挑战性，这些模型继续表现出幻觉和事实矛盾，需要进行人类的监督。本文使用LLMS作为人类代理来探讨自动数据集的增强，以调节临床医生控制的LLM，而不会增加认知工作。在bionlp acl'24上，我释放了我！共同的任务，我们通过更有效的培训实现了比先前提交的更简单的方法实现新的最新结果，从而实现了9 \％的相对改进，而无需增强培训，并且使用DataSet Exmentation获得了最高34 \％。初步的人类评估进一步支持了我们方法的有效性，突出了增强临床文本生成以控制相关性，准确性和事实一致性的潜力。

Title: End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models

Authors: Raymond Choi, Frank Burns, Chase Lawrence
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17589
Pdf URL: https://arxiv.org/pdf/2502.17589
Copy Paste: [[2502.17589]] End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models(https://arxiv.org/abs/2502.17589)
Keywords: language model, chain-of-thought
Abstract: Automated chart summarization is crucial for enhancing data accessibility and enabling efficient information extraction from visual data. While recent advances in visual-language models (VLMs) have demonstrated promise, existing methods often suffer from limitations in matching the generated summary to the chart data and in reasoning about complex chart patterns. This paper introduces End-to-End Visual Chain-of-Thought (V-CoT) for chart summarization, a novel approach optimized for Large Vision-Language Models (LVLMs). Our method directly trains an LVLM to process chart images and generate textual summaries in an end-to-end fashion, eliminating the need for explicit chart parsing modules. We incorporate a visual Chain-of-Thought mechanism through instruction fine-tuning, implicitly guiding the LVLM to perform visual reasoning steps during summary generation. Evaluated on the large-scale Chart-Sum-QA dataset, our V-CoT method significantly outperforms state-of-the-art baselines across a range of automatic metrics, including BLEU, BLEURT, CIDEr, and CS, and demonstrates superior matching degree and reasoning correctness in human evaluations. Ablation studies and detailed analyses further validate the effectiveness and robustness of our proposed approach, establishing a new benchmark for end-to-end chart summarization.
摘要：自动图表汇总对于增强数据可访问性和从视觉数据中提取有效的信息至关重要。尽管视觉语言模型（VLM）的最新进展已经证明了有望，但现有方法通常会遇到限制，将生成的摘要与图表数据以及有关复杂图表模式的推理相匹配。本文介绍了用于图表摘要的端到端视觉链（V-COT），这是一种针对大型视觉模型（LVLM）优化的新方法。我们的方法直接训练LVLM来处理图表图像并以端到端方式生成文本摘要，从而消除了对明确图表解析模块的需求。我们通过教学进行微调结合了视觉链的机制，暗中指导LVLM在摘要生成过程中执行视觉推理步骤。在大规模的图表和QA数据集上进行了评估，我们的V-COT方法在一系列自动指标上显着优于最先进的基线，包括BLEU，BLEURT，CIDER和CS，并展示了卓越的匹配度和人类评估中的推理正确性。消融研究和详细分析进一步验证了我们提出的方法的有效性和鲁棒性，为端到端图表摘要建立了新的基准。

Title: Proactive Privacy Amnesia for Large Language Models: Safeguarding PII with Negligible Impact on Model Utility

Authors: Martin Kuo, Jingyang Zhang, Jianyi Zhang, Minxue Tang, Louis DiValentin, Aolin Ding, Jingwei Sun, William Chen, Amin Hass, Tianlong Chen, Yiran Chen, Hai Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17591
Pdf URL: https://arxiv.org/pdf/2502.17591
Copy Paste: [[2502.17591]] Proactive Privacy Amnesia for Large Language Models: Safeguarding PII with Negligible Impact on Model Utility(https://arxiv.org/abs/2502.17591)
Keywords: language model, llm
Abstract: With the rise of large language models (LLMs), increasing research has recognized their risk of leaking personally identifiable information (PII) under malicious attacks. Although efforts have been made to protect PII in LLMs, existing methods struggle to balance privacy protection with maintaining model utility. In this paper, inspired by studies of amnesia in cognitive science, we propose a novel approach, Proactive Privacy Amnesia (PPA), to safeguard PII in LLMs while preserving their utility. This mechanism works by actively identifying and forgetting key memories most closely associated with PII in sequences, followed by a memory implanting using suitable substitute memories to maintain the LLM's functionality. We conduct evaluations across multiple models to protect common PII, such as phone numbers and physical addresses, against prevalent PII-targeted attacks, demonstrating the superiority of our method compared with other existing defensive techniques. The results show that our PPA method completely eliminates the risk of phone number exposure by 100% and significantly reduces the risk of physical address exposure by 9.8% - 87.6%, all while maintaining comparable model utility performance.
摘要：随着大语言模型（LLM）的兴起，不断增长的研究已经认识到在恶意攻击下泄漏个人身份信息（PII）的风险。尽管已努力保护LLM中的PII，但现有的方法努力平衡隐私保护与维护模型效用。在本文中，受认知科学的健忘症的启发，我们提出了一种新颖的方法，即主动的隐私失忆症（PPA），以保护LLMS的PII，同时保留其效用。该机制通过积极地识别和忘记序列中与PII最紧密相关的关键记忆，然后使用合适的替代记忆来维持LLM的功能。我们对多种模型进行评估，以保护普通PII，例如电话号码和物理地址，以防止普遍的PII靶向攻击，这证明了我们方法与其他现有防御技术相比的优势。结果表明，我们的PPA方法完全消除了电话号码暴露的风险100％，并显着将物理地址暴露的风险降低了9.8％-87.6％，同时保持了可比的模型公用事业性能。

Title: MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference

Authors: Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17599
Pdf URL: https://arxiv.org/pdf/2502.17599
Copy Paste: [[2502.17599]] MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference(https://arxiv.org/abs/2502.17599)
Keywords: language model, llm
Abstract: Long-context Multimodal Large Language Models (MLLMs) that incorporate long text-image and text-video modalities, demand substantial resources as their multimodal Key-Value (KV) caches grow with increasing input lengths, challenging inference efficiency. Existing methods for KV cache compression, in both text-only and multimodal LLMs, have neglected attention density variations across layers, thus often adopting uniform or progressive reduction strategies for layer-wise cache allocation. In this work, we propose MEDA, a dynamic layer-wise KV cache allocation method for efficient multimodal long-context inference. As its core, MEDA utilizes cross-modal attention entropy to determine the KV cache size at each MLLMs layer. Given the dynamically allocated KV cache size at each layer, MEDA also employs a KV pair selection scheme to identify which KV pairs to select and a KV pair merging strategy that merges the selected and non-selected ones to preserve information from the entire context. MEDA achieves up to 72% KV cache memory reduction and 2.82 times faster decoding speed, while maintaining or enhancing performance on various multimodal tasks in long-context settings, including multi-images and long-video scenarios. Our code is released at this https URL.
摘要：长篇文本多模式模型（MLLM）结合了长文字图像和文本视频模式，需要大量资源，因为它们的多模式键值（KV）缓存随着输入长度的增加而增长，具有挑战性的推理效率。在纯文本和多模式LLM中，现有的KV缓存压缩方法已忽略了各个层的注意力密度变化，因此通常采用均匀或渐进的减少策略来分配层。在这项工作中，我们提出了MEDA，这是一种动态层的KV缓存分配方法，用于有效的多模式长篇下说推断。作为其核心，MEDA利用跨模式的注意熵来确定每个MLLMS层处的KV缓存尺寸。鉴于每一层动态分配的KV高速缓存大小，MEDA还采用KV对选择方案来确定要选择哪种KV对以及合并所选和非选择的KV对合并策略，以从整个上下文中保留信息。 MEDA可实现高达72％的KV缓存内存减少和2.82倍的解码速度，同时在长篇小说设置中维持或增强各种多模式任务的性能，包括多映射和长时间视频场景。我们的代码在此HTTPS URL上发布。

Title: PICASO: Permutation-Invariant Context Composition with State Space Models

Authors: Tian Yu Liu, Alessandro Achille, Matthew Trager, Aditya Golatkar, Luca Zancato, Stefano Soatto
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.17605
Pdf URL: https://arxiv.org/pdf/2502.17605
Copy Paste: [[2502.17605]] PICASO: Permutation-Invariant Context Composition with State Space Models(https://arxiv.org/abs/2502.17605)
Keywords: language model
Abstract: Providing Large Language Models with relevant contextual knowledge at inference time has been shown to greatly improve the quality of their generations. This is often achieved by prepending informative passages of text, or 'contexts', retrieved from external knowledge bases to their input. However, processing additional contexts online incurs significant computation costs that scale with their length. State Space Models (SSMs) offer a promising solution by allowing a database of contexts to be mapped onto fixed-dimensional states from which to start the generation. A key challenge arises when attempting to leverage information present across multiple contexts, since there is no straightforward way to condition generation on multiple independent states in existing SSMs. To address this, we leverage a simple mathematical relation derived from SSM dynamics to compose multiple states into one that efficiently approximates the effect of concatenating textual contexts. Since the temporal ordering of contexts can often be uninformative, we enforce permutation-invariance by efficiently averaging states obtained via our composition algorithm across all possible context orderings. We evaluate our resulting method on WikiText and MSMARCO in both zero-shot and fine-tuned settings, and show that we can match the strongest performing baseline while enjoying on average 5.4x speedup.
摘要：在推理时间为大型语言模型提供相关的上下文知识已被证明可以大大提高其几代人的质量。这通常是通过准备文本或“上下文”的信息段落来实现的，这些文本从外部知识库中检索到其输入。但是，在线处理其他上下文会产生大量的计算成本，这些计算成本随其长度而扩展。状态空间模型（SSM）通过允许将上下文的数据库映射到固定维状态，从而从中开始生成。当试图利用跨多个上下文中存在的信息时，就会产生一个关键的挑战，因为没有直接的方法来调节现有SSM中多个独立状态的生成。为了解决这个问题，我们利用从SSM动力学得出的简单数学关系将多个状态构成一个状态，该状态有效地近似于串联文本上下文的效果。由于上下文的时间顺序通常可能是不信息的，因此我们通过在所有可能的上下文顺序中通过我们的组成算法获得的有效平均态来实施置换不变性。我们在零射门和微调设置上在Wikitext和MSMARCO上评估了我们的最终方法，并表明我们可以匹配性能最强的基线，同时平均享受5.4倍的速度。

Title: Evaluating the Effect of Retrieval Augmentation on Social Biases

Authors: Tianhui Zhang, Yi Zhou, Danushka Bollegala
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17611
Pdf URL: https://arxiv.org/pdf/2502.17611
Copy Paste: [[2502.17611]] Evaluating the Effect of Retrieval Augmentation on Social Biases(https://arxiv.org/abs/2502.17611)
Keywords: language model, llm, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) has gained popularity as a method for conveniently incorporating novel facts that were not seen during the pre-training stage in Large Language Model (LLM)-based Natural Language Generation (NLG) systems. However, LLMs are known to encode significant levels of unfair social biases. The modulation of these biases by RAG in NLG systems is not well understood. In this paper, we systematically study the relationship between the different components of a RAG system and the social biases presented in the text generated across three languages (i.e. English, Japanese and Chinese) and four social bias types (i.e. gender, race, age and religion). Specifically, using the Bias Question Answering (BBQ) benchmark datasets, we evaluate the social biases in RAG responses from document collections with varying levels of stereotypical biases, employing multiple LLMs used as generators. We find that the biases in document collections are often amplified in the generated responses, even when the generating LLM exhibits a low-level of bias. Our findings raise concerns about the use of RAG as a technique for injecting novel facts into NLG systems and call for careful evaluation of potential social biases in RAG applications before their real-world deployment.
摘要：检索增强生成（RAG）已成为一种方便地纳入大型语言模型（LLM）基于自然语言生成（NLG）系统的新颖事实的方法。但是，LLMS已知可以编码重要水平的不公平社会偏见。 RAG在NLG系统中对这些偏见的调节尚不清楚。在本文中，我们系统地研究了抹布系统的不同组成部分与跨三种语言（即英语，日语和中文）和四种社交偏见类型（即性别，种族，种族，年龄和年龄和年龄和年龄，年龄和年龄段）产生的文本中所呈现的社会偏见之间的关系宗教）。具体而言，使用偏见问题回答（BBQ）基准数据集，我们评估了带有不同刻板印象偏见水平的文档收集的抹布回答中的社会偏见，采用了多个用作生成器的LLM。我们发现，即使生成的LLM表现出低水平的偏见，也会在生成的响应中放大文档收集中的偏见。我们的发现引起了人们对将抹布用作将新事实注入NLG系统的技术的担忧，并呼吁在现实世界部署之前仔细评估RAG应用程序中潜在的社会偏见。

Title: Towards Typologically Aware Rescoring to Mitigate Unfaithfulness in Lower-Resource Languages

Authors: Tsan Tsai Chan, Xin Tong, Thi Thu Uyen Hoang, Barbare Tepnadze, Wojciech Stempniak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17664
Pdf URL: https://arxiv.org/pdf/2502.17664
Copy Paste: [[2502.17664]] Towards Typologically Aware Rescoring to Mitigate Unfaithfulness in Lower-Resource Languages(https://arxiv.org/abs/2502.17664)
Keywords: language model, llm
Abstract: Multilingual large language models (LLMs) are known to more frequently generate non-faithful output in resource-constrained languages (Guerreiro et al., 2023 - arXiv:2303.16104), potentially because these typologically diverse languages are underrepresented in their training data. To mitigate unfaithfulness in such settings, we propose using computationally light auxiliary models to rescore the outputs of larger architectures. As proof of the feasibility of such an approach, we show that monolingual 4-layer BERT models pretrained from scratch on less than 700 MB of data without fine-tuning are able to identify faithful summaries with a mean accuracy of 88.33% in three genetically unrelated languages that differ in their morphological complexity - Vietnamese, Polish and Georgian. The same hyperparameter combination moreover generalises well to three other tasks, suggesting applications for rescoring beyond improving faithfulness. In order to inform typologically aware model selection, we also investigate how morphological complexity interacts with regularisation, model depth and training objectives, ultimately demonstrating that morphologically complex languages are more likely to benefit from dropout, while across languages downstream performance is enhanced most by shallow architectures as well as training using the standard BERT objectives.
摘要：已知多语言大语模型（LLMS）更频繁地产生资源受限语言的非信仰输出（Guerreiro等，2023-ARXIV：2303.16104），可能是因为这些类型上多样化的语言在其培训数据中的代表性不足。为了减轻这种情况下的不忠，我们建议使用计算轻度辅助模型来营救较大架构的输出。为了证明这种方法的可行性，我们表明单语言4层BERT模型在不到700 MB的数据的情况下从头开始鉴定的，而无需微调的数据能够识别出三种基因无关的平均准确性，平均准确性为88.33％其形态复杂性不同的语言 - 越南，波兰和格鲁吉亚语。此外，相同的超参数组合也概括了其他三个任务，这表明了申请的申请，超越了忠诚。为了告知类型上有意识的模型选择，我们还研究了形态复杂性如何与正规化，模型深度和训练目标相互作用，最终证明了形态上复杂的语言更有可能从辍学中受益，而跨语言的跨语言则可以通过浅层建筑增强以及使用标准BERT目标培训。

Title: Towards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models

Authors: Bushi Xiao, Michael Bennie, Jayetri Bardhan, Daisy Zhe Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17669
Pdf URL: https://arxiv.org/pdf/2502.17669
Copy Paste: [[2502.17669]] Towards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models(https://arxiv.org/abs/2502.17669)
Keywords: language model
Abstract: We introduced PRISMATIC, the first multimodal structural priming dataset, and proposed a reference-free evaluation metric that assesses priming effects without predefined target sentences. Using this metric, we constructed and tested models with different multimodal encoding architectures (dual encoder and fusion encoder) to investigate their structural preservation capabilities. Our findings show that models with both encoding methods demonstrate comparable syntactic priming effects. However, only fusion-encoded models exhibit robust positive correlations between priming effects and visual similarity, suggesting a cognitive process more aligned with human psycholinguistic patterns. This work provides new insights into evaluating and understanding how syntactic information is processed in multimodal language models.
摘要：我们引入了Prismatic，这是第一个多模式结构启动数据集，并提出了无参考评估指标，该指标评估启动效应而没有预定义的目标句子。使用此指标，我们构建了具有不同多模式编码体系结构（双重编码器和融合编码器）的模型，以研究其结构保存能力。我们的发现表明，具有两种编码方法的模型表明了可比的句法启动效应。但是，只有融合编码的模型在启动效应和视觉相似性之间表现出强大的正相关性，这表明认知过程与人类心理语言模式更加一致。这项工作为评估和了解如何在多模式模型中处理句法信息提供了新的见解。

Title: Bridging Information Gaps with Comprehensive Answers: Improving the Diversity and Informativeness of Follow-Up Questions

Authors: Zhe Liu, Taekyu Kang, Haoyu Wang, Seyed Hossein Alavi, Vered Shwartz
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2502.17715
Pdf URL: https://arxiv.org/pdf/2502.17715
Copy Paste: [[2502.17715]] Bridging Information Gaps with Comprehensive Answers: Improving the Diversity and Informativeness of Follow-Up Questions(https://arxiv.org/abs/2502.17715)
Keywords: language model, llm
Abstract: Effective conversational systems are expected to dynamically generate contextual follow-up questions to elicit new information while maintaining the conversation flow. While humans excel at asking diverse and informative questions by intuitively assessing both obtained and missing information, existing models often fall short of human performance on this task. To mitigate this, we propose a method that generates diverse and informative questions based on targeting unanswered information using a hypothetical LLM-generated "comprehensive answer". Our method is applied to augment an existing follow-up questions dataset. The experimental results demonstrate that language models fine-tuned on the augmented datasets produce follow-up questions of significantly higher quality and diversity. This promising approach could be effectively adopted to future work to augment information-seeking dialogues for reducing ambiguities and improving the accuracy of LLM answers.
摘要：有效的对话系统有望动态生成上下文后续问题，以在保持对话流程的同时引起新信息。尽管人类通过直观地评估获得的信息和丢失的信息来提出各种各样的信息问题，但现有模型通常在这项任务上的人类表现不足。为了减轻这种情况，我们提出了一种基于针对未解决的信息，使用假设的LLM生成的“综合答案”来产生各种各样的信息问题。我们的方法用于增加现有的后续问题数据集。实验结果表明，在增强数据集上微调的语言模型产生了质量和多样性明显更高的后续问题。这种有希望的方法可以有效地采用以后的工作，以增加降低歧义并提高LLM答案准确性的信息寻求信息对话。

Title: Knowledge Distillation with Training Wheels

Authors: Guanlin Liu, Anand Ramachandran, Tanmay Gangwani, Yan Fu, Abhinav Sethy
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.17717
Pdf URL: https://arxiv.org/pdf/2502.17717
Copy Paste: [[2502.17717]] Knowledge Distillation with Training Wheels(https://arxiv.org/abs/2502.17717)
Keywords: language model
Abstract: Knowledge distillation is used, in generative language modeling, to train a smaller student model using the help of a larger teacher model, resulting in improved capabilities for the student model. In this paper, we formulate a more general framework for knowledge distillation where the student learns from the teacher during training, and also learns to ask for the teacher's help at test-time following rules specifying test-time restrictions. Towards this, we first formulate knowledge distillation as an entropy-regularized value optimization problem. Adopting Path Consistency Learning to solve this, leads to a new knowledge distillation algorithm using on-policy and off-policy demonstrations. We extend this using constrained reinforcement learning to a framework that incorporates the use of the teacher model as a test-time reference, within constraints. In this situation, akin to a human learner, the model needs to learn not only the learning material, but also the relative difficulty of different sections to prioritize for seeking teacher help. We examine the efficacy of our method through experiments in translation and summarization tasks, observing trends in accuracy and teacher use, noting that our approach unlocks operating points not available to the popular Speculative Decoding approach.
摘要：在生成语言建模中，使用知识蒸馏来利用较大的教师模型来训练较小的学生模型，从而提高了学生模型的功能。在本文中，我们制定了一个更一般的知识蒸馏框架，学生在培训期间向老师学习，并学会在测试时间遵循指定测试时间限制的规则时寻求老师的帮助。在此方面，我们首先将知识蒸馏作为熵登记的价值优化问题。采用路径一致性学习来解决这一问题，从而导致新的知识蒸馏算法使用政策和非政策示范。我们使用约束的强化学习将其扩展到一个框架，该框架将使用教师模型用作测试时间参考的框架内。在这种情况下，类似于人类学习者，模型不仅需要学习学习材料，而且还需要学习不同部分的相对难度以优先寻求教师帮助。我们通过在翻译和汇总任务中进行实验检查方法的功效，观察准确性和教师使用的趋势，并指出我们的方法解锁了流行的投机解码方法不可用的操作点。

Title: Spontaneous Giving and Calculated Greed in Language Models

Authors: Yuxuan Li, Hirokazu Shirado
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17720
Pdf URL: https://arxiv.org/pdf/2502.17720
Copy Paste: [[2502.17720]] Spontaneous Giving and Calculated Greed in Language Models(https://arxiv.org/abs/2502.17720)
Keywords: language model, chain-of-thought
Abstract: Large language models, when trained with reinforcement learning, demonstrate advanced problem-solving capabilities through reasoning techniques like chain of thoughts and reflection. However, it is unclear how these reasoning capabilities extend to social intelligence. In this study, we investigate how reasoning influences model outcomes in social dilemmas. First, we examine the effects of chain-of-thought and reflection techniques in a public goods game. We then extend our analysis to six economic games on cooperation and punishment, comparing off-the-shelf non-reasoning and reasoning models. We find that reasoning models reduce cooperation and norm enforcement, prioritizing individual rationality. Consequently, groups with more reasoning models exhibit less cooperation and lower gains through repeated interactions. These behaviors parallel human tendencies of "spontaneous giving and calculated greed." Our results suggest the need for AI architectures that incorporate social intelligence alongside reasoning capabilities to ensure that AI supports, rather than disrupts, human cooperative intuition.
摘要：大型语言模型在接受强化学习训练时，通过推理技术（例如思想链和反思）展示了先进的问题解决能力。但是，目前尚不清楚这些推理能力如何扩展到社会智能。在这项研究中，我们研究了推理如何影响社会困境中的模型结果。首先，我们研究了公共物品游戏中的思想链和反思技术的影响。然后，我们将分析扩展到有关合作和惩罚的六项经济游戏，比较现成的非争议和推理模型。我们发现推理模型减少了合作和规范执行，优先考虑个人合理性。因此，具有更多推理模型的小组通过反复的相互作用表现出较少的合作和降低的收益。这些行为平行于“自发给予和计算贪婪”的人类趋势。我们的结果表明，需要将社会智能与推理能力结合起来，以确保AI支持而不是破坏人类合作直觉。

Title: LLM Inference Acceleration via Efficient Operation Fusion

Authors: Mahsa Salmani, Ilya Soloveychik
Subjects: cs.CL, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2502.17728
Pdf URL: https://arxiv.org/pdf/2502.17728
Copy Paste: [[2502.17728]] LLM Inference Acceleration via Efficient Operation Fusion(https://arxiv.org/abs/2502.17728)
Keywords: language model, llm
Abstract: The rapid development of the Transformer-based Large Language Models (LLMs) in recent years has been closely linked to their ever-growing and already enormous sizes. Many LLMs contain hundreds of billions of parameters and require dedicated hardware resources for training and inference. One of the key challenges inherent to the Transformer architecture is the requirement to support numerous non-linear transformations that involves normalization. For instance, each decoder block typically contains at least one Softmax operation and two Layernorms. The computation of the corresponding normalization scaling factors becomes a major bottleneck as it requires spatial collective operations. In other words, when it comes to the computation of denominators for Softmax and Layernorm, all vector elements must be aggregated into a single location, requiring significant communication. These collective operations slow down inference on Transformers by approximately 20%, defeating the whole purpose of distributed in-memory compute. In this work, we propose an extremely efficient technique that can completely hide the overhead caused by such collective operations. Note that each Softmax and Layernorm operation is typically followed by a linear layer. Since non-linear and linear operations are performed on different hardware engines, they can be easily parallelized once the algebra allows such commutation. By leveraging the inherent properties of linear operations, we can defer the normalization of the preceding Softmax and Layernorm until after the linear layer is computed. Now we can compute the collective scaling factors concurrently with the matrix multiplication and completely hide the latency of the former behind the latter. Such parallelization preserves the numerical accuracy while significantly improving the hardware utilization and reducing the overall latency.
摘要：近年来，基于变压器的大语言模型（LLM）的快速发展与它们不断增长且已经巨大的大小紧密相关。许多LLM都包含数百亿个参数，并且需要专门的硬件资源进行培训和推理。变压器体系结构固有的关键挑战之一是支持许多涉及归一化的非线性变换。例如，每个解码器块通常包含至少一个SoftMax操作和两个分层。相应的归一化缩放因子的计算成为主要瓶颈，因为它需要空间集体操作。换句话说，当涉及用于软磁性和分层的分母计算时，所有向量元素都必须汇总到一个单个位置，需要大量的通信。这些集体操作对变形金刚的推断降低了约20％，打败了分布式内存中计算的全部目的。在这项工作中，我们提出了一种极其有效的技术，该技术可以完全掩盖此类集体运营造成的开销。请注意，每个SoftMax和LayerNorm操作通常都在线性层之后。由于非线性和线性操作是在不同的硬件引擎上执行的，因此一旦代数允许这种换向，它们就很容易平行。通过利用线性操作的固有属性，我们可以将上述软磁性和上层的归一化推迟到计算线性层之后。现在，我们可以与矩阵乘法同时计算集体缩放因子，并完全隐藏后者后面的潜伏期。这种并行化可以保留数值的精度，同时显着改善了硬件利用率并降低了整体延迟。

Title: FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks

Authors: Tanawan Premsri, Parisa Kordjamshidi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17775
Pdf URL: https://arxiv.org/pdf/2502.17775
Copy Paste: [[2502.17775]] FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks(https://arxiv.org/abs/2502.17775)
Keywords: language model, llm, prompt
Abstract: Spatial reasoning is a fundamental aspect of human intelligence. One key concept in spatial cognition is the Frame of Reference (FoR), which identifies the perspective of spatial expressions. Despite its significance, FoR has received limited attention in AI models that need spatial intelligence. There is a lack of dedicated benchmarks and in-depth evaluation of large language models (LLMs) in this area. To address this issue, we introduce the Frame of Reference Evaluation in Spatial Reasoning Tasks (FoREST) benchmark, designed to assess FoR comprehension in LLMs. We evaluate LLMs on answering questions that require FoR comprehension and layout generation in text-to-image models using FoREST. Our results reveal a notable performance gap across different FoR classes in various LLMs, affecting their ability to generate accurate layouts for text-to-image generation. This highlights critical shortcomings in FoR comprehension. To improve FoR understanding, we propose Spatial-Guided prompting, which improves LLMs ability to extract essential spatial concepts. Our proposed method improves overall performance across spatial reasoning tasks.
摘要：空间推理是人类智能的基本方面。空间认知中的一个关键概念是参考框架（for），它标识了空间表达的观点。尽管具有重要意义，但由于需要空间智能的AI模型，因此受到了有限的关注。缺乏对该领域的大语言模型（LLM）的专用基准和深入评估。为了解决这个问题，我们在空间推理任务（Forest）基准中介绍了参考评估框架，旨在评估LLMS中的理解。我们评估了LLM，以回答使用森林中文本到图像模型中理解和布局产生的问题。我们的结果揭示了各种LLM中不同类别的不同性能差距，影响了它们为文本到图像生成的准确布局的能力。这重点介绍了理解的关键缺点。为了改善理解，我们提出了空间引导的提示，这提高了LLMS提取基本空间概念的能力。我们提出的方法改善了空间推理任务的整体性能。

Title: Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty

Authors: Yoshee Jain, John Hollander, Amber He, Sunny Tang, Liang Zhang, John Sabatini
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2502.17785
Pdf URL: https://arxiv.org/pdf/2502.17785
Copy Paste: [[2502.17785]] Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty(https://arxiv.org/abs/2502.17785)
Keywords: language model, gpt, llm
Abstract: Reading comprehension is a key for individual success, yet the assessment of question difficulty remains challenging due to the extensive human annotation and large-scale testing required by traditional methods such as linguistic analysis and Item Response Theory (IRT). While these robust approaches provide valuable insights, their scalability is limited. There is potential for Large Language Models (LLMs) to automate question difficulty estimation; however, this area remains underexplored. Our study investigates the effectiveness of LLMs, specifically OpenAI's GPT-4o and o1, in estimating the difficulty of reading comprehension questions using the Study Aid and Reading Assessment (SARA) dataset. We evaluated both the accuracy of the models in answering comprehension questions and their ability to classify difficulty levels as defined by IRT. The results indicate that, while the models yield difficulty estimates that align meaningfully with derived IRT parameters, there are notable differences in their sensitivity to extreme item characteristics. These findings suggest that LLMs can serve as the scalable method for automated difficulty assessment, particularly in dynamic interactions between learners and Adaptive Instructional Systems (AIS), bridging the gap between traditional psychometric techniques and modern AIS for reading comprehension and paving the way for more adaptive and personalized educational assessments.
摘要：阅读理解是个人成功的关键，但是由于人类注释和传统方法（例如语言分析和项目反应理论（IRT））所需的广泛人类注释和大规模测试，对问题难度的评估仍然具有挑战性。尽管这些强大的方法提供了有价值的见解，但它们的可扩展性是有限的。大型语言模型（LLMS）有可能使问题难度估计自动化；但是，该区域仍未被逐渐倍增。我们的研究调查了LLM，特别是OpenAI的GPT-4O和O1的有效性，在使用研究辅助和阅读评估（SARA）数据集估算阅读理解问题的困难方面。我们评估了模型回答理解问题的准确性及其对IRT定义的难度级别进行分类的能力。结果表明，尽管模型产生的难度估计有意义地与派生的IRT参数保持一致，但它们对极端项目特征的敏感性存在显着差异。这些发现表明，LLM可以用作自动难度评估的可扩展方法，尤其是在学习者和自适应教学系统（AIS）之间的动态互动中，弥合了传统的心理测量技术与现代AIS之间的差距，以弥补阅读理解的方式，并为更多的适应性铺平道路。和个性化的教育评估。

Title: AIR: Complex Instruction Generation via Automatic Iterative Refinement

Authors: Wei Liu, Yancheng He, Hui Huang, Chengwei Hu, Jiaheng Liu, Shilong Li, Wenbo Su, Bo Zheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17787
Pdf URL: https://arxiv.org/pdf/2502.17787
Copy Paste: [[2502.17787]] AIR: Complex Instruction Generation via Automatic Iterative Refinement(https://arxiv.org/abs/2502.17787)
Keywords: language model, llm
Abstract: With the development of large language models, their ability to follow simple instructions has significantly improved. However, adhering to complex instructions remains a major challenge. Current approaches to generating complex instructions are often irrelevant to the current instruction requirements or suffer from limited scalability and diversity. Moreover, methods such as back-translation, while effective for simple instruction generation, fail to leverage the rich contents and structures in large web corpora. In this paper, we propose a novel automatic iterative refinement framework to generate complex instructions with constraints, which not only better reflects the requirements of real scenarios but also significantly enhances LLMs' ability to follow complex instructions. The AIR framework consists of two stages: (1)Generate an initial instruction from a document; (2)Iteratively refine instructions with LLM-as-judge guidance by comparing the model's output with the document to incorporate valuable constraints. Finally, we construct the AIR-10K dataset with 10K complex instructions and demonstrate that instructions generated with our approach significantly improve the model's ability to follow complex instructions, outperforming existing methods for instruction generation.
摘要：随着大型语言模型的发展，它们遵循简单说明的能力已大大提高。但是，坚持复杂的说明仍然是一个重大挑战。当前生成复杂说明的方法通常与当前的指示要求无关，或者遭受有限的可伸缩性和多样性的影响。此外，诸如反向翻译的方法虽然对简单的指导生成有效，却无法利用大型网络语料库中丰富的内容和结构。在本文中，我们提出了一个新颖的自动迭代改进框架，以生成具有约束的复杂说明，这不仅可以更好地反映实际场景的要求，而且可以显着增强LLMS遵循复杂说明的能力。空气框架由两个阶段组成：（1）从文档中生成初始指令；（2）通过将模型的输出与文档进行比较以包含有价值的约束，从而使用LLM-AS-AS-Gudge指导进行了迭代完善说明。最后，我们使用10K复杂说明构建AIR-10K数据集，并证明使用我们的方法生成的说明显着提高了模型遵循复杂说明的能力，超过了现有的方法来生成指导。

Title: Enhancing Human Evaluation in Machine Translation with Comparative Judgment

Authors: Yixiao Song, Parker Riley, Daniel Deutsch, Markus Freitag
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17797
Pdf URL: https://arxiv.org/pdf/2502.17797
Copy Paste: [[2502.17797]] Enhancing Human Evaluation in Machine Translation with Comparative Judgment(https://arxiv.org/abs/2502.17797)
Keywords: language model
Abstract: Human evaluation is crucial for assessing rapidly evolving language models but is influenced by annotator proficiency and task design. This study explores the integration of comparative judgment into human annotation for machine translation (MT) and evaluates three annotation setups-point-wise Multidimensional Quality Metrics (MQM), side-by-side (SxS) MQM, and its simplified version SxS relative ranking (RR). In MQM, annotators mark error spans with categories and severity levels. SxS MQM extends MQM to pairwise error annotation for two translations of the same input, while SxS RR focuses on selecting the better output without labeling errors. Key findings are: (1) the SxS settings achieve higher inter-annotator agreement than MQM; (2) SxS MQM enhances inter-translation error marking consistency compared to MQM by, on average, 38.5% for explicitly compared MT systems and 19.5% for others; (3) all annotation settings return stable system rankings, with SxS RR offering a more efficient alternative to (SxS) MQM; (4) the SxS settings highlight subtle errors overlooked in MQM without altering absolute system evaluations. To spur further research, we will release the triply annotated datasets comprising 377 ZhEn and 104 EnDe annotation examples.
摘要：人类评估对于评估迅速发展的语言模型至关重要，但受注释熟练度和任务设计的影响。这项研究探讨了比较判断与机器翻译的人体注释（MT）的整合，并评估了三个注释设置，以下方面的多维质量指标（MQM），并排（SXS）MQM及其简化的版本SXS相对排名（RR）。在MQM中，注释器标记错误跨越具有类别和严重程度的错误。 SXS MQM将MQM扩展到对同一输入的两个翻译的成对误差注释，而SXS RR专注于选择更好的输出而无需标记错误。关键发现是：（1）SXS设置比MQM获得更高的通道间协议；（2）与MQM相比，SXS MQM平均比MQM增强了翻译误差标记一致性，显式比较了38.5％，而其他MT系统则比较38.5％，而其他MQM则平均为19.5％。（3）所有注释设置返回稳定的系统排名，SXS RR为（SXS）MQM提供了更有效的替代方案；（4）SXS设置突出显示了MQM中忽略的微妙错误，而不会改变绝对系统评估。为了刺激进一步的研究，我们将发布包含377个Zhen和104个Ende注释示例的三重注释数据集。

Title: Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training

Authors: Yihang Yao, Zhepeng Cen, Miao Li, William Han, Yuyou Zhang, Emerson Liu, Zuxin Liu, Chuang Gan, Ding Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17800
Pdf URL: https://arxiv.org/pdf/2502.17800
Copy Paste: [[2502.17800]] Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training(https://arxiv.org/abs/2502.17800)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated strong reasoning capabilities across various tasks. However, even minor variations in query phrasing, despite preserving the underlying semantic meaning, can significantly affect their performance. To address this, we focus on enhancing LLMs' awareness of symmetry in query variations and propose syMmetry-ENhanceD (MEND) Data Augmentation, a data-centric approach that improves the model's ability to extract useful information from context. Unlike existing methods that emphasize reasoning chain augmentation, our approach improves model robustness at the knowledge extraction stage through query augmentations, enabling more data-efficient training and stronger generalization to Out-of-Distribution (OOD) settings. Extensive experiments on both logical and arithmetic reasoning tasks show that MEND enhances reasoning performance across diverse query variations, providing new insight into improving LLM robustness through structured dataset curation.
摘要：大型语言模型（LLMS）已证明了各种任务的强大推理能力。但是，即使保留了潜在的语义含义，即使是查询措辞的微小变化也会显着影响其性能。为了解决这个问题，我们专注于增强LLMS在查询变化中对对称性的认识，并提出对称性增强（MEND）数据增强，这是一种以数据为中心的方法，可提高该模型从上下文中提取有用信息的能力。与强调推理链增强的现有方法不同，我们的方法通过查询增强来提高知识提取阶段的模型鲁棒性，从而实现了更多的数据有效培训和更强大的概括到分布（OOD）设置。关于逻辑和算术推理任务的广泛实验表明，MEND可以增强各种查询变化的推理性能，从而通过结构化数据集策划为改善LLM鲁棒性提供新的见解。

Title: URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models

Authors: Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, Xie Chen
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2502.17810
Pdf URL: https://arxiv.org/pdf/2502.17810
Copy Paste: [[2502.17810]] URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models(https://arxiv.org/abs/2502.17810)
Keywords: language model, llm
Abstract: In recent years, with advances in large language models (LLMs), end-to-end spoken dialogue models (SDMs) have made significant strides. Compared to text-based LLMs, the evaluation of SDMs needs to take speech-related aspects into account, such as paralinguistic information and speech quality. However, there is still a lack of comprehensive evaluations for SDMs in speech-to-speech (S2S) scenarios. To address this gap, we propose URO-Bench, an extensive benchmark for SDMs. Notably, URO-Bench is the first S2S benchmark that covers evaluations about multilingualism, multi-round dialogues, and paralinguistics. Our benchmark is divided into two difficulty levels: basic track and pro track, consisting of 16 and 20 datasets respectively, evaluating the model's abilities in Understanding, Reasoning, and Oral conversation. Evaluations on our proposed benchmark reveal that current open-source SDMs perform rather well in daily QA tasks, but lag behind their backbone LLMs in terms of instruction-following ability and also suffer from catastrophic forgetting. Their performance in advanced evaluations of paralinguistic information and audio understanding remains subpar, highlighting the need for further research in this direction. We hope that URO-Bench can effectively facilitate the development of spoken dialogue models by providing a multifaceted evaluation of existing models and helping to track progress in this area.
摘要：近年来，随着大语言模型（LLM）的进步，端到端的口语对话模型（SDM）取得了长足的进步。与基于文本的LLM相比，SDMS的评估需要考虑与语音相关的方面，例如副语言信息和语音质量。但是，在语音到语音（S2S）方案中，仍然缺乏对SDM的全面评估。为了解决这一差距，我们提出了Uro Bench，这是SDM的广泛基准。值得注意的是，URO基础是第一个S2S基准，涵盖了有关多语言，多轮对话和副语言学的评估。我们的基准分为两个难度级别：基本轨道和Pro轨道，分别由16个和20个数据集组成，评估了模型在理解，推理和口头对话方面的能力。对我们提出的基准测试的评估表明，当前的开源SDM在每日质量检查任务中的表现相当出色，但是在遵循指导遵循的能力方面落后于骨干LLM，并且也遭受了灾难性的遗忘。他们在对副语言信息和音频理解的高级评估中的表现仍然不足，这突出了对这一方向进行进一步研究的必要性。我们希望Uro Bench可以通过提供对现有模型的多方面评估并帮助跟踪该领域的进度来有效地促进口语对话模型的开发。

Title: Can Multimodal LLMs Perform Time Series Anomaly Detection?

Authors: Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao, Kai Shu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.17812
Pdf URL: https://arxiv.org/pdf/2502.17812
Copy Paste: [[2502.17812]] Can Multimodal LLMs Perform Time Series Anomaly Detection?(https://arxiv.org/abs/2502.17812)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have been increasingly used in time series analysis. However, the potential of multimodal LLMs (MLLMs), particularly vision-language models, for time series remains largely under-explored. One natural way for humans to detect time series anomalies is through visualization and textual description. Motivated by this, we raise a critical and practical research question: Can multimodal LLMs perform time series anomaly detection? To answer this, we propose VisualTimeAnomaly benchmark to evaluate MLLMs in time series anomaly detection (TSAD). Our approach transforms time series numerical data into the image format and feed these images into various MLLMs, including proprietary models (GPT-4o and Gemini-1.5) and open-source models (LLaVA-NeXT and Qwen2-VL), each with one larger and one smaller variant. In total, VisualTimeAnomaly contains 12.4k time series images spanning 3 scenarios and 3 anomaly granularities with 9 anomaly types across 8 MLLMs. Starting with the univariate case (point- and range-wise anomalies), we extend our evaluation to more practical scenarios, including multivariate and irregular time series scenarios, and variate-wise anomalies. Our study reveals several key insights: 1) MLLMs detect range- and variate-wise anomalies more effectively than point-wise anomalies. 2) MLLMs are highly robust to irregular time series, even with 25% of the data missing. 3) Open-source MLLMs perform comparably to proprietary models in TSAD. While open-source MLLMs excel on univariate time series, proprietary MLLMs demonstrate superior effectiveness on multivariate time series. To the best of our knowledge, this is the first work to comprehensively investigate MLLMs for TSAD, particularly for multivariate and irregular time series scenarios. We release our dataset and code at this https URL to support future research.
摘要：大型语言模型（LLM）已越来越多地用于时间序列分析。然而，时间序列的多模式LLM（MLLM）的潜力（尤其是视觉模型）的潜力在很大程度上仍然很少。人类检测时间序列异常的一种自然方法是通过可视化和文本描述。在此激励的情况下，我们提出了一个关键而实用的研究问题：多模式LLM可以执行时间序列异常检测吗？为了回答这一点，我们提出了VisualTimeanomaly基准测试，以评估时间序列异常检测（TSAD）中的MLLM。我们的方法将时间序列的数值数据转换为图像格式，并将这些图像馈送到各种MLLM中，包括专有模型（GPT-4O和Gemini-1.5）和开源模型（LLAVA-NEXT和QWEN2-VL），每个模型都有一个较大的模型还有一个较小的变体。总体而言，VisualTimeanomaly包含12.4K时间序列图像，涵盖了3个方案和3种异常粒度，在8 mllms上具有9种异常类型。从单变量案例（点和范围异常）开始，我们将评估扩展到更实际的场景，包括多元和不规则的时间序列方案，以及各种视频异常。我们的研究揭示了几个关键的见解：1）MLLMS比点异常更有效地检测范围和变化异常。 2）MLLM非常健壮至不规则时间序列，即使缺少25％的数据。 3）开源MLLM与TSAD的专有模型相当。尽管开源MLLM在单变量时间序列上表现出色，但专有的MLLM在多变量时间序列上表现出较高的有效性。据我们所知，这是第一项全面研究TSAD的工作，特别是对于多元和不规则时间序列的情况。我们在此HTTPS URL上发布数据集和代码，以支持未来的研究。

Title: Predicting Through Generation: Why Generation Is Better for Prediction

Authors: Md Kowsher, Nusrat Jahan Prottasha, Prakash Bhat, Chun-Nam Yu, Mojtaba Soltanalian, Ivan Garibay, Ozlem Garibay, Chen Chen, Niloofar Yousefi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17817
Pdf URL: https://arxiv.org/pdf/2502.17817
Copy Paste: [[2502.17817]] Predicting Through Generation: Why Generation Is Better for Prediction(https://arxiv.org/abs/2502.17817)
Keywords: llm
Abstract: This paper argues that generating output tokens is more effective than using pooled representations for prediction tasks because token-level generation retains more mutual information. Since LLMs are trained on massive text corpora using next-token prediction, generation aligns naturally with their learned behavior. Using the Data Processing Inequality (DPI), we provide both theoretical and empirical evidence supporting this claim. However, autoregressive models face two key challenges when used for prediction: (1) exposure bias, where the model sees ground truth tokens during training but relies on its own predictions during inference, leading to errors, and (2) format mismatch, where discrete tokens do not always align with the tasks required output structure. To address these challenges, we introduce PredGen(Predicting Through Generating), an end to end framework that (i) uses scheduled sampling to reduce exposure bias, and (ii) introduces a task adapter to convert the generated tokens into structured outputs. Additionally, we introduce Writer-Director Alignment Loss (WDAL), which ensures consistency between token generation and final task predictions, improving both text coherence and numerical accuracy. We evaluate PredGen on multiple classification and regression benchmarks. Our results show that PredGen consistently outperforms standard baselines, demonstrating its effectiveness in structured prediction tasks.
摘要：本文认为，生成输出令牌比使用汇总表示来预测任务更有效，因为令牌级生成保留了更多的互信息。由于使用下一步的预测对LLM进行了大规模文本语料库的培训，因此一代人自然与其学习的行为保持一致。使用数据处理不平等（DPI），我们提供支持这一主张的理论和经验证据。但是，自回归模型在用于预测时面临两个关键挑战：（1）暴露偏见，该模型在培训期间看到地面真相令牌，但依靠其在推理期间的预测，导致错误，以及（2）格式不匹配，在离散的情况下，令牌并不总是与所需的任务输出结构保持一致。为了应对这些挑战，我们介绍了倾向（通过生成预测），该端到端框架（i）使用计划的抽样来减少曝光偏差，（ii）引入任务适配器将生成的令牌转换为结构化输出。此外，我们介绍了作者 - 导演对准损失（WDAL），从而确保令牌生成和最终任务预测之间的一致性，从而提高了文本相干性和数值准确性。我们评估了多个分类和回归基准的泼妇。我们的结果表明，泼妇始终优于标准基准，证明了其在结构化预测任务中的有效性。

Title: Say Less, Mean More: Leveraging Pragmatics in Retrieval-Augmented Generation

Authors: Haris Riaz, Ellen Riloff, Mihai Surdeanu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.17839
Pdf URL: https://arxiv.org/pdf/2502.17839
Copy Paste: [[2502.17839]] Say Less, Mean More: Leveraging Pragmatics in Retrieval-Augmented Generation(https://arxiv.org/abs/2502.17839)
Keywords: llm, retrieval-augmented generation
Abstract: We propose a simple, unsupervised method that injects pragmatic principles in retrieval-augmented generation (RAG) frameworks such as Dense Passage Retrieval~\cite{karpukhin2020densepassageretrievalopendomain} to enhance the utility of retrieved contexts. Our approach first identifies which sentences in a pool of documents retrieved by RAG are most relevant to the question at hand, cover all the topics addressed in the input question and no more, and then highlights these sentences within their context, before they are provided to the LLM, without truncating or altering the context in any other way. We show that this simple idea brings consistent improvements in experiments on three question answering tasks (ARC-Challenge, PubHealth and PopQA) using five different LLMs. It notably enhances relative accuracy by up to 19.7\% on PubHealth and 10\% on ARC-Challenge compared to a conventional RAG system.
摘要：我们提出了一种简单，无监督的方法，该方法将务实的原理注入了检索仪（RAG）框架，例如密集的通道检索〜\ cite {karpukhin2020dense densepassageretrievalopendomain}，以增强检索到的上下文的实用性。我们的方法首先确定了在RAG检索到的文档中的句子中的句子最重要的是与手头的问题相关，涵盖了输入问题中解决的所有主题，而不再是，然后在其上下文中突出显示这些句子，然后才能提供这些句子。 LLM无需以任何其他方式截断或更改上下文。我们表明，这个简单的想法可以使用五种不同的LLM对三个问题回答任务（Arc-Challenge，PubHealth和popqa）进行实验的一致改进。与传统的抹布系统相比，PubHealth的相对准确性高达19.7 \％，在Arc-Challenge上提高了10 \％。

Title: LR${}^{2}$Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

Authors: Jianghao Chen, Zhenlin Wei, Zhenjiang Ren, Ziyong Li, Jiajun Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17848
Pdf URL: https://arxiv.org/pdf/2502.17848
Copy Paste: [[2502.17848]] LR${}^{2}$Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems(https://arxiv.org/abs/2502.17848)
Keywords: language model, llm
Abstract: Recent progress in o1-like models has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LR${}^{2}$Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR${}^{2}$Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios. We conduct extensive evaluation on both conventional models and o1-like models. Our experimental results reveal that even the most advanced reasoning-specific models, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR${}^{2}$Bench, achieving an average Exact Match score of only 20.0% and 23.6%, respectively. These findings underscore the significant room for improvement in the reflective reasoning capabilities of current LLMs. The leaderboard of our benchmark is available at this https URL
摘要：类似O1的模型的最新进展显着增强了大语言模型（LLM）的推理能力，使他们能够通过反思能力（例如做出假设，回溯和自我投资）来解决日益复杂的任务。但是，由于缺乏适当的基准，有效评估这种反射功能仍然具有挑战性。为了弥合这一差距，我们介绍了LR $ {}^{2} $ bench，这是一种旨在评估LLM的长链反射推理能力的小说基准。 lr $ {}^{2} $基准包括六个约束满意度问题（CSP）的850个样本，其中反射推理对于派生符合所有给定约束的解决方案至关重要。每种类型的任务都集中在不同的约束模式上，例如基于知识的，逻辑和空间约束，提供了对各种问题解决方案的全面评估。我们对常规模型和类似O1的模型进行了广泛的评估。我们的实验结果表明，即使是最先进的推理模型，例如DeepSeek-R1和OpenAi O1-preview，也与LR $ {}^{2} $台式中的任务挣扎，仅达到平均精确匹配分数20.0％和23.6％。这些发现强调了当前LLM的反射推理能力改善的重要空间。我们的基准标准的排行榜可在此HTTPS URL上找到

Title: SYNTHEMPATHY: A Scalable Empathy Corpus Generated Using LLMs Without Any Crowdsourcing

Authors: Run Chen, Jun Shin, Julia Hirschberg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17857
Pdf URL: https://arxiv.org/pdf/2502.17857
Copy Paste: [[2502.17857]] SYNTHEMPATHY: A Scalable Empathy Corpus Generated Using LLMs Without Any Crowdsourcing(https://arxiv.org/abs/2502.17857)
Keywords: language model, llm, agent
Abstract: Previous research has shown that humans are more receptive towards language models that that exhibit empathetic behavior. While empathy is essential for developing helpful dialogue agents, very few large corpora containing empathetic dialogues are available for fine-tune LLMs. The few existing corpora have largely relied on crowdsourcing to simulate empathetic conversations, a process that is expensive, time-consuming, and not scalable to larger datasets. We propose a data generation framework for developing SYNTHEMPATHY, a large corpus containing 105k empathetic responses to real-life situations compiled through LLM generation. A base Mistral 7B model fine-tuned on our SYNTHEMPATHY corpus exhibits an increase in the average empathy score.
摘要：先前的研究表明，人类对表现出同情行为的语言模型更容易接受。尽管移情对于开发有用的对话代理至关重要，但很少有包含善解人意对话的大型语料库可用于微调LLM。少数现有的语料库在很大程度上依靠众包来模拟善解人意的对话，这一过程昂贵，耗时且无法扩展到较大的数据集。我们提出了一个数据生成框架，用于开发合成性，这是一个大型语料库，其中包含对通过LLM生成汇编的现实情况的105K善解人意反应。对我们的合成性同步语料库进行了微调的基本MISTRAL 7B模型表现出平均移情评分的提高。

Title: Towards Enhanced Immersion and Agency for LLM-based Interactive Drama

Authors: Hongqiu Wu, Weiqi Wu, Tianyang Xu, Jiameng Zhang, Hai Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17878
Pdf URL: https://arxiv.org/pdf/2502.17878
Copy Paste: [[2502.17878]] Towards Enhanced Immersion and Agency for LLM-based Interactive Drama(https://arxiv.org/abs/2502.17878)
Keywords: llm, agent
Abstract: LLM-based Interactive Drama is a novel AI-based dialogue scenario, where the user (i.e. the player) plays the role of a character in the story, has conversations with characters played by LLM agents, and experiences an unfolding story. This paper begins with understanding interactive drama from two aspects: Immersion, the player's feeling of being present in the story, and Agency, the player's ability to influence the story world. Both are crucial to creating an enjoyable interactive experience, while they have been underexplored in previous work. To enhance these two aspects, we first propose Playwriting-guided Generation, a novel method that helps LLMs craft dramatic stories with substantially improved structures and narrative quality. Additionally, we introduce Plot-based Reflection for LLM agents to refine their reactions to align with the player's intentions. Our evaluation relies on human judgment to assess the gains of our methods in terms of immersion and agency.
摘要：基于LLM的Interactive Drama是一种基于AI的新型对话方案，用户（即玩家）在故事中扮演角色的角色，与LLM代理人扮演的角色进行对话，并体验一个不断发展的故事。本文始于从两个方面理解互动戏剧：沉浸式，玩家在故事中表现出来的感觉，以及代理，玩家影响故事世界的能力。两者对于创造令人愉悦的互动体验至关重要，而在以前的工作中却没有被忽视。为了增强这两个方面，我们首先提出了戏剧写作引导的一代，这是一种新颖的方法，可帮助LLMS创作出戏剧性的故事，并具有大大改善的结构和叙事质量。此外，我们为LLM代理提供了基于情节的反射，以完善其反应以与玩家的意图保持一致。我们的评估依靠人类的判断来评估我们的沉浸和代理方面的方法的收益。

Title: RankCoT: Refining Knowledge for Retrieval-Augmented Generation through Ranking Chain-of-Thoughts

Authors: Mingyan Wu, Zhenghao Liu, Yukun Yan, Xinze Li, Shi Yu, Zheni Zeng, Yu Gu, Ge Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17888
Pdf URL: https://arxiv.org/pdf/2502.17888
Copy Paste: [[2502.17888]] RankCoT: Refining Knowledge for Retrieval-Augmented Generation through Ranking Chain-of-Thoughts(https://arxiv.org/abs/2502.17888)
Keywords: language model, llm, prompt, retrieval-augmented generation, chain-of-thought
Abstract: Retrieval-Augmented Generation (RAG) enhances the performance of Large Language Models (LLMs) by incorporating external knowledge. However, LLMs still encounter challenges in effectively utilizing the knowledge from retrieved documents, often being misled by irrelevant or noisy information. To address this issue, we introduce RankCoT, a knowledge refinement method that incorporates reranking signals in generating CoT-based summarization for knowledge refinement based on given query and all retrieval documents. During training, RankCoT prompts the LLM to generate Chain-of-Thought (CoT) candidates based on the query and individual documents. It then fine-tunes the LLM to directly reproduce the best CoT from these candidate outputs based on all retrieved documents, which requires LLM to filter out irrelevant documents during generating CoT-style summarization. Additionally, RankCoT incorporates a self-reflection mechanism that further refines the CoT outputs, resulting in higher-quality training data. Our experiments demonstrate the effectiveness of RankCoT, showing its superior performance over other knowledge refinement models. Further analysis reveals that RankCoT can provide shorter but effective refinement results, enabling the generator to produce more accurate answers. All code and data are available at this https URL.
摘要：检索演示的生成（RAG）通过合并外部知识来增强大语言模型（LLMS）的性能。但是，LLM在有效利用检索文件中的知识时仍会遇到挑战，通常被无关或嘈杂的信息误导。为了解决这个问题，我们介绍了RankCot，这是一种知识改进方法，该方法将重新信号包含在基于COT的摘要中，以根据给定查询和所有检索文档生成基于COT的摘要。在培训期间，RankCot提示LLM根据查询和单个文档生成候选者（COT）候选者。然后，它根据所有检索的文档进行微调LLM，以直接从这些候选输出中重现最佳婴儿床，这需要LLM在生成COT型摘要过程中过滤无关的文档。此外，RankCot结合了一种自我反射机制，该机制进一步完善了COT输出，从而产生了更高质量的训练数据。我们的实验证明了RankCot的有效性，显示了其优于其他知识改进模型的表现。进一步的分析表明，RankCot可以提供更短但有效的改进结果，从而使发电机能够产生更准确的答案。所有代码和数据均可在此HTTPS URL上找到。

Title: Can Large Language Models Identify Implicit Suicidal Ideation? An Empirical Evaluation

Authors: Tong Li, Shu Yang, Junchao Wu, Jiyao Wei, Lijie Hu, Mengdi Li, Derek F. Wong, Joshua R. Oltmanns, Di Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17899
Pdf URL: https://arxiv.org/pdf/2502.17899
Copy Paste: [[2502.17899]] Can Large Language Models Identify Implicit Suicidal Ideation? An Empirical Evaluation(https://arxiv.org/abs/2502.17899)
Keywords: language model, llm
Abstract: We present a comprehensive evaluation framework for assessing Large Language Models' (LLMs) capabilities in suicide prevention, focusing on two critical aspects: the Identification of Implicit Suicidal ideation (IIS) and the Provision of Appropriate Supportive responses (PAS). We introduce \ourdata, a novel dataset of 1,308 test cases built upon psychological frameworks including D/S-IAT and Negative Automatic Thinking, alongside real-world scenarios. Through extensive experiments with 8 widely used LLMs under different contextual settings, we find that current models struggle significantly with detecting implicit suicidal ideation and providing appropriate support, highlighting crucial limitations in applying LLMs to mental health contexts. Our findings underscore the need for more sophisticated approaches in developing and evaluating LLMs for sensitive psychological applications.
摘要：我们提出了一个全面的评估框架，用于评估自杀预防的大型语言模型（LLMS）功能，重点关注两个关键方面：识别隐式自杀念头（IIS）和提供适当的支持反应（PAS）。我们介绍了\ ouredata，这是一个新颖的数据集，该数据集由1,308个测试案例建立在心理框架上，包括D/S-AIT和负面自动思维，以及现实世界中的情况。通过在不同的上下文设置下对8种广泛使用的LLM进行的广泛实验，我们发现当前模型在检测隐性自杀构想并提供适当的支持方面存在很大的努力，突出了将LLMS应用于心理健康环境中的重要限制。我们的发现强调了为敏感心理应用开发和评估LLM的更复杂方法的必要性。

Title: Scaling LLM Pre-training with Vocabulary Curriculum

Authors: Fangyuan Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17910
Pdf URL: https://arxiv.org/pdf/2502.17910
Copy Paste: [[2502.17910]] Scaling LLM Pre-training with Vocabulary Curriculum(https://arxiv.org/abs/2502.17910)
Keywords: language model, gpt, llm
Abstract: Modern language models rely on static vocabularies, fixed before pretraining, in contrast to the adaptive vocabulary acquisition observed in human language learning. To bridge this gap, we introduce vocabulary curriculum learning, an approach that improves pretraining efficiency with log-linear scaling gains relative to vocabulary size. Our method alternates between entropy-guided vocabulary expansion and model optimization, enabling models to learn transferable representations across diverse tokenization granularities. This approach naturally gives rise to an optimal computation allocation pattern: longer tokens capture predictable content, while shorter tokens focus on more complex, harder-to-predict contexts. Experiments on small-scale GPT models demonstrate improved scaling efficiency, reinforcing the effectiveness of dynamic tokenization. We release our code to support further research and plan to extend our experiments to larger models and diverse domains.
摘要：现代语言模型依靠静态词汇，在预处理之前修复，与人类语言学习中观察到的自适应词汇获取相反。为了弥合这一差距，我们引入了词汇课程学习，这种方法可通过相对于词汇大小来提高训练效率。我们的方法在熵引导的词汇扩展和模型优化之间交替，使模型能够学习各种令牌粒度的可转移表示。这种方法自然产生了最佳的计算分配模式：更长的令牌捕获可预测的内容，而较短的代币则集中在更复杂，更难预测的上下文上。小型GPT模型的实验表明缩放效率提高，从而增强了动态令牌化的有效性。我们发布我们的代码，以支持进一步的研究并计划将我们的实验扩展到更大的模型和不同的领域。

Title: FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models

Authors: Hongzhan Lin, Yang Deng, Yuxuan Gu, Wenxuan Zhang, Jing Ma, See-Kiong Ng, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17924
Pdf URL: https://arxiv.org/pdf/2502.17924
Copy Paste: [[2502.17924]] FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models(https://arxiv.org/abs/2502.17924)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have significantly advanced the fact-checking studies. However, existing automated fact-checking evaluation methods rely on static datasets and classification metrics, which fail to automatically evaluate the justification production and uncover the nuanced limitations of LLMs in fact-checking. In this work, we introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities. Leveraging importance sampling principles and multi-agent collaboration, FACT-AUDIT generates adaptive and scalable datasets, performs iterative model-centric evaluations, and updates assessments based on model-specific responses. By incorporating justification production alongside verdict prediction, this framework provides a comprehensive and evolving audit of LLMs' factual reasoning capabilities, to investigate their trustworthiness. Extensive experiments demonstrate that FACT-AUDIT effectively differentiates among state-of-the-art LLMs, providing valuable insights into model strengths and limitations in model-centric fact-checking analysis.
摘要：大型语言模型（LLMS）已显着提高了事实检查研究。但是，现有的自动化事实检查评估方法依赖于静态数据集和分类指标，这些指标无法自动评估合理性生产并揭示LLM在事实检查中的细微局限性。在这项工作中，我们介绍了Fact-Adit，这是一个由代理驱动的框架，可以自适应地评估LLMS的事实检查功能。 FACT-ADIT利用重要性采样原理和多代理协作，生成自适应和可扩展数据集，执行以迭代模型为中心的评估，并根据模型特定响应进行更新评估。通过将辩护作品与判决预测一起融合在一起，该框架提供了对LLMS的事实推理能力的全面且不断发展的审计，以调查其可信赖性。广泛的实验表明，事实审计可以有效地区分最先进的LLM，从而对以模型为中心的事实检验分析中对模型优势和局限性提供了宝贵的见解。

Title: Advantage-Guided Distillation for Preference Alignment in Small Language Models

Authors: Shiping Gao, Fanqi Wan, Jiajian Guo, Xiaojun Quan, Qifan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17927
Pdf URL: https://arxiv.org/pdf/2502.17927
Copy Paste: [[2502.17927]] Advantage-Guided Distillation for Preference Alignment in Small Language Models(https://arxiv.org/abs/2502.17927)
Keywords: language model, llm
Abstract: Alignment techniques enable Large Language Models (LLMs) to generate outputs that align with human preferences and play a crucial role in their effectiveness. However, their impact often diminishes when applied to Small Language Models (SLMs), likely due to the limited capacity of these models. Instead of directly applying existing alignment techniques to SLMs, we propose to utilize a well-aligned teacher LLM to guide the alignment process for these models, thereby facilitating the transfer of the teacher's knowledge of human preferences to the student model. To achieve this, we first explore a straightforward approach, Dual-Constrained Knowledge Distillation (DCKD), that employs knowledge distillation with two KL-divergence constraints from the aligned teacher to the unaligned student. To further enhance the student's ability to distinguish between preferred and dispreferred responses, we then propose Advantage-Guided Distillation for Preference Alignment (ADPA), which leverages an advantage function from the aligned teacher to deliver more nuanced, distribution-level reward signals for the student's alignment. Our experimental results show that these two approaches appreciably improve the alignment of SLMs and narrow the performance gap with larger counterparts. Among them, ADPA demonstrates superior performance and achieves even greater effectiveness when integrated with DCKD. Our code is available at this https URL.
摘要：对齐技术使大型语言模型（LLM）能够产生与人类偏好相符的输出，并在其有效性中起着至关重要的作用。但是，当应用于小语言模型（SLM）时，它们的影响通常会减少，这可能是由于这些模型的能力有限。我们建议没有直接将现有的对齐技术应用于SLM，而是建议利用良好的教师LLM来指导这些模型的对齐过程，从而促进教师对人类偏好的知识转移给学生模型。为了实现这一目标，我们首先探索一种直接的方法，即双重约束的知识蒸馏（DCKD），该方法采用了知识蒸馏，并从一致的教师到未对准的学生使用两个KL-Divergence的约束。为了进一步增强学生区分首选和分配回答的能力，我们然后提出优势对齐（ADPA）的优势引导的蒸馏（ADPA），该蒸馏利用一致的教师的优势功能来提供更细微的分配级别的奖励信号结盟。我们的实验结果表明，这两种方法可以显着改善SLM的比对，并缩小较大对应物的性能差距。其中，ADPA表现出卓越的性能，并在与DCKD集成时达到更大的有效性。我们的代码可在此HTTPS URL上找到。

Title: CaseGen: A Benchmark for Multi-Stage Legal Case Documents Generation

Authors: Haitao Li, Jiaying Ye, Yiran Hu, Jia Chen, Qingyao Ai, Yueyue Wu, Junjie Chen, Yifan Chen, Cheng Luo, Quan Zhou, Yiqun Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17943
Pdf URL: https://arxiv.org/pdf/2502.17943
Copy Paste: [[2502.17943]] CaseGen: A Benchmark for Multi-Stage Legal Case Documents Generation(https://arxiv.org/abs/2502.17943)
Keywords: language model, llm
Abstract: Legal case documents play a critical role in judicial proceedings. As the number of cases continues to rise, the reliance on manual drafting of legal case documents is facing increasing pressure and challenges. The development of large language models (LLMs) offers a promising solution for automating document generation. However, existing benchmarks fail to fully capture the complexities involved in drafting legal case documents in real-world scenarios. To address this gap, we introduce CaseGen, the benchmark for multi-stage legal case documents generation in the Chinese legal domain. CaseGen is based on 500 real case samples annotated by legal experts and covers seven essential case sections. It supports four key tasks: drafting defense statements, writing trial facts, composing legal reasoning, and generating judgment results. To the best of our knowledge, CaseGen is the first benchmark designed to evaluate LLMs in the context of legal case document generation. To ensure an accurate and comprehensive evaluation, we design the LLM-as-a-judge evaluation framework and validate its effectiveness through human annotations. We evaluate several widely used general-domain LLMs and legal-specific LLMs, highlighting their limitations in case document generation and pinpointing areas for potential improvement. This work marks a step toward a more effective framework for automating legal case documents drafting, paving the way for the reliable application of AI in the legal field. The dataset and code are publicly available at this https URL.
摘要：法律案件文件在司法程序中起着至关重要的作用。随着案件的数量继续增加，对法律案件文件的手动起草的依赖正面临着越来越多的压力和挑战。大型语言模型（LLMS）的开发为自动化文档生成提供了有希望的解决方案。但是，现有的基准无法完全捕获起草法律案件文件中涉及的复杂性。为了解决这一差距，我们介绍了Casegen，这是中国法律领域中多阶段法律案件文件的基准。 Casegen基于法律专家注释的500个真实案例样本，涵盖了七个基本案例部分。它支持四个关键任务：起草国防声明，编写审判事实，构成法律推理并产生判断结果。据我们所知，Casegen是第一个旨在评估法律案件文档生成的LLM的基准。 To ensure an accurate and comprehensive evaluation, we design the LLM-as-a-judge evaluation framework and validate its effectiveness through human annotations.我们评估了几个广泛使用的通用域LLM和法律特定的LLM，强调了它们的局限性，如果文档生成并确定了潜在的改进领域。这项工作标志着朝着更有效的框架迈出的一步，用于自动化法律案件文件起草，为在法律领域的可靠应用中铺平了道路。该数据集和代码在此HTTPS URL上可公开可用。

Title: Assessing Large Language Models in Agentic Multilingual National Bias

Authors: Qianying Liu, Katrina Qiyao Wang, Fei Cheng, Sadao Kurohashi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17945
Pdf URL: https://arxiv.org/pdf/2502.17945
Copy Paste: [[2502.17945]] Assessing Large Language Models in Agentic Multilingual National Bias(https://arxiv.org/abs/2502.17945)
Keywords: language model, gpt, llm, prompt, chain-of-thought, agent
Abstract: Large Language Models have garnered significant attention for their capabilities in multilingual natural language processing, while studies on risks associated with cross biases are limited to immediate context preferences. Cross-language disparities in reasoning-based recommendations remain largely unexplored, with a lack of even descriptive analysis. This study is the first to address this gap. We test LLM's applicability and capability in providing personalized advice across three key scenarios: university applications, travel, and relocation. We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages. We quantify bias in model-generated scores and assess the impact of demographic factors and reasoning strategies (e.g., Chain-of-Thought prompting) on bias patterns. Our findings reveal that local language bias is prevalent across different tasks, with GPT-4 and Sonnet reducing bias for English-speaking countries compared to GPT-3.5 but failing to achieve robust multilingual alignment, highlighting broader implications for multilingual AI agents and applications such as education.
摘要：大型语言模型对它们在多语言自然语言处理中的能力引起了极大的关注，而与跨偏见相关的风险的研究仅限于即时背景偏好。基于推理的建议中的跨语言差异在很大程度上尚未探索，并且缺乏描述性分析。这项研究是第一个解决这一差距的研究。我们测试了LLM在在三种关键情况下提供个性化建议时的适用性和能力：大学申请，旅行和搬迁。我们通过分析他们跨多种语言对决策任务的反应来研究最先进的LLM中的多语言偏见。我们量化了模型生成的分数的偏见，并评估人口统计学因素和推理策略（例如，经过思考链的提示）对偏差模式的影响。我们的发现表明，与GPT-3.5相比，GPT-4和十四行诗减少了对英语国家的偏见，但与GPT-3.5相比，gpt-4和十四行诗降低了偏见，但未能实现强大的多语言对准，突出了对多语言AI代理和应用程序（例如）的更广泛影响，例如教育。

Title: DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning

Authors: Pusheng Xu, Yue Wu, Kai Jin, Xiaolan Chen, Mingguang He, Danli Shi
Subjects: cs.CL, cs.AI, cs.PF
Abstract URL: https://arxiv.org/abs/2502.17947
Pdf URL: https://arxiv.org/pdf/2502.17947
Copy Paste: [[2502.17947]] DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning(https://arxiv.org/abs/2502.17947)
Keywords: language model, llm
Abstract: Purpose: To evaluate the accuracy and reasoning ability of DeepSeek-R1 and three other recently released large language models (LLMs) in bilingual complex ophthalmology cases. Methods: A total of 130 multiple-choice questions (MCQs) related to diagnosis (n = 39) and management (n = 91) were collected from the Chinese ophthalmology senior professional title examination and categorized into six topics. These MCQs were translated into English using DeepSeek-R1. The responses of DeepSeek-R1, Gemini 2.0 Pro, OpenAI o1 and o3-mini were generated under default configurations between February 15 and February 20, 2025. Accuracy was calculated as the proportion of correctly answered questions, with omissions and extra answers considered incorrect. Reasoning ability was evaluated through analyzing reasoning logic and the causes of reasoning error. Results: DeepSeek-R1 demonstrated the highest overall accuracy, achieving 0.862 in Chinese MCQs and 0.808 in English MCQs. Gemini 2.0 Pro, OpenAI o1, and OpenAI o3-mini attained accuracies of 0.715, 0.685, and 0.692 in Chinese MCQs (all P<0.001 compared with DeepSeek-R1), and 0.746 (P=0.115), 0.723 (P=0.027), and 0.577 (P<0.001) in English MCQs, respectively. DeepSeek-R1 achieved the highest accuracy across five topics in both Chinese and English MCQs. It also excelled in management questions conducted in Chinese (all P<0.05). Reasoning ability analysis showed that the four LLMs shared similar reasoning logic. Ignoring key positive history, ignoring key positive signs, misinterpretation medical data, and too aggressive were the most common causes of reasoning errors. Conclusion: DeepSeek-R1 demonstrated superior performance in bilingual complex ophthalmology reasoning tasks than three other state-of-the-art LLMs. While its clinical applicability remains challenging, it shows promise for supporting diagnosis and clinical decision-making.
摘要：目的：评估DeepSeek-R1和其他三个最近发布的大型语言模型（LLM）在双语复杂的眼科案例中的准确性和推理能力。方法：从中国眼科高级专业冠军考试中收集了与诊断（n = 39）和管理（n = 91）有关的130个多项选择问题（MCQ）（n = 39）（n = 91），并将其分为六个主题。这些MCQ使用DeepSeek-R1翻译成英文。 DeepSeek-R1，Gemini 2.0 Pro，OpenAI O1和O3-Mini的响应是在2月15日至2025年2月20日之间在默认配置下产生的。准确性的计算为正确回答的问题的比例，并以遗漏和额外的答案被认为是不正确的。通过分析推理逻辑和推理错误的原因来评估推理能力。结果：DeepSeek-R1表现出最高的整体准确性，在中国MCQ中达到0.862，英语MCQ实现了0.808。 Gemini 2.0 Pro，OpenAI O1和OpenAI O3-Mini的中文MCQ的精度为0.715、0.685和0.692（与DeepSeek-R1相比，所有P <0.001）和0.746（P = 0.115），0.723（p = 0.723（P = 0.027）（P = 0.027），分别为英语MCQ和0.577（p <0.001）。 DeepSeek-R1在中文和英语MCQ中取得了五个主题的最高准确性。它在以中文进行的管理问题上也表现出色（所有p <0.05）。推理能力分析表明，四个LLM共享了相似的推理逻辑。忽略关键的积极历史，忽略关键的积极迹象，误解医学数据以及过于侵略性是推理错误的最常见原因。结论：DeepSeek-R1在双语复杂的眼科推理任务中表现出优异的表现，而不是其他三个最先进的LLM。尽管其临床适用性仍然具有挑战性，但它显示出支持诊断和临床决策的希望。

Title: Language Models' Factuality Depends on the Language of Inquiry

Authors: Tushar Aggarwal, Kumar Tanmay, Ayush Agrawal, Kumar Ayush, Hamid Palangi, Paul Pu Liang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17955
Pdf URL: https://arxiv.org/pdf/2502.17955
Copy Paste: [[2502.17955]] Language Models' Factuality Depends on the Language of Inquiry(https://arxiv.org/abs/2502.17955)
Keywords: language model
Abstract: Multilingual language models (LMs) are expected to recall factual knowledge consistently across languages, yet they often fail to transfer knowledge between languages even when they possess the correct information in one of the languages. For example, we find that an LM may correctly identify Rashed Al Shashai as being from Saudi Arabia when asked in Arabic, but consistently fails to do so when asked in English or Swahili. To systematically investigate this limitation, we introduce a benchmark of 10,000 country-related facts across 13 languages and propose three novel metrics: Factual Recall Score, Knowledge Transferability Score, and Cross-Lingual Factual Knowledge Transferability Score-to quantify factual recall and knowledge transferability in LMs across different languages. Our results reveal fundamental weaknesses in today's state-of-the-art LMs, particularly in cross-lingual generalization where models fail to transfer knowledge effectively across different languages, leading to inconsistent performance sensitive to the language used. Our findings emphasize the need for LMs to recognize language-specific factual reliability and leverage the most trustworthy information across languages. We release our benchmark and evaluation framework to drive future research in multilingual knowledge transfer.
摘要：预计多语言语言模型（LMS）将跨语言始终如一地回顾事实知识，但是即使在其中一种语言中拥有正确的信息，它们也经常在语言之间转移知识。例如，我们发现，当用阿拉伯语询问LM时，LM可以正确地识别出来自沙特阿拉伯的Rash Al Shashai，但在用英语或斯瓦希里语询问时，始终未能这样做。为了系统地调查这一限制，我们介绍了13种语言的10,000个与国家相关的事实的基准，并提出了三个新颖的指标：事实召回得分，知识转移性得分和跨语言的事实知识转移性得分，以量化事实召回和知识转移性。 LMS跨不同语言。我们的结果揭示了当今最先进的LMS中的根本弱点，尤其是在跨语言概括中，模型无法有效地跨不同语言转移知识，从而导致对所使用语言敏感的性能不一致。我们的发现强调了LMS必须识别特定语言的事实可靠性并利用跨语言最值得信赖的信息。我们发布基准和评估框架，以推动多语言知识转移的未来研究。

Title: Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

Authors: Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udomcharoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17956
Pdf URL: https://arxiv.org/pdf/2502.17956
Copy Paste: [[2502.17956]] Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments(https://arxiv.org/abs/2502.17956)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.
摘要：多步性推理对于大语言模型（LLM）至关重要，但是多语言性能仍然具有挑战性。虽然对经营链（COT）提示推理的推理，但由于推理和执行的纠缠而与非英语语言斗争。促使经营计划（POT）促使推理与执行区分开，提供了有希望的替代方案，但将挑战转移到从非英语问题中产生程序。我们提出了一个框架来评估POT，通过将多语言推理与代码执行分开，以检查（i）微调对问题策划对齐的影响以及（ii）推理质量如何影响正确性。我们的发现表明，Pot微调基本上增强了多语言推理，表现优于COT微调模型。我们进一步证明了推理质量（通过代码质量衡量）和回答准确性之间的密切相关性，从而突出了其作为测试时间性能改进的启发式的潜力。

Title: On Synthetic Data Strategies for Domain-Specific Generative Retrieval

Authors: Haoyang Wen, Jiang Guo, Yi Zhang, Jiarong Jiang, Zhiguo Wang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.17957
Pdf URL: https://arxiv.org/pdf/2502.17957
Copy Paste: [[2502.17957]] On Synthetic Data Strategies for Domain-Specific Generative Retrieval(https://arxiv.org/abs/2502.17957)
Keywords: llm
Abstract: This paper investigates synthetic data generation strategies in developing generative retrieval models for domain-specific corpora, thereby addressing the scalability challenges inherent in manually annotating in-domain queries. We study the data strategies for a two-stage training framework: in the first stage, which focuses on learning to decode document identifiers from queries, we investigate LLM-generated queries across multiple granularity (e.g. chunks, sentences) and domain-relevant search constraints that can better capture nuanced relevancy signals. In the second stage, which aims to refine document ranking through preference learning, we explore the strategies for mining hard negatives based on the initial model's predictions. Experiments on public datasets over diverse domains demonstrate the effectiveness of our synthetic data generation and hard negative sampling approach.
摘要：本文研究了合成数据生成策略，以开发针对域特异性语料库的生成检索模型，从而解决了手动注释内域查询所固有的可扩展性挑战。我们研究了两个阶段培训框架的数据策略：在第一阶段，该阶段着重于学习从查询中解码文档标识符，我们研究了跨多个粒度（例如，块，句子）和域相关的搜索约束的LLM生成的查询可以更好地捕获细微的相关性信号。在第二阶段，旨在通过偏好学习来完善文档排名，我们根据初始模型的预测探讨了挖掘艰苦负面因素的策略。在公共数据集上进行的有关不同领域的实验证明了我们的合成数据生成和硬采样方法的有效性。

Title: Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Authors: Xinghao Chen, Zhijing Sun, Wenjin Guo, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, Xiaoyu Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18001
Pdf URL: https://arxiv.org/pdf/2502.18001
Copy Paste: [[2502.18001]] Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning(https://arxiv.org/abs/2502.18001)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) excel in reasoning tasks through Chain-of-Thought (CoT) prompting. However, CoT prompting greatly increases computational demands, which has prompted growing interest in distilling CoT capabilities into Small Language Models (SLMs). This study systematically examines the factors influencing CoT distillation, including the choice of granularity, format and teacher model. Through experiments involving four teacher models and seven student models across seven mathematical and commonsense reasoning datasets, we uncover three key findings: (1) Unlike LLMs, SLMs exhibit a non-monotonic relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision; (2) CoT format significantly impacts LLMs but has minimal effect on SLMs, likely due to their reliance on supervised fine-tuning rather than pretraining preferences; (3) Stronger teacher models do NOT always produce better student models, as diversity and complexity in CoT supervision can outweigh accuracy alone. These findings emphasize the need to tailor CoT strategies to specific student model, offering actionable insights for optimizing CoT distillation in SLMs. The code and datasets are available at this https URL.
摘要：大型语言模型（LLMS）在推理任务（COT）提示中表现出色。但是，COT促使COT极大地增加了计算需求，这引发了人们对将COT功能提炼成小语言模型（SLM）的日益兴趣。这项研究系统地研究了影响COT蒸馏的因素，包括选择粒度，格式和教师模型。通过涉及七个数学和常识性推理数据集中的四个教师模型和七个学生模型的实验，我们发现了三个关键发现：（1）与LLMS不同，SLM与粒度表现出非单调关系，并且更强的模型受益于良好的元素推理和更加元素的推理和更强的模型。通过更简单的COT监督，较弱的模型表现更好；（2）COT格式显着影响LLM，但对SLM的影响很小，这可能是由于它们依赖于监督的微调而不是训练偏好的偏好；（3）更强大的教师模型并不总是会产生更好的学生模型，因为COT监督的多样性和复杂性仅能超过准确性。这些发现强调了为特定的学生模型量身定制COT策略的必要性，提供了可行的见解，以优化SLM中的COT蒸馏。代码和数据集可在此HTTPS URL上找到。

Title: Verdict: A Library for Scaling Judge-Time Compute

Authors: Nimit Kalra, Leonard Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18018
Pdf URL: https://arxiv.org/pdf/2502.18018
Copy Paste: [[2502.18018]] Verdict: A Library for Scaling Judge-Time Compute(https://arxiv.org/abs/2502.18018)
Keywords: llm, hallucination, prompt
Abstract: The use of LLMs as automated judges ("LLM-as-a-judge") is now widespread, yet standard judges suffer from a multitude of reliability issues. To address these challenges, we introduce Verdict, an open-source library for scaling judge-time compute to enhance the accuracy, reliability, and interpretability of automated evaluators. Verdict leverages the composition of modular reasoning units -- such as verification, debate, and aggregation -- and increased inference-time compute to improve LLM judge quality. Across a variety of challenging tasks such as content moderation, fact-checking, and hallucination detection, Verdict judges achieve state-of-the-art (SOTA) or near-SOTA performance, surpassing orders-of-magnitude larger fine-tuned judges, prompted judges, and reasoning models. Ultimately, we hope Verdict serves as a useful framework for researchers and practitioners building scalable, interpretable, and reliable LLM-based evaluators.
摘要：现在，使用LLMS作为自动法官（“ LLM-AS-A-Gudge”）已广泛，但标准法官却遭受了多种可靠性问题的困扰。为了应对这些挑战，我们介绍了判决，这是一个开源库，用于扩展法官时间计算，以增强自动化评估者的准确性，可靠性和解释性。判决利用模块化推理单元的组成（例如验证，辩论和聚合），并增加推理时间计算以提高LLM法官质量。在各种具有挑战性的任务中，例如内容审核，事实检查和幻觉检测，判决法官达到了最先进的（SOTA）或近距离表现，超过了较大的较大的微调法官，提示法官和推理模型。最终，我们希望判决为研究人员和从业人员建立可扩展，可解释且可靠的基于LLM的评估者的有用框架。

Title: AfroXLMR-Comet: Multilingual Knowledge Distillation with Attention Matching for Low-Resource languages

Authors: Joshua Sakthivel Raju, Sanjay S, Jaskaran Singh Walia, Srinivas Raghav, Vukosi Marivate
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.18020
Pdf URL: https://arxiv.org/pdf/2502.18020
Copy Paste: [[2502.18020]] AfroXLMR-Comet: Multilingual Knowledge Distillation with Attention Matching for Low-Resource languages(https://arxiv.org/abs/2502.18020)
Keywords: language model
Abstract: Language model compression through knowledge distillation has emerged as a promising approach for deploying large language models in resource-constrained environments. However, existing methods often struggle to maintain performance when distilling multilingual models, especially for low-resource languages. In this paper, we present a novel hybrid distillation approach that combines traditional knowledge distillation with a simplified attention matching mechanism, specifically designed for multilingual contexts. Our method introduces an extremely compact student model architecture, significantly smaller than conventional multilingual models. We evaluate our approach on five African languages: Kinyarwanda, Swahili, Hausa, Igbo, and Yoruba. The distilled student model; AfroXLMR-Comet successfully captures both the output distribution and internal attention patterns of a larger teacher model (AfroXLMR-Large) while reducing the model size by over 85%. Experimental results demonstrate that our hybrid approach achieves competitive performance compared to the teacher model, maintaining an accuracy within 85% of the original model's performance while requiring substantially fewer computational resources. Our work provides a practical framework for deploying efficient multilingual models in resource-constrained environments, particularly benefiting applications involving African languages.
摘要：通过知识蒸馏的语言模型压缩已成为在资源受限环境中部署大型语言模型的一种有前途的方法。但是，现有的方法通常在蒸馏多语言模型时通常难以保持性能，尤其是对于低资源语言。在本文中，我们提出了一种新型的混合蒸馏方法，该方法将传统知识蒸馏与简化的注意匹配机制相结合，该机制是专门为多语言环境设计的。我们的方法引入了非常紧凑的学生模型架构，比传统的多语言模型要小得多。我们评估了五种非洲语言的方法：Kinyarwanda，Swahili，Hausa，Igbo和Yoruba。蒸馏学生模型； Afroxlmr-comp成功地捕获了较大的教师模型（Afroxlmr-large）的输出分布和内部注意力模式，同时将模型大小降低了85％以上。实验结果表明，与教师模型相比，我们的混合方法可以实现竞争性能，在原始模型绩效的85％之内保持准确性，同时需要更少的计算资源。我们的工作为在资源受限的环境中部署有效的多语言模型提供了一个实用的框架，尤其是受益于涉及非洲语言的应用程序。

Title: Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference

Authors: Zhuo Chen, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinyu Geng, Pengjun Xie, Fei Huang, Kewei Tu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18023
Pdf URL: https://arxiv.org/pdf/2502.18023
Copy Paste: [[2502.18023]] Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference(https://arxiv.org/abs/2502.18023)
Keywords: language model, llm, retrieval augmented generation
Abstract: Despite the advancements made in Visual Large Language Models (VLLMs), like text Large Language Models (LLMs), they have limitations in addressing questions that require real-time information or are knowledge-intensive. Indiscriminately adopting Retrieval Augmented Generation (RAG) techniques is an effective yet expensive way to enable models to answer queries beyond their knowledge scopes. To mitigate the dependence on retrieval and simultaneously maintain, or even improve, the performance benefits provided by retrieval, we propose a method to detect the knowledge boundary of VLLMs, allowing for more efficient use of techniques like RAG. Specifically, we propose a method with two variants that fine-tunes a VLLM on an automatically constructed dataset for boundary identification. Experimental results on various types of Visual Question Answering datasets show that our method successfully depicts a VLLM's knowledge boundary based on which we are able to reduce indiscriminate retrieval while maintaining or improving the performance. In addition, we show that the knowledge boundary identified by our method for one VLLM can be used as a surrogate boundary for other VLLMs. Code will be released at this https URL
摘要：尽管视觉大语模型（VLLM）（例如文本大语言模型（LLM））的进步在解决需要实时信息或知识密集的问题时仍存在局限性。不加选择地采用检索增强发电（RAG）技术是一种有效但昂贵的方法，使模型能够回答超出其知识范围的查询。为了减轻对检索的依赖，并同时维护甚至改善了取回所提供的性能益处，我们提出了一种检测VLLMS知识边界的方法，从而可以更有效地利用诸如RAG之类的技术。具体而言，我们提出了一种具有两个变体的方法，该方法在自动构造的数据集上微调VLLM以进行边界识别。各种视觉问题答录数据集的实验结果表明，我们的方法成功地描绘了VLLM的知识边界，我们能够在维持或改善性能的同时减少不加选择的检索。此外，我们表明，我们的方法对一个VLLM识别的知识边界可以用作其他VLLM的替代边界。代码将在此HTTPS URL上发布

Title: Harnessing Multiple Large Language Models: A Survey on LLM Ensemble

Authors: Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, Philip S. Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18036
Pdf URL: https://arxiv.org/pdf/2502.18036
Copy Paste: [[2502.18036]] Harnessing Multiple Large Language Models: A Survey on LLM Ensemble(https://arxiv.org/abs/2502.18036)
Keywords: language model, llm
Abstract: LLM Ensemble -- which involves the comprehensive use of multiple large language models (LLMs), each aimed at handling user queries during downstream inference, to benefit from their individual strengths -- has gained substantial attention recently. The widespread availability of LLMs, coupled with their varying strengths and out-of-the-box usability, has profoundly advanced the field of LLM Ensemble. This paper presents the first systematic review of recent developments in LLM Ensemble. First, we introduce our taxonomy of LLM Ensemble and discuss several related research problems. Then, we provide a more in-depth classification of the methods under the broad categories of "ensemble-before-inference, ensemble-during-inference, ensemble-after-inference", and review all relevant methods. Finally, we introduce related benchmarks and applications, summarize existing studies, and suggest several future research directions. A curated list of papers on LLM Ensemble is available at this https URL.
摘要：LLM Ensemble（涉及多种大型语言模型（LLM）的全面使用，旨在在下游推理期间处理用户查询，以使其从个人优势中受益 - 最近引起了大量关注。 LLM的广泛可用性，加上它们的各种优势和开箱即用的可用性，已深刻地推进了LLM Ensemble领域。本文介绍了对LLM Ensemble最近发展的首次系统评价。首先，我们介绍了LLM Ensemble的分类法，并讨论一些相关的研究问题。然后，我们在“集合 - 前提，合奏 - 削减，集合 - 合奏”的广泛类别下提供了更深入的分类，并查看所有相关方法。最后，我们介绍了相关的基准和应用程序，总结了现有的研究，并提出了一些未来的研究指示。在此HTTPS URL上可以找到LLM Ensemble上策划的论文列表。

Title: Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

Authors: Wenkai Yang, Shuming Ma, Yankai Lin, Furu Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18080
Pdf URL: https://arxiv.org/pdf/2502.18080
Copy Paste: [[2502.18080]] Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning(https://arxiv.org/abs/2502.18080)
Keywords: language model, llm
Abstract: Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks. While current researches continue to explore the benefits of increasing test-time compute by extending the CoT lengths of Large Language Models (LLMs), we are concerned about a potential issue hidden behind the current pursuit of test-time scaling: Would excessively scaling the CoT length actually bring adverse effects to a model's reasoning performance? Our explorations on mathematical reasoning tasks reveal an unexpected finding that scaling with longer CoTs can indeed impair the reasoning performance of LLMs in certain domains. Moreover, we discover that there exists an optimal scaled length distribution that differs across different domains. Based on these insights, we propose a Thinking-Optimal Scaling strategy. Our method first uses a small set of seed data with varying response length distributions to teach the model to adopt different reasoning efforts for deep thinking. Then, the model selects its shortest correct response under different reasoning efforts on additional problems for self-improvement. Our self-improved models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks, and achieve performance on par with QwQ-32B-Preview.
摘要：最近的研究表明，使模型花费更多的时间通过更长的思想链（COT）进行思考，从而使其能够在复杂的推理任务中获得重大改进。尽管当前的研究继续通过扩展大型语言模型（LLMS）的COT长度来探索增加测试时间计算的好处，但我们担心当前追求测试时间扩展的潜在问题：会过多地扩展COT长度实际上给模型的推理表现带来了不利影响吗？我们对数学推理任务的探索揭示了一个意外的发现，即使用更长的COTS缩放确实会损害某些域中LLM的推理性能。此外，我们发现存在最佳的缩放长度分布，在不同的域之间有所不同。基于这些见解，我们提出了一种理想的缩放策略。我们的方法首先使用一小部分种子数据，这些数据具有不同的响应长度分布，以教导该模型采用不同的推理工作进行深思熟虑。然后，该模型在不同的推理工作中选择其最短的正确响应，以解决自我完善的其他问题。我们基于QWEN2.5-32B - 教学构建的自我改进的模型在各种数学基准中都优于其他基于蒸馏的32B O1型模型，并在QWQ-32B-Preview中实现性能。

Title: Uncertainty Quantification in Retrieval Augmented Question Answering

Authors: Laura Perez-Beltrachini, Mirella Lapata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18108
Pdf URL: https://arxiv.org/pdf/2502.18108
Copy Paste: [[2502.18108]] Uncertainty Quantification in Retrieval Augmented Question Answering(https://arxiv.org/abs/2502.18108)
Keywords: hallucination
Abstract: Retrieval augmented Question Answering (QA) helps QA models overcome knowledge gaps by incorporating retrieved evidence, typically a set of passages, alongside the question at test time. Previous studies show that this approach improves QA performance and reduces hallucinations, without, however, assessing whether the retrieved passages are indeed useful at answering correctly. In this work, we propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with. We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods. Code and data are available at this https URL.
摘要：检索增强问答答案（QA）通过合并检索到的证据（通常是一组段落）以及在测试时间的问题以及问题来帮助质量保证模型克服知识差距。先前的研究表明，这种方法可改善质量检查的性能并降低幻觉，但是，没有评估检索到的段落是否确实有助于正确回答。在这项工作中，我们建议通过估计提供的段落的效用来量化质量检查模型的不确定性。我们训练一个轻量级的神经模型，以预测目标质量质量质量检查模型的通道实用程序，并表明，尽管简单信息理论指标可以在一定程度上预测答案正确性，但我们的方法有效地近似或胜过更昂贵的基于抽样的方法。代码和数据可在此HTTPS URL上找到。

Title: LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers

Authors: Zhuocheng Zhang, Yang Feng, Min Zhang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.18139
Pdf URL: https://arxiv.org/pdf/2502.18139
Copy Paste: [[2502.18139]] LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers(https://arxiv.org/abs/2502.18139)
Keywords: language model, gpt, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) is a crucial method for mitigating hallucinations in Large Language Models (LLMs) and integrating external knowledge into their responses. Existing RAG methods typically employ query rewriting to clarify the user intent and manage multi-hop logic, while using hybrid retrieval to expand search scope. However, the tight coupling of query rewriting to the dense retriever limits its compatibility with hybrid retrieval, impeding further RAG performance improvements. To address this challenge, we introduce a high-level searcher that decomposes complex queries into atomic queries, independent of any retriever-specific optimizations. Additionally, to harness the strengths of sparse retrievers for precise keyword retrieval, we have developed a new sparse searcher that employs Lucene syntax to enhance retrieval this http URL web and dense searchers, these components seamlessly collaborate within our proposed method, \textbf{LevelRAG}. In LevelRAG, the high-level searcher orchestrates the retrieval logic, while the low-level searchers (sparse, web, and dense) refine the queries for optimal retrieval. This approach enhances both the completeness and accuracy of the retrieval process, overcoming challenges associated with current query rewriting techniques in hybrid retrieval scenarios. Empirical experiments conducted on five datasets, encompassing both single-hop and multi-hop question answering tasks, demonstrate the superior performance of LevelRAG compared to existing RAG methods. Notably, LevelRAG outperforms the state-of-the-art proprietary model, GPT4o, underscoring its effectiveness and potential impact on the RAG field.
摘要：检索增强的生成（RAG）是减轻大语言模型（LLM）中幻觉并将外部知识整合到其反应中的关键方法。现有的抹布方法通常采用查询重写来阐明用户意图并管理多跳逻辑，同时使用混合检索来扩展搜索范围。但是，查询重写与密集回猎商的紧密耦合限制了其与混合检索的兼容性，阻碍了进一步的破布性能。为了应对这一挑战，我们引入了一个高级搜索器，将复杂的查询分解为原子查询，而与任何猎犬特定的优化无关。此外，为了利用稀疏检索器的优势进行精确的关键字检索，我们开发了一个新的稀疏搜索器，该稀疏搜索器采用Lucene语法来增强此HTTP URL网络和密集搜索者的检索，这些组件在我们建议的方法中无缝协作，。在LevelRag中，高级搜索器协调检索逻辑，而低级搜索者（稀疏，网络和密集）则完善了查询以进行最佳检索。这种方法增强了检索过程的完整性和准确性，克服了与当前查询重写技术相关的挑战。在五个数据集上进行的经验实验，包括单跳和多跳的问题答案任务，证明了与现有的抹布方法相比，LevelRag的出色性能。值得注意的是，LevelRag的表现优于最先进的专有模型GPT4O，强调了其有效性和对抹布场的潜在影响。

Title: NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts

Authors: Muhammad Farid Adilazuarda, Musa Izzanardi Wijanarko, Lucky Susanto, Khumaisa Nur'aini, Derry Wijaya, Alham Fikri Aji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18148
Pdf URL: https://arxiv.org/pdf/2502.18148
Copy Paste: [[2502.18148]] NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts(https://arxiv.org/abs/2502.18148)
Keywords: gpt, llm
Abstract: Indonesia is rich in languages and scripts. However, most NLP progress has been made using romanized text. In this paper, we present NusaAksara, a novel public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identification. Our data is constructed by human experts through rigorous steps. NusaAksara covers 8 scripts across 7 languages, including low-resource languages not commonly seen in NLP benchmarks. Although unsupported by Unicode, the Lampung script is included in this dataset. We benchmark our data across several models, from LLMs and VLMs such as GPT-4o, Llama 3.2, and Aya 23 to task-specific systems such as PP-OCR and LangID, and show that most NLP technologies cannot handle Indonesia's local scripts, with many achieving near-zero performance.
摘要：印度尼西亚拥有丰富的语言和脚本。但是，大多数NLP的进度都是使用罗马化文本进行的。在本文中，我们介绍了Nusaaksara，这是印尼语言的新型公共基准，其中包括其原始脚本。我们的基准涵盖了文本和图像方式，并涵盖了各种任务，例如图像分割，OCR，音译，翻译和语言标识。我们的数据是由人类专家通过严格的步骤构建的。 Nusaaksara涵盖了7种语言的8个脚本，包括NLP基准中常见的低资源语言。尽管Unicode不支持，但该数据集包含Lampung脚本。我们通过LLM和VLMS（例如GPT-4O，LLAMA 3.2和AYA 23）到诸如PP-OCR和Langid等任务的系统的几种模型进行基准测试，并表明大多数NLP技术无法处理印尼本地脚本，以及许多人的性能接近零。

Title: Can LLMs Explain Themselves Counterfactually?

Authors: Zahra Dehghanighobadi, Asja Fischer, Muhammad Bilal Zafar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18156
Pdf URL: https://arxiv.org/pdf/2502.18156
Copy Paste: [[2502.18156]] Can LLMs Explain Themselves Counterfactually?(https://arxiv.org/abs/2502.18156)
Keywords: language model, llm, prompt
Abstract: Explanations are an important tool for gaining insights into the behavior of ML models, calibrating user trust and ensuring regulatory compliance. Past few years have seen a flurry of post-hoc methods for generating model explanations, many of which involve computing model gradients or solving specially designed optimization problems. However, owing to the remarkable reasoning abilities of Large Language Model (LLMs), self-explanation, that is, prompting the model to explain its outputs has recently emerged as a new paradigm. In this work, we study a specific type of self-explanations, self-generated counterfactual explanations (SCEs). We design tests for measuring the efficacy of LLMs in generating SCEs. Analysis over various LLM families, model sizes, temperature settings, and datasets reveals that LLMs sometimes struggle to generate SCEs. Even when they do, their prediction often does not agree with their own counterfactual reasoning.
摘要：解释是获得ML模型行为，校准用户信任并确保监管合规性的重要工具。过去几年已经看到了一系列事后方法来生成模型解释，其中许多涉及计算模型梯度或解决特殊设计的优化问题。但是，由于大语言模型（LLMS）的显着推理能力，即自称，也就是说，促使该模型解释其输出最近已成为一种新范式。在这项工作中，我们研究了一种特定类型的自我解释，自我产生的反事实解释（SCES）。我们设计了测量LLM在生成SCE中的功效的测试。对各种LLM系列，模型尺寸，温度设置和数据集进行的分析表明，LLM有时很难生成SCE。即使他们这样做，他们的预测也常常不同意他们自己的反事实推理。

Title: SECURA: Sigmoid-Enhanced CUR Decomposition with Uninterrupted Retention and Low-Rank Adaptation in Large Language Models

Authors: Zhang Yuxuan, Li Ruizhe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18168
Pdf URL: https://arxiv.org/pdf/2502.18168
Copy Paste: [[2502.18168]] SECURA: Sigmoid-Enhanced CUR Decomposition with Uninterrupted Retention and Low-Rank Adaptation in Large Language Models(https://arxiv.org/abs/2502.18168)
Keywords: language model, llm
Abstract: With the rapid development of large language models (LLMs), fully fine-tuning (FT) these models has become increasingly impractical due to the high computational demands. Additionally, FT can lead to catastrophic forgetting. As an alternative, Low-Rank Adaptation (LoRA) has been proposed, which fine-tunes only a small subset of parameters, achieving similar performance to FT while significantly reducing resource requirements. However, since LoRA inherits FT's design, the issue of catastrophic forgetting remains. To address these challenges, we propose SECURA: Sigmoid-Enhanced CUR Decomposition LoRA, a novel parameter-efficient fine-tuning (PEFT) variant that mitigates catastrophic forgetting while improving fine-tuning performance. Our method introduces a new normalization technique, SigNorm, to enhance parameter retention and overall performance. SECURA has been evaluated on a variety of tasks, including mathematical problem-solving (GSM8K), challenging question-answering (CNNDM), translation (NewsDE), and complex multiple-choice reasoning (LogiQA). Experimental results show that SECURA achieves an average fine-tuning improvement of 3.59% across four multiple-choice question (MCQ) tasks and a 2.51% improvement across five question-answering (QA) tasks on models such as Gemma2 2b, Qwen2 1.5b, Qwen 2 7b, Llama3 8b, and Llama3.1 8b, compared to DoRA. Moreover, SECURA demonstrates superior knowledge retention capabilities, maintaining more than 70% accuracy on basic LLM knowledge across 16 continual learning tests, outperforming Experience Replay (ER), Sequential Learning (SEQ), EWC, I-LoRA, and CUR-LoRA.
摘要：随着大型语言模型（LLM）的快速发展，由于高计算需求，这些模型完全微调（FT）变得越来越不切实际。此外，英尺可能导致灾难性遗忘。作为替代方案，已经提出了低级适应性（LORA），它仅微调一小部分参数，实现与FT相似的性能，同时显着降低了资源要求。但是，由于Lora继承了FT的设计，因此仍然存在灾难性遗忘的问题。为了应对这些挑战，我们提出了Secura：Sigmoid增强的CUR分解Lora，这是一种新型的参数有效的微调（PEFT）变体，可减轻灾难性的遗忘，同时改善微调性能。我们的方法引入了一种新的归一化技术，即符号，以增强参数保留和整体性能。已对SECURA进行了各种任务的评估，包括数学解决问题（GSM8K），具有挑战性的问答解决方案（CNNDM），翻译（NewsDE）和复杂的多项选择性推理（LogiQA）。实验结果表明，SECURA在四个多项选择问题（MCQ）任务中的平均微调提高了3.59％，并且在诸如GEMMA2 2B，QWEN2 1.5B，QWEN2 1.5B等模型上的五个问题效应（QA）任务中提高了2.51％。 QWEN 2 7B，LLAMA3 8B和LLAMA3.1 8B，与Dora相比。此外，Secura展示了优越的知识保留能力，在16个持续学习测试中保持了70％以上的精度，超过了经验重播（ER），顺序学习（SEQ），EWC，I-Lora和Cur-Lora。

Title: Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs

Authors: Gaye Colakoglu, Gürkan Solmaz, Jonathan Fürst
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18179
Pdf URL: https://arxiv.org/pdf/2502.18179
Copy Paste: [[2502.18179]] Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs(https://arxiv.org/abs/2502.18179)
Keywords: language model, llm, prompt
Abstract: This paper defines and explores the design space for information extraction (IE) from layout-rich documents using large language models (LLMs). The three core challenges of layout-aware IE with LLMs are 1) data structuring, 2) model engagement, and 3) output refinement. Our study delves into the sub-problems within these core challenges, such as input representation, chunking, prompting, and selection of LLMs and multimodal models. It examines the outcomes of different design choices through a new layout-aware IE test suite, benchmarking against the state-of-art (SoA) model LayoutLMv3. The results show that the configuration from one-factor-at-a-time (OFAT) trial achieves near-optimal results with 14.1 points F1-score gain from the baseline model, while full factorial exploration yields only a slightly higher 15.1 points gain at around 36x greater token usage. We demonstrate that well-configured general-purpose LLMs can match the performance of specialized models, providing a cost-effective alternative. Our test-suite is freely available at this https URL.
摘要：本文使用大型语言模型（LLMS）定义并探讨了来自布局丰富文档的信息提取（IE）的设计空间。 LLM的布局意识到IE的三个核心挑战是1）数据构造，2）模型参与度和3）输出细化。我们的研究深入研究了这些核心挑战中的子问题，例如输入表示，分解，提示和选择LLMS和多模型。它通过新的布局意识IE测试套件来检查不同设计选择的结果，并针对Art-Art（SOA）模型Layoutlmv3进行了基准测试。结果表明，来自一次（OFAT）试验的配置实现了接近最佳的结果，而基线模型的14.1分F1分数获得增益，而全阶段探索仅产生稍高的15.1点。大约36倍的令牌用法。我们证明，配置良好的通用LLM可以与专业模型的性能相匹配，从而提供了具有成本效益的替代方案。我们的测试套件可在此HTTPS URL上免费获得。

Title: Grandes modelos de lenguaje: de la predicción de palabras a la comprensión?

Authors: Carlos Gómez-Rodríguez
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2502.18205
Pdf URL: https://arxiv.org/pdf/2502.18205
Copy Paste: [[2502.18205]] Grandes modelos de lenguaje: de la predicción de palabras a la comprensión?(https://arxiv.org/abs/2502.18205)
Keywords: language model, gpt, chat
Abstract: Large language models, such as the well-known ChatGPT, have brought about an unexpected revolution in the field of artificial intelligence. On the one hand, they have numerous practical applications and enormous potential still to be explored. On the other hand, they are also the subject of debate from scientific, philosophical, and social perspectives: there are doubts about the exact mechanisms of their functioning and their actual capacity for language comprehension, and their applications raise ethical dilemmas. In this chapter, we describe how this technology has been developed and the fundamentals of its operation, allowing us to better understand its capabilities and limitations and to introduce some of the main debates surrounding its development and use. -- Los grandes modelos de lenguaje, como el conocido ChatGPT, han supuesto una inesperada revolución en el ámbito de la inteligencia artificial. Por un lado, cuentan con multitud de aplicaciones prácticas y un enorme potencial todavía por explorar. Por otro lado, son también objeto de debate, tanto desde el punto de vista científico y filosófico como social: hay dudas sobre los mecanismos exactos de su funcionamiento y su capacidad real de comprensión del lenguaje, y sus aplicaciones plantean dilemas éticos. En este capítulo describimos cómo se ha llegado a esta tecnología y los fundamentos de su funcionamiento, permitiéndonos así comprender mejor sus capacidades y limitaciones e introducir algunos de los principales debates que rodean su desarrollo y uso.
摘要：大型语言模型（例如著名的chatgpt）在人工智能领域带来了一场意想不到的革命。一方面，他们仍有许多实际应用和巨大的潜力待探索。另一方面，它们也是从科学，哲学和社会角度来看的辩论的主题：对其功能的确切机制及其实际的语言理解能力以及其应用带来了道德困境，这是有疑问的。在本章中，我们描述了如何开发该技术以及其运营的基本原理，使我们能够更好地了解其能力和局限性，并介绍有关其开发和使用的一些主要辩论。 -Los Grandes Modelos de Lenguaje，Como El Conocido Chatgpt，Han Supuesto una inesperadarevoluciónenel elelámbitode la inteligencia人工。 Por Un Lado，Cuentan Con Multiud de aplicacionesprácticasy un enme PitencialTodavíaPorExplorar。 Por otro Lado，儿子Tambiénobjeto de Debate，Tanto desde el Punto deVistiíficficoficicficcoyfilosóficocomocomo社交：Hay Dudas Sobre los Mecanismos extracos extcectsectsectsectsectsectsectsectos ecectsectos de su funcionamiento y su su funcidad y sucaprensióndecomprensióndelengeandirem usean direm usean y y y y y y y sipel。 En este capítulo describimos cómo se ha llegado a esta tecnología y los fundamentos de su funcionamiento, permitiéndonos así comprender mejor sus capacidades y limitaciones e introducir algunos de los principales debates que rodean su desarrollo y uso.

Title: LAG: LLM agents for Leaderboard Auto Generation on Demanding

Authors: Jian Wu, Jiayu Zhang, Dongyuan Li, Linyi Yang, Aoxiao Zhong, Renhe Jiang, Qingsong Wen, Yue Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18209
Pdf URL: https://arxiv.org/pdf/2502.18209
Copy Paste: [[2502.18209]] LAG: LLM agents for Leaderboard Auto Generation on Demanding(https://arxiv.org/abs/2502.18209)
Keywords: language model, llm, prompt, agent
Abstract: This paper introduces Leaderboard Auto Generation (LAG), a novel and well-organized framework for automatic generation of leaderboards on a given research topic in rapidly evolving fields like Artificial Intelligence (AI). Faced with a large number of AI papers updated daily, it becomes difficult for researchers to track every paper's proposed methods, experimental results, and settings, prompting the need for efficient automatic leaderboard construction. While large language models (LLMs) offer promise in automating this process, challenges such as multi-document summarization, leaderboard generation, and experiment fair comparison still remain under exploration. LAG solves these challenges through a systematic approach that involves the paper collection, experiment results extraction and integration, leaderboard generation, and quality evaluation. Our contributions include a comprehensive solution to the leaderboard construction problem, a reliable evaluation method, and experimental results showing the high quality of leaderboards.
摘要：本文介绍了排行榜自动发电（LAG），这是一个新颖且组织良好的框架，用于在诸如人工智能（AI）之类的迅速发展领域的给定研究主题上自动生成排行榜。面对每天更新的大量AI论文，研究人员很难跟踪每个论文提出的方法，实验结果和设置，从而促使需要有效的自动排行榜构造。尽管大型语言模型（LLMS）在自动化此过程时提供了希望，但诸如多文章摘要，排行榜生成和实验公平比较之类的挑战仍在探索中。滞后通过系统的方法解决了这些挑战，该方法涉及纸张收集，实验结果提取和集成，排行榜生成和质量评估。我们的贡献包括对排行榜施工问题的全面解决方案，可靠的评估方法以及实验结果，显示了排行榜的高质量。

Title: Debt Collection Negotiations with Large Language Models: An Evaluation System and Optimizing Decision Making with Multi-Agent

Authors: Xiaofeng Wang, Zhixin Zhang, Jinguang Zheng, Yiming Ai, Rui Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18228
Pdf URL: https://arxiv.org/pdf/2502.18228
Copy Paste: [[2502.18228]] Debt Collection Negotiations with Large Language Models: An Evaluation System and Optimizing Decision Making with Multi-Agent(https://arxiv.org/abs/2502.18228)
Keywords: language model, llm, agent
Abstract: Debt collection negotiations (DCN) are vital for managing non-performing loans (NPLs) and reducing creditor losses. Traditional methods are labor-intensive, while large language models (LLMs) offer promising automation potential. However, prior systems lacked dynamic negotiation and real-time decision-making capabilities. This paper explores LLMs in automating DCN and proposes a novel evaluation framework with 13 metrics across 4 aspects. Our experiments reveal that LLMs tend to over-concede compared to human negotiators. To address this, we propose the Multi-Agent Debt Negotiation (MADeN) framework, incorporating planning and judging modules to improve decision rationality. We also apply post-training techniques, including DPO with rejection sampling, to optimize performance. Our studies provide valuable insights for practitioners and researchers seeking to enhance efficiency and outcomes in this domain.
摘要：收债协商（DCN）对于管理不表现贷款（NPL）和减少债权人损失至关重要。传统方法是劳动密集型的，而大型语言模型（LLMS）具有有希望的自动化潜力。但是，先前的系统缺乏动态的谈判和实时决策能力。本文探讨了自动化DCN的LLM，并提出了一个新的评估框架，该框架在4个方面具有13个指标。我们的实验表明，与人类谈判者相比，LLM倾向于过度构想。为了解决这个问题，我们提出了多代理债务谈判（MADEN）框架，并结合了计划和判断模块以提高决策合理性。我们还采用培训后技术，包括带有拒绝采样的DPO来优化性能。我们的研究为寻求提高该领域效率和结果的从业者和研究人员提供了宝贵的见解。

Title: Beyond In-Distribution Success: Scaling Curves of CoT Granularity for Language Model Generalization

Authors: Ru Wang, Wei Huang, Selena Song, Haoyu Zhang, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18273
Pdf URL: https://arxiv.org/pdf/2502.18273
Copy Paste: [[2502.18273]] Beyond In-Distribution Success: Scaling Curves of CoT Granularity for Language Model Generalization(https://arxiv.org/abs/2502.18273)
Keywords: language model, chain-of-thought
Abstract: Generalization to novel compound tasks under distribution shift is important for deploying transformer-based language models (LMs). This work investigates Chain-of-Thought (CoT) reasoning as a means to enhance OOD generalization. Through controlled experiments across several compound tasks, we reveal three key insights: (1) While QA-trained models achieve near-perfect in-distribution accuracy, their OOD performance degrades catastrophically, even with 10000k+ training examples; (2) the granularity of CoT data strongly correlates with generalization performance; finer-grained CoT data leads to better generalization; (3) CoT exhibits remarkable sample efficiency, matching QA performance with much less (even 80%) data. Theoretically, we demonstrate that compound tasks inherently permit shortcuts in Q-A data that misalign with true reasoning principles, while CoT forces internalization of valid dependency structures, and thus can achieve better generalization. Further, we show that transformer positional embeddings can amplify generalization by emphasizing subtask condition recurrence in long CoT sequences. Our combined theoretical and empirical analysis provides compelling evidence for CoT reasoning as a crucial training paradigm for enabling LM generalization under real-world distributional shifts for compound tasks.
摘要：对分配转移的新型复合任务的概括对于部署基于变压器的语言模型（LMS）很重要。这项工作调查了思想链（COT）推理，作为增强OOD概括的一种手段。通过几个复合任务的受控实验，我们揭示了三个关键见解：（1）虽然QA训练的模型达到了接近完美的分布精度，但即使有10000K+训练示例，它们的OOD性能也会降低灾难性；（2）COT数据的粒度与泛化性能密切相关；细粒的COT数据可以更好地概括。（3）COT表现出显着的样本效率，将QA性能与较少（甚至80％）的数据匹配。从理论上讲，我们证明复合任务固有地允许在Q-A数据中与真实推理原理错过的数据，而COT则迫使有效的依赖性结构内部化，从而可以实现更好的概括。此外，我们表明变压器位置嵌入可以通过强调长COT序列中的子任务条件复发来扩大概括。我们合并的理论和经验分析为COT推理提供了令人信服的证据，作为在现实世界分布转移下实现LM泛化的关键训练范式，以实现复合任务。

Title: Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases

Authors: Shanshan Xu, T.Y.S.S Santosh, Yanai Elazar, Quirin Vogel, Barbara Plank, Matthias Grabmair
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18282
Pdf URL: https://arxiv.org/pdf/2502.18282
Copy Paste: [[2502.18282]] Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases(https://arxiv.org/abs/2502.18282)
Keywords: language model, llm
Abstract: The increased adoption of Large Language Models (LLMs) and their potential to shape public opinion have sparked interest in assessing these models' political leanings. Building on previous research that compared LLMs and human opinions and observed political bias in system responses, we take a step further to investigate the underlying causes of such biases by empirically examining how the values and biases embedded in training corpora shape model outputs. Specifically, we propose a method to quantitatively evaluate political leanings embedded in the large pretraining corpora. Subsequently we investigate to whom are the LLMs' political leanings more aligned with, their pretrainig corpora or the surveyed human opinions. As a case study, we focus on probing the political leanings of LLMs in 32 U.S. Supreme Court cases, addressing contentious topics such as abortion and voting rights. Our findings reveal that LLMs strongly reflect the political leanings in their training data, and no strong correlation is observed with their alignment to human opinions as expressed in surveys. These results underscore the importance of responsible curation of training data and the need for robust evaluation metrics to ensure LLMs' alignment with human-centered values.
摘要：大型语言模型（LLM）的采用及其塑造公众舆论的潜力引发了人们对评估这些模型的政治倾向的兴趣。在比较LLM和人类意见并观察到系统反应中的政治偏见的先前研究的基础上，我们迈出了一步，通过经验研究如何嵌入在培训Corpora Shape模型输出中的价值观和偏见，来研究此类偏见的根本原因。具体而言，我们提出了一种定量评估嵌入在大型预审预周仔的政治倾向的方法。随后，我们调查了LLMS的政治倾向，他们与他们的政治倾向更加一致，他们的预处理语料库或被调查的人类意见。作为一个案例研究，我们专注于在32个美国最高法院案件中探讨LLM的政治倾向，以解决诸如堕胎和投票权之类的有争议的话题。我们的发现表明，LLMS强烈反映了他们的培训数据中的政治倾向，并且与调查中所表达的人类观点的一致性没有观察到很强的相关性。这些结果强调了负责任的培训数据策划的重要性以及对确保LLMS与以人为中心的值保持一致性的强大评估指标的需求。

Title: RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction

Authors: Jianhao Yan, Yun Luo, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18308
Pdf URL: https://arxiv.org/pdf/2502.18308
Copy Paste: [[2502.18308]] RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction(https://arxiv.org/abs/2502.18308)
Keywords: language model, llm, long context, agent
Abstract: In the multi-turn interaction schema, large language models (LLMs) can leverage user feedback to enhance the quality and relevance of their responses. However, evaluating an LLM's ability to incorporate user refutation feedback is crucial yet challenging. In this study, we introduce RefuteBench 2.0, which significantly extends the original RefuteBench by incorporating LLM agents as refuters and evaluators, which allows for flexible and comprehensive assessment. We design both transient and persistent refutation instructions with different validity periods. Meta-evaluation shows that the LLM-based refuter could generate more human-like refutations and the evaluators could assign scores with high correlation with humans. Experimental results of various LLMs show that current models could effectively satisfy the refutation but fail to memorize the refutation information. Interestingly, we also observe that the performance of the initial task decreases as the refutations increase. Analysis of the attention scores further shows a potential weakness of current LLMs: they struggle to retain and correctly use previous information during long context dialogues. this https URL
摘要：在多转交互模式中，大语言模型（LLMS）可以利用用户反馈来提高其响应的质量和相关性。但是，评估LLM合并用户反驳反馈的能力是至关重要但充满挑战的。在这项研究中，我们介绍了反驳2.0，该研究通过将LLM代理作为反驳和评估者大大扩展了原始反驳，从而可以进行灵活而全面的评估。我们设计了有效期不同的瞬态和持续反驳说明。元评估表明，基于LLM的拒绝者可以产生更多类似人类的反驳，评估人员可以分配与人类高相关的分数。各种LLM的实验结果表明，当前模型可以有效地满足反驳，但无法记住反驳信息。有趣的是，我们还观察到，随着反驳的增加，初始任务的性能会降低。注意分数的分析进一步显示了当前LLM的潜在弱点：它们在长篇小说对话中努力保留并正确使用以前的信息。此HTTPS URL

Title: WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging

Authors: Ahmed Elhady, Eneko Agirre, Mikel Artetxe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18316
Pdf URL: https://arxiv.org/pdf/2502.18316
Copy Paste: [[2502.18316]] WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging(https://arxiv.org/abs/2502.18316)
Keywords: llm, chain-of-thought
Abstract: We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with "None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original benchmarks. We relase our code and data at this https URL.
摘要：我们介绍了一种简单的方法，可以通过随机替换“无上述方法”的选择来提高现有多项选择基准的复杂性，这是一种经常用于教育测试的方法。我们表明，邪恶可以自动应用于任何现有的基准，使其更具挑战性。我们将邪恶的人应用于6个流行的基准测试，并使用它来评估18个开放式LLM。相对于数据集的原始版本，模型的性能平均下降了12.1点。当在3个MMLU数据集上使用经过思考的链时，邪恶变体的性能下降与直接使用LLMS时观察到的相似，这表明Wicked对于具有增强推理能力的模型而言也很具有挑战性。邪恶的还发现，某些模型对所需的额外推理更加敏感，从而提供了有关原始基准测试的其他信息。我们在此HTTPS URL上介绍代码和数据。

Title: Mapping of Subjective Accounts into Interpreted Clusters (MOSAIC): Topic Modelling and LLM applied to Stroboscopic Phenomenology

Authors: Romy Beauté, David J. Schwartzman, Guillaume Dumas, Jennifer Crook, Fiona Macpherson, Adam B. Barrett, Anil K. Seth
Subjects: cs.CL, q-bio.NC
Abstract URL: https://arxiv.org/abs/2502.18318
Pdf URL: https://arxiv.org/pdf/2502.18318
Copy Paste: [[2502.18318]] Mapping of Subjective Accounts into Interpreted Clusters (MOSAIC): Topic Modelling and LLM applied to Stroboscopic Phenomenology(https://arxiv.org/abs/2502.18318)
Keywords: language model, llm, hallucination
Abstract: Stroboscopic light stimulation (SLS) on closed eyes typically induces simple visual hallucinations (VHs), characterised by vivid, geometric and colourful patterns. A dataset of 862 sentences, extracted from 422 open subjective reports, was recently compiled as part of the Dreamachine programme (Collective Act, 2022), an immersive multisensory experience that combines SLS and spatial sound in a collective setting. Although open reports extend the range of reportable phenomenology, their analysis presents significant challenges, particularly in systematically identifying patterns. To address this challenge, we implemented a data-driven approach leveraging Large Language Models and Topic Modelling to uncover and interpret latent experiential topics directly from the Dreamachine's text-based reports. Our analysis confirmed the presence of simple VHs typically documented in scientific studies of SLS, while also revealing experiences of altered states of consciousness and complex hallucinations. Building on these findings, our computational approach expands the systematic study of subjective experience by enabling data-driven analyses of open-ended phenomenological reports, capturing experiences not readily identified through standard questionnaires. By revealing rich and multifaceted aspects of experiences, our study broadens our understanding of stroboscopically-induced phenomena while highlighting the potential of Natural Language Processing and Large Language Models in the emerging field of computational (neuro)phenomenology. More generally, this approach provides a practically applicable methodology for uncovering subtle hidden patterns of subjective experience across diverse research domains.
摘要：闭合眼睛上的频镜刺激（SLS）通常会诱导简单的视觉幻觉（VHS），其特征是生动，几何和彩色图案。最近从422个开放主观报告中提取的862个句子的数据集是作为Dreamachine计划（Collective Act，2022）的一部分编辑的，这是一种沉浸式的多感官体验，将SLS和空间声音结合在集体环境中。尽管开放报告扩展了可报告现象学的范围，但他们的分析提出了重大挑战，尤其是在系统地识别模式的情况下。为了应对这一挑战，我们实施了一种数据驱动的方法，利用大型语言模型和主题建模来揭示和解释潜在的体验主题，直接从Dreamachine的基于文本的报告中。我们的分析证实了SLS科学研究中通常记录的简单VHS，同时还揭示了意识状态和复杂幻觉状态的经历。在这些发现的基础上，我们的计算方法通过对开放式现象学报告进行数据驱动的分析来扩展对主观经验的系统研究，从而捕获通过标准问卷轻松确定的经验。通过揭示经验的丰富而多方面的方面，我们的研究扩大了我们对频道诱导的现象的理解，同时强调了自然语言处理和大型语言模型在新兴的计算领域（NEURO）现象学领域的潜力。更一般地，这种方法为揭示各种研究领域的主观经验的微妙隐藏模式提供了一种适用的方法。

Title: BottleHumor: Self-Informed Humor Explanation using the Information Bottleneck Principle

Authors: EunJeong Hwang, Peter West, Vered Shwartz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18331
Pdf URL: https://arxiv.org/pdf/2502.18331
Copy Paste: [[2502.18331]] BottleHumor: Self-Informed Humor Explanation using the Information Bottleneck Principle(https://arxiv.org/abs/2502.18331)
Keywords: language model
Abstract: Humor is prevalent in online communications and it often relies on more than one modality (e.g., cartoons and memes). Interpreting humor in multimodal settings requires drawing on diverse types of knowledge, including metaphorical, sociocultural, and commonsense knowledge. However, identifying the most useful knowledge remains an open question. We introduce \method{}, a method inspired by the information bottleneck principle that elicits relevant world knowledge from vision and language models which is iteratively refined for generating an explanation of the humor in an unsupervised manner. Our experiments on three datasets confirm the advantage of our method over a range of baselines. Our method can further be adapted in the future for additional tasks that can benefit from eliciting and conditioning on relevant world knowledge and open new research avenues in this direction.
摘要：幽默在在线通信中普遍存在，它通常依赖于多种模式（例如卡通和模因）。在多模式环境中解释幽默需要利用各种类型的知识，包括隐喻，社会文化和常识知识。但是，确定最有用的知识仍然是一个悬而未决的问题。我们介绍了\ method {}，这是一种受到信息瓶颈原则启发的方法，该方法从视觉和语言模型中引起了世界知识的相关知识，这是迭代地完善的，用于以一种无聊的方式产生幽默的解释。我们在三个数据集上的实验证实了我们方法比一系列基线的优势。将来，我们的方法可以进一步适应其他任务，这些任务可以从相关的世界知识中引起和调理，并朝着这一方向开放新的研究途径。

Title: Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

Authors: Rylan Schaeffer, Punit Singh Koura, Binh Tang, Ranjan Subramanian, Aaditya K Singh, Todor Mihaylov, Prajjwal Bhargava, Lovish Madaan, Niladri S. Chatterji, Vedanuj Goswami, Sergey Edunov, Dieuwke Hupkes, Sanmi Koyejo, Sharan Narang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.18339
Pdf URL: https://arxiv.org/pdf/2502.18339
Copy Paste: [[2502.18339]] Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks(https://arxiv.org/abs/2502.18339)
Keywords: language model, chat
Abstract: The explosion of high-performing conversational language models (LMs) has spurred a shift from classic natural language processing (NLP) benchmarks to expensive, time-consuming and noisy human evaluations - yet the relationship between these two evaluation strategies remains hazy. In this paper, we conduct a large-scale study of four Chat Llama 2 models, comparing their performance on 160 standard NLP benchmarks (e.g., MMLU, ARC, BIG-Bench Hard) against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Our findings are striking: most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences. Three human evaluations, such as adversarial dishonesty and safety, are anticorrelated with NLP benchmarks, while two are uncorrelated. Moreover, through overparameterized linear regressions, we show that NLP scores can accurately predict human evaluations across different model scales, offering a path to reduce costly human annotation without sacrificing rigor. Overall, our results affirm the continued value of classic benchmarks and illuminate how to harness them to anticipate real-world user satisfaction - pointing to how NLP benchmarks can be leveraged to meet evaluation needs of our new era of conversational AI.
摘要：高性能对话语言模型（LMS）的爆炸激发了从经典的自然语言处理（NLP）基准转变为昂贵，耗时和嘈杂的人类评估 - 但这两种评估策略之间的关系仍然朦胧。在本文中，我们对四种Chat Llama 2型号进行了大规模研究，比较了它们在160个标准NLP基准（例如MMLU，ARC，Big Bench Hard）上的性能与超过11K单转的广泛人类偏好，并进行了比较。来自2K人类注释者的2K多转向对话。我们的发现令人惊讶：大多数NLP基准与人类评估密切相关，表明更便宜，自动化的指标可以作为人类偏好的可靠预测指标。三种人类评估，例如对抗性的不诚实和安全性，与NLP基准相关，而两项是不相关的。此外，通过过度参数的线性回归，我们表明NLP得分可以准确预测不同模型量表的人类评估，从而提供了减少昂贵的人类注释而不牺牲严格性的途径。总体而言，我们的结果肯定了经典基准测试的持续价值，并阐明了如何利用他们预测现实世界的用户满意度 - 指出如何利用NLP基准测试来满足我们对话型AI的新时代的评估需求。

Title: BRIDO: Bringing Democratic Order to Abstractive Summarization

Authors: Junhyun Lee, Harshith Goka, Hyeonmok Ko
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.18342
Pdf URL: https://arxiv.org/pdf/2502.18342
Copy Paste: [[2502.18342]] BRIDO: Bringing Democratic Order to Abstractive Summarization(https://arxiv.org/abs/2502.18342)
Keywords: language model, llm, hallucination
Abstract: Hallucination refers to the inaccurate, irrelevant, and inconsistent text generated from large language models (LLMs). While the LLMs have shown great promise in a variety of tasks, the issue of hallucination still remains a major challenge for many practical uses. In this paper, we tackle the issue of hallucination in abstract text summarization by mitigating exposure bias. Existing models targeted for exposure bias mitigation, namely BRIO, aim for better summarization quality in the ROUGE score. We propose a model that uses a similar exposure bias mitigation strategy but with a goal that is aligned with less hallucination. We conjecture that among a group of candidate outputs, ones with hallucinations will comprise the minority of the whole group. That is, candidates with less similarity with others will have a higher chance of containing hallucinated content. Our method uses this aspect and utilizes contrastive learning, incentivizing candidates with high inter-candidate ROUGE scores. We performed experiments on the XSum and CNN/DM summarization datasets, and our method showed 6.25% and 3.82% improvement, respectively, on the consistency G-Eval score over BRIO.
摘要：幻觉是指由大语言模型（LLM）产生的不准确，无关紧要和不一致的文本。尽管LLM在各种任务中都表现出了巨大的希望，但幻觉问题仍然是许多实际用途的主要挑战。在本文中，我们通过减轻暴露偏见来解决抽象文本摘要中的幻觉问题。现有用于曝光偏见的模型，即BRIO，目的是在胭脂分数中更好地汇总质量。我们提出了一种使用类似的暴露偏置缓解策略但目标与幻觉较少保持一致的模型。我们猜想，在一组候选输出中，具有幻觉的输出将包括整个群体的少数。也就是说，与他人相似的候选人将有更高的机会包含幻觉内容。我们的方法使用此方面并利用对比度学习，激励具有高候选胭脂分数的候选人。我们在XSUM和CNN/DM汇总数据集上进行了实验，我们的方法在BRIO上的一致性G-eval评分方面分别显示了6.25％和3.82％。

Title: DBR: Divergence-Based Regularization for Debiasing Natural Language Understanding Models

Authors: Zihao Li, Ruixiang Tang, Lu Cheng, Shuaiqiang Wang, Dawei Yin, Mengnan Du
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.18353
Pdf URL: https://arxiv.org/pdf/2502.18353
Copy Paste: [[2502.18353]] DBR: Divergence-Based Regularization for Debiasing Natural Language Understanding Models(https://arxiv.org/abs/2502.18353)
Keywords: language model
Abstract: Pre-trained language models (PLMs) have achieved impressive results on various natural language processing tasks. However, recent research has revealed that these models often rely on superficial features and shortcuts instead of developing a genuine understanding of language, especially for natural language understanding (NLU) tasks. Consequently, the models struggle to generalize to out-of-domain data. In this work, we propose Divergence Based Regularization (DBR) to mitigate this shortcut learning behavior. Our method measures the divergence between the output distributions for original examples and examples where shortcut tokens have been masked. This process prevents the model's predictions from being overly influenced by shortcut features or biases. We evaluate our model on three NLU tasks and find that it improves out-of-domain performance with little loss of in-domain accuracy. Our results demonstrate that reducing the reliance on shortcuts and superficial features can enhance the generalization ability of large pre-trained language models.
摘要：预训练的语言模型（PLM）在各种自然语言处理任务上取得了令人印象深刻的结果。但是，最近的研究表明，这些模型通常依赖于表面特征和快捷方式，而不是对语言的真正理解，尤其是对于自然语言理解（NLU）任务。因此，模型努力将其推广到室外数据。在这项工作中，我们提出基于差异的正则化（DBR）来减轻这种快捷方式学习行为。我们的方法衡量了原始示例的输出分布与已掩盖快捷方式令牌的示例之间的差异。此过程阻止模型的预测不受捷径特征或偏见的影响。我们在三个NLU任务上评估了我们的模型，并发现它可以改善室外性能，而损失域的准确性几乎没有。我们的结果表明，减少对快捷方式和表面特征的依赖可以增强大型预训练的语言模型的概括能力。

Title: Monte Carlo Temperature: a robust sampling strategy for LLM's uncertainty quantification methods

Authors: Nicola Cecere, Andrea Bacciu, Ignacio Fernández Tobías, Amin Mantrach
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18389
Pdf URL: https://arxiv.org/pdf/2502.18389
Copy Paste: [[2502.18389]] Monte Carlo Temperature: a robust sampling strategy for LLM's uncertainty quantification methods(https://arxiv.org/abs/2502.18389)
Keywords: language model, llm
Abstract: Uncertainty quantification (UQ) in Large Language Models (LLMs) is essential for their safe and reliable deployment, particularly in critical applications where incorrect outputs can have serious consequences. Current UQ methods typically rely on querying the model multiple times using non-zero temperature sampling to generate diverse outputs for uncertainty estimation. However, the impact of selecting a given temperature parameter is understudied, and our analysis reveals that temperature plays a fundamental role in the quality of uncertainty estimates. The conventional approach of identifying optimal temperature values requires expensive hyperparameter optimization (HPO) that must be repeated for each new model-dataset combination. We propose Monte Carlo Temperature (MCT), a robust sampling strategy that eliminates the need for temperature calibration. Our analysis reveals that: 1) MCT provides more robust uncertainty estimates across a wide range of temperatures, 2) MCT improves the performance of UQ methods by replacing fixed-temperature strategies that do not rely on HPO, and 3) MCT achieves statistical parity with oracle temperatures, which represent the ideal outcome of a well-tuned but computationally expensive HPO process. These findings demonstrate that effective UQ can be achieved without the computational burden of temperature parameter calibration.
摘要：大语言模型（LLM）中的不确定性量化（UQ）对于其安全可靠的部署至关重要，尤其是在不正确的产出可能会带来严重后果的关键应用程序中。当前的UQ方法通常依赖于使用非零温度采样多次查询模型，以生成不同的输出以进行不确定性估计。但是，选择给定的温度参数的影响被研究了，我们的分析表明，温度在不确定性估计的质量中起着基本作用。识别最佳温度值的常规方法需要昂贵的高参数优化（HPO），必须重复每个新的模型数据组组合。我们提出了蒙特卡洛温度（MCT），这是一种强大的采样策略，消除了对温度校准的需求。我们的分析表明：1）MCT通过替换不依赖HPO的固定温度策略来提高UQ方法的性能，提供更高的不确定性估计，2）MCT提高了UQ方法的性能，3）MCT具有统计率甲骨文温度是一个调整良好但计算昂贵的HPO过程的理想结果。这些发现表明，如果没有温度参数校准的计算负担，就可以实现有效的UQ。

Title: KiRAG: Knowledge-Driven Iterative Retriever for Enhancing Retrieval-Augmented Generation

Authors: Jinyuan Fang, Zaiqiao Meng, Craig Macdonald
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18397
Pdf URL: https://arxiv.org/pdf/2502.18397
Copy Paste: [[2502.18397]] KiRAG: Knowledge-Driven Iterative Retriever for Enhancing Retrieval-Augmented Generation(https://arxiv.org/abs/2502.18397)
Keywords: retrieval-augmented generation, chain-of-thought
Abstract: Iterative retrieval-augmented generation (iRAG) models offer an effective approach for multi-hop question answering (QA). However, their retrieval process faces two key challenges: (1) it can be disrupted by irrelevant documents or factually inaccurate chain-of-thoughts; (2) their retrievers are not designed to dynamically adapt to the evolving information needs in multi-step reasoning, making it difficult to identify and retrieve the missing information required at each iterative step. Therefore, we propose KiRAG, which uses a knowledge-driven iterative retriever model to enhance the retrieval process of iRAG. Specifically, KiRAG decomposes documents into knowledge triples and performs iterative retrieval with these triples to enable a factually reliable retrieval process. Moreover, KiRAG integrates reasoning into the retrieval process to dynamically identify and retrieve knowledge that bridges information gaps, effectively adapting to the evolving information needs. Empirical results show that KiRAG significantly outperforms existing iRAG models, with an average improvement of 9.40% in R@3 and 5.14% in F1 on multi-hop QA.
摘要：迭代检索型生成（IRAG）模型为多跳问答（QA）提供了有效的方法。但是，他们的检索过程面临两个主要挑战：（1）它可能会被无关的文件或实际不准确的思想链所破坏；（2）他们的猎犬并非旨在动态地适应多步推理中不断发展的信息需求，因此很难在每个迭代步骤中识别和检索所需的丢失信息。因此，我们提出了基拉格（Kirag），它使用知识驱动的迭代猎犬模型来增强IRAG的检索过程。具体而言，Kirag将文档分解为知识三元，并通过这些三元组进行迭代检索，以实现实际上可靠的检索过程。此外，基拉格将推理集成到检索过程中，以动态识别和检索弥合信息差距的知识，有效地适应不断发展的信息需求。经验结果表明，基拉格（Kirag）的表现明显胜过现有的IRAG模型，在多跳QA上，R@3的平均提高了9.40％，F1的平均提高为5.14％。

Title: AgentRM: Enhancing Agent Generalization with Reward Modeling

Authors: Yu Xia, Jingru Fan, Weize Chen, Siyu Yan, Xin Cong, Zhong Zhang, Yaxi Lu, Yankai Lin, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.18407
Pdf URL: https://arxiv.org/pdf/2502.18407
Copy Paste: [[2502.18407]] AgentRM: Enhancing Agent Generalization with Reward Modeling(https://arxiv.org/abs/2502.18407)
Keywords: llm, agent
Abstract: Existing LLM-based agents have achieved strong performance on held-in tasks, but their generalizability to unseen tasks remains poor. Hence, some recent work focus on fine-tuning the policy model with more diverse tasks to improve the generalizability. In this work, we find that finetuning a reward model to guide the policy model is more robust than directly finetuning the policy model. Based on this finding, we propose AgentRM, a generalizable reward model, to guide the policy model for effective test-time search. We comprehensively investigate three approaches to construct the reward model, including explicit reward modeling, implicit reward modeling and LLM-as-a-judge. We then use AgentRM to guide the answer generation with Best-of-N sampling and step-level beam search. On four types of nine agent tasks, AgentRM enhances the base policy model by $8.8$ points on average, surpassing the top general agent by $4.0$. Moreover, it demonstrates weak-to-strong generalization, yielding greater improvement of $12.6$ on LLaMA-3-70B policy model. As for the specializability, AgentRM can also boost a finetuned policy model and outperform the top specialized agent by $11.4$ on three held-in tasks. Further analysis verifies its effectiveness in test-time scaling. Codes will be released to facilitate the research in this area.
摘要：现有的基于LLM的代理商在持有任务上取得了出色的性能，但是他们看不见的任务的普遍性仍然很差。因此，最近的一些工作着重于对政策模型进行微调，以更加多样化的任务来提高普遍性。在这项工作中，我们发现为指导政策模型的奖励模型比直接对策略模型进行挑战更强大。基于这一发现，我们提出了一种可推广的奖励模型AgensRM，以指导有效测试时间搜索的政策模型。我们全面研究了构建奖励模型的三种方法，包括明确的奖励建模，隐式奖励建模和LLM-AS-A-A-a-Gudge。然后，我们使用AgentRM通过最佳N采样和渐变梁搜索来指导答案生成。在四种类型的九种代理任务中，AgentRM平均将基本政策模型提高了$ 8.8 $，使顶级代理商超过$ 4.0 $。此外，它表明了弱到较弱的概括，在Llama-3-70B政策模型上产生了更大的提高$ 12.6 $。至于特殊性，AgentRM还可以提高填充政策模型，并在三项持有任务上超过$ 11.4 $。进一步分析验证其在测试时间缩放中的有效性。代码将被发布以促进该领域的研究。

Title: GLEAN: Generalized Category Discovery with Diverse and Quality-Enhanced LLM Feedback

Authors: Henry Peng Zou, Siffi Singh, Yi Nian, Jianfeng He, Jason Cai, Saab Mansour, Hang Su
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.18414
Pdf URL: https://arxiv.org/pdf/2502.18414
Copy Paste: [[2502.18414]] GLEAN: Generalized Category Discovery with Diverse and Quality-Enhanced LLM Feedback(https://arxiv.org/abs/2502.18414)
Keywords: llm
Abstract: Generalized Category Discovery (GCD) is a practical and challenging open-world task that aims to recognize both known and novel categories in unlabeled data using limited labeled data from known categories. Due to the lack of supervision, previous GCD methods face significant challenges, such as difficulty in rectifying errors for confusing instances, and inability to effectively uncover and leverage the semantic meanings of discovered clusters. Therefore, additional annotations are usually required for real-world applicability. However, human annotation is extremely costly and inefficient. To address these issues, we propose GLEAN, a unified framework for generalized category discovery that actively learns from diverse and quality-enhanced LLM feedback. Our approach leverages three different types of LLM feedback to: (1) improve instance-level contrastive features, (2) generate category descriptions, and (3) align uncertain instances with LLM-selected category descriptions. Extensive experiments demonstrate the superior performance of \MethodName over state-of-the-art models across diverse datasets, metrics, and supervision settings. Our code is available at this https URL.
摘要：广义类别发现（GCD）是一项实用且具有挑战性的开放世界任务，旨在使用已知类别的有限标记数据在未标记的数据中识别已知和新颖类别。由于缺乏监督，以前的GCD方法面临着重大挑战，例如难以纠正混淆实例的错误，以及无法有效揭示和利用发现的簇的语义含义。因此，实际适用性通常需要其他注释。但是，人类注释非常昂贵且效率低下。为了解决这些问题，我们提出了Glean，这是一个统一的统一类别发现框架，该框架从多样化和质量增强的LLM反馈中积极学习。我们的方法利用了三种不同类型的LLM反馈来：（1）改善实例级对比功能，（2）生成类别描述，以及（3）与LLM选择的类别描述相结合。广泛的实验表明，\ MethodName在各种数据集，指标和监督设置上的优越性能优于最先进的模型。我们的代码可在此HTTPS URL上找到。

Title: Compressing Language Models for Specialized Domains

Authors: Miles Williams, George Chrysostomou, Vitor Jeronymo, Nikolaos Aletras
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18424
Pdf URL: https://arxiv.org/pdf/2502.18424
Copy Paste: [[2502.18424]] Compressing Language Models for Specialized Domains(https://arxiv.org/abs/2502.18424)
Keywords: language model
Abstract: Compression techniques such as pruning and quantization offer a solution for more efficient deployment of language models (LMs), albeit with small performance drops in benchmark performance. However, general-purpose LM compression methods can negatively affect performance in specialized domains (e.g. biomedical or legal). Recent work has sought to address this, yet requires computationally expensive full-parameter fine-tuning. To this end, we propose cross-calibration, a novel training-free approach for improving the domain performance of compressed LMs. Our approach effectively leverages Hessian-based sensitivity to identify weights that are influential for both in-domain and general performance. Through extensive experimentation, we demonstrate that cross-calibration substantially outperforms existing approaches on domain-specific tasks, without compromising general performance. Notably, these gains come without additional computational overhead, displaying remarkable potential towards extracting domain-specialized compressed models from general-purpose LMs.
摘要：诸如修剪和量化之类的压缩技术为更有效地部署语言模型（LMS）提供了解决方案，尽管基准性能的性能较小。但是，通用LM压缩方法可能会对专业领域（例如生物医学或法律）的性能产生负面影响。最近的工作试图解决这个问题，但需要计算昂贵的全参数微调。为此，我们提出了交叉校准，这是一种新型的无训练方法，可改善压缩LMS的域性能。我们的方法有效地利用了基于Hessian的灵敏度来确定对内域和一般性能影响的权重。通过广泛的实验，我们证明了交叉校准在不损害一般绩效的情况下基本上优于域特异性任务的现有方法。值得注意的是，这些收益没有其他计算开销，从而在从通用LMS中提取域特异性压缩模型具有显着的潜力。

Title: TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning

Authors: Frederikus Hudi, Genta Indra Winata, Ruochen Zhang, Alham Fikri Aji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18431
Pdf URL: https://arxiv.org/pdf/2502.18431
Copy Paste: [[2502.18431]] TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning(https://arxiv.org/abs/2502.18431)
Keywords: language model, llm
Abstract: Reasoning is a fundamental capability of large language models (LLMs), enabling them to comprehend, analyze, and solve complex problems. In this paper, we introduce TextGames, an innovative benchmark specifically crafted to assess LLMs through demanding text-based games that require advanced skills in pattern recognition, spatial awareness, arithmetic, and logical reasoning. Our analysis probes LLMs' performance in both single-turn and multi-turn reasoning, and their abilities in leveraging feedback to correct subsequent answers through self-reflection. Our findings reveal that, although LLMs exhibit proficiency in addressing most easy and medium-level problems, they face significant challenges with more difficult tasks. In contrast, humans are capable of solving all tasks when given sufficient time. Moreover, we observe that LLMs show improved performance in multi-turn predictions through self-reflection, yet they still struggle with sequencing, counting, and following complex rules consistently. Additionally, models optimized for reasoning outperform pre-trained LLMs that prioritize instruction following, highlighting the crucial role of reasoning skills in addressing highly complex problems.
摘要：推理是大语言模型（LLM）的基本能力，使他们能够理解，分析和解决复杂的问题。在本文中，我们介绍了TextGames，这是一种专门针对LLM精心制作的创新基准，该基准通过苛刻的基于文本的游戏需要模式识别，空间意识，算术和逻辑推理的高级技能。我们的分析探讨了LLM在单转弯和多转弯推理中的性能，及其在利用反馈方面的能力来通过自我反射来纠正后续答案。我们的发现表明，尽管LLM在解决最简单和中级问题方面表现出熟练程度，但他们在更艰巨的任务中面临重大挑战。相比之下，人类在给出足够的时间时能够解决所有任务。此外，我们观察到，LLM通过自我反省显示了多转变预测的性能，但他们仍然在测序，计数和遵循复杂的规则方面持续挣扎。此外，为推理优化的模型优于优先训练的LLM，该LLM优先考虑指导，从而强调了推理技能在解决高度复杂问题方面的关键作用。

Title: Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions

Authors: Yizhe Zhang, Richard Bai, Zijin Gu, Ruixiang Zhang, Jiatao Gu, Emmanuel Abbe, Samy Bengio, Navdeep Jaitly
Subjects: cs.CL, cs.IT, cs.LG
Abstract URL: https://arxiv.org/abs/2502.18435
Pdf URL: https://arxiv.org/pdf/2502.18435
Copy Paste: [[2502.18435]] Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions(https://arxiv.org/abs/2502.18435)
Keywords: language model, llm
Abstract: Language models usually use left-to-right (L2R) autoregressive factorization. However, L2R factorization may not always be the best inductive bias. Therefore, we investigate whether alternative factorizations of the text distribution could be beneficial in some tasks. We investigate right-to-left (R2L) training as a compelling alternative, focusing on multiple-choice questions (MCQs) as a test bed for knowledge extraction and reasoning. Through extensive experiments across various model sizes (2B-8B parameters) and training datasets, we find that R2L models can significantly outperform L2R models on several MCQ benchmarks, including logical reasoning, commonsense understanding, and truthfulness assessment tasks. Our analysis reveals that this performance difference may be fundamentally linked to multiple factors including calibration, computability and directional conditional entropy. We ablate the impact of these factors through controlled simulation studies using arithmetic tasks, where the impacting factors can be better disentangled. Our work demonstrates that exploring alternative factorizations of the text distribution can lead to improvements in LLM capabilities and provides theoretical insights into optimal factorization towards approximating human language distribution, and when each reasoning order might be more advantageous.
摘要：语言模型通常使用从左到右（L2R）自回归分解。但是，L2R分解可能并不总是最佳的归纳偏差。因此，我们研究文本分布的替代因素是否在某些任务中可能是有益的。我们将左右（R2L）培训作为令人信服的替代方案，重点介绍了多项选择问题（MCQ）作为知识提取和推理的测试床。通过对各种模型尺寸（2B-8B参数）和培训数据集进行的广泛实验，我们发现R2L模型可以在几种MCQ基准上显着胜过L2R模型，包括逻辑推理，常识理解和真实性评估任务。我们的分析表明，这种性能差异可能与多种因素有关，包括校准，可计算性和方向性条件熵。我们通过使用算术任务通过对照模拟研究来消融这些因素的影响，在这种模拟研究中，可以更好地解开影响因素。我们的工作表明，探索文本分布的替代因素化可能会改善LLM功能，并提供理论上的见解，以实现最佳分解，以近似人类语言分布，并且何时每个推理顺序可能更有利。

Title: olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

Authors: Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, Luca Soldaini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18443
Pdf URL: https://arxiv.org/pdf/2502.18443
Copy Paste: [[2502.18443]] olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models(https://arxiv.org/abs/2502.18443)
Keywords: language model, llm
Abstract: PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. We present olmOCR, an open-source Python toolkit for processing PDFs into clean, linearized plain text in natural reading order while preserving structured content like sections, tables, lists, equations, and more. Our toolkit runs a fine-tuned 7B vision language model (VLM) trained on a sample of 260,000 pages from over 100,000 crawled PDFs with diverse properties, including graphics, handwritten text and poor quality scans. olmOCR is optimized for large-scale batch processing, able to scale flexibly to different hardware setups and convert a million PDF pages for only $190 USD. We release all components of olmOCR including VLM weights, data and training code, as well as inference code built on serving frameworks including vLLM and SGLang.
摘要：PDF文档有可能为培训语言模型提供数万亿个新颖的高质量代币。但是，这些文档具有多种类型，具有不同的格式和视觉布局，在试图提取和忠实地代表语言模型使用的基础内容时会构成挑战。我们提出了Olmocr，这是一种开源Python工具包，用于以自然阅读顺序处理干净的线性纯文本，同时保留诸如部分，表，表，列表，方程等的结构化内容。我们的Toolkit运行了一个微调的7B视觉语言模型（VLM），该模型在260,000页的样本中培训了100,000多个爬行的PDF，其中包括图形，手写文本和质量较差，包括图形，手写文本和质量差。奥尔莫克（Olmocr）针对大型批处理处理进行了优化，能够灵活地扩展到不同的硬件设置，并仅以$ 190美元的价格将一百万个PDF页面转换。我们发布了Olmocr的所有组件，包括VLM权重，数据和培训代码，以及建立在包括VLLM和SGLANG在内的服务框架上的推理代码。

Title: Disambiguate First Parse Later: Generating Interpretations for Ambiguity Resolution in Semantic Parsing

Authors: Irina Saparina, Mirella Lapata
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18448
Pdf URL: https://arxiv.org/pdf/2502.18448
Copy Paste: [[2502.18448]] Disambiguate First Parse Later: Generating Interpretations for Ambiguity Resolution in Semantic Parsing(https://arxiv.org/abs/2502.18448)
Keywords: llm
Abstract: Handling ambiguity and underspecification is an important challenge in natural language interfaces, particularly for tasks like text-to-SQL semantic parsing. We propose a modular approach that resolves ambiguity using natural language interpretations before mapping these to logical forms (e.g., SQL queries). Although LLMs excel at parsing unambiguous utterances, they show strong biases for ambiguous ones, typically predicting only preferred interpretations. We constructively exploit this bias to generate an initial set of preferred disambiguations and then apply a specialized infilling model to identify and generate missing interpretations. To train the infilling model, we introduce an annotation method that uses SQL execution to validate different meanings. Our approach improves interpretation coverage and generalizes across datasets with different annotation styles, database structures, and ambiguity types.
摘要：在自然语言界面中处理歧义和规定是一个重要的挑战，尤其是对于文本到SQL语义解析等任务。我们提出了一种模块化方法，该方法在将它们映射到逻辑形式（例如SQL查询）之前，使用自然语言解释解决了歧义。尽管LLM在解析明确的话语方面表现出色，但它们对模棱两可的话语表现出强烈的偏见，通常只预测首选解释。我们建设性地利用了这种偏见来生成一组首选的disigamuations，然后应用专门的填充模型来识别和生成缺失的解释。为了训练填充模型，我们引入了一种使用SQL执行来验证不同含义的注释方法。我们的方法改善了具有不同注释样式，数据库结构和歧义类型的数据集的解释覆盖范围和概括。

Title: FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response

Authors: Mollie Shichman, Claire Bonial, Austin Blodgett, Taylor Hudson, Francis Ferraro, Rachel Rudinger
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18452
Pdf URL: https://arxiv.org/pdf/2502.18452
Copy Paste: [[2502.18452]] FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response(https://arxiv.org/abs/2502.18452)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have the potential for substantial common sense reasoning. However, these capabilities are often emergent in larger models. This means smaller models that can be run locally are less helpful and capable with respect to certain reasoning tasks. To meet our problem space requirements, we fine-tune smaller LLMs to disaster domains, as these domains involve complex and low-frequency physical common sense knowledge. We introduce a pipeline to create Field Ready Instruction Decoding Agent (FRIDA) models, where domain experts and linguists combine their knowledge to make high-quality seed data that is used to generate synthetic data for fine-tuning. We create a set of 130 seed instructions for synthetic generation, a synthetic dataset of 25000 instructions, and 119 evaluation instructions relating to both general and earthquake-specific object affordances. We fine-tune several LLaMa and Mistral instruction-tuned models and find that FRIDA models outperform their base models at a variety of sizes. We then run an ablation study to understand which kinds of synthetic data most affect performance and find that training physical state and object function common sense knowledge alone improves over FRIDA models trained on all data. We conclude that the FRIDA pipeline is capable of instilling general common sense, but needs to be augmented with information retrieval for specific domain knowledge.
摘要：大型语言模型（LLM）具有实质性常识推理的潜力。但是，这些功能通常在较大的模型中出现。这意味着可以在本地运行的较小模型在某些推理任务方面的帮助和能力较小。为了满足我们的问题空间要求，我们将较小的LLM微调到灾难领域，因为这些领域涉及复杂且低频的物理常识知识。我们介绍了一条管道来创建现场就绪指导解码剂（FRIDA）模型，该模型在该模型中，领域专家和语言学家结合了知识，以制作用于生成合成数据进行微调的高质量种子数据。我们为合成生成创建了一组130个种子说明，一个25000个说明的合成数据集以及与一般和特定于地震的对象相关的119个评估指令。我们微调了几种骆驼和Mistral指导模型的模型，发现Frida模型的表现优于各种尺寸的基本模型。然后，我们进行一项消融研究，以了解哪些合成数据最大的综合数据会影响性能，并发现训练物理状态和对象功能常识知识可以改善对所有数据训练的FRIDA模型。我们得出的结论是，弗里达管道能够灌输一般的常识，但需要通过信息检索来增强特定领域知识。

Title: DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers

Authors: Xueguang Ma, Xi Victoria Lin, Barlas Oguz, Jimmy Lin, Wen-tau Yih, Xilun Chen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.18460
Pdf URL: https://arxiv.org/pdf/2502.18460
Copy Paste: [[2502.18460]] DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers(https://arxiv.org/abs/2502.18460)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated strong effectiveness and robustness while fine-tuned as dense retrievers. However, their large parameter size brings significant inference time computational challenges, including high encoding costs for large-scale corpora and increased query latency, limiting their practical deployment. While smaller retrievers offer better efficiency, they often fail to generalize effectively with limited supervised fine-tuning data. In this work, we introduce DRAMA, a training framework that leverages LLMs to train smaller generalizable dense retrievers. In particular, we adopt pruned LLMs as the backbone and train on diverse LLM-augmented data in a single-stage contrastive learning setup. Experiments show that DRAMA offers better multilingual and long-context capabilities than traditional encoder-based retrievers, and achieves strong performance across multiple tasks and languages. These highlight the potential of connecting the training of smaller retrievers with the growing advancements in LLMs, bridging the gap between efficiency and generalization.
摘要：大型语言模型（LLM）表现出强大的有效性和鲁棒性，同时作为密集的猎犬进行了微调。但是，它们的较大参数规模带来了重大的推理时间计算挑战，包括大规模语料库的高编码成本和增加的查询延迟，从而限制了其实际部署。尽管较小的猎犬提供了更高的效率，但他们通常无法通过有限的监督微调数据有效地概括。在这项工作中，我们介绍了戏剧，这是一个培训框架，利用LLMS来培训较小的可概括性捕犬。特别是，我们采用修剪的LLM作为骨干，并在单级对比度学习设置中训练不同的LLM扬声器数据。实验表明，戏剧比传统的基于编码器的猎犬提供了更好的多语言和长篇文化功能，并且在多种任务和语言中实现了强劲的性能。这些突出了将较小猎犬的训练与LLM中不断增长的进步联系起来的潜力，从而弥合了效率和概括之间的差距。