2025-02-27

Title: MixLLM: Dynamic Routing in Mixed Large Language Models

Authors: Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, Haifeng Chen
Subjects: cs.CL, cs.AI, cs.DB, cs.IR
Abstract URL: https://arxiv.org/abs/2502.18482
Pdf URL: https://arxiv.org/pdf/2502.18482
Copy Paste: [[2502.18482]] MixLLM: Dynamic Routing in Mixed Large Language Models(https://arxiv.org/abs/2502.18482)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response quality and minimize cost and latency. However, the challenges involve: (1) dynamic trade-offs among quality, cost, and latency; (2) enabling continual learning in deployed systems; and (3) navigating a varying (e.g., new LLM addition or old LLM removal) set of LLM candidates over time. To bridge these gaps, we develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment. Specifically, we first leverage query tags to enhance query embeddings for the routing task. Next, we design lightweight prediction models to estimate the response qualities and costs of queries over LLMs. We then devise a meta-decision maker to choose the query-LLM assignments to best tradeoff response quality, cost, and latency. Finally, the system benefits from continual training, allowing it to adapt to evolving queries and user feedback over time. Our extensive experiments show that MixLLM achieves the best trade-offs in response quality, cost, and latency (97.25% of GPT-4's quality at 24.18% of the cost under the time constraint).
摘要：大型语言模型（LLMS）最近表现出潜在的人工通用智能，但是它们的使用量很高，响应延迟很高。鉴于LLM具有自己的优势和劣势，LLM路由旨在确定流中每个查询的最合适模型，以最大程度地提高响应质量并最大程度地减少成本和延迟。但是，挑战涉及：（1）质量，成本和潜伏期之间的动态权衡；（2）在部署的系统中启用不断学习；（3）随着时间的推移，将导航一组不同的LLM候选者（例如，新的LLM添加或旧LLM删除）。为了弥合这些空白，我们开发了Mixllm，这是一种基于上下文的动态 - 基于Bandit的路由系统，用于查询-LLM分配。具体来说，我们首先利用查询标签来增强路由任务的查询嵌入。接下来，我们设计了轻巧的预测模型，以估计LLM上查询的响应质量和成本。然后，我们设计了一个元决定制造商，以选择查询-LLM分配来最佳折衷响应质量，成本和延迟。最后，该系统受益于持续培训，从而使其适应了随着时间的流逝而不断发展的查询和用户反馈。我们的广泛实验表明，Mixllm在响应质量，成本和潜伏期中实现了最佳的权衡（GPT-4质量的97.25％，占时间限制下成本的24.18％）。

Title: FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

Authors: Radu Marinescu, Debarun Bhattacharjya, Junkyu Lee, Tigran Tchrakian, Javier Carnerero Cano, Yufang Hou, Elizabeth Daly, Alessandra Pascale
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18573
Pdf URL: https://arxiv.org/pdf/2502.18573
Copy Paste: [[2502.18573]] FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models(https://arxiv.org/abs/2502.18573)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated vast capabilities on generative tasks in recent years, yet they struggle with guaranteeing the factual correctness of the generated content. This makes these models unreliable in realistic situations where factually accurate responses are expected. In this paper, we propose FactReasoner, a new factuality assessor that relies on probabilistic reasoning to assess the factuality of a long-form generated response. Specifically, FactReasoner decomposes the response into atomic units, retrieves relevant contexts for them from an external knowledge source, and constructs a joint probability distribution over the atoms and contexts using probabilistic encodings of the logical relationships (entailment, contradiction) between the textual utterances corresponding to the atoms and contexts. FactReasoner then computes the posterior probability of whether atomic units in the response are supported by the retrieved contexts. Our experiments on labeled and unlabeled benchmark datasets demonstrate clearly that FactReasoner improves considerably over state-of-the-art prompt-based approaches in terms of both factual precision and recall.
摘要：近年来，大型语言模型（LLMS）在生成任务上表现出了巨大的能力，但他们在确保生成内容的事实正确性方面努力。这使得这些模型在预期的现实情况下不可靠。在本文中，我们提出了FACTREAMON，这是一种依赖概率推理的新事实评估者，以评估长期产生的响应的事实。具体而言，FACTREAMON将响应分解为原子单元，从外部知识来源检索相关的上下文，并使用逻辑关系（构成，矛盾）在与原子和上下文的逻辑关系（构成，矛盾）之间使用概率编码在原子和上下文上构建了共同的概率分布。然后，FACTROAMIN计算响应中原子单位是否由检索到的上下文支持的后验概率。我们对标记和未标记的基准数据集进行的实验清楚地表明，从事实精确和召回方面，FACTREAMINER在基于最新的及时及时及时的方法上大大改善了。

Title: Scalable Best-of-N Selection for Large Language Models via Self-Certainty

Authors: Zhewei Kang, Xuandong Zhao, Dawn Song
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.18581
Pdf URL: https://arxiv.org/pdf/2502.18581
Copy Paste: [[2502.18581]] Scalable Best-of-N Selection for Large Language Models via Self-Certainty(https://arxiv.org/abs/2502.18581)
Keywords: language model, llm, chain-of-thought
Abstract: Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size $N$, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at this https URL
摘要：最佳N选择是通过增加测试时间计算来改善大语言模型（LLM）推理性能的关键技术。当前的最新方法通常采用计算密集型奖励模型进行响应评估和选择。无奖励替代方案，例如自洽和普遍的自遇到，其能力有效地处理开放式生成任务或有效地缩放的能力有限。为了解决这些局限性，我们提出了自我确定性，这是一种新颖有效的度量，利用LLM输出的固有概率分布来估算响应质量而无需外部奖励模型。我们假设较高的分布自我确定性（在多个样本中汇总）与提高的响应精度相关，因为它反映了对生成的输出的更大信心。通过对各种推理任务的广泛实验，我们证明了自我确定性（1）随着样本量$ n $的增加而有效地尺度，类似于奖励模型，但没有计算开销。（2）补充了思想链，改善了贪婪解码以外的推理绩效；（3）概括到传统自洽方法缺乏的开放式任务。我们的发现将自我确定性确立为提高LLM推理能力的实用和有效方法。该代码可在此HTTPS URL上找到

Title: Chain of Draft: Thinking Faster by Writing Less

Authors: Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18600
Pdf URL: https://arxiv.org/pdf/2502.18600
Copy Paste: [[2502.18600]] Chain of Draft: Thinking Faster by Writing Less(https://arxiv.org/abs/2502.18600)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in solving complex reasoning tasks through mechanisms like Chain-of-Thought (CoT) prompting, which emphasizes verbose, step-by-step reasoning. However, humans typically employ a more efficient strategy: drafting concise intermediate thoughts that capture only essential information. In this work, we propose Chain of Draft (CoD), a novel paradigm inspired by human cognitive processes, where LLMs generate minimalistic yet informative intermediate reasoning outputs while solving tasks. By reducing verbosity and focusing on critical insights, CoD matches or surpasses CoT in accuracy while using as little as only 7.6% of the tokens, significantly reducing cost and latency across various reasoning tasks.
摘要：大型语言模型（LLMS）在通过诸如Thebough（COT）提示之类的机制来解决复杂的推理任务方面表现出了显着的性能，该机制强调了详细的，逐步的推理。但是，人类通常采用更有效的策略：起草简洁的中间思想，只捕获基本信息。在这项工作中，我们提出了草稿链（COD），这是一种受人类认知过程启发的新型范式，在该过程中，LLMS在解决任务时会产生简约但内容丰富的中间推理输出。通过降低详细的洞察力并专注于关键见解，COD匹配或超过COT的准确性，同时仅使用几乎7.6％的令牌，从而大大降低了各种推理任务的成本和潜伏期。

Title: Steered Generation via Gradient Descent on Sparse Features

Authors: Sumanta Bhattacharyya, Pedram Rooshenas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18644
Pdf URL: https://arxiv.org/pdf/2502.18644
Copy Paste: [[2502.18644]] Steered Generation via Gradient Descent on Sparse Features(https://arxiv.org/abs/2502.18644)
Keywords: language model, llm
Abstract: Large language models (LLMs) encode a diverse range of linguistic features within their latent representations, which can be harnessed to steer their output toward specific target characteristics. In this paper, we modify the internal structure of LLMs by training sparse autoencoders to learn a sparse representation of the query embedding, allowing precise control over the model's attention distribution. We demonstrate that manipulating this sparse representation effectively transforms the output toward different stylistic and cognitive targets. Specifically, in an educational setting, we show that the cognitive complexity of LLM-generated feedback can be systematically adjusted by modifying the encoded query representation at a specific layer. To achieve this, we guide the learned sparse embedding toward the representation of samples from the desired cognitive complexity level, using gradient-based optimization in the latent space.
摘要：大型语言模型（LLMS）在其潜在表示中编码了各种语言特征，可以利用这些特征将其输出转向特定的目标特征。在本文中，我们通过训练稀疏的自动编码器来学习查询嵌入的稀疏表示，从而修改LLM的内部结构，从而可以精确控制模型的注意力分布。我们证明，操纵这种稀疏表示形式会有效地将输出转化为不同的风格和认知目标。具体而言，在教育环境中，我们表明可以通过在特定层上修改编码的查询表示形式来系统地调整LLM生成反馈的认知复杂性。为了实现这一目标，我们使用基于梯度的优化在潜在空间中的优化来指导从所需的认知复杂度水平来表示样品的稀疏嵌入。

Title: Single- vs. Dual-Prompt Dialogue Generation with LLMs for Job Interviews in Human Resources

Authors: Joachim De Baer, A. Seza Doğruöz, Thomas Demeester, Chris Develder
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18650
Pdf URL: https://arxiv.org/pdf/2502.18650
Copy Paste: [[2502.18650]] Single- vs. Dual-Prompt Dialogue Generation with LLMs for Job Interviews in Human Resources(https://arxiv.org/abs/2502.18650)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Optimizing language models for use in conversational agents requires large quantities of example dialogues. Increasingly, these dialogues are synthetically generated by using powerful large language models (LLMs), especially in domains with challenges to obtain authentic human data. One such domain is human resources (HR). In this context, we compare two LLM-based dialogue generation methods for the use case of generating HR job interviews, and assess whether one method generates higher-quality dialogues that are more challenging to distinguish from genuine human discourse. The first method uses a single prompt to generate the complete interview dialog. The second method uses two agents that converse with each other. To evaluate dialogue quality under each method, we ask a judge LLM to determine whether AI was used for interview generation, using pairwise interview comparisons. We demonstrate that despite a sixfold increase in token cost, interviews generated with the dual-prompt method achieve a win rate up to ten times higher than those generated with the single-prompt method. This difference remains consistent regardless of whether GPT-4o or Llama 3.3 70B is used for either interview generation or judging quality.
摘要：优化用于对话代理的语言模型需要大量的示例对话。这些对话越来越多地通过使用强大的大语言模型（LLM）综合生成，尤其是在具有获得真实人类数据的挑战的域中。这样的领域是人力资源（HR）。在这种情况下，我们将两种基于LLM的对话生成方法与生成人力资源求职访谈的用例进行了比较，并评估一种方法是否会产生更高质量的对话，这些对话更具挑战性地与真正的人类话语区分开来。第一个方法使用单个提示来生成完整的面试对话框。第二种方法使用两种相互交谈的代理。为了评估每种方法下的对话质量，我们要求LLM法官使用成对访谈比较来确定AI是否用于采访。我们证明，尽管令牌成本增加了六倍，但双重提示方法产生的访谈的获胜率最高十倍，高于单个prompt方法生成的访谈。不管GPT-4O或Llama 3.3 70B用于采访或判断质量，这种差异仍然保持一致。

Title: Enhancing Text Classification with a Novel Multi-Agent Collaboration Framework Leveraging BERT

Authors: Hediyeh Baban, Sai A Pidapar, Aashutosh Nema, Sichen Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18653
Pdf URL: https://arxiv.org/pdf/2502.18653
Copy Paste: [[2502.18653]] Enhancing Text Classification with a Novel Multi-Agent Collaboration Framework Leveraging BERT(https://arxiv.org/abs/2502.18653)
Keywords: agent
Abstract: We introduce a novel multi-agent collaboration framework designed to enhance the accuracy and robustness of text classification models. Leveraging BERT as the primary classifier, our framework dynamically escalates low-confidence predictions to a specialized multi-agent system comprising Lexical, Contextual, Logic, Consensus, and Explainability agents. This collaborative approach allows for comprehensive analysis and consensus-driven decision-making, significantly improving classification performance across diverse text classification tasks. Empirical evaluations on benchmark datasets demonstrate that our framework achieves a 5.5% increase in accuracy compared to standard BERT-based classifiers, underscoring its effectiveness and academic novelty in advancing multi-agent systems within natural language processing.
摘要：我们介绍了一个新型的多代理协作框架，旨在增强文本分类模型的准确性和鲁棒性。我们将BERT作为主要分类器，我们的框架将低信心的预测动态升级到包含词汇，上下文，逻辑，共识和解释性剂的专业多机构系统。这种协作方法可以进行全面的分析和共识驱动的决策，从而大大改善各种文本分类任务的分类绩效。基于基准数据集的经验评估表明，与基于BERT的分类器相比，我们的框架的准确性提高了5.5％，从而强调了其在自然语言处理中推进多机构系统方面的有效性和学术新颖性。

Title: Discriminative Finetuning of Generative Large Language Models without Reward Models and Preference Data

Authors: Siqi Guo, Ilgee Hong, Vicente Balmaseda, Tuo Zhao, Tianbao Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18679
Pdf URL: https://arxiv.org/pdf/2502.18679
Copy Paste: [[2502.18679]] Discriminative Finetuning of Generative Large Language Models without Reward Models and Preference Data(https://arxiv.org/abs/2502.18679)
Keywords: language model, llm
Abstract: Supervised fine-tuning (SFT) followed by preference optimization (PO) denoted by SFT$\rightarrow$PO has become the standard for improving pretrained large language models (LLMs), with PO demonstrating significant performance gains. However, PO methods rely on either human-labeled preference data or a strong reward model to generate preference data. Can we fine-tune LLMs without preference data or reward models while achieving competitive performance to SFT$\rightarrow$PO? We address this question by introducing Discriminative Fine-Tuning (DFT), a novel approach that eliminates the need for preference data. Unlike SFT, which employs a generative approach and overlooks negative data, DFT adopts a discriminative paradigm that that increases the probability of positive answers while suppressing potentially negative ones, shifting from token prediction to data prediction. Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT's effectiveness, achieving performance better than SFT and comparable to if not better than SFT$\rightarrow$PO. The code can be found at this https URL.
摘要：有监督的微调（SFT），然后是SFT $ \ rightArrow $ PO表示的偏好优化（PO）已成为改善预审前的大语言模型（LLMS）的标准，PO显示出显着的性能提高。但是，PO方法依赖于人体标记的偏好数据或强大的奖励模型来生成偏好数据。我们可以在没有偏好数据或奖励模型的情况下微调LLM，同时在SFT $ \ rightarrow $ po中实现竞争性能？我们通过引入歧视性微调（DFT）来解决这个问题，这是一种新颖的方法，可以消除对偏好数据的需求。与采用生成方法并忽略负面数据的SFT不同，DFT采用了一种歧视性范式，该范式增加了正面答案的概率，同时抑制潜在的负面范式，从而从标记预测转移到数据预测。我们的贡献包括：（i）通过对输入的所有可能输出的所有可能输出中的答案，通过明确建模答案的判别可能性，用于微调LLM的歧视性概率框架；（ii）有效算法优化这种判别可能性；（iii）广泛的实验证明了DFT的有效性，比SFT更好地实现性能，并且与SFT $ \ rightarrow $ PO相当。该代码可以在此HTTPS URL上找到。

Title: MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

Authors: Tianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, Linjun Zhang
Subjects: cs.CL, cs.LG, stat.ME
Abstract URL: https://arxiv.org/abs/2502.18699
Pdf URL: https://arxiv.org/pdf/2502.18699
Copy Paste: [[2502.18699]] MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment(https://arxiv.org/abs/2502.18699)
Keywords: language model, llm
Abstract: Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.
摘要：从人类反馈（RLHF）中学习的强化学习表明，在对齐大语言模型（LLMS）方面有希望。然而，它对单一奖励模型的依赖通常忽略了人类偏好的多样性。最近的方法通过利用多维反馈来解决对相应的奖励模型和使用加强学习训练LLM的限制。但是，该过程是昂贵且不稳定的，尤其是考虑到人类偏好的竞争性和异质性质。在本文中，我们提出了混合偏好优化（MPO），这是一种汇总单目标策略的后处理框架，可作为多目标RLHF（MORLHF）和MAXMIN-RLHF的替代方案。 MPO避免从头开始对齐。取而代之的是，它将现有策略与通过批处理随机镜下降计算的每个策略的重量结合在一起。经验结果表明，MPO在各种偏好之间达到平衡的性能，优于或匹配现有模型的计算成本大大降低。

Title: Random Forest-of-Thoughts: Uncertainty-aware Reasoning for Computational Social Science

Authors: Xiaohua Wu, Xiaohui Tao, Wenjie Wu, Yuefeng Li, Lin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18729
Pdf URL: https://arxiv.org/pdf/2502.18729
Copy Paste: [[2502.18729]] Random Forest-of-Thoughts: Uncertainty-aware Reasoning for Computational Social Science(https://arxiv.org/abs/2502.18729)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Social surveys in computational social science are well-designed by elaborate domain theories that can effectively reflect the interviewee's deep thoughts without concealing their true feelings. The candidate questionnaire options highly depend on the interviewee's previous answer, which results in the complexity of social survey analysis, the time, and the expertise required. The ability of large language models (LLMs) to perform complex reasoning is well-enhanced by prompting learning such as Chain-of-thought (CoT) but still confined to left-to-right decision-making processes or limited paths during inference. This means they can fall short in problems that require exploration and uncertainty searching. In response, a novel large language model prompting method, called Random Forest of Thoughts (RFoT), is proposed for generating uncertainty reasoning to fit the area of computational social science. The RFoT allows LLMs to perform deliberate decision-making by generating diverse thought space and randomly selecting the sub-thoughts to build the forest of thoughts. It can extend the exploration and prediction of overall performance, benefiting from the extensive research space of response. The method is applied to optimize computational social science analysis on two datasets covering a spectrum of social survey analysis problems. Our experiments show that RFoT significantly enhances language models' abilities on two novel social survey analysis problems requiring non-trivial reasoning.
摘要：精心设计的领域理论可以很好地设计计算社会科学中的社会调查，这些理论可以有效地反映受访者的深层思想而不掩盖他们的真实感受。候选问卷选项在很大程度上取决于受访者的先前答案，这导致社会调查分析，时间和所需的专业知识的复杂性。大型语言模型（LLM）执行复杂推理的能力通过促使学习（例如思考链（COT））进行了良好的增强，但仍局限于推理期间的左右决策过程或有限的路径。这意味着他们可能会缺乏需要探索和不确定性搜索的问题。作为回应，提出了一种新型的大型语言模型提示方法，即称为“随机思想森林”（RFOT），以产生不确定性推理以适合计算社会科学领域。 RFOT允许LLM通过产生多样化的思想空间并随机选择构建思想森林来执行故意的决策。它可以扩展整体绩效的探索和预测，从而受益于广泛的响应研究空间。该方法用于在两个数据集上优化计算社会科学分析，涵盖了一系列社会调查分析问题。我们的实验表明，RFOT显着增强了语言模型在两个需要非平凡推理的新型社会调查分析问题上的能力。

Title: Automatic Prompt Optimization via Heuristic Search: A Survey

Authors: Wendi Cui, Jiaxin Zhang, Zhuohang Li, Hao Sun, Damien Lopez, Kamalika Das, Bradley A. Malin, Sricharan Kumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18746
Pdf URL: https://arxiv.org/pdf/2502.18746
Copy Paste: [[2502.18746]] Automatic Prompt Optimization via Heuristic Search: A Survey(https://arxiv.org/abs/2502.18746)
Keywords: language model, llm, prompt
Abstract: Recent advances in Large Language Models have led to remarkable achievements across a variety of Natural Language Processing tasks, making prompt engineering increasingly central to guiding model outputs. While manual methods can be effective, they typically rely on intuition and do not automatically refine prompts over time. In contrast, automatic prompt optimization employing heuristic-based search algorithms can systematically explore and improve prompts with minimal human oversight. This survey proposes a comprehensive taxonomy of these methods, categorizing them by where optimization occurs, what is optimized, what criteria drive the optimization, which operators generate new prompts, and which iterative search algorithms are applied. We further highlight specialized datasets and tools that support and accelerate automated prompt refinement. We conclude by discussing key open challenges pointing toward future opportunities for more robust and versatile LLM applications.
摘要：大型语言模型的最新进展导致了各种自然语言处理任务的杰出成就，从而使工程越来越重要，越来越重要。尽管手动方法可以有效，但它们通常依赖于直觉，并且不会随着时间的推移自动提示。相比之下，采用基于启发式的搜索算法的自动及时及时优化可以系统地探索和改善人类监督的提示。这项调查提出了对这些方法的全面分类法，将它们分类为优化的位置，优化的内容，哪些标准推动了优化，操作员会生成新提示以及应用哪些迭代搜索算法。我们进一步重点介绍了支持和加速自动及时改进的专业数据集和工具。最后，我们讨论了主要的开放挑战，指出了未来的机会，以实现更健壮和多功能的LLM应用程序。

Title: Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance

Authors: Xueqing Peng, Triantafillos Papadopoulos, Efstathia Soufleri, Polydoros Giannouris, Ruoyu Xiang, Yan Wang, Lingfei Qian, Jimin Huang, Qianqian Xie, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18772
Pdf URL: https://arxiv.org/pdf/2502.18772
Copy Paste: [[2502.18772]] Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance(https://arxiv.org/abs/2502.18772)
Keywords: language model, llm
Abstract: Despite Greece's pivotal role in the global economy, large language models (LLMs) remain underexplored for Greek financial context due to the linguistic complexity of Greek and the scarcity of domain-specific datasets. Previous efforts in multilingual financial natural language processing (NLP) have exposed considerable performance disparities, yet no dedicated Greek financial benchmarks or Greek-specific financial LLMs have been developed until now. To bridge this gap, we introduce Plutus-ben, the first Greek Financial Evaluation Benchmark, and Plutus-8B, the pioneering Greek Financial LLM, fine-tuned with Greek domain-specific data. Plutus-ben addresses five core financial NLP tasks in Greek: numeric and textual named entity recognition, question answering, abstractive summarization, and topic classification, thereby facilitating systematic and reproducible LLM assessments. To underpin these tasks, we present three novel, high-quality Greek financial datasets, thoroughly annotated by expert native Greek speakers, augmented by two existing resources. Our comprehensive evaluation of 22 LLMs on Plutus-ben reveals that Greek financial NLP remains challenging due to linguistic complexity, domain-specific terminology, and financial reasoning gaps. These findings underscore the limitations of cross-lingual transfer, the necessity for financial expertise in Greek-trained models, and the challenges of adapting financial LLMs to Greek text. We release Plutus-ben, Plutus-8B, and all associated datasets publicly to promote reproducible research and advance Greek financial NLP, fostering broader multilingual inclusivity in finance.
摘要：尽管希腊在全球经济中起着关键作用，但由于希腊语的语言复杂性和特定领域的数据集的稀缺性，大型语言模型（LLMS）仍未在希腊财务背景下遭受重视。以前的多语言金融自然语言处理（NLP）已经揭示了相当大的绩效差异，但是到目前为止，还没有开发出专用的希腊金融基准或希腊特定的金融LLM。为了弥合这一差距，我们介绍了希腊第一个金融评估基准Plutus-ben和Plutus-8B，Pioneering Greek Financial LLM，以希腊特定的域名数据进行了微调。 Plutus-ben在希腊语中介绍了五个核心财务NLP任务：数字和文本命名实体识别，问答，抽象性摘要和主题分类，从而促进了系统的和可重复的LLM评估。为了支撑这些任务，我们介绍了三个小说，高质量的希腊财务数据集，并由专家本地希腊人的专家注释，并由两个现有资源增强。我们对22个LLM在冥王星ben上的全面评估表明，由于语言复杂性，特定于领域的术语和财务推理差距，希腊财务NLP仍然具有挑战性。这些发现强调了跨语言转移的局限性，在希腊培训模型中的财务专业知识的必要性以及将财务LLMS适应希腊文本的挑战。我们公开发布Plutus-Ben，Plutus-8B和所有相关数据集，以促进可重复的研究并促进希腊财务NLP，从而促进了财务的更广泛的多语言包容性。

Title: Active Few-Shot Learning for Text Classification

Authors: Saeed Ahmadnia, Arash Yousefi Jordehi, Mahsa Hosseini Khasheh Heyran, Seyed Abolghasem Mirroshandel, Owen Rambow, Cornelia Caragea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18782
Pdf URL: https://arxiv.org/pdf/2502.18782
Copy Paste: [[2502.18782]] Active Few-Shot Learning for Text Classification(https://arxiv.org/abs/2502.18782)
Keywords: language model, llm
Abstract: The rise of Large Language Models (LLMs) has boosted the use of Few-Shot Learning (FSL) methods in natural language processing, achieving acceptable performance even when working with limited training data. The goal of FSL is to effectively utilize a small number of annotated samples in the learning process. However, the performance of FSL suffers when unsuitable support samples are chosen. This problem arises due to the heavy reliance on a limited number of support samples, which hampers consistent performance improvement even when more support samples are added. To address this challenge, we propose an active learning-based instance selection mechanism that identifies effective support instances from the unlabeled pool and can work with different LLMs. Our experiments on five tasks show that our method frequently improves the performance of FSL. We make our implementation available on GitHub.
摘要：大语言模型（LLM）的兴起已经提高了在自然语言处理中使用少量学习方法（FSL）方法，即使使用有限的培训数据，也可以实现可接受的表现。 FSL的目的是有效利用学习过程中的少量注释样本。但是，选择不合适的支持样本时，FSL的性能会受到影响。由于对有限数量的支持样本的严重依赖，因此出现了这个问题，即使添加了更多的支持样本，也会阻碍稳定的性能提高。为了应对这一挑战，我们提出了一种基于学习的实例选择机制，该机制可以识别未标记池的有效支持实例，并可以与不同的LLM一起使用。我们对五项任务的实验表明，我们的方法经常改善FSL的性能。我们在Github上提供实施。

Title: Seeing the Forest for the Trees: A Large Scale, Continuously Updating Meta-Analysis of Frontier LLMs

Authors: Jungsoo Park, Junmo Kang, Gabriel Stanovsky, Alan Ritter
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.18791
Pdf URL: https://arxiv.org/pdf/2502.18791
Copy Paste: [[2502.18791]] Seeing the Forest for the Trees: A Large Scale, Continuously Updating Meta-Analysis of Frontier LLMs(https://arxiv.org/abs/2502.18791)
Keywords: llm, chain-of-thought
Abstract: The surge of LLM studies makes synthesizing their findings challenging. Meta-analysis can uncover important trends across studies, but its use is limited by the time-consuming nature of manual data extraction. Our study presents a semi-automated approach for meta-analysis that accelerates data extraction using LLMs. It automatically identifies relevant arXiv papers, extracts experimental results and related attributes, and organizes them into a structured dataset. We conduct a comprehensive meta-analysis of frontier LLMs using an automatically extracted dataset, reducing the effort of paper surveying and data extraction by more than 93\% compared to manual approaches. We validate our dataset by showing that it reproduces key findings from a recent manual meta-analysis about Chain-of-Thought (CoT), and also uncovers new insights that go beyond it, showing for example that in-context examples benefit multimodal tasks but offer limited gains in mathematical tasks compared to CoT. Our automatically updatable dataset enables continuous tracking of target models by extracting evaluation studies as new data becomes available. Through our scientific artifacts and empirical analysis, we provide novel insights into LLMs while facilitating ongoing meta-analyses of their behavior.
摘要：LLM研究的激增使他们的发现具有挑战性。荟萃分析可以发现整个研究中的重要趋势，但其使用受到手动数据提取的耗时性的限制。我们的研究提出了一种半自动化的荟萃分析方法，可加速使用LLMS提取数据。它会自动识别相关的ARXIV论文，提取实验结果和相关属性，并将其组织到结构化的数据集中。我们使用自动提取的数据集对Frontier LLM进行了全面的荟萃分析，与手动方法相比，纸张测量和数据提取的努力减少了93％以上。我们通过证明它从最近的手动荟萃分析（COT）重现关键发现来验证我们的数据集，并发现超出数据的新见解，例如，在封闭式示例中有利于多模式任务，但与COT相比提供了数学任务的有限收益。我们自动更新的数据集可通过提取评估研究来促进目标模型的连续跟踪，因为新数据可用。通过我们的科学工件和经验分析，我们为LLM提供了新的见解，同时促进了其行为的持续荟萃分析。

Title: Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs

Authors: Xiulin Yang, Tatsuya Aoyama, Yuekun Yao, Ethan Wilcox
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18795
Pdf URL: https://arxiv.org/pdf/2502.18795
Copy Paste: [[2502.18795]] Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs(https://arxiv.org/abs/2502.18795)
Keywords: language model, gpt, llm
Abstract: Do LLMs offer insights into human language learning? A common argument against this idea is that because their architecture and training paradigm are so vastly different from humans, LLMs can learn arbitrary inputs as easily as natural languages. In this paper, we test this claim by training LMs to model impossible and typologically unattested languages. Unlike previous work, which has focused exclusively on English, we conduct experiments on 12 natural languages from 4 language families. Our results show that while GPT-2 small can primarily distinguish attested languages from their impossible counterparts, it does not achieve perfect separation between all the attested languages and all the impossible ones. We further test whether GPT-2 small distinguishes typologically attested from unattested languages with different NP orders by manipulating word order based on Greenberg's Universal 20. We find that the model's perplexity scores do not distinguish attested vs. unattested word orders, as long as the unattested variants maintain constituency structure. These findings suggest that language models exhibit some human-like inductive biases, though these biases are weaker than those found in human learners.
摘要：LLM是否提供对人类语言学习的见解？反对这一想法的一个普遍论点是，由于他们的建筑和训练范式与人类大不相同，因此LLM可以像自然语言一样容易学习任意输入。在本文中，我们通过训练LMS来对此主张进行测试，以模拟不可能的和类型的未调查语言。与以前专注于英语的工作不同，我们对来自4个语言家庭的12种自然语言进行了实验。我们的结果表明，尽管GPT-2 Small可以主要将有证明的语言与不可能的语言区分开，但它并不能在所有认证的语言和所有不可能的语言之间实现完美的分离。我们进一步测试GPT-2小型是否通过基于Greenberg的Universal 20来操纵单词顺序，从而与具有不同的NP订单的未经检验的语言区分了类型，我们发现该模型的困惑分数不会区分未经启示的单词订单，只要未受欢迎的变化阶段维持选民组成的构造构成结构。这些发现表明，语言模型表现出一些类似人类的诱导偏见，尽管这些偏见比人类学习者中的偏见弱。

Title: ANPMI: Assessing the True Comprehension Capabilities of LLMs for Multiple Choice Questions

Authors: Gyeongje Cho, Yeonkyoung So, Jaejin Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18798
Pdf URL: https://arxiv.org/pdf/2502.18798
Copy Paste: [[2502.18798]] ANPMI: Assessing the True Comprehension Capabilities of LLMs for Multiple Choice Questions(https://arxiv.org/abs/2502.18798)
Keywords: language model, llm, prompt
Abstract: Multiple-choice benchmarks, consisting of various prompts and choices, are among the most widely used methods to assess a language model's natural language understanding capability. Given a specific prompt, we typically compute $P(Choice|Prompt)$ to evaluate how likely a language model is to generate the correct choice compared to incorrect ones. However, we observe that performance measured using this approach reflects not only the model's comprehension of the prompt but also its inherent biases for certain choices regardless of the prompt. This issue makes it challenging to accurately measure a model's natural language understanding, as models may select the answer without fully understanding the prompt. To address this limitation, we propose a novel metric called ANPMI, which normalizes Pointwise Mutual Information (PMI) by $-\log P(Choice)$. ANPMI provides a more accurate assessment of the model's natural language understanding by ensuring that it is challenging to answer a question without properly understanding the prompt.
摘要：由各种提示和选择组成的多项选择基准是评估语言模型自然语言理解能力的最广泛使用的方法之一。给定特定的提示，我们通常计算$ p（选择|提示）$，以评估与错误选择相比，语言模型生成正确选择的可能性。但是，我们观察到，使用这种方法测量的性能不仅反映了模型对提示的理解，而且反映了其固有的偏见，无论提示如何如何。这个问题使得准确测量模型的自然语言理解是一项挑战，因为模型可以在不完全理解提示的情况下选择答案。为了解决这一限制，我们提出了一个名为ANPMI的新颖指标，该指标将$ - \ log P（选择）$归一化的互信息（PMI）归一化。 ANPMI通过确保在不正确理解提示的情况下回答问题是一个挑战，可以更准确地评估模型的自然语言理解。

Title: Language Models Grow Less Humanlike beyond Phase Transition

Authors: Tatsuya Aoyama, Ethan Wilcox
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18802
Pdf URL: https://arxiv.org/pdf/2502.18802
Copy Paste: [[2502.18802]] Language Models Grow Less Humanlike beyond Phase Transition(https://arxiv.org/abs/2502.18802)
Keywords: language model
Abstract: LMs' alignment with human reading behavior (i.e. psychometric predictive power; PPP) is known to improve during pretraining up to a tipping point, beyond which it either plateaus or degrades. Various factors, such as word frequency, recency bias in attention, and context size, have been theorized to affect PPP, yet there is no current account that explains why such a tipping point exists, and how it interacts with LMs' pretraining dynamics more generally. We hypothesize that the underlying factor is a pretraining phase transition, characterized by the rapid emergence of specialized attention heads. We conduct a series of correlational and causal experiments to show that such a phase transition is responsible for the tipping point in PPP. We then show that, rather than producing attention patterns that contribute to the degradation in PPP, phase transitions alter the subsequent learning dynamics of the model, such that further training keeps damaging PPP.
摘要：LMS与人类阅读行为的一致性（即心理测定能力； PPP）在预处理期间可以改善到临界点，超过该临床或降解。从理论上考虑了各种因素，例如单词频率，注意力偏置和上下文大小会影响PPP，但是没有经常账户可以解释为什么存在这样的临界点，以及它如何与LMS的LMS相互作用更广泛地与LMS相互作用。我们假设潜在因素是预处理的相变，其特征是专门注意力头的快速出现。我们进行了一系列相关和因果实验，以表明这种相变是PPP中临界点的原因。然后，我们表明，相位转换并没有产生导致PPP降解的注意力模式，而是改变了模型的后续学习动力，从而使进一步的训练不断损害PPP。

Title: Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models

Authors: Shuliang Liu, Xinze Li, Zhenghao Liu, Yukun Yan, Cheng Yang, Zheni Zeng, Zhiyuan Liu, Maosong Sun, Ge Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18817
Pdf URL: https://arxiv.org/pdf/2502.18817
Copy Paste: [[2502.18817]] Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models(https://arxiv.org/abs/2502.18817)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilize the judge-consistency to evaluate these judgments and select the accepted and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at this https URL.
摘要：检索演示的一代（RAG）证明了其在减轻大语模型（LLMS）幻觉方面的有效性。但是，现有的自动化评估指标无法公平地评估培训和评估过程中抹布模型产生的输出。基于LLM的判断模型提供了产生高质量判断的潜力，但它们对评估提示非常敏感，从而在判断抹布模型的输出时会导致不一致。本文介绍了法官矛盾（COND Judge）方法，该方法旨在增强LLMS以对抹布模型产生更准确的评估。具体而言，Conshugge提示LLM基于判断维度的各种组合产生不同的判断，利用法官一致性评估这些判断，并选择接受和拒绝DPO培训的判断。我们的实验表明，Conshugge可以有效地提供更准确的判断，以优化各种抹布模型和数据集中的抹布模型。进一步的分析表明，Conshugge产生的判断与上级LLM具有很高的一致性。所有代码均可在此HTTPS URL上找到。

Title: Evidence-Driven Marker Extraction for Social Media Suicide Risk Detection

Authors: Carter Adams, Caleb Carter, Jackson Simmons
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18823
Pdf URL: https://arxiv.org/pdf/2502.18823
Copy Paste: [[2502.18823]] Evidence-Driven Marker Extraction for Social Media Suicide Risk Detection(https://arxiv.org/abs/2502.18823)
Keywords: language model, llm, prompt
Abstract: Early detection of suicide risk from social media text is crucial for timely intervention. While Large Language Models (LLMs) offer promising capabilities in this domain, challenges remain in terms of interpretability and computational efficiency. This paper introduces Evidence-Driven LLM (ED-LLM), a novel approach for clinical marker extraction and suicide risk classification. ED-LLM employs a multi-task learning framework, jointly training a Mistral-7B based model to identify clinical marker spans and classify suicide risk levels. This evidence-driven strategy enhances interpretability by explicitly highlighting textual evidence supporting risk assessments. Evaluated on the CLPsych datasets, ED-LLM demonstrates competitive performance in risk classification and superior capability in clinical marker span identification compared to baselines including fine-tuned LLMs, traditional machine learning, and prompt-based methods. The results highlight the effectiveness of multi-task learning for interpretable and efficient LLM-based suicide risk assessment, paving the way for clinically relevant applications.
摘要：从社交媒体文本中早期发现自杀风险对于及时干预至关重要。尽管大型语言模型（LLMS）在该领域提供了有希望的能力，但在解释性和计算效率方面仍然存在挑战。本文介绍了证据驱动的LLM（ED-LLM），这是一种用于临床标记提取和自杀风险分类的新方法。 ED-LLM采用多任务学习框架，共同培训基于Mistral-7b的模型，以识别临床标记跨度并对自杀风险水平进行分类。这种证据驱动的策略通过明确强调支持风险评估的文本证据来增强可解释性。 ED-LLM在CLPSych数据集上进行了评估，与包括微调LLM，传统的机器学习和及时的方法相比，在风险分类和临床标记范围识别的卓越能力方面表现出了竞争性能。结果突出了多任务学习对可解释和有效的基于LLM的自杀风险评估的有效性，为临床相关的应用铺平了道路。

Title: Sliding Window Attention Training for Efficient Large Language Models

Authors: Zichuan Fu, Wentao Song, Yejing Wang, Xian Wu, Yefeng Zheng, Yingying Zhang, Derong Xu, Xuetao Wei, Tong Xu, Xiangyu Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.18845
Pdf URL: https://arxiv.org/pdf/2502.18845
Copy Paste: [[2502.18845]] Sliding Window Attention Training for Efficient Large Language Models(https://arxiv.org/abs/2502.18845)
Keywords: language model, llm
Abstract: Recent advances in transformer-based Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their quadratic computational complexity concerning sequence length remains a significant bottleneck for processing long documents. As a result, many efforts like sparse attention and state space models have been proposed to improve the efficiency of LLMs over long sequences. Though effective, these approaches compromise the performance or introduce structural complexity. This calls for a simple yet efficient model that preserves the fundamental Transformer architecture. To this end, we introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Then, we replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks. Code is available at this https URL.
摘要：基于变压器的大语言模型（LLM）的最新进展已在各种任务中表现出显着的功能。但是，它们有关序列长度的二次计算复杂性仍然是处理长文档的重要瓶颈。结果，已经提出了许多诸如稀疏注意力和状态空间模型之类的努力，以提高LLM在长序列上的效率。尽管有效，但这些方法损害了性能或引入结构复杂性。这需要一个简单而有效的模型，可以保留基本的变压器体系结构。为此，我们介绍了SWAT，该特警可以通过滑动窗户注意力训练进行有效的长篇小说处理。本文首先将变形金刚的效率低下归因于由于软磁性操作的较高差异而产生的注意力下沉现象。然后，我们用Sigmoid函数替换SoftMax，并利用平衡的abi和旋转位置嵌入以有效的信息压缩和保留。实验表明，与八个基准上的最先进的线性复发架构相比，特警可以达到SOTA性能。代码可在此HTTPS URL上找到。

Title: A Causal Lens for Evaluating Faithfulness Metrics

Authors: Kerem Zaman, Shashank Srivastava
Subjects: cs.CL, cs.AI, cs.LG, stat.ME
Abstract URL: https://arxiv.org/abs/2502.18848
Pdf URL: https://arxiv.org/pdf/2502.18848
Copy Paste: [[2502.18848]] A Causal Lens for Evaluating Faithfulness Metrics(https://arxiv.org/abs/2502.18848)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) offer natural language explanations as an alternative to feature attribution methods for model interpretability. However, despite their plausibility, they may not reflect the model's internal reasoning faithfully, which is crucial for understanding the model's true decision-making processes. Although several faithfulness metrics have been proposed, a unified evaluation framework remains absent. To address this gap, we present Causal Diagnosticity, a framework to evaluate faithfulness metrics for natural language explanations. Our framework employs the concept of causal diagnosticity, and uses model-editing methods to generate faithful-unfaithful explanation pairs. Our benchmark includes four tasks: fact-checking, analogy, object counting, and multi-hop reasoning. We evaluate a variety of faithfulness metrics, including post-hoc explanation and chain-of-thought-based methods. We find that all tested faithfulness metrics often fail to surpass a random baseline. Our work underscores the need for improved metrics and more reliable interpretability methods in LLMs.
摘要：大型语言模型（LLMS）提供自然语言解释，以替代模型可解释性的特征归因方法。但是，尽管它们具有合理性，但它们可能不会忠实地反映模型的内部推理，这对于理解模型的真正决策过程至关重要。尽管已经提出了几个忠实指标，但仍然没有一个统一的评估框架。为了解决这一差距，我们提出了因果诊断性，这是一个评估自然语言解释的忠诚指标的框架。我们的框架采用了因果诊断的概念，并使用模型编辑的方法来产生忠实的不信仰解释对。我们的基准包括四个任务：事实检查，类比，对象计数和多跳推理。我们评估了各种忠诚指标，包括事后解释和基于经验的方法。我们发现，所有经过测试的忠诚指标通常都无法超过随机基线。我们的工作强调了在LLM中需要改进指标和更可靠的可解释性方法的需求。

Title: Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework

Authors: Kaishuai Xu, Tiezheng Yu, Wenjun Hou, Yi Cheng, Liangyou Li, Xin Jiang, Lifeng Shang, Qun Liu, Wenjie Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18874
Pdf URL: https://arxiv.org/pdf/2502.18874
Copy Paste: [[2502.18874]] Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework(https://arxiv.org/abs/2502.18874)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and refines all analyses to make the final judgment. We construct a Composite Analysis Corpus that integrates tasks for evaluation criteria generation alongside text-based and code-driven analysis generation to train the Analyzer. Our results demonstrate that ARJudge outperforms existing fine-tuned evaluators in effectiveness and robustness. Furthermore, it demonstrates the importance of multi-faceted evaluation and code-driven analyses in enhancing evaluation capabilities.
摘要：在各种情况下，大型语言模型（LLM）被越来越广泛地用于自动评估。先前的研究试图微调开源LLM，以复制强大专有模型（例如GPT-4）的评估解释和判断。但是，这些方法在很大程度上仅限于在预定义的一般标准下基于文本的分析，从而降低了看不见的指示的适应性，并证明了评估遵守定量和结构约束的不稳定。为了解决这些局限性，我们提出了一个新颖的评估框架Arjudge，该框架适应了评估标准，并综合了基于文本和代码驱动的分析以评估LLM响应。 Arjudge由两个组成部分组成：一个微调的分析仪，该分析仪生成多方面的评估分析和无调炼油厂，将所有分析结合在一起，以做出最终判断。我们构建了一个复合分析语料库，该语料库将生成评估标准的任务与基于文本和代码驱动的分析生成以及培训分析器进行培训。我们的结果表明，在有效性和鲁棒性方面，Arjudge优于现有的微调评估者。此外，它证明了多方面评估和代码驱动分析在增强评估功能中的重要性。

Title: Learning to Generate Structured Output with Schema Reinforcement Learning

Authors: Yaxi Lu, Haolun Li, Xin Cong, Zhong Zhang, Yesai Wu, Yankai Lin, Zhiyuan Liu, Fangming Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18878
Pdf URL: https://arxiv.org/pdf/2502.18878
Copy Paste: [[2502.18878]] Learning to Generate Structured Output with Schema Reinforcement Learning(https://arxiv.org/abs/2502.18878)
Keywords: language model, llm
Abstract: This study investigates the structured generation capabilities of large language models (LLMs), focusing on producing valid JSON outputs against a given schema. Despite the widespread use of JSON in integrating language models with programs, there is a lack of comprehensive analysis and benchmarking of these capabilities. We explore various aspects of JSON generation, such as structure understanding, escaping, and natural language description, to determine how to assess and enable LLMs to generate valid responses. Building upon this, we propose SchemaBench features around 40K different JSON schemas to obtain and assess models' abilities in generating valid JSON. We find that the latest LLMs are still struggling to generate a valid JSON string. Moreover, we demonstrate that incorporating reinforcement learning with a Fine-grained Schema Validator can further enhance models' understanding of JSON schema, leading to improved performance. Our models demonstrate significant improvement in both generating JSON outputs and downstream tasks.
摘要：这项研究调查了大语言模型（LLM）的结构化生成能力，重点是针对给定的模式产生有效的JSON输出。尽管JSON在将语言模型与程序集成在一起，但缺乏对这些功能的全面分析和基准测试。我们探讨了JSON生成的各个方面，例如结构理解，逃脱和自然语言描述，以确定如何评估和使LLM能够产生有效的响应。在此基础上，我们建议Schemabench具有大约40k不同的JSON模式，以获得和评估模型生成有效JSON的能力。我们发现最新的LLM仍在努力生成有效的JSON字符串。此外，我们证明，将强化学习与细粒度验证器合并可以进一步增强模型对JSON模式的理解，从而提高性能。我们的模型在生成JSON输出和下游任务方面表现出显着改善。

Title: On Pruning State-Space LLMs

Authors: Tamer Ghattas, Michael Hassid, Roy Schwartz
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.18886
Pdf URL: https://arxiv.org/pdf/2502.18886
Copy Paste: [[2502.18886]] On Pruning State-Space LLMs(https://arxiv.org/abs/2502.18886)
Keywords: llm
Abstract: Recent work proposed state-space models (SSMs) as an efficient alternative to transformer-based LLMs. Can these models be pruned to further reduce their computation costs? We adapt several pruning methods to the SSM structure, and apply them to four SSM-based LLMs across multiple tasks. We find that such models are quite robust to some pruning methods (e.g. WANDA), while using other methods lead to fast performance degradation.
摘要：最近的工作提出了状态空间模型（SSM）作为基于变压器LLM的有效替代品。可以修剪这些模型以进一步降低其计算成本吗？我们将几种修剪方法调整到SSM结构中，并将它们应用于多个任务的四个基于SSM的LLMS。我们发现，这种模型对于某些修剪方法（例如Wanda）非常可靠，同时使用其他方法会导致快速性能降解。

Title: From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

Authors: Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18890
Pdf URL: https://arxiv.org/pdf/2502.18890
Copy Paste: [[2502.18890]] From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens(https://arxiv.org/abs/2502.18890)
Keywords: language model, llm
Abstract: Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at this https URL.
摘要：使用大语言模型（LLM）生成超长序列已变得越来越关键，但仍然是一项高度耗时的任务，尤其是对于多达100K令牌的序列。尽管存在传统的投机解码方法，但仅扩大其生成限制并无法加速该过程，并且可能有害。通过深入分析，我们确定了阻碍有效产生的三个主要挑战：频繁重新加载，动态键值（KV）管理和重复生成。为了解决这些问题，我们介绍了TokensWift，这是一个新颖的框架，旨在实质上加速超长序列的生成过程，同时保持目标模型的固有质量。实验结果表明，TokensWift在不同尺度（1.5b，7b，8b，14b）和体系结构（MHA，GQA）的模型中达到了3倍的速度。这种加速度转化为超长序列产生的时间节省时间，以前所未有的长度建立TokensWift作为可扩展有效的解决方案。代码可以在此HTTPS URL上找到。

Title: END: Early Noise Dropping for Efficient and Effective Context Denoising

Authors: Hongye Jin, Pei Chen, Jingfeng Yang, Zhengyang Wang, Meng Jiang, Yifan Gao, Binxuan Huang, Xinyang Zhang, Zheng Li, Tianyi Liu, Huasheng Li, Bing Yin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18915
Pdf URL: https://arxiv.org/pdf/2502.18915
Copy Paste: [[2502.18915]] END: Early Noise Dropping for Efficient and Effective Context Denoising(https://arxiv.org/abs/2502.18915)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We reveal that LLMs can implicitly identify whether input sequences contain useful information at early layers, prior to token generation. Leveraging this insight, we introduce Early Noise Dropping (\textsc{END}), a novel approach to mitigate this issue without requiring fine-tuning the LLMs. \textsc{END} segments input sequences into chunks and employs a linear prober on the early layers of LLMs to differentiate between informative and noisy chunks. By discarding noisy chunks early in the process, \textsc{END} preserves critical information, reduces distraction, and lowers computational overhead. Extensive experiments demonstrate that \textsc{END} significantly improves both performance and efficiency across different LLMs on multiple evaluation datasets. Furthermore, by investigating LLMs' implicit understanding to the input with the prober, this work also deepens understanding of how LLMs do reasoning with contexts internally.
摘要：大型语言模型（LLM）在广泛的自然语言处理任务中表现出了出色的表现。但是，在输入序列中，它们通常会因降低输出质量的输入序列而分心。这个问题会影响长篇小说和短篇小说方案，例如检索效果的一代，提问和内在学习。我们揭示LLM可以隐式识别输入序列是否在代币生成之前的早期层中包含有用的信息。利用这种见解，我们引入了早期噪声掉落（\ textsc {end}），这是一种减轻此问题的新方法，而无需对LLM进行微调。 \ textsc {end}段将序列输入块，并在LLM的早期层上采用线性专家来区分信息性和嘈杂的块。通过在此过程的早期丢弃嘈杂的块，\ textsc {end}可以保留关键信息，减少干扰并降低计算开销。广泛的实验表明，\ textsc {end}可显着提高多个评估数据集中不同LLM的性能和效率。此外，通过调查LLMS对Prober的意义理解，这项工作还加深了对LLMS在内部使用上下文推理的理解。

Title: Kanana: Compute-efficient Bilingual Language Models

Authors: Kanana LLM Team: Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim, Eunhwa Kim, Byeongil Ko, Daniel Lee, Minchul Lee, Miok Lee, Shinbok Lee, Gaeun Seo
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.18934
Pdf URL: https://arxiv.org/pdf/2502.18934
Copy Paste: [[2502.18934]] Kanana: Compute-efficient Bilingual Language Models(https://arxiv.org/abs/2502.18934)
Keywords: language model, retrieval augmented generation
Abstract: We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.
摘要：我们介绍了Kanana，这是一系列双语语言模型，这些模型表明韩语的表现超出了英语的竞争性能。卡纳纳的计算成本明显低于相似大小的最先进模型的计算成本。该报告详细介绍了在预训练期间采用的技术，以实现计算效率但竞争性的模型，包括高质量的数据过滤，上演的预训练，深度缩小以及修剪和蒸馏。此外，该报告概述了卡纳纳模型训练后使用的方法，包括监督的微调和偏好优化，旨在增强其与用户无缝互动的能力。最后，该报告详细阐述了用于语言模型适应特定方案的合理方法，例如嵌入，检索增强生成和功能调用。 Kanana型号系列将公开发布2.1b型号（基础，指导，嵌入）从2.1b到32.5b参数，以促进对韩国语言模型的研究。

Title: JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models

Authors: Shuyi Liu, Simiao Cui, Haoran Bu, Yuming Shang, Xi Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18935
Pdf URL: https://arxiv.org/pdf/2502.18935
Copy Paste: [[2502.18935]] JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models(https://arxiv.org/abs/2502.18935)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various applications, highlighting the urgent need for comprehensive safety evaluations. In particular, the enhanced Chinese language proficiency of LLMs, combined with the unique characteristics and complexity of Chinese expressions, has driven the emergence of Chinese-specific benchmarks for safety assessment. However, these benchmarks generally fall short in effectively exposing LLM safety vulnerabilities. To address the gap, we introduce JailBench, the first comprehensive Chinese benchmark for evaluating deep-seated vulnerabilities in LLMs, featuring a refined hierarchical safety taxonomy tailored to the Chinese context. To improve generation efficiency, we employ a novel Automatic Jailbreak Prompt Engineer (AJPE) framework for JailBench construction, which incorporates jailbreak techniques to enhance assessing effectiveness and leverages LLMs to automatically scale up the dataset through context-learning. The proposed JailBench is extensively evaluated over 13 mainstream LLMs and achieves the highest attack success rate against ChatGPT compared to existing Chinese benchmarks, underscoring its efficacy in identifying latent vulnerabilities in LLMs, as well as illustrating the substantial room for improvement in the security and trustworthiness of LLMs within the Chinese context. Our benchmark is publicly available at this https URL.
摘要：大型语言模型（LLMS）在各种应用程序中都表现出了出色的功能，强调了迫切需要进行全面的安全评估。特别是，LLM的中文水平的增强，再加上中国表达的独特特征和复杂性，推动了中文特定基准的出现以进行安全评估。但是，这些基准通常在有效地暴露了LLM安全漏洞的情况下缺乏。为了解决差距，我们介绍了监狱堡，这是第一个全面的中国基准，用于评估LLMS中的深层脆弱性，其中包括针对中国背景的精致等级安全分类。为了提高发电效率，我们采用了一种新型的自动越狱及时工程师（AJPE）框架来进行监狱式建筑，该框架结合了越狱技术，以提高评估有效性和利用LLMS来通过上下文学习自动扩展数据集。与现有的中国基准相比，拟议的监狱长经过了13个主流LLM的广泛评估，并取得了针对ChatGPT的最高攻击成功率，这突显了其在识别LLMS中潜在漏洞的功效，并说明了在中国环境中LLMS的实质性余地。我们的基准标准可在此HTTPS URL上公开使用。

Title: MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors

Authors: Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.18940
Pdf URL: https://arxiv.org/pdf/2502.18940
Copy Paste: [[2502.18940]] MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors(https://arxiv.org/abs/2502.18940)
Keywords: llm
Abstract: Evaluating the pedagogical capabilities of AI-based tutoring models is critical for making guided progress in the field. Yet, we lack a reliable, easy-to-use, and simple-to-run evaluation that reflects the pedagogical abilities of models. To fill this gap, we present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation. MathTutorBench contains a collection of datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching. To score the pedagogical quality of open-ended teacher responses, we train a reward model and show it can discriminate expert from novice teacher responses with high accuracy. We evaluate a wide set of closed- and open-weight models on MathTutorBench and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching. Rather, pedagogy and subject expertise appear to form a trade-off that is navigated by the degree of tutoring specialization of the model. Furthermore, tutoring appears to become more challenging in longer dialogs, where simpler questioning strategies begin to fail. We release the benchmark, code, and leaderboard openly to enable rapid benchmarking of future models.
摘要：评估基于AI的辅导模型的教学能力对于在该领域取得指导进展至关重要。但是，我们缺乏可靠，易于使用且易于运行的评估，反映了模型的教学能力。为了填补这一空白，我们提出MathTutorBench，这是整体辅导模型评估的开源基准。 MathTutorBench包含一系列数据集和指标，这些数据集和指标涵盖了基于对话的教学中的学习科学研究所定义的辅导能力。为了评分开放式教师的回答的教学质量，我们训练奖励模型，并表明它可以以高准确的态度将专家与新手老师的回应区分开。我们在MathTutorBench上评估了广泛的封闭式和开放权重模型，并发现通过解决能力来指示的主题专业知识不会立即转化为良好的教学。相反，教学法和主题专业知识似乎形成了一个权衡，该折算是由模型的补习程度导航的。此外，在更长的对话过程中，辅导似乎变得更具挑战性，更简单的质疑策略开始失败。我们公开发布基准，代码和排行榜，以实现未来模型的快速基准测试。

Title: Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles

Authors: Kuang Wang, Xianfei Li, Shenghao Yang, Li Zhou, Feng Jiang, Haizhou Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18968
Pdf URL: https://arxiv.org/pdf/2502.18968
Copy Paste: [[2502.18968]] Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles(https://arxiv.org/abs/2502.18968)
Keywords: language model, llm
Abstract: User simulators are crucial for replicating human interactions with dialogue systems, supporting both collaborative training and automatic evaluation, especially for large language models (LLMs). However, existing simulators often rely solely on text utterances, missing implicit user traits such as personality, speaking style, and goals. In contrast, persona-based methods lack generalizability, as they depend on predefined profiles of famous individuals or archetypes. To address these challenges, we propose User Simulator with implicit Profiles (USP), a framework that infers implicit user profiles from human-machine conversations and uses them to generate more personalized and realistic dialogues. We first develop an LLM-driven extractor with a comprehensive profile schema. Then, we refine the simulation through conditional supervised fine-tuning and reinforcement learning with cycle consistency, optimizing it at both the utterance and conversation levels. Finally, we adopt a diverse profile sampler to capture the distribution of real-world user profiles. Experimental results demonstrate that USP outperforms strong baselines in terms of authenticity and diversity while achieving comparable performance in consistency. Furthermore, dynamic multi-turn evaluations based on USP strongly align with mainstream benchmarks, demonstrating its effectiveness in real-world applications.
摘要：用户模拟器对于复制人类与对话系统的互动至关重要，支持协作培训和自动评估，特别是对于大型语言模型（LLMS）。但是，现有的模拟器通常仅依靠文本话语，缺少隐性用户特征，例如个性，说话风格和目标。相反，基于角色的方法缺乏普遍性，因为它们取决于著名个体或原型的预定义概况。为了应对这些挑战，我们提出了用隐式配置文件（USP）的用户模拟器，该框架会侵入人机对话中的隐式用户配置文件，并使用它们来生成更个性化和现实的对话。我们首先开发具有全面配置模式的LLM驱动提取器。然后，我们通过循环一致性通过有条件监督的微调和加强学习来完善模拟，并在话语和对话水平上进行优化。最后，我们采用多样化的配置样本来捕获现实世界用户配置文件的分布。实验结果表明，USP在真实性和多样性方面优于强大的基准，同时达到一致性的可比性。此外，基于USP的动态多转变评估与主流基准强烈一致，这表明了其在现实世界应用中的有效性。

Title: Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning

Authors: Hongyi Cal, ie Li, Wenzhen Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18978
Pdf URL: https://arxiv.org/pdf/2502.18978
Copy Paste: [[2502.18978]] Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning(https://arxiv.org/abs/2502.18978)
Keywords: language model
Abstract: The effectiveness of instruction fine-tuning for Large Language Models is fundamentally constrained by the quality and efficiency of training datasets. This work introduces Low-Confidence Gold (LCG), a novel filtering framework that employs centroid-based clustering and confidence-guided selection for identifying valuable instruction pairs. Through a semi-supervised approach using a lightweight classifier trained on representative samples, LCG curates high-quality subsets while preserving data diversity. Experimental evaluation demonstrates that models fine-tuned on LCG-filtered subsets of 6K samples achieve superior performance compared to existing methods, with substantial improvements on MT-bench and consistent gains across comprehensive evaluation metrics. The framework's efficacy while maintaining model performance establishes a promising direction for efficient instruction tuning.
摘要：教学对大语模型的微调的有效性在根本上受到培训数据集的质量和效率的限制。这项工作引入了低信心黄金（LCG），这是一种新型的过滤框架，采用基于质心的聚类和信心引导的选择来识别有价值的教学对。通过使用对代表样本训练的轻量级分类器的半监督方法，LCG策划了高质量的子集，同时保留了数据多样性。实验评估表明，与现有方法相比，在6K样品的LCG滤光子集上进行了微调的模型具有较高的性能，并且在MT板凳上进行了实质性改进，并且在全面的评估指标中均具有一致的增长。该框架在保持模型性能的同时的功效为有效的指导调整建立了一个有希望的方向。

Title: PEToolLLM: Towards Personalized Tool Learning in Large Language Models

Authors: Qiancheng Xu, Yongqi Li, Heming Xia, Fan Liu, Min Yang, Wenjie Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.18980
Pdf URL: https://arxiv.org/pdf/2502.18980
Copy Paste: [[2502.18980]] PEToolLLM: Towards Personalized Tool Learning in Large Language Models(https://arxiv.org/abs/2502.18980)
Keywords: language model, llm
Abstract: Tool learning has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools. Existing tool learning studies primarily focus on the general-purpose tool-use capability, which addresses explicit user requirements in instructions. However, they overlook the importance of personalized tool-use capability, leading to an inability to handle implicit user preferences. To address the limitation, we first formulate the task of personalized tool learning, which integrates user's interaction history towards personalized tool usage. To fill the gap of missing benchmarks, we construct PEToolBench, featuring diverse user preferences reflected in interaction history under three distinct personalized settings, and encompassing a wide range of tool-use scenarios. Moreover, we propose a framework PEToolLLaMA to adapt LLMs to the personalized tool learning task, which is trained through supervised fine-tuning and direct preference optimization. Extensive experiments on PEToolBench demonstrate the superiority of PEToolLLaMA over existing LLMs.
摘要：通过使用外部工具扩展大型语言模型（LLMS）功能，工具学习已成为一个有希望的方向。现有的工具学习研究主要关注通用工具使用功能，该功能解决了指令中明确的用户需求。但是，他们忽略了个性化工具使用功能的重要性，导致无法处理隐式用户偏好。为了解决限制，我们首先制定个性化工具学习的任务，该任务将用户的交互历史记录集成到个性化工具使用情况下。为了填补缺失基准的空白，我们构建了Petoolbench，以在三个不同的个性化设置下反映在互动历史记录中反映的各种用户偏好，并涵盖了广泛的工具使用方案。此外，我们提出了一个框架Petoolllama，以使LLM适应个性化的工具学习任务，该任务是通过监督的微调和直接偏好优化培训的。对石烟碱镇的广泛实验表明，Petoolllama比现有LLM的优越性。

Title: GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation

Authors: Jie He, Jennifer Neville, Mengting Wan, Longqi Yang, Hui Liu, Xiaofeng Xu, Xia Song, Jeff Z. Pan, Pei Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.18990
Pdf URL: https://arxiv.org/pdf/2502.18990
Copy Paste: [[2502.18990]] GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation(https://arxiv.org/abs/2502.18990)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) can enhance their capabilities as AI assistants by integrating external tools, allowing them to access a wider range of information. While recent LLMs are typically fine-tuned with tool usage examples during supervised fine-tuning (SFT), questions remain about their ability to develop robust tool-usage skills and can effectively generalize to unseen queries and tools. In this work, we present GenTool, a novel training framework that prepares LLMs for diverse generalization challenges in tool utilization. Our approach addresses two fundamental dimensions critical for real-world applications: Zero-to-One Generalization, enabling the model to address queries initially lacking a suitable tool by adopting and utilizing one when it becomes available, and Weak-to-Strong Generalization, allowing models to leverage enhanced versions of existing tools to solve queries. To achieve this, we develop synthetic training data simulating these two dimensions of tool usage and introduce a two-stage fine-tuning approach: optimizing tool ranking, then refining tool selection. Through extensive experiments across four generalization scenarios, we demonstrate that our method significantly enhances the tool-usage capabilities of LLMs ranging from 1B to 8B parameters, achieving performance that surpasses GPT-4o. Furthermore, our analysis also provides valuable insights into the challenges LLMs encounter in tool generalization.
摘要：大型语言模型（LLMS）可以通过集成外部工具，使他们访问更广泛的信息来增强他们作为AI助手的能力。虽然最近的LLM通常在监督微调（SFT）期间对工具使用示例进行微调，但有关其发展强大的工具使用技能的能力的问题仍然存在，并且可以有效地概括以看不见的查询和工具。在这项工作中，我们提出了Gentool，这是一个新颖的培训框架，为工具利用中的各种泛化挑战做准备。我们的方法解决了对现实世界应用至关重要的两个基本尺寸：零对一个概括，使该模型能够通过在可用时采用和利用一个方法来解决最初缺少合适工具的查询，并使用该工具，以及较弱的概括，使模型能够利用现有工具的增强版本来解决Querve Queries。为了实现这一目标，我们开发了模拟工具使用的这两个维度的合成训练数据，并引入了两阶段的微调方法：优化工具排名，然后提炼工具选择。通过在四种概括方案中进行的广泛实验，我们证明我们的方法显着增强了LLMS的工具使用能力，范围从1B到8B参数，实现了超过GPT-4O的性能。此外，我们的分析还为LLMS在工具概括中遇到的挑战提供了宝贵的见解。

Title: MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering

Authors: Teng Lin
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2502.18993
Pdf URL: https://arxiv.org/pdf/2502.18993
Copy Paste: [[2502.18993]] MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering(https://arxiv.org/abs/2502.18993)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Multi-entity question answering (MEQA) represents significant challenges for large language models (LLM) and retrieval-augmented generation (RAG) systems, which frequently struggle to consolidate scattered information across diverse documents. While existing methods excel at single-document comprehension, they often struggle with cross-document aggregation, particularly when resolving entity-dense questions like "What is the distribution of ACM Fellows among various fields of study?", which require integrating entity-centric insights from heterogeneous sources (e.g., Wikipedia pages). To address this gap, we introduce MEBench, a novel multi-document, multi-entity benchmark designed to systematically evaluate LLMs' capacity to retrieve, consolidate, and reason over fragmented information. Our benchmark comprises 4,780 questions which are systematically categorized into three primary categories, further divided into eight distinct types, ensuring broad coverage of real-world multi-entity reasoning scenarios. Our experiments on state-of-the-art LLMs (e.g., GPT-4, Llama-3) and RAG pipelines reveal critical limitations: even advanced models achieve only 59% accuracy on MEBench. Our benchmark emphasizes the importance of completeness and factual precision of information extraction in MEQA tasks, using Entity-Attributed F1 (EA-F1) metric for granular evaluation of entity-level correctness and attribution validity. MEBench not only highlights systemic weaknesses in current LLM frameworks but also provides a foundation for advancing robust, entity-aware QA architectures.
摘要：多实体问答（MEQA）代表了大型语言模型（LLM）和检索型发电（RAG）系统的重大挑战，这些系统经常努力巩固各种文档跨不同文档的分散信息。尽管现有的方法在单案纪录理解方面表现出色，但他们经常在跨文档的聚合中挣扎，尤其是在解决实体密集的问题时，例如“在各个研究领域中ACM研究员的分布是什么？”，哪些需要整合以实体为中心的洞察力（例如，Wikipedia页面）。为了解决这一差距，我们介绍了Mebench，这是一种新型的多文章，多实体基准，旨在系统地评估LLMS检索，合并和理由而不是碎片信息的能力。我们的基准分别包括4,780个问题，这些问题被系统地分为三个类别，进一步分为八种不同的类型，确保了对现实世界中多性推理方案的广泛报道。我们对最先进的LLM（例如GPT-4，Llama-3）和RAG管道的实验揭示了关键的局限性：即使是高级模型，也只能在MEBENCH上获得59％的精度。我们的基准强调了MEQA任务中完整性和信息提取的事实精度的重要性，使用实体归属的F1（EA-F1）度量，用于实体级别的正确性和归因有效性的粒状评估。 Mebench不仅强调了当前LLM框架中的系统性弱点，而且还为推进强大的实体意识QA架构提供了基础。

Title: Binary Neural Networks for Large Language Model: A Survey

Authors: Liangdong Liu, Zhitong Zheng, Cong Wang, Tianhuang Su, Zhenyu Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19008
Pdf URL: https://arxiv.org/pdf/2502.19008
Copy Paste: [[2502.19008]] Binary Neural Networks for Large Language Model: A Survey(https://arxiv.org/abs/2502.19008)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have wide applications in the field of natural language processing(NLP), such as GPT-4 and Llama. However, with the exponential growth of model parameter sizes, LLMs bring significant resource overheads. Low-bit quantization, as a key technique, reduces memory usage and computational demands by decreasing the bit-width of model parameters, activations, and gradients. Previous quantization methods for LLMs have largely employed Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). PTQ does not require any retraining of the original model, while QAT involves optimizing precision during training to achieve the best quantization parameters. The BitNet team proposed a radically different approach, where quantization is performed from the start of model training, utilizing low-precision binary weights during the training process. This approach has led to the emergence of many binary quantization techniques for large language models. This paper provides a comprehensive review of these binary quantization techniques. Specifically, we will introduce binary quantization techniques in deep neural networks and further explore their application to LLMs, reviewing their various contributions, implementations, and applications.
摘要：大型语言模型（LLM）在自然语言处理（NLP）（例如GPT-4和Llama）领域中具有广泛的应用。但是，随着模型参数大小的指数增长，LLMS带来了重要的资源开销。低位量化是一种关键技术，可通过降低模型参数，激活和梯度的位宽度来减少内存使用和计算需求。 LLM的先前量化方法在很大程度上采用了训练后量化（PTQ）和量化感知培训（QAT）。 PTQ不需要对原始模型进行任何重新培训，而QAT涉及在训练过程中优化精度以实现最佳量化参数。 Bitnet团队提出了一种根本不同的方法，其中从模型训练开始进行量化，在训练过程中利用低精确的二进制重量。这种方法导致了大型语言模型的许多二进制量化技术的出现。本文对这些二进制量化技术进行了全面综述。具体来说，我们将在深层神经网络中介绍二进制量化技术，并进一步探索其对LLM的应用，以审查其各种贡献，实现和应用程序。

Title: MathClean: A Benchmark for Synthetic Mathematical Data Cleaning

Authors: Hao Liang, Meiyi Qiang, Yuying Li, Zefeng He, Yongzhen Guo, Zhengzhou Zhu, Wentao Zhang, Bin Cui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19058
Pdf URL: https://arxiv.org/pdf/2502.19058
Copy Paste: [[2502.19058]] MathClean: A Benchmark for Synthetic Mathematical Data Cleaning(https://arxiv.org/abs/2502.19058)
Keywords: language model, gpt, llm
Abstract: With the rapid development of large language models (LLMs), the quality of training data has become crucial. Among the various types of training data, mathematical data plays a key role in enabling LLMs to acquire strong reasoning abilities. While high-quality open-source data is important, it is often insufficient for pre-training, necessitating the addition of synthetic math problems. However, synthetic math questions and answers can introduce inaccuracies, which may degrade both the training data and web data. Therefore, an effective method for cleaning synthetic math data is essential. In this paper, we propose the MathClean benchmark to evaluate the effectiveness of math data cleaning models. The MathClean benchmark consists of 2,000 correct questions and 2,000 erroneous questions with additional 2,000 correct and erroneous answers sourced from augmented data based on GSM8K and MATH. Moreover, we also annotate error types for each question or answer, since it can assess whether models can correctly identify the error categories for future improvements. Finally, we present comprehensive evaluations using state-of-the-art (SOTA) models. Our results demonstrate that even strong models like GPT-o1 and DeepSeek-R1 perform poorly on this benchmark, highlighting the utility of MathClean. Our code and data is available at this https URL.
摘要：随着大语言模型（LLM）的快速发展，培训数据的质量变得至关重要。在各种类型的培训数据中，数学数据在使LLMS能够获得强大的推理能力方面起着关键作用。尽管高质量的开源数据很重要，但通常不足以进行预训练，因此需要增加合成数学问题。但是，合成数学问题和答案可能会引入不准确性，这可能会降低培训数据和Web数据。因此，一种清洁合成数学数据的有效方法至关重要。在本文中，我们提出了MathClean基准测试，以评估数学数据清洁模型的有效性。 MathClean基准测试由2,000个正确的问题和2,000个错误问题组成，另外2,000个正确和错误的答案来自基于GSM8K和数学的增强数据。此外，我们还注释每个问题或答案的错误类型，因为它可以评估模型是否可以正确识别未来改进的错误类别。最后，我们使用最先进（SOTA）模型进行了全面的评估。我们的结果表明，即使像GPT-O1和DeepSeek-R1这样的强大模型在此基准上的表现差，强调了MathClean的实用性。我们的代码和数据可在此HTTPS URL上找到。

Title: Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique

Authors: Piotr Sawicki, Marek Grześ, Dan Brown, Fabrício Góes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19064
Pdf URL: https://arxiv.org/pdf/2502.19064
Copy Paste: [[2502.19064]] Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique(https://arxiv.org/abs/2502.19064)
Keywords: language model, gpt, llm
Abstract: The Consensual Assessment Technique (CAT) evaluates creativity through holistic expert judgments. We investigate the use of two advanced Large Language Models (LLMs), Claude-3-Opus and GPT-4o, to evaluate poetry by a methodology inspired by the CAT. Using a dataset of 90 poems, we found that these LLMs can surpass the results achieved by non-expert human judges at matching a ground truth based on publication venue, particularly when assessing smaller subsets of poems. Claude-3-Opus exhibited slightly superior performance than GPT-4o. We show that LLMs are viable tools for accurately assessing poetry, paving the way for their broader application into other creative domains.
摘要：共识评估技术（CAT）通过整体专家判断评估创造力。我们研究了两个先进的大型语言模型（LLMS），Claude-3-Opus和GPT-4O，以通过猫启发的方法来评估诗歌。使用90首诗的数据集，我们发现这些LLM可以超过非专家人类法官在基于出版物场地的地面真理匹配的非专家法官所取得的结果，尤其是在评估较小的诗歌子集时。 Claude-3-Opus的性能比GPT-4O表现出色。我们表明，LLM是准确评估诗歌的可行工具，为其在其他创意领域的更广泛应用铺平了道路。

Title: Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics

Authors: Aloka Fernando, Surangika Ranathunga, Nisansa de Silva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19074
Pdf URL: https://arxiv.org/pdf/2502.19074
Copy Paste: [[2502.19074]] Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics(https://arxiv.org/abs/2502.19074)
Keywords: language model
Abstract: Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from the web-mined corpora. Prior research has demonstrated that ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) and training the NMT systems with the top-ranked samples, produces superior NMT performance than when trained using the full dataset. However, previous research has shown that the choice of multiPLM significantly impacts the ranking quality. This paper investigates the reasons behind this disparity across multiPLMs. Using the web-mined corpora CCMatrix and CCAligned for En$\rightarrow$Si, En$\rightarrow$Ta and Si$\rightarrow$Ta, we show that different multiPLMs (LASER3, XLM-R, and LaBSE) are biased towards certain types of sentences, which allows noisy sentences to creep into the top-ranked samples. We show that by employing a series of heuristics, this noise can be removed to a certain extent. This results in improving the results of NMT systems trained with web-mined corpora and reduces the disparity across multiPLMs.
摘要：并行数据策划（PDC）技术旨在滤除网络开采语料库中的嘈杂并行句子。先前的研究表明，使用相似性得分对句子的嵌入句子进行排名对，这些句子嵌入了预先训练的多语言语言模型（乘法）和使用顶级样本训练NMT系统的NMT性能比使用完整数据集进行培训时产生的NMT性能优于。但是，先前的研究表明，多型的选择显着影响排名质量。本文调查了这种差异的原因。使用网络开采的COCMATRIX和CCALIGN，用于EN $ \ rightarrow $ si，en $ \ rightarrow $ ta和si $ \ rightarrow $ ta，我们表明，不同的乘数（laser3，xlm-r和labse）偏向于某些类型的句子，可以使某些类型的句子散发出一定的句子，可以渗透到顶部的samepep samepep same same same same same same same same same same same same。我们表明，通过采用一系列启发式方法，可以在一定程度上消除这种噪声。这导致改善了接受Web开发语料库培训的NMT系统的结果，并降低了跨多个倍数的差异。

Title: Sparse Brains are Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLMs

Authors: Yiheng Yang, Yujie Wang, Chi Ma, Lei Yu, Emmanuele Chersoni, Chu-Ren Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19078
Pdf URL: https://arxiv.org/pdf/2502.19078
Copy Paste: [[2502.19078]] Sparse Brains are Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLMs(https://arxiv.org/abs/2502.19078)
Keywords: language model, llm
Abstract: Dense large language models(LLMs) face critical efficiency bottlenecks as they rigidly activate all parameters regardless of input complexity. While existing sparsity methods(static pruning or dynamic activation) address this partially, they either lack adaptivity to contextual or model structural demands or incur prohibitive computational overhead. Inspired by human brain's dual-process mechanisms - predictive coding (N400) for backbone sparsity and structural reanalysis (P600) for complex context - we propose CLADA, a \textit{\textbf{C}ognitive-\textbf{L}oad-\textbf{A}ware \textbf{D}ynamic \textbf{A}ctivation} framework that synergizes statistical sparsity with semantic adaptability. Our key insight is that LLM activations exhibit two complementary patterns: 1) \textit{Global statistical sparsity} driven by sequence-level prefix information, and 2) \textit{Local semantic adaptability} modulated by cognitive load metrics(e.g., surprisal and entropy). CLADA employs a hierarchical thresholding strategy: a baseline from offline error-controlled optimization ensures 40\%+ sparsity, dynamically adjusted by real-time cognitive signals. Evaluations across six mainstream LLMs and nine benchmarks demonstrate that CLADA achieves \textbf{~20\% average speedup with <2\% accuracy drop}, outperforming Griffin (5\%+ degradation) and TT (negligible speedup). Crucially, we establish the first formal connection between neurolinguistic event-related potential (ERP) components and LLM efficiency mechanisms through multi-level regression analysis ($R^2=0.17$ for sparsity-adaptation synergy). Requiring no retraining or architectural changes, CLADA offers a deployable solution for resource-aware LLM inference while advancing biologically-inspired AI design. Our code is available at \href{this https URL}{CLADA}.
摘要：密集的大语言模型（LLMS）将面临关键效率瓶颈，因为它们都会激活所有参数，无论输入复杂性如何。尽管现有的稀疏方法（静态修剪或动态激活）部分解决了这一问题，但它们要么缺乏对上下文或模型结构需求的适应性，要么缺乏产生过度的计算开销。受人脑的双重过程机制的启发 - 用于骨干的稀疏性和结构重新分析（P600）的预测性编码（N400） - 我们建议Clada，A \ textIt {\ textBf {\ textbf {c} \ textbf {a} ctivation}框架，以语义适应性协同统计稀疏性。我们的关键见解是，LLM激活表现出两种互补模式：1）\ textIt {全局统计稀疏性}由序列级别的前缀信息驱动，而2）\ textit {局部语义适应性}由认知载荷量指标调节（例如，ixprise and proppy和Entropy）。 Clada采用层次结构阈值策略：离线错误控制优化的基线可确保40 \％+稀疏性，并通过实时认知信号动态调整。对六个主流LLM和9个基准的评估表明，Clada可以实现\ TextBf {〜20 \％的平均速度，<2 \％的准确度下降}，优于Griffin（5 \％+降解）和TT（可忽略的加速）。至关重要的是，我们通过多级回归分析（$ r^2 = 0.17 $ for Sparsity-Adaptation Synergengy）建立了与神经语言事件相关电位（ERP）组件（ERP）组件（ERP）组件（ERP）组件（ERP）组件之间的第一个正式联系。不需要再进行重新修改或架构更改，Clada为资源感知的LLM推理提供了可部署的解决方案，同时推进了以生物学启发的AI设计。我们的代码可在\ href {此https url} {clada}上获得。

Title: LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm

Authors: Siwei Wu, Yizhi Li, Xingwei Qu, Rishi Ravikumar, Yucheng Li, Tyler Loakman Shanghaoran Quan Xiaoyong Wei, Riza Batista-Navarro, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19103
Pdf URL: https://arxiv.org/pdf/2502.19103
Copy Paste: [[2502.19103]] LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm(https://arxiv.org/abs/2502.19103)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, yet their ability to generate long-form content remains poorly understood and evaluated. Our analysis reveals that current LLMs struggle with length requirements and information density in long-text generation, with performance deteriorating as text length increases. To quantitively locate such a performance degradation and provide further insights on model development, we present LongEval, a benchmark that evaluates long-text generation through both direct and plan-based generation paradigms, inspired by cognitive and linguistic writing models. The comprehensive experiments in this work reveal interesting findings such as that while model size correlates with generation ability, the small-scale model (e.g., LongWriter), well-trained on long texts, has comparable performance. All code and datasets are released in this https URL.
摘要：大型语言模型（LLMS）在各种自然语言处理任务中取得了巨大的成功，但它们产生的长格式内容的能力仍然很少了解和评估。我们的分析表明，当前的LLM与长文本生成的长度要求和信息密度斗争，随着文本长度的增加，性能恶化。为了定量定位这种性能降级并提供了对模型开发的进一步见解，我们提出了长寿，这是一种基准，它通过认知和语言写作模型的启发，通过直接和基于计划的生成范式评估长文本生成。这项工作中的全面实验揭示了有趣的发现，例如模型大小与生成能力相关，但在长文本上进行了良好训练的小型模型（例如，长作者）具有可比的性能。所有代码和数据集均在此HTTPS URL中发布。

Title: Evaluating Gender Bias in German Machine Translation

Authors: Michelle Kappl
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19104
Pdf URL: https://arxiv.org/pdf/2502.19104
Copy Paste: [[2502.19104]] Evaluating Gender Bias in German Machine Translation(https://arxiv.org/abs/2502.19104)
Keywords: language model, llm
Abstract: We present WinoMTDE, a new gender bias evaluation test set designed to assess occupational stereotyping and underrepresentation in German machine translation (MT) systems. Building on the automatic evaluation method introduced by arXiv:1906.00591v1 [cs.CL], we extend the approach to German, a language with grammatical gender. The WinoMTDE dataset comprises 288 German sentences that are balanced in regard to gender, as well as stereotype, which was annotated using German labor statistics. We conduct a large-scale evaluation of five widely used MT systems and a large language model. Our results reveal persistent bias in most models, with the LLM outperforming traditional systems. The dataset and evaluation code are publicly available under this https URL.
摘要：我们提出了Winomtde，这是一种新的性别偏见评估测试集，旨在评估德国机器翻译（MT）系统中职业刻板印象和代表性不足。在Arxiv引入的自动评估方法的基础上：1906.00591V1 [CS.CL]，我们将方法扩展到了德语，这是一种具有语法性别的语言。 Winomtde数据集包含288个关于性别和刻板印象平衡的德国句子，这些句子使用德国劳动统计进行了注释。我们对五个广泛使用的MT系统和大型语言模型进行了大规模评估。我们的结果表明，在大多数模型中，LLM的表现都优于传统系统。数据集和评估代码在此HTTPS URL下公开可用。

Title: Conformal Linguistic Calibration: Trading-off between Factuality and Specificity

Authors: Zhengping Jiang, Anqi Liu, Benjamin Van Durme
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19110
Pdf URL: https://arxiv.org/pdf/2502.19110
Copy Paste: [[2502.19110]] Conformal Linguistic Calibration: Trading-off between Factuality and Specificity(https://arxiv.org/abs/2502.19110)
Keywords: language model, prompt
Abstract: Language model outputs are not always reliable; this prompts research into methods for adapting model responses based on uncertainty. Common approaches include: \emph{abstention}, where models refrain from generating responses when uncertain; and \emph{linguistic calibration}, where models hedge their statements using uncertainty quantifiers. However, abstention can withhold valuable information, while linguistically calibrated responses are often challenging to leverage in downstream tasks. We propose a unifying view of both approaches, Conformal Linguistic Calibration (CLC), reinterpreting linguistic calibration as answer set prediction. We begin by presenting a unified framework that connects abstention and linguistic calibration through the lens of linguistic pragmatics. We then describe an implementation that allows for controlling the level of imprecision in model responses. Experimental results show that our method produces calibrated outputs with conformal guarantees on factual accuracy. Furthermore, our approach enables fine-tuning models to perform uncertainty-aware adaptive claim rewriting, offering a controllable balance between factuality and specificity.
摘要：语言模型输出并不总是可靠的；这促使研究基于不确定性调整模型响应的方法。常见方法包括：\ emph {弃权}，其中模型在不确定时避免产生响应；和\ emph {语言校准}，其中模型使用不确定性量词对冲他们的陈述。但是，弃权可以扣留有价值的信息，而语言校准的响应通常挑战在下游任务中利用。我们提出了两种方法的统一视图，共形语言校准（CLC），将语言校准重新解释为答案集预测。我们首先提出一个统一的框架，该框架通过语言务实的镜头连接弃权和语言校准。然后，我们描述了一个实现，该实现允许控制模型响应中不精确的水平。实验结果表明，我们的方法在事实准确性上产生具有共形保证的校准输出。此外，我们的方法使微调模型能够执行不确定性感知的自适应主张重写，从而在事实和特殊性之间提供可控制的平衡。

Title: Self-Memory Alignment: Mitigating Factual Hallucinations with Generalized Improvement

Authors: Siyuan Zhang, Yichi Zhang, Yinpeng Dong, Hang Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19127
Pdf URL: https://arxiv.org/pdf/2502.19127
Copy Paste: [[2502.19127]] Self-Memory Alignment: Mitigating Factual Hallucinations with Generalized Improvement(https://arxiv.org/abs/2502.19127)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) often struggle to align their responses with objective facts, resulting in the issue of factual hallucinations, which can be difficult to detect and mislead users without relevant knowledge. While post-training techniques have been employed to mitigate the issue, existing methods usually suffer from poor generalization and trade-offs in different capabilities. In this paper, we propose to address it by directly augmenting LLM's fundamental ability to precisely leverage its existing memory--the knowledge acquired from pre-training data. We introduce self-memory alignment (SMA), which fine-tunes the model on self-generated responses to precise and simple factual questions through preference optimization. Furthermore, we construct FactualBench, a comprehensive and precise factual QA dataset containing 181k Chinese data spanning 21 domains, to facilitate both evaluation and training. Extensive experiments show that SMA significantly improves LLMs' overall performance, with consistent enhancement across various benchmarks concerning factuality, as well as helpfulness and comprehensive skills.
摘要：大型语言模型（LLMS）通常很难使他们的回答与客观事实保持一致，从而导致事实幻觉问题，这可能很难在没有相关知识的情况下检测和误导用户。尽管已采用后培训技术来减轻问题，但现有方法通常会遭受不同能力的概括和权衡的差。在本文中，我们建议通过直接增强LLM精确利用其现有记忆的基本能力来解决它 - 从培训前数据中获得的知识。我们介绍了自我内存一致性（SMA），它通过首选优化对精确而简单的事实问题进行自我生成的响应微调模型。此外，我们构建了FactualBench，这是一个全面而精确的事实QA数据集，其中包含181K中国数据跨越21个域，以促进评估和培训。广泛的实验表明，SMA显着提高了LLMS的整体性能，并且在各种基准方面都始终如一地提高有关事实的基准，以及有益的和全面的技能。

Title: Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs

Authors: Zhaowei Zhang, Fengshuo Bai, Qizhi Chen, Chengdong Ma, Mingzhi Wang, Haoran Sun, Zilong Zheng, Yaodong Yang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19148
Pdf URL: https://arxiv.org/pdf/2502.19148
Copy Paste: [[2502.19148]] Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs(https://arxiv.org/abs/2502.19148)
Keywords: language model, llm, prompt
Abstract: How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse regarding culture, values, or time. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduce Amulet, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users' personalized preferences. To reduce the computational cost brought by this optimization process for each token, we additionally provide a closed-form solution for each iteration step of the optimization process, thereby reducing the computational time cost to a negligible level. The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency.
摘要：如何将大型语言模型（LLMS）与静态一般数据集的用户偏好保持一致。但是，在文化，价值或时间上，用户偏好通常是个性化的，变化和多样化的。这导致了一个问题，即实际用户的偏好通常与模型开发人员在LLMS实际使用中培训的人不一致。由于我们无法收集足够的数据并针对每个需求进行重新培训，因此在测试时间内根据基本LLM的有效的实时偏好适应方法很重要。为此，我们介绍了Amulet，这是一个新颖的，无训练的框架，通过提供简单的用户提供的提示的指导，将每个令牌的解码过程作为一个单独的在线学习问题，从而实现实时优化，以满足用户的个性化偏好。为了减少每个代币的优化过程带来的计算成本，我们还为优化过程的每个迭代步骤提供了封闭形式的解决方案，从而将计算时间成本降低到可忽略的水平。详细的实验结果表明，护身符可以与不同的LLM，数据集和用户偏好的组合在丰富的设置中实现重大的性能改进，同时保持可接受的计算效率。

Title: When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning

Authors: Yijiang River Dong, Tiancheng Hu, Yinhong Liu, Ahmet Üstün, Nigel Collier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19158
Pdf URL: https://arxiv.org/pdf/2502.19158
Copy Paste: [[2502.19158]] When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning(https://arxiv.org/abs/2502.19158)
Keywords: language model, llm
Abstract: While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority viewpoints. Although personalized preference learning addresses this by tailoring separate preferences for individual users, the field lacks standardized methods to assess its effectiveness. We present a multi-faceted evaluation framework that measures not only performance but also fairness, unintended effects, and adaptability across varying levels of preference divergence. Through extensive experiments comparing eight personalization methods across three preference datasets, we demonstrate that performance differences between methods could reach 36% when users strongly disagree, and personalization can introduce up to 20% safety misalignment. These findings highlight the critical need for holistic evaluation approaches to advance the development of more effective and inclusive preference learning systems.
摘要：虽然从人类反馈中学习（RLHF）的强化学习被广泛用于使大语言模型（LLMS）与人类偏好保持一致，但它通常假定用户之间的同质偏好，忽略了各种人类价值观和少数群体的观点。尽管个性化的偏好学习通过为个别用户定制单独的偏好来解决这一问题，但该领域缺乏评估其有效性的标准化方法。我们提出了一个多方面的评估框架，该框架不仅可以衡量性能，而且还衡量公平性，意外的效果以及各种偏好差异的适应性。通过比较三个偏好数据集的八种个性化方法的广泛实验，我们证明，当用户强烈不同意时，方法之间的性能差异可能达到36％，并且个性化最多可以引入20％的安全失控。这些发现突出了对整体评估方法的关键需求，以推动开发更有效和包容性的偏好学习系统。

Title: Detecting Linguistic Indicators for Stereotype Assessment with Large Language Models

Authors: Rebekka Görge, Michael Mock, Héctor Allende-Cid
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19160
Pdf URL: https://arxiv.org/pdf/2502.19160
Copy Paste: [[2502.19160]] Detecting Linguistic Indicators for Stereotype Assessment with Large Language Models(https://arxiv.org/abs/2502.19160)
Keywords: language model, gpt, llm, prompt
Abstract: Social categories and stereotypes are embedded in language and can introduce data bias into Large Language Models (LLMs). Despite safeguards, these biases often persist in model behavior, potentially leading to representational harm in outputs. While sociolinguistic research provides valuable insights into the formation of stereotypes, NLP approaches for stereotype detection rarely draw on this foundation and often lack objectivity, precision, and interpretability. To fill this gap, in this work we propose a new approach that detects and quantifies the linguistic indicators of stereotypes in a sentence. We derive linguistic indicators from the Social Category and Stereotype Communication (SCSC) framework which indicate strong social category formulation and stereotyping in language, and use them to build a categorization scheme. To automate this approach, we instruct different LLMs using in-context learning to apply the approach to a sentence, where the LLM examines the linguistic properties and provides a basis for a fine-grained assessment. Based on an empirical evaluation of the importance of different linguistic indicators, we learn a scoring function that measures the linguistic indicators of a stereotype. Our annotations of stereotyped sentences show that these indicators are present in these sentences and explain the strength of a stereotype. In terms of model performance, our results show that the models generally perform well in detecting and classifying linguistic indicators of category labels used to denote a category, but sometimes struggle to correctly evaluate the associated behaviors and characteristics. Using more few-shot examples within the prompts, significantly improves performance. Model performance increases with size, as Llama-3.3-70B-Instruct and GPT-4 achieve comparable results that surpass those of Mixtral-8x7B-Instruct, GPT-4-mini and Llama-3.1-8B-Instruct.
摘要：社会类别和刻板印象嵌入了语言中，可以将数据偏见引入大型语言模型（LLMS）中。尽管有保障，但这些偏见通常会持续存在于模型行为上，可能导致产出的代表性危害。尽管社会语言学研究为刻板印象的形成提供了宝贵的见解，但NLP的刻板印象检测方法很少借鉴这一基础，并且通常缺乏客观性，精度和解释性。为了填补这一空白，在这项工作中，我们提出了一种新方法，该方法检测和量化了句子中刻板印象的语言指标。我们从社会类别和刻板印象通信（SCSC）框架中得出语言指标，这些框架表明了强大的社会类别表述和语言刻板印象，并使用它们来构建分类方案。为了自动化这种方法，我们使用内部文化学习的不同LLM将方法应用于句子，在该句子中，LLM检查了语言属性，并为精细粒度评估提供了基础。基于对不同语言指标的重要性的经验评估，我们学习了一个评分函数，以衡量刻板印象的语言指标。我们对刻板印象的注释表明，这些句子中存在这些指标，并解释了刻板印象的强度。在模型性能方面，我们的结果表明，这些模型通常在检测和分类用于表示类别的类别标签的语言指标方面表现良好，但有时很难正确评估相关的行为和特征。在提示中使用更多示例，可以显着提高性能。模型性能随着尺寸而增加，因为Llama-3.3-70B-Instruct和GPT-4实现了可比的结果，这些结果超过了Mixtral-8x7b-Instruct，GPT-4-Mini和Llama-3.1-8B教学。

Title: TestNUC: Enhancing Test-Time Computing Approaches through Neighboring Unlabeled Data Consistency

Authors: Henry Peng Zou, Zhengyao Gu, Yue Zhou, Yankai Chen, Weizhi Zhang, Liancheng Fang, Yibo Wang, Yangning Li, Kay Liu, Philip S. Yu
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19163
Pdf URL: https://arxiv.org/pdf/2502.19163
Copy Paste: [[2502.19163]] TestNUC: Enhancing Test-Time Computing Approaches through Neighboring Unlabeled Data Consistency(https://arxiv.org/abs/2502.19163)
Keywords: language model, prompt
Abstract: Test-time computing approaches, which leverage additional computational resources during inference, have been proven effective in enhancing large language model performance. This work introduces a novel, linearly scaling approach, TestNUC, that improves test-time predictions by leveraging the local consistency of neighboring unlabeled data-it classifies an input instance by considering not only the model's prediction on that instance but also on neighboring unlabeled instances. We evaluate TestNUC across eight diverse datasets, spanning intent classification, topic mining, domain discovery, and emotion detection, demonstrating its consistent superiority over baseline methods such as standard prompting and self-consistency. Furthermore, TestNUC can be seamlessly integrated with existing test-time computing approaches, substantially boosting their performance. Our analysis reveals that TestNUC scales effectively with increasing amounts of unlabeled data and performs robustly across different embedding models, making it practical for real-world applications. Our code is available at this https URL.
摘要：在推断过程中利用其他计算资源的测试时间计算方法已被证明有效增强大型语言模型性能。这项工作介绍了一种新颖的线性缩放方法，即TestNUC，通过利用相邻未标记的数据的局部一致性来改善测试时间预测，这不仅考虑了模型对该实例的预测，还可以在相邻的未标记实例上考虑输入实例。我们评估了八个不同数据集的TestNUC，涵盖了意图分类，主题挖掘，域发现和情感检测，并证明了其一致的优越性，而不是标准提示和自相连等基线方法。此外，可以将TestNUC与现有的测试时间计算方法无缝集成，从而大大提高其性能。我们的分析表明，测试NUC量表可以有效地随着未标记的数据量增加，并在不同的嵌入模型中执行稳健性，从而使其对于现实世界应用而实用。我们的代码可在此HTTPS URL上找到。

Title: MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis

Authors: Daniel Rose, Chia-Chien Hung, Marco Lepri, Israa Alqassem, Kiril Gashteovski, Carolin Lawrence
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19175
Pdf URL: https://arxiv.org/pdf/2502.19175
Copy Paste: [[2502.19175]] MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis(https://arxiv.org/abs/2502.19175)
Keywords: language model, llm, agent
Abstract: Differential Diagnosis (DDx) is a fundamental yet complex aspect of clinical decision-making, in which physicians iteratively refine a ranked list of possible diseases based on symptoms, antecedents, and medical knowledge. While recent advances in large language models have shown promise in supporting DDx, existing approaches face key limitations, including single-dataset evaluations, isolated optimization of components, unrealistic assumptions about complete patient profiles, and single-attempt diagnosis. We introduce a Modular Explainable DDx Agent (MEDDxAgent) framework designed for interactive DDx, where diagnostic reasoning evolves through iterative learning, rather than assuming a complete patient profile is accessible. MEDDxAgent integrates three modular components: (1) an orchestrator (DDxDriver), (2) a history taking simulator, and (3) two specialized agents for knowledge retrieval and diagnosis strategy. To ensure robust evaluation, we introduce a comprehensive DDx benchmark covering respiratory, skin, and rare diseases. We analyze single-turn diagnostic approaches and demonstrate the importance of iterative refinement when patient profiles are not available at the outset. Our broad evaluation demonstrates that MEDDxAgent achieves over 10% accuracy improvements in interactive DDx across both large and small LLMs, while offering critical explainability into its diagnostic reasoning process.
摘要：鉴别诊断（DDX）是临床决策的一个基本但复杂的方面，在该方面，医生会根据症状，前因和医学知识迭代地完善可能的疾病清单。尽管大型语言模型的最新进展在支持DDX方面表现出了希望，但现有方法面临关键局限性，包括单数据库评估，组件的孤立优化，关于完整患者概况的不切实际假设以及单位空位诊断。我们引入了一个可解释的DDX代理（MEDDXAGENT）框架，该框架专为Interactive DDX设计，其中诊断推理通过迭代学习会发展，而不是假设可以访问完整的患者概况。 Meddxagent整合了三个模块化组件：（1）编排器（DDXDRIVER），（2）具有模拟器的历史记录，以及（3）两种用于知识检索和诊断策略的专业剂。为了确保可靠的评估，我们引入了综合的DDX基准，涵盖了呼吸道，皮肤和罕见疾病。我们分析了单转诊断方法，并证明了当患者概况在一开始就没有可用时迭代精致的重要性。我们广泛的评估表明，Meddxagent在大型LLM和小型LLM中的交互式DDX的准确性提高了10％以上，同时为其诊断推理过程提供了关键的解释性。

Title: BIG-Bench Extra Hard

Authors: Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K. Jain, Virginia Aglietti, Disha Jindal, Peter Chen, Nishanth Dikkala, Gladys Tyen, Xin Liu, Uri Shalit, Silvia Chiappa, Kate Olszewska, Yi Tay, Vinh Q. Tran, Quoc V. Le, Orhan Firat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19187
Pdf URL: https://arxiv.org/pdf/2502.19187
Copy Paste: [[2502.19187]] BIG-Bench Extra Hard(https://arxiv.org/abs/2502.19187)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: this https URL.
摘要：大型语言模型（LLM）越来越多地在日常应用中部署，要求强大的一般推理能力和多样化的推理技能。但是，当前的LLM推理基准主要集中在数学和编码能力上，从而在评估更广泛的推理能力方面留下了差距。一个特殊的例外是Big-Bench数据集，该数据集已成为评估LLMS一般推理能力的关键基准，这要归功于其多样化的一系列具有挑战性的任务，可以在统一框架内对各种技能进行全面评估。但是，LLM的最新进展导致了大基础座位上的饱和度，并且它更艰难的Big-Bens Hard（BBH）。最先进的模型在BBH中的许多任务上达到了接近完美的分数，从而降低了其效用。为了解决这一限制，我们引入了Big Bench Extry Hard（BBEH），这是一种新的基准测试，旨在突破LLM推理评估的界限。 BBEH用一项新的任务代替了BBH中的每个任务，该任务探讨了相似的推理能力，但表现出明显增加的难度。我们在BBEH上评估了各种模型，并观察到最佳通用模型的（谐波）平均准确性为9.8 \％，最佳推理特有模型的（44.8％）和44.8％的平均精度，表明在LLMS中实现强大的一般性推理的持续挑战，表明了实质性的空间。我们在以下位置公开发布BBEH：此HTTPS URL。

Title: LiGT: Layout-infused Generative Transformer for Visual Question Answering on Vietnamese Receipts

Authors: Thanh-Phong Le, Trung Le Chi Phan, Nghia Hieu Nguyen, Kiet Van Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19202
Pdf URL: https://arxiv.org/pdf/2502.19202
Copy Paste: [[2502.19202]] LiGT: Layout-infused Generative Transformer for Visual Question Answering on Vietnamese Receipts(https://arxiv.org/abs/2502.19202)
Keywords: language model
Abstract: \textbf{Purpose:} Document Visual Question Answering (document VQA) challenges multimodal systems to holistically handle textual, layout, and visual modalities to provide appropriate answers. Document VQA has gained popularity in recent years due to the increasing amount of documents and the high demand for digitization. Nonetheless, most of document VQA datasets are developed in high-resource languages such as English. \textbf{Methods:} In this paper, we present ReceiptVQA (\textbf{Receipt} \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering), the initial large-scale document VQA dataset in Vietnamese dedicated to receipts, a document kind with high commercial potentials. The dataset encompasses \textbf{9,000+} receipt images and \textbf{60,000+} manually annotated question-answer pairs. In addition to our study, we introduce LiGT (\textbf{L}ayout-\textbf{i}nfused \textbf{G}enerative \textbf{T}ransformer), a layout-aware encoder-decoder architecture designed to leverage embedding layers of language models to operate layout embeddings, minimizing the use of additional neural modules. \textbf{Results:} Experiments on ReceiptVQA show that our architecture yielded promising performance, achieving competitive results compared with outstanding baselines. Furthermore, throughout analyzing experimental results, we found evident patterns that employing encoder-only model architectures has considerable disadvantages in comparison to architectures that can generate answers. We also observed that it is necessary to combine multiple modalities to tackle our dataset, despite the critical role of semantic understanding from language models. \textbf{Conclusion:} We hope that our work will encourage and facilitate future development in Vietnamese document VQA, contributing to a diverse multimodal research community in the Vietnamese language.
摘要：\ textbf {perim：}文档视觉询问答案（文档VQA）挑战多模式系统，以整体处理文本，布局和视觉方式，以提供适当的答案。由于文档量的增加和对数字化的需求越来越大，VQA近年来已广受欢迎。尽管如此，大多数文档VQA数据集都以高资源语言（例如英语）开发。 \textbf{Methods:} In this paper, we present ReceiptVQA (\textbf{Receipt} \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering), the initial large-scale document VQA dataset in Vietnamese dedicated to receipts, a document kind with high commercial potentials.该数据集包括\ textbf {9,000+}收据图像和\ textbf {60,000+}手动注释的Question-Asswer对。除了研究外，我们还介绍了Ligt（\ textbf {l} ayout- \ textbf {i} nfused \ textbf {g} forthbf {g} textbf {g} textbf {t} ransformer），布局意识 - 意识到的意识 - 意识到的型号架构旨在驱动语言模型的其他单元模型，以实现运行的单元模型。 \ textbf {结果：}对ReceiptVQA的实验表明，与出色的基线相比，我们的体系结构产生了有希望的性能，实现了竞争成果。此外，在分析实验结果的过程中，我们发现了明显的模式，与可以产生答案的体系结构相比，采用仅编码模型体系结构具有相当大的缺点。我们还观察到，尽管语言模型的语义理解至关重要，但有必要结合多种方式来解决我们的数据集。 \ textbf {结论：}我们希望我们的工作能够鼓励和促进越南文档的未来发展，从而为越南语言的多种多模式研究社区做出了贡献。

Title: FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge

Authors: Nakyeong Yang, Minsung Kim, Seunghyun Yoon, Joongbo Shin, Kyomin Jung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19207
Pdf URL: https://arxiv.org/pdf/2502.19207
Copy Paste: [[2502.19207]] FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge(https://arxiv.org/abs/2502.19207)
Keywords: language model
Abstract: Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the complex and interconnected nature of knowledge, where related knowledge must be carefully examined. Specifically, they have failed to evaluate whether an unlearning method faithfully erases interconnected knowledge that should be removed, retaining knowledge that appears relevant but exists in a completely different context. To resolve this problem, we first define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method either fails to erase the interconnected knowledge it should remove or unintentionally erases irrelevant knowledge. Based on the definition, we introduce a new benchmark, FaithUn, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings. Furthermore, we propose a novel unlearning method, KLUE, which updates only knowledge-related neurons to achieve faithful unlearning. KLUE identifies knowledge neurons using an explainability method and updates only those neurons using selected unforgotten samples. Experimental results demonstrate that widely-used unlearning methods fail to ensure faithful unlearning, while our method shows significant effectiveness in real-world QA unlearning.
摘要：各种研究试图从语言模型中删除敏感或私人知识，以防止其未经授权的暴露。但是，先前的研究忽略了知识的复杂和相互联系的性质，必须仔细检查相关的知识。具体而言，他们未能评估未学习方法是否忠实地消除了应删除的相互联系的知识，保留似乎相关但存在于完全不同的情况下的知识。为了解决这个问题，我们首先定义了一个称为浅表学习的新概念，该概念指的是一种现象，即一种未学习的方法无法消除相互联系的知识，它应该消除或无意中擦除无关紧要的知识。基于定义，我们介绍了一个新的基准，Faithun，以分析和评估现实世界中知识环境中学习的忠诚。此外，我们提出了一种新颖的学习方法Klue，该方法仅更新与知识相关的神经元以实现忠实的学习。 Klue使用解释性方法识别知识神经元，并仅使用选定的未遗忘样本更新那些神经元。实验结果表明，广泛使用的未学习方法无法确保忠实的学习，而我们的方法在实现现实世界中的质量质量检查中表现出显着的有效性。

Title: Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation

Authors: Zhouyu Jiang, Mengshu Sun, Zhiqiang Zhang, Lei Liang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19209
Pdf URL: https://arxiv.org/pdf/2502.19209
Copy Paste: [[2502.19209]] Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation(https://arxiv.org/abs/2502.19209)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at this https URL.
摘要：检索增强的生成（RAG）有效地减少了大语言模型（LLM）的幻觉，但仍然可以产生不一致或不支持的内容。尽管LLM-AS-A-Gudge由于其实施简单性而被广泛用于RAG幻觉检测，但它面临两个主要挑战：缺乏全面的评估基准和缺乏域优化的法官模型。为了弥合这些差距，我们介绍了\ textbf {bi'an}，这是一个具有双语基准数据集和轻量级法官模型的新颖框架。该数据集支持跨多个抹布场景进行严格的评估，而法官模型则通过紧凑的开源LLM进行了微调。 Bi'Anbench上的广泛实验评估显示，我们的14B模型优于基线模型，其参数尺度和竞争对手的最新封闭源LLMS的五倍以上。我们将在此HTTPS URL上尽快发布数据和模型。

Title: Negation-Induced Forgetting in LLMs

Authors: Francesca Capuano, Ellen Boschert, Barbara Kaup
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19211
Pdf URL: https://arxiv.org/pdf/2502.19211
Copy Paste: [[2502.19211]] Negation-Induced Forgetting in LLMs(https://arxiv.org/abs/2502.19211)
Keywords: language model, gpt, llm, chat
Abstract: The study explores whether Large Language Models (LLMs) exhibit negation-induced forgetting (NIF), a cognitive phenomenon observed in humans where negating incorrect attributes of an object or event leads to diminished recall of this object or event compared to affirming correct attributes (Mayo et al., 2014; Zang et al., 2023). We adapted Zang et al. (2023) experimental framework to test this effect in ChatGPT-3.5, GPT-4o mini and Llama3-70b-instruct. Our results show that ChatGPT-3.5 exhibits NIF, with negated information being less likely to be recalled than affirmed information. GPT-4o-mini showed a marginally significant NIF effect, while LLaMA-3-70B did not exhibit NIF. The findings provide initial evidence of negation-induced forgetting in some LLMs, suggesting that similar cognitive biases may emerge in these models. This work is a preliminary step in understanding how memory-related phenomena manifest in LLMs.
摘要：该研究探讨了大型语言模型（LLM）是否表现出否定诱发的遗忘（NIF），这是在人类中观察到的认知现象，在该人类中，与确认的正确属性相比，对象或事件的负属性会导致对物体或事件的召回减少（Mayo等人，2014年； Zang等，2014; Zang等，20223）。我们改编了Zang等。（2023）在Chatgpt-3.5，GPT-4O MINI和LLAMA3-70B-INSTRUCT中测试此效果的实验框架。我们的结果表明，Chatgpt-3.5展示了NIF，而被否定的信息召回的可能性少于确认信息。 GPT-4O-MINI显示出略有显着的NIF效应，而Llama-3-70b没有表现出NIF。这些发现提供了一些LLM中否定引起的遗忘的初步证据，表明在这些模型中可能会出现类似的认知偏见。这项工作是了解与记忆相关现象如何在LLM中表现出来的初步步骤。

Title: Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time

Authors: Jiazheng Li, Yuxiang Zhou, Junru Lu, Gladys Tyen, Lin Gui, Cesare Aloisi, Yulan He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19230
Pdf URL: https://arxiv.org/pdf/2502.19230
Copy Paste: [[2502.19230]] Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time(https://arxiv.org/abs/2502.19230)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) often struggle with complex reasoning scenarios. While preference optimization methods enhance reasoning performance through training, they often lack transparency in why one reasoning outcome is preferred over another. Verbal reflection techniques improve explainability but are limited in LLMs' critique and refinement capacity. To address these challenges, we introduce a contrastive reflection synthesis pipeline that enhances the accuracy and depth of LLM-generated reflections. We further propose a dual-model reasoning framework within a verbal reinforcement learning paradigm, decoupling inference-time self-reflection into specialized, trained models for reasoning critique and refinement. Extensive experiments show that our framework outperforms traditional preference optimization methods across all evaluation metrics. Our findings also show that "two heads are better than one", demonstrating that a collaborative Reasoner-Critic model achieves superior reasoning performance and transparency, compared to single-model approaches.
摘要：大型语言模型（LLM）经常在复杂的推理方案中挣扎。尽管偏好优化方法通过培训提高了推理性能，但他们通常缺乏透明度，为什么一种推理结果比另一种推理结果优先。口头反射技术可提高解释性，但在LLMS的批评和精致能力方面受到限制。为了应对这些挑战，我们引入了一种对比反射综合管道，以增强LLM生成的反射的准确性和深度。我们进一步提出了在口头增强学习范式中的双模型推理框架，将推理时间自我反射分解为专门的，受过训练的模型，以进行推理批评和改进。广泛的实验表明，我们的框架优于所有评估指标的传统偏好优化方法。我们的发现还表明，“两个头部胜于一个人”，这表明与单模方法相比，协作推理 - 批判性的模型达到了卓越的推理性能和透明度。

Title: Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

Authors: Michael Y. Hu, Jackson Petty, Chuan Shi, William Merrill, Tal Linzen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19249
Pdf URL: https://arxiv.org/pdf/2502.19249
Copy Paste: [[2502.19249]] Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases(https://arxiv.org/abs/2502.19249)
Keywords: language model
Abstract: Pretraining language models on formal languages can improve their acquisition of natural language, but it is unclear which features of the formal language impart an inductive bias that leads to effective transfer. Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when the formal language both captures dependency structures in natural language and remains within the computational limitations of the model architecture. Focusing on transformers, we find that formal languages with both these properties enable language models to achieve lower loss on natural language and better linguistic generalization compared to other languages. In fact, pre-pretraining, or training on formal-then-natural language, reduces loss more efficiently than the same amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. We also give mechanistic evidence of cross-task transfer from formal to natural language: attention heads acquired during formal language pretraining remain crucial for the model's performance on syntactic evaluations.
摘要：在形式语言上进行训练的语言模型可以改善其对自然语言的获取，但是尚不清楚形式语言的哪些特征会带来导致有效传递的归纳偏见。利用语言学和复杂性理论的见解，我们假设在形式语言两种捕获自然语言的依赖性结构并保持在模型体系结构的计算限制之内时，就会发生有效的转移。专注于变形金刚，我们发现具有这两种属性的形式语言使语言模型与其他语言相比，可以在自然语言上降低自然语言和更好的语言概括。实际上，对正式语言的预言或培训，比相同数量的自然语言更有效地降低了损失。对于以大约1.6b的自然语言代币培训的1B参数语言模型，预先介绍可以实现相同的损失和更好的语言概括，而预算较小33％。我们还提供了从形式上到自然语言的交叉任务转移的机理证据：在预训练期间获得的注意力头对模型在句法评估上的表现至关重要。

Title: Disentangled VAD Representations via a Variational Framework for Political Stance Detection

Authors: Beiyu Xu, Zhiwei Liu, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19276
Pdf URL: https://arxiv.org/pdf/2502.19276
Copy Paste: [[2502.19276]] Disentangled VAD Representations via a Variational Framework for Political Stance Detection(https://arxiv.org/abs/2502.19276)
Keywords: gpt
Abstract: The stance detection task aims to categorise the stance regarding specified targets. Current methods face challenges in effectively integrating sentiment information for stance detection. Moreover, the role of highly granular sentiment labelling in stance detection has been largely overlooked. This study presents a novel stance detection framework utilizing a variational autoencoder (VAE) to disentangle latent emotional features-value, arousal, and dominance (VAD)-from political discourse on social media. This approach addresses limitations in current methods, particularly in in-target and cross-target stance detection scenarios. This research uses an advanced emotional annotation tool to annotate seven-class sentiment labels for P-STANCE. Evaluations on benchmark datasets, including P-STANCE and SemEval-2016, reveal that PoliStance-VAE achieves state-of-the-art performance, surpassing models like BERT, BERTweet, and GPT-4o. PoliStance-VAE offers a robust and interpretable solution for stance detection, demonstrating the effectiveness of integrating nuanced emotional representations. This framework paves the way for advancements in natural language processing tasks, particularly those requiring detailed emotional understanding.
摘要：立场检测任务旨在对指定目标的立场进行分类。当前的方法在有效整合情感信息以进行立场检测时面临挑战。此外，高度颗粒状的情感标记在立场检测中的作用在很大程度上被忽略了。这项研究提出了一个新颖的立场检测框架，利用各种自动编码器（VAE）删除了从社交媒体上的政治话语中解散潜在的情感特征，唤醒和优势（VAD）。这种方法解决了当前方法的局限性，尤其是在目标和跨目标姿态检测方案中。这项研究使用先进的情感注释工具来注释p-Stance的七级情感标签。包括P-Stance和Semeval-2016在内的基准数据集的评估表明，Polistance-Vae可以实现最先进的性能，超过BERT，BERTWEET和GPT-4O等模型。 Polistance-Vae为立场检测提供了强大而可解释的解决方案，证明了整合细微的情感表征的有效性。该框架为自然语言处理任务的进步铺平了道路，尤其是那些需要详细情感理解的任务。

Title: CritiQ: Mining Data Quality Criteria from Human Preferences

Authors: Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19279
Pdf URL: https://arxiv.org/pdf/2502.19279
Copy Paste: [[2502.19279]] CritiQ: Mining Data Quality Criteria from Human Preferences(https://arxiv.org/abs/2502.19279)
Keywords: language model, prompt, agent
Abstract: Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, or careful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only $\sim$30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier- based methods, verbal criteria are more interpretable and possess reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.1 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.
摘要：语言模型在很大程度上取决于高质量数据以获得最佳性能。现有方法依赖于手动设计的启发式方法，现有模型的困惑，培训分类器或仔细的及时工程，这些工程需要大量的专家经验和人类注释工作，同时引入偏见。我们介绍了Critiq，这是一种新型的数据选择方法，该方法自动从人类对数据质量的偏好中挖掘出标准，仅$ \ sim $ 30 $ 30的人类宣传对，并执行有效的数据选择。主要组成部分Critiq Flow使用经理代理来发展质量标准和工人来做出成对判断。我们建立一个知识库，该知识库从以前的工作中提取质量标准来提高Critiq流量。与基于困惑和分类器的方法相比，言语标准更容易解释，并且具有可重复使用的价值。得出标准后，我们训练Critiq得分手以获得质量得分并进行有效的数据选择。我们证明了我们在代码，数学和逻辑域中方法的有效性，在人类通知的测试集上实现了高精度。为了验证所选数据的质量，我们不断训练Llama 3.1型号，并观察到与均匀采样相比，在下游任务上的性能提高。消融研究验证了知识库和反思过程的益处。我们分析了标准如何发展和多数投票的有效性。

Title: Shh, don't say that! Domain Certification in LLMs

Authors: Cornelius Emde, Alasdair Paren, Preetham Arvind, Maxime Kayser, Tom Rainforth, Thomas Lukasiewicz, Bernard Ghanem, Philip H.S. Torr, Adel Bibi
Subjects: cs.CL, cs.AI, cs.CR, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.19320
Pdf URL: https://arxiv.org/pdf/2502.19320
Copy Paste: [[2502.19320]] Shh, don't say that! Domain Certification in LLMs(https://arxiv.org/abs/2502.19320)
Keywords: language model, llm
Abstract: Large language models (LLMs) are often deployed to perform constrained tasks, with narrow domains. For example, customer support bots can be built on top of LLMs, relying on their broad language understanding and capabilities to enhance performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended domain. To formalize, assess, and mitigate this risk, we introduce domain certification; a guarantee that accurately characterizes the out-of-domain behavior of language models. We then propose a simple yet effective approach, which we call VALID that provides adversarial bounds as a certificate. Finally, we evaluate our method across a diverse set of datasets, demonstrating that it yields meaningful certificates, which bound the probability of out-of-domain samples tightly with minimum penalty to refusal behavior.
摘要：大型语言模型（LLMS）通常被部署以执行狭窄的域名执行约束任务。例如，可以在LLM的顶部建立客户支持机器人，依靠其广泛的语言理解和能力来提高性能。但是，这些LLM在对手易感性上，可能会在预期域之外产生输出。为了形式化，评估和减轻这种风险，我们介绍了域认证；准确表征语言模型的室外行为的保证。然后，我们提出了一种简单而有效的方法，我们称之为有效，可以提供对抗性界限作为证书。最后，我们在各种数据集中评估了我们的方法，表明它产生了有意义的证书，这将不域外样本的概率紧密地限制为拒绝行为。

Title: Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

Authors: Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, Juanzi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19328
Pdf URL: https://arxiv.org/pdf/2502.19328
Copy Paste: [[2502.19328]] Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems(https://arxiv.org/abs/2502.19328)
Keywords: language model, llm, agent
Abstract: Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research (this https URL).
摘要：奖励模型（RMS）对于大型语言模型（LLMS）的培训和推理时间扩展至关重要。但是，现有的奖励模型主要集中在人类的偏好上，忽略了可验证的正确性信号，这些信号在培训LLM中表现出强大的潜力。在本文中，我们提出了代理奖励建模，这是一种奖励系统，将奖励模型与来自不同方面的可验证正确性信号相结合，以提供可靠的奖励。我们从经验上实施了一个名为RewardAgent的奖励代理，该奖励代理将人类偏好奖励与两个可验证的信号结合在一起：事实和指导以下，以提供更可靠的奖励。我们对现有的奖励模型基准和推理时间最佳N搜索进行了全面的实验。 Rewardagent显着优于香草奖励模型，证明其有效性。我们进一步使用ReadeDagent构建培训偏好对，并以DPO目标培训LLM，与常规奖励模型相比，在各种NLP基准测试方面取得了卓越的性能。我们的代码公开发布以促进进一步的研究（此HTTPS URL）。

Title: Evaluating LLMs and Pre-trained Models for Text Summarization Across Diverse Datasets

Authors: Tohida Rehman, Soumabha Ghosh, Kuntal Das, Souvik Bhattacharjee, Debarshi Kumar Sanyal, Samiran Chattopadhyay
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19339
Pdf URL: https://arxiv.org/pdf/2502.19339
Copy Paste: [[2502.19339]] Evaluating LLMs and Pre-trained Models for Text Summarization Across Diverse Datasets(https://arxiv.org/abs/2502.19339)
Keywords: language model, llm
Abstract: Text summarization plays a crucial role in natural language processing by condensing large volumes of text into concise and coherent summaries. As digital content continues to grow rapidly and the demand for effective information retrieval increases, text summarization has become a focal point of research in recent years. This study offers a thorough evaluation of four leading pre-trained and open-source large language models: BART, FLAN-T5, LLaMA-3-8B, and Gemma-7B, across five diverse datasets CNN/DM, Gigaword, News Summary, XSum, and BBC News. The evaluation employs widely recognized automatic metrics, including ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, and METEOR, to assess the models' capabilities in generating coherent and informative summaries. The results reveal the comparative strengths and limitations of these models in processing various text types.
摘要：文本摘要在自然语言处理中起着至关重要的作用，这是通过将大量文本凝结成简洁而连贯的摘要。随着数字内容继续迅速增长和对有效信息检索增加的需求，文本汇总已成为近年来研究的重点。这项研究对五种不同的数据集CNN/DM，Gigaword，News Summary，Xsum和BBC新闻进行了对四种领先的预培训和开源大型语言模型的全面评估：Bart，Flan-T5，Llama-3-8B和Gemma-7b。该评估采用广泛认可的自动指标，包括Rouge-1，Rouge-2，Rouge-L，Bertscore和Meteor，以评估模型在产生相干且内容丰富的摘要方面的能力。结果揭示了这些模型在处理各种文本类型时的比较优势和局限性。

Title: Controlled Diversity: Length-optimized Natural Language Generation

Authors: Diana Marie Schenke, Timo Baumann
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19347
Pdf URL: https://arxiv.org/pdf/2502.19347
Copy Paste: [[2502.19347]] Controlled Diversity: Length-optimized Natural Language Generation(https://arxiv.org/abs/2502.19347)
Keywords: llm
Abstract: LLMs are not generally able to adjust the length of their outputs based on strict length requirements, a capability that would improve their usefulness in applications that require adherence to diverse user and system requirements. We present an approach to train LLMs to acquire this capability by augmenting existing data and applying existing fine-tuning techniques, which we compare based on the trained models' adherence to the length requirement and overall response quality relative to the baseline model. Our results demonstrate that these techniques can be successfully applied to train LLMs to adhere to length requirements, with the trained models generating texts which better align to the length requirements. Our results indicate that our method may change the response quality when using training data that was not generated by the baseline model. This allows simultaneous alignment to another training objective in certain scenarios, but is undesirable otherwise. Training on a dataset containing the model's own responses eliminates this issue.
摘要：LLM通常无法根据严格的长度要求来调整其输出的长度，该功能将提高其在需要遵守多样化用户和系统要求的应用程序中的实用性。我们提出了一种培训LLM的方法，通过增强现有数据并应用现有的微调技术来获取此功能，我们根据训练有素的模型遵守长度要求和相对于基线模型的长度要求和整体响应质量进行比较。我们的结果表明，这些技术可以成功地应用于训练LLMS以遵守长度要求，受过训练的模型生成的文本可以更好地与长度要求保持一致。我们的结果表明，使用基线模型未生成的训练数据时，我们的方法可能会改变响应质量。在某些情况下，这可以同时与另一个培训目标保持一致，但否则是不希望的。在包含模型自身响应的数据集上培训可以消除此问题。

Title: Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Authors: Yancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyuan Bu, Ge Zhang, Zhongyuan Peng, Zhaoxiang Zhang, Wenbo Su, Bo Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19361
Pdf URL: https://arxiv.org/pdf/2502.19361
Copy Paste: [[2502.19361]] Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?(https://arxiv.org/abs/2502.19361)
Keywords: language model, llm, chain-of-thought
Abstract: Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.
摘要：最近，类似O1的模型引起了极大的关注，这些模型产生了长期的经过经过经过思考（COT）的推理步骤，以提高现有大型语言模型（LLMS）的推理能力。在本文中，为了了解这些长床的质量并衡量这些长床上现有LLM的批评能力，我们介绍了三块式的Deltabench，包括来自不同O1型模型（例如QWQ，DeepSeek-r1）的长科，用于不同的推理任务（例如，数学，代码，一般的理由），以衡量该能力，以衡量该较长的推理。基于Deltabench，我们首先对生成的长COTS进行细粒度分析，以发现不同O1样模型的有效性和效率。然后，我们对现有过程奖励模型（PRM）和评论家模型进行了广泛的评估，以检测每个注释过程的错误，该过程旨在研究现有的PRMS和评论家模型的界限和局限性。最后，我们希望Deltabench能够指导开发人员更好地了解其模型的漫长的婴儿床推理能力。

Title: DataMan: Data Manager for Pre-training Large Language Models

Authors: Ru Peng, Kexin Yang, Yawen Zeng, Junyang Lin, Dayiheng Liu, Junbo Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19363
Pdf URL: https://arxiv.org/pdf/2502.19363
Copy Paste: [[2502.19363]] DataMan: Data Manager for Pre-training Large Language Models(https://arxiv.org/abs/2502.19363)
Keywords: language model, llm, prompt
Abstract: The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by ``reverse thinking'' -- prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the Overall Score l=5 surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan's domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.
摘要：由数据扩展定律驱动的大语言模型（LLM）的表现出现使得选择培训数据的选择越来越重要。但是，现有的方法依赖于有限的启发式方法和人类直觉，缺乏全面而明确的准则。为了解决这个问题，我们受到``反向思维''的启发 - 促使LLMS自我识别哪些标准有益于其性能。由于其预训练能力与困惑（PPL）有关，因此我们从文本困惑异常的原因中得出了14个质量标准，并引入了15个常见的应用域以支持域混合。在本文中，我们培训数据经理（Dataman），以从点上学习质量评级和域识别质量评级，并使用它来注释447B令牌预训练的前培训语料库，具有14个质量评级和域类型。我们的实验验证了我们的方法，使用Dataman选择30B令牌来训练1.3B参数语言模型，从而在最先进的基线上表现出对内部文化学习（ICL）的显着改善（ICL），并遵循指导性的能力。基于总体得分L = 5的表现最佳模型超过了使用均匀采样的50％数据训练的模型。我们继续使用Dataman注释的高评级，特定领域的数据进行预训练，以增强特定于域的ICL性能，从而验证Dataman的域混合能力。我们的发现强调了质量排名的重要性，质量标准的互补性质及其与困惑性的低相关性，从而分析了PPL和ICL性能之间的错位。我们还彻底分析了我们的预培训数据集，检查其组成，质量评级的分布和原始文档来源。

Title: Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs

Authors: Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, Julian McAuley
Subjects: cs.CL, cs.AI, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2502.19411
Pdf URL: https://arxiv.org/pdf/2502.19411
Copy Paste: [[2502.19411]] Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs(https://arxiv.org/abs/2502.19411)
Keywords: language model, llm
Abstract: In large language models (LLMs), code and reasoning reinforce each other: code offers an abstract, modular, and logic-driven structure that supports reasoning, while reasoning translates high-level goals into smaller, executable steps that drive more advanced code intelligence. In this study, we examine how code serves as a structured medium for enhancing reasoning: it provides verifiable execution paths, enforces logical decomposition, and enables runtime validation. We also explore how improvements in reasoning have transformed code intelligence from basic completion to advanced capabilities, enabling models to address complex software engineering tasks through planning and debugging. Finally, we identify key challenges and propose future research directions to strengthen this synergy, ultimately improving LLM's performance in both areas.
摘要：在大型语言模型（LLMS）中，代码和推理相互加强：代码提供了一个抽象，模块化和逻辑驱动的结构，支持推理，而推理将高级目标转化为较小的可执行步骤，可驱动更高级的代码智能。在这项研究中，我们研究了代码如何用作增强推理的结构化介质：它提供可验证的执行路径，实施逻辑分解并启用运行时验证。我们还探讨了推理的改进如何使代码智能从基本完成转变为高级功能，从而使模型能够通过计划和调试来解决复杂的软件工程任务。最后，我们确定了关键的挑战，并提出了未来的研究指示，以增强这种协同作用，最终提高了LLM在这两个领域的表现。

Title: The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

Authors: Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz, Asaf Yehudai, Elron Bandel, Leshem Choshen, Eyal Shnarch, Percy Liang, Michal Shmueli-Scheuer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19412
Pdf URL: https://arxiv.org/pdf/2502.19412
Copy Paste: [[2502.19412]] The Mighty ToRR: A Benchmark for Table Reasoning and Robustness(https://arxiv.org/abs/2502.19412)
Keywords: prompt
Abstract: Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, that measures model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. Although no specific table format leads to consistently better performance, we show that testing over multiple formats is crucial for reliably estimating model capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that table understanding and reasoning tasks remain a significant challenge.
摘要：尽管具有现实世界的意义，但表格数据上的模型性能仍未得到充分兴奋，却留下了不确定性依赖哪种模型以及采用哪种及时配置。为了解决这一差距，我们创建了Torr，这是表格推理和鲁棒性的基准，可以衡量与桌子相关的任务上的模型性能和鲁棒性。该基准包括10个数据集，这些数据集涵盖了各种域的不同类型的表推理功能。 Torr超出了模型性能排名，旨在反映模型是否可以在各种常见的表表示格式上始终如一地处理表格数据。我们介绍了一个排行榜以及对Torr领先模型结果的全面分析。我们的结果揭示了脆性模型行为的惊人模式，即使强大的模型也无法在表格数据任务上执行。尽管没有特定的表格格式会始终如一地提高性能，但我们表明，对多种格式进行测试对于可靠地估计模型功能至关重要。此外，我们表明，测试多个提示的可靠性提高可能相当于添加更多的测试示例。总体而言，我们的发现表明，桌子理解和推理任务仍然是一个重大挑战。

Title: Norm Growth and Stability Challenges in Localized Sequential Knowledge Editing

Authors: Akshat Gupta, Christine Fang, Atahan Ozdemir, Maochuan Lu, Ahmed Alaa, Thomas Hartvigsen, Gopala Anumanchipalli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19416
Pdf URL: https://arxiv.org/pdf/2502.19416
Copy Paste: [[2502.19416]] Norm Growth and Stability Challenges in Localized Sequential Knowledge Editing(https://arxiv.org/abs/2502.19416)
Keywords: language model, llm
Abstract: This study investigates the impact of localized updates to large language models (LLMs), specifically in the context of knowledge editing - a task aimed at incorporating or modifying specific facts without altering broader model capabilities. We first show that across different post-training interventions like continuous pre-training, full fine-tuning and LORA-based fine-tuning, the Frobenius norm of the updated matrices always increases. This increasing norm is especially detrimental for localized knowledge editing, where only a subset of matrices are updated in a model . We reveal a consistent phenomenon across various editing techniques, including fine-tuning, hypernetwork-based approaches, and locate-and-edit methods: the norm of the updated matrix invariably increases with successive updates. Such growth disrupts model balance, particularly when isolated matrices are updated while the rest of the model remains static, leading to potential instability and degradation of downstream performance. Upon deeper investigations of the intermediate activation vectors, we find that the norm of internal activations decreases and is accompanied by shifts in the subspaces occupied by these activations, which shows that these activation vectors now occupy completely different regions in the representation space compared to the unedited model. With our paper, we highlight the technical challenges with continuous and localized sequential knowledge editing and their implications for maintaining model stability and utility.
摘要：这项研究调查了本地化更新对大语言模型（LLM）的影响，特别是在知识编辑的背景下 - 旨在纳入或修改特定事实而不改变更广泛模型功能的任务。我们首先表明，在不同的训练后干预措施中，例如连续训练，完整的微调和基于洛拉的微调，更新的矩阵的Frobenius规范总是会增加。这种增加的规范对本地化知识编辑尤其有害，在模型中只能更新一部分矩阵。我们揭示了各种编辑技术的一致现象，包括微调，基于超网络的方法和定位和编辑方法：更新的矩阵的规范总是随着连续的更新而增加。这种增长破坏了模型平衡，尤其是在更新隔离矩阵的情况下，而其余模型仍保持静态，从而导致潜在的不稳定性和下游性能的降解。在对中间激活向量进行更深入的研究后，我们发现内部激活的规范降低了，并伴随着这些激活所占据的子空间的变化，这表明这些激活向量现在在表示空间中占据完全不同的区域，与未编辑的模型相比。在我们的论文中，我们通过连续和局部的顺序知识编辑及其对维护模型稳定性和实用性的影响来强调技术挑战。