2025-01-23

Title: Human-like conceptual representations emerge from language prediction

Authors: Ningyu Xu, Qi Zhang, Chao Du, Qiang Luo, Xipeng Qiu, Xuanjing Huang, Menghan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.12547
Pdf URL: https://arxiv.org/pdf/2501.12547
Copy Paste: [[2501.12547]] Human-like conceptual representations emerge from language prediction(https://arxiv.org/abs/2501.12547)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) provide a new opportunity to address the long-standing question of how concepts are represented and organized in the mind, which is central to unravelling the nature of human cognition. Here, we reframed the classic reverse dictionary task to simulate human concept inference in context and investigated the emergence of human-like conceptual representations within LLMs. We found that LLMs were able to infer concepts from definitional descriptions and construct representation spaces that converge towards a shared, context-independent structure. These representations effectively predicted human behavioural judgments and aligned well with neural activity patterns in the human brain, offering evidence for biological plausibility. These findings demonstrate that human-like conceptual representations and organization can naturally emerge from language prediction, even without real-world grounding. Our work supports the view that LLMs serve as valuable tools for understanding complex human cognition and paves the way for better alignment between artificial and human intelligence.
摘要：大型语言模型 (LLM) 的最新进展为解决概念如何在大脑中表示和组织这一长期存在的问题提供了新的机会，这对于揭示人类认知的本质至关重要。在这里，我们重新构建了经典的反向词典任务，以模拟人类在上下文中的概念推理，并研究了 LLM 中类似人类的概念表征的出现。我们发现 LLM 能够从定义描述中推断概念，并构建向共享的、独立于上下文的结构收敛的表示空间。这些表征有效地预测了人类的行为判断，并与人类大脑中的神经活动模式很好地吻合，为生物学合理性提供了证据。这些发现表明，即使没有现实世界的基础，类似人类的概念表征和组织也可以自然地从语言预测中出现。我们的工作支持了这样的观点，即 LLM 是理解复杂人类认知的宝贵工具，并为人工智能和人类智能之间的更好协调铺平了道路。

Title: O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

Authors: Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.12570
Pdf URL: https://arxiv.org/pdf/2501.12570
Copy Paste: [[2501.12570]] O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning(https://arxiv.org/abs/2501.12570)
Keywords: llm
Abstract: Recently, long-thought reasoning LLMs, such as OpenAI's O1, adopt extended reasoning processes similar to how humans ponder over complex problems. This reasoning paradigm significantly enhances the model's problem-solving abilities and has achieved promising results. However, long-thought reasoning process leads to a substantial increase in inference time. A pressing challenge is reducing the inference overhead of long-thought LLMs while ensuring accuracy. In this paper, we experimentally demonstrate that long-thought reasoning models struggle to effectively allocate token budgets based on problem difficulty and reasoning redundancies. To address this, we propose Length-Harmonizing Fine-Tuning (O1-Pruner), aiming at minimizing reasoning overhead while maintaining accuracy. This effective fine-tuning method first estimates the LLM's baseline performance through pre-sampling and then uses RL-style fine-tuning to encourage the model to generate shorter reasoning processes under accuracy constraints. This allows the model to achieve efficient reasoning with lower redundancy while maintaining accuracy. Experiments on various mathematical reasoning benchmarks show that O1-Pruner not only significantly reduces inference overhead but also achieves higher accuracy, providing a novel and promising solution to this challenge. Our code is coming soon at this https URL
摘要：近期，OpenAI 的 O1 等长思考推理 LLM 采用了类似于人类思考复杂问题的扩展推理过程。这种推理范式显著提升了模型的解决问题的能力，并取得了令人欣喜的成果。然而，长思考推理过程导致推理时间大幅增加。一个迫切的挑战是在保证准确率的同时减少长思考 LLM 的推理开销。在本文中，我们通过实验证明长思考推理模型难以根据问题难度和推理冗余度有效地分配 token 预算。为了解决这个问题，我们提出了长度协调微调 (O1-Pruner)，旨在在保持准确率的同时最小化推理开销。这种有效的微调方法首先通过预采样估计 LLM 的基线性能，然后使用 RL 风格的微调来鼓励模型在准确率约束下生成更短的推理过程。这使得模型在保持准确率的同时实现低冗余的高效推理。在各种数学推理基准上的实验表明，O1-Pruner 不仅显著降低了推理开销，而且实现了更高的准确率，为这一挑战提供了一种新颖且有前途的解决方案。我们的代码即将在此 https URL 上发布

Title: T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

Authors: Lijun Li, Zhelun Shi, Xuhao Hu, Bowen Dong, Yiran Qin, Xihui Liu, Lu Sheng, Jing Shao
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2501.12612
Pdf URL: https://arxiv.org/pdf/2501.12612
Copy Paste: [[2501.12612]] T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation(https://arxiv.org/abs/2501.12612)
Keywords: gpt, prompt
Abstract: Text-to-image (T2I) models have rapidly advanced, enabling the generation of high-quality images from text prompts across various domains. However, these models present notable safety concerns, including the risk of generating harmful, biased, or private content. Current research on assessing T2I safety remains in its early stages. While some efforts have been made to evaluate models on specific safety dimensions, many critical risks remain unexplored. To address this gap, we introduce T2ISafety, a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a detailed hierarchy of 12 tasks and 44 categories based on these three domains, and meticulously collect 70K corresponding prompts. Based on this taxonomy and prompt set, we build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks that previous work has failed to identify, including risks that even ultra-large proprietary models like GPTs cannot correctly detect. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models, even with defense methods like concept erasing. Data and evaluator are released under this https URL.
摘要：文本转图像 (T2I) 模型发展迅速，能够从各个领域的文本提示生成高质量图像。然而，这些模型存在显著的安全问题，包括生成有害、有偏见或私人内容的风险。目前对 T2I 安全性的评估研究仍处于早期阶段。虽然已经做出了一些努力来评估特定安全维度上的模型，但仍有许多关键风险尚未探索。为了弥补这一差距，我们推出了 T2ISafety，这是一个安全基准，可评估三个关键领域的 T2I 模型：毒性、公平性和偏见。我们根据这三个领域构建了 12 个任务和 44 个类别的详细层次结构，并精心收集了 70K 个相应的提示。基于此分类法和提示集，我们构建了一个包含 68K 张手动注释图像的大规模 T2I 数据集，并训练了一个评估器，使其能够检测以前的工作未能识别的关键风险，包括甚至像 GPT 这样的超大型专有模型都无法正确检测的风险。我们对 T2ISafety 上的 12 个著名传播模型进行了评估，并揭示了一些问题，包括种族公平性问题、产生有害内容的倾向以及不同模型之间隐私保护的显著差异，即使采用概念擦除等防御方法也是如此。数据和评估器在此 https URL 下发布。

Title: Distillation Quantification for Large Language Models

Authors: Sunbowen Lee, Junting Zhou, Chang Ao, Kaige Li, Xinrun Du, Sirui He, Jiaheng Liu, Min Yang, Zhoufutu Wen, Shiwen Ni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.12619
Pdf URL: https://arxiv.org/pdf/2501.12619
Copy Paste: [[2501.12619]] Distillation Quantification for Large Language Models(https://arxiv.org/abs/2501.12619)
Keywords: language model, llm
Abstract: Model distillation is a technique for transferring knowledge from large language models (LLMs) to smaller ones, aiming to create resource-efficient yet high-performing models. However, excessive distillation can lead to homogenization, reducing diversity among models and impairing their ability to robustly handle complex or novel tasks. These limitations underscore the need to systematically quantify the distillation process and its impact. In this work, we propose a framework to evaluate and quantify model distillation. Our method addresses two key aspects: (1) Identifying identity cognition contradictions to assess discrepancies in how models perceive and represent identity-related information, and (2) Analyzing multi-granularity response similarities across models to measure the extent of homogenization. Experimental results demonstrate two key insights: (1) Well-known closed-source and open-source LLMs usually exhibit high distillation degrees, except for Claude, Doubao, and Gemini. (2) Base LLMs show higher distillation degrees compared to aligned LLMs. By offering a systematic approach to improve the transparency of LLM data distillation, we call for LLMs with more independent development and more transparent technical reports to improve LLMs' robustness and safety. The code and data are available under this https URL.
摘要：模型蒸馏是一种将知识从大型语言模型 (LLM) 转移到较小模型的技术，旨在创建资源高效但性能高的模型。然而，过度蒸馏会导致同质化，降低模型之间的多样性，削弱它们处理复杂或新任务的能力。这些限制强调了系统地量化蒸馏过程及其影响的必要性。在本文中，我们提出了一个评估和量化模型蒸馏的框架。我们的方法解决了两个关键方面：（1）识别身份认知矛盾以评估模型感知和表示身份相关信息的差异；（2）分析跨模型的多粒度响应相似性以衡量同质化的程度。实验结果表明了两个关键见解：（1）除 Claude、Doubao 和 Gemini 外，众所周知的闭源和开源 LLM 通常表现出较高的蒸馏度。（2）与对齐的 LLM 相比，基础 LLM 表现出更高的蒸馏度。通过提供系统的方法提高 LLM 数据提炼的透明度，我们呼吁 LLM 具有更多的独立开发和更透明的技术报告，以提高 LLM 的稳健性和安全性。代码和数据可在此 https URL 下找到。

Title: The potential -- and the pitfalls -- of using pre-trained language models as cognitive science theories

Authors: Raj Sanjay Shah, Sashank Varma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.12651
Pdf URL: https://arxiv.org/pdf/2501.12651
Copy Paste: [[2501.12651]] The potential -- and the pitfalls -- of using pre-trained language models as cognitive science theories(https://arxiv.org/abs/2501.12651)
Keywords: language model
Abstract: Many studies have evaluated the cognitive alignment of Pre-trained Language Models (PLMs), i.e., their correspondence to adult performance across a range of cognitive domains. Recently, the focus has expanded to the developmental alignment of these models: identifying phases during training where improvements in model performance track improvements in children's thinking over development. However, there are many challenges to the use of PLMs as cognitive science theories, including different architectures, different training data modalities and scales, and limited model interpretability. In this paper, we distill lessons learned from treating PLMs, not as engineering artifacts but as cognitive science and developmental science models. We review assumptions used by researchers to map measures of PLM performance to measures of human performance. We identify potential pitfalls of this approach to understanding human thinking, and we end by enumerating criteria for using PLMs as credible accounts of cognition and cognitive development.
摘要：许多研究已经评估了预训练语言模型 (PLM) 的认知一致性，即它们与一系列认知领域的成人表现的对应关系。最近，研究重点已经扩展到这些模型的发展一致性：确定训练期间模型性能改进与儿童思维发展改进的阶段。然而，将 PLM 用作认知科学理论面临许多挑战，包括不同的架构、不同的训练数据模式和规模以及有限的模型可解释性。在本文中，我们总结了将 PLM 视为认知科学和发展科学模型而不是工程工件的经验教训。我们回顾了研究人员用来将 PLM 性能指标映射到人类性能指标的假设。我们确定了这种理解人类思维的方法的潜在缺陷，最后列举了使用 PLM 作为认知和认知发展可靠解释的标准。

Title: Extracting General-use Transformers for Low-resource Languages via Knowledge Distillation

Authors: Jan Christian Blaise Cruz, Alham Fikri Aji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.12660
Pdf URL: https://arxiv.org/pdf/2501.12660
Copy Paste: [[2501.12660]] Extracting General-use Transformers for Low-resource Languages via Knowledge Distillation(https://arxiv.org/abs/2501.12660)
Keywords: language model
Abstract: In this paper, we propose the use of simple knowledge distillation to produce smaller and more efficient single-language transformers from Massively Multilingual Transformers (MMTs) to alleviate tradeoffs associated with the use of such in low-resource settings. Using Tagalog as a case study, we show that these smaller single-language models perform on-par with strong baselines in a variety of benchmark tasks in a much more efficient manner. Furthermore, we investigate additional steps during the distillation process that improves the soft-supervision of the target language, and provide a number of analyses and ablations to show the efficacy of the proposed method.
摘要：在本文中，我们建议使用简单的知识提炼从大规模多语言转换器 (MMT) 中生成更小、更高效的单语言转换器，以减轻在资源匮乏的环境中使用此类转换器所带来的权衡。以塔加洛语为例，我们表明这些较小的单语言模型在各种基准任务中的表现与强大的基线相当，而且效率更高。此外，我们研究了提炼过程中的其他步骤，以改进目标语言的软监督，并提供了大量分析和消融来展示所提方法的有效性。

Title: Training Dialogue Systems by AI Feedback for Improving Overall Dialogue Impression

Authors: Kai Yoshida, Masahiro Mizukami, Seiya Kawano, Canasai Kruengkrai, Hiroaki Sugiyama, Koichiro Yoshino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.12698
Pdf URL: https://arxiv.org/pdf/2501.12698
Copy Paste: [[2501.12698]] Training Dialogue Systems by AI Feedback for Improving Overall Dialogue Impression(https://arxiv.org/abs/2501.12698)
Keywords: language model, llm, prompt
Abstract: To improve user engagement during conversations with dialogue systems, we must improve individual dialogue responses and dialogue impressions such as consistency, personality, and empathy throughout the entire dialogue. While such dialogue systems have been developing rapidly with the help of large language models (LLMs), reinforcement learning from AI feedback (RLAIF) has attracted attention to align LLM-based dialogue models for such dialogue impressions. In RLAIF, a reward model based on another LLM is used to create a training signal for an LLM-based dialogue model using zero-shot/few-shot prompting techniques. However, evaluating an entire dialogue only by prompting LLMs is challenging. In this study, the supervised fine-tuning (SFT) of LLMs prepared reward models corresponding to 12 metrics related to the impression of the entire dialogue for evaluating dialogue responses. We tuned our dialogue models using the reward model signals as feedback to improve the impression of the system. The results of automatic and human evaluations showed that tuning the dialogue model using our reward model corresponding to dialogue impression improved the evaluation of individual metrics and the naturalness of the dialogue response.
摘要：为了提高用户与对话系统对话时的参与度，我们必须在整个对话过程中改进个人对话响应和对话印象，例如一致性、个性和同理心。虽然此类对话系统在大型语言模型 (LLM) 的帮助下发展迅速，但人工智能反馈强化学习 (RLAIF) 已引起人们的关注，以使基于 LLM 的对话模型与此类对话印象保持一致。在 RLAIF 中，基于另一个 LLM 的奖励模型用于使用零样本/少量样本提示技术为基于 LLM 的对话模型创建训练信号。但是，仅通过提示 LLM 来评估整个对话具有挑战性。在本研究中，LLM 的监督微调 (SFT) 准备了与整个对话印象相关的 12 个指标相对应的奖励模型，以评估对话响应。我们使用奖励模型信号作为反馈来调整我们的对话模型，以改善系统的印象。自动和人工评估的结果表明，使用与对话印象相对应的奖励模型调整对话模型可以改善对个人指标的评估和对话响应的自然性。

Title: EvidenceMap: Unleashing the Power of Small Language Models with Evidence Analysis for Biomedical Question Answering

Authors: Chang Zong, Jian Wan, Lei Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.12746
Pdf URL: https://arxiv.org/pdf/2501.12746
Copy Paste: [[2501.12746]] EvidenceMap: Unleashing the Power of Small Language Models with Evidence Analysis for Biomedical Question Answering(https://arxiv.org/abs/2501.12746)
Keywords: language model, llm
Abstract: Current LLM-based approaches improve question answering performance by leveraging the internal reasoning abilities of models or incorporating external knowledge. However, when humans address professional problems, it is essential to explicitly analyze the multifaceted relationships from multiple pieces and diverse sources of evidence to achieve better answers. In this study, we propose a novel generative question answering framework for the biomedical domain, named EvidenceMap, which explicitly learns and incorporates evidence analysis with small language models (SLMs). The framework describes an evidence map for each question and fully utilizes an SLM to derive the representation of the supportive evaluation, the logical correlation, and the summarization of the related evidence, which facilitates an analysis-augmented generation with another SLM in an autoregressive way. Extensive experiments have shown that introducing an evidence analysis learning process can significantly outperform larger models and popular LLM reasoning methods.
摘要：当前基于 LLM 的方法通过利用模型的内部推理能力或结合外部知识来提高问答性能。然而，当人类解决专业问题时，明确分析来自多个部分和不同证据来源的多方面关系以获得更好的答案至关重要。在本研究中，我们提出了一种用于生物医学领域的新型生成式问答框架，名为 EvidenceMap，它明确学习并将证据分析与小型语言模型 (SLM) 结合起来。该框架为每个问题描述了一个证据图，并充分利用了 SLM 来得出支持性评估的表示、逻辑相关性和相关证据的总结，这有助于以自回归的方式与另一个 SLM 进行分析增强生成。大量实验表明，引入证据分析学习过程可以显著胜过更大的模型和流行的 LLM 推理方法。

Title: NExtLong: Toward Effective Long-Context Training without Long Documents

Authors: Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, Songlin Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.12766
Pdf URL: https://arxiv.org/pdf/2501.12766
Copy Paste: [[2501.12766]] NExtLong: Toward Effective Long-Context Training without Long Documents(https://arxiv.org/abs/2501.12766)
Keywords: language model, llm
Abstract: Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong's ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.
摘要：具有扩展上下文窗口的大型语言模型 (LLM) 取得了重大进展，但由于长文档稀缺，这仍然是一个挑战。现有方法倾向于合成长上下文数据，但缺乏明确的机制来强化长距离依赖关系建模。为了解决这一限制，我们提出了 NExtLong，这是一种通过负文档扩展合成长上下文数据的新框架。NExtLong 将文档分解为多个元块，并通过交错从预训练语料库中检索到的硬负干扰项来扩展上下文。这种方法迫使模型区分长距离依赖上下文和干扰内容，从而增强其建模长距离依赖关系的能力。大量实验表明，与现有的长上下文合成方法和领先模型（在非合成长文档上训练）相比，NExtLong 在 HELMET 和 RULER 基准上实现了显着的性能改进。这些发现凸显了 NExtLong 减少对非合成长文档的依赖的能力，使其成为开发高级长上下文 LLM 的有效框架。

Title: LLMs as Repositories of Factual Knowledge: Limitations and Solutions

Authors: Seyed Mahed Mousavi, Simone Alghisi, Giuseppe Riccardi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.12774
Pdf URL: https://arxiv.org/pdf/2501.12774
Copy Paste: [[2501.12774]] LLMs as Repositories of Factual Knowledge: Limitations and Solutions(https://arxiv.org/abs/2501.12774)
Keywords: language model, llm, prompt
Abstract: LLMs' sources of knowledge are data snapshots containing factual information about entities collected at different timestamps and from different media types (e.g. wikis, social media, etc.). Such unstructured knowledge is subject to change due to updates through time from past to present. Equally important are the inconsistencies and inaccuracies occurring in different information sources. Consequently, the model's knowledge about an entity may be perturbed while training over the sequence of snapshots or at inference time, resulting in inconsistent and inaccurate model performance. In this work, we study the appropriateness of Large Language Models (LLMs) as repositories of factual knowledge. We consider twenty-four state-of-the-art LLMs that are either closed-, partially (weights), or fully (weight and training data) open-source. We evaluate their reliability in responding to time-sensitive factual questions in terms of accuracy and consistency when prompts are perturbed. We further evaluate the effectiveness of state-of-the-art methods to improve LLMs' accuracy and consistency. We then propose "ENtity-Aware Fine-tuning" (ENAF), a soft neurosymbolic approach aimed at providing a structured representation of entities during fine-tuning to improve the model's performance.
摘要：LLM 的知识来源是数据快照，其中包含在不同时间戳和不同媒体类型（例如 wiki、社交媒体等）收集的有关实体的事实信息。这种非结构化知识可能会随着从过去到现在的更新而发生变化。同样重要的是不同信息源中出现的不一致和不准确性。因此，在对快照序列进行训练或在推理时，模型关于实体的知识可能会受到干扰，从而导致模型性能不一致和不准确。在这项工作中，我们研究了大型语言模型 (LLM) 作为事实知识存储库的适用性。我们考虑了 24 个最先进的 LLM，它们要么是封闭的，要么是部分（权重）或完全（权重和训练数据）开源的。我们评估了它们在提示受到干扰时在准确性和一致性方面回答时间敏感的事实问题的可靠性。我们进一步评估了最先进的方法在提高 LLM 准确性和一致性方面的有效性。然后，我们提出了“实体感知微调”（ENAF），这是一种软神经符号方法，旨在在微调期间提供实体的结构化表示，以提高模型的性能。

Title: Generating Diverse Q&A Benchmarks for RAG Evaluation with DataMorgana

Authors: Simone Filice, Guy Horowitz, David Carmel, Zohar Karnin, Liane Lewin-Eytan, Yoelle Maarek
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2501.12789
Pdf URL: https://arxiv.org/pdf/2501.12789
Copy Paste: [[2501.12789]] Generating Diverse Q&A Benchmarks for RAG Evaluation with DataMorgana(https://arxiv.org/abs/2501.12789)
Keywords: llm, retrieval-augmented generation
Abstract: Evaluating Retrieval-Augmented Generation (RAG) systems, especially in domain-specific contexts, requires benchmarks that address the distinctive requirements of the applicative scenario. Since real data can be hard to obtain, a common strategy is to use LLM-based methods to generate synthetic data. Existing solutions are general purpose: given a document, they generate a question to build a Q&A pair. However, although the generated questions can be individually good, they are typically not diverse enough to reasonably cover the different ways real end-users can interact with the RAG system. We introduce here DataMorgana, a tool for generating highly customizable and diverse synthetic Q&A benchmarks tailored to RAG applications. DataMorgana enables detailed configurations of user and question categories and provides control over their distribution within the benchmark. It uses a lightweight two-stage process, ensuring efficiency and fast iterations, while generating benchmarks that reflect the expected traffic. We conduct a thorough line of experiments, showing quantitatively and qualitatively that DataMorgana surpasses existing tools and approaches in producing lexically, syntactically, and semantically diverse question sets across domain-specific and general-knowledge corpora. DataMorgana will be made available to selected teams in the research community, as first beta testers, in the context of the upcoming SIGIR'2025 LiveRAG challenge to be announced in early February 2025.
摘要：评估检索增强生成 (RAG) 系统（尤其是在特定领域的环境中）需要基准来满足应用场景的独特要求。由于真实数据可能难以获得，因此一种常见的策略是使用基于 LLM 的方法来生成合成数据。现有的解决方案是通用的：给定一个文档，它们会生成一个问题来构建问答对。但是，尽管生成的问题可能单独都很好，但它们通常不够多样化，无法合理地涵盖真实最终用户与 RAG 系统交互的不同方式。我们在此介绍 DataMorgana，这是一种用于生成高度可定制且多样化的合成问答基准的工具，专门针对 RAG 应用程序。DataMorgana 支持对用户和问题类别进行详细配置，并可以控制它们在基准中的分布。它使用轻量级的两阶段流程，确保效率和快速迭代，同时生成反映预期流量的基准。我们进行了一系列全面的实验，从数量和质量上表明，DataMorgana 在跨领域特定和常识语料库生成词汇、句法和语义多样化的问题集方面超越了现有工具和方法。DataMorgana 将作为首批 Beta 测试人员提供给研究社区中的选定团队，该测试将在即将于 2025 年 2 月初宣布的 SIGIR'2025 LiveRAG 挑战赛中进行。

Title: Open or Closed LLM for Lesser-Resourced Languages? Lessons from Greek

Authors: John Pavlopoulos, Juli Bakagianni, Kanella Pouli, Maria Gavriilidou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.12826
Pdf URL: https://arxiv.org/pdf/2501.12826
Copy Paste: [[2501.12826]] Open or Closed LLM for Lesser-Resourced Languages? Lessons from Greek(https://arxiv.org/abs/2501.12826)
Keywords: language model, gpt, llm
Abstract: Natural Language Processing (NLP) for lesser-resourced languages faces persistent challenges, including limited datasets, inherited biases from high-resource languages, and the need for domain-specific solutions. This study addresses these gaps for Modern Greek through three key contributions. First, we evaluate the performance of open-source (Llama-70b) and closed-source (GPT-4o mini) large language models (LLMs) on seven core NLP tasks with dataset availability, revealing task-specific strengths, weaknesses, and parity in their performance. Second, we expand the scope of Greek NLP by reframing Authorship Attribution as a tool to assess potential data usage by LLMs in pre-training, with high 0-shot accuracy suggesting ethical implications for data provenance. Third, we showcase a legal NLP case study, where a Summarize, Translate, and Embed (STE) methodology outperforms the traditional TF-IDF approach for clustering \emph{long} legal texts. Together, these contributions provide a roadmap to advance NLP in lesser-resourced languages, bridging gaps in model evaluation, task innovation, and real-world impact.
摘要：资源较少的语言的自然语言处理 (NLP) 面临着持续的挑战，包括有限的数据集、从资源丰富的语言继承的偏见以及对特定领域解决方案的需求。本研究通过三个主要贡献解决了现代希腊语的这些差距。首先，我们评估了开源 (Llama-70b) 和闭源 (GPT-4o mini) 大型语言模型 (LLM) 在七个核心 NLP 任务上的表现，这些任务具有数据集可用性，揭示了它们在任务特定方面的优势、劣势和性能上的奇偶性。其次，我们通过将作者归属重新定义为一种工具来评估 LLM 在预训练中的潜在数据使用情况，从而扩大了希腊语 NLP 的范围，高 0 次准确度表明数据来源具有伦理意义。第三，我们展示了一个法律 NLP 案例研究，其中总结、翻译和嵌入 (STE) 方法在聚类 \emph{long} 法律文本方面优于传统的 TF-IDF 方法。总的来说，这些贡献为推动资源匮乏语言的 NLP 发展提供了路线图，弥补了模型评估、任务创新和现实影响方面的差距。

Title: Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home

Authors: Viktor Moskvoretskii, Maria Lysyuk, Mikhail Salnikov, Nikolay Ivanov, Sergey Pletenev, Daria Galimzianova, Nikita Krayko, Vasily Konovalov, Irina Nikishina, Alexander Panchenko
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.12835
Pdf URL: https://arxiv.org/pdf/2501.12835
Copy Paste: [[2501.12835]] Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home(https://arxiv.org/abs/2501.12835)
Keywords: language model, llm, hallucination, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) improves correctness of Question Answering (QA) and addresses hallucinations in Large Language Models (LLMs), yet greatly increase computational costs. Besides, RAG is not always needed as may introduce irrelevant information. Recent adaptive retrieval methods integrate LLMs' intrinsic knowledge with external information appealing to LLM self-knowledge, but they often neglect efficiency evaluations and comparisons with uncertainty estimation techniques. We bridge this gap by conducting a comprehensive analysis of 35 adaptive retrieval methods, including 8 recent approaches and 27 uncertainty estimation techniques, across 6 datasets using 10 metrics for QA performance, self-knowledge, and efficiency. Our findings show that uncertainty estimation techniques often outperform complex pipelines in terms of efficiency and self-knowledge, while maintaining comparable QA performance.
摘要：检索增强生成 (RAG) 提高了问答 (QA) 的正确性并解决了大型语言模型 (LLM) 中的幻觉问题，但却大大增加了计算成本。此外，RAG 并非总是需要的，因为可能会引入不相关的信息。最近的自适应检索方法将 LLM 的内在知识与吸引 LLM 自我知识的外部信息相结合，但它们往往忽视效率评估和与不确定性估计技术的比较。我们通过对 6 个数据集中的 35 种自适应检索方法（包括 8 种近期方法和 27 种不确定性估计技术）进行全面分析来弥补这一差距，使用 10 个指标来衡量 QA 性能、自我知识和效率。我们的研究结果表明，不确定性估计技术在效率和自我知识方面往往优于复杂流程，同时保持相当的 QA 性能。

Title: ACEBench: Who Wins the Match Point in Tool Learning?

Authors: Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, Xinzhi Wang, Defu Lian, Baoqun Yin, Yasheng Wang, Wu Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.12851
Pdf URL: https://arxiv.org/pdf/2501.12851
Copy Paste: [[2501.12851]] ACEBench: Who Wins the Match Point in Tool Learning?(https://arxiv.org/abs/2501.12851)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have demonstrated significant potential in decision-making and reasoning, especially when combined with various tools to effectively solve complex problems. However, existing evaluation systems for assessing LLM function calling capabilities have several limitations: (1) limited evaluation scenarios, lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, lacking detailed assessments for fine-grained function calls; (3) relying on LLMs or real API executions for result evaluation, which introduces significant overhead. To address these issues, we propose a comprehensive evaluation system named ACEBench. This system is meticulously designed to encompass a wide spectrum of function calling scenarios. Moreover, it categorizes these scenarios into three primary types according to the evaluation methodology: Normal, Special, and Agent. Normal evaluates function calls in basic scenarios; Special evaluates function calls in scenarios with vague or incomplete instructions; Agent introduces multi-agent interactions to simulate function calling evaluation in real-world multi-turn interactions. We conducted extensive experiments on ACEBench, analyzing various LLMs in-depth and performing a more granular analysis of error causes across different data types.
摘要：大型语言模型（LLM）在决策和推理方面表现出巨大的潜力，尤其是与各种工具相结合时可以有效地解决复杂问题。然而，现有的用于评估LLM函数调用能力的评估系统有几个局限性：（1）评估场景有限，缺乏对真实多轮对话上下文的评估；（2）评估维度狭窄，缺乏对细粒度函数调用的详细评估；（3）依赖LLM或真实API执行进行结果评估，这会引入大量开销。针对这些问题，我们提出了一个全面的评估系统ACEBench。该系统经过精心设计，涵盖了广泛的函数调用场景。此外，它根据评估方法将这些场景分为三大类型：普通、特殊和代理。普通评估基本场景中的函数调用；特殊评估具有模糊或不完整指令的场景中的函数调用；代理引入多智能体交互来模拟现实世界多轮交互中的函数调用评估。我们对 ACEBench 进行了大量的实验，深入分析了各种 LLM，并对不同数据类型的错误原因进行了更细致的分析。

Title: WisdomBot: Tuning Large Language Models with Artificial Intelligence Knowledge

Authors: Jingyuan Chen, Tao Wu, Wei Ji, Fei Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.12877
Pdf URL: https://arxiv.org/pdf/2501.12877
Copy Paste: [[2501.12877]] WisdomBot: Tuning Large Language Models with Artificial Intelligence Knowledge(https://arxiv.org/abs/2501.12877)
Keywords: language model, llm
Abstract: Large language models (LLMs) have emerged as powerful tools in natural language processing (NLP), showing a promising future of artificial generated intelligence (AGI). Despite their notable performance in the general domain, LLMs have remained suboptimal in the field of education, owing to the unique challenges presented by this domain, such as the need for more specialized knowledge, the requirement for personalized learning experiences, and the necessity for concise explanations of complex concepts. To address these issues, this paper presents a novel LLM for education named WisdomBot, which combines the power of LLMs with educational theories, enabling their seamless integration into educational contexts. To be specific, we harness self-instructed knowledge concepts and instructions under the guidance of Bloom's Taxonomy as training data. To further enhance the accuracy and professionalism of model's response on factual questions, we introduce two key enhancements during inference, i.e., local knowledge base retrieval augmentation and search engine retrieval augmentation during inference. We substantiate the effectiveness of our approach by applying it to several Chinese LLMs, thereby showcasing that the fine-tuned models can generate more reliable and professional responses.
摘要：大型语言模型 (LLM) 已成为自然语言处理 (NLP) 的强大工具，展现了人工智能 (AGI) 的光明前景。尽管 LLM 在通用领域表现突出，但在教育领域仍然表现不佳，这是由于该领域面临独特的挑战，例如需要更专业的知识、个性化学习体验的要求以及对复杂概念的简洁解释的必要性。为了解决这些问题，本文提出了一种名为 WisdomBot 的新型教育 LLM，它将 LLM 的强大功能与教育理论相结合，使其能够无缝集成到教育环境中。具体来说，我们利用布鲁姆分类法指导下的自学知识概念和指令作为训练数据。为了进一步提高模型对事实问题的回答的准确性和专业性，我们在推理过程中引入了两项关键增强功能，即推理过程中的本地知识库检索增强和搜索引擎检索增强。我们通过将我们的方法应用于几个中国法学硕士来证实其有效性，从而表明经过微调的模型可以产生更可靠和专业的回应。

Title: Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Authors: Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.12895
Pdf URL: https://arxiv.org/pdf/2501.12895
Copy Paste: [[2501.12895]] Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback(https://arxiv.org/abs/2501.12895)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at this https URL.
摘要：大型语言模型 (LLM) 表现出色，但缺乏灵活性，无法快速适应人类偏好而无需重新训练。在这项工作中，我们引入了测试时偏好优化 (TPO)，这是一个在推理过程中将 LLM 输出与人类偏好对齐的框架，从而无需更新模型参数。TPO 不依赖纯数字奖励，而是将奖励信号转化为文本批评，并将其用作文本奖励来迭代改进其响应。对涵盖指令遵循、偏好对齐、安全性和数学的基准的评估表明，TPO 逐步改善了与人类偏好的一致性。值得注意的是，仅经过几个 TPO 步骤，最初未对齐的 Llama-3.1-70B-SFT 模型就可以超越对齐的对应模型 Llama-3.1-70B-Instruct。此外，TPO 在推理过程中可以有效地扩展搜索宽度和深度。通过案例研究，我们说明了 TPO 如何利用 LLM 解释和根据奖励信号采取行动的先天能力。我们的研究结果表明，TPO 是一种实用的轻量级测试时偏好优化替代方案，可实现动态对齐。我们的代码可在此 https URL 上公开获取。

Title: Architectural Fusion Through Contextual Partitioning in Large Language Models: A Novel Approach to Parameterized Knowledge Integration

Authors: Offa Kingsleigh, Alfred Abercrombie, David Woolstencroft, Beorhtric Meadowcroft, Marcus Irvin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.12901
Pdf URL: https://arxiv.org/pdf/2501.12901
Copy Paste: [[2501.12901]] Architectural Fusion Through Contextual Partitioning in Large Language Models: A Novel Approach to Parameterized Knowledge Integration(https://arxiv.org/abs/2501.12901)
Keywords: language model
Abstract: Contextual Partitioning introduces an innovative approach to enhancing the architectural design of large-scale computational models through the dynamic segmentation of parameters into context-aware regions. This methodology emphasizes the importance of task-specific specialization, achieved through adaptive parameter allocation mechanisms that align with the linguistic features of input data. Experimental evaluations demonstrated substantial improvements in accuracy, perplexity, and contextual coherence across a variety of linguistic tasks, highlighting the adaptability and scalability of the proposed framework. By reducing redundancy and enhancing computational efficiency, Contextual Partitioning not only streamlines model operations but also expands the scope of applications for advanced language processing systems. The approach operates autonomously, requiring no external fine-tuning, thereby addressing a significant limitation in conventional parameter optimization techniques. Empirical results demonstrate the effectiveness of gradient-driven segmentation, enabling models to dynamically recalibrate and specialize in response to task-specific demands. Furthermore, resource utilization metrics reveal notable reductions in memory usage and training times, confirming the efficiency of the approach. Observations from qualitative analyses illustrate improved contextual coherence and logical flow in generated outputs, reinforcing the practical value of this technique. The findings collectively demonstrate the potential for Contextual Partitioning to redefine the scalability and adaptability of computational language architectures in diverse and complex domains.
摘要：上下文分区引入了一种创新方法，通过将参数动态分割为上下文感知区域来增强大规模计算模型的架构设计。该方法强调了任务特定专业化的重要性，通过与输入数据的语言特征相一致的自适应参数分配机制实现。实验评估表明，在各种语言任务中，准确性、困惑度和上下文连贯性都有了显著的提高，凸显了所提框架的适应性和可扩展性。通过减少冗余和提高计算效率，上下文分区不仅简化了模型操作，还扩大了高级语言处理系统的应用范围。该方法自主运行，不需要外部微调，从而解决了传统参数优化技术的一个重大限制。实证结果证明了梯度驱动分割的有效性，使模型能够根据任务特定需求动态重新校准和专业化。此外，资源利用率指标显示内存使用量和训练时间显着减少，证实了该方法的效率。定性分析的观察结果表明，生成的输出中上下文连贯性和逻辑流程得到了改善，从而增强了该技术的实用价值。这些发现共同证明了上下文分区有潜力重新定义计算语言架构在多样化和复杂领域的可扩展性和适应性。

Title: FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

Authors: Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang
Subjects: cs.CL, cs.GR, cs.MA
Abstract URL: https://arxiv.org/abs/2501.12909
Pdf URL: https://arxiv.org/pdf/2501.12909
Copy Paste: [[2501.12909]] FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces(https://arxiv.org/abs/2501.12909)
Keywords: gpt, llm, hallucination, agent
Abstract: Virtual film production requires intricate decision-making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision-making with language agent-based societies, this paper introduces FilmAgent, a novel LLM-based multi-agent collaborative framework for end-to-end film automation in our constructed 3D virtual spaces. FilmAgent simulates various crew roles, including directors, screenwriters, actors, and cinematographers, and covers key stages of a film production workflow: (1) idea development transforms brainstormed ideas into structured story outlines; (2) scriptwriting elaborates on dialogue and character actions for each scene; (3) cinematography determines the camera setups for each shot. A team of agents collaborates through iterative feedback and revisions, thereby verifying intermediate scripts and reducing hallucinations. We evaluate the generated videos on 15 ideas and 4 key aspects. Human evaluation shows that FilmAgent outperforms all baselines across all aspects and scores 3.98 out of 5 on average, showing the feasibility of multi-agent collaboration in filmmaking. Further analysis reveals that FilmAgent, despite using the less advanced GPT-4o model, surpasses the single-agent o1, showing the advantage of a well-coordinated multi-agent system. Lastly, we discuss the complementary strengths and weaknesses of OpenAI's text-to-video model Sora and our FilmAgent in filmmaking.
摘要：虚拟电影制作需要复杂的决策过程，包括剧本创作、虚拟摄影以及演员的精确定位和动作。受语言代理社会在自动决策方面的最新进展的启发，本文介绍了 FilmAgent，这是一种基于 LLM 的新型多代理协作框架，用于在我们构建的 3D 虚拟空间中实现端到端电影自动化。FilmAgent 模拟了各种剧组角色，包括导演、编剧、演员和摄影师，并涵盖了电影制作工作流程的关键阶段：（1）创意开发将头脑风暴的想法转化为结构化的故事大纲；（2）剧本创作详细阐述每个场景的对话和角色动作；（3）摄影确定每个镜头的摄像机设置。代理团队通过迭代反馈和修订进行协作，从而验证中间脚本并减少幻觉。我们根据 15 个想法和 4 个关键方面评估生成的视频。人工评估显示，FilmAgent 在各个方面均优于所有基线，平均得分为 3.98（满分 5 分），表明多智能体协作在电影制作中的可行性。进一步分析表明，尽管 FilmAgent 使用的是不太先进的 GPT-4o 模型，但它却超越了单智能体 o1，显示出协调良好的多智能体系统的优势。最后，我们讨论了 OpenAI 的文本转视频模型 Sora 和我们的 FilmAgent 在电影制作中的互补优势和劣势。

Title: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Authors: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z.F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J.L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R.J. Chen, R.L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S.S. Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.12948
Pdf URL: https://arxiv.org/pdf/2501.12948
Copy Paste: [[2501.12948]] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning(https://arxiv.org/abs/2501.12948)
Keywords: llm
Abstract: We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.
摘要：我们介绍了第一代推理模型 DeepSeek-R1-Zero 和 DeepSeek-R1。DeepSeek-R1-Zero 是一种通过大规模强化学习 (RL) 训练的模型，无需监督微调 (SFT) 作为初步步骤，它展示了卓越的推理能力。通过 RL，DeepSeek-R1-Zero 自然而然地呈现出许多强大而有趣的推理行为。然而，它面临着可读性差和语言混合等挑战。为了解决这些问题并进一步提高推理性能，我们推出了 DeepSeek-R1，它在 RL 之前结合了多阶段训练和冷启动数据。DeepSeek-R1 在推理任务上实现了与 OpenAI-o1-1217 相当的性能。为了支持研究社区，我们开源了 DeepSeek-R1-Zero、DeepSeek-R1 以及基于 Qwen 和 Llama 从 DeepSeek-R1 提炼出的六个密集模型（1.5B、7B、8B、14B、32B、70B）。

Title: Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference

Authors: Weizhi Fei, Xueyan Niu, Guoqing Xie, Yingqing Liu, Bo Bai, Wei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.12959
Pdf URL: https://arxiv.org/pdf/2501.12959
Copy Paste: [[2501.12959]] Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference(https://arxiv.org/abs/2501.12959)
Keywords: language model, llm, prompt
Abstract: Although applications involving long-context inputs are crucial for the effective utilization of large language models (LLMs), they also result in increased computational costs and reduced performance. To address this challenge, we propose an efficient, training-free prompt compression method that retains key information within compressed prompts. We identify specific attention heads in transformer-based LLMs, which we designate as evaluator heads, that are capable of selecting tokens in long inputs that are most significant for inference. Building on this discovery, we develop EHPC, an Evaluator Head-based Prompt Compression method, which enables LLMs to rapidly "skim through" input prompts by leveraging only the first few layers with evaluator heads during the pre-filling stage, subsequently passing only the important tokens to the model for inference. EHPC achieves state-of-the-art results across two mainstream benchmarks: prompt compression and long-context inference acceleration. Consequently, it effectively reduces the complexity and costs associated with commercial API calls. We further demonstrate that EHPC attains competitive results compared to key-value cache-based acceleration methods, thereby highlighting its potential to enhance the efficiency of LLMs for long-context tasks.
摘要：尽管涉及长上下文输入的应用对于有效利用大型语言模型 (LLM) 至关重要，但它们也会导致计算成本增加和性能下降。为了应对这一挑战，我们提出了一种高效、无需训练的提示压缩方法，该方法可在压缩提示中保留关键信息。我们在基于 Transformer 的 LLM 中识别特定的注意头，我们将其指定为评估头，它们能够在长输入中选择对推理最重要的标记。基于这一发现，我们开发了一种基于评估头的提示压缩方法 EHPC，该方法使 LLM 能够通过在预填充阶段仅利用具有评估头的前几层来快速“浏览”输入提示，随后仅将重要的标记传递给模型进行推理。EHPC 在两个主流基准测试中取得了最先进的结果：提示压缩和长上下文推理加速。因此，它有效地降低了与商业 API 调用相关的复杂性和成本。我们进一步证明，与基于键值缓存的加速方法相比，EHPC 获得了具有竞争力的结果，从而凸显了其在提高 LLM 执行长上下文任务的效率方面的潜力。

Title: OnionEval: An Unified Evaluation of Fact-conflicting Hallucination for Small-Large Language Models

Authors: Chongren Sun, Yuran Li, Di Wu, Benoit Boulet
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.12975
Pdf URL: https://arxiv.org/pdf/2501.12975
Copy Paste: [[2501.12975]] OnionEval: An Unified Evaluation of Fact-conflicting Hallucination for Small-Large Language Models(https://arxiv.org/abs/2501.12975)
Keywords: language model, llm, hallucination, chain-of-thought
Abstract: Large Language Models (LLMs) are highly capable but require significant computational resources for both training and inference. Within the LLM family, smaller models (those with fewer than 10 billion parameters) also perform well across various tasks. However, these smaller models share similar limitations to their larger counterparts, including the tendency to hallucinate. Despite the existence of many benchmarks to evaluate hallucination in LLMs, few have specifically focused on small LLMs (SLLMs). Additionally, SLLMs show widely varying performance across different benchmarks. In this paper, we introduce OnionEval, a multi-layer structured framework with a specific metric called the context-influence score (CI), designed to effectively assess the fact-conflicting hallucination tendencies of small LLMs across different contextual levels. Our experimental results reveal a key feature of SLLMs: they excel in factual analysis but face challenges with context reasoning. Further investigation shows that a simple Chain-of-Thought strategy can significantly reduce these limitations, improving the practical usefulness of SLLMs in real-world applications.
摘要：大型语言模型 (LLM) 功能强大，但需要大量计算资源进行训练和推理。在 LLM 家族中，较小的模型（参数少于 100 亿的模型）在各种任务中也表现良好。然而，这些较小的模型与较大的模型具有类似的局限性，包括产生幻觉的倾向。尽管存在许多用于评估 LLM 中幻觉的基准，但很少有专门针对小型 LLM（SLLM）的基准。此外，SLLM 在不同的基准上表现出很大的差异。在本文中，我们介绍了 OnionEval，这是一个多层结构化框架，具有称为上下文影响分数 (CI) 的特定指标，旨在有效评估不同上下文级别的小型 LLM 中与事实相冲突的幻觉倾向。我们的实验结果揭示了 SLLM 的一个关键特征：它们在事实分析方面表现出色，但在上下文推理方面面临挑战。进一步的研究表明，简单的思维链策略可以显着减少这些限制，从而提高 SLLM 在实际应用中的实际实用性。

Title: Implicit Causality-biases in humans and LLMs as a tool for benchmarking LLM discourse capabilities

Authors: Florian Kankowski, Torgrim Solstad, Sina Zarriess, Oliver Bott
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.12980
Pdf URL: https://arxiv.org/pdf/2501.12980
Copy Paste: [[2501.12980]] Implicit Causality-biases in humans and LLMs as a tool for benchmarking LLM discourse capabilities(https://arxiv.org/abs/2501.12980)
Keywords: llm
Abstract: In this paper, we compare data generated with mono- and multilingual LLMs spanning a range of model sizes with data provided by human participants in an experimental setting investigating well-established discourse biases. Beyond the comparison as such, we aim to develop a benchmark to assess the capabilities of LLMs with discourse biases as a robust proxy for more general discourse understanding capabilities. More specifically, we investigated Implicit Causality verbs, for which psycholinguistic research has found participants to display biases with regard to three phenomena:\ the establishment of (i) coreference relations (Experiment 1), (ii) coherence relations (Experiment 2), and (iii) the use of particular referring expressions (Experiments 3 and 4). With regard to coreference biases we found only the largest monolingual LLM (German Bloom 6.4B) to display more human-like biases. For coherence relation, no LLM displayed the explanation bias usually found for humans. For referring expressions, all LLMs displayed a preference for referring to subject arguments with simpler forms than to objects. However, no bias effect on referring expression was found, as opposed to recent studies investigating human biases.
摘要：在本文中，我们将使用多种模型大小的单语和多语法学硕士 (LLM) 生成的数据与人类参与者在研究已确立的话语偏见的实验环境中提供的数据进行比较。除了进行比较之外，我们还旨在开发一个基准来评估具有话语偏见的法学硕士的能力，作为更一般的话语理解能力的可靠代理。更具体地说，我们研究了隐性因果动词，心理语言学研究发现参与者在三种现象方面表现出偏见：\t建立 (i) 共指关系（实验 1）、(ii) 连贯关系（实验 2）和 (iii) 使用特定的指称表达（实验 3 和 4）。关于共指偏见，我们发现只有最大的单语法学硕士 (German Bloom 6.4B) 表现出更像人类的偏见。对于连贯关系，没有一个法学硕士表现出通常在人类身上发现的解释偏见。对于指称表达，所有法学硕士都表现出倾向于使用更简单的形式而不是对象来指称主语论证。然而，与最近研究人类偏见的研究相反，并未发现对指称表达的偏见效应。

Title: Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament

Authors: Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13007
Pdf URL: https://arxiv.org/pdf/2501.13007
Copy Paste: [[2501.13007]] Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament(https://arxiv.org/abs/2501.13007)
Keywords: language model, llm
Abstract: Best-of-N (BoN) sampling, a common strategy for test-time scaling of Large Language Models (LLMs), relies on reward models to select the best candidate solution from multiple generations. However, traditional reward models often assign arbitrary and inconsistent scores, limiting their effectiveness. To address this, we propose a Pairwise Reward Model (Pairwise RM) combined with a knockout tournament for BoN sampling. Instead of assigning absolute scores, given one math problem, Pairwise RM evaluates two candidate solutions' correctness simultaneously. This approach eliminates the need for arbitrary scoring and enables cross-validation of solutions through parallel comparison. In the knockout tournament, Pairwise RM conducts pairwise comparisons between candidate solutions and eliminates the incorrect ones iteratively. We construct \ourdataset, a large-scale dataset of 443K pairwise comparisons derived from NumiaMath and annotated using \texttt{gemini-1.5-flash}, and train the Pairwise RM via supervised fine-tuning. Experiments on MATH-500 and the Olympiad Bench demonstrate significant improvements over traditional discriminative reward models. And a 40\% to 60\% relative improvement is achieved on the top 50\% challenging problems.
摘要：最佳 N (BoN) 抽样是大型语言模型 (LLM) 测试时间扩展的常用策略，它依靠奖励模型从多代中选择最佳候选解决方案。然而，传统的奖励模型通常会分配任意且不一致的分数，从而限制了它们的有效性。为了解决这个问题，我们提出了一个成对奖励模型 (Pairwise RM)，并结合淘汰赛进行 BoN 抽样。对于一个数学问题，Pairwise RM 不会分配绝对分数，而是同时评估两个候选解决方案的正确性。这种方法消除了任意评分的需要，并通过并行比较实现了解决方案的交叉验证。在淘汰赛中，Pairwise RM 在候选解决方案之间进行成对比较，并迭代地消除不正确的解决方案。我们构建了 \ourdataset，这是一个来自 NumiaMath 并使用 \texttt{gemini-1.5-flash} 注释的 443K 成对比较的大规模数据集，并通过监督微调训练 Pairwise RM。在 MATH-500 和 Olympiad Bench 上的实验表明，与传统的判别奖励模型相比，该模型有显著的改进。在前 50% 的挑战性问题上，该模型实现了 40% 到 60% 的相对改进。

Title: Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning

Authors: Bohao Yang, Yingji Zhang, Dong Liu, André Freitas, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13042
Pdf URL: https://arxiv.org/pdf/2501.13042
Copy Paste: [[2501.13042]] Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning(https://arxiv.org/abs/2501.13042)
Keywords: language model, llm
Abstract: Recent large language models (LLMs) have advanced table understanding capabilities but rely on converting tables into text sequences. While multimodal large language models (MLLMs) enable direct visual processing, they face limitations in handling scientific tables due to fixed input image resolutions and insufficient numerical reasoning capabilities. We present a comprehensive framework for multimodal scientific table understanding and reasoning with dynamic input image resolutions. Our framework consists of three key components: (1) MMSci-Pre, a domain-specific table structure learning dataset of 52K scientific table structure recognition samples, (2) MMSci-Ins, an instruction tuning dataset with 12K samples across three table-based tasks, and (3) MMSci-Eval, a benchmark with 3,114 testing samples specifically designed to evaluate numerical reasoning capabilities. Extensive experiments demonstrate that our domain-specific approach with 52K scientific table images achieves superior performance compared to 150K general-domain tables, highlighting the importance of data quality over quantity. Our proposed table-based MLLMs with dynamic input resolutions show significant improvements in both general table understanding and numerical reasoning capabilities, with strong generalisation to held-out datasets. Our code and data are publicly available at this https URL.
摘要：最近的大型语言模型 (LLM) 具有高级表格理解能力，但依赖于将表格转换为文本序列。虽然多模态大型语言模型 (MLLM) 支持直接的视觉处理，但由于输入图像分辨率固定且数值推理能力不足，它们在处理科学表格时面临限制。我们提出了一个全面的框架，用于在动态输入图像分辨率下理解和推理多模态科学表格。我们的框架由三个关键组件组成：(1) MMSci-Pre，一个包含 52K 科学表格结构识别样本的领域特定表格结构学习数据集，(2) MMSci-Ins，一个包含三个基于表格的任务的 12K 样本的指令调整数据集，以及 (3) MMSci-Eval，一个包含 3,114 个测试样本的基准，专门用于评估数值推理能力。大量实验表明，与 150K 通用领域表格相比，我们针对 52K 科学表格图像的领域特定方法实现了卓越的性能，凸显了数据质量而非数量的重要性。我们提出的具有动态输入分辨率的基于表格的 MLLM 在一般表格理解和数值推理能力方面均表现出显著的改进，并且对保留数据集具有很强的泛化能力。我们的代码和数据可在此 https URL 上公开获取。

Title: Autonomy-of-Experts Models

Authors: Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.13074
Pdf URL: https://arxiv.org/pdf/2501.13074
Copy Paste: [[2501.13074]] Autonomy-of-Experts Models(https://arxiv.org/abs/2501.13074)
Keywords: language model
Abstract: Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.
摘要：混合专家 (MoE) 模型主要使用路由器将 token 分配给特定的专家模块，仅激活部分参数，并且通常优于密集模型。我们认为，路由器的决策与专家的执行之间的分离是一个关键但被忽视的问题，导致专家选择不理想和学习无效。为了解决这个问题，我们提出了专家自主 (AoE)，这是一种新颖的 MoE 范式，其中专家自主选择自己来处理输入。AoE 基于这样的见解：专家意识到自己有效处理 token 的能力，这种意识反映在其内部激活的规模中。在 AoE 中，路由器被删除；相反，专家预先计算输入的内部激活，并根据其激活规范进行排名。只有排名靠前的专家才能继续前向传递，而其他专家则中止。通过低秩权重分解减少了预先计算激活的开销。这种先自我评估再与合作伙伴比较的方法可确保改进专家选择和有效学习。我们预先训练了具有 7 亿至 40 亿个参数的语言模型，证明了 AoE 的表现优于具有同等效率的传统 MoE 模型。

Title: Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment

Authors: Melissa Kazemi Rad, Huy Nghiem, Andy Luo, Sahil Wadhwa, Mohammad Sorower, Stephen Rawls
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.13080
Pdf URL: https://arxiv.org/pdf/2501.13080
Copy Paste: [[2501.13080]] Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment(https://arxiv.org/abs/2501.13080)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Large Language Models (LLMs) have demonstrated powerful capabilities that render them valuable in different applications, including conversational AI products. It is paramount to ensure the security and reliability of these products by mitigating their vulnerabilities towards malicious user interactions, which can lead to the exposure of great risks and reputational repercussions. In this work, we present a comprehensive study on the efficacy of fine-tuning and aligning Chain-of-Thought (CoT) responses of different LLMs that serve as input moderation guardrails. We systematically explore various tuning methods by leveraging a small set of training data to adapt these models as proxy defense mechanisms to detect malicious inputs and provide a reasoning for their verdicts, thereby preventing the exploitation of conversational agents. We rigorously evaluate the efficacy and robustness of different tuning strategies to generalize across diverse adversarial and malicious query types. Our experimental results outline the potential of alignment processes tailored to a varied range of harmful input queries, even with constrained data resources. These techniques significantly enhance the safety of conversational AI systems and provide a feasible framework for deploying more secure and trustworthy AI-driven interactions.
摘要：大型语言模型 (LLM) 已展示出强大的功能，使其在不同应用（包括对话式 AI 产品）中具有价值。至关重要的是要通过减轻这些产品对恶意用户交互的脆弱性来确保其安全性和可靠性，因为恶意用户交互可能会导致巨大的风险和声誉影响。在这项工作中，我们对微调和对齐不同 LLM 的思路链 (CoT) 响应的有效性进行了全面研究，这些响应可作为输入审核护栏。我们系统地探索了各种调整方法，利用一小组训练数据来调整这些模型作为代理防御机制，以检测恶意输入并为它们的判决提供推理，从而防止对话代理被利用。我们严格评估不同调整策略的有效性和稳健性，以在各种对抗性和恶意查询类型中推广。我们的实验结果概述了即使在数据资源受限的情况下，针对各种有害输入查询量身定制的对齐过程的潜力。这些技术显著增强了对话式人工智能系统的安全性，并为部署更安全、更可信的人工智能驱动交互提供了可行的框架。