2025-02-20

Title: Private Text Generation by Seeding Large Language Model Prompts

Authors: Supriya Nagesh, Justin Y. Chen, Nina Mishra, Tal Wagner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13193
Pdf URL: https://arxiv.org/pdf/2502.13193
Copy Paste: [[2502.13193]] Private Text Generation by Seeding Large Language Model Prompts(https://arxiv.org/abs/2502.13193)
Keywords: language model, llm, prompt
Abstract: We explore how private synthetic text can be generated by suitably prompting a large language model (LLM). This addresses a challenge for organizations like hospitals, which hold sensitive text data like patient medical records, and wish to share it in order to train machine learning models for medical tasks, while preserving patient privacy. Methods that rely on training or finetuning a model may be out of reach, either due to API limits of third-party LLMs, or due to ethical and legal prohibitions on sharing the private data with the LLM itself. We propose Differentially Private Keyphrase Prompt Seeding (DP-KPS), a method that generates a private synthetic text corpus from a sensitive input corpus, by accessing an LLM only through privatized prompts. It is based on seeding the prompts with private samples from a distribution over phrase embeddings, thus capturing the input corpus while achieving requisite output diversity and maintaining differential privacy. We evaluate DP-KPS on downstream ML text classification tasks, and show that the corpora it generates preserve much of the predictive power of the original ones. Our findings offer hope that institutions can reap ML insights by privately sharing data with simple prompts and little compute.
摘要：我们探索如何通过适当提示大型语言模型（LLM）来生成私有合成文本。这是针对像医院这样的组织挑战，该组织拥有诸如患者病历的敏感文本数据，并希望分享它以培训机器学习模型的医疗任务，同时保留患者隐私。依靠培训或填充模型的方法可能是由于第三方LLM的API限制而无法实现的，或者是由于与LLM本身共享私人数据的道德和法律禁令。我们建议通过仅通过私有化的提示访问LLM，从而从敏感输入语料库中生成私有合成文本语料库，从而从敏感输入语料库中生成私有合成的文本语料库。它是基于从短语嵌入的分布中的私人样本中播种提示，从而捕获输入语料库，同时实现必要的输出多样性并保持差异隐私。我们在下游ML文本分类任务上评估了DP-KPS，并证明其生成的Corpora可以保留原始内容的许多预测能力。我们的发现提供了希望，即机构可以通过用简单的提示和几乎没有计算来私下共享数据来收获ML洞察力。

Title: Thinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation

Authors: Giorgio Franceschelli, Mirco Musolesi
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13207
Pdf URL: https://arxiv.org/pdf/2502.13207
Copy Paste: [[2502.13207]] Thinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation(https://arxiv.org/abs/2502.13207)
Keywords: language model
Abstract: Despite the increasing use of large language models for creative tasks, their outputs often lack diversity. Common solutions, such as sampling at higher temperatures, can compromise the quality of the results. Drawing on information theory, we propose a context-based score to quantitatively evaluate value and originality. This score incentivizes accuracy and adherence to the request while fostering divergence from the learned distribution. We propose using our score as a reward in a reinforcement learning framework to fine-tune large language models for maximum performance. We validate our strategy through experiments in poetry generation and math problem solving, demonstrating that it enhances the value and originality of the generated solutions.
摘要：尽管大型语言模型用于创造性任务，但它们的输出通常缺乏多样性。通用解决方案（例如在较高温度下采样）可能会损害结果的质量。利用信息理论，我们提出了一个基于上下文的分数，以定量评估价值和原创性。该分数激发了准确性和对请求的遵守，同时促进了与学习分布的分歧。我们建议在加强学习框架中以分数为奖励，以微调大语言模型，以提高表现。我们通过诗歌产生和数学问题解决的实验来验证我们的策略，表明它增强了生成的解决方案的价值和独创性。

Title: SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?

Authors: Yucheng Shi, Tianze Yang, Canyu Chen, Quanzheng Li, Tianming Liu, Xiang Li, Ninghao Liu
Subjects: cs.CL, cs.AI, cs.IR, cs.IT
Abstract URL: https://arxiv.org/abs/2502.13233
Pdf URL: https://arxiv.org/pdf/2502.13233
Copy Paste: [[2502.13233]] SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?(https://arxiv.org/abs/2502.13233)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have shown remarkable capabilities in general domains but often struggle with tasks requiring specialized knowledge. Conventional Retrieval-Augmented Generation (RAG) techniques typically retrieve external information from static knowledge bases, which can be outdated or incomplete, missing fine-grained clinical details essential for accurate medical question answering. In this work, we propose SearchRAG, a novel framework that overcomes these limitations by leveraging real-time search engines. Our method employs synthetic query generation to convert complex medical questions into search-engine-friendly queries and utilizes uncertainty-based knowledge selection to filter and incorporate the most relevant and informative medical knowledge into the LLM's input. Experimental results demonstrate that our method significantly improves response accuracy in medical question answering tasks, particularly for complex questions requiring detailed and up-to-date knowledge.
摘要：大型语言模型（LLMS）在一般领域表现出了显着的功能，但经常在需要专业知识的任务上挣扎。传统的检索型发电（RAG）技术通常从静态知识库中检索外部信息，这些信息可能过时或不完整，缺少精确的医疗问题答案必不可少的细粒度临床细节。在这项工作中，我们提出了SearchRag，这是一个新颖的框架，通过利用实时搜索引擎来克服这些局限性。我们的方法采用综合查询生成将复杂的医疗问题转换为搜索引擎友好的查询，并利用基于不确定的知识选择过滤并将最相关和最有用的医学知识纳入LLM的输入中。实验结果表明，我们的方法显着提高了医学问答任务的响应准确性，尤其是对于需要详细和最新知识的复杂问题。

Title: When People are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models

Authors: Julia Mendelsohn, Ceren Budak
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2502.13246
Pdf URL: https://arxiv.org/pdf/2502.13246
Copy Paste: [[2502.13246]] When People are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models(https://arxiv.org/abs/2502.13246)
Keywords: language model
Abstract: Metaphor, discussing one concept in terms of another, is abundant in politics and can shape how people understand important issues. We develop a computational approach to measure metaphorical language, focusing on immigration discourse on social media. Grounded in qualitative social science research, we identify seven concepts evoked in immigration discourse (e.g. "water" or "vermin"). We propose and evaluate a novel technique that leverages both word-level and document-level signals to measure metaphor with respect to these concepts. We then study the relationship between metaphor, political ideology, and user engagement in 400K US tweets about immigration. While conservatives tend to use dehumanizing metaphors more than liberals, this effect varies widely across concepts. Moreover, creature-related metaphor is associated with more retweets, especially for liberal authors. Our work highlights the potential for computational methods to complement qualitative approaches in understanding subtle and implicit language in political discourse.
摘要：隐喻在另一个概念上讨论一个概念，在政治上很丰富，可以塑造人们如何理解重要问题。我们开发了一种计算方法来衡量隐喻语言，重点是社交媒体上的移民论述。基于定性社会科学研究，我们确定了在移民话语中引起的七个概念（例如“水”或“害虫”）。我们提出并评估一种利用单词级别和文档级信号来衡量这些概念的隐喻的新颖技术。然后，我们研究了40万美国有关移民的推文中隐喻，政治意识形态和用户参与之间的关系。尽管保守派倾向于使用非人性化的隐喻，而不是自由主义者，但这种效果在概念上差异很大。此外，与生物相关的隐喻与更多的转发有关，尤其是对于自由作者而言。我们的工作突出了计算方法的潜力，可以补充定性方法，以理解政治话语中的微妙和隐性语言。

Title: Grounding LLM Reasoning with Knowledge Graphs

Authors: Alfonso Amayuelas, Joy Sain, Simerjot Kaur, Charese Smiley
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13247
Pdf URL: https://arxiv.org/pdf/2502.13247
Copy Paste: [[2502.13247]] Grounding LLM Reasoning with Knowledge Graphs(https://arxiv.org/abs/2502.13247)
Keywords: language model, llm, chain-of-thought, tree-of-thought, agent
Abstract: Knowledge Graphs (KGs) are valuable tools for representing relationships between entities in a structured format. Traditionally, these knowledge bases are queried to extract specific information. However, question-answering (QA) over such KGs poses a challenge due to the intrinsic complexity of natural language compared to the structured format and the size of these graphs. Despite these challenges, the structured nature of KGs can provide a solid foundation for grounding the outputs of Large Language Models (LLMs), offering organizations increased reliability and control. Recent advancements in LLMs have introduced reasoning methods at inference time to improve their performance and maximize their capabilities. In this work, we propose integrating these reasoning strategies with KGs to anchor every step or "thought" of the reasoning chains in KG data. Specifically, we evaluate both agentic and automated search methods across several reasoning strategies, including Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Graph-of-Thought (GoT), using GRBench, a benchmark dataset for graph reasoning with domain-specific graphs. Our experiments demonstrate that this approach consistently outperforms baseline models, highlighting the benefits of grounding LLM reasoning processes in structured KG data.
摘要：知识图（kgs）是代表结构化格式实体之间关系的有价值工具。传统上，这些知识库被询问以提取特定信息。然而，由于自然语言的内在复杂性与结构化格式和这些图的大小相比，因此，此类公斤的提问（QA）构成了挑战。尽管面临这些挑战，KG的结构性性质仍可以为基础大型语言模型（LLM）的产出提供坚实的基础，从而为组织提供了提高的可靠性和控制力。 LLM的最新进步在推理时间引入了推理方法，以提高其性能并最大程度地提高其能力。在这项工作中，我们建议将这些推理策略与KG集成，以锚定KG数据中推理链的每个步骤或“思考”。具体而言，我们在几种推理策略中都评估了代理和自动化搜索方法，包括经过思考链（COT），经营之树（TOT）和经过思考图（GOT），使用GRBENCH，一个基准数据集用于使用特定域图的图形推理。我们的实验表明，这种方法始终优于基线模型，突出了结构化KG数据中接地LLM推理过程的好处。

Title: Neural Attention Search

Authors: Difan Deng, Marius Lindauer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13251
Pdf URL: https://arxiv.org/pdf/2502.13251
Copy Paste: [[2502.13251]] Neural Attention Search(https://arxiv.org/abs/2502.13251)
Keywords: language model
Abstract: We present Neural Attention Search (NAtS), a framework that automatically evaluates the importance of each token within a sequence and determines if the corresponding token can be dropped after several steps. This approach can efficiently reduce the KV cache sizes required by transformer-based models during inference and thus reduce inference costs. In this paper, we design a search space that contains three token types: (i) Global Tokens will be preserved and queried by all the following tokens. (ii) Local Tokens survive until the next global token appears. (iii) Sliding Window Tokens have an impact on the inference of a fixed size of the next following tokens. Similar to the One-Shot Neural Architecture Search approach, this token-type information can be learned jointly with the architecture weights via a learnable attention mask. Experiments on both training a new transformer from scratch and fine-tuning existing large language models show that NAtS can efficiently reduce the KV cache size required for the models while maintaining the models' performance.
摘要：我们提出了神经注意力搜索（NAT），该框架自动评估了序列中每个令牌的重要性，并确定几个步骤后是否可以删除相应的令牌。这种方法可以有效地减少推理过程中基于变压器的模型所需的KV缓存大小，从而降低推理成本。在本文中，我们设计了一个包含三种令牌类型的搜索空间：（i）所有以下令牌将保留和查询全局令牌。（ii）本地令牌生存，直到出现下一个全球令牌。（iii）滑动窗口令牌对下一个以下令牌的固定大小的推断有影响。与单发神经体系结构搜索方法类似，可以通过可学习的注意力掩码共同学习这种令牌型信息。在两种训练新变压器上都从头开始和微调现有大语模型的实验表明，NAT可以在保持模型的性能的同时有效地降低模型所需的KV高速缓存尺寸。

Title: Multilingual Language Model Pretraining using Machine-translated Data

Authors: Jiayi Wang, Yao Lu, Maurice Weber, Max Ryabinin, David Adelani, Yihong Chen, Raphael Tang, Pontus Stenetorp
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13252
Pdf URL: https://arxiv.org/pdf/2502.13252
Copy Paste: [[2502.13252]] Multilingual Language Model Pretraining using Machine-translated Data(https://arxiv.org/abs/2502.13252)
Keywords: language model, llm
Abstract: High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other languages as LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated texts from a single high-quality source language can contribute significantly to the pretraining quality of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into nine languages, resulting in a 1.7-trillion-token dataset, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this dataset. Across nine non-English reasoning tasks, we show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2, Qwen2.5, and Gemma, despite using an order of magnitude less data. We demonstrate that adding less than 5% of TransWebEdu as domain-specific pretraining data sets a new state-of-the-art in Arabic, Italian, Indonesian, Swahili, and Welsh understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus, models, and training pipeline under Open Source Initiative-approved licenses.
摘要：诸如英语之类的高资源语言可以预处理高质量的大语言模型（LLMS）。对于大多数其他语言来说，也不能说这是同样的，因为LLM在非英语语言中仍然表现不佳，这可能是由于可用多语言读图的质量和多样性差异所致。在这项工作中，我们发现来自单个高质量源语言的机器翻译文本可以对多语言LLM的预处理质量产生重大贡献。我们将高质量的英语Web数据集的FineWeb-Edu转换为九种语言，从而产生了1.7亿美元的数据集，我们将其称为TransWebedu，并将其定为1.3B参数模型，TransWebllm，Transwebllm，从该数据集上的Sckatch。在九项非英语推理任务中，我们表明transwebllm匹配或胜过最先进的多语言模型，但使用封闭数据（例如llama3.2，qwen2.5和Gemma）培训，尽管使用了较小的数据级，但。我们证明，将不到5％的Transwebedu添加为特定领域的预处理数据设置了一个新的阿拉伯语，意大利语，印度尼西亚人，斯瓦希里语和威尔士理解和常识性推理任务的新最新。为了促进可重复性，我们将我们的语料库，模型和培训管道发布在开源倡议批准的许可下。

Title: HumT DumT: Measuring and controlling human-like language in LLMs

Authors: Myra Cheng, Sunny Yu, Dan Jurafsky
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2502.13259
Pdf URL: https://arxiv.org/pdf/2502.13259
Copy Paste: [[2502.13259]] HumT DumT: Measuring and controlling human-like language in LLMs(https://arxiv.org/abs/2502.13259)
Keywords: llm
Abstract: Should LLMs generate language that makes them seem human? Human-like language might improve user experience, but might also lead to overreliance and stereotyping. Assessing these potential impacts requires a systematic way to measure human-like tone in LLM outputs. We introduce HumT and SocioT, metrics for human-like tone and other dimensions of social perceptions in text data based on relative probabilities from an LLM. By measuring HumT across preference and usage datasets, we find that users prefer less human-like outputs from LLMs. HumT also offers insights into the impacts of anthropomorphism: human-like LLM outputs are highly correlated with warmth, social closeness, femininity, and low status, which are closely linked to the aforementioned harms. We introduce DumT, a method using HumT to systematically control and reduce the degree of human-like tone while preserving model performance. DumT offers a practical approach for mitigating risks associated with anthropomorphic language generation.
摘要：LLM是否应该生成使它们看起来人性化的语言？类似人类的语言可能会改善用户体验，但也可能导致过度依赖和定型观念。评估这些潜在影响需要一种系统的方法来测量LLM输出中类似人类的音调。我们基于LLM的相对概率介绍了Humt和Sociot，类似人类语调的指标以及文本数据中社会感知的其他方面。通过在偏好和使用数据集中测量HUMT，我们发现用户更喜欢LLMS的类似人类的输出。 Humt还提供了对拟人化影响的见解：类似人类的LLM输出与温暖，社会亲密，女性气质和低地位高度相关，这与上述危害密切相关。我们介绍了Dumt，这是一种使用HUMT系统控制和降低类似人类语调的方法的方法，同时保持模型性能。杜姆（Dumt）提供了一种实用方法，用于减轻与拟人化语言相关的风险。

Title: Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models

Authors: Yingqian Cui, Pengfei He, Jingying Zeng, Hui Liu, Xianfeng Tang, Zhenwei Dai, Yan Han, Chen Luo, Jing Huang, Zhen Li, Suhang Wang, Yue Xing, Jiliang Tang, Qi He
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13260
Pdf URL: https://arxiv.org/pdf/2502.13260
Copy Paste: [[2502.13260]] Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models(https://arxiv.org/abs/2502.13260)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) reasoning, which breaks down complex tasks into intermediate reasoning steps, has significantly enhanced the performance of large language models (LLMs) on challenging tasks. However, the detailed reasoning process in CoT often incurs long generation times and high computational costs, partly due to the inclusion of unnecessary steps. To address this, we propose a method to identify critical reasoning steps using perplexity as a measure of their importance: a step is deemed critical if its removal causes a significant increase in perplexity. Our method enables models to focus solely on generating these critical steps. This can be achieved through two approaches: refining demonstration examples in few-shot CoT or fine-tuning the model using selected examples that include only critical steps. Comprehensive experiments validate the effectiveness of our method, which achieves a better balance between the reasoning accuracy and efficiency of CoT.
摘要：将复杂的任务分解为中间推理步骤的基础链（COT）推理，已大大提高了大语言模型（LLMS）在具有挑战性的任务上的性能。但是，COT中的详细推理过程通常会产生长时间的时间和高计算成本，部分原因是包括不必要的步骤。为了解决这一问题，我们提出了一种使用困惑的方法来识别关键推理步骤的方法：如果其去除会导致困惑显着增加，则认为一步至关重要。我们的方法使模型能够仅专注于生成这些关键步骤。这可以通过两种方法来实现：精炼演示示例中的几个小婴儿床或使用仅包括关键步骤的选定示例对模型进行微调。全面的实验验证了我们方法的有效性，这在COT的推理准确性和效率之间取得了更好的平衡。

Title: REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

Authors: Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, Francesco Barbieri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13270
Pdf URL: https://arxiv.org/pdf/2502.13270
Copy Paste: [[2502.13270]] REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation(https://arxiv.org/abs/2502.13270)
Keywords: llm, chat
Abstract: Long-term, open-domain dialogue capabilities are essential for chatbots aiming to recall past interactions and demonstrate emotional intelligence (EI). Yet, most existing research relies on synthetic, LLM-generated data, leaving open questions about real-world conversational patterns. To address this gap, we introduce REALTALK, a 21-day corpus of authentic messaging app dialogues, providing a direct benchmark against genuine human interactions. We first conduct a dataset analysis, focusing on EI attributes and persona consistency to understand the unique challenges posed by real-world dialogues. By comparing with LLM-generated conversations, we highlight key differences, including diverse emotional expressions and variations in persona stability that synthetic dialogues often fail to capture. Building on these insights, we introduce two benchmark tasks: (1) persona simulation where a model continues a conversation on behalf of a specific user given prior dialogue context; and (2) memory probing where a model answers targeted questions requiring long-term memory of past interactions. Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation. Additionally, existing models face significant challenges in recalling and leveraging long-term context within real-world conversations.
摘要：长期开放域对话能力对于旨在回忆过去互动并展示情绪智力（EI）的聊天机器人至关重要。然而，大多数现有的研究依赖于合成，LLM生成的数据，留下有关现实世界对话模式的开放问题。为了解决这一差距，我们介绍了Realtalk，这是一个21天真实的消息应用程序对话，为真正的人类互动提供了直接的基准。我们首先进行数据集分析，重点介绍EI属性和角色一致性，以了解实际对话所带来的独特挑战。通过与LLM生成的对话进行比较，我们强调了关键差异，包括各种情感表达和人格稳定性的变化，合成对话通常无法捕获。在这些见解的基础上，我们介绍了两个基准任务：（1）角色模拟，其中模型代表特定用户继续进行对话的对话中的对话；（2）模型回答有针对性问题的内存探测需要长期记忆过去相互作用的问题。我们的发现表明，模型很难仅从对话历史记录中模拟用户，而对特定用户聊天进行微调可以改善性格仿真。此外，现有模型在召回和利用现实世界中的长期背景方面面临重大挑战。

Title: Understanding and Tackling Label Errors in Individual-Level Nature Language Understanding

Authors: Yunpeng Xiao, Youpeng Zhao, Kai Shu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13297
Pdf URL: https://arxiv.org/pdf/2502.13297
Copy Paste: [[2502.13297]] Understanding and Tackling Label Errors in Individual-Level Nature Language Understanding(https://arxiv.org/abs/2502.13297)
Keywords: language model, gpt
Abstract: Natural language understanding (NLU) is a task that enables machines to understand human language. Some tasks, such as stance detection and sentiment analysis, are closely related to individual subjective perspectives, thus termed individual-level NLU. Previously, these tasks are often simplified to text-level NLU tasks, ignoring individual factors. This not only makes inference difficult and unexplainable but often results in a large number of label errors when creating datasets. To address the above limitations, we propose a new NLU annotation guideline based on individual-level factors. Specifically, we incorporate other posts by the same individual and then annotate individual subjective perspectives after considering all individual posts. We use this guideline to expand and re-annotate the stance detection and topic-based sentiment analysis datasets. We find that error rates in the samples were as high as 31.7\% and 23.3\%. We further use large language models to conduct experiments on the re-annotation datasets and find that the large language models perform well on both datasets after adding individual factors. Both GPT-4o and Llama3-70B can achieve an accuracy greater than 87\% on the re-annotation datasets. We also verify the effectiveness of individual factors through ablation studies. We call on future researchers to add individual factors when creating such datasets. Our re-annotation dataset can be found at this https URL
摘要：自然语言理解（NLU）是一项使机器能够理解人类语言的任务。某些任务，例如立场检测和情感分析，与个人主观观点密切相关，因此称为个人级别的NLU。以前，这些任务通常被简化为文本级NLU任务，忽略了个体因素。这不仅使推理变得困难且无法解释，而且在创建数据集时通常会导致大量标签错误。为了解决上述限制，我们提出了基于个人级别因素的新的NLU注释指南。具体来说，我们通过同一个人结合了其他帖子，然后在考虑所有个人帖子后注释个人主观观点。我们使用此指南来扩展和重新注释立场检测和基于主题的情感分析数据集。我们发现样品中的错误率高达31.7 \％和23.3 \％。我们进一步使用大型语言模型来对重新注册数据集进行实验，并发现在添加个别因素后，大型语言模型在两个数据集上都表现良好。 GPT-4O和LLAMA3-70B均可在重新注释数据集中获得大于87 \％的精度。我们还通过消融研究来验证个体因素的有效性。我们呼吁未来的研究人员在创建此类数据集时添加个人因素。我们的重新通信数据集可以在此HTTPS URL上找到

Title: Improving Multi-turn Task Completion in Task-Oriented Dialog Systems via Prompt Chaining and Fine-Grained Feedback

Authors: Moghis Fereidouni, Md Sajid Ahmed, Adib Mosharrof, A.B. Siddique
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13298
Pdf URL: https://arxiv.org/pdf/2502.13298
Copy Paste: [[2502.13298]] Improving Multi-turn Task Completion in Task-Oriented Dialog Systems via Prompt Chaining and Fine-Grained Feedback(https://arxiv.org/abs/2502.13298)
Keywords: language model, llm, prompt
Abstract: Task-oriented dialog (TOD) systems facilitate users in accomplishing complex, multi-turn tasks through natural language. While traditional approaches rely on extensive fine-tuning and annotated data for each domain, instruction-tuned large language models (LLMs) offer a more flexible alternative. However, LLMs struggle to reliably handle multi-turn task completion, particularly with accurately generating API calls and adapting to new domains without explicit demonstrations. To address these challenges, we propose RealTOD, a novel framework that enhances TOD systems through prompt chaining and fine-grained feedback mechanisms. Prompt chaining enables zero-shot domain adaptation via a two-stage prompting strategy, eliminating the need for human-curated demonstrations. Meanwhile, the fine-grained feedback mechanism improves task completion by verifying API calls against domain schemas and providing precise corrective feedback when errors are detected. We conduct extensive experiments on the SGD and BiTOD benchmarks using four LLMs. RealTOD improves API accuracy, surpassing AutoTOD by 37.74% on SGD and SimpleTOD by 11.26% on BiTOD. Human evaluations further confirm that LLMs integrated with RealTOD achieve superior task completion, fluency, and informativeness compared to existing methods.
摘要：以任务为导向的对话框（TOD）系统促进用户通过自然语言完成复杂的多转弯任务。尽管传统方法依赖于每个域的广泛微调和注释数据，但指导调节的大语言模型（LLMS）提供了更灵活的替代方案。但是，LLM努力可靠地处理多转任任务的完成，尤其是在没有明确演示的情况下准确地生成API呼叫并适应新领域。为了应对这些挑战，我们提出了Realtod，这是一个新颖的框架，可以通过迅速链接和细粒度的反馈机制来增强TOD系统。及时链接可以通过两阶段提示策略适应零射击域，从而消除了对人类策划的示范的需求。同时，通过验证针对域模式的API调用并在检测到错误时提供精确的纠正反馈来改善任务完成。我们使用四个LLM对SGD和BITOD基准进行了广泛的实验。 Realtod提高了API准确性，SGD上的AUTOTOD超过37.74％，而SimpleTOD则在BITOD上超过了11.26％。与现有方法相比，与房地产用运集成的LLM与RealTOD集成的LLM相比实现了卓越的任务完成，流利和信息性。

Title: Evaluating and Enhancing Out-of-Domain Generalization of Task-Oriented Dialog Systems for Task Completion without Turn-level Dialog Annotations

Authors: Adib Mosharrof, Moghis Fereidouni, A.B. Siddique
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13310
Pdf URL: https://arxiv.org/pdf/2502.13310
Copy Paste: [[2502.13310]] Evaluating and Enhancing Out-of-Domain Generalization of Task-Oriented Dialog Systems for Task Completion without Turn-level Dialog Annotations(https://arxiv.org/abs/2502.13310)
Keywords: language model, llm, prompt
Abstract: Traditional task-oriented dialog (ToD) systems rely heavily on labor-intensive turn-level annotations, such as dialogue states and policy labels, for training. This work explores whether large language models (LLMs) can be fine-tuned solely on natural language dialogs to perform ToD tasks, without requiring such annotations. We evaluate their ability to generalize to unseen domains and compare their performance with models trained on fully annotated data. Through extensive experiments with three open-source LLMs of varying sizes and two diverse ToD datasets, we find that models fine-tuned without turn-level annotations generate coherent and contextually appropriate responses. However, their task completion performance - measured by accurate execution of API calls - remains suboptimal, with the best models achieving only around 53% success in unseen domains. To improve task completion, we propose ZeroToD, a framework that incorporates a schema augmentation mechanism to enhance API call accuracy and overall task completion rates, particularly in out-of-domain settings. We also compare ZeroToD with fine-tuning-free alternatives, such as prompting off-the-shelf LLMs, and find that our framework enables smaller, fine-tuned models that outperform large-scale proprietary LLMs in task completion. Additionally, a human study evaluating informativeness, fluency, and task completion confirms our empirical findings. These findings suggest the feasibility of developing cost-effective, scalable, and zero-shot generalizable ToD systems for real-world applications.
摘要：传统面向任务的对话框（TOD）系统在很大程度上依赖于劳动密集型的转向级注释，例如对话状态和政策标签，用于培训。这项工作探讨了大型语言模型（LLMS）是否仅在自然语言对话框中进行微调以执行TOD任务，而无需进行此类注释。我们评估了他们概括地看不见域并将其性能与经过完全注释数据训练的模型进行比较的能力。通过对三个不同大小的开源LLM和两个不同TOD数据集进行的广泛实验，我们发现模型在没有转交级注释的情况下进行了微调，从而产生连贯且上下文适当的响应。但是，通过准确执行API调用，他们的任务完成性能仍然次优，最佳模型在看不见的域中仅取得了53％的成功。为了改善任务完成，我们提出了Zerotod，该框架结合了架构增强机制，以提高API调用准确性和整体任务完成率，尤其是在室外设置中。我们还将Zerotod与无微调的替代方案进行了比较，例如提示现成的LLMS，并发现我们的框架可以使较小的，微调的模型在完成任务完成时的表现优于大型专有LLMS。此外，一项评估信息性，流利性和任务完成的人类研究证实了我们的经验发现。这些发现表明，为真实应用应用开发具有成本效益，可扩展性和零击的TOD系统的可行性。

Title: Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors

Authors: Jian Wang, Yinpei Dai, Yichi Zhang, Ziqiao Ma, Wenjie Li, Joyce Chai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13311
Pdf URL: https://arxiv.org/pdf/2502.13311
Copy Paste: [[2502.13311]] Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors(https://arxiv.org/abs/2502.13311)
Keywords: language model, llm, agent
Abstract: Intelligent tutoring agents powered by large language models (LLMs) have been increasingly explored to deliver personalized guidance in areas such as language learning and science education. However, their capabilities in guiding users to solve complex real-world tasks remain underexplored. To address this limitation, in this work, we focus on coding tutoring, a challenging problem that requires tutors to proactively guide students toward completing predefined coding tasks. We propose a novel agent workflow, Trace-and-Verify (TRAVER), which combines knowledge tracing to estimate a student's knowledge state and turn-by-turn verification to ensure effective guidance toward task completion. We introduce DICT, an automatic evaluation protocol that assesses tutor agents holistically using controlled student simulation and code generation tests. Extensive experiments reveal the challenges of coding tutoring and demonstrate that TRAVER achieves a significantly higher success rate. Although we use code tutoring as an example in this paper, our results and findings can be extended beyond coding, providing valuable insights into advancing tutoring agents for a variety of tasks.
摘要：越来越多地探索了由大语言模型（LLM）提供支持的智能辅导代理，以在语言学习和科学教育等领域提供个性化的指导。但是，他们在指导用户解决复杂现实世界任务的能力仍然没有被忽略。为了解决这一局限性，在这项工作中，我们专注于编码辅导，这是一个具有挑战性的问题，要求导师主动指导学生完成预定义的编码任务。我们提出了一个新颖的代理工作流程，跟踪验证（TRAVER），该工作流程结合了知识跟踪，以估算学生的知识状态和逐个转变验证，以确保有效的指导完成任务完成任务。我们介绍了一种自动评估协议，该协议可以使用受控的学生模拟和代码生成测试来整体评估辅导代理。广泛的实验揭示了编码辅导的挑战，并证明Traver取得了更高的成功率。尽管我们以本文中的代码辅导为例，但我们的结果和发现可以扩展到编码之外，从而为推进辅导代理的各种任务提供了宝贵的见解。

Title: Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare

Authors: Hiba Ahsan, Arnab Sen Sharma, Silvio Amir, David Bau, Byron C. Wallace
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13319
Pdf URL: https://arxiv.org/pdf/2502.13319
Copy Paste: [[2502.13319]] Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare(https://arxiv.org/abs/2502.13319)
Keywords: llm
Abstract: We know from prior work that LLMs encode social biases, and that this manifests in clinical tasks. In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare. Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)? We find that gender information is highly localized in middle MLP layers and can be reliably manipulated at inference time via patching. Such interventions can surgically alter generated clinical vignettes for specific conditions, and also influence downstream clinical predictions which correlate with gender, e.g., patient risk of depression. We find that representation of patient race is somewhat more distributed, but can also be intervened upon, to a degree. To our knowledge, this is the first application of mechanistic interpretability methods to LLMs for healthcare.
摘要：我们从先前的工作中知道，LLMS编码社会偏见，这表现在临床任务中。在这项工作中，我们采用工具从机械性解释性到在医疗保健背景下LLMS内的社会人口统计学表示和偏见。具体来说，我们问：我们可以识别LLM中编码社会人口统计学信息（例如性别，种族）的激活吗？我们发现性别信息高度位于中间MLP层中，可以在推理时间通过修补可靠地操纵。此类干预措施可以通过手术改变生成的特定疾病的临床小晕，并影响与性别相关的下游临床预测，例如患者患抑郁症的风险。我们发现，患者竞赛的代表性在某种程度上有些分布，但也可以在一定程度上进行干预。据我们所知，这是机械性解释性方法在医疗保健中的首次应用。

Title: Language Models Can Predict Their Own Behavior

Authors: Dhananjay Ashok, Jonathan May
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13329
Pdf URL: https://arxiv.org/pdf/2502.13329
Copy Paste: [[2502.13329]] Language Models Can Predict Their Own Behavior(https://arxiv.org/abs/2502.13329)
Keywords: language model, prompt, chain-of-thought
Abstract: Autoregressive Language Models output text by sequentially predicting the next token to generate, with modern methods like Chain-of-Thought (CoT) prompting achieving state-of-the-art reasoning capabilities by scaling the number of generated tokens. However, are there times when we can infer how the model will behave (e.g. abstain from answering a question) early in the computation, making generation unnecessary? We show that internal representation of input tokens alone can often precisely predict, not just the next token, but eventual behavior over the entire output sequence. We leverage this capacity and learn probes on internal states to create early warning (and exit) systems. Specifically, if the probes can confidently estimate the way the LM is going to behave, then the system will avoid generating tokens altogether and return the estimated behavior instead. On 27 text classification datasets spanning five different tasks, we apply this method to estimate the eventual answer of an LM under CoT prompting, reducing inference costs by 65% (average) while suffering an accuracy loss of no more than 1.4% (worst case). We demonstrate the potential of this method to pre-emptively identify when a model will abstain from answering a question, fail to follow output format specifications, or give a low-confidence response. We explore the limits of this capability, showing that probes generalize to unseen datasets, but perform worse when LM outputs are longer and struggle to predict properties that require access to knowledge that the models themselves lack. Encouragingly, performance scales with model size, suggesting applicability to the largest of models
摘要：自回归语言模型通过顺序预测下一个要生成的令牌来输出文本，并采用现代方法（例如Thequiend（COT）（COT））来提示通过扩展生成的令牌数量来提示实现最新推理能力。但是，有时我们是否可以推断该模型在计算早期的表现（例如，弃权回答问题），从而使一代不必要？我们表明，仅输入令牌的内部表示通常可以精确地预测整个输出序列上的下一个令牌，而是最终的行为。我们利用这种能力并学习对内部状态的探针来创建预警（和退出）系统。具体而言，如果探针可以自信地估计LM的行为方式，那么系统将避免完全生成令牌并返回估计的行为。在跨越五个不同任务的27个文本分类数据集中，我们应用此方法来估算COT提示下LM的最终答案，将推理成本降低65％（平均），而精度损失不超过1.4％（最坏的情况）。我们证明了这种方法的潜力，可以预先确定何时将拒绝回答问题，未能遵循输出格式规格或给出低信心的响应。我们探索了此功能的限制，表明探针将其推广到看不见的数据集，但是当LM输出更长并难以预测需要访问模型本身所缺乏的知识的属性时，效果更糟。令人鼓舞的是，具有模型大小的性能尺度，表明适用于最大的型号

Title: Language Models are Few-Shot Graders

Authors: Chenyan Zhao, Mariana Silva, Seth Poulsen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13337
Pdf URL: https://arxiv.org/pdf/2502.13337
Copy Paste: [[2502.13337]] Language Models are Few-Shot Graders(https://arxiv.org/abs/2502.13337)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation
Abstract: Providing evaluations to student work is a critical component of effective student learning, and automating its process can significantly reduce the workload on human graders. Automatic Short Answer Grading (ASAG) systems, enabled by advancements in Large Language Models (LLMs), offer a promising solution for assessing and providing instant feedback for open-ended student responses. In this paper, we present an ASAG pipeline leveraging state-of-the-art LLMs. Our new LLM-based ASAG pipeline achieves better performances than existing custom-built models on the same datasets. We also compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview. Our results demonstrate that GPT-4o achieves the best balance between accuracy and cost-effectiveness. On the other hand, o1-preview, despite higher accuracy, exhibits a larger variance in error that makes it less practical for classroom use. We investigate the effects of incorporating instructor-graded examples into prompts using no examples, random selection, and Retrieval-Augmented Generation (RAG)-based selection strategies. Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection. Additionally, integrating grading rubrics improves accuracy by offering a structured standard for evaluation.
摘要：为学生工作进行评估是有效学生学习的关键组成部分，自动化其过程可以大大减少人类分级的工作量。由大语言模型（LLMS）的进步启用的自动简短答案分级（ASAG）系统，为评估和为开放式学生提供即时反馈提供了有希望的解决方案。在本文中，我们提出了一条ASAG管道，利用了最先进的LLM。我们的新基于LLM的ASAG管道比同一数据集上的现有定制型号取得更好的性能。我们还比较了三种OpenAI型号的评分性能：GPT-4，GPT-4O和O1-preview。我们的结果表明，GPT-4O在准确性和成本效益之间取得了最佳平衡。另一方面，尽管准确性较高，但O1-preiview表现出更大的错误差异，从而使其在课堂使用方面的实用性降低。我们调查了将教师毕业的示例纳入提示的效果，没有随机选择和基于检索的基于基于的（RAG）的选择策略。我们的发现表明，提供分级示例可以提高评分精度，基于抹布的选择优于随机选择。此外，通过提供评估的结构化标准，整合分级标准可以提高准确性。

Title: Craw4LLM: Efficient Web Crawling for LLM Pretraining

Authors: Shi Yu, Zhiyuan Liu, Chenyan Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13347
Pdf URL: https://arxiv.org/pdf/2502.13347
Copy Paste: [[2502.13347]] Craw4LLM: Efficient Web Crawling for LLM Pretraining(https://arxiv.org/abs/2502.13347)
Keywords: language model, llm
Abstract: Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Crawl4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler's scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at this https URL.
摘要：Web Crawl是大型语言模型（LLMS）预处理数据的主要来源，但是由于数据质量较低，大多数爬行的网页都被丢弃了。本文介绍了Crawl4LLM，这是一种有效的Web爬行方法，该方法基于LLM预处理探索Web图。具体而言，它利用LLM预处理中的网页作为Web搜寻器调度程序的优先级分数的影响，以取代基于标准图的优先级。我们在包含来自商业搜索引擎索引的9亿网页的Web图上进行的实验，证明了Crawl4LLM在获取高质量预读取数据方面的效率。仅21％的URL爬行，在爬网4LLM数据上预处理的LLM与以前的爬网的下游性能相同，从而大大减少了爬行废物并减轻了网站上的负担。我们的代码在此HTTPS URL上公开可用。

Title: Event Segmentation Applications in Large Language Model Enabled Automated Recall Assessments

Authors: Ryan A. Panela (1,2), Alex J. Barnett (2,3), Morgan D. Barense (1,2), Björn Herrmann (1,2) ((1) Rotman Research Institute, Baycrest Academy for Research and Education, (2) Department of Psychology, University of Toronto, (3) Department of Neurology and Neurosurgery, Montreal Neurological Institute and Hospital, McGill University)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13349
Pdf URL: https://arxiv.org/pdf/2502.13349
Copy Paste: [[2502.13349]] Event Segmentation Applications in Large Language Model Enabled Automated Recall Assessments(https://arxiv.org/abs/2502.13349)
Keywords: language model, llm, chat
Abstract: Understanding how individuals perceive and recall information in their natural environments is critical to understanding potential failures in perception (e.g., sensory loss) and memory (e.g., dementia). Event segmentation, the process of identifying distinct events within dynamic environments, is central to how we perceive, encode, and recall experiences. This cognitive process not only influences moment-to-moment comprehension but also shapes event specific memory. Despite the importance of event segmentation and event memory, current research methodologies rely heavily on human judgements for assessing segmentation patterns and recall ability, which are subjective and time-consuming. A few approaches have been introduced to automate event segmentation and recall scoring, but validity with human responses and ease of implementation require further advancements. To address these concerns, we leverage Large Language Models (LLMs) to automate event segmentation and assess recall, employing chat completion and text-embedding models, respectively. We validated these models against human annotations and determined that LLMs can accurately identify event boundaries, and that human event segmentation is more consistent with LLMs than among humans themselves. Using this framework, we advanced an automated approach for recall assessments which revealed semantic similarity between segmented narrative events and participant recall can estimate recall performance. Our findings demonstrate that LLMs can effectively simulate human segmentation patterns and provide recall evaluations that are a scalable alternative to manual scoring. This research opens novel avenues for studying the intersection between perception, memory, and cognitive impairment using methodologies driven by artificial intelligence.
摘要：了解个人在自然环境中的感知和回忆信息对于理解感知中的潜在失败至关重要（例如，感觉丧失）和记忆（例如痴呆症）。事件细分是在动态环境中识别不同事件的过程，对于我们如何看待，编码和召回经验是至关重要的。这种认知过程不仅影响力矩的理解力，而且影响事件特定的内存。尽管事件细分和事件记忆非常重要，但当前的研究方法很大程度上依赖于人类判断来评估分割模式和召回能力，这些模式和召回能力是主观且耗时的。已经引入了一些方法来自动化事件细分和召回评分，但是人类响应和易于实施的有效性需要进一步的进步。为了解决这些问题，我们利用大型语言模型（LLMS）分别采用聊天完成和文本插入模型来自动化事件细分并评估召回。我们针对人类注释验证了这些模型，并确定LLM可以准确地识别事件边界，并且人类事件分割与LLMS比人类本身更一致。使用此框架，我们提出了一种自动化方法进行召回评估，该方法揭示了分段叙事事件和参与者召回之间的语义相似性可以估计召回性能。我们的发现表明，LLM可以有效地模拟人类分割模式，并提供回忆评估，这些评估是手动评分的可扩展替代方法。这项研究开辟了新的途径，用于研究通过人工智能驱动的方法学研究感知，记忆和认知障碍之间的交集。

Title: Bridging the Editing Gap in LLMs: FineEdit for Precise and Targeted Text Modifications

Authors: Yiming Zeng, Wanhao Yu, Zexin Li, Tao Ren, Yu Ma, Jinghan Cao, Xiyan Chen, Tingting Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13358
Pdf URL: https://arxiv.org/pdf/2502.13358
Copy Paste: [[2502.13358]] Bridging the Editing Gap in LLMs: FineEdit for Precise and Targeted Text Modifications(https://arxiv.org/abs/2502.13358)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs) have transformed natural language processing, yet they still struggle with direct text editing tasks that demand precise, context-aware modifications. While models like ChatGPT excel in text generation and analysis, their editing abilities often fall short, addressing only superficial issues rather than deeper structural or logical inconsistencies. In this work, we introduce a dual approach to enhance LLMs editing performance. First, we present InstrEditBench, a high-quality benchmark dataset comprising over 20,000 structured editing tasks spanning Wiki articles, LaTeX documents, code, and database Domain-specific Languages (DSL). InstrEditBench is generated using an innovative automated workflow that accurately identifies and evaluates targeted edits, ensuring that modifications adhere strictly to specified instructions without altering unrelated content. Second, we propose FineEdit, a specialized model trained on this curated benchmark. Experimental results demonstrate that FineEdit achieves significant improvements around {10\%} compared with Gemini on direct editing tasks, convincingly validating its effectiveness.
摘要：大型语言模型（LLM）改变了自然语言处理，但他们仍然在直接的文本编辑任务中挣扎，这些任务需要精确的上下文感知的修改。尽管Chatgpt等模型在文本生成和分析中表现出色，但它们的编辑能力通常会缺乏，仅解决了肤浅的问题，而不是更深层的结构或逻辑上的不一致之处。在这项工作中，我们引入了一种双重方法来增强LLMS编辑性能。首先，我们提出了InstereditBench，这是一个高质量的基准数据集，其中包括20,000多个结构化编辑任务，涵盖Wiki文章，乳胶文档，代码和数据库域特异性语言（DSL）。使用创新的自动化工作流程生成了InsteredItbench，该工作流程准确地识别和评估了目标编辑，从而确保修改严格遵循而无需更改无关内容的指定指令。其次，我们提出了Fineedit，这是一种专门的模型，该模型在此精选的基准测试中训练。实验结果表明，与双子座在直接编辑任务上相比，FineDit在{10 \％}方面取得了重大改进，令人信服地验证了其有效性。

Title: RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering

Authors: Sichu Liang, Linhai Zhang, Hongyu Zhu, Wenwen Wang, Yulan He, Deyu Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13361
Pdf URL: https://arxiv.org/pdf/2502.13361
Copy Paste: [[2502.13361]] RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering(https://arxiv.org/abs/2502.13361)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Medical question answering requires extensive access to specialized conceptual knowledge. The current paradigm, Retrieval-Augmented Generation (RAG), acquires expertise medical knowledge through large-scale corpus retrieval and uses this knowledge to guide a general-purpose large language model (LLM) for generating answers. However, existing retrieval approaches often overlook the importance of factual knowledge, which limits the relevance of retrieved conceptual knowledge and restricts its applicability in real-world scenarios, such as clinical decision-making based on Electronic Health Records (EHRs). This paper introduces RGAR, a recurrence generation-augmented retrieval framework that retrieves both relevant factual and conceptual knowledge from dual sources (i.e., EHRs and the corpus), allowing them to interact and refine each another. Through extensive evaluation across three factual-aware medical question answering benchmarks, RGAR establishes a new state-of-the-art performance among medical RAG systems. Notably, the Llama-3.1-8B-Instruct model with RGAR surpasses the considerably larger, RAG-enhanced GPT-3.5. Our findings demonstrate the benefit of extracting factual knowledge for retrieval, which consistently yields improved generation quality.
摘要：医疗问题回答需要广泛访问专业的概念知识。当前的范式，检索型发电（RAG）通过大规模的语料库检索获得了专业知识知识，并使用这些知识来指导通用大型语言模型（LLM）来生成答案。但是，现有的检索方法通常忽略事实知识的重要性，这限制了检索概念知识的相关性并限制了其在现实情况下的适用性，例如基于电子健康记录（EHRS）的临床决策。本文介绍了RGAR，这是一个复发生成的检索框架，从双重来源（即EHRS和语料库）从相关的事实和概念知识中检索了相关的事实和概念知识，允许它们相互作用并相互作用。通过对三个事实感知的医学问题的广泛评估，RGAR在医学抹布系统中建立了新的最新性能。值得注意的是，带有RGAR的Llama-3.1-8B教学模型超过了较大的RAG增强GPT-3.5。我们的发现证明了提取事实知识以获取检索的好处，从而始终如一地提高了发电质量。

Title: Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval

Authors: Aditya Sharma, Luis Lara, Amal Zouaq, Christopher J. Pal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13369
Pdf URL: https://arxiv.org/pdf/2502.13369
Copy Paste: [[2502.13369]] Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval(https://arxiv.org/abs/2502.13369)
Keywords: language model, llm, hallucination
Abstract: The ability to generate SPARQL queries from natural language questions is crucial for ensuring efficient and accurate retrieval of structured data from knowledge graphs (KG). While large language models (LLMs) have been widely adopted for SPARQL query generation, they are often susceptible to hallucinations and out-of-distribution errors when producing KG elements like Uniform Resource Identifiers (URIs) based on internal parametric knowledge. This often results in content that appears plausible but is factually incorrect, posing significant challenges for their use in real-world information retrieval (IR) applications. This has led to increased research aimed at detecting and mitigating such errors. In this paper, we introduce PGMR (Post-Generation Memory Retrieval), a modular framework that incorporates a non-parametric memory module to retrieve KG elements and enhance LLM-based SPARQL query generation. Our experimental results indicate that PGMR consistently delivers strong performance across diverse datasets, data distributions, and LLMs. Notably, PGMR significantly mitigates URI hallucinations, nearly eliminating the problem in several scenarios.
摘要：从自然语言问题中生成SPARQL查询的能力对于确保从知识图（KG）中有效而准确地检索结构化数据至关重要。尽管大型语言模型（LLM）已被广泛用于SPARQL查询产生，但在产生基于内部参数知识（统一资源标识符（URI））时，它们通常容易受到幻觉和分布错误的影响。这通常会导致内容看起来很合理，但实际上是不正确的，这对它们在现实世界信息检索（IR）应用中的使用构成了重大挑战。这导致了旨在检测和减轻此类错误的研究增加。在本文中，我们介绍了PGMR（后期内存检索），该模块化框架结合了非参数存储模块以检索KG元素并增强基于LLM的SPARQL查询生成。我们的实验结果表明，PGMR始终在不同的数据集，数据分布和LLMS中提供强大的性能。值得注意的是，PGMR大大减轻了URI幻觉，几乎可以消除了几种情况。

Title: Task-agnostic Prompt Compression with Context-aware Sentence Embedding and Reward-guided Task Descriptor

Authors: Barys Liskavets, Shuvendu Roy, Maxim Ushakov, Mark Klibanov, Ali Etemad, Shane Luke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13374
Pdf URL: https://arxiv.org/pdf/2502.13374
Copy Paste: [[2502.13374]] Task-agnostic Prompt Compression with Context-aware Sentence Embedding and Reward-guided Task Descriptor(https://arxiv.org/abs/2502.13374)
Keywords: language model, llm, prompt
Abstract: The rise of Large Language Models (LLMs) has led to significant interest in prompt compression, a technique aimed at reducing the length of input prompts while preserving critical information. However, the prominent approaches in prompt compression often require explicit questions or handcrafted templates for compression, limiting their generalizability. We propose Task-agnostic Prompt Compression (TPC), a novel framework that generalizes compression across tasks and domains without requiring input questions or templates. TPC generates a context-relevant task description using a task descriptor trained on a curated dataset of context and query pairs, and fine-tuned via reinforcement learning with a reward function designed to capture the most relevant information. The task descriptor is then utilized to compute the relevance of each sentence in the prompt to generate the compressed prompt. We introduce 3 model sizes (Base, Large, and Huge), where the largest model outperforms the existing state-of-the-art methods on LongBench and ZeroSCROLLS benchmarks, and our smallest model performs comparable to the existing solutions while being considerably smaller.
摘要：大型语言模型（LLM）的兴起引起了人们对迅速压缩的浓厚兴趣，该技术旨在减少输入提示的长度，同时保留关键信息。但是，迅速压缩中的突出方法通常需要明确的问题或手工制作的模板进行压缩，从而限制了它们的普遍性。我们提出了任务不合时宜的提示压缩（TPC），这是一个新颖的框架，可在不需要输入问题或模板的情况下跨越任务和域的压缩。 TPC使用在策划的上下文和查询对数据集中训练的任务描述符生成了相关的任务描述，并通过奖励功能通过强化学习进行微调，旨在捕获最相关的信息。然后使用任务描述符来计算在提示中生成压缩提示的每个句子的相关性。我们介绍了3种模型尺寸（基础，大和巨大），其中最大的型号的表现优于Longbench和ZerosCrolls基准的现有最新方法，而我们最小的型号的性能与现有的解决方案相当，同时又小得多。

Title: MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification

Authors: Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tianpeng Li, Fan Yang, Zenan Zhou, Wentao Zhang
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13383
Pdf URL: https://arxiv.org/pdf/2502.13383
Copy Paste: [[2502.13383]] MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification(https://arxiv.org/abs/2502.13383)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: According to the Test-Time Scaling, the integration of External Slow-Thinking with the Verify mechanism has been demonstrated to enhance multi-round reasoning in large language models (LLMs). However, in the multimodal (MM) domain, there is still a lack of a strong MM-Verifier. In this paper, we introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification. First, we propose a two-step MM verification data synthesis method, which combines a simulation-based tree search with verification and uses rejection sampling to generate high-quality Chain-of-Thought (COT) data. This data is then used to fine-tune the verification model, MM-Verifier. Additionally, we present a more efficient method for synthesizing MMCOT data, bridging the gap between text-based and multimodal reasoning. The synthesized data is used to fine-tune MM-Reasoner. Our MM-Verifier outperforms all larger models on the MathCheck, MathVista, and MathVerse benchmarks. Moreover, MM-Reasoner demonstrates strong effectiveness and scalability, with performance improving as data size increases. Finally, our approach achieves strong performance when combining MM-Reasoner and MM-Verifier, reaching an accuracy of 65.3 on MathVista, surpassing GPT-4o (63.8) with 12 rollouts.
摘要：根据测试时间缩放，已经证明了外部缓慢思考与验证机制的集成可以增强大语言模型（LLMS）中的多轮推理。但是，在多模式（mm）域中，仍然缺乏强大的MM佛教剂。在本文中，我们介绍了MM佛教和MM-Reasoner，以通过更长的推理和更强大的验证来增强多模式推理。首先，我们提出了一个两步的MM验证数据合成方法，该方法将基于仿真的树搜索与验证结合在一起，并使用拒绝采样来生成高质量的思想链（COT）数据。然后，该数据用于微调验证模型MM佛教符。此外，我们提出了一种更有效的方法来合成MMCOT数据，从而弥合了基于文本和多模式推理之间的差距。合成的数据用于微调MM-Reasoner。我们的MM佛罗里达人在Mathcheck，Mathvista和Mathverse Benchmarks上的表现优于所有较大模型。此外，MM-Reasoner表现出强大的有效性和可扩展性，随着数据规模的增加，性能的提高。最后，我们的方法在组合MM-Reasoner和MM佛教徒时取得了强大的性能，在Mathvista上的准确度达到65.3，超过GPT-4O（63.8），并进行了12次推广。

Title: Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study

Authors: Wenwen Xie, Gray Gwizdz, Dongji Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13396
Pdf URL: https://arxiv.org/pdf/2502.13396
Copy Paste: [[2502.13396]] Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study(https://arxiv.org/abs/2502.13396)
Keywords: language model, llm, prompt
Abstract: While Large Language Models (LLMs) have emerged as promising tools for evaluating Natural Language Generation (NLG) tasks, their effectiveness is limited by their inability to appropriately weigh the importance of different topics, often overemphasizing minor details while undervaluing critical information, leading to misleading assessments. Our work proposes an efficient prompt design mechanism to address this specific limitation and provide a case study. Through strategic prompt engineering that incorporates explicit importance weighting mechanisms, we enhance using LLM-as-a-Judge ability to prioritize relevant information effectively, as demonstrated by an average improvement of 6% in the Human Alignment Rate (HAR) metric.
摘要：尽管大型语言模型（LLM）已成为评估自然语言生成（NLG）任务的有前途的工具，但它们的有效性受到无法适当权衡不同主题的重要性的限制，通常会过分强调次要细节，同时低估了关键信息，从而导致了误解。评估。我们的工作提出了一种有效的及时设计机制来解决此特定限制并提供案例研究。通过结合了重要性加权机制的战略及时工程，我们使用LLM-AS-A-Gudge能力来增强相关信息的优先级，如人类对齐率（HAR）指标平均6％的平均提高所证明。

Title: Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning

Authors: Ningke Li, Yahui Song, Kailong Wang, Yuekang Li, Ling Shi, Yi Liu, Haoyu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13416
Pdf URL: https://arxiv.org/pdf/2502.13416
Copy Paste: [[2502.13416]] Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning(https://arxiv.org/abs/2502.13416)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) face the challenge of hallucinations -- outputs that seem coherent but are actually incorrect. A particularly damaging type is fact-conflicting hallucination (FCH), where generated content contradicts established facts. Addressing FCH presents three main challenges: 1) Automatically constructing and maintaining large-scale benchmark datasets is difficult and resource-intensive; 2) Generating complex and efficient test cases that the LLM has not been trained on -- especially those involving intricate temporal features -- is challenging, yet crucial for eliciting hallucinations; and 3) Validating the reasoning behind LLM outputs is inherently difficult, particularly with complex logical relationships, as it requires transparency in the model's decision-making process. This paper presents Drowzee, an innovative end-to-end metamorphic testing framework that utilizes temporal logic to identify fact-conflicting hallucinations (FCH) in large language models (LLMs). Drowzee builds a comprehensive factual knowledge base by crawling sources like Wikipedia and uses automated temporal-logic reasoning to convert this knowledge into a large, extensible set of test cases with ground truth answers. LLMs are tested using these cases through template-based prompts, which require them to generate both answers and reasoning steps. To validate the reasoning, we propose two semantic-aware oracles that compare the semantic structure of LLM outputs to the ground truths. Across nine LLMs in nine different knowledge domains, experimental results show that Drowzee effectively identifies rates of non-temporal-related hallucinations ranging from 24.7% to 59.8%, and rates of temporal-related hallucinations ranging from 16.7% to 39.2%.
摘要：大型语言模型（LLMS）面临着幻觉的挑战 - 看起来似乎连贯但实际上是不正确的输出。一种特别有害的类型是事实冲突的幻觉（FCH），其中产生的内容与已建立的事实相矛盾。解决FCH提出了三个主要挑战：1）自动构建和维护大规模基准数据集很困难且资源密集； 2）生成尚未对LLM进行培训的复杂和高效的测试用例，尤其是那些涉及复杂时间特征的培训 - 具有挑战性，但对于引起幻觉而言至关重要； 3）验证LLM输出背后的推理本质上是困难的，尤其是在复杂的逻辑关系中，因为它需要模型的决策过程中的透明度。本文介绍了Drowzee，这是一种创新的端到端变质测试框架，该框架利用时间逻辑来识别大语言模型（LLMS）中的事实冲突幻觉（FCH）。 Drowzee通过爬行诸如Wikipedia之类的来源来建立一个全面的事实知识基础，并使用自动化的时间逻辑推理将这些知识转换为带有地面真相答案的大型，可扩展的测试用例。使用这些情况通过基于模板的提示对LLM进行测试，这要求它们同时生成答案和推理步骤。为了验证推理，我们提出了两个语义感知的甲壳，将LLM输出的语义结构与地面真相进行了比较。在九个不同知识领域的9个LLM中，实验结果表明，Drowzee有效地识别了与周期性相关的幻觉的发生率，范围为24.7％至59.8％，与时间相关的幻觉范围从16.7％到39.2％。

Title: RLTHF: Targeted Human Feedback for LLM Alignment

Authors: Yifei Xu, Tusher Chakraborty, Emre Kıcıman, Bibek Aryal, Eduardo Rodrigues, Srinagesh Sharma, Roberto Estevao, Maria Angels de Luis Balaguer, Jessica Wolk, Rafael Padilha, Leonardo Nunes, Shobana Balakrishnan, Songwu Lu, Ranveer Chandra
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13417
Pdf URL: https://arxiv.org/pdf/2502.13417
Copy Paste: [[2502.13417]] RLTHF: Targeted Human Feedback for LLM Alignment(https://arxiv.org/abs/2502.13417)
Keywords: language model, llm
Abstract: Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model's reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM's correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF's curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF's strategic data curation.
摘要：通过对用户偏好保持一致的微调大语言模型（LLMS）由于从人类反馈（RLHF）学习的高质量注释和AI反馈的普遍性限制而具有挑战性。为了应对这些挑战，我们提出了RLTHF，RLTHF是一种人类混合框架，将基于LLM的初始对齐与选择性的人类注释相结合，以实现全人类注释的一致性和最小的努力。 RLTHF使用奖励模型的奖励分布标识了LLMS错误标记的难以宣布的样品，并通过整合战略性人类校正来迭代地增强对齐方式，同时利用LLM正确标记的样品。对HH-RLHF和TL; DR数据集的评估表明，RLTHF仅以人类注释工作的6-7％达到全人类注释级别对齐。此外，在RLTHF的策划数据集中训练的模型以优于在完全被人类注销的数据集中训练的那些训练的模型，从而强调了RLTHF战略数据策略的有效性。

Title: TabSD: Large Free-Form Table Question Answering with SQL-Based Table Decomposition

Authors: Yuxiang Wang, Junhao Gan, Jianzhong Qi
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2502.13422
Pdf URL: https://arxiv.org/pdf/2502.13422
Copy Paste: [[2502.13422]] TabSD: Large Free-Form Table Question Answering with SQL-Based Table Decomposition(https://arxiv.org/abs/2502.13422)
Keywords: language model, llm
Abstract: Question answering on free-form tables (TableQA) is challenging due to the absence of predefined schemas and the presence of noise in large tables. While Large Language Models (LLMs) have shown promise in TableQA, they struggle with large free-form tables and noise sensitivity. To address these challenges, we propose TabSD, a SQL-based decomposition model that enhances LLMs' ability to process large free-form tables. TabSD generates SQL queries to guide the table decomposition, remove noise, and processes sub-tables for better answer generation. Additionally, SQL Verifier refines SQL outputs to enhance decomposition accuracy. We introduce two TableQA datasets with large free-form tables, SLQA and SEQA, which consist solely of large free-form tables and will be publicly available. Experimental results on four benchmark datasets demonstrate that TABSD outperforms the best-existing baseline models by 23.07%, 2.84%, 23.24% and 9.32% in accuracy, respectively, highlighting its effectiveness in handling large and noisy free-form tables.
摘要：由于缺乏预定义的模式和大表格中的噪声，因此对自由形式表（TableQa）的回答（TableQA）是具有挑战性的。尽管大型语言模型（LLM）在TableQA中表现出了希望，但它们在大型自由桌子和噪音敏感性方面挣扎。为了应对这些挑战，我们提出了TABD，这是一种基于SQL的分解模型，可增强LLMS处理大型自由式表的能力。 TABSD生成SQL查询，以指导表分解，删除噪声和处理子桌，以更好地回答生成。此外，SQL验证者会优化SQL输出以提高分解精度。我们介绍了两个带有大型自由式表的TableQA数据集，SLQA和SEQA，它们仅由大型自由桌组成，将公开使用。四个基准数据集的实验结果表明，TABSD的表现分别超过了最佳的基线模型，准确性分别高于23.07％，2.84％，23.24％和9.32％的精度，突出了其在处理大型和嘈杂的自由形式中的有效性。

Title: MCTS-KBQA: Monte Carlo Tree Search for Knowledge Base Question Answering

Authors: Guanming Xiong, Haochen Li, Wen Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13428
Pdf URL: https://arxiv.org/pdf/2502.13428
Copy Paste: [[2502.13428]] MCTS-KBQA: Monte Carlo Tree Search for Knowledge Base Question Answering(https://arxiv.org/abs/2502.13428)
Keywords: language model, llm, prompt, agent
Abstract: This study explores how to enhance the reasoning capabilities of large language models (LLMs) in knowledge base question answering (KBQA) by leveraging Monte Carlo Tree Search (MCTS). Semantic parsing-based KBQA methods are particularly challenging as these approaches require locating elements from knowledge bases and generating logical forms, demanding not only extensive annotated data but also strong reasoning capabilities. Although recent approaches leveraging LLMs as agents have demonstrated considerable potential, these studies are inherently constrained by their linear decision-making processes. To address this limitation, we propose a MCTS-based framework that enhances LLMs' reasoning capabilities through tree search methodology. We design a carefully designed step-wise reward mechanism that requires only direct prompting of open-source instruction LLMs without additional fine-tuning. Experimental results demonstrate that our approach significantly outperforms linear decision-making methods, particularly in low-resource scenarios. Additionally, we contribute new data resources to the KBQA community by annotating intermediate reasoning processes for existing question-SPARQL datasets using distant supervision. Experimental results on the extended dataset demonstrate that our method achieves comparable performance to fully supervised models while using significantly less training data.
摘要：这项研究探讨了如何通过利用蒙特卡洛树搜索（MCT）来增强知识基础问题答案（KBQA）中大语言模型（LLM）的推理能力。基于语义解析的KBQA方法特别具有挑战性，因为这些方法需要从知识库中找到要素并产生逻辑形式，不仅要求大量的注释数据，而且要求强大的推理能力。尽管最近利用LLM作为代理的方法表现出了很大的潜力，但这些研究本质上受其线性决策过程的限制。为了解决此限制，我们提出了一个基于MCTS的框架，该框架通过树搜索方法来增强LLMS的推理功能。我们设计了经过精心设计的逐步奖励机制，该机制只需要直接提示开源指令LLM，而无需进行其他微调。实验结果表明，我们的方法显着优于线性决策方法，尤其是在低资源场景中。此外，我们通过使用遥远的监督为现有问题 - sparql数据集的中间推理过程注释中间推理过程，为KBQA社区贡献新的数据资源。扩展数据集的实验结果表明，我们的方法可以实现与完全监督模型相当的性能，同时使用较少的训练数据。

Title: The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?

Authors: Yutao Sun, Mingshuai Chen, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Jianwei Yin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13441
Pdf URL: https://arxiv.org/pdf/2502.13441
Copy Paste: [[2502.13441]] The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?(https://arxiv.org/abs/2502.13441)
Keywords: language model, llm, prompt
Abstract: Self-improving large language models (LLMs) -- i.e., to improve the performance of an LLM by fine-tuning it with synthetic data generated by itself -- is a promising way to advance the capabilities of LLMs while avoiding extensive supervision. Existing approaches to self-improvement often rely on external supervision signals in the form of seed data and/or assistance from third-party models. This paper presents Crescent -- a simple yet effective framework for generating high-quality synthetic question-answer data in a fully autonomous manner. Crescent first elicits the LLM to generate raw questions via a bait prompt, then diversifies these questions leveraging a rejection sampling-based self-deduplication, and finally feeds the questions to the LLM and collects the corresponding answers by means of majority voting. We show that Crescent sheds light on the potential of true self-improvement with zero external supervision signals for math reasoning; in particular, Crescent-generated question-answer pairs suffice to (i) improve the reasoning capabilities of an LLM while preserving its general performance (especially in the 0-shot setting); and (ii) distil LLM knowledge to weaker models more effectively than existing methods based on seed-dataset augmentation.
摘要：自我提出的大语言模型（LLMS） - 即，通过用自身生成的合成数据对LLM进行微调来提高LLM的性能 - 是提高LLMS同时避免进行广泛监督的能力的一种有希望的方法。现有的自我完善方法通常以种子数据和/或第三方模型的帮助形式依赖外部监督信号。本文介绍了新月 - 一个简单而有效的框架，用于以完全自主的方式生成高质量的综合问题解答数据。 Crescent首先引起LLM通过诱饵提示引起原始问题，然后多样化这些问题，利用基于拒绝抽样的自我删除，并最终将问题提供给LLM，并通过多数投票来收集相应的答案。我们表明，新月形阐明了真正的自我完善的潜力，而零外部监督信号用于数学推理。特别是，新月生成的问题解答足以（i）提高LLM的推理能力，同时保留其一般性能（尤其是在0-Shot设置中）；（ii）比基于种子数据库增强的现有方法更有效地了解弱模型的知识。

Title: TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation

Authors: Jialin Ouyang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13442
Pdf URL: https://arxiv.org/pdf/2502.13442
Copy Paste: [[2502.13442]] TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation(https://arxiv.org/abs/2502.13442)
Keywords: language model, gpt, llm, hallucination
Abstract: Large language models (LLMs) now achieve near-human performance on standard math word problem benchmarks (e.g., GSM8K), yet their true reasoning ability remains disputed. A key concern is that models often produce confident, yet unfounded, answers to unanswerable problems. We introduce TreeCut, a synthetic dataset that systematically generates infinite unanswerable math word problems and their answerable counterparts, by representing each question as a tree and removing chosen necessary conditions. Experiments show TreeCut effectively induce hallucinations in large language models, including GPT-4o and o3-mini, with rates of 61% and 42% in their respective worst-case scenarios. Further analysis highlights that deeper or more complex trees, composite item names, and removing necessary condition near the middle of a path all increase the likelihood of hallucinations, underscoring the persistent challenges LLMs face in identifying unanswerable math problems.
摘要：现在，大型语言模型（LLMS）在标准数学单词问题基准（例如GSM8K）上实现了近乎人类的性能，但它们的真正推理能力仍然存在争议。一个关键问题是，模型通常会产生对无法回答问题的自信，但毫无根据的答案。我们介绍了Treecut，这是一个合成数据集，它通过将每个问题表示为树并删除所选的必要条件，从而系统地生成无限的无法接触的数学单词问题及其可回答的对应物。实验表明，在包括GPT-4O和O3-Mini在内的大语言模型中有效诱导了幻觉，其比率为61％和42％。进一步的分析强调，在路径中间附近的更深或更复杂的树木，复合项目名称以及消除必要的条件都增加了幻觉的可能性，从而强调了LLMS在识别无法接近的数学问题方面面临的持续挑战。

Title: ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails

Authors: Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, Muhao Chen
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13458
Pdf URL: https://arxiv.org/pdf/2502.13458
Copy Paste: [[2502.13458]] ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails(https://arxiv.org/abs/2502.13458)
Keywords: language model, llm
Abstract: Ensuring the safety of large language models (LLMs) is critical as they are deployed in real-world applications. Existing guardrails rely on rule-based filtering or single-pass classification, limiting their ability to handle nuanced safety violations. To address this, we propose ThinkGuard, a critique-augmented guardrail model that distills knowledge from high-capacity LLMs by generating structured critiques alongside safety labels. Fine-tuned on critique-augmented data, the captured deliberative thinking ability drastically enhances the guardrail's cautiousness and interpretability. Evaluated on multiple safety benchmarks, ThinkGuard achieves the highest average F1 and AUPRC, outperforming all baselines. Compared to LLaMA Guard 3, ThinkGuard improves accuracy by 16.1% and macro F1 by 27.0%. Moreover, it surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning while maintaining computational efficiency.
摘要：确保大型语言模型（LLM）的安全性至关重要，因为它们被部署在现实世界应用程序中。现有的护栏依赖于基于规则的过滤或单频道的分类，从而限制了其处理细微的安全违规行为的能力。为了解决这个问题，我们提出了一种审判的审判护栏模型，该模型通过与安全标签一起生成结构化的批评来提炼知识。捕获的审议思维能力对批判性的数据进行了微调，从而极大地增强了护栏的谨慎和解释性。 ThinkGuard通过多个安全基准进行了评估，达到了最高的平均F1和AUPRC，表现优于所有基线。与Llama Guard 3相比，ThinkGuard将准确性提高了16.1％，宏F1提高了27.0％。此外，它超过了仅标签的微调模型，证实结构化的批评增强了分类精度和细微的安全性推理，同时保持计算效率。

Title: Estimating Commonsense Plausibility through Semantic Shifts

Authors: Wanqing Cui, Keping Bi, Jiafeng Guo, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13464
Pdf URL: https://arxiv.org/pdf/2502.13464
Copy Paste: [[2502.13464]] Estimating Commonsense Plausibility through Semantic Shifts(https://arxiv.org/abs/2502.13464)
Keywords: language model, llm
Abstract: Commonsense plausibility estimation is critical for evaluating language models (LMs), yet existing generative approaches--reliant on likelihoods or verbalized judgments--struggle with fine-grained discrimination. In this paper, we propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts when augmenting sentences with commonsense-related information. Plausible augmentations induce minimal shifts in semantics, while implausible ones result in substantial deviations. Evaluations on two types of fine-grained commonsense plausibility estimation tasks across different backbones, including LLMs and vision-language models (VLMs), show that ComPaSS consistently outperforms baselines. It demonstrates the advantage of discriminative approaches over generative methods in fine-grained commonsense plausibility evaluation. Experiments also show that (1) VLMs yield superior performance to LMs, when integrated with ComPaSS, on vision-grounded commonsense tasks. (2) contrastive pre-training sharpens backbone models' ability to capture semantic nuances, thereby further enhancing ComPaSS.
摘要：常识性合理性估计对于评估语言模型（LMS），但现有的生成方法至关重要 - 基于可能性或言语判断 - 与精细的歧视斗争。在本文中，我们提出了指南针，这是一个新颖的歧视性框架，它通过测量语义转移来量化常识性的合理性，在使用与常识相关的信息增强句子时。合理的增强会导致语义的最小变化，而令人难以置信的增强会导致实质性偏差。对包括LLMS和视觉语言模型（VLM）在内的两种类型的细粒常识性估计任务进行评估表明，指南针始终超过基准。它证明了在细粒度常识性合理性评估中，歧视方法比生成方法的优势。实验还表明，（1）VLMS与指南针集成时，在视觉接地的常识任务上产生了卓越的性能。（2）对比预训练可以使骨干模型捕获语义细微差别的能力，从而进一步增强了指南针。

Title: Towards Lightweight, Adaptive and Attribute-Aware Multi-Aspect Controllable Text Generation with Large Language Models

Authors: Chenyu Zhu, Yefeng Liu, Chenyang Lyu, Xue Yang, Guanhua Chen, Longyue Wang, Weihua Luo, Kaifu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13474
Pdf URL: https://arxiv.org/pdf/2502.13474
Copy Paste: [[2502.13474]] Towards Lightweight, Adaptive and Attribute-Aware Multi-Aspect Controllable Text Generation with Large Language Models(https://arxiv.org/abs/2502.13474)
Keywords: language model
Abstract: Multi-aspect controllable text generation aims to control text generation in attributes from multiple aspects, making it a complex but powerful task in natural language processing. Supervised fine-tuning methods are often employed for this task due to their simplicity and effectiveness. However, they still have some limitations: low rank adaptation (LoRA) only fine-tunes a few parameters and has suboptimal control effects, while full fine-tuning (FFT) requires significant computational resources and is susceptible to overfitting, particularly when data is limited. Moreover, existing works typically train multi-aspect controllable text generation models using only single-aspect annotated data, which results in discrepancies in data distribution; at the same time, accurately generating text with specific attributes is a challenge that requires strong attribute-aware capabilities. To address these limitations, we propose a lightweight, adaptive and attribute-aware framework for multi-aspect controllable text generation. Our framework can dynamically adjust model parameters according to different aspects of data to achieve controllable text generation, aiming to optimize performance across multiple aspects. Experimental results show that our framework outperforms other strong baselines, achieves state-of-the-art performance, adapts well to data discrepancies, and is more accurate in attribute perception.
摘要：多光值可控的文本生成旨在控制多个方面的属性中文本生成，使其在自然语言处理中成为复杂但有力的任务。由于其简单性和有效性，经常使用监督的微调方法来执行此任务。但是，它们仍然有一些局限性：低等级适应（LORA）仅微调一些参数，并且具有次优的控制效果，而完整的微调（FFT）需要大量的计算资源，并且易于过度拟合，尤其是当数据受到限制时。此外，现有作品通常仅使用单个镜头注释的数据训练多偏见可控的文本生成模型，从而导致数据分布的差异。同时，准确生成具有特定属性的文本是一个需要强大属性感知功能的挑战。为了解决这些局限性，我们为多方面可控文本生成提供了一个轻巧，自适应和属性感知的框架。我们的框架可以根据数据的不同方面动态调整模型参数，以实现可控制的文本生成，旨在优化多个方面的性能。实验结果表明，我们的框架优于其他强大的基线，实现最新的性能，适应数据差异，并且在属性感知方面更为准确。

Title: LLM should think and action as a human

Authors: Haun Leung, ZiNan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13475
Pdf URL: https://arxiv.org/pdf/2502.13475
Copy Paste: [[2502.13475]] LLM should think and action as a human(https://arxiv.org/abs/2502.13475)
Keywords: language model, llm, prompt, chat
Abstract: It is popular lately to train large language models to be used as chat assistants, but in the conversation between the user and the chat assistant, there are prompts, require multi-turns between the chat assistant and the user. However, there are a number of issues with the multi-turns conversation: The response of the chat assistant is prone to errors and cannot help users achieve their goals; It is difficult for chat assistant to generate responses with different processes based on actual needs for the same command or request; Chat assistant require the use of tools, but the current approach is not elegant and efficient, and the number of tool calls that can be supported is limited. The main reason for these issues is that large language models do not have the thinking ability as a human, lack the reasoning ability and planning ability, and lack the ability to execute plans. To solve these issues, we propose a thinking method based on a built-in chain of thought: In the multi-turns conversation, for each user prompt, the large language model thinks based on elements such as chat history, thinking context, action calls, memory and knowledge, makes detailed reasoning and planning, and actions according to the plan. We also explored how the large language model enhances thinking ability through this thinking method: Collect training datasets according to the thinking method and fine tune the large language model through supervised learning; Train a consistency reward model and use it as a reward function to fine tune the large language model using reinforcement learning, and the reinforced large language model outputs according to this way of thinking. Our experimental results show that the reasoning ability and planning ability of the large language model are enhanced, and the issues in the multi-turns conversation are solved.
摘要：最近，训练大型语言模型以用作聊天助手是很受欢迎的，但是在用户和聊天助手之间的对话中，有提示，需要聊天助手与用户之间的多转弯。但是，多转向对话存在许多问题：聊天助手的回应容易出错，无法帮助用户实现目标；聊天助手很难根据对同一命令或请求的实际需求生成不同过程的响应；聊天助理需要使用工具，但是当前的方法不是优雅和高效，并且可以支持的工具调用数量受到限制。造成这些问题的主要原因是，大型语言模型没有作为人类的思维能力，缺乏推理能力和计划能力，并且缺乏执行计划的能力。为了解决这些问题，我们提出了一种基于内置思维链的思维方法：在多转交流中，对于每个用户提示，记忆和知识，根据计划进行详细的推理和计划以及行动。我们还探索了大型语言模型如何通过这种思维方法增强思维能力：根据思维方法收集培训数据集并通过监督学习微调大语言模型；训练一个一致性奖励模型，并将其用作奖励功能，以使用强化学习来微调大语言模型，并根据这种思维方式加强大型语言模型。我们的实验结果表明，大型语言模型的推理能力和计划能力得到了增强，并且解决了多转交谈中的问题。

Title: Transferring Textual Preferences to Vision-Language Understanding through Model Merging

Authors: Chen-An Li, Tzu-Han Lin, Yun-Nung Chen, Hung-yi Lee
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13487
Pdf URL: https://arxiv.org/pdf/2502.13487
Copy Paste: [[2502.13487]] Transferring Textual Preferences to Vision-Language Understanding through Model Merging(https://arxiv.org/abs/2502.13487)
Keywords: language model
Abstract: Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.
摘要：大型视觉模型（LVLM）在各种多模式任务中表现出色。但是，他们评估生成内容的能力仍然有限，并且具有偏好数据的培训视觉奖励模型（VLRMS）在计算上很昂贵。本文通过将基于文本的奖励模型（RMS）与LVLMS合并以创建VLRMS来探讨无培训替代方案。我们的方法表明，整合这些模型会导致LVLMS的评分和基于文本的RMS的性能提高，从而提供了将文本偏好纳入LVLM的有效方法。

Title: What are Models Thinking about? Understanding Large Language Model Hallucinations "Psychology" through Model Inner State Analysis

Authors: Peiran Wang, Yang Liu, Yunfei Lu, Jue Hong, Ye Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13490
Pdf URL: https://arxiv.org/pdf/2502.13490
Copy Paste: [[2502.13490]] What are Models Thinking about? Understanding Large Language Model Hallucinations "Psychology" through Model Inner State Analysis(https://arxiv.org/abs/2502.13490)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language model (LLM) systems suffer from the models' unstable ability to generate valid and factual content, resulting in hallucination generation. Current hallucination detection methods heavily rely on out-of-model information sources, such as RAG to assist the detection, thus bringing heavy additional latency. Recently, internal states of LLMs' inference have been widely used in numerous research works, such as prompt injection detection, etc. Considering the interpretability of LLM internal states and the fact that they do not require external information sources, we introduce such states into LLM hallucination detection. In this paper, we systematically analyze different internal states' revealing features during inference forward and comprehensively evaluate their ability in hallucination detection. Specifically, we cut the forward process of a large language model into three stages: understanding, query, generation, and extracting the internal state from these stages. By analyzing these states, we provide a deep understanding of why the hallucinated content is generated and what happened in the internal state of the models. Then, we introduce these internal states into hallucination detection and conduct comprehensive experiments to discuss the advantages and limitations.
摘要：大型语言模型（LLM）系统遭受模型产生有效和事实内容的不稳定能力，从而导致幻觉产生。当前的幻觉检测方法在很大程度上依赖于模型外信息源，例如抹布来协助检测，从而带来了额外的延迟。最近，LLMS推论的内部状态已被广泛用于许多研究工作，例如及时注射检测等。考虑到LLM内部状态的可解释性以及它们不需要外部信息来源的事实，我们将此类状态引入LLM幻觉检测。在本文中，我们系统地分析了不同内部状态在推理过程中揭示的特征，并全面评估了它们在幻觉检测中的能力。具体来说，我们将大语言模型的前进过程切成三个阶段：理解，查询，生成和从这些阶段提取内部状态。通过分析这些状态，我们可以深入了解为什么会产生幻觉内容以及模型内部状态发生的情况。然后，我们将这些内部状态引入幻觉检测，并进行全面的实验以讨论优势和局限性。

Title: Towards Geo-Culturally Grounded LLM Generations

Authors: Piyawat Lertvittayakumjorn, David Kinney, Vinodkumar Prabhakaran, Donald Martin, Sunipa Dev
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13497
Pdf URL: https://arxiv.org/pdf/2502.13497
Copy Paste: [[2502.13497]] Towards Geo-Culturally Grounded LLM Generations(https://arxiv.org/abs/2502.13497)
Keywords: language model, llm, retrieval augmented generation
Abstract: Generative large language models (LLMs) have been demonstrated to have gaps in diverse, cultural knowledge across the globe. We investigate the effect of retrieval augmented generation and search-grounding techniques on the ability of LLMs to display familiarity with a diverse range of national cultures. Specifically, we compare the performance of standard LLMs, LLMs augmented with retrievals from a bespoke knowledge base (i.e., KB grounding), and LLMs augmented with retrievals from a web search (i.e., search grounding) on a series of cultural familiarity benchmarks. We find that search grounding significantly improves the LLM performance on multiple-choice benchmarks that test propositional knowledge (e.g., the norms, artifacts, and institutions of national cultures), while KB grounding's effectiveness is limited by inadequate knowledge base coverage and a suboptimal retriever. However, search grounding also increases the risk of stereotypical judgments by language models, while failing to improve evaluators' judgments of cultural familiarity in a human evaluation with adequate statistical power. These results highlight the distinction between propositional knowledge about a culture and open-ended cultural fluency when it comes to evaluating the cultural familiarity of generative LLMs.
摘要：已经证明，生成的大语言模型（LLM）在全球各种文化知识方面存在差距。我们研究了检索增强产生和搜索基础技术对LLM对各种民族文化表现出熟悉的能力的影响。具体而言，我们比较了标准LLM的性能，LLMS的性能，并通过从定制知识库（即KB接地）中检索进行增强，并通过从网络搜索（即搜索地接地）进行检索而增强LLMS。我们发现，搜索接地可显着提高在测试命题知识（例如，规范，文物和国家文化机构）的多项选择基准上的LLM性能，而KB接地的有效性受到知识基础覆盖不足的限制，并限制了次优率。但是，搜索基础还增加了语言模型的刻板印象判断的风险，同时未能改善评估者在具有足够统计能力的人类评估中对文化熟悉的判断。这些结果突出了在评估生成LLM的文化熟悉度时，关于文化的命题知识与开放式文化流利性之间的区别。

Title: PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference

Authors: Burc Gokden
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13502
Pdf URL: https://arxiv.org/pdf/2502.13502
Copy Paste: [[2502.13502]] PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference(https://arxiv.org/abs/2502.13502)
Keywords: language model, llm
Abstract: We show that Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a foundational model whose deductive outputs are invariant tensors up to a small perturbation. PLDR-LLM learns a singularity condition for the deductive outputs that enable the once-inferred energy-curvature tensor $\mathbf{G}_{LM}$ to replace the deep neural network of power law graph attention (PLGA) generating the deductive outputs at inference. We demonstrate that a cache for $\mathbf{G}_{LM}$ (G-cache) and KV-cache can be implemented in a straightforward manner to improve the inference time. The invariance and generalizable nature of deductive outputs is at a very high fidelity where deductive outputs have same RMSE and determinant values up to 15 decimal places after caching, and zero-shot benchmark scores remain unchanged. Ablation studies show that learned deductive outputs have distinct loss and accuracy characteristics from models pretrained with transferred, randomly initialized or identity tensors as a constant tensor operator and an LLM with scaled-dot product attention (SDPA) is a special case of PLDR-LLM where $\mathbf{G}_{LM}$ is predefined as identity. The observed invariance characteristic introduces a novel asymmetry between training and inference phases with caching. We outline observed common characteristics of the deductive outputs for the learned singularity condition. We provide an implementation of a training and inference framework for PLDR-LLM with KV-cache and G-cache.
摘要：我们表明，来自Power Law解码器表示（PLDR-LLM）的大型语言模型是一种基础模型，其演绎输出是不变的张量，直到小扰动。 pldr-llm学习了演绎输出的奇异性条件推断。我们证明，可以简单地实现用于$ \ Mathbf {g} _ {lm} $（g-cache）的缓存，以改善推理时间。演绎输出的不变性和可概括的性质处于非常高的保真度，在缓存后，演绎输出具有相同的RMSE和确定性值，最多可达15个小数位，而零击基准分数保持不变。消融研究表明，学到的演绎产量具有从转移，随机初始化或身份张量作为恒定张量操作员和具有缩放点点产品注意力（SDPA）的LLM鉴定的模型中具有明显的损失和准确性特征。 $ \ mathbf {g} _ {lm} $预定为身份。观察到的不变性特征引入了缓存的训练和推理阶段之间的新型不对称性。我们概述了学习奇异性条件的演绎输出的共同特征。我们提供了具有KV-CACHE和G-CACHE的PLDR-LLM的培训和推理框架的实现。

Title: Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion

Authors: Shuai Niu, Jing Ma, Hongzhan Lin, Liang Bai, Zhihua Wang, Wei Bi, Yida Xu, Guo Li, Xian Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13509
Pdf URL: https://arxiv.org/pdf/2502.13509
Copy Paste: [[2502.13509]] Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion(https://arxiv.org/abs/2502.13509)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have shown remarkable performance in vision-language tasks, but their application in the medical field remains underexplored, particularly for integrating structured time series data with unstructured clinical notes. In clinical practice, dynamic time series data such as lab test results capture critical temporal patterns, while clinical notes provide rich semantic context. Merging these modalities is challenging due to the inherent differences between continuous signals and discrete text. To bridge this gap, we introduce ProMedTS, a novel self-supervised multimodal framework that employs prompt-guided learning to unify these heterogeneous data types. Our approach leverages lightweight anomaly detection to generate anomaly captions that serve as prompts, guiding the encoding of raw time series data into informative embeddings. These embeddings are aligned with textual representations in a shared latent space, preserving fine-grained temporal nuances alongside semantic insights. Furthermore, our framework incorporates tailored self-supervised objectives to enhance both intra- and inter-modal alignment. We evaluate ProMedTS on disease diagnosis tasks using real-world datasets, and the results demonstrate that our method consistently outperforms state-of-the-art approaches.
摘要：大型语言模型（LLM）在视觉任务中表现出色，但是它们在医学领域的应用仍未得到充满反感，尤其是将结构化时间序列数据与非结构化的临床注释集成在一起。在临床实践中，动态时间序列数据（例如实验室测试结果）捕获了关键的时间模式，而临床注释则提供了丰富的语义环境。由于连续信号和离散文本之间的固有差异，合并这些方式是具有挑战性的。为了弥合这一差距，我们介绍了Promedts，这是一种新型的自我监管的多模式框架，采用迅速引入的学习来统一这些异质数据类型。我们的方法利用轻巧的异常检测来生成用作提示的异常字幕，将原始时间序列数据的编码引导到信息性嵌入中。这些嵌入与共享潜在空间中的文本表示相一致，并在语义洞察力的同时保留细粒度的时间细微差别。此外，我们的框架结合了量身定制的自我监督目标，以增强内部和模式间对准。我们使用现实世界数据集评估了有关疾病诊断任务的Promedt，结果表明，我们的方法始终优于最先进的方法。

Title: Shall Your Data Strategy Work? Perform a Swift Study

Authors: Minlong Peng, Jingyi Yang, Zhongjun He, Hua Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13514
Pdf URL: https://arxiv.org/pdf/2502.13514
Copy Paste: [[2502.13514]] Shall Your Data Strategy Work? Perform a Swift Study(https://arxiv.org/abs/2502.13514)
Keywords: chain-of-thought
Abstract: This work presents a swift method to assess the efficacy of particular types of instruction-tuning data, utilizing just a handful of probe examples and eliminating the need for model retraining. This method employs the idea of gradient-based data influence estimation, analyzing the gradient projections of probe examples from the chosen strategy onto evaluation examples to assess its advantages. Building upon this method, we conducted three swift studies to investigate the potential of Chain-of-thought (CoT) data, query clarification data, and response evaluation data in enhancing model generalization. Subsequently, we embarked on a validation study to corroborate the findings of these swift studies. In this validation study, we developed training datasets tailored to each studied strategy and compared model performance with and without the use of these datasets. The results of the validation study aligned with the findings of the swift studies, validating the efficacy of our proposed method.
摘要：这项工作提出了一种快速的方法来评估特定类型的指令调查数据的功效，仅利用少数探测示例，并消除了模型重新培训的需求。该方法采用基于梯度的数据影响估计的想法，将所选策略中的探测示例的梯度预测到评估示例以评估其优势。在这种方法的基础上，我们进行了三项迅速研究，以研究在增强模型概括的过程中，研究链（COT）数据，查询澄清数据和响应评估数据的潜力。随后，我们开始了一项验证研究，以证实这些迅速研究的发现。在这项验证研究中，我们开发了针对每个研究策略量身定制的培训数据集，并在有或不使用这些数据集的情况下比较了模型性能。验证研究的结果与迅速研究的发现一致，从而验证了我们提出的方法的功效。

Title: Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference

Authors: Qingfa Xiao, Jiachuan Wang, Haoyang Li, Cheng Deng, Jiaqi Tang, Shuangyin Li, Yongqi Zhang, Jun Wang, Lei Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13542
Pdf URL: https://arxiv.org/pdf/2502.13542
Copy Paste: [[2502.13542]] Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference(https://arxiv.org/abs/2502.13542)
Keywords: language model, llm, long context
Abstract: Recent advances in large language models (LLMs) have showcased exceptional performance in long-context tasks, while facing significant inference efficiency challenges with limited GPU memory. Existing solutions first proposed the sliding-window approach to accumulate a set of historical \textbf{key-value} (KV) pairs for reuse, then further improvements selectively retain its subsets at each step. However, due to the sparse attention distribution across a long context, it is hard to identify and recall relevant KV pairs, as the attention is distracted by massive candidate pairs. Additionally, we found it promising to select representative tokens as probe-Query in each sliding window to effectively represent the entire context, which is an approach overlooked by existing methods. Thus, we propose \textbf{ActQKV}, a training-free, \textbf{Act}ivation-aware approach that dynamically determines probe-\textbf{Q}uery and leverages it to retrieve the relevant \textbf{KV} pairs for inference. Specifically, ActQKV monitors a token-level indicator, Activation Bias, within each context window, enabling the proper construction of probe-Query for retrieval at pre-filling stage. To accurately recall the relevant KV pairs and minimize the irrelevant ones, we design a dynamic KV cut-off mechanism guided by information density across layers at the decoding stage. Experiments on the Long-Bench and $\infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
摘要：大型语言模型（LLMS）的最新进展已在长篇文章任务中展示了出色的表现，同时在GPU内存有限的情况下面临着重要的推理效率挑战。现有解决方案首先提出了滑动窗口方法，以累积一组历史\ textbf {key-value}（kv）对重复使用，然后在每个步骤中有选择地保留其子集。但是，由于漫长背景下的注意力分布稀少，因此很难识别和回忆相关的KV对，因为大量候选人对分散了注意力。此外，我们发现有望选择代表令牌作为每个滑动窗口中的探测器，以有效地表示整个上下文，这是现有方法所忽略的一种方法。因此，我们提出\ textbf {actqkv}，一种无训练的，\ textbf {act} ivation-wawawawawawawawewawance方法，该方法动态确定probe- \ \ textbf {q} uery并利用它来检索相关的\ textbf {kv} pairs for推入的推荐。具体而言，ACTQKV在每个上下文窗口内监视一个令牌级别的指示器，激活偏差，从而可以在预填充阶段正确构造用于检索的探针。为了准确回忆相关的KV对并最大程度地减少无关的KV对，我们设计了一个动态的KV截止机构，该机构在解码阶段的信息密度为指导下。长期基础和$ \ infty $基准的实验证明了其最先进的性能，具有竞争性的推理质量和资源效率。

Title: From Sub-Ability Diagnosis to Human-Aligned Generation: Bridging the Gap for Text Length Control via MARKERGEN

Authors: Peiwen Yuan, Chuyi Tan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Boyuan Pan, Yao Hu, Kan Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13544
Pdf URL: https://arxiv.org/pdf/2502.13544
Copy Paste: [[2502.13544]] From Sub-Ability Diagnosis to Human-Aligned Generation: Bridging the Gap for Text Length Control via MARKERGEN(https://arxiv.org/abs/2502.13544)
Keywords: language model, llm
Abstract: Despite the rapid progress of large language models (LLMs), their length-controllable text generation (LCTG) ability remains below expectations, posing a major limitation for practical applications. Existing methods mainly focus on end-to-end training to reinforce adherence to length constraints. However, the lack of decomposition and targeted enhancement of LCTG sub-abilities restricts further this http URL bridge this gap, we conduct a bottom-up decomposition of LCTG sub-abilities with human patterns as reference and perform a detailed error this http URL this basis, we propose MarkerGen, a simple-yet-effective plug-and-play approach that:(1) mitigates LLM fundamental deficiencies via external tool integration;(2) conducts explicit length modeling with dynamically inserted markers;(3) employs a three-stage generation scheme to better align length constraints while maintaining content this http URL experiments demonstrate that MarkerGen significantly improves LCTG across various settings, exhibiting outstanding effectiveness and generalizability.
摘要：尽管大语言模型（LLM）取得了迅速的进展，但它们的长度可控文本生成（LCTG）的能力仍低于预期，对实际应用构成了主要限制。现有方法主要集中于端到端培训，以增强对长度约束的遵守。但是，缺乏分解和靶向提高LCTG亚功能的限制进一步限制了此http URL桥梁这一差距，我们对LCTG亚功能进行了自下而上的分解，并以人为模式作为参考，并执行详细的错误此基础。，我们提出了Markergen，这是一种简单有效的即插即用方法：（1）通过外部减轻LLM的基本缺陷工具集成；（2）通过动态插入的标记进行明确的长度建模；（3）采用三阶段的生成方案来更好地对齐长度约束，同时保持内容该HTTP URL实验表明，Markergen显着改善了各种环境的LCTG，表现出杰出的有效性和普遍性。

Title: Detecting Linguistic Bias in Government Documents Using Large language Models

Authors: Milena de Swart, Floris den Hengst, Jieying Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13548
Pdf URL: https://arxiv.org/pdf/2502.13548
Copy Paste: [[2502.13548]] Detecting Linguistic Bias in Government Documents Using Large language Models(https://arxiv.org/abs/2502.13548)
Keywords: language model
Abstract: This paper addresses the critical need for detecting bias in government documents, an underexplored area with significant implications for governance. Existing methodologies often overlook the unique context and far-reaching impacts of governmental documents, potentially obscuring embedded biases that shape public policy and citizen-government interactions. To bridge this gap, we introduce the Dutch Government Data for Bias Detection (DGDB), a dataset sourced from the Dutch House of Representatives and annotated for bias by experts. We fine-tune several BERT-based models on this dataset and compare their performance with that of generative language models. Additionally, we conduct a comprehensive error analysis that includes explanations of the models' predictions. Our findings demonstrate that fine-tuned models achieve strong performance and significantly outperform generative language models, indicating the effectiveness of DGDB for bias detection. This work underscores the importance of labeled datasets for bias detection in various languages and contributes to more equitable governance practices.
摘要：本文解决了检测政府文件中偏见的关键需求，这是一个未经置换的领域，对治理产生了重大影响。现有的方法经常忽略政府文件的独特背景和深远的影响，可能会掩盖嵌入的偏见，这些偏见塑造了公共政策和公民政府的互动。为了弥合这一差距，我们介绍了荷兰政府的偏见检测数据（DGDB），该数据集是从荷兰众议院采购的数据集，并由专家注释。我们在此数据集中微调了几个基于BERT的模型，并将其性能与生成语言模型的性能进行了比较。此外，我们进行了全面的错误分析，其中包括对模型预测的解释。我们的发现表明，微调模型实现了强大的性能并明显胜过生成语言模型，这表明DGDB在偏置检测中的有效性。这项工作强调了标记的数据集对以各种语言的偏见检测的重要性，并有助于更公平的治理实践。

Title: STaR-SQL: Self-Taught Reasoner for Text-to-SQL

Authors: Mingqian He, Yongliang Shen, Wenqi Zhang, Qiuying Peng, Jun Wang, Weiming Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13550
Pdf URL: https://arxiv.org/pdf/2502.13550
Copy Paste: [[2502.13550]] STaR-SQL: Self-Taught Reasoner for Text-to-SQL(https://arxiv.org/abs/2502.13550)
Keywords: language model, gpt, llm, prompt, chain-of-thought, agent
Abstract: Generating step-by-step "chain-of-thought" rationales has proven effective for improving the performance of large language models on complex reasoning tasks. However, applying such techniques to structured tasks, such as text-to-SQL, remains largely unexplored. In this paper, we introduce Self-Taught Reasoner for text-to-SQL (STaR-SQL), a novel approach that reframes SQL query generation as a reasoning-driven process. Our method prompts the LLM to produce detailed reasoning steps for SQL queries and fine-tunes it on rationales that lead to correct outcomes. Unlike traditional methods, STaR-SQL dedicates additional test-time computation to reasoning, thereby positioning LLMs as spontaneous reasoners rather than mere prompt-based agents. To further scale the inference process, we incorporate an outcome-supervised reward model (ORM) as a verifier, which enhances SQL query accuracy. Experimental results on the challenging Spider benchmark demonstrate that STaR-SQL significantly improves text-to-SQL performance, achieving an execution accuracy of 86.6%. This surpasses a few-shot baseline by 31.6% and a baseline fine-tuned to predict answers directly by 18.0%. Additionally, STaR-SQL outperforms agent-like prompting methods that leverage more powerful yet closed-source models such as GPT-4. These findings underscore the potential of reasoning-augmented training for structured tasks and open the door to extending self-improving reasoning models to text-to-SQL generation and beyond.
摘要：事实证明，生成逐步的“思想链”理由，可以有效地提高大型语言模型在复杂的推理任务上的性能。但是，将这种技术应用于结构化任务，例如文本到SQL，在很大程度上没有探索。在本文中，我们介绍了文本到SQL（Star-SQL）的自学成才的推理器，这是一种新颖的方法，将SQL查询产生重新构成以推理为导向的过程。我们的方法促使LLM为SQL查询提供详细的推理步骤，并在理由上进行微调，从而导致纠正结果。与传统方法不同，Star-SQL将额外的测试时间计算用于推理，从而将LLMS定位为自发推理器，而不是仅仅基于及时的代理。为了进一步扩展推理过程，我们将一个监督的奖励模型（ORM）作为验证者，从而提高了SQL查询准确性。挑战性蜘蛛基准的实验结果表明，Star-SQL显着提高了文本到SQL的性能，实现了86.6％的执行精度。这超过了几杆基线31.6％，基线进行了微调，直接预测答案为18.0％。此外，STAR-SQL优于类似于代理的提示方法，这些方法利用了更强大但封闭式模型（例如GPT-4）。这些发现强调了对结构化任务进行推理的培训的潜力，并为将自我改进的推理模型扩展到文本到SQL的生成及以后打开了大门。

Title: PRIV-QA: Privacy-Preserving Question Answering for Cloud Large Language Models

Authors: Guangwei Li, Yuansen Zhang, Yinggui Wang, Shoumeng Yan, Lei Wang, Tao Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13564
Pdf URL: https://arxiv.org/pdf/2502.13564
Copy Paste: [[2502.13564]] PRIV-QA: Privacy-Preserving Question Answering for Cloud Large Language Models(https://arxiv.org/abs/2502.13564)
Keywords: language model, llm
Abstract: The rapid development of large language models (LLMs) is redefining the landscape of human-computer interaction, and their integration into various user-service applications is becoming increasingly prevalent. However, transmitting user data to cloud-based LLMs presents significant risks of data breaches and unauthorized access to personal identification information. In this paper, we propose a privacy preservation pipeline for protecting privacy and sensitive information during interactions between users and LLMs in practical LLM usage scenarios. We construct SensitiveQA, the first privacy open-ended question-answering dataset. It comprises 57k interactions in Chinese and English, encompassing a diverse range of user-sensitive information within the conversations. Our proposed solution employs a multi-stage strategy aimed at preemptively securing user information while simultaneously preserving the response quality of cloud-based LLMs. Experimental validation underscores our method's efficacy in balancing privacy protection with maintaining robust interaction quality. The code and dataset are available at this https URL.
摘要：大型语言模型（LLM）的快速发展正在重新定义人类计算机互动的景观，并且它们与各种用户服务应用程序的集成变得越来越普遍。但是，将用户数据传输到基于云的LLMS会带来严重的数据泄露风险和未经授权的个人身份识别信息的访问。在本文中，我们提出了一条隐私保护管道，用于在实用的LLM使用情况下在用户和LLMS之间的互动过程中保护隐私和敏感信息。我们构建了敏感的Qa，这是第一个隐私开放式问题撤销数据集。它包括中文和英语的57K交互，其中包括对话中各种用户敏感的信息。我们提出的解决方案采用了多阶段策略，旨在先发确保用户信息，同时保留基于云的LLM的响应质量。实验验证强调了我们方法在平衡隐私保护和保持稳健相互作用质量方面的功效。该代码和数据集可在此HTTPS URL上找到。

Title: Extracting Social Connections from Finnish Karelian Refugee Interviews Using LLMs

Authors: Joonatan Laato, Jenna Kanerva, John Loehr, Virpi Lummaa, Filip Ginter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13566
Pdf URL: https://arxiv.org/pdf/2502.13566
Copy Paste: [[2502.13566]] Extracting Social Connections from Finnish Karelian Refugee Interviews Using LLMs(https://arxiv.org/abs/2502.13566)
Keywords: gpt, llm
Abstract: We performed a zero-shot information extraction study on a historical collection of 89,339 brief Finnish-language interviews of refugee families relocated post-WWII from Finnish Eastern Karelia. Our research objective is two-fold. First, we aim to extract social organizations and hobbies from the free text of the interviews, separately for each family member. These can act as a proxy variable indicating the degree of social integration of refugees in their new environment. Second, we aim to evaluate several alternative ways to approach this task, comparing a number of generative models and a supervised learning approach, to gain a broader insight into the relative merits of these different approaches and their applicability in similar studies. We find that the best generative model (GPT-4) is roughly on par with human performance, at an F-score of 88.8%. Interestingly, the best open generative model (Llama-3-70B-Instruct) reaches almost the same performance, at 87.7% F-score, demonstrating that open models are becoming a viable alternative for some practical tasks even on non-English data. Additionally, we test a supervised learning alternative, where we fine-tune a Finnish BERT model (FinBERT) using GPT-4 generated training data. By this method, we achieved an F-score of 84.1% already with 6K interviews up to an F-score of 86.3% with 30k interviews. Such an approach would be particularly appealing in cases where the computational resources are limited, or there is a substantial mass of data to process.
摘要：我们对89,339个简短的芬兰语言访谈进行了零拍摄的信息提取研究，对芬兰东部卡雷利亚（Eastern Karelia）的第二次世界大战后搬迁了难民家庭的简短访谈。我们的研究目标是两个方面。首先，我们旨在从每个家庭成员分别从访谈的自由文本中提取社会组织和爱好。这些可以充当代理变量，表明难民在其新环境中的社会融合程度。其次，我们旨在评估几种替代方法来处理此任务，比较多种生成模型和一种监督学习方法，以更广泛地了解这些不同方法的相对优点及其在类似研究中的适用性。我们发现，最好的生成模型（GPT-4）与人类绩效大致相当，F-评分为88.8％。有趣的是，最佳的开放生成模型（Llama-3-70B-Instruct）达到了几乎相同的性能，f-SCORE的表现为87.7％，表明即使在非英语数据上，开放模型也已成为某些实际任务的可行替代方案。此外，我们测试了有监督的学习替代方案，我们使用GPT-4生成的培训数据微调了Finnish Bert模型（FINBERT）。通过这种方法，我们的F-评分已经达到了84.1％的F-评分，而6K访谈的F-评分为86.3％，并进行了30K访谈。在计算资源有限或有大量数据处理的情况下，这种方法将特别有吸引力。

Title: Don't Stop the Multi-Party! On Generating Synthetic Multi-Party Conversations with Constraints

Authors: Nicolò Penzo, Marco Guerini, Bruno Lepri, Goran Glavaš, Sara Tonelli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13592
Pdf URL: https://arxiv.org/pdf/2502.13592
Copy Paste: [[2502.13592]] Don't Stop the Multi-Party! On Generating Synthetic Multi-Party Conversations with Constraints(https://arxiv.org/abs/2502.13592)
Keywords: language model, llm
Abstract: Multi-Party Conversations (MPCs) are widely studied across disciplines, with social media as a primary data source due to their accessibility. However, these datasets raise privacy concerns and often reflect platform-specific properties. For example, interactions between speakers may be limited due to rigid platform structures (e.g., threads, tree-like discussions), which yield overly simplistic interaction patterns (e.g., as a consequence of ``reply-to'' links). This work explores the feasibility of generating diverse MPCs with instruction-tuned Large Language Models (LLMs) by providing deterministic constraints such as dialogue structure and participants' stance. We investigate two complementary strategies of leveraging LLMs in this context: (i.) LLMs as MPC generators, where we task the LLM to generate a whole MPC at once and (ii.) LLMs as MPC parties, where the LLM generates one turn of the conversation at a time, provided the conversation history. We next introduce an analytical framework to evaluate compliance with the constraints, content quality, and interaction complexity for both strategies. Finally, we assess the quality of obtained MPCs via human annotation and LLM-as-a-judge evaluations. We find stark differences among LLMs, with only some being able to generate high-quality MPCs. We also find that turn-by-turn generation yields better conformance to constraints and higher linguistic variability than generating MPCs in one pass. Nonetheless, our structural and qualitative evaluation indicates that both generation strategies can yield high-quality MPCs.
摘要：多方对话（MPC）在跨学科中进行了广泛的研究，由于其可访问性，社交媒体是主要数据源。但是，这些数据集引发了隐私问题，并经常反映了特定于平台的属性。例如，由于刚性平台结构（例如，线程，类似树状的讨论），扬声器之间的交互可能会受到限制，该结构产生了过于简单的交互模式（例如，由于``回复''链接）。这项工作通过提供确定性的约束（例如对话结构和参与者的立场）来探讨使用指导调节的大语模型（LLM）产生多种MPC的可行性。我们研究了在这种情况下利用LLM的两种互补策略：（i。）LLM作为MPC生成器，在其中我们任务LLM立即生成整个MPC，并且（ii。）llms作为MPC派对，在其中LLM生成一转的一转。一次对话提供了对话历史。接下来，我们引入了一个分析框架，以评估这两种策略的限制，内容质量和相互作用的复杂性。最后，我们通过人类注释和LLM-AS-A-A-Gudge评估评估获得的MPC的质量。我们发现LLMS之间存在明显的差异，只有一些能够生成高质量的MPC。我们还发现，与一通行证中生成MPC相比，逐圈的产生更符合约束和更高的语言变异性。但是，我们的结构和定性评估表明，两种策略都可以产生高质量的MPC。

Title: MMTEB: Massive Multilingual Text Embedding Benchmark

Authors: Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shitao Xiao, Akshita Sukhlecha, Bhavish Pahwa, Rafał Poświata, Kranthi Kiran GV, Shawon Ashraf, Daniel Auras, Björn Plüster, Jan Philipp Harries, Loïc Magne, Isabelle Mohr, Mariya Hendriksen, Dawei Zhu, Hippolyte Gisserot-Boukhlef, Tom Aarsen, Jan Kostkan, Konrad Wojtasik, Taemin Lee, Marek Šuppa, Crystina Zhang, Roberta Rocca, Mohammed Hamdy, Andrianos Michail, John Yang, Manuel Faysse, Aleksei Vatolin, Nandan Thakur, Manan Dey, Dipam Vasani, Pranjal Chitale, Simone Tedeschi, Nguyen Tai, Artem Snegirev, Michael Günther, Mengzhou Xia, Weijia Shi, Xing Han Lù, Jordan Clive, Gayatri Krishnakumar, Anna Maksimova, Silvan Wehrli, Maria Tikhonova, Henil Panchal, Aleksandr Abramov, Malte Ostendorff, Zheng Liu, Simon Clematide, Lester James Miranda, Alena Fenogenova, Guangyu Song, Ruqiya Bin Safi, Wen-Ding Li, Alessia Borghini, Federico Cassano, Hongjin Su, Jimmy Lin, Howard Yen, Lasse Hansen, Sara Hooker, Chenghao Xiao, Vaibhav Adlakha, Orion Weller, Siva Reddy, Niklas Muennighoff
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2502.13595
Pdf URL: https://arxiv.org/pdf/2502.13595
Copy Paste: [[2502.13595]] MMTEB: Massive Multilingual Text Embedding Benchmark(https://arxiv.org/abs/2502.13595)
Keywords: language model, llm
Abstract: Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.
摘要：通常在有限的任务集上评估文本嵌入，这些任务受语言，域和任务多样性的约束。为了解决这些限制并提供更全面的评估，我们介绍了大规模的多语言嵌入基准（MMTEB） - MTEB的大规模，社区驱动的扩展，涵盖了250多种语言的500多个质量控制的评估任务。 MMTEB包括一系列具有挑战性的新任务，例如以下说明，长期记录检索和代码检索，代表了迄今为止嵌入模型的最大多语言评估任务集合。使用此集合，我们开发了几种高度多语言的基准，我们用来评估一组代表性的模型。我们发现，尽管具有数十亿个参数的大型语言模型（LLM）可以在某些语言子集和任务类别上实现最先进的性能，但表现最佳的公开模型是多语言 - E5-large-large-instruct，只有560百万参数。为了促进可访问性并降低计算成本，我们引入了一种基于任务间相关性的新型下采样方法，以确保选择多样化的选择，同时保留相对模型排名。此外，我们通过对硬质底片进行采样，创建较小但有效的分割来优化诸如检索之类的任务。这些优化使我们能够引入大幅度降低计算需求的基准。例如，我们新引入的零击英文基准的排名顺序与全尺度版本相似，但计算成本的一小部分。

Title: Efficient Safety Retrofitting Against Jailbreaking for LLMs

Authors: Dario Garcia-Gasulla, Anna Arias-Duart, Adrian Tormos, Daniel Hinjos, Oscar Molina-Sedano, Ashwin Kumar Gururajan, Maria Eugenia Cardello
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13603
Pdf URL: https://arxiv.org/pdf/2502.13603
Copy Paste: [[2502.13603]] Efficient Safety Retrofitting Against Jailbreaking for LLMs(https://arxiv.org/abs/2502.13603)
Keywords: llm
Abstract: Direct Preference Optimization (DPO) is an efficient alignment technique that steers LLMs towards preferable outputs by training on preference data, bypassing the need for explicit reward models. Its simplicity enables easy adaptation to various domains and safety requirements. This paper examines DPO's effectiveness in model safety against jailbreaking attacks while minimizing data requirements and training costs. We introduce Egida, a dataset expanded from multiple sources, which includes 27 different safety topics and 18 different attack styles, complemented with synthetic and human labels. This data is used to boost the safety of state-of-the-art LLMs (Llama-3.1-8B/70B-Instruct, Qwen-2.5-7B/72B-Instruct) across topics and attack styles. In addition to safety evaluations, we assess their post-alignment performance degradation in general purpose tasks, and their tendency to over refusal. Following the proposed methodology, trained models reduce their Attack Success Rate by 10%-30%, using small training efforts (2,000 samples) with low computational cost (3\$ for 8B models, 20\$ for 72B models). Safety aligned models generalize to unseen topics and attack styles, with the most successful attack style reaching a success rate around 5%. Size and family are found to strongly influence model malleability towards safety, pointing at the importance of pre-training choices. To validate our findings, a large independent assessment of human preference agreement with Llama-Guard-3-8B is conducted by the authors and the associated dataset Egida-HSafe is released. Overall, this study illustrates how affordable and accessible it is to enhance LLM safety using DPO while outlining its current limitations. All datasets and models are released to enable reproducibility and further research.
摘要：直接偏好优化（DPO）是一种有效的对齐技术，通过对偏好数据进行训练，绕开了对明确奖励模型的需求，可以通过训练LLMS朝着优选的输出方向发展。它的简单性可以轻松适应各种领域和安全要求。本文研究了DPO在模型安全方面的有效性，以防止越狱攻击，同时最大程度地减少了数据要求和培训成本。我们介绍了Egida，该数据集从多个来源扩展，其中包括27个不同的安全主题和18种不同的攻击方式，并配有合成和人类标签。该数据用于在主题和攻击样式中提高最先进的LLMS（Llama-3.1-8B/70B教学，QWEN-2.5-7B/72B-INSTRUCT）的安全性。除了安全评估外，我们还评估了他们在一般性任务中的分配后绩效退化，以及它们过度拒绝的趋势。遵循提出的方法，训练有素的模型使用小型培训工作（2,000个样本），其攻击成功率降低了10％-30％（8B型号的3 \ $，对于72B型号为20 \ $）。安全对准模型概括为看不见的主题和攻击方式，最成功的攻击方式达到了5％的成功率。发现规模和家庭对安全性的模型可强烈影响，指出了预训练选择的重要性。为了验证我们的发现，作者对与Llama-Guard-3-8b进行人体偏好一致性的大量独立评估是由作者进行的，并发布了相关的数据集Egida-HSAFE。总体而言，这项研究说明了使用DPO在概述其当前限制的同时，可以提高LLM安全性的价格和可及性。所有数据集和模型都均可启用可重复性和进一步的研究。

Title: BeamLoRA: Beam-Constraint Low-Rank Adaptation

Authors: Naibin Gu, Zhenyu Zhang, Xiyu Liu, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13604
Pdf URL: https://arxiv.org/pdf/2502.13604
Copy Paste: [[2502.13604]] BeamLoRA: Beam-Constraint Low-Rank Adaptation(https://arxiv.org/abs/2502.13604)
Keywords: language model
Abstract: Due to the demand for efficient fine-tuning of large language models, Low-Rank Adaptation (LoRA) has been widely adopted as one of the most effective parameter-efficient fine-tuning methods. Nevertheless, while LoRA improves efficiency, there remains room for improvement in accuracy. Herein, we adopt a novel perspective to assess the characteristics of LoRA ranks. The results reveal that different ranks within the LoRA modules not only exhibit varying levels of importance but also evolve dynamically throughout the fine-tuning process, which may limit the performance of LoRA. Based on these findings, we propose BeamLoRA, which conceptualizes each LoRA module as a beam where each rank naturally corresponds to a potential sub-solution, and the fine-tuning process becomes a search for the optimal sub-solution combination. BeamLoRA dynamically eliminates underperforming sub-solutions while expanding the parameter space for promising ones, enhancing performance with a fixed rank. Extensive experiments across three base models and 12 datasets spanning math reasoning, code generation, and commonsense reasoning demonstrate that BeamLoRA consistently enhances the performance of LoRA, surpassing the other baseline methods.
摘要：由于对大型语言模型有效进行微调的需求，低级适应（Lora）已被广泛用作最有效的参数有效的微调方法之一。然而，尽管洛拉提高了效率，但准确性仍然可以提高。在此，我们采用了一种新颖的观点来评估洛拉等级的特征。结果表明，洛拉模块中的不同等级不仅表现出不同的重要性水平，而且在整个微调过程中也动态发展，这可能会限制洛拉的性能。基于这些发现，我们提出了Beamlora，该发现将每个Lora模块概念化为一个梁，其中每个等级自然对应于潜在的子溶液，并且微调过程成为寻找最佳子溶液组合的搜索。 Beamlora动态消除了表现不佳的子分析，同时扩展了有前途的参数空间，从而以固定的等级增强性能。跨越数学推理，代码生成和常识性推理的三个基本模型和12个数据集进行的广泛实验表明，Beamlora始终增强LORA的性能，超过其他基线方法。

Title: Complex Ontology Matching with Large Language Model Embeddings

Authors: Guilherme Sousa, Rinaldo Lima, Cassia Trojahn
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13619
Pdf URL: https://arxiv.org/pdf/2502.13619
Copy Paste: [[2502.13619]] Complex Ontology Matching with Large Language Model Embeddings(https://arxiv.org/abs/2502.13619)
Keywords: language model, llm
Abstract: Ontology, and more broadly, Knowledge Graph Matching is a challenging task in which expressiveness has not been fully addressed. Despite the increasing use of embeddings and language models for this task, approaches for generating expressive correspondences still do not take full advantage of these models, in particular, large language models (LLMs). This paper proposes to integrate LLMs into an approach for generating expressive correspondences based on alignment need and ABox-based relation discovery. The generation of correspondences is performed by matching similar surroundings of instance sub-graphs. The integration of LLMs results in different architectural modifications, including label similarity, sub-graph matching, and entity matching. The performance word embeddings, sentence embeddings, and LLM-based embeddings, was compared. The results demonstrate that integrating LLMs surpasses all other models, enhancing the baseline version of the approach with a 45\% increase in F-measure.
摘要：本体论，更广泛地，知识图匹配是一项具有挑战性的任务，在这种任务中，表达性尚未得到充分解决。尽管在此任务中越来越多地使用嵌入和语言模型，但生成表达通信的方法仍未充分利用这些模型，尤其是大型语言模型（LLMS）。本文提议将LLM集成到一种基于一致性需求和基于Abox的关系发现的表达对应的方法中。通过匹配实例子图的相似环境来执行对应关系的生成。 LLMS的集成导致不同的体系结构修改，包括标签相似性，子图匹配和实体匹配。比较了性能词嵌入，句子嵌入和基于LLM的嵌入。结果表明，集成LLMS超过所有其他模型，以45 \％的F量级增强了该方法的基线版本。

Title: REFIND: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models

Authors: DongGeon Lee, Hwanjo Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13622
Pdf URL: https://arxiv.org/pdf/2502.13622
Copy Paste: [[2502.13622]] REFIND: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models(https://arxiv.org/abs/2502.13622)
Keywords: language model, llm, hallucination
Abstract: Hallucinations in large language model (LLM) outputs severely limit their reliability in knowledge-intensive tasks such as question answering. To address this challenge, we introduce REFIND (Retrieval-augmented Factuality hallucINation Detection), a novel framework that detects hallucinated spans within LLM outputs by directly leveraging retrieved documents. As part of the REFIND, we propose the Context Sensitivity Ratio (CSR), a novel metric that quantifies the sensitivity of LLM outputs to retrieved evidence. This innovative approach enables REFIND to efficiently and accurately detect hallucinations, setting it apart from existing methods. In the evaluation, REFIND demonstrated robustness across nine languages, including low-resource settings, and significantly outperformed baseline models, achieving superior IoU scores in identifying hallucinated spans. This work highlights the effectiveness of quantifying context sensitivity for hallucination detection, thereby paving the way for more reliable and trustworthy LLM applications across diverse languages.
摘要：大语言模型（LLM）中的幻觉会严重限制其在知识密集型任务（例如问答）中的可靠性。为了应对这一挑战，我们介绍了Refind（检索结果的事实幻觉检测），这是一个新颖的框架，通过直接利用检索的文档来检测LLM输出中的幻觉跨度。作为炼油的一部分，我们提出了上下文灵敏度比率（CSR），这是一种量化LLM输出对检索证据的敏感性的新型指标。这种创新的方法使您能够有效，准确地检测幻觉，使其与现有方法区分开。在评估中，炼油在包括低资源设置在内的九种语言中表现出鲁棒性，并且在识别幻觉跨度方面取得了卓越的分数。这项工作突出了量化上下文对幻觉检测的敏感性的有效性，从而为跨不同语言的更可靠和可信赖的LLM应用铺平了道路。

Title: Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts

Authors: Maiya Goloburda, Nurkhan Laiyk, Diana Turmakhan, Yuxia Wang, Mukhammed Togmanov, Jonibek Mansurov, Askhat Sametov, Nurdaulet Mukhituly, Minghan Wang, Daniil Orel, Zain Muhammad Mujahid, Fajri Koto, Timothy Baldwin, Preslav Nakov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13640
Pdf URL: https://arxiv.org/pdf/2502.13640
Copy Paste: [[2502.13640]] Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts(https://arxiv.org/abs/2502.13640)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are known to have the potential to generate harmful content, posing risks to users. While significant progress has been made in developing taxonomies for LLM risks and safety evaluation prompts, most studies have focused on monolingual contexts, primarily in English. However, language- and region-specific risks in bilingual contexts are often overlooked, and core findings can diverge from those in monolingual settings. In this paper, we introduce Qorgau, a novel dataset specifically designed for safety evaluation in Kazakh and Russian, reflecting the unique bilingual context in Kazakhstan, where both Kazakh (a low-resource language) and Russian (a high-resource language) are spoken. Experiments with both multilingual and language-specific LLMs reveal notable differences in safety performance, emphasizing the need for tailored, region-specific datasets to ensure the responsible and safe deployment of LLMs in countries like Kazakhstan. Warning: this paper contains example data that may be offensive, harmful, or biased.
摘要：众所周知，大型语言模型（LLMS）有可能产生有害内容，对用户带来风险。尽管在开发LLM风险和安全评估提示的分类法方面取得了重大进展，但大多数研究都集中在单语言环境上，主要是英语。但是，在双语环境中通常会忽略语言和特定于区域的风险，核心发现与单语性环境中的核心发现可能有所不同。在本文中，我们介绍了Qorgau，这是一款专门为哈萨克和俄罗斯的安全评估设计的新颖数据集，反映了哈萨克斯坦独特的双语背景，在这里，他们都说了哈萨克（一种低资产阶级的语言）和俄罗斯（一种高资源的语言）。进行多语言和语言特异性LLM的实验揭示了安全性能的显着差异，强调了对量身定制的特定地区数据集的需求，以确保在哈萨克斯坦等国家 /地区负责和安全地部署LLMS。警告：本文包含可能令人反感，有害或有偏见的示例数据。

Title: D.Va: Validate Your Demonstration First Before You Use It

Authors: Qi Zhang, Zhiqing Xiao, Ruixuan Xiao, Lirong Gao, Junbo Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13646
Pdf URL: https://arxiv.org/pdf/2502.13646
Copy Paste: [[2502.13646]] D.Va: Validate Your Demonstration First Before You Use It(https://arxiv.org/abs/2502.13646)
Keywords: language model, llm
Abstract: In-context learning (ICL) has demonstrated significant potential in enhancing the capabilities of large language models (LLMs) during inference. It's well-established that ICL heavily relies on selecting effective demonstrations to generate outputs that better align with the expected results. As for demonstration selection, previous approaches have typically relied on intuitive metrics to evaluate the effectiveness of demonstrations, which often results in limited robustness and poor cross-model generalization capabilities. To tackle these challenges, we propose a novel method, \textbf{D}emonstration \textbf{VA}lidation (\textbf{this http URL}), which integrates a demonstration validation perspective into this field. By introducing the demonstration validation mechanism, our method effectively identifies demonstrations that are both effective and highly generalizable. \textbf{this http URL} surpasses all existing demonstration selection techniques across both natural language understanding (NLU) and natural language generation (NLG) tasks. Additionally, we demonstrate the robustness and generalizability of our approach across various language models with different retrieval models.
摘要：在推理期间，在增强大语言模型（LLM）的能力方面具有重要潜力。据此，ICL在很大程度上依赖于选择有效的演示来生成更好地与预期结果保持一致的输出。至于示范选择，以前的方法通常依赖于直觉指标来评估示范的有效性，这通常会导致鲁棒性和跨模型概括能力差。为了应对这些挑战，我们提出了一种新颖的方法，即\ textbf {d} strustration \ textbf {va} lidation（\ textbf {textbf {this http url}），将演示验证的观点集成到此字段中。通过引入示范验证机制，我们的方法有效地确定了既有有效又可以推广的演示。 \ textbf {此HTTP URL}超过了自然语言理解（NLU）和自然语言生成（NLG）任务的所有现有演示选择技术。此外，我们证明了通过不同检索模型在各种语言模型中我们的方法的鲁棒性和概括性。

Title: Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh

Authors: Nurkhan Laiyk, Daniil Orel, Rituraj Joshi, Maiya Goloburda, Yuxia Wang, Preslav Nakov, Fajri Koto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13647
Pdf URL: https://arxiv.org/pdf/2502.13647
Copy Paste: [[2502.13647]] Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh(https://arxiv.org/abs/2502.13647)
Keywords: gpt, llm
Abstract: Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
摘要：由于文本数据有限，尤其是在政府和文化领域中，以低资源语言进行教学调整仍未得到充实。为了解决这个问题，我们介绍并开源一个大规模（10,600个样本）跟随（IFT）数据集，涵盖与哈萨克斯坦有关的关键制度和文化知识。我们的数据集增强了LLM对程序，法律和结构治理主题的理解。我们采用了LLM辅助数据生成，比较了数据集构建的开放重量和关闭重量模型，并选择GPT-4O作为骨干。我们数据集的每个实体均经过完整的手动验证，以确保高质量。我们还表明，数据集中的微调QWEN，FALCON和GEMMA会导致多项选择和生成任务的性能一致，这证明了LLM辅助指令对低资源语言的可能调整的潜力。

Title: Reliability Across Parametric and External Knowledge: Understanding Knowledge Handling in LLMs

Authors: Youna Kim, Minjoon Choi, Sungmin Cho, Hyuhng Joon Kim, Sang-goo Lee, Taeuk Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13648
Pdf URL: https://arxiv.org/pdf/2502.13648
Copy Paste: [[2502.13648]] Reliability Across Parametric and External Knowledge: Understanding Knowledge Handling in LLMs(https://arxiv.org/abs/2502.13648)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) enhance their problem-solving capability by leveraging both parametric and external knowledge. Beyond leveraging external knowledge to improve response accuracy, they require key capabilities for reliable knowledge-handling: resolving conflicts between knowledge sources, avoiding distraction from uninformative external knowledge, and abstaining when sufficient knowledge is unavailable. Prior studies have examined these scenarios in isolation or with limited scope. To systematically evaluate these capabilities, we introduce a comprehensive framework for analyzing knowledge-handling based on two key dimensions: the presence of parametric knowledge and the informativeness of external knowledge. Through analysis, we identify biases in knowledge utilization and examine how the ability to handle one scenario impacts performance in others. Furthermore, we demonstrate that training on data constructed based on the knowledge-handling scenarios improves LLMs' reliability in integrating and utilizing knowledge.
摘要：大型语言模型（LLMS）通过利用参数和外部知识来增强其解决问题的能力。除了利用外部知识来提高响应准确性之外，它们还需要关键能力来可靠知识处理：解决知识源之间的冲突，避免分心非信息性外部知识，并在不可用的知识时弃权。先前的研究已经隔离或有限范围检查了这些方案。为了系统地评估这些功能，我们介绍了一个综合框架，用于根据两个关键维度分析知识处理：参数知识的存在和外部知识的信息。通过分析，我们确定知识利用率的偏见，并研究处理一种方案的能力如何影响他人的绩效。此外，我们证明，基于知识处理方案构建的数据培训可提高LLMS在整合和利用知识方面的可靠性。

Title: C2T: A Classifier-Based Tree Construction Method in Speculative Decoding

Authors: Feiye Huo, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Shengli Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13652
Pdf URL: https://arxiv.org/pdf/2502.13652
Copy Paste: [[2502.13652]] C2T: A Classifier-Based Tree Construction Method in Speculative Decoding(https://arxiv.org/abs/2502.13652)
Keywords: language model, llm
Abstract: The growing scale of Large Language Models (LLMs) has exacerbated inference latency and computational costs. Speculative decoding methods, which aim to mitigate these issues, often face inefficiencies in the construction of token trees and the verification of candidate tokens. Existing strategies, including chain mode, static tree, and dynamic tree approaches, have limitations in accurately preparing candidate token trees for verification. We propose a novel method named C2T that adopts a lightweight classifier to generate and prune token trees dynamically. Our classifier considers additional feature variables beyond the commonly used joint probability to predict the confidence score for each draft token to determine whether it is the candidate token for verification. This method outperforms state-of-the-art (SOTA) methods such as EAGLE-2 on multiple benchmarks, by reducing the total number of candidate tokens by 25% while maintaining or even improving the acceptance length.
摘要：大型语言模型（LLM）的量表不断增长，加剧了推论潜伏期和计算成本。旨在减轻这些问题的投机解码方法通常会在构造象征树的建造和候选令牌验证时面临效率低下。现有的策略，包括链模式，静态树和动态树方法，在准确准备候选树木以进行验证方面有局限性。我们提出了一种名为C2T的新颖方法，该方法采用了轻量级分类器，以动态生成和修剪令状树。我们的分类器考虑了超出常用的关节概率以外的其他特征变量，以预测每个草稿令牌的置信度评分，以确定是否是候选代币进行验证。该方法的表现优于最先进的方法（SOTA）方法，例如在多个基准上的EAGE-2，通过在维持甚至改善接受时间长度的同时，将候选令牌的总数减少25％。

Title: Refining Sentence Embedding Model through Ranking Sentences Generation with Large Language Models

Authors: Liyang He, Chenglong Liu, Rui Li, Zhenya Huang, Shulan Ruan, Jun Zhou, Enhong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13656
Pdf URL: https://arxiv.org/pdf/2502.13656
Copy Paste: [[2502.13656]] Refining Sentence Embedding Model through Ranking Sentences Generation with Large Language Models(https://arxiv.org/abs/2502.13656)
Keywords: language model, llm
Abstract: Sentence embedding is essential for many NLP tasks, with contrastive learning methods achieving strong performance using annotated datasets like NLI. Yet, the reliance on manual labels limits scalability. Recent studies leverage large language models (LLMs) to generate sentence pairs, reducing annotation dependency. However, they overlook ranking information crucial for fine-grained semantic distinctions. To tackle this challenge, we propose a method for controlling the generation direction of LLMs in the latent space. Unlike unconstrained generation, the controlled approach ensures meaningful semantic divergence. Then, we refine exist sentence embedding model by integrating ranking information and semantic information. Experiments on multiple benchmarks demonstrate that our method achieves new SOTA performance with a modest cost in ranking sentence synthesis.
摘要：句子嵌入对于许多NLP任务至关重要，使用诸如NLI之类的注释数据集实现了强大的性能。但是，对手动标签的依赖限制了可扩展性。最近的研究利用大型语言模型（LLMS）生成句子对，从而减少注释依赖性。但是，他们忽略了对细粒语义区分至关重要的排名信息。为了应对这一挑战，我们提出了一种控制LLM在潜在空间中的生成方向的方法。与无约束的一代不同，受控方法可确保有意义的语义差异。然后，我们通过整合排名信息和语义信息来完善存在嵌入模型的句子。多个基准测试的实验表明，我们的方法在排名句子合成中以适度的成本实现了新的SOTA性能。

Title: SCOPE: A Self-supervised Framework for Improving Faithfulness in Conditional Text Generation

Authors: Song Duong, Florian Le Bronnec, Alexandre Allauzen, Vincent Guigue, Alberto Lumbreras, Laure Soulier, Patrick Gallinari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13674
Pdf URL: https://arxiv.org/pdf/2502.13674
Copy Paste: [[2502.13674]] SCOPE: A Self-supervised Framework for Improving Faithfulness in Conditional Text Generation(https://arxiv.org/abs/2502.13674)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs), when used for conditional text generation, often produce hallucinations, i.e., information that is unfaithful or not grounded in the input context. This issue arises in typical conditional text generation tasks, such as text summarization and data-to-text generation, where the goal is to produce fluent text based on contextual input. When fine-tuned on specific domains, LLMs struggle to provide faithful answers to a given context, often adding information or generating errors. One underlying cause of this issue is that LLMs rely on statistical patterns learned from their training data. This reliance can interfere with the model's ability to stay faithful to a provided context, leading to the generation of ungrounded information. We build upon this observation and introduce a novel self-supervised method for generating a training set of unfaithful samples. We then refine the model using a training process that encourages the generation of grounded outputs over unfaithful ones, drawing on preference-based training. Our approach leads to significantly more grounded text generation, outperforming existing self-supervised techniques in faithfulness, as evaluated through automatic metrics, LLM-based assessments, and human evaluations.
摘要：大型语言模型（LLMS）用于有条件的文本生成时，通常会产生幻觉，即在输入上下文中不忠或不基于的信息。这个问题出现在典型的条件文本生成任务中，例如文本摘要和数据对文本生成，其目标是根据上下文输入产生流利的文本。当对特定领域进行微调时，LLM努力为给定上下文提供忠实的答案，通常会添加信息或产生错误。该问题的根本原因是LLMS依赖于从其培训数据中学到的统计模式。这种依赖可以干扰该模型忠实于提供的环境的能力，从而导致产生未接地的信息。我们以这种观察为基础，并引入了一种新颖的自我监督方法，以生成一组不忠的样本训练。然后，我们使用培训过程来改进模型，该过程鼓励基于偏好的培训，鼓励对不忠的产量产生扎根的产出。通过自动指标，基于LLM的评估和人类评估，我们的方法在忠诚中的现有自我监督技术胜过忠诚的现有技术，从而超过了现有的自我监督技术。

Title: Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

Authors: Tristan Karch, Luca Engel, Philippe Schwaller, Frédéric Kaplan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13691
Pdf URL: https://arxiv.org/pdf/2502.13691
Copy Paste: [[2502.13691]] Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora(https://arxiv.org/abs/2502.13691)
Keywords: language model, llm
Abstract: As large language models (LLMs) converge towards similar capabilities, the key to advancing their performance lies in identifying and incorporating valuable new information sources. However, evaluating which text collections are worth the substantial investment required for digitization, preprocessing, and integration into LLM systems remains a significant challenge. We present a novel approach to this challenge: an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material. The performance gap between these conditions serves as a proxy for the collection's information potential. We validate our approach using three strategically selected datasets: EPFL PhD manuscripts (likely containing novel specialized knowledge), Wikipedia articles (presumably part of training data), and a synthetic baseline dataset. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.
摘要：随着大型语言模型（LLMS）趋向于类似的功能，推进其性能的关键在于识别和合并有价值的新信息来源。但是，评估哪些文本集值值得数字化，预处理和集成到LLM系统中所需的大量投资仍然是一个重大挑战。我们提出了一种针对这一挑战的新方法：一种自动化管道，该管道可评估文本收集中潜在的信息收益，而无需进行模型培训或微调。我们的方法从文本中生成多项选择问题（MCQ），并在有或不使用源材料的情况下测量LLM的性能。这些条件之间的性能差距是该集合信息潜力的代理。我们使用三个战略选择的数据集验证我们的方法：EPFL PHD手稿（可能包含新型专业知识），Wikipedia文章（大概是培训数据的一部分）和合成基线数据集。我们的结果表明，该方法有效地确定了包含有价值的新型信息的收藏品，从而提供了确定数据获取和集成工作的实用工具。

Title: Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values

Authors: Hongbo Zhang, Han Cui, Guangsheng Bao, Linyi Yang, Jun Wang, Yue Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13723
Pdf URL: https://arxiv.org/pdf/2502.13723
Copy Paste: [[2502.13723]] Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values(https://arxiv.org/abs/2502.13723)
Keywords: language model, llm, chain-of-thought
Abstract: We introduce Direct Value Optimization (DVO), an innovative reinforcement learning framework for enhancing large language models in complex reasoning tasks. Unlike traditional methods relying on preference labels, DVO utilizes value signals at individual reasoning steps, optimizing models via a mean squared error loss. The key benefit of DVO lies in its fine-grained supervision, circumventing the need for labor-intensive human annotations. Target values within the DVO are estimated using either Monte Carlo Tree Search or an outcome value model. Our empirical analysis on both mathematical and commonsense reasoning tasks shows that DVO consistently outperforms existing offline preference optimization techniques, even with fewer training steps. These findings underscore the importance of value signals in advancing reasoning capabilities and highlight DVO as a superior methodology under scenarios lacking explicit human preference information.
摘要：我们引入了直接价值优化（DVO），这是一种创新的增强学习框架，用于在复杂的推理任务中增强大型语言模型。与依靠偏好标签的传统方法不同，DVO在单个推理步骤中使用了值信号，从而通过平方误差损失来优化模型。 DVO的关键好处在于其精细的监督，规定了对劳动密集型人类注释的需求。使用Monte Carlo树搜索或结果值模型估算DVO中的目标值。我们对数学和常识性推理任务的经验分析表明，即使使用较少的培训步骤，DVO也始终优于现有的离线优先优化技术。这些发现强调了价值信号在推进推理能力方面的重要性，并在缺乏明确的人类偏好信息的情况下将DVO作为一种卓越的方法。

Title: Adapting Large Language Models for Time Series Modeling via a Novel Parameter-efficient Adaptation Method

Authors: Juyuan Zhang, Wei Zhu, Jiechao Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13725
Pdf URL: https://arxiv.org/pdf/2502.13725
Copy Paste: [[2502.13725]] Adapting Large Language Models for Time Series Modeling via a Novel Parameter-efficient Adaptation Method(https://arxiv.org/abs/2502.13725)
Keywords: language model, llm, prompt
Abstract: Time series modeling holds significant importance in many real-world applications and has been extensively studied. While pre-trained foundation models have made impressive strides in the fields of natural language processing (NLP) and computer vision (CV), their development in time series domains has been constrained by data sparsity. A series of recent studies have demonstrated that large language models (LLMs) possess robust pattern recognition and reasoning abilities over complex sequences of tokens. However, the current literature have yet striked a high-quality balance between (a) effectively aligning the time series and natural language modalities, and (b) keeping the inference efficiency. To address the above issues, we now propose the Time-LlaMA framework. Time-LlaMA first converts the time series input into token embeddings through a linear tokenization mechanism. Second, the time series token embeddings are aligned with the text prompts. Third, to further adapt the LLM backbone for time series modeling, we have developed a dynamic low-rank adaptation technique (D-LoRA). D-LoRA dynamically chooses the most suitable LoRA modules at each layer of the Transformer backbone for each time series input, enhancing the model's predictive capabilities. Our experimental results on an extensive collection of challenging real-world time series tasks confirm that our proposed method achieves the state-of-the-art (SOTA) performance.
摘要：时间序列建模在许多现实世界应用中具有重要的重要性，并且已经进行了广泛的研究。尽管预训练的基础模型在自然语言处理（NLP）和计算机视觉（CV）领域取得了令人印象深刻的进步，但其时间序列域中的开发受到数据稀疏性的限制。一系列最近的研究表明，大型语言模型（LLMS）在复杂的令牌序列上具有强大的模式识别能力和推理能力。但是，目前的文献尚未达到（a）有效地保持时间序列和自然语言方式之间的高质量平衡，以及（b）保持推理效率。为了解决上述问题，我们现在提出了时间段框架。 Time-LALA首先通过线性令牌化机制将时间序列输入转换为令牌嵌入。其次，时间序列令牌嵌入与文本提示对齐。第三，为了进一步调整LLM主链进行时间序列建模，我们开发了一种动态的低级适应技术（D-Lora）。 D-Lora在每个时间序列输入中动态选择最合适的洛拉模块，从而增强了模型的预测能力。我们对大量挑战的现实时间序列任务的实验结果证实，我们提出的方法实现了最先进的（SOTA）性能。

Title: Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding

Authors: Keqin Peng, Liang Ding, Yuanxin Ouyang, Meng Fang, Yancheng Yuan, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13738
Pdf URL: https://arxiv.org/pdf/2502.13738
Copy Paste: [[2502.13738]] Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding(https://arxiv.org/abs/2502.13738)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel at a range of tasks through in-context learning (ICL), where only a few task examples guide their predictions. However, prior research highlights that LLMs often overlook input-label mapping information in ICL, relying more on their pre-trained knowledge. To address this issue, we introduce In-Context Contrastive Decoding (ICCD), a novel method that emphasizes input-label mapping by contrasting the output distributions between positive and negative in-context examples. Experiments on 7 natural language understanding (NLU) tasks show that our ICCD method brings consistent and significant improvement (up to +2.1 improvement on average) upon 6 different scales of LLMs without requiring additional training. Our approach is versatile, enhancing performance with various demonstration selection methods, demonstrating its broad applicability and effectiveness. The code and scripts will be publicly released.
摘要：大型语言模型（LLMS）在一系列任务范围内通过文本学习（ICL）出色，其中只有几个任务示例指导他们的预测。但是，先前的研究强调，LLM经常忽略ICL中的输入标签映射信息，更多地依靠其预训练的知识。为了解决这个问题，我们引入了内在的对比解码（ICCD），这是一种新颖的方法，该方法通过对比正面和负面上下文示例之间的输出分布来强调输入标签映射。对7种自然语言理解（NLU）任务进行的实验表明，我们的ICCD方法在6种不同的LLMS上带来一致且显着改善（平均提高+2.1，而无需额外的培训）。我们的方法是多功能的，通过各种演示选择方法提高了性能，证明了其广泛的适用性和有效性。代码和脚本将公开发布。

Title: SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning

Authors: Renxi Wang, Honglin Mu, Liqun Ma, Lizhi Lin, Yunlong Feng, Timothy Baldwin, Xudong Han, Haonan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13753
Pdf URL: https://arxiv.org/pdf/2502.13753
Copy Paste: [[2502.13753]] SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning(https://arxiv.org/abs/2502.13753)
Keywords: language model, llm
Abstract: Evaluating large language models' (LLMs) long-context understanding capabilities remains challenging. We present SCALAR (Scientific Citation-based Live Assessment of Long-context Academic Reasoning), a novel benchmark that leverages academic papers and their citation networks. SCALAR features automatic generation of high-quality ground truth labels without human annotation, controllable difficulty levels, and a dynamic updating mechanism that prevents data contamination. Using ICLR 2025 papers, we evaluate 8 state-of-the-art LLMs, revealing key insights about their capabilities and limitations in processing long scientific documents across different context lengths and reasoning types. Our benchmark provides a reliable and sustainable way to track progress in long-context understanding as LLM capabilities evolve.
摘要：评估大型语言模型（LLMS）的长篇小说理解能力仍然具有挑战性。我们介绍标量（基于科学引用的长篇文化学术推理的实时评估），这是一种利用学术论文及其引文网络的新颖基准。标量具有自动生成高质量的地面真相标签，没有人类注释，可控的难度水平以及防止数据污染的动态更新机制。使用ICLR 2025论文，我们评估了8个最先进的LLM，揭示了有关其能力和限制在处理不同上下文长度和推理类型的长期科学文档时的关键见解。我们的基准提供了一种可靠且可持续的方式，可以随着LLM功能的发展来跟踪长篇小说理解的进度。

Title: GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking

Authors: Florian Schneider, Carolin Holtermann, Chris Biemann, Anne Lauscher
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13766
Pdf URL: https://arxiv.org/pdf/2502.13766
Copy Paste: [[2502.13766]] GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking(https://arxiv.org/abs/2502.13766)
Keywords: language model, llm
Abstract: Large Vision-Language Models (LVLMs) have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We systematically examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues. Our analyses reveal strong biases toward Western cultures across models and tasks and highlight strong correlations between model size and performance, as well as the effectiveness of multimodal input and external geographic cues. We further find that models have more knowledge of tangible than intangible aspects (e.g., food vs. rituals) and that they excel in recognizing broad cultural origins but struggle with a more nuanced understanding.
摘要：大型视觉模型（LVLM）由于其独特的性能和广泛的适用性而引起了人们的关注。虽然以前已经证明，它们在涉及非西方环境的用法场景中的功效不足，但现有研究的范围有限，仅涵盖了狭窄的文化，仅关注少数文化方面，或评估有限选择仅在单个任务上的模型。为了全球包含的LVLM研究，我们介绍了Gimmick，这是一种广泛的多模式基准，旨在评估144个代表六个全球宏观区域的144个国家 /地区的广泛文化知识。 Gimmick包括六个任务，构建了三个新的数据集，它们跨越了728个独特的文化事件或我们评估了20个LVLM和11个LLM的唯一文化事件或方面，其中包括5个专有和26个所有尺寸的开放权重模型。我们系统地检查（1）区域文化偏见，（2）模型大小，（3）输入方式和（4）外部提示的影响。我们的分析揭示了跨模型和任务对西方文化的强烈偏见，并突出了模型大小和性能之间的强烈相关性，以及多模式输入和外部地理线索的有效性。我们进一步发现，模型比无形方面（例如，食物与仪式）具有更多的有形知识，并且它们在认识到广泛的文化渊源方面表现出色，但以更加细微的理解挣扎。

Title: VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare

Authors: Anudeex Shetty, Amin Beheshti, Mark Dras, Usman Naseem
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13775
Pdf URL: https://arxiv.org/pdf/2502.13775
Copy Paste: [[2502.13775]] VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare(https://arxiv.org/abs/2502.13775)
Keywords: language model, llm
Abstract: Alignment techniques have become central to ensuring that Large Language Models (LLMs) generate outputs consistent with human values. However, existing alignment paradigms often model an averaged or monolithic preference, failing to account for the diversity of perspectives across cultures, demographics, and communities. This limitation is particularly critical in health-related scenarios, where plurality is essential due to the influence of culture, religion, personal values, and conflicting opinions. Despite progress in pluralistic alignment, no prior work has focused on health, likely due to the unavailability of publicly available datasets. To address this gap, we introduce VITAL, a new benchmark dataset comprising 13.1K value-laden situations and 5.4K multiple-choice questions focused on health, designed to assess and benchmark pluralistic alignment methodologies. Through extensive evaluation of eight LLMs of varying sizes, we demonstrate that existing pluralistic alignment techniques fall short in effectively accommodating diverse healthcare beliefs, underscoring the need for tailored AI alignment in specific domains. This work highlights the limitations of current approaches and lays the groundwork for developing health-specific alignment solutions.
摘要：对齐技术已成为确保大语言模型（LLM）产生与人类价值一致的产出的核心。但是，现有的一致性范式通常对平均或单一的偏好进行建模，但未能说明跨文化，人口统计和社区的观点的多样性。在与健康相关的情况下，这种局限性特别重要，由于文化，宗教，个人价值观和矛盾的观点的影响，多数是必不可少的。尽管多元化的一致性取得了进展，但先前的工作仍未关注健康，这可能是由于公开可用的数据集所致。为了解决这一差距，我们引入了Cital，这是一个新的基准数据集，其中包括13.1k个价值的情况和5.4k多项选择问题，这些问题侧重于健康，旨在评估和基准测试多元化的一致性方法。通过对八个不同大小的LLM的广泛评估，我们证明了现有的多元对准技术在有效地适应多种医疗保健信念的情况下缺乏，强调了在特定领域中对AI的量身定制的需求。这项工作突出了当前方法的局限性，并为开发特定健康的一致性解决方案奠定了基础。

Title: EHOP: A Dataset of Everyday NP-Hard Optimization Problems

Authors: Alex Duchnowski, Ellie Pavlick, Alexander Koller
Subjects: cs.CL, cs.CC
Abstract URL: https://arxiv.org/abs/2502.13776
Pdf URL: https://arxiv.org/pdf/2502.13776
Copy Paste: [[2502.13776]] EHOP: A Dataset of Everyday NP-Hard Optimization Problems(https://arxiv.org/abs/2502.13776)
Keywords: llm, prompt
Abstract: We introduce the dataset of Everyday Hard Optimization Problems (EHOP), a collection of NP-hard optimization problems expressed in natural language. EHOP includes problem formulations that could be found in computer science textbooks, versions that are dressed up as problems that could arise in real life, and variants of well-known problems with inverted rules. We find that state-of-the-art LLMs, across multiple prompting strategies, systematically solve textbook problems more accurately than their real-life and inverted counterparts. We argue that this constitutes evidence that LLMs adapt solutions seen during training, rather than leveraging reasoning abilities that would enable them to generalize to novel problems.
摘要：我们介绍了日常硬优化问题的数据集（EHOP），这是以自然语言表达的NP-HARD优化问题的集合。 EHOP包括可以在计算机科学教科书中找到的问题制定的，这些版本被打扮成在现实生活中可能出现的问题，以及倒置规则的众所周知问题的变体。我们发现，在多种提示策略中，最先进的LLMS比实际和倒置的同行更准确地解决教科书问题。我们认为，这构成了证据表明，LLMS适应培训期间看到的解决方案，而不是利用能够使他们推广到新问题的推理能力。

Title: Translation in the Hands of Many:Centering Lay Users in Machine Translation Interactions

Authors: Beatrice Savoldi, Alan Ramponi, Matteo Negri, Luisa Bentivogli
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2502.13780
Pdf URL: https://arxiv.org/pdf/2502.13780
Copy Paste: [[2502.13780]] Translation in the Hands of Many:Centering Lay Users in Machine Translation Interactions(https://arxiv.org/abs/2502.13780)
Keywords: language model, llm
Abstract: Converging societal and technical factors have transformed language technologies into user-facing applications employed across languages. Machine Translation (MT) has become a global tool, with cross-lingual services now also supported by dialogue systems powered by multilingual Large Language Models (LLMs). This accessibility has expanded MT's reach to a vast base of lay users, often with little to no expertise in the languages or the technology itself. Despite this, the understanding of MT consumed by this diverse group of users -- their needs, experiences, and interactions with these systems -- remains limited. This paper traces the shift in MT user profiles, focusing on non-expert users and how their engagement with these systems may change with LLMs. We identify three key factors -- usability, trust, and literacy -- that shape these interactions and must be addressed to align MT with user needs. By exploring these dimensions, we offer insights to guide future MT with a user-centered approach.
摘要：融合的社会和技术因素已将语言技术转变为跨语言采用的面向用户的应用程序。机器翻译（MT）已成为一种全球工具，现在还拥有由多语言大语言模型（LLMS）提供支持的对话系统的跨语言服务。这种可访问性将MT的覆盖范围扩大到了大量的外行用户，通常几乎没有语言或技术本身的专业知识。尽管如此，这种多样化的用户对MT的理解（他们的需求，经验和与这些系统的互动）仍然有限。本文追溯了MT用户配置文件的变化，重点关注非专家用户以及他们与这些系统的参与如何随LLM的变化。我们确定了三个关键因素 - 可用性，信任和扫盲 - 可以塑造这些交互作用，并且必须解决以使MT与用户需求保持一致。通过探索这些维度，我们提供了以用户为中心的方法来指导未来MT的见解。

Title: From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Authors: Nathanaël Carraz Rakotonirina, Mohammed Hamdy, Jon Ander Campos, Lucas Weber, Alberto Testoni, Marzieh Fadaee, Sandro Pezzelle, Marco Del Tredici
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13791
Pdf URL: https://arxiv.org/pdf/2502.13791
Copy Paste: [[2502.13791]] From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions(https://arxiv.org/abs/2502.13791)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs' ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long instruction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.
摘要：大型语言模型（LLM）越来越多地用于工作环境中，以完成各种任务，在孤立地解决个人问题方面表现出色。但是，他们是否还可以在长期互动中有效合作？为了进行调查，我们介绍了MemoryCode，这是一个合成的多节目数据集，旨在测试LLMS在不相关信息的情况下跟踪和执行简单编码说明的能力，从而模拟现实设置。尽管我们测试的所有模型都很好地处理了孤立的说明，但即使在跨会话中分布说明时，即使是最先进的模型（例如GPT-4O）的性能也会恶化。我们的分析表明，这是由于他们未能通过长期指导链检索和整合信息。我们的结果突出了当前LLM的基本局限性，限制了他们在长时间互动中有效协作的能力。

Title: Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking

Authors: Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13842
Pdf URL: https://arxiv.org/pdf/2502.13842
Copy Paste: [[2502.13842]] Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking(https://arxiv.org/abs/2502.13842)
Keywords: language model, llm
Abstract: Large language models (LLMs) face inherent performance bottlenecks under parameter constraints, particularly in processing critical tokens that demand complex reasoning. Empirical analysis reveals challenging tokens induce abrupt gradient spikes across layers, exposing architectural stress points in standard Transformers. Building on this insight, we propose Inner Thinking Transformer (ITT), which reimagines layer computations as implicit thinking steps. ITT dynamically allocates computation through Adaptive Token Routing, iteratively refines representations via Residual Thinking Connections, and distinguishes reasoning phases using Thinking Step Encoding. ITT enables deeper processing of critical tokens without parameter expansion. Evaluations across 162M-466M parameter models show ITT achieves 96.5\% performance of a 466M Transformer using only 162M parameters, reduces training data by 43.2\%, and outperforms Transformer/Loop variants in 11 benchmarks. By enabling elastic computation allocation during inference, ITT balances performance and efficiency through architecture-aware optimization of implicit thinking pathways.
摘要：大型语言模型（LLMS）在参数约束下面临固有的性能瓶颈，尤其是在处理需要复杂推理的关键令牌时。经验分析表明，具有挑战性的令牌会引起跨层的突然梯度尖峰，从而揭示了标准变压器中的建筑应力点。在此洞察力的基础上，我们提出了内在思维变压器（ITT），该思想变压器（ITT）将计算层的计算重新构想为隐式思维步骤。 ITT通过自适应令牌路由动态分配计算，迭代通过残留思维连接来完善表示形式，并使用思维步骤编码区分推理阶段。 ITT可以更深入地处理关键令牌，而无需参数扩展。 162m-466m参数模型的评估显示ITT仅使用162m参数实现466m变压器的性能，将训练数据降低43.2 \％，并且超过了11个基准标准的变压器/Loop变体。通过在推理过程中启用弹性计算分配，ITT通过对隐性思维途径的优化来平衡性能和效率。

Title: DH-RAG: A Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue

Authors: Feiyuan Zhang, Dezhi Zhu, James Ming, Yilun Jin, Di Chai, Liu Yang, Han Tian, Zhaoxin Fan, Kai Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13847
Pdf URL: https://arxiv.org/pdf/2502.13847
Copy Paste: [[2502.13847]] DH-RAG: A Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue(https://arxiv.org/abs/2502.13847)
Keywords: retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems have shown substantial benefits in applications such as question answering and multi-turn dialogue \citep{lewis2020retrieval}. However, traditional RAG methods, while leveraging static knowledge bases, often overlook the potential of dynamic historical information in ongoing conversations. To bridge this gap, we introduce DH-RAG, a Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue. DH-RAG is inspired by human cognitive processes that utilize both long-term memory and immediate historical context in conversational responses \citep{stafford1987conversational}. DH-RAG is structured around two principal components: a History-Learning based Query Reconstruction Module, designed to generate effective queries by synthesizing current and prior interactions, and a Dynamic History Information Updating Module, which continually refreshes historical context throughout the dialogue. The center of DH-RAG is a Dynamic Historical Information database, which is further refined by three strategies within the Query Reconstruction Module: Historical Query Clustering, Hierarchical Matching, and Chain of Thought Tracking. Experimental evaluations show that DH-RAG significantly surpasses conventional models on several benchmarks, enhancing response relevance, coherence, and dialogue quality.
摘要：检索增强的生成（RAG）系统已在诸如问答和多转化对话等应用中显示出很大的好处。但是，传统的抹布方法在利用静态知识基础的同时，经常忽略正在进行的对话中动态历史信息的潜力。为了弥合这一差距，我们介绍了DH-RAG，这是一种动态的历史上下文驱动的检索型生成方法，用于多转向对话。 DH-rag的灵感来自人类认知过程，这些过程利用了长期记忆和对话响应中的直接历史上下文\ citep {sceptord1987 contressation}。 DH-rag围绕两个主要组件结构：一个基于历史学习的查询重建模块，旨在通过综合当前和先前的交互来生成有效的查询，以及一个动态的历史信息更新模块，在整个对话过程中，它不断刷新历史上下文。 DH-rag的中心是一个动态的历史信息数据库，它通过查询重建模块中的三种策略进一步完善：历史查询聚类，分层匹配和思想跟踪链。实验评估表明，DH-RAG显着超过了几个基准上的常规模型，从而提高了响应相关性，相干性和对话质量。

Title: DataSciBench: An LLM Agent Benchmark for Data Science

Authors: Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, Yisong Yue
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13897
Pdf URL: https://arxiv.org/pdf/2502.13897
Copy Paste: [[2502.13897]] DataSciBench: An LLM Agent Benchmark for Data Science(https://arxiv.org/abs/2502.13897)
Keywords: language model, llm, prompt, agent
Abstract: This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task - Function - Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules. Our experimental framework involves testing 6 API-based models, 8 open-source general models, and 9 open-source code generation models using the diverse set of prompts we have gathered. This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses. Experimental results demonstrate that API-based models outperform open-sourced models on all metrics and Deepseek-Coder-33B-Instruct achieves the highest score among open-sourced models. We release all code and data at this https URL.
摘要：本文介绍了DataScibench，这是评估数据科学中大型语言模型（LLM）功能的综合基准。最近的相关基准主要集中于单个任务，易于获得的地面真相以及直接的评估指标，这限制了可以评估的任务范围。相比之下，DataScibench是基于更全面和精心策划的自然和具有挑战性的提示，以确保不确定的地面真相和评估指标。我们开发了一个半自动化管道来产生地面真理（GT）并验证评估指标。该管道利用并实现了基于LLM的自符际和人类验证策略，通过利用收集的提示，预定义的任务类型和聚合功能（指标）来产生准确的GT。此外，我们提出了一个创新的任务 - 函数 - 代码（TFC）框架，以根据精确定义的指标和程序化规则来评估每个代码执行结果。我们的实验框架涉及测试6个基于API的模型，8个开源通用型号和9个开源代码生成模型，使用我们收集的各种提示。这种方法旨在提供对数据科学中LLM的更全面和严格的评估，从而揭示其优势和劣势。实验结果表明，基于API的模型在所有指标上优于开源模型，而DeepSeek-Coder-33b-Instruct在开源模型中的得分最高。我们在此HTTPS URL上发布所有代码和数据。

Title: How Do LLMs Perform Two-Hop Reasoning in Context?

Authors: Tianyu Guo, Hanlin Zhu, Ruiqi Zhang, Jiantao Jiao, Song Mei, Michael I. Jordan, Stuart Russell
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13913
Pdf URL: https://arxiv.org/pdf/2502.13913
Copy Paste: [[2502.13913]] How Do LLMs Perform Two-Hop Reasoning in Context?(https://arxiv.org/abs/2502.13913)
Keywords: language model, llm
Abstract: "Socrates is human. All humans are mortal. Therefore, Socrates is mortal." This classical example demonstrates two-hop reasoning, where a conclusion logically follows from two connected premises. While transformer-based Large Language Models (LLMs) can make two-hop reasoning, they tend to collapse to random guessing when faced with distracting premises. To understand the underlying mechanism, we train a three-layer transformer on synthetic two-hop reasoning tasks. The training dynamics show two stages: a slow learning phase, where the 3-layer transformer performs random guessing like LLMs, followed by an abrupt phase transitions, where the 3-layer transformer suddenly reaches $100%$ accuracy. Through reverse engineering, we explain the inner mechanisms for how models learn to randomly guess between distractions initially, and how they learn to ignore distractions eventually. We further propose a three-parameter model that supports the causal claims for the mechanisms to the training dynamics of the transformer. Finally, experiments on LLMs suggest that the discovered mechanisms generalize across scales. Our methodologies provide new perspectives for scientific understandings of LLMs and our findings provide new insights into how reasoning emerges during training.
摘要：“苏格拉底是人类。所有人类都是凡人。因此，苏格拉底是凡人。”这个经典的示例演示了两跳推理，其中一个结论从逻辑上遵循了两个连接的前提。尽管基于变压器的大型语言模型（LLM）可以做出两跳的推理，但当面对分散注意力的前提时，它们往往会崩溃至随机猜测。为了了解基本机制，我们培训了三层变压器的合成两跳推理任务。训练动力学显示了两个阶段：一个缓慢的学习阶段，其中3层变压器像LLM一样进行随机猜测，然后进行突然的相变，其中3层变压器突然达到$ 100％的准确性。通过反向工程，我们解释了模型如何学会在最初分心之间随机猜测的内在机制，以及他们如何最终学会忽略干扰。我们进一步提出了一个三参数模型，该模型支持变压器训练动力学机制的因果关系。最后，在LLMS上进行的实验表明，发现的机制在跨尺度上概括。我们的方法为科学理解LLM提供了新的观点，我们的发现为推理在培训过程中的出现提供了新的见解。

Title: TESS 2: A Large-Scale Generalist Diffusion Language Model

Authors: Jaesung Tae, Hamish Ivison, Sachin Kumar, Arman Cohan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13917
Pdf URL: https://arxiv.org/pdf/2502.13917
Copy Paste: [[2502.13917]] TESS 2: A Large-Scale Generalist Diffusion Language Model(https://arxiv.org/abs/2502.13917)
Keywords: language model
Abstract: We introduce TESS 2, a general instruction-following diffusion language model that outperforms contemporary instruction-tuned diffusion models, as well as matches and sometimes exceeds strong autoregressive (AR) models. We train TESS 2 by first adapting a strong AR model via continued pretraining with the usual cross-entropy as diffusion loss, and then performing further instruction tuning. We find that adaptation training as well as the choice of the base model is crucial for training good instruction-following diffusion models. We further propose reward guidance, a novel and modular inference-time guidance procedure to align model outputs without needing to train the underlying model. Finally, we show that TESS 2 further improves with increased inference-time compute, highlighting the utility of diffusion LMs in having fine-grained controllability over the amount of compute used at inference time. Code and models are available at this https URL.
摘要：我们介绍了苔丝2，这是一种一般指导的扩散语言模型，它的表现优于当代指导调节的扩散模型，并且匹配，有时超过强大的自动回归（AR）模型。我们首先通过以通常的跨凝结预处理为扩散损失，然后进行进一步的指导调整，从而训练苔丝2。我们发现适应训练以及基本模型的选择对于训练良好的指导跟踪扩散模型至关重要。我们进一步提出了奖励指导，这是一种新颖和模块化的推理时间指导程序，以使模型输出保持一致，而无需训练基础模型。最后，我们表明TESS 2随着推理时间计算的增加而进一步改善，突出了扩散LMS在推理时使用的计算量的细粒度可控性方面的实用性。代码和型号可在此HTTPS URL上找到。

Title: LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization

Authors: Guanzheng Chen, Xin Li, Michael Qizhe Shieh, Lidong Bing
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13922
Pdf URL: https://arxiv.org/pdf/2502.13922
Copy Paste: [[2502.13922]] LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization(https://arxiv.org/abs/2502.13922)
Keywords: language model, gpt, llm, long context
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and long-context performance. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transferring short-context capabilities. LongPO harnesses LLMs to learn from self-generated short-to-long preference data, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned long-context scenarios. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, \ourMethod-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales.
摘要：大型语言模型（LLM）通过预处理和对齐表现出了显着的功能。但是，由于长篇文本对齐不足，上下文的上下文LLM上下文的表现可能不足。由于人类注释对扩展背景的不切实际，并且在平衡短篇和长篇小说性能之间的困难，因此这种对齐过程仍然具有挑战性。为了应对这些挑战，我们介绍了LongPo，这使短篇小说LLMS能够通过内部传输短上下文功能来自我发展，从而在长篇小说任务上脱颖而出。 LongPo利用LLMS从自我生成的短到长时间的偏好数据中学习，包括针对具有长篇文本输入的相同指令生成的配对响应及其压缩的短篇小说对应物。这种偏好揭示了在短篇小说对齐期间培养的LLM的能力和潜力，在不匹配的长篇小说场景中可能会降低。此外，LongPo在长期偏置对齐过程中纳入了短期kl限制，以减轻短篇小说性能下降。当应用于从128K到512K上下文长度的Mistral-7B-Instruct-V0.2时，LongPo完全保留了短篇小说性能，并且在长期和短篇小说任务中都超过了Naive SFT和DPO。具体而言，\ oureMethod训练的模型可以在长篇文本基准上获得可与涉及广泛的长篇文化注释和较大参数量表的上下文基准相当甚至超过上下文的基准。

Title: Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

Authors: Xiaochen Wang, Heming Xia, Jialin Song, Longyu Guan, Yixin Yang, Qingxiu Dong, Weiyao Luo, Yifan Pu, Yiru Wang, Xiangdi Meng, Wenjie Li, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13925
Pdf URL: https://arxiv.org/pdf/2502.13925
Copy Paste: [[2502.13925]] Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?(https://arxiv.org/abs/2502.13925)
Keywords: gpt, llm
Abstract: Large Multimodal Models (LMMs) have achieved remarkable success across various visual-language tasks. However, existing benchmarks predominantly focus on single-image understanding, leaving the analysis of image sequences largely unexplored. To address this limitation, we introduce StripCipher, a comprehensive benchmark designed to evaluate capabilities of LMMs to comprehend and reason over sequential images. StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. Our evaluation of $16$ state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities, particularly in tasks that require reordering shuffled sequential images. For instance, GPT-4o achieves only 23.93% accuracy in the reordering subtask, which is 56.07% lower than human performance. Further quantitative analysis discuss several factors, such as input format of images, affecting the performance of LLMs in sequential understanding, underscoring the fundamental challenges that remain in the development of LMMs.
摘要：大型多模型模型（LMM）在各种视觉语言任务中取得了巨大的成功。但是，现有的基准主要集中在单像理解上，而对图像序列的分析很大程度上没有探索。为了解决此限制，我们引入了Stripcipher，这是一种综合基准，旨在评估LMM的能力，以理解和推理顺序图像。带状卷手包括一个人类通知的数据集和三个具有挑战性的子任务：视觉叙事理解，上下文框架预测和时间叙事重新排序。我们对包括GPT-4O和QWEN2.5VL在内的最先进的LMM的$ 16 $评估显示，与人类能力相比，性能差距有显着的性能差距，尤其是在需要重新排序洗牌的顺序图像的任务中。例如，GPT-4O在重新排序子任务中仅达到23.93％的精度，比人类绩效低56.07％。进一步的定量分析讨论了几个因素，例如图像的输入格式，影响了LLM在顺序理解中的性能，从而强调了LMMS开发中仍然存在的基本挑战。

Title: Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region

Authors: Chak Tou Leong, Qingyu Yin, Jian Wang, Wenjie Li
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2502.13946
Pdf URL: https://arxiv.org/pdf/2502.13946
Copy Paste: [[2502.13946]] Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region(https://arxiv.org/abs/2502.13946)
Keywords: language model, llm
Abstract: The safety alignment of large language models (LLMs) remains vulnerable, as their initial behavior can be easily jailbroken by even relatively simple attacks. Since infilling a fixed template between the input instruction and initial model output is a common practice for existing LLMs, we hypothesize that this template is a key factor behind their vulnerabilities: LLMs' safety-related decision-making overly relies on the aggregated information from the template region, which largely influences these models' safety behavior. We refer to this issue as template-anchored safety alignment. In this paper, we conduct extensive experiments and verify that template-anchored safety alignment is widespread across various aligned LLMs. Our mechanistic analyses demonstrate how it leads to models' susceptibility when encountering inference-time jailbreak attacks. Furthermore, we show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks. We encourage future research to develop more robust safety alignment techniques that reduce reliance on the template region.
摘要：大型语言模型（LLM）的安全一致性仍然很脆弱，因为甚至相对简单的攻击也很容易被越狱。由于在输入指令和初始模型输出之间填充固定的模板是现有LLMS的常见做法，因此我们假设该模板是其漏洞的关键因素：LLMS与安全相关的决策过于依赖于来自总体上的信息。模板区域，这在很大程度上影响了这些模型的安全行为。我们将此问题称为模板锚定的安全对准。在本文中，我们进行了广泛的实验，并验证模板锚定的安全比对在各种对齐的LLMS中广泛分布。我们的机械分析表明，在遇到推理时间越狱攻击时，它如何导致模型的敏感性。此外，我们表明，从模板地区脱离安全机制正在有望减轻越狱袭击的脆弱性。我们鼓励未来的研究开发更强大的安全对准技术，以减少对模板区域的依赖。

Title: RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision

Authors: Guangzhi Xiong, Qiao Jin, Xiao Wang, Yin Fang, Haolin Liu, Yifan Yang, Fangyuan Chen, Zhixing Song, Dengyu Wang, Minjia Zhang, Zhiyong Lu, Aidong Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13957
Pdf URL: https://arxiv.org/pdf/2502.13957
Copy Paste: [[2502.13957]] RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision(https://arxiv.org/abs/2502.13957)
Keywords: llm, prompt, retrieval-augmented generation, agent
Abstract: Retrieval-augmented generation (RAG) has shown great potential for knowledge-intensive tasks, but its traditional architectures rely on static retrieval, limiting their effectiveness for complex questions that require sequential information-seeking. While agentic reasoning and search offer a more adaptive approach, most existing methods depend heavily on prompt engineering. In this work, we introduce RAG-Gym, a unified optimization framework that enhances information-seeking agents through fine-grained process supervision at each search step. We also propose ReSearch, a novel agent architecture that synergizes answer reasoning and search query generation within the RAG-Gym framework. Experiments on four challenging datasets show that RAG-Gym improves performance by up to 25.6\% across various agent architectures, with ReSearch consistently outperforming existing baselines. Further analysis highlights the effectiveness of advanced LLMs as process reward judges and the transferability of trained reward models as verifiers for different LLMs. Additionally, we examine the scaling properties of training and inference in agentic RAG. The project homepage is available at this https URL.
摘要：检索演示的一代（RAG）在知识密集型任务中显示出很大的潜力，但其传统架构依靠静态检索，将其有效性限制在需要顺序寻求信息的复杂问题上。虽然代理推理和搜索提供了一种更自适应的方法，但大多数现有方法都在很大程度上取决于及时工程。在这项工作中，我们介绍了Rag-Gym，这是一个统一的优化框架，在每个搜索步骤中通过细粒度的过程监督增强信息寻求信息。我们还提出了研究，这是一种新型的代理体系结构，在RAG-GYM框架内协同回答推理和搜索查询产生。四个具有挑战性的数据集的实验表明，在各种代理体系结构中，RAG-GYM提高了高达25.6 \％的性能，研究始终超过现有基线。进一步的分析强调了高级LLM作为过程奖励法官的有效性以及训练有素的奖励模型作为不同LLM的验证者的可转让性。此外，我们检查了训练和推断剂抹布的缩放特性。该项目主页可在此HTTPS URL上找到。

Title: LIDDIA: Language-based Intelligent Drug Discovery Agent

Authors: Reza Averly, Frazier N. Baker, Xia Ning
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13959
Pdf URL: https://arxiv.org/pdf/2502.13959
Copy Paste: [[2502.13959]] LIDDIA: Language-based Intelligent Drug Discovery Agent(https://arxiv.org/abs/2502.13959)
Keywords: language model, agent
Abstract: Drug discovery is a long, expensive, and complex process, relying heavily on human medicinal chemists, who can spend years searching the vast space of potential therapies. Recent advances in artificial intelligence for chemistry have sought to expedite individual drug discovery tasks; however, there remains a critical need for an intelligent agent that can navigate the drug discovery process. Towards this end, we introduce LIDDiA, an autonomous agent capable of intelligently navigating the drug discovery process in silico. By leveraging the reasoning capabilities of large language models, LIDDiA serves as a low-cost and highly-adaptable tool for autonomous drug discovery. We comprehensively examine LIDDiA, demonstrating that (1) it can generate molecules meeting key pharmaceutical criteria on over 70% of 30 clinically relevant targets, (2) it intelligently balances exploration and exploitation in the chemical space, and (3) it can identify promising novel drug candidates on EGFR, a critical target for cancers.
摘要：药物发现是一个漫长，昂贵且复杂的过程，它依靠人类的药物学家，他们可以花费数年的时间来搜索潜在疗法的广阔空间。人工智能的最新进展试图加快个人药物发现任务。但是，对于可以导航药物发现过程的智能代理仍然存在迫切需要。为此，我们介绍了利迪亚（Liddia），这是一种能够智能地在计算机中进行药物发现过程的自治药物。通过利用大语言模型的推理能力，Liddia是一种低成本且高度适应的工具，可用于自主药物发现。我们全面检查Liddia，表明（1）它可以在30个临床相关目标中有70％以上的分子生成符合关键药物标准的分子，（2）它智能平衡了化学空间中的探索和剥削，并且可以识别出（3）它可以识别出来的。 EGFR上的新型药物候选物，这是癌症的关键目标。

Title: Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

Authors: William Jurayj, Jeffrey Cheng, Benjamin Van Durme
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13962
Pdf URL: https://arxiv.org/pdf/2502.13962
Copy Paste: [[2502.13962]] Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering(https://arxiv.org/abs/2502.13962)
Keywords: language model
Abstract: Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.
摘要：扩展大型语言模型的测试时间计算已经在推理基准上表现出了令人印象深刻的性能。但是，对测试时间缩放的现有评估表明，推理系统应始终对所提供的任何问题给出答案。这忽略了对模型是否对其答案充满信心的担忧，以及是否始终提供响应是否合适。为了解决这些问题，我们在推理阈值模型响应的推理过程中提取置信度得分。我们发现，在推理时间增加计算预算的增加不仅有助于模型正确回答更多问题，还可以增加对正确答案的信心。然后，我们通过考虑具有非零响应风险水平的设置，扩展了评估期间零风险响应的当前范式，并建议在这些设置下报告评估的配方。

Title: MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads

Authors: Weihao Liu, Ning Wu, Shiping Yang, Wenbiao Ding, Shining Liang, Ming Gong, Dongmei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13963
Pdf URL: https://arxiv.org/pdf/2502.13963
Copy Paste: [[2502.13963]] MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads(https://arxiv.org/abs/2502.13963)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) frequently show distracted attention due to irrelevant information in the input, which severely impairs their long-context capabilities. Inspired by recent studies on the effectiveness of retrieval heads in long-context factutality, we aim at addressing this distraction issue through improving such retrieval heads directly. We propose Multi-Document Attention Focusing (MuDAF), a novel method that explicitly optimizes the attention distribution at the head level through contrastive learning. According to the experimental results, MuDAF can significantly improve the long-context question answering performance of LLMs, especially in multi-document question answering. Extensive evaluations on retrieval scores and attention visualizations show that MuDAF possesses great potential in making attention heads more focused on relevant information and reducing attention distractions.
摘要：大型语言模型（LLMS）经常由于输入中无关的信息而表现出分心的注意力，这严重损害了他们的长期影响能力。受到最近关于检索头在长篇小说事实的有效性的研究的启发，我们旨在通过直接改善此类检索头来解决这一分心问题。我们提出了多文件的注意聚焦（MUDAF），这是一种新型方法，可以通过对比度学习明确优化头级的注意力分布。根据实验结果，MUDAF可以显着改善LLM的长篇文化问题回答性能，尤其是在多文档的问题回答中。对检索分数和注意力可视化的广泛评估表明，Mudaf具有巨大的潜力，可以使注意力更专注于相关信息并减少注意力的注意力。