2025-03-18

Title: Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning

Authors: Donghao Huang, Zhaoxia Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11655
Pdf URL: https://arxiv.org/pdf/2503.11655
Copy Paste: [[2503.11655]] Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning(https://arxiv.org/abs/2503.11655)
Keywords: language model, gpt, llm, prompt
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced sentiment analysis capabilities. However, the trade-offs between model performance, efficiency, and explainability of some latest models remain underexplored. This study presents the first comprehensive evaluation of the DeepSeek-R1 series of models, reasoning open-source LLMs, for sentiment analysis, comparing them against OpenAI's GPT-4 and GPT-4-mini. We systematically analyze their performance under few-shot prompting conditions, scaling up to 50-shot configurations to assess in-context learning effectiveness. Our experiments reveal that DeepSeek-R1 demonstrates competitive accuracy, particularly in multi-class sentiment tasks, while offering enhanced interpretability through its detailed reasoning process. Additionally, we highlight the impact of increasing few-shot examples on model performance and discuss key trade-offs between explainability and computational efficiency.
摘要：大型语言模型（LLM）的最新进展具有显着增强的情感分析能力。但是，模型性能，效率和某些最新模型的解释性之间的权衡仍然没有得到充实的态度。这项研究介绍了对DeepSeek-R1系列模型的首次全面评估，即推理开源LLM，以进行情感分析，并将其与OpenAI的GPT-4和GPT-4-MINI进行了比较。我们系统地分析了它们在几次弹性条件下的性能，最多扩展50张配置，以评估在上下文中的学习有效性。我们的实验表明，DeepSeek-R1证明了竞争精度，尤其是在多类情感任务中，同时通过其详细的推理过程提供了增强的可解释性。此外，我们强调了增加几乎没有示例对模型性能的影响，并讨论解释性和计算效率之间的关键权衡。

Title: TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models

Authors: Joshua Liu, Aarav Jain, Soham Takuri, Srihan Vege, Aslihan Akalin, Kevin Zhu, Sean O'Brien, Vasu Sharma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11656
Pdf URL: https://arxiv.org/pdf/2503.11656
Copy Paste: [[2503.11656]] TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models(https://arxiv.org/abs/2503.11656)
Keywords: language model, prompt
Abstract: Rapid improvements in large language models have unveiled a critical challenge in human-AI interaction: sycophancy. In this context, sycophancy refers to the tendency of models to excessively agree with or flatter users, often at the expense of factual accuracy. While previous studies have primarily analyzed this behavior in single-turn interactions, its persistence and evolution in multi-step conversations remain largely unexplored. We introduce TRUTH DECAY, a benchmark specifically designed to evaluate sycophancy in extended dialogues, where language models must navigate iterative user feedback, challenges, and persuasion. We prompt models to elicit four types of sycophantic biases. We then propose and test sycophancy reduction strategies, evaluating their effectiveness beyond single-step interactions.
摘要：大语言模型的快速改进已经揭示了人类互动中的关键挑战：粘浮浪。在这种情况下，摇摇欲坠是指模型过度同意或夸张的用户的趋势，通常是以事实准确性为代价。虽然先前的研究主要在单转交互中分析了这种行为，但在多步对话中的持久性和演变基本上仍未开发。我们介绍了Truth Decay，这是一种专门旨在评估扩展对话中的无粘合剂的基准，语言模型必须导致迭代用户反馈，挑战和说服力。我们促使模型引起四种类型的scophantic偏见。然后，我们提出和测试减少糊状策略，评估其在单步相互作用之外的有效性。

Title: Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs

Authors: Vincent Li, Yule Fu, Tim Knappe, Kevin Han, Kevin Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11657
Pdf URL: https://arxiv.org/pdf/2503.11657
Copy Paste: [[2503.11657]] Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs(https://arxiv.org/abs/2503.11657)
Keywords: language model, llm, agent
Abstract: Large Language Models have demonstrated remarkable capabilities in natural language processing tasks, including mathematical problem-solving that requires multi-step logical reasoning. However, challenges persist in automating the identification of key mathematical concepts, understanding their interrelations, and formalizing proofs within a rigorous framework. We present a novel framework that leverages knowledge graphs to augment LLMs to construct and formalize mathematical proofs. Our results demonstrate significant performance improvements across multiple datasets, with using knowledge graphs, achieving up to a 34% success rate on the MUSTARDSAUCE dataset on o1-mini and consistently outperforming baseline approaches by 2-11% across different models. We show how this approach bridges the gap between natural language understanding and formal logic proof systems and achieve elevated results for foundation models over baseline.
摘要：大型语言模型在自然语言处理任务中表现出了显着的功能，包括需要多步逻辑推理的数学问题解决。但是，挑战持续存在自动化关键数学概念的识别，理解它们的相互关系以及在严格的框架内形式上证明的挑战。我们提出了一个新颖的框架，该框架利用知识图来增强LLM，以构建和形式化数学证明。我们的结果表明，在多个数据集中，使用知识图的性能得到了重大改进，在O1-Mini上的芥末酱数据集上达到了34％的成功率，并且在不同模型上始终超过2-11％的基线方法。我们展示了这种方法如何弥合自然语言理解与正式逻辑证明系统之间的差距，并在基线上取得了基础模型的提高结果。

Title: LogitLens4LLMs: Extending Logit Lens Analysis to Modern Large Language Models

Authors: Zhenyu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11667
Pdf URL: https://arxiv.org/pdf/2503.11667
Copy Paste: [[2503.11667]] LogitLens4LLMs: Extending Logit Lens Analysis to Modern Large Language Models(https://arxiv.org/abs/2503.11667)
Keywords: language model, llm
Abstract: This paper introduces LogitLens4LLMs, a toolkit that extends the Logit Lens technique to modern large language models. While Logit Lens has been a crucial method for understanding internal representations of language models, it was previously limited to earlier model architectures. Our work overcomes the limitations of existing implementations, enabling the technique to be applied to state-of-the-art architectures (such as Qwen-2.5 and Llama-3.1) while automating key analytical workflows. By developing component-specific hooks to capture both attention mechanisms and MLP outputs, our implementation achieves full compatibility with the HuggingFace transformer library while maintaining low inference overhead. The toolkit provides both interactive exploration and batch processing capabilities, supporting large-scale layer-wise analyses. Through open-sourcing our implementation, we aim to facilitate deeper investigations into the internal mechanisms of large-scale language models. The toolkit is openly available at this https URL.
摘要：本文介绍了LogitLens4Llms，这是一种将Logit Lens技术扩展到现代大型语言模型的工具包。尽管Logit Lens是了解语言模型内部表示的关键方法，但它以前仅限于早期的模型体系结构。我们的工作克服了现有实施的局限性，使该技术能够应用于最先进的体系结构（例如QWEN-2.5和LLAMA-3.1），同时自动化关键的分析工作流程。通过开发特定于组件的挂钩以捕获注意机制和MLP输出，我们的实现可以与Huggingface Transfereser库完全兼容，同时保持低推理开销。该工具包提供交互式探索和批处理处理功能，从而支持大规模层的分析。通过开源实施，我们旨在促进对大规模语言模型的内部机制进行更深入的研究。该工具包在此HTTPS URL上公开可用。

Title: reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs

Authors: Zhaofeng Wu, Michihiro Yasunaga, Andrew Cohen, Yoon Kim, Asli Celikyilmaz, Marjan Ghazvininejad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11751
Pdf URL: https://arxiv.org/pdf/2503.11751
Copy Paste: [[2503.11751]] reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs(https://arxiv.org/abs/2503.11751)
Keywords: chat
Abstract: Reward models have become a staple in modern NLP, serving as not only a scalable text evaluator, but also an indispensable component in many alignment recipes and inference-time algorithms. However, while recent reward models increase performance on standard benchmarks, this may partly be due to overfitting effects, which would confound an understanding of their true capability. In this work, we scrutinize the robustness of reward models and the extent of such overfitting. We build **reWordBench**, which systematically transforms reward model inputs in meaning- or ranking-preserving ways. We show that state-of-the-art reward models suffer from substantial performance degradation even with minor input transformations, sometimes dropping to significantly below-random accuracy, suggesting brittleness. To improve reward model robustness, we propose to explicitly train them to assign similar scores to paraphrases, and find that this approach also improves robustness to other distinct kinds of transformations. For example, our robust reward model reduces such degradation by roughly half for the Chat Hard subset in RewardBench. Furthermore, when used in alignment, our robust reward models demonstrate better utility and lead to higher-quality outputs, winning in up to 59% of instances against a standardly trained RM.
摘要：奖励模型已成为现代NLP中的主食，不仅是可扩展的文本评估者，而且是许多对齐配方和推理时间算法中必不可少的组件。但是，尽管最近的奖励模型提高了标准基准的性能，但这可能部分是由于过度拟合的效果，这会使人们对其真正能力的理解感到困惑。在这项工作中，我们仔细检查了奖励模型的鲁棒性以及这种过度拟合的程度。我们构建** rewwordBench **，该**以含义或排名的方式有系统地转换奖励模型输入。我们表明，即使进行较小的输入转换，最新的奖励模型即使具有较小的输入转换也遭受了巨大的性能降解，有时会降至最低的准确性，表明脆弱性。为了提高奖励模型的鲁棒性，我们建议明确训练它们，以将类似的分数分配给释义，并发现这种方法还提高了对其他不同类型的转型的鲁棒性。例如，我们可靠的奖励模型将这种退化减少了大约一半的奖励式聊天子集。此外，在对齐中使用时，我们的强大奖励模型表现出更好的实用性和提高质量更高的产出，最多可在59％的实例中与经过标准训练的RM赢得。

Title: Key, Value, Compress: A Systematic Exploration of KV Cache Compression Techniques

Authors: Neusha Javidnia, Bita Darvish Rouhani, Farinaz Koushanfar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11816
Pdf URL: https://arxiv.org/pdf/2503.11816
Copy Paste: [[2503.11816]] Key, Value, Compress: A Systematic Exploration of KV Cache Compression Techniques(https://arxiv.org/abs/2503.11816)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated exceptional capabilities in generating text, images, and video content. However, as context length grows, the computational cost of attention increases quadratically with the number of tokens, presenting significant efficiency challenges. This paper presents an analysis of various Key-Value (KV) cache compression strategies, offering a comprehensive taxonomy that categorizes these methods by their underlying principles and implementation techniques. Furthermore, we evaluate their impact on performance and inference latency, providing critical insights into their effectiveness. Our findings highlight the trade-offs involved in KV cache compression and its influence on handling long-context scenarios, paving the way for more efficient LLM implementations.
摘要：大型语言模型（LLMS）在生成文本，图像和视频内容方面表现出了出色的功能。但是，随着上下文长度的增长，注意力的计算成本随着令牌数量的数量而倍增，带来了巨大的效率挑战。本文介绍了对各种键值（KV）缓存压缩策略的分析，提供了全面的分类法，该分类法通过其基本原则和实施技术对这些方法进行了分类。此外，我们评估了它们对绩效和推理潜伏期的影响，从而为其有效性提供了重要的见解。我们的发现强调了KV缓存压缩涉及的权衡及其对处理长篇小说方案的影响，为更有效的LLM实施铺平了道路。

Title: Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring

Authors: Kezia Oketch, John P. Lalor, Yi Yang, Ahmed Abbasi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11827
Pdf URL: https://arxiv.org/pdf/2503.11827
Copy Paste: [[2503.11827]] Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring(https://arxiv.org/abs/2503.11827)
Keywords: language model, gpt, llm
Abstract: Closed large language models (LLMs) such as GPT-4 have set state-of-the-art results across a number of NLP tasks and have become central to NLP and machine learning (ML)-driven solutions. Closed LLMs' performance and wide adoption has sparked considerable debate about their accessibility in terms of availability, cost, and transparency. In this study, we perform a rigorous comparative analysis of nine leading LLMs, spanning closed, open, and open-source LLM ecosystems, across text assessment and generation tasks related to automated essay scoring. Our findings reveal that for few-shot learning-based assessment of human generated essays, open LLMs such as Llama 3 and Qwen2.5 perform comparably to GPT-4 in terms of predictive performance, with no significant differences in disparate impact scores when considering age- or race-related fairness. Moreover, Llama 3 offers a substantial cost advantage, being up to 37 times more cost-efficient than GPT-4. For generative tasks, we find that essays generated by top open LLMs are comparable to closed LLMs in terms of their semantic composition/embeddings and ML assessed scores. Our findings challenge the dominance of closed LLMs and highlight the democratizing potential of open LLMs, suggesting they can effectively bridge accessibility divides while maintaining competitive performance and fairness.
摘要：封闭的大型语言模型（LLM）（例如GPT-4）已在许多NLP任务中设定了最新的结果，并且已成为NLP和机器学习（ML）驱动的解决方案的核心。封闭的LLMS的性能和广泛的采用引发了有关其可及性，成本和透明度的可及性的巨大争论。在这项研究中，我们跨文本评估和与自动论文评分有关的文本评估和发电任务进行了九个领先的LLM，跨越封闭，开放和开源LLM生态系统的严格比较分析。我们的发现表明，对于对人类生成的论文的几次基于学习的评估，诸如Llama 3和Qwen2.5的开放性LLM在预测绩效方面与GPT-4相当地表现，在考虑与年龄相关的公平性时，不同的影响分数没有显着差异。此外，Llama 3具有可观的成本优势，其成本效益高37倍。对于生成任务，我们发现，在其语义组成/嵌入和ML评估的分数方面，由顶级开放LLM生成的论文与封闭的LLM相当。我们的发现挑战了封闭的LLM的主导地位，并强调了开放LLM的民主化潜力，这表明它们可以有效地弥合可访问性分歧，同时保持竞争性表现和公平性。

Title: A Transformer and Prototype-based Interpretable Model for Contextual Sarcasm Detection

Authors: Ximing Wen, Rezvaneh Rezapour
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11838
Pdf URL: https://arxiv.org/pdf/2503.11838
Copy Paste: [[2503.11838]] A Transformer and Prototype-based Interpretable Model for Contextual Sarcasm Detection(https://arxiv.org/abs/2503.11838)
Keywords: language model
Abstract: Sarcasm detection, with its figurative nature, poses unique challenges for affective systems designed to perform sentiment analysis. While these systems typically perform well at identifying direct expressions of emotion, they struggle with sarcasm's inherent contradiction between literal and intended sentiment. Since transformer-based language models (LMs) are known for their efficient ability to capture contextual meanings, we propose a method that leverages LMs and prototype-based networks, enhanced by sentiment embeddings to conduct interpretable sarcasm detection. Our approach is intrinsically interpretable without extra post-hoc interpretability techniques. We test our model on three public benchmark datasets and show that our model outperforms the current state-of-the-art. At the same time, the prototypical layer enhances the model's inherent interpretability by generating explanations through similar examples in the reference time. Furthermore, we demonstrate the effectiveness of incongruity loss in the ablation study, which we construct using sentiment prototypes.
摘要：讽刺性的检测具有象征性的性质，对旨在执行情感分析的情感系统构成了独特的挑战。尽管这些系统通常在识别直接表达情感方面表现良好，但它们与讽刺性的字面意见和预期情感之间的固有矛盾斗争。由于基于变压器的语言模型（LMS）以其捕获上下文含义的有效能力而闻名，因此我们提出了一种利用LMS和基于原型网络的方法，通过情感嵌入来增强了可解释的讽刺检测。我们的方法本质上是可以解释的，而没有额外的事后可解释性技术。我们在三个公共基准数据集上测试我们的模型，并表明我们的模型的表现优于当前最新设备。同时，原型层通过在参考时间中通过类似示例生成解释来增强模型的固有解释性。此外，我们在消融研究中证明了不一致丧失的有效性，我们使用情感原型来构建该研究。

Title: OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

Authors: Ivan Kartáč, Mateusz Lango, Ondřej Dušek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11858
Pdf URL: https://arxiv.org/pdf/2503.11858
Copy Paste: [[2503.11858]] OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs(https://arxiv.org/abs/2503.11858)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.
摘要：大型语言模型（LLMS）作为NLG系统的评估者表现出了巨大的潜力，可以进行高质量，无参考和多主观评估。但是，现有的基于LLM的指标遭受了两个主要缺点：依靠专有模型来生成培训数据或进行评估，以及缺乏细粒度的解释性反馈。在本文中，我们介绍了OpenLGauge，这是一种完全开源的，无参考的NLG评估度量标准，可根据误差跨度提供准确的解释。 OpenLGauge可作为较大的开放式LLM的两阶段集合，或作为一个小型的微调评估模型，具有确认的通用性，可以看不见任务，域和方面。我们广泛的元评估表明，OpenLGauge与人类判断达到了竞争性的相关性，在某些任务上表现优于最先进的模型，同时保持完全可重复性，并提供了准确的两倍以上的解释。

Title: GPT's Devastated and LLaMA's Content: Emotion Representation Alignment in LLMs for Keyword-based Generation

Authors: Shadab Choudhury, Asha Kumar, Lara J. Martin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11881
Pdf URL: https://arxiv.org/pdf/2503.11881
Copy Paste: [[2503.11881]] GPT's Devastated and LLaMA's Content: Emotion Representation Alignment in LLMs for Keyword-based Generation(https://arxiv.org/abs/2503.11881)
Keywords: language model, gpt, llm
Abstract: In controlled text generation using large language models (LLMs), gaps arise between the language model's interpretation and human expectations. We look at the problem of controlling emotions in keyword-based sentence generation for both GPT-4 and LLaMA-3. We selected four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. Our human evaluation looked at the Human-LLM alignment for each representation, as well as the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., "angry") rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. However, we found that converting the originally-numeric VAD scales to Lexical scales (e.g., +4.0 becomes "High") dramatically improved agreement. Furthermore, the perception of how much a generated sentence conveys an emotion is highly dependent on the LLM, representation type, and which emotion it is.
摘要：在使用大语言模型（LLM）的受控文本生成中，语言模型的解释和人类期望之间会出现差距。我们查看GPT-4和Llama-3中基于关键字的句子生成中控制情绪的问题。我们选择了四个情感表示：单词，词性 - 宽松占主导地位（VAD）维度，以词汇和数字形式和表情符号表示。我们的人类评估研究了每种表示的人类llm对齐方式，以及生成句子的准确性和现实主义。尽管诸如VAD之类的表示形式将情绪融合为易于计算的组件，但我们的发现表明，人们更多地同意LLMS在英语单词（例如“愤怒”）而不是VAD量表时的生成。将数字VAD与单词进行比较时，这种差异尤其明显。但是，我们发现将最初数量的VAD量表转换为词汇量表（例如，+4.0变为“高”）大大改善了一致性。此外，对生成句子的感知的感知高度取决于LLM，表示类型以及它是哪种情感。

Title: Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing

Authors: Bhiman Kumar Baghel, Scott M. Jordan, Zheyuan Ryan Shi, Xiang Lorraine Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11895
Pdf URL: https://arxiv.org/pdf/2503.11895
Copy Paste: [[2503.11895]] Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing(https://arxiv.org/abs/2503.11895)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are used in various downstream language tasks, making it crucial to keep their knowledge up-to-date, but both retraining and fine-tuning the model can be costly. Model editing offers an efficient and effective alternative by a single update to only a key subset of model parameters. While being efficient, these methods are not perfect. Sometimes knowledge edits are unsuccessful, i.e., UnderEdit, or the edit contaminated neighboring knowledge that should remain unchanged, i.e., OverEdit. To address these limitations, we propose iterative model editing, based on our hypothesis that a single parameter update is often insufficient, to mitigate UnderEdit, and neighbor-assisted model editing, which incorporates neighboring knowledge during editing to minimize OverEdit. Extensive experiments demonstrate that our methods effectively reduce UnderEdit up to 38 percentage points and OverEdit up to 6 percentage points across multiple model editing algorithms, LLMs, and benchmark datasets.
摘要：大型语言模型（LLM）用于各种下游语言任务，这对于保持知识的知识至关重要，但是重新调整和对模型进行微调可能是昂贵的。模型编辑仅通过单个更新提供了一个模型参数的关键子集，提供了有效的替代方案。在高效的同时，这些方法并不完美。有时，知识编辑是不成功的，即基金会或编辑污染的邻近知识，应保持不变，即覆盖。为了解决这些局限性，我们提出了迭代模型编辑，基于我们的假设，即单个参数更新通常不足以减轻跨越的模型编辑，并且在编辑过程中结合了相邻的知识以最大程度地减少过度编辑。广泛的实验表明，在多个模型编辑算法，LLMS和基准数据集中，我们的方法有效地将底部的设计有效地降低了38个百分点，并将其覆盖高达6个百分点。

Title: LLMs for Translation: Historical, Low-Resourced Languages and Contemporary AI Models

Authors: Merve Tekgurler
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11898
Pdf URL: https://arxiv.org/pdf/2503.11898
Copy Paste: [[2503.11898]] LLMs for Translation: Historical, Low-Resourced Languages and Contemporary AI Models(https://arxiv.org/abs/2503.11898)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable adaptability in performing various tasks, including machine translation (MT), without explicit training. Models such as OpenAI's GPT-4 and Google's Gemini are frequently evaluated on translation benchmarks and utilized as translation tools due to their high performance. This paper examines Gemini's performance in translating an 18th-century Ottoman Turkish manuscript, Prisoner of the Infidels: The Memoirs of Osman Agha of Timisoara, into English. The manuscript recounts the experiences of Osman Agha, an Ottoman subject who spent 11 years as a prisoner of war in Austria, and includes his accounts of warfare and violence. Our analysis reveals that Gemini's safety mechanisms flagged between 14 and 23 percent of the manuscript as harmful, resulting in untranslated passages. These safety settings, while effective in mitigating potential harm, hinder the model's ability to provide complete and accurate translations of historical texts. Through real historical examples, this study highlights the inherent challenges and limitations of current LLM safety implementations in the handling of sensitive and context-rich materials. These real-world instances underscore potential failures of LLMs in contemporary translation scenarios, where accurate and comprehensive translations are crucial-for example, translating the accounts of modern victims of war for legal proceedings or humanitarian documentation.
摘要：大型语言模型（LLMS）在执行各种任务（包括机器翻译（MT））的情况下表现出了显着的适应性，而无需明确的培训。诸如OpenAI的GPT-4和Google的双子座之类的模型经常在翻译基准上进行评估，并由于其高性能而被用作翻译工具。本文研究了双子座在将18世纪奥斯曼帝国手稿，《异教徒的囚徒：Timisoara的Osman Agha的回忆录》翻译成英文时的表现。手稿讲述了奥斯曼·阿哈（Osman Agha）的经历，奥斯曼（Osman Agha）的经历在奥地利度过了11年的战俘，其中包括他对战争和暴力的说法。我们的分析表明，双子座的安全机制在手稿的14％到23％之间被标记为有害，导致未翻译的通道。这些安全环境虽然有效地减轻潜在的伤害，但仍阻碍了模型提供历史文本的完整和准确翻译的能力。通过真实的历史示例，本研究突出了当前LLM安全实现在处理敏感和上下文富含材料中的固有挑战和局限性。这些现实世界中的实例强调了LLM在当代翻译方案中的潜在失败，在当代的翻译情景中，准确而全面的翻译是至关重要的例子，将现代战争受害者的法律程序或人道主义文献的叙述翻译了。

Title: LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama

Authors: Naome A. Etori, Kevin Lu, Randu Karisa, Arturs Kanepajs
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11911
Pdf URL: https://arxiv.org/pdf/2503.11911
Copy Paste: [[2503.11911]] LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama(https://arxiv.org/abs/2503.11911)
Keywords: language model, llm
Abstract: As large language models (LLMs) rapidly advance, evaluating their performance is critical. LLMs are trained on multilingual data, but their reasoning abilities are mainly evaluated using English datasets. Hence, robust evaluation frameworks are needed using high-quality non-English datasets, especially low-resource languages (LRLs). This study evaluates eight state-of-the-art (SOTA) LLMs on Latvian and Giriama using a Massive Multitask Language Understanding (MMLU) subset curated with native speakers for linguistic and cultural relevance. Giriama is benchmarked for the first time. Our evaluation shows that OpenAI's o1 model outperforms others across all languages, scoring 92.8\% in English, 88.8\% in Latvian, and 70.8\% in Giriama on 0-shot tasks. Mistral-large (35.6\%) and Llama-70B IT (41\%) have weak performance, on both Latvian and Giriama. Our results underscore the need for localized benchmarks and human evaluations in advancing cultural AI contextualization.
摘要：随着大型语言模型（LLM）迅速发展，评估其性能至关重要。 LLM经过多语言数据的培训，但其推理能力主要使用英语数据集评估。因此，使用高质量的非英语数据集需要强大的评估框架，尤其是低资源语言（LRLS）。这项研究使用大量的多任务语言理解（MMLU）子集评估了八个最先进的LLM（SOTA）在拉脱维亚和Giriama上的LLM，该语言和以语言和文化相关性为例。 Giriama首次进行基准测试。我们的评估表明，OpenAI的O1模型在所有语言上都优于其他语言，英语得分为92.8 \％，在Latvian中为88.8 \％，在0-Shot Tasks上的Giriama中为70.8 \％。在拉脱维亚人和吉里亚玛（Latvian and Giriama）的Mistral-Large（35.6 \％）和Llama-70B IT（41 \％）的性能较弱。我们的结果强调了在推进文化AI情境化方面对局部基准和人类评估的必要性。

Title: REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives

Authors: Kun Su, Krishna Sayana, Hubert Pham, James Pine, Yuri Vasilevski, Raghavendra Vasudeva, Marialena Kyriakidi, Liam Hebert, Ambarish Jash, Anushya Subbiah, Sukhdeep Sodhi
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11924
Pdf URL: https://arxiv.org/pdf/2503.11924
Copy Paste: [[2503.11924]] REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives(https://arxiv.org/abs/2503.11924)
Keywords: language model, llm
Abstract: This paper introduces a novel dataset REGEN (Reviews Enhanced with GEnerative Narratives), designed to benchmark the conversational capabilities of recommender Large Language Models (LLMs), addressing the limitations of existing datasets that primarily focus on sequential item prediction. REGEN extends the Amazon Product Reviews dataset by inpainting two key natural language features: (1) user critiques, representing user "steering" queries that lead to the selection of a subsequent item, and (2) narratives, rich textual outputs associated with each recommended item taking into account prior context. The narratives include product endorsements, purchase explanations, and summaries of user preferences. Further, we establish an end-to-end modeling benchmark for the task of conversational recommendation, where models are trained to generate both recommendations and corresponding narratives conditioned on user history (items and critiques). For this joint task, we introduce a modeling framework LUMEN (LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives) which uses an LLM as a backbone for critiquing, retrieval and generation. We also evaluate the dataset's quality using standard auto-rating techniques and benchmark it by training both traditional and LLM-based recommender models. Our results demonstrate that incorporating critiques enhances recommendation quality by enabling the recommender to learn language understanding and integrate it with recommendation signals. Furthermore, LLMs trained on our dataset effectively generate both recommendations and contextual narratives, achieving performance comparable to state-of-the-art recommenders and language models.
摘要：本文介绍了一种新颖的数据集重新（通过生成叙事增强了评论），旨在基于推荐语言模型（LLMS）的对话能力（LLMS）的对话能力，以解决主要关注顺序项目预测的现有数据集的局限性。 Regen通过介绍两个关键的自然语言特征来扩展亚马逊产品评论数据集：（1）用户评论，代表用户“转向”查询，导致选择后续项目的选择，以及（2）叙述，富裕的文本输出与每个推荐项目相关的文本输出。叙述包括产品认可，购买说明以及用户偏好的摘要。此外，我们为对话推荐的任务建立了端到端的建模基准，在该任务中，培训了模型以生成建议和相应的叙述，并以用户历史记录（项目和评论）为条件。对于这项联合任务，我们介绍了一个建模框架Lumen（基于LLM的统一多任务模型，具有批评，建议和叙述），该模型使用LLM作为批评，检索和发电的骨干。我们还使用标准自动评估技术评估了数据集的质量，并通过培训传统和基于LLM的建议模型来基准测试。我们的结果表明，通过使建议者了解语言理解并将其与建议信号集成在一起，可以提高建议质量。此外，在我们的数据集中培训的LLM有效地产生了建议和上下文叙述，实现了与最先进的建议者和语言模型相当的性能。

Title: Integration of Explainable AI Techniques with Large Language Models for Enhanced Interpretability for Sentiment Analysis

Authors: Thivya Thogesan, Anupiya Nugaliyadde, Kok Wai Wong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11948
Pdf URL: https://arxiv.org/pdf/2503.11948
Copy Paste: [[2503.11948]] Integration of Explainable AI Techniques with Large Language Models for Enhanced Interpretability for Sentiment Analysis(https://arxiv.org/abs/2503.11948)
Keywords: language model, llm
Abstract: Interpretability remains a key difficulty in sentiment analysis with Large Language Models (LLMs), particularly in high-stakes applications where it is crucial to comprehend the rationale behind forecasts. This research addressed this by introducing a technique that applies SHAP (Shapley Additive Explanations) by breaking down LLMs into components such as embedding layer,encoder,decoder and attention layer to provide a layer-by-layer knowledge of sentiment prediction. The approach offers a clearer overview of how model interpret and categorise sentiment by breaking down LLMs into these parts. The method is evaluated using the Stanford Sentiment Treebank (SST-2) dataset, which shows how different sentences affect different layers. The effectiveness of layer-wise SHAP analysis in clarifying sentiment-specific token attributions is demonstrated by experimental evaluations, which provide a notable enhancement over current whole-model explainability techniques. These results highlight how the suggested approach could improve the reliability and transparency of LLM-based sentiment analysis in crucial applications.
摘要：通过大型语言模型（LLM）的情感分析，可解释性仍然是一个关键的困难，尤其是在高风险应用程序中，对于理解预测背后的基本原理至关重要。这项研究通过引入一种应用Shap（Shapley添加说明）的技术来解决这一问题，通过将LLMS分解为诸如嵌入层，编码器，解码器和注意力层之类的组件，以提供情感预测的一层知识。该方法通过将LLMS分解为这些部分，对模型如何解释和分类情感提供了更清晰的概述。使用Stanford情感Treebank（SST-2）数据集评估该方法，该数据集显示了不同句子如何影响不同的层。实验评估证明了层状外形分析在阐明情感特定令牌归因中的有效性，这对当前全模式的解释性技术提供了显着的增强。这些结果突出了建议的方法如何提高至关重要应用中基于LLM的情感分析的可靠性和透明度。

Title: HInter: Exposing Hidden Intersectional Bias in Large Language Models

Authors: Badr Souani, Ezekiel Soremekun, Mike Papadakis, Setsuko Yokoyama, Sudipta Chattopadhyay, Yves Le Traon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11962
Pdf URL: https://arxiv.org/pdf/2503.11962
Copy Paste: [[2503.11962]] HInter: Exposing Hidden Intersectional Bias in Large Language Models(https://arxiv.org/abs/2503.11962)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) may portray discrimination towards certain individuals, especially those characterized by multiple attributes (aka intersectional bias). Discovering intersectional bias in LLMs is challenging, as it involves complex inputs on multiple attributes (e.g. race and gender). To address this challenge, we propose HInter, a test technique that synergistically combines mutation analysis, dependency parsing and metamorphic oracles to automatically detect intersectional bias in LLMs. HInter generates test inputs by systematically mutating sentences using multiple mutations, validates inputs via a dependency invariant and detects biases by checking the LLM response on the original and mutated sentences. We evaluate HInter using six LLM architectures and 18 LLM models (GPT3.5, Llama2, BERT, etc) and find that 14.61% of the inputs generated by HInter expose intersectional bias. Results also show that our dependency invariant reduces false positives (incorrect test inputs) by an order of magnitude. Finally, we observed that 16.62% of intersectional bias errors are hidden, meaning that their corresponding atomic cases do not trigger biases. Overall, this work emphasize the importance of testing LLMs for intersectional bias.
摘要：大型语言模型（LLM）可能会描绘某些人的歧视，尤其是那些以多个属性为特征的人（又称交叉偏见）。在LLM中发现交叉偏见是具有挑战性的，因为它涉及多个属性（例如种族和性别）的复杂输入。为了应对这一挑战，我们提出了Hinter，这是一种协同结合突变分析，依赖性解析和变质甲环的测试技术，以自动检测LLMS中的间偏置偏置。 Hinter通过使用多个突变系统突变句子来生成测试输入，通过依赖关系不变性验证输入，并通过检查原始和突变句子的LLM响应来检测偏差。我们使用六个LLM体系结构和18个LLM模型（GPT3.5，LLAMA2，BERT等）评估Hinter，并发现Hinter生成的输入中有14.61％暴露了交叉偏置。结果还表明，我们的依赖性不变性可将误报（错误的测试输入）降低。最后，我们观察到16.62％的交叉偏置错误被隐藏了，这意味着它们相应的原子情况不会触发偏见。总体而言，这项工作强调了测试LLMS对交叉偏见的重要性。

Title: No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language models

Authors: Charaka Vinayak Kumar, Ashok Urlana, Gopichand Kanumolu, Bala Mallikarjunarao Garlapati, Pruthwik Mishra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11985
Pdf URL: https://arxiv.org/pdf/2503.11985
Copy Paste: [[2503.11985]] No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language models(https://arxiv.org/abs/2503.11985)
Keywords: language model, llm, prompt
Abstract: Advancements in Large Language Models (LLMs) have increased the performance of different natural language understanding as well as generation tasks. Although LLMs have breached the state-of-the-art performance in various tasks, they often reflect different forms of bias present in the training data. In the light of this perceived limitation, we provide a unified evaluation of benchmarks using a set of representative LLMs that cover different forms of biases starting from physical characteristics to socio-economic categories. Moreover, we propose five prompting approaches to carry out the bias detection task across different aspects of bias. Further, we formulate three research questions to gain valuable insight in detecting biases in LLMs using different approaches and evaluation metrics across benchmarks. The results indicate that each of the selected LLMs suffer from one or the other form of bias with the LLaMA3.1-8B model being the least biased. Finally, we conclude the paper with the identification of key challenges and possible future directions.
摘要：大语言模型（LLM）的进步提高了不同自然语言理解和发电任务的性能。尽管LLM违反了各种任务中最新的表现，但它们通常反映了培训数据中存在的不同形式的偏见。鉴于这种感知到的限制，我们使用一组代表性的LLM对基准进行了统一的评估，这些LLM涵盖了从物理特征到社会经济类别开始的不同形式的偏见。此外，我们提出了五种提示方法，以在偏见的不同方面执行偏见检测任务。此外，我们提出三个研究问题，以使用不同的方法和基准跨基准的不同方法和评估指标来获得有价值的见解。结果表明，每个选定的LLM都遭受一种或另一种形式的偏见，而Llama3.1-8b模型的偏见最小。最后，我们以确定关键挑战和可能的未来方向来结束本文。

Title: Applications of Large Language Model Reasoning in Feature Generation

Authors: Dharani Chandra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11989
Pdf URL: https://arxiv.org/pdf/2503.11989
Copy Paste: [[2503.11989]] Applications of Large Language Model Reasoning in Feature Generation(https://arxiv.org/abs/2503.11989)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have revolutionized natural language processing through their state of art reasoning capabilities. This paper explores the convergence of LLM reasoning techniques and feature generation for machine learning tasks. We examine four key reasoning approaches: Chain of Thought, Tree of Thoughts, Retrieval-Augmented Generation, and Thought Space Exploration. Our analysis reveals how these approaches can be used to identify effective feature generation rules without having to manually specify search spaces. The paper categorizes LLM-based feature generation methods across various domains including finance, healthcare, and text analytics. LLMs can extract key information from clinical notes and radiology reports in healthcare, by enabling more efficient data utilization. In finance, LLMs facilitate text generation, summarization, and entity extraction from complex documents. We analyze evaluation methodologies for assessing feature quality and downstream performance, with particular attention to OCTree's decision tree reasoning approach that provides language-based feedback for iterative improvements. Current challenges include hallucination, computational efficiency, and domain adaptation. As of March 2025, emerging approaches include inference-time compute scaling, reinforcement learning, and supervised fine-tuning with model distillation. Future directions point toward multimodal feature generation, self-improving systems, and neuro-symbolic approaches. This paper provides a detailed overview of an emerging field that promises to automate and enhance feature engineering through language model reasoning.
摘要：大型语言模型（LLM）通过其艺术推理能力彻底改变了自然语言处理。本文探讨了LLM推理技术和机器学习任务的功能生成的融合。我们研究了四种关键的推理方法：思想链，思想树，检索的一代和思想空间探索。我们的分析揭示了如何使用这些方法来识别有效的特征生成规则，而无需手动指定搜索空间。本文将基于LLM的特征生成方法分类为各个领域，包括金融，医疗保健和文本分析。 LLM可以通过实现更有效的数据利用来从医疗保健中的临床注释和放射学报告中提取关键信息。在金融中，LLM促进了从复杂文档中提取文本，摘要和实体。我们分析了评估方法的评估方法，以评估功能质量和下游性能，特别关注Octree的决策树推理方法，该方法为迭代性改进提供了基于语言的反馈。当前的挑战包括幻觉，计算效率和域的适应性。截至2025年3月，新兴的方法包括推理时间计算缩放，增强学习和通过模型蒸馏进行微调进行微调。未来的方向指向多模式特征产生，自我改善系统和神经符号方法。本文提供了一个新兴领域的详细概述，该领域有望通过语言模型推理自动化和增强功能工程。

Title: TLUE: A Tibetan Language Understanding Evaluation Benchmark

Authors: Fan Gao, Cheng Huang, Nyima Tashi, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Yongbin Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12051
Pdf URL: https://arxiv.org/pdf/2503.12051
Copy Paste: [[2503.12051]] TLUE: A Tibetan Language Understanding Evaluation Benchmark(https://arxiv.org/abs/2503.12051)
Keywords: language model, llm
Abstract: Large language models (LLMs) have made tremendous progress in recent years, but low-resource languages, such as Tibetan, remain significantly underrepresented in their evaluation. Despite Tibetan being spoken by over seven million people, it has largely been neglected in the development and assessment of LLMs. To address this gap, we present TLUE (A Tibetan Language Understanding Evaluation Benchmark), the first large-scale benchmark for assessing LLMs' capabilities in Tibetan. TLUE comprises two major components: (1) a comprehensive multi-task understanding benchmark spanning 5 domains and 67 subdomains, and (2) a safety benchmark covering 7 subdomains. We evaluate a diverse set of state-of-the-art LLMs. Experimental results demonstrate that most LLMs perform below the random baseline, highlighting the considerable challenges LLMs face in processing Tibetan, a low-resource language. TLUE provides an essential foundation for driving future research and progress in Tibetan language understanding and underscores the need for greater inclusivity in LLM development.
摘要：近年来，大型语言模型（LLM）取得了巨大的进步，但是诸如藏族之类的低资源语言在其评估中的代表性大大不足。尽管藏族有超过700万人的说法，但在LLM的开发和评估中，它在很大程度上被忽略了。为了解决这一差距，我们提出了TLUE（藏族语言理解评估基准），这是第一个评估LLMS在藏族功能的大规模基准。 TLUE包括两个主要组成部分：（1）跨越5个域和67个子域的全面的多任务理解，以及（2）覆盖7个子域的安全基准。我们评估了一套最先进的LLM。实验结果表明，大多数LLM在随机基线以下执行，这突出了LLM在处理藏族（一种低资源语言）方面面临的巨大挑战。 TLUE为推动藏族语言理解的未来研究和进步提供了基础，并强调了在LLM开发中更具包容性的需求。

Title: Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models

Authors: Abhilasha Ravichander, Jillian Fisher, Taylor Sorensen, Ximing Lu, Yuchen Lin, Maria Antoniak, Niloofar Mireshghallah, Chandra Bhagavatula, Yejin Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12072
Pdf URL: https://arxiv.org/pdf/2503.12072
Copy Paste: [[2503.12072]] Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models(https://arxiv.org/abs/2503.12072)
Keywords: language model, gpt, llm
Abstract: High-quality training data has proven crucial for developing performant large language models (LLMs). However, commercial LLM providers disclose few, if any, details about the data used for training. This lack of transparency creates multiple challenges: it limits external oversight and inspection of LLMs for issues such as copyright infringement, it undermines the agency of data authors, and it hinders scientific research on critical issues such as data contamination and data selection. How can we recover what training data is known to LLMs? In this work, we demonstrate a new method to identify training data known to proprietary LLMs like GPT-4 without requiring any access to model weights or token probabilities, by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes. By evaluating a model's ability to successfully reconstruct high-surprisal tokens in text, we can identify a surprising number of texts memorized by LLMs.
摘要：高质量的培训数据已证明对发展性能大语言模型（LLM）至关重要。但是，商业LLM提供商很少透露有关培训数据的详细信息（如果有的话）。这种缺乏透明度会带来多个挑战：它限制了外部监督和对诸如侵犯版权的问题的LLM的检查，破坏了数据作者的机构，并且阻碍了对数据污染和数据选择等关键问题的科学研究。我们如何恢复LLM已知的培训数据？在这项工作中，我们展示了一种新方法，可以通过使用信息引导的探针来识别GPT-4（例如GPT-4）已知的培训数据，而无需任何访问模型权重或令牌概率。我们的作品建立在关键观察的基础上：具有高惊奇的文本段落是记忆探针的良好搜索材料。通过评估模型在文本中成功重建高源代币的能力，我们可以确定LLMS记住的令人惊讶的文本。

Title: Large Language Models in Legislative Content Analysis: A Dataset from the Polish Parliament

Authors: Arkadiusz Bryłkowski, Jakub Klikowski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12100
Pdf URL: https://arxiv.org/pdf/2503.12100
Copy Paste: [[2503.12100]] Large Language Models in Legislative Content Analysis: A Dataset from the Polish Parliament(https://arxiv.org/abs/2503.12100)
Keywords: language model, llm
Abstract: Large language models (LLMs) are among the best methods for processing natural language, partly due to their versatility. At the same time, domain-specific LLMs are more practical in real-life applications. This work introduces a novel natural language dataset created by acquired data from official legislative authorities' websites. The study focuses on formulating three natural language processing (NLP) tasks to evaluate the effectiveness of LLMs on legislative content analysis within the context of the Polish legal system. Key findings highlight the potential of LLMs in automating and enhancing legislative content analysis while emphasizing specific challenges, such as understanding legal context. The research contributes to the advancement of NLP in the legal field, particularly in the Polish language. It has been demonstrated that even commonly accessible data can be practically utilized for legislative content analysis.
摘要：大型语言模型（LLM）是处理自然语言的最佳方法之一，部分原因是它们的多功能性。同时，特定于域的LLM在现实生活中更为实用。这项工作介绍了一个新型的自然语言数据集，该数据集是由官方立法机构网站获得的数据创建的。该研究重点是制定三种自然语言处理（NLP）任务，以评估LLM在波兰法律体系中立法内容分析的有效性。主要发现突出了LLM在自动化和增强立法内容分析的同时强调特定挑战（例如了解法律背景）的潜力。该研究有助于NLP在法律领域的发展，尤其是在波兰语言中。已经证明，即使是通常的数据，也可以实际使用的数据用于立法内容分析。

Title: RECSIP: REpeated Clustering of Scores Improving the Precision

Authors: André Schamschurko, Nenad Petrovic, Alois Christian Knoll
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12108
Pdf URL: https://arxiv.org/pdf/2503.12108
Copy Paste: [[2503.12108]] RECSIP: REpeated Clustering of Scores Improving the Precision(https://arxiv.org/abs/2503.12108)
Keywords: language model, gpt, llm
Abstract: The latest research on Large Language Models (LLMs) has demonstrated significant advancement in the field of Natural Language Processing (NLP). However, despite this progress, there is still a lack of reliability in these models. This is due to the stochastic architecture of LLMs, which presents a challenge for users attempting to ascertain the reliability of a model's response. These responses may cause serious harm in high-risk environments or expensive failures in industrial contexts. Therefore, we introduce the framework REpeated Clustering of Scores Improving the Precision (RECSIP) which focuses on improving the precision of LLMs by asking multiple models in parallel, scoring and clustering their responses to ensure a higher reliability on the response. The evaluation of our reference implementation recsip on the benchmark MMLU-Pro using the models GPT-4o, Claude and Gemini shows an overall increase of 5.8 per cent points compared to the best used model.
摘要：关于大语言模型（LLM）的最新研究表明，自然语言处理领域（NLP）取得了重大进步。但是，尽管取得了这种进展，但这些模型仍然缺乏可靠性。这是由于LLMS的随机体系结构所致，该体系结构为试图确定模型响应可靠性的用户带来了挑战。这些反应可能会在高风险环境中造成严重伤害或在工业环境中昂贵的失败。因此，我们介绍了分数的重复集群，以提高精度（RECSIP），该框架通过并行询问多个模型，评分和聚类响应，以确保对响应的可靠性提高，以提高LLM的精度。使用模型GPT-4O，Claude和Gemini对基准MMLU-PRO的参考实现RECSIP的评估显示，与最常用的模型相比，总体上升为5.8％。

Title: MT-RewardTree: A Comprehensive Framework for Advancing LLM-Based Machine Translation via Reward Modeling

Authors: Zhaopeng Feng, Jiahan Ren, Jiayuan Su, Jiamei Zheng, Zhihang Tang, Hongwei Wang, Zuozhu Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12123
Pdf URL: https://arxiv.org/pdf/2503.12123
Copy Paste: [[2503.12123]] MT-RewardTree: A Comprehensive Framework for Advancing LLM-Based Machine Translation via Reward Modeling(https://arxiv.org/abs/2503.12123)
Keywords: language model, llm
Abstract: Process reward models (PRMs) have shown success in complex reasoning tasks for large language models (LLMs). However, their application to machine translation (MT) remains underexplored due to the lack of systematic methodologies and evaluation benchmarks. To address this gap, we introduce \textbf{MT-RewardTree}, a comprehensive framework for constructing, evaluating, and deploying process reward models in MT. Unlike traditional vanilla preference pair construction, we propose a novel method for automatically generating token-level preference pairs using approximate Monte Carlo Tree Search (MCTS), which mitigates the prohibitive cost of human annotation for fine-grained steps. Then, we establish the first MT-specific reward model benchmark and provide a systematic comparison of different reward modeling architectures, revealing that token-level supervision effectively captures fine-grained preferences. Experimental results demonstrate that our MT-PRM-Qwen-2.5-3B achieves state-of-the-art performance in both token-level and sequence-level evaluation given the same input prefix. Furthermore, we showcase practical applications where PRMs enable test-time alignment for LLMs without additional alignment training and significantly improve performance in hypothesis ensembling. Our work provides valuable insights into the role of reward models in MT research. Our code and data are released in \href{this https URL}{this https URL\_RewardTreePage}.
摘要：流程奖励模型（PRM）在大型语言模型（LLMS）的复杂推理任务中表现出了成功。但是，由于缺乏系统的方法和评估基准，它们在机器翻译（MT）上的应用仍未得到充实。为了解决此差距，我们介绍了\ textbf {Mt-rewardtree}，这是MT中构建，评估和部署过程奖励模型的综合框架。与传统的香草偏好对结构不同，我们提出了一种新的方法，用于使用近似蒙特卡洛树搜索（MCT）自动产生令牌级别的偏好对，该方法降低了人类注释的过于良好的粒度步骤。然后，我们建立了第一个特定于MT的奖励模型基准，并对不同的奖励建筑体系结构进行了系统的比较，揭示了令牌级别的监督有效地捕获了细粒度的偏好。实验结果表明，我们的MT-PRM-QWEN-2.5-3B在给定相同的输入前缀时，在令牌级别和序列级别的评估中都能达到最先进的性能。此外，我们展示了实用的应用，其中PRM可以在没有额外的对齐训练的情况下实现LLM的测试时间对齐，并显着提高了假设结合的性能。我们的工作为奖励模型在MT研究中的作用提供了宝贵的见解。我们的代码和数据以\ href {this HTTPS url} {此https url \ _rewardTreepage}发布。

Title: Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models

Authors: Junjie Chen, Xuyang Liu, Subin Huang, Linfeng Zhang, Hang Yu
Subjects: cs.CL, cs.MM, cs.SI
Abstract URL: https://arxiv.org/abs/2503.12149
Pdf URL: https://arxiv.org/pdf/2503.12149
Copy Paste: [[2503.12149]] Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models(https://arxiv.org/abs/2503.12149)
Keywords: language model, prompt
Abstract: With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets. Evaluating 12 state-of-the-art LVLMs over 2,409 samples, we examine interpretive variations within and across models, focusing on confidence levels, alignment with dataset labels, and recognition of ambiguous "neutral" cases. Our findings reveal notable discrepancies -- across LVLMs and within the same model under varied prompts. While classification-oriented prompts yield higher internal consistency, models diverge markedly when tasked with interpretive reasoning. These results challenge binary labeling paradigms by highlighting sarcasm's subjectivity. We advocate moving beyond rigid annotation schemes toward multi-perspective, uncertainty-aware modeling, offering deeper insights into multimodal sarcasm comprehension. Our code and data are available at: this https URL
摘要：随着大型视力模型（LVLM）的出现，表明了越来越类似人类的能力，就出现了一个关键的问题：不同的LVLM是否以不同的方式解释多模式的讽刺，并且单个模型可以从像人类这样的多个角度抓住讽刺吗？为了探讨这一点，我们使用在现有的多模式讽刺数据集上系统设计的提示引入了一个分析框架。评估12个最先进的LVLM在2,409个样本中，我们检查了模型内部和跨模型内部和跨模型内部的解释性变化，重点关注置信度，与数据集标签的一致性以及对模棱两可的“中性”案例的认识。我们的发现揭示了明显的差异 - 跨LVLM和在不同提示下的同一模型中。虽然以分类为导向的提示产生更高的内部一致性，但在承担解释性推理的任务时，模型显着差异。这些结果通过强调讽刺的主观性来挑战二进制标记范例。我们主张超越刚性注释方案朝着多观点，不确定性吸引的建模，为多模式讽刺理解提供了更深入的见解。我们的代码和数据可用：此HTTPS URL

Title: Improving LLM-based Document-level Machine Translation with Multi-Knowledge Fusion

Authors: Bin Liu, Xinglin Lyu, Junhui Li, Daimeng Wei, Min Zhang, Shimin Tao, Hao Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12152
Pdf URL: https://arxiv.org/pdf/2503.12152
Copy Paste: [[2503.12152]] Improving LLM-based Document-level Machine Translation with Multi-Knowledge Fusion(https://arxiv.org/abs/2503.12152)
Keywords: language model, gpt, llm, prompt
Abstract: Recent studies in prompting large language model (LLM) for document-level machine translation (DMT) primarily focus on the inter-sentence context by flatting the source document into a long sequence. This approach relies solely on the sequence of sentences within the document. However, the complexity of document-level sequences is greater than that of shorter sentence-level sequences, which may limit LLM's ability in DMT when only this single-source knowledge is used. In this paper, we propose an enhanced approach by incorporating multiple sources of knowledge, including both the document summarization and entity translation, to enhance the performance of LLM-based DMT. Given a source document, we first obtain its summarization and translation of entities via LLM as the additional knowledge. We then utilize LLMs to generate two translations of the source document by fusing these two single knowledge sources, respectively. Finally, recognizing that different sources of knowledge may aid or hinder the translation of different sentences, we refine and rank the translations by leveraging a multi-knowledge fusion strategy to ensure the best results. Experimental results in eight document-level translation tasks show that our approach achieves an average improvement of 0.8, 0.6, and 0.4 COMET scores over the baseline without extra knowledge for LLaMA3-8B-Instruct, Mistral-Nemo-Instruct, and GPT-4o-mini, respectively.
摘要：促使文档级机器翻译（DMT）促使大型语言模型（LLM）的最新研究主要通过将源文档置于长序列中，这主要集中在句子间环境上。这种方法仅取决于文档中的句子序列。但是，文档级序列的复杂性大于较短的句子级序列的复杂性，这可能会限制LLM在DMT中的能力，而仅当使用此单源知识。在本文中，我们通过结合包括文档摘要和实体翻译在内的多种知识来源来提高基于LLM的DMT的性能，提出了一种增强的方法。给定一个源文档，我们首先通过LLM作为附加知识获得了实体的摘要和翻译。然后，我们利用LLM分别通过融合这两个单个知识源来生成源文档的两个翻译。最后，认识到不同的知识来源可能有助于或阻碍不同句子的翻译，我们通过利用多知识融合策略来完善和对翻译进行排名，以确保最佳结果。八个文档级翻译任务的实验结果表明，我们的方法的平均改善比基线的平均提高了0.8、0.6和0.4彗星得分，而没有额外的Llama3-8B教学，MISTRAL-NEMO-INSTRUCT，GPT-4O-MINI的额外知识。

Title: PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing

Authors: Cheng Deng, Luoyang Sun, Jiwen Jiang, Yongcheng Zeng, Xinjian Wu, Wenxin Zhao, Qingfa Xiao, Jiachuan Wang, Lei Chen, Lionel M. Ni, Haifeng Zhang, Jun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12167
Pdf URL: https://arxiv.org/pdf/2503.12167
Copy Paste: [[2503.12167]] PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing(https://arxiv.org/abs/2503.12167)
Keywords: language model, llm
Abstract: While scaling laws have been continuously validated in large language models (LLMs) with increasing model parameters, the inherent tension between the inference demands of LLMs and the limited resources of edge devices poses a critical challenge to the development of edge intelligence. Recently, numerous small language models have emerged, aiming to distill the capabilities of LLMs into smaller footprints. However, these models often retain the fundamental architectural principles of their larger counterparts, still imposing considerable strain on the storage and bandwidth capacities of edge devices. In this paper, we introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimizes model architecture and edge system constraints. The PLM utilizes a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint during inference. During training, we collect and reorganize open-source datasets, implement a multi-phase training strategy, and empirically investigate the Warmup-Stable-Decay-Constant (WSDC) learning rate scheduler. Additionally, we incorporate Reinforcement Learning from Human Feedback (RLHF) by adopting the ARIES preference learning approach. Following a two-phase SFT process, this method yields performance gains of 2% in general tasks, 9% in the GSM8K task, and 11% in coding tasks. In addition to its novel architecture, evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data while maintaining the lowest number of activated parameters. Furthermore, deployment across various edge devices, including consumer-grade GPUs, mobile phones, and Raspberry Pis, validates PLM's suitability for peripheral applications. The PLM series models are publicly available at this https URL.
摘要：尽管规模定律在大型语言模型（LLMS）中持续验证，但模型参数增加，但LLMS的推理需求与边缘设备的有限资源之间的固有张力对Edge Intelligence的发展构成了关键的挑战。最近，出现了许多小型语言模型，旨在将LLM的功能提炼成较小的足迹。但是，这些模型通常会保留其较大对应物的基本建筑原理，但仍会对边缘设备的存储和带宽容量造成很大的压力。在本文中，我们介绍了通过共同设计过程开发的PLM，这是一种外围语言模型，该过程共同优化了模型体系结构和边缘系统约束。 PLM利用多头的潜在注意机制，并采用平方的relu激活函数来鼓励稀疏性，从而减少了推理期间的峰值记忆足迹。在培训期间，我们收集和重组开源数据集，实施多相培训策略，并经验研究热身稳定稳定的稳定 - 组合（WSDC）学习率调度程序。此外，我们通过采用白羊座的偏好学习方法来纳入从人类反馈（RLHF）中学习的强化学习。经过两阶段的SFT过程，此方法在一般任务中的性能增长，GSM8K任务中的绩效增长为9％，编码任务中的性能增长了11％。除了其新颖的体系结构外，评估结果还表明，PLM的表现优于现有的小语言模型，该模型在维持最低数量的激活参数的同时训练了公开可用的数据。此外，在各种边缘设备中的部署，包括消费级GPU，手机和Raspberry PI，验证了PLM对外围应用的适用性。 PLM系列模型可在此HTTPS URL上公开使用。

Title: Interpretation Gaps in LLM-Assisted Comprehension of Privacy Documents

Authors: Rinku Dewri
Subjects: cs.CL, cs.CY, cs.IR
Abstract URL: https://arxiv.org/abs/2503.12225
Pdf URL: https://arxiv.org/pdf/2503.12225
Copy Paste: [[2503.12225]] Interpretation Gaps in LLM-Assisted Comprehension of Privacy Documents(https://arxiv.org/abs/2503.12225)
Keywords: language model, llm
Abstract: This article explores the gaps that can manifest when using a large language model (LLM) to obtain simplified interpretations of data practices from a complex privacy policy. We exemplify these gaps to showcase issues in accuracy, completeness, clarity and representation, while advocating for continued research to realize an LLM's true potential in revolutionizing privacy management through personal assistants and automated compliance checking.
摘要：本文探讨了使用大型语言模型（LLM）从复杂的隐私政策中获取数据实践的简化解释时可能会显示出的差距。我们体现了这些差距，以展示准确性，完整性，清晰度和代表性的问题，同时倡导继续研究，以通过个人助理和自动合规性检查实现LLM的真正潜力。

Title: Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes

Authors: Da Wu, Zhanliang Wang, Quan Nguyen, Kai Wang
Subjects: cs.CL, cs.AI, q-bio.GN, q-bio.QM
Abstract URL: https://arxiv.org/abs/2503.12286
Pdf URL: https://arxiv.org/pdf/2503.12286
Copy Paste: [[2503.12286]] Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes(https://arxiv.org/abs/2503.12286)
Keywords: language model, gpt, llm, prompt, retrieval augmented generation, chain-of-thought
Abstract: Background: Several studies show that large language models (LLMs) struggle with phenotype-driven gene prioritization for rare diseases. These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes. However, in real-world settings, foundation models are not optimized for domain-specific tasks like clinical diagnosis, yet inputs are unstructured clinical notes rather than standardized terms. How LLMs can be instructed to predict candidate genes or disease diagnosis from unstructured clinical notes remains a major challenge. Methods: We introduce RAG-driven CoT and CoT-driven RAG, two methods that combine Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG) to analyze clinical notes. A five-question CoT protocol mimics expert reasoning, while RAG retrieves data from sources like HPO and OMIM (Online Mendelian Inheritance in Man). We evaluated these approaches on rare disease datasets, including 5,980 Phenopacket-derived notes, 255 literature-based narratives, and 220 in-house clinical notes from Childrens Hospital of Philadelphia. Results: We found that recent foundations models, including Llama 3.3-70B-Instruct and DeepSeek-R1-Distill-Llama-70B, outperformed earlier versions such as Llama 2 and GPT-3.5. We also showed that RAG-driven CoT and CoT-driven RAG both outperform foundation models in candidate gene prioritization from clinical notes; in particular, both methods with DeepSeek backbone resulted in a top-10 gene accuracy of over 40% on Phenopacket-derived clinical notes. RAG-driven CoT works better for high-quality notes, where early retrieval can anchor the subsequent reasoning steps in domain-specific evidence, while CoT-driven RAG has advantage when processing lengthy and noisy notes.
摘要：背景：几项研究表明，大型语言模型（LLMS）与表型驱动的基因优先级为罕见疾病而奋斗。这些研究通常使用人类表型本体论（HPO）术语来促进GPT和Llama等基础模型来预测候选基因。但是，在现实世界中，基础模型并未针对特定于临床诊断的领域特定任务进行优化，但是输入是非结构化的临床笔记，而不是标准化的术语。如何指示LLMS从非结构化的临床注释预测候选基因或疾病诊断仍然是一个主要挑战。方法：我们介绍了抹布驱动的婴儿床和COT驱动的抹布，这两种结合了思考链（COT）和检索增强发电（RAG）以分析临床注释。一个五个问题的COT协议模拟了专家推理，而RAG从HPO和OMIM等来源（在线Mendelian继承人）中检索数据。我们在稀有疾病数据集上评估了这些方法，其中包括5,980个苯丙胺衍生的笔记，255种基于文献的叙述以及费城儿童医院的220个内部临床笔记。结果：我们发现，最近的基金会模型，包括Llama 3.3-70b-Instruct和DeepSeek-R1-Distill-Lalama-70B，其表现优于早期版本，例如Llama 2和GPT-3.5。我们还表明，从临床注释中，rag驱动的婴儿床和COT驱动的碎布都优先于候选基因优先级的基础模型。尤其是，两种具有深质骨架的方法均在phenopacket衍生的临床注释上均可获得超过40％的基因精度。抹布驱动的COT在高质量的音符方面更好地工作，在此期间可以锚定特定于域的证据中的后续推理步骤，而在处理冗长且嘈杂的注释时，COT驱动的抹布具有优势。

Title: The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation

Authors: Olivier Gouvert, Julie Hunter, Jérôme Louradour, Christophe Cerisara, Evan Dufraisse, Yaya Sy, Laura Rivière, Jean-Pierre Lorré, OpenLLM-France community
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12294
Pdf URL: https://arxiv.org/pdf/2503.12294
Copy Paste: [[2503.12294]] The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation(https://arxiv.org/abs/2503.12294)
Keywords: language model, llm
Abstract: We present both the Lucie Training Dataset and the Lucie-7B foundation model. The Lucie Training Dataset is a multilingual collection of textual corpora centered around French and designed to offset anglo-centric biases found in many datasets for large language model pretraining. Its French data is pulled not only from traditional web sources, but also from French cultural heritage documents, filling an important gap in modern datasets. Beyond French, which makes up the largest share of the data, we added documents to support several other European languages, including English, Spanish, German, and Italian. Apart from its value as a resource for French language and culture, an important feature of this dataset is that it prioritizes data rights by minimizing copyrighted material. In addition, building on the philosophy of past open projects, it is redistributed in the form used for training and its processing is described on Hugging Face and GitHub. The Lucie-7B foundation model is trained on equal amounts of data in French and English -- roughly 33% each -- in an effort to better represent cultural aspects of French-speaking communities. We also describe two instruction fine-tuned models, Lucie-7B-Instruct-v1.1 and Lucie-7B-Instruct-human-data, which we release as demonstrations of Lucie-7B in use. These models achieve promising results compared to state-of-the-art models, demonstrating that an open approach prioritizing data rights can still deliver strong performance. We see these models as an initial step toward developing more performant, aligned models in the near future. Model weights for Lucie-7B and the Lucie instruct models, along with intermediate checkpoints for the former, are published on Hugging Face, while model training and data preparation code is available on GitHub. This makes Lucie-7B one of the first OSI compliant language models according to the new OSI definition.
摘要：我们介绍Lucie培训数据集和Lucie-7B基金会模型。 Lucie Training数据集是一个多语言的文本语料库集合，围绕法语为中心，旨在抵消许多数据集中发现的以盎格鲁为中心的偏见，以预处理大型语言模型。它的法国数据不仅是从传统的网络来源中获取的，而且还从法国文化遗产文件中获取，从而填补了现代数据集中的重要空白。除了构成数据中最大份额的法语之外，我们还添加了文档，以支持其他几种欧洲语言，包括英语，西班牙语，德语和意大利语。除了它作为法语和文化资源的价值外，该数据集的一个重要特征是，它通过最大程度地减少受版权保护的材料来优先考虑数据权利。此外，基于过去的开放项目的理念的基础，它以用于培训的形式进行了重新分配，并在拥抱面孔和github上描述了其处理。 Lucie-7b基金会模型接受了同等数量的法语和英语数据（每个数据）的培训，每个数据大约是33％，以更好地代表说法语社区的文化方面。我们还描述了两个指令微调模型，Lucie-7b-Instruct-V1.1和Lucie-7b-Instruct-Human-Data，我们将其作为使用的Lucie-7B的演示释放。这些模型与最先进的模型相比获得了令人鼓舞的结果，表明优先级数据权利的开放方法仍然可以提供强大的绩效。我们将这些模型视为在不久的将来开发更具性能，一致的模型的第一步。 LUCIE-7B和LUCIE指示模型的模型权重以及前者的中间检查站，在拥抱面上发表，而Github则可以在GitHub上获得模型培训和数据准备代码。根据新的OSI定义，这使Lucie-7b成为了第一个符合OSI的语言模型之一。

Title: SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression

Authors: Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, Mi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12340
Pdf URL: https://arxiv.org/pdf/2503.12340
Copy Paste: [[2503.12340]] SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression(https://arxiv.org/abs/2503.12340)
Keywords: language model, llm
Abstract: Despite significant advancements, the practical deployment of Large Language Models (LLMs) is often hampered by their immense sizes, highlighting the need for effective compression techniques. Singular Value Decomposition (SVD) is a promising LLM compression technique. However, existing SVD-based compression methods fall short in reducing truncation losses, leading to less competitive performance in compressed models. In this work, we introduce SVD-LLM V2, a SVD-based LLM compression method that optimizes singular value truncation in SVD compression with two techniques. First, SVD-LLM V2 proposes to use theoretical truncation loss of weight matrices to assign a unique compression ratio to each weight matrix at different layers to accommodate weight redundancy heterogeneity. Second, SVD-LLM V2 proposes loss-optimized weight truncation to ensure that the truncated singular values result in a lower and more stable truncation loss in practice. We evaluate SVD-LLM V2 on ten datasets and five LLMs at various scales. Our results show SVD-LLM V2 outperforms state-of-the-art SVD-based LLM compression methods. Our code is available at this https URL
摘要：尽管取得了重大进步，但大型语言模型（LLM）的实际部署通常会因其巨大尺寸而受到阻碍，这突出了需要有效的压缩技术。单数值分解（SVD）是一种有希望的LLM压缩技术。但是，现有的基于SVD的压缩方法在减少截短损失方面缺乏，导致压缩模型中竞争性的性能较低。在这项工作中，我们引入了SVD-LLM V2，这是一种基于SVD的LLM压缩方法，可通过两种技术优化SVD压缩中的单数值截断。首先，SVD-LLM V2建议使用理论截断的重量矩阵损失，以分配不同层的每个重量矩阵的唯一压缩比，以适应重量冗余异质性。其次，SVD-LLM V2提出了损失优化的重量截断，以确保截断的奇异值在实践中导致较低，更稳定的截断损失。我们在十个数据集上评估了SVD-LLM V2和各种尺度上的五个LLM。我们的结果表明，SVD-LLM V2优于最先进的SVD LLM压缩方法。我们的代码可在此HTTPS URL上找到

Title: General Table Question Answering via Answer-Formula Joint Generation

Authors: Zhongyuan Wang, Richong Zhang, Zhijie Nie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12345
Pdf URL: https://arxiv.org/pdf/2503.12345
Copy Paste: [[2503.12345]] General Table Question Answering via Answer-Formula Joint Generation(https://arxiv.org/abs/2503.12345)
Keywords: language model, llm, prompt
Abstract: Advanced table question answering (TableQA) methods prompt large language models (LLMs) to generate answer text, SQL query, Python code, or custom operations, which impressively improve the complex reasoning problems in the TableQA task. However, these methods lack the versatility to cope with specific question types or table structures. In contrast, the Spreadsheet Formula, the widely-used and well-defined operation language for tabular data, has not been thoroughly explored to solve TableQA. In this paper, we first attempt to use Formula as the logical form for solving complex reasoning on the tables with different structures. Specifically, we construct a large Formula-annotated TableQA dataset \texttt{FromulaQA} from existing datasets. In addition, we propose \texttt{TabAF}, a general table answering framework to solve multiple types of tasks over multiple types of tables simultaneously. Unlike existing methods, \texttt{TabAF} decodes answers and Formulas with a single LLM backbone, demonstrating great versatility and generalization. \texttt{TabAF} based on Llama3.1-70B achieves new state-of-the-art performance on the WikiTableQuestion, HiTab and TabFact.
摘要：高级表问题答案（TableQA）方法促使大型语言模型（LLMS）生成答案文本，SQL查询，Python代码或自定义操作，从而令人印象深刻地改善了TableQQA任务中的复杂推理问题。但是，这些方法缺乏应对特定问题类型或表结构的多功能性。相比之下，尚未详细探讨电子表格公式，即用于表格数据的广泛使用和定义明确的操作语言，以求解TableQQA。在本文中，我们首先尝试使用公式作为在具有不同结构的表上解决复杂推理的逻辑形式。具体来说，我们从现有数据集中构造了一个大型公式宣布的tableqa数据集\ texttt {frofulaqa}。此外，我们建议\ texttt {tabaf}，这是一个通用表答案框架，以同时在多种类型的表上求解多种类型的任务。与现有方法不同，\ texttt {tabaf}用单个LLM主链解码答案和公式，展示了极大的多功能性和概括性。 \ texttt {tabaf}基于llama3.1-70b，在WikibleQuestion，Hitab和Tabfact上实现了新的最新性能。

Title: Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs

Authors: Bowen Tan, Zheng Xu, Eric Xing, Zhiting Hu, Shanshan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12347
Pdf URL: https://arxiv.org/pdf/2503.12347
Copy Paste: [[2503.12347]] Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs(https://arxiv.org/abs/2503.12347)
Keywords: language model, llm, prompt
Abstract: Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution, depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.
摘要：合成数据为训练模型提供了有希望的途径，同时保留了数据隐私。大型语言模型（LLMS）作为数据生成器的差异私有（DP）填充是有效的，但是当计算资源受到限制时是不切实际的。同时，诸如私有进化之类的基于及时的方法在很大程度上取决于手动提示，并在其迭代数据选择过程中无效地使用私人信息。为了克服这些局限性，我们提出了CTCL（具有可控性和聚类的数据综合），这是一个新的框架，用于生成隐私保护的合成数据，而无需大量的及时工程或十亿个LLM Finetuning。 CTCL预处理14000万轻量化的条件发生器和基于聚类的大规模公共数据的主题模型。为了进一步适应私人域，在私人数据上对生成器进行了DP列表，以获取细粒度的文本信息，而主题模型则提取代表分布信息的DP直方图。然后，根据DP直方图，DP发电机进行采样，以合成所需数量的数据示例。跨五个不同领域的评估证明了我们框架的有效性，尤其是在强大的隐私制度中。系统的消融验证了每个框架组件的设计，并突出了我们方法的可扩展性。

Title: Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs

Authors: Rupak Sarkar, Neha Srikanth, Taylor Hudson, Rachel Rudinger, Claire Bonial, Philip Resnik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12370
Pdf URL: https://arxiv.org/pdf/2503.12370
Copy Paste: [[2503.12370]] Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs(https://arxiv.org/abs/2503.12370)
Keywords: llm, chat
Abstract: While it is commonly accepted that maintaining common ground plays a role in conversational success, little prior research exists connecting conversational grounding to success in task-oriented conversations. We study failures of grounding in the Ubuntu IRC dataset, where participants use text-only communication to resolve technical issues. We find that disruptions in conversational flow often stem from a misalignment in common ground, driven by a divergence in beliefs and assumptions held by participants. These disruptions, which we call conversational friction, significantly correlate with task success. We find that although LLMs can identify overt cases of conversational friction, they struggle with subtler and more context-dependent instances requiring pragmatic or domain-specific reasoning.
摘要：尽管人们普遍认为，保持共同基础在对话成功中起着作用，但几乎没有研究将对话基础与以任务为导向的对话的成功联系起来。我们研究了Ubuntu IRC数据集中接地的失败，参与者使用仅文本通信来解决技术问题。我们发现，对话流的破坏通常源于共同基础的错位，这是由于参与者所拥有的信念和假设的分歧所致。这些中断，我们称之为对话摩擦，与任务成功显着相关。我们发现，尽管LLM可以识别出明显的会话摩擦案例，但它们与细节和更依赖上下文有关的实例挣扎，需要实用或特定于领域的推理。

Title: HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs

Authors: Tsz Chung Cheng, Chung Shing Cheng, Chaak Ming Lau, Eugene Tin-Ho Lam, Chun Yat Wong, Hoi On Yu, Cheuk Hei Chong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12440
Pdf URL: https://arxiv.org/pdf/2503.12440
Copy Paste: [[2503.12440]] HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs(https://arxiv.org/abs/2503.12440)
Keywords: language model, llm
Abstract: The ability of language models to comprehend and interact in diverse linguistic and cultural landscapes is crucial. The Cantonese language used in Hong Kong presents unique challenges for natural language processing due to its rich cultural nuances and lack of dedicated evaluation datasets. The HKCanto-Eval benchmark addresses this gap by evaluating the performance of large language models (LLMs) on Cantonese language understanding tasks, extending to English and Written Chinese for cross-lingual evaluation. HKCanto-Eval integrates cultural and linguistic nuances intrinsic to Hong Kong, providing a robust framework for assessing language models in realistic scenarios. Additionally, the benchmark includes questions designed to tap into the underlying linguistic metaknowledge of the models. Our findings indicate that while proprietary models generally outperform open-weight models, significant limitations remain in handling Cantonese-specific linguistic and cultural knowledge, highlighting the need for more targeted training data and evaluation methods. The code can be accessed at this https URL
摘要：语言模型在多种语言和文化景观中理解和相互作用的能力至关重要。香港使用的粤语由于其丰富的文化细微差别和缺乏专用的评估数据集而面临自然语言处理的独特挑战。 HKCANTO-EVAL基准测试通过评估大语言模型（LLMS）在广东语言理解任务上的性能（LLMS）的表现，扩展到英语和书面中文进行跨语言评估。 hkcanto-eval融合了香港固有的文化和语言弊端，为在现实情况下评估语言模型提供了一个强大的框架。此外，基准还包括旨在利用模型的潜在语言元的问题。我们的发现表明，尽管专有模型通常超过开放权重模型，但在处理广东话特定的语言和文化知识方面仍然存在重大局限性，强调了对更有针对性的培训数据和评估方法的需求。可以通过此HTTPS URL访问代码

Title: CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences

Authors: Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, Jianguo Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12491
Pdf URL: https://arxiv.org/pdf/2503.12491
Copy Paste: [[2503.12491]] CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences(https://arxiv.org/abs/2503.12491)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel at processing long sequences, boosting demand for key-value (KV) caching. While recent efforts to evict KV cache have alleviated the inference burden, they often fail to allocate resources rationally across layers with different attention patterns. In this paper, we introduce Cascading and Adaptive KV cache Eviction (CAKE), a novel approach that frames KV cache eviction as a "cake-slicing problem." CAKE assesses layer-specific preferences by considering attention dynamics in both spatial and temporal dimensions, allocates rational cache size for layers accordingly, and manages memory constraints in a cascading manner. This approach enables a global view of cache allocation, adaptively distributing resources across diverse attention mechanisms while maintaining memory budgets. CAKE also employs a new eviction indicator that considers the shifting importance of tokens over time, addressing limitations in existing methods that overlook temporal dynamics. Comprehensive experiments on LongBench and NeedleBench show that CAKE maintains model performance with only 3.2% of the KV cache and consistently outperforms current baselines across various models and memory constraints, particularly in low-memory settings. Additionally, CAKE achieves over 10x speedup in decoding latency compared to full cache when processing contexts of 128K tokens with FlashAttention-2. Our code is available at this https URL.
摘要：大型语言模型（LLMS）在处理长序列方面表现出色，从而提高对键值（KV）缓存的需求。尽管最近驱逐KV缓存的努力减轻了推论负担，但他们通常无法在具有不同注意力模式的层次上合理地分配资源。在本文中，我们介绍了级联和自适应KV缓存驱逐（蛋糕），这是一种新颖的方法，将KV缓存驱逐作为“蛋糕切片问题”。 Cake通过考虑空间和时间维度的注意力动态来评估层特定的偏好，对层次分配理性的缓存大小，并以级联的方式管理内存约束。这种方法可以使全球性的缓存分配观点，在维持记忆预算的同时，将资源自适应地分配资源。 Cake还采用了一个新的驱逐指标，该指标认为令牌随着时间的流逝的重要性，解决了忽略时间动态的现有方法的局限性。在Longbench和Needleanch上进行的综合实验表明，Cake只有3.2％的KV缓存维护模型性能，并且在各种型号和内存约束上，尤其是在低内存设置中，始终超过当前基线。此外，与完整的缓存相比，在处理128K令牌的情况下，蛋糕在解码潜伏期的解码延迟中达到了超过10倍的速度。我们的代码可在此HTTPS URL上找到。

Title: EXAONE Deep: Reasoning Enhanced Language Models

Authors: LG AI Research, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12524
Pdf URL: https://arxiv.org/pdf/2503.12524
Copy Paste: [[2503.12524]] EXAONE Deep: Reasoning Enhanced Language Models(https://arxiv.org/abs/2503.12524)
Keywords: language model
Abstract: We present EXAONE Deep series, which exhibits superior capabilities in various reasoning tasks, including math and coding benchmarks. We train our models mainly on the reasoning-specialized dataset that incorporates long streams of thought processes. Evaluation results show that our smaller models, EXAONE Deep 2.4B and 7.8B, outperform other models of comparable size, while the largest model, EXAONE Deep 32B, demonstrates competitive performance against leading open-weight models. All EXAONE Deep models are openly available for research purposes and can be downloaded from this https URL
摘要：我们提出了Exaone Deep系列，该系列在包括数学和编码基准在内的各种推理任务中表现出卓越的功能。我们主要在结合了长长的思考过程的推理专题数据集上训练模型。评估结果表明，我们较小的模型，Exaone Deep 2.4b和7.8b，超过其他大小的模型，而最大的模型Exaone Deep 32B则表明了与领先的开放重量模型相对的竞争性能。所有Exaone Deep模型均可用于研究目的，可以从此HTTPS URL下载

Title: Investigating Human-Aligned Large Language Model Uncertainty

Authors: Kyle Moore, Jesse Roberts, Daryl Watson, Pamela Wisniewski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12528
Pdf URL: https://arxiv.org/pdf/2503.12528
Copy Paste: [[2503.12528]] Investigating Human-Aligned Large Language Model Uncertainty(https://arxiv.org/abs/2503.12528)
Keywords: language model
Abstract: Recent work has sought to quantify large language model uncertainty to facilitate model control and modulate user trust. Previous works focus on measures of uncertainty that are theoretically grounded or reflect the average overt behavior of the model. In this work, we investigate a variety of uncertainty measures, in order to identify measures that correlate with human group-level uncertainty. We find that Bayesian measures and a variation on entropy measures, top-k entropy, tend to agree with human behavior as a function of model size. We find that some strong measures decrease in human-similarity with model size, but, by multiple linear regression, we find that combining multiple uncertainty measures provide comparable human-alignment with reduced size-dependency.
摘要：最近的工作试图量化大型语言模型不确定性，以促进模型控制并调节用户信任。先前的作品着重于理论上基础或反映模型平均公开行为的不确定性度量。在这项工作中，我们研究了各种不确定性度量，以确定与人类级别不确定性相关的措施。我们发现，贝叶斯的措施和熵措施的变化，Top-K熵倾向于与人类的行为同意于模型大小的函数。我们发现，与模型大小相似的某些强度措施减少了，但是，通过多个线性回归，我们发现将多种不确定性度量相结合提供了可比的人类对准与尺寸依赖性降低。

Title: Basic Category Usage in Vision Language Models

Authors: Hunter Sawyer, Jesse Roberts, Kyle Moore
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12530
Pdf URL: https://arxiv.org/pdf/2503.12530
Copy Paste: [[2503.12530]] Basic Category Usage in Vision Language Models(https://arxiv.org/abs/2503.12530)
Keywords: language model
Abstract: The field of psychology has long recognized a basic level of categorization that humans use when labeling visual stimuli, a term coined by Rosch in 1976. This level of categorization has been found to be used most frequently, to have higher information density, and to aid in visual language tasks with priming in humans. Here, we investigate basic level categorization in two recently released, open-source vision-language models (VLMs). This paper demonstrates that Llama 3.2 Vision Instruct (11B) and Molmo 7B-D both prefer basic level categorization consistent with human behavior. Moreover, the models' preferences are consistent with nuanced human behaviors like the biological versus non-biological basic level effects and the well established expert basic level shift, further suggesting that VLMs acquire cognitive categorization behaviors from the human data on which they are trained.
摘要：长期以来，心理学领域已经认识到，人类在标记视觉刺激时使用的基本分类水平，这是Rosch在1976年创造的一个术语。已经发现，这种分类水平最常使用，以具有更高的信息密度，并有助于在人类中启动的视觉语言任务。在这里，我们研究了两个最近发布的开源视觉语言模型（VLM）中的基本级别分类。本文表明，Llama 3.2视觉指导（11b）和Molmo 7b-D都倾向于与人类行为一致的基本水平分类。此外，模型的偏好与细微的人类行为一致，例如生物学和非生物学基本水平效应以及良好的专家基本水平变化，进一步表明VLMS从训练的人类数据中获得了认知分类行为。

Title: From Guessing to Asking: An Approach to Resolving the Persona Knowledge Gap in LLMs during Multi-Turn Conversations

Authors: Sarvesh Baskar, Tanmay Tulsidas Verelakar, Srinivasan Parthasarathy, Manas Gaur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12556
Pdf URL: https://arxiv.org/pdf/2503.12556
Copy Paste: [[2503.12556]] From Guessing to Asking: An Approach to Resolving the Persona Knowledge Gap in LLMs during Multi-Turn Conversations(https://arxiv.org/abs/2503.12556)
Keywords: language model, llm
Abstract: In multi-turn dialogues, large language models (LLM) face a critical challenge of ensuring coherence while adapting to user-specific information. This study introduces the persona knowledge gap, the discrepancy between a model's internal understanding and the knowledge required for coherent, personalized conversations. While prior research has recognized these gaps, computational methods for their identification and resolution remain underexplored. We propose Conversation Preference Elicitation and Recommendation (CPER), a novel framework that dynamically detects and resolves persona knowledge gaps using intrinsic uncertainty quantification and feedback-driven refinement. CPER consists of three key modules: a Contextual Understanding Module for preference extraction, a Dynamic Feedback Module for measuring uncertainty and refining persona alignment, and a Persona-Driven Response Generation module for adapting responses based on accumulated user context. We evaluate CPER on two real-world datasets: CCPE-M for preferential movie recommendations and ESConv for mental health support. Using A/B testing, human evaluators preferred CPER's responses 42% more often than baseline models in CCPE-M and 27% more often in ESConv. A qualitative human evaluation confirms that CPER's responses are preferred for maintaining contextual relevance and coherence, particularly in longer (12+ turn) conversations.
摘要：在多转话的对话中，大型语言模型（LLM）面临着确保连贯性的关键挑战，同时适应了特定于用户的信息。这项研究介绍了角色知识差距，模型的内部理解与连贯的个性化对话所需的知识之间的差异。尽管先前的研究已经识别出这些差距，但其识别和解决方案的计算方法仍未得到充实。我们提出了对话偏好启发和推荐（CPER），这是一个新型框架，该框架使用固有的不确定性量化和反馈驱动的改进来动态检测和解决角色知识差距。 CPER由三个关键模块组成：一个用于偏好提取的上下文理解模块，用于测量不确定性和精炼角色对准的动态反馈模块，以及通过累积用户上下文来调整响应的角色驱动的响应生成模块。我们在两个现实世界数据集上评估了CPER：CCPE-M，以获得优先电影建议，以及用于心理健康支持的eSconv。使用A/B测试，人类评估人员比CCPE-M中的基线模型更偏爱CPER的反应，而在ESCONV中的频率高27％。定性的人类评估证实，CPER的反应是维持上下文相关性和连贯性的首选，尤其是在更长的（12个以上）对话中。

Title: RaSA: Rank-Sharing Low-Rank Adaptation

Authors: Zhiwei He, Zhaopeng Tu, Xing Wang, Xingyu Chen, Zhijie Wang, Jiahao Xu, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Rui Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12576
Pdf URL: https://arxiv.org/pdf/2503.12576
Copy Paste: [[2503.12576]] RaSA: Rank-Sharing Low-Rank Adaptation(https://arxiv.org/abs/2503.12576)
Keywords: language model, llm
Abstract: Low-rank adaptation (LoRA) has been prominently employed for parameter-efficient fine-tuning of large language models (LLMs). However, the limited expressive capacity of LoRA, stemming from the low-rank constraint, has been recognized as a bottleneck, particularly in rigorous tasks like code generation and mathematical reasoning. To address this limitation, we introduce Rank-Sharing Low-Rank Adaptation (RaSA), an innovative extension that enhances the expressive capacity of LoRA by leveraging partial rank sharing across layers. By forming a shared rank pool and applying layer-specific weighting, RaSA effectively increases the number of ranks without augmenting parameter overhead. Our theoretically grounded and empirically validated approach demonstrates that RaSA not only maintains the core advantages of LoRA but also significantly boosts performance in challenging code and math tasks. Code, data and scripts are available at: this https URL.
摘要：低级适应性（LORA）已突出用于大型语言模型（LLMS）的参数效率微调。但是，源自低级约束的洛拉的表达能力有限，被认为是瓶颈，尤其是在诸如代码生成和数学推理之类的严格任务中。为了解决这一限制，我们引入了排名共享的低级适应（RASA），这是一种创新的扩展，通过利用跨层的部分排名共享来增强洛拉的表现能力。通过形成共享的等级池并应用特定层的加权，RASA有效地增加了排名数量而无需增强参数开销。我们理论上扎根和经验验证的方法表明，RASA不仅保持了洛拉的核心优势，而且还大大提高了具有挑战性的代码和数学任务中的性能。代码，数据和脚本可用：此HTTPS URL。

Title: UniBERTs: Adversarial Training for Language-Universal Representations

Authors: Andrei-Marius Avram, Marian Lupaşcu, Dumitru-Clementin Cercel, Ionuţ Mironică, Ştefan Trăuşan-Matu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12608
Pdf URL: https://arxiv.org/pdf/2503.12608
Copy Paste: [[2503.12608]] UniBERTs: Adversarial Training for Language-Universal Representations(https://arxiv.org/abs/2503.12608)
Keywords: language model
Abstract: This paper presents UniBERT, a compact multilingual language model that leverages an innovative training framework integrating three components: masked language modeling, adversarial training, and knowledge distillation. Pre-trained on a meticulously curated Wikipedia corpus spanning 107 languages, UniBERT is designed to reduce the computational demands of large-scale models while maintaining competitive performance across various natural language processing tasks. Comprehensive evaluations on four tasks -- named entity recognition, natural language inference, question answering, and semantic textual similarity -- demonstrate that our multilingual training strategy enhanced by an adversarial objective significantly improves cross-lingual generalization. Specifically, UniBERT models show an average relative improvement of 7.72% over traditional baselines, which achieved an average relative improvement of only 1.17%, with statistical analysis confirming the significance of these gains (p-value = 0.0181). This work highlights the benefits of combining adversarial training and knowledge distillation to build scalable and robust language models, thereby advancing the field of multilingual and cross-lingual natural language processing.
摘要：本文介绍了Unibert是一种紧凑的多语言模型，该模型利用创新的培训框架整合了三个组成部分：蒙版语言建模，对抗性培训和知识蒸馏。 Unibert旨在减少大规模模型的计算需求，同时保持各种自然语言处理任务的竞争性能，从而在精心策划的Wikipedia语料库中进行了预先培训的Wikipedia语料库。对四个任务的全面评估 - 指定实体识别，自然语言推论，问题答案和语义文本相似性 - 表明，我们的多语言培训策略通过对抗性客观增强了，可以显着改善跨语性的概括。具体而言，UNIBERT模型的平均相对改善比传统基线的平均相对改善为7.72％，这仅达到了仅1.17％的平均相对改善，统计分析证实了这些收益的重要性（P值= 0.0181）。这项工作突出了结合对抗性训练和知识蒸馏以构建可扩展和强大的语言模型的好处，从而推进了多语言和跨语性自然语言处理的领域。

Title: Plausibility Vaccine: Injecting LLM Knowledge for Event Plausibility

Authors: Jacob Chmura, Jonah Dauvet, Sebastian Sabry
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12667
Pdf URL: https://arxiv.org/pdf/2503.12667
Copy Paste: [[2503.12667]] Plausibility Vaccine: Injecting LLM Knowledge for Event Plausibility(https://arxiv.org/abs/2503.12667)
Keywords: language model, llm, prompt
Abstract: Despite advances in language modelling, distributional methods that build semantic representations from co-occurrences fail to discriminate between plausible and implausible events. In this work, we investigate how plausibility prediction can be improved by injecting latent knowledge prompted from large language models using parameter-efficient fine-tuning. We train 12 task adapters to learn various physical properties and association measures and perform adapter fusion to compose latent semantic knowledge from each task on top of pre-trained AlBERT embeddings. We automate auxiliary task data generation, which enables us to scale our approach and fine-tune our learned representations across two plausibility datasets. Our code is available at this https URL.
摘要：尽管语言建模方面取得了进步，但构建同时发生的语义表示的分布方法未能区分合理和令人难以置信的事件。在这项工作中，我们研究了如何通过使用参数有效的微调从大语言模型中注入潜在知识来改善合理性预测。我们训练12个任务适配器学习各种物理属性和关联度量，并执行适配器融合，以在预先训练的Albert嵌入方式上从每个任务中构成潜在的语义知识。我们可以自动化辅助任务数据生成，这使我们能够扩展方法并在两个合理性数据集中微调我们学习的表示形式。我们的代码可在此HTTPS URL上找到。

Title: RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

Authors: Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Julia Hockenmaier, Tong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12759
Pdf URL: https://arxiv.org/pdf/2503.12759
Copy Paste: [[2503.12759]] RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning(https://arxiv.org/abs/2503.12759)
Keywords: language model, retrieval-augmented generation
Abstract: Recent research highlights the challenges retrieval models face in retrieving useful contexts and the limitations of generation models in effectively utilizing those contexts in retrieval-augmented generation (RAG) settings. To address these challenges, we introduce RAG-RL, the first reasoning language model (RLM) specifically trained for RAG. RAG-RL demonstrates that stronger answer generation models can identify relevant contexts within larger sets of retrieved information -- thereby alleviating the burden on retrievers -- while also being able to utilize those contexts more effectively. Moreover, we show that curriculum design in the reinforcement learning (RL) post-training process is a powerful approach to enhancing model performance. We benchmark our method on two open-domain question-answering datasets and achieve state-of-the-art results, surpassing previous SOTA generative reader models. In addition, we offers empirical insights into various curriculum learning strategies, providing a deeper understanding of their impact on model performance.
摘要：最近的研究突出了检索模型在检索有用的环境和生成模型的局限性方面面临的挑战，在有效利用这些环境的局限性中，在检索型发电（RAG）设置中。为了应对这些挑战，我们介绍了RAG-RL，这是专门针对RAG培训的第一个推理语言模型（RLM）。 RAG-RL表明，更强大的答案生成模型可以在较大的检索信息集中识别相关的上下文，从而减轻了回收者的负担，同时也能够更有效地利用这些上下文。此外，我们表明，加固学习（RL）培训过程中的课程设计是增强模型性能的强大方法。我们在两个开放域提问数据集上基准我们的方法，并获得最新的结果，超过了先前的SOTA生成读取器模型。此外，我们还提供了有关各种课程学习策略的经验见解，从而更深入地了解了它们对模型性能的影响。

Title: Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Authors: Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, Dongbin Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12854
Pdf URL: https://arxiv.org/pdf/2503.12854
Copy Paste: [[2503.12854]] Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation(https://arxiv.org/abs/2503.12854)
Keywords: language model, llm
Abstract: Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with RL-based approaches have led to growing interest in alternative paradigms, such as Direct Preference Optimization (DPO). In this study, we investigate the effectiveness of DPO in facilitating self-improvement for LLMs through iterative preference-based learning. We demonstrate that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance, particularly for strong base model. Furthermore, we design an iterative enhancement framework for both the generator and the reward model (RM), enabling their mutual improvement through online interaction across multiple rounds of DPO. Finally, with simple verifiable rewards, our model DPO-VP achieves RL-level performance with significantly lower computational overhead. These findings highlight DPO as a scalable and cost-effective alternative to RL, offering a practical solution for enhancing LLM reasoning in resource-constrained situations.
摘要：大型语言模型（LLM）训练后方法的最新进展强调了强化学习（RL）是增强推理的关键组成部分。但是，与基于RL的方法相关的大量计算成本导致人们对替代范式（例如直接偏好优化（DPO））的兴趣日益增加。在这项研究中，我们研究了DPO通过基于迭代偏好的学习来促进LLM的自我改善的有效性。我们证明，具有粗滤波的一轮DPO显着提高了数学推理性能，尤其是对于强大的基本模型。此外，我们为发电机和奖励模型（RM）设计了一个迭代增强框架，从而通过多个DPO进行在线互动，从而使它们相互改进。最后，有了简单的可验证奖励，我们的模型DPO-VP以明显较低的计算开销来实现RL级别的性能。这些发现突出了DPO是RL的可扩展性且具有成本效益的替代品，它提供了一种实用解决方案，可增强资源约束情况下的LLM推理。

Title: nvBench 2.0: A Benchmark for Natural Language to Visualization under Ambiguity

Authors: Tianqi Luo, Chuhan Huang, Leixian Shen, Boyan Li, Shuyu Shen, Wei Zeng, Nan Tang, Yuyu Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12880
Pdf URL: https://arxiv.org/pdf/2503.12880
Copy Paste: [[2503.12880]] nvBench 2.0: A Benchmark for Natural Language to Visualization under Ambiguity(https://arxiv.org/abs/2503.12880)
Keywords: language model, llm
Abstract: Natural Language to Visualization (NL2VIS) enables users to create visualizations from natural language queries, making data insights more accessible. However, NL2VIS faces challenges in interpreting ambiguous queries, as users often express their visualization needs in imprecise language. To address this challenge, we introduce nvBench 2.0, a new benchmark designed to evaluate NL2VIS systems in scenarios involving ambiguous queries. nvBench 2.0 includes 7,878 natural language queries and 24,076 corresponding visualizations, derived from 780 tables across 153 domains. It is built using a controlled ambiguity-injection pipeline that generates ambiguous queries through a reverse-generation workflow. By starting with unambiguous seed visualizations and selectively injecting ambiguities, the pipeline yields multiple valid interpretations for each query, with each ambiguous query traceable to its corresponding visualization through step-wise reasoning paths. We evaluate various Large Language Models (LLMs) on their ability to perform ambiguous NL2VIS tasks using nvBench 2.0. We also propose Step-NL2VIS, an LLM-based model trained on nvBench 2.0, which enhances performance in ambiguous scenarios through step-wise preference optimization. Our results show that Step-NL2VIS outperforms all baselines, setting a new state-of-the-art for ambiguous NL2VIS tasks.
摘要：自然语言到可视化（NL2VIS）使用户能够从自然语言查询中创建可视化，从而使数据见解更容易访问。但是，NL2VI在解释模棱两可的查询方面面临挑战，因为用户经常以不精确的语言表达其可视化需求。为了应对这一挑战，我们引入了NVBench 2.0，这是一种新的基准测试，旨在评估涉及模棱两可查询的情况下的NL2VIS系统。 NVBENCH 2.0包括7,878个自然语言查询和24,076个相应的可视化，这些可视化量来自153个域中的780个表。它是使用受控的歧义注射管道构建的，该管道通过反向生成工作流来生成模棱两可的查询。通过明确的种子可视化和选择性地注入歧义，该管道可以为每个查询产生多个有效的解释，每个歧义查询都可以追溯到其通过逐步推理路径相应可视化的可视化。我们评估了各种大型语言模型（LLMS）使用NVBench 2.0执行模棱两可的NL2VIS任务的能力。我们还提出了STEP-NL2VIS，这是一种基于LLM的模型，该模型在NVBench 2.0上训练，可以通过逐步偏好优化在模棱两可的方案中提高性能。我们的结果表明，Step-NL2Vis的表现优于所有基准，为模棱两可的NL2VIS任务设定了新的最新技术。

Title: HICD: Hallucination-Inducing via Attention Dispersion for Contrastive Decoding to Mitigate Hallucinations in Large Language Models

Authors: Xinyan Jiang, Hang Ye, Yongxin Zhu, Xiaoying Zheng, Zikang Chen, Jun Gong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12908
Pdf URL: https://arxiv.org/pdf/2503.12908
Copy Paste: [[2503.12908]] HICD: Hallucination-Inducing via Attention Dispersion for Contrastive Decoding to Mitigate Hallucinations in Large Language Models(https://arxiv.org/abs/2503.12908)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) often generate hallucinations, producing outputs that are contextually inaccurate or factually incorrect. We introduce HICD, a novel method designed to induce hallucinations for contrastive decoding to mitigate hallucinations. Unlike existing contrastive decoding methods, HICD selects attention heads crucial to the model's prediction as inducing heads, then induces hallucinations by dispersing attention of these inducing heads and compares the hallucinated outputs with the original outputs to obtain the final result. Our approach significantly improves performance on tasks requiring contextual faithfulness, such as context completion, reading comprehension, and question answering. It also improves factuality in tasks requiring accurate knowledge recall. We demonstrate that our inducing heads selection and attention dispersion method leads to more "contrast-effective" hallucinations for contrastive decoding, outperforming other hallucination-inducing methods. Our findings provide a promising strategy for reducing hallucinations by inducing hallucinations in a controlled manner, enhancing the performance of LLMs in a wide range of tasks.
摘要：大型语言模型（LLMS）通常会产生幻觉，产生上下文不准确或事实不正确的输出。我们介绍了HICD，这是一种旨在诱导幻觉来对比解码以减轻幻觉的幻觉。与现有的对比解码方法不同，HICD选择了注意力头对模型的预测至关重要，因为诱导头部诱导了幻觉，从而诱导了幻觉，从而将这些引起的头部的注意力分散，并将幻觉的输出与原始输出进行比较，以获得最终的结果。我们的方法大大提高了需要上下文忠诚的任务的绩效，例如上下文完成，阅读理解和问答。它还改善了需要准确知识回忆的任务中的事实。我们证明，我们的诱导头部选择和注意分散方法会导致对比度解码的“对比度有效”的幻觉，表现优于其他引起幻觉的方法。我们的发现提供了一种有希望的策略，可以通过以受控方式诱导幻觉来减少幻觉，从而在各种任务中提高LLM的性能。

Title: ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs

Authors: Pengcheng Wen, Jiaming Ji, Chi-Min Chan, Juntao Dai, Donghai Hong, Yaodong Yang, Sirui Han, Yike Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12918
Pdf URL: https://arxiv.org/pdf/2503.12918
Copy Paste: [[2503.12918]] ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs(https://arxiv.org/abs/2503.12918)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated enhanced performance through the \textit{Thinking then Responding} paradigm, where models generate internal thoughts before final responses (aka, System 2 thinking). However, existing research lacks a systematic understanding of the mechanisms underlying how thinking patterns affect performance across model sizes. In this work, we conduct a comprehensive analysis of the impact of various thinking types on model performance and introduce ThinkPatterns-21k, a curated dataset comprising 21k instruction-response pairs (QA) collected from existing instruction-following datasets with five thinking types. For each pair, we augment it with five distinct internal thinking patterns: one unstructured thinking (monologue) and four structured variants (decomposition, self-ask, self-debate and self-critic), while maintaining the same instruction and response. Through extensive evaluation across different model sizes (3B-32B parameters), we have two key findings: (1) smaller models (<30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured thinking like decomposition would degrade performance and (2) unstructured monologue demonstrates broad effectiveness across different model sizes. Finally, we released all of our datasets, checkpoints, training logs of diverse thinking patterns to reproducibility, aiming to facilitate further research in this direction.
摘要：大型语言模型（LLM）通过\ textit {思考然后响应}范式证明了增强的性能，其中模型在最终响应之前会产生内部思想（又称，系统2思维）。但是，现有的研究缺乏对思维方式如何影响模型大小的性能的系统的系统理解。在这项工作中，我们对各种思维类型对模型性能的影响进行了全面分析，并介绍了ThinkPatterns-21K，这是一个策划的数据集，其中包括从现有的指令遵守数据集中收集的21K指令 - 响应对（QA），该数据集使用五种思维类型。对于每对，我们都会使用五种不同的内部思维模式来扩展它：一种非结构化思维（独白）和四个结构化变体（分解，自我掩盖，自我掩盖和自我批评），同时保持相同的指导和响应。通过跨不同模型大小（3B-32B参数）进行广泛的评估，我们有两个关键的发现：（1）较小的模型（<30b参数）可以从大多数结构化思维模式中受益，而具有分解等结构化思维的较大模型（32B）会降低性能，并且（2）非结构化的独白范围跨不同模型均表明不同模型的效率广泛。最后，我们发布了所有数据集，检查站，各种思维模式的培训日志，以促进该方向的进一步研究。

Title: HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

Authors: Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12941
Pdf URL: https://arxiv.org/pdf/2503.12941
Copy Paste: [[2503.12941]] HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model(https://arxiv.org/abs/2503.12941)
Keywords: language model, llm
Abstract: Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Our code will be public available.
摘要：指令调整被广泛用于改善预训练的多模式大语言模型（MLLM），通过在精选的特定任务数据集中训练它，从而更好地理解人类指令。但是，在实际情况下同时收集所有可能的指令数据集是不可行的。因此，启用具有连续指导调整的MLLM对于保持其适应性至关重要。但是，现有方法通常将记忆效率换成绩效提高，从而极大地损害了总体效率。在本文中，我们提出了一个特定于任务的扩展和任务融合框架，基于对不同模型层进行培训的不同模型层的中心内核比对（CKA）相似性的变化。此外，我们分析了现有基准中存在的信息泄漏，并提出了一种新的，更具挑战性的基准，以合理地评估不同方法的性能。与现有的最新方法相比，全面的实验展示了我们方法的显着性能提高。我们的代码将公开可用。

Title: A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models

Authors: Palakorn Achananuparp, Ee-Peng Lim
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2503.12989
Pdf URL: https://arxiv.org/pdf/2503.12989
Copy Paste: [[2503.12989]] A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models(https://arxiv.org/abs/2503.12989)
Keywords: language model, llm
Abstract: Automatically annotating job data with standardized occupations from taxonomies, known as occupation classification, is crucial for labor market analysis. However, this task is often hindered by data scarcity and the challenges of manual annotations. While large language models (LLMs) hold promise due to their extensive world knowledge and in-context learning capabilities, their effectiveness depends on their knowledge of occupational taxonomies, which remains unclear. In this study, we assess the ability of LLMs to generate precise taxonomic entities from taxonomy, highlighting their limitations. To address these challenges, we propose a multi-stage framework consisting of inference, retrieval, and reranking stages, which integrates taxonomy-guided reasoning examples to enhance performance by aligning outputs with taxonomic knowledge. Evaluations on a large-scale dataset show significant improvements in classification accuracy. Furthermore, we demonstrate the framework's adaptability for multi-label skill classification. Our results indicate that the framework outperforms existing LLM-based methods, offering a practical and scalable solution for occupation classification and related tasks across LLMs.
摘要：通过分类学的标准化职业（称为职业分类）自动注释工作数据对于劳动力市场分析至关重要。但是，这项任务通常受到数据稀缺性和手动注释的挑战的阻碍。尽管大型语言模型（LLMS）由于其广泛的世界知识和内在的学习能力而持希望，但它们的有效性取决于他们对职业分类法的了解，这尚不清楚。在这项研究中，我们评估了LLM从分类学产生精确的分类实体的能力，从而强调了它们的局限性。为了应对这些挑战，我们提出了一个多阶段框架，该框架由推理，检索和重新阶段组成，该阶段整合了分类学引导的推理示例，以通过使产出与分类学知识保持一致来提高性能。大规模数据集的评估显示出分类准确性的显着提高。此外，我们演示了该框架对多标签技能分类的适应性。我们的结果表明，该框架的表现优于现有的基于LLM的方法，为跨LLM的职业分类和相关任务提供了实用且可扩展的解决方案。

Title: Overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) Task

Authors: Junjie Chen, Haitao Li, Zhumin Chu, Yiqun Liu, Qingyao Ai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13038
Pdf URL: https://arxiv.org/pdf/2503.13038
Copy Paste: [[2503.13038]] Overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) Task(https://arxiv.org/abs/2503.13038)
Keywords: language model, llm
Abstract: In this paper, we provide an overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) task. As large language models (LLMs) grow popular in both academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue. Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics). To advance the innovation of automatic evaluation, we propose the AEOLLM task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as dialogue generation, text expansion, summary generation and non-factoid question answering to comprehensively test different methods. This year, we received 48 runs from 4 teams in total. This paper will describe the background of the task, the data set, the evaluation measures and the evaluation results, respectively.
摘要：在本文中，我们提供了LLMS（AEOLLM）任务的NTCIR-18自动评估概述。随着大型语言模型（LLM）在学术界和行业中都变得流行，如何有效评估LLM的能力成为一个日益关键但仍然具有挑战性的问题。现有方法可以分为两种类型：手动评估，昂贵，自动评估，这面临许多限制，包括任务格式（多数属于多项选择问题）和评估标准（由基于参考的指标占用）。为了推进自动评估的创新，我们提出了侧重于生成任务并鼓励无参考方法的AEOLLM任务。此外，我们设置了各种子任务，例如对话生成，文本扩展，摘要生成和非事实问题，以全面测试不同的方法。今年，我们总共获得了48支球队的48次奔跑。本文将分别描述任务的背景，数据集，评估度量和评估结果。

Title: A Framework to Assess Multilingual Vulnerabilities of LLMs

Authors: Likai Tang, Niruth Bogahawatta, Yasod Ginige, Jiarui Xu, Shixuan Sun, Surangika Ranathunga, Suranga Seneviratne
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13081
Pdf URL: https://arxiv.org/pdf/2503.13081
Copy Paste: [[2503.13081]] A Framework to Assess Multilingual Vulnerabilities of LLMs(https://arxiv.org/abs/2503.13081)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are acquiring a wider range of capabilities, including understanding and responding in multiple languages. While they undergo safety training to prevent them from answering illegal questions, imbalances in training data and human evaluation resources can make these models more susceptible to attacks in low-resource languages (LRL). This paper proposes a framework to automatically assess the multilingual vulnerabilities of commonly used LLMs. Using our framework, we evaluated six LLMs across eight languages representing varying levels of resource availability. We validated the assessments generated by our automated framework through human evaluation in two languages, demonstrating that the framework's results align with human judgments in most cases. Our findings reveal vulnerabilities in LRL; however, these may pose minimal risk as they often stem from the model's poor performance, resulting in incoherent responses.
摘要：大型语言模型（LLM）正在获得更广泛的功能，包括以多种语言的理解和响应。尽管他们接受安全培训以防止他们回答非法问题，但培训数据和人类评估资源的失衡会使这些模型更容易受到低资源语言（LRL）的攻击。本文提出了一个框架，以自动评估常用LLM的多语言漏洞。使用我们的框架，我们评估了代表不同资源可用性级别的八种语言的六个LLM。我们通过使用两种语言的人类评估来验证了自动化框架产生的评估，表明该框架的结果与大多数情况下的人类判断相符。我们的发现显示了LRL中的脆弱性；但是，由于模型的性能差，它们可能会带来最小的风险，从而导致反应不连贯。

Title: ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning

Authors: Baohao Liao, Christian Herold, Seyyed Hadi Hashemi, Stefan Vasilev, Shahram Khadivi, Christof Monz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13089
Pdf URL: https://arxiv.org/pdf/2503.13089
Copy Paste: [[2503.13089]] ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning(https://arxiv.org/abs/2503.13089)
Keywords: language model, llm
Abstract: As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this paper, we propose ClusComp, a simple yet effective compression paradigm that clusters weight matrices into codebooks and finetunes them block-by-block. ClusComp (1) achieves superior performance in 2-4 bit quantization, (2) pushes compression to 1-bit while outperforming ultra-low-bit methods with minimal finetuning, and (3) enables efficient finetuning, even surpassing existing quantization-based approaches and rivaling full FP16 finetuning. Notably, ClusComp supports compression and finetuning of 70B LLMs on a single A6000-48GB GPU.
摘要：作为大型语言模型（LLMS）量表，模型压缩对于边缘部署和可访问性至关重要。仅重量量化会降低模型大小，但在较低的位宽度下遭受性能降解。此外，标准的芬太尼与量化模型不兼容，替代方法通常缺乏完全的登录。在本文中，我们提出了ClusComp，这是一种简单而有效的压缩范式，将重量矩阵插入代码手册中，并通过限制限制它们。 CLUSCOMP（1）在2-4位量化中实现了卓越的性能，（2）将压缩推向1位，同时超过最小的芬太尼的超低方法，并且（3）启用有效的芬太尼，甚至可以超过现有的基于量化的方法并竞争完整的FP16 Finetuning。值得注意的是，CLUSCOMP支持单个A6000-48GB GPU上70B LLMS的压缩和捕获。

Title: Who Wrote This? Identifying Machine vs Human-Generated Text in Hausa

Authors: Babangida Sani, Aakansha Soy, Sukairaj Hafiz Imam, Ahmad Mustapha, Lukman Jibril Aliyu, Idris Abdulmumin, Ibrahim Said Ahmad, Shamsuddeen Hassan Muhammad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13101
Pdf URL: https://arxiv.org/pdf/2503.13101
Copy Paste: [[2503.13101]] Who Wrote This? Identifying Machine vs Human-Generated Text in Hausa(https://arxiv.org/abs/2503.13101)
Keywords: language model, llm
Abstract: The advancement of large language models (LLMs) has allowed them to be proficient in various tasks, including content generation. However, their unregulated usage can lead to malicious activities such as plagiarism and generating and spreading fake news, especially for low-resource languages. Most existing machine-generated text detectors are trained on high-resource languages like English, French, etc. In this study, we developed the first large-scale detector that can distinguish between human- and machine-generated content in Hausa. We scrapped seven Hausa-language media outlets for the human-generated text and the Gemini-2.0 flash model to automatically generate the corresponding Hausa-language articles based on the human-generated article headlines. We fine-tuned four pre-trained Afri-centric models (AfriTeVa, AfriBERTa, AfroXLMR, and AfroXLMR-76L) on the resulting dataset and assessed their performance using accuracy and F1-score metrics. AfroXLMR achieved the highest performance with an accuracy of 99.23% and an F1 score of 99.21%, demonstrating its effectiveness for Hausa text detection. Our dataset is made publicly available to enable further research.
摘要：大型语言模型（LLM）的进步使他们能够精通各种任务，包括内容生成。但是，他们不受监管的用法可能会导致恶意活动，例如窃，并引发和传播假新闻，尤其是对于低资源语言。大多数现有的机器生成的文本探测器都接受了高资源语言（例如英语，法语等）的培训。在这项研究中，我们开发了第一个可以区分Hausa中人类和机器生成的内容的大型检测器。我们为人类生成的文本和Gemini-2.0 Flash模型取消了七个Hausa语言媒体，以根据人类生成的文章头条自动生成相应的Hausa语言文章。我们在所得数据集上微调了四个预训练的以AFRI为中心的模型（Afriteva，Afribibiberta，Afroxlmr和Afroxlmr-76L），并使用准确性和F1分数指标评估了其性能。 Afroxlmr的精度为99.23％，F1得分为99.21％，表现出最高的性能，证明了其对Hausa文本检测的有效性。我们的数据集可公开使用以实现进一步的研究。

Title: REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities

Authors: Alexander Pugachev, Alena Fenogenova, Vladislav Mikhailov, Ekaterina Artemova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13102
Pdf URL: https://arxiv.org/pdf/2503.13102
Copy Paste: [[2503.13102]] REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities(https://arxiv.org/abs/2503.13102)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have introduced the novel paradigm of using LLMs as judges, where an LLM evaluates and scores the outputs of another LLM, which often correlates highly with human preferences. However, the use of LLM-as-a-judge has been primarily studied in English. In this paper, we evaluate this framework in Russian by introducing the Russian Error tyPes Annotation dataset (REPA), a dataset of 1k user queries and 2k LLM-generated responses. Human annotators labeled each response pair expressing their preferences across ten specific error types, as well as selecting an overall preference. We rank six generative LLMs across the error types using three rating systems based on human preferences. We also evaluate responses using eight LLM judges in zero-shot and few-shot settings. We describe the results of analyzing the judges and position and length biases. Our findings reveal a notable gap between LLM judge performance in Russian and English. However, rankings based on human and LLM preferences show partial alignment, suggesting that while current LLM judges struggle with fine-grained evaluation in Russian, there is potential for improvement.
摘要：大型语言模型（LLM）的最新进展引入了使用LLM作为法官的新型范式，在该法官中，LLM评估并得分另一个LLM的输出，这通常与人类的偏好高度相关。但是，使用LLM-AS-A-Gudge的使用主要是用英语研究的。在本文中，我们通过引入俄罗斯错误类型注释数据集（REPA），1K用户查询的数据集和2K LLM生成的响应来评估此框架。人的注释者标记了每个响应对，表达了在十种特定误差类型中表达偏好的，并选择了整体偏好。我们使用基于人类偏好的三个评级系统在误差类型中对六个生成LLM进行排名。我们还使用零射门和少量设置的八名LLM法官评估响应。我们描述了分析法官，位置和长度偏见的结果。我们的发现表明，LLM法官在俄罗斯和英语之间的表现有明显的差距。但是，基于人类和LLM偏好的排名显示部分一致性，这表明，尽管当前的LLM法官在俄罗斯的精细粒度评估中挣扎，但仍有进步。

Title: Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences

Authors: Kedi Chen, Zhikai Lei, Fan Zhang, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Qipeng Guo, Kai Chen, Wei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13109
Pdf URL: https://arxiv.org/pdf/2503.13109
Copy Paste: [[2503.13109]] Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences(https://arxiv.org/abs/2503.13109)
Keywords: language model
Abstract: Large language models make remarkable progress in reasoning capabilities. Existing works focus mainly on deductive reasoning tasks (e.g., code and math), while another type of reasoning mode that better aligns with human learning, inductive reasoning, is not well studied. We attribute the reason to the fact that obtaining high-quality process supervision data is challenging for inductive reasoning. Towards this end, we novelly employ number sequences as the source of inductive reasoning data. We package sequences into algorithmic problems to find the general term of each sequence through a code solution. In this way, we can verify whether the code solution holds for any term in the current sequence, and inject case-based supervision signals by using code unit tests. We build a sequence synthetic data pipeline and form a training dataset CodeSeq. Experimental results show that the models tuned with CodeSeq improve on both code and comprehensive reasoning benchmarks.
摘要：大型语言模型在推理能力方面取得了显着进步。现有的作品主要集中在演绎推理任务（例如代码和数学）上，而另一种更好地与人类学习，归纳推理相一致的推理模式并未得到很好的研究。我们将原因归因于以下事实：获得高质量的过程监督数据对于归纳推理而言是一项挑战。为此，我们在小新生中采用数字序列作为归纳推理数据的来源。我们将序列包装到算法问题中，以通过代码解决方案找到每个序列的一般项。通过这种方式，我们可以验证代码解决方案是否在当前序列中适用于任何术语，并通过使用代码单元测试来注入基于案例的监督信号。我们构建一个序列合成数据管道并形成训练数据集Codeseq。实验结果表明，使用Codeseq调整的模型对代码和全面的推理基准进行了改进。

Title: Improving Complex Reasoning with Dynamic Prompt Corruption: A soft prompt Optimization Approach

Authors: Sinan Fan, Liang Xie, Chen Shen, Ge Teng, Xiaosong Yuan, Xiaofeng Zhang, Chenxi Huang, Wenxiao Wang, Xiaofei He, Jieping Ye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13208
Pdf URL: https://arxiv.org/pdf/2503.13208
Copy Paste: [[2503.13208]] Improving Complex Reasoning with Dynamic Prompt Corruption: A soft prompt Optimization Approach(https://arxiv.org/abs/2503.13208)
Keywords: language model, llm, prompt
Abstract: Prompt-tuning (PT) for large language models (LLMs) can facilitate the performance on various conventional NLP tasks with significantly fewer trainable parameters. However, our investigation reveals that PT provides limited improvement and may even degrade the primitive performance of LLMs on complex reasoning tasks. Such a phenomenon suggests that soft prompts can positively impact certain instances while negatively affecting others, particularly during the later phases of reasoning. To address these challenges, We first identify an information accumulation within the soft prompts. Through detailed analysis, we demonstrate that this phenomenon is often accompanied by erroneous information flow patterns in the deeper layers of the model, which ultimately lead to incorrect reasoning outcomes. we propose a novel method called \textbf{D}ynamic \textbf{P}rompt \textbf{C}orruption (DPC) to take better advantage of soft prompts in complex reasoning tasks, which dynamically adjusts the influence of soft prompts based on their impact on the reasoning process. Specifically, DPC consists of two stages: Dynamic Trigger and Dynamic Corruption. First, Dynamic Trigger measures the impact of soft prompts, identifying whether beneficial or detrimental. Then, Dynamic Corruption mitigates the negative effects of soft prompts by selectively masking key tokens that interfere with the reasoning process. We validate the proposed approach through extensive experiments on various LLMs and reasoning tasks, including GSM8K, MATH, and AQuA. Experimental results demonstrate that DPC can consistently enhance the performance of PT, achieving 4\%-8\% accuracy gains compared to vanilla prompt tuning, highlighting the effectiveness of our approach and its potential to enhance complex reasoning in LLMs.
摘要：大型语言模型（LLMS）的及时调整（PT）可以促进各种常规NLP任务的性能，其可训练参数少得多。但是，我们的调查表明，PT提供有限的改进，甚至可能降低LLM在复杂推理任务上的原始性能。这种现象表明，软提示可以对某些情况产生积极影响，同时对他人产生负面影响，尤其是在推理的后期。为了应对这些挑战，我们首先确定软提示中的信息积累。通过详细的分析，我们证明了这种现象通常伴随着模型较深层中错误的信息流模式，这最终导致了不正确的推理结果。我们提出了一种称为\ textbf {d} ynamic \ textbf {p} rompt \ textbf {c} orruption（dpc）的新方法，以在复杂的推理任务中更好地利用软提示，从而根据软提示对其对推理过程的影响动态调整影响。具体而言，DPC由两个阶段组成：动态触发和动态损坏。首先，动态触发器衡量软提示的影响，确定有益还是有害。然后，动态腐败通过选择性掩盖干扰推理过程的密钥令牌来减轻软提示的负面影响。我们通过对各种LLM和推理任务（包括GSM8K，Math和Aqua）进行的广泛实验来验证所提出的方法。实验结果表明，与香草及时调整相比，DPC可以始终提高PT的性能，从而达到4 \％-8 \％的准确性提高，从而强调了我们方法的有效性及其在LLM中增强复杂推理的潜力。

Title: Can Language Models Follow Multiple Turns of Entangled Instructions?

Authors: Chi Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13222
Pdf URL: https://arxiv.org/pdf/2503.13222
Copy Paste: [[2503.13222]] Can Language Models Follow Multiple Turns of Entangled Instructions?(https://arxiv.org/abs/2503.13222)
Keywords: language model, gpt, llm
Abstract: Despite significant achievements in improving the instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflicting instructions remains a considerable challenge. Real-world scenarios often require consistency across multiple instructions over time, such as secret privacy, personal preferences, and prioritization, which demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs' capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct with around 1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in nine capability categories, including statics and dynamics, reasoning, and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effectiveness in privacy-protection tasks requiring selective information withholding. Larger models exhibit stronger reasoning capabilities but still struggle with resolving conflicting instructions. Importantly, these performance gaps cannot be attributed solely to information loss, as models demonstrate strong BLEU scores on memorization tasks but their attention mechanisms fail to integrate multiple related instructions effectively. These findings highlight critical areas for improvement in complex real-world tasks involving multi-turn instructions.
摘要：尽管在改善大语言模型（LLMS）的指导跟踪功能方面取得了重大成就，但处理多个潜在纠缠或冲突指令的能力仍然是一个巨大的挑战。现实世界中的场景通常需要随着时间的推移多个指令的一致性，例如秘密隐私，个人偏好和优先级，这需要复杂的能力以整合多个转弯并在指令相交或冲突时仔细平衡竞争目标。这项工作介绍了LLMS在处理多个指令方面的功能的系统调查，涵盖了三个级别的难度：（1）从说明中检索信息，（2）转弯的跟踪和推理，以及（3）在说明之间解决冲突。我们通过人类的方法构建了大约1.1k高质量的多转向对话，并导致9个功能类别，包括静态和动态，推理和多任务处理。我们的发现揭示了不同能力之间的有趣权衡。尽管GPT模型表现出卓越的记忆，但它们在需要选择性信息的隐私保护任务中显示出降低的有效性。较大的模型表现出更强的推理能力，但仍在解决矛盾的指示方面挣扎。重要的是，这些性能差距不能仅归因于信息丢失，因为模型在记忆任务上表现出强大的BLEU分数，但是他们的注意机制无法有效地整合多个相关指令。这些发现突出了重要的领域，以改善涉及多转弯指令的复杂现实世界任务。

Title: TablePilot; Recommending Human-Preferred Tabular Data Analysis with Large Language Models

Authors: Deyin Yi, Yihao Liu, Lang Cao, Mengyu Zhou, Haoyu Dong, Shi Han, Dongmei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13262
Pdf URL: https://arxiv.org/pdf/2503.13262
Copy Paste: [[2503.13262]] TablePilot; Recommending Human-Preferred Tabular Data Analysis with Large Language Models(https://arxiv.org/abs/2503.13262)
Keywords: language model, gpt
Abstract: Tabular data analysis is crucial in many scenarios, yet efficiently identifying the most relevant data analysis queries and results for a new table remains a significant challenge. The complexity of tabular data, diverse analytical operations, and the demand for high-quality analysis make the process tedious. To address these challenges, we aim to recommend query-code-result triplets tailored for new tables in tabular data analysis workflows. In this paper, we present TablePilot, a pioneering tabular data analysis framework leveraging large language models to autonomously generate comprehensive and superior analytical results without relying on user profiles or prior interactions. The framework incorporates key designs in analysis preparation and analysis optimization to enhance accuracy. Additionally, we propose Rec-Align, a novel method to further improve recommendation quality and better align with human preferences. Experiments on DART, a dataset specifically designed for comprehensive tabular data analysis recommendation, demonstrate the effectiveness of our framework. Based on GPT-4o, the tuned TablePilot achieves 77.0% top-5 recommendation recall. Human evaluations further highlight its effectiveness in optimizing tabular data analysis workflows.
摘要：在许多情况下，表格数据分析至关重要，但是有效地识别新表的最相关数据分析查询和结果仍然是一个重大挑战。表格数据的复杂性，多样化的分析操作以及对高质量分析的需求使过程变得乏味。为了应对这些挑战，我们的目标是建议在表格数据分析工作流程中针对新桌子量身定制的查询代码分子三重态。在本文中，我们提出了TablePilot，这是一个开创性的表格数据分析框架，利用大型语言模型自主产生全面和优越的分析结果，而无需依赖用户配置文件或先前的交互。该框架将关键设计纳入分析准备和分析优化以提高准确性。此外，我们提出了Rec-Angign，这是一种新的方法，可以进一步提高建议质量并更好地与人类偏好保持一致。专为全面表格数据分析建议而设计的数据集的DART实验证明了我们框架的有效性。根据GPT-4O，调整的TablePilot获得了77.0％的前5个建议召回。人类评估进一步强调了其在优化表格数据分析工作流程中的有效性。

Title: LLM-Match: An Open-Sourced Patient Matching Model Based on Large Language Models and Retrieval-Augmented Generation

Authors: Xiaodi Li, Shaika Chowdhury, Chung Il Wi, Maria Vassilaki, Ken Liu, Terence T Sio, Owen Garrick, Young J Juhn, James R Cerhan, Cui Tao, Nansu Zong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13281
Pdf URL: https://arxiv.org/pdf/2503.13281
Copy Paste: [[2503.13281]] LLM-Match: An Open-Sourced Patient Matching Model Based on Large Language Models and Retrieval-Augmented Generation(https://arxiv.org/abs/2503.13281)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation
Abstract: Patient matching is the process of linking patients to appropriate clinical trials by accurately identifying and matching their medical records with trial eligibility criteria. We propose LLM-Match, a novel framework for patient matching leveraging fine-tuned open-source large language models. Our approach consists of four key components. First, a retrieval-augmented generation (RAG) module extracts relevant patient context from a vast pool of electronic health records (EHRs). Second, a prompt generation module constructs input prompts by integrating trial eligibility criteria (both inclusion and exclusion criteria), patient context, and system instructions. Third, a fine-tuning module with a classification head optimizes the model parameters using structured prompts and ground-truth labels. Fourth, an evaluation module assesses the fine-tuned model's performance on the testing datasets. We evaluated LLM-Match on four open datasets, n2c2, SIGIR, TREC 2021, and TREC 2022, using open-source models, comparing it against TrialGPT, Zero-Shot, and GPT-4-based closed models. LLM-Match outperformed all baselines.
摘要：患者匹配是通过准确识别和匹配其病历与试验资格标准，将患者与适当的临床试验联系起来的过程。我们提出了LLM匹配，这是一个新颖的框架，用于匹配利用微调的开源大语模型。我们的方法由四个关键组成部分组成。首先，从大量的电子健康记录（EHR）中提取相关的患者环境。其次，迅速生成模块通过整合试验资格标准（包含和排除标准），患者环境和系统说明来构建输入提示。第三，带有分类头的微调模块使用结构化提示和地面真相标签优化了模型参数。第四，评估模块评估了微调模型在测试数据集上的性能。我们使用开源模型在四个开放数据集，N2C2，Sigir，TREC 2021和TREC 2022上评估了LLM匹配，将其与试用设备，零射击和GPT-4基于封闭模型进行了比较。 LLM搭配优于所有基准。

Title: A Survey on Transformer Context Extension: Approaches and Evaluation

Authors: Yijun Liu, Jinzheng Yu, Yang Xu, Zhongyang Li, Qingfu Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13299
Pdf URL: https://arxiv.org/pdf/2503.13299
Copy Paste: [[2503.13299]] A Survey on Transformer Context Extension: Approaches and Evaluation(https://arxiv.org/abs/2503.13299)
Keywords: language model, llm, long context
Abstract: Large language models (LLMs) based on Transformer have been widely applied in the filed of natural language processing (NLP), demonstrating strong performance, particularly in handling short text tasks. However, when it comes to long context scenarios, the performance of LLMs degrades due to some challenges. To alleviate this phenomenon, there is a number of work proposed recently. In this survey, we first list the challenges of applying pre-trained LLMs to process long contexts. Then systematically review the approaches related to long context and propose our taxonomy categorizing them into four main types: positional encoding, context compression, retrieval augmented, and attention pattern. In addition to the approaches, we focus on the evaluation of long context, organizing relevant data, tasks, and metrics based on existing long context benchmarks. Finally, we summarize unresolved issues in the long context domain and put forward our views on future developments.
摘要：基于变压器的大型语言模型（LLM）已被广泛应用于自然语言处理（NLP），表现出强烈的性能，尤其是在处理短文本任务时。但是，当涉及长上下文方案时，由于一些挑战，LLMS的性能降低了。为了减轻这种现象，最近提出了许多作品。在这项调查中，我们首先列出了应用预训练的LLMS处理长篇小说的挑战。然后系统地回顾与长篇小说相关的方法，并提出我们的分类法将它们分为四种主要类型：位置编码，上下文压缩，检索增强和注意力模式。除了方法之外，我们还专注于对长篇小说，组织相关数据，任务和指标的评估，以现有的长上下文基准为基础。最后，我们总结了长篇小说领域中未解决的问题，并提出了对未来发展的看法。

Title: Computation Mechanism Behind LLM Position Generalization

Authors: Chi Han, Heng Ji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13305
Pdf URL: https://arxiv.org/pdf/2503.13305
Copy Paste: [[2503.13305]] Computation Mechanism Behind LLM Position Generalization(https://arxiv.org/abs/2503.13305)
Keywords: language model, llm
Abstract: Most written natural languages are composed of sequences of words and sentences. Similar to humans, large language models (LLMs) exhibit flexibility in handling textual positions - a phenomenon we term position generalization. They can understand texts with position perturbations and generalize to longer texts than those encountered during training with the latest techniques. These phenomena suggest that LLMs handle positions tolerantly, but how LLMs computationally process positional relevance remains largely unexplored. This work connects the linguistic phenomenon with LLMs' computational mechanisms. We show how LLMs enforce certain computational mechanisms for the aforementioned tolerance in position perturbations. Despite the complex design of the self-attention mechanism, this work reveals that LLMs learn a counterintuitive disentanglement of attention logits. Their values show a 0.959 linear correlation with an approximation of the arithmetic sum of positional relevance and semantic importance. Furthermore, we identify a prevalent pattern in intermediate features, which we prove theoretically enables this effect. The pattern, which is different from how randomly initialized parameters would behave, suggests that it is a learned behavior rather than a natural result of the model architecture. Based on these findings, we provide computational explanations and criteria for LLMs' position flexibilities. This work takes a pioneering step in linking position generalization with modern LLMs' internal mechanisms.
摘要：大多数书面自然语言都是由单词和句子序列组成的。与人类类似，大型语言模型（LLMS）在处理文本位置时具有灵活性 - 一种现象，我们定期概括。他们可以理解具有扰动位置的文本，并且比最新技术培训期间遇到的文本更长。这些现象表明LLM宽容地处理位置，但是LLM在计算过程中的位置相关性如何在很大程度上尚未探索。这项工作将语言现象与LLMS的计算机制联系起来。我们展示了LLM如何在位置扰动中实施上述公差的某些计算机制。尽管自我发挥机制具有复杂的设计，但这项工作表明，LLMS学习了违反直觉的注意力逻辑。它们的值显示了0.959线性相关性，其位置相关性和语义重要性的算术总和的近似值。此外，我们确定了中间特征中普遍的模式，从理论上讲，我们可以实现这种效果。该模式与随机初始化参数的行为不同，这表明这是一种学习的行为，而不是模型体系结构的自然结果。基于这些发现，我们为LLMS位置灵活性提供了计算说明和标准。这项工作在将位置的概括与现代LLM的内部机制联系起来，采取了开创性的步骤。

Title: Reliable and Efficient Amortized Model-based Evaluation

Authors: Sang Truong, Yuheng Tu, Percy Liang, Bo Li, Sanmi Koyejo
Subjects: cs.CL, cs.AI, cs.LG, stat.AP
Abstract URL: https://arxiv.org/abs/2503.13335
Pdf URL: https://arxiv.org/pdf/2503.13335
Copy Paste: [[2503.13335]] Reliable and Efficient Amortized Model-based Evaluation(https://arxiv.org/abs/2503.13335)
Keywords: language model, llm
Abstract: Comprehensive evaluations of language models (LM) during both development and deployment phases are necessary because these models possess numerous capabilities (e.g., mathematical reasoning, legal support, or medical diagnostic) as well as safety risks (e.g., racial bias, toxicity, or misinformation). The average score across a wide range of benchmarks provides a signal that helps guide the use of these LMs in practice. Currently, holistic evaluations are costly due to the large volume of benchmark questions, making frequent evaluations impractical. A popular attempt to lower the cost is to compute the average score on a subset of the benchmark. This approach, unfortunately, often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset. Item response theory (IRT) was designed to address this challenge, providing a reliable measurement by careful controlling for question difficulty. Unfortunately, question difficulty is expensive to estimate. Facing this challenge, we train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost. In addition, we leverage this difficulty predictor to further improve the evaluation efficiency through training a question generator given a difficulty level. This question generator is essential in adaptive testing, where, instead of using a random subset of the benchmark questions, informative questions are adaptively chosen based on the current estimation of LLM performance. Experiments on 22 common natural language benchmarks and 172 LMs show that this approach is more reliable and efficient compared to current common practice.
摘要：在开发和部署阶段对语言模型（LM）的全面评估是必要的，因为这些模型具有许多功能（例如数学推理，法律支持或医学诊断）以及安全风险（例如种族偏见，毒性，毒性或错误信息）。各种基准的平均得分提供了一个信号，有助于指导这些LMS在实践中的使用。当前，由于大量的基准问题，整体评估的昂贵，因此经常进行评估。降低成本的一种流行尝试是计算基准子集的平均得分。不幸的是，这种方法通常会导致LM性能的不可靠度量，因为平均得分通常与基准子集中问题的难度相混淆。项目响应理论（IRT）旨在应对这一挑战，通过仔细控制问题难度来提供可靠的测量。不幸的是，估计问题很昂贵。面对这一挑战，我们训练一个模型，该模型可以预测问题的难度，从而以一小部分成本实现可靠的测量。此外，我们利用这种困难预测因素来进一步提高评估效率，通过训练一个问题生成器的困难水平。这个问题生成器对于自适应测试至关重要，其中，根据当前对LLM性能的估计，无需使用基准问题的随机子集，而是自适应地选择了信息。对22个常见的自然语言基准和172个LMS进行的实验表明，与当前的常见实践相比，这种方法更可靠和有效。

Title: Valid Text-to-SQL Generation with Unification-based DeepStochLog

Authors: Ying Jiao, Luc De Raedt, Giuseppe Marra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13342
Pdf URL: https://arxiv.org/pdf/2503.13342
Copy Paste: [[2503.13342]] Valid Text-to-SQL Generation with Unification-based DeepStochLog(https://arxiv.org/abs/2503.13342)
Keywords: language model
Abstract: Large language models have been used to translate natural language questions to SQL queries. Without hard constraints on syntax and database schema, they occasionally produce invalid queries that are not executable. These failures limit the usage of these systems in real-life scenarios. We propose a neurosymbolic framework that imposes SQL syntax and schema constraints with unification-based definite clause grammars and thus guarantees the generation of valid queries. Our framework also builds a bi-directional interface to language models to leverage their natural language understanding abilities. The evaluation results on a subset of SQL grammars show that all our output queries are valid. This work is the first step towards extending language models with unification-based grammars. We demonstrate this extension enhances the validity, execution accuracy, and ground truth alignment of the underlying language model by a large margin. Our code is available at this https URL.
摘要：大型语言模型已用于将自然语言问题转化为SQL查询。在语法和数据库架构上没有严格的约束，它们偶尔会产生无效的查询。这些故障限制了这些系统在现实生活中的使用。我们提出了一个神经符号框架，该框架施加了SQL语法和模式约束，并具有基于统一的确定子句语法，从而保证了有效查询的产生。我们的框架还为语言模型建立了双向界面，以利用其自然语言理解能力。 SQL语法子集的评估结果表明，我们所有的输出查询都是有效的。这项工作是通过基于统一的语法扩展语言模型的第一步。我们证明了这一扩展可以提高基础语言模型的有效性，执行准确性和地面真相对齐方式。我们的代码可在此HTTPS URL上找到。

Title: Aligned Probing: Relating Toxic Behavior and Model Internals

Authors: Andreas Waldis, Vagrant Gautam, Anne Lauscher, Dietrich Klakow, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13390
Pdf URL: https://arxiv.org/pdf/2503.13390
Copy Paste: [[2503.13390]] Aligned Probing: Relating Toxic Behavior and Model Internals(https://arxiv.org/abs/2503.13390)
Keywords: language model, prompt
Abstract: We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs), based on their outputs, and their internal representations (internals). Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time. Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers. Focusing on how unique LMs differ offers both correlative and causal evidence that they generate less toxic output when strongly encoding information about the input toxicity. We also highlight the heterogeneity of toxicity, as model behavior and internals vary across unique attributes such as Threat. Finally, four case studies analyzing detoxification, multi-prompt evaluations, model quantization, and pre-training dynamics underline the practical impact of aligned probing with further concrete insights. Our findings contribute to a more holistic understanding of LMs, both within and beyond the context of toxicity.
摘要：我们介绍了对齐的探测，这是一种新颖的可解释性框架，该框架基于其输出及其内部表示（内部）来对齐语言模型（LMS）的行为（LMS）。使用此框架，我们首次检查了20多个Olmo，Llama和Mistral模型，对毒性的行为和内部观点进行弥合。我们的结果表明，LMS强烈编码有关输入和随后输出的毒性水平的信息，尤其是在较低层中。专注于独特的LMS差异如何提供相关和因果证据，表明它们在强烈编码有关输入毒性的信息时产生的毒性产量较小。我们还强调了毒性的异质性，因为模型行为和内部质量在威胁等独特属性中各不相同。最后，四个案例研究分析了排毒，多项目评估，模型量化和训练前动态，强调了对齐探测的实际影响，并进一步的具体见解。我们的发现有助于对LM的更全面理解，无论是在毒性的背景下还是超出毒性。

Title: Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis

Authors: Alexander Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L. Griffiths
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13401
Pdf URL: https://arxiv.org/pdf/2503.13401
Copy Paste: [[2503.13401]] Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis(https://arxiv.org/abs/2503.13401)
Keywords: language model
Abstract: Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical difficulties in understanding the human mind, we argue that methods developed in cognitive science can be useful for understanding large language models. We propose a framework for applying these methods based on Marr's three levels of analysis. By revisiting established cognitive science techniques relevant to each level and illustrating their potential to yield insights into the behavior and internal organization of large language models, we aim to provide a toolkit for making sense of these new kinds of minds.
摘要：现代人工智能系统，例如大型语言模型，越来越强大，但也越来越难理地理解。认识到这个问题类似于理解人类思想的历史困难，我们认为在认知科学中开发的方法对于理解大语言模型很有用。我们提出了一个基于MARR的三个分析级别应用这些方法的框架。通过重新审视与每个层面相关的认知科学技术，并说明了它们对大型语言模型的行为和内部组织的见解的潜力，我们旨在提供一种工具包，以了解这些新型思维。

Title: DLPO: Towards a Robust, Efficient, and Generalizable Prompt Optimization Framework from a Deep-Learning Perspective

Authors: Dengyun Peng, Yuhang Zhou, Qiguang Chen, Jinhao Liu, Jingjing Chen, Libo Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13413
Pdf URL: https://arxiv.org/pdf/2503.13413
Copy Paste: [[2503.13413]] DLPO: Towards a Robust, Efficient, and Generalizable Prompt Optimization Framework from a Deep-Learning Perspective(https://arxiv.org/abs/2503.13413)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have achieved remarkable success across diverse tasks, largely driven by well-designed prompts. However, crafting and selecting such prompts often requires considerable human effort, significantly limiting its scalability. To mitigate this, recent studies have explored automated prompt optimization as a promising solution. Despite these efforts, existing methods still face critical challenges in robustness, efficiency, and generalization. To systematically address these challenges, we first conduct an empirical analysis to identify the limitations of current reflection-based prompt optimization paradigm. Building on these insights, we propose 7 innovative approaches inspired by traditional deep learning paradigms for prompt optimization (DLPO), seamlessly integrating these concepts into text-based gradient optimization. Through these advancements, we progressively tackle the aforementioned challenges and validate our methods through extensive experimentation. We hope our study not only provides valuable guidance for future research but also offers a comprehensive understanding of the challenges and potential solutions in prompt optimization. Our code is available at this https URL.
摘要：大型语言模型（LLM）在各种任务中取得了巨大的成功，这在很大程度上是由精心设计的提示驱动的。但是，制作和选择此类提示通常需要大量的人为努力，从而大大限制了其可扩展性。为了减轻这种情况，最近的研究探索了自动化的及时优化作为有希望的解决方案。尽管做出了这些努力，但现有方法仍然面临着鲁棒性，效率和概括性的关键挑战。为了系统地应对这些挑战，我们首先进行了经验分析，以确定当前基于反射的及时优化范式的局限性。在这些见解的基础上，我们提出了7种受传统深度学习范式迅速优化（DLPO）的创新方法（DLPO），将这些概念无缝地集成到基于文本的梯度优化中。通过这些进步，我们逐步应对上述挑战，并通过广泛的实验来验证我们的方法。我们希望我们的研究不仅为未来的研究提供宝贵的指导，而且还为迅速优化的挑战和潜在解决方案提供了全面的了解。我们的代码可在此HTTPS URL上找到。

Title: SuperBPE: Space Travel for Language Models

Authors: Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, Yejin Choi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13423
Pdf URL: https://arxiv.org/pdf/2503.13423
Copy Paste: [[2503.13423]] SuperBPE: Space Travel for Language Models(https://arxiv.org/abs/2503.13423)
Keywords: language model
Abstract: The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying *only* the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall.
摘要：几乎所有语言模型（LM）令牌化方案的假设是，令牌应该是子词，即单词边界内包含。在提供看似合理的归纳偏见的同时，这种常见的做法是否限制了现代LM的潜力？ Whitespace不是一个可靠的含义定界符，如多字表达式（例如，“顺便说一句”）所证明的，是表达概念所需的单词数量的跨语言变化（例如，德语中的“太空服头盔”是“ raumanzughelm”），以及不使用白色的语言。为了探索以外的子词的代币化的潜力，我们引入了“超级字”令牌Superbpe，该superbpe将简单的伪造课程纳入字节对编码（BPE）算法以首先学习子词，然后桥接了whitespace。这给编码效率带来了巨大的提高：将词汇尺寸固定到200k时，Superbpe编码了一块固定的文本，其令牌平均比BPE少33％。在实验中，我们在固定模型大小，词汇大小和火车计算的同时从刮擦中为8B变压器LMS提供了预算，而 *仅 *仅 *学习词汇的算法。在30个下游任务（包括MMLU上的 +8.2％）中，我们经过Superbpe训练的模型比BPE基线的平均 +4.0％的绝对提高，同时需要在推理时降低27％的计算。在分析中，我们发现Superbpe会导致文本的分割，这些文本在the的难度上更加均匀。定性上，这可能是因为超级令牌通常会捕获以语义作为单个单元发挥作用的通用多字表达式。 Superbpe是对令牌化的直接局部修改，可提高编码效率和下游性能，从而产生更好的语言模型。

Title: Faithfulness of LLM Self-Explanations for Commonsense Tasks: Larger Is Better, and Instruction-Tuning Allows Trade-Offs but Not Pareto Dominance

Authors: Noah Y. Siegel, Nicolas Heess, Maria Perez-Ortiz, Oana-Maria Camburu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13445
Pdf URL: https://arxiv.org/pdf/2503.13445
Copy Paste: [[2503.13445]] Faithfulness of LLM Self-Explanations for Commonsense Tasks: Larger Is Better, and Instruction-Tuning Allows Trade-Offs but Not Pareto Dominance(https://arxiv.org/abs/2503.13445)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) become increasingly capable, ensuring that their self-generated explanations are faithful to their internal decision-making process is critical for safety and oversight. In this work, we conduct a comprehensive counterfactual faithfulness analysis across 62 models from 8 families, encompassing both pretrained and instruction-tuned variants and significantly extending prior studies of counterfactual tests. We introduce phi-CCT, a simplified variant of the Correlational Counterfactual Test, which avoids the need for token probabilities while explaining most of the variance of the original test. Our findings reveal clear scaling trends: larger models are consistently more faithful on our metrics. However, when comparing instruction-tuned and human-imitated explanations, we find that observed differences in faithfulness can often be attributed to explanation verbosity, leading to shifts along the true-positive/false-positive Pareto frontier. While instruction-tuning and prompting can influence this trade-off, we find limited evidence that they fundamentally expand the frontier of explanatory faithfulness beyond what is achievable with pretrained models of comparable size. Our analysis highlights the nuanced relationship between instruction-tuning, verbosity, and the faithful representation of model decision processes.
摘要：随着大型语言模型（LLM）变得越来越有能力，确保他们的自我生成的解释对他们的内部决策过程忠于他们的安全和监督至关重要。在这项工作中，我们对来自8个家庭的62个模型进行了全面的反事实忠诚分析，包括审计和指导调节的变体，并显着扩展了对反事实检验的先前研究。我们介绍了Phi-CCT，这是相关反事实测试的简化变体，这避免了对令牌概率的需求，同时解释了原始测试的大部分差异。我们的发现揭示了明确的缩放趋势：较大的模型对我们的指标始终如一。但是，在比较指导调节和人类拟合的解释时，我们发现观察到的忠诚差异通常归因于解释的冗长，从而导致沿着真正的阳性/假阳性帕雷托的前沿变化。尽管进行指导调节和提示可以影响这一权衡，但我们发现有限的证据表明它们从根本上扩大了解释性忠诚的边界，而不是鉴定的可比规模模型可以实现的。我们的分析强调了指导调查，冗长和模型决策过程的忠实表示之间的细微关系。

Title: MetaScale: Test-Time Scaling with Evolving Meta-Thoughts

Authors: Qin Liu, Wenxuan Zhou, Nan Xu, James Y. Huang, Fei Wang, Sheng Zhang, Hoifung Poon, Muhao Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13447
Pdf URL: https://arxiv.org/pdf/2503.13447
Copy Paste: [[2503.13447]] MetaScale: Test-Time Scaling with Evolving Meta-Thoughts(https://arxiv.org/abs/2503.13447)
Keywords: language model, gpt, llm
Abstract: One critical challenge for large language models (LLMs) for making complex reasoning is their reliance on matching reasoning patterns from training data, instead of proactively selecting the most appropriate cognitive strategy to solve a given task. Existing approaches impose fixed cognitive structures that enhance performance in specific tasks but lack adaptability across diverse scenarios. To address this limitation, we introduce METASCALE, a test-time scaling framework based on meta-thoughts -- adaptive thinking strategies tailored to each task. METASCALE initializes a pool of candidate meta-thoughts, then iteratively selects and evaluates them using a multi-armed bandit algorithm with upper confidence bound selection, guided by a reward model. To further enhance adaptability, a genetic algorithm evolves high-reward meta-thoughts, refining and extending the strategy pool over time. By dynamically proposing and optimizing meta-thoughts at inference time, METASCALE improves both accuracy and generalization across a wide range of tasks. Experimental results demonstrate that MetaScale consistently outperforms standard inference approaches, achieving an 11% performance gain in win rate on Arena-Hard for GPT-4o, surpassing o1-mini by 0.9% under style control. Notably, METASCALE scales more effectively with increasing sampling budgets and produces more structured, expert-level responses.
摘要：对于大型语言模型（LLM）进行复杂推理的一个关键挑战是，它们依赖培训数据中的推理模式，而不是主动选择最合适的认知策略来解决给定的任务。现有方法施加了固定的认知结构，可提高特定任务的性能，但在各种情况下缺乏适应性。为了解决此限制，我们介绍了Metascale，这是一个基于元思考的测试时间扩展框架 - 适用于每个任务的自适应思维策略。 Metascale初始化了候选元思考的库，然后使用具有奖励模型的指导的多臂匪徒算法进行迭代选择并评估它们。为了进一步提高适应性，遗传算法会发展高回报的元思考，随着时间的推移，策略池并扩展了策略库。通过在推理时动态提出和优化元思考，Metascale可以提高各种任务的准确性和泛化。实验结果表明，Metascale始终优于标准推理方法，在GPT-4O的竞技场上获得了11％的绩效增长，在样式控制下，O1-Mini超过了0.9％。值得注意的是，Metascale随着采样预算的增加而更有效地缩放，并产生更具结构化的专家级响应。