2025-02-05

Title: Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media

Authors: Kwanho Kim, Soojong Kim
Subjects: cs.CL, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2502.01658
Pdf URL: https://arxiv.org/pdf/2502.01658
Copy Paste: [[2502.01658]] Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media(https://arxiv.org/abs/2502.01658)
Keywords: language model, gpt, llm
Abstract: Sentiment analysis of alternative tobacco products on social media is important for tobacco control research. Large Language Models (LLMs) can help streamline the labor-intensive human sentiment analysis process. This study examined the accuracy of LLMs in replicating human sentiment evaluation of social media messages about heated tobacco products (HTPs). The research used GPT-3.5 and GPT-4 Turbo to classify 500 Facebook and 500 Twitter messages, including anti-HTPs, pro-HTPs, and neutral messages. The models evaluated each message up to 20 times, and their majority label was compared to human evaluators. Results showed that GPT-3.5 accurately replicated human sentiment 61.2% of the time for Facebook messages and 57.0% for Twitter messages. GPT-4 Turbo performed better, with 81.7% accuracy for Facebook and 77.0% for Twitter. Using three response instances, GPT-4 Turbo achieved 99% of the accuracy of twenty instances. GPT-4 Turbo also had higher accuracy for anti- and pro-HTPs messages compared to neutral ones. Misclassifications by GPT-3.5 often involved anti- or pro-HTPs messages being labeled as neutral or irrelevant, while GPT-4 Turbo showed improvements across all categories. In conclusion, LLMs can be used for sentiment analysis of HTP-related social media messages, with GPT-4 Turbo reaching around 80% accuracy compared to human experts. However, there's a risk of misrepresenting overall sentiment due to differences in accuracy across sentiment categories.
摘要：社交媒体上替代烟草产品的情绪分析对烟草控制研究非常重要。大型语言模型 (LLM) 可以帮助简化劳动密集型的人类情绪分析过程。这项研究考察了 LLM 在复制人类对加热烟草产品 (HTP) 社交媒体消息的情绪评估方面的准确性。该研究使用 GPT-3.5 和 GPT-4 Turbo 对 500 条 Facebook 和 500 条 Twitter 消息进行分类，包括反对 HTP、支持 HTP 和中性消息。这些模型对每条消息评估多达 20 次，并将其大多数标签与人类评估者进行比较。结果表明，GPT-3.5 在 Facebook 消息中准确复制人类情绪的准确率为 61.2%，在 Twitter 消息中准确率为 57.0%。GPT-4 Turbo 表现更好，Facebook 的准确率为 81.7%，Twitter 的准确率为 77.0%。使用三个响应实例，GPT-4 Turbo 实现了二十个实例准确率的 99%。 GPT-4 Turbo 对反对和支持 HTP 消息的准确率也高于对中性消息的准确率。GPT-3.5 的错误分类通常涉及将反对或支持 HTP 消息标记为中性或不相关，而 GPT-4 Turbo 在所有类别中均表现出改善。总之，LLM 可用于对与 HTP 相关的社交媒体消息进行情绪分析，与人类专家相比，GPT-4 Turbo 的准确率达到 80% 左右。然而，由于不同情绪类别的准确率存在差异，因此存在歪曲整体情绪的风险。

Title: Speculative Ensemble: Fast Large Language Model Ensemble via Speculation

Authors: Jiale Fu, Yuchu Jiang, Junkai Chen, Jiaming Fan, Xin Geng, Xu Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.01662
Pdf URL: https://arxiv.org/pdf/2502.01662
Copy Paste: [[2502.01662]] Speculative Ensemble: Fast Large Language Model Ensemble via Speculation(https://arxiv.org/abs/2502.01662)
Keywords: language model, llm
Abstract: Ensemble methods enhance Large Language Models (LLMs) by combining multiple models but suffer from high computational costs. In this paper, we introduce Speculative Ensemble, a novel framework that accelerates LLM ensembles without sacrificing performance, inspired by Speculative Decoding-where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel. Our approach builds on two key insights: (1) the verification distribution can be the ensemble distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency. We generalize this method to ensembles with n models and theoretically prove that SE is never slower than a standard ensemble, typically achieving faster speed. Extensive experiments demonstrate speed improvements of 1.11x-2.23x over standard ensemble techniques without compromising generation quality. Our code is available at this https URL
摘要：集成方法通过组合多个模型来增强大型语言模型 (LLM)，但计算成本较高。在本文中，我们介绍了 Speculative Ensemble，这是一种在不牺牲性能的情况下加速 LLM 集成的新框架，其灵感来自 Speculative Decoding - 其中小型提议模型按顺序生成标记，而较大的目标模型并行验证它们。我们的方法建立在两个关键见解之上：(1) 验证分布可以是提议模型和目标模型的集成分布，(2) 交替使用每个模型作为提议者和验证者可以进一步提高效率。我们将此方法推广到具有 n 个模型的集成，并从理论上证明 SE 永远不会比标准集成慢，通常可以实现更快的速度。大量实验表明，与标准集成技术相比，速度提高了 1.11 倍至 2.23 倍，而不会影响生成质量。我们的代码可在此 https URL 上找到

Title: Explainable AI for Sentiment Analysis of Human Metapneumovirus (HMPV) Using XLNet

Authors: Md. Shahriar Hossain Apu, Md Saiful Islam, Tanjim Taharat Aurpa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.01663
Pdf URL: https://arxiv.org/pdf/2502.01663
Copy Paste: [[2502.01663]] Explainable AI for Sentiment Analysis of Human Metapneumovirus (HMPV) Using XLNet(https://arxiv.org/abs/2502.01663)
Keywords: prompt
Abstract: In 2024, the outbreak of Human Metapneumovirus (HMPV) in China, which later spread to the UK and other countries, raised significant public concern. While HMPV typically causes mild symptoms, its effects on vulnerable individuals prompted health authorities to emphasize preventive measures. This paper explores how sentiment analysis can enhance our understanding of public reactions to HMPV by analyzing social media data. We apply transformer models, particularly XLNet, achieving 93.50% accuracy in sentiment classification. Additionally, we use explainable AI (XAI) through SHAP to improve model transparency.
摘要：2024 年，人类偏肺病毒 (HMPV) 在中国爆发，随后蔓延至英国和其他国家，引起了公众的极大关注。虽然 HMPV 通常会引起轻微症状，但它对脆弱人群的影响促使卫生当局强调预防措施。本文探讨了情绪分析如何通过分析社交媒体数据来增强我们对公众对 HMPV 的反应的理解。我们应用了 Transformer 模型，尤其是 XLNet，在情绪分类中实现了 93.50% 的准确率。此外，我们通过 SHAP 使用可解释 AI (XAI) 来提高模型透明度。

Title: Benchmark on Peer Review Toxic Detection: A Challenging Task with a New Dataset

Authors: Man Luo, Bradley Peterson, Rafael Gan, Hari Ramalingame, Navya Gangrade, Ariadne Dimarogona, Imon Banerjee, Phillip Howard
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2502.01676
Pdf URL: https://arxiv.org/pdf/2502.01676
Copy Paste: [[2502.01676]] Benchmark on Peer Review Toxic Detection: A Challenging Task with a New Dataset(https://arxiv.org/abs/2502.01676)
Keywords: language model, gpt, llm, prompt
Abstract: Peer review is crucial for advancing and improving science through constructive criticism. However, toxic feedback can discourage authors and hinder scientific progress. This work explores an important but underexplored area: detecting toxicity in peer reviews. We first define toxicity in peer reviews across four distinct categories and curate a dataset of peer reviews from the OpenReview platform, annotated by human experts according to these definitions. Leveraging this dataset, we benchmark a variety of models, including a dedicated toxicity detection model, a sentiment analysis model, several open-source large language models (LLMs), and two closed-source LLMs. Our experiments explore the impact of different prompt granularities, from coarse to fine-grained instructions, on model performance. Notably, state-of-the-art LLMs like GPT-4 exhibit low alignment with human judgments under simple prompts but achieve improved alignment with detailed instructions. Moreover, the model's confidence score is a good indicator of better alignment with human judgments. For example, GPT-4 achieves a Cohen's Kappa score of 0.56 with human judgments, which increases to 0.63 when using only predictions with a confidence score higher than 95%. Overall, our dataset and benchmarks underscore the need for continued research to enhance toxicity detection capabilities of LLMs. By addressing this issue, our work aims to contribute to a healthy and responsible environment for constructive academic discourse and scientific collaboration.
摘要：同行评审对于通过建设性批评推动和改进科学至关重要。然而，有害的反馈可能会打击作者的积极性并阻碍科学进步。这项工作探索了一个重要但尚未得到充分探索的领域：检测同行评审中的毒性。我们首先从四个不同的类别中定义同行评审中的毒性，并从 OpenReview 平台整理一个同行评审数据集，并由人类专家根据这些定义进行注释。利用这个数据集，我们对各种模型进行了基准测试，包括专用的毒性检测模型、情绪分析模型、几个开源大型语言模型 (LLM) 和两个闭源 LLM。我们的实验探索了从粗粒度到细粒度指令的不同提示粒度对模型性能的影响。值得注意的是，像 GPT-4 这样的最先进的 LLM 在简单提示下与人类判断的一致性较低，但在详细指令下实现了更好的一致性。此外，模型的置信度得分是与人类判断更好一致性的一个很好的指标。例如，GPT-4 在人类判断下获得了 0.56 的 Cohen's Kappa 分数，而当仅使用置信度分数高于 95% 的预测时，该分数将上升至 0.63。总体而言，我们的数据集和基准强调了继续研究以增强 LLM 的毒性检测能力的必要性。通过解决这一问题，我们的工作旨在为建设性的学术讨论和科学合作营造一个健康、负责任的环境。

Title: LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient

Authors: Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.01683
Pdf URL: https://arxiv.org/pdf/2502.01683
Copy Paste: [[2502.01683]] LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient(https://arxiv.org/abs/2502.01683)
Keywords: language model, llm, prompt
Abstract: The rapid advancement of large language models (LLMs) has led to a surge in both model supply and application demands. To facilitate effective matching between them, reliable, generic and efficient benchmark generators are widely needed. However, human annotators are constrained by inefficiency, and current LLM benchmark generators not only lack generalizability but also struggle with limited reliability, as they lack a comprehensive evaluation framework for validation and optimization. To fill this gap, we first propose an automated and unbiased evaluation framework, structured around four dimensions and ten criteria. Under this framework, we carefully analyze the advantages and weaknesses of directly prompting LLMs as generic benchmark generators. To enhance the reliability, we introduce a series of methods to address the identified weaknesses and integrate them as BenchMaker. Experiments across multiple LLMs and tasks confirm that BenchMaker achieves superior or comparable performance to human-annotated benchmarks on all metrics, highlighting its generalizability and reliability. More importantly, it delivers highly consistent evaluation results across 12 LLMs (0.967 Pearson correlation against MMLU-Pro), while taking only $0.005 and 0.38 minutes per sample.
摘要：大型语言模型 (LLM) 的快速发展导致模型供应和应用需求激增。为了促进它们之间的有效匹配，广泛需要可靠、通用和高效的基准生成器。然而，人工注释者受到效率低下的制约，当前的 LLM 基准生成器不仅缺乏通用性，而且可靠性有限，因为它们缺乏用于验证和优化的全面评估框架。为了填补这一空白，我们首先提出了一个自动化和无偏见的评估框架，该框架围绕四个维度和十个标准构建。在这个框架下，我们仔细分析了直接提示 LLM 作为通用基准生成器的优势和劣势。为了提高可靠性，我们引入了一系列方法来解决已发现的弱点并将它们集成为 BenchMaker。跨多个 LLM 和任务的实验证实，BenchMaker 在所有指标上都实现了优于或可与人工注释基准相当的性能，突出了其通用性和可靠性。更重要的是，它在 12 个 LLM 中提供了高度一致的评估结果（与 MMLU-Pro 的 Pearson 相关性为 0.967），同时每个样本仅需 0.005 美元和 0.38 分钟。

Title: Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model

Authors: Hadas Ben-Atya, Naama Gavrielov, Zvi Badash, Gili Focht, Ruth Cytter-Kuint, Talar Hagopian, Dan Turner, Moti Freiman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.01691
Pdf URL: https://arxiv.org/pdf/2502.01691
Copy Paste: [[2502.01691]] Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model(https://arxiv.org/abs/2502.01691)
Keywords: language model, llm, prompt, agent
Abstract: Reliable extraction of structured data from radiology reports using Large Language Models (LLMs) remains challenging, especially for complex, non-English texts like Hebrew. This study introduces an agent-based uncertainty-aware approach to improve the trustworthiness of LLM predictions in medical applications. We analyzed 9,683 Hebrew radiology reports from Crohn's disease patients (from 2010 to 2023) across three medical centers. A subset of 512 reports was manually annotated for six gastrointestinal organs and 15 pathological findings, while the remaining reports were automatically annotated using HSMP-BERT. Structured data extraction was performed using Llama 3.1 (Llama 3-8b-instruct) with Bayesian Prompt Ensembles (BayesPE), which employed six semantically equivalent prompts to estimate uncertainty. An Agent-Based Decision Model integrated multiple prompt outputs into five confidence levels for calibrated uncertainty and was compared against three entropy-based models. Performance was evaluated using accuracy, F1 score, precision, recall, and Cohen's Kappa before and after filtering high-uncertainty cases. The agent-based model outperformed the baseline across all metrics, achieving an F1 score of 0.3967, recall of 0.6437, and Cohen's Kappa of 0.3006. After filtering high-uncertainty cases (greater than or equal to 0.5), the F1 score improved to 0.4787, and Kappa increased to 0.4258. Uncertainty histograms demonstrated clear separation between correct and incorrect predictions, with the agent-based model providing the most well-calibrated uncertainty estimates. By incorporating uncertainty-aware prompt ensembles and an agent-based decision model, this approach enhances the performance and reliability of LLMs in structured data extraction from radiology reports, offering a more interpretable and trustworthy solution for high-stakes medical applications.
摘要：使用大型语言模型 (LLM) 从放射学报告中可靠地提取结构化数据仍然具有挑战性，尤其是对于希伯来语等复杂的非英语文本。本研究引入了一种基于代理的不确定性感知方法，以提高 LLM 预测在医疗应用中的可信度。我们分析了来自三个医疗中心的克罗恩病患者的 9,683 份希伯来语放射学报告（从 2010 年到 2023 年）。512 份报告的子集针对六个胃肠道器官和 15 个病理发现进行了手动注释，而其余报告则使用 HSMP-BERT 自动注释。使用 Llama 3.1（Llama 3-8b-instruct）和贝叶斯提示集成（BayesPE）执行结构化数据提取，它使用六个语义等效的提示来估计不确定性。基于代理的决策模型将多个提示输出集成到五个校准不确定性的置信度水平中，并与三个基于熵的模型进行了比较。在过滤高不确定性案例之前和之后，使用准确度、F1 得分、精确度、召回率和 Cohen's Kappa 来评估性能。基于代理的模型在所有指标上都优于基线，F1 得分为 0.3967、召回率为 0.6437，Cohen's Kappa 为 0.3006。在过滤高不确定性案例（大于或等于 0.5）之后，F1 得分提高到 0.4787，Kappa 提高到 0.4258。不确定性直方图显示正确和错误预测之间有明显的区分，其中基于代理的模型提供了校准最为良好的不确定性估计。通过结合不确定性感知提示集成和基于代理的决策模型，这种方法提高了 LLM 在从放射学报告中提取结构化数据方面的性能和可靠性，为高风险医疗应用提供了更可解释、更值得信赖的解决方案。

Title: BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation

Authors: Alan Zhu, Parth Asawa, Jared Quincy Davis, Lingjiao Chen, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.01697
Pdf URL: https://arxiv.org/pdf/2502.01697
Copy Paste: [[2502.01697]] BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation(https://arxiv.org/abs/2502.01697)
Keywords: language model, llm, prompt
Abstract: As the demand for high-quality data in model training grows, researchers and developers are increasingly generating synthetic data to tune and train LLMs. A common assumption about synthetic data is that sampling from instruct-tuned models is sufficient; however, these models struggle to produce diverse outputs-a key requirement for generalization. Despite various prompting methods, in this work we show that achieving meaningful diversity from instruct-tuned models remains challenging. In contrast, we find base models without post-training exhibit greater diversity, but are less capable at instruction following and hence of lower quality. Leveraging this insight, we propose Base-Refine (BARE), a synthetic data generation method that combines the diversity of base models with the quality of instruct-tuned models through a two-stage process. With minimal few-shot examples and curation, BARE generates diverse and high-quality datasets, improving downstream task performance. We show that fine-tuning with as few as 1,000 BARE-generated samples can reach performance comparable to the best similarly sized models on LiveCodeBench tasks. Furthermore, fine-tuning with BARE-generated data achieves a 101% improvement over instruct-only data on GSM8K and a 18.4% improvement over SOTA methods on RAFT.
摘要：随着模型训练对高质量数据的需求不断增长，研究人员和开发人员越来越多地生成合成数据来调整和训练 LLM。关于合成数据的一个常见假设是，从指令调整模型中采样就足够了；然而，这些模型很难产生多样化的输出——这是泛化的关键要求。尽管有各种提示方法，但在这项工作中，我们表明，从指令调整模型中获得有意义的多样性仍然具有挑战性。相比之下，我们发现没有后训练的基础模型表现出更大的多样性，但在指令跟随方面的能力较差，因此质量较低。利用这一见解，我们提出了 Base-Refine (BARE)，这是一种合成数据生成方法，它通过两阶段过程将基础模型的多样性与指令调整模型的质量相结合。通过最少的少量样本和策展，BARE 生成多样化和高质量的数据集，从而提高下游任务的性能。我们表明，使用少至 1,000 个 BARE 生成的样本进行微调就可以达到与 LiveCodeBench 任务上最佳类似大小模型相当的性能。此外，使用 BARE 生成的数据进行微调在 GSM8K 上比仅指令数据实现了 101% 的改进，在 RAFT 上比 SOTA 方法实现了 18.4% 的改进。

Title: Evaluation of Large Language Models via Coupled Token Generation

Authors: Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, Suhas Thejaswi, Manuel Gomez-Rodriguez
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.01754
Pdf URL: https://arxiv.org/pdf/2502.01754
Copy Paste: [[2502.01754]] Evaluation of Large Language Models via Coupled Token Generation(https://arxiv.org/abs/2502.01754)
Keywords: language model, prompt, chat
Abstract: State of the art large language models rely on randomization to respond to a prompt. As an immediate consequence, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on (human) pairwise comparisons, coupled and vanilla autoregressive generation can surprisingly lead to different rankings when comparing more than two models, even with an infinite amount of samples. This suggests that the apparent advantage of a model over others in existing evaluation protocols may not be genuine but rather confounded by the randomness inherent to the generation process. To illustrate and complement our theoretical results, we conduct experiments with several large language models from the Llama family. We find that, across multiple knowledge areas from the popular MMLU benchmark dataset, coupled autoregressive generation requires up to 40% fewer samples to reach the same conclusions as vanilla autoregressive generation. Further, using data from the LMSYS Chatbot Arena platform, we find that the win-rates derived from pairwise comparisons by a strong large language model to prompts differ under coupled and vanilla autoregressive generation.
摘要：最先进的大型语言模型依靠随机化来响应提示。直接的后果是，如果多次询问，模型可能会对同一提示做出不同的响应。在这项工作中，我们认为大型语言模型的评估和排名应该控制其运作所依赖的随机化。我们的出发点是开发耦合自回归生成的因果模型，该模型允许不同的大型语言模型使用相同的随机源对响应进行采样。基于我们的因果模型，我们首先表明，在基于基准数据集的评估中，耦合自回归生成得出的结论与普通自回归生成相同，但使用的样本更少。然而，我们进一步表明，在基于（人工）成对比较的评估中，耦合和普通自回归生成在比较两个以上的模型时，即使样本数量无限，也会导致不同的排名。这表明，现有评估协议中模型相对于其他模型的明显优势可能不是真实的，而是被生成过程固有的随机性所混淆。为了说明和补充我们的理论结果，我们使用 Llama 家族的几种大型语言模型进行了实验。我们发现，在流行的 MMLU 基准数据集的多个知识领域中，耦合自回归生成所需的样本量最多比普通自回归生成少 40%，即可得出相同的结论。此外，使用来自 LMSYS Chatbot Arena 平台的数据，我们发现强大的大型语言模型与提示进行成对比较得出的胜率在耦合自回归生成和普通自回归生成下有所不同。

Title: On Bob Dylan: A Computational Perspective

Authors: Prashant Garg
Subjects: cs.CL, cs.AI, cs.IR, cs.SI
Abstract URL: https://arxiv.org/abs/2502.01772
Pdf URL: https://arxiv.org/pdf/2502.01772
Copy Paste: [[2502.01772]] On Bob Dylan: A Computational Perspective(https://arxiv.org/abs/2502.01772)
Keywords: language model
Abstract: Cass Sunstein's essay 'On Bob Dylan' describes Dylan's 'dishabituating' style -- a constant refusal to conform to expectation and a penchant for reinventing his musical and lyrical identity. In this paper, I extend Sunstein's observations through a large-scale computational analysis of Dylan's lyrics from 1962 to 2012. Using o3-mini-high (a large language model), I extract concept-to-concept relationships from the lyrics and construct directed knowledge graphs that capture Dylan's thematic structure. I then quantify shifts in sentiment, metaphorical expression, thematic diversity, and network complexity over time. The results indicate that Dylan's lyrics increasingly rely on metaphor, display an evolving sentiment profile, and exhibit heightened dishabituation -- measured here as a growing variance in the network centrality of key concepts. I also find that references to movement, protest, and mythic imagery fluctuate in ways that align with well-known phases of Dylan's career, reflecting the dynamic and unpredictable quality of his art. These findings not only deepen our empirical understanding of Sunstein's thesis but also introduce a novel computational method for analyzing an artist's evolution-offering broader applicability to the study of cultural and creative change.
摘要：卡斯·桑斯坦 (Cass Sunstein) 的文章《论鲍勃·迪伦》描述了迪伦的“去习惯化”风格——不断拒绝顺应期望，并热衷于重塑他的音乐和歌词身份。在本文中，我通过对迪伦 1962 年至 2012 年的歌词进行大规模计算分析，扩展了桑斯坦的观察结果。使用 o3-mini-high（大型语言模型），我从歌词中提取了概念到概念的关系，并构建了有向知识图谱来捕捉迪伦的主题结构。然后，我量化了情绪、隐喻表达、主题多样性和网络复杂性随时间的变化。结果表明，迪伦的歌词越来越依赖隐喻，显示出不断变化的情绪特征，并表现出高度的去习惯化——在这里以关键概念网络中心性不断增加的方差来衡量。我还发现，对运动、抗议和神话意象的引用与迪伦职业生涯中众所周知的阶段一致，反映了他艺术的动态和不可预测性。这些发现不仅加深了我们对桑斯坦论点的实证理解，还引入了一种分析艺术家演变的新型计算方法，为文化和创造性变化的研究提供了更广泛的适用性。

Title: SelfCheckAgent: Zero-Resource Hallucination Detection in Generative Large Language Models

Authors: Diyana Muhammed, Gollam Rabby, Sören Auer
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.01812
Pdf URL: https://arxiv.org/pdf/2502.01812
Copy Paste: [[2502.01812]] SelfCheckAgent: Zero-Resource Hallucination Detection in Generative Large Language Models(https://arxiv.org/abs/2502.01812)
Keywords: language model, gpt, llm, hallucination, chain-of-thought, agent
Abstract: Detecting hallucinations in Large Language Models (LLMs) remains a critical challenge for their reliable deployment in real-world applications. To address this, we introduce SelfCheckAgent, a novel framework integrating three different agents: the Symbolic Agent, the Specialized Detection Agent, and the Contextual Consistency Agent. These agents provide a robust multi-dimensional approach to hallucination detection. Notable results include the Contextual Consistency Agent leveraging Llama 3.1 with Chain-of-Thought (CoT) to achieve outstanding performance on the WikiBio dataset, with NonFactual hallucination detection scoring 93.64%, Factual 70.26%, and Ranking 78.48% respectively. On the AIME dataset, GPT-4o with CoT excels in NonFactual detection with 94.89% but reveals trade-offs in Factual with 30.58% and Ranking with 30.68%, underscoring the complexity of hallucination detection in the complex mathematical domains. The framework also incorporates a triangulation strategy, which increases the strengths of the SelfCheckAgent, yielding significant improvements in real-world hallucination identification. The comparative analysis demonstrates SelfCheckAgent's applicability across diverse domains, positioning it as a crucial advancement for trustworthy LLMs. These findings highlight the potentiality of consistency-driven methodologies in detecting hallucinations in LLMs.
摘要：在大型语言模型 (LLM) 中检测幻觉仍然是将其可靠地部署到实际应用中的关键挑战。为了解决这个问题，我们引入了 SelfCheckAgent，这是一个集成了三种不同代理的新框架：符号代理、专用检测代理和上下文一致性代理。这些代理为幻觉检测提供了一种强大的多维方法。值得注意的结果包括上下文一致性代理利用 Llama 3.1 和思维链 (CoT) 在 WikiBio 数据集上取得了出色的表现，非事实幻觉检测得分分别为 93.64%、事实 70.26% 和排名 78.48%。在 AIME 数据集上，具有 CoT 的 GPT-4o 在非事实检测方面表现出色，准确率为 94.89%，但在事实检测方面则表现不佳，准确率为 30.58%，排名检测方面则为 30.68%，这凸显了在复杂数学领域中检测幻觉的复杂性。该框架还采用了三角测量策略，增强了 SelfCheckAgent 的强度，从而显著提高了现实世界中幻觉的识别能力。比较分析证明了 SelfCheckAgent 在不同领域的适用性，使其成为值得信赖的 LLM 的关键进步。这些发现凸显了一致性驱动方法在检测 LLM 中的幻觉方面的潜力。

Title: Latent Lexical Projection in Large Language Models: A Novel Approach to Implicit Representation Refinement

Authors: Ziad Shaker, Brendan Ashdown, Hugo Fitzalan, Alistair Heathcote, Jocasta Huntington
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.01882
Pdf URL: https://arxiv.org/pdf/2502.01882
Copy Paste: [[2502.01882]] Latent Lexical Projection in Large Language Models: A Novel Approach to Implicit Representation Refinement(https://arxiv.org/abs/2502.01882)
Keywords: language model
Abstract: Generating semantically coherent text requires a robust internal representation of linguistic structures, which traditional embedding techniques often fail to capture adequately. A novel approach, Latent Lexical Projection (LLP), is introduced to refine lexical representations through a structured transformation into a latent space, thereby enhancing the alignment between input embeddings and their contextual meanings. The method integrates an optimized projection mechanism within an existing language model architecture, enabling more accurate token selection while maintaining syntactic integrity. Evaluations across multiple benchmarks indicate a reduction in perplexity and an increase in BLEU scores, suggesting improvements in predictive accuracy and fluency. The analysis of lexical diversity reveals a more varied vocabulary in generated text, addressing common issues of redundancy and repetitive phrase structures. Further assessments of entropy distributions demonstrate a decline in uncertainty during decoding, reflecting enhanced confidence in word selection. Additionally, long-range dependency retention exhibits measurable gains, with increased classification accuracy at extended token distances. Computational efficiency remains within manageable constraints, despite the added projection mechanism, highlighting the practicality of LLP for integration into existing architectures.
摘要：生成语义连贯的文本需要对语言结构进行强大的内部表示，而传统的嵌入技术通常无法充分捕捉到这一点。引入了一种新方法，即潜在词汇投影 (LLP)，通过结构化转换为潜在空间来细化词汇表示，从而增强输入嵌入与其上下文含义之间的一致性。该方法将优化的投影机制集成到现有的语言模型架构中，从而实现更准确的标记选择，同时保持句法完整性。跨多个基准的评估表明困惑度降低，BLEU 分数增加，表明预测准确性和流畅度有所提高。词汇多样性的分析揭示了生成的文本中词汇的多样性，解决了冗余和重复短语结构的常见问题。对熵分布的进一步评估表明解码过程中的不确定性有所下降，反映了对单词选择的信心增强。此外，长距离依赖性保留表现出可衡量的收益，在扩展标记距离时分类准确性有所提高。尽管增加了投影机制，但计算效率仍然在可管理的限制范围内，突出了 LLP 集成到现有架构中的实用性。

Title: Conceptual Metaphor Theory as a Prompting Paradigm for Large Language Models

Authors: Oliver Kramer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.01901
Pdf URL: https://arxiv.org/pdf/2502.01901
Copy Paste: [[2502.01901]] Conceptual Metaphor Theory as a Prompting Paradigm for Large Language Models(https://arxiv.org/abs/2502.01901)
Keywords: language model, llm, prompt
Abstract: We introduce Conceptual Metaphor Theory (CMT) as a framework for enhancing large language models (LLMs) through cognitive prompting in complex reasoning tasks. CMT leverages metaphorical mappings to structure abstract reasoning, improving models' ability to process and explain intricate concepts. By incorporating CMT-based prompts, we guide LLMs toward more structured and human-like reasoning patterns. To evaluate this approach, we compare four native models (Llama3.2, Phi3, Gemma2, and Mistral) against their CMT-augmented counterparts on benchmark tasks spanning domain-specific reasoning, creative insight, and metaphor interpretation. Responses were automatically evaluated using the Llama3.3 70B model. Experimental results indicate that CMT prompting significantly enhances reasoning accuracy, clarity, and metaphorical coherence, outperforming baseline models across all evaluated tasks.
摘要：我们引入概念隐喻理论 (CMT) 作为通过复杂推理任务中的认知提示来增强大型语言模型 (LLM) 的框架。CMT 利用隐喻映射来构建抽象推理，提高模型处理和解释复杂概念的能力。通过结合基于 CMT 的提示，我们引导 LLM 走向更结构化和更像人类的推理模式。为了评估这种方法，我们在涵盖领域特定推理、创造性洞察力和隐喻解释的基准任务上将四个原生模型（Llama3.2、Phi3、Gemma2 和 Mistral）与 CMT 增强模型进行比较。使用 Llama3.3 70B 模型自动评估响应。实验结果表明，CMT 提示显著提高了推理准确性、清晰度和隐喻连贯性，在所有评估任务中均优于基线模型。

Title: PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling

Authors: Avery Ma, Yangchen Pan, Amir-massoud Farahmand
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.01925
Pdf URL: https://arxiv.org/pdf/2502.01925
Copy Paste: [[2502.01925]] PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling(https://arxiv.org/abs/2502.01925)
Keywords: language model, llm, prompt
Abstract: Many-shot jailbreaking circumvents the safety alignment of large language models by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational turns between the user and the model. These fabricated exchanges are randomly sampled from a pool of malicious questions and responses, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with positive affirmations, negative demonstrations, and an optimized adaptive sampling method tailored to the target prompt's topic. Extensive experiments on AdvBench and HarmBench, using state-of-the-art LLMs, demonstrate that PANDAS significantly outperforms baseline methods in long-context scenarios. Through an attention analysis, we provide insights on how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
摘要：多次越狱利用大型语言模型处理长输入序列的能力，绕过了大型语言模型的安全对齐。为了实现这一点，恶意目标提示前面加上了数百个虚构的用户和模型之间的对话轮次。这些虚构的对话是从恶意问题和响应池中随机抽样的，使得模型看起来好像已经遵守了有害指令。在本文中，我们介绍了 PANDAS：一种混合技术，它通过使用肯定的肯定、否定的演示和针对目标提示主题定制的优化自适应采样方法来修改这些虚构的对话，从而改进多次越狱。使用最先进的 LLM 在 AdvBench 和 HarmBench 上进行的大量实验表明，PANDAS 在长上下文场景中的表现明显优于基线方法。通过注意力分析，我们提供了有关如何利用长上下文漏洞的见解，并展示了 PANDAS 如何进一步改进多次越狱。

Title: Can LLMs Maintain Fundamental Abilities under KV Cache Compression?

Authors: Xiang Liu, Zhenheng Tang, Hong Chen, Peijie Dong, Zeyu Li, Xiuze Zhou, Bo Li, Xuming Hu, Xiaowen Chu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.01941
Pdf URL: https://arxiv.org/pdf/2502.01941
Copy Paste: [[2502.01941]] Can LLMs Maintain Fundamental Abilities under KV Cache Compression?(https://arxiv.org/abs/2502.01941)
Keywords: language model, llm
Abstract: This paper investigates an under-explored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs' fundamental capabilities. While existing methods achieve impressive compression ratios on long-context benchmarks, their effects on core model capabilities remain understudied. We present a comprehensive empirical study evaluating prominent KV cache compression methods across diverse tasks, spanning world knowledge, commonsense reasoning, arithmetic reasoning, code generation, safety, and long-context understanding and this http URL analysis reveals that KV cache compression methods exhibit task-specific performance degradation. Arithmetic reasoning tasks prove particularly sensitive to aggressive compression, with different methods showing performance drops of $17.4\%$-$43.3\%$. Notably, the DeepSeek R1 Distill model exhibits more robust compression tolerance compared to instruction-tuned models, showing only $9.67\%$-$25.53\%$ performance degradation. Based on our analysis of attention patterns and cross-task compression performance, we propose ShotKV, a novel compression approach that distinctly handles prefill and decoding phases while maintaining shot-level semantic coherence. Empirical results show that ShotKV achieves $9\%$-$18\%$ performance improvements on long-context generation tasks under aggressive compression ratios.
摘要：本文探讨了大型语言模型 (LLM) 中一个尚未得到充分探索的挑战：KV 缓存压缩方法对 LLM 基本功能的影响。虽然现有方法在长上下文基准测试中实现了令人印象深刻的压缩率，但它们对核心模型功能的影响仍未得到充分研究。我们进行了一项全面的实证研究，评估了不同任务中突出的 KV 缓存压缩方法，涵盖世界知识、常识推理、算术推理、代码生成、安全性和长上下文理解，此 http URL 分析表明 KV 缓存压缩方法表现出特定于任务的性能下降。算术推理任务对激进压缩特别敏感，不同方法的性能下降为 $17.4\%$-$43.3\%$。值得注意的是，与指令调整模型相比，DeepSeek R1 Distill 模型表现出更强大的压缩容忍度，性能下降仅为 $9.67\%$-$25.53\%$。基于我们对注意力模式和跨任务压缩性能的分析，我们提出了 ShotKV，这是一种新颖的压缩方法，它可以明确处理预填充和解码阶段，同时保持镜头级语义连贯性。实证结果表明，在高压缩率下，ShotKV 在长上下文生成任务上实现了 $9\%$-$18\%$ 的性能提升。

Title: Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning

Authors: Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, Yang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.01968
Pdf URL: https://arxiv.org/pdf/2502.01968
Copy Paste: [[2502.01968]] Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning(https://arxiv.org/abs/2502.01968)
Keywords: language model, llm
Abstract: Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters more than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-training, even in high-quality samples, patterns or phrases that are not task-related can be redundant or uninformative. Continuing to fine-tune on these patterns may offer limited benefit and even degrade downstream task performance. In this paper, we investigate token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those carrying key task-specific information. Specifically, we first evaluate token quality by examining the influence of model updates on each token, then apply a threshold-based separation. The token influence can be measured in a single pass with a fixed reference model or iteratively with self-evolving reference models. The benefits and limitations of both methods are analyzed theoretically by error upper bounds. Extensive experiments show that our framework consistently improves performance across multiple downstream tasks.
摘要：最近的研究表明，在大型语言模型 (LLM) 的监督微调 (SFT) 中，数据质量比数量更重要。虽然大多数数据清理方法都集中在过滤整个样本，但样本中单个标记的质量可能会有很大差异。预训练后，即使在高质量样本中，与任务无关的模式或短语也可能是多余的或无信息的。继续对这些模式进行微调可能会带来有限的好处，甚至会降低下游任务的性能。在本文中，我们从噪声标签的角度研究标记质量，并提出了一种用于 SFT 任务的通用标记清理管道。我们的方法会过滤掉无信息的标记，同时保留那些携带关键任务特定信息的标记。具体来说，我们首先通过检查模型更新对每个标记的影响来评估标记质量，然后应用基于阈值的分离。可以使用固定参考模型一次性测量标记影响，也可以使用自进化参考模型迭代测量标记影响。通过误差上限从理论上分析了这两种方法的优点和局限性。大量实验表明，我们的框架能够持续提高多个下游任务的性能。

Title: CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing

Authors: Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao
Subjects: cs.CL, cs.AI, cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2502.01976
Pdf URL: https://arxiv.org/pdf/2502.01976
Copy Paste: [[2502.01976]] CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing(https://arxiv.org/abs/2502.01976)
Keywords: language model, llm
Abstract: Large language models have achieved remarkable success in various tasks but suffer from high computational costs during inference, limiting their deployment in resource-constrained applications. To address this issue, we propose a novel CITER (\textbf{C}ollaborative \textbf{I}nference with \textbf{T}oken-l\textbf{E}vel \textbf{R}outing) framework that enables efficient collaboration between small and large language models (SLMs & LLMs) through a token-level routing strategy. Specifically, CITER routes non-critical tokens to an SLM for efficiency and routes critical tokens to an LLM for generalization quality. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. This allows the router to learn to predict token-level routing scores and make routing decisions based on both the current token and the future impact of its decisions. To further accelerate the reward evaluation process, we introduce a shortcut which significantly reduces the costs of the reward estimation and improving the practicality of our approach. Extensive experiments on five benchmark datasets demonstrate that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.
摘要：大型语言模型在各种任务中取得了显著成功，但在推理过程中存在高计算成本的问题，限制了它们在资源受限的应用中的部署。为了解决这个问题，我们提出了一个新颖的 CITER（\textbf{C}ollaborative \textbf{I}nference with \textbf{T}oken-l\textbf{E}vel \textbf{R}outing）框架，该框架通过 token 级路由策略实现小型和大型语言模型（SLM 和 LLM）之间的有效协作。具体来说，CITER 将非关键 token 路由到 SLM 以提高效率，将关键 token 路由到 LLM 以提高泛化质量。我们将路由器训练制定为一种策略优化，其中路由器根据预测质量和生成的推理成本获得奖励。这使路由器能够学习预测 token 级路由分数，并根据当前 token 及其决策的未来影响做出路由决策。为了进一步加快奖励评估过程，我们引入了一种捷径，大大降低了奖励估计的成本并提高了我们方法的实用性。在五个基准数据集上进行的大量实验表明，CITER 在保持高质量生成的同时降低了推理成本，为实时和资源受限的应用程序提供了一种有前途的解决方案。

Title: Gradient-Regularized Latent Space Modulation in Large Language Models for Structured Contextual Synthesis

Authors: Derek Yotheringhay, Beatrix Nightingale, Maximilian Featherstone, Edmund Worthington, Hugo Ashdown
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.01979
Pdf URL: https://arxiv.org/pdf/2502.01979
Copy Paste: [[2502.01979]] Gradient-Regularized Latent Space Modulation in Large Language Models for Structured Contextual Synthesis(https://arxiv.org/abs/2502.01979)
Keywords: language model
Abstract: Generating structured textual content requires mechanisms that enforce coherence, stability, and adherence to predefined constraints while maintaining semantic fidelity. Conventional approaches often rely on rule-based heuristics or fine-tuning strategies that lack flexibility and generalizability across diverse tasks. The incorporation of Gradient-Regularized Latent Space Modulation (GRLSM) introduces a novel paradigm for guiding text generation through the application of structured constraints within the latent space. The integration of gradient-based regularization mitigates abrupt variations in latent representations, ensuring a smoother encoding process that enhances structural consistency and logical progression within generated sequences. Comparative evaluations demonstrate that latent space modulation leads to a reduction in perplexity, increased coherence scores, and improved structural alignment across multiple domains. Stability assessments further indicate that the imposition of spectral norm constraints facilitates more controlled variations in generated text, preserving semantic consistency under input perturbations. Empirical results confirm that structured latent space constraints not only refine the organization of generated outputs but also enhance interpretability through more predictable and reliable synthesis patterns. Performance metrics illustrate that the GRLSM framework substantially reduces structural inconsistencies while preserving the generative flexibility inherent in neural models.
摘要：生成结构化文本内容需要机制来强制一致性、稳定性和遵守预定义约束，同时保持语义保真度。传统方法通常依赖于基于规则的启发式方法或微调策略，这些策略缺乏灵活性和跨不同任务的通用性。梯度正则化潜在空间调制 (GRLSM) 的结合引入了一种新范式，通过在潜在空间中应用结构化约束来指导文本生成。基于梯度的正则化的集成可减轻潜在表示的突然变化，确保更平滑的编码过程，从而增强生成序列中的结构一致性和逻辑进展。比较评估表明，潜在空间调制可降低困惑度、提高连贯性分数并改善跨多个域的结构对齐。稳定性评估进一步表明，施加谱范数约束有助于生成文本中更可控的变化，从而在输入扰动下保持语义一致性。实证结果证实，结构化潜在空间约束不仅可以改善生成输出的组织，还可以通过更可预测和更可靠的合成模式增强可解释性。性能指标表明，GRLSM 框架大大减少了结构不一致性，同时保留了神经模型固有的生成灵活性。

Title: Can LLMs Assist Annotators in Identifying Morality Frames? -- Case Study on Vaccination Debate on Social Media

Authors: Tunazzina Islam, Dan Goldwasser
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, cs.SI
Abstract URL: https://arxiv.org/abs/2502.01991
Pdf URL: https://arxiv.org/pdf/2502.01991
Copy Paste: [[2502.01991]] Can LLMs Assist Annotators in Identifying Morality Frames? -- Case Study on Vaccination Debate on Social Media(https://arxiv.org/abs/2502.01991)
Keywords: language model, llm
Abstract: Nowadays, social media is pivotal in shaping public discourse, especially on polarizing issues like vaccination, where diverse moral perspectives influence individual opinions. In NLP, data scarcity and complexity of psycholinguistic tasks such as identifying morality frames makes relying solely on human annotators costly, time-consuming, and prone to inconsistency due to cognitive load. To address these issues, we leverage large language models (LLMs), which are adept at adapting new tasks through few-shot learning, utilizing a handful of in-context examples coupled with explanations that connect examples to task principles. Our research explores LLMs' potential to assist human annotators in identifying morality frames within vaccination debates on social media. We employ a two-step process: generating concepts and explanations with LLMs, followed by human evaluation using a "think-aloud" tool. Our study shows that integrating LLMs into the annotation process enhances accuracy, reduces task difficulty, lowers cognitive load, suggesting a promising avenue for human-AI collaboration in complex psycholinguistic tasks.
摘要：如今，社交媒体在塑造公众话语方面发挥着关键作用，尤其是在疫苗接种等两极分化问题上，不同的道德观点会影响个人意见。在 NLP 中，数据稀缺和心理语言学任务（例如识别道德框架）的复杂性使得仅依靠人类注释者成本高昂、耗时长，并且由于认知负荷而容易出现不一致。为了解决这些问题，我们利用大型语言模型 (LLM)，该模型擅长通过少量学习来适应新任务，利用少量上下文示例以及将示例与任务原则联系起来的解释。我们的研究探索了 LLM 在帮助人类注释者识别社交媒体上疫苗接种辩论中的道德框架方面的潜力。我们采用了一个两步流程：使用 LLM 生成概念和解释，然后使用“出声思考”工具进行人工评估。我们的研究表明，将 LLM 集成到注释过程中可以提高准确性、降低任务难度、降低认知负荷，为复杂的心理语言学任务中的人机协作提供了一条有希望的途径。

Title: Wavelet-based Positional Representation for Long Context

Authors: Yui Oka, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.02004
Pdf URL: https://arxiv.org/pdf/2502.02004
Copy Paste: [[2502.02004]] Wavelet-based Positional Representation for Long Context(https://arxiv.org/abs/2502.02004)
Keywords: language model, long context
Abstract: In the realm of large-scale language models, a significant challenge arises when extrapolating sequences beyond the maximum allowable length. This is because the model's position embedding mechanisms are limited to positions encountered during training, thus preventing effective representation of positions in longer sequences. We analyzed conventional position encoding methods for long contexts and found the following characteristics. (1) When the representation dimension is regarded as the time axis, Rotary Position Embedding (RoPE) can be interpreted as a restricted wavelet transform using Haar-like wavelets. However, because it uses only a fixed scale parameter, it does not fully exploit the advantages of wavelet transforms, which capture the fine movements of non-stationary signals using multiple scales (window sizes). This limitation could explain why RoPE performs poorly in extrapolation. (2) Previous research as well as our own analysis indicates that Attention with Linear Biases (ALiBi) functions similarly to windowed attention, using windows of varying sizes. However, it has limitations in capturing deep dependencies because it restricts the receptive field of the model. From these insights, we propose a new position representation method that captures multiple scales (i.e., window sizes) by leveraging wavelet transforms without limiting the model's attention field. Experimental results show that this new method improves the performance of the model in both short and long contexts. In particular, our method allows extrapolation of position information without limiting the model's attention field.
摘要：在大规模语言模型领域，当推断超出最大允许长度的序列时，会出现重大挑战。这是因为模型的位置嵌入机制仅限于训练期间遇到的位置，从而阻碍了对较长序列中的位置的有效表示。我们分析了长上下文的传统位置编码方法，发现了以下特点。（1）当表示维度被视为时间轴时，旋转位置嵌入 (RoPE) 可以解释为使用 Haar 类小波的受限小波变换。但是，由于它仅使用固定尺度参数，因此无法充分利用小波变换的优势，小波变换使用多个尺度（窗口大小）来捕捉非平稳信号的细微运动。这种限制可以解释为什么 RoPE 在推断中表现不佳。（2）先前的研究以及我们自己的分析表明，具有线性偏差的注意力 (ALiBi) 的功能类似于窗口注意力，使用不同大小的窗口。但是，它在捕获深度依赖性方面存在局限性，因为它限制了模型的感受野。基于这些见解，我们提出了一种新的位置表示方法，该方法利用小波变换来捕获多个尺度（即窗口大小），而不会限制模型的注意力范围。实验结果表明，这种新方法提高了模型在短距离和长距离环境中的性能。特别是，我们的方法允许推断位置信息，而不会限制模型的注意力范围。

Title: Reasoning Bias of Next Token Prediction Training

Authors: Pengxiao Lin, Zhongwang Zhang, Zhi-Qin John Xu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.02007
Pdf URL: https://arxiv.org/pdf/2502.02007
Copy Paste: [[2502.02007]] Reasoning Bias of Next Token Prediction Training(https://arxiv.org/abs/2502.02007)
Keywords: language model, llm
Abstract: Since the inception of Large Language Models (LLMs), the quest to efficiently train them for superior reasoning capabilities has been a pivotal challenge. The dominant training paradigm for LLMs is based on next token prediction (NTP). Alternative methodologies, called Critical Token Prediction (CTP), focused exclusively on specific critical tokens (such as the answer in Q\&A dataset), aiming to reduce the overfitting of extraneous information and noise. Contrary to initial assumptions, our research reveals that despite NTP's exposure to noise during training, it surpasses CTP in reasoning ability. We attribute this counterintuitive outcome to the regularizing influence of noise on the training dynamics. Our empirical analysis shows that NTP-trained models exhibit enhanced generalization and robustness across various benchmark reasoning datasets, demonstrating greater resilience to perturbations and achieving flatter loss minima. These findings illuminate that NTP is instrumental in fostering reasoning abilities during pretraining, whereas CTP is more effective for finetuning, thereby enriching our comprehension of optimal training strategies in LLM development.
摘要：自大型语言模型 (LLM) 诞生以来，如何高效地训练它们以获得卓越的推理能力一直是一项关键挑战。LLM 的主要训练范式基于下一个标记预测 (NTP)。替代方法称为关键标记预测 (CTP)，它专注于特定的关键标记（例如 Q\&A 数据集中的答案），旨在减少无关信息和噪音的过度拟合。与最初的假设相反，我们的研究表明，尽管 NTP 在训练期间受到噪音的影响，但它的推理能力却超过了 CTP。我们将这种违反直觉的结果归因于噪音对训练动态的正则化影响。我们的实证分析表明，NTP 训练的模型在各种基准推理数据集中表现出增强的泛化和鲁棒性，对扰动具有更强的弹性并实现更平坦的损失最小值。这些发现表明，NTP 有助于在预训练期间培养推理能力，而 CTP 更有利于微调，从而丰富了我们对 LLM 开发中最佳训练策略的理解。

Title: Fine-tuning Language Models for Recipe Generation: A Comparative Analysis and Benchmark Study

Authors: Anneketh Vij, Changhao Liu, Rahul Anil Nair, Theo Ho, Edward Shi, Ayan Bhowmick
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.02028
Pdf URL: https://arxiv.org/pdf/2502.02028
Copy Paste: [[2502.02028]] Fine-tuning Language Models for Recipe Generation: A Comparative Analysis and Benchmark Study(https://arxiv.org/abs/2502.02028)
Keywords: language model, llm
Abstract: This research presents an exploration and study of the recipe generation task by fine-tuning various very small language models, with a focus on developing robust evaluation metrics and comparing across different language models the open-ended task of recipe generation. This study presents extensive experiments with multiple model architectures, ranging from T5-small (Raffel et al., 2023) and SmolLM-135M (Allal et al., 2024) to Phi-2 (Research, 2023),implementing both traditional NLP metrics and custom domain-specific evaluation metrics. Our novel evaluation framework incorporates recipe-specific metrics for assessing content quality and introduces an approach to allergen substitution. The results indicate that, while larger models generally perform better on standard metrics, the relationship between model size and recipe quality is more nuanced when considering domain-specific metrics. We find that SmolLM-360M and SmolLM-1.7B demonstrate comparable performance despite their size difference, while Phi-2 shows limitations in recipe generation despite its larger parameter count. Our comprehensive evaluation framework and allergen substitution system provide valuable insights for future work in recipe generation and broader NLG tasks that require domain expertise and safety considerations.
摘要：本研究通过微调各种非常小的语言模型，对菜谱生成任务进行了探索和研究，重点是开发强大的评估指标，并在不同的语言模型之间比较开放式菜谱生成任务。本研究对多种模型架构进行了广泛的实验，从 T5-small（Raffel 等人，2023 年）和 SmolLM-135M（Allal 等人，2024 年）到 Phi-2（Research，2023 年），实现了传统的 NLP 指标和自定义领域特定评估指标。我们新颖的评估框架结合了用于评估内容质量的菜谱特定指标，并引入了一种过敏原替代方法。结果表明，虽然较大的模型通常在标准指标上表现更好，但在考虑领域特定指标时，模型大小与菜谱质量之间的关系更加微妙。我们发现，尽管 SmolLM-360M 和 SmolLM-1.7B 的大小不同，但它们表现出相当的性能，而 Phi-2 尽管参数数量较大，但在菜谱生成方面却表现出局限性。我们全面的评估框架和过敏原替代系统为未来需要领域专业知识和安全考虑的食谱生成和更广泛的 NLG 任务提供了宝贵的见解。

Title: M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference

Authors: Nikhil Bhendawade, Mahyar Najibi, Devang Naik, Irina Belousova
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.02040
Pdf URL: https://arxiv.org/pdf/2502.02040
Copy Paste: [[2502.02040]] M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference(https://arxiv.org/abs/2502.02040)
Keywords: language model, llm
Abstract: Residual transformations enhance the representational depth and expressive power of large language models (LLMs). However, applying static residual transformations across all tokens in auto-regressive generation leads to a suboptimal trade-off between inference efficiency and generation fidelity. Existing methods, including Early Exiting, Skip Decoding, and Mixture-of-Depth address this by modulating the residual transformation based on token-level complexity. Nevertheless, these approaches predominantly consider the distance traversed by tokens through the model layers, neglecting the underlying velocity of residual evolution. We introduce Mixture of Multi-rate Residuals (M2R2), a framework that dynamically modulates residual velocity to improve early alignment, enhancing inference efficiency. Evaluations on reasoning oriented tasks such as Koala, Self-Instruct, WizardLM, and MT-Bench show M2R2 surpasses state-of-the-art distance-based strategies, balancing generation quality and speedup. In self-speculative decoding setup, M2R2 achieves up to 2.8x speedups on MT-Bench, outperforming methods like 2-model speculative decoding, Medusa, LookAhead Decoding, and DEED. In Mixture-of-Experts (MoE) architectures, integrating early residual alignment with ahead-of-time expert loading into high-bandwidth memory (HBM) accelerates decoding, reduces expert-switching bottlenecks, and achieves a 2.9x speedup, making it highly effective in resource-constrained environments.
摘要：残差变换增强了大型语言模型 (LLM) 的表征深度和表达能力。然而，在自回归生成中对所有 token 施加静态残差变换会导致推理效率和生成保真度之间的权衡不理想。现有方法（包括 Early Exiting、Skip Decoding 和 Mixture-of-Depth）通过根据 token 级复杂性调节残差变换来解决此问题。然而，这些方法主要考虑 token 穿过模型层的距离，而忽略了残差演化的底层速度。我们引入了 Mixture of Multi-rate Residuals (M2R2)，这是一个动态调节残差速度以改善早期对齐的框架，从而提高推理效率。对 Koala、Self-Instruct、WizardLM 和 MT-Bench 等推理导向任务的评估表明，M2R2 超越了最先进的基于距离的策略，平衡了生成质量和加速。在自推测解码设置中，M2R2 在 MT-Bench 上实现了高达 2.8 倍的加速，优于 2 模型推测解码、Medusa、LookAhead 解码和 DEED 等方法。在混合专家 (MoE) 架构中，将早期残差对齐与提前专家加载到高带宽内存 (HBM) 中相结合可加速解码、减少专家切换瓶颈并实现 2.9 倍的加速，使其在资源受限的环境中非常有效。

Title: Contextual Memory Reweaving in Large Language Models Using Layered Latent State Reconstruction

Authors: Frederick Dillon, Gregor Halvorsen, Simon Tattershall, Magnus Rowntree, Gareth Vanderpool
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.02046
Pdf URL: https://arxiv.org/pdf/2502.02046
Copy Paste: [[2502.02046]] Contextual Memory Reweaving in Large Language Models Using Layered Latent State Reconstruction(https://arxiv.org/abs/2502.02046)
Keywords: language model
Abstract: Memory retention challenges in deep neural architectures have ongoing limitations in the ability to process and recall extended contextual information. Token dependencies degrade as sequence length increases, leading to a decline in coherence and factual consistency across longer outputs. A structured approach is introduced to mitigate this issue through the reweaving of latent states captured at different processing layers, reinforcing token representations over extended sequences. The proposed Contextual Memory Reweaving framework incorporates a Layered Latent State Reconstruction mechanism to systematically integrate past contextual embeddings without introducing external memory modules. Experimental results demonstrate improvements in recall accuracy across a range of sequence lengths, with notable gains in the retention of rarely occurring tokens and numerical reasoning consistency. Further analysis of computational efficiency indicates that the additional processing overhead remains within acceptable thresholds, enabling scalability across different model sizes. Evaluations in long-form text generation and ambiguous query resolution highlight the capacity of memory reweaving to enhance continuity and reduce inconsistencies over extended outputs. Attention weight distributions reveal more structured allocation patterns, suggesting that reweaved latent states contribute to improved contextual awareness. The findings establish a framework for refining memory retention mechanisms in language models, addressing long-standing challenges in handling complex, multi-step reasoning tasks.
摘要：深度神经架构中的记忆保留挑战在处理和回忆扩展上下文信息的能力方面存在持续的限制。随着序列长度的增加，标记依赖性会降低，从而导致较长输出的连贯性和事实一致性下降。引入了一种结构化方法来缓解这个问题，通过重新编织在不同处理层捕获的潜在状态，增强扩展序列上的标记表示。所提出的上下文记忆重新编织框架结合了分层潜在状态重建机制，可以系统地整合过去的上下文嵌入，而无需引入外部记忆模块。实验结果表明，在一系列序列长度中回忆准确度有所提高，在保留罕见标记和数字推理一致性方面有显著提高。对计算效率的进一步分析表明，额外的处理开销保持在可接受的阈值范围内，从而实现了跨不同模型大小的可扩展性。长文本生成和模糊查询解析中的评估突出了记忆重新编织在增强连续性和减少扩展输出不一致方面的能力。注意力权重分布揭示了更结构化的分配模式，表明重新编织的潜在状态有助于提高情境意识。研究结果建立了一个框架，用于改进语言模型中的记忆保留机制，解决了处理复杂、多步骤推理任务的长期挑战。

Title: ASCenD-BDS: Adaptable, Stochastic and Context-aware framework for Detection of Bias, Discrimination and Stereotyping

Authors: Rajiv Bahl, Venkatesan N, Parimal Aglawe, Aastha Sarasapalli, Bhavya Kancharla, Chaitanya kolukuluri, Harish Mohite, Japneet Hora, Kiran Kakollu, Rahul Diman, Shubham Kapale, Sri Bhagya Kathula, Vamsikrishna Motru, Yogeshwar Reddy
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2502.02072
Pdf URL: https://arxiv.org/pdf/2502.02072
Copy Paste: [[2502.02072]] ASCenD-BDS: Adaptable, Stochastic and Context-aware framework for Detection of Bias, Discrimination and Stereotyping(https://arxiv.org/abs/2502.02072)
Keywords: language model, llm
Abstract: The rapid evolution of Large Language Models (LLMs) has transformed natural language processing but raises critical concerns about biases inherent in their deployment and use across diverse linguistic and sociocultural contexts. This paper presents a framework named ASCenD BDS (Adaptable, Stochastic and Context-aware framework for Detection of Bias, Discrimination and Stereotyping). The framework presents approach to detecting bias, discrimination, stereotyping across various categories such as gender, caste, age, disability, socioeconomic status, linguistic variations, etc., using an approach which is Adaptive, Stochastic and Context-Aware. The existing frameworks rely heavily on usage of datasets to generate scenarios for detection of Bias, Discrimination and Stereotyping. Examples include datasets such as Civil Comments, Wino Gender, WinoBias, BOLD, CrowS Pairs and BBQ. However, such an approach provides point solutions. As a result, these datasets provide a finite number of scenarios for assessment. The current framework overcomes this limitation by having features which enable Adaptability, Stochasticity, Context Awareness. Context awareness can be customized for any nation or culture or sub-culture (for example an organization's unique culture). In this paper, context awareness in the Indian context has been established. Content has been leveraged from Indian Census 2011 to have a commonality of categorization. A framework has been developed using Category, Sub-Category, STEM, X-Factor, Synonym to enable the features for Adaptability, Stochasticity and Context awareness. The framework has been described in detail in Section 3. Overall 800 plus STEMs, 10 Categories, 31 unique SubCategories were developed by a team of consultants at Saint Fox Consultancy Private Ltd. The concept has been tested out in SFCLabs as part of product development.
摘要：大型语言模型 (LLM) 的快速发展改变了自然语言处理，但也引发了人们对其在不同语言和社会文化背景下部署和使用所固有的偏见的严重担忧。本文提出了一个名为 ASCenD BDS（用于检测偏见、歧视和刻板印象的适应性、随机性和情境感知框架）的框架。该框架提出了一种使用自适应、随机和情境感知的方法检测性别、种姓、年龄、残疾、社会经济地位、语言变化等各种类别的偏见、歧视和刻板印象的方法。现有框架严重依赖数据集的使用来生成用于检测偏见、歧视和刻板印象的场景。示例包括 Civil Comments、Wino Gender、WinoBias、BOLD、CrowS Pairs 和 BBQ 等数据集。然而，这种方法提供的是点解决方案。因此，这些数据集提供了有限数量的评估场景。当前框架通过具有适应性、随机性和情境感知等功能克服了这一限制。情境感知可以针对任何国家、文化或亚文化进行定制（例如，组织的独特文化）。本文建立了印度情境中的情境感知。内容来自 2011 年印度人口普查，具有分类的共性。使用类别、子类别、STEM、X 因子、同义词开发了一个框架，以实现适应性、随机性和情境感知功能。该框架已在第 3 节中详细描述。Saint Fox Consultancy Private Ltd. 的顾问团队开发了总共 800 多个 STEM、10 个类别和 31 个独特的子类别。该概念已在 SFCLabs 作为产品开发的一部分进行了测试。

Title: Rethinking stance detection: A theoretically-informed research agenda for user-level inference using language models

Authors: Prasanta Bhattacharya, Hong Zhang, Yiming Cao, Wei Gao, Brandon Siyuan Loh, Joseph J.P. Simons, Liang Ze Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.02074
Pdf URL: https://arxiv.org/pdf/2502.02074
Copy Paste: [[2502.02074]] Rethinking stance detection: A theoretically-informed research agenda for user-level inference using language models(https://arxiv.org/abs/2502.02074)
Keywords: language model, llm
Abstract: Stance detection has emerged as a popular task in natural language processing research, enabled largely by the abundance of target-specific social media data. While there has been considerable research on the development of stance detection models, datasets, and application, we highlight important gaps pertaining to (i) a lack of theoretical conceptualization of stance, and (ii) the treatment of stance at an individual- or user-level, as opposed to message-level. In this paper, we first review the interdisciplinary origins of stance as an individual-level construct to highlight relevant attributes (e.g., psychological features) that might be useful to incorporate in stance detection models. Further, we argue that recent pre-trained and large language models (LLMs) might offer a way to flexibly infer such user-level attributes and/or incorporate them in modelling stance. To better illustrate this, we briefly review and synthesize the emerging corpus of studies on using LLMs for inferring stance, and specifically on incorporating user attributes in such tasks. We conclude by proposing a four-point agenda for pursuing stance detection research that is theoretically informed, inclusive, and practically impactful.
摘要：立场检测已成为自然语言处理研究中的一项热门任务，这主要得益于大量针对特定目标的社交媒体数据。虽然对立场检测模型、数据集和应用的开发进行了大量研究，但我们强调了以下重要差距：(i) 缺乏立场的理论概念化，以及 (ii) 在个人或用户层面而非消息层面处理立场。在本文中，我们首先回顾立场作为个人层面构造的跨学科起源，以突出可能有助于纳入立场检测模型的相关属性（例如心理特征）。此外，我们认为最近的预训练和大型语言模型 (LLM) 可能提供一种灵活推断此类用户级属性和/或将它们纳入建模立场的方法。为了更好地说明这一点，我们简要回顾并综合了使用 LLM 推断立场的新兴研究语料库，特别是在此类任务中纳入用户属性的研究语料库。最后，我们提出了开展理论上明智、包容性和实践上影响的立场检测研究的四点议程。

Title: LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information

Authors: Bowen Ping, Jiali Zeng, Fandong Meng, Shuo Wang, Jie Zhou, Shanghang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.02095
Pdf URL: https://arxiv.org/pdf/2502.02095
Copy Paste: [[2502.02095]] LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information(https://arxiv.org/abs/2502.02095)
Keywords: gpt, llm
Abstract: Long-form generation is crucial for academic writing papers and repo-level code generation. Despite this, current models, including GPT-4o, still exhibit unsatisfactory performance. Existing methods that utilize preference learning with outcome supervision often fail to provide detailed feedback for extended contexts. This shortcoming can lead to content that does not fully satisfy query requirements, resulting in issues like length deviations, and diminished quality. In this paper, we propose enhancing long-form generation by incorporating process supervision. We employ Monte Carlo Tree Search to gather stepwise preference pairs, utilizing a global memory pool to maintain consistency. To address the issue of suboptimal candidate selection, we integrate external critiques to refine and improve the quality of the preference pairs. Finally, we apply step-level DPO using the collected stepwise preference pairs. Experimental results show that our method improves length and quality on long-form generation benchmarks, with almost lossless performance on general benchmarks across various model backbones.
摘要：长格式生成对于学术论文写作和 repo 级代码生成至关重要。尽管如此，包括 GPT-4o 在内的当前模型仍然表现出不尽人意的性能。现有的利用偏好学习和结果监督的方法通常无法为扩展上下文提供详细反馈。这一缺点可能导致内容不能完全满足查询要求，从而导致长度偏差和质量下降等问题。在本文中，我们提出通过结合过程监督来增强长格式生成。我们使用蒙特卡洛树搜索来收集逐步偏好对，并利用全局内存池来保持一致性。为了解决候选选择次优的问题，我们整合了外部批评来改进和提高偏好对的质量。最后，我们使用收集到的逐步偏好对应用步级 DPO。实验结果表明，我们的方法提高了长格式生成基准的长度和质量，并且在各种模型主干上的一般基准上几乎无损地发挥了性能。

Title: Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge

Authors: Daniel Tamayo, Aitor Gonzalez-Agirre, Javier Hernando, Marta Villegas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.02173
Pdf URL: https://arxiv.org/pdf/2502.02173
Copy Paste: [[2502.02173]] Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge(https://arxiv.org/abs/2502.02173)
Keywords: language model
Abstract: Recent research has explored methods for updating and modifying factual knowledge in large language models, often focusing on specific multi-layer perceptron blocks. This study expands on this work by examining the effectiveness of existing knowledge editing methods across languages and delving into the role of attention mechanisms in this process. Drawing from the insights gained, we propose Mass-Editing Memory with Attention in Transformers (MEMAT), a method that achieves significant improvements in all metrics while requiring minimal parameter modifications. MEMAT delivers a remarkable 10% increase in magnitude metrics, benefits languages not included in the training data and also demonstrates a high degree of portability. Our code and data are at this https URL.
摘要：最近的研究探索了在大型语言模型中更新和修改事实知识的方法，通常侧重于特定的多层感知器块。本研究通过研究现有知识编辑方法在不同语言中的有效性并深入研究注意力机制在此过程中的作用，扩展了这项工作。根据获得的见解，我们提出了 Transformers 中的注意力批量编辑内存 (MEMAT)，这种方法可以在所有指标上实现显著改进，同时只需进行最少的参数修改。MEMAT 实现了显着的 10% 的量级指标提升，使训练数据中未包含的语言受益，并且还表现出高度的可移植性。我们的代码和数据位于此 https URL。

Title: When Dimensionality Hurts: The Role of LLM Embedding Compression for Noisy Regression Tasks

Authors: Felix Drinkall, Janet B. Pierrehumbert, Stefan Zohren
Subjects: cs.CL, cs.CE, cs.LG, q-fin.CP
Abstract URL: https://arxiv.org/abs/2502.02199
Pdf URL: https://arxiv.org/pdf/2502.02199
Copy Paste: [[2502.02199]] When Dimensionality Hurts: The Role of LLM Embedding Compression for Noisy Regression Tasks(https://arxiv.org/abs/2502.02199)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable success in language modelling due to scaling laws found in model size and the hidden dimension of the model's text representation. Yet, we demonstrate that compressed representations of text can yield better performance in LLM-based regression tasks. In this paper, we compare the relative performance of embedding compression in three different signal-to-noise contexts: financial return prediction, writing quality assessment and review scoring. Our results show that compressing embeddings, in a minimally supervised manner using an autoencoder's hidden representation, can mitigate overfitting and improve performance on noisy tasks, such as financial return prediction; but that compression reduces performance on tasks that have high causal dependencies between the input and target data. Our results suggest that the success of interpretable compressed representations such as sentiment may be due to a regularising effect.
摘要：大型语言模型 (LLM) 在语言建模方面表现出色，这得益于模型大小和模型文本表示的隐藏维度中的缩放定律。然而，我们证明压缩的文本表示可以在基于 LLM 的回归任务中获得更好的性能。在本文中，我们比较了嵌入压缩在三种不同的信噪比环境中的相对性能：财务回报预测、写作质量评估和评论评分。我们的结果表明，以最低限度监督的方式使用自动编码器的隐藏表示压缩嵌入可以减轻过度拟合并提高噪声任务（例如财务回报预测）的性能；但这种压缩会降低输入和目标数据之间具有高度因果依赖性的任务的性能。我们的结果表明，可解释的压缩表示（例如情绪）的成功可能是由于正则化效应。

Title: Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation

Authors: Atharva Mangeshkumar Agrawal, Rutika Pandurang Shinde, Vasanth Kumar Bhukya, Ashmita Chakraborty, Sagar Bharat Shah, Tanmay Shukla, Sree Pradeep Kumar Relangi, Nilesh Mutyam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.02249
Pdf URL: https://arxiv.org/pdf/2502.02249
Copy Paste: [[2502.02249]] Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation(https://arxiv.org/abs/2502.02249)
Keywords: language model, gpt, llm, chat, retrieval augmented generation, retrieval-augmented generation
Abstract: Large language models (LLMs) have shown impressive capabilities in natural language processing tasks, including dialogue generation. This research aims to conduct a novel comparative analysis of two prominent techniques, fine-tuning with LoRA (Low-Rank Adaptation) and the Retrieval-Augmented Generation (RAG) framework, in the context of doctor-patient chat conversations with multiple datasets of mixed medical domains. The analysis involves three state-of-the-art models: Llama-2, GPT, and the LSTM model. Employing real-world doctor-patient dialogues, we comprehensively evaluate the performance of models, assessing key metrics such as language quality (perplexity, BLEU score), factual accuracy (fact-checking against medical knowledge bases), adherence to medical guidelines, and overall human judgments (coherence, empathy, safety). The findings provide insights into the strengths and limitations of each approach, shedding light on their suitability for healthcare applications. Furthermore, the research investigates the robustness of the models in handling diverse patient queries, ranging from general health inquiries to specific medical conditions. The impact of domain-specific knowledge integration is also explored, highlighting the potential for enhancing LLM performance through targeted data augmentation and retrieval strategies.
摘要：大型语言模型 (LLM) 在自然语言处理任务（包括对话生成）中表现出了令人印象深刻的能力。本研究旨在对两种主要技术进行新颖的比较分析，即使用 LoRA（低秩自适应）进行微调和检索增强生成 (RAG) 框架，在具有混合医疗领域的多个数据集的医患聊天对话背景下进行。分析涉及三个最先进的模型：Llama-2、GPT 和 LSTM 模型。通过采用现实世界的医患对话，我们全面评估模型的性能，评估关键指标，例如语言质量（困惑度、BLEU 分数）、事实准确性（根据医学知识库进行事实核查）、遵守医疗指南以及整体人类判断（连贯性、同理心、安全性）。研究结果深入了解了每种方法的优势和局限性，揭示了它们是否适用于医疗保健应用。此外，该研究调查了模型在处理各种患者查询（从一般健康咨询到特定医疗状况）时的稳健性。研究还探讨了特定领域知识整合的影响，强调了通过有针对性的数据增强和检索策略提高 LLM 性能的潜力。

Title: Evalita-LLM: Benchmarking Large Language Models on Italian

Authors: Bernardo Magnini, Roberto Zanoli, Michele Resta, Martin Cimmino, Paolo Albano, Marco Madeddu, Viviana Patti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.02289
Pdf URL: https://arxiv.org/pdf/2502.02289
Copy Paste: [[2502.02289]] Evalita-LLM: Benchmarking Large Language Models on Italian(https://arxiv.org/abs/2502.02289)
Keywords: language model, llm, prompt
Abstract: We describe Evalita-LLM, a new benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing and innovative features of Evalita-LLM are the following: (i) all tasks are native Italian, avoiding issues of translating from Italian and potential cultural biases; (ii) in addition to well established multiple-choice tasks, the benchmark includes generative tasks, enabling more natural interaction with LLMs; (iii) all tasks are evaluated against multiple prompts, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation. We propose an iterative methodology, where candidate tasks and candidate prompts are validated against a set of LLMs used for development. We report experimental results from the benchmark's development phase, and provide performance statistics for several state-of-the-art LLMs.
摘要：我们描述了 Evalita-LLM，这是一种旨在评估意大利语任务的大型语言模型 (LLM) 的新基准。Evalita-LLM 的特色和创新功能如下：(i) 所有任务都是意大利语母语，避免了意大利语翻译问题和潜在的文化偏见；(ii) 除了完善的多项选择任务外，基准还包括生成任务，从而实现与 LLM 的更自然交互；(iii) 所有任务都针对多个提示进行评估，这样可以减轻模型对特定提示的敏感性，并实现更公平和客观的评估。我们提出了一种迭代方法，其中候选任务和候选提示根据用于开发的一组 LLM 进行验证。我们报告了基准开发阶段的实验结果，并为几种最先进的 LLM 提供了性能统计数据。

Title: Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

Authors: Jinyang Wu, Mingkuan Feng, Shuai Zhang, Ruihan Jin, Feihu Che, Zengqi Wen, Jianhua Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.02339
Pdf URL: https://arxiv.org/pdf/2502.02339
Copy Paste: [[2502.02339]] Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking(https://arxiv.org/abs/2502.02339)
Keywords: language model, gpt, llm
Abstract: Multimodal large language models (MLLMs) exhibit impressive capabilities but still face challenges in complex visual reasoning. While recent efforts attempt to enhance MLLMs' reasoning by incorporating OpenAI o1-like structured thinking through explicit search structures or teacher-guided distillation, they often struggle to balance performance and efficiency. A critical limitation is their heavy reliance on extensive data and search spaces, resulting in low-efficiency implicit insight extraction and data utilization. To address this, we propose AStar, an Automated Structured thinking paradigm for multimodal reasoning via Monte Carlo Tree Search (MCTS). AStar automatically derives high-level cognitive reasoning patterns from limited data using MCTS-powered hierarchical structures. Building on these explicit patterns, we design a unified reasoning framework that seamlessly integrates models' internal reasoning capabilities and external reasoning guidelines, enabling efficient inference with minimal tree iterations. This novel paradigm strikes a compelling balance between performance and efficiency. Extensive experiments demonstrate AStar's effectiveness, achieving superior accuracy (54.0$\%$) on the MathVerse benchmark with a 7B backbone, surpassing GPT-4o (50.2$\%$) while maintaining substantial data and computational efficiency.
摘要：多模态大型语言模型 (MLLM) 表现出令人印象深刻的能力，但在复杂的视觉推理方面仍面临挑战。虽然最近的努力试图通过明确的搜索结构或教师指导的提炼来整合类似 OpenAI o1 的结构化思维来增强 MLLM 的推理能力，但它们往往难以平衡性能和效率。一个关键的限制是它们严重依赖广泛的数据和搜索空间，导致隐性洞察提取和数据利用效率低下。为了解决这个问题，我们提出了 AStar，一种通过蒙特卡洛树搜索 (MCTS) 进行多模态推理的自动结构化思维范式。AStar 使用由 MCTS 驱动的层次结构从有限的数据中自动得出高级认知推理模式。基于这些显式模式，我们设计了一个统一的推理框架，无缝集成了模型的内部推理能力和外部推理指南，从而以最少的树迭代实现高效的推理。这种新范式在性能和效率之间取得了令人信服的平衡。大量实验证明了 AStar 的有效性，在具有 7B 主干的 MathVerse 基准上实现了卓越的准确率 (54.0%$)，超越了 GPT-4o (50.2%$)，同时保持了大量数据和计算效率。

Title: Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs

Authors: Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani Tur
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.02362
Pdf URL: https://arxiv.org/pdf/2502.02362
Copy Paste: [[2502.02362]] Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs(https://arxiv.org/abs/2502.02362)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large language models (LLMs) by enabling detailed step-by-step solutions. However, due to the verbosity of LLMs, the resulting reasoning chains can be long, making it harder to verify the reasoning steps and trace issues resulting from dependencies between the steps that may be farther away in the sequence of steps. Importantly, mathematical reasoning allows each step to be derived from a small set of premises, which are a subset of the preceding steps in the reasoning chain. In this paper, we present a framework that identifies the premises for each step, to improve the evaluation of reasoning. We restructure conventional linear reasoning chains into Premise Augmented Reasoning Chains (PARC) by introducing premise links, resulting in a directed acyclic graph where the nodes are the steps and the edges are the premise links. Through experiments with a PARC-based dataset that we built, namely PERL (Premises and ERrors identification in LLMs), we demonstrate that LLMs can reliably identify premises within complex reasoning chains. In particular, even open-source LLMs achieve 90% recall in premise identification. We also show that PARC helps to identify errors in reasoning chains more reliably. The accuracy of error identification improves by 6% to 16% absolute when step-by-step verification is carried out in PARC under the premises. Our findings highlight the utility of premise-centric representations in addressing complex problem-solving tasks and open new avenues for improving the reliability of LLM-based reasoning evaluations.
摘要：思路链 (CoT) 提示通过提供详细的分步解决方案增强了大型语言模型 (LLM) 中的数学推理能力。然而，由于 LLM 的冗长，产生的推理链可能很长，使得验证推理步骤和追踪步骤之间可能相距较远的依赖关系所导致的问题变得更加困难。重要的是，数学推理允许从一小组前提中得出每个步骤，这些前提是推理链中前面步骤的子集。在本文中，我们提出了一个框架来识别每个步骤的前提，以改进对推理的评估。我们通过引入前提链接将传统的线性推理链重构为前提增强推理链 (PARC)，从而得到一个有向无环图，其中节点是步骤，边是前提链接。通过对我们构建的基于 PARC 的数据集 PERL（LLM 中的前提和错误识别）进行实验，我们证明了 LLM 能够可靠地识别复杂推理链中的前提。特别是，即使是开源 LLM 在前提识别中也能实现 90% 的召回率。我们还表明 PARC 有助于更可靠地识别推理链中的错误。当在 PARC 中根据前提进行分步验证时，错误识别的准确率绝对提高了 6% 到 16%。我们的研究结果强调了以前提为中心的表示在解决复杂问题解决任务中的实用性，并为提高基于 LLM 的推理评估的可靠性开辟了新途径。

Title: STAIR: Improving Safety Alignment with Introspective Reasoning

Authors: Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, Jun Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.02384
Pdf URL: https://arxiv.org/pdf/2502.02384
Copy Paste: [[2502.02384]] STAIR: Improving Safety Alignment with Introspective Reasoning(https://arxiv.org/abs/2502.02384)
Keywords: language model, llm, chain-of-thought
Abstract: Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and the susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose STAIR, a novel framework that integrates SafeTy Alignment with Itrospective Reasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. Relevant resources in this work are available at this https URL.
摘要：确保大型语言模型 (LLM) 的安全性和无害性已变得与其在应用中的性能同等重要。然而，现有的安全对齐方法通常会受到安全性能权衡和易受越狱攻击的影响，这主要是因为它们依赖于对恶意查询的直接拒绝。在本文中，我们提出了 STAIR，这是一个将安全对齐与自省推理相结合的新框架。我们通过自我改进的具有安全意识的思路链 (CoT) 推理，使 LLM 能够通过逐步分析来识别安全风险。STAIR 首先为模型配备结构化推理能力，然后通过对我们新提出的安全知情蒙特卡洛树搜索 (SI-MCTS) 生成的步骤级推理数据进行迭代偏好优化来推进安全对齐。我们进一步在这些数据上训练过程奖励模型，以指导测试时搜索以改进响应。大量实验表明，与本能对齐策略相比，STAIR 可以有效减轻有害输出，同时更好地保留有用性。通过测试时间扩展，STAIR 实现了与 Claude-3.5 相当的安全性能，可抵御流行的越狱攻击。本文的相关资源可在此 https URL 上找到。

Title: CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning

Authors: Jianfeng Pan, Senyou Deng, Shaomang Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.02390
Pdf URL: https://arxiv.org/pdf/2502.02390
Copy Paste: [[2502.02390]] CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning(https://arxiv.org/abs/2502.02390)
Keywords: language model, llm
Abstract: Research on LLM technologies is rapidly emerging, with most of them employing a 'fast thinking' approach to inference. Most LLMs generate the final result based solely on a single query and LLM's reasoning capabilities. However, with the advent of OpenAI-o1, 'slow thinking' techniques have garnered increasing attention because its process is closer to the human thought process. Inspired by the human ability to constantly associate and replenish knowledge during thinking, we developed the novel Chain-of-Associated-Thoughts (CoAT) framework, which introduces an innovative synergy between the Monte Carlo Tree Search (MCTS) algorithm and a dynamic mechanism for integrating new key information, termed 'associative memory'. By combining the structured exploration capabilities of MCTS with the adaptive learning capacity of associative memory, CoAT significantly expands the LLM search space, enabling our framework to explore diverse reasoning pathways and dynamically update its knowledge base in real-time. This allows the framework to not only revisit and refine earlier inferences but also adaptively incorporate evolving information, ensuring that the final output is both accurate and comprehensive. To validate the effectiveness of our framework, we conducted extensive experiments across a range of generative and reasoning tasks. These experiments demonstrated that our framework outperforms conventional inference processes on accuracy, coherence, and diversity. The framework's ability to iteratively expand its search space while retaining contextually relevant information results.
摘要：LLM 技术研究正在迅速兴起，其中大多数采用“快速思考”方法进行推理。大多数 LLM 仅基于单个查询和 LLM 的推理能力生成最终结果。然而，随着 OpenAI-o1 的出现，“慢思考”技术越来越受到关注，因为它的过程更接近人类的思维过程。受人类在思考过程中不断联想和补充知识的能力的启发，我们开发了新颖的联想思维链 (CoAT) 框架，该框架在蒙特卡洛树搜索 (MCTS) 算法和集成新关键信息的动态机制（称为“联想记忆”）之间引入了创新的协同作用。通过将 MCTS 的结构化探索能力与联想记忆的自适应学习能力相结合，CoAT 显著扩展了 LLM 搜索空间，使我们的框架能够探索不同的推理路径并实时动态更新其知识库。这使得框架不仅可以重新审视和改进早期的推理，还可以自适应地整合不断发展的信息，确保最终输出既准确又全面。为了验证我们框架的有效性，我们在一系列生成和推理任务中进行了广泛的实验。这些实验表明，我们的框架在准确性、连贯性和多样性方面优于传统的推理过程。该框架能够迭代扩展其搜索空间，同时保留与上下文相关的信息结果。

Title: Activation-Informed Merging of Large Language Models

Authors: Amin Heyrani Nobari, Kaveh Alimohammadi, Ali ArjomandBigdeli, Akash Srivastava, Faez Ahmed, Navid Azizan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.02421
Pdf URL: https://arxiv.org/pdf/2502.02421
Copy Paste: [[2502.02421]] Activation-Informed Merging of Large Language Models(https://arxiv.org/abs/2502.02421)
Keywords: language model, llm
Abstract: Model merging, a method that combines the parameters and embeddings of multiple fine-tuned large language models (LLMs), offers a promising approach to enhance model performance across various tasks while maintaining computational efficiency. This paper introduces Activation-Informed Merging (AIM), a technique that integrates the information from the activation space of LLMs into the merging process to improve performance and robustness. AIM is designed as a flexible, complementary solution that is applicable to any existing merging method. It aims to preserve critical weights from the base model, drawing on principles from continual learning~(CL) and model compression. Utilizing a task-agnostic calibration set, AIM selectively prioritizes essential weights during merging. We empirically demonstrate that AIM significantly enhances the performance of merged models across multiple benchmarks. Our findings suggest that considering the activation-space information can provide substantial advancements in the model merging strategies for LLMs with up to 40\% increase in benchmark performance.
摘要：模型合并是一种将多个经过微调的大型语言模型 (LLM) 的参数和嵌入组合在一起的方法，它提供了一种有前途的方法，可以在保持计算效率的同时提高各种任务中的模型性能。本文介绍了激活信息合并 (AIM)，这是一种将 LLM 激活空间中的信息集成到合并过程中以提高性能和鲁棒性的技术。AIM 被设计为一种灵活的互补解决方案，适用于任何现有的合并方法。它旨在保留来自基础模型的关键权重，借鉴持续学习 (CL) 和模型压缩的原理。利用与任务无关的校准集，AIM 在合并过程中有选择地优先考虑基本权重。我们通过实证证明，AIM 显著提高了合并模型在多个基准测试中的性能。我们的研究结果表明，考虑激活空间信息可以为 LLM 的模型合并策略带来显着的进步，基准测试性能最多可提高 40%。

Title: Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models

Authors: Haoran Ye, Tianze Zhang, Yuhang Xie, Liyuan Zhang, Yuanyi Ren, Xin Zhang, Guojie Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.02444
Pdf URL: https://arxiv.org/pdf/2502.02444
Copy Paste: [[2502.02444]] Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models(https://arxiv.org/abs/2502.02444)
Keywords: language model, llm
Abstract: Values are core drivers of individual and collective perception, cognition, and behavior. Value systems, such as Schwartz's Theory of Basic Human Values, delineate the hierarchy and interplay among these values, enabling cross-disciplinary investigations into decision-making and societal dynamics. Recently, the rise of Large Language Models (LLMs) has raised concerns regarding their elusive intrinsic values. Despite growing efforts in evaluating, understanding, and aligning LLM values, a psychologically grounded LLM value system remains underexplored. This study addresses the gap by introducing the Generative Psycho-Lexical Approach (GPLA), a scalable, adaptable, and theoretically informed method for constructing value systems. Leveraging GPLA, we propose a psychologically grounded five-factor value system tailored for LLMs. For systematic validation, we present three benchmarking tasks that integrate psychological principles with cutting-edge AI priorities. Our results reveal that the proposed value system meets standard psychological criteria, better captures LLM values, improves LLM safety prediction, and enhances LLM alignment, when compared to the canonical Schwartz's values.
摘要：价值观是个人和集体感知、认知和行为的核心驱动因素。价值体系，例如施瓦茨的基本人类价值观理论，描绘了这些价值观之间的层次结构和相互作用，使跨学科研究决策和社会动态成为可能。最近，大型语言模型 (LLM) 的兴起引发了人们对其难以捉摸的内在价值的担忧。尽管在评估、理解和调整 LLM 价值观方面付出了越来越多的努力，但基于心理学的 LLM 价值体系仍未得到充分探索。本研究通过引入生成心理词汇方法 (GPLA) 来解决这一差距，这是一种可扩展、适应性强且理论丰富的构建价值体系的方法。利用 GPLA，我们提出了一种针对 LLM 量身定制的基于心理学的五因素价值体系。为了进行系统验证，我们提出了三个基准测试任务，将心理学原理与尖端 AI 优先事项相结合。我们的结果表明，与典型的施瓦茨价值观相比，所提出的价值体系符合标准心理标准，更好地捕捉了 LLM 价值观，提高了 LLM 安全预测，并增强了 LLM 一致性。

Title: Beyond English: Evaluating Automated Measurement of Moral Foundations in Non-English Discourse with a Chinese Case Study

Authors: Calvin Yixiang Cheng, Scott A Hale
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2502.02451
Pdf URL: https://arxiv.org/pdf/2502.02451
Copy Paste: [[2502.02451]] Beyond English: Evaluating Automated Measurement of Moral Foundations in Non-English Discourse with a Chinese Case Study(https://arxiv.org/abs/2502.02451)
Keywords: language model, llm
Abstract: This study explores computational approaches for measuring moral foundations (MFs) in non-English corpora. Since most resources are developed primarily for English, cross-linguistic applications of moral foundation theory remain limited. Using Chinese as a case study, this paper evaluates the effectiveness of applying English resources to machine translated text, local language lexicons, multilingual language models, and large language models (LLMs) in measuring MFs in non-English texts. The results indicate that machine translation and local lexicon approaches are insufficient for complex moral assessments, frequently resulting in a substantial loss of cultural information. In contrast, multilingual models and LLMs demonstrate reliable cross-language performance with transfer learning, with LLMs excelling in terms of data efficiency. Importantly, this study also underscores the need for human-in-the-loop validation of automated MF assessment, as the most advanced models may overlook cultural nuances in cross-language measurements. The findings highlight the potential of LLMs for cross-language MF measurements and other complex multilingual deductive coding tasks.
摘要：本研究探索了用于测量非英语语料库中的道德基础 (MF) 的计算方法。由于大多数资源主要针对英语开发，道德基础理论的跨语言应用仍然有限。本文以中文为例，评估了将英语资源应用于机器翻译文本、本地语言词典、多语言语言模型和大型语言模型 (LLM) 在测量非英语文本中的 MF 方面的有效性。结果表明，机器翻译和本地词典方法不足以进行复杂的道德评估，经常导致大量文化信息的丢失。相比之下，多语言模型和 LLM 通过迁移学习表现出可靠的跨语言性能，其中 LLM 在数据效率方面表现出色。重要的是，这项研究还强调了对自动 MF 评估进行人机验证的必要性，因为最先进的模型可能会忽略跨语言测量中的文化细微差别。研究结果凸显了 LLM 在跨语言 MF 测量和其他复杂的多语言演绎编码任务中的潜力。

Title: SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency

Authors: Qianhao Yuan, Yanjiang Liu, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2502.02458
Pdf URL: https://arxiv.org/pdf/2502.02458
Copy Paste: [[2502.02458]] SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency(https://arxiv.org/abs/2502.02458)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) mainly fall into two architectures, each involving a trade-off between training and inference efficiency: embedding space alignment (e.g., LLaVA-1.5) is inefficient during inference, while cross-attention space alignment (e.g., Flamingo) is inefficient in training. In this paper, we compare these two architectures and identify the key factors for building efficient MLLMs. A primary difference between them lies in how attention is applied to visual tokens, particularly in their interactions with each other. To investigate whether attention among visual tokens is necessary, we propose a new self-attention mechanism, NAAViT (\textbf{N}o \textbf{A}ttention \textbf{A}mong \textbf{Vi}sual \textbf{T}okens), which eliminates this type of attention. Our pilot experiment on LLaVA-1.5 shows that attention among visual tokens is highly redundant. Based on these insights, we introduce SAISA (\textbf{S}elf-\textbf{A}ttention \textbf{I}nput \textbf{S}pace \textbf{A}lignment), a novel architecture that enhance both training and inference efficiency. SAISA directly aligns visual features with the input spaces of NAAViT self-attention blocks, reducing computational overhead in both self-attention blocks and feed-forward networks (FFNs). Using the same configuration as LLaVA-1.5, SAISA reduces inference FLOPs by 66\% and training budget by 26\%, while achieving superior performance in terms of accuracy. Comprehensive ablation studies further validate the effectiveness of SAISA across various LLMs and visual encoders. The code and model will be publicly available at this https URL.
摘要：多模态大型语言模型 (MLLM) 主要分为两种架构，每种架构都需要在训练和推理效率之间进行权衡：嵌入空间对齐（例如 LLaVA-1.5）在推理过程中效率低下，而跨注意空间对齐（例如 Flamingo）在训练中效率低下。在本文中，我们比较了这两种架构，并确定了构建高效 MLLM 的关键因素。它们之间的主要区别在于如何将注意力应用于视觉标记，特别是在它们之间的交互中。为了研究视觉标记之间的注意力是否必要，我们提出了一种新的自注意力机制 NAAViT（\textbf{N}o \textbf{A}ttention \textbf{A}mong \textbf{Vi}sual \textbf{T}okens），它消除了这种类型的注意力。我们在 LLaVA-1.5 上进行的初步实验表明，视觉标记之间的注意力高度冗余。基于这些见解，我们引入了 SAISA（\textbf{S}elf-\textbf{A}ttention \textbf{I}nput \textbf{S}pace \textbf{A}lignment），这是一种可同时提高训练和推理效率的新型架构。SAISA 将视觉特征直接与 NAAViT 自注意力块的输入空间对齐，从而减少了自注意力块和前馈网络 (FFN) 中的计算开销。使用与 LLaVA-1.5 相同的配置，SAISA 将推理 FLOP 减少了 66\%，训练预算减少了 26\%，同时在准确性方面实现了卓越的性能。全面的消融研究进一步验证了 SAISA 在各种 LLM 和视觉编码器中的有效性。代码和模型将在此 https URL 上公开提供。

Title: Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study

Authors: Menglong Cui, Pengzhi Gao, Wei Liu, Jian Luan, BinWang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.02481
Pdf URL: https://arxiv.org/pdf/2502.02481
Copy Paste: [[2502.02481]] Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study(https://arxiv.org/abs/2502.02481)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have shown continuously improving multilingual capabilities, and even small-scale open-source models have demonstrated rapid performance enhancement. In this paper, we systematically explore the abilities of open LLMs with less than ten billion parameters to handle multilingual machine translation (MT) tasks. We conduct comprehensive evaluations on six popular LLMs and find that models like Gemma2-9B exhibit impressive multilingual translation capabilities. We then introduce the Parallel-First Monolingual-Second (PFMS) data mixing strategy in the continual pretraining stage to further enhance the MT performance and present GemmaX2-28, a 9B model achieving top-tier multilingual translation performance across 28 languages. Specifically, GemmaX2-28 consistently outperforms the state-of-the-art (SOTA) models such as TowerInstruct and XALMA and achieves competitive performance with Google Translate and GPT-4-turbo.
摘要：大型语言模型 (LLM) 表现出不断提升的多语言能力，即使是小规模的开源模型也表现出快速的性能提升。在本文中，我们系统地探索了少于百亿个参数的开放 LLM 处理多语言机器翻译 (MT) 任务的能力。我们对六种流行的 LLM 进行了全面的评估，发现像 Gemma2-9B 这样的模型表现出了令人印象深刻的多语言翻译能力。然后，我们在持续预训练阶段引入了并行第一单语第二 (PFMS) 数据混合策略来进一步增强 MT 性能，并提出了 GemmaX2-28，这是一个在 28 种语言中实现顶级多语言翻译性能的 9B 模型。具体而言，GemmaX2-28 始终优于 TowerInstruct 和 XALMA 等最先进 (SOTA) 模型，并与谷歌翻译和 GPT-4-turbo 实现了竞争性的性能。

Title: Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

Authors: Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, Chuang Gan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.02508
Pdf URL: https://arxiv.org/pdf/2502.02508
Copy Paste: [[2502.02508]] Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search(https://arxiv.org/abs/2502.02508)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models will be fully open-sourced.
摘要：大型语言模型 (LLM) 已在不同领域展现出卓越的推理能力。最近的研究表明，增加测试时间计算可增强 LLM 的推理能力。这通常涉及在外部 LLM 验证器指导下在推理时进行大量采样，从而形成双人系统。尽管有外部指导，但该系统的有效性证明了单个 LLM 解决复杂任务的潜力。因此，我们提出了一个新的研究问题：我们能否内化搜索能力，从根本上增强单个 LLM 的推理能力？这项工作探索了一个正交方向，重点关注训练后 LLM 的自回归搜索（即具有自我反思和自我探索新策略的扩展推理过程）。为了实现这一目标，我们提出了行动-思想链 (COAT) 推理和一个两阶段训练范式：1) 小规模格式调整阶段，以内化 COAT 推理格式；2) 利用强化学习的大规模自我改进阶段。我们的方法产生了 Satori，一个基于开源模型和数据进行训练的 7B LLM。大量的实证评估表明，Satori 在数学推理基准上实现了最先进的性能，同时在领域外的任务中表现出很强的泛化能力。代码、数据和模型将完全开源。

Title: Adaptive Self-improvement LLM Agentic System for ML Library Development

Authors: Genghan Zhang, Weixin Liang, Olivia Hsu, Kunle Olukotun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.02534
Pdf URL: https://arxiv.org/pdf/2502.02534
Copy Paste: [[2502.02534]] Adaptive Self-improvement LLM Agentic System for ML Library Development(https://arxiv.org/abs/2502.02534)
Keywords: language model, llm, agent
Abstract: ML libraries, often written in architecture-specific programming languages (ASPLs) that target domain-specific architectures, are key to efficient ML systems. However, writing these high-performance ML libraries is challenging because it requires expert knowledge of ML algorithms and the ASPL. Large language models (LLMs), on the other hand, have shown general coding capabilities. However, challenges remain when using LLMs for generating ML libraries using ASPLs because 1) this task is complicated even for experienced human programmers and 2) there are limited code examples because of the esoteric and evolving nature of ASPLs. Therefore, LLMs need complex reasoning with limited data in order to complete this task. To address these challenges, we introduce an adaptive self-improvement agentic system. In order to evaluate the effectiveness of our system, we construct a benchmark of a typical ML library and generate ASPL code with both open and closed-source LLMs on this benchmark. Our results show improvements of up to $3.9\times$ over a baseline single LLM.
摘要：机器学习库通常以针对特定领域架构的架构特定编程语言 (ASPL) 编写，是高效机器学习系统的关键。然而，编写这些高性能机器学习库具有挑战性，因为它需要机器学习算法和 ASPL 的专业知识。另一方面，大型语言模型 (LLM) 已显示出通用的编码能力。然而，使用 LLM 生成使用 ASPL 的机器学习库时仍然存在挑战，因为 1) 即使对于经验丰富的人类程序员来说，这项任务也很复杂，2) 由于 ASPL 的深奥和不断发展，代码示例有限。因此，LLM 需要使用有限的数据进行复杂的推理才能完成此任务。为了应对这些挑战，我们引入了一个自适应的自我改进代理系统。为了评估我们系统的有效性，我们构建了一个典型机器学习库的基准，并在此基准上使用开源和闭源 LLM 生成 ASPL 代码。我们的结果显示，与基线单个 LLM 相比，改进高达 $3.9\times$。

Title: Are Language Models Up to Sequential Optimization Problems? From Evaluation to a Hegelian-Inspired Enhancement

Authors: Soheil Abbasloo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.02573
Pdf URL: https://arxiv.org/pdf/2502.02573
Copy Paste: [[2502.02573]] Are Language Models Up to Sequential Optimization Problems? From Evaluation to a Hegelian-Inspired Enhancement(https://arxiv.org/abs/2502.02573)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across numerous fields, presenting an opportunity to revolutionize optimization problem-solving, a crucial, ubiquitous, and complex domain. This paper explores the proficiency of LLMs in handling Sequential Optimization Problems (SOPs). We introduce WorldGen, a dynamic framework for generating unseen SOPs with controllable complexities, to evaluate LLM performance. Our initial observations reveal that while LLMs perform well on simple SOPs, their performance significantly degrades with increased complexity. Motivated by this, we revisit philosophical hypotheses on reasoning to enhance LLM performance. Inspired by the influential framework of Hegelian Dialectics, we propose ACE, demonstrating how the performance of LLMs in SOP contexts can be significantly improved without any retraining or further fine-tuning.
摘要：大型语言模型 (LLM) 已在众多领域展现出令人印象深刻的能力，为彻底改变优化问题解决这一至关重要、无处不在且复杂的领域提供了机会。本文探讨了 LLM 在处理顺序优化问题 (SOP) 方面的能力。我们引入了 WorldGen，这是一个用于生成复杂度可控的未见 SOP 的动态框架，用于评估 LLM 的性能。我们的初步观察表明，虽然 LLM 在简单的 SOP 上表现良好，但随着复杂度的增加，其性能会显著下降。受此启发，我们重新审视了关于推理的哲学假设，以提高 LLM 的性能。受黑格尔辩证法的有影响力的框架的启发，我们提出了 ACE，展示了如何在无需任何再训练或进一步微调的情况下显著提高 LLM 在 SOP 环境中的性能。

Title: A comparison of translation performance between DeepL and Supertext

Authors: Alex Flückiger, Chantal Amrhein, Tim Graf, Philippe Schläpfer, Florian Schottmann, Samuel Läubli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.02577
Pdf URL: https://arxiv.org/pdf/2502.02577
Copy Paste: [[2502.02577]] A comparison of translation performance between DeepL and Supertext(https://arxiv.org/abs/2502.02577)
Keywords: language model, llm
Abstract: As strong machine translation (MT) systems are increasingly based on large language models (LLMs), reliable quality benchmarking requires methods that capture their ability to leverage extended context. This study compares two commercial MT systems -- DeepL and Supertext -- by assessing their performance on unsegmented texts. We evaluate translation quality across four language directions with professional translators assessing segments with full document-level context. While segment-level assessments indicate no strong preference between the systems in most cases, document-level analysis reveals a preference for Supertext in three out of four language directions, suggesting superior consistency across longer texts. We advocate for more context-sensitive evaluation methodologies to ensure that MT quality assessments reflect real-world usability. We release all evaluation data and scripts for further analysis and reproduction at this https URL.
摘要：由于强大的机器翻译 (MT) 系统越来越多地基于大型语言模型 (LLM)，可靠的质量基准测试需要能够捕捉其利用扩展上下文的能力的方法。本研究通过评估两种商业机器翻译系统 DeepL 和 Supertext 在未分段文本上的表现，对它们进行了比较。我们通过四个语言方向的翻译质量进行评估，由专业翻译人员使用完整的文档级上下文评估片段。虽然片段级评估表明在大多数情况下系统之间没有强烈的偏好，但文档级分析表明在四个语言方向中的三个方向上都偏向 Supertext，这表明在较长的文本中具有出色的一致性。我们提倡更多对上下文敏感的评估方法，以确保机器翻译质量评估反映现实世界的可用性。我们在此 https URL 上发布所有评估数据和脚本以供进一步分析和重现。