2025-01-30

Title: Tuning LLM Judges Hyperparameters

Authors: David Salinas, Omar Swelam, Frank Hutter
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.17178
Pdf URL: https://arxiv.org/pdf/2501.17178
Copy Paste: [[2501.17178]] Tuning LLM Judges Hyperparameters(https://arxiv.org/abs/2501.17178)
Keywords: language model, llm, prompt
Abstract: Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune hyperparameter of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trades accuracy for cost and also reduce significantly the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility.
摘要：评估大型语言模型 (LLM) 通常需要昂贵的人工注释。为了解决这个问题，提出了基于 LLM 的评判者，它比较两个 LLM 的输出，从而可以在没有人工干预的情况下对模型进行排名。虽然已经提出了几种方法，但不同论文之间存在许多混杂因素。例如，模型、提示和其他超参数通常同时更改，使得苹果与苹果的比较具有挑战性。在本文中，我们建议系统地分析和调整 LLM 评判者的超参数。为了减轻评估评判者的高成本，我们建议利用多目标多保真度，这可以找到以准确性换取成本的评判者，并显着降低搜索成本。我们的方法识别的评判者不仅在准确性和成本效率方面优于现有基准，而且还利用开放权重模型，确保更高的可访问性和可重复性。

Title: LLM Evaluation Based on Aerospace Manufacturing Expertise: Automated Generation and Multi-Model Question Answering

Authors: Beiming Liu, Zhizhuo Cui, Siteng Hu, Xiaohua Li, Haifeng Lin, Zhengxin Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.17183
Pdf URL: https://arxiv.org/pdf/2501.17183
Copy Paste: [[2501.17183]] LLM Evaluation Based on Aerospace Manufacturing Expertise: Automated Generation and Multi-Model Question Answering(https://arxiv.org/abs/2501.17183)
Keywords: language model, gpt, llm, hallucination
Abstract: Aerospace manufacturing demands exceptionally high precision in technical parameters. The remarkable performance of Large Language Models (LLMs), such as GPT-4 and QWen, in Natural Language Processing has sparked industry interest in their application to tasks including process design, material selection, and tool information retrieval. However, LLMs are prone to generating "hallucinations" in specialized domains, producing inaccurate or false information that poses significant risks to the quality of aerospace products and flight safety. This paper introduces a set of evaluation metrics tailored for LLMs in aerospace manufacturing, aiming to assess their accuracy by analyzing their performance in answering questions grounded in professional knowledge. Firstly, key information is extracted through in-depth textual analysis of classic aerospace manufacturing textbooks and guidelines. Subsequently, utilizing LLM generation techniques, we meticulously construct multiple-choice questions with multiple correct answers of varying difficulty. Following this, different LLM models are employed to answer these questions, and their accuracy is recorded. Experimental results demonstrate that the capabilities of LLMs in aerospace professional knowledge are in urgent need of improvement. This study provides a theoretical foundation and practical guidance for the application of LLMs in aerospace manufacturing, addressing a critical gap in the field.
摘要：航空航天制造对技术参数的精度要求极高。GPT-4 和 QWen 等大型语言模型 (LLM) 在自然语言处理中的出色表现引发了业界对其在工艺设计、材料选择和工具信息检索等任务中的应用兴趣。然而，LLM 容易在专业领域产生“幻觉”，产生不准确或虚假的信息，对航空航天产品质量和飞行安全构成重大风险。本文介绍了一套针对航空航天制造领域 LLM 量身定制的评估指标，旨在通过分析其在回答基于专业知识的问题方面的表现来评估其准确性。首先，通过对经典航空航天制造教科书和指南的深入文本分析来提取关键信息。随后，利用 LLM 生成技术，我们精心构造了难度各异、有多个正确答案的多项选择题。之后，使用不同的 LLM 模型来回答这些问题，并记录它们的准确性。实验结果表明法学硕士在航空航天专业知识方面的能力亟待提升，本研究为法学硕士在航空航天制造业的应用提供了理论基础和实践指导，填补了该领域的一大空白。

Title: Visualizing Uncertainty in Translation Tasks: An Evaluation of LLM Performance and Confidence Metrics

Authors: Jin Hyun Park, Utsawb Laminchhane, Umer Farooq, Uma Sivakumar, Arpan Kumar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.17187
Pdf URL: https://arxiv.org/pdf/2501.17187
Copy Paste: [[2501.17187]] Visualizing Uncertainty in Translation Tasks: An Evaluation of LLM Performance and Confidence Metrics(https://arxiv.org/abs/2501.17187)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly utilized for machine translation, yet their predictions often exhibit uncertainties that hinder interpretability and user trust. Effectively visualizing these uncertainties can enhance the usability of LLM outputs, particularly in contexts where translation accuracy is critical. This paper addresses two primary objectives: (1) providing users with token-level insights into model confidence and (2) developing a web-based visualization tool to quantify and represent translation uncertainties. To achieve these goals, we utilized the T5 model with the WMT19 dataset for translation tasks and evaluated translation quality using established metrics such as BLEU, METEOR, and ROUGE. We introduced three novel uncertainty quantification (UQ) metrics: (1) the geometric mean of token probabilities, (2) the arithmetic mean of token probabilities, and (3) the arithmetic mean of the kurtosis of token distributions. These metrics provide a simple yet effective framework for evaluating translation performance. Our analysis revealed a linear relationship between the traditional evaluation metrics and our UQ metrics, demonstrating the validity of our approach. Additionally, we developed an interactive web-based visualization that uses a color gradient to represent token confidence. This tool offers users a clear and intuitive understanding of translation quality while providing valuable insights into model performance. Overall, we show that our UQ metrics and visualization are both robust and interpretable, offering practical tools for evaluating and accessing machine translation systems.
摘要：大型语言模型 (LLM) 越来越多地用于机器翻译，但它们的预测通常表现出不确定性，从而阻碍了可解释性和用户信任。有效地可视化这些不确定性可以增强 LLM 输出的可用性，特别是在翻译准确性至关重要的情况下。本文解决了两个主要目标：(1) 为用户提供对模型置信度的 token 级洞察；(2) 开发基于 Web 的可视化工具来量化和表示翻译不确定性。为了实现这些目标，我们将 T5 模型与 WMT19 数据集结合用于翻译任务，并使用 BLEU、METEOR 和 ROUGE 等既定指标评估翻译质量。我们引入了三个新的不确定性量化 (UQ) 指标：(1) token 概率的几何平均值、(2) token 概率的算术平均值和 (3) token 分布峰度的算术平均值。这些指标提供了一个简单而有效的框架来评估翻译性能。我们的分析揭示了传统评估指标与我们的 UQ 指标之间的线性关系，证明了我们方法的有效性。此外，我们还开发了一种基于 Web 的交互式可视化工具，使用颜色渐变来表示标记置信度。此工具可让用户清晰直观地了解翻译质量，同时提供有关模型性能的宝贵见解。总体而言，我们表明我们的 UQ 指标和可视化工具既稳健又可解释，为评估和访问机器翻译系统提供了实用工具。

Title: A Comprehensive Study on Fine-Tuning Large Language Models for Medical Question Answering Using Classification Models and Comparative Analysis

Authors: Aysegul Ucar, Soumik Nayak, Anunak Roy, Burak Taşcı, Gülay Taşcı
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.17190
Pdf URL: https://arxiv.org/pdf/2501.17190
Copy Paste: [[2501.17190]] A Comprehensive Study on Fine-Tuning Large Language Models for Medical Question Answering Using Classification Models and Comparative Analysis(https://arxiv.org/abs/2501.17190)
Keywords: language model, llm
Abstract: This paper presents the overview of the development and fine-tuning of large language models (LLMs) designed specifically for answering medical questions. We are mainly improving the accuracy and efficiency of providing reliable answers to medical queries. In our approach, we have two stages, prediction of a specific label for the received medical question and then providing a predefined answer for this label. Various models such as RoBERTa and BERT were examined and evaluated based on their ability. The models are trained using the datasets derived from 6,800 samples that were scraped from Healthline. com with additional synthetic data. For evaluation, we conducted a comparative study using 5-fold cross-validation. For accessing performance we used metrics like, accuracy, precision, recall, and F1 score and also recorded the training time. The performance of the models was evaluated using 5-fold cross-validation. The LoRA Roberta-large model achieved an accuracy of 78.47%, precision of 72.91%, recall of 76.95%, and an F1 score of 73.56%. The Roberta-base model demonstrated high performance with an accuracy of 99.87%, precision of 99.81%, recall of 99.86%, and an F1 score of 99.82%. The Bert Uncased model showed strong results with an accuracy of 95.85%, precision of 94.42%, recall of 95.58%, and an F1 score of 94.72%. Lastly, the Bert Large Uncased model achieved the highest performance, with an accuracy, precision, recall, and F1 score of 100%. The results obtained have helped indicate the capability of the models in classifying the medical questions and generating accurate answers in the prescription of improved health-related AI solutions.
摘要：本文概述了专为回答医学问题而设计的大型语言模型 (LLM) 的开发和微调。我们主要提高提供可靠医学查询答案的准确性和效率。在我们的方法中，我们有两个阶段，预测收到的医学问题的特定标签，然后为该标签提供预定义的答案。我们检查并评估了 RoBERTa 和 BERT 等各种模型的能力。这些模型使用从 Healthline.com 抓取的 6,800 个样本的数据集进行训练，并附加了合成数据。为了进行评估，我们使用 5 倍交叉验证进行了比较研究。为了评估性能，我们使用了准确度、精确度、召回率和 F1 分数等指标，并记录了训练时间。使用 5 倍交叉验证评估模型的性能。LoRA Roberta-large 模型的准确率为 78.47%，精确度为 72.91%，召回率为 76.95%，F1 分数为 73.56%。基于 Roberta 的模型表现出色，准确率为 99.87%，精确率为 99.81%，召回率为 99.86%，F1 得分为 99.82%。Bert Uncased 模型表现出色，准确率为 95.85%，精确率为 94.42%，召回率为 95.58%，F1 得分为 94.72%。最后，Bert Large Uncased 模型取得了最高性能，准确率、精确率、召回率和 F1 得分均为 100%。所获得的结果有助于表明模型在对医疗问题进行分类和生成准确答案方面的能力，有助于制定更好的健康相关 AI 解决方案。

Title: Aspect-Aware Decomposition for Opinion Summarization

Authors: Miao Li, Jey Han Lau, Eduard Hovy, Mirella Lapata
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2501.17191
Pdf URL: https://arxiv.org/pdf/2501.17191
Copy Paste: [[2501.17191]] Aspect-Aware Decomposition for Opinion Summarization(https://arxiv.org/abs/2501.17191)
Keywords: prompt
Abstract: Opinion summarization plays a key role in deriving meaningful insights from large-scale online reviews. To make this process more explainable and grounded, we propose a modular approach guided by review aspects which separates the tasks of aspect identification, opinion consolidation, and meta-review synthesis, enabling greater transparency and ease of inspection. We conduct extensive experiments across datasets representing scientific research, business, and product domains. Results show that our method generates more grounded summaries compared to strong baseline models, as verified through automated and human evaluations. Additionally, our modular approach, which incorporates reasoning based on review aspects, produces more informative intermediate outputs than knowledge-agnostic decomposed prompting. These intermediate outputs can also effectively support humans in summarizing opinions from large volumes of reviews.
摘要：观点总结在从大规模在线评论中获取有意义的见解方面起着关键作用。为了使这个过程更易于解释和扎实，我们提出了一种由评论方面指导的模块化方法，将方面识别、观点整合和元评论综合的任务分开，从而提高透明度和检查的简易性。我们在代表科学研究、商业和产品领域的数据集上进行了广泛的实验。结果表明，与强大的基线模型相比，我们的方法可以生成更有根据的总结，这已通过自动和人工评估得到验证。此外，我们的模块化方法结合了基于评论方面的推理，比知识无关的分解提示产生了更具信息量的中间输出。这些中间输出还可以有效地支持人类从大量评论中总结观点。

Title: Atla Selene Mini: A General Purpose Evaluation Model

Authors: Andrei Alexandru, Antonia Calvi, Henry Broomfield, Jackson Golden, Kyle Dai, Mathias Leys, Maurice Burger, Max Bartolo, Roman Engeler, Sashank Pisupati, Toby Drane, Young Sun Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.17195
Pdf URL: https://arxiv.org/pdf/2501.17195
Copy Paste: [[2501.17195]] Atla Selene Mini: A General Purpose Evaluation Model(https://arxiv.org/abs/2501.17195)
Keywords: language model, gpt, prompt
Abstract: We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges. To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios. Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format. Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena. We release the model weights on HuggingFace (this https URL) and Ollama to encourage widespread community adoption.
摘要：我们推出了 Atla Selene Mini，这是一款最先进的小型语言模型评判器 (SLMJ)。Selene Mini 是一款通用评估器，在 11 个分布外基准测试中的整体表现优于最佳 SLMJ 和 GPT-4o-mini，涵盖绝对评分、分类和成对偏好任务。它是 RewardBench 上得分最高的 8B 生成模型，超越了 GPT-4o 和专业评判器等强大基线。为了实现这一目标，我们开发了一种原则性的数据管理策略，该策略使用合成生成的评论来增强公共数据集，并通过过滤和数据集消融来确保高质量。我们在直接偏好优化 (DPO) 和监督微调 (SFT) 损失的组合上训练我们的模型，并生成一个在现实世界场景中表现出色的高度可提示的评估器。Selene Mini 在金融和医疗行业数据集上与人类专家评估的零样本一致性显着提高。它还能适应提示格式的变化。初步结果表明，Selene Mini 是实时、社区驱动的 Judge Arena 中排名最高的评估者。我们在 HuggingFace（此 https URL）和 Ollama 上发布了模型权重，以鼓励社区广泛采用。

Title: Improving LLM Leaderboards with Psychometrical Methodology

Authors: Denis Federiakin
Subjects: cs.CL, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2501.17200
Pdf URL: https://arxiv.org/pdf/2501.17200
Copy Paste: [[2501.17200]] Improving LLM Leaderboards with Psychometrical Methodology(https://arxiv.org/abs/2501.17200)
Keywords: language model, llm
Abstract: The rapid development of large language models (LLMs) has necessitated the creation of benchmarks to evaluate their performance. These benchmarks resemble human tests and surveys, as they consist of sets of questions designed to measure emergent properties in the cognitive behavior of these systems. However, unlike the well-defined traits and abilities studied in social sciences, the properties measured by these benchmarks are often vaguer and less rigorously defined. The most prominent benchmarks are often grouped into leaderboards for convenience, aggregating performance metrics and enabling comparisons between models. Unfortunately, these leaderboards typically rely on simplistic aggregation methods, such as taking the average score across benchmarks. In this paper, we demonstrate the advantages of applying contemporary psychometric methodologies - originally developed for human tests and surveys - to improve the ranking of large language models on leaderboards. Using data from the Hugging Face Leaderboard as an example, we compare the results of the conventional naive ranking approach with a psychometrically informed ranking. The findings highlight the benefits of adopting psychometric techniques for more robust and meaningful evaluation of LLM performance.
摘要：大型语言模型 (LLM) 的快速发展使得创建基准来评估其性能成为必要。这些基准类似于人工测试和调查，因为它们由一系列问题组成，旨在衡量这些系统认知行为中出现的属性。然而，与社会科学中研究的明确特征和能力不同，这些基准所衡量的属性通常更模糊，定义也不太严格。最突出的基准通常被分组到排行榜中，以方便汇总性能指标并实现模型之间的比较。不幸的是，这些排行榜通常依赖于简单的聚合方法，例如取基准的平均分数。在本文中，我们展示了应用当代心理测量方法（最初是为人工测试和调查而开发的）来提高大型语言模型在排行榜上的排名的优势。以 Hugging Face 排行榜的数据为例，我们将传统的朴素排名方法的结果与心理测量学排名进行了比较。研究结果强调了采用心理测量技术对 LLM 性能进行更稳健和更有意义的评估的好处。

Title: NUS-Emo at SemEval-2024 Task 3: Instruction-Tuning LLM for Multimodal Emotion-Cause Analysis in Conversations

Authors: Meng Luo, Han Zhang, Shengqiong Wu, Bobo Li, Hong Han, Hao Fei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.17261
Pdf URL: https://arxiv.org/pdf/2501.17261
Copy Paste: [[2501.17261]] NUS-Emo at SemEval-2024 Task 3: Instruction-Tuning LLM for Multimodal Emotion-Cause Analysis in Conversations(https://arxiv.org/abs/2501.17261)
Keywords: language model, llm
Abstract: This paper describes the architecture of our system developed for Task 3 of SemEval-2024: Multimodal Emotion-Cause Analysis in Conversations. Our project targets the challenges of subtask 2, dedicated to Multimodal Emotion-Cause Pair Extraction with Emotion Category (MECPE-Cat), and constructs a dual-component system tailored to the unique challenges of this task. We divide the task into two subtasks: emotion recognition in conversation (ERC) and emotion-cause pair extraction (ECPE). To address these subtasks, we capitalize on the abilities of Large Language Models (LLMs), which have consistently demonstrated state-of-the-art performance across various natural language processing tasks and domains. Most importantly, we design an approach of emotion-cause-aware instruction-tuning for LLMs, to enhance the perception of the emotions with their corresponding causal rationales. Our method enables us to adeptly navigate the complexities of MECPE-Cat, achieving a weighted average 34.71% F1 score of the task, and securing the 2nd rank on the leaderboard. The code and metadata to reproduce our experiments are all made publicly available.
摘要：本文介绍了我们为 SemEval-2024 任务 3 开发的系统的架构：对话中的多模态情绪原因分析。我们的项目针对子任务 2 的挑战，致力于使用情绪类别进行多模态情绪原因对提取 (MECPE-Cat)，并构建了一个针对此任务独特挑战的双组件系统。我们将任务分为两个子任务：对话中的情绪识别 (ERC) 和情绪原因对提取 (ECPE)。为了解决这些子任务，我们利用大型语言模型 (LLM) 的能力，这些模型在各种自然语言处理任务和领域中始终表现出最先进的性能。最重要的是，我们为 LLM 设计了一种情绪原因感知指令调整方法，以增强对情绪及其相应因果原理的感知。我们的方法使我们能够熟练地应对 MECPE-Cat 的复杂性，在任务中取得了 34.71% 的加权平均 F1 分数，并在排行榜上名列第二。重现我们实验的代码和元数据均已公开。

Title: Tailored Truths: Optimizing LLM Persuasion with Personalization and Fabricated Statistics

Authors: Jasper Timm, Chetan Talele, Jacob Haimes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.17273
Pdf URL: https://arxiv.org/pdf/2501.17273
Copy Paste: [[2501.17273]] Tailored Truths: Optimizing LLM Persuasion with Personalization and Fabricated Statistics(https://arxiv.org/abs/2501.17273)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are becoming increasingly persuasive, demonstrating the ability to personalize arguments in conversation with humans by leveraging their personal data. This may have serious impacts on the scale and effectiveness of disinformation campaigns. We studied the persuasiveness of LLMs in a debate setting by having humans $(n=33)$ engage with LLM-generated arguments intended to change the human's opinion. We quantified the LLM's effect by measuring human agreement with the debate's hypothesis pre- and post-debate and analyzing both the magnitude of opinion change, as well as the likelihood of an update in the LLM's direction. We compare persuasiveness across established persuasion strategies, including personalized arguments informed by user demographics and personality, appeal to fabricated statistics, and a mixed strategy utilizing both personalized arguments and fabricated statistics. We found that static arguments generated by humans and GPT-4o-mini have comparable persuasive power. However, the LLM outperformed static human-written arguments when leveraging the mixed strategy in an interactive debate setting. This approach had a $\mathbf{51\%}$ chance of persuading participants to modify their initial position, compared to $\mathbf{32\%}$ for the static human-written arguments. Our results highlight the concerning potential for LLMs to enable inexpensive and persuasive large-scale disinformation campaigns.
摘要：大型语言模型 (LLM) 的说服力越来越强，展示了利用个人数据在与人类对话中个性化论点的能力。这可能会对虚假宣传活动的规模和有效性产生严重影响。我们研究了 LLM 在辩论环境中的说服力，方法是让人类（n=33）参与 LLM 生成的旨在改变人类观点的论点。我们通过测量人类在辩论前后对辩论假设的认同程度，并分析意见变化的幅度以及 LLM 方向更新的可能性来量化 LLM 的效果。我们比较了既定说服策略的说服力，包括根据用户人口统计和个性形成的个性化论点、诉诸虚构的统计数据以及同时利用个性化论点和虚构统计数据的混合策略。我们发现人类和 GPT-4o-mini 生成的静态论点具有相当的说服力。然而，在交互式辩论环境中利用混合策略时，LLM 的表现优于静态的人类撰写的论点。这种方法有 $\mathbf{51\%}$ 的几率说服参与者修改其初始立场，而静态人为论证的几率则为 $\mathbf{32\%}$。我们的结果凸显了 LLM 实现廉价且有说服力的大规模虚假宣传活动的潜力。

Title: Mitigating Hallucinated Translations in Large Language Models with Hallucination-focused Preference Optimization

Authors: Zilu Tang, Rajen Chatterjee, Sarthak Garg
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.17295
Pdf URL: https://arxiv.org/pdf/2501.17295
Copy Paste: [[2501.17295]] Mitigating Hallucinated Translations in Large Language Models with Hallucination-focused Preference Optimization(https://arxiv.org/abs/2501.17295)
Keywords: language model, llm, hallucination
Abstract: Machine Translation (MT) is undergoing a paradigm shift, with systems based on fine-tuned large language models (LLM) becoming increasingly competitive with traditional encoder-decoder models trained specifically for translation tasks. However, LLM-based systems are at a higher risk of generating hallucinations, which can severely undermine user's trust and safety. Most prior research on hallucination mitigation focuses on traditional MT models, with solutions that involve post-hoc mitigation - detecting hallucinated translations and re-translating them. While effective, this approach introduces additional complexity in deploying extra tools in production and also increases latency. To address these limitations, we propose a method that intrinsically learns to mitigate hallucinations during the model training phase. Specifically, we introduce a data creation framework to generate hallucination focused preference datasets. Fine-tuning LLMs on these preference datasets reduces the hallucination rate by an average of 96% across five language pairs, while preserving overall translation quality. In a zero-shot setting our approach reduces hallucinations by 89% on an average across three unseen target languages.
摘要：机器翻译 (MT) 正在经历范式转变，基于经过微调的大型语言模型 (LLM) 的系统与专门为翻译任务训练的传统编码器-解码器模型相比，竞争力越来越强。然而，基于 LLM 的系统产生幻觉的风险更高，这会严重损害用户的信任和安全。大多数关于幻觉缓解的先前研究都集中在传统的 MT 模型上，解决方案涉及事后缓解 - 检测幻觉翻译并重新翻译。虽然这种方法有效，但它在生产中部署额外工具时增加了复杂性，并且还增加了延迟。为了解决这些限制，我们提出了一种在模型训练阶段从本质上学习缓解幻觉的方法。具体来说，我们引入了一个数据创建框架来生成以幻觉为重点的偏好数据集。在这些偏好数据集上微调 LLM 可将五种语言对的幻觉率平均降低 96%，同时保持整体翻译质量。在零样本设置中，我们的方法平均将三种未见过的目标语言中的幻觉减少了 89%。

Title: Memorize and Rank: Elevating Large Language Models for Clinical Diagnosis Prediction

Authors: Mingyu Derek Ma, Xiaoxuan Wang, Yijia Xiao, Anthony Cuturrufo, Vijay S Nori, Eran Halperin, Wei Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.17326
Pdf URL: https://arxiv.org/pdf/2501.17326
Copy Paste: [[2501.17326]] Memorize and Rank: Elevating Large Language Models for Clinical Diagnosis Prediction(https://arxiv.org/abs/2501.17326)
Keywords: language model, llm
Abstract: Clinical diagnosis prediction models, when provided with a patient's medical history, aim to detect potential diseases early, facilitating timely intervention and improving prognostic outcomes. However, the inherent scarcity of patient data and large disease candidate space often pose challenges in developing satisfactory models for this intricate task. The exploration of leveraging Large Language Models (LLMs) for encapsulating clinical decision processes has been limited. We introduce MERA, a clinical diagnosis prediction model that bridges pertaining natural language knowledge with medical practice. We apply hierarchical contrastive learning on a disease candidate ranking list to alleviate the large decision space issue. With concept memorization through fine-tuning, we bridge the natural language clinical knowledge with medical codes. Experimental results on MIMIC-III and IV datasets show that MERA achieves the state-of-the-art diagnosis prediction performance and dramatically elevates the diagnosis prediction capabilities of generative LMs.
摘要：临床诊断预测模型旨在通过提供患者的病史来尽早发现潜在疾病，以便及时干预并改善预后结果。然而，患者数据本身的稀缺性和疾病候选空间巨大，往往会对开发令人满意的模型来完成这一复杂任务带来挑战。利用大型语言模型 (LLM) 来封装临床决策过程的探索一直很有限。我们推出了 MERA，这是一种临床诊断预测模型，它将相关的自然语言知识与医疗实践联系起来。我们对疾病候选排名列表应用分层对比学习，以缓解决策空间大的问题。通过微调进行概念记忆，我们将自然语言临床知识与医疗代码联系起来。在 MIMIC-III 和 IV 数据集上的实验结果表明，MERA 实现了最先进的诊断预测性能，并显著提升了生成式语言模型的诊断预测能力。

Title: Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection

Authors: Mingyu Derek Ma, Yanna Ding, Zijie Huang, Jianxi Gao, Yizhou Sun, Wei Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.17338
Pdf URL: https://arxiv.org/pdf/2501.17338
Copy Paste: [[2501.17338]] Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection(https://arxiv.org/abs/2501.17338)
Keywords: language model
Abstract: Generative Language Models rely on autoregressive decoding to produce the output sequence token by token. Many tasks such as preference optimization, require the model to produce task-level output consisting of multiple tokens directly by selecting candidates from a pool as predictions. Determining a task-level prediction from candidates using the ordinary token-level decoding mechanism is constrained by time-consuming decoding and interrupted gradients by discrete token selection. Existing works have been using decoding-free candidate selection methods to obtain candidate probability from initial output logits over vocabulary. Though these estimation methods are widely used, they are not systematically evaluated, especially on end tasks. We introduce an evaluation of a comprehensive collection of decoding-free candidate selection approaches on a comprehensive set of tasks, including five multiple-choice QA tasks with a small candidate pool and four clinical decision tasks with a massive amount of candidates, some with 10k+ options. We evaluate the estimation methods paired with a wide spectrum of foundation LMs covering different architectures, sizes and training paradigms. The results and insights from our analysis inform the future model design.
摘要：生成语言模型依赖于自回归解码来逐个标记地生成输出序列。许多任务（例如偏好优化）都要求模型通过从池中选择候选作为预测来直接生成由多个标记组成的任务级输出。使用普通的标记级解码机制从候选中确定任务级预测受到耗时的解码和离散标记选择的中断梯度的限制。现有的工作一直使用无解码候选选择方法从词汇表的初始输出 logit 中获得候选概率。虽然这些估计方法被广泛使用，但它们并没有得到系统的评估，特别是在最终任务上。我们介绍了对一套全面的无解码候选选择方法的评估，这些方法包括五个具有小候选池的多项选择 QA 任务和四个具有大量候选的临床决策任务，其中一些有 10k+ 个选项。我们评估了与涵盖不同架构、大小和训练范式的广泛基础 LM 配对的估计方法。我们分析的结果和见解为未来的模型设计提供了参考。

Title: Context-Aware Semantic Recomposition Mechanism for Large Language Models

Authors: Richard Katrix, Quentin Carroway, Rowan Hawkesbury, Matthias Heathfield
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.17386
Pdf URL: https://arxiv.org/pdf/2501.17386
Copy Paste: [[2501.17386]] Context-Aware Semantic Recomposition Mechanism for Large Language Models(https://arxiv.org/abs/2501.17386)
Keywords: language model
Abstract: Context-aware processing mechanisms have increasingly become a critical area of exploration for improving the semantic and contextual capabilities of language generation models. The Context-Aware Semantic Recomposition Mechanism (CASRM) was introduced as a novel framework designed to address limitations in coherence, contextual adaptability, and error propagation in large-scale text generation tasks. Through the integration of dynamically generated context vectors and attention modulation layers, CASRM enhances the alignment between token-level representations and broader contextual dependencies. Experimental evaluations demonstrated significant improvements in semantic coherence across multiple domains, including technical, conversational, and narrative text. The ability to adapt to unseen domains and ambiguous inputs was evaluated using a diverse set of test scenarios, highlighting the robustness of the proposed mechanism. A detailed computational analysis revealed that while CASRM introduces additional processing overhead, the gains in linguistic precision and contextual relevance outweigh the marginal increase in complexity. The framework also successfully mitigates error propagation in sequential tasks, improving performance in dialogue continuation and multi-step text synthesis. Additional investigations into token-level attention distribution emphasized the dynamic focus shifts enabled through context-aware enhancements. The findings suggest that CASRM offers a scalable and flexible solution for integrating contextual intelligence into existing language model architectures.
摘要：上下文感知处理机制日益成为改进语言生成模型的语义和上下文能力的关键探索领域。上下文感知语义重组机制 (CASRM) 是一种新颖的框架，旨在解决大规模文本生成任务中连贯性、上下文适应性和错误传播方面的限制。通过集成动态生成的上下文向量和注意力调节层，CASRM 增强了标记级表示与更广泛的上下文依赖性之间的一致性。实验评估表明，在多个领域（包括技术、对话和叙述文本）中，语义连贯性得到了显着改善。使用一组不同的测试场景评估了适应看不见的领域和模糊输入的能力，突出了所提出的机制的稳健性。详细的计算分析表明，虽然 CASRM 引入了额外的处理开销，但语言精度和上下文相关性的提高超过了复杂性的边际增加。该框架还成功地缓解了顺序任务中的错误传播，提高了对话延续和多步骤文本合成的性能。对 token 级注意力分布的进一步研究强调了通过上下文感知增强实现的动态焦点转移。研究结果表明，CASRM 提供了一种可扩展且灵活的解决方案，可将上下文智能集成到现有的语言模型架构中。

Title: Leveraging In-Context Learning and Retrieval-Augmented Generation for Automatic Question Generation in Educational Domains

Authors: Subhankar Maity, Aniket Deroy, Sudeshna Sarkar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.17397
Pdf URL: https://arxiv.org/pdf/2501.17397
Copy Paste: [[2501.17397]] Leveraging In-Context Learning and Retrieval-Augmented Generation for Automatic Question Generation in Educational Domains(https://arxiv.org/abs/2501.17397)
Keywords: gpt, retrieval-augmented generation
Abstract: Question generation in education is a time-consuming and cognitively demanding task, as it requires creating questions that are both contextually relevant and pedagogically sound. Current automated question generation methods often generate questions that are out of context. In this work, we explore advanced techniques for automated question generation in educational contexts, focusing on In-Context Learning (ICL), Retrieval-Augmented Generation (RAG), and a novel Hybrid Model that merges both methods. We implement GPT-4 for ICL using few-shot examples and BART with a retrieval module for RAG. The Hybrid Model combines RAG and ICL to address these issues and improve question quality. Evaluation is conducted using automated metrics, followed by human evaluation metrics. Our results show that both the ICL approach and the Hybrid Model consistently outperform other methods, including baseline models, by generating more contextually accurate and relevant questions.
摘要：教育中的问题生成是一项耗时且需要认知能力的任务，因为它需要创建既与上下文相关又符合教学法的问题。当前的自动问题生成方法通常会生成脱离上下文的问题。在这项工作中，我们探索了教育环境中自动问题生成的高级技术，重点关注情境学习 (ICL)、检索增强生成 (RAG) 以及融合这两种方法的新型混合模型。我们使用少量样本为 ICL 实现 GPT-4，并使用检索模块为 RAG 实现 BART。混合模型结合了 RAG 和 ICL 来解决这些问题并提高问题质量。评估使用自动指标进行，然后是人工评估指标。我们的结果表明，ICL 方法和混合模型都通过生成更符合情境、更相关的问题，始终优于其他方法（包括基线模型）。

Title: MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

Authors: Ved Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, Chen Xing
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.17399
Pdf URL: https://arxiv.org/pdf/2501.17399
Copy Paste: [[2501.17399]] MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs(https://arxiv.org/abs/2501.17399)
Keywords: language model, llm
Abstract: We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challenging to all current frontier LLMs. All 4 challenges require accurate instruction-following, context allocation, and in-context reasoning at the same time. We also develop LLM as judge with instance-level rubrics to facilitate an automatic evaluation method with fair agreement with experienced human raters. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge, with the top-performing Claude 3.5 Sonnet (June 2024) achieving just a 41.4% average accuracy.
摘要：我们提出了 MultiChallenge，这是一个开创性的基准，用于评估大型语言模型 (LLM) 与人类用户进行多轮对话的能力，这是一项对其应用至关重要但尚未得到充分研究的能力。MultiChallenge 确定了多轮对话中的四类挑战，这些挑战不仅在当前的人机 LLM 交互中常见且现实，而且对所有当前前沿 LLM 都具有挑战性。所有 4 个挑战都需要同时准确遵循指令、上下文分配和上下文推理。我们还开发了具有实例级评分标准的 LLM 作为评判者，以促进与经验丰富的人类评分者公平一致的自动评估方法。尽管在现有的多轮评估基准上取得了近乎完美的分数，但所有前沿模型在 MultiChallenge 上的准确率都低于 50%，其中表现最好的 Claude 3.5 Sonnet（2024 年 6 月）的平均准确率仅为 41.4%。

Title: Actions Speak Louder than Words: Agent Decisions Reveal Implicit Biases in Language Models

Authors: Yuxuan Li, Hirokazu Shirado, Sauvik Das
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2501.17420
Pdf URL: https://arxiv.org/pdf/2501.17420
Copy Paste: [[2501.17420]] Actions Speak Louder than Words: Agent Decisions Reveal Implicit Biases in Language Models(https://arxiv.org/abs/2501.17420)
Keywords: language model, llm, prompt, agent
Abstract: While advances in fairness and alignment have helped mitigate overt biases exhibited by large language models (LLMs) when explicitly prompted, we hypothesize that these models may still exhibit implicit biases when simulating human behavior. To test this hypothesis, we propose a technique to systematically uncover such biases across a broad range of sociodemographic categories by assessing decision-making disparities among agents with LLM-generated, sociodemographically-informed personas. Using our technique, we tested six LLMs across three sociodemographic groups and four decision-making scenarios. Our results show that state-of-the-art LLMs exhibit significant sociodemographic disparities in nearly all simulations, with more advanced models exhibiting greater implicit biases despite reducing explicit biases. Furthermore, when comparing our findings to real-world disparities reported in empirical studies, we find that the biases we uncovered are directionally aligned but markedly amplified. This directional alignment highlights the utility of our technique in uncovering systematic biases in LLMs rather than random variations; moreover, the presence and amplification of implicit biases emphasizes the need for novel strategies to address these biases.
摘要：虽然公平性和一致性方面的进步有助于减轻大型语言模型 (LLM) 在明确提示时表现出的明显偏见，但我们假设这些模型在模拟人类行为时仍可能表现出隐性偏见。为了验证这一假设，我们提出了一种技术，通过评估具有 LLM 生成的、具有社会人口统计学信息的角色的代理之间的决策差异，系统地发现广泛社会人口统计学类别中的此类偏见。使用我们的技术，我们在三个社会人口统计学群体和四个决策场景中测试了六个 LLM。我们的结果表明，最先进的 LLM 在几乎所有模拟中都表现出显著的社会人口统计学差异，更先进的模型尽管减少了显性偏见，但表现出更大的隐性偏见。此外，当将我们的研究结果与实证研究中报告的现实世界差异进行比较时，我们发现的偏见在方向上是一致的，但明显被放大了。这种方向性的一致凸显了我们的技术在揭示 LLM 中的系统性偏见而不是随机变化方面的效用；此外，隐性偏见的存在和扩大凸显了采取新策略来解决这些偏见的必要性。

Title: Cross-Language Approach for Quranic QA

Authors: Islam Oshallah, Mohamed Basem, Ali Hamdi, Ammar Mohammed
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2501.17449
Pdf URL: https://arxiv.org/pdf/2501.17449
Copy Paste: [[2501.17449]] Cross-Language Approach for Quranic QA(https://arxiv.org/abs/2501.17449)
Keywords: language model
Abstract: Question answering systems face critical limitations in languages with limited resources and scarce data, making the development of robust models especially challenging. The Quranic QA system holds significant importance as it facilitates a deeper understanding of the Quran, a Holy text for over a billion people worldwide. However, these systems face unique challenges, including the linguistic disparity between questions written in Modern Standard Arabic and answers found in Quranic verses written in Classical Arabic, and the small size of existing datasets, which further restricts model performance. To address these challenges, we adopt a cross-language approach by (1) Dataset Augmentation: expanding and enriching the dataset through machine translation to convert Arabic questions into English, paraphrasing questions to create linguistic diversity, and retrieving answers from an English translation of the Quran to align with multilingual training requirements; and (2) Language Model Fine-Tuning: utilizing pre-trained models such as BERT-Medium, RoBERTa-Base, DeBERTa-v3-Base, ELECTRA-Large, Flan-T5, Bloom, and Falcon to address the specific requirements of Quranic QA. Experimental results demonstrate that this cross-language approach significantly improves model performance, with RoBERTa-Base achieving the highest MAP@10 (0.34) and MRR (0.52), while DeBERTa-v3-Base excels in Recall@10 (0.50) and Precision@10 (0.24). These findings underscore the effectiveness of cross-language strategies in overcoming linguistic barriers and advancing Quranic QA systems
摘要：在资源有限、数据稀缺的语言中，问答系统面临严重限制，这使得开发稳健的模型尤为具有挑战性。《古兰经》问答系统具有重要意义，因为它有助于更深入地理解《古兰经》，而《古兰经》是全球十多亿人的圣书。然而，这些系统面临着独特的挑战，包括用现代标准阿拉伯语写成的问题与用古典阿拉伯语写成的《古兰经》经文中的答案之间的语言差异，以及现有数据集的规模较小，这些都进一步限制了模型的性能。为了应对这些挑战，我们采用了一种跨语言方法，(1) 数据集增强：通过机器翻译扩展和丰富数据集，将阿拉伯语问题转换成英语，改写问题以创造语言多样性，并从《古兰经》的英语翻译中检索答案以满足多语言训练要求；（2）语言模型微调：利用 BERT-Medium、RoBERTa-Base、DeBERTa-v3-Base、ELECTRA-Large、Flan-T5、Bloom 和 Falcon 等预训练模型来满足《古兰经》问答的特定要求。实验结果表明，这种跨语言方法显著提高了模型性能，其中 RoBERTa-Base 实现了最高的 MAP@10（0.34）和 MRR（0.52），而 DeBERTa-v3-Base 在 Recall@10（0.50）和 Precision@10（0.24）方面表现出色。这些发现强调了跨语言策略在克服语言障碍和推进《古兰经》问答系统方面的有效性

Title: DINT Transformer

Authors: Yueyang Cang, Yuhang Liu, Xiaoteng Zhang, Erlu Zhao, Li Shi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.17486
Pdf URL: https://arxiv.org/pdf/2501.17486
Copy Paste: [[2501.17486]] DINT Transformer(https://arxiv.org/abs/2501.17486)
Keywords: language model
Abstract: DIFF Transformer addresses the issue of irrelevant context interference by introducing a differential attention mechanism that enhances the robustness of local attention. However, it has two critical limitations: the lack of global context modeling, which is essential for identifying globally significant tokens, and numerical instability due to the absence of strict row normalization in the attention matrix. To overcome these challenges, we propose DINT Transformer, which extends DIFF Transformer by incorporating a differential-integral mechanism. By computing global importance scores and integrating them into the attention matrix, DINT Transformer improves its ability to capture global dependencies. Moreover, the unified parameter design enforces row-normalized attention matrices, improving numerical stability. Experimental results demonstrate that DINT Transformer excels in accuracy and robustness across various practical applications, such as long-context language modeling and key information retrieval. These results position DINT Transformer as a highly effective and promising architecture.
摘要：DIFF Transformer 通过引入差分注意机制来解决不相关上下文干扰的问题，从而增强局部注意的鲁棒性。然而，它有两个关键的局限性：缺乏全局上下文建模（这对于识别全局重要标记至关重要），以及由于注意矩阵中缺乏严格的行规范化而导致的数值不稳定性。为了克服这些挑战，我们提出了 DINT Transformer，它通过结合差分积分机制扩展了 DIFF Transformer。通过计算全局重要性分数并将其集成到注意矩阵中，DINT Transformer 提高了其捕获全局依赖关系的能力。此外，统一的参数设计强制使用行规范化的注意矩阵，从而提高了数值稳定性。实验结果表明，DINT Transformer 在各种实际应用中都具有出色的准确性和鲁棒性，例如长上下文语言建模和关键信息检索。这些结果使 DINT Transformer 成为一种高效且有前途的架构。

Title: Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models

Authors: Wooyoung Kim, Byungyoon Park, Wooju Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.17549
Pdf URL: https://arxiv.org/pdf/2501.17549
Copy Paste: [[2501.17549]] Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models(https://arxiv.org/abs/2501.17549)
Keywords: language model, gpt, prompt
Abstract: Graph-structured data plays a vital role in numerous domains, such as social networks, citation networks, commonsense reasoning graphs and knowledge graphs. While graph neural networks have been employed for graph processing, recent advancements have explored integrating large language models for graph-based tasks. In this paper, we propose a novel approach named Learnable Graph Pooling Token (LGPT), which addresses the limitations of the scalability issues in node-level projection and information loss in graph-level projection. LGPT enables flexible and efficient graph representation by introducing learnable parameters that act as tokens in large language models, balancing fine-grained and global graph information. Additionally, we investigate an Early Query Fusion technique, which fuses query context before constructing the graph representation, leading to more effective graph embeddings. Our method achieves a 4.13\% performance improvement on the GraphQA benchmark without training the large language model, demonstrating significant gains in handling complex textual-attributed graph data.
摘要：图结构数据在许多领域都发挥着至关重要的作用，例如社交网络、引文网络、常识推理图和知识图。虽然图神经网络已用于图处理，但最近的进展已经探索了将大型语言模型集成到基于图的任务中。在本文中，我们提出了一种名为可学习图池化标记 (LGPT) 的新方法，该方法解决了节点级投影中的可扩展性问题和图级投影中的信息丢失的限制。LGPT 通过引入可学习参数来实现灵活高效的图表示，这些参数充当大型语言模型中的标记，平衡细粒度和全局图信息。此外，我们研究了一种早期查询融合技术，该技术在构建图表示之前融合查询上下文，从而实现更有效的图嵌入。我们的方法在无需训练大型语言模型的情况下在 GraphQA 基准上实现了 4.13\% 的性能提升，在处理复杂的文本属性图数据方面表现出显着的收益。

Title: A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks

Authors: Elie Antoine (LIS, TALEP), Frédéric Béchet (LIS, TALEP), Géraldine Damnati, Philippe Langlais (DIRO)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.17569
Pdf URL: https://arxiv.org/pdf/2501.17569
Copy Paste: [[2501.17569]] A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks(https://arxiv.org/abs/2501.17569)
Keywords: gpt, chat
Abstract: We introduce an evaluation methodology for reading comprehension tasks based on the intuition that certain examples, by the virtue of their linguistic complexity, consistently yield lower scores regardless of model size or architecture. We capitalize on semantic frame annotation for characterizing this complexity, and study seven complexity factors that may account for model's difficulty. We first deploy this methodology on a carefully annotated French reading comprehension benchmark showing that two of those complexity factors are indeed good predictors of models' failure, while others are less so. We further deploy our methodology on a well studied English benchmark by using Chat-GPT as a proxy for semantic annotation. Our study reveals that fine-grained linguisticallymotivated automatic evaluation of a reading comprehension task is not only possible, but helps understand models' abilities to handle specific linguistic characteristics of input examples. It also shows that current state-of-the-art models fail with some for those characteristics which suggests that adequately handling them requires more than merely increasing model size.
摘要：我们引入了一种阅读理解任务评估方法，该方法基于以下直觉：某些示例由于其语言复杂性，无论模型大小或架构如何，都会始终获得较低的分数。我们利用语义框架注释来表征这种复杂性，并研究可能导致模型难度的七个复杂性因素。我们首先在经过精心注释的法语阅读理解基准上部署了这种方法，结果表明其中两个复杂性因素确实是模型失败的良好预测因素，而其他因素则不然。我们进一步在经过充分研究的英语基准上部署了我们的方法，使用 Chat-GPT 作为语义注释的代理。我们的研究表明，对阅读理解任务进行细粒度的语言驱动自动评估不仅是可能的，而且有助于理解模型处理输入示例的特定语言特征的能力。它还表明，当前最先进的模型在某些特征上失败了，这表明要充分处理它们需要的不仅仅是增加模型大小。

Title: CSEval: Towards Automated, Multi-Dimensional, and Reference-Free Counterspeech Evaluation using Auto-Calibrated LLMs

Authors: Amey Hengle, Aswini Kumar, Anil Bandhakavi, Tanmoy Chakraborty
Subjects: cs.CL, cs.AI, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2501.17581
Pdf URL: https://arxiv.org/pdf/2501.17581
Copy Paste: [[2501.17581]] CSEval: Towards Automated, Multi-Dimensional, and Reference-Free Counterspeech Evaluation using Auto-Calibrated LLMs(https://arxiv.org/abs/2501.17581)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Counterspeech has been popular as an effective approach to counter online hate speech, leading to increasing research interest in automated counterspeech generation using language models. However, this field lacks standardised evaluation protocols and robust automated evaluation metrics that align with human judgement. Current automatic evaluation methods, primarily based on similarity metrics, do not effectively capture the complex and independent attributes of counterspeech quality, such as contextual relevance, aggressiveness, or argumentative coherence. This has led to an increased dependency on labor-intensive human evaluations to assess automated counter-speech generation methods. To address these challenges, we introduce CSEval, a novel dataset and framework for evaluating counterspeech quality across four dimensions: contextual-relevance, aggressiveness, argument-coherence, and suitableness. Furthermore, we propose Auto-Calibrated COT for Counterspeech Evaluation (ACE), a prompt-based method with auto-calibrated chain-of-thoughts (CoT) for scoring counterspeech using large language models. Our experiments show that ACE outperforms traditional metrics like ROUGE, METEOR, and BertScore in correlating with human judgement, indicating a significant advancement in automated counterspeech evaluation.
摘要：反驳作为一种有效的反击网络仇恨言论的方法而广受欢迎，这导致人们对使用语言模型自动生成反驳言论的研究兴趣日益浓厚。然而，这一领域缺乏标准化的评估协议和与人类判断相一致的强大的自动评估指标。当前的自动评估方法主要基于相似性指标，无法有效捕捉反驳言论质量的复杂和独立属性，例如上下文相关性、攻击性或论证连贯性。这导致人们越来越依赖劳动密集型的人工评估来评估自动反驳言论生成方法。为了应对这些挑战，我们引入了 CSEval，这是一个新颖的数据集和框架，用于从四个维度评估反驳言论质量：上下文相关性、攻击性、论证连贯性和适用性。此外，我们提出了用于反驳言论评估的自动校准 COT (ACE)，这是一种基于提示的方法，具有自动校准的思路链 (CoT)，用于使用大型语言模型对反驳言论进行评分。我们的实验表明，ACE 在与人类判断的关联性方面优于 ROUGE、METEOR 和 BertScore 等传统指标，这表明自动反话评估取得了显著的进步。

Title: Semantic Consistency Regularization with Large Language Models for Semi-supervised Sentiment Analysis

Authors: Kunrong Li, Xinyu Liu, Zhen Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.17598
Pdf URL: https://arxiv.org/pdf/2501.17598
Copy Paste: [[2501.17598]] Semantic Consistency Regularization with Large Language Models for Semi-supervised Sentiment Analysis(https://arxiv.org/abs/2501.17598)
Keywords: language model, llm, prompt
Abstract: Accurate sentiment analysis of texts is crucial for a variety of applications, such as understanding customer feedback, monitoring market trends, and detecting public sentiment. However, manually annotating large sentiment corpora for supervised learning is labor-intensive and time-consuming. Therefore, it is essential and effective to develop a semi-supervised method for the sentiment analysis task. Although some methods have been proposed for semi-supervised text classification, they rely on the intrinsic information within the unlabeled data and the learning capability of the NLP model, which lack generalization ability to the sentiment analysis scenario and may prone to overfit. Inspired by the ability of pretrained Large Language Models (LLMs) in following instructions and generating coherent text, we propose a Semantic Consistency Regularization with Large Language Models (SCR) framework for semi-supervised sentiment analysis. We introduce two prompting strategies to semantically enhance unlabeled text using LLMs. The first is Entity-based Enhancement (SCR-EE), which involves extracting entities and numerical information, and querying the LLM to reconstruct the textual information. The second is Concept-based Enhancement (SCR-CE), which directly queries the LLM with the original sentence for semantic reconstruction. Subsequently, the LLM-augmented data is utilized for a consistency loss with confidence thresholding, which preserves high-quality agreement samples to provide additional supervision signals during training. Furthermore, to fully utilize the uncertain unlabeled data samples, we propose a class re-assembling strategy inspired by the class space shrinking theorem. Experiments show our method achieves remarkable performance over prior semi-supervised methods.
摘要：准确的文本情感分析对于各种应用都至关重要，例如理解客户反馈、监测市场趋势和检测公众情绪。然而，手动注释大型情感语料库以进行监督学习是一项劳动密集型且耗时的工程。因此，开发一种半监督方法来完成情感分析任务是必要且有效的。虽然已经提出了一些用于半监督文本分类的方法，但它们依赖于未标记数据中的内在信息和 NLP 模型的学习能力，缺乏对情感分析场景的泛化能力并且容易过拟合。受预训练的大型语言模型 (LLM) 遵循指令和生成连贯文本的能力的启发，我们提出了一种用于半监督情感分析的具有大型语言模型 (SCR) 框架的语义一致性正则化。我们引入了两种提示策略，以使用 LLM 在语义上增强未标记文本。第一种是基于实体的增强（SCR-EE），它涉及提取实体和数字信息，并查询 LLM 以重建文本信息。第二种是基于概念的增强（SCR-CE），它直接使用原始句子查询 LLM 以进行语义重建。随后，利用 LLM 增强数据进行一致性损失和置信度阈值处理，这保留了高质量的一致性样本以在训练期间提供额外的监督信号。此外，为了充分利用不确定的未标记数据样本，我们提出了一种受类空间收缩定理启发的类重组策略。实验表明，我们的方法比以前的半监督方法取得了显著的效果。

Title: Structured Context Recomposition for Large Language Models Using Probabilistic Layer Realignment

Authors: Jonathan Teel, Jocasta Cumberbatch, Raphael Benington, Quentin Baskerville
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.17617
Pdf URL: https://arxiv.org/pdf/2501.17617
Copy Paste: [[2501.17617]] Structured Context Recomposition for Large Language Models Using Probabilistic Layer Realignment(https://arxiv.org/abs/2501.17617)
Keywords: language model
Abstract: Extended sequence generation often leads to degradation in contextual consistency due to the inability of conventional self-attention mechanisms to effectively retain long-range dependencies. Existing approaches, including memory compression and retrieval-augmented conditioning, introduce computational trade-offs that either increase inference latency or impose additional storage overhead. Structured Context Recomposition (SCR) introduces a probabilistic layer realignment strategy that dynamically adjusts learned representations within transformer layers, ensuring that semantically relevant embeddings persist throughout extended transformations. The proposed method enhances coherence retention through a recursive weighting function that redistributes representational emphasis based on inferred contextual relevance rather than relying on fixed token-level attention scores. Empirical results indicate that probabilistic realignment mitigates abrupt topic shifts and logical inconsistencies, particularly in scenarios where sequences exceed standard attention window constraints. Sequence-level entropy analysis further reveals that SCR moderates representational variability without introducing excessive output regularization, allowing models to sustain generative diversity while preserving contextual alignment. Attention head deviation measurements confirm that hierarchical reweighting contributes to smoother token dependency transitions across transformer layers, reinforcing the stability of multi-turn interactions and document-level reasoning. Computational resource assessments show that while SCR incurs a moderate increase in processing time, memory overhead remains within feasible limits, making it suitable for practical deployment in autoregressive generative applications.
摘要：扩展序列生成通常会导致上下文一致性下降，因为传统的自注意力机制无法有效地保留长距离依赖关系。现有方法（包括内存压缩和检索增强条件）会引入计算权衡，从而增加推理延迟或增加存储开销。结构化上下文重组 (SCR) 引入了一种概率层重新调整策略，可动态调整转换器层内学习到的表示，确保语义相关的嵌入在整个扩展转换过程中保持不变。所提出的方法通过递归加权函数增强了连贯性保留，该函数根据推断的上下文相关性重新分配表示重点，而不是依赖于固定的 token 级注意力分数。实证结果表明，概率重新调整可缓解突然的主题转移和逻辑不一致，尤其是在序列超出标准注意力窗口约束的情况下。序列级熵分析进一步表明，SCR 可在不引入过多输出正则化的情况下缓和表征变异性，从而使模型能够在保持上下文一致性的同时维持生成多样性。注意头偏差测量证实，分层重新加权有助于更平滑地跨转换器层进行标记依赖性转换，从而增强多轮交互和文档级推理的稳定性。计算资源评估表明，虽然 SCR 会导致处理时间略有增加，但内存开销仍在可行范围内，使其适合在自回归生成应用中实际部署。

Title: In-Context Meta LoRA Generation

Authors: Yihua Shao, Minxi Yan, Yang Liu, Siyu Chen, Wenjie Chen, Xinwei Long, Ziyang Yan, Lei Li, Chenyu Zhang, Nicu Sebe, Hao Tang, Yan Wang, Hao Zhao, Mengzhu Wang, Jingcai Guo
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2501.17635
Pdf URL: https://arxiv.org/pdf/2501.17635
Copy Paste: [[2501.17635]] In-Context Meta LoRA Generation(https://arxiv.org/abs/2501.17635)
Keywords: language model, llm
Abstract: Low-rank Adaptation (LoRA) has demonstrated remarkable capabilities for task specific fine-tuning. However, in scenarios that involve multiple tasks, training a separate LoRA model for each one results in considerable inefficiency in terms of storage and inference. Moreover, existing parameter generation methods fail to capture the correlations among these tasks, making multi-task LoRA parameter generation challenging. To address these limitations, we propose In-Context Meta LoRA (ICM-LoRA), a novel approach that efficiently achieves task-specific customization of large language models (LLMs). Specifically, we use training data from all tasks to train a tailored generator, Conditional Variational Autoencoder (CVAE). CVAE takes task descriptions as inputs and produces task-aware LoRA weights as outputs. These LoRA weights are then merged with LLMs to create task-specialized models without the need for additional fine-tuning. Furthermore, we utilize in-context meta-learning for knowledge enhancement and task mapping, to capture the relationship between tasks and parameter distributions. As a result, our method achieves more accurate LoRA parameter generation for diverse tasks using CVAE. ICM-LoRA enables more accurate LoRA parameter reconstruction than current parameter reconstruction methods and is useful for implementing task-specific enhancements of LoRA parameters. At the same time, our method occupies 283MB, only 1\% storage compared with the original LoRA.
摘要：低秩自适应 (LoRA) 已展示出针对特定任务进行微调的卓越能力。然而，在涉及多个任务的场景中，为每个任务训练单独的 LoRA 模型会导致存储和推理方面的效率低下。此外，现有的参数生成方法无法捕捉这些任务之间的相关性，使得多任务 LoRA 参数生成具有挑战性。为了解决这些限制，我们提出了上下文元 LoRA (ICM-LoRA)，这是一种新方法，可以有效地实现大型语言模型 (LLM) 的任务特定定制。具体来说，我们使用来自所有任务的训练数据来训练定制的生成器条件变分自动编码器 (CVAE)。CVAE 将任务描述作为输入并生成任务感知的 LoRA 权重作为输出。然后将这些 LoRA 权重与 LLM 合并以创建任务专用模型，而无需额外的微调。此外，我们利用上下文元学习来增强知识和进行任务映射，以捕捉任务与参数分布之间的关系。因此，我们的方法使用 CVAE 实现了针对各种任务的更准确的 LoRA 参数生成。与当前的参数重建方法相比，ICM-LoRA 能够实现更准确的 LoRA 参数重建，并且可用于实现针对特定任务的 LoRA 参数增强。同时，我们的方法占用 283MB，与原始 LoRA 相比仅占用 1% 的存储空间。

Title: Tonguescape: Exploring Language Models Understanding of Vowel Articulation

Authors: Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.17643
Pdf URL: https://arxiv.org/pdf/2501.17643
Copy Paste: [[2501.17643]] Tonguescape: Exploring Language Models Understanding of Vowel Articulation(https://arxiv.org/abs/2501.17643)
Keywords: language model
Abstract: Vowels are primarily characterized by tongue position. Humans have discovered these features of vowel articulation through their own experience and explicit objective observation such as using MRI. With this knowledge and our experience, we can explain and understand the relationship between tongue positions and vowels, and this knowledge is helpful for language learners to learn pronunciation. Since language models (LMs) are trained on a large amount of data that includes linguistic and medical fields, our preliminary studies indicate that an LM is able to explain the pronunciation mechanisms of vowels. However, it is unclear whether multi-modal LMs, such as vision LMs, align textual information with visual information. One question arises: do LMs associate real tongue positions with vowel articulation? In this study, we created video and image datasets from the existing real-time MRI dataset and investigated whether LMs can understand vowel articulation based on tongue positions using vision-based information. Our findings suggest that LMs exhibit potential for understanding vowels and tongue positions when reference examples are provided while they have difficulties without them. Our code for dataset building is available on GitHub.
摘要：元音主要通过舌头位置来表征。人类通过自己的经验和明确的客观观察（例如使用 MRI）发现了元音发音的这些特征。凭借这些知识和我们的经验，我们可以解释和理解舌头位置和元音之间的关系，这些知识有助于语言学习者学习发音。由于语言模型 (LM) 是在包括语言学和医学领域的大量数据上进行训练的，我们的初步研究表明，语言模型能够解释元音的发音机制。然而，目前尚不清楚多模态语言模型（例如视觉语言模型）是否将文本信息与视觉信息对齐。一个问题出现了：语言模型是否将真实的舌头位置与元音发音联系起来？在这项研究中，我们从现有的实时 MRI 数据集中创建了视频和图像数据集，并研究了语言模型是否可以使用基于视觉的信息根据舌头位置理解元音发音。我们的研究结果表明，当提供参考示例时，语言模型表现出理解元音和舌头位置的潜力，而如果没有这些示例，语言模型就会遇到困难。我们的数据集构建代码可在 GitHub 上找到。

Title: Exploring Vision Language Models for Multimodal and Multilingual Stance Detection

Authors: Jake Vasilakes, Carolina Scarton, Zhixue Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.17654
Pdf URL: https://arxiv.org/pdf/2501.17654
Copy Paste: [[2501.17654]] Exploring Vision Language Models for Multimodal and Multilingual Stance Detection(https://arxiv.org/abs/2501.17654)
Keywords: language model
Abstract: Social media's global reach amplifies the spread of information, highlighting the need for robust Natural Language Processing tasks like stance detection across languages and modalities. Prior research predominantly focuses on text-only inputs, leaving multimodal scenarios, such as those involving both images and text, relatively underexplored. Meanwhile, the prevalence of multimodal posts has increased significantly in recent years. Although state-of-the-art Vision-Language Models (VLMs) show promise, their performance on multimodal and multilingual stance detection tasks remains largely unexamined. This paper evaluates state-of-the-art VLMs on a newly extended dataset covering seven languages and multimodal inputs, investigating their use of visual cues, language-specific performance, and cross-modality interactions. Our results show that VLMs generally rely more on text than images for stance detection and this trend persists across languages. Additionally, VLMs rely significantly more on text contained within the images than other visual content. Regarding multilinguality, the models studied tend to generate consistent predictions across languages whether they are explicitly multilingual or not, although there are outliers that are incongruous with macro F1, language support, and model size.
摘要：社交媒体的全球影响力扩大了信息的传播，凸显了对强大的自然语言处理任务的需求，例如跨语言和模态的立场检测。先前的研究主要集中在纯文本输入上，而多模态场景（例如同时涉及图像和文本的场景）相对较少被探索。与此同时，近年来多模态帖子的流行度显著增加。尽管最先进的视觉语言模型 (VLM) 前景光明，但它们在多模态和多语言立场检测任务上的表现仍未得到很大程度的检验。本文在涵盖七种语言和多模态输入的新扩展数据集上评估了最先进的 VLM，研究了它们对视觉提示的使用、特定于语言的性能以及跨模态交互。我们的结果表明，VLM 通常更依赖文本而不是图像来进行立场检测，并且这种趋势在各种语言中都存在。此外，与其他视觉内容相比，VLM 更依赖图像中包含的文本。关于多语言性，所研究的模型倾向于跨语言生成一致的预测，无论这些语言是否明确为多语言，尽管存在与宏 F1、语言支持和模型大小不一致的异常值。

Title: Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

Authors: Yubo Wang, Xiang Yue, Wenhu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.17703
Pdf URL: https://arxiv.org/pdf/2501.17703
Copy Paste: [[2501.17703]] Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate(https://arxiv.org/abs/2501.17703)
Keywords: language model, gpt
Abstract: Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones. Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of (input=[query; noisy response], output=critique). CFT on this dataset yields a consistent 4-10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably, our Qwen2.5-Math-CFT model-trained on just 50K samples-matches or outperforms competitive models such as AceMath and Qwen2.5-Math-Instruct on most benchmarks, both of which use over 2M samples. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model. Through these findings, we argue that critique-based training offers a more effective alternative to advance the reasoning of language models.
摘要：监督式微调 (SFT) 通常用于训练语言模型，以模仿给定指令的带注释响应。在本文中，我们挑战了这一范式，并提出了批判式微调 (CFT)，这是一种让模型学会批判嘈杂响应而不是简单地模仿正确响应的策略。受强调批判性思维的人类学习过程的启发，CFT 鼓励更深入的分析和细致的理解——这些特征经常被标准 SFT 所忽视。为了验证 CFT 的有效性，我们从 WebInstruct 构建了一个 50K 样本数据集，使用 GPT-4o 作为老师，以 (input=[query; noisy response], output=critique) 的形式生成批判。在这个数据集上，CFT 在六个数学基准测试中比 SFT 产生了 4-10% 的持续改进，这些基准测试使用了不同的基础模型，如 Qwen2.5、Qwen2.5-Math 和 DeepSeek-Math。我们进一步扩展到 MetaMath 和 NuminaMath 数据集，并观察到与 SFT 相比的类似收益。值得注意的是，我们的 Qwen2.5-Math-CFT 模型（仅使用 50K 个样本进行训练）在大多数基准测试中都与 AceMath 和 Qwen2.5-Math-Instruct 等竞争模型相当或优于它们，这两个模型都使用超过 2M 个样本。消融研究表明，CFT 对噪声响应源和教师批评模型具有很强的鲁棒性。通过这些发现，我们认为基于批评的训练提供了一种更有效的替代方法来推进语言模型的推理。

Title: RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts

Authors: Eujeong Choi, Younghun Jeong, Soomin Kim, Won Ik Cho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.17715
Pdf URL: https://arxiv.org/pdf/2501.17715
Copy Paste: [[2501.17715]] RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts(https://arxiv.org/abs/2501.17715)
Keywords: language model, llm, prompt, chat, agent
Abstract: User interactions with conversational agents (CAs) evolve in the era of heavily guardrailed large language models (LLMs). As users push beyond programmed boundaries to explore and build relationships with these systems, there is a growing concern regarding the potential for unauthorized access or manipulation, commonly referred to as "jailbreaking." Moreover, with CAs that possess highly human-like qualities, users show a tendency toward initiating intimate sexual interactions or attempting to tame their chatbots. To capture and reflect these in-the-wild interactions into chatbot designs, we propose RICoTA, a Korean red teaming dataset that consists of 609 prompts challenging LLMs with in-the-wild user-made dialogues capturing jailbreak attempts. We utilize user-chatbot conversations that were self-posted on a Korean Reddit-like community, containing specific testing and gaming intentions with a social chatbot. With these prompts, we aim to evaluate LLMs' ability to identify the type of conversation and users' testing purposes to derive chatbot design implications for mitigating jailbreaking risks. Our dataset will be made publicly available via GitHub.
摘要：在大型语言模型 (LLM) 受到严格监管的时代，用户与对话代理 (CA) 的交互不断发展。随着用户突破编程界限，探索并与这些系统建立关系，人们越来越担心未经授权的访问或操纵（通常称为“越狱”）。此外，对于具有高度类似人类特质的 CA，用户表现出发起亲密性互动或试图驯服聊天机器人的倾向。为了捕捉这些野外互动并将其反映到聊天机器人设计中，我们提出了 RICoTA，这是一个韩国红队数据集，包含 609 个提示，通过野外用户制作的对话挑战 LLM，以捕捉越狱尝试。我们利用在韩国 Reddit 类社区上自行发布的用户聊天机器人对话，其中包含使用社交聊天机器人的特定测试和游戏意图。通过这些提示，我们旨在评估 LLM 识别对话类型和用户测试目的的能力，从而得出聊天机器人设计对减轻越狱风险的影响。我们的数据集将通过 GitHub 公开。

Title: Hybrid Graphs for Table-and-Text based Question Answering using LLMs

Authors: Ankush Agarwal, Ganesh S, Chaitanya Devaguptapu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.17767
Pdf URL: https://arxiv.org/pdf/2501.17767
Copy Paste: [[2501.17767]] Hybrid Graphs for Table-and-Text based Question Answering using LLMs(https://arxiv.org/abs/2501.17767)
Keywords: language model, gpt, llm
Abstract: Answering questions that require reasoning and aggregation across both structured (tables) and unstructured (raw text) data sources presents significant challenges. Current methods rely on fine-tuning and high-quality, human-curated data, which is difficult to obtain. Recent advances in Large Language Models (LLMs) have shown promising results for multi-hop question answering (QA) over single-source text data in a zero-shot setting, yet exploration into multi-source Table-Text QA remains limited. In this paper, we present a novel Hybrid Graph-based approach for Table-Text QA that leverages LLMs without fine-tuning. Our method constructs a unified Hybrid Graph from textual and tabular data, pruning information based on the input question to provide the LLM with relevant context concisely. We evaluate our approach on the challenging Hybrid-QA and OTT-QA datasets using state-of-the-art LLMs, including GPT-3.5, GPT-4, and LLaMA-3. Our method achieves the best zero-shot performance on both datasets, improving Exact Match scores by up to 10% on Hybrid-QA and 5.4% on OTT-QA. Moreover, our approach reduces token usage by up to 53% compared to the original context.
摘要：回答需要跨结构化（表格）和非结构化（原始文本）数据源进行推理和聚合的问题是一项艰巨的挑战。当前的方法依赖于微调和高质量的人工策划数据，而这些数据很难获得。大型语言模型 (LLM) 的最新进展已显示出在零样本设置下对单源文本数据的多跳问答 (QA) 的良好结果，但对多源表格文本 QA 的探索仍然有限。在本文中，我们提出了一种基于混合图的新型表格文本 QA 方法，该方法利用无需微调的 LLM。我们的方法从文本和表格数据构建统一的混合图，根据输入问题修剪信息，以简洁地为 LLM 提供相关上下文。我们使用最先进的 LLM（包括 GPT-3.5、GPT-4 和 LLaMA-3）在具有挑战性的混合 QA 和 OTT-QA 数据集上评估我们的方法。我们的方法在两个数据集上都实现了最佳的零样本性能，在 Hybrid-QA 上将精确匹配分数提高了 10%，在 OTT-QA 上将精确匹配分数提高了 5.4%。此外，与原始上下文相比，我们的方法最多可将 token 使用量减少 53%。

Title: 2SSP: A Two-Stage Framework for Structured Pruning of LLMs

Authors: Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.17771
Pdf URL: https://arxiv.org/pdf/2501.17771
Copy Paste: [[2501.17771]] 2SSP: A Two-Stage Framework for Structured Pruning of LLMs(https://arxiv.org/abs/2501.17771)
Keywords: language model, llm
Abstract: We propose a novel Two-Stage framework for Structured Pruning (2SSP) for pruning Large Language Models (LLMs), which combines two different strategies of pruning, namely Width and Depth Pruning. The first stage (Width Pruning) removes entire neurons, hence their corresponding rows and columns, aiming to preserve the connectivity among the pruned structures in the intermediate state of the Feed-Forward Networks in each Transformer block. This is done based on an importance score measuring the impact of each neuron over the output magnitude. The second stage (Depth Pruning), instead, removes entire Attention submodules. This is done by applying an iterative process that removes the Attention submodules with the minimum impact on a given metric of interest (in our case, perplexity). We also propose a novel mechanism to balance the sparsity rate of the two stages w.r.t. to the desired global sparsity. We test 2SSP on four LLM families and three sparsity rates (25\%, 37.5\%, and 50\%), measuring the resulting perplexity over three language modeling datasets as well as the performance over six downstream tasks. Our method consistently outperforms five state-of-the-art competitors over three language modeling and six downstream tasks, with an up to two-order-of-magnitude gain in terms of pruning time. The code is available at available at \url{this https URL}.
摘要：我们提出了一种新颖的结构化剪枝两阶段框架 (2SSP)，用于剪枝大型语言模型 (LLM)，它结合了两种不同的剪枝策略，即宽度剪枝和深度剪枝。第一阶段（宽度剪枝）删除整个神经元，从而删除它们对应的行和列，旨在保留每个 Transformer 块中前馈网络中间状态下剪枝结构之间的连通性。这是基于衡量每个神经元对输出幅度影响的重要性分数来完成的。第二阶段（深度剪枝）则删除整个注意力子模块。这是通过应用迭代过程来完成的，该过程删除对给定感兴趣指标（在我们的例子中是困惑度）影响最小的注意力子模块。我们还提出了一种新颖的机制来平衡两个阶段的稀疏率与所需的全局稀疏度。我们在四个 LLM 系列和三个稀疏率（25\%、37.5\% 和 50\%）上测试了 2SSP，测量了三个语言建模数据集的困惑度以及六个下游任务的性能。我们的方法在三个语言建模和六个下游任务中始终优于五个最先进的竞争对手，并且在修剪时间方面最多可提高两个数量级。代码可在 \url{此 https URL} 上找到。

Title: Reasoning Over the Glyphs: Evaluation of LLM's Decipherment of Rare Scripts

Authors: Yu-Fei Shih, Zheng-Lin Lin, Shu-Kai Hsieh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.17785
Pdf URL: https://arxiv.org/pdf/2501.17785
Copy Paste: [[2501.17785]] Reasoning Over the Glyphs: Evaluation of LLM's Decipherment of Rare Scripts(https://arxiv.org/abs/2501.17785)
Keywords: gpt, llm
Abstract: We explore the capabilities of LVLMs and LLMs in deciphering rare scripts not encoded in Unicode. We introduce a novel approach to construct a multimodal dataset of linguistic puzzles involving such scripts, utilizing a tokenization method for language glyphs. Our methods include the Picture Method for LVLMs and the Description Method for LLMs, enabling these models to tackle these challenges. We conduct experiments using prominent models, GPT-4o, Gemini, and Claude 3.5 Sonnet, on linguistic puzzles. Our findings reveal the strengths and limitations of current AI methods in linguistic decipherment, highlighting the impact of Unicode encoding on model performance and the challenges of modeling visual language tokens through descriptions. Our study advances understanding of AI's potential in linguistic decipherment and underscores the need for further research.
摘要：我们探索了 LVLM 和 LLM 在解密未使用 Unicode 编码的稀有文字方面的能力。我们介绍了一种新方法，利用语言字形的标记化方法，构建涉及此类文字的语言谜题的多模态数据集。我们的方法包括 LVLM 的图片方法和 LLM 的描述方法，使这些模型能够应对这些挑战。我们使用著名模型 GPT-4o、Gemini 和 Claude 3.5 Sonnet 对语言谜题进行实验。我们的研究结果揭示了当前 AI 方法在语言解密方面的优势和局限性，突出了 Unicode 编码对模型性能的影响以及通过描述建模视觉语言标记的挑战。我们的研究加深了对 AI 在语言解密方面的潜力的理解，并强调了进一步研究的必要性。

Title: BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights

Authors: Chan-Jan Hsu, Yi-Cheng Lin, Chia-Chun Lin, Wei-Chih Chen, Ho Lam Chung, Chen-An Li, Yi-Chang Chen, Chien-Yu Yu, Ming-Ji Lee, Chien-Cheng Chen, Ru-Heng Huang, Hung-yi Lee, Da-Shan Shiu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.17790
Pdf URL: https://arxiv.org/pdf/2501.17790
Copy Paste: [[2501.17790]] BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights(https://arxiv.org/abs/2501.17790)
Keywords: language model, llm
Abstract: We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a $S^{3}$ tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice's superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural codec TTS systems.
摘要：我们推出了 BreezyVoice，这是一款专门针对台湾普通话而改编的文本转语音 (TTS) 系统，其突出的语音控制能力可以解决该语言中多音素消歧的独特挑战。在 CosyVoice 的基础上，我们结合了 $S^{3}$ 标记器、大型语言模型 (LLM)、最佳传输条件流匹配模型 (OT-CFM) 和字素到音素预测模型，以生成与人类话语非常相似的逼真语音。我们的评估表明 BreezyVoice 在一般和代码转换环境中均表现出色，突出了其在生成高保真语音方面的稳健性和有效性。此外，我们还解决了建模长尾说话者和多音素消歧的通用性挑战。我们的方法显著提高了性能，并为神经编解码器 TTS 系统的工作原理提供了宝贵的见解。

Title: Learning Beyond the Surface: How Far Can Continual Pre-Training with LoRA Enhance LLMs' Domain-Specific Insight Learning?

Authors: Pouya Pezeshkpour, Estevam Hruschka
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.17840
Pdf URL: https://arxiv.org/pdf/2501.17840
Copy Paste: [[2501.17840]] Learning Beyond the Surface: How Far Can Continual Pre-Training with LoRA Enhance LLMs' Domain-Specific Insight Learning?(https://arxiv.org/abs/2501.17840)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance on various tasks, yet their ability to extract and internalize deeper insights from domain-specific datasets remains underexplored. In this study, we investigate how continual pre-training can enhance LLMs' capacity for insight learning across three distinct forms: declarative, statistical, and probabilistic insights. Focusing on two critical domains: medicine and finance, we employ LoRA to train LLMs on two existing datasets. To evaluate each insight type, we create benchmarks to measure how well continual pre-training helps models go beyond surface-level knowledge. We also assess the impact of document modification on capturing insights. The results show that, while continual pre-training on original documents has a marginal effect, modifying documents to retain only essential information significantly enhances the insight-learning capabilities of LLMs.
摘要：大型语言模型 (LLM) 在各种任务上都表现出色，但它们从特定领域数据集中提取和内化更深层次洞察的能力仍未得到充分探索。在本研究中，我们调查了持续的预训练如何增强 LLM 在三种不同形式中的洞察学习能力：声明性、统计性和概率性洞察。我们专注于两个关键领域：医学和金融，使用 LoRA 在两个现有数据集上训练 LLM。为了评估每种洞察类型，我们创建了基准来衡量持续预训练如何帮助模型超越表面知识。我们还评估了文档修改对获取洞察的影响。结果表明，虽然对原始文档进行持续的预训练效果不大，但修改文档以仅保留基本信息可显著增强 LLM 的洞察学习能力。

Title: Improving Your Model Ranking on Chatbot Arena by Vote Rigging

Authors: Rui Min, Tianyu Pang, Chao Du, Qian Liu, Minhao Cheng, Min Lin
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.17858
Pdf URL: https://arxiv.org/pdf/2501.17858
Copy Paste: [[2501.17858]] Improving Your Model Ranking on Chatbot Arena by Vote Rigging(https://arxiv.org/abs/2501.17858)
Keywords: llm, chat
Abstract: Chatbot Arena is a popular platform for evaluating LLMs by pairwise battles, where users vote for their preferred response from two randomly sampled anonymous models. While Chatbot Arena is widely regarded as a reliable LLM ranking leaderboard, we show that crowdsourced voting can be rigged to improve (or decrease) the ranking of a target model $m_{t}$. We first introduce a straightforward target-only rigging strategy that focuses on new battles involving $m_{t}$, identifying it via watermarking or a binary classifier, and exclusively voting for $m_{t}$ wins. However, this strategy is practically inefficient because there are over $190$ models on Chatbot Arena and on average only about $1\%$ of new battles will involve $m_{t}$. To overcome this, we propose omnipresent rigging strategies, exploiting the Elo rating mechanism of Chatbot Arena that any new vote on a battle can influence the ranking of the target model $m_{t}$, even if $m_{t}$ is not directly involved in the battle. We conduct experiments on around $1.7$ million historical votes from the Chatbot Arena Notebook, showing that omnipresent rigging strategies can improve model rankings by rigging only hundreds of new votes. While we have evaluated several defense mechanisms, our findings highlight the importance of continued efforts to prevent vote rigging. Our code is available at this https URL.
摘要：Chatbot Arena 是一个流行的平台，通过成对战斗来评估 LLM，用户从两个随机抽样的匿名模型中投票选出他们喜欢的答案。虽然 Chatbot Arena 被广泛认为是一个可靠的 LLM 排名排行榜，但我们表明，众包投票可以被操纵以提高（或降低）目标模型 $m_{t}$ 的排名。我们首先介绍一种简单的仅针对目标的操纵策略，该策略专注于涉及 $m_{t}$ 的新战斗，通过水印或二元分类器识别它，并专门投票给 $m_{t}$ 获胜。然而，这种策略实际上效率低下，因为 Chatbot Arena 上有超过 $190$ 个模型，平均只有大约 $1\%$ 的新战斗会涉及 $m_{t}$。为了克服这个问题，我们提出了无处不在的操纵策略，利用 Chatbot Arena 的 Elo 评级机制，任何对战斗的新投票都会影响目标模型 $m_{t}$ 的排名，即使 $m_{t}$ 不直接参与战斗。我们对 Chatbot Arena Notebook 中约 170 万美元的历史投票进行了实验，结果表明无处不在的操纵策略只需操纵数百张新投票即可提高模型排名。虽然我们已经评估了几种防御机制，但我们的研究结果强调了继续努力防止投票操纵的重要性。我们的代码可在此 https URL 上找到。

Title: Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations

Authors: Zijie Liu, Xinyu Zhao, Jie Peng, Zhuangdi Zhu, Qingyu Chen, Xia Hu, Tianlong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.17860
Pdf URL: https://arxiv.org/pdf/2501.17860
Copy Paste: [[2501.17860]] Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations(https://arxiv.org/abs/2501.17860)
Keywords: llm
Abstract: Current medical AI systems often fail to replicate real-world clinical reasoning, as they are predominantly trained and evaluated on static text and question-answer tasks. These tuning methods and benchmarks overlook critical aspects like evidence-based reasoning and handling distracting information. To bridge this gap, we introduce a novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards. Moreover, we explore dialogue-based fine-tuning, which transforms static datasets into conversational formats to better capture iterative reasoning processes. Experiments show that dialogue-tuned models outperform traditional methods, with improvements of $9.64\%$ in multi-round reasoning scenarios and $6.18\%$ in accuracy in a noisy environment. Our findings highlight dialogue tuning as a promising approach for advancing clinically aligned and robust medical AI systems.
摘要：当前的医疗 AI 系统通常无法复制现实世界的临床推理，因为它们主要在静态文本和问答任务上进行训练和评估。这些调整方法和基准忽略了关键方面，例如基于证据的推理和处理分散注意力的信息。为了弥补这一差距，我们引入了一个模拟现实世界诊断场景的新基准，整合了与 USMLE 标准一致的噪音和难度级别。此外，我们还探索了基于对话的微调，将静态数据集转换为对话格式，以更好地捕捉迭代推理过程。实验表明，对话调整模型优于传统方法，在多轮推理场景中提高了 $9.64\%$，在嘈杂环境中的准确率提高了 $6.18\%$。我们的研究结果表明，对话调整是一种有前途的方法，可以推进符合临床要求且强大的医疗 AI 系统。