2025-03-26

Title: SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment

Authors: Ruoxi Cheng, Shuirong Cao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18991
Pdf URL: https://arxiv.org/pdf/2503.18991
Copy Paste: [[2503.18991]] SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment(https://arxiv.org/abs/2503.18991)
Keywords: language model, llm, prompt
Abstract: Aligning large language models (LLMs) with human preferences and values is vital for application. However, current alignment methods face three main limitations: (1) reliance on costly human annotation; (2) alignment tax; (3) shallow alignment vulnerable to jailbreak attacks. Additionally, current alignment datasets often suffer from uneven distributions, leading to overrepresentation of some topics and neglect of others. To address these issues, we propose SRMIR (Shadow Reward Models Based on Introspective Reasoning), inspired by shadow models in membership inference attacks. We first construct a balanced safety Chain of Draft (CoD) dataset across $7$ harmful types with structured prompt leveraging the introspective reasoning capabilities of LLMs, then train a set of specialized reward models to guide policy optimization through Group Relative Policy Optimization (GRPO). We apply two strategies, linear combination and categorized approach, to integrate shadow reward models for policy optimization. By comparison, we find that the latter achieves superior alignment despite higher computational costs. Experiments across several LLMs demonstrate SRMIR significantly outperforms existing methods.
摘要：将大语言模型（LLM）与人类的偏好和价值保持一致，对于应用至关重要。但是，当前的一致性方法面临三个主要局限性：（1）依赖昂贵的人类注释；（2）对齐税；（3）容易受到越狱攻击的浅对齐。此外，当前的对齐数据集经常遭受分布不平衡的困扰，从而导致某些主题和忽视其他主题过多。为了解决这些问题，我们提出了SRMIR（基于内省推理的阴影奖励模型），灵感来自成员推理攻击中的影子模型。我们首先在$ 7 $的有害类型中构建了一个平衡的草稿数据集（COD）数据集，并通过结构化及时利用LLM的内省推理能力，然后训练一组专业奖励模型，以通过小组相对政策优化（GRPO）来指导政策优化。我们采用两种策略，即线性组合和分类方法，以整合影子奖励模型以进行策略优化。相比之下，我们发现尽管计算成本较高，后者仍达到了较高的一致性。几个LLMS的实验证明了SRMIR明显优于现有方法。

Title: LookAhead Tuning: Safer Language Models via Partial Answer Previews

Authors: Kangwei Liu, Mengru Wang, Yujie Luo, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Huajun Chen
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2503.19041
Pdf URL: https://arxiv.org/pdf/2503.19041
Copy Paste: [[2503.19041]] LookAhead Tuning: Safer Language Models via Partial Answer Previews(https://arxiv.org/abs/2503.19041)
Keywords: language model, llm
Abstract: Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often undermines their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, which comprises two simple, low-resource, and effective data-driven methods that modify training data by previewing partial answer prefixes. Both methods aim to preserve the model's inherent safety mechanisms by minimizing perturbations to initial token distributions. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs. Code is released at this https URL.
摘要：微调使大型语言模型（LLMS）适应特定领域，但经常破坏其先前确定的安全一致性。为了减轻微调过程中模型安全性的降解，我们介绍了LookAhead Tuning，其中包括两种简单，低资源且有效的数据驱动方法，这些方法通过预测部分答案前缀来修改培训数据。这两种方法旨在通过最大程度地减少对初始令牌分布的扰动来保护模型的固有安全机制。全面的实验表明，LookAhead调整可以有效地维护模型安全性，而无需牺牲下游任务的稳健性能。我们的发现位置lookahead调整是可靠有效的解决方案，以实现LLM的安全有效适应。代码在此HTTPS URL上发布。

Title: LLM-Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment

Authors: Varsha Embar, Ritvik Shrivastava, Vinay Damodaran, Travis Mehlinger, Yu-Chung Hsiao, Karthik Raghunathan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19090
Pdf URL: https://arxiv.org/pdf/2503.19090
Copy Paste: [[2503.19090]] LLM-Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment(https://arxiv.org/abs/2503.19090)
Keywords: language model, llm, agent
Abstract: Large Language Models have transformed the Contact Center industry, manifesting in enhanced self-service tools, streamlined administrative processes, and augmented agent productivity. This paper delineates our system that automates call driver generation, which serves as the foundation for tasks such as topic modeling, incoming call classification, trend detection, and FAQ generation, delivering actionable insights for contact center agents and administrators to consume. We present a cost-efficient LLM system design, with 1) a comprehensive evaluation of proprietary, open-weight, and fine-tuned models and 2) cost-efficient strategies, and 3) the corresponding cost analysis when deployed in production environments.
摘要：大型语言模型已经改变了联络中心行业，以增强的自助工具，简化的管理流程和增强的代理生产率表现出来。本文描述了我们的系统，该系统可以自动化呼叫驱动程序的生成，该系统是主题建模，传入呼叫分类，趋势检测和FAQ生成等任务的基础，为联系中心代理和管理员提供了可行的见解。我们提出了经济高效的LLM系统设计，1）对专有，开放权重和微调模型的全面评估以及2）具有成本效益的策略，以及3）在生产环境中部署时相应的成本分析。

Title: Masks and Mimicry: Strategic Obfuscation and Impersonation Attacks on Authorship Verification

Authors: Kenneth Alperin, Rohan Leekha, Adaku Uchendu, Trang Nguyen, Srilakshmi Medarametla, Carlos Levya Capote, Seth Aycock, Charlie Dagli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19099
Pdf URL: https://arxiv.org/pdf/2503.19099
Copy Paste: [[2503.19099]] Masks and Mimicry: Strategic Obfuscation and Impersonation Attacks on Authorship Verification(https://arxiv.org/abs/2503.19099)
Keywords: language model, llm
Abstract: The increasing use of Artificial Intelligence (AI) technologies, such as Large Language Models (LLMs) has led to nontrivial improvements in various tasks, including accurate authorship identification of documents. However, while LLMs improve such defense techniques, they also simultaneously provide a vehicle for malicious actors to launch new attack vectors. To combat this security risk, we evaluate the adversarial robustness of authorship models (specifically an authorship verification model) to potent LLM-based attacks. These attacks include untargeted methods - \textit{authorship obfuscation} and targeted methods - \textit{authorship impersonation}. For both attacks, the objective is to mask or mimic the writing style of an author while preserving the original texts' semantics, respectively. Thus, we perturb an accurate authorship verification model, and achieve maximum attack success rates of 92\% and 78\% for both obfuscation and impersonation attacks, respectively.
摘要：人工智能（AI）技术的日益增长的使用，例如大语言模型（LLMS），导致了各种任务的非平地改进，包括对文档的准确作者身份识别。但是，尽管LLMS改进了这种防御技术，但它们也为恶意演员提供了推出新攻击媒介的工具。为了应对这种安全风险，我们将作者身份模型（特别是作者验证模型）的对抗性鲁棒性评估为有效的基于LLM的攻击。这些攻击包括未靶向的方法 - \ textIt {authorship obfuscation}和目标方法 - \ textit {authorship Impersonation}。对于这两种攻击，目的是掩盖或模仿作者的写作风格，同时分别保留原始文本的语义。因此，我们扰动精确的作者身份验证模型，并分别为混淆和假冒攻击达到92 \％和78 \％的最大攻击成功率。

Title: Understanding and Improving Information Preservation in Prompt Compression for LLMs

Authors: Weronika Łajewska, Momchil Hardalov, Laura Aina, Neha Anna John, Hang Su, Lluís Màrquez
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19114
Pdf URL: https://arxiv.org/pdf/2503.19114
Copy Paste: [[2503.19114]] Understanding and Improving Information Preservation in Prompt Compression for LLMs(https://arxiv.org/abs/2503.19114)
Keywords: language model, llm, prompt
Abstract: Recent advancements in large language models (LLMs) have enabled their successful application to a broad range of tasks. However, in information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information. Recently, various prompt compression techniques have been introduced to optimize the trade-off between reducing input length and retaining performance. We propose a holistic evaluation framework that allows for in-depth analysis of prompt compression methods. We focus on three key aspects, besides compression ratio: (i) downstream task performance, (ii) grounding in the input context, and (iii) information preservation. Through this framework, we investigate state-of-the-art soft and hard compression methods, showing that they struggle to preserve key details from the original prompt, limiting their performance on complex tasks. We demonstrate that modifying soft prompting methods to control better the granularity of the compressed information can significantly improve their effectiveness -- up to +23\% in downstream task performance, more than +8 BERTScore points in grounding, and 2.7x more entities preserved in compression.
摘要：大型语言模型（LLMS）的最新进展使他们成功地应用了广泛的任务。但是，在信息密集型任务中，及时长度可以快速增长，从而增加计算要求，性能下降以及无关或冗余信息引起的偏见。最近，已经引入了各种及时的压缩技术，以优化减少输入长度和保持性能之间的权衡。我们提出了一个整体评估框架，可以深入分析及时压缩方法。除压缩比以外，我们专注于三个关键方面：（i）下游任务性能，（ii）在输入上下文中接地，以及（iii）信息保存。通过此框架，我们研究了最先进的软压缩方法，表明它们很难从原始提示中保留关键细节，从而将其绩效限制在复杂的任务上。我们证明，修改软提示方法以更好地控制压缩信息的粒度可以显着提高其有效性 - 在下游任务性能中，高达+23 \％，地接地中的+8 Bertscore点超过8倍，并且在压缩中保留了2.7倍的实体。

Title: Where is this coming from? Making groundedness count in the evaluation of Document VQA models

Authors: Armineh Nourbakhsh, Siddharth Parekh, Pranav Shetty, Zhao Jin, Sameena Shah, Carolyn Rose
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19120
Pdf URL: https://arxiv.org/pdf/2503.19120
Copy Paste: [[2503.19120]] Where is this coming from? Making groundedness count in the evaluation of Document VQA models(https://arxiv.org/abs/2503.19120)
Keywords: hallucination
Abstract: Document Visual Question Answering (VQA) models have evolved at an impressive rate over the past few years, coming close to or matching human performance on some benchmarks. We argue that common evaluation metrics used by popular benchmarks do not account for the semantic and multimodal groundedness of a model's outputs. As a result, hallucinations and major semantic errors are treated the same way as well-grounded outputs, and the evaluation scores do not reflect the reasoning capabilities of the model. In response, we propose a new evaluation methodology that accounts for the groundedness of predictions with regard to the semantic characteristics of the output as well as the multimodal placement of the output within the input document. Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences. We validate our scoring methodology using human judgment and show its potential impact on existing popular leaderboards. Through extensive analyses, we demonstrate that our proposed method produces scores that are a better indicator of a model's robustness and tends to give higher rewards to better-calibrated answers.
摘要：在过去的几年中，文档视觉问题回答（VQA）模型以令人印象深刻的速度发展，在某些基准上接近或匹配人类的性能。我们认为，流行基准使用的常见评估指标并不能说明模型输出的语义和多模式基础。结果，幻觉和重大语义错误的处理方式与良好的输出相同，评估得分并不能反映模型的推理能力。作为回应，我们提出了一种新的评估方法，该方法解释了预测的基础，这些方法与输出的语义特征以及输出文档中输出的多模式放置有关。我们提出的方法的参数化是用户可以根据其喜好配置分数的方式。我们使用人类判断来验证我们的评分方法，并显示其对现有流行排行榜的潜在影响。通过广泛的分析，我们证明我们提出的方法产生的分数可以更好地表明模型的鲁棒性，并且倾向于给出更高的奖励，以获得更好地校准的答案。

Title: Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

Authors: Haebin Shin, Lei Ji, Xiao Liu, Yeyun Gong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19123
Pdf URL: https://arxiv.org/pdf/2503.19123
Copy Paste: [[2503.19123]] Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling(https://arxiv.org/abs/2503.19123)
Keywords: language model
Abstract: Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.
摘要：使用大型教师模型来指导较小的学生模型的培训已成为有效学习的主要范式。但是，教师和学生语言模型之间的词汇不匹配在语言建模中构成了重大挑战，从而导致令牌序列和输出分布不同。 To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training.我们使用具有不同词汇的各种7B教师模型通过1B学生模型来证明其在语言建模中的有效性。值得注意的是，随着QWEN2.5-MATH-INSTRUCT，教师模型与Tinyllama只有6％的词汇量，与持续预处理相比，Vocagnolm的性能提高了46％。此外，我们证明了Vocagnolm始终从更强的教师模型中受益，从而为语言建模中的词汇不匹配提供了强大的解决方案。

Title: MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks

Authors: Wenhao You, Bryan Hooi, Yiwei Wang, Youke Wang, Zong Ke, Ming-Hsuan Yang, Zi Huang, Yujun Cai
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2503.19134
Pdf URL: https://arxiv.org/pdf/2503.19134
Copy Paste: [[2503.19134]] MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks(https://arxiv.org/abs/2503.19134)
Keywords: language model, llm
Abstract: While safety mechanisms have significantly progressed in filtering harmful text inputs, MLLMs remain vulnerable to multimodal jailbreaks that exploit their cross-modal reasoning capabilities. We present MIRAGE, a novel multimodal jailbreak framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models (MLLMs). By systematically decomposing the toxic query into environment, role, and action triplets, MIRAGE constructs a multi-turn visual storytelling sequence of images and text using Stable Diffusion, guiding the target model through an engaging detective narrative. This process progressively lowers the model's defences and subtly guides its reasoning through structured contextual cues, ultimately eliciting harmful responses. In extensive experiments on the selected datasets with six mainstream MLLMs, MIRAGE achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines. Moreover, we demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards. These results highlight critical weaknesses in current multimodal safety mechanisms and underscore the urgent need for more robust defences against cross-modal threats.
摘要：尽管安全机制在过滤有害文本输入方面取得了显着发展，但MLLM仍然容易受到利用其跨模式推理能力的多模式越狱。我们展示了Mirage，这是一种新型的多式联运越狱框架，它利用了叙事驱动的环境和浸入角色，以规避多模式大语言模型（MLLM）的安全机制。通过系统地将有毒的查询分解为环境，角色和动作三胞胎，Mirage使用稳定的扩散构建了图像和文本的多转视觉故事序列，从而通过引人入胜的侦探叙事来指导目标模型。这个过程逐渐降低了模型的防御能力，并通过结构化的上下文提示巧妙地指导其推理，最终引起了有害的反应。在具有六个主流MLLM的选定数据集上的广泛实验中，Mirage实现了最先进的性能，将攻击成功率提高了最佳基线的17.5％。此外，我们证明了浸入和结构化的语义重建可以激活固有的模型偏见，从而促进模型自发违反道德保障。这些结果突出了当前多模式安全机制中的关键弱点，并强调了迫切需要对跨模式威胁进行更强大的防御能力。

Title: Language Model Uncertainty Quantification with Attention Chain

Authors: Yinghao Li, Rushi Qiang, Lama Moukheiber, Chao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19168
Pdf URL: https://arxiv.org/pdf/2503.19168
Copy Paste: [[2503.19168]] Language Model Uncertainty Quantification with Attention Chain(https://arxiv.org/abs/2503.19168)
Keywords: language model, llm
Abstract: Accurately quantifying a large language model's (LLM) predictive uncertainty is crucial for judging the reliability of its answers. While most existing research focuses on short, directly answerable questions with closed-form outputs (e.g., multiple-choice), involving intermediate reasoning steps in LLM responses is increasingly important. This added complexity complicates uncertainty quantification (UQ) because the probabilities assigned to answer tokens are conditioned on a vast space of preceding reasoning tokens. Direct marginalization is infeasible, and the dependency inflates probability estimates, causing overconfidence in UQ. To address this, we propose UQAC, an efficient method that narrows the reasoning space to a tractable size for marginalization. UQAC iteratively constructs an "attention chain" of tokens deemed "semantically crucial" to the final answer via a backtracking procedure. Starting from the answer tokens, it uses attention weights to identify the most influential predecessors, then iterates this process until reaching the input tokens. Similarity filtering and probability thresholding further refine the resulting chain, allowing us to approximate the marginal probabilities of the answer tokens, which serve as the LLM's confidence. We validate UQAC on multiple reasoning benchmarks with advanced open-source LLMs, demonstrating that it consistently delivers reliable UQ estimates with high computational efficiency.
摘要：准确地量化大型语言模型（LLM）的预测不确定性对于判断其答案的可靠性至关重要。尽管大多数现有的研究都集中在带有封闭形式输出（例如多项选择）的简短，可直接回答的问题上，涉及LLM响应中的中间推理步骤的问题越来越重要。这种增加的复杂性使不确定性定量（UQ）复杂化，因为分配给代币的概率是在推理令牌之前的巨大空间上进行的。直接边缘化是不可行的，依赖性膨胀的概率估计值，导致UQ中的过度自信。为了解决这个问题，我们提出了UQAC，这是一种有效的方法，它将推理空间缩小到边缘化的可拖动大小。 UQAC迭代地构建了通过回溯过程认为最终答案的令牌的“注意链”。从答案令牌开始，它使用注意力权重识别最有影响力的前辈，然后迭代此过程，直到达到输入令牌。相似性滤波和概率阈值进一步完善了所得链，使我们能够近似答案令牌的边际概率，这些链条作为LLM的置信度。我们在具有高级开源LLM的多个推理基准上验证了UQAC，这表明它始终以高计算效率提供可靠的UQ估计值。

Title: Evaluating Bias in LLMs for Job-Resume Matching: Gender, Race, and Education

Authors: Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, Estevam Hruschka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19182
Pdf URL: https://arxiv.org/pdf/2503.19182
Copy Paste: [[2503.19182]] Evaluating Bias in LLMs for Job-Resume Matching: Gender, Race, and Education(https://arxiv.org/abs/2503.19182)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) offer the potential to automate hiring by matching job descriptions with candidate resumes, streamlining recruitment processes, and reducing operational costs. However, biases inherent in these models may lead to unfair hiring practices, reinforcing societal prejudices and undermining workplace diversity. This study examines the performance and fairness of LLMs in job-resume matching tasks within the English language and U.S. context. It evaluates how factors such as gender, race, and educational background influence model decisions, providing critical insights into the fairness and reliability of LLMs in HR applications. Our findings indicate that while recent models have reduced biases related to explicit attributes like gender and race, implicit biases concerning educational background remain significant. These results highlight the need for ongoing evaluation and the development of advanced bias mitigation strategies to ensure equitable hiring practices when using LLMs in industry settings.
摘要：大型语言模型（LLMS）提供了通过将职位描述与候选简历相匹配，简化招聘流程和降低运营成本来自动化招聘的潜力。但是，这些模型固有的偏见可能导致不公平的招聘实践，加强社会偏见并破坏工作场所的多样性。这项研究检查了LLM在英语和美国背景下的工作库匹配任务中的表现和公平性。它评估了性别，种族和教育背景等因素如何影响模型的决策，从而为人力资源应用程序中LLM的公平性和可靠性提供了重要的见解。我们的发现表明，尽管最近的模型减少了与性别和种族等明确属性有关的偏见，但有关教育背景的隐性偏见仍然很大。这些结果凸显了进行持续评估的必要性以及在行业环境中使用LLM时确保公平招聘实践的高级偏见策略的发展。

Title: Overtrained Language Models Are Harder to Fine-Tune

Authors: Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19206
Pdf URL: https://arxiv.org/pdf/2503.19206
Copy Paste: [[2503.19206]] Overtrained Language Models Are Harder to Fine-Tune(https://arxiv.org/abs/2503.19206)
Keywords: language model, llm
Abstract: Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.
摘要：大型语言模型已在不断增长的代币预算上进行了预先培训，因为假设更好的预训练性能转化为改进的下游模型。在这项工作中，我们挑战了这一假设，并表明扩展的预训练可以使模型更难微调，从而导致最终性能降低。我们称这种现象的灾难性过度训练。例如，在3T代币上预先训练的指令调整的OLMO-1B模型导致多个标准LLM基准测试的性能比其2.3T代币对应物差2％。通过对照实验和理论分析，我们表明灾难性过度训练是由于预训练参数对修改的广泛灵敏度的系统性提高，包括但不限于微调。我们的发现要求对预训练设计的重新评估，以考虑模型的下游适应性。

Title: A Survey of Large Language Model Agents for Question Answering

Authors: Murong Yue
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2503.19213
Pdf URL: https://arxiv.org/pdf/2503.19213
Copy Paste: [[2503.19213]] A Survey of Large Language Model Agents for Question Answering(https://arxiv.org/abs/2503.19213)
Keywords: language model, llm, agent
Abstract: This paper surveys the development of large language model (LLM)-based agents for question answering (QA). Traditional agents face significant limitations, including substantial data requirements and difficulty in generalizing to new environments. LLM-based agents address these challenges by leveraging LLMs as their core reasoning engine. These agents achieve superior QA results compared to traditional QA pipelines and naive LLM QA systems by enabling interaction with external environments. We systematically review the design of LLM agents in the context of QA tasks, organizing our discussion across key stages: planning, question understanding, information retrieval, and answer generation. Additionally, this paper identifies ongoing challenges and explores future research directions to enhance the performance of LLM agent QA systems.
摘要：本文调查了基于大语言模型（LLM）的代理商的开发（QA）。传统代理人面临重大局限性，包括大量数据要求和对新环境的概括。基于LLM的代理商通过利用LLM作为其核心推理引擎来应对这些挑战。与传统的QA管道和Naive LLM QA系统相比，这些代理通过与外部环境相互作用而获得了质量质量的结果。我们会系统地回顾质量检查任务的设计LLM代理的设计，在关键阶段组织我们的讨论：计划，问题理解，信息检索和回答生成。此外，本文确定了正在进行的挑战，并探讨了未来的研究方向，以增强LLM Agent QA系统的性能。

Title: SCI-IDEA: Context-Aware Scientific Ideation Using Token and Sentence Embeddings

Authors: Farhana Keya, Gollam Rabby, Prasenjit Mitra, Sahar Vahdati, Sören Auer, Yaser Jaradeh
Subjects: cs.CL, cs.DL
Abstract URL: https://arxiv.org/abs/2503.19257
Pdf URL: https://arxiv.org/pdf/2503.19257
Copy Paste: [[2503.19257]] SCI-IDEA: Context-Aware Scientific Ideation Using Token and Sentence Embeddings(https://arxiv.org/abs/2503.19257)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Every scientific discovery starts with an idea inspired by prior work, interdisciplinary concepts, and emerging challenges. Recent advancements in large language models (LLMs) trained on scientific corpora have driven interest in AI-supported idea generation. However, generating context-aware, high-quality, and innovative ideas remains challenging. We introduce SCI-IDEA, a framework that uses LLM prompting strategies and Aha Moment detection for iterative idea refinement. SCI-IDEA extracts essential facets from research publications, assessing generated ideas on novelty, excitement, feasibility, and effectiveness. Comprehensive experiments validate SCI-IDEA's effectiveness, achieving average scores of 6.84, 6.86, 6.89, and 6.84 (on a 1-10 scale) across novelty, excitement, feasibility, and effectiveness, respectively. Evaluations employed GPT-4o, GPT-4.5, DeepSeek-32B (each under 2-shot prompting), and DeepSeek-70B (3-shot prompting), with token-level embeddings used for Aha Moment detection. Similarly, it achieves scores of 6.87, 6.86, 6.83, and 6.87 using GPT-4o under 5-shot prompting, GPT-4.5 under 3-shot prompting, DeepSeek-32B under zero-shot chain-of-thought prompting, and DeepSeek-70B under 5-shot prompting with sentence-level embeddings. We also address ethical considerations such as intellectual credit, potential misuse, and balancing human creativity with AI-driven ideation. Our results highlight SCI-IDEA's potential to facilitate the structured and flexible exploration of context-aware scientific ideas, supporting innovation while maintaining ethical standards.
摘要：每个科学发现都始于一个受到先前工作，跨学科概念和新兴挑战的想法。对科学语料库进行培训的大型语言模型（LLM）的最新进展引起了人们对AI支持的想法产生的兴趣。但是，产生背景感，高质量和创新思想仍然具有挑战性。我们介绍了Sci-Idea，该框架使用LLM提示策略和AHA时刻检测进行迭代思想的完善。 Sci-Idea提取了研究出版物的基本方面，评估了有关新颖，兴奋，可行性和有效性的产生的想法。全面的实验验证了Sci-IDEA的有效性，平均得分分别在新颖，兴奋，可行性和有效性的范围内分别达到6.84、6.86、6.89和6.84（以1-10的比例）。使用GPT-4O，GPT-4.5，DeepSeek-32B（每个下方的提示下）和DeepSeek-70B（3-shot提示）采用的评估，并带有用于AHA力矩检测的令牌级嵌入。同样，它的得分达到6.87、6.86、6.83和6.87，使用GPT-4O在5次提示下使用GPT-4O，GPT-4.5在3次及3次提示下，DeepSeek-32B，零射链链链的提示下的DeepSeek-32b，在零链链链中，DeepSeek-70b在5次的句子下方，以句子级别的固定剂量提示。我们还解决了诸如智力信用，潜在滥用和平衡人类创造力与AI驱动的构想之类的道德考虑因素。我们的结果凸显了Sci-Idea促进对情境感知的科学思想的结构化和灵活探索的潜力，同时支持创新，同时保持道德标准。

Title: Linguistic Blind Spots of Large Language Models

Authors: Jiali Cheng, Hadi Amiri
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19260
Pdf URL: https://arxiv.org/pdf/2503.19260
Copy Paste: [[2503.19260]] Linguistic Blind Spots of Large Language Models(https://arxiv.org/abs/2503.19260)
Keywords: language model, llm
Abstract: Large language models (LLMs) are the foundation of many AI applications today. However, despite their remarkable proficiency in generating coherent text, questions linger regarding their ability to perform fine-grained linguistic annotation tasks, such as detecting nouns or verbs, or identifying more complex syntactic structures like clauses in input texts. These tasks require precise syntactic and semantic understanding of input text, and when LLMs underperform on specific linguistic structures, it raises concerns about their reliability for detailed linguistic analysis and whether their (even correct) outputs truly reflect an understanding of the inputs. In this paper, we empirically study the performance of recent LLMs on fine-grained linguistic annotation tasks. Through a series of experiments, we find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs. We show that the most capable LLM (Llama3-70b) makes notable errors in detecting linguistic structures, such as misidentifying embedded clauses, failing to recognize verb phrases, and confusing complex nominals with clauses. Our results provide insights to inform future advancements in LLM design and development.
摘要：大型语言模型（LLM）是当今许多AI应用程序的基础。然而，尽管他们在生成连贯的文本方面的熟练程度非常熟练，但问题仍然存在有关其执行精细语言注释任务的能力，例如检测名词或动词，或识别更复杂的句法结构，例如输入文本中的从句。这些任务需要对输入文本进行精确的句法和语义理解，当LLMS在特定语言结构上表现不佳时，它引起了人们对它们对详细语言分析的可靠性以及其（甚至正确的）输出是否真正反映了对输入的理解。在本文中，我们从经验上研究了最近的LLM在细粒度语言注释任务上的表现。通过一系列实验，我们发现最近的LLM在解决语言查询方面表现出有限的功效，并且经常在语言上复杂的输入中遇到困难。我们表明，最有能力的LLM（LLAMA3-70B）在检测语言结构时犯了一个明显的错误，例如误解了嵌入式从句，未能识别动词短语，并使复杂的名称与条款混淆。我们的结果提供了见解，以告知LLM设计和开发中未来的进步。

Title: PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping

Authors: Sarah Pungitore, Shashank Yadav, Vignesh Subbian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19265
Pdf URL: https://arxiv.org/pdf/2503.19265
Copy Paste: [[2503.19265]] PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping(https://arxiv.org/abs/2503.19265)
Keywords: language model, llm
Abstract: Computational phenotyping is essential for biomedical research but often requires significant time and resources, especially since traditional methods typically involve extensive manual data review. While machine learning and natural language processing advancements have helped, further improvements are needed. Few studies have explored using Large Language Models (LLMs) for these tasks despite known advantages of LLMs for text-based tasks. To facilitate further research in this area, we developed an evaluation framework, Evaluation of PHEnotyping for Observational Health Data (PHEONA), that outlines context-specific considerations. We applied and demonstrated PHEONA on concept classification, a specific task within a broader phenotyping process for Acute Respiratory Failure (ARF) respiratory support therapies. From the sample concepts tested, we achieved high classification accuracy, suggesting the potential for LLM-based methods to improve computational phenotyping processes.
摘要：计算表型对于生物医学研究至关重要，但通常需要大量的时间和资源，尤其是因为传统方法通常涉及大量的手动数据审查。尽管机器学习和自然语言处理的进步有所帮助，但仍需要进一步改进。尽管LLMS已知在基于文本的任务方面具有已知的优势，但很少有研究使用大型语言模型（LLM）来探讨这些任务。为了促进该领域的进一步研究，我们开发了一个评估框架，观察健康数据（Pheona）的表型评估，概述了特定于上下文的考虑。我们在概念分类上应用并证明了Pheona，这是急性呼吸衰竭（ARF）呼吸支持疗法的更广泛表型过程中的特定任务。从测试的样本概念中，我们达到了高分类的准确性，这表明了基于LLM的方法改善计算表型过程的潜力。

Title: MARS: Memory-Enhanced Agents with Reflective Self-improvement

Authors: Xuechen Liang, Meiling Tao, Yinghui Xia, Jianhui Wang, Kun Li, Yijin Wang, Jingsong Yang, Tianyu Shi, Yuantao Wang, Miao Zhang, Xueqian Wang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.19271
Pdf URL: https://arxiv.org/pdf/2503.19271
Copy Paste: [[2503.19271]] MARS: Memory-Enhanced Agents with Reflective Self-improvement(https://arxiv.org/abs/2503.19271)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have made significant advances in the field of natural language processing, but they still face challenges such as continuous decision-making, lack of long-term memory, and limited context windows in dynamic environments. To address these issues, this paper proposes an innovative framework Memory-Enhanced Agents with Reflective Self-improvement. The MARS framework comprises three agents: the User, the Assistant, and the Checker. By integrating iterative feedback, reflective mechanisms, and a memory optimization mechanism based on the Ebbinghaus forgetting curve, it significantly enhances the agents capabilities in handling multi-tasking and long-span information.
摘要：大型语言模型（LLM）在自然语言处理领域取得了重大进步，但它们仍然面临诸如持续决策，缺乏长期记忆以及在动态环境中有限的上下文窗口等挑战。为了解决这些问题，本文提出了一种创新的框架记忆增强的代理，并具有反思性的自我完善。火星框架包括三个代理：用户，助手和检查器。通过集成基于Ebbinghaus忘记曲线的迭代反馈，反射机制和内存优化机制，它可以显着增强代理在处理多任务和长期跨度信息方面的功能。

Title: CoMAC: Conversational Agent for Multi-Source Auxiliary Context with Sparse and Symmetric Latent Interactions

Authors: Junfeng Liu, Christopher T. Symons, Ranga Raju Vatsavai
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19274
Pdf URL: https://arxiv.org/pdf/2503.19274
Copy Paste: [[2503.19274]] CoMAC: Conversational Agent for Multi-Source Auxiliary Context with Sparse and Symmetric Latent Interactions(https://arxiv.org/abs/2503.19274)
Keywords: agent
Abstract: Recent advancements in AI-driven conversational agents have exhibited immense potential of AI applications. Effective response generation is crucial to the success of these agents. While extensive research has focused on leveraging multiple auxiliary data sources (e.g., knowledge bases and personas) to enhance response generation, existing methods often struggle to efficiently extract relevant information from these sources. There are still clear limitations in the ability to combine versatile conversational capabilities with adherence to known facts and adaptation to large variations in user preferences and belief systems, which continues to hinder the wide adoption of conversational AI tools. This paper introduces a novel method, Conversational Agent for Multi-Source Auxiliary Context with Sparse and Symmetric Latent Interactions (CoMAC), for conversation generation, which employs specialized encoding streams and post-fusion grounding networks for multiple data sources to identify relevant persona and knowledge information for the conversation. CoMAC also leverages a novel text similarity metric that allows bi-directional information sharing among multiple sources and focuses on a selective subset of meaningful words. Our experiments show that CoMAC improves the relevant persona and knowledge prediction accuracies and response generation quality significantly over two state-of-the-art methods.
摘要：AI驱动的对话剂的最新进展表现出了AI应用的巨大潜力。有效的响应产生对于这些药物的成功至关重要。尽管广泛的研究重点是利用多个辅助数据源（例如，知识库和角色）来增强响应的产生，但现有的方法通常很难从这些来源中有效提取相关信息。将多功能对话能力与已知事实的遵守以及对用户偏好和信念系统的巨大差异相结合的能力仍然存在明显的限制，这将继续阻碍广泛采用对话AI工具。本文介绍了一种新颖的方法，即具有稀疏和对称潜在互动（COMAC）的多源辅助上下文的对话代理，用于对话，该语言采用了专门编码流和融合后接地网络的多个数据源，以确定对话的相关性格和知识信息。 COMAC还利用了一个新颖的文本相似性度量，该指标允许多个来源之间的双向信息共享，并着重于有意义的单词的选择性子集。我们的实验表明，在两种最新方法上，COMAC改善了相关的角色和知识预测精度和响应产生质量。

Title: Machine-assisted writing evaluation: Exploring pre-trained language models in analyzing argumentative moves

Authors: Wenjuan Qin, Weiran Wang, Yuming Yang, Tao Gui
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19279
Pdf URL: https://arxiv.org/pdf/2503.19279
Copy Paste: [[2503.19279]] Machine-assisted writing evaluation: Exploring pre-trained language models in analyzing argumentative moves(https://arxiv.org/abs/2503.19279)
Keywords: language model
Abstract: The study investigates the efficacy of pre-trained language models (PLMs) in analyzing argumentative moves in a longitudinal learner corpus. Prior studies on argumentative moves often rely on qualitative analysis and manual coding, limiting their efficiency and generalizability. The study aims to: 1) to assess the reliability of PLMs in analyzing argumentative moves; 2) to utilize PLM-generated annotations to illustrate developmental patterns and predict writing quality. A longitudinal corpus of 1643 argumentative texts from 235 English learners in China is collected and annotated into six move types: claim, data, counter-claim, counter-data, rebuttal, and non-argument. The corpus is divided into training, validation, and application sets annotated by human experts and PLMs. We use BERT as one of the implementations of PLMs. The results indicate a robust reliability of PLMs in analyzing argumentative moves, with an overall F1 score of 0.743, surpassing existing models in the field. Additionally, PLM-labeled argumentative moves effectively capture developmental patterns and predict writing quality. Over time, students exhibit an increase in the use of data and counter-claims and a decrease in non-argument moves. While low-quality texts are characterized by a predominant use of claims and data supporting only oneside position, mid- and high-quality texts demonstrate an integrative perspective with a higher ratio of counter-claims, counter-data, and rebuttals. This study underscores the transformative potential of integrating artificial intelligence into language education, enhancing the efficiency and accuracy of evaluating students' writing. The successful application of PLMs can catalyze the development of educational technology, promoting a more data-driven and personalized learning environment that supports diverse educational needs.
摘要：该研究调查了预训练的语言模型（PLM）在分析纵向学习者语料库中的论证举动中的功效。关于论证举动的先前研究通常依赖定性分析和手动编码，从而限制了它们的效率和概括性。该研究的目的是：1）评估PLM在分析论证举动时的可靠性； 2）利用PLM生成的注释来说明发展模式并预测写作质量。收集了来自中国235名英语学习者的1643年辩论文本的纵向语料库，并注释为六种类型：索赔，数据，反索赔，反data，反驳，反驳和非主题。该语料库分为人类专家和PLM注释的培训，验证和应用程序集。我们将BERT作为PLM的实现之一。结果表明，PLM在分析论证移动时具有可靠的可靠性，总F1得分为0.743，超过了现场的现有模型。此外，PLM标记的论证举动有效地捕获了发展模式并预测写作质量。随着时间的流逝，学生表现出数据和反索赔的使用增加以及非题为动作的减少。尽管低质量的文本的特征是主要使用索赔和仅支持Oneside位置的数据，但中和高质量的文本展示了一个综合观点，具有更高的反索赔，反data和反驳的比例。这项研究强调了将人工智能纳入语言教育的变革潜力，增强了评估学生写作的效率和准确性。 PLM的成功应用可以催化教育技术的发展，促进更具数据驱动和个性化的学习环境，以支持各种教育需求。

Title: Iterative Hypothesis Generation for Scientific Discovery with Monte Carlo Nash Equilibrium Self-Refining Trees

Authors: Gollam Rabby, Diyana Muhammed, Prasenjit Mitra, Sören Auer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19309
Pdf URL: https://arxiv.org/pdf/2503.19309
Copy Paste: [[2503.19309]] Iterative Hypothesis Generation for Scientific Discovery with Monte Carlo Nash Equilibrium Self-Refining Trees(https://arxiv.org/abs/2503.19309)
Keywords: language model, llm, prompt
Abstract: Scientific hypothesis generation is a fundamentally challenging task in research, requiring the synthesis of novel and empirically grounded insights. Traditional approaches rely on human intuition and domain expertise, while purely large language model (LLM) based methods often struggle to produce hypotheses that are both innovative and reliable. To address these limitations, we propose the Monte Carlo Nash Equilibrium Self-Refine Tree (MC-NEST), a novel framework that integrates Monte Carlo Tree Search with Nash Equilibrium strategies to iteratively refine and validate hypotheses. MC-NEST dynamically balances exploration and exploitation through adaptive sampling strategies, which prioritize high-potential hypotheses while maintaining diversity in the search space. We demonstrate the effectiveness of MC-NEST through comprehensive experiments across multiple domains, including biomedicine, social science, and computer science. MC-NEST achieves average scores of 2.65, 2.74, and 2.80 (on a 1-3 scale) for novelty, clarity, significance, and verifiability metrics on the social science, computer science, and biomedicine datasets, respectively, outperforming state-of-the-art prompt-based methods, which achieve 2.36, 2.51, and 2.52 on the same datasets. These results underscore MC-NEST's ability to generate high-quality, empirically grounded hypotheses across diverse domains. Furthermore, MC-NEST facilitates structured human-AI collaboration, ensuring that LLMs augment human creativity rather than replace it. By addressing key challenges such as iterative refinement and the exploration-exploitation balance, MC-NEST sets a new benchmark in automated hypothesis generation. Additionally, MC-NEST's ethical design enables responsible AI use, emphasizing transparency and human supervision in hypothesis generation.
摘要：科学假设的产生是研究中一项根本挑战性的任务，需要综合新颖和经验扎根的见解。传统方法取决于人类的直觉和领域专业知识，而纯粹的大语言模型（LLM）方法通常很难产生既创新又可靠的假设。为了解决这些局限性，我们提出了蒙特卡洛·纳什平衡自我refine树（MC-nest），这是一个新颖的框架，将蒙特卡洛树搜索与纳什平衡策略集成到迭代地完善并验证假设。 MC-Nest通过自适应抽样策略动态平衡探索和剥削，该策略优先考虑高潜力假设，同时保持搜索空间中的多样性。我们通过跨多个领域（包括生物医学，社会科学和计算机科学）的全面实验来证明MC-NEST的有效性。 MC-Nest的平均得分为2.65、2.74和2.80（比例为1-3），以分别对社会科学，计算机科学和生物医学数据集的新颖性，清晰度，意义和可验证性指标，这是超出最先进的基于最先进的及时及时及时方法的方法，这些方法实现了2.36，2.36，2.51，2.51和2.51和2.52和2.52。这些结果强调了MC-Nest产生高质量的，经验扎根的假设的能力。此外，MC-NEST促进了结构化的人类合作，确保LLMS增强人类创造力而不是取代它。通过应对迭代精致和探索探索平衡等关键挑战，MC-NEST为自动假设产生了新的基准。此外，MC-Nest的道德设计可实现负责的AI使用，强调假设产生的透明度和人类监督。

Title: Substance over Style: Evaluating Proactive Conversational Coaching Agents

Authors: Vidya Srinivas, Xuhai Xu, Xin Liu, Kumar Ayush, Isaac Galatzer-Levy, Shwetak Patel, Daniel McDuff, Tim Althoff
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19328
Pdf URL: https://arxiv.org/pdf/2503.19328
Copy Paste: [[2503.19328]] Substance over Style: Evaluating Proactive Conversational Coaching Agents(https://arxiv.org/abs/2503.19328)
Keywords: agent
Abstract: While NLP research has made strides in conversational tasks, many approaches focus on single-turn responses with well-defined objectives or evaluation criteria. In contrast, coaching presents unique challenges with initially undefined goals that evolve through multi-turn interactions, subjective evaluation criteria, mixed-initiative dialogue. In this work, we describe and implement five multi-turn coaching agents that exhibit distinct conversational styles, and evaluate them through a user study, collecting first-person feedback on 155 conversations. We find that users highly value core functionality, and that stylistic components in absence of core components are viewed negatively. By comparing user feedback with third-person evaluations from health experts and an LM, we reveal significant misalignment across evaluation approaches. Our findings provide insights into design and evaluation of conversational coaching agents and contribute toward improving human-centered NLP applications.
摘要：尽管NLP研究在对话任务方面取得了进步，但许多方法都集中在具有明确的目标或评估标准的单转响应上。相反，教练通过最初不确定的目标提出了独特的挑战，这些目标是通过多转交互，主观评估标准，混合定位性对话来发展的。在这项工作中，我们描述并实施了五个表现出独特的对话风格的多转弯教练，并通过用户研究对其进行评估，从而收集了155次对话的第一人称反馈。我们发现用户高度重视核心功能，而在没有核心组件的情况下，否定了风格组件。通过将用户反馈与健康专家和LM的第三人称评估进行比较，我们揭示了在评估方法之间的严重未对准。我们的发现提供了对对话教练代理商的设计和评估的见解，并有助于改善以人为本的NLP应用程序。

Title: DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models

Authors: Suyoung Bae, YunSeok Choi, Jee-Hyong Lee
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.19426
Pdf URL: https://arxiv.org/pdf/2503.19426
Copy Paste: [[2503.19426]] DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models(https://arxiv.org/abs/2503.19426)
Keywords: language model, llm, prompt
Abstract: While Large Language Models (LLMs) excel in zero-shot Question Answering (QA), they tend to expose biases in their internal knowledge when faced with socially sensitive questions, leading to a degradation in performance. Existing zero-shot methods are efficient but fail to consider context and prevent bias propagation in the answers. To address this, we propose DeCAP, a method for debiasing LLMs using Context-Adaptive Prompt Generation. DeCAP leverages a Question Ambiguity Detection to take appropriate debiasing actions based on the context and a Neutral Answer Guidance Generation to suppress the LLMs make objective judgments about the context, minimizing the propagation of bias from their internal knowledge. Our various experiments across eight LLMs show that DeCAP achieves state-of-the-art zero-shot debiased QA performance. This demonstrates DeCAP's efficacy in enhancing the fairness and accuracy of LLMs in diverse QA settings.
摘要：尽管大型语言模型（LLMS）在零片问题回答（QA）中表现出色，但在面对社会敏感的问题时，他们倾向于在内部知识中暴露出偏见，从而导致绩效降低。现有的零射方法是有效的，但无法考虑上下文并防止答案中的偏差传播。为了解决这个问题，我们建议使用上下文自适应及时生成来证明LLMS的方法。 decap利用歧义检测的问题来基于上下文采取适当的辩护行动，并进行中性答案指导产生来抑制LLMS对上下文进行客观判断，从而最大程度地减少其内部知识的偏见的传播。我们在八个LLM中进行的各种实验表明，DECAP实现了最先进的零拍词质量验证性能。这表明了脱糖在提高不同质量检查环境中LLM的公平性和准确性方面的功效。

Title: Enhancing Small Language Models for Cross-Lingual Generalized Zero-Shot Classification with Soft Prompt Tuning

Authors: Fred Philippy, Siwen Guo, Cedric Lothritz, Jacques Klein, Tegawendé F. Bissyandé
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19469
Pdf URL: https://arxiv.org/pdf/2503.19469
Copy Paste: [[2503.19469]] Enhancing Small Language Models for Cross-Lingual Generalized Zero-Shot Classification with Soft Prompt Tuning(https://arxiv.org/abs/2503.19469)
Keywords: language model, prompt
Abstract: In NLP, Zero-Shot Classification (ZSC) has become essential for enabling models to classify text into categories unseen during training, particularly in low-resource languages and domains where labeled data is scarce. While pretrained language models (PLMs) have shown promise in ZSC, they often rely on large training datasets or external knowledge, limiting their applicability in multilingual and low-resource scenarios. Recent approaches leveraging natural language prompts reduce the dependence on large training datasets but struggle to effectively incorporate available labeled data from related classification tasks, especially when these datasets originate from different languages or distributions. Moreover, existing prompt-based methods typically rely on manually crafted prompts in a specific language, limiting their adaptability and effectiveness in cross-lingual settings. To address these challenges, we introduce RoSPrompt, a lightweight and data-efficient approach for training soft prompts that enhance cross-lingual ZSC while ensuring robust generalization across data distribution shifts. RoSPrompt is designed for small multilingual PLMs, enabling them to leverage high-resource languages to improve performance in low-resource settings without requiring extensive fine-tuning or high computational costs. We evaluate our approach on multiple multilingual PLMs across datasets covering 106 languages, demonstrating strong cross-lingual transfer performance and robust generalization capabilities over unseen classes.
摘要：在NLP中，零摄像分类（ZSC）对于使模型在培训期间将文本分类为看不见的类别至关重要，尤其是在低资源语言和标记数据稀缺的域中。虽然审计的语言模型（PLM）在ZSC中表现出了希望，但它们通常依靠大型培训数据集或外部知识，从而将其适用性限制在多语言和低资源场景中。利用自然语言提示的最新方法减少了对大型培训数据集的依赖性，但很难有效地整合相关分类任务中可用的标记数据，尤其是当这些数据集源自不同的语言或分布时。此外，现有的基于及时的方法通常依赖于特定语言的手动制作的提示，从而限制了其在跨语性设置中的适应性和有效性。为了应对这些挑战，我们介绍了Rosprompt，这是一种轻巧和数据效率的训练软提示方法，可以增强跨语言ZSC，同时确保跨数据分布变化的强大概括。 Rosprompt专为小型多语言PLM而设计，使它们能够利用高资源语言来提高低资源环境中的性能，而无需大量的微调或高计算成本。我们在涵盖106种语言的数据集的多种多语言PLM上评估了我们的方法，表明对看不见的类别的跨语性转移性能和强大的概括能力。

Title: KSHSeek: Data-Driven Approaches to Mitigating and Detecting Knowledge-Shortcut Hallucinations in Generative Models

Authors: Zhiwei Wang, Zhongxin Liu, Ying Li, Hongyu Sun, Meng Xu, Yuqing Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19482
Pdf URL: https://arxiv.org/pdf/2503.19482
Copy Paste: [[2503.19482]] KSHSeek: Data-Driven Approaches to Mitigating and Detecting Knowledge-Shortcut Hallucinations in Generative Models(https://arxiv.org/abs/2503.19482)
Keywords: language model, llm, hallucination
Abstract: The emergence of large language models (LLMs) has significantly advanced the development of natural language processing (NLP), especially in text generation tasks like question answering. However, model hallucinations remain a major challenge in natural language generation (NLG) tasks due to their complex causes. We systematically expand on the causes of factual hallucinations from the perspective of knowledge shortcuts, analyzing hallucinations arising from correct and defect-free data and demonstrating that knowledge-shortcut hallucinations are prevalent in generative models. To mitigate this issue, we propose a high similarity pruning algorithm at the data preprocessing level to reduce spurious correlations in the data. Additionally, we design a specific detection method for knowledge-shortcut hallucinations to evaluate the effectiveness of our mitigation strategy. Experimental results show that our approach effectively reduces knowledge-shortcut hallucinations, particularly in fine-tuning tasks, without negatively impacting model performance in question answering. This work introduces a new paradigm for mitigating specific hallucination issues in generative models, enhancing their robustness and reliability in real-world applications.
摘要：大语言模型（LLM）的出现显着推进了自然语言处理（NLP）的发展，尤其是在诸如问题回答之类的文本生成任务中。但是，由于其复杂的原因，模型幻觉仍然是自然语言生成（NLG）任务的主要挑战。我们从知识快捷方式的角度系统地扩展了事实幻觉的原因，分析了正确和无缺陷数据引起的幻觉，并证明知识转换幻觉在生成模型中普遍存在。为了减轻此问题，我们在数据预处理级别上提出了高相似性修剪算法，以减少数据中的虚假相关性。此外，我们设计了一种特定的检测方法，用于知识缩短幻觉，以评估缓解策略的有效性。实验结果表明，我们的方法有效地降低了知识转换幻觉，尤其是在微调任务中，而没有负面影响的模型绩效。这项工作引入了一种新的范式，用于减轻生成模型中的特定幻觉问题，从而增强其在现实世界应用中的稳健性和可靠性。

Title: DomainCQA: Crafting Expert-Level QA from Domain-Specific Charts

Authors: Ling Zhong, Yujing Lu, Jing Yang, Weiming Li, Peng Wei, Yongheng Wang, Manni Duan, Qing Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19498
Pdf URL: https://arxiv.org/pdf/2503.19498
Copy Paste: [[2503.19498]] DomainCQA: Crafting Expert-Level QA from Domain-Specific Charts(https://arxiv.org/abs/2503.19498)
Keywords: language model, llm
Abstract: Chart Question Answering (CQA) benchmarks are essential for evaluating the capability of Multimodal Large Language Models (MLLMs) to interpret visual data. However, current benchmarks focus primarily on the evaluation of general-purpose CQA but fail to adequately capture domain-specific challenges. We introduce DomainCQA, a systematic methodology for constructing domain-specific CQA benchmarks, and demonstrate its effectiveness by developing AstroChart, a CQA benchmark in the field of astronomy. Our evaluation shows that chart reasoning and combining chart information with domain knowledge for deeper analysis and summarization, rather than domain-specific knowledge, pose the primary challenge for existing MLLMs, highlighting a critical gap in current benchmarks. By providing a scalable and rigorous framework, DomainCQA enables more precise assessment and improvement of MLLMs for domain-specific applications.
摘要：图表问题回答（CQA）基准测试对于评估多模式大语言模型（MLLMS）解释视觉数据的能力至关重要。但是，当前的基准主要集中于对通用CQA的评估，但无法充分捕捉特定领域的挑战。我们介绍了DomainCQA，这是一种用于构建域特异性CQA基准测试的系统方法论，并通过开发Astrochart来证明其有效性，Astrochart是天文学领域的CQA基准。我们的评估表明，图表推理和结合图表信息与域知识相结合，以进行更深入的分析和摘要，而不是特定于领域的知识，对现有MLLM构成了主要挑战，突出了当前基准的临界差距。通过提供一个可扩展和严格的框架，DomainCQA可以更精确的评估和改进特定领域应用的MLLM。

Title: FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models

Authors: Dahyun Jung, Seungyoon Lee, Hyeonseok Moon, Chanjun Park, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19540
Pdf URL: https://arxiv.org/pdf/2503.19540
Copy Paste: [[2503.19540]] FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models(https://arxiv.org/abs/2503.19540)
Keywords: language model, llm, prompt
Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced interactions between users and models. These advancements concurrently underscore the need for rigorous safety evaluations due to the manifestation of social biases, which can lead to harmful societal impacts. Despite these concerns, existing benchmarks may overlook the intrinsic weaknesses of LLMs, which can generate biased responses even with simple adversarial instructions. To address this critical gap, we introduce a new benchmark, Fairness Benchmark in LLM under Extreme Scenarios (FLEX), designed to test whether LLMs can sustain fairness even when exposed to prompts constructed to induce bias. To thoroughly evaluate the robustness of LLMs, we integrate prompts that amplify potential biases into the fairness assessment. Comparative experiments between FLEX and existing benchmarks demonstrate that traditional evaluations may underestimate the inherent risks in models. This highlights the need for more stringent LLM evaluation benchmarks to guarantee safety and fairness.
摘要：大型语言模型（LLM）的最新进展已显着增强了用户与模型之间的交互。这些进步同时强调了由于社会偏见的表现而对严格的安全评估的需求，这可能导致有害的社会影响。尽管有这些担忧，但现有的基准可能会忽略LLM的内在弱点，即使使用简单的对抗说明，也可能会产生偏见的响应。为了解决这个关键的差距，我们在极端情况下（FLEX）在LLM中引入了一个新的基准，公平的基准，旨在测试LLM是否可以维持公平性，即使暴露于构建的提示造成偏见。为了彻底评估LLM的鲁棒性，我们整合了将潜在偏见扩大到公平评估中的提示。 Flex和现有基准之间的比较实验表明，传统评估可能低估了模型中固有的风险。这强调了需要更严格的LLM评估基准来确保安全性和公平性。

Title: Scaling Laws of Synthetic Data for Language Models

Authors: Zeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong, Xiaolong Huang, Ziyi Yang, Mahmoud Khademi, Dongdong Zhang, Hany Hassan Awadalla, Yi R. Fung, Weizhu Chen, Minhao Cheng, Furu Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19551
Pdf URL: https://arxiv.org/pdf/2503.19551
Copy Paste: [[2503.19551]] Scaling Laws of Synthetic Data for Language Models(https://arxiv.org/abs/2503.19551)
Keywords: language model, llm
Abstract: Large language models (LLMs) achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a promising alternative, but it remains unclear whether synthetic datasets exhibit predictable scalability comparable to raw pre-training data. In this work, we systematically investigate the scaling laws of synthetic data by introducing SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm. Key findings from our extensive mathematical experiments on SynthLLM include: (1) SynthLLM generates synthetic data that reliably adheres to the \emph{rectified scaling law} across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens. For instance, an 8B model peaks at 1T tokens, while a 3B model requires 4T. Moreover, comparisons with existing synthetic data generation and augmentation methods demonstrate that SynthLLM achieves superior performance and scalability. Our findings highlight synthetic data as a scalable and reliable alternative to organic pre-training corpora, offering a viable path toward continued improvement in model performance.
摘要：大型语言模型（LLM）在各种任务中实现了强劲的性能，这在很大程度上是由用于预训练的高质量网络数据驱动的。但是，最近的研究表明，该数据源正在迅速消耗。合成数据是一种有希望的替代方案，但尚不清楚合成数据集是否具有可预测的可伸缩性与原始训练前数据相当。在这项工作中，我们通过引入Synthllm（一个可扩展的框架，可将培训前的语料库转换为多样化的高质量合成数据集）来系统地研究合成数据的缩放定律。我们的方法通过使用图算法自动提取和重新组合高级概念来实现这一目标。我们对合成的广泛数学实验的关键发现包括：（1）Synthllm生成合成数据，这些数据可靠地遵循各种模型尺寸的\ emph {retectifiend缩放定律}；（2）绩效改善高原附近300b令牌；（3）较大的模型以更少的培训令牌接近最佳性能。例如，8B模型在1T令牌处达到峰值，而3B模型则需要4T。此外，与现有的合成数据生成和增强方法的比较表明，合成可实现卓越的性能和可扩展性。我们的发现凸显了合成数据是有机预培训语料库的可扩展可靠的替代品，为持续改进模型性能提供了可行的途径。

Title: Context-Efficient Retrieval with Factual Decomposition

Authors: Yanhong Li, David Yunis, David McAllester, Jiawei Zhou
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.19574
Pdf URL: https://arxiv.org/pdf/2503.19574
Copy Paste: [[2503.19574]] Context-Efficient Retrieval with Factual Decomposition(https://arxiv.org/abs/2503.19574)
Keywords: language model, llm
Abstract: There has recently been considerable interest in incorporating information retrieval into large language models (LLMs). Retrieval from a dynamically expanding external corpus of text allows a model to incorporate current events and can be viewed as a form of episodic memory. Here we demonstrate that pre-processing the external corpus into semi-structured ''atomic facts'' makes retrieval more efficient. More specifically, we demonstrate that our particular form of atomic facts improves performance on various question answering tasks when the amount of retrieved text is limited. Limiting the amount of retrieval reduces the size of the context and improves inference efficiency.
摘要：最近，人们对将信息检索纳入大型语言模型（LLM）（LLMS）有了很大的兴趣。从动态扩展的文本外部语料库中检索允许模型合并时事件，可以看作是情节内存的一种形式。在这里，我们证明将外部语料库预处理成半结构的“原子事实”，从而使检索更有效。更具体地说，我们证明，当检索到的文本的数量有限时，我们特定的原子事实形式可以提高各种问答任务的绩效。限制检索量可降低上下文的大小并提高推理效率。

Title: Distinct social-linguistic processing between humans and large audio-language models: Evidence from model-brain alignment

Authors: Hanlin Wu, Xufeng Duan, Zhenguang Cai
Subjects: cs.CL, q-bio.NC
Abstract URL: https://arxiv.org/abs/2503.19586
Pdf URL: https://arxiv.org/pdf/2503.19586
Copy Paste: [[2503.19586]] Distinct social-linguistic processing between humans and large audio-language models: Evidence from model-brain alignment(https://arxiv.org/abs/2503.19586)
Keywords: language model
Abstract: Voice-based AI development faces unique challenges in processing both linguistic and paralinguistic information. This study compares how large audio-language models (LALMs) and humans integrate speaker characteristics during speech comprehension, asking whether LALMs process speaker-contextualized language in ways that parallel human cognitive mechanisms. We compared two LALMs' (Qwen2-Audio and Ultravox 0.5) processing patterns with human EEG responses. Using surprisal and entropy metrics from the models, we analyzed their sensitivity to speaker-content incongruency across social stereotype violations (e.g., a man claiming to regularly get manicures) and biological knowledge violations (e.g., a man claiming to be pregnant). Results revealed that Qwen2-Audio exhibited increased surprisal for speaker-incongruent content and its surprisal values significantly predicted human N400 responses, while Ultravox 0.5 showed limited sensitivity to speaker characteristics. Importantly, neither model replicated the human-like processing distinction between social violations (eliciting N400 effects) and biological violations (eliciting P600 effects). These findings reveal both the potential and limitations of current LALMs in processing speaker-contextualized language, and suggest differences in social-linguistic processing mechanisms between humans and LALMs.
摘要：基于语音的AI开发在处理语言和副语言信息方面面临着独特的挑战。这项研究比较了大型的音频模型（LALMS）和人类在语音理解过程中的整合说话者特征，询问LALMS是否以与人类认知机制相似的方式来处理说话者的语言化语言。我们将两个LALMS（Qwen2-Audio和Ultravox 0.5）的处理模式与人EEG响应进行了比较。使用模型中的惊人和熵指标，我们分析了他们对跨社会刻板印象违规行为（例如，一个声称经常获得修指甲的人）和违反生物知识的敏感性（例如，一个声称怀孕的男人）的敏感性。结果表明，QWEN2-AUDIO对说话者 - 含量含量及其惊人值的惊人越来越大，可显着预测人类N400的响应，而Ultravox 0.5对说话者特征的敏感性有限。重要的是，模型均未复制人类的加工在社会侵犯（引发N400效果）和生物侵犯（引发P600效果）之间的区别。这些发现揭示了当前LALM在处理说话者的语言语言中的潜力和局限性，并提出了人类和LALMS之间社会语言处理机制的差异。

Title: The Greatest Good Benchmark: Measuring LLMs' Alignment with Utilitarian Moral Dilemmas

Authors: Giovanni Franco Gabriel Marraffini, Andrés Cotton, Noe Fabian Hsueh, Axel Fridman, Juan Wisznia, Luciano Del Corro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19598
Pdf URL: https://arxiv.org/pdf/2503.19598
Copy Paste: [[2503.19598]] The Greatest Good Benchmark: Measuring LLMs' Alignment with Utilitarian Moral Dilemmas(https://arxiv.org/abs/2503.19598)
Keywords: language model, llm
Abstract: The question of how to make decisions that maximise the well-being of all persons is very relevant to design language models that are beneficial to humanity and free from harm. We introduce the Greatest Good Benchmark to evaluate the moral judgments of LLMs using utilitarian dilemmas. Our analysis across 15 diverse LLMs reveals consistently encoded moral preferences that diverge from established moral theories and lay population moral standards. Most LLMs have a marked preference for impartial beneficence and rejection of instrumental harm. These findings showcase the 'artificial moral compass' of LLMs, offering insights into their moral alignment.
摘要：如何做出最大化所有人福祉的决策的问题与设计对人类有益并免受伤害的语言模型非常相关。我们介绍了最大的良好基准，以使用功利主义困境来评估LLM的道德判断。我们对15种不同LLM的分析揭示了与已建立的道德理论和外行人口道德标准不同的编码道德偏好。大多数LLM对公正的益处和对工具伤害的拒绝都有明显的偏爱。这些发现展示了LLM的“人造道德指南针”，从而提供了对其道德一致性的见解。

Title: 1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training

Authors: Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, Xiangang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19633
Pdf URL: https://arxiv.org/pdf/2503.19633
Copy Paste: [[2503.19633]] 1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training(https://arxiv.org/abs/2503.19633)
Keywords: language model, llm
Abstract: The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks, composed of high-quality and challenging reasoning problems. These problems are collected from a multitude of open-source datasets, subjected to semantic deduplication and meticulous cleaning to eliminate test set contamination. All responses within the dataset are distilled from reasoning models (predominantly DeepSeek-R1) and have undergone rigorous verification procedures. Mathematical problems are validated by checking against reference answers, code problems are verified using test cases, and other tasks are evaluated with the aid of a reward model. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT) using this batch of data, outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks: AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench. Additionally, the AM-Distill-Qwen-72B model surpassed the DeepSeek-R1-Distill-Llama-70B model on all benchmarks as well. We are releasing these 1.4 million problems and their corresponding responses to the research community with the objective of fostering the development of powerful reasoning-oriented Large Language Models (LLMs). The dataset was published in \href{this https URL}{this https URL}.
摘要：Am-Deepseek-R1启动是一个大规模数据集，具有一般推理任务的思维痕迹，由高质量和挑战性的推理问题组成。这些问题是从多种开源数据集中收集的，受到语义重复数据删除和细致的清洁，以消除测试集污染。数据集中的所有响应都从推理模型（主要是DeepSeek-R1）中提取，并具有严格的验证程序。数学问题是通过对参考答案进行检查，使用测试用例验证的代码问题的验证，并在奖励模型的帮助下对其他任务进行了验证。 AM-DISTILL-QWEN-32B模型仅通过此批次数据通过简单监督的微调（SFT）进行培训，在四个基准上超过了DeepSeek-R1-Distill-Qwen-32b模型：AIME2024：AIME2024，MATH-500，MATH-500，GPQA-DIAMOND，GPQA-DIAMOND，和LIVECECODEBENCE。此外，AM-DISTILL-QWEN-72B型号在所有基准上也超过了DeepSeek-R1-Distill-lalama-70b模型。我们正在发布这140万个问题及其对研究社区的相应反应，目的是促进强大的面向推理的大语言模型（LLM）的发展。该数据集发表在\ href {this https url} {此https url}中。

Title: HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection

Authors: Maryam Bala, Amina Imam Abubakar, Abdulhamid Abubakar, Abdulkadir Shehu Bichi, Hafsa Kabir Ahmad, Sani Abdullahi Sani, Idris Abdulmumin, Shamsuddeen Hassan Muhamad, Ibrahim Said Ahmad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19650
Pdf URL: https://arxiv.org/pdf/2503.19650
Copy Paste: [[2503.19650]] HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection(https://arxiv.org/abs/2503.19650)
Keywords: language model, llm, hallucination
Abstract: This paper presents our findings of the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes, MU-SHROOM, which focuses on identifying hallucinations and related overgeneration errors in large language models (LLMs). The shared task involves detecting specific text spans that constitute hallucinations in the outputs generated by LLMs in 14 languages. To address this task, we aim to provide a nuanced, model-aware understanding of hallucination occurrences and severity in English. We used natural language inference and fine-tuned a ModernBERT model using a synthetic dataset of 400 samples, achieving an Intersection over Union (IoU) score of 0.032 and a correlation score of 0.422. These results indicate a moderately positive correlation between the model's confidence scores and the actual presence of hallucinations. The IoU score indicates that our model has a relatively low overlap between the predicted hallucination span and the truth annotation. The performance is unsurprising, given the intricate nature of hallucination detection. Hallucinations often manifest subtly, relying on context, making pinpointing their exact boundaries formidable.
摘要：本文介绍了我们关于幻觉和相关可观察到的过度错误的多语言共享任务的发现，该任务的重点是识别大语言模型（LLMS）中的幻觉和相关的过度错误。共享任务涉及检测特定的文本跨度，这些文本跨度构成了LLMS以14种语言生成的输出中的幻觉。为了解决这项任务，我们旨在提供对英语中幻觉事件和严重性的细微，模型感知的理解。我们使用了自然语言推断，并使用400个样本的合成数据集进行了微调的现代模型，从而实现了与Union（IOU）分数的相交（IOU）分数为0.032，相关得分为0.422。这些结果表明，模型的置信度得分与幻觉的实际存在之间存在适度的正相关。 iou分数表明我们的模型在预测的幻觉跨度和真相注释之间的重叠相对较低。鉴于幻觉检测的错综复杂的性质，该性能并不令人惊讶。幻觉通常依靠上下文巧妙地表现出来，从而使精确的界限强大。

Title: AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

Authors: Itay Nakash, Nitay Calderon, Eyal Ben David, Elad Hoffer, Roi Reichart
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19693
Pdf URL: https://arxiv.org/pdf/2503.19693
Copy Paste: [[2503.19693]] AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation(https://arxiv.org/abs/2503.19693)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown impressive versatility as general purpose models. However, their broad applicability comes at a high-cost computational overhead, particularly in auto-regressive decoding where each step requires a forward pass. In domain-specific settings, general-purpose capabilities are unnecessary and can be exchanged for efficiency. In this work, we take a novel perspective on domain adaptation, reducing latency and computational costs by adapting the vocabulary to focused domains of interest. We introduce AdaptiVocab, an end-to-end approach for vocabulary adaptation, designed to enhance LLM efficiency in low-resource domains. AdaptiVocab can be applied to any tokenizer and architecture, modifying the vocabulary by replacing tokens with domain-specific n-gram-based tokens, thereby reducing the number of tokens required for both input processing and output generation. AdaptiVocab initializes new n-token embeddings using an exponentially weighted combination of existing embeddings and employs a lightweight fine-tuning phase that can be efficiently performed on a single GPU. We evaluate two 7B LLMs across three niche domains, assessing efficiency, generation quality, and end-task performance. Our results show that AdaptiVocab reduces token usage by over 25% without compromising performance
摘要：大型语言模型（LLM）作为通用模型显示出令人印象深刻的多功能性。但是，它们的广泛适用性是在高成本的计算开销中，尤其是在自动回归解码的情况下，每个步骤都需要向前传球。在特定于域的设置中，通用功能是不必要的，可以交换以提高效率。在这项工作中，我们通过将词汇调整为关注的目标领域来对域的适应性进行新颖的看法，从而降低潜伏期和计算成本。我们介绍了Adaptivocab，这是一种词汇适应的端到端方法，旨在提高低资源域中的LLM效率。可以将AdaptivoCab应用于任何令牌和架构，通过用基于域的基于n-gram的令牌代替令牌来修改词汇，从而减少输入处理和输出生成所需的令牌数量。 Adaptivocab使用现有嵌入式的指数加权组合初始化了新的N-Token嵌入，并采用了轻巧的微调阶段，可以在单个GPU上有效地执行。我们评估了三个利基领域的两个7B LLM，评估了效率，发电质量和终端性能。我们的结果表明，Adaptivocab在不损害性能的情况下将令牌使用量减少了25％以上

Title: HausaNLP at SemEval-2025 Task 2: Entity-Aware Fine-tuning vs. Prompt Engineering in Entity-Aware Machine Translation

Authors: Abdulhamid Abubakar, Hamidatu Abdulkadir, Ibrahim Rabiu Abdullahi, Abubakar Auwal Khalid, Ahmad Mustapha Wali, Amina Aminu Umar, Maryam Bala, Sani Abdullahi Sani, Ibrahim Said Ahmad, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Vukosi Marivate
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19702
Pdf URL: https://arxiv.org/pdf/2503.19702
Copy Paste: [[2503.19702]] HausaNLP at SemEval-2025 Task 2: Entity-Aware Fine-tuning vs. Prompt Engineering in Entity-Aware Machine Translation(https://arxiv.org/abs/2503.19702)
Keywords: prompt
Abstract: This paper presents our findings for SemEval 2025 Task 2, a shared task on entity-aware machine translation (EA-MT). The goal of this task is to develop translation models that can accurately translate English sentences into target languages, with a particular focus on handling named entities, which often pose challenges for MT systems. The task covers 10 target languages with English as the source. In this paper, we describe the different systems we employed, detail our results, and discuss insights gained from our experiments.
摘要：本文介绍了我们针对Semeval 2025 Task 2的发现，这是实体感知机器翻译（EA-MT）的共同任务。该任务的目的是开发可以准确地将英语句子翻译成目标语言的翻译模型，特别关注处理命名实体，这通常对MT系统构成挑战。该任务涵盖了10种目标语言，以英语为源。在本文中，我们描述了我们采用的不同系统，详细介绍了我们的结果，并讨论了从实验中获得的见解。

Title: Writing as a testbed for open ended agents

Authors: Sian Gooding, Lucia Lopez-Rivilla, Edward Grefenstette
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2503.19711
Pdf URL: https://arxiv.org/pdf/2503.19711
Copy Paste: [[2503.19711]] Writing as a testbed for open ended agents(https://arxiv.org/abs/2503.19711)
Keywords: gpt, llm, agent
Abstract: Open-ended tasks are particularly challenging for LLMs due to the vast solution space, demanding both expansive exploration and adaptable strategies, especially when success lacks a clear, objective definition. Writing, with its vast solution space and subjective evaluation criteria, provides a compelling testbed for studying such problems. In this paper, we investigate the potential of LLMs to act as collaborative co-writers, capable of suggesting and implementing text improvements autonomously. We analyse three prominent LLMs - Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o - focusing on how their action diversity, human alignment, and iterative improvement capabilities impact overall performance. This work establishes a framework for benchmarking autonomous writing agents and, more broadly, highlights fundamental challenges and potential solutions for building systems capable of excelling in diverse open-ended domains.
摘要：由于庞大的解决方案空间，开放式任务对于LLM尤其具有挑战性，要求扩大探索和适应性的策略，尤其是当成功缺乏明确，客观的定义时。写作凭借其庞大的解决方案空间和主观评估标准，为研究此类问题提供了令人信服的测试床。在本文中，我们调查了LLM的潜力充当协作共同撰写者，能够自动提出和实施文本改进。我们分析了三个著名的LLM -Gemini 1.5 Pro，Claude 3.5十四行诗和GPT -4O-着重于他们的行动多样性，人类的一致性和迭代性改进能力如何影响整体绩效。这项工作建立了一个基准测试自主写作代理商的框架，更广泛地强调了基本挑战和潜在的解决方案，用于建立能够在各种开放式领域中出色的构建系统。

Title: Gemma 3 Technical Report

Authors: Gemma Team: Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin, Robert Busa-Fekete, Alex Feng, Noveen Sachdeva, Benjamin Coleman, Yi Gao, Basil Mustafa, Iain Barr, Emilio Parisotto, David Tian, Matan Eyal, Colin Cherry, Jan-Thorsten Peter, Danila Sinopalnikov, Surya Bhupatiraju, Rishabh Agarwal, Mehran Kazemi, Dan Malkin, Ravin Kumar, David Vilar, Idan Brusilovsky, Jiaming Luo, Andreas Steiner, Abe Friesen, Abhanshu Sharma, Abheesht Sharma, Adi Mayrav Gilady, Adrian Goedeckemeyer, Alaa Saade, Alex Feng, Alexander Kolesnikov, Alexei Bendebury, Alvin Abdagic, Amit Vadi, András György, André Susano Pinto, Anil Das, Ankur Bapna, Antoine Miech, Antoine Yang, Antonia Paterson, Ashish Shenoy, Ayan Chakrabarti, Bilal Piot, Bo Wu, Bobak Shahriari, Bryce Petrini, Charlie Chen, Charline Le Lan, Christopher A. Choquette-Choo, CJ Carey, Cormac Brick, Daniel Deutsch, Danielle Eisenbud, Dee Cattle, Derek Cheng, Dimitris Paparas, Divyashree Shivakumar Sreepathihalli, Doug Reid, Dustin Tran, Dustin Zelle, Eric Noland, Erwin Huizenga, Eugene Kharitonov, Frederick Liu, Gagik Amirkhanyan, Glenn Cameron, Hadi Hashemi, Hanna Klimczak-Plucińska, Harman Singh, Harsh Mehta, Harshal Tushar Lehri, Hussein Hazimeh, Ian Ballantyne, Idan Szpektor, Ivan Nardini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19786
Pdf URL: https://arxiv.org/pdf/2503.19786
Copy Paste: [[2503.19786]] Gemma 3 Technical Report(https://arxiv.org/abs/2503.19786)
Keywords: long context, chat
Abstract: We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.
摘要：我们介绍了Gemma 3，这是吉玛（Gemma）轻巧开放模型家族的多模式，规模从1到270亿个参数不等。此版本介绍了愿景理解能力，更广泛的语言覆盖范围和更长的上下文 - 至少128K令牌。我们还更改了模型的体系结构，以减少倾向于在长篇小说中爆炸的KV-CACHE内存。这是通过增加本地注意层与全球注意力层的比率，并保持局部关注的范围。 Gemma 3型号经过蒸馏训练，并在预训练和指令列式列出版本中获得了与Gemma 2的优越性能。特别是，我们的新颖培训后食谱可显着改善数学，聊天，指导跟踪和多语言能力，从而使Gemma3-4b-it与Gemma2-27b-it和gemma3-27b-it竞争与Gemini-1.5-Pro相当。我们将所有模型都发布给社区。

Title: SemEval-2025 Task 9: The Food Hazard Detection Challenge

Authors: Korbinian Randl, John Pavlopoulos, Aron Henriksson, Tony Lindgren, Juli Bakagianni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19800
Pdf URL: https://arxiv.org/pdf/2503.19800
Copy Paste: [[2503.19800]] SemEval-2025 Task 9: The Food Hazard Detection Challenge(https://arxiv.org/abs/2503.19800)
Keywords: language model
Abstract: In this challenge, we explored text-based food hazard prediction with long tail distributed classes. The task was divided into two subtasks: (1) predicting whether a web text implies one of ten food-hazard categories and identifying the associated food category, and (2) providing a more fine-grained classification by assigning a specific label to both the hazard and the product. Our findings highlight that large language model-generated synthetic data can be highly effective for oversampling long-tail distributions. Furthermore, we find that fine-tuned encoder-only, encoder-decoder, and decoder-only systems achieve comparable maximum performance across both subtasks. During this challenge, we gradually released (under CC BY-NC-SA 4.0) a novel set of 6,644 manually labeled food-incident reports.
摘要：在这一挑战中，我们探索了基于文本的食物危害预测，其中较长的尾巴分布式类别。该任务分为两个子任务：（1）预测网络文本是否意味着十种食品危害类别之一并确定相关食品类别，以及（2）通过将特定标签分配给危险和产品，提供更细粒度的分类。我们的发现强调，大型语言模型生成的合成数据对于过度采样长尾分布可能非常有效。此外，我们发现仅微调编码，编码器编码器和仅解码器的系统在两个子任务中实现了可比的最大性能。在这项挑战中，我们逐渐发行（根据CC BY-NC-SA 4.0），一套新型的6,644套手动标记的食品含量报告。

Title: A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950

Authors: Zhao Fang, Liang-Chun Wu, Xuening Kong, Spencer Dean Stewart
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19844
Pdf URL: https://arxiv.org/pdf/2503.19844
Copy Paste: [[2503.19844]] A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950(https://arxiv.org/abs/2503.19844)
Keywords: language model, gpt, llm
Abstract: This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950. Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes. Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series. The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.
摘要：本文将大型语言模型（LLM）和传统的自然语言处理（NLP）工具比较，用于执行单词细分，词性词（POS）标签以及1900年至1950年的中文文本中命名的Entity Interity识别（NER）。历史中国文档对文本脚本，自然单词边界的缺失以及大量语言语言的缺乏，对文本分析构成挑战。使用来自上海图书馆共和党杂志语料库的样本数据集，将Jieba和Spacy等传统工具与LLM进行了比较，包括GPT-4O，Claude 3.5和GLM系列。结果表明，LLMS在所有指标中的表现都胜过传统方法，尽管计算成本要高得多，从而突出了准确性和效率之间的权衡。此外，LLM可以更好地应对特定于流派的挑战，例如诗歌和时间变化（即1920年前与1920年后的文本），这表明他们的上下文学习能力可以通过减少对领域特异性训练数据的需求来推动NLP的历史文本方法。

Title: Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

Authors: Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yunjie Ji, Yiping Peng, Han Zhao, Xiangang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19855
Pdf URL: https://arxiv.org/pdf/2503.19855
Copy Paste: [[2503.19855]] Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking(https://arxiv.org/abs/2503.19855)
Keywords: language model, llm, prompt
Abstract: Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For instance, the accuracy of QwQ-32B improved from 80.3% (Round 1) to 82.1% (Round 2) on the AIME 2024 dataset, while DeepSeek-R1 showed a similar increase from 79.7% to 82.0%. These results confirm that Multi-round Thinking is a broadly applicable, straightforward approach to achieving stable enhancements in model performance, underscoring its potential for future developments in test-time scaling techniques. The key prompt: {Original question prompt} The assistant's previous answer is: {last round answer} , and please re-answer.
摘要：大型语言模型（LLM）的最新进展，例如OpenAI-O1和DeepSeek-R1，已经证明了测试时间缩放的有效性，其中扩展的推理过程大大提高了模型性能。尽管如此，当前的模型仍受到处理长文本和增强学习（RL）培训效率的限制。为了解决这些问题，我们提出了一种简单而有效的测试时间扩展方法多发思维。这种方法通过利用先前的答案作为后续回合的提示来迭代地完善模型推理。包括QWQ-32B和DeepSeek-R1在内的多个模型之间进行的广泛实验始终显示出各种基准测试的性能改进，例如Aime 2024，Math-500，GPQA-Diamond和LiveCodeBench。例如，在AIME 2024数据集中，QWQ-32B的准确性从80.3％（第1轮）提高到82.1％（第2轮），而DeepSeek-R1的准确性从79.7％增加到82.0％。这些结果证实，多发思维是一种广泛，直接的方法，可实现模型性能的稳定增强，强调了其未来测试时间扩展技术发展的潜力。密钥提示：{原始问题提示}助手的先前答案是： {上一轮答案} ，请重新解答。

Title: Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

Authors: Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Kiril Gashteovski, Carolin Lawrence, Julia Hockenmaier, Graham Neubig, Sean Welleck
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.19877
Pdf URL: https://arxiv.org/pdf/2503.19877
Copy Paste: [[2503.19877]] Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators(https://arxiv.org/abs/2503.19877)
Keywords: language model, prompt, chain-of-thought
Abstract: As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.
摘要：随着语言模型（LM）输出变得越来越自然，评估其质量的越来越困难。同时，通过缩放测试时间计算增加LMS的“思考”时间已证明是一种有效的技术，可以解决数学和代码等领域中的具有挑战性的问题。这提出了一个自然的问题：通过花费更多的测试时间计算，可以提高LM的评估能力吗？为了回答这一点，我们调查了使用推理模型LMS本地产生长期思考的推理 - 作为评估者。具体而言，我们检查了使用推理模型通过（1）通过（1）通过（1）来利用更多测试时间计算的方法，以及（2）提示这些模型不仅评估整体响应（即结果评估），还可以分别评估响应中的每个步骤（即过程评估）。在实验中，我们观察到评估者的性能在产生更多的推理令牌时单调改善，类似于基于LM的发电中观察到的趋势。此外，我们使用这些更准确的评估者来重读多代，并证明在评估时花费更多的计算与在生成时间使用更多计算以提高LM解决问题的解决能力一样有效。

Title: CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation

Authors: Nengbo Wang, Xiaotian Han, Jagdip Singh, Jing Ma, Vipin Chaudhary
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.19878
Pdf URL: https://arxiv.org/pdf/2503.19878
Copy Paste: [[2503.19878]] CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation(https://arxiv.org/abs/2503.19878)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) have revolutionized natural language processing (NLP), particularly through Retrieval-Augmented Generation (RAG), which enhances LLM capabilities by integrating external knowledge. However, traditional RAG systems face critical limitations, including disrupted contextual integrity due to text chunking, and over-reliance on semantic similarity for retrieval. To address these issues, we propose CausalRAG, a novel framework that incorporates causal graphs into the retrieval process. By constructing and tracing causal relationships, CausalRAG preserves contextual continuity and improves retrieval precision, leading to more accurate and interpretable responses. We evaluate CausalRAG against regular RAG and graph-based RAG approaches, demonstrating its superiority across several metrics. Our findings suggest that grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks.
摘要：大型语言模型（LLM）已彻底改变了自然语言处理（NLP），尤其是通过检索型发电（RAG），通过整合外部知识来增强LLM功能。但是，传统的抹布系统面临着关键的局限性，包括由于文本块而导致的上下文完整性中断，以及对检索的语义相似性的过度依赖。为了解决这些问题，我们提出了Causalrag，这是一个新颖的框架，将因果图纳入检索过程中。通过构建和追踪因果关系，Causalrag保留了上下文的连续性并提高了检索精度，从而导致更准确和可解释的反应。我们评估了针对常规抹布和基于图的抹布方法的因果关系，证明了它在几种指标上的优越性。我们的发现表明，因果推理中的基础检索为知识密集型任务提供了有希望的方法。