2025-03-07

Title: Vision-Language Models Struggle to Align Entities across Modalities

Authors: Iñigo Alonso, Ander Salaberria, Gorka Azkune, Jeremy Barnes, Oier Lopez de Lacalle
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.03854
Pdf URL: https://arxiv.org/pdf/2503.03854
Copy Paste: [[2503.03854]] Vision-Language Models Struggle to Align Entities across Modalities(https://arxiv.org/abs/2503.03854)
Keywords: language model, prompt, chain-of-thought
Abstract: Cross-modal entity linking refers to the ability to align entities and their attributes across different modalities. While cross-modal entity linking is a fundamental skill needed for real-world applications such as multimodal code generation, fake news detection, or scene understanding, it has not been thoroughly studied in the literature. In this paper, we introduce a new task and benchmark to address this gap. Our benchmark, MATE, consists of 5.5k evaluation instances featuring visual scenes aligned with their textual representations. To evaluate cross-modal entity linking performance, we design a question-answering task that involves retrieving one attribute of an object in one modality based on a unique attribute of that object in another modality. We evaluate state-of-the-art Vision-Language Models (VLMs) and humans on this task, and find that VLMs struggle significantly compared to humans, particularly as the number of objects in the scene increases. Our analysis also shows that, while chain-of-thought prompting can improve VLM performance, models remain far from achieving human-level proficiency. These findings highlight the need for further research in cross-modal entity linking and show that MATE is a strong benchmark to support that progress.
摘要：链接的跨模式实体是指在不同模态上对齐实体及其属性的能力。虽然跨模式实体链接是真实世界应用所需的一项基本技能，例如多模式代码生成，假新闻检测或场景理解，但文献中尚未对其进行彻底研究。在本文中，我们介绍了一个新的任务和基准来解决这一差距。我们的基准MATE由5.5k评估实例组成，其中包含与文本表示的视觉场景。为了评估跨模式实体链接性能，我们设计了一个提问的任务，该任务涉及基于该对象的唯一属性在另一种模态中的唯一属性中检索对象的一个属性。我们在这项任务上评估了最先进的视觉模型（VLM）和人类，并发现与人类相比，VLMS艰巨的挣扎，尤其是随着场景中的对象数量的增加。我们的分析还表明，尽管经过思考的提示可以提高VLM的性能，但模型却无法实现人类水平的水平。这些发现凸显了需要在跨模式实体联系的进一步研究的必要性，并表明伴侣是支持这一进展的强大基准。

Title: Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions

Authors: Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes, Lara Marinov, Michael Chen, Shreya Singhal, Carolin Lawrence, Aditi Raghunathan, Kiril Gashteovski, Graham Neubig
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.03862
Pdf URL: https://arxiv.org/pdf/2503.03862
Copy Paste: [[2503.03862]] Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions(https://arxiv.org/abs/2503.03862)
Keywords: language model
Abstract: Improvements in language model capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25\% code, as well as the better performance of some architectural decisions such as choosing rotary over learned embeddings. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.
摘要：语言模型功能的改进通常归因于模型大小或培训数据的增加，但是在某些情况下，经过精心策划的数据或具有不同架构决策的较小模型可以胜过对更多代币训练的大型模型。有什么解释？为了量化这些设计选择的影响，我们在各种各样的尺度上进行了92个开源预算模型，包括最先进的开放式型型号以及较少的性能模型以及那些具有较少传统设计决策的模型。我们发现，除了模型大小和训练令牌数量之外，通过合并功能，我们可以相对增加3-28％的预测下游性能的能力，而与单独使用比例相比。模型设计决策的分析揭示了对数据组成的见解，例如15-25 \％代码之间的语言和代码任务之间的权衡，以及某些建筑决策的更好性能，例如选择旋转而不是学习的嵌入。从广义上讲，我们的框架为对模型开发选择如何塑造最终功能的方式进行更系统的研究奠定了基础。

Title: AI for Scaling Legal Reform: Mapping and Redacting Racial Covenants in Santa Clara County

Authors: Faiz Surani, Mirac Suzgun, Vyoma Raman, Christopher D. Manning, Peter Henderson, Daniel E. Ho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.03888
Pdf URL: https://arxiv.org/pdf/2503.03888
Copy Paste: [[2503.03888]] AI for Scaling Legal Reform: Mapping and Redacting Racial Covenants in Santa Clara County(https://arxiv.org/abs/2503.03888)
Keywords: language model
Abstract: Many jurisdictions have moved to identify and strike these provisions, including California, which mandated in 2021 that all counties implement such a process. Yet the scale can be overwhelming, with Santa Clara County (SCC) alone having over 24 million property deed documents, making purely manual review infeasible. We present a novel approach to addressing this pressing issue, developed through a partnership with the SCC Clerk-Recorder's Office. First, we leverage an open large language model, fine-tuned to detect racial covenants with high precision and recall. We estimate that this system reduces manual efforts by 86,500 person hours and costs less than 2% of the cost for a comparable off-the-shelf closed model. Second, we illustrate the County's integration of this model into responsible operational practice, including legal review and the creation of a historical registry, and release our model to assist the hundreds of jurisdictions engaged in similar efforts. Finally, our results reveal distinct periods of utilization of racial covenants, sharp geographic clustering, and the disproportionate role of a small number of developers in maintaining housing discrimination. We estimate that by 1950, one in four properties across the County were subject to racial covenants.
摘要：许多司法管辖区已提出确定和罢工这些规定，包括加利福尼亚州，该规定在2021年所有县实施了这样的程序。然而，该量表可能是压倒性的，仅圣塔克拉拉县（SCC）拥有超过2400万个财产契据文件，因此纯粹是手动审查了。我们提出了一种通过与SCC店员办公室的合作伙伴关系而开发的解决这一紧迫问题的新方法。首先，我们利用开放的大型语言模型，并经过微调来检测具有高精度和回忆的种族盟约。我们估计该系统将手动工作减少了86,500人小时，而可比现成的封闭模型的成本不到成本的2％。其次，我们说明了该县将该模型整合到负责任的运营实践中，包括法律审查和创建历史注册表，并发布了我们的模型，以协助从事类似努力的数百个司法管辖区。最后，我们的结果揭示了种族盟约，尖锐的地理聚类以及少数开发商在维持住房歧视方面的不成比例作用的不同时期。我们估计到1950年，全县四分之一的财产属于种族盟约。

Title: Tec-Habilidad: Skill Classification for Bridging Education and Employment

Authors: Sabur Butt, Hector G. Ceballos, Diana P. Madera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.03932
Pdf URL: https://arxiv.org/pdf/2503.03932
Copy Paste: [[2503.03932]] Tec-Habilidad: Skill Classification for Bridging Education and Employment(https://arxiv.org/abs/2503.03932)
Keywords: llm
Abstract: Job application and assessment processes have evolved significantly in recent years, largely due to advancements in technology and changes in the way companies operate. Skill extraction and classification remain an important component of the modern hiring process as it provides a more objective way to evaluate candidates and automatically align their skills with the job requirements. However, to effectively evaluate the skills, the skill extraction tools must recognize varied mentions of skills on resumes, including direct mentions, implications, synonyms, acronyms, phrases, and proficiency levels, and differentiate between hard and soft skills. While tools like LLMs (Large Model Models) help extract and categorize skills from job applications, there's a lack of comprehensive datasets for evaluating the effectiveness of these models in accurately identifying and classifying skills in Spanish-language job applications. This gap hinders our ability to assess the reliability and precision of the models, which is crucial for ensuring that the selected candidates truly possess the required skills for the job. In this paper, we develop a Spanish language dataset for skill extraction and classification, provide annotation methodology to distinguish between knowledge, skill, and abilities, and provide deep learning baselines to advance robust solutions for skill classification.
摘要：近年来，工作应用和评估流程已经显着发展，这主要是由于技术的进步以及公司运营方式的变化。技能提取和分类仍然是现代招聘过程的重要组成部分，因为它提供了一种更客观的方法来评估候选人并自动使其技能与工作要求保持一致。但是，为了有效地评估技能，技能提取工具必须识别简历上的各种技能，包括直接提及，含义，同义词，缩写，短语和能力水平，并区分硬性和软技能。尽管LLM（大型模型）等工具有助于从工作应用程序中提取和分类技能，但缺乏全面的数据集来评估这些模型在准确识别和分类西班牙语工作应用程序中的技能方面的有效性。这一差距阻碍了我们评估模型的可靠性和精度的能力，这对于确保所选候选人真正拥有所需的工作技能至关重要。在本文中，我们开发了一个西班牙语数据集来进行技能提取和分类，提供注释方法来区分知识，技能和能力，并提供深度学习的基准，以推动强大的解决方案进行技能分类。

Title: Performance Comparison of Large Language Models on Advanced Calculus Problems

Authors: In Hak Moon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.03960
Pdf URL: https://arxiv.org/pdf/2503.03960
Copy Paste: [[2503.03960]] Performance Comparison of Large Language Models on Advanced Calculus Problems(https://arxiv.org/abs/2503.03960)
Keywords: language model, gpt, llm, prompt, chat
Abstract: This paper presents an in-depth analysis of the performance of seven different Large Language Models (LLMs) in solving a diverse set of math advanced calculus problems. The study aims to evaluate these models' accuracy, reliability, and problem-solving capabilities, including ChatGPT 4o, Gemini Advanced with 1.5 Pro, Copilot Pro, Claude 3.5 Sonnet, Meta AI, Mistral AI, and Perplexity. The assessment was conducted through a series of thirty-two test problems, encompassing a total of 320 points. The problems covered various topics, from vector calculations and geometric interpretations to integral evaluations and optimization tasks. The results highlight significant trends and patterns in the models' performance, revealing both their strengths and weaknesses - for instance, models like ChatGPT 4o and Mistral AI demonstrated consistent accuracy across various problem types, indicating their robustness and reliability in mathematical problem-solving, while models such as Gemini Advanced with 1.5 Pro and Meta AI exhibited specific weaknesses, particularly in complex problems involving integrals and optimization, suggesting areas for targeted improvements. The study also underscores the importance of re-prompting in achieving accurate solutions, as seen in several instances where models initially provided incorrect answers but corrected them upon re-prompting. Overall, this research provides valuable insights into the current capabilities and limitations of LLMs in the domain of math calculus, with the detailed analysis of each model's performance on specific problems offering a comprehensive understanding of their strengths and areas for improvement, contributing to the ongoing development and refinement of LLM technology. The findings are particularly relevant for educators, researchers, and developers seeking to leverage LLMs for educational and practical applications in mathematics.
摘要：本文对七个不同的大语模型（LLM）的性能进行了深入的分析，以解决一组多样化的数学高级演算问题。该研究旨在评估这些模型的准确性，可靠性和解决问题的能力，包括Chatgpt 4O，Gemini，Gemini以1.5 Pro，Copilot Pro，Claude 3.5 SONNET，META AI，MISTA AI，MISTRAL AI和CLEXITY效力。评估是通过一系列32个测试问题进行的，总共包括320分。这些问题涵盖了各种主题，从矢量计算和几何解释到整体评估和优化任务。结果突出了模型表现中的重要趋势和模式，揭示了它们的优势和弱点 - 例如，诸如ChatGpt 4O和Mistral AI之类的模型在各种问题类型上表现出一致的准确性，表明它们在数学问题解决方面的稳健性和可靠性，而诸如GexInigy的模型（如1.5 Pro和Meta ai Ai I Importers）尤其是针对特定的问题，并且在涉及特定的问题上，并在特定的问题上提出了一致性。这项研究还强调了重新提出在实现准确解决方案中的重要性，这在几种情况下可以看出，模型最初提供了错误的答案，但在重新提出后对其进行了更正。总体而言，这项研究为数学微积分领域的LLM的当前功能和局限性提供了宝贵的见解，并详细分析了每种模型在特定问题上的绩效，从而对他们的优势和改进方面有了深刻的了解，从而有助于LLM技术的持续发展和改进。这些发现与寻求利用LLM的教育工作者，研究人员和开发人员尤其重要，用于数学中的教育和实际应用。

Title: On the Acquisition of Shared Grammatical Representations in Bilingual Language Models

Authors: Catherine Arnett, Tyler A. Chang, James A. Michaelov, Benjamin K. Bergen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.03962
Pdf URL: https://arxiv.org/pdf/2503.03962
Copy Paste: [[2503.03962]] On the Acquisition of Shared Grammatical Representations in Bilingual Language Models(https://arxiv.org/abs/2503.03962)
Keywords: language model
Abstract: While crosslingual transfer is crucial to contemporary language models' multilingual capabilities, how it occurs is not well understood. In this paper, we ask what happens to a monolingual language model when it begins to be trained on a second language. Specifically, we train small bilingual models for which we control the amount of data for each language and the order of language exposure. To find evidence of shared multilingual representations, we turn to structural priming, a method used to study grammatical representations in humans. We first replicate previous crosslingual structural priming results and find that after controlling for training data quantity and language exposure, there are asymmetrical effects across language pairs and directions. We argue that this asymmetry may shape hypotheses about human structural priming effects. We also find that structural priming effects are less robust for less similar language pairs, highlighting potential limitations of crosslingual transfer learning and shared representations for typologically diverse languages.
摘要：虽然跨语言转移对于当代语言模型的多语言能力至关重要，但它的发生方式尚未得到充分理解。在本文中，我们询问一种单语言模型开始接受第二语言培训时会发生什么。具体来说，我们训练小型双语模型，我们控制每种语言的数据量和语言曝光顺序。为了找到共享多语言表示的证据，我们转向结构启动，这种方法用于研究人类的语法表达。我们首先复制先前的跨语言结构启动结果，并发现在控制训练数据数量和语言暴露之后，语言对和方向之间存在不对称的影响。我们认为，这种不对称可能会塑造有关人类结构启动效应的假设。我们还发现，对于不太相似的语言对，结构性启动效应不太强大，强调了跨语言转移学习的潜在局限性以及对类型上多种语言的共享表示。

Title: ReasonGraph: Visualisation of Reasoning Paths

Authors: Zongqian Li, Ehsan Shareghi, Nigel Collier
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2503.03979
Pdf URL: https://arxiv.org/pdf/2503.03979
Copy Paste: [[2503.03979]] ReasonGraph: Visualisation of Reasoning Paths(https://arxiv.org/abs/2503.03979)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) reasoning processes are challenging to analyze due to their complexity and the lack of organized visualization tools. We present ReasonGraph, a web-based platform for visualizing and analyzing LLM reasoning processes. It supports both sequential and tree-based reasoning methods while integrating with major LLM providers and over fifty state-of-the-art models. ReasonGraph incorporates an intuitive UI with meta reasoning method selection, configurable visualization parameters, and a modular framework that facilitates efficient extension. Our evaluation shows high parsing reliability, efficient processing, and strong usability across various downstream applications. By providing a unified visualization framework, ReasonGraph reduces cognitive load in analyzing complex reasoning paths, improves error detection in logical processes, and enables more effective development of LLM-based applications. The platform is open-source, promoting accessibility and reproducibility in LLM reasoning analysis.
摘要：大型语言模型（LLMS）推理过程由于其复杂性和缺乏有组织的可视化工具而具有挑战性。我们提出了Reasongraph，这是一个基于Web的平台，用于可视化和分析LLM推理过程。它在与主要的LLM提供商集成和五十多个最先进的模型时同时支持顺序和基于树的推理方法。 Reasongraph将直观的UI与元推理方法选择，可配置的可视化参数以及一个促进有效扩展的模块化框架结合在一起。我们的评估显示了各种下游应用程序的高解析可靠性，有效的处理和强大的可用性。通过提供统一的可视化框架，Reasongraph可以减少分析复杂推理路径，改善逻辑过程中的错误检测，并使基于LLM的应用程序更有效地开发的认知负载。该平台是开源的，在LLM推理分析中促进了可访问性和可重复性。

Title: Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting

Authors: Jiyue Jiang, Pengan Chen, Jiuming Wang, Dongchen He, Ziqin Wei, Liang Hong, Licheng Zong, Sheng Wang, Qinze Yu, Zixian Ma, Yanyu Chen, Yimin Fan, Xiangyu Shi, Jiawei Sun, Chuan Wu, Yu Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04013
Pdf URL: https://arxiv.org/pdf/2503.04013
Copy Paste: [[2503.04013]] Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting(https://arxiv.org/abs/2503.04013)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) have become important tools in solving biological problems, offering improvements in accuracy and adaptability over conventional methods. Several benchmarks have been proposed to evaluate the performance of these LLMs. However, current benchmarks can hardly evaluate the performance of these models across diverse tasks effectively. In this paper, we introduce a comprehensive prompting-based benchmarking framework, termed Bio-benchmark, which includes 30 key bioinformatics tasks covering areas such as proteins, RNA, drugs, electronic health records, and traditional Chinese medicine. Using this benchmark, we evaluate six mainstream LLMs, including GPT-4o and Llama-3.1-70b, etc., using 0-shot and few-shot Chain-of-Thought (CoT) settings without fine-tuning to reveal their intrinsic capabilities. To improve the efficiency of our evaluations, we demonstrate BioFinder, a new tool for extracting answers from LLM responses, which increases extraction accuracy by round 30% compared to existing methods. Our benchmark results show the biological tasks suitable for current LLMs and identify specific areas requiring enhancement. Furthermore, we propose targeted prompt engineering strategies for optimizing LLM performance in these contexts. Based on these findings, we provide recommendations for the development of more robust LLMs tailored for various biological applications. This work offers a comprehensive evaluation framework and robust tools to support the application of LLMs in bioinformatics.
摘要：大型语言模型（LLM）已成为解决生物学问题的重要工具，提供了对常规方法的准确性和适应性的提高。已经提出了几种基准测试来评估这些LLM的性能。但是，当前的基准测试基准几乎无法有效地评估这些模型的性能。在本文中，我们介绍了一个全面的基于提示的基准测试框架，该框架称为Bio-Benchmark，其中包括30个关键的生物信息学任务，涵盖蛋白质，RNA，药物，电子健康记录和传统中医等领域。使用此基准测试，我们使用0次和少量的经营链（COT）设置评估了六个主流LLM，包括GPT-4O和LLAMA-3.1-70B等，而无需微调以揭示其内在功能。为了提高我们的评估效率，我们演示了生物传动器，这是一种从LLM响应中提取答案的新工具，与现有方法相比，它将提取精度提高了30％。我们的基准结果显示了适合当前LLM的生物学任务，并确定需要增强的特定领域。此外，我们提出了有针对性的及时工程策略，以优化在这些情况下的LLM性能。基于这些发现，我们为开发针对各种生物应用定制的更健壮的LLM提供了建议。这项工作提供了全面的评估框架和强大的工具，以支持LLM在生物信息学中的应用。

Title: Uncovering inequalities in new knowledge learning by large language models across different languages

Authors: Chenglong Wang, Haoyu Tang, Xiyuan Yang, Yueqi Xie, Jina Suh, Sunayana Sitaram, Junming Huang, Yu Xie, Zhaoya Gong, Xing Xie, Fangzhao Wu
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.04064
Pdf URL: https://arxiv.org/pdf/2503.04064
Copy Paste: [[2503.04064]] Uncovering inequalities in new knowledge learning by large language models across different languages(https://arxiv.org/abs/2503.04064)
Keywords: language model, llm
Abstract: As large language models (LLMs) gradually become integral tools for problem solving in daily life worldwide, understanding linguistic inequality is becoming increasingly important. Existing research has primarily focused on static analyses that assess the disparities in the existing knowledge and capabilities of LLMs across languages. However, LLMs are continuously evolving, acquiring new knowledge to generate up-to-date, domain-specific responses. Investigating linguistic inequalities within this dynamic process is, therefore, also essential. In this paper, we explore inequalities in new knowledge learning by LLMs across different languages and four key dimensions: effectiveness, transferability, prioritization, and robustness. Through extensive experiments under two settings (in-context learning and fine-tuning) using both proprietary and open-source models, we demonstrate that low-resource languages consistently face disadvantages across all four dimensions. By shedding light on these disparities, we aim to raise awareness of linguistic inequalities in LLMs' new knowledge learning, fostering the development of more inclusive and equitable future LLMs.
摘要：随着大型语言模型（LLMS）逐渐成为全球日常生活中问题解决问题的组成工具，因此了解语言不平等正变得越来越重要。现有的研究主要集中于静态分析，以评估LLM跨语言的现有知识和能力的差异。但是，LLM正在不断发展，获得了新知识，以产生最新的特定领域响应。因此，在这个动态过程中调查语言不平等也是必不可少的。在本文中，我们探讨了LLM在不同语言和四个关键方面的新知识学习中的不平等：有效性，可转移性，优先级和鲁棒性。通过使用专有和开源模型的两种设置（在文本学习和微调）下进行广泛的实验，我们证明，低资源语言始终在所有四个维度上都面临着劣势。通过阐明这些差异，我们旨在提高对LLMS新知识学习中语言不平等的认识，从而促进更具包容性和公平的未来LLM的发展。

Title: Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts

Authors: Xiangnan Chen, Yuancheng Fang, Qian Xiao, Juncheng Li, Jun Lin, Siliang Tang, Yi Yang, Yueting Zhuang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04095
Pdf URL: https://arxiv.org/pdf/2503.04095
Copy Paste: [[2503.04095]] Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts(https://arxiv.org/abs/2503.04095)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have garnered significant attention for their strong visual-semantic understanding. Most existing chart benchmarks evaluate MLLMs' ability to parse information from charts to answer this http URL, they overlook the inherent output biases of MLLMs, where models rely on their parametric memory to answer questions rather than genuinely understanding the chart content. To address this limitation, we introduce a novel Chart Hypothetical Question Answering (HQA) task, which imposes assumptions on the same question to compel models to engage in counterfactual reasoning based on the chart content. Furthermore, we introduce HAI, a human-AI interactive data synthesis approach that leverages the efficient text-editing capabilities of LLMs alongside human expert knowledge to generate diverse and high-quality HQA data at a low cost. Using HAI, we construct Chart-HQA, a challenging benchmark synthesized from publicly available data sources. Evaluation results on 18 MLLMs of varying model sizes reveal that current models face significant generalization challenges and exhibit imbalanced reasoning performance on the HQA task.
摘要：多模式大型语言模型（MLLM）因其强烈的视觉语义理解而引起了极大的关注。大多数现有图表基准都评估了MLLM从图表中解析信息以回答此HTTP URL的能力，它们忽略了MLLM的固有输出偏见，其中模型依靠其参数内存来回答问题，而不是真正理解图表内容。为了解决这一限制，我们介绍了一个新颖的图表假设问题回答（HQA）任务，该任务对同一问题施加了假设，以强迫模型根据图表内容进行反事实推理。此外，我们介绍了HAI，HAI是一种人类交互式数据综合方法，该方法利用LLMS的有效文本编辑功能以及人类专家知识，以低成本生成多样化和高质量的HQA数据。使用HAI，我们构建了Chart-HQA，这是一种从公开可用的数据源合成的具有挑战性的基准。对18个MLLM的不同模型尺寸的评估结果表明，当前模型面临着重大的概括挑战，并在HQA任务上表现出不平衡的推理性能。

Title: Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English

Authors: Runtao Zhou, Guangya Wan, Saadia Gabriel, Sheng Li, Alexander J Gates, Maarten Sap, Thomas Hartvigsen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04099
Pdf URL: https://arxiv.org/pdf/2503.04099
Copy Paste: [[2503.04099]] Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English(https://arxiv.org/abs/2503.04099)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning tasks, leading to their widespread deployment. However, recent studies have highlighted concerning biases in these models, particularly in their handling of dialectal variations like African American English (AAE). In this work, we systematically investigate dialectal disparities in LLM reasoning tasks. We develop an experimental framework comparing LLM performance given Standard American English (SAE) and AAE prompts, combining LLM-based dialect conversion with established linguistic analyses. We find that LLMs consistently produce less accurate responses and simpler reasoning chains and explanations for AAE inputs compared to equivalent SAE questions, with disparities most pronounced in social science and humanities domains. These findings highlight systematic differences in how LLMs process and reason about different language varieties, raising important questions about the development and deployment of these systems in our multilingual and multidialectal world. Our code repository is publicly available at this https URL.
摘要：大型语言模型（LLMS）在推理任务中表现出了显着的功能，从而导致其广泛的部署。但是，最近的研究强调了这些模型中的偏见，特别是在处理非裔美国人英语（AAE）等方言变化方面。在这项工作中，我们系统地研究了LLM推理任务中的方言差异。我们开发了一个实验框架，将基于LLM的方言转换与已建立的语言分析相结合的标准美国英语（SAE）和AAE提示，比较了LLM性能。我们发现，与同等的SAE问题相比，LLM始终产生较少准确的响应和更简单的推理链和解释AAE投入，在社会科学和人文领域中差异最为明显。这些发现突出了LLM在不同语言品种中如何处理和理由的系统差异，提出了有关这些系统在我们的多语言和多层直肠世界中开发和部署的重要问题。我们的代码存储库可在此HTTPS URL上公开使用。

Title: LLMs Can Generate a Better Answer by Aggregating Their Own Responses

Authors: Zichong Li, Xinyu Feng, Yuheng Cai, Zixuan Zhang, Tianyi Liu, Chen Liang, Weizhu Chen, Haoyu Wang, Tuo Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04104
Pdf URL: https://arxiv.org/pdf/2503.04104
Copy Paste: [[2503.04104]] LLMs Can Generate a Better Answer by Aggregating Their Own Responses(https://arxiv.org/abs/2503.04104)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have shown remarkable capabilities across tasks, yet they often require additional prompting techniques when facing complex problems. While approaches like self-correction and response selection have emerged as popular solutions, recent studies have shown these methods perform poorly when relying on the LLM itself to provide feedback or selection criteria. We argue this limitation stems from the fact that common LLM post-training procedures lack explicit supervision for discriminative judgment tasks. In this paper, we propose Generative Self-Aggregation (GSA), a novel prompting method that improves answer quality without requiring the model's discriminative capabilities. GSA first samples multiple diverse responses from the LLM, then aggregates them to obtain an improved solution. Unlike previous approaches, our method does not require the LLM to correct errors or compare response quality; instead, it leverages the model's generative abilities to synthesize a new response based on the context of multiple samples. While GSA shares similarities with the self-consistency (SC) approach for response aggregation, SC requires specific verifiable tokens to enable majority voting. In contrast, our approach is more general and can be applied to open-ended tasks. Empirical evaluation demonstrates that GSA effectively improves response quality across various tasks, including mathematical reasoning, knowledge-based problems, and open-ended generation tasks such as code synthesis and conversational responses.
摘要：大型语言模型（LLMS）在跨任务中表现出了显着的功能，但是在面对复杂问题时，它们通常需要其他提示技术。尽管自我纠正和响应选择之类的方法已成为流行解决方案，但最近的研究表明，在依靠LLM本身提供反馈或选择标准时，这些方法的性能很差。我们认为这一限制源于这样一个事实，即常见的LLM培训后程序缺乏明确的判断性判断任务监督。在本文中，我们提出了生成性自我聚集（GSA），这是一种新颖的提示方法，可提高答案质量，而无需模型的歧视能力。 GSA首先从LLM采样了多种不同的响应，然后将它们汇总以获得改进的解决方案。与以前的方法不同，我们的方法不需要LLM纠正错误或比较响应质量；取而代之的是，它利用模型的生成能力来基于多个样本的上下文综合一个新响应。尽管GSA与自洽（SC）的响应汇总方法有着相似之处，但SC需要特定的可验证令牌才能使多数投票。相比之下，我们的方法更一般，可以应用于开放式任务。经验评估表明，GSA有效地提高了各种任务的响应质量，包括数学推理，基于知识的问题和开放式生成任务，例如代码综合和对话响应。

Title: Uncovering Gaps in How Humans and LLMs Interpret Subjective Language

Authors: Erik Jones, Arjun Patrawala, Jacob Steinhardt
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04113
Pdf URL: https://arxiv.org/pdf/2503.04113
Copy Paste: [[2503.04113]] Uncovering Gaps in How Humans and LLMs Interpret Subjective Language(https://arxiv.org/abs/2503.04113)
Keywords: language model, llm, prompt
Abstract: Humans often rely on subjective natural language to direct language models (LLMs); for example, users might instruct the LLM to write an enthusiastic blogpost, while developers might train models to be helpful and harmless using LLM-based edits. The LLM's operational semantics of such subjective phrases -- how it adjusts its behavior when each phrase is included in the prompt -- thus dictates how aligned it is with human intent. In this work, we uncover instances of misalignment between LLMs' actual operational semantics and what humans expect. Our method, TED (thesaurus error detector), first constructs a thesaurus that captures whether two phrases have similar operational semantics according to the LLM. It then elicits failures by unearthing disagreements between this thesaurus and a human-constructed reference. TED routinely produces surprising instances of misalignment; for example, Mistral 7B Instruct produces more harassing outputs when it edits text to be witty, and Llama 3 8B Instruct produces dishonest articles when instructed to make the articles enthusiastic. Our results demonstrate that humans can uncover unexpected LLM behavior by scrutinizing relationships between abstract concepts, without supervising outputs directly.
摘要：人类经常依靠主观自然语言来指导语言模型（LLMS）；例如，用户可能会指示LLM撰写热情的博客文章，而开发人员可能会使用基于LLM的编辑来培训模型，从而有助于和无害。 LLM对这种主观短语的操作语义 - 当每个短语都包含在提示中时，它如何调整其行为 - 因此决定了它与人类意图的一致性。在这项工作中，我们发现了LLMS的实际操作语义与人类期望之间的未对准实例。我们的方法，TED（词库误差检测器），首先构建了词库，该词库捕获了两个短语是否根据LLM具有相似的操作语义。然后，它通过在该词库和人类建设的参考文献之间发现分歧而引起失败。 TED通常会产生令人惊讶的未对准实例。例如，Mistral 7b指示在编辑文本机智时会产生更多骚扰的输出，而Llama 3 8B指示在指示使这些文章热情的文章时会产生不诚实的文章。我们的结果表明，人类可以通过审查抽象概念之间的关系来揭示意外的LLM行为，而无需直接监督输出。

Title: Biological Sequence with Language Model Prompting: A Survey

Authors: Jiyue Jiang, Zikang Wang, Yuheng Shan, Heyan Chai, Jiayi Li, Zixian Ma, Xinrui Zhang, Yu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04135
Pdf URL: https://arxiv.org/pdf/2503.04135
Copy Paste: [[2503.04135]] Biological Sequence with Language Model Prompting: A Survey(https://arxiv.org/abs/2503.04135)
Keywords: language model, llm, prompt
Abstract: Large Language models (LLMs) have emerged as powerful tools for addressing challenges across diverse domains. Notably, recent studies have demonstrated that large language models significantly enhance the efficiency of biomolecular analysis and synthesis, attracting widespread attention from academics and medicine. In this paper, we systematically investigate the application of prompt-based methods with LLMs to biological sequences, including DNA, RNA, proteins, and drug discovery tasks. Specifically, we focus on how prompt engineering enables LLMs to tackle domain-specific problems, such as promoter sequence prediction, protein structure modeling, and drug-target binding affinity prediction, often with limited labeled data. Furthermore, our discussion highlights the transformative potential of prompting in bioinformatics while addressing key challenges such as data scarcity, multimodal fusion, and computational resource limitations. Our aim is for this paper to function both as a foundational primer for newcomers and a catalyst for continued innovation within this dynamic field of study.
摘要：大型语言模型（LLM）已成为解决各种领域挑战的强大工具。值得注意的是，最近的研究表明，大型语言模型显着提高了生物分子分析和合成的效率，从而引起了学者和医学的广泛关注。在本文中，我们系统地研究了具有LLM的及时方法的应用到生物学序列，包括DNA，RNA，蛋白质和药物发现任务。具体而言，我们专注于迅速工程如何使LLM能够解决域特异性问题，例如启动子序列预测，蛋白质结构建模和药物靶向结合亲和力预测，通常具有有限的标记数据。此外，我们的讨论突出了提示生物信息学的变革潜力，同时解决了诸如数据稀缺，多模式融合和计算资源限制之类的关键挑战。我们的目的是使本文既可以作为新移民的基础入门，又是在这个动态研究领域中继续创新的催化剂。

Title: Ticktack : Long Span Temporal Alignment of Large Language Models Leveraging Sexagenary Cycle Time Expression

Authors: Xue Han, Qian Hu, Yitong Wang, Wenchun Gao, Lianlian Zhang, Qing Wang, Lijun Mei, Chao Deng, Junlan Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04150
Pdf URL: https://arxiv.org/pdf/2503.04150
Copy Paste: [[2503.04150]] Ticktack : Long Span Temporal Alignment of Large Language Models Leveraging Sexagenary Cycle Time Expression(https://arxiv.org/abs/2503.04150)
Keywords: language model, llm
Abstract: Large language models (LLMs) suffer from temporal misalignment issues especially across long span of time. The issue arises from knowing that LLMs are trained on large amounts of data where temporal information is rather sparse over long times, such as thousands of years, resulting in insufficient learning or catastrophic forgetting by the LLMs. This paper proposes a methodology named "Ticktack" for addressing the LLM's long-time span misalignment in a yearly setting. Specifically, we first propose to utilize the sexagenary year expression instead of the Gregorian year expression employed by LLMs, achieving a more uniform distribution in yearly granularity. Then, we employ polar coordinates to model the sexagenary cycle of 60 terms and the year order within each term, with additional temporal encoding to ensure LLMs understand them. Finally, we present a temporal representational alignment approach for post-training LLMs that effectively distinguishes time points with relevant knowledge, hence improving performance on time-related tasks, particularly over a long period. We also create a long time span benchmark for evaluation. Experimental results prove the effectiveness of our proposal.
摘要：大型语言模型（LLMS）遭受时间不一致的问题，尤其是在长时间的时间内。该问题是由于知道LLM经过了大量数据培训，这些数据在很长的时间内（例如数千年）很少，导致LLM的学习或灾难性遗忘不足。本文提出了一种名为“ ticktack”的方法论，即每年解决LLM的长期跨度未对准。具体而言，我们首先建议利用性生物学年份的表达，而不是LLMS所采用的Gregorian年度表达，从而实现了年度颗粒状的更均匀分布。然后，我们采用极性坐标来对60个项的性生物周期和每个学期的年度顺序进行建模，并进行其他时间编码，以确保LLMS了解它们。最后，我们为训练后LLM提供了一种时间代表性对准方法，该方法可以通过相关知识有效区分时间点，从而提高与时间相关的任务的性能，尤其是在长期内。我们还创建了一个长时间的跨度基准进行评估。实验结果证明了我们的提议的有效性。

Title: BPQA Dataset: Evaluating How Well Language Models Leverage Blood Pressures to Answer Biomedical Questions

Authors: Chi Hang, Ruiqi Deng, Lavender Yao Jiang, Zihao Yang, Anton Alyakin, Daniel Alber, Eric Karl Oermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04155
Pdf URL: https://arxiv.org/pdf/2503.04155
Copy Paste: [[2503.04155]] BPQA Dataset: Evaluating How Well Language Models Leverage Blood Pressures to Answer Biomedical Questions(https://arxiv.org/abs/2503.04155)
Keywords: language model, gpt
Abstract: Clinical measurements such as blood pressures and respiration rates are critical in diagnosing and monitoring patient outcomes. It is an important component of biomedical data, which can be used to train transformer-based language models (LMs) for improving healthcare delivery. It is, however, unclear whether LMs can effectively interpret and use clinical measurements. We investigate two questions: First, can LMs effectively leverage clinical measurements to answer related medical questions? Second, how to enhance an LM's performance on medical question-answering (QA) tasks that involve measurements? We performed a case study on blood pressure readings (BPs), a vital sign routinely monitored by medical professionals. We evaluated the performance of four LMs: BERT, BioBERT, MedAlpaca, and GPT-3.5, on our newly developed dataset, BPQA (Blood Pressure Question Answering). BPQA contains $100$ medical QA pairs that were verified by medical students and designed to rely on BPs . We found that GPT-3.5 and MedAlpaca (larger and medium sized LMs) benefit more from the inclusion of BPs than BERT and BioBERT (small sized LMs). Further, augmenting measurements with labels improves the performance of BioBERT and Medalpaca (domain specific LMs), suggesting that retrieval may be useful for improving domain-specific LMs.
摘要：临床测量（例如血压和呼吸率）对于诊断和监测患者预后至关重要。它是生物医学数据的重要组成部分，可用于培训基于变压器的语言模型（LMS），以改善医疗保健的交付。但是，尚不清楚LMS是否可以有效地解释和使用临床测量值。我们研究了两个问题：首先，LM可以有效利用临床测量来回答相关的医疗问题？其次，如何提高LM在涉及测量的医疗询问（QA）任务上的表现？我们进行了一项有关血压读数（BPS）的案例研究，这是由医疗专业人员通常监测的生命体征。我们在新开发的数据集BPQA（血压问题回答）上评估了四个LMS：BERT，BIOBERT，MEDALPACA和GPT-3.5的性能。 BPQA包含$ 100 $的医疗质量检查对，由医学生验证，旨在依靠BPS。我们发现，GPT-3.5和Medalpaca（较大和中型的LMS）比BPS和Bert和Biobert（小型LMS）受益更多。此外，使用标签的增强测量值可改善生物伯特和Medalpaca（域特异性LMS）的性能，这表明检索可能对改善域特异性LMS有用。

Title: Measuring temporal effects of agent knowledge by date-controlled tool use

Authors: R. Patrick Xian, Qiming Cui, Stefan Bauer, Reza Abbasi-Asl
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.04188
Pdf URL: https://arxiv.org/pdf/2503.04188
Copy Paste: [[2503.04188]] Measuring temporal effects of agent knowledge by date-controlled tool use(https://arxiv.org/abs/2503.04188)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: Temporal progression is an integral part of knowledge accumulation and update. Web search is frequently adopted as grounding for agent knowledge, yet its inappropriate configuration affects the quality of agent responses. Here, we construct a tool-based out-of-sample testing framework to measure the knowledge variability of large language model (LLM) agents from distinct date-controlled tools (DCTs). We demonstrate the temporal effects of an LLM agent as a writing assistant, which can use web search to help complete scientific publication abstracts. We show that temporal effects of the search engine translates into tool-dependent agent performance but can be alleviated with base model choice and explicit reasoning instructions such as chain-of-thought prompting. Our results indicate that agent evaluation should take a dynamical view and account for the temporal influence of tools and the updates of external resources.
摘要：时间进步是知识积累和更新的组成部分。网络搜索通常被用作代理知识的基础，但其不适当的配置会影响代理响应的质量。在这里，我们构建了一个基于工具的样本外测试框架，以测量来自不同日期控制工具（DCT）的大语言模型（LLM）代理的知识变异性。我们演示了LLM代理作为写作助理的时间影响，该助理可以使用Web搜索来帮助完成科学出版物摘要。我们表明，搜索引擎的时间效应转化为依赖工具的代理性能，但可以通过基本模型选择和明确的推理说明（例如，经过思考链链的提示）来缓解。我们的结果表明，代理评估应具有动态视图，并说明工具的时间影响和外部资源的更新。

Title: Knowledge-Decoupled Synergetic Learning: An MLLM based Collaborative Approach to Few-shot Multimodal Dialogue Intention Recognition

Authors: Bin Chen, Yu Zhang, Hongfei Ye, Ziyi Huang, Hongyang Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04201
Pdf URL: https://arxiv.org/pdf/2503.04201
Copy Paste: [[2503.04201]] Knowledge-Decoupled Synergetic Learning: An MLLM based Collaborative Approach to Few-shot Multimodal Dialogue Intention Recognition(https://arxiv.org/abs/2503.04201)
Keywords: language model, llm
Abstract: Few-shot multimodal dialogue intention recognition is a critical challenge in the e-commerce domainn. Previous methods have primarily enhanced model classification capabilities through post-training techniques. However, our analysis reveals that training for few-shot multimodal dialogue intention recognition involves two interconnected tasks, leading to a seesaw effect in multi-task learning. This phenomenon is attributed to knowledge interference stemming from the superposition of weight matrix updates during the training process. To address these challenges, we propose Knowledge-Decoupled Synergetic Learning (KDSL), which mitigates these issues by utilizing smaller models to transform knowledge into interpretable rules, while applying the post-training of larger models. By facilitating collaboration between the large and small multimodal large language models for prediction, our approach demonstrates significant improvements. Notably, we achieve outstanding results on two real Taobao datasets, with enhancements of 6.37\% and 6.28\% in online weighted F1 scores compared to the state-of-the-art method, thereby validating the efficacy of our framework.
摘要：在电子商务域中，很少有多模式对话意图识别是一个关键的挑战。以前的方法主要通过培训后技术增强了模型分类功能。但是，我们的分析表明，对几次多模式对话意图识别的培训涉及两个相互联系的任务，从而在多任务学习中产生了SEESAW效应。这种现象归因于训练过程中权重矩阵更新的叠加引起的知识干扰。为了应对这些挑战，我们提出了知识结束的协同学习（KDSL），该学习通过利用较小的模型将知识转换为可解释的规则，同时应用较大模型的后培训来减轻这些问题。通过促进大型和小型多模式大语模型进行预测之间的合作，我们的方法显示出重大改进。值得注意的是，我们在两个真正的TAOBAO数据集上取得了出色的效果，与最先进的方法相比，在线加权F1分数的增强功能为6.37 \％和6.28 \％，从而验证了我们框架的效果。

Title: FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion

Authors: Ziyi Yang, Fanqi Wan, Longguang Zhong, Canbin Huang, Guosheng Liang, Xiaojun Quan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04222
Pdf URL: https://arxiv.org/pdf/2503.04222
Copy Paste: [[2503.04222]] FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion(https://arxiv.org/abs/2503.04222)
Keywords: language model, llm, chat
Abstract: We introduce FuseChat-3.0, a suite of large language models (LLMs) developed by integrating the strengths of heterogeneous source LLMs into more compact target LLMs. Our source models include the powerful Gemma-2-27B-it, Mistral-Large-Instruct-2407, Qwen-2.5-72B-Instruct, and Llama-3.1-70B-Instruct. For target models, we focus on three widely-used smaller variants-Llama-3.1-8B-Instruct, Gemma-2-9B-it, and Qwen-2.5-7B-Instruct-along with two ultra-compact options, Llama-3.2-3B-Instruct and Llama-3.2-1B-Instruct. To leverage the diverse capabilities of these source models, we develop a specialized data construction protocol tailored to various tasks and domains. The FuseChat-3.0 training pipeline consists of two key stages: (1) supervised fine-tuning (SFT) to align the target and source model distributions, and (2) Direct Preference Optimization (DPO) to apply preferences from multiple source LLMs to fine-tune the target model. The resulting FuseChat-3.0 models exhibit significant performance gains across tasks such as instruction following, general knowledge, mathematics, and coding. As illustrated in Figure 1, using Llama-3.1-8B-Instruct as the target model, our fusion approach achieves an average improvement of 6.8 points across 14 benchmarks. Moreover, it demonstrates remarkable gains of 37.1 points and 30.1 points on the instruction-following benchmarks AlpacaEval-2 and Arena-Hard, respectively. Our code, models, and datasets are available at this https URL.
摘要：我们介绍了FuseChat-3.0，这是一套大语言模型（LLM），通过将异质源LLMS的优势整合到更紧凑的目标LLM中而开发了。我们的来源模型包括功能强大的Gemma-2-27b-it，Mistral-Large-Instruct-2407，Qwen-2.5-72b-Instruct和Llama-3.1-70B-Instruct。对于目标模型，我们专注于三个广泛使用的较小变体-3.1-8b-Instruct，Gemma-2-9b-it和Qwen-2.5-7b-Instruct-along，以及两个超紧凑型选项，Llama-3.2-3B-Instruction，and llame-3.2-3b-构造和Llama-3.2-1b-instruct。为了利用这些源模型的各种功能，我们开发了针对各种任务和领域量身定制的专门数据构建协议。 FUSECHAT-3.0训练管道包括两个关键阶段：（1）监督微调（SFT）以对齐目标和源模型分布，以及（2）直接偏好优化（DPO）以应用多个源LLMS的偏好以微调目标模型。由此产生的FuseChat-3.0模型在范围内表现出显着的性能增长，例如以下教学，一般知识，数学和编码。如图1所示，使用Llama-3.1-8b-Instruct作为目标模型，我们的融合方法可以在14个基准测试中达到6.8点的平均改善。此外，它在遵循指令的基准测试基准Alpacaeval-2和Arena-Hard上表现出37.1分和30.1分的显着涨幅。我们的代码，模型和数据集可在此HTTPS URL上找到。

Title: Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models

Authors: Jie He, Bo Peng, Yi Liao, Qun Liu, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04232
Pdf URL: https://arxiv.org/pdf/2503.04232
Copy Paste: [[2503.04232]] Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models(https://arxiv.org/abs/2503.04232)
Keywords: language model, gpt, prompt
Abstract: In order to deeply understand the capability of pretrained language models in text generation and conduct a diagnostic evaluation, we propose TGEA, an error-annotated dataset with multiple benchmark tasks for text generation from pretrained language models (PLMs). We use carefully selected prompt words to guide GPT-2 to generate candidate sentences, from which we select 47K for error annotation. Crowdsourced workers manually check each of these sentences and detect 12k erroneous sentences. We create an error taxonomy to cover 24 types of errors occurring in these erroneous sentences according to the nature of errors with respect to linguistics and knowledge (eg, common sense). For each erroneous span in PLM-generated sentences, we also detect another span that is closely associated with it. Each error is hence manually labeled with comprehensive annotations, including the span of the error, the associated span, minimal correction to the error, the type of the error, and rationale behind the error. Apart from the fully annotated dataset, we also present a detailed description of the data collection procedure, statistics and analysis of the dataset. This is the first dataset with comprehensive annotations for PLM-generated texts, which facilitates the diagnostic evaluation of PLM-based text generation. Furthermore, we use TGEA as a benchmark dataset and propose a series of automatic diagnosis tasks, including error detection, error type classification, associated span detection, error rationale generation, to further promote future study on the automatic error detection and correction on texts generated by pretrained language models.
摘要：为了深入了解经过验证的语言模型在文本生成中的能力并进行诊断评估，我们提出了TGEA，TGEA，这是一个错误的数据集，该数据集具有多个基准任务，用于从预读语言模型（PLMS）中生成文本。我们使用仔细选择的提示单词指导GPT-2生成候选句子，从中我们选择47K进行错误注释。众包工人手动检查这些句子中的每一个，并检测到12k错误的句子。我们创建了一个错误的分类法，以涵盖这些错误句子中在语言学和知识方面的错误性质（例如常识）中发生的24种错误句子。对于PLM生成的句子中的每个错误跨度，我们还检测到与之紧密相关的另一个跨度。因此，每个错误都用全面的注释手动标记，包括错误的跨度，相关的跨度，对误差的最小校正，错误的类型以及错误背后的理由。除了完全注释的数据集外，我们还提供了数据收集过程，统计信息和数据集分析的详细说明。这是第一个具有PLM生成文本的全面注释的数据集，可促进基于PLM的文本生成的诊断评估。此外，我们将TGEA用作基准数据集，并提出一系列自动诊断任务，包括错误检测，错误类型分类，相关的跨度检测，错误理由的产生，进一步促进对预识训练语言模型产生的文本的自动错误检测和纠正的未来研究。

Title: DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models

Authors: Ruizhe Chen, Wenhao Chai, Zhifei Yang, Xiaotian Zhang, Joey Tianyi Zhou, Tony Quek, Soujanya Poria, Zuozhu Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04240
Pdf URL: https://arxiv.org/pdf/2503.04240
Copy Paste: [[2503.04240]] DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models(https://arxiv.org/abs/2503.04240)
Keywords: language model, llm
Abstract: Inference-time alignment provides an efficient alternative for aligning LLMs with humans. However, these approaches still face challenges, such as limited scalability due to policy-specific value functions and latency during the inference phase. In this paper, we propose a novel approach, Diffusion-styled Preference Optimization (\model), which provides an efficient and policy-agnostic solution for aligning LLMs with humans. By directly performing alignment at sentence level, \model~avoids the time latency associated with token-level generation. Designed as a plug-and-play module, \model~can be seamlessly integrated with various base models to enhance their alignment. Extensive experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that \model~achieves superior alignment performance across various settings, achieving a favorable trade-off between alignment quality and inference-time latency. Furthermore, \model~demonstrates model-agnostic scalability, significantly improving the performance of large models such as Llama-3-70B.
摘要：推理时间对齐为将LLM与人类对齐提供了有效的替代方法。但是，这些方法仍然面临挑战，例如由于推理阶段的策略特定价值功能和延迟而引起的可扩展性有限。在本文中，我们提出了一种新颖的方法，扩散风格的偏好优化（\模型），该方法为使LLM与人类对齐提供了有效且策略性的解决方案。通过直接在句子级别执行对齐，\模型〜避免了与令牌级生成相关的时间延迟。 \模型〜设计为即插即用模块，可以与各种基本模型无缝集成以增强其对齐方式。在Alpacaeval 2，MT Bench和HH-RLHF上进行的广泛实验表明，\模型〜在各种环境中都取得了卓越的对齐性能，从而实现了对齐质量和推理时间延迟之间的良好权衡。此外，\ model〜展示了模型 - 不合稳定性，从而显着提高了大型模型的性能，例如Llama-3-70B。

Title: On Fact and Frequency: LLM Responses to Misinformation Expressed with Uncertainty

Authors: Yana van de Sande, Gunes Açar, Thabo van Woudenberg, Martha Larson
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2503.04271
Pdf URL: https://arxiv.org/pdf/2503.04271
Copy Paste: [[2503.04271]] On Fact and Frequency: LLM Responses to Misinformation Expressed with Uncertainty(https://arxiv.org/abs/2503.04271)
Keywords: gpt, llm, prompt
Abstract: We study LLM judgments of misinformation expressed with uncertainty. Our experiments study the response of three widely used LLMs (GPT-4o, LlaMA3, DeepSeek-v2) to misinformation propositions that have been verified false and then are transformed into uncertain statements according to an uncertainty typology. Our results show that after transformation, LLMs change their factchecking classification from false to not-false in 25% of the cases. Analysis reveals that the change cannot be explained by predictors to which humans are expected to be sensitive, i.e., modality, linguistic cues, or argumentation strategy. The exception is doxastic transformations, which use linguistic cue phrases such as "It is believed ...".To gain further insight, we prompt the LLM to make another judgment about the transformed misinformation statements that is not related to truth value. Specifically, we study LLM estimates of the frequency with which people make the uncertain statement. We find a small but significant correlation between judgment of fact and estimation of frequency.
摘要：我们研究了以不确定性表达的错误信息的LLM判断。我们的实验研究了三种广泛使用的LLM（GPT-4O，LLAMA3，DEEPSEEK-V2）对已验证错误的错误信息命题的响应，然后根据不确定性类型学转化为不确定的陈述。我们的结果表明，在转换后，LLMS在25％的情况下将其事实检查分类从虚假变为非false。分析表明，这种变化不能通过预期人类敏感的预测因素来解释，即模式，语言提示或论证策略。例外是Doxastic Transformations，它使用语言提示短语，例如“它可以相信...”。为了获得进一步的见解，我们提示LLM对与真实价值无关的转换错误信息陈述做出另一种判断。具体而言，我们研究了LLM估计人们发表不确定陈述的频率的估计。我们发现事实判断与频率估计之间存在很小但显着的相关性。

Title: Dual-Class Prompt Generation: Enhancing Indonesian Gender-Based Hate Speech Detection through Data Augmentation

Authors: Muhammad Amien Ibrahim, Faisal, Tora Sangputra Yopie Winarto, Zefanya Delvin Sulistiya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04279
Pdf URL: https://arxiv.org/pdf/2503.04279
Copy Paste: [[2503.04279]] Dual-Class Prompt Generation: Enhancing Indonesian Gender-Based Hate Speech Detection through Data Augmentation(https://arxiv.org/abs/2503.04279)
Keywords: language model, prompt
Abstract: Detecting gender-based hate speech in Indonesian social media remains challenging due to limited labeled datasets. While binary hate speech classification has advanced, a more granular category like gender-targeted hate speech is understudied because of class imbalance issues. This paper addresses this gap by comparing three data augmentation techniques for Indonesian gender-based hate speech detection. We evaluate backtranslation, single-class prompt generation (using only hate speech examples), and our proposed dual-class prompt generation (using both hate speech and non-hate speech examples). Experiments show all augmentation methods improve classification performance, with our dual-class approach achieving the best results (88.5% accuracy, 88.1% F1-score using Random Forest). Semantic similarity analysis reveals dual-class prompt generation produces the most novel content, while T-SNE visualizations confirm these samples occupy distinct feature space regions while maintaining class characteristics. Our findings suggest that incorporating examples from both classes helps language models generate more diverse yet representative samples, effectively addressing limited data challenges in specialized hate speech detection.
摘要：由于标记的数据集有限，在印尼社交媒体中检测基于性别的仇恨言论仍然具有挑战性。尽管二进制仇恨言语分类已经提高，但由于阶级不平衡问题，诸如性别的仇恨言论之类的更详细的类别被研究了研究。本文通过比较了基于印尼的仇恨言论检测的三种数据增强技术来解决这一差距。我们评估反向翻译，单级及时生成（仅使用仇恨言论示例）以及我们提出的双级及时生成（使用仇恨言论和非讨厌的言语示例）。实验显示了所有增强方法提高了分类性能，我们的双级方法可实现最佳效果（精度为88.5％，使用随机森林为88.1％F1分数）。语义相似性分析揭示了双级及时生成产生最新的内容，而T-SNE可视化证实了这些样品占据了不同的特征空间区域，同时保持了类特征。我们的发现表明，从两个课程中合并示例有助于语言模型产生更多样化但代表性的样本，从而有效地解决了专业仇恨语音检测中的数据挑战有限。

Title: Solving Word-Sense Disambiguation and Word-Sense Induction with Dictionary Examples

Authors: Tadej Škvorc, Marko Robnik-Šikonja
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04328
Pdf URL: https://arxiv.org/pdf/2503.04328
Copy Paste: [[2503.04328]] Solving Word-Sense Disambiguation and Word-Sense Induction with Dictionary Examples(https://arxiv.org/abs/2503.04328)
Keywords: language model, llm
Abstract: Many less-resourced languages struggle with a lack of large, task-specific datasets that are required for solving relevant tasks with modern transformer-based large language models (LLMs). On the other hand, many linguistic resources, such as dictionaries, are rarely used in this context despite their large information contents. We show how LLMs can be used to extend existing language resources in less-resourced languages for two important tasks: word-sense disambiguation (WSD) and word-sense induction (WSI). We approach the two tasks through the related but much more accessible word-in-context (WiC) task where, given a pair of sentences and a target word, a classification model is tasked with predicting whether the sense of a given word differs between sentences. We demonstrate that a well-trained model for this task can distinguish between different word senses and can be adapted to solve the WSD and WSI tasks. The advantage of using the WiC task, instead of directly predicting senses, is that the WiC task does not need pre-constructed sense inventories with a sufficient number of examples for each sense, which are rarely available in less-resourced languages. We show that sentence pairs for the WiC task can be successfully generated from dictionary examples using LLMs. The resulting prediction models outperform existing models on WiC, WSD, and WSI tasks. We demonstrate our methodology on the Slovene language, where a monolingual dictionary is available, but word-sense resources are tiny.
摘要：许多资源较低的语言都在使用现代化的大型语言模型（LLMS）解决相关任务所需的大型，特定于任务的数据集。另一方面，尽管有大量信息内容，但在这种情况下，许多语言资源（例如字典）很少使用。我们展示了如何使用LLMs以较低的语言来扩展现有的语言资源，以完成两个重要任务：单词态度歧义（WSD）和单词 - 义诱导（WSI）。我们通过相关但更容易访问的文字（WIC）任务来处理这两个任务，在给定给定句子和目标词的情况下，分类模型的任务是预测给定单词的感觉是否在句子之间有所不同。我们证明，该任务的训练有素的模型可以区分不同的单词感官，并且可以适应以解决WSD和WSI任务。使用WIC任务的优点，而不是直接预测感官，是WIC任务不需要预先构造的感官清单，每种感觉都有足够数量的示例，而这些示例很少以较少的语言使用。我们证明，WIC任务的句子对可以使用LLMS成功地从字典示例中生成。最终的预测模型优于WIC，WSD和WSI任务上的现有模型。我们在Slovene语言上演示了我们的方法论，那里有单语词典，但是单词态度的资源很小。

Title: Adding Alignment Control to Language Models

Authors: Wenhong Zhu, Weinan Zhang, Rui Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04346
Pdf URL: https://arxiv.org/pdf/2503.04346
Copy Paste: [[2503.04346]] Adding Alignment Control to Language Models(https://arxiv.org/abs/2503.04346)
Keywords: language model
Abstract: Post-training alignment has increasingly become a crucial factor in enhancing the usability of language models (LMs). However, the strength of alignment varies depending on individual preferences. This paper proposes a method to incorporate alignment control into a single model, referred to as CLM. This approach adds one identity layer preceding the initial layers and performs preference learning only on this layer to map unaligned input token embeddings into the aligned space. Experimental results demonstrate that this efficient fine-tuning method performs comparable to full fine-tuning. During inference, the input embeddings are processed through the aligned and unaligned layers, which are then merged through the interpolation coefficient. By controlling this parameter, the alignment exhibits a clear interpolation and extrapolation phenomenon.
摘要：训练后的一致性越来越成为增强语言模型（LMS）可用性的关键因素。但是，对齐的强度取决于个人偏好。本文提出了一种将对齐控制纳入单个模型的方法，称为CLM。这种方法在初始层之前添加了一个身份层，并仅在此层上执行首选项学习，以将未对齐的输入令牌嵌入映射到对齐空间中。实验结果表明，这种有效的微调方法的性能与完整的微调相当。在推断期间，输入嵌入通过对齐和未对准的层处理，然后通过插值系数合并。通过控制此参数，对齐表现出明显的插值和外推现象。

Title: Layer-Specific Scaling of Positional Encodings for Superior Long-Context Modeling

Authors: Zhenghua Wang, Yiran Ding, Changze Lv, Zhibo Xu, Tianlong Li, Tianyuan Shi, Xiaoqing Zheng, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04355
Pdf URL: https://arxiv.org/pdf/2503.04355
Copy Paste: [[2503.04355]] Layer-Specific Scaling of Positional Encodings for Superior Long-Context Modeling(https://arxiv.org/abs/2503.04355)
Keywords: language model, llm
Abstract: Although large language models (LLMs) have achieved significant progress in handling long-context inputs, they still suffer from the ``lost-in-the-middle'' problem, where crucial information in the middle of the context is often underrepresented or lost. Our extensive experiments reveal that this issue may arise from the rapid long-term decay in Rotary Position Embedding (RoPE). To address this problem, we propose a layer-specific positional encoding scaling method that assigns distinct scaling factors to each layer, slowing down the decay rate caused by RoPE to make the model pay more attention to the middle context. A specially designed genetic algorithm is employed to efficiently select the optimal scaling factors for each layer by incorporating Bezier curves to reduce the search space. Through comprehensive experimentation, we demonstrate that our method significantly alleviates the ``lost-in-the-middle'' problem. Our approach results in an average accuracy improvement of up to 20% on the Key-Value Retrieval dataset. Furthermore, we show that layer-specific interpolation, as opposed to uniform interpolation across all layers, enhances the model's extrapolation capabilities when combined with PI and Dynamic-NTK positional encoding schemes.
摘要：尽管大型语言模型（LLMS）在处理长篇小说输入方面取得了重大进展，但它们仍然遭受``中间损失''问题的困扰，在这种问题中，在上下文中间的重要信息通常不足或丢失。我们广泛的实验表明，这个问题可能是由于旋转位置嵌入（绳索）的快速长期衰减引起的。为了解决这个问题，我们提出了一种特定图层的位置编码缩放方法，该方法将不同的缩放因子分配给每个层，从而减慢了由绳索引起的衰减率，以使模型更加关注中间环境。通过合并Bezeier曲线以减少搜索空间，采用了专门设计的遗传算法来有效地为每层选择最佳缩放因子。通过全面的实验，我们证明了我们的方法可大大减轻``中间''问题的问题。我们的方法在键值检索数据集上的平均准确性提高了20％。此外，我们表明，与所有层的均匀插值相反，当与PI和Dynamic-NTK位置编码方案结合使用时，可以增强模型的外推能力。

Title: Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators

Authors: Jiayi Chang, Mingqi Gao, Xinyu Hu, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04360
Pdf URL: https://arxiv.org/pdf/2503.04360
Copy Paste: [[2503.04360]] Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators(https://arxiv.org/abs/2503.04360)
Keywords: llm, prompt
Abstract: Previous research has shown that LLMs have potential in multilingual NLG evaluation tasks. However, existing research has not fully explored the differences in the evaluation capabilities of LLMs across different languages. To this end, this study provides a comprehensive analysis of the multilingual evaluation performance of 10 recent LLMs, spanning high-resource and low-resource languages through correlation analysis, perturbation attacks, and fine-tuning. We found that 1) excluding the reference answer from the prompt and using large-parameter LLM-based evaluators leads to better performance across various languages; 2) most LLM-based evaluators show a higher correlation with human judgments in high-resource languages than in low-resource languages; 3) in the languages where they are most sensitive to such attacks, they also tend to exhibit the highest correlation with human judgments; and 4) fine-tuning with data from a particular language yields a broadly consistent enhancement in the model's evaluation performance across diverse languages. Our findings highlight the imbalance in LLMs'evaluation capabilities across different languages and suggest that low-resource language scenarios deserve more attention.
摘要：先前的研究表明，LLM具有多语言NLG评估任务的潜力。但是，现有的研究尚未充分探讨不同语言中LLM的评估功能的差异。为此，这项研究通过相关分析，扰动攻击和微调进行了对10种LLM的多语言评估性能的全面分析。我们发现1）从提示中排除参考答案并使用基于大参数LLM的评估者会导致各种语言的表现更好； 2）大多数基于LLM的评估者与高资源语言的人类判断相比，比低资源语言具有更高的相关性； 3）在他们对这种攻击最敏感的语言中，它们也倾向于与人类判断具有最高的相关性； 4）对来自特定语言的数据进行微调，可以在模型跨不同语言的评估表现中得到广泛的一致性。我们的发现突出了LLMS跨不同语言的评估功能的不平衡，并表明低资源语言情景值得更多关注。

Title: Lost in Literalism: How Supervised Training Shapes Translationese in LLMs

Authors: Yafu Li, Ronghao Zhang, Zhilin Wang, Huajian Zhang, Leyang Cui, Yongjing Yin, Tong Xiao, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04369
Pdf URL: https://arxiv.org/pdf/2503.04369
Copy Paste: [[2503.04369]] Lost in Literalism: How Supervised Training Shapes Translationese in LLMs(https://arxiv.org/abs/2503.04369)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable success in machine translation, demonstrating impressive performance across diverse languages. However, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge in LLM-based translation systems. Despite their pre-training on vast corpora of natural utterances, LLMs exhibit translationese errors and generate unexpected unnatural translations, stemming from biases introduced during supervised fine-tuning (SFT). In this work, we systematically evaluate the prevalence of translationese in LLM-generated translations and investigate its roots during supervised training. We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances. Empirical evaluations demonstrate that these approaches significantly reduce translationese while improving translation naturalness, validated by human evaluations and automatic metrics. Our findings highlight the need for training-aware adjustments to optimize LLM translation outputs, paving the way for more fluent and target-language-consistent translations. We release the data and code at this https URL.
摘要：大型语言模型（LLM）在机器翻译方面取得了巨大的成功，表现出了各种语言的令人印象深刻的表现。然而，在基于LLM的翻译系统中，翻译人员的特征是过于文字和不自然的翻译，仍然是一个持续的挑战。尽管LLM在广泛的自然话语中进行了预培训，但LLM仍表现出翻译错误，并产生了意想不到的不自然翻译，这是由于监督微调（SFT）引入的偏见。在这项工作中，我们系统地评估了LLM生成翻译中翻译的流行率，并在监督培训期间研究其根源。我们介绍了减轻这些偏见的方法，包括抛光黄金参考和过滤不自然的训练实例。经验评估表明，这些方法可显着减少翻译，同时改善翻译自然性，并通过人类评估和自动指标验证。我们的发现突出了需要调整训练以优化LLM翻译输出的需求，为更流利和目标语言一致的翻译铺平了道路。我们在此HTTPS URL上发布数据和代码。

Title: Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks

Authors: Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Daniel Egert, Ellie Evans, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04378
Pdf URL: https://arxiv.org/pdf/2503.04378
Copy Paste: [[2503.04378]] Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks(https://arxiv.org/abs/2503.04378)
Keywords: chat
Abstract: Inference-Time Scaling has been critical to the success of recent models such as OpenAI o1 and DeepSeek R1. However, many techniques used to train models for inference-time scaling require tasks to have answers that can be verified, limiting their application to domains such as math, coding and logical reasoning. We take inspiration from how humans make first attempts, ask for detailed feedback from others and make improvements based on such feedback across a wide spectrum of open-ended endeavors. To this end, we collect data for and train dedicated Feedback and Edit Models that are capable of performing inference-time scaling for open-ended general-domain tasks. In our setup, one model generates an initial response, which are given feedback by a second model, that are then used by a third model to edit the response. We show that performance on Arena Hard, a benchmark strongly predictive of Chatbot Arena Elo can be boosted by scaling the number of initial response drafts, effective feedback and edited responses. When scaled optimally, our setup based on 70B models from the Llama 3 family can reach SoTA performance on Arena Hard at 92.7 as of 5 Mar 2025, surpassing OpenAI o1-preview-2024-09-12 with 90.4 and DeepSeek R1 with 92.3.
摘要：推理时间缩放对最近模型（例如OpenAI O1和DeepSeek R1）的成功至关重要。但是，许多用于训练推理时间缩放模型的技术都需要任务可以得到验证的答案，从而将其应用限制在数学，编码和逻辑推理等领域。我们从人类如何进行首次尝试，要求他人的详细反馈中汲取灵感，并根据各种开放式努力进行此类反馈做出改进。为此，我们收集了能够执行开放式通用域任务的推理时间缩放的专用反馈和编辑模型的数据。在我们的设置中，一个模型会生成一个初始响应，该响应由第二个模型给出了反馈，然后由第三个模型将其用于编辑响应。我们表明，在竞技场上的性能很难，可以通过扩展初始响应草案的数量，有效的反馈和编辑响应来提高聊天机器人竞技场ELO的强烈预测。当最佳缩放时，我们基于Llama 3家族的70B型号的设置可以在2025年3月5日在92.7的竞技场上达到SOTA性能，超过Openai O1-Preview-2024-09-12，其90.4，DeepSeek R1以92.3的速度超过了DeepSeek R1。

Title: TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge

Authors: Cheng-Han Chiang, Hung-yi Lee, Michal Lukasik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04381
Pdf URL: https://arxiv.org/pdf/2503.04381
Copy Paste: [[2503.04381]] TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge(https://arxiv.org/abs/2503.04381)
Keywords: language model, llm, chain-of-thought
Abstract: The LLM-as-a-judge paradigm uses large language models (LLMs) for automated text evaluation, where a numerical assessment is assigned by an LLM to the input text following scoring rubrics. Existing methods for LLM-as-a-judge use cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning, which, however, does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method combining CoT reasoning with regression-aware training. TRACT consists of two stages: first, seed LLM is fine-tuned to generate CoTs, which serve as supervision for the second stage fine-tuning. The training objective of TRACT combines the CE loss for learning the CoT reasoning capabilities, and the regression-aware loss for the score prediction. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the importance of each component in TRACT.
摘要：LLM-AS-A-Gudge范式使用大型语言模型（LLMS）进行自动文本评估，其中LLM将数值评估分配给了评分标题后的输入文本。 LLM-AS-A-Gudge的现有方法使用跨凝集（CE）损失进行微调，这忽略了得分预测的数值。最近的工作解决了LLM通过回归感知的微调的数值预测限制，但是，这并不考虑分数预测的思维链（COT）推理。在本文中，我们介绍了区域（与COT进行两阶段回归 - 感知的微调），这是将COT推理与回归感知训练相结合的方法。区域由两个阶段组成：首先，种子LLM进行了微调以产生COTS，这是对第二阶段微调的监督。道的训练目标结合了CE损失，以学习COT推理能力以及分数预测的回归感知损失。在四个LLM-AS-A-A-Gudge数据集和两个LLMS的实验表明，区域的表现明显胜过现有方法。广泛的消融研究验证了每个组件在道中的重要性。

Title: More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG

Authors: Shahar Levy, Nir Mazor, Lihi Shalmon, Michael Hassid, Gabriel Stanovsky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04388
Pdf URL: https://arxiv.org/pdf/2503.04388
Copy Paste: [[2503.04388]] More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG(https://arxiv.org/abs/2503.04388)
Keywords: language model, llm, long context, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) provides LLMs with relevant documents. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for LLMs. Additionally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: this https URL .
摘要：检索授权发电（RAG）为LLM提供了相关文件。尽管以前的研究指出，检索许多文档可以降低性能，但它们并没有隔离文档的数量在控制上下文长度时如何影响性能。我们在从多跳质量质量检查任务中得出的自定义数据集上评估了各种语言模型。在改变文档数量的同时，我们将相关信息的上下文长度和位置保持不变，并发现在抹布设置中增加文档数量对LLM构成了重大挑战。此外，我们的结果表明，处理多个文档与处理长篇小说是一个不同的挑战。我们还使数据集和代码可用：此HTTPS URL。

Title: Shaping Shared Languages: Human and Large Language Models' Inductive Biases in Emergent Communication

Authors: Tom Kouwenhoven, Max Peeperkorn, Roy de Kleijn, Tessa Verhoef
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04395
Pdf URL: https://arxiv.org/pdf/2503.04395
Copy Paste: [[2503.04395]] Shaping Shared Languages: Human and Large Language Models' Inductive Biases in Emergent Communication(https://arxiv.org/abs/2503.04395)
Keywords: language model, llm
Abstract: Languages are shaped by the inductive biases of their users. Using a classical referential game, we investigate how artificial languages evolve when optimised for inductive biases in humans and large language models (LLMs) via Human-Human, LLM-LLM and Human-LLM experiments. We show that referentially grounded vocabularies emerge that enable reliable communication in all conditions, even when humans and LLMs collaborate. Comparisons between conditions reveal that languages optimised for LLMs subtly differ from those optimised for humans. Interestingly, interactions between humans and LLMs alleviate these differences and result in vocabularies which are more human-like than LLM-like. These findings advance our understanding of how inductive biases in LLMs play a role in the dynamic nature of human language and contribute to maintaining alignment in human and machine communication. In particular, our work underscores the need to think of new methods that include human interaction in the training processes of LLMs, and shows that using communicative success as a reward signal can be a fruitful, novel direction.
摘要：语言是由用户的感应偏见来塑造的。使用经典的参考游戏，我们研究了人造语言如何通过人类，LLM-LLM和Human-LLM实验对人类和大型语言模型（LLM）的归纳偏见进行优化时如何发展。我们表明，即使在人类和LLMS进行协作时，即使在所有情况下都可以在所有情况下都能在所有情况下进行可靠的沟通。条件之间的比较表明，针对LLMS优化的语言巧妙地与对人类优化的语言有所不同。有趣的是，人类与LLM之间的相互作用减轻了这些差异，并导致词汇比LLM样更像人性化。这些发现提高了我们对LLM中的归纳偏见如何在人类语言的动态性质中发挥作用的理解，并有助于保持人类和机器交流中的一致性。特别是，我们的工作强调了在LLMS训练过程中进行包括人类相互作用的新方法的需求，并表明将交流成功作为奖励信号可能是一个富有成果的新方向。

Title: TableLoRA: Low-rank Adaptation on Table Structure Understanding for Large Language Models

Authors: Xinyi He, Yihao Liu, Mengyu Zhou, Yeye He, Haoyu Dong, Shi Han, Zejian Yuan, Dongmei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04396
Pdf URL: https://arxiv.org/pdf/2503.04396
Copy Paste: [[2503.04396]] TableLoRA: Low-rank Adaptation on Table Structure Understanding for Large Language Models(https://arxiv.org/abs/2503.04396)
Keywords: language model, llm
Abstract: Tabular data are crucial in many fields and their understanding by large language models (LLMs) under high parameter efficiency paradigm is important. However, directly applying parameter-efficient fine-tuning (PEFT) techniques to tabular tasks presents significant challenges, particularly in terms of better table serialization and the representation of two-dimensional structured information within a one-dimensional sequence. To address this, we propose TableLoRA, a module designed to improve LLMs' understanding of table structure during PEFT. It incorporates special tokens for serializing tables with special token encoder and uses 2D LoRA to encode low-rank information on cell positions. Experiments on four tabular-related datasets demonstrate that TableLoRA consistently outperforms vanilla LoRA and surpasses various table encoding methods tested in control experiments. These findings reveal that TableLoRA, as a table-specific LoRA, enhances the ability of LLMs to process tabular data effectively, especially in low-parameter settings, demonstrating its potential as a robust solution for handling table-related tasks.
摘要：表格数据在许多领域至关重要，并且在高参数效率范式下，大型语言模型（LLM）的理解很重要。但是，直接将参数有效的微调（PEFT）技术应用于表格任务，提出了重大挑战，尤其是在更好的表格序列化方面以及在一维序列中的二维结构信息的表示方面。为了解决这个问题，我们提出了Tablelora，这是一个模块，旨在提高LLMS对PEFT期间桌子结构的理解。它结合了特殊的令牌，用于序列化表和特殊的令牌编码器，并使用2D lora编码细胞位置的低级别信息。在四个与表格有关的数据集上的实验表明，Tablelora始终胜过香草Lora，并且超过了在对照实验中测试的各种表格编码方法。这些发现表明，Tablelora作为特定于桌子的洛拉，增强了LLM有效地处理表格数据的能力，尤其是在低参数设置中，这表明了其作为处理与桌子相关的任务的强大解决方案。

Title: Comparative Study of Zero-Shot Cross-Lingual Transfer for Bodo POS and NER Tagging Using Gemini 2.0 Flash Thinking Experimental Model

Authors: Sanjib Narzary, Bihung Brahma, Haradip Mahilary, Mahananda Brahma, Bidisha Som, Sukumar Nandi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04405
Pdf URL: https://arxiv.org/pdf/2503.04405
Copy Paste: [[2503.04405]] Comparative Study of Zero-Shot Cross-Lingual Transfer for Bodo POS and NER Tagging Using Gemini 2.0 Flash Thinking Experimental Model(https://arxiv.org/abs/2503.04405)
Keywords: prompt
Abstract: Named Entity Recognition (NER) and Part-of-Speech (POS) tagging are critical tasks for Natural Language Processing (NLP), yet their availability for low-resource languages (LRLs) like Bodo remains limited. This article presents a comparative empirical study investigating the effectiveness of Google's Gemini 2.0 Flash Thinking Experiment model for zero-shot cross-lingual transfer of POS and NER tagging to Bodo. We explore two distinct methodologies: (1) direct translation of English sentences to Bodo followed by tag transfer, and (2) prompt-based tag transfer on parallel English-Bodo sentence pairs. Both methods leverage the machine translation and cross-lingual understanding capabilities of Gemini 2.0 Flash Thinking Experiment to project English POS and NER annotations onto Bodo text in CONLL-2003 format. Our findings reveal the capabilities and limitations of each approach, demonstrating that while both methods show promise for bootstrapping Bodo NLP, prompt-based transfer exhibits superior performance, particularly for NER. We provide a detailed analysis of the results, highlighting the impact of translation quality, grammatical divergences, and the inherent challenges of zero-shot cross-lingual transfer. The article concludes by discussing future research directions, emphasizing the need for hybrid approaches, few-shot fine-tuning, and the development of dedicated Bodo NLP resources to achieve high-accuracy POS and NER tagging for this low-resource language.
摘要：命名实体识别（NER）和词性词（POS）标签是自然语言处理（NLP）的关键任务，但是像Bodo这样的低资源语言（LRL）的可用性仍然有限。本文介绍了一项比较实证研究，研究了Google的Gemini 2.0 Flash思维实验模型的有效性，用于将POS和NER标记为BODO的零射击传递。我们探索了两种不同的方法：（1）将英语句子直接翻译给Bodo，然后进行标签传输，以及（2）基于平行的英语 - 宽道句子对基于及时的标签转移。两种方法都利用了Gemini 2.0 Flash思维实验的机器翻译和跨语性理解能力，以将英语POS和NER注释投射到Conll-2003格式的BODO文本上。我们的发现揭示了每种方法的功能和局限性，表明尽管两种方法都表现出对自举Bodo NLP的希望，但基于及时的转移表现出了出色的表现，尤其是NER。我们提供了对结果的详细分析，强调了翻译质量，语法差异以及零击跨语言转移的固有挑战的影响。本文通过讨论未来的研究方向，强调对混合方法的需求，很少的微调以及专门的BODO NLP资源的开发，以实现这种低资源语言的高智度POS和NER标签。

Title: Can Large Language Models Predict Antimicrobial Resistance Gene?

Authors: Hyunwoo Yoo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04413
Pdf URL: https://arxiv.org/pdf/2503.04413
Copy Paste: [[2503.04413]] Can Large Language Models Predict Antimicrobial Resistance Gene?(https://arxiv.org/abs/2503.04413)
Keywords: language model
Abstract: This study demonstrates that generative large language models can be utilized in a more flexible manner for DNA sequence analysis and classification tasks compared to traditional transformer encoder-based models. While recent encoder-based models such as DNABERT and Nucleotide Transformer have shown significant performance in DNA sequence classification, transformer decoder-based generative models have not yet been extensively explored in this field. This study evaluates how effectively generative Large Language Models handle DNA sequences with various labels and analyzes performance changes when additional textual information is provided. Experiments were conducted on antimicrobial resistance genes, and the results show that generative Large Language Models can offer comparable or potentially better predictions, demonstrating flexibility and accuracy when incorporating both sequence and textual information. The code and data used in this work are available at the following GitHub repository: this https URL.
摘要：这项研究表明，与传统的基于变压器编码器的模型相比，可以以更灵活的方式以更灵活的方式用于DNA序列分析和分类任务。尽管最近基于编码器的模型（例如DNABERT和核苷酸变压器）在DNA序列分类中显示出显着的性能，但在该领域尚未广泛探索基于变压器解码器的生成模型。这项研究评估了如何有效地生成的大语言模型使用各种标签处理DNA序列，并在提供其他文本信息时分析性能的变化。实验是在抗菌抗性基因上进行的，结果表明，生成的大语言模型可以提供可比性或潜在的更好的预测，在结合序列和文本信息时表明灵活性和准确性。这项工作中使用的代码和数据可在以下GitHub存储库中获得：此HTTPS URL。

Title: Revisiting the Othello World Model Hypothesis

Authors: Yifei Yuan, Anders Søgaard
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04421
Pdf URL: https://arxiv.org/pdf/2503.04421
Copy Paste: [[2503.04421]] Revisiting the Othello World Model Hypothesis(https://arxiv.org/abs/2503.04421)
Keywords: language model, gpt
Abstract: Li et al. (2023) used the Othello board game as a test case for the ability of GPT-2 to induce world models, and were followed up by Nanda et al. (2023b). We briefly discuss the original experiments, expanding them to include more language models with more comprehensive probing. Specifically, we analyze sequences of Othello board states and train the model to predict the next move based on previous moves. We evaluate seven language models (GPT-2, T5, Bart, Flan-T5, Mistral, LLaMA-2, and Qwen2.5) on the Othello task and conclude that these models not only learn to play Othello, but also induce the Othello board layout. We find that all models achieve up to 99% accuracy in unsupervised grounding and exhibit high similarity in the board features they learned. This provides considerably stronger evidence for the Othello World Model Hypothesis than previous works.
摘要：Li等。（2023年）使用奥赛罗棋盘游戏作为GPT-2诱导世界模型的能力的测试案例，而Nanda等人则跟进。（2023b）。我们简要讨论了原始实验，将它们扩展到包括更全面的探索的更多语言模型。具体而言，我们分析了奥赛罗董事会状态的序列，并训练模型以根据先前的举动来预测下一步。我们在Othello任务上评估了七个语言模型（GPT-2，T5，BART，FLAN-T5，MISTRAL，LLAMA-2和QWEN2.5），并得出结论，这些模型不仅学会了扮演Othello，而且还引起了Othello董事会的布局。我们发现，所有模型在无监督的接地方面最多可获得99％的精度，并且在他们学到的董事会功能中表现出很高的相似性。与以前的作品相比，这为奥赛罗世界模型假设提供了更强的证据。

Title: A Dataset for Analysing News Framing in Chinese Media

Authors: Owen Cook, Yida Mu, Xinye Yang, Xingyi Song, Kalina Bontcheva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04439
Pdf URL: https://arxiv.org/pdf/2503.04439
Copy Paste: [[2503.04439]] A Dataset for Analysing News Framing in Chinese Media(https://arxiv.org/abs/2503.04439)
Keywords: gpt
Abstract: Framing is an essential device in news reporting, allowing the writer to influence public perceptions of current affairs. While there are existing automatic news framing detection datasets in various languages, none of them focus on news framing in the Chinese language which has complex character meanings and unique linguistic features. This study introduces the first Chinese News Framing dataset, to be used as either a stand-alone dataset or a supplementary resource to the SemEval-2023 task 3 dataset. We detail its creation and we run baseline experiments to highlight the need for such a dataset and create benchmarks for future research, providing results obtained through fine-tuning XLM-RoBERTa-Base and using GPT-4o in the zero-shot setting. We find that GPT-4o performs significantly worse than fine-tuned XLM-RoBERTa across all languages. For the Chinese language, we obtain an F1-micro (the performance metric for SemEval task 3, subtask 2) score of 0.719 using only samples from our Chinese News Framing dataset and a score of 0.753 when we augment the SemEval dataset with Chinese news framing samples. With positive news frame detection results, this dataset is a valuable resource for detecting news frames in the Chinese language and is a valuable supplement to the SemEval-2023 task 3 dataset.
摘要：框架是新闻报道中必不可少的设备，使作者能够影响公众对时事的看法。虽然有各种语言的现有自动新闻框架检测数据集，但它们都不关注中文的新闻框架，该新闻框架具有复杂的字符含义和独特的语言特征。这项研究介绍了第一个中国新闻框架数据集，该数据集将用作独立数据集或Semeval-2023 Task 3 DataSet的独立数据集或补充资源。我们详细介绍了它的创建，并运行基线实验，以突出对此类数据集的需求，并为将来的研究创建基准测试，从而通过微调XLM-Roberta-base获得结果，并在零弹片设置中使用GPT-4O。我们发现，在所有语言中，GPT-4O的性能比XLM-Roberta的微调XLM-Roberta差得多。对于中文，我们获得了F1-Micro（Semeval Task 3，子任务2，子任务2）的得分为0.719，仅使用中国新闻框架数据集中的样品，当我们使用中国新闻新闻框架样品增强Semeval数据集时，得分为0.753。有了积极的新闻框架检测结果，该数据集是用于检测中文新闻框架的宝贵资源，并且是Semeval-2023 Task 3数据集的宝贵补充。

Title: Guiding LLMs to Generate High-Fidelity and High-Quality Counterfactual Explanations for Text Classification

Authors: Van Bach Nguyen, Christin Seifert, Jörg Schlötterer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04463
Pdf URL: https://arxiv.org/pdf/2503.04463
Copy Paste: [[2503.04463]] Guiding LLMs to Generate High-Fidelity and High-Quality Counterfactual Explanations for Text Classification(https://arxiv.org/abs/2503.04463)
Keywords: language model, llm
Abstract: The need for interpretability in deep learning has driven interest in counterfactual explanations, which identify minimal changes to an instance that change a model's prediction. Current counterfactual (CF) generation methods require task-specific fine-tuning and produce low-quality text. Large Language Models (LLMs), though effective for high-quality text generation, struggle with label-flipping counterfactuals (i.e., counterfactuals that change the prediction) without fine-tuning. We introduce two simple classifier-guided approaches to support counterfactual generation by LLMs, eliminating the need for fine-tuning while preserving the strengths of LLMs. Despite their simplicity, our methods outperform state-of-the-art counterfactual generation methods and are effective across different LLMs, highlighting the benefits of guiding counterfactual generation by LLMs with classifier information. We further show that data augmentation by our generated CFs can improve a classifier's robustness. Our analysis reveals a critical issue in counterfactual generation by LLMs: LLMs rely on parametric knowledge rather than faithfully following the classifier.
摘要：深度学习中对可解释性的需求引起了人们对反事实解释的兴趣，这些解释确定了改变模型预测的实例的最小变化。当前的反事实（CF）生成方法需要特定于任务的微调并产生低质量的文本。大型语言模型（LLMS）虽然对高质量的文本生成有效，但与贴标签的反事实（即改变预测的反事实）斗争而无需微调。我们介绍了两种简单的分类器指导方法，以支持LLMS的反事实生成，从而消除了对LLMS优势的微调的需求。尽管它们很简单，但我们的方法的表现优于最先进的反事实生成方法，并且在不同的LLM中有效，这突出了指导LLMS使用分类器信息指导反事实生成的好处。我们进一步表明，我们生成的CFS的数据增加可以改善分类器的鲁棒性。我们的分析揭示了LLMS的反事实生成中的一个关键问题：LLMS依靠参数知识，而不是忠实地跟随分类器。

Title: Generalized Interpolating Discrete Diffusion

Authors: Dimitri von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, Thomas Hofmann
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04482
Pdf URL: https://arxiv.org/pdf/2503.04482
Copy Paste: [[2503.04482]] Generalized Interpolating Discrete Diffusion(https://arxiv.org/abs/2503.04482)
Keywords: language model, prompt
Abstract: While state-of-the-art language models achieve impressive results through next-token prediction, they have inherent limitations such as the inability to revise already generated tokens. This has prompted exploration of alternative approaches such as discrete diffusion. However, masked diffusion, which has emerged as a popular choice due to its simplicity and effectiveness, reintroduces this inability to revise words. To overcome this, we generalize masked diffusion and derive the theoretical backbone of a family of general interpolating discrete diffusion (GIDD) processes offering greater flexibility in the design of the noising processes. Leveraging a novel diffusion ELBO, we achieve compute-matched state-of-the-art performance in diffusion language modeling. Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality and unlocking the ability for the model to correct its own mistakes, an area where autoregressive models notoriously have struggled. Our code and models are open-source: this https URL
摘要：尽管最先进的语言模型通过下一步的预测获得了令人印象深刻的结果，但它们具有固有的局限性，例如无法修改已经产生的令牌。这促使人们探索了诸如离散扩散之类的替代方法。但是，由于其简单性和有效性，掩盖的扩散已成为一种流行的选择，它重新引入了这种无法修改词语。为了克服这一点，我们概括了掩盖的扩散，并得出了一般插值离散扩散（GIDD）过程的理论骨干，从而在尖锐的过程的设计方面具有更大的灵活性。利用新颖的扩散Elbo，我们在扩散语言建模中实现了匹配匹配的最新性能。利用Gidd的灵活性，我们探索了一种混合方法，结合了掩盖和均匀的噪声，从而提高了样本质量，并解锁了模型纠正其自身错误的能力，该区域众所周知，自动回归模型却在努力挣扎。我们的代码和型号是开源的：此HTTPS URL

Title: Large Language Models in Bioinformatics: A Survey

Authors: Zhenyu Wang, Zikang Wang, Jiyue Jiang, Pengan Chen, Xiangyu Shi, Yu Li
Subjects: cs.CL, q-bio.GN
Abstract URL: https://arxiv.org/abs/2503.04490
Pdf URL: https://arxiv.org/pdf/2503.04490
Copy Paste: [[2503.04490]] Large Language Models in Bioinformatics: A Survey(https://arxiv.org/abs/2503.04490)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are revolutionizing bioinformatics, enabling advanced analysis of DNA, RNA, proteins, and single-cell data. This survey provides a systematic review of recent advancements, focusing on genomic sequence modeling, RNA structure prediction, protein function inference, and single-cell transcriptomics. Meanwhile, we also discuss several key challenges, including data scarcity, computational complexity, and cross-omics integration, and explore future directions such as multimodal learning, hybrid AI models, and clinical applications. By offering a comprehensive perspective, this paper underscores the transformative potential of LLMs in driving innovations in bioinformatics and precision medicine.
摘要：大型语言模型（LLM）正在彻底改变生物信息学，从而实现了对DNA，RNA，蛋白质和单细胞数据的高级分析。这项调查提供了对最新进展的系统回顾，重点是基因组序列建模，RNA结构预测，蛋白质功能推断和单细胞转录组学。同时，我们还讨论了一些关键挑战，包括数据稀缺，计算复杂性和跨词语集成，并探索未来的方向，例如多模式学习，混合AI模型和临床应用。通过提供全面的观点，本文强调了LLM在推动生物信息学和精密医学创新方面的变革潜力。

Title: Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model

Authors: Wenke Huang, Jian Liang, Xianda Guo, Yiyang Fang, Guancheng Wan, Xuankun Rong, Chi Wen, Zekun Shi, Qingyun Li, Didi Zhu, Yanbiao Ma, Ke Liang, Bin Yang, He Li, Jiawei Shao, Mang Ye, Bo Du
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04543
Pdf URL: https://arxiv.org/pdf/2503.04543
Copy Paste: [[2503.04543]] Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model(https://arxiv.org/abs/2503.04543)
Keywords: language model, llm
Abstract: Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering. While MLLMs demonstrate remarkable versatility, MLLMs appears limited performance on special applications. But tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert Specialization, where distribution shifts between pre-training and target datasets constrain target performance, and Open-World Stabilization, where catastrophic forgetting erases the model general knowledge. In this work, we systematically review recent advancements in MLLM tuning methodologies, classifying them into three paradigms: (I) Selective Tuning, (II) Additive Tuning, and (III) Reparameterization Tuning. Furthermore, we benchmark these tuning strategies across popular MLLM architectures and diverse downstream tasks to establish standardized evaluation analysis and systematic tuning principles. Finally, we highlight several open challenges in this domain and propose future research directions. To facilitate ongoing progress in this rapidly evolving field, we provide a public repository that continuously tracks developments: this https URL.
摘要：多模式大型语言模型（MLLM）整合了视觉和语言推理，以解决复杂的任务，例如图像字幕和视觉问题回答。尽管MLLMS表现出了显着的多功能性，但MLLM在特殊应用程序上的性能有限。但是，为下游任务调整MLLM会遇到两个关键挑战：任务外科专业化，在预训练和目标数据集之间的分布会限制目标性能和开放世界的稳定，其中灾难性遗忘会消除模型的通用知识。在这项工作中，我们系统地回顾了MLLM调整方法中的最新进展，将它们分为三个范式：（i）选择性调整，（ii）添加剂调整，以及（iii）重新聚集调整。此外，我们在流行的MLLM体系结构和各种下游任务中基准了这些调整策略，以建立标准化的评估分析和系统调整原理。最后，我们重点介绍了该领域中的一些开放挑战，并提出了未来的研究方向。为了促进这个快速发展的领域的持续进展，我们提供了一个不断跟踪发展的公共存储库：此HTTPS URL。

Title: Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation

Authors: Armel Zebaze, Benoît Sagot, Rachel Bawden
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04554
Pdf URL: https://arxiv.org/pdf/2503.04554
Copy Paste: [[2503.04554]] Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation(https://arxiv.org/abs/2503.04554)
Keywords: language model, llm, prompt
Abstract: The ability of generative large language models (LLMs) to perform in-context learning has given rise to a large body of research into how best to prompt models for various natural language processing tasks. Machine Translation (MT) has been shown to benefit from in-context examples, in particular when they are semantically similar to the sentence to translate. In this paper, we propose a new LLM-based translation paradigm, compositional translation, to replace naive few-shot MT with similarity-based demonstrations. An LLM is used to decompose a sentence into simpler phrases, and then to translate each phrase with the help of retrieved demonstrations. Finally, the LLM is prompted to translate the initial sentence with the help of the self-generated phrase-translation pairs. Our intuition is that this approach should improve translation because these shorter phrases should be intrinsically easier to translate and easier to match with relevant examples. This is especially beneficial in low-resource scenarios, and more generally whenever the selection pool is small or out of domain. We show that compositional translation boosts LLM translation performance on a wide range of popular MT benchmarks, including FLORES 200, NTREX 128 and TICO-19. Code and outputs are available at this https URL
摘要：生成性大语言模型（LLM）执行在上下文中学习的能力已引起了大量研究，如何最好地促进各种自然语言处理任务的模型。机器翻译（MT）已被证明可以受益于封闭式示例，尤其是当它们在语义上与句子的语义相似时。在本文中，我们提出了一种新的基于LLM的翻译范式，即构图翻译，以基于相似性的演示代替幼稚的几张MT。 LLM用于将句子分解为更简单的短语，然后在检索示范的帮助下翻译每个短语。最后，提示LLM借助自生成的短语 - 翻译对翻译初始句子。我们的直觉是，这种方法应该改善翻译，因为这些较短的短语本质上应该易于翻译，并且更易于与相关示例匹配。这在低资源场景中尤其有益，并且每当选择池小或不在域之外时，这通常是有益的。我们表明，构图翻译在广泛流行的MT基准（包括Flores 200，NTREX 128和TICO-19）上提高了LLM翻译性能。代码和输出可在此HTTPS URL上找到

Title: Compositional Causal Reasoning Evaluation in Language Models

Authors: Jacqueline R. M. A. Maasch, Alihan Hüyük, Xinnuo Xu, Aditya V. Nori, Javier Gonzalez
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04556
Pdf URL: https://arxiv.org/pdf/2503.04556
Copy Paste: [[2503.04556]] Compositional Causal Reasoning Evaluation in Language Models(https://arxiv.org/abs/2503.04556)
Keywords: language model, gpt
Abstract: Causal reasoning and compositional reasoning are two core aspirations in generative AI. Measuring the extent of these behaviors requires principled evaluation methods. We explore a unified perspective that considers both behaviors simultaneously, termed compositional causal reasoning (CCR): the ability to infer how causal measures compose and, equivalently, how causal quantities propagate through graphs. We instantiate a framework for the systematic evaluation of CCR for the average treatment effect and the probability of necessity and sufficiency. As proof of concept, we demonstrate the design of CCR tasks for language models in the LLama, Phi, and GPT families. On a math word problem, our framework revealed a range of taxonomically distinct error patterns. Additionally, CCR errors increased with the complexity of causal paths for all models except o1.
摘要：因果推理和组成推理是生成AI中的两个核心愿望。衡量这些行为的程度需要原则的评估方法。我们探索了一个统一的观点，该观点同时考虑了这两种行为，称为组成因果推理（CCR）：推断因果量度如何构成以及因果量如何通过图传播的能力。我们实例化了CCR系统评估的框架，以获得平均治疗效果以及必要性和充分性的可能性。作为概念证明，我们演示了Llama，Phi和GPT家族中语言模型的CCR任务的设计。在数学单词问题上，我们的框架揭示了一系列分类学上不同的误差模式。此外，除O1以外，所有模型的因果路径的复杂性随着因果路径的复杂性而增加。

Title: HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

Authors: Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04598
Pdf URL: https://arxiv.org/pdf/2503.04598
Copy Paste: [[2503.04598]] HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization(https://arxiv.org/abs/2503.04598)
Keywords: language model, llm
Abstract: Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, challenges remain in training deep transformer networks, especially regarding the location of layer normalization. While Pre-Norm structures facilitate easier training due to their more prominent identity path, they often yield suboptimal performance compared to Post-Norm. In this paper, we propose $\textbf{HybridNorm}$, a straightforward yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm approaches. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. This design not only stabilizes training but also enhances performance, particularly in the context of LLMs. Comprehensive experiments in both dense and sparse architectures show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches, achieving state-of-the-art results across various benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. %Code will be made publicly available. Code is available at this https URL.
摘要：变形金刚已成为多种机器学习任务的事实上的架构，尤其是在大型语言模型（LLMS）中。尽管表现出色，但仍在训练深层变压器网络中挑战，尤其是在层归一化的位置方面。虽然预先的结构由于其更突出的身份路径而促进了更容易的训练，但与结税相比，它们通常会产生次优的性能。在本文中，我们提出了$ \ textbf {hybridnorm} $，这是一种简单而有效的混合归一化策略，可以整合了前 - 通用和后市场方法的优势。具体而言，HybridNorm在每个变压器块的进发纸网络（FFN）中采用QKV归一化，并在标志后进行QKV归一化。这种设计不仅可以稳定训练，还可以提高性能，尤其是在LLM的背景下。在密集和稀疏体系结构中的全面实验表明，杂交型始终胜过术前和结构后方法，从而在各种基准测试中实现了最新的结果。这些发现凸显了杂交作为改善深型变压器模型训练和性能的更稳定和有效技术的潜力。％代码将公开可用。代码可在此HTTPS URL上找到。

Title: Towards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning

Authors: Mohammad Amin Ghanizadeh, Mohammad Javad Dousti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04611
Pdf URL: https://arxiv.org/pdf/2503.04611
Copy Paste: [[2503.04611]] Towards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning(https://arxiv.org/abs/2503.04611)
Keywords: language model, llm
Abstract: In this work, we explain our approach employed in the BabyLM Challenge, which uses various methods of training language models (LMs) with significantly less data compared to traditional large language models (LLMs) and are inspired by how human children learn. While a human child is exposed to far less linguistic input than an LLM, they still achieve remarkable language understanding and generation abilities. To this end, we develop a model trained on a curated dataset consisting of 10 million words, primarily sourced from child-directed transcripts. The 2024 BabyLM Challenge initial dataset of 10M words is filtered to 8.5M. Next, it is supplemented with a randomly selected subset of TVR dataset consisting of 1.5M words of television dialogues. The latter dataset ensures that similar to children, the model is also exposed to language through media. Furthermore, we reduce the vocabulary size to 32,000 tokens, aligning it with the limited vocabulary of children in the early stages of language acquisition. We use curriculum learning and is able to match the baseline on certain benchmarks while surpassing the baseline on others. Additionally, incorporating common LLM training datasets, such as MADLAD-400, degrades performance. These findings underscore the importance of dataset selection, vocabulary scaling, and curriculum learning in creating more data-efficient language models that better mimic human learning processes.
摘要：在这项工作中，我们解释了Babylm Challenge中采用的方法，该方法使用了与传统的大型语言模型（LLM）相比，数据的各种培训语言模型（LMS）的数据明显少得多，并且受到人类儿童学习方式的启发。尽管人类儿童的语言输入远低于LLM，但他们仍然具有出色的语言理解和发电能力。为此，我们开发了一个在由1000万个单词组成的策划数据集上训练的模型，主要来自以儿童为导向的成绩单。 2024年BABYLM挑战的初始数据集将10M单词的初始数据集过滤至850万。接下来，将其添加到TVR数据集的随机选择子集，该数据集由150万个电视对话单词组成。后一个数据集可确保与儿童类似，该模型也通过媒体暴露于语言。此外，我们将词汇量的大小减少到32,000个令牌，使其与语言获取初期的儿童词汇有限的词汇保持一致。我们使用课程学习，并且能够在某些基准测试的基线上与其他基准相匹配。此外，结合了常见的LLM培训数据集，例如Madlad-400，会降低性能。这些发现强调了数据集选择，词汇缩放和课程学习在创建更多数据有效语言模型中的重要性，从而更好地模仿人类的学习过程。

Title: HalluCounter: Reference-free LLM Hallucination Detection in the Wild!

Authors: Ashok Urlana, Gopichand Kanumolu, Charaka Vinayak Kumar, Bala Mallikarjunarao Garlapati, Rahul Mishra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04615
Pdf URL: https://arxiv.org/pdf/2503.04615
Copy Paste: [[2503.04615]] HalluCounter: Reference-free LLM Hallucination Detection in the Wild!(https://arxiv.org/abs/2503.04615)
Keywords: llm, hallucination
Abstract: Response consistency-based, reference-free hallucination detection (RFHD) methods do not depend on internal model states, such as generation probabilities or gradients, which Grey-box models typically rely on but are inaccessible in closed-source LLMs. However, their inability to capture query-response alignment patterns often results in lower detection accuracy. Additionally, the lack of large-scale benchmark datasets spanning diverse domains remains a challenge, as most existing datasets are limited in size and scope. To this end, we propose HalluCounter, a novel reference-free hallucination detection method that utilizes both response-response and query-response consistency and alignment patterns. This enables the training of a classifier that detects hallucinations and provides a confidence score and an optimal response for user queries. Furthermore, we introduce HalluCounterEval, a benchmark dataset comprising both synthetically generated and human-curated samples across multiple domains. Our method outperforms state-of-the-art approaches by a significant margin, achieving over 90\% average confidence in hallucination detection across datasets.
摘要：基于响应一致性的，无参考的幻觉检测（RFHD）方法不取决于内部模型状态，例如发电概率或梯度，灰色盒模型通常依赖但在封闭源LLMS中无法访问。但是，它们无法捕获查询响应对准模式通常会导致检测准确性较低。此外，由于大多数现有数据集的尺寸和范围有限，因此缺乏跨越不同域的大规模基准数据集仍然是一个挑战。为此，我们提出了一种幻觉，这是一种新颖的无参考幻觉检测方法，它利用响应响应和查询响应的一致性和对准模式。这使得对检测幻觉的分类器进行培训，并为用户查询提供了信心评分和最佳响应。此外，我们引入了幻觉，这是一个基准数据集，其中包括跨多个领域的合成生成和人类策划的样品。我们的方法的表现优于最先进的方法，其差距显着，在整个数据集中实现了超过90％的平均幻觉检测信心。

Title: Better Process Supervision with Bi-directional Rewarding Signals

Authors: Wenxiang Chen, Wei He, Zhiheng Xi, Honglin Guo, Boyang Hong, Jiazheng Zhang, Rui Zheng, Nijun Li, Tao Gui, Yun Li, Qi Zhang, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04618
Pdf URL: https://arxiv.org/pdf/2503.04618
Copy Paste: [[2503.04618]] Better Process Supervision with Bi-directional Rewarding Signals(https://arxiv.org/abs/2503.04618)
Keywords: language model, llm
Abstract: Process supervision, i.e., evaluating each step, is critical for complex large language model (LLM) reasoning and test-time searching with increased inference compute. Existing approaches, represented by process reward models (PRMs), primarily focus on rewarding signals up to the current step, exhibiting a one-directional nature and lacking a mechanism to model the distance to the final target. To address this problem, we draw inspiration from the A* algorithm, which states that an effective supervisory signal should simultaneously consider the incurred cost and the estimated cost for reaching the target. Building on this key insight, we introduce BiRM, a novel process supervision model that not only evaluates the correctness of previous steps but also models the probability of future success. We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps, achieving an improvement of 3.1% on Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.
摘要：过程监督，即评估每个步骤，对于复杂的大语言模型（LLM）推理和测试时间搜索至关重要。以过程奖励模型（PRM）为代表的现有方法主要集中于奖励当前步骤的信号，表现出单向性质，并且缺乏建模到最终目标的距离的机制。为了解决这个问题，我们从A*算法中汲取灵感，该算法指出，有效的监督信号应同时考虑成本和达到目标的估计成本。在此关键见解的基础上，我们介绍了Birm，这是一个新型的过程监督模型，不仅评估了以前的步骤的正确性，而且还对未来成功的概率进行了建模。我们对数学推理任务进行了广泛的实验，并证明BIRM对LLM推理步骤提供了更精确的评估，在最佳N采样方法下，Gaokao2023对Gaokao2023的提高了3.1％。此外，在基于搜索的策略中，BIRM在Math-500上提供了更全面的指导和优于5.0％的ORM和PRM的ORM。

Title: SynGraph: A Dynamic Graph-LLM Synthesis Framework for Sparse Streaming User Sentiment Modeling

Authors: Xin Zhang, Qiyu Wei, Yingjie Zhu, Linhai Zhang, Deyu Zhou, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04619
Pdf URL: https://arxiv.org/pdf/2503.04619
Copy Paste: [[2503.04619]] SynGraph: A Dynamic Graph-LLM Synthesis Framework for Sparse Streaming User Sentiment Modeling(https://arxiv.org/abs/2503.04619)
Keywords: llm
Abstract: User reviews on e-commerce platforms exhibit dynamic sentiment patterns driven by temporal and contextual factors. Traditional sentiment analysis methods focus on static reviews, failing to capture the evolving temporal relationship between user sentiment rating and textual content. Sentiment analysis on streaming reviews addresses this limitation by modeling and predicting the temporal evolution of user sentiments. However, it suffers from data sparsity, manifesting in temporal, spatial, and combined forms. In this paper, we introduce SynGraph, a novel framework designed to address data sparsity in sentiment analysis on streaming reviews. SynGraph alleviates data sparsity by categorizing users into mid-tail, long-tail, and extreme scenarios and incorporating LLM-augmented enhancements within a dynamic graph-based structure. Experiments on real-world datasets demonstrate its effectiveness in addressing sparsity and improving sentiment modeling in streaming reviews.
摘要：电子商务平台上的用户评论表现出由时间和上下文因素驱动的动态情感模式。传统的情感分析方法着眼于静态评论，未能捕获用户情感评级和文本内容之间不断发展的时间关系。流式评论的情感分析通过建模和预测用户情感的时间演变来解决此限制。但是，它具有数据稀疏性，以时间，空间和组合形式表现出来。在本文中，我们介绍了Syngraph，这是一个新颖的框架，旨在解决流媒体评论的情感分析中的数据稀疏性。 Syngraph通过将用户分类为中尾，长尾和极端方案，并将LLM-Elevent的增强功能纳入基于动态图的结构中，从而减轻数据稀疏性。关于现实世界数据集的实验证明了其在解决稀疏性和改善流媒体评论中的情感建模方面的有效性。

Title: START: Self-taught Reasoner with Tools

Authors: Chengpeng Li, Mingfeng Xue, Zhenru Zhang, Jiaxi Yang, Beichen Zhang, Xiang Wang, Bowen Yu, Binyuan Hui, Junyang Lin, Dayiheng Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04625
Pdf URL: https://arxiv.org/pdf/2503.04625
Copy Paste: [[2503.04625]] START: Self-taught Reasoner with Tools(https://arxiv.org/abs/2503.04625)
Keywords: llm, hallucination, chain-of-thought
Abstract: Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.'') during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.
摘要：大型推理模型（LRMS）（例如OpenAI-O1和DeepSeek-R1）通过长期的经过思维链（COT）的利用而在复杂的推理任务中表现出了显着的功能。但是，这些模型通常由于其对内部推理过程的依赖而经常遭受幻觉和效率低下的困扰。在本文中，我们介绍了Start（具有工具的自学成才的推理器），这是一种新颖的工具集成的长床推理LLM，可通过利用外部工具来显着增强推理能力。通过代码执行，Start能够执行复杂的计算，自我检查，探索各种方法和自我欺骗，从而解决了LRMS的局限性。起步的核心创新在于其自学习框架，其中包括两个关键技术：1）提示：我们证明，在lrm的推理过程中，不需要刺激其不需要的工具，就可以插入人工设计的提示（例如``等待，也许在这里使用Python是一个好主意。''））。提示 - 加入也可以用作简单有效的顺序测试时间缩放方法。 2）提示拒绝采样微调（hint-rft）：hint-rft通过评分，过滤和修改推理轨迹，通过lrm通过hint-fifer-inter-in-fer-inter-int-fifer-the-the-the-the-thint-rft结合了提示和RFT。通过此框架，我们对QWQ-32B模型进行了微调以实现启动。关于博士学位科学质量质量质量检查（GPQA），竞争级数学基准（AMC23，AIME24，AIME25）和竞争级代码基准（LiveCodeBench），开始实现准确率63.6％，95.0％，66.7％，47..1％，47.1％和47.3.3％。它大大优于基本QWQ-32B，并实现与最先进的开放式模型R1-DISTILL-QWEN-32B和专有模型O1-preview相当的性能。

Title: SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing

Authors: Xiangchao Yan, Shiyang Feng, Jiakang Yuan, Renqiu Xia, Bin Wang, Bo Zhang, Lei Bai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04629
Pdf URL: https://arxiv.org/pdf/2503.04629
Copy Paste: [[2503.04629]] SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing(https://arxiv.org/abs/2503.04629)
Keywords: llm, agent
Abstract: Survey paper plays a crucial role in scientific research, especially given the rapid growth of research publications. Recently, researchers have begun using LLMs to automate survey generation for better efficiency. However, the quality gap between LLM-generated surveys and those written by human remains significant, particularly in terms of outline quality and citation accuracy. To close these gaps, we introduce SurveyForge, which first generates the outline by analyzing the logical structure of human-written outlines and referring to the retrieved domain-related articles. Subsequently, leveraging high-quality papers retrieved from memory by our scholar navigation agent, SurveyForge can automatically generate and refine the content of the generated article. Moreover, to achieve a comprehensive evaluation, we construct SurveyBench, which includes 100 human-written survey papers for win-rate comparison and assesses AI-generated survey papers across three dimensions: reference, outline, and content quality. Experiments demonstrate that SurveyForge can outperform previous works such as AutoSurvey.
摘要：调查文件在科学研究中起着至关重要的作用，尤其是考虑到研究出版物的快速增长。最近，研究人员已开始使用LLM来自动化调查生成，以提高效率。但是，LLM生成的调查与人类撰写的调查之间的质量差距仍然很重要，尤其是在轮廓质量和引文准确性方面。为了缩小这些差距，我们介绍了SurveyForge，该差距首先通过分析人工写的概述的逻辑结构并提及检索到的域相关文章来产生大纲。随后，SurveyForge利用从记忆中从记忆中检索到的高质量论文可以自动生成和完善生成文章的内容。此外，为了获得全面的评估，我们构建了SurveyBench，其中包括100篇人工编写的调查论文，以进行胜利比较，并评估了三个维度的AI生成的调查论文：参考，轮廓和内容质量。实验表明，SurveyForge可以胜过以前的作品，例如自动调查。

Title: Mark Your LLM: Detecting the Misuse of Open-Source Large Language Models via Watermarking

Authors: Yijie Xu, Aiwei Liu, Xuming Hu, Lijie Wen, Hui Xiong
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04636
Pdf URL: https://arxiv.org/pdf/2503.04636
Copy Paste: [[2503.04636]] Mark Your LLM: Detecting the Misuse of Open-Source Large Language Models via Watermarking(https://arxiv.org/abs/2503.04636)
Keywords: language model, llm
Abstract: As open-source large language models (LLMs) like Llama3 become more capable, it is crucial to develop watermarking techniques to detect their potential misuse. Existing watermarking methods either add watermarks during LLM inference, which is unsuitable for open-source LLMs, or primarily target classification LLMs rather than recent generative LLMs. Adapting these watermarks to open-source LLMs for misuse detection remains an open challenge. This work defines two misuse scenarios for open-source LLMs: intellectual property (IP) violation and LLM Usage Violation. Then, we explore the application of inference-time watermark distillation and backdoor watermarking in these contexts. We propose comprehensive evaluation methods to assess the impact of various real-world further fine-tuning scenarios on watermarks and the effect of these watermarks on LLM performance. Our experiments reveal that backdoor watermarking could effectively detect IP Violation, while inference-time watermark distillation is applicable in both scenarios but less robust to further fine-tuning and has a more significant impact on LLM performance compared to backdoor watermarking. Exploring more advanced watermarking methods for open-source LLMs to detect their misuse should be an important future direction.
摘要：随着像Llama3这样的开源大型语言模型（LLM）变得越来越有能力，因此开发水印技术以检测其潜在滥用至关重要。现有的水印方法在LLM推理期间添加水印，这不适合开源LLM，或者主要针对分类LLM而不是最近的生成LLM。将这些水印调整为开源LLM进行滥用检测仍然是一个开放的挑战。这项工作定义了开源LLM的两个滥用方案：知识产权（IP）违规和LLM使用违规。然后，我们探讨了在这些情况下推理时间水印蒸馏和后门水印的应用。我们提出了全面的评估方法，以评估各种现实世界的进一步微调方案对水印的影响以及这些水印对LLM性能的影响。我们的实验表明，后门水印可以有效地检测出IP违规，而推理时间水印蒸馏剂在两种情况下都适用，但与后门水印相比，对进一步的微调效果更大，并且对LLM性能的影响更大。探索开源LLM的更先进的水印方法检测其滥用，应该是一个重要的未来方向。

Title: IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval

Authors: Tingyu Song, Guo Gan, Mingsheng Shang, Yilun Zhao
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.04644
Pdf URL: https://arxiv.org/pdf/2503.04644
Copy Paste: [[2503.04644]] IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval(https://arxiv.org/abs/2503.04644)
Keywords: llm
Abstract: We introduce IFIR, the first comprehensive benchmark designed to evaluate instruction-following information retrieval (IR) in expert domains. IFIR includes 2,426 high-quality examples and covers eight subsets across four specialized domains: finance, law, healthcare, and science literature. Each subset addresses one or more domain-specific retrieval tasks, replicating real-world scenarios where customized instructions are critical. IFIR enables a detailed analysis of instruction-following retrieval capabilities by incorporating instructions at different levels of complexity. We also propose a novel LLM-based evaluation method to provide a more precise and reliable assessment of model performance in following instructions. Through extensive experiments on 15 frontier retrieval models, including those based on LLMs, our results reveal that current models face significant challenges in effectively following complex, domain-specific instructions. We further provide in-depth analyses to highlight these limitations, offering valuable insights to guide future advancements in retriever development.
摘要：我们介绍了IFIR，这是第一个旨在评估专家领域中遵循教学信息检索（IR）的全面基准。 IFIR包括2,426个高质量的例子，并涵盖了四个专业领域的八个子集：金融，法律，医疗保健和科学文献。每个子集都介绍一个或多个特定于域的检索任务，复制自定义指令至关重要的现实情况。 IFIR通过在不同级别的复杂性级别纳入指令来详细分析指导遵循检索功能。我们还提出了一种基于LLM的新型评估方法，以在以下说明中对模型性能进行更精确，更可靠的评估。通过对15个边境检索模型（包括基于LLM的）的广泛实验，我们的结果表明，当前模型在有效地遵循复杂的，特定于领域的指令时面临着重大挑战。我们进一步提供了深入的分析，以突出这些局限性，提供有价值的见解，以指导未来的回猎犬开发进步。

Title: Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

Authors: Wen Yang, Junhong Wu, Chen Wang, Chengqing Zong, Jiajun Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04647
Pdf URL: https://arxiv.org/pdf/2503.04647
Copy Paste: [[2503.04647]] Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment(https://arxiv.org/abs/2503.04647)
Keywords: language model, llm
Abstract: Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is hampered by data scarcity. To address this, we propose a novel approach that $\textit{captures}$ learned preferences from well-aligned English models by implicit rewards and $\textit{transfers}$ them to other languages through iterative training. Specifically, we derive an implicit reward model from the logits of an English DPO-aligned model and its corresponding reference model. This reward model is then leveraged to annotate preference relations in cross-lingual instruction-following pairs, using English instructions to evaluate multilingual responses. The annotated data is subsequently used for multilingual DPO fine-tuning, facilitating preference knowledge transfer from English to other languages. Fine-tuning Llama3 for two iterations resulted in a 12.72% average improvement in Win Rate and a 5.97% increase in Length Control Win Rate across all training languages on the X-AlpacaEval leaderboard. Our findings demonstrate that leveraging existing English-aligned models can enable efficient and effective multilingual preference alignment, significantly reducing the need for extensive multilingual preference data. The code is available at this https URL
摘要：直接偏好优化（DPO）已成为将大型语言模型（LLMS）与人类偏好保持一致的突出方法。尽管DPO在对齐英语LLMS方面取得了重大进展，但多语言偏好对齐受到数据稀缺的阻碍。为了解决这个问题，我们提出了一种新颖的方法，该方法是通过隐式奖励从安装良好的英语模型和$ \ textit {transfers} $通过迭代培训来学习到其他语言的新方法。具体而言，我们从英语DPO一致模型的逻辑及其相应的参考模型中得出一个隐式奖励模型。然后使用英语说明来评估多种语言响应，然后利用该奖励模型在跨语言指令遵循对的偏好关系中注释偏好关系。随后，注释的数据用于多语言DPO微调，从而促进了从英语到其他语言的偏好知识转移。进行两次迭代的微调Llama3导致X-Alpacaeval排行榜上所有培训语言的胜利率平均提高12.72％，而控制率的长度控制率提高了5.97％。我们的发现表明，利用现有的英语一致模型可以实现高效有效的多语言偏好对齐方式，从而大大减少了对广泛的多语言偏好数据的需求。该代码可在此HTTPS URL上找到

Title: An Information-theoretic Multi-task Representation Learning Framework for Natural Language Understanding

Authors: Dou Hu, Lingwei Wei, Wei Zhou, Songlin Hu
Subjects: cs.CL, cs.IT, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04667
Pdf URL: https://arxiv.org/pdf/2503.04667
Copy Paste: [[2503.04667]] An Information-theoretic Multi-task Representation Learning Framework for Natural Language Understanding(https://arxiv.org/abs/2503.04667)
Keywords: language model
Abstract: This paper proposes a new principled multi-task representation learning framework (InfoMTL) to extract noise-invariant sufficient representations for all tasks. It ensures sufficiency of shared representations for all tasks and mitigates the negative effect of redundant features, which can enhance language understanding of pre-trained language models (PLMs) under the multi-task paradigm. Firstly, a shared information maximization principle is proposed to learn more sufficient shared representations for all target tasks. It can avoid the insufficiency issue arising from representation compression in the multi-task paradigm. Secondly, a task-specific information minimization principle is designed to mitigate the negative effect of potential redundant features in the input for each task. It can compress task-irrelevant redundant information and preserve necessary information relevant to the target for multi-task prediction. Experiments on six classification benchmarks show that our method outperforms 12 comparative multi-task methods under the same multi-task settings, especially in data-constrained and noisy scenarios. Extensive experiments demonstrate that the learned representations are more sufficient, data-efficient, and robust.
摘要：本文提出了一个新的原则多任务表示学习框架（InfomTL），以提取所有任务的噪声不足表示。它确保了所有任务共享表示形式的充分性，并减轻冗余特征的负面影响，这可以增强对多任务范式下的预训练语言模型（PLM）的语言理解。首先，提出了共享信息最大化原则，以了解所有目标任务的更多共享表示。它可以避免由多任务范式中的表示压缩引起的不足问题。其次，特定于任务的信息最小化原则旨在减轻每个任务输入中潜在冗余特征的负面影响。它可以压缩任务 - 无关紧要的冗余信息，并保留与目标预测目标相关的必要信息。六个分类基准的实验表明，我们的方法在相同的多任务设置下胜过12种比较多任务方法，尤其是在数据约束和噪声方案中。广泛的实验表明，学到的表示形式更加足够，数据效率和健壮。

Title: LLM-guided Plan and Retrieval: A Strategic Alignment for Interpretable User Satisfaction Estimation in Dialogue

Authors: Sangyeop Kim, Sohhyung Park, Jaewon Jung, Jinseok Kim, Sungzoon Cho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04675
Pdf URL: https://arxiv.org/pdf/2503.04675
Copy Paste: [[2503.04675]] LLM-guided Plan and Retrieval: A Strategic Alignment for Interpretable User Satisfaction Estimation in Dialogue(https://arxiv.org/abs/2503.04675)
Keywords: language model, llm
Abstract: Understanding user satisfaction with conversational systems, known as User Satisfaction Estimation (USE), is essential for assessing dialogue quality and enhancing user experiences. However, existing methods for USE face challenges due to limited understanding of underlying reasons for user dissatisfaction and the high costs of annotating user intentions. To address these challenges, we propose PRAISE (Plan and Retrieval Alignment for Interpretable Satisfaction Estimation), an interpretable framework for effective user satisfaction prediction. PRAISE operates through three key modules. The Strategy Planner develops strategies, which are natural language criteria for classifying user satisfaction. The Feature Retriever then incorporates knowledge on user satisfaction from Large Language Models (LLMs) and retrieves relevance features from utterances. Finally, the Score Analyzer evaluates strategy predictions and classifies user satisfaction. Experimental results demonstrate that PRAISE achieves state-of-the-art performance on three benchmarks for the USE task. Beyond its superior performance, PRAISE offers additional benefits. It enhances interpretability by providing instance-level explanations through effective alignment of utterances with strategies. Moreover, PRAISE operates more efficiently than existing approaches by eliminating the need for LLMs during the inference phase.
摘要：了解用户对对话系统的满意度（称为用户满意度估计（使用））对于评估对话质量和增强用户体验至关重要。但是，由于对用户不满意的潜在原因和用户意图的高成本的理解有限，因此现有的使用方法面临挑战。为了应对这些挑战，我们提出赞美（计划和检索对可解释满意度估计的一致性），这是一个可解释的框架，用于有效的用户满意度预测。赞美是通过三个关键模块运作的。战略计划者制定了策略，这是对用户满意度分类的自然语言标准。然后，该功能检索器结合了有关大语言模型（LLM）用户满意度的知识，并从发声中检索相关功能。最后，得分分析仪评估策略预测并对用户满意度进行了分类。实验结果表明，赞美在使用三个基准上实现了最新的性能。除了其出色的表现，Allive还提供了其他好处。它通过通过有效使话语与策略相结合来提供实例级别的解释来增强可解释性。此外，通过在推理阶段消除对LLM的需求，赞美的运作效率比现有方法更有效。

Title: DIMSUM: Discourse in Mathematical Reasoning as a Supervision Module

Authors: Krish Sharma, Niyar R Barman, Nicholas Asher, Akshay Chaturvedi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04685
Pdf URL: https://arxiv.org/pdf/2503.04685
Copy Paste: [[2503.04685]] DIMSUM: Discourse in Mathematical Reasoning as a Supervision Module(https://arxiv.org/abs/2503.04685)
Keywords: llm
Abstract: We look at reasoning on GSM8k, a dataset of short texts presenting primary school, math problems. We find, with Mirzadeh et al. (2024), that current LLM progress on the data set may not be explained by better reasoning but by exposure to a broader pretraining data distribution. We then introduce a novel information source for helping models with less data or inferior training reason better: discourse structure. We show that discourse structure improves performance for models like Llama2 13b by up to 160%. Even for models that have most likely memorized the data set, adding discourse structural information to the model still improves predictions and dramatically improves large model performance on out of distribution examples.
摘要：我们研究了GSM8K的推理，GSM8K是一个简短的小学，数学问题的数据集。我们发现，Mirzadeh等人。（2024），数据集的当前LLM进展可能不会通过更好的推理来解释，而是通过暴露于更广泛的预处理数据分布。然后，我们介绍了一种新的信息源，以帮助使用更少的数据或较低培训原因：话语结构。我们表明，话语结构可将Llama2 13B等模型的性能提高到160％。即使对于最有可能记住数据集的模型，在模型中添加话语结构信息仍然可以改善预测，并显着改善了分发示例中的大型模型性能。

Title: Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases

Authors: Pengcheng Qiu, Chaoyi Wu, Shuyu Liu, Weike Zhao, Ya Zhang, Yanfeng Wang, Weidi Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04691
Pdf URL: https://arxiv.org/pdf/2503.04691
Copy Paste: [[2503.04691]] Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases(https://arxiv.org/abs/2503.04691)
Keywords: language model, llm, agent
Abstract: The latest reasoning-enhanced large language models (reasoning LLMs), such as DeepSeek-R1 and OpenAI-o3, have demonstrated remarkable success. However, the application of such reasoning enhancements to the highly professional medical domain has not been clearly evaluated, particularly regarding with not only assessing the final generation but also examining the quality of their reasoning processes. In this study, we present MedR-Bench, a reasoning-focused medical evaluation benchmark comprising 1,453 structured patient cases with reasoning references mined from case reports. Our benchmark spans 13 body systems and 10 specialty disorders, encompassing both common and rare diseases. In our evaluation, we introduce a versatile framework consisting of three critical clinical stages: assessment recommendation, diagnostic decision-making, and treatment planning, comprehensively capturing the LLMs' performance across the entire patient journey in healthcare. For metrics, we propose a novel agentic system, Reasoning Evaluator, designed to automate and objectively quantify free-text reasoning responses in a scalable manner from the perspectives of efficiency, factuality, and completeness by dynamically searching and performing cross-referencing checks. As a result, we assess five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and others. Our results reveal that current LLMs can handle relatively simple diagnostic tasks with sufficient critical assessment results, achieving accuracy generally over 85%. However, they still struggle with more complex tasks, such as assessment recommendation and treatment planning. In reasoning, their reasoning processes are generally reliable, with factuality scores exceeding 90%, though they often omit critical reasoning steps. Our study clearly reveals further development directions for current clinical LLMs.
摘要：最新的推理增强大型语言模型（推理LLM），例如DeepSeek-R1和OpenAI-O3，已经取得了巨大的成功。但是，尚未对高度专业的医学领域的应用来应用这种推理增强，特别是在不仅评估了最后一代，而且还要检查其推理过程的质量。在这项研究中，我们提出了MEDR BENCH，这是一种以推理为重点的医学评估基准，其中包括1,453例结构化患者病例，并从病例报告中挖出的推理参考文献。我们的基准测试涵盖了13个身体系统和10种特种疾病，包括常见和稀有疾病。在我们的评估中，我们介绍了一个多功能框架，该框架包括三个关键的临床阶段：评估建议，诊断决策和治疗计划，全面捕获LLMS在整个医疗保健中的患者旅程中的表现。对于指标，我们提出了一个新型的代理系统，即推理评估器，旨在从效率，善意和完整性的角度以动态搜索和执行交叉引用检查的角度以可扩展的方式自动化和客观地量化自由文本推理响应。结果，我们评估了五个最先进的推理LLM，包括DeepSeek-R1，OpenAi-O3-Mini等。我们的结果表明，当前的LLM可以通过足够的批判性评估结果处理相对简单的诊断任务，从而达到准确性超过85％。但是，他们仍然在更复杂的任务中挣扎，例如评估建议和治疗计划。在推理中，他们的推理过程通常是可靠的，事实得分超过90％，尽管它们经常忽略关键的推理步骤。我们的研究清楚地揭示了当前临床LLM的进一步发展方向。

Title: UIPE: Enhancing LLM Unlearning by Removing Knowledge Related to Forgetting Targets

Authors: Wenyu Wang, Mengqi Zhang, Xiaotian Ye, Zhaochun Ren, Zhumin Chen, Pengjie Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04693
Pdf URL: https://arxiv.org/pdf/2503.04693
Copy Paste: [[2503.04693]] UIPE: Enhancing LLM Unlearning by Removing Knowledge Related to Forgetting Targets(https://arxiv.org/abs/2503.04693)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) inevitably acquire harmful information during training on massive datasets. LLM unlearning aims to eliminate the influence of such harmful information while maintaining the model's overall performance. Existing unlearning methods, represented by gradient ascent-based approaches, primarily focus on forgetting target data while overlooking the crucial impact of logically related knowledge on the effectiveness of unlearning. In this paper, through both theoretical and experimental analyses, we first demonstrate that a key reason for the suboptimal unlearning performance is that models can reconstruct the target content through reasoning with logically related knowledge. To address this issue, we propose Unlearning Improvement via Parameter Extrapolation (UIPE), a method that removes knowledge highly correlated with the forgetting targets. Experimental results show that UIPE significantly enhances the performance of various mainstream LLM unlearning methods on the TOFU benchmark.
摘要：大型语言模型（LLMS）在大规模数据集的培训期间不可避免地获取有害信息。 LLM Underning旨在消除这种有害信息的影响，同时保持模型的整体绩效。以基于梯度上升的方法为代表的现有未学习方法主要集中于忘记目标数据，同时忽略逻辑相关知识对未学习有效性的关键影响。在本文中，通过理论和实验分析，我们首先证明了次优的学习绩效的关键原因是模型可以通过与逻辑相关的知识来推理重建目标内容。为了解决这个问题，我们建议通过参数外推（UIPE）进行改进，该方法可以消除与遗忘目标高度相关的知识。实验结果表明，UIPE显着提高了豆腐基准上各种主流LLM学习方法的性能。

Title: L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Authors: Pranjal Aggarwal, Sean Welleck
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04697
Pdf URL: https://arxiv.org/pdf/2503.04697
Copy Paste: [[2503.04697]] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning(https://arxiv.org/abs/2503.04697)
Keywords: language model, gpt, prompt, chain-of-thought
Abstract: Reasoning language models have shown an uncanny ability to improve performance at test-time by ``thinking longer''-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their chain-of-thought reasoning is not controllable, making it impossible to allocate test-time compute to achieve a desired level of performance. We introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints. We use LCPO to train L1, a reasoning language model that produces outputs satisfying a length constraint given in its prompt. L1's length control allows for smoothly trading off computational cost and accuracy on a wide range of tasks, and outperforms the state-of-the-art S1 method for length control. Furthermore, we uncover an unexpected short chain-of-thought capability in models trained with LCPO. For instance, our 1.5B L1 model surpasses GPT-4o at equal reasoning lengths. Overall, LCPO enables precise control over reasoning length, allowing for fine-grained allocation of test-time compute and accuracy. We release code and models at this https URL
摘要：推理语言模型表明，通过“思考更长”来提高测试时间的性能的不可思议的能力，也就是说，通过生成更长的经过思考序列的序列，从而使用更多的计算。但是，他们的经营推理的长度是不可控制的，因此无法分配测试时间计算以达到所需的性能水平。我们介绍了长度受控策略优化（LCPO），这是一种简单的增强学习方法，可优化准确性和遵守用户指定的长度约束。我们使用LCPO训练L1，这是一种推理语言模型，可产生输出，满足其提示中给出的长度约束。 L1的长度控制允许在各种任务上平稳地交易计算成本和准确性，并且要优于最先进的S1方法进行长度控制。此外，我们在接受LCPO训练的型号中发现了意外的短链经验。例如，我们的1.5B L1模型以相等的推理长度超过GPT-4O。总体而言，LCPO可以精确控制推理长度，从而可以对测试时间计算和准确性进行细粒度分配。我们在此HTTPS URL上发布代码和模型

Title: Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities

Authors: Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, Hung-yi Lee
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2503.04721
Pdf URL: https://arxiv.org/pdf/2503.04721
Copy Paste: [[2503.04721]] Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities(https://arxiv.org/abs/2503.04721)
Keywords: language model
Abstract: Spoken dialogue modeling introduces unique challenges beyond text-based language modeling, demanding robust turn-taking, backchanneling, and real-time interaction. Although most Spoken Dialogue Models (SDMs) rely on half-duplex processing (handling speech one turn at a time), emerging full-duplex SDMs can listen and speak simultaneously, enabling more natural and engaging conversations. However, current evaluations of such models remain limited, often focusing on turn-based metrics or high-level corpus analyses (e.g., turn gaps, pauses). To address this gap, we present Full-Duplex-Bench, a new benchmark that systematically evaluates key conversational behaviors: pause handling, backchanneling, turn-taking, and interruption management. Our framework uses automatic metrics for consistent and reproducible assessments of SDMs' interactive performance. By offering an open and standardized evaluation benchmark, we aim to advance spoken dialogue modeling and encourage the development of more interactive and natural dialogue systems.
摘要：口语对话建模带来了独特的挑战，除了基于文本的语言建模，要求强大的转弯，回音和实时互动。尽管大多数口语对话模型（SDM）依靠半双链处理（一次处理语音），但新兴的全双工SDM可以同时聆听和讲话，从而实现更自然而引人入胜的对话。但是，当前对此类模型的评估仍然有限，通常集中于基于回合的指标或高级语料库分析（例如，转向差距，暂停）。为了解决这一差距，我们提出了完整的基础板，这是一种系统地评估关键对话行为的新基准：暂停处理，回流，转弯和中断管理。我们的框架使用自动指标来对SDMS互动性能的一致和可重复的评估。通过提供开放和标准化的评估基准，我们旨在提高口头对话建模，并鼓励开发更具互动性和自然对话系统。

Title: Enough Coin Flips Can Make LLMs Act Bayesian

Authors: Ritwik Gupta, Rodolfo Corona, Jiaxin Ge, Eric Wang, Dan Klein, Trevor Darrell, David M. Chan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04722
Pdf URL: https://arxiv.org/pdf/2503.04722
Copy Paste: [[2503.04722]] Enough Coin Flips Can Make LLMs Act Bayesian(https://arxiv.org/abs/2503.04722)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) exhibit the ability to generalize given few-shot examples in their input prompt, an emergent capability known as in-context learning (ICL). We investigate whether LLMs utilize ICL to perform structured reasoning in ways that are consistent with a Bayesian framework or rely on pattern matching. Using a controlled setting of biased coin flips, we find that: (1) LLMs often possess biased priors, causing initial divergence in zero-shot settings, (2) in-context evidence outweighs explicit bias instructions, (3) LLMs broadly follow Bayesian posterior updates, with deviations primarily due to miscalibrated priors rather than flawed updates, and (4) attention magnitude has negligible effect on Bayesian inference. With sufficient demonstrations of biased coin flips via ICL, LLMs update their priors in a Bayesian manner.
摘要：大型语言模型（LLMS）在输入提示中具有很少的示例，这是一种被称为内部文化学习（ICL）的紧急功能。我们研究LLMS是使用ICL以与贝叶斯框架一致或依赖模式匹配的方式执行结构化推理。 Using a controlled setting of biased coin flips, we find that: (1) LLMs often possess biased priors, causing initial divergence in zero-shot settings, (2) in-context evidence outweighs explicit bias instructions, (3) LLMs broadly follow Bayesian posterior updates, with deviations primarily due to miscalibrated priors rather than flawed updates, and (4) attention magnitude has对贝叶斯推论的影响可忽略不计。通过通过ICL充分证明有偏见的硬币翻转，LLMS以贝叶斯的方式更新了他们的先验。

Title: Shifting Long-Context LLMs Research from Input to Output

Authors: Yuhao Wu, Yushi Bai, Zhiqing Hu, Shangqing Tu, Ming Shan Hee, Juanzi Li, Roy Ka-Wei Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04723
Pdf URL: https://arxiv.org/pdf/2503.04723
Copy Paste: [[2503.04723]] Shifting Long-Context LLMs Research from Input to Output(https://arxiv.org/abs/2503.04723)
Keywords: language model, llm
Abstract: Recent advancements in long-context Large Language Models (LLMs) have primarily concentrated on processing extended input contexts, resulting in significant strides in long-context comprehension. However, the equally critical aspect of generating long-form outputs has received comparatively less attention. This paper advocates for a paradigm shift in NLP research toward addressing the challenges of long-output generation. Tasks such as novel writing, long-term planning, and complex reasoning require models to understand extensive contexts and produce coherent, contextually rich, and logically consistent extended text. These demands highlight a critical gap in current LLM capabilities. We underscore the importance of this under-explored domain and call for focused efforts to develop foundational LLMs tailored for generating high-quality, long-form outputs, which hold immense potential for real-world applications.
摘要：长篇文章大语言模型（LLM）的最新进展主要集中在处理扩展的输入环境上，从而在长期文化理解中取得了显着的进步。但是，产生长期产出的同样关键方面的关注相对较少。本文主张NLP研究的范式转变，以应对长期产生的挑战。诸如新颖写作，长期计划和复杂推理之类的任务需要模型来了解广泛的上下文，并产生连贯的，上下文富裕且逻辑上一致的扩展文本。这些要求突出了当前LLM功能的关键差距。我们强调了这个不足探索的领域的重要性，并呼吁为开发用于产生高质量的长期产出而量身定制的基础LLM，这对现实应用程序具有巨大的潜力。

Title: LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

Authors: Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.04724
Pdf URL: https://arxiv.org/pdf/2503.04724
Copy Paste: [[2503.04724]] LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM(https://arxiv.org/abs/2503.04724)
Keywords: language model, llm
Abstract: Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at this https URL .
摘要：语音到语音对话系统的最新进展利用LLM进行多模式交互，但仍受到微调要求，高计算开销和文本语音未对准的阻碍。现有的支持语音的LLM通常通过修改LLM来降低对话质量，从而损害其语言能力。相比之下，我们提出了LLMVOX，这是一种轻巧的30m参数，LLM-AGNOSTIC，自回旋的流媒体TTS系统，该系统生成具有低潜伏期的高质量语音，同时充分保留了基本LLM的功能。与启用语音的LLM相比，我们的方法的单词错误率显着较低，同时以可比的延迟和UTMOS分数运行。通过通过多标记流系统将语音合成从LLM处理中，LLMVOX支持无缝的无限长度对话。它的插件设计还促进了具有不同骨架的各种任务的扩展。此外，LLMVOX仅具有数据集适应的新语言概括，在阿拉伯语语音任务上达到了低角色错误率。此外，我们将LLMVOX与视觉语言模型集成在一起，以创建具有语音，文本和视觉功能的Omni模型，而无需进行其他多模式训练。我们的代码库和项目页面可在此HTTPS URL上找到。

Title: L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling

Authors: Zhuo Chen, Oriol Mayné i Comas, Zhuotao Jin, Di Luo, Marin Soljačić
Subjects: cs.CL, cs.AI, cs.IT, cs.LG, physics.data-an
Abstract URL: https://arxiv.org/abs/2503.04725
Pdf URL: https://arxiv.org/pdf/2503.04725
Copy Paste: [[2503.04725]] L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling(https://arxiv.org/abs/2503.04725)
Keywords: language model, long context
Abstract: We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies. This scaling law, which we show is distinct from and scales independently of the conventional two-point mutual information, is the key to understanding long-context language modeling. Using this scaling law, we formulate the Long-context Language Modeling (L$^2$M) condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information. Our results are validated through experiments on both transformers and state space models. This work establishes a theoretical foundation that guides the development of large language models toward longer context lengths.
摘要：我们严格地建立了一种自然语言的双方共同信息扩展定律，以控制远程依赖性。我们所显示的比例定律与传统的两点共同信息独立于区别，并且是理解长篇小说语言建模的关键。使用此缩放定律，我们制定了长篇文章的语言建模（L $^2 $ M）条件，该条件将模型的有效长上下文长度建模的能力与其潜在状态大小的缩放量相关联，以存储过去的信息。通过对变压器和状态空间模型的实验，我们的结果得到了验证。这项工作建立了一个理论基础，该基础指导大型语言模型的发展朝着更长的上下文长度。