2025-04-03

Title: Repetitions are not all alike: distinct mechanisms sustain repetition in language models

Authors: Matéo Mahaut, Francesca Franzon
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.01100
Pdf URL: https://arxiv.org/pdf/2504.01100
Copy Paste: [[2504.01100]] Repetitions are not all alike: distinct mechanisms sustain repetition in language models(https://arxiv.org/abs/2504.01100)
Keywords: language model, prompt
Abstract: Text generated by language models (LMs) can degrade into repetitive cycles, where identical word sequences are persistently repeated one after another. Prior research has typically treated repetition as a unitary phenomenon. However, repetitive sequences emerge under diverse tasks and contexts, raising the possibility that it may be driven by multiple underlying factors. Here, we experimentally explore the hypothesis that repetition in LMs can result from distinct mechanisms, reflecting different text generation strategies used by the model. We examine the internal working of LMs under two conditions that prompt repetition: one in which repeated sequences emerge naturally after human-written text, and another where repetition is explicitly induced through an in-context learning (ICL) setup. Our analysis reveals key differences between the two conditions: the model exhibits varying levels of confidence, relies on different attention heads, and shows distinct pattens of change in response to controlled perturbations. These findings suggest that distinct internal mechanisms can interact to drive repetition, with implications for its interpretation and mitigation strategies. More broadly, our results highlight that the same surface behavior in LMs may be sustained by different underlying processes, acting independently or in combination.
摘要：语言模型（LMS）生成的文本可以将其降低到重复的周期中，其中相同的单词序列被持续重复一个。先前的研究通常将重复视为统一现象。但是，重复序列在不同的任务和上下文下出现，从而增加了可能由多个基本因素驱动的可能性。在这里，我们通过实验探讨了以下假设：LMS中的重复可能是由不同的机制引起的，反映了模型使用的不同文本生成策略。我们检查了LMS在两个迅速重复的条件下的内部工作：一个重复的序列在人写的文本后自然出现，而另一种是通过内在学习（ICL）设置明确诱导的重复序列。我们的分析揭示了这两种条件之间的关键差异：该模型表现出不同水平的置信度，依赖于不同的注意力头，并且在响应受控扰动的响应中显示出不同的变化曲折。这些发现表明，不同的内部机制可以相互作用以驱动重复，这对其解释和缓解策略产生了影响。更广泛地说，我们的结果强调，LMS中相同的表面行为可能由不同的基础过程来维持，独立或结合起来。

Title: Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-Bench

Authors: Ziyi Liu, Priyanka Dey, Zhenyu Zhao, Jen-tse Huang, Rahul Gupta, Yang Liu, Jieyu Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01127
Pdf URL: https://arxiv.org/pdf/2504.01127
Copy Paste: [[2504.01127]] Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-Bench(https://arxiv.org/abs/2504.01127)
Keywords: language model, gpt, llm
Abstract: Cultural Intelligence (CQ) refers to the ability to understand unfamiliar cultural contexts-a crucial skill for large language models (LLMs) to effectively engage with globally diverse users. While existing research often focuses on explicitly stated cultural norms, such approaches fail to capture the subtle, implicit values that underlie real-world conversations. To address this gap, we introduce CQ-Bench, a benchmark specifically designed to assess LLMs' capability to infer implicit cultural values from natural conversational contexts. We generate a multi-character conversation-based stories dataset using values from the World Value Survey and GlobalOpinions datasets, with topics including ethical, religious, social, and political. Our dataset construction pipeline includes rigorous validation procedures-incorporation, consistency, and implicitness checks-using GPT-4o, with 98.2% human-model agreement in the final validation. Our benchmark consists of three tasks of increasing complexity: attitude detection, value selection, and value extraction. We find that while o1 and Deepseek-R1 models reach human-level performance in value selection (0.809 and 0.814), they still fall short in nuanced attitude detection, with F1 scores of 0.622 and 0.635, respectively. In the value extraction task, GPT-4o-mini and o3-mini score 0.602 and 0.598, highlighting the difficulty of open-ended cultural reasoning. Notably, fine-tuning smaller models (e.g., LLaMA-3.2-3B) on only 500 culturally rich examples improves performance by over 10%, even outperforming stronger baselines (o3-mini) in some cases. Using CQ-Bench, we provide insights into the current challenges in LLMs' CQ research and suggest practical pathways for enhancing LLMs' cross-cultural reasoning abilities.
摘要：文化智能（CQ）是指理解陌生文化背景的能力 - 大型语言模型（LLMS）有效与全球多样的用户互动的重要技能。尽管现有的研究通常集中在明确指定的文化规范上，但这种方法未能捕捉到现实世界对话的微妙而隐含的价值观。为了解决这一差距，我们介绍了CQ-Bench，这是一种专门旨在评估LLMS从自然对话环境中推断出隐式文化价值的能力的基准。我们使用世界价值调查和全球范围数据集的价值观生成了一个基于对话的故事数据集，其中包括道德，宗教，社会和政治。我们的数据集施工管道包括严格的验证程序 - 使用GPT-4O，一致性和内在性检查，在最终验证中，人类模型一致性为98.2％。我们的基准包括增加复杂性的三个任务：态度检测，价值选择和价值提取。我们发现，尽管O1和DeepSeek-R1模型在价值选择方面达到了人级的性能（0.809和0.814），但它们的态度检测仍然不足，F1得分分别为0.622和0.635。在价值提取任务中，GPT-4O-Mini和O3-Mini得分为0.602和0.598，突出了开放式文化推理的困难。值得注意的是，只有500个文化丰富的例子的微调较小的模型（例如，Llama-3.2-3b）提高了10％以上的性能，在某些情况下甚至超过了更强大的基线（O3-MINI）。使用CQ Bench，我们提供了有关LLMS CQ研究中当前挑战的见解，并提出了增强LLMS跨文化推理能力的实用途径。

Title: Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding

Authors: Melanie Subbiah, Akankshya Mishra, Grace Kim, Liyan Tang, Greg Durrett, Kathleen McKeown
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01132
Pdf URL: https://arxiv.org/pdf/2504.01132
Copy Paste: [[2504.01132]] Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding(https://arxiv.org/abs/2504.01132)
Keywords: llm
Abstract: Determining faithfulness of a claim to a source document is an important problem across many domains. This task is generally treated as a binary judgment of whether the claim is supported or unsupported in relation to the source. In many cases, though, whether a claim is supported can be ambiguous. For instance, it may depend on making inferences from given evidence, and different people can reasonably interpret the claim as either supported or unsupported based on their agreement with those inferences. Forcing binary labels upon such claims lowers the reliability of evaluation. In this work, we reframe the task to manage the subjectivity involved with factuality judgments of ambiguous claims. We introduce LLM-generated edits of summaries as a method of providing a nuanced evaluation of claims: how much does a summary need to be edited to be unambiguous? Whether a claim gets rewritten and how much it changes can be used as an automatic evaluation metric, the Ambiguity Rewrite Metric (ARM), with a much richer feedback signal than a binary judgment of faithfulness. We focus on the area of narrative summarization as it is particularly rife with ambiguity and subjective interpretation. We show that ARM produces a 21% absolute improvement in annotator agreement on claim faithfulness, indicating that subjectivity is reduced.
摘要：确定对源文件的忠诚是许多领域的重要问题。该任务通常被视为二进制判断，即对索赔是否受到支持或不支持来源的判断。但是，在许多情况下，是否支持索赔可能是模棱两可的。例如，这可能取决于从给定证据中提出的推论，而不同的人可以根据他们与这些推论的一致性合理地将主张解释为受支持或不支持的说法。强迫二进制标签对此类主张降低评估的可靠性。在这项工作中，我们重新构架了管理与歧义主张的事实判断所涉及的主观性的任务。我们介绍了摘要的LLM生成的编辑，以提供对索赔的细微评估的一种方法：需要编辑多少摘要才能明确？索赔是否被重写以及更改的程度可以用作自动评估指标，歧义重写度量标准（ARM），其反馈信号比忠实的二进制判断更为丰富。我们专注于叙事总结的领域，因为它特别充满了歧义和主观解释。我们表明，ARM在主张忠诚方面的注释人协议中产生21％的绝对改善，表明主观性降低了。

Title: Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

Authors: Guy Kaplan, Michael Toker, Yuval Reif, Yonatan Belinkov, Roy Schwartz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01137
Pdf URL: https://arxiv.org/pdf/2504.01137
Copy Paste: [[2504.01137]] Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models(https://arxiv.org/abs/2504.01137)
Keywords: prompt
Abstract: Text-to-Image (T2I) models often suffer from issues such as semantic leakage, incorrect feature binding, and omissions of key concepts in the generated image. This work studies these phenomena by looking into the role of information flow between textual token representations. To this end, we generate images by applying the diffusion component on a subset of contextual token representations in a given prompt and observe several interesting phenomena. First, in many cases, a word or multiword expression is fully represented by one or two tokens, while other tokens are redundant. For example, in "San Francisco's Golden Gate Bridge", the token "gate" alone captures the full expression. We demonstrate the redundancy of these tokens by removing them after textual encoding and generating an image from the resulting representation. Surprisingly, we find that this process not only maintains image generation performance but also reduces errors by 21\% compared to standard generation. We then show that information can also flow between different expressions in a sentence, which often leads to semantic leakage. Based on this observation, we propose a simple, training-free method to mitigate semantic leakage: replacing the leaked item's representation after the textual encoding with its uncontextualized representation. Remarkably, this simple approach reduces semantic leakage by 85\%. Overall, our work provides a comprehensive analysis of information flow across textual tokens in T2I models, offering both novel insights and practical benefits.
摘要：文本对图像（T2I）模型通常遭受语义泄漏，不正确的特征绑定以及生成图像中关键概念的遗漏等问题。这项工作通过研究文本令牌表示之间的信息流的作用来研究这些现象。为此，我们通过在给定的提示中应用于上下文令牌表示子集上的扩散组件来生成图像，并观察几个有趣的现象。首先，在许多情况下，单词或多词表达式由一个或两个令牌完全表示，而其他令牌则是冗余的。例如，在“旧金山的金门大桥”中，单独的“大门”捕捉了完整的表达。我们通过在文本编码和从结果表示形式中生成图像后将其删除来证明这些令牌的冗余。令人惊讶的是，我们发现此过程不仅保持图像生成性能，而且与标准生成相比，错误还可以将错误降低21 \％。然后，我们表明信息也可以在句子中的不同表达式之间流动，这通常会导致语义泄漏。基于此观察结果，我们提出了一种简单，无训练的方法来减轻语义泄漏：用文本编码以其不受义言的表示，替换泄漏的项目的表示形式。值得注意的是，这种简单的方法可将语义泄漏减少85 \％。总体而言，我们的工作对T2I模型中文本令牌之间的信息流进行了全面分析，从而提供了新颖的见解和实际好处。

Title: $μ$KE: Matryoshka Unstructured Knowledge Editing of Large Language Models

Authors: Zian Su, Ziyang Huang, Kaiyuan Zhang, Xiangyu Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01196
Pdf URL: https://arxiv.org/pdf/2504.01196
Copy Paste: [[2504.01196]] $μ$KE: Matryoshka Unstructured Knowledge Editing of Large Language Models(https://arxiv.org/abs/2504.01196)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have emerged as powerful knowledge bases yet are limited by static training data, leading to issues such as hallucinations and safety risks. Editing a model's internal knowledge through the locate-and-edit paradigm has proven a cost-effective alternative to retraining, though current unstructured approaches, especially window-based autoregressive methods, often disrupt the causal dependency between early memory updates and later output tokens. In this work, we first theoretically analyze these limitations and then introduce Matryoshka Unstructured Knowledge Editing ($\mu$KE), a novel memory update mechanism that preserves such dependencies via a Matryoshka-style objective and adaptive loss coefficients. Empirical evaluations on two models across four benchmarks demonstrate that $\mu$KE improves edit efficacy by up to 12.33% over state-of-the-art methods, and remain robust when applied to diverse formatted edits, underscoring its potential for effective unstructured knowledge editing in LLMs.
摘要：大型语言模型（LLM）已成为强大的知识库，但受到静态培训数据的限制，导致幻觉和安全风险等问题。通过定位和编辑范式编辑模型的内部知识已证明是一种具有成本效益的替代方法，尽管当前的非结构化方法，尤其是基于窗口的自动回应方法，但通常会破坏早期内存更新和后来的输出令牌之间的因果关系。在这项工作中，我们首先对这些限制进行了分析，然后引入Matryoshka非结构化知识编辑（$ \ MU $ KE），这是一种新型的内存更新机制，可通过Matryoshka式的目标和自适应损耗系数保留此类依赖性。对四个基准测试模型的两个模型的经验评估表明，$ \ mu $ ke比最先进的方法提高了高达12.33％的编辑功效，并且在应用于多样化的格式编辑中时保持强大的效果，从而实现了其在LLMS中有效的非结构化知识编辑的潜力。

Title: Medical large language models are easily distracted

Authors: Krithik Vishwanath, Anton Alyakin, Daniel Alexander Alber, Jin Vivian Lee, Douglas Kondziolka, Eric Karl Oermann
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2504.01201
Pdf URL: https://arxiv.org/pdf/2504.01201
Copy Paste: [[2504.01201]] Medical large language models are easily distracted(https://arxiv.org/abs/2504.01201)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance. The rise of assistive technologies like ambient dictation, which automatically generates draft notes from live patient encounters, has the potential to introduce additional noise making it crucial to assess the ability of LLM's to filter relevant data. To investigate this, we developed MedDistractQA, a benchmark using USMLE-style questions embedded with simulated real-world distractions. Our findings show that distracting statements (polysemous words with clinical meanings used in a non-clinical context or references to unrelated health conditions) can reduce LLM accuracy by up to 17.9%. Commonly proposed solutions to improve model performance such as retrieval-augmented generation (RAG) and medical fine-tuning did not change this effect and in some cases introduced their own confounders and further degraded performance. Our findings suggest that LLMs natively lack the logical mechanisms necessary to distinguish relevant from irrelevant clinical information, posing challenges for real-world applications. MedDistractQA and our results highlights the need for robust mitigation strategies to enhance LLM resilience to extraneous information.
摘要：大型语言模型（LLMS）有可能改变医学的可能性，但是现实世界中的临床场景包含可阻碍性能的无关信息。辅助技术（如环境命令）的兴起，自动产生了活着的患者遇到的票据，它有可能引入额外的噪音，从而使评估LLM的能力过滤相关数据的能力至关重要。为了调查这一点，我们开发了Meddistractqa，这是一种使用USMLE风格的问题的基准，并嵌入了模拟现实世界的分心。我们的发现表明，分散注意力的陈述（具有非临床环境中使用的临床含义或对无关健康状况的临床含义的多义单词）可以将LLM准确性降低高达17.9％。通常提出的解决方案来改善模型性能，例如检索功能增强的生成（RAG）和医学微调并没有改变这种效果，在某些情况下，引入了自己的混杂因素并进一步退化了性能。我们的发现表明，LLM在本地缺乏将相关性与无关的临床信息区分开所必需的逻辑机制，从而对现实世界应用构成了挑战。 Meddistractqa和我们的结果凸显了需要强大的缓解策略来增强LLM对无关信息的弹性。

Title: Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models

Authors: Feng Chen, Dror Ben-Zeev, Gillian Sparks, Arya Kadakia, Trevor Cohen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.01216
Pdf URL: https://arxiv.org/pdf/2504.01216
Copy Paste: [[2504.01216]] Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models(https://arxiv.org/abs/2504.01216)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Post-Traumatic Stress Disorder (PTSD) remains underdiagnosed in clinical settings, presenting opportunities for automated detection to identify patients. This study evaluates natural language processing approaches for detecting PTSD from clinical interview transcripts. We compared general and mental health-specific transformer models (BERT/RoBERTa), embedding-based methods (SentenceBERT/LLaMA), and large language model prompting strategies (zero-shot/few-shot/chain-of-thought) using the DAIC-WOZ dataset. Domain-specific models significantly outperformed general models (Mental-RoBERTa F1=0.643 vs. RoBERTa-base 0.485). LLaMA embeddings with neural networks achieved the highest performance (F1=0.700). Zero-shot prompting using DSM-5 criteria yielded competitive results without training data (F1=0.657). Performance varied significantly across symptom severity and comorbidity status, with higher accuracy for severe PTSD cases and patients with comorbid depression. Our findings highlight the potential of domain-adapted embeddings and LLMs for scalable screening while underscoring the need for improved detection of nuanced presentations and offering insights for developing clinically viable AI tools for PTSD assessment.
摘要：创伤后应激障碍（PTSD）在临床环境中仍未诊断，为自动检测提供了识别患者的机会。这项研究评估了自然语言处理方法，用于从临床访谈笔录中检测PTSD。我们使用DAIC-WOZ数据集比较了一般和心理健康特异性变压器模型（BERT/ROBERTA），基于嵌入的方法（Senterbert/Llama）和大型语言模型促使策略（零射击/少数射击/链链）使用DAIC-WOZ数据集。域特异性模型显着超过了一般模型（精神 - 罗伯塔F1 = 0.643 vs. Roberta-base 0.485）。具有神经网络的Llama嵌入性能达到最高的性能（F1 = 0.700）。使用DSM-5标准零射击提示在没有训练数据的情况下得出竞争结果（F1 = 0.657）。在症状严重程度和合并症状态下的性能差异很大，严重的PTSD病例和合并症患者的准确性更高。我们的发现突出了针对域适应的嵌入和LLM的潜力进行可扩展筛选的潜力，同时强调需要改善细微差异的检测，并为开发用于PTSD评估的临床上可行的AI工具提供见解。

Title: Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks

Authors: Naimul Haque
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01241
Pdf URL: https://arxiv.org/pdf/2504.01241
Copy Paste: [[2504.01241]] Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks(https://arxiv.org/abs/2504.01241)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) have significantly advanced Natural Language Processing (NLP), particularly in Natural Language Understanding (NLU) tasks. As we progress toward an agentic world where LLM-based agents autonomously handle specialized tasks, it becomes crucial for these models to adapt to new tasks without forgetting previously learned information - a challenge known as catastrophic forgetting. This study evaluates the continual fine-tuning of various open-source LLMs with different parameter sizes (specifically models under 10 billion parameters) on key NLU tasks from the GLUE benchmark, including SST-2, MRPC, CoLA, and MNLI. By employing prompt engineering and task-specific adjustments, we assess and compare the models' abilities to retain prior knowledge while learning new tasks. Our results indicate that models such as Phi-3.5-mini exhibit minimal forgetting while maintaining strong learning capabilities, making them well-suited for continual learning environments. Additionally, models like Orca-2-7b and Qwen2.5-7B demonstrate impressive learning abilities and overall performance after fine-tuning. This work contributes to understanding catastrophic forgetting in LLMs and highlights prompting engineering to optimize model performance for continual learning scenarios.
摘要：大型语言模型（LLM）具有明显的高级自然语言处理（NLP），尤其是在自然语言理解（NLU）任务中。随着我们朝着基于LLM的代理人自主处理专业任务的代理世界的发展，对于这些模型而言，在不忘记以前学习的信息的情况下适应新任务至关重要，这是一种被称为灾难性遗忘的挑战。这项研究评估了来自胶水基准的关键NLU任务（包括SST-2，MRPC，COLA和MNLI）的关键NLU任务上的各种开源LLM的持续微调（特别是在100亿个参数以下的模型）。通过采用迅速的工程和特定于任务的调整，我们可以评估和比较模型在学习新任务时保留先验知识的能力。我们的结果表明，诸如PHI-3.5-MINI之类的模型在保持强大的学习能力的同时表现出最小的遗忘，使其适合持续学习环境。此外，诸如Orca-2-7b和Qwen2.5-7B之类的模型表现出令人印象深刻的学习能力和微调后的整体表现。这项工作有助于理解LLM中的灾难性遗忘，并突出显示促使工程技术优化持续学习场景的模型性能。

Title: Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models

Authors: Rafael Giebisch, Ken E. Friedl, Lev Sorokin, Andrea Stocco
Subjects: cs.CL, cs.AI, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2504.01248
Pdf URL: https://arxiv.org/pdf/2504.01248
Copy Paste: [[2504.01248]] Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models(https://arxiv.org/abs/2504.01248)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: In-car conversational systems bring the promise to improve the in-vehicle user experience. Modern conversational systems are based on Large Language Models (LLMs), which makes them prone to errors such as hallucinations, i.e., inaccurate, fictitious, and therefore factually incorrect information. In this paper, we present an LLM-based methodology for the automatic factual benchmarking of in-car conversational systems. We instantiate our methodology with five LLM-based methods, leveraging ensembling techniques and diverse personae to enhance agreement and minimize hallucinations. We use our methodology to evaluate CarExpert, an in-car retrieval-augmented conversational question answering system, with respect to the factual correctness to a vehicle's manual. We produced a novel dataset specifically created for the in-car domain, and tested our methodology against an expert evaluation. Our results show that the combination of GPT-4 with the Input Output Prompting achieves over 90 per cent factual correctness agreement rate with expert evaluations, other than being the most efficient approach yielding an average response time of 4.5s. Our findings suggest that LLM-based testing constitutes a viable approach for the validation of conversational systems regarding their factual correctness.
摘要：车内对话系统带来了改善车载用户体验的希望。现代对话系统基于大型语言模型（LLM），这使它们容易遇到诸如幻觉，即不准确，虚构的，因此实际上不正确的信息。在本文中，我们提出了一种基于LLM的方法，用于自动对交谈系统的实际基准测试。我们使用五种基于LLM的方法实例化方法，利用结合技术和多样化的人物来增强一致性并最大程度地减少幻觉。我们使用我们的方法来评估Carexpert，这是一种在车内检索的对话问题答案系统，就车辆手册的事实正确性而言。我们制作了一个专门为车载域创建的新型数据集，并针对专家评估测试了我们的方法。我们的结果表明，GPT-4与输入输出的组合促使事实正确的一致性率与专家评估达到了90％以上，而不是最有效的方法，其平均响应时间为4.5s。我们的发现表明，基于LLM的测试构成了验证对话系统有关其事实正确性的可行方法。

Title: Grade Guard: A Smart System for Short Answer Automated Grading

Authors: Niharika Dadu, Harsh Vardhan Singh, Romi Banerjee (Indian Institute of Technology Jodhpur)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01253
Pdf URL: https://arxiv.org/pdf/2504.01253
Copy Paste: [[2504.01253]] Grade Guard: A Smart System for Short Answer Automated Grading(https://arxiv.org/abs/2504.01253)
Keywords: language model, gpt, llm
Abstract: The advent of large language models (LLMs) in the education sector has provided impetus to automate grading short answer questions. LLMs make evaluating short answers very efficient, thus addressing issues like staff shortage. However, in the task of Automated Short Answer Grading (ASAG), LLM responses are influenced by diverse perspectives in their training dataset, leading to inaccuracies in evaluating nuanced or partially correct answers. To address this challenge, we propose a novel framework, Grade Guard. 1. To enhance the task-based specialization of the LLMs, the temperature parameter has been fine-tuned using Root Mean Square Error (RMSE). 2. Unlike traditional approaches, LLMs in Grade Guard compute an Indecisiveness Score (IS) along with the grade to reflect uncertainty in predicted grades. 3. Introduced Confidence-Aware Loss (CAL) to generate an optimized Indecisiveness Score (IS). 4. To improve reliability, self-reflection based on the optimized IS has been introduced into the framework, enabling human re-evaluation to minimize incorrect grade assignments. Our experimentation shows that the best setting of Grade Guard outperforms traditional methods by 19.16% RMSE in Upstage Solar Pro, 23.64% RMSE in Upstage Solar Mini, 4.00% RMSE in Gemini 1.5 Flash, and 10.20% RMSE in GPT 4-o Mini. Future work includes improving interpretability by generating rationales for grades to enhance accuracy. Expanding benchmark datasets and annotating them with domain-specific nuances will enhance grading accuracy. Finally, analyzing feedback to enhance confidence in predicted grades, reduce biases, optimize grading criteria, and personalize learning while supporting multilingual grading systems will make the solution more accurate, adaptable, fair, and inclusive.
摘要：教育领域中大型语言模型（LLM）的出现已经提供了自动化简短答案问题的动力。 LLMS使评估简短答案非常有效，从而解决了员工短缺等问题。但是，在自动简短答案分级（ASAG）的任务中，LLM响应受培训数据集中的不同观点的影响，导致评估细微或部分正确答案的不准确性。为了应对这一挑战，我们提出了一个新颖的框架，年级后卫。 1。为了增强LLM的基于任务的专业化，使用均方根误差（RMSE）对温度参数进行了微调。 2。与传统的方法不同，年级后卫的LLM和等级一起计算优柔寡断的评分（IS），以反映预测等级的不确定性。 3。引入了信心损失（CAL），以产生优化的优柔寡断评分（IS）。 4。为了提高可靠性，已经将基于优化的自我反射引入了框架中，从而使人类重新评估以最大程度地减少不正确的成绩分配。我们的实验表明，最佳级别卫队在台阶太阳能专业人士中的表现优于传统方法，在台阶太阳能Mini中，RMSE胜过19.16％的RMSE，在Gemini 1.5闪存中的4.00％RMSE和GPT 4-O MINI中的RMSE和10.20％的RMSE。未来的工作包括通过为成绩产生理由来提高准确性来提高可解释性。扩展基准数据集并用特定领域的细微差别注释它们将提高评分精度。最后，分析反馈以增强对预测等级的信心，减少偏见，优化评分标准和个性化学习，同时支持多语言分级系统，这将使解决方案更加准确，适应性，公平，公平和包容。

Title: Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing

Authors: Jihyun Janice Ahn, Wenpeng Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01282
Pdf URL: https://arxiv.org/pdf/2504.01282
Copy Paste: [[2504.01282]] Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing(https://arxiv.org/abs/2504.01282)
Keywords: language model, llm, prompt
Abstract: While the inconsistency of LLMs is not a novel topic, prior research has predominantly addressed two types of generative inconsistencies: i) Randomness Inconsistency: running the same LLM multiple trials, yielding varying responses; ii) Paraphrase Inconsistency: paraphrased prompts result in different responses from the same LLM. Randomness Inconsistency arises from the inherent randomness due to stochastic sampling in generative models, while Paraphrase Inconsistency is a consequence of the language modeling objectives, where paraphrased prompts alter the distribution of vocabulary logits. This research discovers Prompt-Reverse Inconsistency (PRIN), a new form of LLM self-inconsistency: given a question and a couple of LLM-generated answer candidates, the LLM often has conflicting responses when prompted "Which are correct answers?" and "Which are incorrect answers?". PRIN poses a big concern as it undermines the credibility of LLM-as-a-judge, and suggests a challenge for LLMs to adhere to basic logical rules. We conduct a series of experiments to investigate PRIN, examining the extent of PRIN across different LLMs, methods to mitigate it, potential applications, and its relationship with Randomness Inconsistency and Paraphrase Inconsistency. As the first study to explore PRIN, our findings offer valuable insights into the inner workings of LLMs and contribute to advancing trustworthy AI.
摘要：尽管LLM的不一致并不是一个新的话题，但先前的研究主要解决了两种类型的生成不一致之处：i）随机性不一致：进行相同的LLM多次试验，产生不同的响应； ii）释义不一致：释义提示会导致与同一LLM的不同响应。随机性不一致源于生成模型中随机采样引起的固有随机性，而释义不一致是语言建模目标的结果，其中释义会改变词汇逻辑的分布。这项研究发现了迅速的逆转不一致（PRIN），这是LLM自我矛盾的一种新形式：一个问题和几个LLM生成的答案候选者，LLM在提示“哪个是正确的答案？”时经常会有冲突的答案。和“哪些不正确答案？”。普林（Prin）在破坏了LLM-AS-A-A-Gudge的信誉时提出了一个很大的关注，并提出了LLMS遵守基本逻辑规则的挑战。我们进行了一系列实验来研究PRIN，研究了不同LLM的PRIN的程度，减轻它的方法，潜在的应用以及与随机性不一致和释义不一致的关系。作为探索PRIN的首次研究，我们的发现为LLM的内部运作提供了宝贵的见解，并有助于推进值得信赖的AI。

Title: ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning

Authors: Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01296
Pdf URL: https://arxiv.org/pdf/2504.01296
Copy Paste: [[2504.01296]] ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning(https://arxiv.org/abs/2504.01296)
Keywords: llm, chain-of-thought
Abstract: We present ThinkPrune, a simple yet effective method for pruning the thinking length for long-thinking LLMs, which has been found to often produce inefficient and redundant thinking processes. Existing preliminary explorations of reducing thinking length primarily focus on forcing the thinking process to early exit, rather than adapting the LLM to optimize and consolidate the thinking process, and therefore the length-performance tradeoff observed so far is sub-optimal. To fill this gap, ThinkPrune offers a simple solution that continuously trains the long-thinking LLMs via reinforcement learning (RL) with an added token limit, beyond which any unfinished thoughts and answers will be discarded, resulting in a zero reward. To further preserve model performance, we introduce an iterative length pruning approach, where multiple rounds of RL are conducted, each with an increasingly more stringent token limit. We observed that ThinkPrune results in a remarkable performance-length tradeoff -- on the AIME24 dataset, the reasoning length of DeepSeek-R1-Distill-Qwen-1.5B can be reduced by half with only 2% drop in performance. We also observed that after pruning, the LLMs can bypass unnecessary steps while keeping the core reasoning process complete. Code is available at this https URL.
摘要：我们提出了ThinkPrune，这是一种简单而有效的方法，用于修剪长期插入LLM的思维长度，这通常会产生效率低下且冗余的思维过程。现有的初步探索减少思维长度主要集中于迫使思维过程提早退出，而不是改编LLM以优化和巩固思维过程，因此到目前为止观察到的长度 - 绩效权衡是次优的。为了填补这一空白，ThinkPrune提供了一个简单的解决方案，该解决方案通过增强学习（RL）不断地训练长期构想的LLM，并增加了令牌限制，除此之外，任何未完成的思想和答案都将被丢弃，从而获得零奖励。为了进一步保留模型性能，我们引入了一种迭代长度修剪方法，其中进行了多个RL，每个RL都具有越来越严格的令牌极限。我们观察到，ThinkPrune会导致出色的性能长度折衷 - 在AIME24数据集上，DeepSeek-R1-Distill-Qwen-1.5b的推理长度可以减少一半，而性能下降只有2％。我们还观察到，修剪后，LLM可以绕过不必要的步骤，同时保持核心推理过程完成。代码可在此HTTPS URL上找到。

Title: Biomedical Question Answering via Multi-Level Summarization on a Local Knowledge Graph

Authors: Lingxiao Guan, Yuanhao Huang, Jie Liu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2504.01309
Pdf URL: https://arxiv.org/pdf/2504.01309
Copy Paste: [[2504.01309]] Biomedical Question Answering via Multi-Level Summarization on a Local Knowledge Graph(https://arxiv.org/abs/2504.01309)
Keywords: language model, retrieval augmented generation
Abstract: In Question Answering (QA), Retrieval Augmented Generation (RAG) has revolutionized performance in various domains. However, how to effectively capture multi-document relationships, particularly critical for biomedical tasks, remains an open question. In this work, we propose a novel method that utilizes propositional claims to construct a local knowledge graph from retrieved documents. Summaries are then derived via layerwise summarization from the knowledge graph to contextualize a small language model to perform QA. We achieved comparable or superior performance with our method over RAG baselines on several biomedical QA benchmarks. We also evaluated each individual step of our methodology over a targeted set of metrics, demonstrating its effectiveness.
摘要：在有关回答（QA）的问题上，检索增强发电（RAG）彻底改变了各个领域的性能。但是，如何有效捕获多文件关系，尤其是生物医学任务至关重要的问题，仍然是一个悬而未决的问题。在这项工作中，我们提出了一种新的方法，该方法利用命题主张从检索到的文档中构建本地知识图。然后，摘要是通过从知识图来通过layerwise摘要得出的，以将小型语言模型上下文化以执行质量检查。在几个生物医学QA基准上，我们的方法超过了抹布基线，我们的方法具有可比性或卓越的性能。我们还评估了我们方法的每个单独的步骤，以证明其有效性。

Title: Adaptive Rectification Sampling for Test-Time Compute Scaling

Authors: Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Yancheng Pan, Shaoxun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01317
Pdf URL: https://arxiv.org/pdf/2504.01317
Copy Paste: [[2504.01317]] Adaptive Rectification Sampling for Test-Time Compute Scaling(https://arxiv.org/abs/2504.01317)
Keywords: language model, llm
Abstract: The newly released OpenAI-o1 and DeepSeek-R1 have demonstrated that test-time scaling can significantly improve model performance, especially in complex tasks such as logical reasoning. Common test-time scaling methods involve generating more chain of thoughts (CoTs) or longer CoTs with self-correction. However, while self-correction can improve performance, it may lead to significant token waste and reduce readability of the CoT if the reasoning steps are already correct. To demonstrate that large language models (LLMs) can rectify errors at a more fine-grained level, we propose Adaptive Rectification Sampling (AR-Sampling), which can guide the LLMs to self-correction at the appropriate step. AR-Sampling leverages a process-supervised reward model (PRM) as a verifier and constructed trigger sentences to guide the model in adaptive step-level rethinking. Through the experiments on GSM8K and MATH500, it indicate that our approach enables the models to rethink in more fine-grained level, improving the accuracy of solutions, while generating a reasonable number of additional tokens.
摘要：新发布的OpenAI-O1和DeepSeek-R1证明，测试时间缩放可以显着改善模型性能，尤其是在诸如逻辑推理之类的复杂任务中。常见的测试时间缩放方法涉及产生更多的思想链（COT）或更长的COTS进行自我纠正。但是，尽管自我校正可以提高性能，但如果推理步骤已经正确，它可能会导致大量的令牌废物并降低婴儿床的可读性。为了证明大型语言模型（LLMS）可以在更细粒度的水平上纠正错误，我们提出了自适应矫正抽样（AR-SMPLING），这可以指导LLMS在适当的步骤中进行自我纠正。 AR采样利用过程监督奖励模型（PRM）作为验证者和构造的触发句子，以自适应级别的重新思考指导该模型。通过在GSM8K和MATH500上的实验，它表明我们的方法使模型能够以更细粒度的水平重新考虑，从而提高溶液的准确性，同时产生合理数量的其他标记。

Title: GTR: Graph-Table-RAG for Cross-Table Question Answering

Authors: Jiaru Zou, Dongqi Fu, Sirui Chen, Xinrui He, Zihao Li, Yada Zhu, Jiawei Han, Jingrui He
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.01346
Pdf URL: https://arxiv.org/pdf/2504.01346
Copy Paste: [[2504.01346]] GTR: Graph-Table-RAG for Cross-Table Question Answering(https://arxiv.org/abs/2504.01346)
Keywords: llm, prompt
Abstract: Beyond pure text, a substantial amount of knowledge is stored in tables. In real-world scenarios, user questions often require retrieving answers that are distributed across multiple tables. GraphRAG has recently attracted much attention for enhancing LLMs' reasoning capabilities by organizing external knowledge to address ad-hoc and complex questions, exemplifying a promising direction for cross-table question answering. In this paper, to address the current gap in available data, we first introduce a multi-table benchmark, MutliTableQA, comprising 60k tables and 25k user queries collected from real-world sources. Then, we propose the first Graph-Table-RAG framework, namely GTR, which reorganizes table corpora into a heterogeneous graph, employs a hierarchical coarse-to-fine retrieval process to extract the most relevant tables, and integrates graph-aware prompting for downstream LLMs' tabular reasoning. Extensive experiments show that GTR exhibits superior cross-table question-answering performance while maintaining high deployment efficiency, demonstrating its real-world practical applicability.
摘要：除纯文本外，还存储了大量知识。在实际情况下，用户问题通常需要检索在多个表中分发的答案。 GraphRag最近通过组织外部知识来解决临时问题和复杂问题，以增强LLMS的推理能力，引起了很多关注，这为跨桌子的问题回答提供了有希望的方向。在本文中，为了解决可用数据中当前的差距，我们首先引入了一个多桌子标准，mutlitableQa，包括60k表和从现实世界来源收集的25k用户查询。然后，我们提出了第一个图形桌布框架，即GTR，该框架将表Corpora重新组织为异质图，它采用了层次的粗到细节检索过程来提取最相关的表，并集成了图形知觉的提示，以进行下游LLMS LLMS的表格推理。广泛的实验表明，GTR在保持高部署效率的同时表现出卓越的跨表问题，表现出其现实世界的实际适用性。

Title: LITE: LLM-Impelled efficient Taxonomy Evaluation

Authors: Lin Zhang, Zhouhong Gu, Suhang Zheng, Tao Wang, Tianyu Li, Hongwei Feng, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01369
Pdf URL: https://arxiv.org/pdf/2504.01369
Copy Paste: [[2504.01369]] LITE: LLM-Impelled efficient Taxonomy Evaluation(https://arxiv.org/abs/2504.01369)
Keywords: llm
Abstract: This paper presents LITE, an LLM-based evaluation method designed for efficient and flexible assessment of taxonomy quality. To address challenges in large-scale taxonomy evaluation, such as efficiency, fairness, and consistency, LITE adopts a top-down hierarchical evaluation strategy, breaking down the taxonomy into manageable substructures and ensuring result reliability through cross-validation and standardized input formats. LITE also introduces a penalty mechanism to handle extreme cases and provides both quantitative performance analysis and qualitative insights by integrating evaluation metrics closely aligned with task objectives. Experimental results show that LITE demonstrates high reliability in complex evaluation tasks, effectively identifying semantic errors, logical contradictions, and structural flaws in taxonomies, while offering directions for improvement. Code is available at this https URL .
摘要：本文介绍了Lite，这是一种基于LLM的评估方法，旨在有效，灵活地评估分类法质量。为了解决大规模分类学评估中的挑战，例如效率，公平性和一致性，Lite采用了自上而下的分层评估策略，将分类法分解为可管理的子结构，并通过交叉验证确保结果可靠性，并标准化输入格式。 Lite还引入了一种惩罚机制来处理极端情况，并通过整合与任务目标紧密一致的评估指标来提供定量绩效分析和定性见解。实验结果表明，Lite在复杂的评估任务中表现出高度可靠性，可以有效地识别语义错误，逻辑矛盾和分类法的结构缺陷，同时提供改进方向。代码可在此HTTPS URL上找到。

Title: ToolACE-R: Tool Learning with Adaptive Self-Refinement

Authors: Xingshan Zeng, Weiwen Liu, Xu Huang, Zezhong Wang, Lingzhi Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Ruiming Tang, Qun Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.01400
Pdf URL: https://arxiv.org/pdf/2504.01400
Copy Paste: [[2504.01400]] ToolACE-R: Tool Learning with Adaptive Self-Refinement(https://arxiv.org/abs/2504.01400)
Keywords: language model, llm
Abstract: Tool learning, which allows Large Language Models (LLMs) to leverage external tools for solving complex user tasks, has emerged as a promising avenue for extending model capabilities. However, current approaches primarily focus on data synthesis for fine-tuning LLMs to invoke tools effectively, largely ignoring how to fully stimulate the potential of the model. In this paper, we propose ToolACE-R, a novel method that introduces adaptive self-refinement for tool invocations. Our approach features a model-aware iterative training procedure that progressively incorporates more training samples based on the model's evolving capabilities. Additionally, it allows LLMs to iteratively refine their tool calls, optimizing performance without requiring external feedback. To further enhance computational efficiency, we integrate an adaptive mechanism when scaling the inference time, enabling the model to autonomously determine when to stop the refinement process. We conduct extensive experiments across several benchmark datasets, showing that ToolACE-R achieves competitive performance compared to advanced API-based models, even without any refinement. Furthermore, its performance can be further improved efficiently through adaptive self-refinement. Our results demonstrate the effectiveness of the proposed method, which is compatible with base models of various sizes, offering a promising direction for more efficient tool learning.
摘要：允许大型语言模型（LLM）利用外部工具来解决复杂用户任务的工具学习已成为扩展模型功能的有前途的途径。但是，当前的方法主要集中于微调LLM的数据合成，以有效调用工具，在很大程度上忽略了如何完全刺激模型的潜力。在本文中，我们提出了Toolace-R，这是一种新型方法，它引入了用于工具调用的自适应自我。我们的方法采用了模型感知的迭代培训程序，该程序逐渐根据模型的不断发展的功能逐步合并更多的培训样本。此外，它允许LLMS迭代地完善其工具调用，在不需要外部反馈的情况下优化性能。为了进一步提高计算效率，我们在缩放推理时间时会整合一种自适应机制，从而使模型能够自主确定何时停止改进过程。我们在几个基准数据集中进行了广泛的实验，这表明工具-R与基于高级API的模型相比，即使没有任何细化，工具R可以达到竞争性能。此外，可以通过自适应自我进行有效地进一步提高其性能。我们的结果证明了所提出的方法的有效性，该方法与各种尺寸的基本模型兼容，为更有效的工具学习提供了有希望的方向。

Title: FAIRE: Assessing Racial and Gender Bias in AI-Driven Resume Evaluations

Authors: Athena Wen, Tanush Patil, Ansh Saxena, Yicheng Fu, Sean O'Brien, Kevin Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01420
Pdf URL: https://arxiv.org/pdf/2504.01420
Copy Paste: [[2504.01420]] FAIRE: Assessing Racial and Gender Bias in AI-Driven Resume Evaluations(https://arxiv.org/abs/2504.01420)
Keywords: language model, llm
Abstract: In an era where AI-driven hiring is transforming recruitment practices, concerns about fairness and bias have become increasingly important. To explore these issues, we introduce a benchmark, FAIRE (Fairness Assessment In Resume Evaluation), to test for racial and gender bias in large language models (LLMs) used to evaluate resumes across different industries. We use two methods-direct scoring and ranking-to measure how model performance changes when resumes are slightly altered to reflect different racial or gender identities. Our findings reveal that while every model exhibits some degree of bias, the magnitude and direction vary considerably. This benchmark provides a clear way to examine these differences and offers valuable insights into the fairness of AI-based hiring tools. It highlights the urgent need for strategies to reduce bias in AI-driven recruitment. Our benchmark code and dataset are open-sourced at our repository: this https URL.
摘要：在一个AI驱动的招聘正在改变招聘实践的时代，对公平和偏见的担忧变得越来越重要。为了探索这些问题，我们引入了基准，公平的（简历评估中的公平评估），以测试用于评估不同行业简历的大语言模型（LLMS）中的种族和性别偏见。我们使用两种方法定向评分和排名来衡量模型性能在稍微改变以反映不同种族或性别认同时的模型性能如何变化。我们的发现表明，尽管每个模型都表现出一定程度的偏见，但大小和方向差异很大。该基准提供了一种清晰的方法来检查这些差异，并为基于AI的招聘工具的公平性提供了宝贵的见解。它凸显了迫切需要减少AI驱动招聘中偏见的策略。我们的基准代码和数据集在我们的存储库中开源：此HTTPS URL。

Title: Refining Interactions: Enhancing Anisotropy in Graph Neural Networks with Language Semantics

Authors: Zhaoxing Li, Xiaoming Zhang, Haifeng Zhang, Chengxiang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01429
Pdf URL: https://arxiv.org/pdf/2504.01429
Copy Paste: [[2504.01429]] Refining Interactions: Enhancing Anisotropy in Graph Neural Networks with Language Semantics(https://arxiv.org/abs/2504.01429)
Keywords: language model, llm
Abstract: The integration of Large Language Models (LLMs) with Graph Neural Networks (GNNs) has recently been explored to enhance the capabilities of Text Attribute Graphs (TAGs). Most existing methods feed textual descriptions of the graph structure or neighbouring nodes' text directly into LLMs. However, these approaches often cause LLMs to treat structural information simply as general contextual text, thus limiting their effectiveness in graph-related tasks. In this paper, we introduce LanSAGNN (Language Semantic Anisotropic Graph Neural Network), a framework that extends the concept of anisotropic GNNs to the natural language level. This model leverages LLMs to extract tailor-made semantic information for node pairs, effectively capturing the unique interactions within node relationships. In addition, we propose an efficient dual-layer LLMs finetuning architecture to better align LLMs' outputs with graph tasks. Experimental results demonstrate that LanSAGNN significantly enhances existing LLM-based methods without increasing complexity while also exhibiting strong robustness against interference.
摘要：最近探索了大型语言模型（LLM）与图神经网络（GNN）的集成，以增强文本属性图（TAG）的功能。大多数现有方法将图形结构的文本描述或相邻节点的文本直接输入llms。但是，这些方法通常会导致LLM仅将结构信息视为一般上下文文本，从而限制了它们在与图形相关的任务中的有效性。在本文中，我们介绍了Lansagnn（语言语义各向异性图神经网络），该框架将各向异性GNN的概念扩展到了自然语言水平。该模型利用LLM为节点对提取量身定制的语义信息，从而有效地捕获了节点关系中的独特相互作用。此外，我们提出了一个有效的双层LLMS FINETUNTINT架构，以更好地使LLMS的输出与图形任务相提并论。实验结果表明，Lansagnn显着增强了现有的基于LLM的方法，而不增加复杂性，同时还表现出强大的鲁棒性针对干扰。

Title: PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation

Authors: Zhengwei Tao, Zhi Jin, Bincheng Li, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01509
Pdf URL: https://arxiv.org/pdf/2504.01509
Copy Paste: [[2504.01509]] PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation(https://arxiv.org/abs/2504.01509)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Predicting future events stands as one of the ultimate aspirations of artificial intelligence. Recent advances in large language model (LLM)-based systems have shown remarkable potential in forecasting future events, thereby garnering significant interest in the research community. Currently, several benchmarks have been established to evaluate the forecasting capabilities by formalizing the event prediction as a retrieval-augmented generation (RAG) and reasoning task. In these benchmarks, each prediction question is answered with relevant retrieved news articles. However, because there is no consideration on whether the questions can be supported by valid or sufficient supporting rationales, some of the questions in these benchmarks may be inherently noninferable. To address this issue, we introduce a new benchmark, PROPHET, which comprises inferable forecasting questions paired with relevant news for retrieval. To ensure the inferability of the benchmark, we propose Causal Intervened Likelihood (CIL), a statistical measure that assesses inferability through causal inference. In constructing this benchmark, we first collected recent trend forecasting questions and then filtered the data using CIL, resulting in an inferable benchmark for event prediction. Through extensive experiments, we first demonstrate the validity of CIL and in-depth investigations into event prediction with the aid of CIL. Subsequently, we evaluate several representative prediction systems on PROPHET, drawing valuable insights for future directions.
摘要：预测未来事件是人工智能的最终愿望之一。大语模型（LLM）系统的最新进展在预测未来事件方面表现出了巨大的潜力，从而引起了人们对研究界的重大兴趣。目前，已经建立了几个基准，以通过将事件预测形式化为检索授权的一代（RAG）和推理任务来评估预测功能。在这些基准测试中，每个预测问题都通过相关检索的新闻文章回答。但是，由于没有考虑是否可以通过有效或足够的支持原理来支持这些问题，因此这些基准中的某些问题本质上是不可秘密的。为了解决这个问题，我们介绍了一个新的基准先知，该基准包括可推断的预测问题与相关新闻的检索。为了确保基准的可推广性，我们提出了因果关系（CIL），这是一种统计指标，可通过因果推断评估可推断性。在构建此基准测试时，我们首先收集了最近的趋势预测问题，然后使用CIL过滤了数据，从而为事件预测提供了可推断的基准。通过广泛的实验，我们首先证明了CIL和对事件预测的深入研究的有效性。随后，我们评估了先知上的几种代表性预测系统，为未来的方向提供了宝贵的见解。

Title: Chain of Correction for Full-text Speech Recognition with Large Language Models

Authors: Zhiyuan Tang, Dong Wang, Zhikai Zhou, Yong Liu, Shen Huang, Shidong Shang
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2504.01519
Pdf URL: https://arxiv.org/pdf/2504.01519
Copy Paste: [[2504.01519]] Chain of Correction for Full-text Speech Recognition with Large Language Models(https://arxiv.org/abs/2504.01519)
Keywords: language model, llm, long context, chat
Abstract: Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) has gained increased attention due to its potential to correct errors across long contexts and address a broader spectrum of error types, including punctuation restoration and inverse text normalization. Nevertheless, many challenges persist, including issues related to stability, controllability, completeness, and fluency. To mitigate these challenges, this paper proposes the Chain of Correction (CoC) for full-text error correction with LLMs, which corrects errors segment by segment using pre-recognized text as guidance within a regular multi-turn chat format. The CoC also uses pre-recognized full text for context, allowing the model to better grasp global semantics and maintain a comprehensive overview of the entire content. Utilizing the open-sourced full-text error correction dataset ChFT, we fine-tune a pre-trained LLM to evaluate the performance of the CoC framework. Experimental results demonstrate that the CoC effectively corrects errors in full-text ASR outputs, significantly outperforming baseline and benchmark systems. We further analyze how to set the correction threshold to balance under-correction and over-rephrasing, extrapolate the CoC model on extremely long ASR outputs, and investigate whether other types of information can be employed to guide the error correction process.
摘要：通过大型语言模型（LLM）进行自动语音识别（ASR）的全文误差校正（ASR），由于其在长篇小说中纠正错误的潜力并解决了更广泛的错误类型，包括标点符号恢复和逆文本归一化，因此引起了人们的注意。然而，许多挑战仍然存在，包括与稳定性，可控性，完整性和流利性有关的问题。为了减轻这些挑战，本文提出了使用LLM的全文误差校正链链（COC），该校正链条校正LLMS，该校正纠正措施在常规的多转移聊天格式中使用预认识的文本作为指导来纠正误差细分。 COC还使用预认识的全文进行上下文，从而使模型可以更好地掌握全局语义并保持整个内容的全面概述。利用开源的全文本误差校正数据集CHFT，我们微调了预训练的LLM来评估COC框架的性能。实验结果表明，COC有效地纠正了全文ASR输出中的误差，明显优于基线和基准系统。我们进一步分析了如何设置校正阈值以平衡校正不足和重新倍率，在极长的ASR输出上推断COC模型，并研究是否可以使用其他类型的信息来指导误差校正过程。

Title: Context-Aware Toxicity Detection in Multiplayer Games: Integrating Domain-Adaptive Pretraining and Match Metadata

Authors: Adrien Schurger-Foy, Rafal Dariusz Kocielnik, Caglar Gulcehre, R. Michael Alvarez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01534
Pdf URL: https://arxiv.org/pdf/2504.01534
Copy Paste: [[2504.01534]] Context-Aware Toxicity Detection in Multiplayer Games: Integrating Domain-Adaptive Pretraining and Match Metadata(https://arxiv.org/abs/2504.01534)
Keywords: llm, prompt, chat
Abstract: The detrimental effects of toxicity in competitive online video games are widely acknowledged, prompting publishers to monitor player chat conversations. This is challenging due to the context-dependent nature of toxicity, often spread across multiple messages or informed by non-textual interactions. Traditional toxicity detectors focus on isolated messages, missing the broader context needed for accurate moderation. This is especially problematic in video games, where interactions involve specialized slang, abbreviations, and typos, making it difficult for standard models to detect toxicity, especially given its rarity. We adapted RoBERTa LLM to support moderation tailored to video games, integrating both textual and non-textual context. By enhancing pretrained embeddings with metadata and addressing the unique slang and language quirks through domain adaptive pretraining, our method better captures the nuances of player interactions. Using two gaming datasets - from Defense of the Ancients 2 (DOTA 2) and Call of Duty$^\circledR$: Modern Warfare$^\circledR$III (MWIII) we demonstrate which sources of context (metadata, prior interactions...) are most useful, how to best leverage them to boost performance, and the conditions conducive to doing so. This work underscores the importance of context-aware and domain-specific approaches for proactive moderation.
摘要：广泛认可竞争性在线视频游戏中毒性的不利影响，促使出版商监视播放器聊天对话。由于毒性的上下文依赖性，通常会分布在多个消息中或通过非文本相互作用告知，这是具有挑战性的。传统的毒性探测器专注于孤立的消息，缺少准确节制所需的更广泛的环境。这在视频游戏中尤其有问题，在视频游戏中，互动涉及专业的语，缩写和错别字，因此标准模型很难检测毒性，尤其是考虑到它的稀有性。我们调整了Roberta LLM，以支持针对视频游戏量身定制的节制，并集成了文本和非文本上下文。通过用元数据增强预审预周化的嵌入，并通过域自适应预处理来解决独特的语和语言怪癖，我们的方法可以更好地捕捉玩家相互作用的细微差别。使用两个游戏数据集 - 从辩护古代人2（DOTA 2）和使命召唤$^\ Circledr $：Modern Warfare $^\ Circledr $ III（MWIII）（MWIII），我们证明了哪些上下文来源（Metadata，先前的互动...）是最有用的，如何最大程度地利用它们来提高性能，以及可以促进条件的指导性。这项工作强调了上下文感知和域特异性方法对主动节制的重要性。

Title: From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time

Authors: Mikkel Wildner Kildeberg, Emil Allerslev Schledermann, Nicolaj Larsen, Rob van der Goot
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01540
Pdf URL: https://arxiv.org/pdf/2504.01540
Copy Paste: [[2504.01540]] From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time(https://arxiv.org/abs/2504.01540)
Keywords: language model, gpt, llm
Abstract: The best performing transformer-based language models use subword tokenization techniques, such as Byte-Pair-Encoding (BPE). However, these approaches often overlook linguistic principles, such as morphological segmentation, which we believe is fundamental for understanding language-specific word structure. In this study, we leverage an annotated Danish morphological dataset to train a semisupervised model for morphological segmentation, enabling the development of tokenizers optimized for Danish morphology. We evaluate four distinct tokenizers, including two custom morphological tokenizers, by analyzing their performance in morphologically segmenting Danish words. Additionally, we train two generative transformer models, \textit{CerebrasGPT-111M} and \textit{LLaMA-3.2 1B}, using these tokenizers and evaluate their downstream performance. Our findings reveal that our custom-developed tokenizers substantially enhance morphological segmentation, achieving an F1 score of 58.84, compared to 39.28 achieved by a Danish BPE tokenizer. In downstream tasks, models trained with our morphological tokenizers outperform those using BPE tokenizers across different evaluation metrics. These results highlight that incorporating Danish morphological segmentation strategies into tokenizers leads to improved performance in generative transformer models on Danish language
摘要：最佳性能的基于变压器的语言模型使用子字代币化技术，例如字节对编码（BPE）。但是，这些方法经常忽略语言原理，例如形态学细分，我们认为这对于理解特定语言的单词结构至关重要。在这项研究中，我们利用了带注释的丹麦形态数据集来训练一个半监视模型，以进行形态学分割，从而使对丹麦形态优化的引物的发展能够开发。我们通过分析它们在形态学上细分丹麦单词中的性能来评估四个不同的引物，包括两个自定义形态令牌。此外，我们使用这些tokenizers训练两个生成变压器模型\ textIt {Cerebrasgpt-111m}和\ textit {llama-3.2 1b}，并评估其下游性能。我们的发现表明，我们的定制令牌大大增强了形态学细分，达到58.84的F1得分，而丹麦BPE代币器实现了39.28。在下游任务中，接受我们形态令牌的训练的模型优于使用BPE令牌的模型在不同的评估指标上使用BPE引物。这些结果凸显了将丹麦形态分割策略纳入令牌的，从而改善了丹麦语的生成变压器模型的性能

Title: Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation

Authors: Amanda Myntti, Erik Henriksson, Veronika Laippala, Sampo Pyysalo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01542
Pdf URL: https://arxiv.org/pdf/2504.01542
Copy Paste: [[2504.01542]] Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation(https://arxiv.org/abs/2504.01542)
Keywords: language model, llm
Abstract: Pretraining data curation is a cornerstone in Large Language Model (LLM) development, leading to growing research on quality filtering of large web corpora. From statistical quality flags to LLM-based labeling systems, datasets are divided into categories, frequently reducing to a binary: those passing the filters deemed as valuable examples, others discarded as useless or detrimental. However, a more detailed understanding of the contribution of different kinds of texts to model performance is still largely lacking. In this article, we present the first study utilizing registers (also known as genres) - a widely used standard in corpus linguistics to model linguistic variation - to curate pretraining datasets and investigate the effect of register on the performance of LLMs. We perform comparative studies by training models with register classified data and evaluating them using standard benchmarks, and show that the register of pretraining data substantially affects model performance. We uncover surprising relationships between the pretraining material and the resulting models: using the News register results in subpar performance, and on the contrary, including the Opinion class, covering texts such as reviews and opinion blogs, is highly beneficial. While a model trained on the entire unfiltered dataset outperforms those trained on datasets limited to a single register, combining well-performing registers like How-to-Instructions, Informational Description, and Opinion leads to major improvements. Furthermore, analysis of individual benchmark results reveals key differences in the strengths and drawbacks of specific register classes as pretraining data. These findings show that register is an important explainer of model variation and can facilitate more deliberate future data selection practices.
摘要：预处理数据策展是大语言模型（LLM）开发的基石，导致对大型网络语料库质量过滤的研究越来越多。从统计质量标志到基于LLM的标签系统，数据集分为类别，经常减少到二进制文件：通过被认为是有价值的示例的过滤器的二进制文件，其他人则被丢弃为无用或有害的。但是，对不同种类的文本对模型性能的贡献的更详细理解仍然很大程度上缺乏。在本文中，我们介绍了使用注册表（也称为流派）的第一项研究（一种广泛使用的语言语言学标准）来模拟语言变化 - 以策划预处理数据集并研究寄存器对LLMS性能的影响。我们通过具有寄存器分类数据的培训模型进行比较研究，并使用标准基准进行评估，并表明预处理数据的寄存器显着影响模型性能。我们发现了训练材料和由此产生的模型之间的惊人关系：使用新闻登记册的结果不足，相反，包括意见类别，涵盖诸如评论和意见博客之类的文本，是非常有益的。在整个未经过滤数据集中训练的模型优于在数据集中训练的模型限制为单个寄存器，结合了良好表现的寄存器，例如操作指导，信息描述和意见会导致重大改进。此外，对单个基准结果的分析揭示了特定寄存器类别的优势和缺点作为训练的数据的关键差异。这些发现表明，寄存器是模型变化的重要解释，并且可以促进更故意的未来数据选择实践。

Title: Testing Low-Resource Language Support in LLMs Using Language Proficiency Exams: the Case of Luxembourgish

Authors: Cedric Lothritz, Jordi Cabot
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01667
Pdf URL: https://arxiv.org/pdf/2504.01667
Copy Paste: [[2504.01667]] Testing Low-Resource Language Support in LLMs Using Language Proficiency Exams: the Case of Luxembourgish(https://arxiv.org/abs/2504.01667)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs) have become an increasingly important tool in research and society at large. While LLMs are regularly used all over the world by experts and lay-people alike, they are predominantly developed with English-speaking users in mind, performing well in English and other wide-spread languages while less-resourced languages such as Luxembourgish are seen as a lower priority. This lack of attention is also reflected in the sparsity of available evaluation tools and datasets. In this study, we investigate the viability of language proficiency exams as such evaluation tools for the Luxembourgish language. We find that large models such as ChatGPT, Claude and DeepSeek-R1 typically achieve high scores, while smaller models show weak performances. We also find that the performances in such language exams can be used to predict performances in other NLP tasks.
摘要：大型语言模型（LLM）已成为研究和整个社会中越来越重要的工具。尽管专家和外行人士都经常在世界各地使用LLM，但它们主要是由讲英语的用户开发的，在英语和其他广泛的语言方面表现良好，而卢森堡（Luxembourgish）等资源较低的语言则被视为较低的优先级。这种缺乏关注也反映在可用评估工具和数据集的稀疏性中。在这项研究中，我们研究了语言能力考试的生存能力，作为卢森堡语言的评估工具。我们发现，诸如Chatgpt，Claude和DeepSeek-R1之类的大型模型通常达到高分，而较小的模型表现出较弱的表现。我们还发现，此类语言考试中的性能可用于预测其他NLP任务中的性能。

Title: ToM-RL: Reinforcement Learning Unlocks Theory of Mind in Small LLMs

Authors: Yi-Long Lu, Chunhui Zhang, Jiajun Song, Lifeng Fan, Wei Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01698
Pdf URL: https://arxiv.org/pdf/2504.01698
Copy Paste: [[2504.01698]] ToM-RL: Reinforcement Learning Unlocks Theory of Mind in Small LLMs(https://arxiv.org/abs/2504.01698)
Keywords: language model, gpt, llm
Abstract: Recent advancements in rule-based reinforcement learning (RL), applied during the post-training phase of large language models (LLMs), have significantly enhanced their capabilities in structured reasoning tasks such as mathematics and logical inference. However, the effectiveness of RL in social reasoning, particularly in Theory of Mind (ToM), the ability to infer others' mental states, remains largely unexplored. In this study, we demonstrate that RL methods effectively unlock ToM reasoning capabilities even in small-scale LLMs (0.5B to 7B parameters). Using a modest dataset comprising 3200 questions across diverse scenarios, our RL-trained 7B model achieves 84.50\% accuracy on the Hi-ToM benchmark, surpassing models like GPT-4o and DeepSeek-v3 despite significantly fewer parameters. While smaller models ($\leq$3B parameters) suffer from reasoning collapse, larger models (7B parameters) maintain stable performance through consistent belief tracking. Additionally, our RL-based models demonstrate robust generalization to higher-order, out-of-distribution ToM problems, novel textual presentations, and previously unseen datasets. These findings highlight RL's potential to enhance social cognitive reasoning, bridging the gap between structured problem-solving and nuanced social inference in LLMs.
摘要：在大型语言模型（LLMS）的培训阶段应用的基于规则的加固学习（RL）的最新进展已大大提高了它们在结构化推理任务（例如数学和逻辑推理）中的能力。但是，RL在社会推理中的有效性，特别是在思想理论中（汤姆），推断他人的心理状态的能力仍然很大程度上没有探索。在这项研究中，我们证明了RL方法即使在小规模LLM（0.5B至7B参数）中也有效地解锁了TOM推理能力。使用跨不同场景的3200个问题的适度数据集，我们经过RL训练的7B模型在HI-TOM基准上达到了84.50 \％的准确性，尽管参数少得多，但超过了GPT-4O和DeepSeek-V3等模型。虽然较小的模型（$ \ leq $ 3B参数）遭受推理崩溃的影响，但较大的模型（7b参数）通过一致的信念跟踪来保持稳定的性能。此外，我们的基于RL的模型还展示了对高阶，分布的TOM问题，新颖的文本演示和以前看不见的数据集的强大概括。这些发现突出了RL增强社会认知推理的潜力，弥合了LLM中结构化解决问题和细微的社会推论之间的差距。

Title: InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation

Authors: Bowen Cao, Deng Cai, Wai Lam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01707
Pdf URL: https://arxiv.org/pdf/2504.01707
Copy Paste: [[2504.01707]] InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation(https://arxiv.org/abs/2504.01707)
Keywords: language model, llm, long context, prompt
Abstract: In-context learning (ICL) is critical for large language models (LLMs), but its effectiveness is constrained by finite context windows, particularly in ultra-long contexts. To overcome this, we introduce InfiniteICL, a framework that parallels context and parameters in LLMs with short- and long-term memory in human cognitive systems, focusing on transforming temporary context knowledge into permanent parameter updates. This approach significantly reduces memory usage, maintains robust performance across varying input lengths, and theoretically enables infinite context integration through the principles of context knowledge elicitation, selection, and consolidation. Evaluations demonstrate that our method reduces context length by 90% while achieving 103% average performance of full-context prompting across fact recall, grounded reasoning, and skill acquisition tasks. When conducting sequential multi-turn transformations on complex, real-world contexts (with length up to 2M tokens), our approach surpasses full-context prompting while using only 0.4% of the original contexts. These findings highlight InfiniteICL's potential to enhance the scalability and efficiency of LLMs by breaking the limitations of conventional context window sizes.
摘要：内部文化学习（ICL）对于大语言模型（LLM）至关重要，但其有效性受到有限上下文窗口的约束，尤其是在超长的上下文中。为了克服这一点，我们引入了InfiniteICL，该框架与LLMS中的上下文和参数与人类认知系统中的短期和长期记忆相似，重点是将临时上下文知识转换为永久参数更新。这种方法可大大降低内存使用情况，在不同的输入长度之间保持稳健的性能，理论上通过上下文知识知识启发，选择和巩固来实现无限上下文集成。评估表明，我们的方法将上下文长度降低了90％，而在事实召回，扎根推理和技能获取任务中达到了103％的全文平均表现。当对复杂的，现实世界的上下文（长度高达2M令牌）进行顺序多转弯转换时，我们的方法超过了全文提示，同时仅使用0.4％的原始上下文。这些发现突出了Infiniteicl通过破坏常规上下文窗口大小的局限性来提高LLM的可扩展性和效率的潜力。

Title: Style over Substance: Distilled Language Models Reason Via Stylistic Replication

Authors: Philip Lippmann, Jie Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01738
Pdf URL: https://arxiv.org/pdf/2504.01738
Copy Paste: [[2504.01738]] Style over Substance: Distilled Language Models Reason Via Stylistic Replication(https://arxiv.org/abs/2504.01738)
Keywords: language model
Abstract: Specialized reasoning language models (RLMs) have demonstrated that scaling test-time computation through detailed reasoning traces significantly enhances performance. Although these traces effectively facilitate knowledge distillation into smaller, instruction-tuned models, the precise nature of transferred reasoning remains unclear. In this study, we investigate to what extent distilled models internalize replicated stylistic patterns during reasoning. To this end, we systematically analyze reasoning traces, identifying structural and lexical patterns that characterize successful reasoning. We then introduce two new datasets -- a dataset of emergent reasoning traces and a synthetic dataset explicitly constructed to replicate these stylistic patterns -- to precisely examine their influence on distilled models' reasoning capabilities. We find that models trained on the synthetic traces achieve comparable performance, indicating that distilled reasoning abilities rely significantly on surface-level patterns. Surprisingly, we observe an increase in performance even when the synthetic traces are altered to lead to the wrong answer. Our findings highlight how stylistic patterns can be leveraged to efficiently enhance LM reasoning across diverse model families.
摘要：专业推理语言模型（RLMS）已经证明，通过详细的推理痕迹来扩展测试时间计算可显着提高性能。尽管这些痕迹有效地促进了知识蒸馏成较小的，指导调整的模型，但转移推理的确切性质尚不清楚。在这项研究中，我们研究了推理过程中蒸馏的模型在多大程度上内部化了复制的风格模式。为此，我们系统地分析了推理痕迹，确定了成功推理的结构和词汇模式。然后，我们介绍了两个新的数据集 - 一个新的数据集以及一个明确构建以复制这些风格模式的合成数据集的数据集 - 精确地检查了它们对蒸馏模型的推理能力的影响。我们发现，经过合成痕迹训练的模型可以实现可比的性能，这表明蒸馏推理能力显着依赖于表面水平的模式。令人惊讶的是，即使更改合成迹线以导致错误的答案，我们也会观察到性能的提高。我们的发现突出了如何利用风格模式来有效地增强各种模型家族的LM推理。

Title: OpenThaiGPT 1.6 and R1: Thai-Centric Open Source and Reasoning Large Language Models

Authors: Sumeth Yuenyong, Thodsaporn Chay-intr, Kobkrit Viriyayudhakorn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01789
Pdf URL: https://arxiv.org/pdf/2504.01789
Copy Paste: [[2504.01789]] OpenThaiGPT 1.6 and R1: Thai-Centric Open Source and Reasoning Large Language Models(https://arxiv.org/abs/2504.01789)
Keywords: language model, gpt, llm
Abstract: We present OpenThaiGPT 1.6 and R1 (OTG-1.6 and OTG-R1), Thai-centric Large Language Models (LLMs) developed through distinct methodologies to enhance generalization and reasoning capabilities. OTG-1.6 employs Task Arithmetic model merging for broad generalization, while OTG-R1 integrates multi-stage training with the Less-Is-More Reasoning Hypothesis (LIMO) for advanced reasoning. Benchmark evaluations demonstrate superior performance across Thai language tasks, achieving competitive results against larger-scale open-source Thai LLMs. This paper details the proposed models, training processes, benchmarks, and results, highlighting improvements over previous models and establishing new performance standards for Thai-centric LLMs.
摘要：我们提出了OpentHaigpt 1.6和R1（OTG-1.6和OTG-R1），以泰语为中心的大语模型（LLMS）通过不同的方法开发，以增强概括和推理能力。 OTG-1.6采用任务算术模型合并进行广泛的概括，而OTG-R1则将多阶段训练与较少IS的推理假设（LIMO）整合起来进行高级推理。基准评估表明，泰国语言任务的卓越表现，可针对大型开源泰国LLM取得竞争成果。本文详细介绍了提议的模型，培训过程，基准和结果，突出了对以前的模型的改进，并为以泰语为中心的LLM建立了新的绩效标准。

Title: Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training

Authors: Zhijun Wang, Jiahuan Li, Hao Zhou, Rongxiang Weng, Jingang Wang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Shujian Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01801
Pdf URL: https://arxiv.org/pdf/2504.01801
Copy Paste: [[2504.01801]] Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training(https://arxiv.org/abs/2504.01801)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training corpus. We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. We conduct an analysis to investigate code-switching in the pre-training corpus, examining its presence and categorizing it into four types within two quadrants. We then assess its impact on multilingual performance. These types of code-switching data are unbalanced in proportions and demonstrate different effects on facilitating language transfer. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching. We continuously scale up the synthetic code-switching data and observe remarkable improvements in both benchmarks and representation space. Extensive experiments indicate that incorporating synthetic code-switching data enables better language alignment and generalizes well to high, medium, and low-resource languages with pre-training corpora of varying qualities.
摘要：尽管培训前数据存在极端的语言不平衡，但大型语言模型（LLMS）表现出显着的多语言功能。在本文中，我们仔细研究了这种现象背后的原因，重点是培训前语料库。我们发现，在上下文中不同语言之间交替的代码转换的存在是多语言能力的关键。我们进行了一项分析，以研究预训练语料库中的代码切换，检查其存在并将其分为两个象限内的四种类型。然后，我们评估其对多语言性能的影响。这些类型的代码转换数据在比例上是不平衡的，并且对促进语言传递的影响不同。为了更好地探索预训练期间的代码转换对语言一致性的力量，我们研究了合成代码转换的策略。我们不断扩大合成代码转换数据，并观察到基准和表示空间的显着改善。广泛的实验表明，合并合成代码转换数据可以使更好的语言对齐，并将其概括为高，中和低资源的语言，并具有各种质量的预培训语料库。

Title: YourBench: Easy Custom Evaluation Sets for Everyone

Authors: Sumuk Shashidhar, Clémentine Fourrier, Alina Lozovskia, Thomas Wolf, Gokhan Tur, Dilek Hakkani-Tür
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01833
Pdf URL: https://arxiv.org/pdf/2504.01833
Copy Paste: [[2504.01833]] YourBench: Easy Custom Evaluation Sets for Everyone(https://arxiv.org/abs/2504.01833)
Keywords: language model, llm
Abstract: Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications. We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark. To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.
摘要：有效评估大型语言模型（LLM）仍然是一个关键的瓶颈，因为传统的静态基准遭受饱和和污染的影响，而人类评估则昂贵且缓慢。这阻碍了及时或特定领域的评估，对于现实世界的应用至关重要。我们介绍了YourBench，这是一个新颖的开源框架，它通过启用动态，自动生成可靠，最新和域名的基准测试来解决这些局限性，直接从用户提供的文档中直接从用户提供的文档中，而无需手动注释。我们通过使用最小的源文本复制7种不同的MMLU子集来证明其功效，从而实现了15美元的总推断成本，同时完美地保留了在原始基准上观察到的相对模型性能排名（Spearman Rho = 1）。 To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous算法检查（例如引用接地）和人类评估。我们发布了YourBench库，tuerma-0325数据集，150K+问题答案对，基于临时以及所有评估和推理痕迹，以促进可复制的研究，并使社区能够按需生成定位基准，从而促进了更相关且具有可信赖的LLM评估。

Title: LARGE: Legal Retrieval Augmented Generation Evaluation Tool

Authors: Minhu Park, Hongseok Oh, Eunkyung Choi, Wonseok Hwang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01840
Pdf URL: https://arxiv.org/pdf/2504.01840
Copy Paste: [[2504.01840]] LARGE: Legal Retrieval Augmented Generation Evaluation Tool(https://arxiv.org/abs/2504.01840)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Recently, building retrieval-augmented generation (RAG) systems to enhance the capability of large language models (LLMs) has become a common practice. Especially in the legal domain, previous judicial decisions play a significant role under the doctrine of stare decisis which emphasizes the importance of making decisions based on (retrieved) prior documents. However, the overall performance of RAG system depends on many components: (1) retrieval corpora, (2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces to facilitate seamless experiments and investigate how changes in the aforementioned five components affect the overall accuracy. We validated LRAGE using multilingual legal benches including Korean (KBL), English (LegalBench), and Chinese (LawBench) by demonstrating how the overall accuracy changes when varying the five components mentioned above. The source code is available at this https URL.
摘要：最近，建筑检索功能增强的系统以增强大语言模型（LLM）的能力已成为一种普遍做法。尤其是在法律领域，以前的司法裁决在凝视决定学说下起着重要的作用，这强调了基于（检索）先前文件做出决定的重要性。但是，抹布系统的总体性能取决于许多组件：（1）检索语料库，（2）检索算法，（3）rerankers，（4）LLM骨架和（5）评估指标。在这里，我们提出了Lrage，这是一种开源工具，用于对关注法律领域的抹布系统进行整体评估。 LRAGE提供GUI和CLI接口，以促进无缝实验，并研究上述五个组件的变化如何影响整体准确性。我们使用多语言法律席（包括韩语（KBL），英语（LegalBench）和中国（Lawbench））验证了LRAGE，以证明在改变上述五个组件时的整体准确性如何变化。源代码可在此HTTPS URL上找到。

Title: Cross-Lingual Consistency: A Novel Inference Framework for Advancing Reasoning in Large Language Models

Authors: Zhiwei Yu, Tuo Li, Changhong Wang, Hui Chen, Lang Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01857
Pdf URL: https://arxiv.org/pdf/2504.01857
Copy Paste: [[2504.01857]] Cross-Lingual Consistency: A Novel Inference Framework for Advancing Reasoning in Large Language Models(https://arxiv.org/abs/2504.01857)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-thought (CoT) has emerged as a critical mechanism for enhancing reasoning capabilities in large language models (LLMs), with self-consistency demonstrating notable promise in boosting performance. However, inherent linguistic biases in multilingual training corpora frequently cause semantic drift and logical inconsistencies, especially in sub-10B parameter LLMs handling complex inference tasks. To overcome these constraints, we propose the Cross-Lingual Consistency (CLC) framework, an innovative inference paradigm that integrates multilingual reasoning paths through majority voting to elevate LLMs' reasoning capabilities. Empirical evaluations on the CMATH dataset reveal CLC's superiority over the conventional self-consistency method, delivering 9.5%, 6.5%, and 6.0% absolute accuracy gains for DeepSeek-Math-7B-Instruct, Qwen2.5-Math-7B-Instruct, and Gemma2-9B-Instruct respectively. Expanding CLC's linguistic scope to 11 diverse languages implies two synergistic benefits: 1) neutralizing linguistic biases in multilingual training corpora through multilingual ensemble voting, 2) escaping monolingual reasoning traps by exploring the broader multilingual solution space. This dual benefits empirically enables more globally optimal reasoning paths compared to monolingual self-consistency baselines, as evidenced by the 4.1%-18.5% accuracy gains using Gemma2-9B-Instruct on the MGSM dataset.
摘要：经过思考链（COT）已成为增强大语言模型（LLMS）推理能力的关键机制，自一致性表现出明显的希望，可以提高绩效。但是，多语言培训语料库中固有的语言偏见通常会引起语义漂移和逻辑上的不一致，尤其是在低于10B参数LLMS中处理复杂推理任务。为了克服这些约束，我们提出了跨语性一致性（CLC）框架，这是一种创新的推论范式，通过多数投票集成了多语言推理路径，以提高LLMS的推理能力。对CMATH数据集的经验评估揭示了CLC优于传统的自洽方法，为DeepSeek-Math-7b-7b-Instruct，QWEN2.5-MATH-MATH-MATH-7B-7B-7B-7B-7B-7B-7B-7B-7B-7B-7B-7B-7B-9B和GEMMA2-9B的绝对准确性增长提供了9.5％，6.5％和6.0％的绝对准确性提高。将CLC的语言范围扩展到11种不同的语言意味着两个协同的好处：1）通过多语言合奏投票，2）通过探索更广泛的多语言解决方案空间来逃避单语的推理陷阱，从而在多语言培训语料库中中和语言偏见。与单语言的自洽基线相比，这种双重益处在经验上可以实现更多的全球最佳推理路径，这证明了使用MGSM数据集中的GEMMA2-9B-Instruct的4.1％-18.5％的准确性提高。

Title: TransientTables: Evaluating LLMs' Reasoning on Temporally Evolving Semi-structured Tables

Authors: Abhilash Shankarampeta, Harsh Mahajan, Tushar Kataria, Dan Roth, Vivek Gupta
Subjects: cs.CL, cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2504.01879
Pdf URL: https://arxiv.org/pdf/2504.01879
Copy Paste: [[2504.01879]] TransientTables: Evaluating LLMs' Reasoning on Temporally Evolving Semi-structured Tables(https://arxiv.org/abs/2504.01879)
Keywords: language model, llm
Abstract: Humans continuously make new discoveries, and understanding temporal sequence of events leading to these breakthroughs is essential for advancing science and society. This ability to reason over time allows us to identify future steps and understand the effects of financial and political decisions on our lives. However, large language models (LLMs) are typically trained on static datasets, limiting their ability to perform effective temporal reasoning. To assess the temporal reasoning capabilities of LLMs, we present the TRANSIENTTABLES dataset, which comprises 3,971 questions derived from over 14,000 tables, spanning 1,238 entities across multiple time periods. We introduce a template-based question-generation pipeline that harnesses LLMs to refine both templates and questions. Additionally, we establish baseline results using state-of-the-art LLMs to create a benchmark. We also introduce novel modeling strategies centered around task decomposition, enhancing LLM performance.
摘要：人类不断提出新发现，并了解导致这些突破的事件的时间顺序对于推进科学和社会至关重要。随着时间的流逝，这种推理能力使我们能够确定未来的步骤，并了解财务和政治决策对我们生活的影响。但是，大型语言模型（LLM）通常在静态数据集上进行培训，从而限制了它们执行有效的时间推理的能力。为了评估LLMS的时间推理功能，我们介绍了瞬态数据集，该数据集包含3,971个问题，这些问题来自14,000多个表，跨越了多个时间段内的1,238个实体。我们介绍了基于模板的问题生成管道，该管道利用LLMS来完善模板和问题。此外，我们使用最先进的LLMS建立基线结果来创建基准测试。我们还介绍了围绕任务分解的新型建模策略，从而提高了LLM的性能。

Title: STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Authors: Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Jieru Mei, Brian R. Bartoldson, Bhavya Kailkhura, Cihang Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01903
Pdf URL: https://arxiv.org/pdf/2504.01903
Copy Paste: [[2504.01903]] STAR-1: Safer Alignment of Reasoning LLMs with 1K Data(https://arxiv.org/abs/2504.01903)
Keywords: gpt, llm
Abstract: This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs. Our project page is this https URL.
摘要：本文介绍了STAR-1，这是一种专为大型推理模型（LRMS）等设计的高质量，仅1K规模的安全数据集，例如DeepSeek-R1。 Star-1建立在三个核心原则（多样性，审议推理和严格的过滤）上，旨在满足LRMS安全对齐的关键需求。具体来说，我们首先将来自不同来源的现有开源安全数据集整合在一起。然后，我们制定安全政策，以生成政策基础的审议推理样本。最后，我们应用基于GPT-4O的安全评分系统来选择与最佳实践相一致的培训示例。实验结果表明，具有STAR-1的微调LRM在四个基准测试中的安全性能平均提高了40％，而在五个推理任务中测得的推理能力的边际下降（例如，平均为1.1％）。广泛的消融研究进一步验证了我们的设计原理在构建Star-1中的重要性，并分析了其在LRM和传统LLM的功效。我们的项目页面是此HTTPS URL。

Title: Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation

Authors: Baban Gain, Dibyanayan Bandyopadhyay, Asif Ekbal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01919
Pdf URL: https://arxiv.org/pdf/2504.01919
Copy Paste: [[2504.01919]] Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation(https://arxiv.org/abs/2504.01919)
Keywords: language model, llm, hallucination, prompt
Abstract: The advent of Large Language Models (LLMs) has significantly reshaped the landscape of machine translation (MT), particularly for low-resource languages and domains that lack sufficient parallel corpora, linguistic tools, and computational infrastructure. This survey presents a comprehensive overview of recent progress in leveraging LLMs for MT. We analyze techniques such as few-shot prompting, cross-lingual transfer, and parameter-efficient fine-tuning that enable effective adaptation to under-resourced settings. The paper also explores synthetic data generation strategies using LLMs, including back-translation and lexical augmentation. Additionally, we compare LLM-based translation with traditional encoder-decoder models across diverse language pairs, highlighting the strengths and limitations of each. We discuss persistent challenges such as hallucinations, evaluation inconsistencies, and inherited biases while also evaluating emerging LLM-driven metrics for translation quality. This survey offers practical insights and outlines future directions for building robust, inclusive, and scalable MT systems in the era of large-scale generative models.
摘要：大型语言模型（LLM）的出现显着重塑了机器翻译（MT）的景观，尤其是对于缺乏足够的并行语料库，语言工具和计算基础架构的低资源语言和域。这项调查概述了为MT利用LLM的最新进展。我们分析了诸如少量发动机，跨语义转移和参数有效的微调等技术，从而有效适应了资源不足的设置。本文还使用LLM探索了合成数据生成策略，包括反向翻译和词汇增强。此外，我们将基于LLM的翻译与跨不同语言对的传统编码器模型进行了比较，突出了每种语言的优势和局限性。我们讨论持续的挑战，例如幻觉，评估矛盾和继承的偏见，同时还评估了新兴的LLM驱动指标的翻译质量。这项调查提供了实用的见解，并概述了在大规模生成模型时代建立强大，包容和可扩展的MT系统的未来方向。

Title: Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure

Authors: Boshi Wang, Huan Sun
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.01928
Pdf URL: https://arxiv.org/pdf/2504.01928
Copy Paste: [[2504.01928]] Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure(https://arxiv.org/abs/2504.01928)
Keywords: llm
Abstract: Despite their impressive capabilities, LLMs exhibit a basic generalization failure known as the Reversal Curse, where they struggle to learn reversible factual associations. Understanding why this occurs could help identify weaknesses in current models and advance their generalization and robustness. In this paper, we conjecture that the Reversal Curse in LLMs is a manifestation of the long-standing binding problem in cognitive science, neuroscience and AI. Specifically, we identify two primary causes of the Reversal Curse stemming from transformers' limitations in conceptual binding: the inconsistency and entanglements of concept representations. We perform a series of experiments that support these conjectures. Our exploration leads to a model design based on JEPA (Joint-Embedding Predictive Architecture) that for the first time breaks the Reversal Curse without side-stepping it with specialized data augmentation or non-causal masking, and moreover, generalization could be further improved by incorporating special memory layers that support disentangled concept representations. We demonstrate that the skill of reversal unlocks a new kind of memory integration that enables models to solve large-scale arithmetic reasoning problems via parametric forward-chaining, outperforming frontier LLMs based on non-parametric memory and prolonged explicit reasoning.
摘要：尽管具有令人印象深刻的能力，但LLMS表现出了一个基本的概括失败，称为“逆转诅咒”，他们努力学习可逆的事实关联。了解为什么发生这种情况可以有助于确定当前模型中的弱点并提高其概括和鲁棒性。在本文中，我们猜想LLMS中的逆转诅咒是认知科学，神经科学和AI中长期存在的结合问题的体现。具体而言，我们确定了逆转诅咒的两个主要原因是由于变形金刚在概念结合中的局限性：概念表示的不一致和纠缠。我们执行一系列支持这些猜想的实验。我们的探索导致了基于JEPA（联合预测性架构）的模型设计，该设计首次打破了反转诅咒而不使用专门的数据增强或非cosal掩模的侧面替代，并且可以通过合并支持悬念概念的特殊记忆层来进一步改善概括。我们证明，逆转的技能可以解锁一种新型的内存集成，使模型能够通过参数向前链接解决大规模的算术推理问题，优于基于非参数存储器的前沿LLM，并超过了基于非参数的记忆和延长的显式推理。

Title: A thorough benchmark of automatic text classification: From traditional approaches to large language models

Authors: Washington Cunha, Leonardo Rocha, Marcos André Gonçalves
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01930
Pdf URL: https://arxiv.org/pdf/2504.01930
Copy Paste: [[2504.01930]] A thorough benchmark of automatic text classification: From traditional approaches to large language models(https://arxiv.org/abs/2504.01930)
Keywords: language model, llm
Abstract: Automatic text classification (ATC) has experienced remarkable advancements in the past decade, best exemplified by recent small and large language models (SLMs and LLMs), leveraged by Transformer architectures. Despite recent effectiveness improvements, a comprehensive cost-benefit analysis investigating whether the effectiveness gains of these recent approaches compensate their much higher costs when compared to more traditional text classification approaches such as SVMs and Logistic Regression is still missing in the literature. In this context, this work's main contributions are twofold: (i) we provide a scientifically sound comparative analysis of the cost-benefit of twelve traditional and recent ATC solutions including five open LLMs, and (ii) a large benchmark comprising {22 datasets}, including sentiment analysis and topic classification, with their (train-validation-test) partitions based on folded cross-validation procedures, along with documentation, and code. The release of code, data, and documentation enables the community to replicate experiments and advance the field in a more scientifically sound manner. Our comparative experimental results indicate that LLMs outperform traditional approaches (up to 26%-7.1% on average) and SLMs (up to 4.9%-1.9% on average) in terms of effectiveness. However, LLMs incur significantly higher computational costs due to fine-tuning, being, on average 590x and 8.5x slower than traditional methods and SLMs, respectively. Results suggests the following recommendations: (1) LLMs for applications that require the best possible effectiveness and can afford the costs; (2) traditional methods such as Logistic Regression and SVM for resource-limited applications or those that cannot afford the cost of tuning large LLMs; and (3) SLMs like Roberta for near-optimal effectiveness-efficiency trade-off.
摘要：自动文本分类（ATC）在过去十年中经历了显着的进步，最佳的大型和大型语言模型（SLMS和LLMS）的最佳体现，由变压器体系结构利用。尽管有效性提高了，但与更传统的文本分类方法（例如SVM和Logistic回归）相比，这些最新方法的有效性提高是否弥补了其更高的成本，但在文献中仍然缺少了更高的成本。 In this context, this work's main contributions are twofold: (i) we provide a scientifically sound comparative analysis of the cost-benefit of twelve traditional and recent ATC solutions including five open LLMs, and (ii) a large benchmark comprising {22 datasets}, including sentiment analysis and topic classification, with their (train-validation-test) partitions based on folded cross-validation procedures, along with documentation, and code.代码，数据和文档的发布使社区能够以更科学的方式复制实验并推进该领域。我们的比较实验结果表明，在有效性方面，LLMS优于传统方法（平均平均为26％-7.1％）和SLMS（平均水平高达4.9％-1.9％）。但是，由于微调，平均比传统方法和SLM的LLMS显着高出590倍和8.5倍，其计算成本明显更高。结果提出了以下建议：（1）用于需要最佳有效性并负担成本的应用程序的LLM；（2）传统方法，例如逻辑回归和用于资源有限应用的SVM或无法负担大型LLM的成本的传统方法；（3）像罗伯塔这样的SLM，以近乎最佳的有效性权衡。

Title: Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection

Authors: Souradip Chakraborty, Mohammadreza Pourreza, Ruoxi Sun, Yiwen Song, Nino Scherrer, Jindong Gu, Furong Huang, Amrit Singh Bedi, Ahmad Beirami, Hamid Palangi, Tomas Pfister
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01931
Pdf URL: https://arxiv.org/pdf/2504.01931
Copy Paste: [[2504.01931]] Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection(https://arxiv.org/abs/2504.01931)
Keywords: llm, agent
Abstract: While AI agents have shown remarkable performance at various tasks, they still struggle with complex multi-modal applications, structured generation and strategic planning. Improvements via standard fine-tuning is often impractical, as solving agentic tasks usually relies on black box API access without control over model parameters. Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance. However, BON lacks iterative feedback integration mechanism. Hence, we propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier. IAD differs in how feedback is designed and integrated, specifically optimized to extract maximal signal from reward scores. We conduct a detailed comparison of baselines across key metrics on Sketch2Code, Text2SQL, and Webshop where IAD consistently outperforms baselines, achieving 3--6% absolute gains on Sketch2Code and Text2SQL (with and without LLM judges) and 8--10% gains on Webshop across multiple metrics. To better understand the source of IAD's gains, we perform controlled experiments to disentangle the effect of adaptive feedback from stochastic sampling, and find that IAD's improvements are primarily driven by verifier-guided refinement, not merely sampling diversity. We also show that both IAD and BON exhibit inference-time scaling with increased compute when guided by an optimal verifier. Our analysis highlights the critical role of verifier quality in effective inference-time optimization and examines the impact of noisy and sparse rewards on scaling behavior. Together, these findings offer key insights into the trade-offs and principles of effective inference-time optimization.
摘要：尽管AI代理在各种任务上表现出色，但他们仍然在复杂的多模式应用，结构化生成和战略计划中挣扎。通过标准微调进行改进通常是不切实际的，因为解决代理任务通常依赖于黑匣子API访问，而无需控制模型参数。推理时间方法（例如最佳N（BON）采样）提供了一种简单而有效的替代方法，可改善性能。但是，Bon缺乏迭代反馈集成机制。因此，我们提出了迭代剂解码（IAD），该迭代剂将迭代精致与动态候选评估和由验证者指导的选择结合在一起。 IAD在设计和集成的方式方面有所不同，专门优化，以从奖励分数中提取最大信号。我们对Sketch2Code，text2sql和Webshop的主要指标进行了详细的比较，在此，IAD始终超过基线，在Sketch2Code和text2sql上获得了3--6％的绝对增长，在shetch2Code和text2sql上获得了3--6％的绝对收益（有和没有LLM法官），在跨多个学术界的WebShop上获得了8--10％的收益。为了更好地理解IAD的收益来源，我们执行受控的实验，以消除自适应反馈与随机抽样的影响，并发现IAD的改进主要是由验证者引导的细化驱动的，而不仅仅是采样多样性。我们还表明，当由最佳验证器引导时，IAD和BON展示了推理时间缩放，并增加了计算。我们的分析强调了验证者质量在有效推理时间优化中的关键作用，并研究了嘈杂和稀疏奖励对缩放行为的影响。这些发现共同提供了对有效推理时间优化的权衡和原则的关键见解。

Title: OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

Authors: Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, Boris Ginsburg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01943
Pdf URL: https://arxiv.org/pdf/2504.01943
Copy Paste: [[2504.01943]] OpenCodeReasoning: Advancing Data Distillation for Competitive Coding(https://arxiv.org/abs/2504.01943)
Keywords: language model, llm
Abstract: Since the advent of reasoning-based large language models, many have found great success from distilling reasoning capabilities into student models. Such techniques have significantly bridged the gap between reasoning and standard LLMs on coding tasks. Despite this, much of the progress on distilling reasoning models remains locked behind proprietary datasets or lacks details on data curation, filtering and subsequent training. To address this, we construct a superior supervised fine-tuning (SFT) dataset that we use to achieve state-of-the-art coding capability results in models of various sizes. Our distilled models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning. We then perform analysis on the data sources used to construct our dataset, the impact of code execution filtering, and the importance of instruction/solution diversity. We observe that execution filtering negatively affected benchmark accuracy, leading us to prioritize instruction diversity over solution correctness. Finally, we also analyze the token efficiency and reasoning patterns utilized by these models. We will open-source these datasets and distilled models to the community.
摘要：自从基于推理的大语言模型的出现以来，许多人从将推理能力提炼到学生模型中取得了巨大的成功。这些技术在编码任务上已经显着弥合了推理和标准LLM之间的差距。尽管如此，蒸馏推理模型的许多进展仍然锁定在专有数据集后面，或者缺乏有关数据策展，过滤和随后培训的细节。为了解决这个问题，我们构建了一个卓越的监督微调（SFT）数据集，用于实现各种尺寸模型的最新编码功能。我们的蒸馏型仅使用SFT在Livecodebench上实现61.8％，而对于CodeContests的24.6％，超过了接受强化学习的替代方案。然后，我们对用于构建数据集的数据源进行分析，代码执行过滤的影响以及指令/解决方案多样性的重要性。我们观察到执行过滤对基准准确性产生负面影响，这使我们确定了指导多样性而不是解决方案正确性。最后，我们还分析了这些模型使用的令牌效率和推理模式。我们将向社区开放这些数据集和蒸馏模型。