2025-04-16

Title: LayerFlow: Layer-wise Exploration of LLM Embeddings using Uncertainty-aware Interlinked Projections

Authors: Rita Sevastjanova, Robin Gerling, Thilo Spinner, Mennatallah El-Assady
Subjects: cs.CL, cs.GR
Abstract URL: https://arxiv.org/abs/2504.10504
Pdf URL: https://arxiv.org/pdf/2504.10504
Copy Paste: [[2504.10504]] LayerFlow: Layer-wise Exploration of LLM Embeddings using Uncertainty-aware Interlinked Projections(https://arxiv.org/abs/2504.10504)
Keywords: language model, llm
Abstract: Large language models (LLMs) represent words through contextual word embeddings encoding different language properties like semantics and syntax. Understanding these properties is crucial, especially for researchers investigating language model capabilities, employing embeddings for tasks related to text similarity, or evaluating the reasons behind token importance as measured through attribution methods. Applications for embedding exploration frequently involve dimensionality reduction techniques, which reduce high-dimensional vectors to two dimensions used as coordinates in a scatterplot. This data transformation step introduces uncertainty that can be propagated to the visual representation and influence users' interpretation of the data. To communicate such uncertainties, we present LayerFlow - a visual analytics workspace that displays embeddings in an interlinked projection design and communicates the transformation, representation, and interpretation uncertainty. In particular, to hint at potential data distortions and uncertainties, the workspace includes several visual components, such as convex hulls showing 2D and HD clusters, data point pairwise distances, cluster summaries, and projection quality metrics. We show the usability of the presented workspace through replication and expert case studies that highlight the need to communicate uncertainty through multiple visual components and different data perspectives.
摘要：大型语言模型（LLM）通过上下文单词嵌入来表示单词，编码语义和语法等不同语言属性。了解这些属性至关重要，尤其是对于研究语言模型功能的研究人员，使用与文本相似性有关的任务的嵌入，或评估通过归因方法衡量的令牌重要性背后的原因。嵌入探索的应用经常涉及降低降低技术，这将高维矢量降低到散点图中用作坐标的两个维度。此数据转换步骤引入了不确定性，可以传播到视觉表示并影响用户对数据的解释。为了传达此类不确定性，我们介绍了LayerFlow - 一个视觉分析工作空间，该工作空间显示在相互关联的投影设计中嵌入并传达转换，表示和解释不确定性。特别是，要暗示潜在的数据扭曲和不确定性，工作区包括几个视觉组件，例如显示2D和HD簇的凸壳，数据点成对距离，群集汇总和投影质量指标。我们通过复制和专家案例研究来展示提出的工作空间的可用性，这些案例研究强调了通过多种视觉组件和不同数据观点传达不确定性的必要性。

Title: Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models

Authors: Thilo Hagendorff, Sarah Fabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10615
Pdf URL: https://arxiv.org/pdf/2504.10615
Copy Paste: [[2504.10615]] Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models(https://arxiv.org/abs/2504.10615)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) can perform reasoning computations both internally within their latent space and externally by generating explicit token sequences like chains of thought. Significant progress in enhancing reasoning abilities has been made by scaling test-time compute. However, understanding and quantifying model-internal reasoning abilities - the inferential "leaps" models make between individual token predictions - remains crucial. This study introduces a benchmark (n = 4,000 items) designed to quantify model-internal reasoning in different domains. We achieve this by having LLMs indicate the correct solution to reasoning problems not through descriptive text, but by selecting a specific language of their initial response token that is different from English, the benchmark language. This not only requires models to reason beyond their context window, but also to overrise their default tendency to respond in the same language as the prompt, thereby posing an additional cognitive strain. We evaluate a set of 18 LLMs, showing significant performance variations, with GPT-4.5 achieving the highest accuracy (74.7%), outperforming models like Grok-2 (67.2%), and Llama 3.1 405B (65.6%). Control experiments and difficulty scaling analyses suggest that while LLMs engage in internal reasoning, we cannot rule out heuristic exploitations under certain conditions, marking an area for future investigation. Our experiments demonstrate that LLMs can "think" via latent-space computations, revealing model-internal inference strategies that need further understanding, especially regarding safety-related concerns such as covert planning, goal-seeking, or deception emerging without explicit token traces.
摘要：大型语言模型（LLMS）可以通过在其潜在空间内部进行推理计算，并且可以通过生成诸如思想链之类的显式令牌序列来进行外部。通过缩放测试时间计算，已经取得了重大进展。但是，理解和量化模型内部推理能力（推论的“飞跃”模型在单个令牌预测之间产生的模型 - 仍然至关重要。这项研究介绍了旨在量化不同域中模型内部推理的基准（n = 4,000个项目）。我们通过让LLMS指示不是通过描述性文本来指示推理问题的正确解决方案，而是通过选择其初始响应的特定语言来实现这一目标，该语言与基准语言不同，该语言与英语不同。这不仅需要模型来推理超越其上下文窗口，而且还要推翻其默认趋势以与提示相同的语言响应，从而构成了额外的认知压力。我们评估了一组18个LLM，显示出明显的性能变化，GPT-4.5的精度最高（74.7％），表现优于Grok-2（67.2％）和Llama 3.1 405b（65.6％）。控制实验和难度扩展分析表明，尽管LLMS进行内部推理，但我们不能排除在某些条件下的剥削，这标志着未来研究的领域。我们的实验表明，LLM可以通过潜在空间计算“思考”，从而揭示了需要进一步理解的模型内推理策略，尤其是在与安全有关的问题上，例如秘密计划，寻求目标或欺骗，而无需明确的标记痕迹。

Title: Better Estimation of the KL Divergence Between Language Models

Authors: Afra Amini, Tim Vieira, Ryan Cotterell
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10637
Pdf URL: https://arxiv.org/pdf/2504.10637
Copy Paste: [[2504.10637]] Better Estimation of the KL Divergence Between Language Models(https://arxiv.org/abs/2504.10637)
Keywords: language model
Abstract: Estimating the Kullback--Leibler (KL) divergence between language models has many applications, e.g., reinforcement learning from human feedback (RLHF), interpretability, and knowledge distillation. However, computing the exact KL divergence between two arbitrary language models is intractable. Thus, practitioners often resort to the use of sampling-based estimators. While it is easy to fashion a simple Monte Carlo (MC) estimator that provides an unbiased estimate of the KL divergence between language models, this estimator notoriously suffers from high variance, and can even result in a negative estimate of the KL divergence, a non-negative quantity. In this paper, we introduce a Rao--Blackwellized estimator that is also unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator. In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially in practice. Additionally, we derive an analogous Rao--Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient.
摘要：估计语言模型之间的kullback-leibler（KL）差异有许多应用，例如，从人类反馈（RLHF），可解释性和知识蒸馏中学习的强化学习。但是，计算两个任意语言模型之间的确切KL差异是棘手的。因此，从业者通常诉诸于基于抽样的估计器的使用。虽然塑造一个简单的蒙特卡洛（MC）估计器很容易，该估计量提供了对语言模型之间KL差异的无偏估计值，但众所周知，该估计量具有很大的差异，甚至可以导致kl差异的负估计值，即非负量。在本文中，我们介绍了一个rao- blackwellization估计量，该估计量也是公正的，并且证明具有小于或等于标准的蒙特卡洛估计器的差异。在一项关于情绪控制的微调的实证研究中，我们表明我们的估计器提供了更稳定的KL估计值，并在实践中大大降低了差异。此外，我们得出了类似的rao，对KL散射梯度的估计量进行了类似的估计，这导致了更稳定的训练，并产生模型，与接受梯度的MC估计器相比，奖励与KL的帕累托前沿中更频繁地出现。

Title: Weight-of-Thought Reasoning: Exploring Neural Network Weights for Enhanced LLM Reasoning

Authors: Saif Punjwani, Larry Heck
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10646
Pdf URL: https://arxiv.org/pdf/2504.10646
Copy Paste: [[2504.10646]] Weight-of-Thought Reasoning: Exploring Neural Network Weights for Enhanced LLM Reasoning(https://arxiv.org/abs/2504.10646)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities when prompted with strategies such as Chain-of-Thought (CoT). However, these approaches focus on token-level output without considering internal weight dynamics. We introduce Weight-of-Thought (WoT) reasoning, a novel approach that examines neural network weights before inference to identify reasoning pathways. Unlike existing methods, WoT explores the weight space through graph-based message passing, multi-step reasoning processes, and attention mechanisms. Our implementation creates an interconnected graph of reasoning nodes. Experiments on diverse reasoning tasks (syllogistic, mathematical, algebraic, combinatorial, and geometric) demonstrate that WoT achieves superior performance compared to traditional methods, particularly for complex problems. This approach leads to both improved performance and greater interpretability of the reasoning process, offering a promising direction for enhancing LLM reasoning capabilities.
摘要：当采用诸如《思维链》（COT）之类的策略的提示时，大型语言模型（LLM）表现出了出色的推理能力。但是，这些方法集中于代币级别的输出，而无需考虑内部体重动态。我们介绍了思想重量（WOT）推理，这是一种新型方法，在推断以识别推理途径之前检查了神经网络权重。与现有方法不同，WOT通过基于图的消息传递，多步推理过程和注意机制来探索权重空间。我们的实现创建了推理节点的互连图。关于各种推理任务的实验（三段式，数学，代数，组合和几何）表明，与传统方法相比，WOT的性能卓越，尤其是对于复杂问题。这种方法可以提高推理过程的性能和更大的解释性，从而为增强LLM推理能力提供了有希望的方向。

Title: Improving In-Context Learning with Reasoning Distillation

Authors: Nafis Sadeq, Xin Xu, Zhouhang Xie, Julian McAuley, Byungkyu Kang, Prarit Lamba, Xiang Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10647
Pdf URL: https://arxiv.org/pdf/2504.10647
Copy Paste: [[2504.10647]] Improving In-Context Learning with Reasoning Distillation(https://arxiv.org/abs/2504.10647)
Keywords: language model, gpt, prompt
Abstract: Language models rely on semantic priors to perform in-context learning, which leads to poor performance on tasks involving inductive reasoning. Instruction-tuning methods based on imitation learning can superficially enhance the in-context learning performance of language models, but they often fail to improve the model's understanding of the underlying rules that connect inputs and outputs in few-shot demonstrations. We propose ReDis, a reasoning distillation technique designed to improve the inductive reasoning capabilities of language models. Through a careful combination of data augmentation, filtering, supervised fine-tuning, and alignment, ReDis achieves significant performance improvements across a diverse range of tasks, including 1D-ARC, List Function, ACRE, and MiniSCAN. Experiments on three language model backbones show that ReDis outperforms equivalent few-shot prompting baselines across all tasks and even surpasses the teacher model, GPT-4o, in some cases. ReDis, based on the LLaMA-3 backbone, achieves relative improvements of 23.2%, 2.8%, and 66.6% over GPT-4o on 1D-ARC, ACRE, and MiniSCAN, respectively, within a similar hypothesis search space. The code, dataset, and model checkpoints will be made available at this https URL.
摘要：语言模型依靠语义先验来执行在上下文中的学习，这导致涉及归纳推理的任务的性能不佳。基于模仿学习的指令调整方法可以表面上下提高语言模型的内在学习性能，但是它们通常无法改善模型对以几次演示相连接输入和输出的基础规则的理解。我们提出了Redis，这是一种推理蒸馏技术，旨在提高语言模型的归纳推理能力。通过仔细组合数据增强，过滤，监督的微调和对齐方式，REDIS可以在各种任务范围内实现重大的性能改进，包括1D ARC，LIST功能，Acre和Miniscan。在三种语言模型主链上进行的实验表明，Redis的表现优于几乎不相同的促使所有任务的基线，甚至超过教师模型GPT-4O，在某些情况下。在类似的假设搜索空间内，基于Llama-3骨干的Redis，基于Llama-3骨干的骨干，在1D-ARC，ACRE和Miniscan上的相对改善分别比GPT-4O的相对改善分别获得了GPT-4O的相对改善。代码，数据集和模型检查点将在此HTTPS URL上提供。

Title: LITERA: An LLM Based Approach to Latin-to-English Translation

Authors: Paul Rosu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10660
Pdf URL: https://arxiv.org/pdf/2504.10660
Copy Paste: [[2504.10660]] LITERA: An LLM Based Approach to Latin-to-English Translation(https://arxiv.org/abs/2504.10660)
Keywords: gpt, llm, prompt
Abstract: This paper introduces an LLM-based Latin-to-English translation platform designed to address the challenges of translating Latin texts. We named the model LITERA, which stands for Latin Interpretation and Translations into English for Research Assistance. Through a multi-layered translation process utilizing a fine-tuned version of GPT-4o-mini and GPT-4o, LITERA offers an unprecedented level of accuracy, showcased by greatly improved BLEU scores, particularly in classical Latin, along with improved BLEURT scores. The development of LITERA involved close collaboration with Duke University's Classical Studies Department, which was instrumental in creating a small, high-quality parallel Latin-English dataset. This paper details the architecture, fine-tuning methodology, and prompting strategies used in LITERA, emphasizing its ability to produce literal translations.
摘要：本文介绍了一个基于LLM的拉丁语对英语翻译平台，旨在应对翻译拉丁文本的挑战。我们命名了Model Iltra，该模型代表拉丁语的解释和翻译为英语以寻求研究帮助。通过多层翻译过程，利用GPT-4O-MINI和GPT-4O的微调版本提供了前所未有的准确性水平，并得到了大大改进的BLEU分数，尤其是在古典拉丁语中，以及改进的出血得分。文学发展的发展涉及与杜克大学的古典研究系密切合作，该系有助于创建一个高质量的平行拉丁语 - 英语数据集。本文详细介绍了文学中使用的架构，微调方法和提示策略，强调了其产生字面翻译的能力。

Title: Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content

Authors: F.A. Rizvi, T. Navojith, A.M.N.H. Adhikari, W.P.U. Senevirathna, Dharshana Kasthurirathna, Lakmini Abeywardhana
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10679
Pdf URL: https://arxiv.org/pdf/2504.10679
Copy Paste: [[2504.10679]] Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content(https://arxiv.org/abs/2504.10679)
Keywords: gpt
Abstract: Brand reputation in the banking sector is maintained through insightful analysis of customer opinion on code-mixed and multilingual content. Conventional NLP models misclassify or ignore code-mixed text, when mix with low resource languages such as Sinhala-English and fail to capture domain-specific knowledge. This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content. Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned SpaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%. Code-mixed and Sinhala keywords are extracted using a fine-tuned XLM-RoBERTa model integrated with a domain-specific Sinhala financial vocabulary, and it results in an accuracy of 87.4%. To ensure data quality, irrelevant comment filtering was performed using several models, with the BERT-base-uncased model achieving 85.2% for English and XLM-RoBERTa 88.1% for Sinhala, which was better than GPT-4o, SVM, and keyword-based filtering. Aspect classification followed the same pattern, with the BERT-base-uncased model achieving 87.4% for English and XLM-RoBERTa 85.9% for Sinhala, both exceeding GPT-4 and keyword-based approaches. These findings confirm that fine-tuned transformer models outperform traditional methods in multilingual financial text analysis. The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.
摘要：通过对代码混合和多语言内容的客户意见进行洞察力的分析来维持银行业中的品牌声誉。当与低资源语言（如Sinhala-English）混合并无法捕获特定于领域的知识时，常规的NLP模型错误分类或忽略了代码混合文本。这项研究介绍了一种混合NLP方法，可改善关键字提取，内容过滤和基于方面的银行内容分类。英语中的关键字提取是通过包括微调的Spacy NER模型，位于Finbert的Keybert Embeddings，Yake和Embedrank的混合方法来执行的，这使得综合精度为91.2％。使用与域特异性Sinhala财务词汇集成的微型XLM-Roberta模型提取代码混合和Sinhala的关键字，并且其准确度为87.4％。为了确保数据质量，使用多种型号进行了无关紧要的评论过滤，而基于Bert-Base的模型为英语实现了85.2％，Sinhala的XLM-Roberta为88.1％，该模型比GPT-4O，SVM和基于关键字的过滤更好。方面分类遵循相同的模式，Bert-base-uncland Model的英语为87.4％，Sinhala的XLM-Roberta为85.9％，均超过GPT-4和基于关键字的方法。这些发现证实，在多语言财务文本分析中，微调变压器模型优于传统方法。当前的框架为在代码混合和低资源的银行环境中提供了准确且可扩展的解决方案。

Title: EMAFusion: A Self-Optimizing System for Seamless LLM Selection and Integration

Authors: Soham Shah, Kumar Shridhar, Surojit Chatterjee, Souvik Sen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10681
Pdf URL: https://arxiv.org/pdf/2504.10681
Copy Paste: [[2504.10681]] EMAFusion: A Self-Optimizing System for Seamless LLM Selection and Integration(https://arxiv.org/abs/2504.10681)
Keywords: language model, gpt, llm
Abstract: While recent advances in large language models (LLMs) have significantly enhanced performance across diverse natural language tasks, the high computational and financial costs associated with their deployment remain substantial barriers. Existing routing strategies partially alleviate this challenge by assigning queries to cheaper or specialized models, but they frequently rely on extensive labeled data or fragile task-specific heuristics. Conversely, fusion techniques aggregate multiple LLM outputs to boost accuracy and robustness, yet they often exacerbate cost and may reinforce shared biases. We introduce EMAFusion, a new framework that self-optimizes for seamless LLM selection and reliable execution for a given query. Specifically, EMAFusion integrates a taxonomy-based router for familiar query types, a learned router for ambiguous inputs, and a cascading approach that progressively escalates from cheaper to more expensive models based on multi-judge confidence evaluations. Through extensive evaluations, we find EMAFusion outperforms the best individual models by over 2.6 percentage points (94.3% vs. 91.7%), while being 4X cheaper than the average cost. EMAFusion further achieves a remarkable 17.1 percentage point improvement over models like GPT-4 at less than 1/20th the cost. Our combined routing approach delivers 94.3% accuracy compared to taxonomy-based (88.1%) and learned model predictor-based (91.7%) methods alone, demonstrating the effectiveness of our unified strategy. Finally, EMAFusion supports flexible cost-accuracy trade-offs, allowing users to balance their budgetary constraints and performance needs.
摘要：尽管大型语言模型（LLMS）的最新进展已大大提高了各种自然语言任务的性能，但与部署相关的高计算和财务成本仍然很大。现有的路由策略通过将查询分配给更便宜或专业的模型来部分减轻这一挑战，但它们经常依靠广泛的标记数据或脆弱的特定任务启发式方法。相反，融合技术汇总了多个LLM输出以提高准确性和鲁棒性，但它们通常会加剧成本，并可能会加剧共同的偏见。我们介绍了emafusion，这是一个新框架，可以自我优化，以选择无缝的LLM选择，并为给定查询进行可靠的执行。具体而言，Emafusion集成了一个基于分类的路由器，用于熟悉的查询类型，一种用于模棱两可的输入的学习路由器以及一种级联方法，该方法逐渐根据多法官置信度评估，从便宜地升级为更昂贵的模型。通过广泛的评估，我们发现Amafusion的表现优于最佳单个模型超过2.6个百分点（94.3％比91.7％），而比平均成本便宜4倍。 Amafusion进一步取得了显着的17.1个百分点，比GPT-4等模型的成本低于1/20。与基于分类法（88.1％）和学习的基于模型预测器（91.7％）方法相比，我们的组合路由方法的准确性为94.3％，这证明了我们统一策略的有效性。最后，Amafusion支持灵活的成本准确性权衡，使用户可以平衡其预算限制和绩效需求。

Title: HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving

Authors: Avinash Kumar, Shashank Nag, Jason Clemons, Lizy John, Poulami Das
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10724
Pdf URL: https://arxiv.org/pdf/2504.10724
Copy Paste: [[2504.10724]] HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving(https://arxiv.org/abs/2504.10724)
Keywords: language model, llm, prompt
Abstract: Deploying large language models (LLMs) presents critical challenges due to the inherent trade-offs associated with key performance metrics, such as latency, accuracy, and throughput. Typically, gains in one metric is accompanied with degradation in others. Early-Exit LLMs (EE-LLMs) efficiently navigate this trade-off space by skipping some of the later model layers when it confidently finds an output token early, thus reducing latency without impacting accuracy. However, as the early exits taken depend on the task and are unknown apriori to request processing, EE-LLMs conservatively load the entire model, limiting resource savings and throughput. Also, current frameworks statically select a model for a user task, limiting our ability to adapt to changing nature of the input queries. We propose HELIOS to address these challenges. First, HELIOS shortlists a set of candidate LLMs, evaluates them using a subset of prompts, gathering telemetry data in real-time. Second, HELIOS uses the early exit data from these evaluations to greedily load the selected model only up to a limited number of layers. This approach yields memory savings which enables us to process more requests at the same time, thereby improving throughput. Third, HELIOS monitors and periodically reassesses the performance of the candidate LLMs and if needed, switches to another model that can service incoming queries more efficiently (such as using fewer layers without lowering accuracy). Our evaluations show that HELIOS achieves 1.48$\times$ throughput, 1.10$\times$ energy-efficiency, 1.39$\times$ lower response time, and 3.7$\times$ improvements in inference batch sizes compared to the baseline, when optimizing for the respective service level objectives.
摘要：部署大型语言模型（LLMS）提出了与关键绩效指标相关的固有权衡，例如延迟，准确性和吞吐量。通常，一个指标的收益伴随着其他指标的降解。早期的LLMS（EE-LLMS）在自信地找到输出令牌时会跳过一些后来的模型层，从而有效地导航了这一权衡空间，从而降低了延迟而不会影响准确性。但是，由于提前出口取决于任务，并且未知的Apriori请求处理，因此EE-LLMS保守地加载了整个模型，限制了资源节省和吞吐量。此外，当前框架从静态地为用户任务选择一个模型，从而限制了我们适应输入查询性质的变化的能力。我们建议Helios解决这些挑战。首先，Helios入围一组候选LLM，使用一部分提示进行评估，并实时收集遥测数据。其次，Helios使用来自这些评估的早期出口数据将所选模型贪婪地加载到有限数量的层。这种方法可以节省内存，使我们能够同时处理更多请求，从而改善吞吐量。第三，Helios监视并定期重新评估候选LLMS的性能，并在需要时切换到可以更有效地服务的另一个模型（例如使用较少的层而不降低准确性的层）。我们的评估表明，Helios达到了1.48 $ \ times $吞吐量，1.10 $ \ times $ energy效率，1.39 $ \ times $ $较低的响应时间，以及3.7 $ \ times $ $ \ times $改进的推理批量与基准相比，在相应服务级别的目标上进行了优化。

Title: The Art of Audience Engagement: LLM-Based Thin-Slicing of Scientific Talks

Authors: Ralf Schmälzle, Sue Lim, Yuetong Du, Gary Bente
Subjects: cs.CL, cs.AI, cs.ET, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10768
Pdf URL: https://arxiv.org/pdf/2504.10768
Copy Paste: [[2504.10768]] The Art of Audience Engagement: LLM-Based Thin-Slicing of Scientific Talks(https://arxiv.org/abs/2504.10768)
Keywords: language model, llm, prompt
Abstract: This paper examines the thin-slicing approach - the ability to make accurate judgments based on minimal information - in the context of scientific presentations. Drawing on research from nonverbal communication and personality psychology, we show that brief excerpts (thin slices) reliably predict overall presentation quality. Using a novel corpus of over one hundred real-life science talks, we employ Large Language Models (LLMs) to evaluate transcripts of full presentations and their thin slices. By correlating LLM-based evaluations of short excerpts with full-talk assessments, we determine how much information is needed for accurate predictions. Our results demonstrate that LLM-based evaluations align closely with human ratings, proving their validity, reliability, and efficiency. Critically, even very short excerpts (less than 10 percent of a talk) strongly predict overall evaluations. This suggests that the first moments of a presentation convey relevant information that is used in quality evaluations and can shape lasting impressions. The findings are robust across different LLMs and prompting strategies. This work extends thin-slicing research to public speaking and connects theories of impression formation to LLMs and current research on AI communication. We discuss implications for communication and social cognition research on message reception. Lastly, we suggest an LLM-based thin-slicing framework as a scalable feedback tool to enhance human communication.
摘要：本文探讨了在科学演示的背景下，薄薄的方法 - 基于最小信息做出准确判断的能力。利用非语言交流和人格心理学的研究，我们表明简短的摘录（薄片）可靠地预测了整体演示质量。使用一百多个现实生活中的科学谈判的新型语料库，我们采用大型语言模型（LLM）来评估完整演示文稿及其薄薄的笔录。通过将简短摘录的基于LLM的评估与全泰式评估相关联，我们确定需要多少信息才能进行准确的预测。我们的结果表明，基于LLM的评估与人类评级紧密相吻合，证明其有效性，可靠性和效率。至关重要的是，即使是非常简短的摘录（不到谈话的10％）也强烈预测总体评估。这表明演示文稿的第一瞬间传达了相关信息，这些信息用于质量评估并可以影响持久的印象。这些发现在不同的LLM中是强大的，并提示了策略。这项工作将薄薄的研究扩展到公开演讲，并将印象形成的理论与LLMS和有关AI通信的当前研究联系起来。我们讨论对信息接收的沟通和社会认知研究的影响。最后，我们建议一个基于LLM的薄片框架作为一种可扩展的反馈工具，以增强人类交流。

Title: GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction

Authors: Jessica Lin, Amir Zeldes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10792
Pdf URL: https://arxiv.org/pdf/2504.10792
Copy Paste: [[2504.10792]] GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction(https://arxiv.org/abs/2504.10792)
Keywords: llm
Abstract: Determining and ranking the most salient entities in a text is critical for user-facing systems, especially as users increasingly rely on models to interpret long documents they only partially read. Graded entity salience addresses this need by assigning entities scores that reflect their relative importance in a text. Existing approaches fall into two main categories: subjective judgments of salience, which allow for gradient scoring but lack consistency, and summarization-based methods, which define salience as mention-worthiness in a summary, promoting explainability but limiting outputs to binary labels (entities are either summary-worthy or not). In this paper, we introduce a novel approach for graded entity salience that combines the strengths of both approaches. Using an English dataset spanning 12 spoken and written genres, we collect 5 summaries per document and calculate each entity's salience score based on its presence across these summaries. Our approach shows stronger correlation with scores based on human summaries and alignments, and outperforms existing techniques, including LLMs. We release our data and code at this https URL to support further research on graded salient entity extraction.
摘要：确定和排名文本中最显着的实体对于面向用户的系统至关重要，尤其是当用户越来越依靠模型来解释它们仅部分阅读的长文档时。分级实体显着性通过分配反映其在文本中相对重要性的实体分数来满足这一需求。现有方法分为两个主要类别：显着性的主观判断，允许梯度评分但缺乏一致性和基于摘要的方法，这些方法将显着性定义为摘要中的值得提及的，促进了解释性，但将产量限制在二进制标签上（实体是摘要值得的）。在本文中，我们介绍了一种新颖的方法，用于分级实体显着性，结合了两种方法的优势。使用涵盖12种口语和书面流派的英文数据集，我们每个文档收集5个摘要，并根据这些摘要中的存在来计算每个实体的显着分数。我们的方法表明，基于人类的摘要和一致性，与得分更强的相关性，并且胜过包括LLM在内的现有技术。我们在此HTTPS URL上发布数据和代码，以支持对分级显着实体提取的进一步研究。

Title: Name of Thrones: Evaluating How LLMs Rank Student Names, Race, and Gender in Status Hierarchies

Authors: Annabella Sakunkoo, Jonathan Sakunkoo
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10797
Pdf URL: https://arxiv.org/pdf/2504.10797
Copy Paste: [[2504.10797]] Name of Thrones: Evaluating How LLMs Rank Student Names, Race, and Gender in Status Hierarchies(https://arxiv.org/abs/2504.10797)
Keywords: llm
Abstract: Across cultures, names tell a lot about their bearers as they carry deep personal and cultural significance. Names also serve as powerful signals of gender, race, and status in the social hierarchy - a pecking order in which individual positions shape others' expectations on their perceived competence and worth. With the widespread adoption of LLMs and as names are often an input for LLMs, it is crucial to evaluate whether LLMs may sort people into status positions based on first and last names and, if so, whether it is in an unfair, biased fashion. While prior work has primarily investigated biases in first names, little attention has been paid to last names and even less to the combined effects of first and last names. In this study, we conduct a large-scale analysis of name variations across 5 ethnicities to examine how AI exhibits name biases. Our study investigates three key characteristics of inequality and finds that LLMs reflect and reinforce status hierarchies based on names that signal gender and ethnicity as they encode differential expectations of competence, leadership, and economic potential. Contrary to the common assumption that AI tends to favor Whites, we show that East and, in some contexts, South Asian names receive higher rankings. We also disaggregate Asians, a population projected to be the largest immigrant group in the U.S. by 2055. Our results challenge the monolithic Asian model minority assumption, illustrating a more complex and stratified model of bias. Gender moderates biases, with girls facing unfair disadvantages in certain racial groups. Additionally, spanning cultural categories by adopting Western first names improves AI-perceived status for East and Southeast Asian students, particularly for girls. Our findings underscore the importance of intersectional and more nuanced understandings of race, gender, and mixed identities in the evaluation of LLMs.
摘要：在整个文化中，名字在具有深厚的个人和文化意义时，都告诉了他们的承载者。名字还可以作为社会等级制度中性别，种族和地位的强大信号 - 啄食顺序，在该命令中，个人职位塑造了他人对自己的能力和价值的期望。由于LLM的广泛采用和名称通常是LLMS的输入，因此至关重要的是，LLM是否可以根据名字和姓氏将人们分类为状态位置，如果是的，是否是不公平的，有偏见的方式。虽然先前的工作主要研究了名字的偏见，但对姓氏的关注很少，更少的是对名字和姓氏的综合效果。在这项研究中，我们对5个种族的名称变化进行了大规模分析，以研究AI如何表现出名称偏见。我们的研究调查了不平等的三个关键特征，发现LLM会根据名称来反映和加强地位等级制度，这些名称表明性别和种族编码对能力，领导力和经济潜力的差异期望。与AI倾向于偏爱白人的普遍假设相反，我们表明，在某些情况下，南亚名字获得了更高的排名。我们还分类亚洲人，该人口预计将是美国最大的移民群体。到2055年，我们的结果挑战了单一的亚洲模型少数群体假设，这说明了一个更复杂和分层的偏见模型。性别缓和了偏见，女孩在某些种族群体中面临不公平的劣势。此外，通过采用西方名字来跨越文化类别，改善了东亚学生，特别是女孩的AI感知状态。我们的发现强调了在LLMS评估中对种族，性别和混合身份的交叉和更细微的理解的重要性。

Title: CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

Authors: Ayoung Lee, Ryan Sungmo Kwon, Peter Railton, Lu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10823
Pdf URL: https://arxiv.org/pdf/2504.10823
Copy Paste: [[2504.10823]] CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives(https://arxiv.org/abs/2504.10823)
Keywords: language model, gpt, llm
Abstract: Navigating high-stakes dilemmas involving conflicting values is challenging even for humans, let alone for AI. Yet prior work in evaluating the reasoning capabilities of large language models (LLMs) in such situations has been limited to everyday scenarios. To close this gap, this work first introduces CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. In particular, we design CLASH in a way to support the study of critical aspects of value-based decision-making processes which are missing from prior work, including understanding decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in characters' perspectives. By benchmarking 10 open and closed frontier models, we uncover several key findings. (1) Even the strongest models, such as GPT-4o and Claude-Sonnet, achieve less than 50% accuracy in identifying situations where the decision should be ambivalent, while they perform significantly better in clear-cut scenarios. (2) While LLMs reasonably predict psychological discomfort as marked by human, they inadequately comprehend perspectives involving value shifts, indicating a need for LLMs to reason over complex values. (3) Our experiments also reveal a significant correlation between LLMs' value preferences and their steerability towards a given value. (4) Finally, LLMs exhibit greater steerability when engaged in value reasoning from a third-party perspective, compared to a first-person setup, though certain value pairs benefit uniquely from the first-person framing.
摘要：即使对人类，涉及冲突价值观的高风险困境也有挑战，更不用说对人工智能了。然而，在这种情况下评估大语言模型（LLM）的推理能力（LLM）的先前工作仅限于日常情况。为了缩小这一差距，这项工作首先引入了冲突（基于角色的LLM评估，处于高风险的情况下），这是一个精心策划的数据集，该数据集由345个高影响力的困境组成，以及3,795个个人观点。特别是，我们以一种支持对基于价值的决策过程的关键方面进行研究的冲突，这些过程中缺少先前的工作，包括了解决策矛盾和心理不适，并捕获价值观的时间变化。通过基准10个开放和封闭的边界模型，我们发现了几个关键发现。（1）即使是最强的模型，例如GPT-4O和Claude-sonnet，在确定决策应该是矛盾的情况下，精度也少于50％，而在清晰场景中它们的表现明显更好。（2）虽然LLMS合理地预测了人类标志的心理不适，但他们不足地理解涉及价值转移的观点，表明LLM需要对复杂的价值进行推理。（3）我们的实验还揭示了LLMS的价值偏好与它们对给定值的可接受性之间的显着相关性。（4）最后，与第一人称设置相比，从第三方的角度参与价值推理时，LLMS表现出更大的可接触性，尽管某些价值对从第一人称框架中独特地受益。

Title: Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators

Authors: Phill Kyu Rhee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10845
Pdf URL: https://arxiv.org/pdf/2504.10845
Copy Paste: [[2504.10845]] Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators(https://arxiv.org/abs/2504.10845)
Keywords: language model, llm
Abstract: Large Language Models (LLMs), powered by Transformers, have demonstrated human-like intelligence capabilities, yet their underlying mechanisms remain poorly understood. This paper presents a novel framework for interpreting LLMs as probabilistic left context-sensitive languages (CSLs) generators. We hypothesize that Transformers can be effectively decomposed into three fundamental components: context windows, attention mechanisms, and autoregressive generation frameworks. This decomposition allows for the development of more flexible and interpretable computational models, moving beyond the traditional view of attention and autoregression as inseparable processes. We argue that next-token predictions can be understood as probabilistic, dynamic approximations of left CSL production rules, providing an intuitive explanation for how simple token predictions can yield human-like intelligence outputs. Given that all CSLs are left context-sensitive (Penttonen, 1974), we conclude that Transformers stochastically approximate CSLs, which are widely recognized as models of human-like intelligence. This interpretation bridges the gap between Formal Language Theory and the observed generative power of Transformers, laying a foundation for future advancements in generative AI theory and applications. Our novel perspective on Transformer architectures will foster a deeper understanding of LLMs and their future potentials.
摘要：由变形金刚提供支持的大型语言模型（LLM）表现出了类似人类的智力能力，但其潜在机制仍然很少了解。本文提出了一个新的框架，用于将LLM解释为概率的左上背景敏感语言（CSL）发电机。我们假设变压器可以有效地分解为三个基本组成部分：上下文窗口，注意机制和自回旋生成框架。这种分解允许开发更灵活，更可解释的计算模型，超越了作为不可分割的过程的传统关注和自动追溯观点。我们认为，下一步的预测可以理解为左CSL生产规则的概率，动态近似，为简单的标记预测如何产生类似人类的智能输出提供了直观的解释。鉴于所有CSL均为上下文敏感（Penttonen，1974年），我们得出结论，变形金刚随机近似CSL，被广泛认为是类似人类智力的模型。这种解释弥合了形式语言理论与观察到的变形金刚的生成力量之间的差距，为生成AI理论和应用中的未来进步奠定了基础。我们对变压器体系结构的小说观点将增强对LLM及其未来潜力的深入了解。

Title: Ai2 Scholar QA: Organized Literature Synthesis with Attribution

Authors: Amanpreet Singh, Joseph Chee Chang, Chloe Anastasiades, Dany Haddad, Aakanksha Naik, Amber Tanaka, Angele Zamarron, Cecile Nguyen, Jena D. Hwang, Jason Dunkleberger, Matt Latzke, Smita Rao, Jaron Lochner, Rob Evans, Rodney Kinney, Daniel S. Weld, Doug Downey, Sergey Feldman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10861
Pdf URL: https://arxiv.org/pdf/2504.10861
Copy Paste: [[2504.10861]] Ai2 Scholar QA: Organized Literature Synthesis with Attribution(https://arxiv.org/abs/2504.10861)
Keywords: retrieval-augmented generation
Abstract: Retrieval-augmented generation is increasingly effective in answering scientific questions from literature, but many state-of-the-art systems are expensive and closed-source. We introduce Ai2 Scholar QA, a free online scientific question answering application. To facilitate research, we make our entire pipeline public: as a customizable open-source Python package and interactive web app, along with paper indexes accessible through public APIs and downloadable datasets. We describe our system in detail and present experiments analyzing its key design decisions. In an evaluation on a recent scientific QA benchmark, we find that Ai2 Scholar QA outperforms competing systems.
摘要：检索声明的一代越来越有效地回答文献中的科学问题，但是许多最先进的系统都是昂贵且封闭的。我们介绍了AI2学者QA，这是一个免费的在线科学问题回答应用程序。为了促进研究，我们将整个管道公开：作为可自定义的开源Python软件包和Interactive Web应用程序，以及通过公共API和可下载数据集访问的纸质索引。我们详细描述了我们的系统，并目前进行了分析其关键设计决策的实验。在对最近的科学质量检查基准测试的评估中，我们发现AI2学者QA的表现优于竞争系统。

Title: Efficient Reasoning Models: A Survey

Authors: Sicheng Feng, Gongfan Fang, Xinyin Ma, Xinchao Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10903
Pdf URL: https://arxiv.org/pdf/2504.10903
Copy Paste: [[2504.10903]] Efficient Reasoning Models: A Survey(https://arxiv.org/abs/2504.10903)
Keywords: language model, chain-of-thought
Abstract: Reasoning models have demonstrated remarkable progress in solving complex and logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to arriving at a final answer. Yet, the emergence of this "slow-thinking" paradigm, with numerous tokens generated in sequence, inevitably introduces substantial computational overhead. To this end, it highlights an urgent need for effective acceleration. This survey aims to provide a comprehensive overview of recent advances in efficient reasoning. It categorizes existing works into three key directions: (1) shorter - compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller - developing compact language models with strong reasoning capabilities through techniques such as knowledge distillation, other model compression techniques, and reinforcement learning; and (3) faster - designing efficient decoding strategies to accelerate inference. A curated collection of papers discussed in this survey is available in our GitHub repository.
摘要：推理模型通过在获得最终答案之前生成扩展的思想链（COTS）来证明在解决复杂和逻辑密集型任务方面取得了显着进展。然而，这种“缓慢思考”的范式的出现，序列产生了许多令牌，不可避免地引入了实质性的计算开销。为此，它突出了迫切需要有效加速。这项调查旨在全面概述有效推理的最新进展。它将现有作品分为三个关键方向：（1）较短 - 将冗长的婴儿床压缩成简洁而有效的推理链；（2）通过知识蒸馏，其他模型压缩技术和强化学习等技术，开发具有强大推理能力的紧凑型语言模型；（3）更快 - 设计有效的解码策略以加速推理。我们的GitHub存储库中提供了一系列策划的论文集合。

Title: Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From

Authors: Changjiang Gao, Hankun Lin, Shujian Huang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Jiajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10906
Pdf URL: https://arxiv.org/pdf/2504.10906
Copy Paste: [[2504.10906]] Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From(https://arxiv.org/abs/2504.10906)
Keywords: language model, gpt, llm
Abstract: The ability of cross-lingual context retrieval is a fundamental aspect of cross-lingual alignment of large language models (LLMs), where the model extracts context information in one language based on requests in another language. Despite its importance in real-life applications, this ability has not been adequately investigated for state-of-the-art models. In this paper, we evaluate the cross-lingual context retrieval ability of over 40 LLMs across 12 languages to understand the source of this ability, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that several small, post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our interpretability analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training, respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential. Our code and is available at this https URL
摘要：跨语言上下文检索的能力是大语言模型（LLMS）的跨语言对齐的一个基本方面，其中模型根据另一种语言的请求以一种语言提取上下文信息。尽管它在现实生活中的重要性很重要，但这种能力尚未针对最先进的模型进行充分研究。在本文中，我们使用跨语言机器阅读理解理解（XMRC）作为代表性的方案，评估了12种语言中40多种LLM的跨语性上下文检索能力。我们的结果表明，几个小的，训练后的开放式LLM显示出强大的跨语性上下文检索能力，可与封闭源LLM相当，例如GPT-4O，估计的Oracle表现在训练后训练后大大提高。我们的可解释性分析表明，跨语言上下文检索过程可以分为两个主要阶段：问题编码和回答检索，分别在训练前和训练后形成。相位稳定性与XMRC性能相关，XMRC瓶颈位于第二阶段的最后一个模型层，显然可以观察到后训练的效果。我们的结果还表明，大规模预处理无法提高XMRC性能。取而代之的是，较大的LLM需要进一步的多语言后培训，以完全解锁其跨语性上下文检索潜力。我们的代码，可在此HTTPS URL上找到

Title: Exploring the Role of KG-Based RAG in Japanese Medical Question Answering with Small-Scale LLMs

Authors: Yingjian Chen, Feiyang Li, Xingyu Song, Tianxiao Li, Issey Sudeka, Irene Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10982
Pdf URL: https://arxiv.org/pdf/2504.10982
Copy Paste: [[2504.10982]] Exploring the Role of KG-Based RAG in Japanese Medical Question Answering with Small-Scale LLMs(https://arxiv.org/abs/2504.10982)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) perform well in medical QA, but their effectiveness in Japanese contexts is limited due to privacy constraints that prevent the use of commercial models like GPT-4 in clinical settings. As a result, recent efforts focus on instruction-tuning open-source LLMs, though the potential of combining them with retrieval-augmented generation (RAG) remains underexplored. To bridge this gap, we are the first to explore a knowledge graph-based (KG) RAG framework for Japanese medical QA small-scale open-source LLMs. Experimental results show that KG-based RAG has only a limited impact on Japanese medical QA using small-scale open-source LLMs. Further case studies reveal that the effectiveness of the RAG is sensitive to the quality and relevance of the external retrieved content. These findings offer valuable insights into the challenges and potential of applying RAG in Japanese medical QA, while also serving as a reference for other low-resource languages.
摘要：大型语言模型（LLMS）在医疗质量检查中表现良好，但是由于隐私的限制，它们在日本环境中的有效性受到限制，从而阻止了在临床环境中使用诸如GPT-4之类的商业模型。结果，最近的努力着重于调整开源LLM的指导，尽管将它们与检索型发电（RAG）相结合的潜力尚未得到充满意。为了弥合这一差距，我们是第一个探索日本医疗质量检查QA小型开源LLM的知识图（kg）抹布框架的人。实验结果表明，基于KG的抹布使用小规模的开源LLM对日本医疗质量质量质量质量质量检查的影响有限。进一步的案例研究表明，抹布的有效性对外部检索内容的质量和相关性敏感。这些发现为在日本医疗质量保证中应用抹布的挑战和潜力提供了宝贵的见解，同时也可以参考其他低资源语言。

Title: ReZero: Enhancing LLM search ability by trying one-more-time

Authors: Alan Dao (Gia Tuan Dao), Thinh Le
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11001
Pdf URL: https://arxiv.org/pdf/2504.11001
Copy Paste: [[2504.11001]] ReZero: Enhancing LLM search ability by trying one-more-time(https://arxiv.org/abs/2504.11001)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) improves Large Language Model (LLM) performance on knowledge-intensive tasks but depends heavily on initial search query quality. Current methods, often using Reinforcement Learning (RL), typically focus on query formulation or reasoning over results, without explicitly encouraging persistence after a failed search. We introduce ReZero (Retry-Zero), a novel RL framework that directly rewards the act of retrying a search query following an initial unsuccessful attempt. This incentivizes the LLM to explore alternative queries rather than prematurely halting. ReZero demonstrates significant improvement, achieving 46.88% accuracy compared to a 25% baseline. By rewarding persistence, ReZero enhances LLM robustness in complex information-seeking scenarios where initial queries may prove insufficient.
摘要：检索增强的生成（RAG）改善了知识密集型任务上的大语言模型（LLM）性能，但在很大程度上取决于初始搜索查询质量。当前的方法通常使用加固学习（RL），通常集中于查询公式或推理而不是结果，而不会在搜索失败后明确鼓励持久性。我们介绍了Rezero（Retry-Zero），这是一个新颖的RL框架，在初始失败尝试后，直接奖励重试搜索查询的行为。这激励LLM探索替代查询，而不是过早停止。 Rezero表现出显着改善，与基线相比，获得了46.88％的精度。通过奖励持久性，Rezero在复杂的信息寻求信息方案中增强了LLM的鲁棒性，而初始查询可能不足。

Title: Dynamic Compressing Prompts for Efficient Inference of Large Language Models

Authors: Jinwu Hu, Wei Zhang, Yufeng Wang, Yu Hu, Bin Xiao, Mingkui Tan, Qing Du
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11004
Pdf URL: https://arxiv.org/pdf/2504.11004
Copy Paste: [[2504.11004]] Dynamic Compressing Prompts for Efficient Inference of Large Language Models(https://arxiv.org/abs/2504.11004)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. However, these techniques often require lengthy prompts, which increase computational costs and can hinder performance because of the limited context windows of LLMs. While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks. To tackle these issues, we propose a task-agnostic method called Dynamic Compressing Prompts (LLM-DCP). Our method reduces the number of prompt tokens while aiming to preserve the performance as much as possible. We model prompt compression as a Markov Decision Process (MDP), enabling the DCP-Agent to sequentially remove redundant tokens by adapting to dynamic contexts and retaining crucial content. We develop a reward function for training the DCP-Agent that balances the compression rate, the quality of the LLM output, and the retention of key information. This allows for prompt token reduction without needing an external black-box LLM. Inspired by the progressive difficulty adjustment in curriculum learning, we introduce a Hierarchical Prompt Compression (HPC) training strategy that gradually increases the compression difficulty, enabling the DCP-Agent to learn an effective compression method that maintains information integrity. Experiments demonstrate that our method outperforms state-of-the-art techniques, especially at higher compression rates. The code for our approach will be available at this https URL.
摘要：大型语言模型（LLM）在各种任务中表现出出色的性能，部分原因是高级提示技术。但是，这些技术通常需要冗长的提示，这增加了计算成本，并且由于LLM的上下文窗口有限，因此可能会阻碍性能。虽然迅速压缩是一个直接的解决方案，但现有方法面临保留基本信息，适应上下文更改以及在不同任务中保持有效的挑战。为了解决这些问题，我们提出了一种称为动态压缩提示（LLM-DCP）的任务无关方法。我们的方法减少了迅速令牌的数量，同时旨在尽可能地保留性能。我们将提示压缩为马尔可夫决策过程（MDP）建模，从而使DCP代理通过适应动态上下文并保留关键内容来顺序删除冗余令牌。我们为培训DCP代理而开发了奖励功能，该奖励能够平衡压缩率，LLM输出的质量和保留关键信息。这允许及时减少令牌，而无需外部黑盒LLM。受课程学习的逐步调整的启发，我们引入了层次及时压缩（HPC）培训策略，逐渐增加了压缩难度，从而使DCP代理能够学习一种维持信息完整性的有效压缩方法。实验表明，我们的方法优于最先进的技术，尤其是在较高的压缩率下。我们方法的代码将在此HTTPS URL上获得。

Title: LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews

Authors: Sukannya Purkayastha, Zhuang Li, Anne Lauscher, Lizhen Qu, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11042
Pdf URL: https://arxiv.org/pdf/2504.11042
Copy Paste: [[2504.11042]] LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews(https://arxiv.org/abs/2504.11042)
Keywords: language model, llm
Abstract: Peer review is a cornerstone of quality control in scientific publishing. With the increasing workload, the unintended use of `quick' heuristics, referred to as lazy thinking, has emerged as a recurring issue compromising review quality. Automated methods to detect such heuristics can help improve the peer-reviewing process. However, there is limited NLP research on this issue, and no real-world dataset exists to support the development of detection tools. This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Our analysis reveals that Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. However, instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points, highlighting the importance of high-quality training data. Furthermore, a controlled experiment demonstrates that reviews revised with lazy thinking feedback are more comprehensive and actionable than those written without such feedback. We will release our dataset and the enhanced guidelines that can be used to train junior reviewers in the community. (Code available here: this https URL)
摘要：同行评审是科学出版中质量控制的基石。随着工作量越来越大，“快速”启发式方法的意外使用被称为懒惰思维，已成为一个反复出现的问题，损害了审查质量。检测这种启发式方法的自动化方法可以帮助改善同行评审过程。但是，关于此问题的NLP研究有限，没有现实世界中的数据集来支持检测工具的开发。这项工作介绍了LazyReview，这是一个带有细粒度懒惰思维类别的同行评审句子的数据集。我们的分析表明，大型语言模型（LLMS）难以在零拍设置中检测这些实例。但是，基于指导的微调在我们的数据集中显着提高了10-20个性能点，从而突出了高质量培训数据的重要性。此外，一个受控的实验表明，用懒惰思考的反馈进行修订的评论比没有这种反馈的书面更全面和可操作。我们将发布我们的数据集以及可用于培训社区初级审阅者的增强指南。（可在此处提供代码：此HTTPS URL）

Title: DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis

Authors: Efthymios Georgiou, Vassilis Katsouros, Yannis Avrithis, Alexandros Potamianos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11082
Pdf URL: https://arxiv.org/pdf/2504.11082
Copy Paste: [[2504.11082]] DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis(https://arxiv.org/abs/2504.11082)
Keywords: language model, llm
Abstract: While multimodal fusion has been extensively studied in Multimodal Sentiment Analysis (MSA), the role of fusion depth and multimodal capacity allocation remains underexplored. In this work, we position fusion depth, scalability, and dedicated multimodal capacity as primary factors for effective fusion. We introduce DeepMLF, a novel multimodal language model (LM) with learnable tokens tailored toward deep fusion. DeepMLF leverages an audiovisual encoder and a pretrained decoder LM augmented with multimodal information across its layers. We append learnable tokens to the LM that: 1) capture modality interactions in a controlled fashion and 2) preserve independent information flow for each modality. These fusion tokens gather linguistic information via causal self-attention in LM Blocks and integrate with audiovisual information through cross-attention MM Blocks. Serving as dedicated multimodal capacity, this design enables progressive fusion across multiple layers, providing depth in the fusion process. Our training recipe combines modality-specific losses and language modelling loss, with the decoder LM tasked to predict ground truth polarity. Across three MSA benchmarks with varying dataset characteristics, DeepMLF achieves state-of-the-art performance. Our results confirm that deeper fusion leads to better performance, with optimal fusion depths (5-7) exceeding those of existing approaches. Additionally, our analysis on the number of fusion tokens reveals that small token sets ($\sim$20) achieve optimal performance. We examine the importance of representation learning order (fusion curriculum) through audiovisual encoder initialization experiments. Our ablation studies demonstrate the superiority of the proposed fusion design and gating while providing a holistic examination of DeepMLF's scalability to LLMs, and the impact of each training objective and embedding regularization.
摘要：虽然多模式融合在多模式情感分析（MSA）中进行了广泛研究，但融合深度和多模式能力分配的作用仍未得到充实。在这项工作中，我们将融合深度，可伸缩性和专用多模式能力定位为有效融合的主要因素。我们介绍了DeepMLF，这是一种新型的多模式模型（LM），其可学习的代币针对深层融合而定。 DEEPMLF利用了视听编码器和预验证的解码器LM增强，并在其层上使用了多模式信息。我们将可学习的令牌附加到LM上：1）以受控方式捕获模式相互作用，2）为每种模式保留独立的信息流。这些融合令牌通过在LM块中的因果自我注意力来收集语言信息，并通过跨注意的MM块与视听信息集成。该设计充当专用的多模式能力，可以使多个层进行渐进式融合，从而在融合过程中提供了深度。我们的培训配方结合了特定于方式的损失和语言建模损失，而解码器LM的任务是预测地面真理极性。在三个具有不同数据集特性的MSA基准测试中，DEEPMLF实现了最先进的性能。我们的结果证实，更深层次的融合会导致更好的性能，而最佳融合深度（5-7）超过了现有方法。此外，我们对融合令牌数量的分析表明，小令牌集（$ \ sim $ 20）可实现最佳性能。我们通过视听编码器初始化实验来研究表示学习顺序（融合课程）的重要性。我们的消融研究表明了拟议的融合设计和门控的优越性，同时提供了对DEEPMLF对LLM的可伸缩性的整体检查，以及每个训练目标和嵌入正则化的影响。

Title: Using LLMs as prompt modifier to avoid biases in AI image generators

Authors: René Peinl
Subjects: cs.CL, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2504.11104
Pdf URL: https://arxiv.org/pdf/2504.11104
Copy Paste: [[2504.11104]] Using LLMs as prompt modifier to avoid biases in AI image generators(https://arxiv.org/abs/2504.11104)
Keywords: language model, llm, prompt
Abstract: This study examines how Large Language Models (LLMs) can reduce biases in text-to-image generation systems by modifying user prompts. We define bias as a model's unfair deviation from population statistics given neutral prompts. Our experiments with Stable Diffusion XL, 3.5 and Flux demonstrate that LLM-modified prompts significantly increase image diversity and reduce bias without the need to change the image generators themselves. While occasionally producing results that diverge from original user intent for elaborate prompts, this approach generally provides more varied interpretations of underspecified requests rather than superficial variations. The method works particularly well for less advanced image generators, though limitations persist for certain contexts like disability representation. All prompts and generated images are available at this https URL
摘要：这项研究研究了大型语言模型（LLMS）如何通过修改用户提示来减少文本到图像生成系统中的偏差。我们将偏见定义为模型与人口统计的不公平偏差，给定中性提示。我们对稳定的扩散XL，3.5和通量的实验表明，LLM修饰会促使图像多样性显着增加并减少偏差，而无需更改图像发生器本身。尽管偶尔会产生与原始用户意图相差的结果，但这种方法通常会提供更多关于未指定请求而不是表面变化的解释。该方法对于较不高级的图像生成器特别有效，尽管对于某些上下文（例如残疾表示）的限制仍然存在。所有提示和生成的图像均可在此HTTPS URL上找到

Title: Benchmarking Vision Language Models on German Factual Data

Authors: René Peinl, Vincent Tischler
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11108
Pdf URL: https://arxiv.org/pdf/2504.11108
Copy Paste: [[2504.11108]] Benchmarking Vision Language Models on German Factual Data(https://arxiv.org/abs/2504.11108)
Keywords: language model, llm, prompt
Abstract: Similar to LLMs, the development of vision language models is mainly driven by English datasets and models trained in English and Chinese language, whereas support for other languages, even those considered high-resource languages such as German, remains significantly weaker. In this work we present an analysis of open-weight VLMs on factual knowledge in the German and English language. We disentangle the image-related aspects from the textual ones by analyzing accu-racy with jury-as-a-judge in both prompt languages and images from German and international contexts. We found that for celebrities and sights, VLMs struggle because they are lacking visual cognition of German image contents. For animals and plants, the tested models can often correctly identify the image contents ac-cording to the scientific name or English common name but fail in German lan-guage. Cars and supermarket products were identified equally well in English and German images across both prompt languages.
摘要：与LLM相似，视觉语言模型的发展主要是由英语和中文培训的英语数据集和模型驱动的，而对其他语言的支持，即使是那些被认为是高源语言（例如德语）的语言，也显着弱。在这项工作中，我们介绍了对德语和英语的事实知识的开放权重VLM的分析。我们通过以及时的语言和来自德语和国际环境的陪审团的方式分析陪审团的法官来分析与图像相关的方面与文本相关的方面。我们发现，对于名人和景点，VLM挣扎，因为他们缺乏对德国形象内容的视觉认知。对于动物和植物，经过测试的模型通常可以正确地识别以科学名称或英语通用名称的形象符合图像内容，但在德国的兰加格中失败了。在两种及时的语言中，汽车和超市产品在英语和德语图像中的鉴定都同样很好。

Title: MuSeD: A Multimodal Spanish Dataset for Sexism Detection in Social Media Videos

Authors: Laura De Grazia, Pol Pastells, Mauro Vázquez Chas, Desmond Elliott, Danae Sánchez Villegas, Mireia Farrús, Mariona Taulé
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11169
Pdf URL: https://arxiv.org/pdf/2504.11169
Copy Paste: [[2504.11169]] MuSeD: A Multimodal Spanish Dataset for Sexism Detection in Social Media Videos(https://arxiv.org/abs/2504.11169)
Keywords: language model, llm
Abstract: Sexism is generally defined as prejudice and discrimination based on sex or gender, affecting every sector of society, from social institutions to relationships and individual behavior. Social media platforms amplify the impact of sexism by conveying discriminatory content not only through text but also across multiple modalities, highlighting the critical need for a multimodal approach to the analysis of sexism online. With the rise of social media platforms where users share short videos, sexism is increasingly spreading through video content. Automatically detecting sexism in videos is a challenging task, as it requires analyzing the combination of verbal, audio, and visual elements to identify sexist content. In this study, (1) we introduce MuSeD, a new Multimodal Spanish dataset for Sexism Detection consisting of $\approx$ 11 hours of videos extracted from TikTok and BitChute; (2) we propose an innovative annotation framework for analyzing the contribution of textual and multimodal labels in the classification of sexist and non-sexist content; and (3) we evaluate a range of large language models (LLMs) and multimodal LLMs on the task of sexism detection. We find that visual information plays a key role in labeling sexist content for both humans and models. Models effectively detect explicit sexism; however, they struggle with implicit cases, such as stereotypes, instances where annotators also show low agreement. This highlights the inherent difficulty of the task, as identifying implicit sexism depends on the social and cultural context.
摘要：性别歧视通常被定义为基于性别或性别的偏见和歧视，从社会制度到人际关系和个人行为，影响社会的每个部门。社交媒体平台不仅通过文本传达歧视性内容，而且可以通过多种方式传达歧视性内容，从而扩大了性别歧视的影响，从而突出了对多模式方法进行在线分析的多模式方法的迫切需求。随着社交媒体平台的兴起，用户共享简短的视频，性别歧视越来越多地通过视频内容传播。在视频中自动检测性别歧视是一项具有挑战性的任务，因为它需要分析言语，音频和视觉元素的组合以识别性别歧视内容。在这项研究中，（1）我们介绍了一个新的多式联运数据集，用于性别歧视检测，该数据集由$ \ $ \ $ \ $ \ $ 11小时的视频组成，这些视频从Tiktok和Bitchute中提取；（2）我们提出了一个创新的注释框架，用于分析文本和多模式标签在性别歧视和非性别含量分类中的贡献；（3）我们在性别歧视检测的任务上评估了一系列大语模型（LLM）和多模式LLM。我们发现，视觉信息在标记人类和模型的性别歧视内容中起着关键作用。模型有效地检测出明确的性别歧视；但是，他们在隐式案例（例如刻板印象）中挣扎，注释者也表现出低同意的情况。这突出了任务的内在困难，因为识别隐性性别歧视取决于社会和文化背景。

Title: Bias Beyond English: Evaluating Social Bias and Debiasing Methods in a Low-Resource Setting

Authors: Ej Zhou, Weiming Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11183
Pdf URL: https://arxiv.org/pdf/2504.11183
Copy Paste: [[2504.11183]] Bias Beyond English: Evaluating Social Bias and Debiasing Methods in a Low-Resource Setting(https://arxiv.org/abs/2504.11183)
Keywords: language model
Abstract: Social bias in language models can potentially exacerbate social inequalities. Despite it having garnered wide attention, most research focuses on English data. In a low-resource scenario, the models often perform worse due to insufficient training data. This study aims to leverage high-resource language corpora to evaluate bias and experiment with debiasing methods in low-resource languages. We evaluated the performance of recent multilingual models in five languages: English (\textsc{eng}), Chinese (\textsc{zho}), Russian (\textsc{rus}), Indonesian (\textsc{ind}) and Thai (\textsc{tha}), and analyzed four bias dimensions: \textit{gender}, \textit{religion}, \textit{nationality}, and \textit{race-color}. By constructing multilingual bias evaluation datasets, this study allows fair comparisons between models across languages. We have further investigated three debiasing methods-\texttt{CDA}, \texttt{Dropout}, \texttt{SenDeb}-and demonstrated that debiasing methods from high-resource languages can be effectively transferred to low-resource ones, providing actionable insights for fairness research in multilingual NLP.
摘要：语言模型中的社会偏见可能会加剧社会不平等现象。尽管它引起了广泛的关注，但大多数研究都集中在英语数据上。在低资源场景中，由于培训数据不足，模型通常的性能往往更糟。这项研究旨在利用高资源语言语言语料库来评估偏见和实验低资源语言的偏见方法。我们评估了五种语言的最新多语言模型的性能：英语（\ textsc {eng}），中文（\ textsc {zho}），俄语（\ textsc {rus}），iNdonesian（\ textsc {indsc {ind}）和thai（\ textsc {tha}），并分析了四个bias dimensions：\ fextion {gias dimensions：\ distion： \ textit {宗教}，\ textit {norlantity}和\ textit {race-color}。通过构建多语言偏见评估数据集，本研究可以在跨语言的模型之间进行公平的比较。我们还进一步研究了三种辩护方法 - \ texttt {cda}，\ texttt {droptt}，\ texttt {sendeb} - ，并证明，从高水库中使用的偏见方法可以有效地转移到低资源的人中，从而提供可行的洞察力。

Title: Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items

Authors: Minjie Zou, Sahana Srinivasan, Thaddaeus Wai Soon Lo, Ke Zou, Gabriel Dawei Yang, Xuguang Ai, Hyunjae Kim, Maxwell Singer, Fares Antaki, Kelvin Li, Robert Chang, Marcus Tan, David Ziyou Chen, Dianbo Liu, Qingyu Chen, Yih Chung Tham
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11186
Pdf URL: https://arxiv.org/pdf/2504.11186
Copy Paste: [[2504.11186]] Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items(https://arxiv.org/abs/2504.11186)
Keywords: language model, llm
Abstract: Recent advances in reasoning-focused large language models (LLMs) mark a shift from general LLMs toward models designed for complex decision-making, a crucial aspect in medicine. However, their performance in specialized domains like ophthalmology remains underexplored. This study comprehensively evaluated and compared the accuracy and reasoning capabilities of four newly developed reasoning-focused LLMs, namely DeepSeek-R1, OpenAI o1, o3-mini, and Gemini 2.0 Flash-Thinking. Each model was assessed using 5,888 multiple-choice ophthalmology exam questions from the MedMCQA dataset in zero-shot setting. Quantitative evaluation included accuracy, Macro-F1, and five text-generation metrics (ROUGE-L, METEOR, BERTScore, BARTScore, and AlignScore), computed against ground-truth reasonings. Average inference time was recorded for a subset of 100 randomly selected questions. Additionally, two board-certified ophthalmologists qualitatively assessed clarity, completeness, and reasoning structure of responses to differential diagnosis questions.O1 (0.902) and DeepSeek-R1 (0.888) achieved the highest accuracy, with o1 also leading in Macro-F1 (0.900). The performance of models across the text-generation metrics varied: O3-mini excelled in ROUGE-L (0.151), o1 in METEOR (0.232), DeepSeek-R1 and o3-mini tied for BERTScore (0.673), DeepSeek-R1 (-4.105) and Gemini 2.0 Flash-Thinking (-4.127) performed best in BARTScore, while o3-mini (0.181) and o1 (0.176) led AlignScore. Inference time across the models varied, with DeepSeek-R1 being slowest (40.4 seconds) and Gemini 2.0 Flash-Thinking fastest (6.7 seconds). Qualitative evaluation revealed that DeepSeek-R1 and Gemini 2.0 Flash-Thinking tended to provide detailed and comprehensive intermediate reasoning, whereas o1 and o3-mini displayed concise and summarized justifications.
摘要：以推理为重点的大语言模型（LLMS）的最新进展标志着从一般LLMS向设计用于复杂决策的模型的转变，这是医学上的关键方面。但是，它们在诸如眼科之类的专业领域的表现仍然没有被忽视。这项研究全面评估并比较了四个新开发的以推理为中心的LLM的准确性和推理能力，即DeepSeek-R1，OpenAI O1，O3-Mini和Gemini 2.0 2.0闪存思维。使用5,888个从MEDMCQA数据集中的5,888个多项选择眼科考试问题评估了每个模型。定量评估包括准确性，宏F1和五个文本生成指标（Rouge-L，Meteor，Bertscore，Bartscore和AlignScore），这些指标是根据地面真实的推理计算的。记录了100个随机选择问题的子集的平均推理时间。此外，两名经过董事会认证的眼科医生定性评估了对差异诊断问题的反应的清晰度，完整性和推理结构。O1（0.902）和DeepSeek-R1（0.888）达到了最高准确性，O1在宏F1（0.900）中也引导了O1。在文本生成指标上的模型的性能各不相同：O3-Mini在Rouge-L（0.151），流星（0.232）中表现出色（0.232），DeepSeek-R1和O3-Mini绑在Bertscore（0.673）（0.673），DeepSeek-R1（0.673），DeepSeek-R1（-4.4.105）和gemini 2.0 flashie 2.0 flash-sc时（-4.44.127）（0.181）和O1（0.176）LED AlignScore。模型中的推理时间各不相同，DeepSeek-R1最慢（40.4秒）和Gemini 2.0闪存最快（6.7秒）。定性评估表明，DeepSeek-R1和Gemini 2.0闪存的思维倾向于提供详细而全面的中间推理，而O1和O3-Mini则表现出简洁并总结了理由。

Title: From Misleading Queries to Accurate Answers: A Three-Stage Fine-Tuning Method for LLMs

Authors: Guocong Li, Weize Liu, Yihang Wu, Ping Wang, Shuaihan Huang, Hongxia Xu, Jian Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11277
Pdf URL: https://arxiv.org/pdf/2504.11277
Copy Paste: [[2504.11277]] From Misleading Queries to Accurate Answers: A Three-Stage Fine-Tuning Method for LLMs(https://arxiv.org/abs/2504.11277)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) exhibit excellent performance in natural language processing (NLP), but remain highly sensitive to the quality of input queries, especially when these queries contain misleading or inaccurate information. Existing methods focus on correcting the output, but they often overlook the potential of improving the ability of LLMs to detect and correct misleading content in the input itself. In this paper, we propose a novel three-stage fine-tuning method that enhances the ability of LLMs to detect and correct misleading information in the input, further improving response accuracy and reducing hallucinations. Specifically, the three stages include (1) training LLMs to identify misleading information, (2) training LLMs to correct the misleading information using built-in or external knowledge, and (3) training LLMs to generate accurate answers based on the corrected queries. To evaluate our method, we conducted experiments on three datasets for the hallucination detection task and the question answering (QA) task, as well as two datasets containing misleading information that we constructed. The experimental results demonstrate that our method significantly improves the accuracy and factuality of LLM responses, while also enhancing the ability to detect hallucinations and reducing the generation of hallucinations in the output, particularly when the query contains misleading information. We will publicly release our code upon acceptance.
摘要：大型语言模型（LLM）在自然语言处理（NLP）方面表现出色，但对输入查询的质量仍然高度敏感，尤其是当这些查询包含误导性或不准确的信息时。现有的方法着重于纠正输出，但它们通常会忽略提高LLMS检测和纠正输入本身中误导性内容的能力的潜力。在本文中，我们提出了一种新颖的三阶段微调方法，可以增强LLM在输入中检测和纠正误导信息的能力，从而进一步提高响应准确性并降低幻觉。具体而言，这三个阶段包括（1）培训LLMS以识别误导性信息，（2）培训LLMS使用内置或外部知识纠正误导性信息，以及（3）培训LLMS根据校正的查询生成准确的答案。为了评估我们的方法，我们在三个数据集上进行了实验，以实现幻觉检测任务和问题答案（QA）任务，以及两个包含我们构建的误导信息的数据集。实验结果表明，我们的方法显着提高了LLM响应的准确性和事实，同时还提高了检测幻觉和减少输出中幻觉产生的能力，尤其是当查询包含误导性信息时。我们将在接受后公开发布我们的代码。

Title: Automated Python Translation

Authors: Joshua Otten, Antonios Anastasopoulos, Kevin Moran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11290
Pdf URL: https://arxiv.org/pdf/2504.11290
Copy Paste: [[2504.11290]] Automated Python Translation(https://arxiv.org/abs/2504.11290)
Keywords: language model
Abstract: Python is one of the most commonly used programming languages in industry and education. Its English keywords and built-in functions/modules allow it to come close to pseudo-code in terms of its readability and ease of writing. However, those who do not speak English may not experience these advantages. In fact, they may even be hindered in their ability to understand Python code, as the English nature of its terms creates an additional layer of overhead. To that end, we introduce the task of automatically translating Python's natural modality (keywords, error types, identifiers, etc.) into other human languages. This presents a unique challenge, considering the abbreviated nature of these forms, as well as potential untranslatability of advanced mathematical/programming concepts across languages. We therefore create an automated pipeline to translate Python into other human languages, comparing strategies using machine translation and large language models. We then use this pipeline to acquire translations from five common Python libraries (pytorch, pandas, tensorflow, numpy, and random) in seven languages, and do a quality test on a subset of these terms in French, Greek, and Bengali. We hope this will provide a clearer path forward towards creating a universal Python, accessible to anyone regardless of nationality or language background.
摘要：Python是行业和教育中最常用的编程语言之一。它的英语关键字和内置功能/模块使其可以就其可读性和易于写作而接近伪代码。但是，那些不会说英语的人可能不会遇到这些优势。实际上，甚至可能会阻碍他们理解Python代码的能力，因为其术语的英语性质会创造出另一个开销层。为此，我们将自动将Python的自然模态（关键字，错误类型，标识符等）转换为其他人类语言的任务。考虑到这些形式的缩写性，以及跨语言的先进数学/编程概念的潜在不转化性，这提出了一个独特的挑战。因此，我们创建了一个自动化管道，将Python转换为其他人类语言，并使用机器翻译和大型语言模型进行比较。然后，我们使用此管道从七种语言中获取五个普通Python库（Pytorch，Pandas，Tensorflow，Numpy和Antrol）的翻译，并在法语，希腊语和孟加拉语中对这些术语的子集进行质量测试。我们希望这将为创建普遍的Python提供更清晰的途径，无论任何国籍或语言背景如何，任何人都可以使用。

Title: REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective

Authors: Zhihao Xu, Yongqi Tong, Xin Zhang, Jun Zhou, Xiting Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11337
Pdf URL: https://arxiv.org/pdf/2504.11337
Copy Paste: [[2504.11337]] REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective(https://arxiv.org/abs/2504.11337)
Keywords: language model
Abstract: Multi-objective preference alignment in language models often encounters a challenging trade-off: optimizing for one human preference (e.g., helpfulness) frequently compromises others (e.g., harmlessness) due to the inherent conflicts between competing objectives. While prior work mainly focuses on algorithmic solutions, we explore a novel data-driven approach to uncover the types of data that can effectively mitigate these conflicts. Specifically, we propose the concept of Reward Consistency (RC), which identifies samples that align with multiple preference objectives, thereby reducing conflicts during training. Through gradient-based analysis, we demonstrate that RC-compliant samples inherently constrain performance degradation during multi-objective optimization. Building on these insights, we further develop Reward Consistency Sampling, a framework that automatically constructs preference datasets that effectively mitigate conflicts during multi-objective alignment. Our generated data achieves an average improvement of 13.37% in both the harmless rate and helpfulness win rate when optimizing harmlessness and helpfulness, and can consistently resolve conflicts in varying multi-objective scenarios.
摘要：语言模型中的多目标偏好一致性通常会遇到一个具有挑战性的权衡：针对一个人类偏好（例如，有益的）优化，由于竞争目标之间固有的冲突，因此经常损害他人（例如，无害）。虽然先前的工作主要关注算法解决方案，但我们探索了一种新型的数据驱动方法，以发现可以有效缓解这些冲突的数据类型。具体来说，我们提出了奖励一致性（RC）的概念，该概念确定了与多个偏好目标保持一致的样本，从而减少了训练期间的冲突。通过基于梯度的分析，我们证明了符合RC的样本在多目标优化过程中固有地限制了性能降解。在这些见解的基础上，我们进一步开发了奖励一致性抽样，该框架自动构建了偏好数据集，这些数据集有效地减轻了多目标对齐过程中的冲突。我们生成的数据在优化无害性和帮助性时，无害的率和有益的获胜率的平均提高13.37％，并且可以在不同的多目标方案中始终如一地解决冲突。

Title: OpenTuringBench: An Open-Model-based Benchmark and Framework for Machine-Generated Text Detection and Attribution

Authors: Lucio La Cava, Andrea Tagarelli
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2504.11369
Pdf URL: https://arxiv.org/pdf/2504.11369
Copy Paste: [[2504.11369]] OpenTuringBench: An Open-Model-based Benchmark and Framework for Machine-Generated Text Detection and Attribution(https://arxiv.org/abs/2504.11369)
Keywords: language model, llm
Abstract: Open Large Language Models (OLLMs) are increasingly leveraged in generative AI applications, posing new challenges for detecting their outputs. We propose OpenTuringBench, a new benchmark based on OLLMs, designed to train and evaluate machine-generated text detectors on the Turing Test and Authorship Attribution problems. OpenTuringBench focuses on a representative set of OLLMs, and features a number of challenging evaluation tasks, including human/machine-manipulated texts, out-of-domain texts, and texts from previously unseen models. We also provide OTBDetector, a contrastive learning framework to detect and attribute OLLM-based machine-generated texts. Results highlight the relevance and varying degrees of difficulty of the OpenTuringBench tasks, with our detector achieving remarkable capabilities across the various tasks and outperforming most existing detectors. Resources are available on the OpenTuringBench Hugging Face repository at this https URL
摘要：开放的大语言模型（OLLM）越来越多地在生成AI应用中利用，提出了检测其输出的新挑战。我们建议OpenturingBench是一种基于OLLM的新基准，旨在在Turing测试和作者身份归因问题上训练和评估机器生成的文本探测器。 OpenturingBench专注于一组代表性的OLLM，并具有许多具有挑战性的评估任务，包括人/机器操作的文本，外域外文本以及来自以前看不见的模型的文本。我们还提供OTBDetector，这是一个对比的学习框架，可检测和属性基于OLLM的机器生成的文本。结果突出了OpenturingBench任务的相关性和不同程度，我们的检测器在各种任务上都具有出色的功能，并优于大多数现有检测器。在此HTTPS URL的OpenturingBench拥抱面库中提供资源

Title: Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions

Authors: Wang Bill Zhu, Tianqi Chen, Ching Ying Lin, Jade Law, Mazen Jizzini, Jorge J. Nieva, Ruishan Liu, Robin Jia
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2504.11373
Pdf URL: https://arxiv.org/pdf/2504.11373
Copy Paste: [[2504.11373]] Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions(https://arxiv.org/abs/2504.11373)
Keywords: language model, gpt, llm, chat, agent
Abstract: Cancer patients are increasingly turning to large language models (LLMs) as a new form of internet search for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with detailed clinical contexts. In this paper, we first evaluate LLMs on cancer-related questions drawn from real patients, reviewed by three hematology oncology physicians. While responses are generally accurate, with GPT-4-Turbo scoring 4.13 out of 5, the models frequently fail to recognize or address false presuppositions in the questions-posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM -- including GPT-4o, this http URL, and Claude-3.5-Sonnet -- corrects these false presuppositions more than 30% of the time. Even advanced medical agentic methods do not prevent LLMs from ignoring false presuppositions. These findings expose a critical gap in the clinical reliability of LLMs and underscore the need for more robust safeguards in medical AI systems.
摘要：癌症患者越来越多地转向大型语言模型（LLMS），作为一种新的互联网搜索医疗信息形式，因此评估这些模型如何处理复杂，个性化的问题至关重要。但是，当前的医疗基准专注于体检或消费者搜索的问题，并且没有在详细的临床环境中评估实际患者问题的LLM。在本文中，我们首先评估了由三位血液学肿瘤医生回顾的实际患者提出的癌症相关问题的LLM。虽然响应通常是准确的，而GPT-4-Turbo得分为4.13，但这些模型经常无法识别或解决问题以解决安全医疗决策的问题。为了系统地研究这一局限性，我们引入了癌症 - 癌症，这是一个由585个与癌症相关的问题的专家验证的对抗数据集，具有错误的预设。在此基准测试中，没有Frontier LLM - 包括GPT-4O，此HTTP URL和Claude-3.5-sonnet-在30％以上的时间内更正这些假预设。即使是先进的医疗代理方法也不能阻止LLM忽略错误的预设。这些发现在LLM的临床可靠性上揭示了一个危险的差距，并强调了对医疗AI系统中更强大的保障措施的需求。

Title: RankAlign: A Ranking View of the Generator-Validator Gap in Large Language Models

Authors: Juan Diego Rodriguez, Wenxuan Ding, Katrin Erk, Greg Durrett
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11381
Pdf URL: https://arxiv.org/pdf/2504.11381
Copy Paste: [[2504.11381]] RankAlign: A Ranking View of the Generator-Validator Gap in Large Language Models(https://arxiv.org/abs/2504.11381)
Keywords: language model, llm, prompt
Abstract: Although large language models (LLMs) have become generally more capable and accurate across many tasks, some fundamental sources of unreliability remain in their behavior. One key limitation is their inconsistency at reporting the the same information when prompts are changed. In this paper, we consider the discrepancy between a model's generated answer and their own verification of that answer, the generator-validator gap. We define this gap in a more stringent way than prior work: we expect correlation of scores from a generator and a validator over the entire set of candidate answers. We show that according to this measure, a large gap exists in various settings, including question answering, lexical semantics tasks, and next-word prediction. We then propose RankAlign, a ranking-based training method, and show that it significantly closes the gap by 31.8% on average, surpassing all baseline methods. Moreover, this approach generalizes well to out-of-domain tasks and lexical items.
摘要：尽管大型语言模型（LLM）在许多任务中通常变得更加有能力和准确，但其行为仍然存在一些基本不可靠的基本来源。一个关键的限制是他们在更改提示时报告相同信息的不一致性。在本文中，我们考虑了模型生成的答案与他们对该答案的验证之间的差异。我们以比先前的工作更严格的方式定义了这一差距：我们期望在整个候选人答案中从发电机和验证器中的分数相关。我们表明，根据这一措施，在各种环境中存在很大的差距，包括问答，词汇语义任务和下一词的预测。然后，我们提出了一种基于排名的培训方法RankAlign，并表明它平均将差距显着缩小31.8％，超过所有基线方法。此外，这种方法很好地概括了室外任务和词汇项目。

Title: Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

Authors: Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Marcin Chochowski, Yashaswi Karnati, Raviraj Joshi, Ameya Sunil Mahabaleshwarkar, Zijia Chen, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11409
Pdf URL: https://arxiv.org/pdf/2504.11409
Copy Paste: [[2504.11409]] Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning(https://arxiv.org/abs/2504.11409)
Keywords: language model, llm
Abstract: Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.
摘要：将注意力和状态空间模型（SSM）结合在一起的混合LLM体系结构达到了最新的准确性和运行时性能。最近的工作表明，将压缩和蒸馏应用于仅注意力的模型，以训练成本的一小部分产生较小，更准确的模型。在这项工作中，我们探讨了压缩混合体系结构的有效性。我们介绍了一种新颖的群体感知修剪策略，该策略保留了SSM块的结构完整性及其序列建模功能。此外，与传统方法相比，我们证明了这种SSM修剪以提高准确性和推理速度的必要性。我们的压缩配方结合了SSM，FFN，嵌入尺寸和层修剪，然后是基于知识蒸馏的重新培训，类似于微型技术。使用这种方法，我们将Nemotron-H 8B混合模型压缩至4B参数，训练令牌少40倍。最终的模型超过了类似大小的模型的准确性，同时提高了2倍的推断，从而大大推进了帕累托前沿。

Title: Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts

Authors: Quanyu Long, Jianda Chen, Zhengyuan Liu, Nancy F. Chen, Wenya Wang, Sinno Jialin Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.11420
Pdf URL: https://arxiv.org/pdf/2504.11420
Copy Paste: [[2504.11420]] Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts(https://arxiv.org/abs/2504.11420)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet they often rely on external context to handle complex tasks. While retrieval-augmented frameworks traditionally focus on selecting top-ranked documents in a single pass, many real-world scenarios demand compositional retrieval, where multiple sources must be combined in a coordinated manner. In this work, we propose a tri-encoder sequential retriever that models this process as a Markov Decision Process (MDP), decomposing the probability of retrieving a set of elements into a sequence of conditional probabilities and allowing each retrieval step to be conditioned on previously selected examples. We train the retriever in two stages: first, we efficiently construct supervised sequential data for initial policy training; we then refine the policy to align with the LLM's preferences using a reward grounded in the structural correspondence of generated programs. Experimental results show that our method consistently and significantly outperforms baselines, underscoring the importance of explicitly modeling inter-example dependencies. These findings highlight the potential of compositional retrieval for tasks requiring multiple pieces of evidence or examples.
摘要：大型语言模型（LLMS）在许多任务中都表现出了出色的功能，但它们通常依靠外部上下文来处理复杂的任务。传统上，取回式框架的框架通常专注于在单个通行证中选择排名最高的文档，但许多现实世界的场景都需要构图检索，其中必须以协调的方式组合多个来源。在这项工作中，我们提出了一个将该过程建模为Markov决策过程（MDP）的三个编码器顺序检索器，将将一组元素检索到一系列条件概率的序列上的概率分解了，并允许每个检索步骤在先前选择的示例上进行条件。我们分两个阶段训练猎犬：首先，我们有效地构建了监督的顺序数据，以进行初始政策培训；然后，我们使用基于生成程序的结构对应关系的奖励来完善与LLM偏好保持一致的政策。实验结果表明，我们的方法始终如一，显着胜过基准，强调了明确建模示例依赖性的重要性。这些发现突出了需要多个证据或示例的任务的组成检索潜力。

Title: A Dual-Space Framework for General Knowledge Distillation of Large Language Models

Authors: Xue Zhang, Songming Zhang, Yunlong Liang, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.11426
Pdf URL: https://arxiv.org/pdf/2504.11426
Copy Paste: [[2504.11426]] A Dual-Space Framework for General Knowledge Distillation of Large Language Models(https://arxiv.org/abs/2504.11426)
Keywords: language model, llm
Abstract: Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model; b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies.
摘要：知识蒸馏（KD）是通过将知识转移到较小模型来压缩大语言模型（LLM）的有希望的解决方案。在此过程中，白盒KD方法通常最大程度地减少教师模型的输出分布与学生模型之间的距离以传输更多信息。但是，我们揭示了当前的白盒KD框架表现出两个局限性：a）来自不同输出空间的桥接概率分布将限制教师模型和学生模型之间的相似性； b）该框架不能应用于具有不同词汇的LLM。这些限制的根本原因之一是，来自教师和KD学生的分布是通过不同的预测头输出的，这些预测头在不同的输出空间和维度中产生分布。因此，在本文中，我们提出了一个双空间知识蒸馏（DSKD）框架，该框架统一了教师的预测负责人和KD的学生模型。具体来说，我们首先介绍了两个具有理想初始化的投影仪，以将教师/学生隐藏状态投射到学生/教师代表空间中。此后，来自不同模型的隐藏状态可以共享相同的头部并统一分布的输出空间。此外，我们开发了一种精确的令牌比对（ETA）算法，以在两个不同的序列中对齐相同的令牌。基于上述内容，我们的DSKD框架是一个一般的KD框架，它支持非政策和上政策KD，无论其词汇量如何，任何两个LLM之间的KD。关于指导跟随，数学推理和代码生成基准的广泛实验表明，DSKD明显优于基于当前的白色盒子KD框架的现有方法，并超过了具有不同词汇表的LLM的其他跨tokenizer kd方法。

Title: Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models

Authors: Maria Teleki, Xiangjue Dong, Haoran Liu, James Caverlee
Subjects: cs.CL, cs.AI, cs.CY, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2504.11431
Pdf URL: https://arxiv.org/pdf/2504.11431
Copy Paste: [[2504.11431]] Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models(https://arxiv.org/abs/2504.11431)
Keywords: language model, llm
Abstract: Masculine defaults are widely recognized as a significant type of gender bias, but they are often unseen as they are under-researched. Masculine defaults involve three key parts: (i) the cultural context, (ii) the masculine characteristics or behaviors, and (iii) the reward for, or simply acceptance of, those masculine characteristics or behaviors. In this work, we study discourse-based masculine defaults, and propose a twofold framework for (i) the large-scale discovery and analysis of gendered discourse words in spoken content via our Gendered Discourse Correlation Framework (GDCF); and (ii) the measurement of the gender bias associated with these gendered discourse words in LLMs via our Discourse Word-Embedding Association Test (D-WEAT). We focus our study on podcasts, a popular and growing form of social media, analyzing 15,117 podcast episodes. We analyze correlations between gender and discourse words -- discovered via LDA and BERTopic -- to automatically form gendered discourse word lists. We then study the prevalence of these gendered discourse words in domain-specific contexts, and find that gendered discourse-based masculine defaults exist in the domains of business, technology/politics, and video games. Next, we study the representation of these gendered discourse words from a state-of-the-art LLM embedding model from OpenAI, and find that the masculine discourse words have a more stable and robust representation than the feminine discourse words, which may result in better system performance on downstream tasks for men. Hence, men are rewarded for their discourse patterns with better system performance by one of the state-of-the-art language models -- and this embedding disparity is a representational harm and a masculine default.
摘要：男性违约被广泛认为是一种重要的性别偏见，但由于研究不足，它们通常看不见。男性违约涉及三个关键部分：（i）文化背景，（ii）男性特征或行为，以及（iii）对这些男性特征或行为的奖励或简单地接受。在这项工作中，我们研究了基于话语的男性默认值，并为（i）通过我们的性别话语相关框架（GDCF）提出了两个框架（i）（i）对口语中性别话语单词的大规模发现和分析；（ii）通过我们的话语单词 - 装饰协会测试（D-Weat）在LLM中与这些性别话语单词相关的性别偏见的测量。我们将研究重点放在播客上，这是一种流行而增长的社交媒体形式，分析了15117个播客剧集。我们分析性别和话语单词之间的相关性 - 通过LDA和伯特式发现 - 自动形成性别的话语单词列表。然后，我们在特定领域的环境中研究这些性别话语词的普遍性，并发现基于性别的男性默认存在存在于商业，技术/政治和视频游戏的领域中。接下来，我们研究了来自OpenAI的最先进的LLM嵌入模型的这些性别话语词的表示，并发现男性话语单词比女性话语单词具有更稳定和强大的表示，这可能会导致男性下游任务的系统性能更好。因此，男人因最先进的语言模型以更好的系统性能而获得了更好的系统性能的奖励 - 这种嵌入差异是一种代表性的伤害和男性默认。

Title: TextArena

Authors: Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, Cheston Tan
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2504.11442
Pdf URL: https://arxiv.org/pdf/2504.11442
Copy Paste: [[2504.11442]] TextArena(https://arxiv.org/abs/2504.11442)
Keywords: language model, llm, agent
Abstract: TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on this https URL and this https URL.
摘要：TextArena是基于文本的游戏的开源集，用于培训和评估大语言模型（LLMS）的代理行为。它跨越了57多个独特的环境（包括单人游戏，两人和多玩家设置），并允许通过在线游戏系统（针对人类和其他提交的模型）轻松评估模型功能，并具有实时的Trueskill分数。传统基准很少评估动态的社交技能，例如谈判，思想理论和欺骗，从而造成了Textarena解决的差距。 Textarena设计了研究，社区和可扩展性，强调易于添加新游戏，适应框架，测试模型，对抗模型和培训模型。此HTTPS URL和此HTTPS URL提供了详细的环境，游戏，排行榜和示例的文档。

Title: DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Authors: Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11456
Pdf URL: https://arxiv.org/pdf/2504.11456
Copy Paste: [[2504.11456]] DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning(https://arxiv.org/abs/2504.11456)
Keywords: llm
Abstract: The capacity for complex mathematical reasoning is a key benchmark for artificial intelligence. While reinforcement learning (RL) applied to LLMs shows promise, progress is significantly hindered by the lack of large-scale training data that is sufficiently challenging, possesses verifiable answer formats suitable for RL, and is free from contamination with evaluation benchmarks. To address these limitations, we introduce DeepMath-103K, a new, large-scale dataset comprising approximately 103K mathematical problems, specifically designed to train advanced reasoning models via RL. DeepMath-103K is curated through a rigorous pipeline involving source analysis, stringent decontamination against numerous benchmarks, and filtering for high difficulty (primarily Levels 5-9), significantly exceeding existing open resources in challenge. Each problem includes a verifiable final answer, enabling rule-based RL, and three distinct R1-generated solutions suitable for diverse training paradigms like supervised fine-tuning or distillation. Spanning a wide range of mathematical topics, DeepMath-103K promotes the development of generalizable reasoning. We demonstrate that models trained on DeepMath-103K achieve significant improvements on challenging mathematical benchmarks, validating its effectiveness. We release DeepMath-103K publicly to facilitate community progress in building more capable AI reasoning systems: this https URL.
摘要：复杂数学推理的能力是人工智能的关键基准。虽然应用于LLM的加强学习（RL）表现出希望，但由于缺乏足够挑战的大规模培训数据，进步受到了极大的阻碍，具有适合RL的可验证答案格式，并且没有评估基准的污染。为了解决这些限制，我们引入了DeepMath-103K，这是一个新的大规模数据集，其中包括大约103K数学问题，专门设计用于通过RL训练先进的推理模型。 DeepMath-103K通过严格的管道进行了策划，涉及源分析，对众多基准测试的严格净化以及高难度（主要是5-9级）的过滤，显着超过了挑战中现有的开放资源。每个问题包括可验证的最终答案，启用基于规则的RL，以及适用于各种培训范式（如受监督的微调或蒸馏）的三种不同的R1生成的解决方案。 DeepMath-103K涵盖了广泛的数学主题，促进了可推广的推理的发展。我们证明，在DeepMath-103K上训练的模型可以在具有挑战性的数学基准上取得重大改进，从而验证其有效性。我们公开发布DeepMath-103K，以促进社区建立更有能力的AI推理系统：此HTTPS URL。