2025-05-19

Title: GeoGrid-Bench: Can Foundation Models Understand Multimodal Gridded Geo-Spatial Data?

Authors: Bowen Jiang, Yangxinyu Xie, Xiaomeng Wang, Jiashu He, Joshua Bergerson, John K Hutchison, Jordan Branham, Camillo J Taylor, Tanwi Mallick
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10714
Pdf URL: https://arxiv.org/pdf/2505.10714
Copy Paste: [[2505.10714]] GeoGrid-Bench: Can Foundation Models Understand Multimodal Gridded Geo-Spatial Data?(https://arxiv.org/abs/2505.10714)
Keywords: language model
Abstract: We present GeoGrid-Bench, a benchmark designed to evaluate the ability of foundation models to understand geo-spatial data in the grid structure. Geo-spatial datasets pose distinct challenges due to their dense numerical values, strong spatial and temporal dependencies, and unique multimodal representations including tabular data, heatmaps, and geographic visualizations. To assess how foundation models can support scientific research in this domain, GeoGrid-Bench features large-scale, real-world data covering 16 climate variables across 150 locations and extended time frames. The benchmark includes approximately 3,200 question-answer pairs, systematically generated from 8 domain expert-curated templates to reflect practical tasks encountered by human scientists. These range from basic queries at a single location and time to complex spatiotemporal comparisons across regions and periods. Our evaluation reveals that vision-language models perform best overall, and we provide a fine-grained analysis of the strengths and limitations of different foundation models in different geo-spatial tasks. This benchmark offers clearer insights into how foundation models can be effectively applied to geo-spatial data analysis and used to support scientific research.
摘要：我们提出了地质基础基础，这是一种基准测试，旨在评估基础模型了解网格结构中地理空间数据的能力。地理空间数据集由于其密集的数值，强烈的空间和时间依赖性以及独特的多模式表示，包括表格数据，热图和地理可视化，因此面临着不同的挑战。为了评估基础模型如何支持该领域的科学研究，地理基础台面具有大规模的现实数据，涵盖了150个位置和延长时间范围的16个气候变量。该基准包括大约3200对提问，这些问题是由8个域专家策划的模板系统生成的，以反映人类科学家遇到的实际任务。这些范围从单个位置和时间的基本查询到各个区域和周期之间复杂的时空比较。我们的评估表明，视觉模型的总体表现最佳，我们对不同地理空间任务中不同基础模型的优势和局限性进行了细粒度分析。该基准测试对如何有效地应用于地理空间数据分析并用于支持科学研究的基础模型提供了更清晰的见解。

Title: A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment

Authors: Jean-Philippe Corbeil, Amin Dada, Jean-Michel Attendu, Asma Ben Abacha, Alessandro Sordoni, Lucas Caccia, François Beaulieu, Thomas Lin, Jens Kleesiek, Paul Vozila
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10717
Pdf URL: https://arxiv.org/pdf/2505.10717
Copy Paste: [[2505.10717]] A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment(https://arxiv.org/abs/2505.10717)
Keywords: language model, gpt
Abstract: High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation, which remains challenging. An additional bottleneck is the unavailability and high sensitivity of clinical data. To address these challenges, we propose a novel framework for adapting SLMs into high-performing clinical models. We introduce the MediPhi collection of 3.8B-parameter SLMs developed with our novel framework: pre-instruction tuning of experts on relevant medical and clinical corpora (PMC, Medical Guideline, MedWiki, etc.), model merging, and clinical-tasks alignment. To cover most clinical tasks, we extended the CLUE benchmark to CLUE+, doubling its size. Our expert models deliver relative improvements on this benchmark over the base model without any task-specific fine-tuning: 64.3% on medical entities, 49.5% on radiology reports, and 44% on ICD-10 coding (outperforming GPT-4-0125 by 14%). We unify the expert models into MediPhi via model merging, preserving gains across benchmarks. Furthermore, we built the MediFlow collection, a synthetic dataset of 2.5 million high-quality instructions on 14 medical NLP tasks, 98 fine-grained document types, and JSON format support. Alignment of MediPhi using supervised fine-tuning and direct preference optimization achieves further gains of 18.9% on average.
摘要：GPT-4等大型语言模型的高计算成本和延迟限制了其在临床环境中的部署。小语言模型（SLM）提供了一种具有成本效益的替代方案，但是它们的有限能力需要生物医学领域的适应性，这仍然具有挑战性。额外的瓶颈是临床数据的不可用和高灵敏度。为了应对这些挑战，我们提出了一个新颖的框架，以使SLM适应高性能的临床模型。我们介绍了使用新颖的框架开发的3.8B参数SLM的MEDIPHI集合：相关医学和临床语料库专家（PMC，医疗指南，Medwiki等），模型合并和临床任务对齐。为了涵盖大多数临床任务，我们将线索基准扩展到线索+，使其大小翻了一番。我们的专家模型在基本模型上提供了相对改进，而没有任何特定于任务的微调：医疗实体的64.3％，放射学报告的49.5％，ICD-10编码（超过14％的GPT-4-0125）的44％，ICD-10编码为44％。我们通过模型合并将专家模型统一为Mediphi，并在基准跨基准中保留增长。此外，我们构建了Mediflow Collection，这是一个合成数据集，对14个医疗NLP任务，98种细粒文档类型和JSON格式支持，共250万个高质量说明。使用有监督的微调和直接偏好优化对MEDIPHI进行对齐，平均获得了18.9％的进一步增长。

Title: AI-enhanced semantic feature norms for 786 concepts

Authors: Siddharth Suresh, Kushin Mukherjee, Tyler Giallanza, Xizheng Yu, Mia Patil, Jonathan D. Cohen, Timothy T. Rogers
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.10718
Pdf URL: https://arxiv.org/pdf/2505.10718
Copy Paste: [[2505.10718]] AI-enhanced semantic feature norms for 786 concepts(https://arxiv.org/abs/2505.10718)
Keywords: language model, llm
Abstract: Semantic feature norms have been foundational in the study of human conceptual knowledge, yet traditional methods face trade-offs between concept/feature coverage and verifiability of quality due to the labor-intensive nature of norming studies. Here, we introduce a novel approach that augments a dataset of human-generated feature norms with responses from large language models (LLMs) while verifying the quality of norms against reliable human judgments. We find that our AI-enhanced feature norm dataset, NOVA: Norms Optimized Via AI, shows much higher feature density and overlap among concepts while outperforming a comparable human-only norm dataset and word-embedding models in predicting people's semantic similarity judgments. Taken together, we demonstrate that human conceptual knowledge is richer than captured in previous norm datasets and show that, with proper validation, LLMs can serve as powerful tools for cognitive science research.
摘要：语义特征规范是对人类概念知识的研究的基础，但是由于规范研究的劳动密集型性质，传统方法面临概念/特征覆盖范围和质量验证之间的权衡。在这里，我们介绍了一种新颖的方法，该方法通过大型语言模型（LLM）的响应来增强人类生成的特征规范数据集，同时验证针对可靠的人类判断的规范质量。我们发现，通过AI优化的AI增强功能数据集，NOVA：规范，在概念之间显示出更高的特征密度和重叠，同时在预测人们的语义相似性判断时表现出优于可比的纯人类纯标准数据集和单词插入模型。综上所述，我们证明了人类的概念知识比以前的规范数据集中捕获的要丰富，并表明，通过适当的验证，LLM可以作为认知科学研究的强大工具。

Title: Tracr-Injection: Distilling Algorithms into Pre-trained Language Models

Authors: Tomás Vergara-Browne, Álvaro Soto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10719
Pdf URL: https://arxiv.org/pdf/2505.10719
Copy Paste: [[2505.10719]] Tracr-Injection: Distilling Algorithms into Pre-trained Language Models(https://arxiv.org/abs/2505.10719)
Keywords: language model
Abstract: Motivated by the surge of large language models, there has been a push to formally characterize the symbolic abilities intrinsic to the transformer architecture. A programming language, called RASP, has been proposed, which can be directly compiled into transformer weights to implement these algorithms. However, the tasks that can be implemented in RASP are often uncommon to learn from natural unsupervised data, showing a mismatch between theoretical capabilities of the transformer architecture, and the practical learnability of these capabilities from unsupervised data. We propose tracr-injection, a method that allows us to distill algorithms written in RASP directly into a pre-trained language model. We showcase our method by injecting 3 different algorithms into a language model. We show how our method creates an interpretable subspace within the model's residual stream, which can be decoded into the variables present in the code of the RASP algorithm. Additionally, we found that the proposed method can improve out of distribution performance compared to our baseline, indicating that indeed a more symbolic mechanism is taking place in the inner workings of the model. We release the code used to run our experiments.
摘要：由大语言模型的激增激励，一直在正式表征变压器体系结构内在的象征能力。已经提出了一种称为RASP的编程语言，可以将其直接编译到变压器权重中以实现这些算法。但是，从RASP中实现的任务通常很少是从自然无监督的数据中学习的，显示了变压器体系结构的理论能力之间的不匹配，以及从无监督的数据中对这些功能的实际学习性。我们提出了TRACR注入，该方法使我们能够将用RASP编写的算法直接提炼成预训练的语言模型。我们通过将3种不同的算法注入语言模型来展示我们的方法。我们展示了我们的方法如何在模型的残差流中创建一个可解释的子空间，可以将其解码为RASP算法代码中存在的变量。此外，我们发现所提出的方法与我们的基线相比可以改善分布性能，这表明在模型的内部工作中确实正在发生更象征性的机制。我们发布用于运行实验的代码。

Title: Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization

Authors: Ximing Dong, Shaowei Wang, Dayi Lin, Ahmed E. Hassan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10736
Pdf URL: https://arxiv.org/pdf/2505.10736
Copy Paste: [[2505.10736]] Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization(https://arxiv.org/abs/2505.10736)
Keywords: language model, llm, prompt
Abstract: Optimizing Large Language Model (LLM) performance requires well-crafted prompts, but manual prompt engineering is labor-intensive and often ineffective. Automated prompt optimization techniques address this challenge but the majority of them rely on randomly selected evaluation subsets, which fail to represent the full dataset, leading to unreliable evaluations and suboptimal prompts. Existing coreset selection methods, designed for LLM benchmarking, are unsuitable for prompt optimization due to challenges in clustering similar samples, high data collection costs, and the unavailability of performance data for new or private datasets. To overcome these issues, we propose IPOMP, an Iterative evaluation data selection for effective Prompt Optimization using real-time Model Performance. IPOMP is a two-stage approach that selects representative and diverse samples using semantic clustering and boundary analysis, followed by iterative refinement with real-time model performance data to replace redundant samples. Evaluations on the BIG-bench dataset show that IPOMP improves effectiveness by 1.6% to 5.3% and stability by at least 57% compared with SOTA baselines, with minimal computational overhead below 1%. Furthermore, the results demonstrate that our real-time performance-guided refinement approach can be universally applied to enhance existing coreset selection methods.
摘要：优化大型语言模型（LLM）的性能需要精心制作的提示，但是手动及时工程是劳动密集型的，通常是无效的。自动提示优化技术应对这一挑战，但是大多数依赖于随机选择的评估子集，这些子集无法代表完整的数据集，从而导致不可靠的评估和次优提示。现有用于LLM基准测试的现有核心选择方法不适合迅速优化，因为在聚类类似的样本，高数据收集成本以及新数据集或私人数据集的性能数据的不可用中挑战。为了克服这些问题，我们提出了IPOMP，这是使用实时模型性能进行有效及时优化的迭代评估数据选择。 IPOMP是一种两阶段的方法，使用语义聚类和边界分析选择代表性和多样的样本，然后使用实时模型性能数据进行迭代改进以替换冗余样本。与SOTA基准相比，对大型基础数据集的评估表明，IPOMP提高了1.6％至5.3％，稳定性至少提高了57％，其计算开销最少低于1％。此外，结果表明，我们可以普遍应用我们的实时性能指导的改进方法来增强现有的核心选择方法。

Title: Ranked Voting based Self-Consistency of Large Language Models

Authors: Weiqin Wang, Yile Wang, Hui Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10772
Pdf URL: https://arxiv.org/pdf/2505.10772
Copy Paste: [[2505.10772]] Ranked Voting based Self-Consistency of Large Language Models(https://arxiv.org/abs/2505.10772)
Keywords: language model, chain-of-thought
Abstract: Majority voting is considered an effective method to enhance chain-of-thought reasoning, as it selects the answer with the highest "self-consistency" among different reasoning paths (Wang et al., 2023). However, previous chain-of-thought reasoning methods typically generate only a single answer in each trial, thereby ignoring the possibility of other potential answers. As a result, these alternative answers are often overlooked in subsequent voting processes. In this work, we propose to generate ranked answers in each reasoning process and conduct ranked voting among multiple ranked answers from different responses, thereby making the overall self-consistency more reliable. Specifically, we use three ranked voting methods: Instant-runoff voting, Borda count voting, and mean reciprocal rank voting. We validate our methods on six datasets, including three multiple-choice and three open-ended question-answering tasks, using both advanced open-source and closed-source large language models. Extensive experimental results indicate that our proposed method outperforms the baselines, showcasing the potential of leveraging the information of ranked answers and using ranked voting to improve reasoning performance. The code is available at this https URL.
摘要：大多数投票被认为是增强思想链推理的有效方法，因为它在不同的推理路径中选择了最高“自谐”的答案（Wang等，2023）。但是，以前的想法链推理方法通常在每个试验中仅产生一个答案，从而忽略了其他潜在答案的可能性。结果，这些替代答案在随后的投票过程中经常被忽略。在这项工作中，我们建议在每个推理过程中产生排名的答案，并通过不同的答案中的多个排名答案进行排名的投票，从而使整体自我一致性更加可靠。具体来说，我们使用三种排名的投票方法：即时投票，Borda计数投票和平均互惠等级投票。我们使用高级开源和封闭源的大型语言模型验证了六个数据集上的方法，包括三个多项选择和三个开放式问题解答任务。广泛的实验结果表明，我们提出的方法的表现优于基准，展示了利用排名答案信息并使用排名投票以提高推理绩效的潜力。该代码可在此HTTPS URL上找到。

Title: A Systematic Analysis of Base Model Choice for Reward Modeling

Authors: Kian Ahrabian, Pegah Jandaghi, Negar Mokhberian, Sai Praneeth Karimireddy, Jay Pujara
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10775
Pdf URL: https://arxiv.org/pdf/2505.10775
Copy Paste: [[2505.10775]] A Systematic Analysis of Base Model Choice for Reward Modeling(https://arxiv.org/abs/2505.10775)
Keywords: language model, llm
Abstract: Reinforcement learning from human feedback (RLHF) and, at its core, reward modeling have become a crucial part of training powerful large language models (LLMs). One commonly overlooked factor in training high-quality reward models (RMs) is the effect of the base model, which is becoming more challenging to choose given the rapidly growing pool of LLMs. In this work, we present a systematic analysis of the effect of base model selection on reward modeling performance. Our results show that the performance can be improved by up to 14% compared to the most common (i.e., default) choice. Moreover, we showcase the strong statistical relation between some existing benchmarks and downstream performances. We also demonstrate that the results from a small set of benchmarks could be combined to boost the model selection ($+$18% on average in the top 5-10). Lastly, we illustrate the impact of different post-training steps on the final performance and explore using estimated data distributions to reduce performance prediction error.
摘要：从人类反馈（RLHF）中学习的强化学习，并以奖励建模已成为培训强大语言模型（LLMS）的关键部分。训练高质量奖励模型（RMS）中通常被忽略的因素是基本模型的效果，鉴于LLM的快速增长，选择越来越具有挑战性。在这项工作中，我们对基本模型选择对奖励建模性能的影响进行系统分析。我们的结果表明，与最常见的（即默认）选择相比，性能可以提高14％。此外，我们展示了一些现有基准和下游性能之间的牢固统计关系。我们还证明，可以合并一小部分基准的结果以提高模型选择（在前5-10名中平均$+$ 18％）。最后，我们说明了不同的训练后步骤对最终性能的影响，并使用估计的数据分布来探索以减少性能预测错误。

Title: Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation

Authors: Zhan Peng Lee, Andre Lin, Calvin Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10792
Pdf URL: https://arxiv.org/pdf/2505.10792
Copy Paste: [[2505.10792]] Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation(https://arxiv.org/abs/2505.10792)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful framework to improve factuality in large language models (LLMs) by grounding their outputs in retrieved documents. However, ensuring perfect retrieval of relevant information remains challenging, and when irrelevant content is passed downstream to an LLM, it can lead to hallucinations. In this work, we propose Finetune-RAG, a simple and effective fine-tuning approach that features the first-of-its-kind RAG training dataset constructed to mimic real-world imperfections. Experimental results show that Finetune-RAG improves factual accuracy by 21.2% over the base model. We also propose a Bench-RAG, an LLM-as-a-judge evaluation pipeline that stress tests models under realistic imperfect retrieval scenarios. Our codebase and dataset are fully open sourced for community use.
摘要：通过在检索的文档中扎根其输出，将其检索增强的生成（RAG）作为一个有力的框架，以改善大语言模型（LLMS）的事实。但是，确保完美地检索相关信息仍然具有挑战性，当不相关的内容下游到LLM下游时，可能会导致幻觉。在这项工作中，我们提出了FineTune-rag，这是一种简单有效的微调方法，它具有构造的，该数据集构建为模仿现实世界中的缺陷。实验结果表明，Finetune-rag比基本模型提高了21.2％的事实准确性。我们还提出了一个基准席位，即LLM-AS-A-A-A-Gudge评估管道，该管道在现实不完美的检索场景下进行压力测试模型。我们的代码库和数据集已完全开放供社区使用。

Title: Enhancing Low-Resource Minority Language Translation with LLMs and Retrieval-Augmented Generation for Cultural Nuances

Authors: Chen-Chi Chang, Chong-Fu Li, Chu-Hsuan Lee, Hung-Shin Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10829
Pdf URL: https://arxiv.org/pdf/2505.10829
Copy Paste: [[2505.10829]] Enhancing Low-Resource Minority Language Translation with LLMs and Retrieval-Augmented Generation for Cultural Nuances(https://arxiv.org/abs/2505.10829)
Keywords: language model, llm, retrieval-augmented generation
Abstract: This study investigates the challenges of translating low-resource languages by integrating Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG). Various model configurations were tested on Hakka translations, with BLEU scores ranging from 12% (dictionary-only) to 31% (RAG with Gemini 2.0). The best-performing model (Model 4) combined retrieval and advanced language modeling, improving lexical coverage, particularly for specialized or culturally nuanced terms, and enhancing grammatical coherence. A two-stage method (Model 3) using dictionary outputs refined by Gemini 2.0 achieved a BLEU score of 26%, highlighting iterative correction's value and the challenges of domain-specific expressions. Static dictionary-based approaches struggled with context-sensitive content, demonstrating the limitations of relying solely on predefined resources. These results emphasize the need for curated resources, domain knowledge, and ethical collaboration with local communities, offering a framework that improves translation accuracy and fluency while supporting cultural preservation.
摘要：这项研究通过将大型语言模型（LLMS）与检索型发电（RAG）相结合（LAG）来调查翻译低资源语言的挑战。在HAKKA翻译上测试了各种模型配置，BLEU分数从12％（仅字典）到31％（Gemini 2.0）。表现最佳的模型（模型4）结合了检索和先进的语言建模，改善了词汇覆盖范围，尤其是对于专业或文化上细微的术语，并增强了语法连贯性。使用Gemini 2.0完善的两级方法（模型3）的BLEU得分为26％，突出了迭代校正的值和域特异性表达式的挑战。基于静态词典的方法在上下文敏感的内容中挣扎，证明了仅依靠预定义资源的局限性。这些结果强调了与当地社区进行策划资源，领域知识和道德合作的需求，提供了一个框架，以提高翻译准确性和流利性，同时支持文化保护。

Title: Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

Authors: Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Linjing Li, Xiangyuan Lan, Dongbin Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10832
Pdf URL: https://arxiv.org/pdf/2505.10832
Copy Paste: [[2505.10832]] Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL(https://arxiv.org/abs/2505.10832)
Keywords: prompt
Abstract: Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis ("...") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs.
摘要：大型推理模型（LRMS）在产生最终答案之前熟练地生成明确的分步推理序列。但是，这种详细的推理可以引入大量的计算开销和延迟，尤其是对于简单问题。为了解决这个思考的问题，我们探讨了如何为LRM提供适应性思维功能：使他们能够动态地决定是否基于问题复杂性进行明确的推理。在R1风格的蒸馏型模型的基础上，我们观察到将简单的椭圆（“ ...”）插入提示可以随机触发思想或无思想模式，从而在推理行为中揭示了潜在的可控性。利用此属性，我们提出了Autothink，这是一个多阶段的增强学习（RL）框架，该框架逐渐通过阶段的奖励成型来逐步优化推理政策。 Autothink学会仅在必要时才调用明确的推理，同时默认为简单的响应以完成简单的任务。在五个主流数学基准上进行的实验表明，与最近的提示和基于RL的修剪方法相比，Autothink实现了有利的精度效率折衷。它可以无缝集成到任何R1式模型中，包括蒸馏和进一步的微调变体。值得注意的是，Autothink将相对准确性提高了6.4％，同时将令牌用法降低了52％，在DeepSeek-R1-Distill-Qwen-1.5B上，为LRMS建立了可扩展和适应性的推理范式。

Title: Multimodal Event Detection: Current Approaches and Defining the New Playground through LLMs and VLMs

Authors: Abhishek Dey, Aabha Bothera, Samhita Sarikonda, Rishav Aryan, Sanjay Kumar Podishetty, Akshay Havalgi, Gaurav Singh, Saurabh Srivastava
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.10836
Pdf URL: https://arxiv.org/pdf/2505.10836
Copy Paste: [[2505.10836]] Multimodal Event Detection: Current Approaches and Defining the New Playground through LLMs and VLMs(https://arxiv.org/abs/2505.10836)
Keywords: gpt, llm
Abstract: In this paper, we study the challenges of detecting events on social media, where traditional unimodal systems struggle due to the rapid and multimodal nature of data dissemination. We employ a range of models, including unimodal ModernBERT and ConvNeXt-V2, multimodal fusion techniques, and advanced generative models like GPT-4o, and LLaVA. Additionally, we also study the effect of providing multimodal generative models (such as GPT-4o) with a single modality to assess their efficacy. Our results indicate that while multimodal approaches notably outperform unimodal counterparts, generative approaches despite having a large number of parameters, lag behind supervised methods in precision. Furthermore, we also found that they lag behind instruction-tuned models because of their inability to generate event classes correctly. During our error analysis, we discovered that common social media issues such as leet speak, text elongation, etc. are effectively handled by generative approaches but are hard to tackle using supervised approaches.
摘要：在本文中，我们研究了在社交媒体上检测事件的挑战，在社交媒体上，由于数据传播的迅速和多模式性质，传统的单峰系统挣扎。我们采用一系列模型，包括Un-Imodal Modernbert和Convnext-V2，多模式融合技术以及GPT-4O和Llava等先进的生成模型。此外，我们还研究提供多模式生成模型（例如GPT-4O）的效果，以评估其功效。我们的结果表明，虽然多模式方法的表现明显胜过非模态的，但尽管有大量参数，但生成的方法仍落后于监督方法。此外，我们还发现它们落后于指导调节的模型，因为它们无法正确生成事件类。在我们的错误分析中，我们发现常见的社交媒体问题，例如Leet说话，文本伸长等，有效地通过生成方法来处理，但很难使用有监督的方法来解决。

Title: Have Multimodal Large Language Models (MLLMs) Really Learned to Tell the Time on Analog Clocks?

Authors: Tairan Fu, Miguel González, Javier Conde, Elena Merino-Gómez, Pedro Reviriego
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10862
Pdf URL: https://arxiv.org/pdf/2505.10862
Copy Paste: [[2505.10862]] Have Multimodal Large Language Models (MLLMs) Really Learned to Tell the Time on Analog Clocks?(https://arxiv.org/abs/2505.10862)
Keywords: language model, gpt, llm
Abstract: Multimodal Large Language Models which can answer complex questions on an image struggle to tell the time on analog clocks. This is probably due to the lack of images with clocks at different times in their training set. In this work we explore this issue with one of the latest MLLMs: GPT-4.1 to understand why MLLMs fail to tell the time and whether fine-tuning can solve the problem. The results show how models are making progress in reading the time on analog clocks. But have they really learned to do it, or have they only learned patterns in their training datasets? In this work we put the models to the test with different clocks to illustrate the limitations of MLLMs to abstract and generalize.
摘要：多模式的大型语言模型可以回答图像上的复杂问题，以告诉模拟时钟上的时间。这可能是由于在训练集中不同时间缺乏图像。在这项工作中，我们使用最新的MLLM：GPT-4.1探索这个问题，以了解为什么MLLM无法告诉时间以及微调是否可以解决问题。结果表明，模型如何在阅读模拟时钟上的时间上取得进展。但是，他们真的学会了这样做，还是只学会了培训数据集中的模式？在这项工作中，我们用不同的时钟将模型进行了测试，以说明MLLM对抽象和概括的局限性。

Title: Improve Rule Retrieval and Reasoning with Self-Induction and Relevance ReEstimate

Authors: Ziyang Huang, Wangtao Sun, Jun Zhao, Kang Liu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.10870
Pdf URL: https://arxiv.org/pdf/2505.10870
Copy Paste: [[2505.10870]] Improve Rule Retrieval and Reasoning with Self-Induction and Relevance ReEstimate(https://arxiv.org/abs/2505.10870)
Keywords: language model, llm
Abstract: This paper systematically addresses the challenges of rule retrieval, a crucial yet underexplored area. Vanilla retrieval methods using sparse or dense retrievers to directly search for relevant rules to support downstream reasoning, often suffer from low accuracy. This is primarily due to a significant semantic gap between the instantiated facts in the queries and the abstract representations of the rules. Such misalignment results in suboptimal retrieval quality, which in turn negatively impacts reasoning performance. To overcome these challenges, we propose Self-Induction Augmented Retrieval (SIAR), a novel approach that utilizes Large Language Models (LLMs) to induce potential inferential rules that might offer benefits for reasoning by abstracting the underlying knowledge and logical structure in queries. These induced rules are then used for query augmentation to improve retrieval effectiveness. Additionally, we introduce Rule Relevance ReEstimate (R$^3$), a method that re-estimates the relevance of retrieved rules by assessing whether the abstract knowledge they contain can be instantiated to align with the facts in the queries and the helpfulness for reasoning. Extensive experiments across various settings demonstrate the effectiveness and versatility of our proposed methods.
摘要：本文系统地解决了规则检索的挑战，该挑战是一个至关重要但毫无疑问的领域。使用稀疏或致密检索器的香草检索方法直接搜索相关规则以支持下游推理，通常会遭受精确度较低的情况。这主要是由于查询中实例化事实与规则的抽象表示之间存在显着的语义差距。这种未对准会导致次优质量，从而对推理性能产生负面影响。为了克服这些挑战，我们提出自我诱导的增强检索（SIAR），这种新颖的方法利用大型语言模型（LLMS）来诱导潜在的推论规则，这些规则可能通过在查询中抽象出潜在的知识和逻辑结构来为推理提供益处。然后将这些诱导的规则用于查询增强，以提高检索效率。此外，我们引入了规则相关性重新审查（r $^3 $），这种方法通过评估是否可以实例化其所包含的抽象知识来重新估算检索到的规则的相关性，以与查询中的事实保持一致以及推理的帮助。在各种环境中进行的广泛实验证明了我们提出的方法的有效性和多功能性。

Title: A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?

Authors: Ada Chen, Yongjiang Wu, Junyuan Zhang, Shu Yang, Jen-tse Huang, Kun Wang, Wenxuan Wang, Shuai Wang
Subjects: cs.CL, cs.AI, cs.CR, cs.CV, cs.SE
Abstract URL: https://arxiv.org/abs/2505.10924
Pdf URL: https://arxiv.org/pdf/2505.10924
Copy Paste: [[2505.10924]] A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?(https://arxiv.org/abs/2505.10924)
Keywords: llm, agent
Abstract: Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of \emph{Computer-Using Agents} (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: \textit{\textbf{(i)}} define the CUA that suits safety analysis; \textit{\textbf{(ii)} } categorize current safety threats among CUAs; \textit{\textbf{(iii)}} propose a comprehensive taxonomy of existing defensive strategies; \textit{\textbf{(iv)}} summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.
摘要：最近，与计算设备的AI驱动相互作用已从基本原型工具到基于LLM的复杂，基于LLM的系统，这些系统模仿了图形用户界面中的类似人类的操作。现在，我们目睹了\ emph {Computer-Is-using Adents}（CUAS）的出现，能够自主执行诸如浏览桌面应用程序，网页和移动应用程序之类的任务。但是，随着这些试剂能力的增长，它们也引入了新颖的安全和保障风险。 LLM驱动的推理中的漏洞，其集成了多个软件组件和多模式输入的复杂性，使安全景观更加复杂。在本文中，我们提出了有关CUAS安全和安全威胁的知识的系统化。我们进行了全面的文献综述，并沿着四个研究目标提炼了我们的发现：\ textIt {\ textbf {（i）}}定义适合安全分析的CUA； \ textIt {\ textbf {（ii）}}对CUAS中的当前安全威胁进行分类； \ textit {\ textbf {（iii）}}提出了现有防御策略的全面分类法； \ textit {\ textbf {（iv）}}总结了用于评估CUAS的安全性和性能的主要基准，数据集和评估指标。在这些见解的基础上，我们的工作为未来的研究人员提供了一个结构化的基础，以探索未开发的漏洞，并为从业人员提供可行的指导，以设计和部署安全的计算机使用代理。

Title: Connecting the Dots: A Chain-of-Collaboration Prompting Framework for LLM Agents

Authors: Jiaxing Zhao, Hongbin Xie, Yuzhen Lei, Xuan Song, Zhuoran Shi, Lianxin Li, Shuangxue Liu, Haoran Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10936
Pdf URL: https://arxiv.org/pdf/2505.10936
Copy Paste: [[2505.10936]] Connecting the Dots: A Chain-of-Collaboration Prompting Framework for LLM Agents(https://arxiv.org/abs/2505.10936)
Keywords: language model, gpt, llm, prompt, chain-of-thought, agent
Abstract: Large Language Models (LLMs) have demonstrated impressive performance in executing complex reasoning tasks. Chain-of-thought effectively enhances reasoning capabilities by unlocking the potential of large models, while multi-agent systems provide more comprehensive solutions by integrating collective intelligence of multiple agents. However, both approaches face significant limitations. Single-agent with chain-of-thought, due to the inherent complexity of designing cross-domain prompts, faces collaboration challenges. Meanwhile, multi-agent systems consume substantial tokens and inevitably dilute the primary problem, which is particularly problematic in business workflow tasks. To address these challenges, we propose Cochain, a collaboration prompting framework that effectively solves business workflow collaboration problem by combining knowledge and prompts at a reduced cost. Specifically, we construct an integrated knowledge graph that incorporates knowledge from multiple stages. Furthermore, by maintaining and retrieving a prompts tree, we can obtain prompt information relevant to other stages of the business workflow. We perform extensive evaluations of Cochain across multiple datasets, demonstrating that Cochain outperforms all baselines in both prompt engineering and multi-agent LLMs. Additionally, expert evaluation results indicate that the use of a small model in combination with Cochain outperforms GPT-4.
摘要：大型语言模型（LLM）在执行复杂的推理任务时表现出了令人印象深刻的表现。经过思考链通过解锁大型模型的潜力有效地增强了推理能力，而多代理系统通过整合多个代理的集体智能提供了更全面的解决方案。但是，两种方法都面临着重大局限性。由于设计跨域提示的固有复杂性，因此与经过思考的单一代理面临着协作挑战。同时，多机构系统消耗大量令牌，不可避免地会稀释主要问题，这在业务工作流程任务中尤其有问题。为了应对这些挑战，我们提出了Cochain，这是一个合作提示框架，通过以降低的成本结合知识和提示来有效解决业务工作流程协作问题。具体而言，我们构建了一个集成的知识图，该图形结合了多个阶段的知识。此外，通过维护和检索提示树，我们可以获得与业务工作流的其他阶段相关的提示信息。我们对多个数据集进行了对科班的广泛评估，这表明Cochain在及时工程和多代理LLM中的表现都优于所有基准。此外，专家评估结果表明，与Cochain结合使用小型模型的表现优于GPT-4。

Title: Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations

Authors: Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10937
Pdf URL: https://arxiv.org/pdf/2505.10937
Copy Paste: [[2505.10937]] Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations(https://arxiv.org/abs/2505.10937)
Keywords: chain-of-thought
Abstract: The emergence of large reasoning models (LRMs) has transformed Natural Language Processing by excelling in complex tasks such as mathematical problem-solving and code generation. These models leverage chain-of-thought (CoT) processes, enabling them to emulate human-like reasoning strategies. However, the advancement of LRMs is hindered by the lack of comprehensive CoT datasets. Current resources often fail to provide extensive reasoning problems with coherent CoT processes distilled from multiple teacher models and do not account for multifaceted properties describing the internal characteristics of CoTs. To address these challenges, we introduce OmniThought, a large-scale dataset featuring 2 million CoT processes generated and validated by two powerful LRMs as teacher models. Each CoT process in OmniThought is annotated with novel Reasoning Verbosity (RV) and Cognitive Difficulty (CD) scores, which describe the appropriateness of CoT verbosity and cognitive difficulty level for models to comprehend these reasoning processes. We further establish a self-reliant pipeline to curate this dataset. Extensive experiments using Qwen2.5 models of various sizes demonstrate the positive impact of our proposed scores on LRM training effectiveness. Based on the proposed OmniThought dataset, we further train and release a series of high-performing LRMs, specifically equipped with stronger reasoning abilities and optimal CoT output length and difficulty level. Our contributions significantly enhance the development and training of LRMs for solving complex tasks.
摘要：大型推理模型（LRMS）的出现通过在数学解决问题和代码生成等复杂任务中出色，从而改变了自然语言处理。这些模型利用了经过思考链（COT）过程，使它们能够模仿类似人类的推理策略。但是，缺乏全面的COT数据集阻碍了LRMS的进步。当前的资源通常无法通过从多个教师模型中提取的连贯的COT过程提供广泛的推理问题，并且不考虑描述COTS内部特征的多方面属性。为了应对这些挑战，我们介绍了一个大规模数据集Omnithought，该数据集具有由两个强大的LRMs作为教师模型生成和验证的200万个COT流程。全新的COT过程都以新颖的推理冗长（RV）和认知难度（CD）分数来注释，这些分数描述了模型理解这些推理过程的COT冗长和认知难度水平的适当性。我们进一步建立了一个自力更生的管道来策划该数据集。使用QWEN2.5各种大小模型的广泛实验证明了我们提出的分数对LRM训练效果的积极影响。根据提议的全体数据集，我们进一步训练并发布了一系列高性能的LRM，这些LRMS具有更强的推理能力以及最佳的COT输出长度和难度水平。我们的贡献大大提高了LRM的开发和培训，以解决复杂的任务。

Title: Accurate KV Cache Quantization with Outlier Tokens Tracing

Authors: Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.10938
Pdf URL: https://arxiv.org/pdf/2505.10938
Copy Paste: [[2505.10938]] Accurate KV Cache Quantization with Outlier Tokens Tracing(https://arxiv.org/abs/2505.10938)
Keywords: language model, llm
Abstract: The impressive capabilities of Large Language Models (LLMs) come at the cost of substantial computational resources during deployment. While KV Cache can significantly reduce recomputation during inference, it also introduces additional memory overhead. KV Cache quantization presents a promising solution, striking a good balance between memory usage and accuracy. Previous research has shown that the Keys are distributed by channel, while the Values are distributed by token. Consequently, the common practice is to apply channel-wise quantization to the Keys and token-wise quantization to the Values. However, our further investigation reveals that a small subset of unusual tokens exhibit unique characteristics that deviate from this pattern, which can substantially impact quantization accuracy. To address this, we develop a simple yet effective method to identify these tokens accurately during the decoding process and exclude them from quantization as outlier tokens, significantly improving overall accuracy. Extensive experiments show that our method achieves significant accuracy improvements under 2-bit quantization and can deliver a 6.4 times reduction in memory usage and a 2.3 times increase in throughput.
摘要：大型语言模型（LLMS）的令人印象深刻的功能以部署期间的大量计算资源为代价。尽管KV缓存可以在推理过程中显着减少重新计算，但它也引入了其他内存开销。 KV缓存量化提出了一个有希望的解决方案，在内存使用和准确性之间达到了良好的平衡。先前的研究表明，这些键是由通道分布的，而值则由令牌分布。因此，普遍的做法是将渠道量化应用于键，将令牌量化应用于值。但是，我们的进一步研究表明，一小部分不寻常的令牌表现出独特的特征，这些特征偏离了这种模式，从而实质上会影响量化精度。为了解决这个问题，我们开发了一种简单而有效的方法，可以在解码过程中准确识别这些令牌，并将其排除在量化方面，将其排除在量化方面，从而显着提高了整体准确性。广泛的实验表明，我们的方法在2位量化下实现了明显的准确性提高，并且可以减少记忆使用情况的6.4倍，并且吞吐量增加了2.3倍。

Title: GenKnowSub: Improving Modularity and Reusability of LLMs through General Knowledge Subtraction

Authors: Mohammadtaha Bagherifard, Sahar Rajabi, Ali Edalat, Yadollah Yaghoobzadeh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10939
Pdf URL: https://arxiv.org/pdf/2505.10939
Copy Paste: [[2505.10939]] GenKnowSub: Improving Modularity and Reusability of LLMs through General Knowledge Subtraction(https://arxiv.org/abs/2505.10939)
Keywords: language model, llm
Abstract: Large language models often struggle with zero-shot generalization, and several modular approaches have been proposed to address this challenge. Yet, we hypothesize that a key limitation remains: the entanglement of general knowledge and task-specific adaptations. To overcome this, we propose a modular framework that disentangles these components by constructing a library of task-specific LoRA modules alongside a general-domain LoRA. By subtracting this general knowledge component from each task-specific module, we obtain residual modules that focus more exclusively on task-relevant information, a method we call general knowledge subtraction (GenKnowSub). Leveraging the refined task-specific modules and the Arrow routing algorithm \citep{ostapenko2024towards}, we dynamically select and combine modules for new inputs without additional training. Our studies on the Phi-3 model and standard Arrow as baselines reveal that using general knowledge LoRAs derived from diverse languages, including English, French, and German, yields consistent performance gains in both monolingual and cross-lingual settings across a wide set of benchmarks. Further experiments on Phi-2 demonstrate how GenKnowSub generalizes to weaker LLMs. The complete code and data are available at this https URL.
摘要：大型语言模型通常会在零拍的概括方面遇到困难，并提出了几种模块化方法来应对这一挑战。然而，我们假设仍然存在一个关键限制：一般知识和特定于任务的适应的纠缠。为了克服这一点，我们提出了一个模块化框架，该框架通过构造特定于特定于任务的洛拉模块的库与通用域的洛拉（Lora）来解散这些组件。通过从每个特定任务的模块中减去此通用知识组件，我们获得了剩余模块，这些模块更专门关注与任务相关的信息，这是我们称之为通用知识减法（GenKnowsub）的方法。利用精致的特定任务模块和箭头路由算法\ citep {ostapenko2024towards}，我们动态选择并组合了未经其他培训的新输入的模块。我们对PHI-3模型和标准箭头作为基线的研究表明，使用源自不同语言（包括英语，法语和德语）的通用知识洛拉斯，在整个基准的单语言和跨语言环境中都能在各种基准中获得一致的性能增长。对PHI-2的进一步实验证明了Genknowsub如何推广到较弱的LLM。完整的代码和数据可在此HTTPS URL上找到。

Title: Semantic Aware Linear Transfer by Recycling Pre-trained Language Models for Cross-lingual Transfer

Authors: Seungyoon Lee, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.10945
Pdf URL: https://arxiv.org/pdf/2505.10945
Copy Paste: [[2505.10945]] Semantic Aware Linear Transfer by Recycling Pre-trained Language Models for Cross-lingual Transfer(https://arxiv.org/abs/2505.10945)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) increasingly incorporate multilingual capabilities, fueling the demand to transfer them into target language-specific models. However, most approaches, which blend the source model's embedding by replacing the source vocabulary with the target language-specific vocabulary, may constrain expressive capacity in the target language since the source model is predominantly trained on English data. In this paper, we propose Semantic Aware Linear Transfer (SALT), a novel cross-lingual transfer technique that recycles embeddings from target language Pre-trained Language Models (PLMs) to transmit the deep representational strengths of PLM-derived embedding to LLMs. SALT derives unique regression lines based on the similarity in the overlap of the source and target vocabularies, to handle each non-overlapping token's embedding space. Our extensive experiments show that SALT significantly outperforms other transfer methods and achieves lower loss with accelerating faster convergence during language adaptation. Notably, SALT obtains remarkable performance in cross-lingual understanding setups compared to other methods. Furthermore, we highlight the scalable use of PLMs to enhance the functionality of contemporary LLMs by conducting experiments with varying architectures.
摘要：大型语言模型（LLMS）越来越多地结合了多语言功能，从而推动了将其转移到针对特定语言模型中的需求。但是，大多数方法通过将源词汇替换为目标语言特定的词汇来融合源模型的嵌入，可能会限制目标语言的表达能力，因为源模型主要是对英语数据进行了培训。在本文中，我们提出了语义意识线性转移（SALT），这是一种新型的跨语性转移技术，可将嵌入到目标语言预训练的语言模型（PLM）中，以传递PLM衍生的嵌入的深度代表性强度到LLMS。盐根据源和目标词汇的重叠中的相似性得出独特的回归线，以处理每个非重叠令牌的嵌入空间。我们的广泛实验表明，在语言适应过程中，盐显着优于其他转移方法，并随着语言适应过程的加速融合而实现较低的损失。值得注意的是，与其他方法相比，盐在跨语言理解设置中获得了显着的性能。此外，我们强调了PLM的可扩展用途通过使用不同的体系结构进行实验来增强当代LLM的功能。

Title: The Way We Prompt: Conceptual Blending, Neural Dynamics, and Prompt-Induced Transitions in LLMs

Authors: Makoto Sato
Subjects: cs.CL, q-bio.NC
Abstract URL: https://arxiv.org/abs/2505.10948
Pdf URL: https://arxiv.org/pdf/2505.10948
Copy Paste: [[2505.10948]] The Way We Prompt: Conceptual Blending, Neural Dynamics, and Prompt-Induced Transitions in LLMs(https://arxiv.org/abs/2505.10948)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs), inspired by neuroscience, exhibit behaviors that often evoke a sense of personality and intelligence-yet the mechanisms behind these effects remain elusive. Here, we operationalize Conceptual Blending Theory (CBT) as an experimental framework, using prompt-based methods to reveal how LLMs blend and compress meaning. By systematically investigating Prompt-Induced Transitions (PIT) and Prompt-Induced Hallucinations (PIH), we uncover structural parallels and divergences between artificial and biological cognition. Our approach bridges linguistics, neuroscience, and empirical AI research, demonstrating that human-AI collaboration can serve as a living prototype for the future of cognitive science. This work proposes prompt engineering not just as a technical tool, but as a scientific method for probing the deep structure of meaning itself.
摘要：受神经科学启发的大型语言模型（LLMS）表现出经常引起人格和智力的行为，这些效果背后的机制仍然难以捉摸。在这里，我们使用基于及时的方法来揭示LLMS如何融合和压缩含义，将概念融合理论（CBT）作为实验框架。通过系统地研究迅速诱导的过渡（PIT）和迅速引起的幻觉（PIH），我们发现了人工和生物认知之间的结构相似之处和分歧。我们的方法桥接语言学，神经科学和经验AI研究，表明人类合作可以作为认知科学未来的生动原型。这项工作提出了促使工程不仅作为技术工具，而且是一种科学方法，用于探讨意义本身的深层结构。

Title: Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning

Authors: Jingcheng Niu, Subhabrata Dutta, Ahmed Elshabrawy, Harish Tayyar Madabushi, Iryna Gurevych
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11004
Pdf URL: https://arxiv.org/pdf/2505.11004
Copy Paste: [[2505.11004]] Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning(https://arxiv.org/abs/2505.11004)
Keywords: language model
Abstract: Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks after seeing just a few examples. The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood. Some studies argue that it is merely the result of memorizing vast amounts of data, while others contend that it reflects a fundamental, symbolic algorithmic development in LMs. In this work, we introduce a suite of investigative tasks and a novel method to systematically investigate ICL by leveraging the full Pythia scaling suite, including interim checkpoints that capture progressively larger amount of training data. By carefully exploring ICL performance on downstream tasks and simultaneously conducting a mechanistic analysis of the residual stream's subspace, we demonstrate that ICL extends beyond mere "memorization" of the training corpus, yet does not amount to the implementation of an independent symbolic algorithm. Our results also clarify several aspects of ICL, including the influence of training dynamics, model capabilities, and elements of mechanistic interpretability. Overall, our work advances the understanding of ICL and its implications, offering model developers insights into potential improvements and providing AI security practitioners with a basis for more informed guidelines.
摘要：仅在仅使用Web尺度数据进行临时预测培训的大规模变压器语言模型（LMS）可以在仅看到几个示例后解决广泛的任务。这种能力背后的机制（称为封闭式学习（ICL））仍然有争议且知之甚少。一些研究认为，这仅仅是记住大量数据的结果，而另一些研究则认为它反映了LMS中基本的象征性算法发展。在这项工作中，我们介绍了一套调查任务和一种新颖的方法来系统地研究ICL，以利用完整的Pythia缩放套件，包括逐渐捕获大量训练数据的临时检查点。通过仔细探索下游任务上的ICL性能，并同时对残差流的子空间进行了机械分析，我们证明了ICL超出了训练语料库的“记忆”超出的，但并不等于实施独立的符号算法。我们的结果还阐明了ICL的几个方面，包括训练动力学，模型能力和机械解释性要素的影响。总体而言，我们的工作促进了对ICL及其含义的理解，从而为模型开发人员提供了对潜在改进的洞察力，并为AI安全从业人员提供了更明智的指导方针的基础。

Title: Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models

Authors: Jiangxu Wu, Cong Wang, TianHuang Su, Jun Yang, Haozhi Lin, Chao Zhang, Ming Peng, Kai Shi, SongPan Yang, BinQing Pan, ZiXian Li, Ni Yang, ZhenYu Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11010
Pdf URL: https://arxiv.org/pdf/2505.11010
Copy Paste: [[2505.11010]] Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models(https://arxiv.org/abs/2505.11010)
Keywords: language model, llm, agent
Abstract: The effectiveness of large language models (LLMs) in conversational AI is hindered by their reliance on single-turn supervised fine-tuning (SFT) data, which limits contextual coherence in multi-turn dialogues. Existing methods for generating multi-turn dialogue data struggle to ensure both diversity and quality in instructions. To address this, we propose Review-Instruct, a novel framework that synthesizes multi-turn conversations through an iterative "Ask-Respond-Review" process involving three agent roles: a Candidate, multiple Reviewers, and a Chairman. The framework iteratively refines instructions by incorporating Reviewer feedback, enhancing dialogue diversity and difficulty. We construct a multi-turn dataset using the Alpaca dataset and fine-tune the LLaMA2-13B model. Evaluations on MT-Bench, MMLU-Pro, and Auto-Arena demonstrate significant improvements, achieving absolute gains of 2.9\% on MMLU-Pro and 2\% on MT-Bench compared to prior state-of-the-art models based on LLaMA2-13B. Ablation studies confirm the critical role of the Review stage and the use of multiple Reviewers in boosting instruction diversity and difficulty. Our work highlights the potential of review-driven, multi-agent frameworks for generating high-quality conversational data at scale.
摘要：大型语言模型（LLMS）在对话式AI中的有效性受到对单转监督的微调（SFT）数据的依赖，这限制了多转交对话中的上下文连贯性。生成多转化对话数据的现有方法努力确保指令的多样性和质量。为了解决这个问题，我们提出了审查教学，这是一个新颖的框架，通过迭代的“ Ask-Respond-Review”过程综合了涉及三个代理角色的过程：候选人，多个审阅者和主席。该框架通过结合审阅者的反馈，增强对话多样性和难度来迭代地完善说明。我们使用羊驼数据集构建一个多转移数据集，然后微调Llama2-13b型号。对MMLU-PRO的MT-BE，MMLU-PRO和AUTO-ARENA的评估表现出显着改善，与基于Llama2-13b的先前最新的模型相比，MMLU-PRO的绝对增长率为2.9 \％，MT基板的绝对增长率为2 \％。消融研究证实了审查阶段的关键作用以及多个审阅者在提高教学多样性和难度方面的使用。我们的工作突出了以审查驱动的多代理框架的潜力，该框架是在大规模生成高质量的对话数据的潜力。

Title: OntoURL: A Benchmark for Evaluating Large Language Models on Symbolic Ontological Understanding, Reasoning and Learning

Authors: Xiao Zhang, Huiyuan Lai, Qianru Meng, Johan Bos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11031
Pdf URL: https://arxiv.org/pdf/2505.11031
Copy Paste: [[2505.11031]] OntoURL: A Benchmark for Evaluating Large Language Models on Symbolic Ontological Understanding, Reasoning and Learning(https://arxiv.org/abs/2505.11031)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a range of natural language processing tasks, yet their ability to process structured symbolic knowledge remains underexplored. To address this gap, we propose a taxonomy of LLMs' ontological capabilities and introduce OntoURL, the first comprehensive benchmark designed to systematically evaluate LLMs' proficiency in handling ontologies -- formal, symbolic representations of domain knowledge through concepts, relationships, and instances. Based on the proposed taxonomy, OntoURL systematically assesses three dimensions: understanding, reasoning, and learning through 15 distinct tasks comprising 58,981 questions derived from 40 ontologies across 8 domains. Experiments with 20 open-source LLMs reveal significant performance differences across models, tasks, and domains, with current LLMs showing proficiency in understanding ontological knowledge but substantial weaknesses in reasoning and learning tasks. These findings highlight fundamental limitations in LLMs' capability to process symbolic knowledge and establish OntoURL as a critical benchmark for advancing the integration of LLMs with formal knowledge representations.
摘要：大型语言模型（LLM）在一系列自然语言处理任务中表现出了显着的功能，但它们处理结构化符号知识的能力仍然没有得到充实的态度。为了解决这一差距，我们提出了LLMS的本体论能力的分类法，并引入Ontourl，这是第一个旨在系统地评估LLM在处理本体论的熟练程度的全面基准，即通过概念，关系和实例和实例和实例的正式，象征性地表示领域知识的符号表示。基于提议的分类法，Ontourl系统地评估了三个维度：通过15个不同的任务理解，推理和学习，其中包含58,981个问题，这些任务来自8个领域的40个本体。具有20个开源LLM的实验揭示了模型，任务和域之间的显着性能差异，当前的LLM表现出精通理解本体论知识的水平，但在推理和学习任务方面存在很大的弱点。这些发现凸显了LLMS处理符号知识的能力中的基本局限性，并确立OnTourl作为将LLM与正式知识表示的集成的关键基准。

Title: BLEUBERI: BLEU is a surprisingly effective reward for instruction following

Authors: Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, Mohit Iyyer
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11080
Pdf URL: https://arxiv.org/pdf/2505.11080
Copy Paste: [[2505.11080]] BLEUBERI: BLEU is a surprisingly effective reward for instruction following(https://arxiv.org/abs/2505.11080)
Keywords: language model, llm
Abstract: Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at this https URL.
摘要：奖励模型对于将LLM与人类偏好保持一致，但训练费用很高，需要大规模的人体标记的偏好数据和强大的预算LLM骨架。同时，高质量合成指令遵循数据集的可用性提高了一个问题：在基于RL的一致性期间，更简单，基于参考的指标可以作为奖励模型的可行替代方案吗？在本文中，我们首先证明了BLEU是一种基本的弦乐指标，令人惊讶地匹配了强大的奖励模型，该模型与人的偏好符合一般指导遵循数据集的偏好。基于这种见解，我们开发了Bleuberi，该方法首先识别有挑战性的指示，然后直接使用BLEU直接应用小组相对策略优化（GRPO）作为奖励功能。我们证明，Bleuberi受过训练的模型具有通过奖励模型引导的RL训练的模型竞争，这些模型在四个挑战性的指导下基准和三种不同的基本语言模型中都具有竞争力。人类评估进一步支持，即布鲁贝里模型输出的质量与奖励模型对齐模型的质量相当。此外，Bleuberi模型产生的输出比竞争方法更为基础。总体而言，我们表明，给定对高质量参考输出的访问（通过现有的指令数据集或合成数据生成很容易获得），基于字符串匹配的指标在对齐过程中是廉价但有效的奖励模型代理。我们在此HTTPS URL上发布代码和数据。

Title: Towards Better Evaluation for Generated Patent Claims

Authors: Lekang Jiang, Pascal A Scherz, Stephan Goetz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11095
Pdf URL: https://arxiv.org/pdf/2505.11095
Copy Paste: [[2505.11095]] Towards Better Evaluation for Generated Patent Claims(https://arxiv.org/abs/2505.11095)
Keywords: language model, llm
Abstract: Patent claims define the scope of protection and establish the legal boundaries of an invention. Drafting these claims is a complex and time-consuming process that usually requires the expertise of skilled patent attorneys, which can form a large access barrier for many small enterprises. To solve these challenges, researchers have investigated the use of large language models (LLMs) for automating patent claim generation. However, existing studies highlight inconsistencies between automated evaluation metrics and human expert assessments. To bridge this gap, we introduce Patent-CE, the first comprehensive benchmark for evaluating patent claims. Patent-CE includes comparative claim evaluations annotated by patent experts, focusing on five key criteria: feature completeness, conceptual clarity, terminology consistency, logical linkage, and overall quality. Additionally, we propose PatClaimEval, a novel multi-dimensional evaluation method specifically designed for patent claims. Our experiments demonstrate that PatClaimEval achieves the highest correlation with human expert evaluations across all assessment criteria among all tested metrics. This research provides the groundwork for more accurate evaluations of automated patent claim generation systems.
摘要：专利主张定义了保护的范围，并建立了发明的法律界限。起草这些主张是一个复杂且耗时的过程，通常需要熟练的专利律师的专业知识，这可以为许多小型企业形成大型访问障碍。为了解决这些挑战，研究人员已经调查了使用大型语言模型（LLMS）来自动化专利索赔生成。但是，现有的研究突出了自动评估指标与人类专家评估之间的矛盾。为了弥合这一差距，我们介绍了专利权，这是评估专利主张的第一个全面基准。专利CE包括专利专家注释的比较主张评估，重点介绍五个关键标准：功能完整性，概念清晰度，术语一致性，逻辑联系和整体质量。此外，我们提出了一种专门为专利主张设计的新型多维评估方法Patclaimeval。我们的实验表明，Patclaimeval与所有测试指标的所有评估标准中的人类专家评估达到了最高的相关性。这项研究为对自动专利索赔生成系统的更准确评估提供了基础。

Title: Scaling Reasoning can Improve Factuality in Large Language Models

Authors: Mike Zhang, Johannes Bjerva, Russa Biswas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11140
Pdf URL: https://arxiv.org/pdf/2505.11140
Copy Paste: [[2505.11140]] Scaling Reasoning can Improve Factuality in Large Language Models(https://arxiv.org/abs/2505.11140)
Keywords: language model, llm
Abstract: Recent studies on large language model (LLM) reasoning capabilities have demonstrated promising improvements in model performance by leveraging a lengthy thinking process and additional computational resources during inference, primarily in tasks involving mathematical reasoning (Muennighoff et al., 2025). However, it remains uncertain if longer reasoning chains inherently enhance factual accuracy, particularly beyond mathematical contexts. In this work, we thoroughly examine LLM reasoning within complex open-domain question-answering (QA) scenarios. We initially distill reasoning traces from advanced, large-scale reasoning models (QwQ-32B and DeepSeek-R1-671B), then fine-tune a variety of models ranging from smaller, instruction-tuned variants to larger architectures based on Qwen2.5. To enrich reasoning traces, we introduce factual information from knowledge graphs in the form of paths into our reasoning traces. Our experimental setup includes four baseline approaches and six different instruction-tuned models evaluated across a benchmark of six datasets, encompassing over 22.6K questions. Overall, we carry out 168 experimental runs and analyze approximately 1.7 million reasoning traces. Our findings indicate that, within a single run, smaller reasoning models achieve noticeable improvements in factual accuracy compared to their original instruction-tuned counterparts. Moreover, our analysis demonstrates that adding test-time compute and token budgets factual accuracy consistently improves by 2-8%, further confirming the effectiveness of test-time scaling for enhancing performance and consequently improving reasoning accuracy in open-domain QA tasks. We release all the experimental artifacts for further research.
摘要：关于大语言模型（LLM）推理能力的最新研究表明，通过利用冗长的思维过程和推断期间的其他计算资源，主要是在涉及数学推理的任务中，可以改善模型性能（Muennighoff等，2025年）。但是，仍然不确定较长的推理链是否固有地提高了事实准确性，尤其是在数学环境之外。在这项工作中，我们在复杂的开放域问题（QA）方案中彻底检查了LLM推理。我们最初是从高级的大规模推理模型（QWQ-32B和DeepSeek-R1-671b）中提取推理痕迹，然后微调各种型号，从较小的，指导调节的变体到基于QWEN2.5的较大体系结构。为了丰富推理痕迹，我们将以路径形式的知识图引入了事实信息，以将其推理到我们的推理轨迹中。我们的实验设置包括四种基线方法和六个不同的指令调整模型，这些模型在六个数据集的基准中评估，其中包括超过22.6k的问题。总体而言，我们进行了168次实验跑步，并分析了大约170万个推理轨迹。我们的发现表明，与原始指导调节的对应物相比，在一次运行中，较小的推理模型在事实准确性方面取得了明显的改进。此外，我们的分析表明，添加测试时间计算和令牌预算的事实准确性始终提高2-8％，进一步证实了测试时间缩放对提高性能的有效性，并因此提高了开放量质量质量质量质量质量质量固定任务的推理准确性。我们释放所有实验工件以进行进一步研究。

Title: SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

Authors: Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11166
Pdf URL: https://arxiv.org/pdf/2505.11166
Copy Paste: [[2505.11166]] SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization(https://arxiv.org/abs/2505.11166)
Keywords: language model, llm, long context
Abstract: Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named $\textbf{S}$h$\textbf{o}$rt-to-$\textbf{Lo}$ng $\textbf{P}$reference $\textbf{O}$ptimization ($\textbf{SoLoPO}$), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency utilization for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model's ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.
摘要：尽管在延长上下文长度方面进行了预测，但大型语言模型（LLMS）仍然面临着有效利用现实世界的长篇小说信息的挑战，这主要是由于数据质量问题引起的长期延长不足，培训效率低下以及缺乏精心设计的优化目标。为了解决这些限制，我们提出了一个名为$ \ textbf {s} $ h $ \ textbf {o} $ rt-to- $ \ $ \ $ \ textbf {lo} $ ng $ \ ng $ \ textbf {p} $两个组成部分：短上下文PO和短期奖励对准（Solo-RA），并得到理论和经验证据的支持。具体而言，短篇小说po利用从短上下文采样的偏好对，以增强模型的上下文知识利用能力。同时，Solo-RA明确鼓励在包含相同任务相同信息的短篇小说和长篇小说条件下进行响应的奖励分数一致性利用。这有助于将模型处理短上下文的能力转移到长篇小说方案中。 Solopo与主流偏好优化算法兼容，同时大大提高了数据构建和培训过程的效率。实验结果表明，Solopo在各种长篇文化基准的长度和域的概括能力方面增强了所有这些算法，同时在计算和记忆效率方面都取得了显着提高。

Title: Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

Authors: Hrishit Madhavi, Jacob Cherian, Yuvraj Khamkar, Dhananjay Bhagat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11177
Pdf URL: https://arxiv.org/pdf/2505.11177
Copy Paste: [[2505.11177]] Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline(https://arxiv.org/abs/2505.11177)
Keywords: language model
Abstract: This paper presents an end-to-end suite for multilingual information extraction and processing from image-based documents. The system uses Optical Character Recognition (Tesseract) to extract text in languages such as English, Hindi, and Tamil, and then a pipeline involving large language model APIs (Gemini) for cross-lingual translation, abstractive summarization, and re-translation into a target language. Additional modules add sentiment analysis (TensorFlow), topic classification (Transformers), and date extraction (Regex) for better document comprehension. Made available in an accessible Gradio interface, the current research shows a real-world application of libraries, models, and APIs to close the language gap and enhance access to information in image media across different linguistic environments
摘要：本文提出了一个端到端套件，用于从基于图像的文档中提取和处理多种语言。该系统使用光学特征识别（Tesseract）用英语，印地语和泰米尔语等语言提取文本，然后使用涉及大语言模型API（Gemini）的管道进行交叉翻译，抽象性摘要，并将其重新翻译为目标语言。其他模块添加了情感分析（TensorFlow），主题分类（变形金刚）和日期提取（REGEX），以更好地理解文档。当前的研究在可访问的Gradio接口中提供，显示了库，模型和API的现实应用，以缩小语言差距，并增强跨不同语言环境中图像媒体中信息的访问

Title: HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization

Authors: Chengyu Huang, Zhengxin Zhang, Claire Cardie
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11225
Pdf URL: https://arxiv.org/pdf/2505.11225
Copy Paste: [[2505.11225]] HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization(https://arxiv.org/abs/2505.11225)
Keywords: language model, llm
Abstract: While scaling the length of responses at test-time has been shown to markedly improve the reasoning abilities and performance of large language models (LLMs), it often results in verbose outputs and increases inference cost. Prior approaches for efficient test-time scaling, typically using universal budget constraints or query-level length optimization, do not leverage historical information from previous encounters with the same problem during training. We hypothesize that this limits their ability to progressively make solutions more concise over time. To address this, we present History-Aware Policy Optimization (HAPO), which keeps track of a history state (e.g., the minimum length over previously generated correct responses) for each problem. HAPO employs a novel length reward function based on this history state to incentivize the discovery of correct solutions that are more concise than those previously found. Crucially, this reward structure avoids overly penalizing shorter incorrect responses with the goal of facilitating exploration towards more efficient solutions. By combining this length reward with a correctness reward, HAPO jointly optimizes for correctness and efficiency. We use HAPO to train DeepSeek-R1-Distill-Qwen-1.5B, DeepScaleR-1.5B-Preview, and Qwen-2.5-1.5B-Instruct, and evaluate HAPO on several math benchmarks that span various difficulty levels. Experiment results demonstrate that HAPO effectively induces LLMs' concise reasoning abilities, producing length reductions of 33-59% with accuracy drops of only 2-5%.
摘要：虽然显示测试时间的响应长度已被证明可以显着提高大语模型（LLMS）的推理能力和性能，但它通常会导致详细的输出并增加推理成本。通常使用通用预算约束或查询级别长度优化的有效测试时间缩放的先前方法，不会在培训期间从以前的相遇中相遇的相遇中利用历史信息。我们假设这限制了他们随着时间的推移使解决方案更加简洁的能力。为了解决这个问题，我们提出了历史吸引的策略优化（HAPO），该策略优化跟踪每个问题的历史状态（例如，超过先前生成的正确响应的最小长度）。 Hapo基于这种历史状态采用了新的长度奖励功能，激励发现正确的解决方案，这些解决方案比以前发现的更简洁。至关重要的是，这种奖励结构避免了过度惩罚较短的错误反应，目的是促进对更有效的解决方案的探索。通过将这一长度奖励与正确的奖励相结合，HAPO共同优化了正确性和效率。我们使用HAPO来训练DeepSeek-R1-Distill-Qwen-1.5b，DeepScaler-1.5b-Preview和Qwen-2.5-1.5b-instruct，并在几个跨越各种难度级别的数学基准上评估HAPO。实验结果表明，HAPO有效地诱导了LLMS的简洁推理能力，而精度下降仅为2-5％，长度降低了33-59％。

Title: Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models

Authors: Camille Couturier, Spyros Mastorakis, Haiying Shen, Saravan Rajmohan, Victor Rühle
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11271
Pdf URL: https://arxiv.org/pdf/2505.11271
Copy Paste: [[2505.11271]] Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models(https://arxiv.org/abs/2505.11271)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) are increasingly deployed across edge and cloud platforms for real-time question-answering and retrieval-augmented generation. However, processing lengthy contexts in distributed systems incurs high computational overhead, memory usage, and network bandwidth. This paper introduces a novel semantic caching approach for storing and reusing intermediate contextual summaries, enabling efficient information reuse across similar queries in LLM-based QA workflows. Our method reduces redundant computations by up to 50-60% while maintaining answer accuracy comparable to full document processing, as demonstrated on NaturalQuestions, TriviaQA, and a synthetic ArXiv dataset. This approach balances computational cost and response quality, critical for real-time AI assistants.
摘要：大型语言模型（LLM）越来越多地跨越边缘和云平台，以实时提问和检索效果。但是，在分布式系统中处理冗长的上下文会导致高计算开销，内存使用情况和网络带宽。本文介绍了一种新型的语义缓存方法，用于存储和重复中间上下文摘要，从而使基于LLM的QA工作流中的类似查询有效地重复使用。我们的方法将冗余计算降低了50-60％，同时保持答案准确性可与完整文档处理相当，如AnturalQuestions，Triviaqa和合成ARXIV数据集所示。这种方法平衡了计算成本和响应质量，这对于实时AI助手至关重要。

Title: Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs

Authors: Yaorui Shi, Shihan Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, Xiang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11277
Pdf URL: https://arxiv.org/pdf/2505.11277
Copy Paste: [[2505.11277]] Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs(https://arxiv.org/abs/2505.11277)
Keywords: language model, llm
Abstract: Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new ``search-and-refine-during-think'' paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.
摘要：大型语言模型表现出了令人印象深刻的推理能力，但本质上受其知识库的限制。通过允许LLM查询外部资源来检索启动的推理可以减轻这种限制，但是现有方法通常会检索无关紧要或嘈杂的信息，从而阻碍了准确的推理。在本文中，我们提出了AutoreFine，这是一种增强培训后的培训后框架，采用了新的``搜索''d-Refine-distring-Inkink''范式。 Autorefine在连续的搜索呼叫之间介绍了明确的知识完善步骤，使模型能够迭代过滤，提炼和组织证据，然后再产生答案。此外，我们使用小组相对策略优化融合了量身定制的检索特定奖励，并将其答案奖励。单跳和多跳QA基准测试的实验表明，自动蛋白明显优于现有方法，尤其是在复杂的，多跳的推理方案中。详细的分析表明，自动蛋白问题经常进行，高质量的搜索并有效地综合了证据。

Title: Temporal fine-tuning for early risk detection

Authors: Horacio Thompson, Esaú Villatoro-Tello, Manuel Montes-y-Gómez, Marcelo Errecalde
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11280
Pdf URL: https://arxiv.org/pdf/2505.11280
Copy Paste: [[2505.11280]] Temporal fine-tuning for early risk detection(https://arxiv.org/abs/2505.11280)
Keywords: prompt
Abstract: Early Risk Detection (ERD) on the Web aims to identify promptly users facing social and health issues. Users are analyzed post-by-post, and it is necessary to guarantee correct and quick answers, which is particularly challenging in critical scenarios. ERD involves optimizing classification precision and minimizing detection delay. Standard classification metrics may not suffice, resorting to specific metrics such as ERDE(theta) that explicitly consider precision and delay. The current research focuses on applying a multi-objective approach, prioritizing classification performance and establishing a separate criterion for decision time. In this work, we propose a completely different strategy, temporal fine-tuning, which allows tuning transformer-based models by explicitly incorporating time within the learning process. Our method allows us to analyze complete user post histories, tune models considering different contexts, and evaluate training performance using temporal metrics. We evaluated our proposal in the depression and eating disorders tasks for the Spanish language, achieving competitive results compared to the best models of MentalRiskES 2023. We found that temporal fine-tuning optimized decisions considering context and time progress. In this way, by properly taking advantage of the power of transformers, it is possible to address ERD by combining precision and speed as a single objective.
摘要：网络上的早期风险检测（ERD）旨在迅速确定面临社会和健康问题的用户。对用户进行逐个后的分析，有必要保证正确和快速的答案，这在关键场景中尤其具有挑战性。 ERD涉及优化分类精度和最小化检测延迟。标准分类指标可能不够，诉诸于特定的指标，例如ERDE（THETA），这些指标明确考虑精度和延迟。当前的研究重点是采用多目标方法，确定分类绩效的优先级，并为决策时间建立单独的标准。在这项工作中，我们提出了一种完全不同的策略，即时间微调，该策略可以通过将时间明确地纳入学习过程来调整基于变压器的模型。我们的方法使我们能够分析完整的用户发布历史记录，考虑不同上下文的模型，并使用时间指标评估培训性能。我们评估了我们在抑郁症和饮食失调方面的提议，与2023年精神风险的最佳模型相比，取得了竞争成果。我们发现，考虑到上下文和时间进展，暂时的微调优化决策。这样，通过正确利用变压器的力量，可以通过将精度和速度结合为一个目标来解决ERD。

Title: XtraGPT: LLMs for Human-AI Collaboration on Controllable Academic Paper Revision

Authors: Nuo Chen, Andre Lin HuiKai, Jiaying Wu, Junyi Hou, Zining Zhang, Qian Wang, Xidong Wang, Bingsheng He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11336
Pdf URL: https://arxiv.org/pdf/2505.11336
Copy Paste: [[2505.11336]] XtraGPT: LLMs for Human-AI Collaboration on Controllable Academic Paper Revision(https://arxiv.org/abs/2505.11336)
Keywords: language model, gpt, llm, prompt
Abstract: Despite the growing adoption of large language models (LLMs) in academic workflows, their capabilities remain limited when it comes to supporting high-quality scientific writing. Most existing systems are designed for general-purpose scientific text generation and fail to meet the sophisticated demands of research communication beyond surface-level polishing, such as conceptual coherence across sections. Furthermore, academic writing is inherently iterative and revision-driven, a process not well supported by direct prompting-based paradigms. To address these scenarios, we propose a human-AI collaboration framework for academic paper revision. We first introduce a comprehensive dataset of 7,040 research papers from top-tier venues annotated with over 140,000 instruction-response pairs that reflect realistic, section-level scientific revisions. Building on the dataset, we develop XtraGPT, the first suite of open-source LLMs, designed to provide context-aware, instruction-guided writing assistance, ranging from 1.5B to 14B parameters. Extensive experiments validate that XtraGPT significantly outperforms same-scale baselines and approaches the quality of proprietary systems. Both automated preference assessments and human evaluations confirm the effectiveness of our models in improving scientific drafts.
摘要：尽管大型语言模型（LLM）在学术工作流程中的采用越来越多，但在支持高质量的科学写作方面，它们的能力仍然有限。大多数现有系统都是为了通用科学文本生成而设计的，并且无法满足超出表面层面抛光范围的研究交流的复杂需求，例如各个部分的概念连贯性。此外，学术写作本质上是迭代和修订驱动的，这一过程不受直接提示的范式的很好支持。为了解决这些情况，我们为学术论文修订提出了人类协作框架。我们首先介绍了一个全面的数据集，其中包括7,040篇研究论文，这些研究论文来自注释的顶级场所，其中包含超过140,000个指令 - 响应对，反映了现实的，部分级的科学修订。在数据集中，我们开发了Xtragpt，这是第一套开源LLM的套件，旨在提供上下文感知的，指导引导的写作帮助，范围从1.5B到14B参数。广泛的实验验证了Xtragpt显着胜过相同的基准，并接近专有系统的质量。自动偏好评估和人类评估都证实了我们模型在改善科学草案方面的有效性。

Title: Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models

Authors: Banca Calvo Figueras, Rodrigo Agerri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11341
Pdf URL: https://arxiv.org/pdf/2505.11341
Copy Paste: [[2505.11341]] Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models(https://arxiv.org/abs/2505.11341)
Keywords: language model, llm
Abstract: The task of Critical Questions Generation (CQs-Gen) aims to foster critical thinking by enabling systems to generate questions that expose assumptions and challenge the reasoning in arguments. Despite growing interest in this area, progress has been hindered by the lack of suitable datasets and automatic evaluation standards. This work presents a comprehensive approach to support the development and benchmarking of systems for this task. We construct the first large-scale manually-annotated dataset. We also investigate automatic evaluation methods and identify a reference-based technique using large language models (LLMs) as the strategy that best correlates with human judgments. Our zero-shot evaluation of 11 LLMs establishes a strong baseline while showcasing the difficulty of the task. Data, code, and a public leaderboard are provided to encourage further research not only in terms of model performance, but also to explore the practical benefits of CQs-Gen for both automated reasoning and human critical thinking.
摘要：关键问题的任务生成（CQS-GEN）旨在通过使系统能够产生揭示假设并挑战参数中推理的问题来促进批判性思维。尽管对这一领域的兴趣日益增加，但由于缺乏合适的数据集和自动评估标准而阻碍了进步。这项工作提出了一种全面的方法来支持该任务系统的开发和基准测试。我们构建了第一个大规模手动通知的数据集。我们还研究了自动评估方法，并使用大语言模型（LLM）确定基于参考的技术，作为与人类判断最相关的策略。我们对11个LLM的零射门评估在展示任务困难的同时建立了强大的基线。提供数据，代码和公共排行榜，不仅鼓励在模型绩效方面进行进一步的研究，还可以探索CQS-GEN对自动推理和人类批判性思维的实际好处。

Title: LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors

Authors: Rao Ma, Tongzhou Chen, Kartik Audhkhasi, Bhuvana Ramabhadran
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.11352
Pdf URL: https://arxiv.org/pdf/2505.11352
Copy Paste: [[2505.11352]] LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors(https://arxiv.org/abs/2505.11352)
Keywords: language model, llm, prompt
Abstract: Recently, large-scale pre-trained speech encoders and Large Language Models (LLMs) have been released, which show state-of-the-art performance on a range of spoken language processing tasks including Automatic Speech Recognition (ASR). To effectively combine both models for better performance, continuous speech prompts, and ASR error correction have been adopted. However, these methods are prone to suboptimal performance or are inflexible. In this paper, we propose a new paradigm, LegoSLM, that bridges speech encoders and LLMs using the ASR posterior matrices. The speech encoder is trained to generate Connectionist Temporal Classification (CTC) posteriors over the LLM vocabulary, which are used to reconstruct pseudo-audio embeddings by computing a weighted sum of the LLM input embeddings. These embeddings are concatenated with text embeddings in the LLM input space. Using the well-performing USM and Gemma models as an example, we demonstrate that our proposed LegoSLM method yields good performance on both ASR and speech translation tasks. By connecting USM with Gemma models, we can get an average of 49% WERR over the USM-CTC baseline on 8 MLS testsets. The trained model also exhibits modularity in a range of settings -- after fine-tuning the Gemma model weights, the speech encoder can be switched and combined with the LLM in a zero-shot fashion. Additionally, we propose to control the decode-time influence of the USM and LLM using a softmax temperature, which shows effectiveness in domain adaptation.
摘要：最近，已经发布了大规模的预训练的语音编码器和大型语言模型（LLMS），这些语言模型（LLMS）在一系列口语处理任务上显示了最先进的性能，包括自动语音识别（ASR）。为了有效地结合这两个模型以提高性能，连续的语音提示和ASR误差校正已被采用。但是，这些方法容易出现次优性能或不灵活。在本文中，我们提出了一种新的范式LEGOSLM，该范式使用ASR后矩阵桥接语音编码和LLM。训练语音编码器可以通过LLM词汇生成连接派时间分类（CTC）后期，这些后词用于通过计算LLM输入嵌入的加权总和来重建伪audio嵌入。这些嵌入与LLM输入空间中的文本嵌入相连。以表现良好的USM和Gemma模型为例，我们证明我们提出的LEGOSLM方法在ASR和语音翻译任务上都能产生良好的性能。通过将USM与Gemma型号联系起来，我们可以在8个MLS测试集上平均获得USM-CTC基线的49％WERR。受过训练的模型还显示了各种设置的模块化 - 在微调了Gemma模型权重之后，可以以零拍的方式切换语音编码器并与LLM结合。此外，我们建议使用软磁性温度来控制USM和LLM的解码时间影响，从而显示出域适应性的有效性。

Title: GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents

Authors: Lingxiao Diao, Xinyue Xu, Wanxuan Sun, Cheng Yang, Zhuosheng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11368
Pdf URL: https://arxiv.org/pdf/2505.11368
Copy Paste: [[2505.11368]] GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents(https://arxiv.org/abs/2505.11368)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines.
摘要：大型语言模型（LLMS）已被广泛部署为能够遵循用户说明并在现实世界应用中做出决策的自主代理。先前的研究在基准在一般领域的LLM功能之后基准指导方面取得了显着进步，主要关注其固有的常识性知识。最近，LLM越来越多地部署为面向域的代理，这些代理依赖于可能与他们的常识性知识相抵触的指导指南。这些指南具有两个关键特征：它们由各种面向域的规则组成，并经常进行更新。尽管有这些挑战，但缺乏全面的基准来评估LLMS功能后的面向域指南，这是其有效评估和进一步发展的重要障碍。在本文中，我们介绍了GuideBench，这是一个综合基准，旨在评估LLMS性能后的指南。指南对LLM在三个关键方面进行评估：（i）遵守不同规则，（ii）稳健性规则更新，以及（iii）与人类偏好的一致性。一系列LLM的实验结果表明，有很大的机会来提高其遵循域名指南的能力。

Title: CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs

Authors: Sijia Chen, Xiaomin Li, Mengxue Zhang, Eric Hanchen Jiang, Qingcheng Zeng, Chen-Hsiang Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11413
Pdf URL: https://arxiv.org/pdf/2505.11413
Copy Paste: [[2505.11413]] CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs(https://arxiv.org/abs/2505.11413)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly deployed in medical contexts, raising critical concerns about safety, alignment, and susceptibility to adversarial manipulation. While prior benchmarks assess model refusal capabilities for harmful prompts, they often lack clinical specificity, graded harmfulness levels, and coverage of jailbreak-style attacks. We introduce CARES (Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for evaluating LLM safety in healthcare. CARES includes over 18,000 prompts spanning eight medical safety principles, four harm levels, and four prompting styles: direct, indirect, obfuscated, and role-play, to simulate both malicious and benign use cases. We propose a three-way response evaluation protocol (Accept, Caution, Refuse) and a fine-grained Safety Score metric to assess model behavior. Our analysis reveals that many state-of-the-art LLMs remain vulnerable to jailbreaks that subtly rephrase harmful prompts, while also over-refusing safe but atypically phrased queries. Finally, we propose a mitigation strategy using a lightweight classifier to detect jailbreak attempts and steer models toward safer behavior via reminder-based conditioning. CARES provides a rigorous framework for testing and improving medical LLM safety under adversarial and ambiguous conditions.
摘要：大型语言模型（LLM）越来越多地在医学环境中部署，对安全性，对齐和对抗性操纵的敏感性引起了关键的关注。虽然先前的基准测试评估模型拒绝有害提示的能力，但它们通常缺乏临床特异性，有害水平和越狱式攻击的覆盖范围。我们介绍Cares（临床对抗性鲁棒性和安全性评估），这是评估医疗保健LLM安全性的基准。护理包括超过18,000个提示，涵盖了八项医疗安全原则，四个危害水平和四种提示风格：直接，间接，混淆和角色扮演，以模拟恶意和良性用例。我们提出了一个三向响应评估方案（接受，谨慎，垃圾）和一个精细的安全评分度量，以评估模型行为。我们的分析表明，许多最先进的LLM仍然容易受到越狱的攻击，这些越狱会重塑有害的提示，同时也过度重新调整了安全但非典型的询问。最后，我们提出了一种使用轻量级分类器来检测越狱尝试并通过基于提醒的调节来检测更安全行为的缓解策略。 CARES提供了一个严格的框架，用于在对抗和模棱两可的条件下测试和改善医疗LLM安全性。

Title: Towards Cultural Bridge by Bahnaric-Vietnamese Translation Using Transfer Learning of Sequence-To-Sequence Pre-training Language Model

Authors: Phan Tran Minh Dat, Vo Hoang Nhat Khang, Quan Thanh Tho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11421
Pdf URL: https://arxiv.org/pdf/2505.11421
Copy Paste: [[2505.11421]] Towards Cultural Bridge by Bahnaric-Vietnamese Translation Using Transfer Learning of Sequence-To-Sequence Pre-training Language Model(https://arxiv.org/abs/2505.11421)
Keywords: language model, gpt
Abstract: This work explores the journey towards achieving Bahnaric-Vietnamese translation for the sake of culturally bridging the two ethnic groups in Vietnam. However, translating from Bahnaric to Vietnamese also encounters some difficulties. The most prominent challenge is the lack of available original Bahnaric resources source language, including vocabulary, grammar, dialogue patterns and bilingual corpus, which hinders the data collection process for training. To address this, we leverage a transfer learning approach using sequence-to-sequence pre-training language model. First of all, we leverage a pre-trained Vietnamese language model to capture the characteristics of this language. Especially, to further serve the purpose of machine translation, we aim for a sequence-to-sequence model, not encoder-only like BERT or decoder-only like GPT. Taking advantage of significant similarity between the two languages, we continue training the model with the currently limited bilingual resources of Vietnamese-Bahnaric text to perform the transfer learning from language model to machine translation. Thus, this approach can help to handle the problem of imbalanced resources between two languages, while also optimizing the training and computational processes. Additionally, we also enhanced the datasets using data augmentation to generate additional resources and defined some heuristic methods to help the translation more precise. Our approach has been validated to be highly effective for the Bahnaric-Vietnamese translation model, contributing to the expansion and preservation of languages, and facilitating better mutual understanding between the two ethnic people.
摘要：这项工作探讨了为了在越南弥合文化繁殖而实现巴赫纳里奇 - 越南翻译的旅程。但是，从巴赫纳里克翻译成越南人也遇到了一些困难。最突出的挑战是缺乏可用的原始Bahnaric Resources语言，包括词汇，语法，对话模式和双语语料库，这阻碍了数据收集的培训过程。为了解决这个问题，我们利用序列到序列前培训语言模型来利用转移学习方法。首先，我们利用预先训练的越南语言模型来捕获这种语言的特征。尤其是，为了进一步实现机器翻译的目的，我们的目标是一个顺序到序列模型，而不是像bert或仅像GPT一样编码。利用两种语言之间的显着相似性，我们继续使用当前有限的越南 - 巴哈纳语文本的双语资源来训练该模型，以执行从语言模型到机器翻译的转移学习。因此，这种方法可以帮助解决两种语言之间资源不平衡的问题，同时还可以优化培训和计算过程。此外，我们还使用数据增强来增强数据集，以生成其他资源，并定义了一些启发式方法，以帮助翻译更精确。我们的方法已得到验证，可以对巴赫纳里奇 - 越南翻译模型非常有效，有助于扩展和保存语言，并促进两个民族之间的更好的相互理解。

Title: When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs

Authors: Xiaomin Li, Zhou Yu, Zhiwei Zhang, Xupeng Chen, Ziji Zhang, Yingying Zhuang, Narayanan Sadagopan, Anurag Beniwal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11423
Pdf URL: https://arxiv.org/pdf/2505.11423
Copy Paste: [[2505.11423]] When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs(https://arxiv.org/abs/2505.11423)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Reasoning-enhanced large language models (RLLMs), whether explicitly trained for reasoning or prompted via chain-of-thought (CoT), have achieved state-of-the-art performance on many complex reasoning tasks. However, we uncover a surprising and previously overlooked phenomenon: explicit CoT reasoning can significantly degrade instruction-following accuracy. Evaluating 15 models on two benchmarks: IFEval (with simple, rule-verifiable constraints) and ComplexBench (with complex, compositional constraints), we consistently observe performance drops when CoT prompting is applied. Through large-scale case studies and an attention-based analysis, we identify common patterns where reasoning either helps (e.g., with formatting or lexical precision) or hurts (e.g., by neglecting simple constraints or introducing unnecessary content). We propose a metric, constraint attention, to quantify model focus during generation and show that CoT reasoning often diverts attention away from instruction-relevant tokens. To mitigate these effects, we introduce and evaluate four strategies: in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning. Our results demonstrate that selective reasoning strategies, particularly classifier-selective reasoning, can substantially recover lost performance. To our knowledge, this is the first work to systematically expose reasoning-induced failures in instruction-following and offer practical mitigation strategies.
摘要：推理增强的大型语言模型（RLLM），无论是针对推理的明确培训还是通过思考链（COT）提示，都在许多复杂的推理任务上都实现了最先进的绩效。但是，我们发现了一个令人惊讶且以前被忽视的现象：显式的COT推理可以显着降低跟随指导的准确性。在两个基准上评估15个模型：IFEVAL（具有简单的，可验证的约束）和复杂的基础（具有复杂的，组成约束），当应用COT提示时，我们一致地观察到性能下降。通过大规模的案例研究和基于注意力的分析，我们确定了推理有助于（例如，格式化或词汇精度）或伤害的共同模式（例如，通过忽略简单的约束或引入不必要的内容）。我们提出了一个指标，限制的注意，以量化生成过程中的模型焦点，并表明COT推理通常会将注意力转移到与教学相关的代币中。为了减轻这些效果，我们介绍和评估四种策略：内在学习，自我反思，自选择性推理和分类器选择性推理。我们的结果表明，选择性推理策略，尤其是分类者选择性推理，可以大大恢复损失的性能。据我们所知，这是系统地揭露推理诱导的指导跟踪中的失败并提供实际缓解策略的第一项工作。

Title: GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art

Authors: Chenkai Zhang, Yiming Lei, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11436
Pdf URL: https://arxiv.org/pdf/2505.11436
Copy Paste: [[2505.11436]] GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art(https://arxiv.org/abs/2505.11436)
Keywords: language model, llm, chain-of-thought
Abstract: Video Comment Art enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce GODBench, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs' abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose Ripple of Thought (RoT), a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improve creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity. GODBench is publicly available at this https URL.
摘要：视频评论艺术通过提供传达幽默，讽刺或情感共鸣的创意内容来增强用户参与度，需要对文化和上下文微妙的细微差别掌握。尽管多模式的大语言模型（MLLM）和思考链（COT）在STEM任务（例如数学和编码）中表现出了强大的推理能力，但他们仍然很难产生诸如谐音笑话和见识的讽刺之类的创造性表达。此外，现有的基准受到其有限的方式和类别不足的限制，从而阻碍了基于视频的评论艺术创作中综合创造力的探索。为了解决这些局限性，我们介绍了Godbench，这是一种新颖的基准，该基准将视频和文本方式整合在一起，以系统地评估MLLM的构成评论艺术的能力。此外，灵感来自物理学中波浪的传播模式，我们提出了思想涟漪（ROT），这是一个多步推理框架，旨在增强MLLM的创造力。广泛的实验表明，现有的MLLM和COT方法在理解和产生创意视频评论方面仍然面临重大挑战。相比之下，ROT提供了一种有效的方法来改善创意作品，突出了其推动基于MLLM的创造力进步的潜力。在此HTTPS URL上公开可用Godbench。

Title: Is Compression Really Linear with Code Intelligence?

Authors: Xianzhen Luo, Shijie Xuyang, Tianhao Cheng, Zheng Chu, Houyi Li, ziqi wang, Siming Huang, Qingfu Zhu, Qiufeng Wang, Xiangyu Zhang, Shuigeng Zhou, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11441
Pdf URL: https://arxiv.org/pdf/2505.11441
Copy Paste: [[2505.11441]] Is Compression Really Linear with Code Intelligence?(https://arxiv.org/abs/2505.11441)
Keywords: language model, llm
Abstract: Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs' code intelligence, we introduce \textit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC), is determined using a novel, large-scale, and previously unseen code validation set derived from GitHub. Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC. This finding refines prior hypotheses of linearity, which we suggest are likely observations of the logarithmic curve's tail under specific, limited conditions. Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
摘要：了解数据压缩与大语言模型（LLM）功能之间的关系至关重要，尤其是在诸如代码智能之类的专用领域。先前的工作在压缩和一般智能之间建立了线性关系。但是，它忽略了代码的多方面性质，该代码涵盖了各种编程语言和任务，并在对现代代码LLM的公平评估中挣扎。我们通过评估全面的多语言多任务代码基准上的各种开源代码LLM来解决这一问题。为了解决对预培训的LLMS守则智能的高效和公平评估的挑战，我们介绍了\ textit {格式退火}，这是一种轻巧，透明的培训方法，旨在评估这些预训练模型的内在功能。使用新型，大规模和以前看不见的代码验证集得出的新型，大规模且以前未见的代码验证集确定的压缩功效，以每张字体（BPC）为单位（BPC）测量。我们的经验结果揭示了测得的代码智能与BPC之间的基本对数关系。这一发现优化了先前的线性假设，我们认为这可能是对对数曲线的尾巴的观察，在特定的有限条件下。我们的工作为压缩在开发代码智能中的作用提供了更细微的理解，并在代码域中贡献了强大的评估框架。

Title: Disentangling Reasoning and Knowledge in Medical Large Language Models

Authors: Rahul Thapa, Qingyang Wu, Kevin Wu, Harrison Zhang, Angela Zhang, Eric Wu, Haotian Ye, Suhana Bedi, Nevin Aresh, Joseph Boen, Shriya Reddy, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11462
Pdf URL: https://arxiv.org/pdf/2505.11462
Copy Paste: [[2505.11462]] Disentangling Reasoning and Knowledge in Medical Large Language Models(https://arxiv.org/abs/2505.11462)
Keywords: language model, gpt, llm
Abstract: Medical reasoning in large language models (LLMs) aims to emulate clinicians' diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human performance. Our analysis shows that only 32.8 percent of questions require complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent gaps between knowledge and reasoning performance. For example, m1 scores 60.5 on knowledge but only 47.1 on reasoning. In adversarial tests where models are misled with incorrect initial reasoning, biomedical models degrade sharply, while larger or RL-trained general models show more robustness. To address this, we train BioMed-R1 using fine-tuning and reinforcement learning on reasoning-heavy examples. It achieves the strongest performance among similarly sized models. Further gains may come from incorporating clinical case reports and training with adversarial and backtracking scenarios.
摘要：大型语言模型（LLM）中的医学推理旨在模仿临床医生的诊断思维，但是当前的基准测试（例如MEDQA-USMLE，MEDMCQA和PubMedQA）经常将推理与事实召回相结合。我们通过使用PubMedbert分类器将11个生物医学质量检查基准分离为推理和知识的子集来解决这一问题，该分类器达到81％的精度，可与人类绩效相当。我们的分析表明，只有32.8％的问题需要复杂的推理。我们评估生物医学模型（Huatuogpt-O1，Medreason，M1）和通用域模型（DeepSeek-R1，O4-Mini，Qwen3），发现知识和推理性能之间的一致差距。例如，M1在知识上得分60.5，但仅在推理方面只有47.1分。在对逆转测试中，模型被错误的初始推理误导，生物医学模型急剧降低，而较大或经过RL训练的通用模型则表现出更大的鲁棒性。为了解决这个问题，我们使用微调和加强学习来训练Biomed-R1，以实例进行理性。它在类似大小的模型中达到了最强的性能。将临床病例报告和对抗性和回溯场景纳入临床病例报告和培训可能会带来进一步的收益。

Title: HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

Authors: Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Hoo-Chang Shin, Felipe Soares, Alexander Bukharin, Ellie Evans, Yi Dong, Oleksii Kuchaiev
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11475
Pdf URL: https://arxiv.org/pdf/2505.11475
Copy Paste: [[2505.11475]] HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages(https://arxiv.org/abs/2505.11475)
Keywords: language model, llm
Abstract: Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a permissively licensed (CC-BY-4.0), high-quality, human-annotated preference dataset comprising of over 40,000 samples. These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios. Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%). This represents a substantial improvement (~10% absolute) over the previously best-reported results from existing RMs. We demonstrate HelpSteer3-Preference can also be applied to train Generative RMs and how policy models can be aligned with RLHF using our RMs. Dataset (CC-BY-4.0): this https URL
摘要：偏好数据集对于通过从人类反馈（RLHF）学习的增强学习培训通用域，指导语言模型至关重要。随后的每个数据发布都会提高对未来数据收集的期望，这意味着不断需要提高公开可用偏好数据的质量和多样性。为了满足这一需求，我们介绍了helpsteer3-preference，这是一个允许的许可（CC-BY-4.0），高质量的人类宣布的偏好数据集，其中包括40,000多个样本。这些样本涵盖了大型语言模型（LLMS）的各种现实世界应用，包括与STEM，编码和多语言方案有关的任务。使用HelpSteer3-Preference，我们培训RM Bench（82.4％）和法官Bench（73.7％）的奖励模型（RMS）。这比现有RMS先前报告的结果相比，这是实质性改善（绝对10％）。我们证明，HelpSteer3-Preference也可以应用于训练生成RMS以及如何使用我们的RMS与RLHF对齐的策略模型。数据集（CC-BY-4.0）：此HTTPS URL

Title: Improving Assembly Code Performance with Large Language Models via Reinforcement Learning

Authors: Anjiang Wei, Tarun Suresh, Huanmi Tan, Yinglun Xu, Gagandeep Singh, Ke Wang, Alex Aiken
Subjects: cs.CL, cs.AI, cs.PF, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2505.11480
Pdf URL: https://arxiv.org/pdf/2505.11480
Copy Paste: [[2505.11480]] Improving Assembly Code Performance with Large Language Models via Reinforcement Learning(https://arxiv.org/abs/2505.11480)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated strong performance across a wide range of programming tasks, yet their potential for code optimization remains underexplored. This work investigates whether LLMs can optimize the performance of assembly code, where fine-grained control over execution enables improvements that are difficult to express in high-level languages. We present a reinforcement learning framework that trains LLMs using Proximal Policy Optimization (PPO), guided by a reward function that considers both functional correctness, validated through test cases, and execution performance relative to the industry-standard compiler gcc -O3. To support this study, we introduce a benchmark of 8,072 real-world programs. Our model, Qwen2.5-Coder-7B-PPO, achieves 96.0% test pass rates and an average speedup of 1.47x over the gcc -O3 baseline, outperforming all 20 other models evaluated, including Claude-3.7-sonnet. These results indicate that reinforcement learning can unlock the potential of LLMs to serve as effective optimizers for assembly code performance.
摘要：大型语言模型（LLMS）在各种编程任务中都表现出了很强的性能，但它们的代码优化潜力仍然没有得到充实。这项工作调查了LLM是否可以优化组装代码的性能，在这种情况下，对执行的细粒度控制能够以高级语言表达的改进。我们提出了一个强化学习框架，该框架使用近端策略优化（PPO）训练LLM，并以奖励功能为指导，该奖励功能既考虑功能正确性，又通过测试用例验证，以及相对于行业标准编译器GCC -O3的执行绩效。为了支持这项研究，我们介绍了8,072个现实世界计划的基准。我们的模型QWEN2.5-编码-7B-PPO达到96.0％的测试率和在GCC -O3基线上的平均速度为1.47倍，表现优于评估的所有其他20个模型，包括Claude-3.7-Sonnet。这些结果表明，增强学习可以解锁LLM的潜力，以作为组装代码性能的有效优化者。

Title: SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning

Authors: Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11484
Pdf URL: https://arxiv.org/pdf/2505.11484
Copy Paste: [[2505.11484]] SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning(https://arxiv.org/abs/2505.11484)
Keywords: llm, chain-of-thought
Abstract: Test-Time Scaling (TTS) refers to approaches that improve reasoning performance by allocating extra computation during inference, without altering the model's parameters. While existing TTS methods operate in a discrete token space by generating more intermediate steps, recent studies in Coconut and SoftCoT have demonstrated that thinking in the continuous latent space can further enhance the reasoning performance. Such latent thoughts encode informative thinking without the information loss associated with autoregressive token generation, sparking increased interest in continuous-space reasoning. Unlike discrete decoding, where repeated sampling enables exploring diverse reasoning paths, latent representations in continuous space are fixed for a given input, which limits diverse exploration, as all decoded paths originate from the same latent thought. To overcome this limitation, we introduce SoftCoT++ to extend SoftCoT to the Test-Time Scaling paradigm by enabling diverse exploration of thinking paths. Specifically, we perturb latent thoughts via multiple specialized initial tokens and apply contrastive learning to promote diversity among soft thought representations. Experiments across five reasoning benchmarks and two distinct LLM architectures demonstrate that SoftCoT++ significantly boosts SoftCoT and also outperforms SoftCoT with self-consistency scaling. Moreover, it shows strong compatibility with conventional scaling techniques such as self-consistency. Source code is available at this https URL.
摘要：测试时间缩放（TTS）是指通过在推理过程中分配额外的计算而无需更改模型参数来改善推理性能的方法。尽管现有的TTS方法通过产生更中间的步骤在离散的令牌空间中运行，但最近对椰子和软核的研究表明，在连续的潜在空间中的思考可以进一步提高推理性能。这种潜在的思想编码信息性思维，而没有与自回归令牌产生相关的信息损失，从而引发了人们对连续空间推理的兴趣。与离散解码不同，在重复采样可以探索各种推理路径的情况下，连续空间中的潜在表示是为给定输入固定的，这限制了各种探索，因为所有解码路径都来自相同的潜在思想。为了克服这一限制，我们将SoftCot ++介绍，通过实现对思维路径的多样化探索，将软件扩展到测试时间缩放范式。具体而言，我们通过多个专业的初始令牌扰动潜在思想，并应用对比度学习以促进软思维表示之间的多样性。在五个推理基准和两个不同的LLM架构上进行的实验表明，SoftCot ++显着提高了软件，并且以自稳态缩放的速度优于软核。此外，它显示出与常规缩放技术（例如自洽性）的强烈兼容性。源代码可在此HTTPS URL上找到。

Title: Modeling cognitive processes of natural reading with transformer-based Language Models

Authors: Bruno Bianchi, Fermín Travi, Juan E. Kamienkowski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11485
Pdf URL: https://arxiv.org/pdf/2505.11485
Copy Paste: [[2505.11485]] Modeling cognitive processes of natural reading with transformer-based Language Models(https://arxiv.org/abs/2505.11485)
Keywords: language model, gpt
Abstract: Recent advances in Natural Language Processing (NLP) have led to the development of highly sophisticated language models for text generation. In parallel, neuroscience has increasingly employed these models to explore cognitive processes involved in language comprehension. Previous research has shown that models such as N-grams and LSTM networks can partially account for predictability effects in explaining eye movement behaviors, specifically Gaze Duration, during reading. In this study, we extend these findings by evaluating transformer-based models (GPT2, LLaMA-7B, and LLaMA2-7B) to further investigate this relationship. Our results indicate that these architectures outperform earlier models in explaining the variance in Gaze Durations recorded from Rioplantense Spanish readers. However, similar to previous studies, these models still fail to account for the entirety of the variance captured by human predictability. These findings suggest that, despite their advancements, state-of-the-art language models continue to predict language in ways that differ from human readers.
摘要：自然语言处理（NLP）的最新进展已导致开发了高度复杂的文本模型。同时，神经科学越来越多地利用这些模型来探索与语言理解有关的认知过程。先前的研究表明，诸如N-Grams和LSTM网络之类的模型可以部分解释阅读过程中解释眼运动行为，特别是凝视持续时间的可预测性效应。在这项研究中，我们通过评估基于变压器的模型（GPT2，Llama-7b和Llama2-7b）来扩展这些发现，以进一步研究这种关系。我们的结果表明，这些体系结构在解释Rioplantense西班牙读者中记录的凝视持续时间方面的差异方面的表现优于早期模型。但是，与以前的研究类似，这些模型仍然无法解释人类可预测性捕获的整个方差。这些发现表明，尽管有进步，但最先进的语言模型继续以与人类读者不同的方式预测语言。