2025-05-28

Title: Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

Authors: Amr Hegazy, Mostafa Elhoushi, Amr Alanwar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20309
Pdf URL: https://arxiv.org/pdf/2505.20309
Copy Paste: [[2505.20309]] Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs(https://arxiv.org/abs/2505.20309)
Keywords: language model, llm, prompt, chat
Abstract: Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed "refusal direction" vector, applied across the LLM's layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to discriminatively apply nuanced, layer-aware interventions, activating steering primarily for harmful inputs. Experiments using safety benchmarks like ToxicChat & In-The-Wild Jailbreak Prompts demonstrate that our weighted steering controller significantly increases refusal rates compared to the base LLM, achieving targeted behavioral modification without altering the original model parameters. Our experiments with Llama-3.1-8B, Llama-3.2-1B & Mistral-7B show our approach outperforms existing methods, presenting an efficient and adaptive method for fine-grained control over LLM behavior at inference time.
摘要：控制不良的大语言模型（LLM）行为，例如产生不安全的内容或不遵守安全指南，通常依赖于昂贵的微调。激活转向为推理时间控制提供了替代方案，但现有方法通常缺乏细粒度的自适应机制。我们使用在推理过程中集成的轻巧，可训练的控制器网络介绍了一种新颖的方法。该控制器网络观察特定的中间LLM激活，并预测全局缩放系数和特定于层的权重。然后，预测的全局缩放系数和特定于层特异性权重动态调节转向贴片的强度，该方向贴片的强度是从预先计算的“拒绝方向”向量衍生而成的，该向量在生成过程中跨LLM层应用。经过有害和良性提示的激活培训，我们的控制者学会了歧视性地采用细微差别的，感知的干预措施，主要激活转向，主要用于有害输入。与基本LLM相比，使用Tocicchat和野外越狱等安全基准的实验表明，我们的加权转向控制器显着提高了拒绝率，从而实现了目标行为修改而没有更改原始模型参数。我们对Llama-3.1-8B，Llama-3.2-1b和Mistral-7b进行的实验表明，我们的方法优于现有方法，在推理时提出了一种有效且适应性的方法，用于对LLM行为进行细粒度控制。

Title: Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL

Authors: Zhewei Yao, Guoheng Sun, Lukasz Borchmann, Zheyu Shen, Minghang Deng, Bohan Zhai, Hao Zhang, Ang Li, Yuxiong He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20315
Pdf URL: https://arxiv.org/pdf/2505.20315
Copy Paste: [[2505.20315]] Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL(https://arxiv.org/abs/2505.20315)
Keywords: language model, llm
Abstract: Translating natural language into SQL (Test2SQL) is a longstanding challenge at the intersection of natural language understanding and structured data access. While large language models (LLMs) have significantly improved fluency in SQL generation, producing correct and executable SQL--particularly for complex queries--remains a bottleneck. We present Arctic-Text2SQL-R1, a reinforcement learning (RL) framework and model family designed to generate accurate, executable SQL using a lightweight reward signal based solely on execution correctness. Our approach avoids brittle intermediate supervision and complex reward shaping, promoting stable training and alignment with the end task. Combined with carefully curated data, strong supervised initialization, and effective training practices, Arctic-Text2SQL-R1 achieves state-of-the-art execution accuracy across six diverse Test2SQL benchmarks, including the top position on the BIRD leaderboard. Notably, our 7B model outperforms prior 70B-class systems, highlighting the framework's scalability and efficiency. We further demonstrate inference-time robustness through simple extensions like value retrieval and majority voting. Extensive experiments and ablation studies offer both positive and negative insights, providing practical guidance for future Test2SQL research.
摘要：在自然语言理解和结构化数据访问的交集中，将自然语言转换为SQL（Test2SQL）是一个长期的挑战。尽管大型语言模型（LLMS）在SQL生成中的流利度有了显着提高，可用于复杂查询的正确且可执行的SQL-孔 - 瓶颈。我们提出了Arctic-Text2SQL-R1，这是一种增强学习（RL）框架和模型家族，旨在使用仅基于执行正确性的轻巧奖励信号来生成准确的可执行性SQL。我们的方法避免了脆弱的中间监督和复杂的奖励成型，从而促进了稳定的培训并与最终任务保持一致。结合精心策划的数据，强大的监督初始化和有效的培训实践，Arctic-Text2SQL-R1在六种不同的Test2SQL基准中实现了最新的执行精度，包括在Bird排行榜上的最高位置。值得注意的是，我们的7B模型优于先前的70B级系统，突出了该框架的可扩展性和效率。我们通过简单的扩展（如价值检索和多数投票）进一步证明了推理时间的鲁棒性。广泛的实验和消融研究提供了积极的和负面的见解，为未来的Test2SQL研究提供了实用的指导。

Title: Beyond Demonstrations: Dynamic Vector Construction from Latent Representations

Authors: Wang Cai, Hsiu-Yuan Huang, Zhixiang Wang, Yunfang Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20318
Pdf URL: https://arxiv.org/pdf/2505.20318
Copy Paste: [[2505.20318]] Beyond Demonstrations: Dynamic Vector Construction from Latent Representations(https://arxiv.org/abs/2505.20318)
Keywords: language model, llm
Abstract: In-Context derived Vector (ICV) methods extract task-relevant representations from large language models (LLMs) and reinject them during inference, achieving comparable performance to few-shot In-Context Learning (ICL) without repeated demonstration processing. However, existing ICV methods remain sensitive to ICL-specific factors, often use coarse or semantically fragmented representations as the source of the vector, and rely on heuristic-based injection positions, limiting their applicability. To address these issues, we propose Dynamic Vector (DyVec), which incorporates an Exhaustive Query Rotation (EQR) strategy to extract robust semantically aggregated latent representations by mitigating variance introduced by ICL. It then applies Dynamic Latent Segmentation and Injection to adaptively partition representations based on task complexity and leverages REINFORCE-based optimization to learn optimal injection positions for each segment. Experiments results show that DyVec outperforms few-shot ICL, LoRA, and prior ICV baselines. Further analysis highlights the effectiveness of dynamically segmenting and injecting semantically aggregated latent representations. DyVec provides a lightweight and data-efficient solution for inference-time task adaptation.
摘要：内部下文衍生的向量（ICV）方法从大语言模型（LLMS）中提取与任务相关的表示，并在推理过程中重新注射它们，从而在没有重复演示处理的情况下实现了与少数射击中的内在学习（ICL）相当的性能。但是，现有的ICV方法对ICL特异性因素仍然敏感，通常将粗或语义上的碎片表示作为向量的来源，并依靠基于启发式的注射位置，从而限制了其适用性。为了解决这些问题，我们提出了动态矢量（DYVEC），该矢量（DYVEC）结合了详尽的查询旋转（EQR）策略，通过缓解ICL引入的差异来提取鲁棒的语义汇总潜在表示。然后，它将动态潜在细分和注入应用于基于任务复杂性和利用基于增强的优化的自适应分区表示，以学习每个细分市场的最佳注入位置。实验结果表明，DYVEC的表现优于少数ICL，LORA和先前的ICV基准。进一步的分析强调了动态分割和注入语义汇总的潜在表示的有效性。 DYVEC为推理时间任务适应提供了轻巧且具有数据效率的解决方案。

Title: Less Context, Same Performance: A RAG Framework for Resource-Efficient LLM-Based Clinical NLP

Authors: Satya Narayana Cheetirala, Ganesh Raut, Dhavalkumar Patel, Fabio Sanatana, Robert Freeman, Matthew A Levin, Girish N. Nadkarni, Omar Dawkins, Reba Miller, Randolph M. Steinhagen, Eyal Klang, Prem Timsina
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20320
Pdf URL: https://arxiv.org/pdf/2505.20320
Copy Paste: [[2505.20320]] Less Context, Same Performance: A RAG Framework for Resource-Efficient LLM-Based Clinical NLP(https://arxiv.org/abs/2505.20320)
Keywords: language model, gpt, llm, retrieval augmented generation
Abstract: Long text classification is challenging for Large Language Models (LLMs) due to token limits and high computational costs. This study explores whether a Retrieval Augmented Generation (RAG) approach using only the most relevant text segments can match the performance of processing entire clinical notes with large context LLMs. We begin by splitting clinical documents into smaller chunks, converting them into vector embeddings, and storing these in a FAISS index. We then retrieve the top 4,000 words most pertinent to the classification query and feed these consolidated segments into an LLM. We evaluated three LLMs (GPT4o, LLaMA, and Mistral) on a surgical complication identification task. Metrics such as AUC ROC, precision, recall, and F1 showed no statistically significant differences between the RAG based approach and whole-text processing (p > 0.05p > 0.05). These findings indicate that RAG can significantly reduce token usage without sacrificing classification accuracy, providing a scalable and cost effective solution for analyzing lengthy clinical documents.
摘要：由于令牌限制和高计算成本，长期文本分类对于大型语言模型（LLM）具有挑战性。这项研究探讨了仅使用最相关的文本段的检索增强生成（RAG）方法是否可以与使用较大的上下文LLM处理整个临床笔记的性能相匹配。我们首先将临床文档分成较小的块，将其转换为矢量嵌入，然后将其存储在Faiss指数中。然后，我们检索了与分类查询最相关的前4000个单词，并将这些综合段馈入LLM。我们在手术并发症识别任务上评估了三个LLM（GPT4O，Llama和Mistral）。诸如AUC ROC，精度，召回和F1之类的指标在基于RAG的方法和全文处理之间没有统计学上的显着差异（P> 0.05p> 0.05）。这些发现表明，抹布可以大大减少令牌使用情况而不牺牲分类精度，从而为分析冗长的临床文档提供了可扩展且具有成本效益的解决方案。

Title: BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Authors: Mathew J. Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A. Nalls, Daniel Khashabi, Faraz Faghri
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20321
Pdf URL: https://arxiv.org/pdf/2505.20321
Copy Paste: [[2505.20321]] BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases(https://arxiv.org/abs/2505.20321)
Keywords: gpt, llm, prompt, agent
Abstract: Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at this https URL, and our code is open-source at this https URL.
摘要：生物医学研究人员越来越依赖大规模的结构化数据库来进行复杂的分析任务。但是，当前的文本到SQL系统通常很难将定性科学问题映射到可执行的SQL中，尤其是在需要隐式域推理时。我们介绍了BiomedSQL，这是第一个明确设计的基准，旨在评估现实世界中生物医学知识库中文本到SQL生成中的科学推理。 BiomedSQL包括68,000个问题/SQL查询/答案三元组，该问题基于一个统一的BigQuery知识库，该知识库集成了基因 - 疾病协会，OMICS数据的因果推断以及药物批准记录。每个问题都需要模型来推断域特异性标准，例如全基因组显着性阈值，效应方向性或试验相滤波，而不是仅依赖于句法翻译。我们在提示的策略和互动范例中评估了一系列开放和封闭的LLM。我们的结果表明，性能差距很大：GPT-O3-MINI的执行精度达到59.0％，而我们的自定义多步代理BMSQL达到62.6％，均低于专家基线90.0％。 BiomedSQL为推进文本到SQL系统提供了新的基础，该系统能够通过对结构化的生物医学知识库进行强有力的推理来支持科学发现。我们的数据集可在此HTTPS URL上公开可用，并且我们的代码在此HTTPS URL上是开源的。

Title: Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

Authors: Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, Ningyu Zhang
Subjects: cs.CL, cs.AI, cs.CV, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20322
Pdf URL: https://arxiv.org/pdf/2505.20322
Copy Paste: [[2505.20322]] Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms(https://arxiv.org/abs/2505.20322)
Keywords: language model, llm, prompt
Abstract: Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering. However, these applications have been limited to toy tasks owing to the nontrivial issue of locating atomic knowledge components. In this paper, we propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety. Comprehensive experiments demonstrate the effectiveness of our approach. Further analysis reveals that steering exhibits superior robustness and flexibility, particularly in adversarial scenarios. We also apply the steering strategy to the large reasoning model, confirming its effectiveness in precise reasoning control.
摘要：对语言模型产生的精确控制对于确保安全性和可靠性至关重要。尽管迅速的工程和转向通常用于干预模型行为，但模型中的大量参数通常会导致高度相互交织的内部表示形式。这种相互依存的性能可以限制控制精度，有时会导致意外的副作用。最近的研究探索了稀疏自动编码器（SAE）在高维空间中解散知识的使用。但是，由于定位原子知识组件的非平凡问题，这些应用程序仅限于玩具任务。在本文中，我们提出了转向目标原子（STA），该方法是一种新颖的方法，可以分离并操纵分离的知识成分以增强安全性。全面的实验证明了我们方法的有效性。进一步的分析表明，转向具有出色的鲁棒性和灵活性，尤其是在对抗情况下。我们还将转向策略应用于大型推理模型，确认其在精确推理控制中的有效性。

Title: PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus

Authors: Shahriar Noroozizadeh, Sayantan Kumar, George H. Chen, Jeremy C. Weiss
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20323
Pdf URL: https://arxiv.org/pdf/2505.20323
Copy Paste: [[2505.20323]] PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus(https://arxiv.org/abs/2505.20323)
Keywords: llm, prompt
Abstract: Understanding temporal dynamics in clinical narratives is essential for modeling patient trajectories, yet large-scale temporally annotated resources remain limited. We present PMOA-TTS, the first openly available dataset of 124,699 PubMed Open Access (PMOA) case reports, each converted into structured (event, time) timelines via a scalable LLM-based pipeline. Our approach combines heuristic filtering with Llama 3.3 to identify single-patient case reports, followed by prompt-driven extraction using Llama 3.3 and DeepSeek R1, resulting in over 5.6 million timestamped clinical events. To assess timeline quality, we evaluate against a clinician-curated reference set using three metrics: (i) event-level matching (80% match at a cosine similarity threshold of 0.1), (ii) temporal concordance (c-index > 0.90), and (iii) Area Under the Log-Time CDF (AULTC) for timestamp alignment. Corpus-level analysis shows wide diagnostic and demographic coverage. In a downstream survival prediction task, embeddings from extracted timelines achieve time-dependent concordance indices up to 0.82 $\pm$ 0.01, demonstrating the predictive value of temporally structured narratives. PMOA-TTS provides a scalable foundation for timeline extraction, temporal reasoning, and longitudinal modeling in biomedical NLP. The dataset is available at: this https URL .
摘要：了解临床叙事中的时间动态对于对患者轨迹进行建模至关重要，但大规模的时间注释资源仍然有限。我们提出PMOA-TTS，这是第一个公开可用的数据集，该数据集由124,699个PubMed Open Access（PMOA）案例报告，每个数据集通过基于可扩展的LLM的管道转换为结构化的（事件，时间）时间表。我们的方法将启发式过滤与Llama 3.3结合在一起，以识别单患有病例报告，然后使用Llama 3.3和DeepSeek R1进行迅速驱动提取，从而导致超过560万个时间戳临床事件。为了评估时间轴质量，我们根据使用三个指标进行临床医生策划的参考集进行评估：（i）事件级匹配（余弦相似性阈值的80％匹配为0.1），（ii）时间一致性（c-index> 0.90）和（iii）在log-time cdf（auttc（auttc）下进行时间段的面积。语料库级分析显示了广泛的诊断和人口统计覆盖范围。在下游的生存预测任务中，提取时间表的嵌入达到时间依赖的一致性指数高达0.82 $ \ pm $ 0.01，这表明了时间结构化叙述的预测价值。 PMOA-TTS为生物医学NLP中的时间线提取，时间推理和纵向建模提供了可扩展的基础。该数据集可在以下网址提供：此HTTPS URL。

Title: Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

Authors: Amirhosein Ghasemabadi, Keith G. Mills, Baochun Li, Di Niu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20325
Pdf URL: https://arxiv.org/pdf/2505.20325
Copy Paste: [[2505.20325]] Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence(https://arxiv.org/abs/2505.20325)
Keywords: language model, llm
Abstract: Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.
摘要：增强大语模型（LLM）推理的测试时间缩放（TTS）方法通常会产生大量的计算成本，这主要是由于广泛依赖外部过程奖励模型（PRMS）或采样方法（例如Best-N（BON））。本文介绍了由ut（gg）指导的，这是一个有效的自引导TTS框架，可在没有昂贵的外部验证器模型的情况下实现PRM级的性能。我们的方法采用了轻巧的树搜索，仅由固有的LLM信号，令牌级信心和阶跃新颖性引导。一种关键的创新是通过有针对性的加强学习微调阶段提高内部置信度估计的可靠性。对具有挑战性的数学推理基准的经验评估表明，GG可以使较小的模型（例如1.5B参数）实现准确性匹配或超过较大的模型（例如32B-70B参数），而将GPU存储器的使用量最大减少10倍。与基于PRM的方法相比，GG可以以更快的推理速度和低4-5倍的记忆使用速度达到可比精度。此外，与BON策略相比，GG将KV高速缓存存储器的使用量减少了约50％，从而促进了TTS技术的更高效和实用的部署。

Title: Multi-Scale Manifold Alignment: A Unified Framework for Enhanced Explainability of Large Language Models

Authors: Yukun Zhang, Qi Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20333
Pdf URL: https://arxiv.org/pdf/2505.20333
Copy Paste: [[2505.20333]] Multi-Scale Manifold Alignment: A Unified Framework for Enhanced Explainability of Large Language Models(https://arxiv.org/abs/2505.20333)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) have achieved strong performance, yet their internal reasoning remains opaque, limiting interpretability and trust in critical applications. We propose a novel Multi_Scale Manifold Alignment framework that decomposes the latent space into global, intermediate, and local semantic manifolds capturing themes, context, and word-level details. Our method introduces cross_scale mapping functions that jointly enforce geometric alignment (e.g., Procrustes analysis) and information preservation (via mutual information constraints like MINE or VIB). We further incorporate curvature regularization and hyperparameter tuning for stable optimization. Theoretical analysis shows that alignment error, measured by KL divergence, can be bounded under mild assumptions. This framework offers a unified explanation of how LLMs structure multi-scale semantics, advancing interpretability and enabling applications such as bias detection and robustness enhancement.
摘要：大型语言模型（LLM）的最新进展已经取得了出色的表现，但他们的内部推理仍然不透明，限制了对关键应用的解释性和信任。我们提出了一个新颖的Multi_scale歧管对齐框架，该框架将潜在空间分解为全局，中间和本地语义歧管，以捕获主题，上下文和文字级别的细节。我们的方法介绍了共同执行几何对齐（例如procrustes分析）和信息保存（通过我的或VIB等相互信息约束）共同执行几何形状比对的跨尺度映射函数。我们进一步结合了曲率正则化和超参数调整以进行稳定优化。理论分析表明，通过KL差异测量的对齐误差可以在轻度假设下进行界定。该框架提供了统一的解释，即LLMS如何构建多尺度语义，提高可解释性和启用应用程序，例如偏置检测和鲁棒性增强。

Title: Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query

Authors: Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20334
Pdf URL: https://arxiv.org/pdf/2505.20334
Copy Paste: [[2505.20334]] Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query(https://arxiv.org/abs/2505.20334)
Keywords: language model, llm
Abstract: Large language models (LLMs) rely on key-value cache (KV cache) to accelerate decoding by reducing redundant computations. However, the KV cache memory usage grows substantially with longer text sequences, posing challenges for efficient deployment. Existing KV cache eviction methods prune tokens using prefilling-stage attention scores, causing inconsistency with actual inference queries, especially under tight memory budgets. In this paper, we propose Lookahead Q-Cache (LAQ), a novel eviction framework that generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries. By using these lookahead queries as the observation window for importance estimation, LAQ achieves more consistent and accurate KV cache eviction aligned with real inference scenarios. Experimental results on LongBench and Needle-in-a-Haystack benchmarks show that LAQ outperforms existing methods across various budget levels, achieving a 1 $\sim$ 4 point improvement on LongBench under limited cache budget. Moreover, LAQ is complementary to existing approaches and can be flexibly combined to yield further improvements.
摘要：大型语言模型（LLMS）依靠键值缓存（KV缓存）来通过减少冗余计算来加速解码。但是，KV缓存存储器的使用情况大大增长，文本序列较长，对有效部署提出了挑战。现有的KV缓存驱逐方法使用预填充阶段的注意分数修剪令牌，从而导致与实际推理查询不一致，尤其是在紧张的记忆预算下。在本文中，我们提出了LookAhead Q-Cache（LAQ），这是一个新型的驱逐框架，生成低成本的lookAhead查询，以更好地近似真正的解码阶段查询。通过使用这些LookAhead查询作为重要性估计的观察窗口，LAQ实现了与真实推理方案一致的更一致和准确的KV缓存驱逐。 Longbench和针中的实验结果表明，LAQ在各种预算水平上胜过现有方法，在有限的缓存预算下，Longbench的1 $ \ sim $ 4分提高了。此外，LAQ是现有方法互补的，可以灵活地合并以获得进一步的改进。

Title: Language Model Distillation: A Temporal Difference Imitation Learning Perspective

Authors: Zishun Yu, Shangzhe Li, Xinhua Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20335
Pdf URL: https://arxiv.org/pdf/2505.20335
Copy Paste: [[2505.20335]] Language Model Distillation: A Temporal Difference Imitation Learning Perspective(https://arxiv.org/abs/2505.20335)
Keywords: language model
Abstract: Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models assign most probability mass to a small subset of tokens. Motivated by this observation, we design a temporal difference learning framework that operates on a reduced action space (a subset of vocabulary), and demonstrate how practical algorithms can be derived and the resulting performance improvements.
摘要：大型语言模型导致了许多NLP任务的重大进展，尽管它们的大小通常会导致大量的计算成本。蒸馏已成为将这些大型且高度强大的模型压缩成较小，更有效的效率的常见做法。从模仿学习的角度或逆增强学习的角度来看，许多现有的语言模型蒸馏方法可以看作是行为克隆。该观点启发了随后的研究，这些研究利用了（逆）增强学习技术，包括行为克隆和时间差异学习方法的变化。我们没有提出另一种特定的时间差异方法，而是通过利用教师模型的分布稀疏性来引入一个基于时间差异的蒸馏的一般框架。具体而言，通常观察到语言模型将最大概率质量分配给一小部分令牌。在这一观察结果的推动下，我们设计了一个时间差异学习框架，该框架在减少的动作空间（词汇的子集）上运行，并演示了如何得出实用算法并改善了性能的实用算法。

Title: MOSLIM:Align with diverse preferences in prompts through reward classification

Authors: Yu Zhang, Wanli Jiang, Zhengyu Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20336
Pdf URL: https://arxiv.org/pdf/2505.20336
Copy Paste: [[2505.20336]] MOSLIM:Align with diverse preferences in prompts through reward classification(https://arxiv.org/abs/2505.20336)
Keywords: language model, llm, prompt
Abstract: The multi-objective alignment of Large Language Models (LLMs) is essential for ensuring foundational models conform to diverse human preferences. Current research in this field typically involves either multiple policies or multiple reward models customized for various preferences, or the need to train a preference-specific supervised fine-tuning (SFT) model. In this work, we introduce a novel multi-objective alignment method, MOSLIM, which utilizes a single reward model and policy model to address diverse objectives. MOSLIM provides a flexible way to control these objectives through prompting and does not require preference training during SFT phase, allowing thousands of off-the-shelf models to be directly utilized within this training framework. MOSLIM leverages a multi-head reward model that classifies question-answer pairs instead of scoring them and then optimize policy model with a scalar reward derived from a mapping function that converts classification results from reward model into reward scores. We demonstrate the efficacy of our proposed method across several multi-objective benchmarks and conduct ablation studies on various reward model sizes and policy optimization methods. The MOSLIM method outperforms current multi-objective approaches in most results while requiring significantly fewer GPU computing resources compared with existing policy optimization methods.
摘要：大语言模型（LLM）的多目标对准对于确保符合各种人类偏好的基础模型至关重要。该领域的当前研究通常涉及多个政策或针对各种偏好定制的多个奖励模型，或者需要培训特定于偏好的监督微调（SFT）模型。在这项工作中，我们介绍了一种新型的多目标对准方法，即Moslim，该方法利用单个奖励模型和政策模型来解决各种目标。 Moslim提供了一种灵活的方法来通过提示来控制这些目标，并且不需要在SFT阶段进行偏好培训，从而可以在此培训框架内直接使用成千上万的现成模型。 Moslim利用了一个多头奖励模型，该模型对问答的分类而不是对它们进行分类，然后使用从映射函数中获得的标量奖励来优化策略模型，该映射功能将分类从奖励模型转换为奖励分数。我们证明了我们提出的方法在几个多目标基准中的功效，并就各种奖励模型大小和策略优化方法进行消融研究。与现有策略优化方法相比，MOSLIM方法在大多数结果中的当前多目标方法都优于当前的多目标方法。

Title: Assessing the Capability of LLMs in Solving POSCOMP Questions

Authors: Cayo Viegas, Rohit Gheyi, Márcio Ribeiro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20338
Pdf URL: https://arxiv.org/pdf/2505.20338
Copy Paste: [[2505.20338]] Assessing the Capability of LLMs in Solving POSCOMP Questions(https://arxiv.org/abs/2505.20338)
Keywords: language model, gpt, llm, chat
Abstract: Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs - ChatGPT-4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large - were initially evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models' proficiency in handling complex questions typical of the exam. LLM performance was notably better on text-based questions than on image interpretation tasks. In the 2022 exam, ChatGPT-4 led with 57 correct answers out of 69 questions, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). Similar trends were observed in the 2023 exam. ChatGPT-4 achieved the highest performance, surpassing all students who took the POSCOMP 2023 exam. LLMs, particularly ChatGPT-4, show promise in text-based tasks on the POSCOMP exam, although image interpretation remains a challenge. Given the rapid evolution of LLMs, we expanded our analysis to include more recent models - o1, Gemini 2.5 Pro, Claude 3.7 Sonnet, and o3-mini-high - evaluated on the 2022-2024 POSCOMP exams. These newer models demonstrate further improvements and consistently surpass both the average and top-performing human participants across all three years.
摘要：大型语言模型（LLM）的最新进展已大大扩大了自然语言处理任务中人工智能的能力。尽管取得了这种进步，但它们在计算机科学等专业领域的表现仍然相对尚未探索。了解LLM在这些领域的熟练程度对于评估其实际实用性和指导未来发展至关重要。 POSCOMP是一项由巴西计算机协会（SBC）推广的计算机科学研究生录取的享有声望的巴西考试，提供了具有挑战性的基准。这项研究调查了LLM在POSOMP考试中是否可以匹配或超过人类绩效。最初在2022年和2023年的POSCOMP考试中评估了四个LLM -Chatgpt -4，Gemini 1.0 Advanced，Claude 3 Sonnet和Le Chat Mistral大型。评估测量了模型在处理考试典型的复杂问题方面的熟练程度。在基于文本的问题上，LLM的性能要比图像解释任务更好。在2022年的考试中，Chatgpt-4在69个问题中获得了57个正确答案，其次是Gemini 1.0 Advanced（49），Le Chat Mistral（48）和Claude 3 Sonnet（44）。在2023年的考试中观察到了类似的趋势。 Chatgpt-4的表现最高，超过了参加POSCOMP 2023考试的所有学生。 LLM，尤其是ChatGpt-4，在POSCOMP考试中的基于文本的任务中显示出希望，尽管图像解释仍然是一个挑战。鉴于LLM的快速演变，我们将分析扩展到包括最新模型-O1，Gemini 2.5 Pro，Claude 3.7十四行诗和O3-Mini-High - 在2022-2024 POSOMP考试中进行了评估。这些较新的模型表现出进一步的进步，并且在三年中始终超过了平均和表现最佳的人类参与者。

Title: Dynamic Manifold Evolution Theory: Modeling and Stability Analysis of Latent Representations in Large Language Models

Authors: Yukun Zhang, Qi Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20340
Pdf URL: https://arxiv.org/pdf/2505.20340
Copy Paste: [[2505.20340]] Dynamic Manifold Evolution Theory: Modeling and Stability Analysis of Latent Representations in Large Language Models(https://arxiv.org/abs/2505.20340)
Keywords: language model
Abstract: We introduce Dynamic Manifold Evolution Theory (DMET),a unified framework that models large language model generation as a controlled dynamical system evolving on a low_dimensional semantic manifold. By casting latent_state updates as discrete time Euler approximations of continuous dynamics, we map intrinsic energy_driven flows and context_dependent forces onto Transformer components (residual connections, attention, feed-forward networks). Leveraging Lyapunov stability theory We define three empirical metrics (state continuity, clustering quality, topological persistence) that quantitatively link latent_trajectory properties to text fluency, grammaticality, and semantic coherence. Extensive experiments across decoding parameters validate DMET's predictions and yield principled guidelines for balancing creativity and consistency in text generation.
摘要：我们介绍了动态歧管演化理论（DMET），这是一个统一的框架，将大型语言模型生成模型建模为在低_D二维语义歧管上演变的受控动力系统。通过将statent_state施放为连续动力学的离散时间EULER近似，我们将intinsic Energy_驱动的流和context_依赖性力映射到变压器组件（残差连接，注意力，馈送方向网络）上。利用Lyapunov稳定性理论，我们定义了三个经验指标（状态连续性，聚类质量，拓扑持久性），它们将stitent_trajectory属性链接到文本流利度，语法和语义相干性。跨解码参数进行的广泛实验验证了DMET的预测和产量原则上的指南，以平衡文本生成的创造力和一致性。

Title: Do LLMs have a Gender (Entropy) Bias?

Authors: Sonal Prabhune, Balaji Padmanabhan, Kaushik Dutta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20343
Pdf URL: https://arxiv.org/pdf/2505.20343
Copy Paste: [[2505.20343]] Do LLMs have a Gender (Entropy) Bias?(https://arxiv.org/abs/2505.20343)
Keywords: gpt, llm, prompt, chat
Abstract: We investigate the existence and persistence of a specific type of gender bias in some of the popular LLMs and contribute a new benchmark dataset, RealWorldQuestioning (released on HuggingFace ), developed from real-world questions across four key domains in business and health contexts: education, jobs, personal financial management, and general health. We define and study entropy bias, which we define as a discrepancy in the amount of information generated by an LLM in response to real questions users have asked. We tested this using four different LLMs and evaluated the generated responses both qualitatively and quantitatively by using ChatGPT-4o (as "LLM-as-judge"). Our analyses (metric-based comparisons and "LLM-as-judge" evaluation) suggest that there is no significant bias in LLM responses for men and women at a category level. However, at a finer granularity (the individual question level), there are substantial differences in LLM responses for men and women in the majority of cases, which "cancel" each other out often due to some responses being better for males and vice versa. This is still a concern since typical users of these tools often ask a specific question (only) as opposed to several varied ones in each of these common yet important areas of life. We suggest a simple debiasing approach that iteratively merges the responses for the two genders to produce a final result. Our approach demonstrates that a simple, prompt-based debiasing strategy can effectively debias LLM outputs, thus producing responses with higher information content than both gendered variants in 78% of the cases, and consistently achieving a balanced integration in the remaining cases.
摘要：我们研究了一些流行的LLM中一种特定类型的性别偏见的存在和持久性，并贡献了一个新的基准数据集，RealworldQuestioning（在Huggingface上发布），这是从商业和健康环境中四个关键领域的现实世界中开发的：教育，工作，工作，个人财务管理以及一般健康。我们定义和研究熵偏见，我们将其定义为响应用户提出的实际问题而产生的LLM产生的信息量的差异。我们使用四个不同的LLMS对此进行了测试，并通过使用ChatGpt-4O（为“ LLM-AS-Gudge”）对生成的响应进行了定性和定量评估。我们的分析（基于公制的比较和“ LLM-AS-AS-Gudge”评估）表明，在类别级别上，男女的LLM反应没有明显的偏见。但是，从更细的粒度（个体问题水平）上，在大多数情况下，男性和女性的LLM反应存在很大差异，由于某些回答对男性的反应更好，反之亦然。这仍然是一个问题，因为这些工具的典型用户通常会提出一个特定的问题（仅），而不是在这些共同但重要的生活领域中的几个不同的问题。我们建议一种简单的证词方法，它迭代地将两个性别产生最终结果的反应融合在一起。我们的方法表明，基于简单的基于及时的偏差策略可以有效地debias llm输出，从而在78％的情况下产生更高的信息含量的响应，并在其余情况下始终达到平衡的整合。

Title: SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data

Authors: Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, Dacheng Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20347
Pdf URL: https://arxiv.org/pdf/2505.20347
Copy Paste: [[2505.20347]] SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data(https://arxiv.org/abs/2505.20347)
Keywords: language model, llm
Abstract: Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose Self-play Reinforcement Learning(SeRL) to bootstrap LLM training with limited initial data. Specifically, SeRL comprises two complementary modules: self-instruction and self-rewarding. The former module generates additional instructions based on the available data at each training step, employing robust online filtering strategies to ensure instruction quality, diversity, and difficulty. The latter module introduces a simple yet effective majority-voting mechanism to estimate response rewards for additional instructions, eliminating the need for external annotations. Finally, SeRL performs conventional RL based on the generated data, facilitating iterative self-play learning. Extensive experiments on various reasoning benchmarks and across different LLM backbones demonstrate that the proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards. Our code is available at this https URL.
摘要：最近的进步表明，加强学习（RL）在改善大语言模型（LLMS）的推理能力方面的有效性。但是，现有的作品不可避免地依靠高质量的说明和可验证的奖励来进行有效的培训，这两者通常都在专业领域很难获得。在本文中，我们提出了自我游戏加强学习（SERL），以进行有限的初始数据进行引导LLM培训。具体而言，SERL包括两个互补的模块：自我指导和自我奖励。前一个模块根据每个培训步骤在可用数据中生成其他说明，采用强大的在线过滤策略来确保指导质量，多样性和困难。后一个模块引入了一种简单而有效的多数投票机制，以估算响应奖励以获取其他说明，从而消除了对外部注释的需求。最后，SERL根据生成的数据执行常规RL，从而促进迭代自我播放学习。对各种推理基准和不同LLM骨架的广泛实验表明，所提出的SERL产量的结果优于其对应物，并且与具有可验证奖励的高质量数据获得的相同的表现。我们的代码可在此HTTPS URL上找到。

Title: Rethinking Text-based Protein Understanding: Retrieval or LLM?

Authors: Juntong Wu, Zijing Liu, He Cao, Hao Li, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, Yu Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20354
Pdf URL: https://arxiv.org/pdf/2505.20354
Copy Paste: [[2505.20354]] Rethinking Text-based Protein Understanding: Retrieval or LLM?(https://arxiv.org/abs/2505.20354)
Keywords: language model, llm
Abstract: In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model's performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at this https URL.
摘要：近年来，蛋白质文本模型因其在蛋白质产生和理解中的潜力而引起了人们的重视。当前的方法着重于通过持续预处理和多模式对准将蛋白质相关知识纳入大语言模型，从而可以同时理解文本描述和蛋白质序列。通过对现有模型架构和基于文本的蛋白质理解基准的透彻分析，我们确定了当前基准中存在的重要数据泄漏问题。此外，从自然语言处理中得出的常规指标无法准确评估该模型在该领域的性能。为了解决这些限制，我们重组现有数据集并基于生物实体引入新的评估框架。在我们的观察过程中，我们提出了一种检索增强的方法，该方法极大地超过了微调的LLM，用于蛋白质到文本生成，并显示了无训练场景的准确性和效率。我们的代码和数据可以在此HTTPS URL上看到。

Title: Enhancing Logical Reasoning in Language Models via Symbolically-Guided Monte Carlo Process Supervision

Authors: Xingwei Tan, Marco Valentino, Mahmud Akhter, Maria Liakata, Nikolaos Aletras
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20415
Pdf URL: https://arxiv.org/pdf/2505.20415
Copy Paste: [[2505.20415]] Enhancing Logical Reasoning in Language Models via Symbolically-Guided Monte Carlo Process Supervision(https://arxiv.org/abs/2505.20415)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown promising performance in mathematical and logical reasoning benchmarks. However, recent studies have pointed to memorization, rather than generalization, as one of the leading causes for such performance. LLMs, in fact, are susceptible to content variations, demonstrating a lack of robust symbolic abstractions supporting their reasoning process. To improve reliability, many attempts have been made to combine LLMs with symbolic methods. Nevertheless, existing approaches fail to effectively leverage symbolic representations due to the challenges involved in developing reliable and scalable verification mechanisms. In this paper, we propose to overcome such limitations by generating symbolic reasoning trajectories and select the high-quality ones using a process reward model automatically tuned based on Monte Carlo estimation. The trajectories are then employed via fine-tuning methods to improve logical reasoning and generalization. Our results on logical reasoning benchmarks such as FOLIO and LogicAsker show the effectiveness of the proposed method with large gains on frontier and open-weight models. Moreover, additional experiments on claim verification reveal that fine-tuning on the generated symbolic reasoning trajectories enhances out-of-domain generalizability, suggesting the potential impact of symbolically-guided process supervision in alleviating the effect of memorization on LLM reasoning.
摘要：大型语言模型（LLMS）在数学和逻辑推理基准中表现出了有希望的表现。但是，最近的研究指出，记忆而不是概括是这种表现的主要原因之一。实际上，LLMS容易受到内容变化的影响，表明缺乏支持其推理过程的符合符号抽象。为了提高可靠性，已经尝试将LLM与符号方法相结合。然而，由于开发可靠且可扩展的验证机制所涉及的挑战，现有方法无法有效利用符号表示。在本文中，我们建议通过生成符号推理轨迹来克服此类局限性，并使用基于蒙特卡洛估计的工艺奖励模型选择高质量的轨迹。然后通过微调方法采用轨迹来改善逻辑推理和概括。我们在逻辑推理基准（例如对开本和Logicasker）上的结果显示了该方法在前沿和开放式模型上具有很大收益的有效性。此外，有关索赔验证的其他实验表明，对产生的符号推理轨迹进行微调增强了域外的通用性，这表明象征性引导的过程监督对减轻记忆对LLM推理的影响的潜在影响。

Title: GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

Authors: Zihong Chen, Wanli Jiang, Jinzhe Li, Zhonghang Yuan, Huanjun Kong, Wanli Ouyang, Nanqing Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20416
Pdf URL: https://arxiv.org/pdf/2505.20416
Copy Paste: [[2505.20416]] GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation(https://arxiv.org/abs/2505.20416)
Keywords: language model, llm
Abstract: Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at this https URL.
摘要：大型语言模型（LLM）的微调通常需要大量的高质量监督数据，这既昂贵又富有劳动力的收购。尽管合成数据的生成已经成为有前途的解决方案，但现有的方法经常遭受事实不准确，长尾覆盖不足，简单知识结构和均质输出的影响。为了应对这些挑战，我们介绍了GraphGen，这是一个知识图指导的框架，专为三个关键的问题避开（QA）方案：原子QA，汇总QA和多跳QA。它首先从源文本中构造细粒度的知识图。然后，它使用预期的校准误差度量来识别LLMS中的知识差距，并优先针对针对高价值，长尾知识的QA对生成。此外，GraphGen结合了多跳的邻域抽样，以捕获复杂的关系信息，并采用样式控制的生成来使所得的QA数据多样化。关于封装设置下的知识密集任务的实验结果表明，GraphGen的表现优于常规合成数据方法，为监督微调中的数据稀缺挑战提供了更可靠，更全面的解决方案。代码和数据可在此HTTPS URL上公开可用。

Title: SEMMA: A Semantic Aware Knowledge Graph Foundation Model

Authors: Arvindh Arun, Sumit Kumar, Mojtaba Nayyeri, Bo Xiong, Ponnurangam Kumaraguru, Antonio Vergari, Steffen Staab
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20422
Pdf URL: https://arxiv.org/pdf/2505.20422
Copy Paste: [[2505.20422]] SEMMA: A Semantic Aware Knowledge Graph Foundation Model(https://arxiv.org/abs/2505.20422)
Keywords: language model, llm
Abstract: Knowledge Graph Foundation Models (KGFMs) have shown promise in enabling zero-shot reasoning over unseen graphs by learning transferable patterns. However, most existing KGFMs rely solely on graph structure, overlooking the rich semantic signals encoded in textual attributes. We introduce SEMMA, a dual-module KGFM that systematically integrates transferable textual semantics alongside structure. SEMMA leverages Large Language Models (LLMs) to enrich relation identifiers, generating semantic embeddings that subsequently form a textual relation graph, which is fused with the structural component. Across 54 diverse KGs, SEMMA outperforms purely structural baselines like ULTRA in fully inductive link prediction. Crucially, we show that in more challenging generalization settings, where the test-time relation vocabulary is entirely unseen, structural methods collapse while SEMMA is 2x more effective. Our findings demonstrate that textual semantics are critical for generalization in settings where structure alone fails, highlighting the need for foundation models that unify structural and linguistic signals in knowledge reasoning.
摘要：知识图基础模型（kgfms）在通过学习可转移模式通过不见图表实现零射击推理方面表现出了希望。但是，大多数现有的kgfms仅依赖于图形结构，忽略了文本属性中编码的丰富语义信号。我们介绍了SEMMA，这是一种双模型kgfm，该kgfm系统地将可转移的文本语义与结构旁边集成在一起。 SEMMA利用大型语言模型（LLM）来丰富关系标识符，生成语义嵌入，随后形成文本关系图，该图与结构分量融合在一起。在54种不同的公斤中，SEMMA在完全感应链路预测中纯粹优于纯粹的结构基线，例如Ultra。至关重要的是，我们表明，在更具挑战性的概括环境中，测试时间关系词汇完全看不见，结构方法崩溃，而SEMMA则更有效。我们的发现表明，文本语义对于仅结构失败的设置中的概括至关重要，强调了对知识推理中结构和语言信号的基础模型的需求。

Title: HAMburger: Accelerating LLM Inference via Token Smashing

Authors: Jingyu Liu, Ce Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20438
Pdf URL: https://arxiv.org/pdf/2505.20438
Copy Paste: [[2505.20438]] HAMburger: Accelerating LLM Inference via Token Smashing(https://arxiv.org/abs/2505.20438)
Keywords: language model, llm
Abstract: The growing demand for efficient Large Language Model (LLM) inference requires a holistic optimization on algorithms, systems, and hardware. However, very few works have fundamentally changed the generation pattern: each token needs one forward pass and one KV cache. This can be sub-optimal because we found that LLMs are extremely capable of self-identifying the exact dose of information that a single KV cache can store, and many tokens can be generated confidently without global context. Based on this insight, we introduce HAMburger, a Hierarchically Auto-regressive Model that redefines resource allocation in LLMs by moving beyond uniform computation and storage per token during inference. Stacking a compositional embedder and a micro-step decoder in between a base LLM, HAMburger smashes multiple tokens into a single KV and generates several tokens per step. Additionally, HAMburger functions as a speculative decoding framework where it can blindly trust self-drafted tokens. As a result, HAMburger shifts the growth of KV cache and forward FLOPs from linear to sub-linear with respect to output length, and adjusts its inference speed based on query perplexity and output structure. Extensive evaluations show that HAMburger reduces the KV cache computation by up to 2$\times$ and achieves up to 2$\times$ TPS, while maintaining quality in both short- and long-context tasks. Our method explores an extremely challenging inference regime that requires both computation- and memory-efficiency with a hardware-agnostic design.
摘要：对高效大语言模型（LLM）推断的需求不断增长，需要对算法，系统和硬件进行整体优化。但是，很少有作品从根本上改变了生成模式：每个令牌都需要一个向前的通行证和一个KV缓存。这可能是次优的，因为我们发现LLM非常能够自我识别单个KV缓存可以存储的确切信息，并且许多令牌可以自信地生成而没有全局上下文。基于此见解，我们介绍了汉堡，这是一种层次自动回归模型，通过在推理过程中超越统一的计算和存储空间来重新定义LLMS中的资源分配。 Hamburger将组成嵌入器和微步解码器堆叠在基础LLM之间，将多个令牌粉碎成单个KV，并每步生成几个令牌。此外，汉堡包充当投机解码框架，它可以盲目信任自由放射的令牌。结果，汉堡包将KV缓存和前向触发器的生长从线性转移到子线性相对于输出长度，并根据查询的困惑和输出结构调整其推理速度。广泛的评估表明，汉堡包可将KV缓存计算降低2 $ \ times $，并达到高达2 $ \ times $ tps，同时保持短篇小说和长篇小说任务的质量。我们的方法探索了一种极具挑战性的推理制度，需要使用硬件 - 静态设计同时计算和内存效率。

Title: In-context Language Learning for Endangered Languages in Speech Recognition

Authors: Zhaolin Li, Jan Niehues
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20445
Pdf URL: https://arxiv.org/pdf/2505.20445
Copy Paste: [[2505.20445]] In-context Language Learning for Endangered Languages in Speech Recognition(https://arxiv.org/abs/2505.20445)
Keywords: language model, llm
Abstract: With approximately 7,000 languages spoken worldwide, current large language models (LLMs) support only a small subset. Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this investigation to speech recognition, investigating whether LLMs can learn unseen, low-resource languages through in-context learning (ICL). With experiments on four diverse endangered languages that LLMs have not been trained on, we find that providing more relevant text samples enhances performance in both language modelling and Automatic Speech Recognition (ASR) tasks. Furthermore, we show that the probability-based approach outperforms the traditional instruction-based approach in language learning. Lastly, we show ICL enables LLMs to achieve ASR performance that is comparable to or even surpasses dedicated language models trained specifically for these languages, while preserving the original capabilities of the LLMs.
摘要：当前的大型语言模型（LLMS）只有大约7,000种语言，仅支持一个小子集。先前的研究表明，LLM可以在没有监督数据的情况下学习某些任务的新语言。我们将此调查扩展到语音识别，研究LLM是否可以通过内在学习（ICL）学习看不见的低资源语言。通过对尚未培训LLM的四种不同濒危语言的实验，我们发现提供更相关的文本样本可以增强语言建模和自动语音识别（ASR）任务的性能。此外，我们表明，基于概率的方法在语言学习中优于基于传统的教学方法。最后，我们显示ICL使LLM可以实现与专门针对这些语言训练的专用语言模型相当甚至超过专门训练的ASR性能，同时保留了LLM的原始功能。

Title: Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries

Authors: Sahana Ramnath, Anurag Mudgil, Brihi Joshi, Skyler Hallinan, Xiang Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20451
Pdf URL: https://arxiv.org/pdf/2505.20451
Copy Paste: [[2505.20451]] Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries(https://arxiv.org/abs/2505.20451)
Keywords: language model, llm
Abstract: Today, large language models are widely used as judges to evaluate responses from other language models. Hence, it is imperative to benchmark and improve these LLM-judges on real-world language model usage: a typical human-assistant conversation is lengthy, and shows significant diversity in topics, intents, and requirements across turns, e.g. social interactions, task requests, feedback. We present Amulet, a framework that leverages pertinent linguistic concepts of dialog-acts and maxims to improve the accuracy of LLM-judges on preference data with complex, multi-turn conversational context. Amulet presents valuable insights about (a) the communicative structures and intents present in the conversation (dialog acts), and (b) the satisfaction of conversational principles (maxims) by the preference responses, and uses them to make judgments. On four challenging datasets, Amulet shows that (a) humans frequently (60 to 70 percent of the time) change their intents from one turn of the conversation to the next, and (b) in 75 percent of instances, the preference responses can be differentiated via dialog acts and/or maxims, reiterating the latter's significance in judging such data. Amulet can be used either as a judge by applying the framework to a single LLM, or integrated into a jury with different LLM judges; our judges and juries show strong improvements on relevant baselines for all four datasets.
摘要：如今，大型语言模型被广泛用作评估其他语言模型的响应的法官。因此，必须对现实世界语言模型使用基准和改进这些LLM判断力：典型的人类辅助对话很长，并且在转弯的主题，意图和要求上显示出显着的多样性，例如。社交互动，任务请求，反馈。我们提出护身符，该框架利用对话行为和格言的相关语言概念来提高LLM-gudges在偏好数据上的准确性，并具有复杂的，多转的对话环境。护身符对（a）对话中存在的沟通结构和意图（对话行为）和（b）通过偏好响应满足对话原理（Maxims）的宝贵见解，并利用它们来做出判断。在四个具有挑战性的数据集中，护身符表明（a）人类（60％到70％的时间）经常将意图从对话的一转变为下一个，并且（b）在75％的实例中，可以通过对话框和/或最大值来区分偏好响应，从而在后面的对话中重复使用此类数据的重要性。护身符可以通过将框架应用于单个LLM或集成到具有不同LLM法官的陪审团中来将其用作法官；我们的法官和陪审团在所有四个数据集的相关基线上都表现出很大的改善。

Title: Conversation Kernels: A Flexible Mechanism to Learn Relevant Context for Online Conversation Understanding

Authors: Vibhor Agarwal, Arjoo Gupta, Suparna De, Nishanth Sastry
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20482
Pdf URL: https://arxiv.org/pdf/2505.20482
Copy Paste: [[2505.20482]] Conversation Kernels: A Flexible Mechanism to Learn Relevant Context for Online Conversation Understanding(https://arxiv.org/abs/2505.20482)
Keywords: language model
Abstract: Understanding online conversations has attracted research attention with the growth of social networks and online discussion forums. Content analysis of posts and replies in online conversations is difficult because each individual utterance is usually short and may implicitly refer to other posts within the same conversation. Thus, understanding individual posts requires capturing the conversational context and dependencies between different parts of a conversation tree and then encoding the context dependencies between posts and comments/replies into the language model. To this end, we propose a general-purpose mechanism to discover appropriate conversational context for various aspects about an online post in a conversation, such as whether it is informative, insightful, interesting or funny. Specifically, we design two families of Conversation Kernels, which explore different parts of the neighborhood of a post in the tree representing the conversation and through this, build relevant conversational context that is appropriate for each task being considered. We apply our developed method to conversations crawled from this http URL, which allows users to apply highly different labels to posts, such as 'insightful', 'funny', etc., and therefore provides an ideal experimental platform to study whether a framework such as Conversation Kernels is general-purpose and flexible enough to be adapted to disparately different conversation understanding tasks.
摘要：了解在线对话通过社交网络和在线讨论论坛的增长引起了研究的关注。在线对话中的帖子和答复的内容分析很困难，因为每个单独的话语通常都短，并且可能隐含地指在同一对话中的其他帖子。因此，了解各个帖子需要捕获对话树不同部分之间的对话上下文和依赖关系，然后编码帖子和评论/答复之间的上下文依赖关系。为此，我们提出了一种通用机制，以发现有关对话中有关在线帖子的各个方面的适当对话环境，例如它是有益，有见地，有趣还是有趣的。具体来说，我们设计了两个对话内核家族，它们探索了代表对话的树中帖子附近的不同部分，并构建相关的对话上下文，这些上下文适用于所考虑的每个任务。我们将开发的方法应用于此HTTP URL的对话，该对话允许用户将高度不同的标签应用于诸如“有见地”，“ Funny”等帖子等帖子，因此提供了一个理想的实验平台来研究诸如对话内核之类的框架是否是通用的，并且是否具有足够的灵活性，以适应足够的适应性，以适应渐进地不同的对话理解任务。

Title: InFact: Informativeness Alignment for Improved LLM Factuality

Authors: Roi Cohen, Russa Biswas, Gerard de Melo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20487
Pdf URL: https://arxiv.org/pdf/2505.20487
Copy Paste: [[2505.20487]] InFact: Informativeness Alignment for Improved LLM Factuality(https://arxiv.org/abs/2505.20487)
Keywords: llm
Abstract: Factual completeness is a general term that captures how detailed and informative a factually correct text is. For instance, the factual sentence ``Barack Obama was born in the United States'' is factually correct, though less informative than the factual sentence ``Barack Obama was born in Honolulu, Hawaii, United States''. Despite the known fact that LLMs tend to hallucinate and generate factually incorrect text, they might also tend to choose to generate factual text that is indeed factually correct and yet less informative than other, more informative choices. In this work, we tackle this problem by proposing an informativeness alignment mechanism. This mechanism takes advantage of recent factual benchmarks to propose an informativeness alignment objective. This objective prioritizes answers that are both correct and informative. A key finding of our work is that when training a model to maximize this objective or optimize its preference, we can improve not just informativeness but also factuality.
摘要：事实完整性是一个一般术语，可捕获事实正确的文本的详细和信息。例如，事实判决``巴拉克·奥巴马（Barack Obama）出生于美国''是正确的，尽管比事实判决``巴拉克·奥巴马（Barack Obama）出生于美国夏威夷州檀香山''。尽管已知LLM倾向于幻觉和产生事实不正确的文本，但它们也可能倾向于选择产生实际上是正确且与其他更有信息的选择相比，实际上是正确且信息不足的事实文本。在这项工作中，我们通过提出信息对准机制来解决这个问题。该机制利用了最新的事实基准提出信息对准目标。该目标优先考虑正确且内容丰富的答案。我们工作的关键发现是，当训练模型以最大化这一目标或优化其偏好时，我们不仅可以提高信息性，而且可以提高事实。

Title: Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism

Authors: Naba Rizvi, Harper Strickland, Saleha Ahmedi, Aekta Kallepalli, Isha Khirwadkar, William Wu, Imani N. S. Munyaka, Nedjma Ousidhoum
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20500
Pdf URL: https://arxiv.org/pdf/2505.20500
Copy Paste: [[2505.20500]] Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism(https://arxiv.org/abs/2505.20500)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used in decision-making tasks like résumé screening and content moderation, giving them the power to amplify or suppress certain perspectives. While previous research has identified disability-related biases in LLMs, little is known about how they conceptualize ableism or detect it in text. We evaluate the ability of four LLMs to identify nuanced ableism directed at autistic individuals. We examine the gap between their understanding of relevant terminology and their effectiveness in recognizing ableist content in context. Our results reveal that LLMs can identify autism-related language but often miss harmful or offensive connotations. Further, we conduct a qualitative comparison of human and LLM explanations. We find that LLMs tend to rely on surface-level keyword matching, leading to context misinterpretations, in contrast to human annotators who consider context, speaker identity, and potential impact. On the other hand, both LLMs and humans agree on the annotation scheme, suggesting that a binary classification is adequate for evaluating LLM performance, which is consistent with findings from prior studies involving human annotators.
摘要：大型语言模型（LLM）越来越多地用于诸如简历筛选和内容审核之类的决策任务中，从而使他们有能力放大或抑制某些观点。尽管以前的研究已经确定了LLMS中与残疾相关的偏见，但对它们如何概念化能力或在文本中检测到它的知识知之甚少。我们评估了四个LLM识别针对自闭症患者的细微差别能力的能力。我们研究了他们对相关术语的理解与它们在上下文中识别能干内容方面的有效性之间的差距。我们的结果表明，LLM可以识别与自闭症相关的语言，但通常会错过有害或冒犯性的含义。此外，我们对人类和LLM解释进行了定性比较。我们发现，与考虑上下文，说话者身份和潜在影响的人类注释者相比，LLM倾向于依靠表面级关键字匹配，导致上下文误解。另一方面，LLM和人类都同意注释方案，这表明二元分类足以评估LLM性能，这与涉及人类注释者的先前研究的发现一致。

Title: Gatsby Without the 'E': Crafting Lipograms with LLMs

Authors: Rohan Balasubramanian, Nitish Gokulakrishnan, Syeda Jannatus Saba, Steven Skiena
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20501
Pdf URL: https://arxiv.org/pdf/2505.20501
Copy Paste: [[2505.20501]] Gatsby Without the 'E': Crafting Lipograms with LLMs(https://arxiv.org/abs/2505.20501)
Keywords: language model, llm
Abstract: Lipograms are a unique form of constrained writing where all occurrences of a particular letter are excluded from the text, typified by the novel Gadsby, which daringly avoids all usage of the letter 'e'. In this study, we explore the power of modern large language models (LLMs) by transforming the novel F. Scott Fitzgerald's The Great Gatsby into a fully 'e'-less text. We experimented with a range of techniques, from baseline methods like synonym replacement to sophisticated generative models enhanced with beam search and named entity analysis. We show that excluding up to 3.6% of the most common letters (up to the letter 'u') had minimal impact on the text's meaning, although translation fidelity rapidly and predictably decays with stronger lipogram constraints. Our work highlights the surprising flexibility of English under strict constraints, revealing just how adaptable and creative language can be.
摘要：lipograms是一种独特的限制写作形式，其中所有特定字母的出现都被排除在文本中，以小说gadsby为代表，该文本大胆地避免了字母“ e”的所有用法。在这项研究中，我们通过将小说Scott Fitzgerald的《 Great Gatsby》转换为完全'E'-e'-少数文本，探索现代大型语言模型（LLM）的力量。我们尝试了一系列技术，从基线方法（例如同义词替代品）到通过梁搜索和命名实体分析增强的复杂生成模型。我们表明，不包括多达3.6％的最常见字母（直到字母“ u”）对文本的含义的影响最小，尽管翻译富度迅速且可预测地衰减具有更强的脂肪图约束。我们的工作突出了严格的限制下的英语灵活性令人惊讶的灵活性，揭示了如何适应和创造性的语言。

Title: Large Language Models for IT Automation Tasks: Are We There Yet?

Authors: Md Mahadi Hassan, John Salvador, Akond Rahman, Santu Karmaker
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2505.20505
Pdf URL: https://arxiv.org/pdf/2505.20505
Copy Paste: [[2505.20505]] Large Language Models for IT Automation Tasks: Are We There Yet?(https://arxiv.org/abs/2505.20505)
Keywords: language model, llm
Abstract: LLMs show promise in code generation, yet their effectiveness for IT automation tasks, particularly for tools like Ansible, remains understudied. Existing benchmarks rely primarily on synthetic tasks that fail to capture the needs of practitioners who use IT automation tools, such as Ansible. We present ITAB (IT Automation Task Benchmark), a benchmark of 126 diverse tasks (e.g., configuring servers, managing files) where each task accounts for state reconciliation: a property unique to IT automation tools. ITAB evaluates LLMs' ability to generate functional Ansible automation scripts via dynamic execution in controlled environments. We evaluate 14 open-source LLMs, none of which accomplish pass@10 at a rate beyond 12%. To explain these low scores, we analyze 1,411 execution failures across the evaluated LLMs and identify two main categories of prevalent semantic errors: failures in state reconciliation related reasoning (44.87% combined from variable (11.43%), host (11.84%), path(11.63%), and template (9.97%) issues) and deficiencies in module-specific execution knowledge (24.37% combined from Attribute and parameter (14.44%) and module (9.93%) errors). Our findings reveal key limitations in open-source LLMs' ability to track state changes and apply specialized module knowledge, indicating that reliable IT automation will require major advances in state reasoning and domain-specific execution understanding.
摘要：LLM在代码生成中表现出希望，但是它们对IT自动化任务的有效性，尤其是对于诸如Ansible之类的工具，仍在研究中。现有的基准主要依赖于未能捕获使用IT自动化工具的从业者（例如Ansible）的从业者的需求的合成任务。我们提出ITAB（IT Automation Task -Benchmark），这是126个不同任务（例如，配置服务器，管理文件）的基准，其中每个任务都在于状态核对：IT自动化工具所独有的属性。 ITAB评估了LLMS在受控环境中通过动态执行生成功能性和自动化脚本的能力。我们评估了14个开源LLM，没有一个以超过12％的速度完成@10。为了解释这些较低的分数，我们分析了评估的LLMS中的1,411个执行失败，并确定了两个主要语义错误的主要类别：状态和解相关推理中的失败（44.87％的变量（11.43％）（11.43％）（11.84％），路径（11.63％）（11.63％）和模式（9.97％）和deficiencienciencienciencienciencience and s-Sprience in Comession（11.84％）（24.37％的属性和参数（14.44％）和模块（9.93％）错误合并）。我们的发现揭示了开源LLMS跟踪状态变化和应用专业模块知识的能力的关键局限性，这表明可靠的IT自动化将需要在州推理和特定领域的执行理解方面取得重大进步。

Title: AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

Authors: Sebastian Antony Joseph, Syed Murtaza Husain, Stella S. R. Offner, Stéphanie Juneau, Paul Torrey, Adam S. Bolton, Juan P. Farias, Niall Gaffney, Greg Durrett, Junyi Jessy Li
Subjects: cs.CL, astro-ph.IM, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20538
Pdf URL: https://arxiv.org/pdf/2505.20538
Copy Paste: [[2505.20538]] AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy(https://arxiv.org/abs/2505.20538)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are being explored for applications in scientific research, including their capabilities to synthesize literature, answer research questions, generate research ideas, and even conduct computational experiments. Ultimately, our goal is for these to help scientists derive novel scientific insights. In many areas of science, such insights often arise from processing and visualizing data to understand its patterns. However, evaluating whether an LLM-mediated scientific workflow produces outputs conveying the correct scientific insights is challenging to evaluate and has not been addressed in past work. We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain. AstroVisBench judges a language model's ability to both (1) create astronomy-specific workflows to process and analyze data and (2) visualize the results of these workflows through complex plots. Our evaluation of visualizations uses a novel LLM-as-a-judge workflow, which is validated against annotation by five professional astronomers. Using AstroVisBench we present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants. This evaluation provides a strong end-to-end evaluation for AI scientists that offers a path forward for the development of visualization-based workflows, which are central to a broad range of domains from physics to biology.
摘要：正在探索大型语言模型（LLM），以用于科学研究中的应用，包括它们合成文学，回答研究问题，产生研究思想甚至进行计算实验的能力。最终，我们的目标是帮助科学家获得新颖的科学见解。在许多科学领域，这些见解通常是由于处理和可视化数据以理解其模式的。但是，评估LLM介导的科学工作流是否产生传达正确的科学见解的输出是否具有挑战性，在过去的工作中尚未解决。我们介绍了Astrovisbench，这是天文学域中科学计算和可视化的第一个基准。 Astrovisbench评判了语言模型的两者（1）创建特定于天文学的工作流以处理和分析数据的能力，以及（2）通过复杂的图可视化这些工作流的结果。我们对可视化的评估使用了一种新颖的LLM-AS-A-A-a-a-a-a-a-Gudge工作流程，该工作流得到了五名专业天文学家的注释来验证。使用Astrovisbench，我们提出了对最先进的语言模型的评估，显示了他们作为有用助手的天文学研究能力的显着差距。该评估为AI科学家提供了强大的端到端评估，为开发基于可视化的工作流程提供了前进的途径，这对于从物理学到生物学的广泛领域至关重要。

Title: Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline

Authors: Meng Lu, Ruochen Zhang, Ellie Pavlick, Carsten Eickhoff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20546
Pdf URL: https://arxiv.org/pdf/2505.20546
Copy Paste: [[2505.20546]] Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline(https://arxiv.org/abs/2505.20546)
Keywords: language model, llm
Abstract: Multilingual large language models (LLMs) often exhibit factual inconsistencies across languages, with significantly better performance in factual recall tasks in English than in other languages. The causes of these failures, however, remain poorly understood. Using mechanistic analysis techniques, we uncover the underlying pipeline that LLMs employ, which involves using the English-centric factual recall mechanism to process multilingual queries and then translating English answers back into the target language. We identify two primary sources of error: insufficient engagement of the reliable English-centric mechanism for factual recall, and incorrect translation from English back into the target language for the final answer. To address these vulnerabilities, we introduce two vector interventions, both independent of languages and datasets, to redirect the model toward better internal paths for higher factual consistency. Our interventions combined increase the recall accuracy by over 35 percent for the lowest-performing language. Our findings demonstrate how mechanistic insights can be used to unlock latent multilingual capabilities in LLMs.
摘要：多语言大语言模型（LLMS）经常在语言上表现出事实不一致，在英语中的事实回忆任务的性能明显优于其他语言。然而，这些失败的原因仍然知之甚少。使用机械分析技术，我们发现了LLMS所采用的潜在管道，该管道涉及使用以英语为中心的事实召回机制处理多语言查询，然后将英语答案转换回目标语言。我们确定了两个主要的错误来源：可靠的以英语为中心的机制进行事实召回，以及从英语回到目标语言的最终答案中的不正确翻译。为了解决这些漏洞，我们介绍了两种矢量干预措施，包括语言和数据集，以将模型重定向到更好的内部路径，以获得更高的事实一致性。对于表现最低的语言，我们的干预措施将召回精度提高了35％以上。我们的发现表明，如何使用机械洞察力来解锁LLMS中潜在的多语言功能。

Title: Effectiveness of Prompt Optimization in NL2SQL Systems

Authors: Sairam Gurajada, Eser Kandogan, Sajjadur Rahman
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2505.20591
Pdf URL: https://arxiv.org/pdf/2505.20591
Copy Paste: [[2505.20591]] Effectiveness of Prompt Optimization in NL2SQL Systems(https://arxiv.org/abs/2505.20591)
Keywords: language model, llm, prompt
Abstract: NL2SQL approaches have greatly benefited from the impressive capabilities of large language models (LLMs). In particular, bootstrapping an NL2SQL system for a specific domain can be as simple as instructing an LLM with sufficient contextual information, such as schema details and translation demonstrations. However, building an accurate system still requires the rigorous task of selecting the right context for each query-including identifying relevant schema elements, cell values, and suitable exemplars that help the LLM understand domain-specific nuances. Retrieval-based methods have become the go-to approach for identifying such context. While effective, these methods introduce additional inference-time costs due to the retrieval process. In this paper, we argue that production scenarios demand high-precision, high-performance NL2SQL systems, rather than simply high-quality SQL generation, which is the focus of most current NL2SQL approaches. In such scenarios, the careful selection of a static set of exemplars-capturing the intricacies of the query log, target database, SQL constructs, and execution latencies-plays a more crucial role than exemplar selection based solely on similarity. The key challenge, however, lies in identifying a representative set of exemplars for a given production setting. To this end, we propose a prompt optimization framework that not only addresses the high-precision requirement but also optimizes the performance of the generated SQL through multi-objective optimization. Preliminary empirical analysis demonstrates the effectiveness of the proposed framework.
摘要：NL2SQL方法从大型语言模型（LLM）的令人印象深刻的能力中受益匪浅。特别是，用于特定域的NL2SQL系统的引导可以像指导LLM具有足够的上下文信息（例如模式详细信息和翻译演示）一样简单。但是，构建精确的系统仍然需要严格的任务，为每个查询（包括识别相关的模式元素，单元格值和合适的示例）选择正确的上下文，以帮助LLM了解特定于域的细微差别。基于检索的方法已成为识别这种情况的首选方法。尽管有效，但由于检索过程，这些方法引入了额外的推理时间成本。在本文中，我们认为生产场景需要高精度，高性能的NL2SQL系统，而不是仅仅是高质量的SQL生成，这是大多数当前NL2SQL方法的重点。在这种情况下，仔细选择了一组静态的示例示例捕捉，这些示例捕捉了查询日志，目标数据库，SQL构建体和执行潜伏期的复杂性比仅基于相似性的示例选择更重要的角色。但是，主要的挑战在于确定给定生产环境的代表性示例。为此，我们提出了一个及时的优化框架，该框架不仅可以解决高精度要求，而且还通过多目标优化优化了生成的SQL的性能。初步经验分析证明了所提出的框架的有效性。

Title: REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning

Authors: Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, Haocheng Ju, Peihao Wu, Bryan Dai, Bin Dong
Subjects: cs.CL, cs.AI, cs.LG, cs.LO
Abstract URL: https://arxiv.org/abs/2505.20613
Pdf URL: https://arxiv.org/pdf/2505.20613
Copy Paste: [[2505.20613]] REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning(https://arxiv.org/abs/2505.20613)
Keywords: language model
Abstract: Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).
摘要：如今，正式的定理掠夺已经在高中和竞争级数学方面取得了巨大的进步，但是很少有人会推广到更高级的数学。在本文中，我们介绍了真正的Prover，这是一个新的开源逐步定理供体，用于LEAN 4，以推动这一边界。这一示意手基于我们的微调大语言模型（Real-Prover-V1），并与检索系统（LeanSearch-PS）集成在一起，特别是在解决大学级数学问题方面尤其提高了性能。为了训练Real-Prover-V1，我们开发了Herald-AF，这是一种数据提取管道，将自然语言数学问题转换为正式陈述，以及一个新的开源精益4交互式环境（JIXIA Intertractive），以促进合成数据收集。在我们的实验中，我们仅使用监督的微调实现竞争结果，在证明数据集（SOTA）模型可弥补的验证数据集上获得23.7％的成功率（PASS@64）。为了进一步评估我们的方法，我们介绍了Fate-M，这是一种针对代数问题的新基准，我们的供奉献的成功率达到56.7％（PASS@64）。

Title: SeqPO-SiMT: Sequential Policy Optimization for Simultaneous Machine Translation

Authors: Ting Xu, Zhichao Huang, Jiankai Sun, Shanbo Cheng, Wai Lam
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20622
Pdf URL: https://arxiv.org/pdf/2505.20622
Copy Paste: [[2505.20622]] SeqPO-SiMT: Sequential Policy Optimization for Simultaneous Machine Translation(https://arxiv.org/abs/2505.20622)
Keywords: llm
Abstract: We present Sequential Policy Optimization for Simultaneous Machine Translation (SeqPO-SiMT), a new policy optimization framework that defines the simultaneous machine translation (SiMT) task as a sequential decision making problem, incorporating a tailored reward to enhance translation quality while reducing latency. In contrast to popular Reinforcement Learning from Human Feedback (RLHF) methods, such as PPO and DPO, which are typically applied in single-step tasks, SeqPO-SiMT effectively tackles the multi-step SiMT task. This intuitive framework allows the SiMT LLMs to simulate and refine the SiMT process using a tailored reward. We conduct experiments on six datasets from diverse domains for En to Zh and Zh to En SiMT tasks, demonstrating that SeqPO-SiMT consistently achieves significantly higher translation quality with lower latency. In particular, SeqPO-SiMT outperforms the supervised fine-tuning (SFT) model by 1.13 points in COMET, while reducing the Average Lagging by 6.17 in the NEWSTEST2021 En to Zh dataset. While SiMT operates with far less context than offline translation, the SiMT results of SeqPO-SiMT on 7B LLM surprisingly rival the offline translation of high-performing LLMs, including Qwen-2.5-7B-Instruct and LLaMA-3-8B-Instruct.
摘要：我们介绍了同时机器翻译（SEQPO-SIMT）的顺序策略优化，这是一个新的策略优化框架，将同时的机器翻译（SIMT）任务定义为顺序决策问题，并结合了量身定制的奖励，以增强翻译质量，同时降低延迟。与从人类反馈（RLHF）方法（例如PPO和DPO）中学习的流行强化学习（通常用于单步任务中），Seqpo-Simt有效地处理了多步SIMT任务。这个直观的框架使Simt LLM可以使用量身定制的奖励来模拟和完善SIMT过程。我们在来自ZH和ZH的六个数据集上进行实验，以实现SIMT任务，这表明Seqpo-Simt始终在较低的延迟中始终达到更高的翻译质量。特别是，Seqpo-Simt的表现优于监督的微调模型（SFT）模型，而彗星中的模型则优于1.13点，而Newstest2021 EN中的平均滞后量则减少了6.17。尽管Simt的上下文远不及离线翻译，但Seqpo-Simt在7B LLM上的SIMT结果令人惊讶地与高性能LLM的离线翻译媲美，包括QWEN-2.5-7B教学和LLAMA-3-8B教学。

Title: POLAR: A Benchmark for Multilingual, Multicultural, and Multi-Event Online Polarization

Authors: Usman Naseem, Juan Ren, Saba Anwar, Sarah Kohail, Rudy Alexandro Garrido Veliz, Robert Geislinger, Aisha Jabr, Idris Abdulmumin, Laiba Qureshi, Aarushi Ajay Borkar, Maryam Ibrahim Mukhtar, Abinew Ali Ayele, Ibrahim Said Ahmad, Adem Ali, Martin Semmann, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20624
Pdf URL: https://arxiv.org/pdf/2505.20624
Copy Paste: [[2505.20624]] POLAR: A Benchmark for Multilingual, Multicultural, and Multi-Event Online Polarization(https://arxiv.org/abs/2505.20624)
Keywords: language model, llm
Abstract: Online polarization poses a growing challenge for democratic discourse, yet most computational social science research remains monolingual, culturally narrow, or event-specific. We introduce POLAR, a multilingual, multicultural, and multievent dataset with over 23k instances in seven languages from diverse online platforms and real-world events. Polarization is annotated along three axes: presence, type, and manifestation, using a variety of annotation platforms adapted to each cultural context. We conduct two main experiments: (1) we fine-tune six multilingual pretrained language models in both monolingual and cross-lingual setups; and (2) we evaluate a range of open and closed large language models (LLMs) in few-shot and zero-shot scenarios. Results show that while most models perform well on binary polarization detection, they achieve substantially lower scores when predicting polarization types and manifestations. These findings highlight the complex, highly contextual nature of polarization and the need for robust, adaptable approaches in NLP and computational social science. All resources will be released to support further research and effective mitigation of digital polarization globally.
摘要：在线两极分化对民主话语构成了越来越多的挑战，但是大多数计算社会科学研究仍然是单语，文化狭窄或特定于事件的。我们介绍了Polar，这是一个多语言，多元文化和多种数据集，其中包含来自不同在线平台和现实世界中的七种语言的超过23K实例。沿三个轴注释极化：存在，使用适合每个文化背景的各种注释平台，存在，类型和表现。我们进行了两个主要的实验：（1）我们在单语和跨语性设置中微调了六个多语言仔细预读的语言模型；（2）我们以几次射击和零拍的场景评估了一系列开放和封闭的大型语言模型（LLMS）。结果表明，尽管大多数模型在二进制极化检测上的表现都很好，但在预测极化类型和表现形式时，它们的得分大大降低。这些发现突出了两极分化的复杂，高度上下文的性质以及在NLP和计算社会科学中对强大，适应性方法的需求。将发布所有资源，以支持全球数字两极化的进一步研究和有效缓解数字两极分化。

Title: Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration

Authors: Sibo Xiao, Zixin Lin, Wenyang Gao, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20625
Pdf URL: https://arxiv.org/pdf/2505.20625
Copy Paste: [[2505.20625]] Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration(https://arxiv.org/abs/2505.20625)
Keywords: language model, llm, long context, agent
Abstract: Processing long contexts has become a critical capability for modern large language models (LLMs). Existing works leverage agent-based divide-and-conquer methods for processing long contexts. But these methods face crucial limitations, including prohibitive accumulated latency and amplified information loss from excessive agent invocations, and the disruption of inherent textual dependencies by immoderate partitioning. In this paper, we propose a novel multi-agent framework XpandA (Expand-Agent) coupled with question-driven workflow and dynamic partitioning for robust long-context processing. XpandA overcomes these limitations through: 1) dynamic partitioning of long texts, which adaptively modulates the filling rate of context windows for input sequences of vastly varying lengths; 2) question-guided protocol to update flat information ensembles within centralized shared memory, constructing consistent inter-agent knowledge across partitions; and 3) selectively replaying specific partitions based on the state-tracking of question-information couples to promote the resolution of inverted-order structures across partitions (e.g., flashbacks). We perform a comprehensive evaluation of XpandA on multiple long-context benchmarks with length varying from 1k to 1M, demonstrating XpandA's feasibility for processing ultra-long sequences and its significant effectiveness in enhancing the long-context capabilities of various LLMs by achieving 20\% improvements and 1.5x inference speedup over baselines of full-context, RAG and previous agent-based methods.
摘要：处理长篇小说已成为现代大型语言模型（LLM）的关键能力。现有的作品利用基于代理的划分和构图方法来处理长上下文。但是，这些方法面临着关键的局限性，包括过度的累积潜伏期和过度的信息损失，以及通过不适式分区而固有的文本依赖性的破坏。在本文中，我们提出了一种新型的多代理框架Xpanda（Expand-Agent），以及问题驱动的工作流程和动态分区，以实现稳健的长篇文化处理。 Xpanda通过以下方式克服了这些局限性：1）长期文本的动态分区，该分区适应性地调节了上下文窗口的填充速率，以使长度大大变化的输入序列； 2）在集中式共享内存中更新平面信息集合的问题指导协议，从分区之间构建一致的跨阶层知识； 3）基于问题信息夫妇的状态跟踪选择性重复特定分区，以促进跨分区跨阶结构的分辨率（例如，闪回）。我们对Xpanda进行了多个长篇小写基准测试的全面评估，其长度从1K到1m不等，这表明了Xpanda对处理超长序列的可行性及其在增强各种LLM的长距离功能方面的显着有效性，从而通过实现20 \％的改进和1.5x pentrypertion offers offers off coptuct offers of Full-Cong and off and Full-Cong and and Flot-Cont and rag和先前的rag。

Title: Test-Time Learning for Large Language Models

Authors: Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, Mingkui Tan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20633
Pdf URL: https://arxiv.org/pdf/2505.20633
Copy Paste: [[2505.20633]] Test-Time Learning for Large Language Models(https://arxiv.org/abs/2505.20633)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) have exhibited remarkable emergent capabilities through extensive pre-training, they still face critical limitations in generalizing to specialized domains and handling diverse linguistic variations, known as distribution shifts. In this paper, we propose a Test-Time Learning (TTL) paradigm for LLMs, namely TLM, which dynamically adapts LLMs to target domains using only unlabeled test data during testing. Specifically, we first provide empirical evidence and theoretical insights to reveal that more accurate predictions from LLMs can be achieved by minimizing the input perplexity of the unlabeled test data. Based on this insight, we formulate the Test-Time Learning process of LLMs as input perplexity minimization, enabling self-supervised enhancement of LLM performance. Furthermore, we observe that high-perplexity samples tend to be more informative for model optimization. Accordingly, we introduce a Sample Efficient Learning Strategy that actively selects and emphasizes these high-perplexity samples for test-time updates. Lastly, to mitigate catastrophic forgetting and ensure adaptation stability, we adopt Low-Rank Adaptation (LoRA) instead of full-parameter optimization, which allows lightweight model updates while preserving more original knowledge from the model. We introduce the AdaptEval benchmark for TTL and demonstrate through experiments that TLM improves performance by at least 20% compared to original LLMs on domain knowledge adaptation.
摘要：尽管大型语言模型（LLM）通过广泛的预训练表现出显着的新兴能力，但它们在推广到专业领域并处理多种语言变化（称为分布变化）方面仍然面临着关键的局限性。在本文中，我们提出了一个针对LLM的测试时间学习（TTL）范式，即TLM，该范式在测试过程中仅使用未标记的测试数据将LLMS动态调整为目标域。具体而言，我们首先提供经验证据和理论见解，以揭示可以通过最大程度地减少未标记测试数据的输入的困惑来实现更准确的LLM的预测。基于这种见识，我们将LLMS的测试时间学习过程作为输入的综合性最小化，从而可以自我监督LLM性能的增强。此外，我们观察到，高质感样本往往更有信息来优化模型。因此，我们引入了一种有效的学习策略，该策略会积极选择并强调这些高质感样本以进行测试时间更新。最后，为了减轻灾难性的遗忘并确保适应稳定性，我们采用了低级适应（LORA），而不是全参数优化，这允许轻巧模型更新，同时从模型中保留更多原始知识。我们介绍了TTL的Adapteval基准测试，并通过实验证明，与域知识适应的原始LLM相比，TLM将其提高至少20％。

Title: STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models

Authors: Kai Chen, Zihao He, Taiwei Shi, Kristina Lerman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20645
Pdf URL: https://arxiv.org/pdf/2505.20645
Copy Paste: [[2505.20645]] STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models(https://arxiv.org/abs/2505.20645)
Keywords: language model, llm
Abstract: Steerability, or the ability of large language models (LLMs) to adapt outputs to align with diverse community-specific norms, perspectives, and communication styles, is critical for real-world applications but remains under-evaluated. We introduce Steer-Bench, a benchmark for assessing population-specific steering using contrasting Reddit communities. Covering 30 contrasting subreddit pairs across 19 domains, Steer-Bench includes over 10,000 instruction-response pairs and validated 5,500 multiple-choice question with corresponding silver labels to test alignment with diverse community norms. Our evaluation of 13 popular LLMs using Steer-Bench reveals that while human experts achieve an accuracy of 81% with silver labels, the best-performing models reach only around 65% accuracy depending on the domain and configuration. Some models lag behind human-level alignment by over 15 percentage points, highlighting significant gaps in community-sensitive steerability. Steer-Bench is a benchmark to systematically assess how effectively LLMs understand community-specific instructions, their resilience to adversarial steering attempts, and their ability to accurately represent diverse cultural and ideological perspectives.
摘要：可管道性，或大型语言模型（LLM）适应输出以与各种社区特定的规范，观点和沟通方式保持一致的能力，对于现实世界中的应用至关重要，但仍未得到评估。我们介绍了Steer-Bench，这是一种使用对比鲜明的Reddit社区评估人口特定转向的基准。覆盖19个域上的30对对比的子重新数对，转向板上的台式包括10,000多个指令 - 响应对，并验证了5,500个带有相应银标签的多项选择问题，以测试与各种社区规范的对齐。我们对13个流行LLM的评估使用Steper-bench表明，尽管人类专家使用银标签获得了81％的精度，但根据域和配置，表现最佳的模型仅达到65％的精度。一些模型以超过15个百分点的位置落后于人类水平的一致性，这突出了社区敏感的连续性差距。 Steer-Bench是一个基准，可以系统地评估LLM有效地了解社区特定的指示，它们对对抗性转向尝试的韧性以及它们准确代表各种文化和意识形态观点的能力。

Title: FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information

Authors: Yan Wang, Yang Ren, Lingfei Qian, Xueqing Peng, Keyi Wang, Yi Han, Dongji Feng, Xiao-Yang Liu, Jimin Huang, Qianqian Xie
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2505.20650
Pdf URL: https://arxiv.org/pdf/2505.20650
Copy Paste: [[2505.20650]] FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information(https://arxiv.org/abs/2505.20650)
Keywords: language model, llm
Abstract: We introduce FinTagging, the first full-scope, table-aware XBRL benchmark designed to evaluate the structured information extraction and semantic alignment capabilities of large language models (LLMs) in the context of XBRL-based financial reporting. Unlike prior benchmarks that oversimplify XBRL tagging as flat multi-class classification and focus solely on narrative text, FinTagging decomposes the XBRL tagging problem into two subtasks: FinNI for financial entity extraction and FinCL for taxonomy-driven concept alignment. It requires models to jointly extract facts and align them with the full 10k+ US-GAAP taxonomy across both unstructured text and structured tables, enabling realistic, fine-grained evaluation. We assess a diverse set of LLMs under zero-shot settings, systematically analyzing their performance on both subtasks and overall tagging accuracy. Our results reveal that, while LLMs demonstrate strong generalization in information extraction, they struggle with fine-grained concept alignment, particularly in disambiguating closely related taxonomy entries. These findings highlight the limitations of existing LLMs in fully automating XBRL tagging and underscore the need for improved semantic reasoning and schema-aware modeling to meet the demands of accurate financial disclosure. Code is available at our GitHub repository and data is at our Hugging Face repository.
摘要：我们介绍了fintagging，这是第一个全尺寸的XBRL基准测试，旨在评估基于XBRL的财务报告中大语言模型（LLMS）的结构化信息提取和语义一致性功能。与以前的基准测试过度简化XBRL标记为平坦的多级分类并仅关注叙事文本，Fintagging将XBRL标记问题分解为两个子任务：FINNI用于金融实体提取和分类法驱动的概念概念的Fincl。它要求模型将事实共同提取事实并将其与在非结构化的文本和结构化表中的整个10K+ US-GAAP分类学保持一致，从而实现现实，细粒度的评估。我们在零拍设置下评估了一组不同的LLMS，系统地分析了它们在子任务上的性能和整体标记精度。我们的结果表明，尽管LLM在信息提取方面表现出强烈的概括，但它们在细粒度的概念一致性方面挣扎，尤其是在消除密切相关的分类学条目时。这些发现凸显了现有LLM在完全自动化XBRL标签方面的局限性，并强调了改进语义推理和架构了解建模的必要性，以满足准确的财务披露的需求。代码可在我们的GITHUB存储库中找到，数据位于我们的拥抱脸部存储库中。

Title: Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge

Authors: Yue Fang, Zhi Jin, Jie An, Hongshen Chen, Xiaohong Chen, Naijun Zhan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20658
Pdf URL: https://arxiv.org/pdf/2505.20658
Copy Paste: [[2505.20658]] Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge(https://arxiv.org/abs/2505.20658)
Keywords: language model, llm
Abstract: Temporal Logic (TL), especially Signal Temporal Logic (STL), enables precise formal specification, making it widely used in cyber-physical systems such as autonomous driving and robotics. Automatically transforming NL into STL is an attractive approach to overcome the limitations of manual transformation, which is time-consuming and error-prone. However, due to the lack of datasets, automatic transformation currently faces significant challenges and has not been fully explored. In this paper, we propose an NL-STL dataset named STL-Diversity-Enhanced (STL-DivEn), which comprises 16,000 samples enriched with diverse patterns. To develop the dataset, we first manually create a small-scale seed set of NL-STL pairs. Next, representative examples are identified through clustering and used to guide large language models (LLMs) in generating additional NL-STL pairs. Finally, diversity and accuracy are ensured through rigorous rule-based filters and human validation. Furthermore, we introduce the Knowledge-Guided STL Transformation (KGST) framework, a novel approach for transforming natural language into STL, involving a generate-then-refine process based on external knowledge. Statistical analysis shows that the STL-DivEn dataset exhibits more diversity than the existing NL-STL dataset. Moreover, both metric-based and human evaluations indicate that our KGST approach outperforms baseline models in transformation accuracy on STL-DivEn and DeepSTL datasets.
摘要：时间逻辑（TL），尤其是信号时间逻辑（STL），可实现精确的正式规范，使其广泛用于网络物理系统，例如自主驾驶和机器人技术。将NL自动转换为STL是一种有吸引力的方法，可以克服手动转换的局限性，而手动转换的局限性是耗时且容易出错的方法。但是，由于缺乏数据集，自动转换目前面临重大挑战，尚未得到充分探索。在本文中，我们提出了一个名为STL多样性增强的NL-STL数据集（STL-Diven），该数据集包含16,000个富含不同模式的样本。为了开发数据集，我们首先手动创建一个小规模的NL-STL对种子。接下来，通过聚类来确定代表性示例，并用于引导大型语言模型（LLMS）生成其他NL-STL对。最后，通过严格的基于规则的过滤器和人类验证来确保多样性和准确性。此外，我们介绍了知识引导的STL变换（KGST）框架，这是一种将自然语言转化为STL的新颖方法，涉及基于外部知识的生成 - 然后是Refine过程。统计分析表明，与现有的NL-STL数据集相比，STL驱动的数据集表现出更多的多样性。此外，基于度量的和人类的评估都表明，我们的KGST方法在STL驱动和DEEPSTL数据集的转换精度方面优于基线模型。

Title: BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism

Authors: Qinzhuo Wu, Pengzhi Gao, Wei Liu, Jian Luan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20660
Pdf URL: https://arxiv.org/pdf/2505.20660
Copy Paste: [[2505.20660]] BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism(https://arxiv.org/abs/2505.20660)
Keywords: agent
Abstract: Graphical User Interface (GUI) agents have gained substantial attention due to their impressive capabilities to complete tasks through multiple interactions within GUI environments. However, existing agents primarily focus on enhancing the accuracy of individual actions and often lack effective mechanisms for detecting and recovering from errors. To address these shortcomings, we propose the BacktrackAgent, a robust framework that incorporates a backtracking mechanism to improve task completion efficiency. BacktrackAgent includes verifier, judger, and reflector components as modules for error detection and recovery, while also applying judgment rewards to further enhance the agent's performance. Additionally, we develop a training dataset specifically designed for the backtracking mechanism, which considers the outcome pages after action executions. Experimental results show that BacktrackAgent has achieved performance improvements in both task success rate and step accuracy on Mobile3M and Auto-UI benchmarks. Our data and code will be released upon acceptance.
摘要：图形用户界面（GUI）代理因其令人印象深刻的功能通过GUI环境中的多个交互完成任务而引起了很大的关注。但是，现有代理主要集中于提高单个行动的准确性，并且通常缺乏检测和从错误中恢复的有效机制。为了解决这些缺点，我们提出了BackTrackagent，这是一个强大的框架，结合了回溯机制，以提高任务完成效率。 BackTrackAgent包括验证者，审判器和反射器组件作为错误检测和恢复的模块，同时还应用判断奖励以进一步提高代理商的性能。此外，我们开发了专门为回溯机制设计的培训数据集，该数据集在执行后考虑结果页面。实验结果表明，BackTrackagent在移动3M和Auto-UI基准方面的任务成功率和步骤准确性都取得了提高。我们的数据和代码将在接受后发布。

Title: Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning

Authors: Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20664
Pdf URL: https://arxiv.org/pdf/2505.20664
Copy Paste: [[2505.20664]] Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning(https://arxiv.org/abs/2505.20664)
Keywords: language model, llm, chain-of-thought
Abstract: While reasoning-augmented large language models (RLLMs) significantly enhance complex task performance through extended reasoning chains, they inevitably introduce substantial unnecessary token consumption, particularly for simpler problems where Short Chain-of-Thought (Short CoT) suffices. This overthinking phenomenon leads to inefficient resource usage without proportional accuracy gains. To address this issue, we propose Self-Route, a dynamic reasoning framework that automatically selects between general and reasoning modes based on model capability estimation. Our approach introduces a lightweight pre-inference stage to extract capability-aware embeddings from hidden layer representations, enabling real-time evaluation of the model's ability to solve problems. We further construct Gradient-10K, a model difficulty estimation-based dataset with dense complexity sampling, to train the router for precise capability boundary detection. Extensive experiments demonstrate that Self-Route achieves comparable accuracy to reasoning models while reducing token consumption by 30-55\% across diverse benchmarks. The proposed framework demonstrates consistent effectiveness across models with different parameter scales and reasoning paradigms, highlighting its general applicability and practical value.
摘要：尽管推理的大型语言模型（RLLM）通过扩展的推理链可显着提高复杂的任务性能，但它们不可避免地引入了大量的不必要的令牌消费，尤其是出于简单的问题（短链（短COT）就足够的简单问题）。这种过度思考的现象导致资源使用效率低下，而没有比例的准确性提高。为了解决这个问题，我们提出了自我路线，这是一个动态的推理框架，该框架会根据模型能力估计自动选择一般和推理模式。我们的方法引入了一个轻量级的推理阶段，可以从隐藏层表示中提取功能感知的嵌入，从而实现对模型解决问题的能力的实时评估。我们进一步构建了具有密度复杂性采样的基于模型难度估计的数据集，以训练路由器以进行精确的功能边界检测。广泛的实验表明，自我路线可以达到与推理模型相当的精度，同时在各种基准中将令牌消耗降低30-55 \％。所提出的框架表明，具有不同参数量表和推理范式的模型之间的有效性一致，突出了其一般适用性和实用价值。

Title: Pretraining Language Models to Ponder in Continuous Space

Authors: Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhiyu Li, Zhouhan Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20674
Pdf URL: https://arxiv.org/pdf/2505.20674
Copy Paste: [[2505.20674]] Pretraining Language Models to Ponder in Continuous Space(https://arxiv.org/abs/2505.20674)
Keywords: language model, gpt
Abstract: Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Our method is straightforward and can be seamlessly integrated with various existing language models. Experiments across three widely used open-source architectures-GPT-2, Pythia, and LLaMA-and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. For language modeling tasks, pondering language models achieve performance comparable to vanilla models with twice the number of parameters. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, pondering-enhanced Pythia-1B is comparable to TinyLlama-1.1B, which is trained on 10 times more data. The code is available at this https URL.
摘要：人类在阐明复杂的句子元素之前思考，通过集中精力使认知处理更深入。在这项工作中，我们通过在单个令牌生成步骤中反复调用远期过程，将此思考过程介绍给语言模型。在思考过程中，该模型没有从预测分布中产生从预测分布中采样的实际令牌，而是根据预测的令牌分布产生所有令牌嵌入的加权总和。然后将生成的嵌入作为另一个正向通行证的输入。我们表明，该模型可以通过自我监督的学习来学习这种方式，而无需任何人类注释。我们的方法很简单，可以与各种现有语言模型无缝集成。在三种广泛使用的开源体系结构GPT-2，Pythia和Llama和广泛的下游任务评估进行的实验证明了我们方法的有效性和一般性。对于语言建模任务，思考语言模型的性能与具有参数数量的两倍的香草模型相当。在9个下游基准测试中，我们的思考增强的毕达斯模型极大地优于官方毕达斯模型。值得注意的是，思考增强的毕曲（Pythia-1b）与Tinyllama-1.1b相当，该数据对数据进行了10倍的培训。该代码可在此HTTPS URL上找到。

Title: SELF-PERCEPT: Introspection Improves Large Language Models' Detection of Multi-Person Mental Manipulation in Conversations

Authors: Danush Khanna, Pratinav Seth, Sidhaarth Sredharan Murali, Aditya Kumar Guru, Siddharth Shukla, Tanuj Tyagi, Sandeep Chaurasia, Kripabandhu Ghosh
Subjects: cs.CL, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20679
Pdf URL: https://arxiv.org/pdf/2505.20679
Copy Paste: [[2505.20679]] SELF-PERCEPT: Introspection Improves Large Language Models' Detection of Multi-Person Mental Manipulation in Conversations(https://arxiv.org/abs/2505.20679)
Keywords: language model, gpt, llm, prompt
Abstract: Mental manipulation is a subtle yet pervasive form of abuse in interpersonal communication, making its detection critical for safeguarding potential victims. However, due to manipulation's nuanced and context-specific nature, identifying manipulative language in complex, multi-turn, and multi-person conversations remains a significant challenge for large language models (LLMs). To address this gap, we introduce the MultiManip dataset, comprising 220 multi-turn, multi-person dialogues balanced between manipulative and non-manipulative interactions, all drawn from reality shows that mimic real-world scenarios. For manipulative interactions, it includes 11 distinct manipulations depicting real-life scenarios. We conduct extensive evaluations of state-of-the-art LLMs, such as GPT-4o and Llama-3.1-8B, employing various prompting strategies. Despite their capabilities, these models often struggle to detect manipulation effectively. To overcome this limitation, we propose SELF-PERCEPT, a novel, two-stage prompting framework inspired by Self-Perception Theory, demonstrating strong performance in detecting multi-person, multi-turn mental manipulation. Our code and data are publicly available at this https URL .
摘要：精神操纵是人际交往中一种微妙而普遍的虐待形式，这对于保护潜在的受害者至关重要。但是，由于操纵的细微差别和特定于上下文的性质，在复杂，多人和多人的对话中确定操纵语言对于大型语言模型（LLMS）仍然是一个重大挑战。为了解决这一差距，我们介绍了MultiManip数据集，其中包括220个多转弯的多人对话，在操纵性和非操纵性交互之间平衡，所有这些都来自现实中得出的所有模仿现实世界情景。对于操纵性互动，它包括11个不同的操纵，描绘了现实生活中的情况。我们采用各种提示策略对最先进的LLM进行了广泛的评估，例如GPT-4O和Llama-3.1-8B。尽管具有功能，但这些模型通常很难有效地检测操作。为了克服这一局限性，我们提出了自我感知，这是一个受自我感知理论启发的新颖，两阶段的促进框架，在检测多人，多转弯的精神操纵方面表现出强烈的表现。我们的代码和数据在此HTTPS URL上公开可用。

Title: Beyond Templates: Dynamic Adaptation of Reasoning Demonstrations via Feasibility-Aware Exploration

Authors: Yong Wu, Weihang Pan, Ke Li, Chen Binhui, Ping Li, Binbin Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20700
Pdf URL: https://arxiv.org/pdf/2505.20700
Copy Paste: [[2505.20700]] Beyond Templates: Dynamic Adaptation of Reasoning Demonstrations via Feasibility-Aware Exploration(https://arxiv.org/abs/2505.20700)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable reasoning capabilities, yet aligning such abilities to small language models (SLMs) remains a challenge due to distributional mismatches and limited model capacity. Existing reasoning datasets, typically designed for powerful LLMs, often lead to degraded performance when directly applied to weaker models. In this work, we introduce Dynamic Adaptation of Reasoning Trajectories (DART), a novel data adaptation framework that bridges the capability gap between expert reasoning trajectories and diverse SLMs. Instead of uniformly imitating expert steps, DART employs a selective imitation strategy guided by step-wise adaptability estimation via solution simulation. When expert steps surpass the student's capacity -- signaled by an Imitation Gap -- the student autonomously explores alternative reasoning paths, constrained by outcome consistency. We validate DART across multiple reasoning benchmarks and model scales, demonstrating that it significantly improves generalization and data efficiency over static fine-tuning. Our method enhances supervision quality by aligning training signals with the student's reasoning capabilities, offering a scalable solution for reasoning alignment in resource-constrained models.
摘要：大型语言模型（LLM）表现出了显着的推理能力，但是由于分配不匹配和有限的模型容量，将这些能力与小语言模型（SLM）保持一致。现有的推理数据集通常是为强大的LLM设计的，当直接应用于较弱的模型时，通常会导致性能退化。在这项工作中，我们引入了推理轨迹（DART）的动态适应性，这是一个新型的数据适应框架，它弥合了专家推理轨迹和不同SLM之间的能力差距。 Dart不是统一地模仿专家步骤，而是采用了通过解决方案模拟以逐步适应性估算为指导的选择性模仿策略。当专家步骤超过学生的能力时 - 通过模仿差距发出信号 - 学生自主探讨了受结果一致性限制的替代推理路径。我们验证了跨多个推理基准和模型量表的飞镖，表明它可以显着提高静态微调的概括和数据效率。我们的方法通过将培训信号与学生的推理能力保持一致，从而提高了监督质量，从而提供了可扩展的解决方案，以在资源约束模型中进行推理对齐。

Title: Dissecting Physics Reasoning in Small Language Models: A Multi-Dimensional Analysis from an Educational Perspective

Authors: Nicy Scaria, Silvester John Joseph Kennedy, Diksha Seth, Deepak Subramani
Subjects: cs.CL, cs.AI, physics.ed-ph
Abstract URL: https://arxiv.org/abs/2505.20707
Pdf URL: https://arxiv.org/pdf/2505.20707
Copy Paste: [[2505.20707]] Dissecting Physics Reasoning in Small Language Models: A Multi-Dimensional Analysis from an Educational Perspective(https://arxiv.org/abs/2505.20707)
Keywords: language model, llm
Abstract: Small Language Models (SLMs) offer computational efficiency and accessibility, making them promising for educational applications. However, their capacity for complex reasoning, particularly in domains such as physics, remains underexplored. This study investigates the high school physics reasoning capabilities of state-of-the-art SLMs (under 4 billion parameters), including instruct versions of Llama 3.2, Phi 4 Mini, Gemma 3, and Qwen series. We developed a comprehensive physics dataset from the OpenStax High School Physics textbook, annotated according to Bloom's Taxonomy, with LaTeX and plaintext mathematical notations. A novel cultural contextualization approach was applied to a subset, creating culturally adapted problems for Asian, African, and South American/Australian contexts while preserving core physics principles. Using an LLM-as-a-judge framework with Google's Gemini 2.5 Flash, we evaluated answer and reasoning chain correctness, along with calculation accuracy. The results reveal significant differences between the SLMs. Qwen 3 1.7B achieved high `answer accuracy' (85%), but `fully correct reasoning' was substantially low (38%). The format of the mathematical notation had a negligible impact on performance. SLMs exhibited varied performance across the physics topics and showed a decline in reasoning quality with increasing cognitive and knowledge complexity. In particular, the consistency of reasoning was largely maintained in diverse cultural contexts, especially by better performing models. These findings indicate that, while SLMs can often find correct answers, their underlying reasoning is frequently flawed, suggesting an overreliance on pattern recognition. For SLMs to become reliable educational tools in physics, future development must prioritize enhancing genuine understanding and the generation of sound, verifiable reasoning chains over mere answer accuracy.
摘要：小型语言模型（SLM）提供了计算效率和可访问性，使其对教育应用有望。但是，它们的复杂推理能力，特别是在诸如物理等领域，仍然没有得到充实的态度。这项研究调查了最先进的SLM（低于40亿个参数）的高中物理学推理能力，包括Llama 3.2，Phi 4 Mini，Gemma 3和Qwen系列的指示版本。我们从OpenStax高中物理教科书中开发了一个全面的物理数据集，并根据Bloom的分类法注释，并带有乳胶和纯种数学符号。一种新颖的文化背景化方法应用于子集，为亚洲，非洲和南美/澳大利亚的环境创造了文化适应的问题，同时保留了核心物理原则。使用Google的Gemini 2.5 Flash使用LLM-AS-A-A-A-A-A-a-Gudge框架，我们评估了答案和推理链正确性，并计算精度。结果揭示了SLM之间的显着差异。 QWEN 3 1.7B获得了高“答案准确性”（85％），但“完全正确的推理”基本低（38％）。数学符号的格式对性能的影响可以忽略不计。 SLM在物理主题中表现出不同的性能，并随着认知和知识复杂性的提高表现出推理质量的下降。特别是，推理的一致性在很大程度上是在各种文化背景下，尤其是通过更好的表现模型。这些发现表明，尽管SLM经常可以找到正确的答案，但它们的基本推理经常存在缺陷，这表明对模式识别过高。为了使SLM成为物理学上可靠的教育工具，未来的发展必须优先提高真正的理解，并产生声音，可验证的推理链，而不是单纯的答案准确性。

Title: SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution

Authors: Hanlin Wang, Chak Tou Leong, Jiashuo Wang, Jian Wang, Wenjie Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20732
Pdf URL: https://arxiv.org/pdf/2505.20732
Copy Paste: [[2505.20732]] SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution(https://arxiv.org/abs/2505.20732)
Keywords: llm, agent
Abstract: Reinforcement learning (RL) holds significant promise for training LLM agents to handle complex, goal-oriented tasks that require multi-step interactions with external environments. However, a critical challenge when applying RL to these agentic tasks arises from delayed rewards: feedback signals are typically available only after the entire task is completed. This makes it non-trivial to assign delayed rewards to earlier actions, providing insufficient guidance regarding environmental constraints and hindering agent training. In this work, we draw on the insight that the ultimate completion of a task emerges from the cumulative progress an agent makes across individual steps. We propose Stepwise Progress Attribution (SPA), a general reward redistribution framework that decomposes the final reward into stepwise contributions, each reflecting its incremental progress toward overall task completion. To achieve this, we train a progress estimator that accumulates stepwise contributions over a trajectory to match the task completion. During policy optimization, we combine the estimated per-step contribution with a grounding signal for actions executed in the environment as the fine-grained, intermediate reward for effective agent training. Extensive experiments on common agent benchmarks (including Webshop, ALFWorld, and VirtualHome) demonstrate that SPA consistently outperforms the state-of-the-art method in both success rate (+2.5\% on average) and grounding accuracy (+1.9\% on average). Further analyses demonstrate that our method remarkably provides more effective intermediate rewards for RL training. Our code is available at this https URL.
摘要：强化学习（RL）对培训LLM代理人进行了巨大的希望，以处理需要与外部环境进行多步交互的复杂，面向目标的任务。但是，将RL应用于这些代理任务时，一个关键的挑战来自延迟的奖励：反馈信号通常仅在完成整个任务后才获得。这使得将延迟的奖励分配给早期行动是非平凡的，从而提供了有关环境限制和阻碍代理培训的足够指导。在这项工作中，我们借鉴了一个见解，即，任务的最终完成是从代理商跨个别步骤中所取得的累积进步中得出的。我们提出了逐步进度归因（SPA），这是一个一般的奖励再分配框架，将最终奖励分解为逐步贡献，每种奖励都反映了其在整体任务完成方面的增量进度。为了实现这一目标，我们训练一个进度估计器，该估计器积累了逐步贡献轨迹，以匹配任务完成。在策略优化期间，我们将估计的每步贡献与在环境中执行的行动的接地信号相结合，作为有效代理培训的细粒度，中等奖励。对普通代理基准（包括网络商店，Alfworld和VirtualHome）进行的广泛实验表明，水疗中心始终以成功率（平均+2.5 \％）和接地准确性（平均+1.9 \％\％）以最先进的方法优于最先进的方法。进一步的分析表明，我们的方法显着为RL培训提供了更有效的中间奖励。我们的代码可在此HTTPS URL上找到。

Title: Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator

Authors: Peiwen Yuan, Yiwei Li, Shaoxiong Feng, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20738
Pdf URL: https://arxiv.org/pdf/2505.20738
Copy Paste: [[2505.20738]] Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator(https://arxiv.org/abs/2505.20738)
Keywords: llm
Abstract: LLM-as-Benchmark-Generator methods have been widely studied as a supplement to human annotators for scalable evaluation, while the potential biases within this paradigm remain underexplored. In this work, we systematically define and validate the phenomenon of inflated performance in models evaluated on their self-generated benchmarks, referred to as self-bias, and attribute it to sub-biases arising from question domain, language style, and wrong labels. On this basis, we propose Silencer, a general framework that leverages the heterogeneity between multiple generators at both the sample and benchmark levels to neutralize bias and generate high-quality, self-bias-silenced benchmark. Experimental results across various settings demonstrate that Silencer can suppress self-bias to near zero, significantly improve evaluation effectiveness of the generated benchmark (with an average improvement from 0.655 to 0.833 in Pearson correlation with high-quality human-annotated benchmark), while also exhibiting strong generalizability.
摘要：LLM-AS基准获得方法已被广泛研究为对人类注释的补充，以进行可扩展评估，而该范式内的潜在偏见仍然没有被忽视。在这项工作中，我们系统地定义并验证了根据其自基础基准评估的模型中夸张的性能现象，称为自偏见，并将其归因于由问题域，语言样式和错误标签产生的子偏见。在此基础上，我们提出了一个消音器，这是一个一般框架，它利用了样本和基准水平的多个发电机之间的异质性，以中和偏见并产生高质量的，自偏见的基准测试。各种环境中的实验结果表明，消音器可以抑制自偏见至接近零，从而显着提高了生成的基准测试的评估效率（在皮尔逊相关性与高质量的人摩擦的基准相关性的平均提高到0.655到0.833），同时也表现出强大的可推广性。

Title: CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models

Authors: Xiaqiang Tang, Jian Li, Keyu Hu, Du Nan, Xiaolong Li, Xi Zhang, Weigao Sun, Sihong Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20767
Pdf URL: https://arxiv.org/pdf/2505.20767
Copy Paste: [[2505.20767]] CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models(https://arxiv.org/abs/2505.20767)
Keywords: language model, llm, hallucination
Abstract: Faithfulness hallucination are claims generated by a Large Language Model (LLM) not supported by contexts provided to the LLM. Lacking assessment standard, existing benchmarks only contain "factual statements" that rephrase source materials without marking "cognitive statements" that make inference from the given context, making the consistency evaluation and optimization of cognitive statements difficult. Inspired by how an evidence is assessed in the legislative domain, we design a rigorous framework to assess different levels of faithfulness of cognitive statements and create a benchmark dataset where we reveal insightful statistics. We design an annotation pipeline to create larger benchmarks for different LLMs automatically, and the resulting larger-scale CogniBench-L dataset can be used to train accurate cognitive hallucination detection model. We release our model and dataset at: this https URL
摘要：忠实幻觉是由提供给LLM的上下文支持的大型语言模型（LLM）产生的。缺乏评估标准，现有的基准仅包含“事实陈述”，即在没有标记“认知陈述”的情况下重现源材料，这些材料从给定的上下文中推断出来，从而使认知陈述的一致性评估和优化变得困难。受到立法领域中如何评估证据的启发，我们设计了一个严格的框架，以评估认知陈述的不同忠诚程度，并创建一个基准数据集，我们可以在其中透露有见地的统计数据。我们设计了一条注释管道，以自动为不同的LLM创建更大的基准测试，而所得的较大的cognibench-l数据集可用于训练准确的认知幻觉检测模型。我们在以下位置发布我们的模型和数据集：此HTTPS URL

Title: SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences

Authors: Jungyoub Cha, Hyunjong Kim, Sungzoon Cho
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20776
Pdf URL: https://arxiv.org/pdf/2505.20776
Copy Paste: [[2505.20776]] SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences(https://arxiv.org/abs/2505.20776)
Keywords: language model, llm
Abstract: Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), but its performance degrades on long inputs due to increased attention cost and reduced draft accuracy. We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences without any additional training. SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models, reducing latency across all stages. To improve draft accuracy and speed, we propose Cross-model Retrieval, a novel KV cache update strategy that uses the target model's attention scores to dynamically select relevant context for the draft model. Extensive evaluations on three long-context understanding datasets show that SpecExtend accelerates standard tree-based speculative decoding by up to 2.22x for inputs up to 16K tokens, providing an effective solution for speculative decoding of long sequences. The code is available at this https URL .
摘要：投机解码是一种广泛采用的技术，用于加速大型语言模型（LLMS）的推断，但是由于注意力成本的提高和降低了准确性草案，其性能会降低长输入。我们介绍了Specextend，这是一种插入增强功能，可提高长序列中投机解码的性能，而无需任何其他培训。 Specextend将有效的注意机制（例如闪存和混合树的注意力集中在草稿和目标模型中），从而降低了各个阶段的延迟。为了提高草稿准确性和速度，我们提出了跨模型检索，这是一种新颖的KV缓存更新策略，它使用目标模型的注意力得分来动态选择模型草案的相关上下文。对三个长篇小说理解数据集进行了广泛的评估表明，Specextend可以通过高达2.22倍的输入加速标准的基于树的投机解码，最多可加速16K令牌，为长序列的投机解码提供了有效的解决方案。该代码可在此HTTPS URL上找到。

Title: CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature

Authors: Noy Sternlicht, Tom Hope
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20779
Pdf URL: https://arxiv.org/pdf/2505.20779
Copy Paste: [[2505.20779]] CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature(https://arxiv.org/abs/2505.20779)
Keywords: llm
Abstract: A hallmark of human innovation is the process of recombination -- creating original ideas by integrating elements of existing mechanisms and concepts. In this work, we automatically mine the scientific literature and build CHIMERA: a large-scale knowledge base (KB) of recombination examples. CHIMERA can be used to empirically explore at scale how scientists recombine concepts and take inspiration from different areas, or to train supervised machine learning models that learn to predict new creative cross-domain directions. To build this KB, we present a novel information extraction task of extracting recombination from scientific paper abstracts, collect a high-quality corpus of hundreds of manually annotated abstracts, and use it to train an LLM-based extraction model. The model is applied to a large corpus of papers in the AI domain, yielding a KB of over 28K recombination examples. We analyze CHIMERA to explore the properties of recombination in different subareas of AI. Finally, we train a scientific hypothesis generation model using the KB, which predicts new recombination directions that real-world researchers find inspiring. Our data and code are available at this https URL
摘要：人类创新的标志是重组的过程 - 通过整合现有机制和概念的要素来创造原创思想。在这项工作中，我们自动挖掘科学文献并构建嵌合体：重组示例的大规模知识库（KB）。 Chimera可用于大规模地探索科学家如何重组概念并从不同领域汲取灵感，或者训练有监督的机器学习模型，以学习预测新的创意跨域方向。为了构建此KB，我们提出了一项新的信息提取任务，该任务是从科学论文摘要中提取重组，收集数百个手动注释摘要的高质量语料库，并使用它来训练基于LLM的提取模型。该模型应用于AI域中的大量论文，产生了超过28K的重组示例的Kb。我们分析了嵌合体以探索AI不同亚地区重组的特性。最后，我们使用KB训练科学假设的产生模型，该模型预测了现实世界研究人员认为鼓舞人心的新重组方向。我们的数据和代码可在此HTTPS URL上找到

Title: Improved Representation Steering for Language Models

Authors: Zhengxuan Wu, Qinan Yu, Aryaman Arora, Christopher D. Manning, Christopher Potts
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20809
Pdf URL: https://arxiv.org/pdf/2505.20809
Copy Paste: [[2505.20809]] Improved Representation Steering for Language Models(https://arxiv.org/abs/2505.20809)
Keywords: language model, prompt
Abstract: Steering methods for language models (LMs) seek to provide fine-grained and interpretable control over model generations by variously changing model inputs, weights, or representations to adjust behavior. Recent work has shown that adjusting weights or representations is often less effective than steering by prompting, for instance when wanting to introduce or suppress a particular concept. We demonstrate how to improve representation steering via our new Reference-free Preference Steering (RePS), a bidirectional preference-optimization objective that jointly does concept steering and suppression. We train three parameterizations of RePS and evaluate them on AxBench, a large-scale model steering benchmark. On Gemma models with sizes ranging from 2B to 27B, RePS outperforms all existing steering methods trained with a language modeling objective and substantially narrows the gap with prompting -- while promoting interpretability and minimizing parameter count. In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants while remaining resilient to prompt-based jailbreaking attacks that defeat prompting. Overall, our results suggest that RePS provides an interpretable and robust alternative to prompting for both steering and suppression.
摘要：语言模型（LMS）的转向方法试图通过改变模型输入，权重或表示行为来提供对模型世代的细粒度和可解释的控制。最近的工作表明，调整权重或表示形式通常不如提示，例如，在想引入或抑制特定概念时。我们演示了如何通过我们的新的无参考偏好转向（REP）改善表示的指导，这是一个双向偏好优先化的目标，共同执行概念转向和抑制。我们训练三个代表的参数化，并在Axbench上进行评估，Axbench是一个大规模的模型转向基准。在尺寸从2B到27B不等的Gemma模型上，代表的表现优于所有经过语言建模目标训练的现有转向方法，并通过提示促进了差距，同时促进了可解释性和最小化参数计数。在镇压中，代表与Gemma-2的语言模型目标相匹配，并在较大的Gemma-3变体上胜过它，同时仍然具有弹性，可抗击基于迅速的越狱攻击。总体而言，我们的结果表明，代表为提示转向和抑制提供了一种可解释且强大的替代方案。

Title: Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Authors: Krishna Singh Rajput, Tejas Anvekar, Chitta Baral, Vivek Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20816
Pdf URL: https://arxiv.org/pdf/2505.20816
Copy Paste: [[2505.20816]] Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective(https://arxiv.org/abs/2505.20816)
Keywords: language model, llm, agent
Abstract: Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
摘要：多模式问答的最新进展主要集中在结合异质方式或微调多模式大语言模型。尽管这些方法表现出很强的性能，但它们通常依赖于一种普遍的推理策略，从而忽略了每种方式的独特特征，最终限制了准确性和解释性。为了解决这些限制，我们提出了Mammqa，这是一个多模式输入的多代理QA框架，涵盖了文本，表格和图像。我们的系统包括两个视觉语言模型（VLM）代理和一个基于文本的大语言模型（LLM）代理。第一个VLM将用户查询分解为子问题，并从每种模式中顺序检索部分答案。第二VLM通过跨模式推理综合并完善了这些结果。最后，LLM将洞察力整合到一个凝聚力的答案中。这种模块化设计通过使推理过程透明，并允许每个代理在其专业知识领域内运行，从而增强了可解释性。关于多模式QA基准测试的实验表明，我们的合作，多代理框架的准确性和鲁棒性始终优于现有基线。

Title: Tracing and Reversing Rank-One Model Edits

Authors: Paul Youssef, Zhixue Zhao, Christin Seifert, Jörg Schlötterer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20819
Pdf URL: https://arxiv.org/pdf/2505.20819
Copy Paste: [[2505.20819]] Tracing and Reversing Rank-One Model Edits(https://arxiv.org/abs/2505.20819)
Keywords: language model, llm, prompt
Abstract: Knowledge editing methods (KEs) are a cost-effective way to update the factual content of large language models (LLMs), but they pose a dual-use risk. While KEs are beneficial for updating outdated or incorrect information, they can be exploited maliciously to implant misinformation or bias. In order to defend against these types of malicious manipulation, we need robust techniques that can reliably detect, interpret, and mitigate adversarial edits. This work investigates the traceability and reversibility of knowledge edits, focusing on the widely used Rank-One Model Editing (ROME) method. We first show that ROME introduces distinctive distributional patterns in the edited weight matrices, which can serve as effective signals for locating the edited weights. Second, we show that these altered weights can reliably be used to predict the edited factual relation, enabling partial reconstruction of the modified fact. Building on this, we propose a method to infer the edited object entity directly from the modified weights, without access to the editing prompt, achieving over 95% accuracy. Finally, we demonstrate that ROME edits can be reversed, recovering the model's original outputs with $\geq$ 80% accuracy. Our findings highlight the feasibility of detecting, tracing, and reversing edits based on the edited weights, offering a robust framework for safeguarding LLMs against adversarial manipulations.
摘要：知识编辑方法（KES）是更新大语模型（LLMS）的事实内容的一种经济有效的方式，但它们构成双重使用风险。尽管KE有益于更新过时或不正确的信息，但它们可能被恶意剥削成植入物错误或偏见。为了防止这些类型的恶意操纵，我们需要可靠地检测，解释和减轻对抗性编辑的强大技术。这项工作调查了知识编辑的可追溯性和可逆性，重点是广泛使用的排名第一模型编辑（罗马）方法。我们首先表明罗马在编辑的重量矩阵中引入了独特的分布模式，该矩阵可以用作定位编辑权重的有效信号。其次，我们表明，这些变化的权重可以可靠地用于预测编辑的事实关系，从而可以部分重建修改后的事实。在此基础上，我们提出了一种直接从修改的权重推断所编辑的对象实体的方法，而无需访问编辑提示，可以实现超过95％的精度。最后，我们证明了罗马的编辑可以逆转，以$ \ geq $ 80％的精度恢复了模型的原始输出。我们的发现突出了基于编辑的权重检测，追踪和逆转编辑的可行性，为保护LLMS防止对抗性操作提供了强大的框架。

Title: Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation

Authors: Yuhao Wang, Ruiyang Ren, Yucheng Wang, Wayne Xin Zhao, Jing Liu, Hua Wu, Haifeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20825
Pdf URL: https://arxiv.org/pdf/2505.20825
Copy Paste: [[2505.20825]] Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation(https://arxiv.org/abs/2505.20825)
Keywords: language model, hallucination, retrieval-augmented generation
Abstract: Long-form question answering (LFQA) presents unique challenges for large language models, requiring the synthesis of coherent, paragraph-length answers. While retrieval-augmented generation (RAG) systems have emerged as a promising solution, existing research struggles with key limitations: the scarcity of high-quality training data for long-form generation, the compounding risk of hallucination in extended outputs, and the absence of reliable evaluation metrics for factual completeness. In this paper, we propose RioRAG, a novel reinforcement learning (RL) framework that advances long-form RAG through reinforced informativeness optimization. Our approach introduces two fundamental innovations to address the core challenges. First, we develop an RL training paradigm of reinforced informativeness optimization that directly optimizes informativeness and effectively addresses the slow-thinking deficit in conventional RAG systems, bypassing the need for expensive supervised data. Second, we propose a nugget-centric hierarchical reward modeling approach that enables precise assessment of long-form answers through a three-stage process: extracting the nugget from every source webpage, constructing a nugget claim checklist, and computing rewards based on factual alignment. Extensive experiments on two LFQA benchmarks LongFact and RAGChecker demonstrate the effectiveness of the proposed method. Our codes are available at this https URL.
摘要：长形式的答案（LFQA）对大语言模型提出了独特的挑战，需要合成连贯的段落长度答案。尽管检索增强的一代（RAG）系统已经成为有前途的解决方案，但现有的研究斗争具有关键的局限性：长期产生的高质量培训数据的稀缺性，扩展产出中幻觉的复杂风险以及缺乏可靠的评估指标而没有可靠的评估指标来实现事实的完整性。在本文中，我们提出了Riorag，这是一种新颖的增强学习（RL）框架，该框架通过增强的信息优化来推进长形抹布。我们的方法引入了两项基本创新，以应对核心挑战。首先，我们开发了一个RL培训范围，具有加强信息优化的优化，可以直接优化信息性，并有效地解决传统抹布系统中的缓慢思考的赤字，从而绕开了对昂贵的监督数据的需求。其次，我们提出了一种以掘金为中心的层次奖励建模方法，该方法可以通过三阶段的过程进行精确评估长形答案：从每个源网页中提取掘金，从每个源网页中提取掘金，构建掘金声明清单，并基于事实统一的计算奖励。在两个LFQA基准长期和Ragchecker上进行了广泛的实验证明了该方法的有效性。我们的代码可在此HTTPS URL上找到。

Title: AdParaphrase v2.0: Generating Attractive Ad Texts Using a Preference-Annotated Paraphrase Dataset

Authors: Soichiro Murakami, Peinan Zhang, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20826
Pdf URL: https://arxiv.org/pdf/2505.20826
Copy Paste: [[2505.20826]] AdParaphrase v2.0: Generating Attractive Ad Texts Using a Preference-Annotated Paraphrase Dataset(https://arxiv.org/abs/2505.20826)
Keywords: language model
Abstract: Identifying factors that make ad text attractive is essential for advertising success. This study proposes AdParaphrase v2.0, a dataset for ad text paraphrasing, containing human preference data, to enable the analysis of the linguistic factors and to support the development of methods for generating attractive ad texts. Compared with v1.0, this dataset is 20 times larger, comprising 16,460 ad text paraphrase pairs, each annotated with preference data from ten evaluators, thereby enabling a more comprehensive and reliable analysis. Through the experiments, we identified multiple linguistic features of engaging ad texts that were not observed in v1.0 and explored various methods for generating attractive ad texts. Furthermore, our analysis demonstrated the relationships between human preference and ad performance, and highlighted the potential of reference-free metrics based on large language models for evaluating ad text attractiveness. The dataset is publicly available at: this https URL.
摘要：识别使广告文本具有吸引力的因素对于广告成功至关重要。这项研究提出了Adparaphrase v2.0，该数据集包含人类偏好数据，以实现语言因素的分析并支持生成有吸引力的AD文本的方法的开发。与v1.0相比，该数据集大于20倍，包括16,460 AD文本释义对，每个释义对十个评估者的偏好数据进行注释，从而实现了更全面和更可靠的分析。通过实验，我们确定了在V1.0中未观察到的引人入胜的广告文本的多种语言特征，并探索了生成有吸引力的AD文本的各种方法。此外，我们的分析证明了人类偏好与广告表现之间的关系，并强调了基于大语言模型的无参考指标的潜力，以评估广告文本吸引力。该数据集可公开获得：此HTTPS URL。

Title: Concealment of Intent: A Game-Theoretic Analysis

Authors: Xinbo Wu, Abhishek Umrawal, Lav R. Varshney
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20841
Pdf URL: https://arxiv.org/pdf/2505.20841
Copy Paste: [[2505.20841]] Concealment of Intent: A Game-Theoretic Analysis(https://arxiv.org/abs/2505.20841)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) grow more capable, concerns about their safe deployment have also grown. Although alignment mechanisms have been introduced to deter misuse, they remain vulnerable to carefully designed adversarial prompts. In this work, we present a scalable attack strategy: intent-hiding adversarial prompting, which conceals malicious intent through the composition of skills. We develop a game-theoretic framework to model the interaction between such attacks and defense systems that apply both prompt and response filtering. Our analysis identifies equilibrium points and reveals structural advantages for the attacker. To counter these threats, we propose and analyze a defense mechanism tailored to intent-hiding attacks. Empirically, we validate the attack's effectiveness on multiple real-world LLMs across a range of malicious behaviors, demonstrating clear advantages over existing adversarial prompting techniques.
摘要：随着大型语言模型（LLMS）的发展越来越有能力，对其安全部署的担忧也越来越大。尽管已经引入了对准机制来阻止滥用，但它们仍然容易受到精心设计的对抗提示的影响。在这项工作中，我们提出了可扩展的攻击策略：掩盖对抗性提示，这通过技能组成掩盖了恶意意图。我们开发了一个游戏理论框架，以模拟此类攻击和防御系统之间的相互作用，这些攻击系统既有提示和响应过滤。我们的分析确定了平衡点，并揭示了攻击者的结构优势。为了应对这些威胁，我们提出并分析了一种针对隐藏攻击的防御机制。从经验上讲，我们验证了攻击对多种恶意行为的多个现实世界LLM的有效性，证明了与现有的对抗性提示技术相比具有明显的优势。

Title: Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG

Authors: Xin Sun, Jianan Xie, Zhongqi Chen, Qiang Liu, Shu Wu, Yuehe Chen, Bowen Song, Weiqiang Wang, Zilei Wang, Liang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20871
Pdf URL: https://arxiv.org/pdf/2505.20871
Copy Paste: [[2505.20871]] Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG(https://arxiv.org/abs/2505.20871)
Keywords: language model, llm
Abstract: Large language models (LLMs) augmented with retrieval systems have significantly advanced natural language processing tasks by integrating external knowledge sources, enabling more accurate and contextually rich responses. To improve the robustness of such systems against noisy retrievals, Retrieval-Augmented Fine-Tuning (RAFT) has emerged as a widely adopted method. However, RAFT conditions models to generate answers even in the absence of reliable knowledge. This behavior undermines their reliability in high-stakes domains, where acknowledging uncertainty is critical. To address this issue, we propose Divide-Then-Align (DTA), a post-training approach designed to endow RAG systems with the ability to respond with "I don't know" when the query is out of the knowledge boundary of both the retrieved passages and the model's internal knowledge. DTA divides data samples into four knowledge quadrants and constructs tailored preference data for each quadrant, resulting in a curated dataset for Direct Preference Optimization (DPO). Experimental results on three benchmark datasets demonstrate that DTA effectively balances accuracy with appropriate abstention, enhancing the reliability and trustworthiness of retrieval-augmented systems.
摘要：通过集成外部知识源，可以通过集成外部知识来源，可以更准确且上下文丰富的响应来实现大型语言模型（LLMS）增强的自然语言处理任务。为了改善此类系统对嘈杂检索的鲁棒性，取回式的微调（RAFT）已成为一种广泛采用的方法。但是，即使在没有可靠的知识的情况下，筏条件也可以产生答案。这种行为破坏了它们在高风险领域的可靠性，因为承认不确定性至关重要。为了解决这个问题，我们提出了划分-Align（DTA），这是一种训练后的方法，旨在赋予抹布系统的能力，即当查询不超出检索到的段落的知识边界和模型的内部知识的知识边界时。 DTA将数据样本划分为四个知识象限，并为每个象限定制偏好数据，从而产生一个策划的数据集，以进行直接偏好优化（DPO）。三个基准数据集的实验结果表明，DTA有效地平衡了准确性与适当的弃权，从而增强了检索功能的系统的可靠性和可信度。

Title: Can LLMs Learn to Map the World from Local Descriptions?

Authors: Sirui Xia, Aili Chen, Xintao Wang, Tinghui Zhu, Yikai Zhang, Jiangjie Chen, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20874
Pdf URL: https://arxiv.org/pdf/2505.20874
Copy Paste: [[2505.20874]] Can LLMs Learn to Map the World from Local Descriptions?(https://arxiv.org/abs/2505.20874)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) have demonstrated strong capabilities in tasks such as code and mathematics. However, their potential to internalize structured spatial knowledge remains underexplored. This study investigates whether LLMs, grounded in locally relative human observations, can construct coherent global spatial cognition by integrating fragmented relational descriptions. We focus on two core aspects of spatial cognition: spatial perception, where models infer consistent global layouts from local positional relationships, and spatial navigation, where models learn road connectivity from trajectory data and plan optimal paths between unconnected locations. Experiments conducted in a simulated urban environment demonstrate that LLMs not only generalize to unseen spatial relationships between points of interest (POIs) but also exhibit latent representations aligned with real-world spatial distributions. Furthermore, LLMs can learn road connectivity from trajectory descriptions, enabling accurate path planning and dynamic spatial awareness during navigation.
摘要：大型语言模型（LLM）的最新进展已证明在代码和数学等任务中具有很强的功能。但是，它们内化结构化的空间知识的潜力仍然没有被逐渐解散。这项研究研究了基于本地相对人类观察的LLM是否可以通过整合零散的关系描述来构建连贯的全局空间认知。我们关注空间认知的两个核心方面：空间感知，其中模型从局部位置关系中推断出一致的全球布局以及空间导航，其中模型从轨迹数据和未连接位置之间的最佳路径学习道路连接。在模拟的城市环境中进行的实验表明，LLM不仅概括了兴趣点之间看不见的空间关系（POI）（POI），而且表现出与现实世界空间分布相符的潜在表示。此外，LLM可以从轨迹描述中学习道路连通性，从而在导航过程中实现准确的路径计划和动态空间意识。

Title: Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties

Authors: Jiyoung Lee, Seungho Kim, Jieun Han, Jun-Min Lee, Kitaek Kim, Alice Oh, Edward Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20875
Pdf URL: https://arxiv.org/pdf/2505.20875
Copy Paste: [[2505.20875]] Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties(https://arxiv.org/abs/2505.20875)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties. These findings highlight the importance of comprehensive linguistic robustness evaluation across diverse English varieties. Each construction of Trans-EnV was validated through rigorous statistical testing and consultation with a researcher in the field of second language acquisition, ensuring its linguistic validity. Our \href{this https URL}{code} and \href{this https URL}{datasets} are publicly available.
摘要：大型语言模型（LLMS）主要根据标准的美国英语（SAE）进行评估，通常忽略了全球英语品种的多样性。这种狭窄的重点可能会引起公平的关注，因为在非标准品种上的性能下降会导致全球用户带来不平等的利益。因此，至关重要的是，在多个非标准的英语品种上广泛评估LLM的语言鲁棒性。我们介绍了Trans-Env，该框架自动将SAE数据集转换为多个英语品种以评估语言鲁棒性。我们的框架结合了（1）语言学专家知识，以策划各种特定特征和语言文献和语料库的转型指南，以及（2）基于LLM的转换，以确保语言有效性和可扩展性。使用Trans-Env，我们将六个基准数据集转换为38个英语品种，并评估七个最先进的LLMS。我们的结果显示出明显的性能差异，非标准品种的准确性降低了46.3％。这些发现突出了各种英语品种综合语言鲁棒性评估的重要性。 Trans-ENV的每次构建均通过严格的统计测试和与第二语言获取领域的研究人员进行咨询来验证，从而确保其语言有效性。我们的\ href {此https url} {code}和\ href {此https url} {dataSets}公开可用。

Title: MSA at SemEval-2025 Task 3: High Quality Weak Labeling and LLM Ensemble Verification for Multilingual Hallucination Detection

Authors: Baraa Hikal, Ahmed Nasreldin, Ali Hamdi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20880
Pdf URL: https://arxiv.org/pdf/2505.20880
Copy Paste: [[2505.20880]] MSA at SemEval-2025 Task 3: High Quality Weak Labeling and LLM Ensemble Verification for Multilingual Hallucination Detection(https://arxiv.org/abs/2505.20880)
Keywords: language model, llm, hallucination, prompt
Abstract: This paper describes our submission for SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The task involves detecting hallucinated spans in text generated by instruction-tuned Large Language Models (LLMs) across multiple languages. Our approach combines task-specific prompt engineering with an LLM ensemble verification mechanism, where a primary model extracts hallucination spans and three independent LLMs adjudicate their validity through probability-based voting. This framework simulates the human annotation workflow used in the shared task validation and test data. Additionally, fuzzy matching refines span alignment. Our system ranked 1st in Arabic and Basque, 2nd in German, Swedish, and Finnish, and 3rd in Czech, Farsi, and French.
摘要：本文介绍了我们对Semeval-2025任务的提交3：MU SHROOM，关于幻觉的多语言共享任务以及相关的可观察到的过度错误。该任务涉及检测通过多种语言通过指令调整的大语言模型（LLM）生成的文本中的幻觉跨度。我们的方法将特定于任务的及时工程与LLM集合验证机制结合在一起，其中主要模型提取幻觉跨度，三个独立的LLM通过基于概率的投票来裁定其有效性。该框架模拟了共享任务验证和测试数据中使用的人类注释工作流程。此外，模糊匹配的完善跨度对齐。我们的系统在阿拉伯语和巴斯克语中排名第一，在德国，瑞典语和芬兰人中排名第二，在捷克，法尔西和法国的第三名。

Title: EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models

Authors: Chengyu Wang, Junbing Yan, Wenrui Cai, Yuanhao Yue, Jun Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20888
Pdf URL: https://arxiv.org/pdf/2505.20888
Copy Paste: [[2505.20888]] EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models(https://arxiv.org/abs/2505.20888)
Keywords: language model, llm
Abstract: In this paper, we present EasyDistill, a comprehensive toolkit designed for effective black-box and white-box knowledge distillation (KD) of large language models (LLMs). Our framework offers versatile functionalities, including data synthesis, supervised fine-tuning, ranking optimization, and reinforcement learning techniques specifically tailored for KD scenarios. The toolkit accommodates KD functionalities for both System 1 (fast, intuitive) and System 2 (slow, analytical) models. With its modular design and user-friendly interface, EasyDistill empowers researchers and industry practitioners to seamlessly experiment with and implement state-of-the-art KD strategies for LLMs. In addition, EasyDistill provides a series of robust distilled models and KD-based industrial solutions developed by us, along with the corresponding open-sourced datasets, catering to a variety of use cases. Furthermore, we describe the seamless integration of EasyDistill into Alibaba Cloud's Platform for AI (PAI). Overall, the EasyDistill toolkit makes advanced KD techniques for LLMs more accessible and impactful within the NLP community.
摘要：在本文中，我们提出了EasyDistill，这是一个综合工具包，旨在用于大型语言模型（LLMS）的有效黑盒和白盒知识蒸馏（KD）。我们的框架提供了多功能功能，包括数据综合，监督微调，排名优化和强化学习技术，专门针对KD方案量身定制。该工具包适用于系统1（快速，直觉）和系统2（慢速，分析）模型的KD功能。 EasyDistill凭借其模块化设计和用户友好的界面，使研究人员和行业从业人员能够无缝尝试并实施最新的LLMS策略。此外，EasyDistill提供了一系列可靠的蒸馏型和基于KD的工业解决方案，以及相应的开源数据集，可满足各种用例。此外，我们描述了EasyDistill无缝集成到Alibaba Cloud的AI（PAI）平台中。总体而言，EasyDistill Toolkit在NLP社区中为LLMS提供了更容易访问和影响力的高级KD技术。

Title: A Stereotype Content Analysis on Color-related Social Bias in Large Vision Language Models

Authors: Junhyuk Choi, Minju Kim, Yeseon Hong, Bugeun Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20901
Pdf URL: https://arxiv.org/pdf/2505.20901
Copy Paste: [[2505.20901]] A Stereotype Content Analysis on Color-related Social Bias in Large Vision Language Models(https://arxiv.org/abs/2505.20901)
Keywords: language model
Abstract: As large vision language models(LVLMs) rapidly advance, concerns about their potential to learn and generate social biases and stereotypes are increasing. Previous studies on LVLM's stereotypes face two primary limitations: metrics that overlooked the importance of content words, and datasets that overlooked the effect of color. To address these limitations, this study introduces new evaluation metrics based on the Stereotype Content Model (SCM). We also propose BASIC, a benchmark for assessing gender, race, and color stereotypes. Using SCM metrics and BASIC, we conduct a study with eight LVLMs to discover stereotypes. As a result, we found three findings. (1) The SCM-based evaluation is effective in capturing stereotypes. (2) LVLMs exhibit color stereotypes in the output along with gender and race ones. (3) Interaction between model architecture and parameter sizes seems to affect stereotypes. We release BASIC publicly on [anonymized for review].
摘要：随着大型视觉语言模型（LVLM）迅速发展，对它们学习和产生社会偏见和刻板印象的潜力的担忧正在增加。先前对LVLM刻板印象的研究面临着两个主要局限性：忽略内容词的重要性的指标，而数据集则忽略了颜色的效果。为了解决这些局限性，本研究介绍了基于刻板印象内容模型（SCM）的新评估指标。我们还提出了基本，是评估性别，种族和颜色刻板印象的基准。使用SCM指标和基本，我们对八个LVLM进行了一项研究，以发现刻板印象。结果，我们发现了三个发现。（1）基于SCM的评估可有效捕获刻板印象。（2）LVLMS与性别和种族的颜色刻板印象在输出中表现出颜色刻板印象。（3）模型体系结构和参数大小之间的相互作用似乎会影响刻板印象。我们公开发布[匿名审查]。

Title: Towards Objective Fine-tuning: How LLMs' Prior Knowledge Causes Potential Poor Calibration?

Authors: Ziming Wang, Zeyu Shi, Haoyi Zhou, Shiqi Gao, Qingyun Sun, Jianxin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20903
Pdf URL: https://arxiv.org/pdf/2505.20903
Copy Paste: [[2505.20903]] Towards Objective Fine-tuning: How LLMs' Prior Knowledge Causes Potential Poor Calibration?(https://arxiv.org/abs/2505.20903)
Keywords: language model, llm
Abstract: Fine-tuned Large Language Models (LLMs) often demonstrate poor calibration, with their confidence scores misaligned with actual performance. While calibration has been extensively studied in models trained from scratch, the impact of LLMs' prior knowledge on calibration during fine-tuning remains understudied. Our research reveals that LLMs' prior knowledge causes potential poor calibration due to the ubiquitous presence of known data in real-world fine-tuning, which appears harmful for calibration. Specifically, data aligned with LLMs' prior knowledge would induce overconfidence, while new knowledge improves calibration. Our findings expose a tension: LLMs' encyclopedic knowledge, while enabling task versatility, undermines calibration through unavoidable knowledge overlaps. To address this, we propose CogCalib, a cognition-aware framework that applies targeted learning strategies according to the model's prior knowledge. Experiments across 7 tasks using 3 LLM families prove that CogCalib significantly improves calibration while maintaining performance, achieving an average 57\% reduction in ECE compared to standard fine-tuning in Llama3-8B. These improvements generalize well to out-of-domain tasks, enhancing the objectivity and reliability of domain-specific LLMs, and making them more trustworthy for critical human-AI interaction applications.
摘要：微调的大语言模型（LLM）通常表现出校准不佳，其信心得分与实际表现不一致。尽管校准已经在从头开始训练的模型中进行了广泛的研究，但LLMS在微调过程中对校准的知识的影响仍在研究中。我们的研究表明，LLMS的先验知识会导致潜在的校准，这是由于无处不在的现实微型微调中存在已知数据，这似乎对校准有害。具体而言，与LLMS的先验知识一致的数据会导致过度自信，而新知识会改善校准。我们的发现暴露了张力：LLMS的百科全书知识，同时实现任务多功能性，通过不可避免的知识重叠来破坏校准。为了解决这个问题，我们提出了Cogcalib，这是一种认知感知的框架，该框架根据模型的先验知识应用了针对性的学习策略。使用3个LLM家族进行的7个任务实验证明，与Llama3-8B的标准微调相比，Cogcalib在保持性能的同时显着改善了校准，ECE平均降低了57％。这些改进很好地推广到了域外任务，增强了特定于域的LLM的客观性和可靠性，并使它们对关键的人类互动应用更加值得信赖。

Title: Automated Privacy Information Annotation in Large Language Model Interactions

Authors: Hang Zeng, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, Shaojie Tang, Guihai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20910
Pdf URL: https://arxiv.org/pdf/2505.20910
Copy Paste: [[2505.20910]] Automated Privacy Information Annotation in Large Language Model Interactions(https://arxiv.org/abs/2505.20910)
Keywords: language model, llm
Abstract: Users interacting with large language models (LLMs) under their real identifiers often unknowingly risk disclosing private information. Automatically notifying users whether their queries leak privacy and which phrases leak what private information has therefore become a practical need. Existing privacy detection methods, however, were designed for different objectives and application scenarios, typically tagging personally identifiable information (PII) in anonymous content. In this work, to support the development and evaluation of privacy detection models for LLM interactions that are deployable on local user devices, we construct a large-scale multilingual dataset with 249K user queries and 154K annotated privacy phrases. In particular, we build an automated privacy annotation pipeline with cloud-based strong LLMs to automatically extract privacy phrases from dialogue datasets and annotate leaked information. We also design evaluation metrics at the levels of privacy leakage, extracted privacy phrase, and privacy information. We further establish baseline methods using light-weight LLMs with both tuning-free and tuning-based methods, and report a comprehensive evaluation of their performance. Evaluation results reveal a gap between current performance and the requirements of real-world LLM applications, motivating future research into more effective local privacy detection methods grounded in our dataset.
摘要：用户在其实际标识符下与大型语言模型（LLM）互动通常在不知不觉中冒着披露私人信息的风险。自动通知用户的查询是否泄漏隐私以及哪些短语泄漏了哪些私人信息已成为实际需求。但是，现有的隐私检测方法是针对不同目标和应用程序方案设计的，通常在匿名内容中标记个人身份信息（PII）。在这项工作中，为了支持可在本地用户设备上部署的LLM交互的隐私检测模型的开发和评估，我们构建了一个具有249K用户查询和154K注释隐私短语的大规模多语言数据集。特别是，我们使用基于云的强LLM构建自动隐私注释管道，以自动从对话数据集中提取隐私短语并注释泄漏的信息。我们还在隐私泄漏，提取的隐私短语和隐私信息的层面上设计评估指标。我们进一步建立了使用轻量级LLM的基线方法，既可以使用无调和基于调整的方法，又报告了对其性能的全面评估。评估结果揭示了当前绩效与现实世界中LLM应用程序的要求之间存在差距，从而激发了对基于我们数据集中更有效的本地隐私检测方法的未来研究。

Title: Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models

Authors: Injae Na, Keonwoong Noh, Woohwan Jung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20921
Pdf URL: https://arxiv.org/pdf/2505.20921
Copy Paste: [[2505.20921]] Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models(https://arxiv.org/abs/2505.20921)
Keywords: language model, llm
Abstract: LLM providers typically offer multiple LLM tiers, varying in performance and price. As NLP tasks become more complex and modularized, selecting the suitable LLM tier for each subtask is a key challenge to balance between cost and performance. To address the problem, we introduce LLM Automatic Transmission (LLM-AT) framework that automatically selects LLM tiers without training. LLM-AT consists of Starter, Generator, and Judge. The starter selects the initial LLM tier expected to solve the given question, the generator produces a response using the LLM of the selected tier, and the judge evaluates the validity of the response. If the response is invalid, LLM-AT iteratively upgrades to a higher-tier model, generates a new response, and re-evaluates until a valid response is obtained. Additionally, we propose accuracy estimator, which enables the suitable initial LLM tier selection without training. Given an input question, accuracy estimator estimates the expected accuracy of each LLM tier by computing the valid response rate across top-k similar queries from past inference records. Experiments demonstrate that LLM-AT achieves superior performance while reducing costs, making it a practical solution for real-world applications.
摘要：LLM提供商通常提供多个LLM级别，其性能和价格各不相同。随着NLP任务变得更加复杂和模块化，为每个子任务选择合适的LLM层是平衡成本和性能之间的关键挑战。为了解决该问题，我们介绍了LLM自动变速箱（LLM-AT）框架，该框架自动选择LLM层而无需培训。 LLM-AT由起动器，发电机和法官组成。起动器选择预期解决给定问题的初始LLM层，使用选定层的LLM产生响应，法官评估了响应的有效性。如果响应无效，LLM-AT迭代升级为高层模型，生成新的响应并重新评估，直到获得有效的响应为止。此外，我们提出了精确度估计器，该估计量可以在没有训练的情况下进行合适的初始LLM层选择。给定输入问题，精度估计器通过计算过去推断记录的TOP-K相似查询的有效响应率来估计每个LLM层的预期准确性。实验表明，LLM-AT在降低成本的同时取得了出色的性能，这使其成为现实世界应用的实用解决方案。

Title: Multi-objective Large Language Model Alignment with Hierarchical Experts

Authors: Zhuo Li, Guodong Du, Weiyang Guo, Yigeng Zhou, Xiucheng Li, Wenya Wang, Fangming Liu, Yequan Wang, Deheng Ye, Min Zhang, Jing Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20925
Pdf URL: https://arxiv.org/pdf/2505.20925
Copy Paste: [[2505.20925]] Multi-objective Large Language Model Alignment with Hierarchical Experts(https://arxiv.org/abs/2505.20925)
Keywords: language model, llm
Abstract: Aligning large language models (LLMs) to simultaneously satisfy multiple objectives remains a significant challenge, especially given the diverse and often conflicting nature of human preferences. Existing alignment methods struggle to balance trade-offs effectively, often requiring costly retraining or yielding suboptimal results across the Pareto frontier of preferences. In this paper, we introduce \textit{HoE}(Hierarchical Mixture-of-Experts), a \textit{lightweight}, \textit{parameter-efficient}, and \textit{plug-and-play} approach that eliminates the need for model training, while enabling LLMs to adapt across the entire Pareto frontier and accommodate diverse user preferences. In particular, \textit{HoE} consists of three hierarchical components: LoRA Experts, Router Experts and Preference Routing, reaching optimal Pareto frontiers and achieving a trade-off between parameter size, training cost, and performance. We evaluate \textit{HoE} across various tasks on 14 objectives and 200 different preferences among 6 benchmarks, demonstrating superior performance over 15 recent baselines. Code is available in the supplementary materials.
摘要：使大型语言模型（LLM）同时满足多个目标仍然是一个重大挑战，尤其是考虑到人类偏好的多样化和经常相互冲突的本质。现有的一致性方法难以有效地平衡权衡取舍，通常需要在偏好的偏好边界上昂贵的再培训或产生次优的结果。 In this paper, we introduce \textit{HoE}(Hierarchical Mixture-of-Experts), a \textit{lightweight}, \textit{parameter-efficient}, and \textit{plug-and-play} approach that eliminates the need for model training, while enabling LLMs to adapt across the entire Pareto frontier and accommodate diverse user preferences.特别是，\ textit {hoe}由三个层次组成部分组成：洛拉专家，路由器专家和偏好路由，达到最佳的帕累托前沿，并在参数规模，培训成本和性能之间取消了权衡。我们在6个基准中的14个目标和200个不同的偏好上评估了\ textit {hoe}，这表明了比最近15个基线的卓越性能。补充材料中有代码。

Title: Information-Theoretic Complementary Prompts for Improved Continual Text Classification

Authors: Duzhen Zhang, Yong Ren, Chenxing Li, Dong Yu, Tielin Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20933
Pdf URL: https://arxiv.org/pdf/2505.20933
Copy Paste: [[2505.20933]] Information-Theoretic Complementary Prompts for Improved Continual Text Classification(https://arxiv.org/abs/2505.20933)
Keywords: prompt
Abstract: Continual Text Classification (CTC) aims to continuously classify new text data over time while minimizing catastrophic forgetting of previously acquired knowledge. However, existing methods often focus on task-specific knowledge, overlooking the importance of shared, task-agnostic knowledge. Inspired by the complementary learning systems theory, which posits that humans learn continually through the interaction of two systems -- the hippocampus, responsible for forming distinct representations of specific experiences, and the neocortex, which extracts more general and transferable representations from past experiences -- we introduce Information-Theoretic Complementary Prompts (InfoComp), a novel approach for CTC. InfoComp explicitly learns two distinct prompt spaces: P(rivate)-Prompt and S(hared)-Prompt. These respectively encode task-specific and task-invariant knowledge, enabling models to sequentially learn classification tasks without relying on data replay. To promote more informative prompt learning, InfoComp uses an information-theoretic framework that maximizes mutual information between different parameters (or encoded representations). Within this framework, we design two novel loss functions: (1) to strengthen the accumulation of task-specific knowledge in P-Prompt, effectively mitigating catastrophic forgetting, and (2) to enhance the retention of task-invariant knowledge in S-Prompt, improving forward knowledge transfer. Extensive experiments on diverse CTC benchmarks show that our approach outperforms previous state-of-the-art methods.
摘要：持续的文本分类（CTC）旨在随着时间的推移不断地对新文本数据进行分类，同时最大程度地减少对先前获得的知识的灾难性忘记。但是，现有的方法通常集中在特定于任务的知识上，忽略了共享的任务不合Snostic知识的重要性。受互补学习系统理论的启发，该理论认为，人类通过两个系统的相互作用不断学习 - 海马，负责形成特定体验的不同表示，而新皮层则从过去的经验中提取更通用和可转移的表示形式 - 我们介绍了信息性互补提示（Infocomps），一种用于CTC的方法。 InfoComp明确地学习了两个不同的提示空间：P（contrate） - Prompt and S（Hared） - Prompt。这些分别编码特定于任务和任务不变的知识，使模型能够在不依赖数据重播的情况下依次学习分类任务。为了促进更多信息的及时学习，InfoComp使用信息理论框架，该框架可在不同参数（或编码表示）之间最大化相互信息。在此框架内，我们设计了两个新颖的损失功能：（1）加强特定于p prompt的任务知识的积累，有效地减轻灾难性的遗忘，（2）增强在s-prompt中的任务不变知识的保留，从而改善了向前知识的转移。关于不同CTC基准测试的广泛实验表明，我们的方法的表现优于先前的最新方法。

Title: On VLMs for Diverse Tasks in Multimodal Meme Classification

Authors: Deepesh Gavit, Debajyoti Mazumder, Samiran Das, Jasabanta Patro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20937
Pdf URL: https://arxiv.org/pdf/2505.20937
Copy Paste: [[2505.20937]] On VLMs for Diverse Tasks in Multimodal Meme Classification(https://arxiv.org/abs/2505.20937)
Keywords: language model, llm, prompt
Abstract: In this paper, we present a comprehensive and systematic analysis of vision-language models (VLMs) for disparate meme classification tasks. We introduced a novel approach that generates a VLM-based understanding of meme images and fine-tunes the LLMs on textual understanding of the embedded meme text for improving the performance. Our contributions are threefold: (1) Benchmarking VLMs with diverse prompting strategies purposely to each sub-task; (2) Evaluating LoRA fine-tuning across all VLM components to assess performance gains; and (3) Proposing a novel approach where detailed meme interpretations generated by VLMs are used to train smaller language models (LLMs), significantly improving classification. The strategy of combining VLMs with LLMs improved the baseline performance by 8.34%, 3.52% and 26.24% for sarcasm, offensive and sentiment classification, respectively. Our results reveal the strengths and limitations of VLMs and present a novel strategy for meme understanding.
摘要：在本文中，我们介绍了视觉模型（VLM）的全面，系统的分析，用于不同的模因分类任务。我们介绍了一种新颖的方法，该方法对模因图像产生了基于VLM的理解，并通过对嵌入式模因文本的文本理解以改善性能的文本理解。我们的贡献是三个方面：（1）用各种提示的策略故意针对每个子任务对VLM进行基准测试；（2）评估所有VLM组件中的Lora微调以评估绩效提高；（3）提出一种新颖的方法，其中使用VLM产生的详细模因解释用于训练较小的语言模型（LLMS），从而显着改善了分类。将VLM与LLMS相结合的策略分别提高了8.34％，3.52％和26.24％的讽刺，进攻性和情感分类。我们的结果揭示了VLM的优势和局限性，并提出了模因理解的新型策略。

Title: Research Community Perspectives on "Intelligence" and Large Language Models

Authors: Bertram Højer, Terne Sasha Thorn Jakobsen, Anna Rogers, Stefan Heinrich
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2505.20959
Pdf URL: https://arxiv.org/pdf/2505.20959
Copy Paste: [[2505.20959]] Research Community Perspectives on "Intelligence" and Large Language Models(https://arxiv.org/abs/2505.20959)
Keywords: language model
Abstract: Despite the widespread use of ''artificial intelligence'' (AI) framing in Natural Language Processing (NLP) research, it is not clear what researchers mean by ''intelligence''. To that end, we present the results of a survey on the notion of ''intelligence'' among researchers and its role in the research agenda. The survey elicited complete responses from 303 researchers from a variety of fields including NLP, Machine Learning (ML), Cognitive Science, Linguistics, and Neuroscience. We identify 3 criteria of intelligence that the community agrees on the most: generalization, adaptability, & reasoning. Our results suggests that the perception of the current NLP systems as ''intelligent'' is a minority position (29%). Furthermore, only 16.2% of the respondents see developing intelligent systems as a research goal, and these respondents are more likely to consider the current systems intelligent.
摘要：尽管在自然语言处理（NLP）研究中广泛使用“人工智能”（AI）框架，但尚不清楚研究人员对“智能”的含义。为此，我们介绍了研究人员中关于“情报”概念及其在研究议程中的作用的调查结果。该调查从NLP，机器学习（ML），认知科学，语言学和神经科学等各个领域的303名研究人员中提出了完整的反应。我们确定了社区最共同达成的3个情报标准：概括，适应性和推理。我们的结果表明，对当前NLP系统“智能”的看法是少数群体立场（29％）。此外，只有16.2％的受访者将开发智能系统作为研究目标，这些受访者更有可能考虑当前的系统智能。

Title: Context-Aware Content Moderation for German Newspaper Comments

Authors: Felix Krejca, Tobias Kietreiber, Alexander Buchelt, Sebastian Neumaier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20963
Pdf URL: https://arxiv.org/pdf/2505.20963
Copy Paste: [[2505.20963]] Context-Aware Content Moderation for German Newspaper Comments(https://arxiv.org/abs/2505.20963)
Keywords: gpt, chat
Abstract: The increasing volume of online discussions requires advanced automatic content moderation to maintain responsible discourse. While hate speech detection on social media is well-studied, research on German-language newspaper forums remains limited. Existing studies often neglect platform-specific context, such as user history and article themes. This paper addresses this gap by developing and evaluating binary classification models for automatic content moderation in German newspaper forums, incorporating contextual information. Using LSTM, CNN, and ChatGPT-3.5 Turbo, and leveraging the One Million Posts Corpus from the Austrian newspaper Der Standard, we assess the impact of context-aware models. Results show that CNN and LSTM models benefit from contextual information and perform competitively with state-of-the-art approaches. In contrast, ChatGPT's zero-shot classification does not improve with added context and underperforms.
摘要：越来越多的在线讨论需要高级自动内容审核才能维持负责任的话语。尽管在社交媒体上进行了仇恨言论检测，但在德语报纸论坛上进行的研究仍然有限。现有的研究通常忽略了特定于平台的环境，例如用户历史记录和文章主题。本文通过在德国报纸论坛中开发和评估自动内容适量的二进制分类模型来解决这一差距，并结合上下文信息。使用LSTM，CNN和Chatgpt-3.5 Turbo，并利用奥地利报纸Der标准的100万个职位，我们评估了上下文感知模型的影响。结果表明，CNN和LSTM模型受益于上下文信息，并通过最先进的方法进行竞争性。相比之下，Chatgpt的零射击分类不会随着上下文和表现不佳而改善。

Title: Reason-Align-Respond: Aligning LLM Reasoning with Knowledge Graphs for KGQA

Authors: Xiangqing Shen, Fanfan Wang, Rui Xia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20971
Pdf URL: https://arxiv.org/pdf/2505.20971
Copy Paste: [[2505.20971]] Reason-Align-Respond: Aligning LLM Reasoning with Knowledge Graphs for KGQA(https://arxiv.org/abs/2505.20971)
Keywords: llm, hallucination
Abstract: LLMs have demonstrated remarkable capabilities in complex reasoning tasks, yet they often suffer from hallucinations and lack reliable factual grounding. Meanwhile, knowledge graphs (KGs) provide structured factual knowledge but lack the flexible reasoning abilities of LLMs. In this paper, we present Reason-Align-Respond (RAR), a novel framework that systematically integrates LLM reasoning with knowledge graphs for KGQA. Our approach consists of three key components: a Reasoner that generates human-like reasoning chains, an Aligner that maps these chains to valid KG paths, and a Responser that synthesizes the final answer. We formulate this process as a probabilistic model and optimize it using the Expectation-Maximization algorithm, which iteratively refines the reasoning chains and knowledge paths. Extensive experiments on multiple benchmarks demonstrate the effectiveness of RAR, achieving state-of-the-art performance with Hit@1 scores of 93.3% and 91.0% on WebQSP and CWQ respectively. Human evaluation confirms that RAR generates high-quality, interpretable reasoning chains well-aligned with KG paths. Furthermore, RAR exhibits strong zero-shot generalization capabilities and maintains computational efficiency during inference.
摘要：LLM在复杂的推理任务中表现出了显着的功能，但它们常常遭受幻觉的困扰，并且缺乏可靠的事实基础。同时，知识图（KGS）提供了结构化的事实知识，但缺乏LLM的灵活推理能力。在本文中，我们提出了理性 - 阿里格 - 雷（RAR），这是一个新颖的框架，将LLM推理与KGQA的知识图一起整合。我们的方法由三个关键组成部分组成：一种生成类似人类的推理链的推理器，将这些链映射到有效kg路径的对准器，以及综合最终答案的响应者。我们将此过程作为概率模型制定，并使用预期最大化算法进行优化，该算法迭代地完善了推理链和知识路径。对多个基准测试的广泛实验证明了RAR的有效性，在WebQSP和CWQ上分别以93.3％和91.0％的命中率达到了最先进的性能。人类评估证实，RAR会产生高质量的，可解释的推理链与KG路径很好。此外，RAR表现出强大的零击功能，并在推理过程中保持计算效率。

Title: Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing

Authors: Peiming Guo, Meishan Zhang, Jianling Li, Min Zhang, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20976
Pdf URL: https://arxiv.org/pdf/2505.20976
Copy Paste: [[2505.20976]] Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing(https://arxiv.org/abs/2505.20976)
Keywords: language model, llm
Abstract: Cross-domain constituency parsing is still an unsolved challenge in computational linguistics since the available multi-domain constituency treebank is limited. We investigate automatic treebank generation by large language models (LLMs) in this paper. The performance of LLMs on constituency parsing is poor, therefore we propose a novel treebank generation method, LLM back generation, which is similar to the reverse process of constituency parsing. LLM back generation takes the incomplete cross-domain constituency tree with only domain keyword leaf nodes as input and fills the missing words to generate the cross-domain constituency treebank. Besides, we also introduce a span-level contrastive learning pre-training strategy to make full use of the LLM back generation treebank for cross-domain constituency parsing. We verify the effectiveness of our LLM back generation treebank coupled with contrastive learning pre-training on five target domains of MCTB. Experimental results show that our approach achieves state-of-the-art performance on average results compared with various baselines.
摘要：跨域选区解析仍然是计算语言学中未解决的挑战，因为可用的多域选区树库有限。我们在本文中研究了大型语言模型（LLM）的自动生成。 LLMS在选区解析上的性能很差，因此我们提出了一种新型的树库生成方法，LLM后期生成，这与选区解析的反向过程相似。 LLM背部生成将不完整的跨域选区树采用只有域关键字叶节点作为输入，并填充缺失的单词以生成跨域的选区树库。此外，我们还引入了一个跨度对比度学习预训练策略，以充分利用LLM背部Treebank进行跨域选区解析。我们验证了LLM背部树库的有效性，并在MCTB的五个目标域上进行了对比度学习预训练。实验结果表明，与各种基准相比，我们的方法平均达到了最先进的结果。

Title: Evaluating and Steering Modality Preferences in Multimodal Large Language Model

Authors: Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20977
Pdf URL: https://arxiv.org/pdf/2505.20977
Copy Paste: [[2505.20977]] Evaluating and Steering Modality Preferences in Multimodal Large Language Model(https://arxiv.org/abs/2505.20977)
Keywords: language model, llm, hallucination, prompt
Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance on complex tasks with multimodal context. However, it is still understudied whether they exhibit modality preference when processing multimodal contexts. To study this question, we first build a \textbf{MC\textsuperscript{2}} benchmark under controlled evidence conflict scenarios to systematically evaluate modality preference, which is the tendency to favor one modality over another when making decisions based on multimodal conflicting evidence. Our extensive evaluation reveals that all 18 tested MLLMs generally demonstrate clear modality bias, and modality preference can be influenced by external interventions. An in-depth analysis reveals that the preference direction can be captured within the latent representations of MLLMs. Built on this, we propose a probing and steering method based on representation engineering to explicitly control modality preference without additional fine-tuning or carefully crafted prompts. Our method effectively amplifies modality preference toward a desired direction and applies to downstream tasks such as hallucination mitigation and multimodal machine translation, yielding promising improvements.
摘要：多模式大语言模型（MLLM）在具有多模式上下文的复杂任务上取得了出色的性能。但是，在处理多模式上下文时，它们是否表现出模态偏好仍在研究。为了研究这个问题，我们首先构建了一个\ textbf {mc \ textsuperscript {2}}在受控的证据冲突场景下基于系统评估模式偏好的基准，这是在基于多模式相互矛盾的证据做出决策时倾向于偏爱一种模式而不是另一种方式。我们广泛的评估表明，所有18种测试的MLLM通常都表现出明显的模态偏见，并且模态偏好可能受到外部干预措施的影响。深入的分析表明，可以在MLLM的潜在表示中捕获偏好方向。在此基础上，我们提出了一种基于表示工程的探测和转向方法，以明确控制模式偏好，而无需其他微调或精心制作的提示。我们的方法有效地扩大了对所需方向的模态偏好，并适用于诸如缓解幻觉和多模式机器翻译等下游任务，从而带来了令人鼓舞的改进。

Title: Who Reasons in the Large Language Models?

Authors: Jie Shao, Jianxin Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20993
Pdf URL: https://arxiv.org/pdf/2505.20993
Copy Paste: [[2505.20993]] Who Reasons in the Large Language Models?(https://arxiv.org/abs/2505.20993)
Keywords: language model, llm
Abstract: Despite the impressive performance of large language models (LLMs), the process of endowing them with new capabilities--such as mathematical reasoning--remains largely empirical and opaque. A critical open question is whether reasoning abilities stem from the entire model, specific modules, or are merely artifacts of overfitting. In this work, we hypothesize that the reasoning capabilities in well-trained LLMs are primarily attributed to the output projection module (oproj) in the Transformer's multi-head self-attention (MHSA) mechanism. To support this hypothesis, we introduce Stethoscope for Networks (SfN), a suite of diagnostic tools designed to probe and analyze the internal behaviors of LLMs. Using SfN, we provide both circumstantial and empirical evidence suggesting that oproj plays a central role in enabling reasoning, whereas other modules contribute more to fluent dialogue. These findings offer a new perspective on LLM interpretability and open avenues for more targeted training strategies, potentially enabling more efficient and specialized LLMs.
摘要：尽管大型语言模型（LLMS）的表现令人印象深刻，但赋予它们具有新功能的过程（例如数学推理），例如 - 怪物在很大程度上是经验和不透明的。一个关键的开放问题是推理能力是源于整个模型，特定模块还是仅仅是过度拟合的伪像。在这项工作中，我们假设训练有素的LLM中的推理能力主要归因于变压器多头自我注意（MHSA）机制中的输出投影模块（OPROJ）。为了支持这一假设，我们引入了网络听诊器（SFN），这是一套诊断工具，旨在探究和分析LLM的内部行为。使用SFN，我们提供了间接和经验证据，表明OPROJ在启用推理方面起着核心作用，而其他模块对流利的对话贡献了更大的作用。这些发现为LLM的可解释性和开放途径提供了新的观点，以实现更有针对性的培训策略，从而有可能使更高效，更专业的LLMS。

Title: Uncertainty Unveiled: Can Exposure to More In-context Examples Mitigate Uncertainty for Large Language Models?

Authors: Yifei Wang, Yu Sheng, Linjing Li, Daniel Zeng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21003
Pdf URL: https://arxiv.org/pdf/2505.21003
Copy Paste: [[2505.21003]] Uncertainty Unveiled: Can Exposure to More In-context Examples Mitigate Uncertainty for Large Language Models?(https://arxiv.org/abs/2505.21003)
Keywords: language model
Abstract: Recent advances in handling long sequences have facilitated the exploration of long-context in-context learning (ICL). While much of the existing research emphasizes performance improvements driven by additional in-context examples, the influence on the trustworthiness of generated responses remains underexplored. This paper addresses this gap by investigating how increased examples influence predictive uncertainty, an essential aspect in trustworthiness. We begin by systematically quantifying the uncertainty of ICL with varying shot counts, analyzing the impact of example quantity. Through uncertainty decomposition, we introduce a novel perspective on performance enhancement, with a focus on epistemic uncertainty (EU). Our results reveal that additional examples reduce total uncertainty in both simple and complex tasks by injecting task-specific knowledge, thereby diminishing EU and enhancing performance. For complex tasks, these advantages emerge only after addressing the increased noise and uncertainty associated with longer inputs. Finally, we explore the evolution of internal confidence across layers, unveiling the mechanisms driving the reduction in uncertainty.
摘要：处理长序列的最新进展促进了对长期文化内部学习（ICL）的探索。尽管现有的许多研究强调了由其他文本示例驱动的绩效改进，但对产生的响应的可信度的影响仍然没有得到充实的影响。本文通过研究增加的例子如何影响预测不确定性，这是可信赖性的重要方面，从而解决了这一差距。我们首先通过系统地量化ICL的不确定性，并分析示例数量的影响。通过不确定性分解，我们介绍了对性能增强的新观点，重点是认知不确定性（EU）。我们的结果表明，其他示例通过注入特定于任务的知识来减少简单和复杂任务的总不确定性，从而减少欧盟并提高绩效。对于复杂的任务，这些优势仅在解决与更长的输入相关的噪声和不确定性增加后才出现。最后，我们探讨了跨层内部信心的演变，揭示了推动不确定性降低的机制。

Title: LLMs are Frequency Pattern Learners in Natural Language Inference

Authors: Liang Cheng, Zhaowei Wang, Mark Steedman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21011
Pdf URL: https://arxiv.org/pdf/2505.21011
Copy Paste: [[2505.21011]] LLMs are Frequency Pattern Learners in Natural Language Inference(https://arxiv.org/abs/2505.21011)
Keywords: llm
Abstract: While fine-tuning LLMs on NLI corpora improves their inferential performance, the underlying mechanisms driving this improvement remain largely opaque. In this work, we conduct a series of experiments to investigate what LLMs actually learn during fine-tuning. We begin by analyzing predicate frequencies in premises and hypotheses across NLI datasets and identify a consistent frequency bias, where predicates in hypotheses occur more frequently than those in premises for positive instances. To assess the impact of this bias, we evaluate both standard and NLI fine-tuned LLMs on bias-consistent and bias-adversarial cases. We find that LLMs exploit frequency bias for inference and perform poorly on adversarial instances. Furthermore, fine-tuned LLMs exhibit significantly increased reliance on this bias, suggesting that they are learning these frequency patterns from datasets. Finally, we compute the frequencies of hyponyms and their corresponding hypernyms from WordNet, revealing a correlation between frequency bias and textual entailment. These findings help explain why learning frequency patterns can enhance model performance on inference tasks.
摘要：虽然对NLI Corpora的微调LLMS提高了其推论性能，但推动这种改进的基本机制仍然在很大程度上不透明。在这项工作中，我们进行了一系列实验，以研究LLM在微调过程中实际学到的东西。我们首先分析前提中的谓词频率和跨NLI数据集的假设，并确定一个一致的频率偏差，在该假设中，假设中的谓词发生的频率比前提的频率更高。为了评估这种偏见的影响，我们评估了标准和NLI微调的LLM对偏见和偏见的病例。我们发现LLMS利用频率偏差来推理，并且在对抗实例上表现不佳。此外，微调的LLMS表现出对这种偏见的依赖，这表明他们正在从数据集中学习这些频率模式。最后，我们计算了hordnet中的假说及其相应的高nyms的频率，从而揭示了频率偏差与文本构成之间的相关性。这些发现有助于解释为什么学习频率模式可以增强推理任务上的模型性能。

Title: Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation

Authors: Seungmin Lee, Yongsang Yoo, Minhwa Jung, Min Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21033
Pdf URL: https://arxiv.org/pdf/2505.21033
Copy Paste: [[2505.21033]] Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation(https://arxiv.org/abs/2505.21033)
Keywords: language model, llm, prompt
Abstract: Dialogue Topic Segmentation (DTS) aims to divide dialogues into coherent segments. DTS plays a crucial role in various NLP downstream tasks, but suffers from chronic problems: data shortage, labeling ambiguity, and incremental complexity of recently proposed solutions. On the other hand, Despite advances in Large Language Models (LLMs) and reasoning strategies, these have rarely been applied to DTS. This paper introduces Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation, which utilizes LLM-based multi-step deductive reasoning to enhance DTS performance and enable case study using intermediate result. Our method employs a structured prompting approach for bidirectional context summarization, utterance intent classification, and deductive topic shift detection. In the intent classification process, we propose the generalizable intent list for domain-agnostic dialogue intent classification. Experiments in various dialogue settings demonstrate that Def-DTS consistently outperforms traditional and state-of-the-art approaches, with each subtask contributing to improved performance, particularly in reducing type 2 error. We also explore the potential for autolabeling, emphasizing the importance of LLM reasoning techniques in DTS.
摘要：对话主题细分（DTS）旨在将对话分为一致的段。 DTS在各种NLP下游任务中起着至关重要的作用，但遭受了慢性问题：数据短缺，标记歧义和最近提出的解决方案的增量复杂性。另一方面，尽管大语言模型（LLM）和推理策略取得了进步，但这些策略很少被应用于DTS。本文介绍了DEF-DTS：开放域对话主题细分的演绎推理，该推理利用基于LLM的多步扣除推理来增强DTS性能并使用中级结果启用案例研究。我们的方法采用双向上下文摘要，话语意图分类和演绎主题转移检测采用结构化提示方法。在意图分类过程中，我们提出了针对域 - 不合稳定对话意图分类的可推广意图列表。各种对话设置的实验表明，DEF-DT始终优于传统和最新方法，每个子任务都会有助于提高性能，尤其是在减少2型错误中。我们还探索了自动标签的潜力，强调了LLM推理技术在DTS中的重要性。

Title: FCKT: Fine-Grained Cross-Task Knowledge Transfer with Semantic Contrastive Learning for Targeted Sentiment Analysis

Authors: Wei Chen, Zhao Zhang, Meng Yuan, Kepeng Xu, Fuzhen Zhuang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21040
Pdf URL: https://arxiv.org/pdf/2505.21040
Copy Paste: [[2505.21040]] FCKT: Fine-Grained Cross-Task Knowledge Transfer with Semantic Contrastive Learning for Targeted Sentiment Analysis(https://arxiv.org/abs/2505.21040)
Keywords: language model, llm
Abstract: In this paper, we address the task of targeted sentiment analysis (TSA), which involves two sub-tasks, i.e., identifying specific aspects from reviews and determining their corresponding sentiments. Aspect extraction forms the foundation for sentiment prediction, highlighting the critical dependency between these two tasks for effective cross-task knowledge transfer. While most existing studies adopt a multi-task learning paradigm to align task-specific features in the latent space, they predominantly rely on coarse-grained knowledge transfer. Such approaches lack fine-grained control over aspect-sentiment relationships, often assuming uniform sentiment polarity within related aspects. This oversimplification neglects contextual cues that differentiate sentiments, leading to negative transfer. To overcome these limitations, we propose FCKT, a fine-grained cross-task knowledge transfer framework tailored for TSA. By explicitly incorporating aspect-level information into sentiment prediction, FCKT achieves fine-grained knowledge transfer, effectively mitigating negative transfer and enhancing task performance. Experiments on three datasets, including comparisons with various baselines and large language models (LLMs), demonstrate the effectiveness of FCKT. The source code is available on this https URL.
摘要：在本文中，我们解决了目标情感分析（TSA）的任务，该任务涉及两个子任务，即从评论中识别特定方面并确定其相应的情感。方面提取构成了情感预测的基础，突出了这两个任务之间的关键依赖性，以进行有效的跨任务知识转移。尽管大多数现有的研究都采用多任务学习范式来使特定于任务特定特征在潜在空间中，但它们主要依赖于粗粒的知识转移。这种方法缺乏对方面关系的细粒度控制，通常假设相关方面的情感极性统一。这种过度简化忽略了与观点不同的情境线索，导致负面转移。为了克服这些局限性，我们提出了FCKT，这是一个针对TSA量身定制的细粒跨任务知识转移框架。通过将方面级信息明确地纳入情感预测中，FCKT实现了细粒度的知识转移，有效地减轻了负面转移并增强了任务绩效。在三个数据集上进行的实验，包括与各种基线和大语言模型（LLMS）的比较，证明了FCKT的有效性。源代码可在此HTTPS URL上使用。

Title: Predicting Implicit Arguments in Procedural Video Instructions

Authors: Anil Batra, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.21068
Pdf URL: https://arxiv.org/pdf/2505.21068
Copy Paste: [[2505.21068]] Predicting Implicit Arguments in Procedural Video Instructions(https://arxiv.org/abs/2505.21068)
Keywords: gpt, llm
Abstract: Procedural texts help AI enhance reasoning about context and action sequences. Transforming these into Semantic Role Labeling (SRL) improves understanding of individual steps by identifying predicate-argument structure like {verb,what,where/with}. Procedural instructions are highly elliptic, for instance, (i) add cucumber to the bowl and (ii) add sliced tomatoes, the second step's where argument is inferred from the context, referring to where the cucumber was placed. Prior SRL benchmarks often miss implicit arguments, leading to incomplete understanding. To address this, we introduce Implicit-VidSRL, a dataset that necessitates inferring implicit and explicit arguments from contextual information in multimodal cooking procedures. Our proposed dataset benchmarks multimodal models' contextual reasoning, requiring entity tracking through visual changes in recipes. We study recent multimodal LLMs and reveal that they struggle to predict implicit arguments of what and where/with from multi-modal procedural data given the verb. Lastly, we propose iSRL-Qwen2-VL, which achieves a 17% relative improvement in F1-score for what-implicit and a 14.7% for where/with-implicit semantic roles over GPT-4o.
摘要：程序文本有助于AI增强有关上下文和动作序列的推理。将它们转换为语义角色标签（SRL），通过识别鉴定谓词argument结构（例如{动词，什么，where/with}）来改善对单个步骤的理解。程序说明是高度椭圆形的，例如（i）在碗中添加黄瓜，（ii）添加切成薄片的西红柿，这是从上下文中推断出参数的第二步，指的是将黄瓜放置在哪里。先前的SRL基准通常会错过隐性论点，从而导致不完整的理解。为了解决这个问题，我们介绍了隐式vidsrl，这是一个数据集，需要从多模式烹饪过程中的上下文信息中推断出隐式和明确的参数。我们提出的数据集基准多模型的上下文推理，需要通过视觉更改配方跟踪实体。我们研究了最新的多模式LLM，并透露他们努力预测有关动词的多模式程序数据的隐性论点。最后，我们提出了ISRL-QWEN2-VL，该ISRL-QWEN2-VL可在f1得分方面取得17％的相对相对提高，而在GPT-4O上，在某些地方/具有IMPLICEN的语义角色的14.7％。

Title: Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation

Authors: Ekaterina Fadeeva, Aleksandr Rubashevskii, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrinmaya Sachan, Maxim Panov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21072
Pdf URL: https://arxiv.org/pdf/2505.21072
Copy Paste: [[2505.21072]] Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation(https://arxiv.org/abs/2505.21072)
Keywords: language model, llm, hallucination, retrieval augmented generation, retrieval-augmented generation
Abstract: Large Language Models (LLMs) enhanced with external knowledge retrieval, an approach known as Retrieval-Augmented Generation (RAG), have shown strong performance in open-domain question answering. However, RAG systems remain susceptible to hallucinations: factually incorrect outputs that may arise either from inconsistencies in the model's internal knowledge or incorrect use of the retrieved context. Existing approaches often conflate factuality with faithfulness to the retrieved context, misclassifying factually correct statements as hallucinations if they are not directly supported by the retrieval. In this paper, we introduce FRANQ (Faithfulness-based Retrieval Augmented UNcertainty Quantification), a novel method for hallucination detection in RAG outputs. FRANQ applies different Uncertainty Quantification (UQ) techniques to estimate factuality based on whether a statement is faithful to the retrieved context or not. To evaluate FRANQ and other UQ techniques for RAG, we present a new long-form Question Answering (QA) dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging examples. Extensive experiments on long- and short-form QA across multiple datasets and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing methods.
摘要：大型语言模型（LLMS）通过外部知识检索增强了，这种方法是一种称为检索的生成（RAG）的方法，在开放域问题的答案中表现出强烈的性能。但是，破布系统仍然容易受到幻觉的影响：实际上是由于模型内部知识中的不一致而引起的，或者对检索到的上下文的使用不正确。现有的方法通常将事实与忠诚与检索到的上下文混为一谈，如果检索未直接支持事实，将事实纠正的陈述误认为是幻觉。在本文中，我们介绍了Franq（基于忠诚的检索增强不确定性量化），这是一种新颖的抹布输出幻觉检测方法。 Franq采用不同的不确定性量化（UQ）技术来估算事实，该陈述是否忠于检索到的上下文。为了评估FRANQ和其他UQ技术的抹布，我们提出了一个新的长形式答案（QA）数据集，以说明事实和忠诚，将自动标签与手动验证挑战示例相结合。与现有方法相比，在多个数据集和LLM的长形质量检查方面进行了广泛的实验表明，Franq更准确地检测了抹布生成的响应中的事实错误。

Title: LLMs Think, But Not In Your Flow: Reasoning-Level Personalization for Black-Box Large Language Models

Authors: Jieyong Kim, Tongyoung Kim, Soonjin Yoon, Jaehyung Kim, Dongha Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21082
Pdf URL: https://arxiv.org/pdf/2505.21082
Copy Paste: [[2505.21082]] LLMs Think, But Not In Your Flow: Reasoning-Level Personalization for Black-Box Large Language Models(https://arxiv.org/abs/2505.21082)
Keywords: language model, llm
Abstract: Large language models (LLMs) have recently achieved impressive performance across a wide range of natural language tasks and are now widely used in real-world applications. Among them, black-box LLMs--served via APIs without access to model internals--are especially dominant due to their scalability and ease of deployment. Despite their strong capabilities, these models typically produce generalized responses that overlook personal preferences and reasoning styles. This has led to growing interest in black-box LLM personalization, which aims to tailor model outputs to user-specific context without modifying model parameters. However, existing approaches primarily focus on response-level personalization, attempting to match final outputs without modeling personal thought process. To address this limitation, we propose RPM, a framework for reasoning-level personalization that aligns the model's reasoning process with a user's personalized logic. RPM first constructs statistical user-specific factors by extracting and grouping response-influential features from user history. It then builds personalized reasoning paths that reflect how these factors are used in context. In the inference stage, RPM retrieves reasoning-aligned examples for new queries via feature-level similarity and performs inference conditioned on the structured factors and retrieved reasoning paths, enabling the model to follow user-specific reasoning trajectories. This reasoning-level personalization enhances both predictive accuracy and interpretability by grounding model outputs in user-specific logic through structured information. Extensive experiments across diverse tasks show that RPM consistently outperforms response-level personalization methods, demonstrating the effectiveness of reasoning-level personalization in black-box LLMs.
摘要：大型语言模型（LLMS）最近在各种自然语言任务中取得了令人印象深刻的表现，现在已在现实世界应用中广泛使用。其中，通过API提供的Black-Box LLMS，无需访问模型内部设备 - 由于其可扩展性和易于部署性而尤其重要。尽管具有很强的功能，但这些模型通常会产生广义响应，以忽略个人喜好和推理方式。这导致对Black-Box LLM个性化的兴趣日益增加，该个性化旨在将模型输出量身定制为特定于用户的上下文而无需修改模型参数。但是，现有方法主要集中于响应级个性化，试图在不建模个人思维过程的情况下匹配最终输出。为了解决此限制，我们提出了RPM，这是推理级个性化的框架，将模型的推理过程与用户的个性化逻辑保持一致。 rpm首先通过从用户历史记录中提取和分组响应响应功能来构建统计用户特定的因素。然后，它建立了个性化的推理路径，以反映这些因素在上下文中的使用方式。在推理阶段，RPM通过特征级相似性来检索与推理一致的示例，并在结构化因子和检索推理路径上执行推理，使模型能够遵循用户特定的推理轨迹。通过结构化信息，通过将模型输出接地，这种推理级个性化通过接地模型输出来增强预测精度和解释性。跨不同任务的广泛实验表明，RPM始终优于响应级个性化方法，证明了在Black-Box LLM中推理级个性化的有效性。

Title: BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge

Authors: Daeen Kabir, Minhajur Rahman Chowdhury Mahim, Sheikh Shafayat, Adnan Sadik, Arian Ahmed, Eunsu Kim, Alice Oh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21092
Pdf URL: https://arxiv.org/pdf/2505.21092
Copy Paste: [[2505.21092]] BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge(https://arxiv.org/abs/2505.21092)
Keywords: language model, gpt, llm
Abstract: In this work, we introduce BLUCK, a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge. Our dataset comprises 2366 multiple-choice questions (MCQs) carefully curated from compiled collections of several college and job level examinations and spans 23 categories covering knowledge on Bangladesh's culture and history and Bengali linguistics. We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3. Our results show that while these models perform reasonably well overall, they, however, struggles in some areas of Bengali phonetics. Although current LLMs' performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, our results indicate Bengali's status as a mid-resource language. Importantly, BLUCK is also the first MCQ-based evaluation benchmark that is centered around native Bengali culture, history, and linguistics.
摘要：在这项工作中，我们介绍了Bluck，这是一个新的数据集，旨在衡量孟加拉语语言理解和文化知识中大型语言模型（LLM）的性能。我们的数据集由2366个多项选择问题（MCQ）组成，这些问题（MCQ）仔细策划了几个大学和工作水平考试的收藏，涵盖了23个类别，涵盖了有关孟加拉国文化，历史以及孟加拉语语言学的知识。我们使用6个专有和3个开源LLM对BLUCK进行了基准测试 - 包括GPT-4O，Claude-3.5-Sonnet，Gemini-1.5-Pro，Llama-3.3-70B-Instruct和DeepSeekV3。我们的结果表明，尽管这些模型的整体表现相当出色，但是它们在孟加拉语音学的某些领域挣扎。尽管当前LLMS在孟加拉文化和语言环境中的表现仍然与英语等主流语言相媲美，但我们的结果表明孟加拉语作为中产阶级语言的地位。重要的是，Bluck也是第一个基于MCQ的评估基准，围绕本地孟加拉文化，历史和语言学。

Title: Thinker: Learning to Think Fast and Slow

Authors: Stephen Chung, Wenyu Du, Jie Fu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21097
Pdf URL: https://arxiv.org/pdf/2505.21097
Copy Paste: [[2505.21097]] Thinker: Learning to Think Fast and Slow(https://arxiv.org/abs/2505.21097)
Keywords: language model, llm, long context
Abstract: Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 24.9% to 27.9% for Qwen2.5-1.5B, and from 45.9% to 49.8% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 26.8% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training.
摘要：最近的研究表明，可以通过将强化学习（RL）应用于诸如数学和编码等领域的问答任务（QA）任务来改善大语言模型（LLM）的推理能力。 llms的长度长度长，可以学会执行搜索，如在DeepSeek R1中观察到的自我纠正行为所表明的那样。但是，这种搜索行为通常是不精确的，并且缺乏信心，从而产生了冗长的，多余的反应，并突出了直觉和验证方面的缺陷。受心理学双重过程理论的启发，我们对质量检查任务进行了简单的修改，该任务包括四个阶段：快速思考，LLM必须在严格的标记预算内回答；验证，模型评估其初始响应；思维缓慢，在此更加审议的地方完善了最初的反应；和摘要，在此，它将从上一阶段的精炼提炼成精确的步骤。我们提出的任务将QWEN2.5-1.5B的平均准确性从24.9％提高到27.9％，而DeepSeek-R1-QWEN-1.5B的平均准确性从45.9％提高到45.9％至49.8％。值得注意的是，对于QWEN2.5-1.5B，仅使用少于1000个令牌就可以实现26.8％的精度，证明了大量推断效率的提高。这些发现表明，直觉和审议推理是明显的，互补的系统受益于有针对性的培训。

Title: A Lightweight Multi-Expert Generative Language Model System for Engineering Information and Knowledge Extraction

Authors: Bogdan Bogachov, Yaoyao Fiona Zhao
Subjects: cs.CL, cs.AI, cs.CE, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21109
Pdf URL: https://arxiv.org/pdf/2505.21109
Copy Paste: [[2505.21109]] A Lightweight Multi-Expert Generative Language Model System for Engineering Information and Knowledge Extraction(https://arxiv.org/abs/2505.21109)
Keywords: language model, llm, hallucination
Abstract: Despite recent advancements in domain adaptation techniques for large language models, these methods remain computationally intensive, and the resulting models can still exhibit hallucination issues. Most existing adaptation methods do not prioritize reducing the computational resources required for fine-tuning and inference of language models. Hallucination issues have gradually decreased with each new model release. However, they remain prevalent in engineering contexts, where generating well-structured text with minimal errors and inconsistencies is critical. This work introduces a novel approach called the Small Language Graph (SLG), which is a lightweight adaptation solution designed to address the two key challenges outlined above. The system is structured in the form of a graph, where each node represents a lightweight expert - a small language model fine-tuned on specific and concise texts. The results of this study have shown that SLG was able to surpass conventional fine-tuning methods on the Exact Match metric by 3 times. Additionally, the fine-tuning process was 1.7 times faster compared to that of a larger stand-alone language model. These findings introduce a potential for small to medium-sized engineering companies to confidently use generative AI technologies, such as LLMs, without the necessity to invest in expensive computational resources. Also, the graph architecture and the small size of expert nodes offer a possible opportunity for distributed AI systems, thus potentially diverting the global need for expensive centralized compute clusters.
摘要：尽管针对大型语言模型的域适应技术最近取得了进步，但这些方法在计算密集程度上仍然存在，并且由此产生的模型仍然可以表现出幻觉问题。大多数现有的适应方法没有优先考虑减少语言模型进行微调和推断所需的计算资源。随着每个新型号发布，幻觉问题逐渐减少。但是，它们在工程环境中仍然很普遍，在工程环境中，生成结构良好的文本具有最小的错误和不一致是至关重要的。这项工作介绍了一种名为“小语言图”（SLG）的新方法，该方法是一种轻巧的适应解决方案，旨在解决上面概述的两个关键挑战。该系统是以图形的形式结构的，每个节点代表一个轻巧的专家 - 一种对特定和简洁文本进行微调的小语言模型。这项研究的结果表明，SLG能够在确切匹配度量的情况下超过3次的常规微调方法。此外，与较大的独立语言模型相比，微型调整过程的速度要快1.7倍。这些发现引入了中小型工程公司的潜力，可以自信地使用生成的AI技术，例如LLM，而无需投资昂贵的计算资源。此外，图形体系结构和专家节点的尺寸较小为分布式AI系统提供了可能的机会，从而有可能转移对昂贵的集中计算集群的全球需求。

Title: Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

Authors: Sergey Pletenev, Maria Marina, Nikolay Ivanov, Daria Galimzianova, Nikita Krayko, Mikhail Salnikov, Vasily Konovalov, Alexander Panchenko, Viktor Moskvoretskii
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21115
Pdf URL: https://arxiv.org/pdf/2505.21115
Copy Paste: [[2505.21115]] Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA(https://arxiv.org/abs/2505.21115)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.
摘要：大型语言模型（LLMS）通常会在有问题的答案（QA）任务中幻觉。造成问题的关键却又没有被忽视的因素是问题的时间性 - 无论它们是常绿（随着时间的流逝，答案保持稳定）还是可变的（答案变化）。在这项工作中，我们介绍了Evergreenqa，这是第一个带有常绿标签的多语言QA数据集，支持评估和培训。使用Evergreenqa，我们基于12 Modern LLMS来评估它们是明确（通过口头判断）还是隐式（通过不确定性信号）来编码问题的问题。我们还训练EG-E5，这是一种轻巧的多语言分类器，可以在此任务上实现SOTA性能。最后，我们在三个应用程序中演示了常绿分类的实际实用性：改进自我知识估计，过滤QA数据集以及解释GPT-4O检索行为。

Title: Scaling and Prompting for Improved End-to-End Spoken Grammatical Error Correction

Authors: Mengjie Qian, Rao Ma, Stefano Bannò, Kate M. Knill, Mark J.F. Gales
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.21137
Pdf URL: https://arxiv.org/pdf/2505.21137
Copy Paste: [[2505.21137]] Scaling and Prompting for Improved End-to-End Spoken Grammatical Error Correction(https://arxiv.org/abs/2505.21137)
Keywords: prompt
Abstract: Spoken Grammatical Error Correction (SGEC) and Feedback (SGECF) are crucial for second language learners, teachers and test takers. Traditional SGEC systems rely on a cascaded pipeline consisting of an ASR, a module for disfluency detection (DD) and removal and one for GEC. With the rise of end-to-end (E2E) speech foundation models, we investigate their effectiveness in SGEC and feedback generation. This work introduces a pseudo-labelling process to address the challenge of limited labelled data, expanding the training data size from 77 hours to approximately 2500 hours, leading to improved performance. Additionally, we prompt an E2E Whisper-based SGEC model with fluent transcriptions, showing a slight improvement in SGEC performance, with more significant gains in feedback generation. Finally, we assess the impact of increasing model size, revealing that while pseudo-labelled data does not yield performance gain for a larger Whisper model, training with prompts proves beneficial.
摘要：口语语法误差校正（SGEC）和反馈（SGECF）对于第二语言学习者，教师和测试者至关重要。传统的SGEC系统依赖于由ASR组成的级联管道，一个用于探测（DD）的模块和删除的模块以及GEC。随着端到端（E2E）语音基础模型的兴起，我们研究了它们在SGEC和反馈生成中的有效性。这项工作引入了伪标记的过程，以应对有限标记数据的挑战，从而将培训数据规模从77小时扩大到大约2500小时，从而提高了性能。此外，我们促使基于E2E低语的SGEC模型具有流利的转录，显示出SGEC性能的略有改善，反馈生成的增长更大。最后，我们评估了增加模型大小的影响，表明虽然伪标记的数据不能为更大的耳语模型带来性能增长，但提示的训练证明是有益的。

Title: Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis

Authors: Tianyi Xu, Hongjie Chen, Wang Qing, Lv Hang, Jian Kang, Li Jie, Zhennan Lin, Yongxiang Li, Xie Lei
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.21138
Pdf URL: https://arxiv.org/pdf/2505.21138
Copy Paste: [[2505.21138]] Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis(https://arxiv.org/abs/2505.21138)
Keywords: language model, llm
Abstract: Large-scale training corpora have significantly improved the performance of ASR models. Unfortunately, due to the relative scarcity of data, Chinese accents and dialects remain a challenge for most ASR models. Recent advancements in self-supervised learning have shown that self-supervised pre- training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. Specifically, we pre-train a Data2vec2 model on 300,000 hours of unlabeled dialect and accented speech data and do alignment training on a supervised dataset of 40,000 hours. Then, we systematically examine the impact of various projectors and LLMs on Mandarin, dialect, and accented speech recognition performance under this paradigm. Our method achieved SOTA results on multiple dialect datasets, including Kespeech. We will open-source our work to promote reproducible research
摘要：大规模培训语料库已大大提高了ASR模型的性能。不幸的是，由于数据的相对稀缺性，对于大多数ASR模型来说，中国口音和方言仍然是一个挑战。自我监督学习的最新进展表明，自我监督的预训练与大语言模型（LLM）相结合可以有效地提高低资源场景中的ASR绩效。我们旨在调查这种范式对中国方言的有效性。具体而言，我们在300,000小时的未标记方言和重音语音数据上预先培训了Data2Vec2模型，并在40,000小时的监督数据集上进行对齐培训。然后，我们系统地检查了该范式下的各种投影仪和LLM对普通话，方言和重音语音识别表现的影响。我们的方法在包括Kespeech在内的多个方言数据集上实现了SOTA结果。我们将开源我们的工作以促进可重复的研究

Title: Assessment of L2 Oral Proficiency using Speech Large Language Models

Authors: Rao Ma, Mengjie Qian, Siyuan Tang, Stefano Bannò, Kate M. Knill, Mark J.F. Gales
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.21148
Pdf URL: https://arxiv.org/pdf/2505.21148
Copy Paste: [[2505.21148]] Assessment of L2 Oral Proficiency using Speech Large Language Models(https://arxiv.org/abs/2505.21148)
Keywords: language model, llm
Abstract: The growing population of L2 English speakers has increased the demand for developing automatic graders for spoken language assessment (SLA). Historically, statistical models, text encoders, and self-supervised speech models have been utilised for this task. However, cascaded systems suffer from the loss of information, while E2E graders also have limitations. With the recent advancements of multi-modal large language models (LLMs), we aim to explore their potential as L2 oral proficiency graders and overcome these issues. In this work, we compare various training strategies using regression and classification targets. Our results show that speech LLMs outperform all previous competitive baselines, achieving superior performance on two datasets. Furthermore, the trained grader demonstrates strong generalisation capabilities in the cross-part or cross-task evaluation, facilitated by the audio understanding knowledge acquired during LLM pre-training.
摘要：L2英语的人群不断增长，增加了对语言评估（SLA）开发自动分级器的需求。从历史上看，统计模型，文本编码和自我监督的语音模型已用于此任务。但是，级联的系统遭受了信息损失，而E2E级别也有局限性。随着多模式大语言模型（LLM）的最新进展，我们旨在探索其作为L2口服熟练度分级者的潜力，并克服这些问题。在这项工作中，我们使用回归和分类目标比较各种培训策略。我们的结果表明，语音LLM的表现优于所有以前的竞争基线，在两个数据集上取得了出色的性能。此外，受过训练的分级机在跨部件或交叉任务评估中表现出强大的概括能力，这是由LLM预训练期间获得的音频理解知识所促进的。

Title: M-Wanda: Improving One-Shot Pruning for Multilingual LLMs

Authors: Rochelle Choenni, Ivan Titov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21171
Pdf URL: https://arxiv.org/pdf/2505.21171
Copy Paste: [[2505.21171]] M-Wanda: Improving One-Shot Pruning for Multilingual LLMs(https://arxiv.org/abs/2505.21171)
Keywords: llm
Abstract: Multilingual LLM performance is often critically dependent on model size. With an eye on efficiency, this has led to a surge in interest in one-shot pruning methods that retain the benefits of large-scale pretraining while shrinking the model size. However, as pruning tends to come with performance loss, it is important to understand the trade-offs between multilinguality and sparsification. In this work, we study multilingual performance under different sparsity constraints and show that moderate ratios already substantially harm performance. To help bridge this gap, we propose M-Wanda, a pruning method that models cross-lingual variation by incorporating language-aware activation statistics into its pruning criterion and dynamically adjusts layerwise sparsity based on cross-lingual importance. We show that M-Wanda consistently improves performance at minimal additional costs. We are the first to explicitly optimize pruning to retain multilingual performance, and hope to inspire future advances in multilingual pruning.
摘要：多语言LLM性能通常关键取决于模型尺寸。着眼于效率，这导致对单发修剪方法的兴趣激增，这些方法在缩小模型大小的同时保留了大规模预处理的好处。但是，由于修剪往往会导致性能丧失，因此重要的是要了解多语言和稀疏之间的权衡。在这项工作中，我们在不同的稀疏性约束下研究多语言性能，并表明中等比率已经严重损害了性能。为了帮助弥合这一差距，我们提出了M-Wanda，这是一种修剪方法，该方法通过将语言感知的激活统计量纳入其修剪标准，并基于跨语性的重要性，通过将语言感知的激活统计统计到其修剪标准中进行建模。我们表明，M-Wanda始终以最低的额外成本提高性能。我们是第一个明确优化修剪以保留多语言表现的人，并希望激发未来的多语言修剪进步。

Title: TAT-R1: Terminology-Aware Translation with Reinforcement Learning and Word Alignment

Authors: Zheng Li, Mao Zheng, Mingyang Song, Wenjie Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21172
Pdf URL: https://arxiv.org/pdf/2505.21172
Copy Paste: [[2505.21172]] TAT-R1: Terminology-Aware Translation with Reinforcement Learning and Word Alignment(https://arxiv.org/abs/2505.21172)
Keywords: language model, llm
Abstract: Recently, deep reasoning large language models(LLMs) like DeepSeek-R1 have made significant progress in tasks such as mathematics and coding. Inspired by this, several studies have employed reinforcement learning(RL) to enhance models' deep reasoning capabilities and improve machine translation(MT) quality. However, the terminology translation, an essential task in MT, remains unexplored in deep reasoning LLMs. In this paper, we propose \textbf{TAT-R1}, a terminology-aware translation model trained with reinforcement learning and word alignment. Specifically, we first extract the keyword translation pairs using a word alignment model. Then we carefully design three types of rule-based alignment rewards with the extracted alignment relationships. With those alignment rewards, the RL-trained translation model can learn to focus on the accurate translation of key information, including terminology in the source text. Experimental results show the effectiveness of TAT-R1. Our model significantly improves terminology translation accuracy compared to the baseline models while maintaining comparable performance on general translation tasks. In addition, we conduct detailed ablation studies of the DeepSeek-R1-like training paradigm for machine translation and reveal several key findings.
摘要：最近，DeepSeek-R1（例如DeepSeek-R1）等深层推理大型语言模型（LLM）在数学和编码等任务中取得了重大进展。受到这一点的启发，一些研究采用了加强学习（RL）来增强模型的深层推理能力并提高机器翻译（MT）质量。但是，术语翻译是MT中的重要任务，在深度推理LLM中仍未探索。在本文中，我们提出了\ textbf {tat-r1}，这是一种术语感知的翻译模型，该模型训练有强化学习和单词对齐。具体来说，我们首先使用单词对齐模型提取关键字翻译对。然后，我们通过提取的对齐关系仔细设计了三种基于规则的对齐奖励。有了这些对齐奖励，经过RL训练的翻译模型可以学习关注关键信息的准确翻译，包括源文本中的术语。实验结果表明TAT-R1的有效性。与基线模型相比，我们的模型可显着提高术语翻译精度，同时保持一般翻译任务的可比性。此外，我们对机器翻译的DeepSeek-R1样训练范式进行了详细的消融研究，并揭示了一些关键发现。

Title: Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning

Authors: Mingyang Song, Mao Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21178
Pdf URL: https://arxiv.org/pdf/2505.21178
Copy Paste: [[2505.21178]] Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning(https://arxiv.org/abs/2505.21178)
Keywords: language model, llm, chain-of-thought
Abstract: As test-time scaling becomes a pivotal research frontier in Large Language Models (LLMs) development, contemporary and advanced post-training methodologies increasingly focus on extending the generation length of long Chain-of-Thought (CoT) responses to enhance reasoning capabilities toward DeepSeek R1-like performance. However, recent studies reveal a persistent overthinking phenomenon in state-of-the-art reasoning models, manifesting as excessive redundancy or repetitive thinking patterns in long CoT responses. To address this issue, in this paper, we propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in LLMs, named ConciseR. Specifically, the first stage, using more training steps, aims to incentivize the model's reasoning capabilities via Group Relative Policy Optimization with clip-higher and dynamic sampling components (GRPO++), and the second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization (L-GRPO). Significantly, ConciseR only optimizes response length once all rollouts of a sample are correct, following the "walk before you run" principle. Extensive experimental results demonstrate that our ConciseR model, which generates more concise CoT reasoning responses, outperforms recent state-of-the-art reasoning models with zero RL paradigm across AIME 2024, MATH-500, AMC 2023, Minerva, and Olympiad benchmarks.
摘要：随着测试时间缩放成为大语言模型（LLMS）开发中的关键研究前沿，当代和先进的培训后方法越来越专注于扩展长期思维链（COT）的发电长度（COT）对增强推理能力的响应，以增强对DeepSeek R1类似R1的表现。然而，最近的研究表明，在最新的推理模型中存在持续的过度思考现象，表现为长期COT响应中过度的冗余或重复思维模式。为了解决这个问题，在本文中，我们提出了一个简单而有效的两阶段增强学习框架，以在LLMS中实现Conciser，以实现Conciser。具体来说，第一阶段采用更多的训练步骤，旨在通过小组相对策略优化模型的推理能力，并使用剪贴快，动态抽样组件（GRPO ++）（GRPO ++）和第二阶段来激励模型的推理能力，并使用较少的培训步骤来激励该模型的推理能力，并使用较少的培训步骤，明确地实现简洁的效率，并通过长度的小组相对策略优化（L-Grpo（L-Grpo）提高效率）。值得注意的是，在“步行前行走”原理之后，Conciser仅优化响应长度。广泛的实验结果表明，产生更简洁的COT推理响应的Conciser模型的表现优于最新的最新推理模型，该模型在AIME 2024，Math-500，AMC 2023，Minerva，Minerva和OlympiaD基准方面具有零RL范式。

Title: Exploring the Latent Capacity of LLMs for One-Step Text Generation

Authors: Gleb Mezentsev, Ivan Oseledets
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21189
Pdf URL: https://arxiv.org/pdf/2505.21189
Copy Paste: [[2505.21189]] Exploring the Latent Capacity of LLMs for One-Step Text Generation(https://arxiv.org/abs/2505.21189)
Keywords: language model, llm
Abstract: A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one specially trained input embedding. In this work, we explore whether such reconstruction is possible without autoregression. We show that frozen LLMs can generate hundreds of accurate tokens in just one forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored capability of LLMs - multi-token generation without iterative decoding. We investigate the behaviour of these embeddings and provide insight into the type of information they encode. We also empirically show that although these representations are not unique for a given text, they form connected and local regions in embedding space - a property that suggests the potential of learning a dedicated encoder into that space.
摘要：最近的一项研究表明，大型语言模型（LLMS）可以通过仅仅从一个经过特殊训练的输入嵌入来重建出令人惊讶的长文本（多达数千个令牌）。在这项工作中，我们探讨了如果没有自动性，这种重建是否可以进行。我们表明，当仅提供两个学习的嵌入时，冷冻的LLM只能在一个前传中产生数百个准确的令牌。这揭示了LLM的令人惊讶且毫无疑问的能力 - 多token的生成而没有迭代解码。我们研究了这些嵌入的行为，并提供了对它们编码的信息类型的见解。我们还从经验上表明，尽管这些表示并不是给定文本的独特之处，但它们在嵌入空间中形成了连接的和本地区域，该属性暗示了学习专用编码器进入该空间的潜力。

Title: Unveiling Instruction-Specific Neurons & Experts: An Analytical Framework for LLM's Instruction-Following Capabilities

Authors: Junyan Zhang, Yubo Gao, Yibo Yan, Jungang Li, Zhaorui Hou, Sicheng Tao, Shuliang Liu, Song Dai, Yonghua Hei, Junzhuo Li, Xuming Hu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21191
Pdf URL: https://arxiv.org/pdf/2505.21191
Copy Paste: [[2505.21191]] Unveiling Instruction-Specific Neurons & Experts: An Analytical Framework for LLM's Instruction-Following Capabilities(https://arxiv.org/abs/2505.21191)
Keywords: language model, llm
Abstract: The finetuning of Large Language Models (LLMs) has significantly advanced their instruction-following capabilities, yet the underlying computational mechanisms driving these improvements remain poorly understood. This study systematically examines how fine-tuning reconfigures LLM computations by isolating and analyzing instruction-specific sparse components, i.e., neurons in dense models and both neurons and experts in Mixture-of-Experts (MoE) architectures. In particular, we introduce HexaInst, a carefully curated and balanced instructional dataset spanning six distinct categories, and propose SPARCOM, a novel analytical framework comprising three key contributions: (1) a method for identifying these sparse components, (2) an evaluation of their functional generality and uniqueness, and (3) a systematic comparison of their alterations. Through experiments, we demonstrate functional generality, uniqueness, and the critical role of these components in instruction execution. By elucidating the relationship between fine-tuning-induced adaptations and sparse computational substrates, this work provides deeper insights into how LLMs internalize instruction-following behavior for the trustworthy LLM community.
摘要：大型语言模型（LLMS）的填充已大大提高了其遵循的指导能力，但是推动这些改进的基本计算机制仍然尚不清楚。这项研究系统地研究了通过隔离和分析特定于特定的稀疏组件的微微调LLM计算，即密集模型中的神经元，神经元中的神经元，以及混合物混合物（MOE）结构中的专家。特别是，我们介绍了跨越六个不同类别的精心策划和平衡的教学数据集，并提出了Sparcom，并提出了一个新的分析框架，其中包含三个关键贡献：（1）一种识别这些稀疏组件的方法，（2）评估其功能性和独特性和独特性的评估，以及（3）对他们的变更的系统比较。通过实验，我们证明了功能通用性，独特性以及这些组件在指导执行中的关键作用。通过阐明微调诱导的适应和稀疏计算基板之间的关系，这项工作提供了更深入的见解，以了解LLMS如何在可信赖的LLM社区中内部化指令遵循的行为。

Title: Pretrained LLMs Learn Multiple Types of Uncertainty

Authors: Roi Cohen, Omri Fahn, Gerard de Melo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21218
Pdf URL: https://arxiv.org/pdf/2505.21218
Copy Paste: [[2505.21218]] Pretrained LLMs Learn Multiple Types of Uncertainty(https://arxiv.org/abs/2505.21218)
Keywords: language model, llm, hallucination
Abstract: Large Language Models are known to capture real-world knowledge, allowing them to excel in many downstream tasks. Despite recent advances, these models are still prone to what are commonly known as hallucinations, causing them to emit unwanted and factually incorrect text. In this work, we study how well LLMs capture uncertainty, without explicitly being trained for that. We show that, if considering uncertainty as a linear concept in the model's latent space, it might indeed be captured, even after only pretraining. We further show that, though unintuitive, LLMs appear to capture several different types of uncertainty, each of which can be useful to predict the correctness for a specific task or benchmark. Furthermore, we provide in-depth results such as demonstrating a correlation between our correction prediction and the model's ability to abstain from misinformation using words, and the lack of impact of model scaling for capturing uncertainty. Finally, we claim that unifying the uncertainty types as a single one using instruction-tuning or [IDK]-token tuning is helpful for the model in terms of correctness prediction.
摘要：众所周知，大型语言模型可以捕获现实世界的知识，从而使它们在许多下游任务中都表现出色。尽管有最近的进步，这些模型仍然容易出现通常被称为幻觉的原因，从而导致它们发出不必要的和事实不正确的文本。在这项工作中，我们研究了LLM捕获不确定性的效果，而无需明确的培训。我们表明，如果将不确定性视为模型潜在空间中的线性概念，那么即使仅在训练后也可以捕获它。我们进一步表明，尽管不直觉，但LLM似乎捕获了几种不同类型的不确定性，每种不确定性对于预测特定任务或基准的正确性都是有用的。此外，我们提供了深入的结果，例如证明了我们的校正预测与模型使用单词弃权误解的能力之间的相关性，以及模型缩放对捕获不确定性的缺乏影响。最后，我们声称，使用指令调整或[idk] -token调谐将不确定性类型统一为单一类型对模型有助于正确性预测。

Title: LMCD: Language Models are Zeroshot Cognitive Diagnosis Learners

Authors: Yu He, Zihan Yao, Chentao Song, Tianyu Qi, Jun Liu, Ming Li, Qing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21239
Pdf URL: https://arxiv.org/pdf/2505.21239
Copy Paste: [[2505.21239]] LMCD: Language Models are Zeroshot Cognitive Diagnosis Learners(https://arxiv.org/abs/2505.21239)
Keywords: language model, llm
Abstract: Cognitive Diagnosis (CD) has become a critical task in AI-empowered education, supporting personalized learning by accurately assessing students' cognitive states. However, traditional CD models often struggle in cold-start scenarios due to the lack of student-exercise interaction data. Recent NLP-based approaches leveraging pre-trained language models (PLMs) have shown promise by utilizing textual features but fail to fully bridge the gap between semantic understanding and cognitive profiling. In this work, we propose Language Models as Zeroshot Cognitive Diagnosis Learners (LMCD), a novel framework designed to handle cold-start challenges by harnessing large language models (LLMs). LMCD operates via two primary phases: (1) Knowledge Diffusion, where LLMs generate enriched contents of exercises and knowledge concepts (KCs), establishing stronger semantic links; and (2) Semantic-Cognitive Fusion, where LLMs employ causal attention mechanisms to integrate textual information and student cognitive states, creating comprehensive profiles for both students and exercises. These representations are efficiently trained with off-the-shelf CD models. Experiments on two real-world datasets demonstrate that LMCD significantly outperforms state-of-the-art methods in both exercise-cold and domain-cold settings. The code is publicly available at this https URL
摘要：认知诊断（CD）已成为AI授权教育的关键任务，通过准确评估学生的认知状态来支持个性化学习。但是，由于缺乏学生运动的互动数据，传统的CD模型通常在冷启动场景中挣扎。利用预训练的语言模型（PLM）的最新基于NLP的方法通过使用文本功能，但无法完全弥合语义理解和认知分析之间的差距，这表明了希望。在这项工作中，我们建议语言模型作为Zeroshot认知诊断学习者（LMCD），这是一个新颖的框架，旨在通过利用大语言模型（LLMS）来应对寒冷的挑战。 LMCD通过两个主要阶段运行：（1）知识扩散，其中LLM会产生丰富的练习和知识概念（KCS），建立更强的语义链接；（2）语义认知融合，其中LLM采用因果注意机制来整合文本信息和学生认知状态，从而为学生和练习创造全面的概况。这些表示是通过现成的CD模型进行有效培训的。在两个现实世界数据集上的实验表明，LMCD在锻炼和域冷的设置中都显着胜过最先进的方法。该代码在此HTTPS URL上公开可用

Title: Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings

Authors: Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21242
Pdf URL: https://arxiv.org/pdf/2505.21242
Copy Paste: [[2505.21242]] Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings(https://arxiv.org/abs/2505.21242)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning. However, these recent efforts do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words or subwords. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces over-fragmentation issue with medical words. To that end, we show vocabulary adaptation helps improve the LLM summarization performance even in difficult settings. Through extensive experimentation of multiple vocabulary adaptation strategies, two continual pretraining strategies, and three benchmark medical summarization datasets, we gain valuable insights into the role of vocabulary adaptation strategies for customizing LLMs to the medical domain. We also performed a human evaluation study with medical experts where they found that vocabulary adaptation results in more relevant and faithful summaries. Our codebase is made publicly available at this https URL.
摘要：大型语言模型（LLMS）最近通过仅使用文本学习，在医学文本摘要中取得了巨大成功。但是，在LLM可能失败的困难环境下，这些最近的努力并未执行精细的评估。他们通常在整个数据集上报告性能得分。通过我们的基准测试研究，我们表明LLMS显示出高浓度的量不高的（OOV）单词或新颖性的数据点的显着性能下降。词汇适应是对这个词汇不匹配问题的直观解决方案，其中LLM词汇通过某些专家域（此处，医学）单词或子词进行了更新。我们研究中有趣的发现是，Llama-3.1即使词汇大小约为128K代币，仍然面临医学词的过度碎片化问题。为此，我们显示词汇适应性有助于改善LLM汇总性能，即使在困难的环境中也是如此。通过对多种词汇适应策略，两种持续预处理策略和三个基准医学摘要数据集进行广泛的实验，我们可以对将LLMS自定义为医疗领域的词汇适应策略的作用有价值。我们还向医学专家进行了人类评估研究，他们发现词汇适应会导致更相关和忠实的摘要。我们的代码库可在此HTTPS URL上公开提供。

Title: ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision

Authors: Dosung Lee, Wonjun Oh, Boyoung Kim, Minyoung Kim, Joonsuk Park, Paul Hongsuck Seo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21250
Pdf URL: https://arxiv.org/pdf/2505.21250
Copy Paste: [[2505.21250]] ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision(https://arxiv.org/abs/2505.21250)
Keywords: language model
Abstract: Multi-hop question answering (MHQA) involves reasoning across multiple documents to answer complex questions. Dense retrievers typically outperform sparse methods like BM25 by leveraging semantic embeddings; however, they require labeled query-document pairs for fine-tuning. This poses a significant challenge in MHQA due to the high variability of queries (reformulated) questions throughout the reasoning steps. To overcome this limitation, we introduce Retriever Supervision with Consistency and Relevance (ReSCORE), a novel method for training dense retrievers for MHQA without labeled documents. ReSCORE leverages large language models to capture each documents relevance to the question and consistency with the correct answer and use them to train a retriever within an iterative question-answering framework. Experiments on three MHQA benchmarks demonstrate the effectiveness of ReSCORE, with significant improvements in retrieval, and in turn, the state-of-the-art MHQA performance. Our implementation is available at: this https URL.
摘要：多跳问题回答（MHQA）涉及跨多个文档推理以回答复杂问题。密集的检索器通常通过利用语义嵌入（例如BM25）优于稀疏方法；但是，它们需要标记为查询文件对进行微调。由于在整个推理步骤中查询（重新计算）问题的高度差异，这在MHQA中构成了重大挑战。为了克服这一限制，我们以一致性和相关性（Rescore）介绍了检索员的监督，这是一种新的方法，用于在没有标记文档的情况下训练MHQA的密集检索器。 Rescore利用大型语言模型捕获与问题和正确答案一致性相关的每个文档，并使用它们在迭代问题解答框架内训练猎犬。三个MHQA基准测试的实验证明了撤退的有效性，并取得了显着改善，进而是最先进的MHQA性能。我们的实施可用：此HTTPS URL。

Title: Multilingual Pretraining for Pixel Language Models

Authors: Ilker Kesen, Jonas F. Lotz, Ingo Ziegler, Phillip Rust, Desmond Elliott
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21265
Pdf URL: https://arxiv.org/pdf/2505.21265
Copy Paste: [[2505.21265]] Multilingual Pretraining for Pixel Language Models(https://arxiv.org/abs/2505.21265)
Keywords: language model
Abstract: Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.
摘要：像素语言模型直接在渲染文本的图像上运行，从而消除了对固定词汇的需求。尽管这些模型表现出强大的下游跨语义转移功能，但多语言预处理仍然没有被忽略。我们介绍了Pixel-M4，这是一种在四种视觉上和语言上多样的语言上仔细预测的模型：英语，印地语，乌克兰语和简化的中文。关于语义和句法任务的多语言评估表明，Pixel-M4在非LATIN脚本上的表现优于仅英语的同行。单词级探测分析证实，Pixel-M4捕获了丰富的语言特征，即使在预训练期间看不到的语言也是如此。此外，对其隐藏表示形式的分析表明，多语言预处理产生了在用于训练的语言中紧密对齐的语义嵌入空间。这项工作表明，多语言预处理大大提高了像素语言模型有效支持各种语言的能力。

Title: rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

Authors: Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, Mao Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21297
Pdf URL: https://arxiv.org/pdf/2505.21297
Copy Paste: [[2505.21297]] rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset(https://arxiv.org/abs/2505.21297)
Keywords: language model, llm
Abstract: Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and oracle solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with much smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B. Code and the dataset will be released at this https URL.
摘要：大型语言模型（LLMS）中的代码推理从根本上受到了高难题数据集的稀缺的限制，尤其是那些具有可验证的输入输出测试用例的稀缺性，以便在大规模上进行严格的解决方案验证。我们介绍了RSTAR-CODER，该编码可以通过构建418K竞争级代码问题的大规模验证的数据集，580K长期的长期解决方案以及各种难度的测试案例，从而显着提高了LLM代码推理功能。这是通过三个核心贡献来实现的：（1）我们策划了竞争性编程代码问题和Oracle解决方案，以综合可解决的新问题；（2）我们引入了可靠的输入测试案例综合管道，该管道将生成分解为三步输入生成方法，以及一种有效输出标记的相互验证机制；（3）我们增加了具有高质量的，测试验证的长期解决方案的问题。在各种代码推理基准的QWEN模型（1.5B-14B）上进行了广泛的实验，证明了RSTAR代码数据集的优越性，实现了与型号较小的Frontier推理LLM相当的领先性能。在LiveCodebench上，RSTAR-CODER将QWEN2.5-7B从17.4％提高到令人印象深刻的57.3％，而Qwen2.5-14b从23.3％到62.5％，超过了O3-Mini（Low）。在更具挑战性的美国计算机奥林匹克运动会上，我们的7B模型达到平均通行证@1的准确性为16.15％，表现优于边境级别的QWQ-32B。代码和数据集将在此HTTPS URL上发布。

Title: How Humans and LLMs Organize Conceptual Knowledge: Exploring Subordinate Categories in Italian

Authors: Andrea Pedrotti, Giulia Rambelli, Caterina Villani, Marianna Bolognesi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21301
Pdf URL: https://arxiv.org/pdf/2505.21301
Copy Paste: [[2505.21301]] How Humans and LLMs Organize Conceptual Knowledge: Exploring Subordinate Categories in Italian(https://arxiv.org/abs/2505.21301)
Keywords: llm
Abstract: People can categorize the same entity at multiple taxonomic levels, such as basic (bear), superordinate (animal), and subordinate (grizzly bear). While prior research has focused on basic-level categories, this study is the first attempt to examine the organization of categories by analyzing exemplars produced at the subordinate level. We present a new Italian psycholinguistic dataset of human-generated exemplars for 187 concrete words. We then use these data to evaluate whether textual and vision LLMs produce meaningful exemplars that align with human category organization across three key tasks: exemplar generation, category induction, and typicality judgment. Our findings show a low alignment between humans and LLMs, consistent with previous studies. However, their performance varies notably across different semantic domains. Ultimately, this study highlights both the promises and the constraints of using AI-generated exemplars to support psychological and linguistic research.
摘要：人们可以将相同的实体分类为多个分类学水平，例如基本（熊），上级（动物）和下属（灰熊）。尽管先前的研究集中在基本级别的类别上，但这项研究是通过分析在下属级别产生的示例来检查类别组织的首次尝试。我们提出了一个新的意大利心理语言数据集，该数据集是187个具体单词的人类生成的典范。然后，我们使用这些数据来评估文本和视觉LLM是否会在三个关键任务中与人类类别组织保持一致的有意义的示例：示例性生成，类别归纳和典型性判断。我们的发现表明，与以前的研究一致，人类和LLM之间的对齐程度很低。但是，它们的性能在不同的语义域上差异很大。最终，这项研究强调了使用AI生成的示例来支持心理学和语言研究的承诺和约束。

Title: Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead

Authors: Jesujoba O. Alabi, Michael A. Hedderich, David Ifeoluwa Adelani, Dietrich Klakow
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21315
Pdf URL: https://arxiv.org/pdf/2505.21315
Copy Paste: [[2505.21315]] Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead(https://arxiv.org/abs/2505.21315)
Keywords: language model, llm
Abstract: With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors-including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 734 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.
摘要：非洲拥有超过2000种语言和可能数百万的演讲者，代表了世界上最富有的语言区域之一。然而，这种多样性几乎没有反映在最先进的自然语言处理（NLP）系统和大语言模型（LLMS）中，这些系统主要支持一组狭窄的高资源语言。这种排除不仅限制了现代NLP技术的影响力和效用，而且有可能扩大语言社区的数字鸿沟。然而，NLP对非洲语言的研究是活跃的和增长的。近年来，这一领域引起了人们的兴趣，这是由于多种因素的驱动，包括创造多语言的语言资源，社区主导的倡议的兴起以及通过资助计划增加支持。在这项调查中，我们分析了过去五年中发表的非洲语言NLP的734个研究论文，对核心任务最近进展的全面概述提供了全面的概述。我们确定了塑造该领域的关键趋势，并通过概述有希望的方向来培养对非洲语言的更具包容性和可持续的NLP研究。

Title: Leveraging large language models and traditional machine learning ensembles for ADHD detection from narrative transcripts

Authors: Yuxin Zhu, Yuting Guo, Noah Marchuck, Abeed Sarker, Yun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21324
Pdf URL: https://arxiv.org/pdf/2505.21324
Copy Paste: [[2505.21324]] Leveraging large language models and traditional machine learning ensembles for ADHD detection from narrative transcripts(https://arxiv.org/abs/2505.21324)
Keywords: language model, llm
Abstract: Despite rapid advances in large language models (LLMs), their integration with traditional supervised machine learning (ML) techniques that have proven applicability to medical data remains underexplored. This is particularly true for psychiatric applications, where narrative data often exhibit nuanced linguistic and contextual complexity, and can benefit from the combination of multiple models with differing characteristics. In this study, we introduce an ensemble framework for automatically classifying Attention-Deficit/Hyperactivity Disorder (ADHD) diagnosis (binary) using narrative transcripts. Our approach integrates three complementary models: LLaMA3, an open-source LLM that captures long-range semantic structure; RoBERTa, a pre-trained transformer model fine-tuned on labeled clinical narratives; and a Support Vector Machine (SVM) classifier trained using TF-IDF-based lexical features. These models are aggregated through a majority voting mechanism to enhance predictive robustness. The dataset includes 441 instances, including 352 for training and 89 for validation. Empirical results show that the ensemble outperforms individual models, achieving an F$_1$ score of 0.71 (95\% CI: [0.60-0.80]). Compared to the best-performing individual model (SVM), the ensemble improved recall while maintaining competitive precision. This indicates the strong sensitivity of the ensemble in identifying ADHD-related linguistic cues. These findings demonstrate the promise of hybrid architectures that leverage the semantic richness of LLMs alongside the interpretability and pattern recognition capabilities of traditional supervised ML, offering a new direction for robust and generalizable psychiatric text classification.
摘要：尽管大语言模型（LLMS）的快速发展，但它们与已证明适用于医学数据的传统监督机器学习（ML）技术的集成仍然没有得到充实。对于精神科应用程序尤其如此，在这种应用中，叙事数据通常表现出细微的语言和上下文复杂性，并且可以从具有不同特征的多个模型的组合中受益。在这项研究中，我们介绍了一个合奏框架，用于自动使用叙事记录对注意力缺陷/多动症（ADHD）诊断（二进制）进行分类。我们的方法集成了三种互补模型：Llama3，一种捕获远程语义结构的开源LLM；罗伯塔（Roberta），一种预先训练的变压器模型，以标记的临床叙述进行了微调；以及使用基于TF-IDF的词汇特征训练的支持向量机（SVM）分类器。这些模型是通过多数投票机制汇总的，以增强预测性鲁棒性。该数据集包括441个实例，其中包括352个培训和89个实例。经验结果表明，合奏的表现优于单个模型，达到0.71的F $ _1 $得分（95 \％CI：[0.60-0.80]）。与表现最佳的个体模型（SVM）相比，整体在保持竞争精度的同时改善了回忆。这表明合奏在识别与ADHD相关的语言提示方面具有强烈的敏感性。这些发现证明了混合体系结构的希望，这些架构利用LLM的语义丰富性以及传统监督ML的可解释性和模式识别能力，为可靠且可推广的精神病学文本分类提供了新的方向。

Title: PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims

Authors: Valentin Knappich, Annemarie Friedrich, Anna Hätty, Simon Razniewski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21342
Pdf URL: https://arxiv.org/pdf/2505.21342
Copy Paste: [[2505.21342]] PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims(https://arxiv.org/abs/2505.21342)
Keywords: language model, llm, agent
Abstract: Patent claims define the scope of protection for an invention. If there are ambiguities in a claim, it is rejected by the patent office. In the US, this is referred to as indefiniteness (35 U.S.C § 112(b)) and is among the most frequent reasons for patent application rejection. The development of automatic methods for patent definiteness examination has the potential to make patent drafting and examination more efficient, but no annotated dataset has been published to date. We introduce PEDANTIC (\underline{P}at\underline{e}nt \underline{D}efiniteness Ex\underline{a}mi\underline{n}a\underline{ti}on \underline{C}orpus), a novel dataset of 14k US patent claims from patent applications relating to Natural Language Processing (NLP), annotated with reasons for indefiniteness. We construct PEDANTIC using a fully automatic pipeline that retrieves office action documents from the USPTO and uses Large Language Models (LLMs) to extract the reasons for indefiniteness. A human validation study confirms the pipeline's accuracy in generating high-quality annotations. To gain insight beyond binary classification metrics, we implement an LLM-as-Judge evaluation that compares the free-form reasoning of every model-cited reason with every examiner-cited reason. We show that LLM agents based on Qwen 2.5 32B and 72B struggle to outperform logistic regression baselines on definiteness prediction, even though they often correctly identify the underlying reasons. PEDANTIC provides a valuable resource for patent AI researchers, enabling the development of advanced examination models. We will publicly release the dataset and code.
摘要：专利要求定义发明的保护范围。如果索赔中有歧义，则被专利局拒绝。在美国，这被称为不确定性（35U.S.C§112（b）），是专利申请拒绝的最常见原因之一。用于专利确定性考试的自动方法的开发有可能使专利制图和检查更有效，但迄今为止尚未发表注释的数据集。我们在\下划线{ （NLP），注释不确定的原因。我们使用一条全自动管道来构建pedantic，该管道从USPTO检索办公室操作文档，并使用大型语言模型（LLMS）来提取不确定的原因。人类验证研究证实了管道在产生高质量注释方面的准确性。为了获得超越二进制分类指标的洞察力，我们实施了一个法学律师法官评估，将每个模型的理由的自由形式推理与每个审查员列出的原因进行比较。我们表明，基于QWEN 2.5 32B和72B的LLM代理商在确定性预测方面的表现均超过了逻辑回归基准，即使它们通常正确地确定了基本原因。 Pedantic为专利AI研究人员提供了宝贵的资源，从而可以开发高级考试模型。我们将公开发布数据集和代码。

Title: Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning

Authors: Bidyarthi Paul, Jalisha Jashim Era, Mirazur Rahman Zim, Tahmid Sattar Aothoi, Faisal Muhammad Shah
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21354
Pdf URL: https://arxiv.org/pdf/2505.21354
Copy Paste: [[2505.21354]] Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning(https://arxiv.org/abs/2505.21354)
Keywords: language model, gpt, llm, prompt
Abstract: Solving Bengali Math Word Problems (MWPs) remains a major challenge in natural language processing (NLP) due to the language's low-resource status and the multi-step reasoning required. Existing models struggle with complex Bengali MWPs, largely because no human-annotated Bengali dataset has previously addressed this task. This gap has limited progress in Bengali mathematical reasoning. To address this, we created SOMADHAN, a dataset of 8792 complex Bengali MWPs with manually written, step-by-step solutions. We designed this dataset to support reasoning-focused evaluation and model development in a linguistically underrepresented context. Using SOMADHAN, we evaluated a range of large language models (LLMs) - including GPT-4o, GPT-3.5 Turbo, LLaMA series models, Deepseek, and Qwen - through both zero-shot and few-shot prompting with and without Chain of Thought (CoT) reasoning. CoT prompting consistently improved performance over standard prompting, especially in tasks requiring multi-step logic. LLaMA-3.3 70B achieved the highest accuracy of 88% with few-shot CoT prompting. We also applied Low-Rank Adaptation (LoRA) to fine-tune models efficiently, enabling them to adapt to Bengali MWPs with minimal computational cost. Our work fills a critical gap in Bengali NLP by providing a high-quality reasoning dataset and a scalable framework for solving complex MWPs. We aim to advance equitable research in low-resource languages and enhance reasoning capabilities in educational and language technologies.
摘要：解决孟加拉数学单词问题（MWP）仍然是自然语言处理（NLP）的主要挑战，这是由于该语言的低资源状态和所需的多步推理。现有的模型与复杂的孟加拉MWP斗争，主要是因为没有人类通知的孟加拉数据集以前已经解决了这一任务。该差距在孟加拉数学推理方面的进展有限。为了解决这个问题，我们创建了Somadhan，这是一个由8792 Complex Bengali MWP的数据集，并使用手动编写的逐步解决方案。我们设计了此数据集，以支持以推理为中心的评估和模型在语言代表性不足的情况下进行建模。使用Somadhan，我们通过零拍摄和几乎没有想法的提示（COT）推理，评估了包括GPT-4O，GPT-4O，GPT-4O，GPT-4O，GPT-3.5 TURBO，LLAMA系列模型，DeepSeek和Qwen。 COT促使始终如一地提高性能，而不是标准提示，尤其是在需要多步逻辑的任务中。 Llama-3.3 70B在很少的COT提示下达到了88％的最高精度。我们还将低级适应性（LORA）应用于有效微调模型，从而使它们能够以最低的计算成本适应孟加拉语MWP。我们的工作通过提供高质量的推理数据集和解决复杂MWP的可扩展框架来填补孟加拉国NLP的关键空白。我们旨在推进低资源语言的公平研究，并增强教育和语言技术的推理能力。

Title: Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History

Authors: Qishuai Zhong, Zongmin Li, Siqi Fan, Aixin Sun
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.21362
Pdf URL: https://arxiv.org/pdf/2505.21362
Copy Paste: [[2505.21362]] Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History(https://arxiv.org/abs/2505.21362)
Keywords: language model, llm, prompt, agent
Abstract: Effective engagement by large language models (LLMs) requires adapting responses to users' sociodemographic characteristics, such as age, occupation, and education level. While many real-world applications leverage dialogue history for contextualization, existing evaluations of LLMs' behavioral adaptation often focus on single-turn prompts. In this paper, we propose a framework to evaluate LLM adaptation when attributes are introduced either (1) explicitly via user profiles in the prompt or (2) implicitly through multi-turn dialogue history. We assess the consistency of model behavior across these modalities. Using a multi-agent pipeline, we construct a synthetic dataset pairing dialogue histories with distinct user profiles and employ questions from the Value Survey Module (VSM 2013) (Hofstede and Hofstede, 2016) to probe value expression. Our findings indicate that most models adjust their expressed values in response to demographic changes, particularly in age and education level, but consistency varies. Models with stronger reasoning capabilities demonstrate greater alignment, indicating the importance of reasoning in robust sociodemographic adaptation.
摘要：大型语言模型（LLM）的有效参与需要适应对用户社会人口统计学特征（例如年龄，职业和教育水平）的反应。尽管许多现实世界的应用程序利用对话历史进行上下文化，但对LLMS行为适应的现有评估通常集中在单转发提示上。在本文中，我们提出了一个框架，以评估LLM改编时（通过提示中的用户配置文件明确引入属性），或者（2）通过多转向对话历史记录隐含。我们评估了这些模式的模型行为的一致性。使用多代理管道，我们构建了具有不同用户配置文件的合成数据集配对历史记录，并采用了价值调查模块中的问题（VSM 2013）（Hofstede and Hofstede，2016年）来探测价值表达。我们的发现表明，大多数模型都会根据人口变化，尤其是年龄和教育水平来调整其表达的价值，但一致性各不相同。具有更强推理能力的模型表明了更大的一致性，表明推理在鲁棒的社会人口统计学适应中的重要性。

Title: Analyzing values about gendered language reform in LLMs' revisions

Authors: Jules Watson, Xi Wang, Raymond Liu, Suzanne Stevenson, Barend Beekhuizen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21378
Pdf URL: https://arxiv.org/pdf/2505.21378
Copy Paste: [[2505.21378]] Analyzing values about gendered language reform in LLMs' revisions(https://arxiv.org/abs/2505.21378)
Keywords: llm
Abstract: Within the common LLM use case of text revision, we study LLMs' revision of gendered role nouns (e.g., outdoorsperson/woman/man) and their justifications of such revisions. We evaluate their alignment with feminist and trans-inclusive language reforms for English. Drawing on insight from sociolinguistics, we further assess if LLMs are sensitive to the same contextual effects in the application of such reforms as people are, finding broad evidence of such effects. We discuss implications for value alignment.
摘要：在文本修订的常见LLM用例中，我们研究了LLMS对性别名词（例如户外/女人/男人）的性别角色名词的修订及其对此类修订的理由。我们评估他们与女权主义和跨性语言改革的英语一致性。利用社会语言学的洞察力，我们进一步评估了LLM是否对像人们一样的改革中的相同上下文效应敏感，找到了这种影响的广泛证据。我们讨论对价值一致性的影响。

Title: AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs

Authors: Xuanwen Ding, Chengjun Pan, Zejun Li, Jiwen Zhang, Siyuan Wang, Zhongyu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21389
Pdf URL: https://arxiv.org/pdf/2505.21389
Copy Paste: [[2505.21389]] AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs(https://arxiv.org/abs/2505.21389)
Keywords: language model, llm, agent
Abstract: Evaluating multimodal large language models (MLLMs) is increasingly expensive, as the growing size and cross-modality complexity of benchmarks demand significant scoring efforts. To tackle with this difficulty, we introduce AutoJudger, an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model's real-time performance. Specifically, AutoJudger incorporates two pivotal components: a semantic-aware retrieval mechanism to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and a dynamic memory that maintains contextual statistics of previously evaluated questions to guide coherent and globally informed question selection throughout the evaluation process. Extensive experiments on four representative multimodal benchmarks demonstrate that our adaptive framework dramatically reduces evaluation expenses, i.e. AutoJudger uses only 4% of the data to achieve over 90% ranking accuracy with the full benchmark evaluation on MMT-Bench.
摘要：评估多模式大语言模型（MLLM）的越来越昂贵，因为基准的规模和跨模式的复杂性不断增长。为了解决这个困难，我们介绍了AutoJudger，这是一个由代理驱动的框架，用于对MLLM的高效和适应性基准测试，以解决这种不断升级的成本。 AutoJudger采用项目响应理论（IRT）来估算问题难度和自主评估代理，以动态选择基于模型的实时性能的最有用的测试问题。具体而言，AutoJudger结合了两个关键组成部分：一种语义感知的检索机制，以确保所选问题涵盖视觉和语言方式的各种各样和挑战性的情景，以及在整个评估过程中保持相干和全球知识的问题的先前评估问题的上下文统计数据，以保持先前评估的问题的上下文统计。对四个代表性的多模式基准进行的广泛实验表明，我们的适应性框架大大降低了评估费用，即自动截图仅使用4％的数据来实现超过90％的排名准确性，并在MMT Bench上进行完整的基准评估。

Title: Improving Research Idea Generation Through Data: An Empirical Investigation in Social Science

Authors: Xiao Liu, Xinyi Dong, Xinyang Gao, Yansong Feng, Xun Pang
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2505.21396
Pdf URL: https://arxiv.org/pdf/2505.21396
Copy Paste: [[2505.21396]] Improving Research Idea Generation Through Data: An Empirical Investigation in Social Science(https://arxiv.org/abs/2505.21396)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have shown promise in generating novel research ideas. However, these ideas often face challenges related to feasibility and expected effectiveness. This paper explores how augmenting LLMs with relevant data during the idea generation process can enhance the quality of generated ideas. We introduce two ways of incorporating data: (1) providing metadata during the idea generation stage to guide LLMs toward feasible directions, and (2) adding automatic validation during the idea selection stage to assess the empirical plausibility of hypotheses within ideas. We conduct experiments in the social science domain, specifically with climate negotiation topics, and find that metadata improves the feasibility of generated ideas by 20%, while automatic validation improves the overall quality of selected ideas by 7%. A human study shows that LLM-generated ideas, along with their related data and validation processes, inspire researchers to propose research ideas with higher quality. Our work highlights the potential of data-driven research idea generation, and underscores the practical utility of LLM-assisted ideation in real-world academic settings.
摘要：大型语言模型（LLM）的最新进展在产生新颖的研究思想方面已显示出希望。但是，这些想法通常面临与可行性和预期有效性有关的挑战。本文探讨了如何在想法生成过程中使用相关数据增强LLM可以提高产生的想法的质量。我们介绍了两种合并数据的方法：（1）在想法生成阶段提供元数据，以指导LLMS朝着可行的方向指导，（2）在思想选择阶段添加自动验证，以评估思想中假设的经验合理性。我们在社会科学领域进行了实验，特别是通过气候谈判主题进行的，发现元数据将产生的思想的可行性提高了20％，而自动验证将所选思想的整体质量提高了7％。人类的一项研究表明，LLM生成的思想以及它们的相关数据和验证过程激发了研究人员以更高的质量提出研究思想。我们的工作突出了数据驱动的研究思想生成的潜力，并强调了LLM辅助构想在实际学术环境中的实际实用性。

Title: DecisionFlow: Advancing Large Language Model as Principled Decision Maker

Authors: Xiusi Chen, Shanyong Wang, Cheng Qian, Hongru Wang, Peixuan Han, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21397
Pdf URL: https://arxiv.org/pdf/2505.21397
Copy Paste: [[2505.21397]] DecisionFlow: Advancing Large Language Model as Principled Decision Maker(https://arxiv.org/abs/2505.21397)
Keywords: language model, llm, prompt
Abstract: In high-stakes domains such as healthcare and finance, effective decision-making demands not just accurate outcomes but transparent and explainable reasoning. However, current language models often lack the structured deliberation needed for such tasks, instead generating decisions and justifications in a disconnected, post-hoc manner. To address this, we propose DecisionFlow, a novel decision modeling framework that guides models to reason over structured representations of actions, attributes, and constraints. Rather than predicting answers directly from prompts, DecisionFlow builds a semantically grounded decision space and infers a latent utility function to evaluate trade-offs in a transparent, utility-driven manner. This process produces decisions tightly coupled with interpretable rationales reflecting the model's reasoning. Empirical results on two high-stakes benchmarks show that DecisionFlow not only achieves up to 30% accuracy gains over strong prompting baselines but also enhances alignment in outcomes. Our work is a critical step toward integrating symbolic reasoning with LLMs, enabling more accountable, explainable, and reliable LLM decision support systems. We release the data and code at this https URL.
摘要：在医疗保健和金融等高风险领域中，有效的决策不仅需要准确的结果，而且需要透明且可解释的推理。但是，当前的语言模型通常缺乏此类任务所需的结构化审议，而是以断开的事后方式产生决策和理由。为了解决这个问题，我们提出了决策流，这是一个新颖的决策建模框架，它指导模型来推理动作，属性和约束的结构化表示。决策流没有直接从提示中预测答案，而是建立一个语义上的决策空间，并渗透潜在的效用功能，以透明，公用事业驱动的方式评估权衡。这个过程产生的决定与反映模型推理的可解释原理紧密相结合。两个高风险基准的经验结果表明，决策流不仅可以在强大的促使基线的情况下获得高达30％的准确性提高，而且可以增强结果的一致性。我们的工作是将象征性推理与LLM集成的关键步骤，从而使更负责任，可解释且可靠的LLM决策支持系统能够承担责任。我们在此HTTPS URL上发布数据和代码。

Title: Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling

Authors: Hovhannes Tamoyan, Subhabrata Dutta, Iryna Gurevych
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21399
Pdf URL: https://arxiv.org/pdf/2505.21399
Copy Paste: [[2505.21399]] Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling(https://arxiv.org/abs/2505.21399)
Keywords: language model, llm
Abstract: Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs). Prior findings suggest LLMs can (sometimes) detect factual incorrectness in their generated content (i.e., fact-checking post-generation). In this work, we provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation. We demonstrate that for a given subject entity and a relation, LLMs internally encode linear features in the Transformer's residual stream that dictate whether it will be able to recall the correct attribute (that forms a valid entity-relation-attribute triplet). This self-awareness signal is robust to minor formatting variations. We investigate the effects of context perturbation via different example selection strategies. Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers. These findings uncover intrinsic self-monitoring capabilities within LLMs, contributing to their interpretability and reliability.
摘要：生成内容的事实错误是大语模型（LLMS）无处不在的部署的主要问题之一。先前的发现表明，LLM可以（有时）检测其生成的内容（即事实检查后产生）的事实错误。在这项工作中，我们提供了支持LLMS内部指南针的存在的证据，该指南在发电时决定了事实召回的正确性。我们证明，对于给定的主题实体和关系，LLMS内部编码了变压器的残差流中的线性特征，该特征决定了它是否能够回忆正确的属性（形成有效的实体 - 缔合 - 属性三重态）。这种自我意识信号对于较小的格式变化是可靠的。我们通过不同的示例选择策略研究了上下文扰动的影响。跨模型尺寸和训练动力学的缩放实验凸显了自我意识在训练和中间层的峰值期间迅速出现。这些发现揭示了LLM中内在的自我监控能力，从而有助于其解释性和可靠性。

Title: RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models

Authors: Dario Satriani, Enzo Veltri, Donatello Santoro, Paolo Papotti
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2505.21409
Pdf URL: https://arxiv.org/pdf/2505.21409
Copy Paste: [[2505.21409]] RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models(https://arxiv.org/abs/2505.21409)
Keywords: language model, llm
Abstract: Factuality in Large Language Models (LLMs) is a persistent challenge. Current benchmarks often assess short factual answers, overlooking the critical ability to generate structured, multi-record tabular outputs from parametric knowledge. We demonstrate that this relational fact retrieval is substantially more difficult than isolated point-wise queries, even when individual facts are known to the model, exposing distinct failure modes sensitive to output dimensionality (e.g., number of attributes or records). To systematically evaluate this under-explored capability, we introduce RelationalFactQA, a new benchmark featuring diverse natural language questions (paired with SQL) and gold-standard tabular answers, specifically designed to assess knowledge retrieval in a structured format. RelationalFactQA enables analysis across varying query complexities, output sizes, and data characteristics. Our experiments reveal that even state-of-the-art LLMs struggle significantly, not exceeding 25% factual accuracy in generating relational outputs, with performance notably degrading as output dimensionality increases. These findings underscore critical limitations in current LLMs' ability to synthesize structured factual knowledge and establish RelationalFactQA as a crucial resource for measuring future progress in LLM factuality.
摘要：大语言模型（LLM）中的事实是一个持续的挑战。当前的基准通常会评估简短的事实答案，从而忽略了从参数知识中产生结构化的多唱片表输出输出的关键能力。我们证明，即使模型已知单个事实，这种关系事实检索也比孤立的点查询要困难得多，从而暴露了对输出维度敏感的不同故障模式（例如属性或记录的数量）。为了系统地评估这种探索不足的功能，我们介绍了RelationalFactQA，这是一种新的基准测试，其中包含多种自然语言问题（与SQL配对）和金标准的表格答案，该答案是专门旨在评估结构化格式的知识检索的。 RelationalFactQA可以跨不同查询复杂性，输出大小和数据特征进行分析。我们的实验表明，即使是最先进的LLM也很大程度上挣扎，在产生关系产出时不超过25％的事实准确性，随着产出维度的增加，性能显着降级。这些发现强调了当前LLMS综合结构化事实知识并建立RelationFactQA作为衡量LLM事实未来进步的关键资源的临界限制。

Title: Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity

Authors: Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, Fei Mi, Xiaojun Meng, Zhicheng Liu, Hanting Chen, Binfan Zheng, Can Chen, Youliang Yan, Ruiming Tang, Peifeng Qin, Xinghao Chen, Dacheng Tao, Yunhe Wang (and Other Contributors)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21411
Pdf URL: https://arxiv.org/pdf/2505.21411
Copy Paste: [[2505.21411]] Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity(https://arxiv.org/abs/2505.21411)
Keywords: language model
Abstract: The surgence of Mixture of Experts (MoE) in Large Language Models promises a small price of execution cost for a much larger model parameter count and learning capacity, because only a small fraction of parameters are activated for each input token. However, it is commonly observed that some experts are activated far more often than others, leading to system inefficiency when running the experts on different devices in parallel. Therefore, we introduce Mixture of Grouped Experts (MoGE), which groups the experts during selection and balances the expert workload better than MoE in nature. It constrains tokens to activate an equal number of experts within each predefined expert group. When a model execution is distributed on multiple devices, this architectural design ensures a balanced computational load across devices, significantly enhancing throughput, particularly for the inference phase. Further, we build Pangu Pro MoE on Ascend NPUs, a sparse model based on MoGE with 72 billion total parameters, 16 billion of which are activated for each token. The configuration of Pangu Pro MoE is optimized for Ascend 300I Duo and 800I A2 through extensive system simulation studies. Our experiments indicate that MoGE indeed leads to better expert load balancing and more efficient execution for both model training and inference on Ascend NPUs. The inference performance of Pangu Pro MoE achieves 1148 tokens/s per card and can be further improved to 1528 tokens/s per card by speculative acceleration, outperforming comparable 32B and 72B Dense models. Furthermore, we achieve an excellent cost-to-performance ratio for model inference on Ascend 300I this http URL studies show that Ascend NPUs are capable of training Pangu Pro MoE with massive parallelization to make it a leading model within the sub-100B total parameter class, outperforming prominent open-source models like GLM-Z1-32B and Qwen3-32B.
摘要：专家（MOE）在大语言模型中的混合物的外科手术有望为更大的模型参数计数和学习能力的执行成本较小，因为每个输入令牌仅激活一小部分参数。但是，通常观察到，某些专家被激活的频率要比其他专家频率要大得多，从而导致系统效率低时，在并行运行不同设备上的专家时。因此，我们介绍了分组专家（MOGE）的混合物，该混合物在选择过程中对专家进行分组，并在本质上比MOE更好地平衡专家工作量。它限制了令牌，以激活每个预定义的专家组中相等数量的专家。当模型执行分布在多个设备上时，该架构设计可确保跨设备的平衡计算负载，从而显着增强吞吐量，尤其是在推理阶段。此外，我们在Ascend NPU上构建了Pangu Pro Moe，这是一种基于Moge的稀疏模型，总参数为720亿，每个代币都被激活了160亿。通过广泛的系统仿真研究，针对Ascend 300i Duo和800I A2的pangu Pro Moe的配置进行了优化。我们的实验表明，Moge确实会导致更好的专家负载平衡，并为模型培训和对Ascend NPU的推断提供更有效的执行。 Pangu Pro Moe的推理性能可实现每张卡的1148代币/s，可以通过投机加速度进一步提高每张卡的1528代币/s，表现优于可比的32B和72B密度的模型。 Furthermore, we achieve an excellent cost-to-performance ratio for model inference on Ascend 300I this http URL studies show that Ascend NPUs are capable of training Pangu Pro MoE with massive parallelization to make it a leading model within the sub-100B total parameter class, outperforming prominent open-source models like GLM-Z1-32B and Qwen3-32B.

Title: RefTool: Enhancing Model Reasoning with Reference-Guided Tool Creation

Authors: Xiao Liu, Da Yin, Zirui Wu, Yansong Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21413
Pdf URL: https://arxiv.org/pdf/2505.21413
Copy Paste: [[2505.21413]] RefTool: Enhancing Model Reasoning with Reference-Guided Tool Creation(https://arxiv.org/abs/2505.21413)
Keywords: language model, llm
Abstract: Tools enhance the reasoning capabilities of large language models (LLMs) in complex problem-solving tasks, but not all tasks have available tools. In the absence of predefined tools, prior works have explored instructing LLMs to generate tools on their own. However, such approaches rely heavily on the models' internal knowledge and would fail in domains beyond the LLMs' knowledge scope. To address this limitation, we propose RefTool, a reference-guided framework for automatic tool creation that leverages structured external materials such as textbooks. RefTool consists of two modules: (1) tool creation, where LLMs generate executable tools from reference content, validate them using illustrative examples, and organize them hierarchically into a toolbox; and (2) tool utilization, where LLMs navigate the toolbox structure to select and apply the appropriate tools to solve problems. Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 11.3% on average accuracy, while being cost-efficient and broadly generalizable. Analyses reveal that grounding tool creation in references produces accurate and faithful tools, and that the hierarchical structure facilitates effective tool selection. RefTool enables LLMs to overcome knowledge limitations, demonstrating the value of grounding tool creation in external references for enhanced and generalizable reasoning.
摘要：工具在复杂解决问题的任务中增强了大语言模型（LLM）的推理能力，但并非所有任务都具有可用的工具。在没有预定义工具的情况下，先前的工作探索了指示LLM自己生成工具的工具。但是，这种方法在很大程度上依赖于模型的内部知识，并且在LLMS知识范围之外的领域中将失败。为了解决此限制，我们提出了Reftool，这是一个自动创建的参考引导框架，利用结构化的外部材料（例如教科书）。重新传输由两个模块组成：（1）工具创建，其中LLMS从参考内容中生成可执行的工具，使用说明性示例对其进行验证，然后将它们层次组织到工具箱中；（2）工具利用率，其中LLM导航工具箱结构以选择并应用适当的工具来解决问题。有关因果关系，物理和化学基准的实验表明，重新赋予的表现优于现有的工具创造和特定于域的推理方法的平均准确性11.3％，而具有成本效益且可广泛的概括。分析表明，参考文献中的基础工具创建会产生准确而忠实的工具，并且层次结构有助于有效的工具选择。 Reftool使LLM能够克服知识局限性，证明了在外部参考文献中建立接地工具的价值，以增强和可推广的推理。

Title: Towards Better Instruction Following Retrieval Models

Authors: Yuchen Zhuang, Aaron Trinh, Rushi Qiang, Haotian Sun, Chao Zhang, Hanjun Dai, Bo Dai
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.21439
Pdf URL: https://arxiv.org/pdf/2505.21439
Copy Paste: [[2505.21439]] Towards Better Instruction Following Retrieval Models(https://arxiv.org/abs/2505.21439)
Keywords: language model
Abstract: Modern information retrieval (IR) models, trained exclusively on standard pairs, struggle to effectively interpret and follow explicit user instructions. We introduce InF-IR, a large-scale, high-quality training corpus tailored for enhancing retrieval models in Instruction-Following IR. InF-IR expands traditional training pairs into over 38,000 expressive triplets as positive samples. In particular, for each positive triplet, we generate two additional hard negative examples by poisoning both instructions and queries, then rigorously validated by an advanced reasoning model (o3-mini) to ensure semantic plausibility while maintaining instructional incorrectness. Unlike existing corpora that primarily support computationally intensive reranking tasks for decoder-only language models, the highly contrastive positive-negative triplets in InF-IR further enable efficient representation learning for smaller encoder-only models, facilitating direct embedding-based retrieval. Using this corpus, we train InF-Embed, an instruction-aware Embedding model optimized through contrastive learning and instruction-query attention mechanisms to align retrieval outcomes precisely with user intents. Extensive experiments across five instruction-based retrieval benchmarks demonstrate that InF-Embed significantly surpasses competitive baselines by 8.1% in p-MRR, measuring the instruction-following capabilities.
摘要：现代信息检索（IR）模型，专门针对标准<查询，通道>对培训，难以有效地解释和遵循明确的用户说明。我们介绍了INF-IR，这是一种量身定制的大型高质量培训语料库，该语料库旨在增强IR的指导型IR中的检索模型。 INF-IR将传统培训对扩展到38,000多个表达<指导，查询，段落>三胞胎作为正样本。特别是，对于每个积极的三胞胎，我们通过中毒指令和查询来产生两个额外的硬性示例，然后通过先进的推理模型（O3-MINI）进行严格验证，以确保语义上的合理性，同时保持教学不正确。与现有的Corpora不同，该公司主要支持仅针对纯语言模型的计算密集型重新计算任务，而Inf-IR中高度对比度的正面阴性三胞胎进一步为较小的编码模型提供了有效的代表性学习，从而促进了基于直接嵌入的直接嵌入检索。使用此语料库，我们训练通过对比度学习和指导 - 疑问注意机制优化的指令感知的嵌入模型，以将检索结果与用户的意图恰好相结合。五个基于指令的检索基准进行的广泛实验表明，在P-MRR中，INF插入的实验可显着超过8.1％的竞争基准，从而衡量了遵循指导遵循的能力。

Title: Words Like Knives: Backstory-Personalized Modeling and Detection of Violent Communication

Authors: Jocelyn Shen, Akhila Yerukola, Xuhui Zhou, Cynthia Breazeal, Maarten Sap, Hae Won Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21451
Pdf URL: https://arxiv.org/pdf/2505.21451
Copy Paste: [[2505.21451]] Words Like Knives: Backstory-Personalized Modeling and Detection of Violent Communication(https://arxiv.org/abs/2505.21451)
Keywords: llm
Abstract: Conversational breakdowns in close relationships are deeply shaped by personal histories and emotional context, yet most NLP research treats conflict detection as a general task, overlooking the relational dynamics that influence how messages are perceived. In this work, we leverage nonviolent communication (NVC) theory to evaluate LLMs in detecting conversational breakdowns and assessing how relationship backstory influences both human and model perception of conflicts. Given the sensitivity and scarcity of real-world datasets featuring conflict between familiar social partners with rich personal backstories, we contribute the PersonaConflicts Corpus, a dataset of N=5,772 naturalistic simulated dialogues spanning diverse conflict scenarios between friends, family members, and romantic partners. Through a controlled human study, we annotate a subset of dialogues and obtain fine-grained labels of communication breakdown types on individual turns, and assess the impact of backstory on human and model perception of conflict in conversation. We find that the polarity of relationship backstories significantly shifted human perception of communication breakdowns and impressions of the social partners, yet models struggle to meaningfully leverage those backstories in the detection task. Additionally, we find that models consistently overestimate how positively a message will make a listener feel. Our findings underscore the critical role of personalization to relationship contexts in enabling LLMs to serve as effective mediators in human communication for authentic connection.
摘要：亲密关系中的对话分解是由个人历史和情感上下文深刻影响的，但是大多数NLP研究都将冲突检测视为一项一般任务，忽略了影响信息如何被感知的关系动态。在这项工作中，我们利用非暴力沟通（NVC）理论评估LLM在检测对话分解并评估关系背景如何影响人类和模型对冲突的模型感知方面评估。鉴于现实世界数据集的敏感性和稀缺性具有熟悉的社会伙伴与丰富的个人背景故事之间的冲突，我们贡献了PersonAconFlicts语料库，这是N = 5,772个自然主义模拟对话的数据集，涵盖了朋友，家人和浪漫的党派和浪漫派对之间的多样化冲突场景。通过对对话的一部分，通过对对话的子集进行注释，并获得对单个转弯的沟通崩溃类型的细粒标签，并评估背景对人类对对话中冲突的影响和模型的影响。我们发现，关系背景的极性显着转移了人类对沟通崩溃和社会伙伴印象的看法，但是模型努力在检测任务中有意义地利用这些背景故事。此外，我们发现模型始终高估了信息会使听众感到积极的积极方式。我们的发现强调了个性化对关系环境的关键作用在使LLMS能够充当人类交流中的有效调解人以实现真实联系。

Title: Do LLMs Need to Think in One Language? Correlation between Latent Language and Task Performance

Authors: Shintaro Ozaki, Tatsuya Hiraoka, Hiroto Otake, Hiroki Ouchi, Masaru Isonuma, Benjamin Heinzerling, Kentaro Inui, Taro Watanabe, Yusuke Miyao, Yohei Oseki, Yu Takagi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21458
Pdf URL: https://arxiv.org/pdf/2505.21458
Copy Paste: [[2505.21458]] Do LLMs Need to Think in One Language? Correlation between Latent Language and Task Performance(https://arxiv.org/abs/2505.21458)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are known to process information using a proficient internal language consistently, referred to as latent language, which may differ from the input or output languages. However, how the discrepancy between the latent language and the input and output language affects downstream task performance remains largely unexplored. While many studies research the latent language of LLMs, few address its importance in influencing task performance. In our study, we hypothesize that thinking in latent language consistently enhances downstream task performance. To validate this, our work varies the input prompt languages across multiple downstream tasks and analyzes the correlation between consistency in latent language and task performance. We create datasets consisting of questions from diverse domains such as translation and geo-culture, which are influenced by the choice of latent language. Experimental results across multiple LLMs on translation and geo-culture tasks, which are sensitive to the choice of language, indicate that maintaining consistency in latent language is not always necessary for optimal downstream task performance. This is because these models adapt their internal representations near the final layers to match the target language, reducing the impact of consistency on overall performance.
摘要：已知大型语言模型（LLM）可以使用熟练的内部语言（称为潜在语言）处理信息，这可能与输入或输出语言不同。但是，潜在语言与输入语言之间的差异如何影响下游任务绩效，这在很大程度上尚未探索。尽管许多研究研究LLM的潜在语言，但很少有人解决其在影响任务绩效的重要性。在我们的研究中，我们假设使用潜在语言思考一致地增强了下游任务绩效。为了验证这一点，我们的工作改变了多个下游任务的输入提示语言，并分析了潜在语言的一致性与任务性能之间的相关性。我们创建的数据集由来自不同领域的问题（例如翻译和地理文化）组成，这些问题受到潜在语言的选择影响。对语言选择敏感的翻译和地理文化任务的多个LLM的实验结果表明，保持潜在语言的一致性并非总是必要的，这对于最佳的下游任务表现并不是必需的。这是因为这些模型在最终层附近适应了其内部表示形式，以匹配目标语言，从而降低了一致性对整体性能的影响。

Title: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Authors: Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S. Abdelfattah, Jae-sun Seo, Zhiru Zhang, Udit Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21467
Pdf URL: https://arxiv.org/pdf/2505.21467
Copy Paste: [[2505.21467]] Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion(https://arxiv.org/abs/2505.21467)
Keywords: language model, prompt
Abstract: Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized Autoregressive (AR) Models (e.g., Qwen2.5 7B, Llama3 8B), their iterative denoising requires multiple full-sequence forward passes, resulting in high computational costs and latency, particularly for long input prompts and long-context scenarios. Furthermore, parallel token generation introduces token incoherence problems, and current sampling heuristics suffer from significant quality drops with decreasing denoising steps. We address these limitations with two training-free techniques. First, we propose FreeCache, a Key-Value (KV) approximation caching technique that reuses stable KV projections across denoising steps, effectively reducing the computational cost of DLM inference. Second, we introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking, dramatically reducing the total number of denoising iterations without sacrificing quality. We conduct extensive evaluations on open-source reasoning benchmarks, and our combined methods deliver up to a 34x end-to-end speedup without compromising accuracy. For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models. Our work successfully paved the way for scaling up the diffusion language model to a broader scope of applications across different domains.
摘要：扩散语言模型具有平行的代币产生和固有的双向性，与自回归方法相比，有望更有效，更强大的序列建模。但是，最先进的扩散模型（例如，梦想7b，llada 8b）的推理缓慢。尽管它们匹配了类似尺寸的自回旋（AR）型号（例如Qwen2.5 7b，llama3 8b）的质量，但它们的迭代授权需要多个完整的前进通行证，从而导致高计算成本和潜伏期，尤其是对于长输入提示和长期接收的现象。此外，平行令牌产生引入了令牌不一致的问题，并且当前的采样启发式方法因质量下降而遭受降低的降低，而降低了deno的步骤。我们通过两种无培训技术来解决这些限制。首先，我们提出了Freecache，这是一种键值（KV）近似缓存技术，可重用跨剥离步骤稳定的KV投影，从而有效地降低了DLM推断的计算成本。其次，我们介绍了一种无训练的方法，它使用轻巧的自动回归模型来监督令牌揭开措施，从而大大减少了迭代迭代的总数而无需牺牲质量。我们对开源推理基准进行了广泛的评估，我们的组合方法可提供34倍的端到端速度，而不会损害准确性。作为广泛采用的自回归模型，扩散语言模型首次达到了可比甚至更快的延迟。我们的工作成功地为将扩散语言模型扩展到跨不同领域的应用程序范围铺平了道路。

Title: Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration

Authors: Zijun Liu, Zhennan Wan, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21471
Pdf URL: https://arxiv.org/pdf/2505.21471
Copy Paste: [[2505.21471]] Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration(https://arxiv.org/abs/2505.21471)
Keywords: language model, llm, agent
Abstract: With the rapid advancement of post-training techniques for reasoning and information seeking, large language models (LLMs) can incorporate a large quantity of retrieved knowledge to solve complex tasks. However, the limited context window of LLMs obstructs scaling the amount of external knowledge input, prohibiting further improvement, especially for tasks requiring significant amount of external knowledge. Existing context window extension methods inevitably cause information loss. LLM-based multi-agent methods emerge as a new paradigm to handle massive input in a distributional manner, where we identify two core bottlenecks in existing knowledge synchronization and reasoning processes. In this work, we develop a multi-agent framework, $\textbf{ExtAgents}$, to overcome the bottlenecks and enable better scalability in inference-time knowledge integration without longer-context training. Benchmarked with our enhanced multi-hop question answering test, $\textbf{$\boldsymbol{\infty}$Bench+}$, and other public test sets including long survey generation, ExtAgents significantly enhances the performance over existing non-training methods with the same amount of external knowledge input, regardless of whether it falls $\textit{within or exceeds the context window}$. Moreover, the method maintains high efficiency due to high parallelism. Further study in the coordination of LLM agents on increasing external knowledge input could benefit real-world applications.
摘要：随着用于推理和信息寻求信息的训练后技术的快速发展，大型语言模型（LLMS）可以结合大量检索的知识来解决复杂的任务。但是，LLMS的有限上下文窗口阻碍了扩展外部知识输入的数量，禁止进一步改进，尤其是对于需要大量外部知识的任务。现有上下文窗口扩展方法不可避免地会导致信息丢失。基于LLM的多代理方法是一种新的范式，以分布方式处理大量输入，在该范式中，我们在现有知识同步和推理过程中识别两个核心瓶颈。在这项工作中，我们开发了一个多代理框架，即$ \ textbf {extragents} $，以克服瓶颈，并在没有更长的文本培训的情况下启用推理时间知识集成的更好的可扩展性。 Benchmarked with our enhanced multi-hop question answering test, $\textbf{$\boldsymbol{\infty}$Bench+}$, and other public test sets including long survey generation, ExtAgents significantly enhances the performance over existing non-training methods with the same amount of external knowledge input, regardless of whether it falls $\textit{within or exceeds the context window}$.此外，该方法由于高平行性而保持高效率。在LLM代理人增加外部知识输入方面的协调方面的进一步研究可以使现实世界中的应用有益。

Title: Are Language Models Consequentialist or Deontological Moral Reasoners?

Authors: Keenan Samway, Max Kleiman-Weiner, David Guzman Piedrahita, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21479
Pdf URL: https://arxiv.org/pdf/2505.21479
Copy Paste: [[2505.21479]] Are Language Models Consequentialist or Deontological Moral Reasoners?(https://arxiv.org/abs/2505.21479)
Keywords: language model, llm
Abstract: As AI systems increasingly navigate applications in healthcare, law, and governance, understanding how they handle ethically complex scenarios becomes critical. Previous work has mainly examined the moral judgments in large language models (LLMs), rather than their underlying moral reasoning process. In contrast, we focus on a large-scale analysis of the moral reasoning traces provided by LLMs. Furthermore, unlike prior work that attempted to draw inferences from only a handful of moral dilemmas, our study leverages over 600 distinct trolley problems as probes for revealing the reasoning patterns that emerge within different LLMs. We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology. Our analysis reveals that LLM chains-of-thought tend to favor deontological principles based on moral obligations, while post-hoc explanations shift notably toward consequentialist rationales that emphasize utility. Our framework provides a foundation for understanding how LLMs process and articulate ethical considerations, an important step toward safe and interpretable deployment of LLMs in high-stakes decision-making environments. Our code is available at this https URL .
摘要：随着AI系统越来越多地在医疗保健，法律和治理中导航应用程序，了解它们如何处理道德上复杂的方案变得至关重要。先前的工作主要研究了大语言模型（LLM）的道德判断，而不是其基本的道德推理过程。相反，我们专注于对LLMS提供的道德推理痕迹的大规模分析。此外，与试图从几个道德困境中提取推断的先前工作不同，我们的研究利用了600多个不同的手推车问题，作为探索揭示不同LLM中出现的推理模式的探针。我们介绍并测试道德理由的分类法，根据两种主要的规范伦理理论系统地对推理痕迹进行分类：结果主义和义务学。我们的分析表明，LLM链链倾向于基于道德义务偏爱道义原理，而事后解释显着转向强调效用的结果主义理由。我们的框架为了解LLM的过程和表达道德考虑如何，这是迈向高风险决策环境中LLM的安全和可解释部署的重要一步。我们的代码可在此HTTPS URL上找到。

Title: UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

Authors: Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Aojun Zhou, Hongsheng Li
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21496
Pdf URL: https://arxiv.org/pdf/2505.21496
Copy Paste: [[2505.21496]] UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents(https://arxiv.org/abs/2505.21496)
Keywords: llm, agent
Abstract: In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently pro- cesses historical context and unifies action-level and task-level rewards. To sup- port the training of UI-Genie-RM, we develop deliberately-designed data genera- tion strategies including rule-based verification, controlled trajectory corruption, and hard negative mining. To address the second challenge, a self-improvement pipeline progressively expands solvable complex GUI tasks by enhancing both the agent and reward models through reward-guided exploration and outcome verification in dynamic environments. For training the model, we generate UI- Genie-RM-517k and UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI agents while demonstrating high-quality synthetic trajectory gen- eration without manual annotation. Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks with three generations of data-model self-improvement. We open-source our complete framework implementation and generated datasets to facilitate further research in this https URL.
摘要：在本文中，我们介绍了UI-Genie，这是一个自我改善的框架，解决了GUI代理中两个关键挑战：轨迹结果的验证是具有挑战性的，高质量的培训数据是不可扩展的。这些挑战分别由奖励模型和自我完善的管道解决。奖励模型UI-Genie-RM具有图像文本交织的体系结构，该体系结构有效地制作了历史上下文，并统一了行动级别和任务级别的奖励。为了进行UI-Genie-RM的培训，我们制定了故意设计的数据通知策略，包括基于规则的验证，受控的轨迹腐败和硬采矿。为了应对第二项挑战，自我改善管道通过在动态环境中通过奖励引导的探索和结果验证来增强代理和奖励模型，从而逐步扩大了可解决的复杂GUI任务。为了训练该模型，我们生成UI-GENIE-RM-517K和UI-GENIE-AGENT-16K，为GUI代理建立了第一个特定于GUI的奖励数据集，同时展示了没有人工注释的高质量合成轨迹轨迹。实验结果表明，UI-Genie在多个GUI代理基准中实现了三代数据模型自我改进的最先进的性能。我们开源的完整框架实现并生成数据集，以促进此HTTPS URL的进一步研究。

Title: Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making

Authors: Yihan Wang, Qiao Yan, Zhenghao Xing, Lihao Liu, Junjun He, Chi-Wing Fu, Xiaowei Hu, Pheng-Ann Heng
Subjects: cs.CL, cs.AI, cs.LG, q-bio.OT
Abstract URL: https://arxiv.org/abs/2505.21503
Pdf URL: https://arxiv.org/pdf/2505.21503
Copy Paste: [[2505.21503]] Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making(https://arxiv.org/abs/2505.21503)
Keywords: language model, gpt, llm, agent
Abstract: Large language models (LLMs) have demonstrated strong potential in clinical question answering, with recent multi-agent frameworks further improving diagnostic accuracy via collaborative reasoning. However, we identify a recurring issue of Silent Agreement, where agents prematurely converge on diagnoses without sufficient critical analysis, particularly in complex or ambiguous cases. We present a new concept called Catfish Agent, a role-specialized LLM designed to inject structured dissent and counter silent agreement. Inspired by the ``catfish effect'' in organizational psychology, the Catfish Agent is designed to challenge emerging consensus to stimulate deeper reasoning. We formulate two mechanisms to encourage effective and context-aware interventions: (i) a complexity-aware intervention that modulates agent engagement based on case difficulty, and (ii) a tone-calibrated intervention articulated to balance critique and collaboration. Evaluations on nine medical Q&A and three medical VQA benchmarks show that our approach consistently outperforms both single- and multi-agent LLMs frameworks, including leading commercial models such as GPT-4o and DeepSeek-R1.
摘要：大型语言模型（LLM）在临床问题答案中表现出强大的潜力，最近的多机构框架通过协作推理进一步提高了诊断准确性。但是，我们确定了一个反复出现的无声协议问题，在没有足够的批判性分析的情况下，尤其是在复杂或模棱两可的情况下，代理在诊断上过早地融合了诊断。我们提出了一个名为Catfish Agent的新概念，这是一种旨在注入结构化异议和反静音协议的角色专业的LLM。受组织心理学中的``cat鱼效应''的启发，cat鱼代理人旨在挑战新兴共识以刺激更深的推理。我们制定了两种机制，以鼓励有效和上下文感知干预措施：（i）一种复杂性感知的干预措施，根据情况难度调节剂的参与度，以及（ii）一种表达语调的干预措施，以平衡批评和协作。对9个医疗问答和3个医疗VQA基准的评估表明，我们的方法始终优于单一和多代理LLMS框架，包括领先的商业模型，例如GPT-4O和DeepSeek-R1。

Title: How does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective

Authors: Shimao Zhang, Zhejian Lai, Xiang Liu, Shuaijie She, Xiao Liu, Yeyun Gong, Shujian Huang, Jiajun Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21505
Pdf URL: https://arxiv.org/pdf/2505.21505
Copy Paste: [[2505.21505]] How does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective(https://arxiv.org/abs/2505.21505)
Keywords: llm
Abstract: Multilingual Alignment is an effective and representative paradigm to enhance LLMs' multilingual capabilities, which transfers the capabilities from the high-resource languages to the low-resource languages. Meanwhile, some researches on language-specific neurons reveal that there are language-specific neurons that are selectively activated in LLMs when processing different languages. This provides a new perspective to analyze and understand LLMs' mechanisms more specifically in multilingual scenarios. In this work, we propose a new finer-grained neuron identification algorithm, which detects language neurons~(including language-specific neurons and language-related neurons) and language-agnostic neurons. Furthermore, based on the distributional characteristics of different types of neurons, we divide the LLMs' internal process for multilingual inference into four parts: (1) multilingual understanding, (2) shared semantic space reasoning, (3) multilingual output space transformation, and (4) vocabulary space outputting. Additionally, we systematically analyze the models before and after alignment with a focus on different types of neurons. We also analyze the phenomenon of ''Spontaneous Multilingual Alignment''. Overall, our work conducts a comprehensive investigation based on different types of neurons, providing empirical results and valuable insights for better understanding multilingual alignment and multilingual capabilities of LLMs.
摘要：多语言对齐方式是增强LLMS多语言能力的有效且具有代表性的范式，可将功能从高资源语言转移到低资源语言。同时，对语言特异性神经元的一些研究表明，在处理不同语言时，有一些语言特定的神经元在LLM中有选择性地激活。这提供了一种新的观点，可以在多语言方案中更具体地分析和理解LLM的机制。在这项工作中，我们提出了一种新的细粒神经元识别算法，该算法检测语言神经元（包括特定语言的神经元和与语言相关的神经元）和语言敏锐的神经元。此外，根据不同类型的神经元的分布特征，我们将LLMS多语言推断的内部过程分为四个部分：（1）多语言理解，（2）共享的语义空间推理，（3）多语言输出空间变换，以及（4）词汇空间输出。此外，我们在对齐前后系统地分析了对不同类型神经元的模型。我们还分析了“自发多语言对齐”的现象。总体而言，我们的工作基于不同类型的神经元进行了全面的调查，提供了经验结果和有价值的见解，以更好地理解LLMS的多语言对齐和多语言能力。