2025-05-27

Title: Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language

Authors: Jesus Alvarez C, Daua D. Karajeanes, Ashley Celeste Prado, John Ruttan, Ivory Yang, Sean O'Brien, Vasu Sharma, Kevin Zhu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18159
Pdf URL: https://arxiv.org/pdf/2505.18159
Copy Paste: [[2505.18159]] Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language(https://arxiv.org/abs/2505.18159)
Keywords: gpt, llm, prompt
Abstract: The digital exclusion of endangered languages remains a critical challenge in NLP, limiting both linguistic research and revitalization efforts. This study introduces the first computational investigation of Comanche, an Uto-Aztecan language on the verge of extinction, demonstrating how minimal-cost, community-informed NLP interventions can support language preservation. We present a manually curated dataset of 412 phrases, a synthetic data generation pipeline, and an empirical evaluation of GPT-4o and GPT-4o-mini for language identification. Our experiments reveal that while LLMs struggle with Comanche in zero-shot settings, few-shot prompting significantly improves performance, achieving near-perfect accuracy with just five examples. Our findings highlight the potential of targeted NLP methodologies in low-resource contexts and emphasize that visibility is the first step toward inclusion. By establishing a foundation for Comanche in NLP, we advocate for computational approaches that prioritize accessibility, cultural sensitivity, and community engagement.
摘要：在NLP中，数字排除濒危语言仍然是一个关键的挑战，限制了语言研究和振兴工作。这项研究介绍了Comanche的首次计算调查，Comanche是一种在灭绝边缘的UTO-Aztecan语言，证明了最小成本，社区信息的NLP干预措施如何支持语言保存。我们提出了412个短语，合成数据生成管道的手动策划数据集，以及用于语言识别的GPT-4O和GPT-4O和GPT-4O和GPT-4O-MINI的经验评估。我们的实验表明，尽管LLM在零射击设置中与Comanche斗争，但很少弹出促使性能显着提高了性能，仅使用五个示例就达到了近乎完美的精度。我们的发现强调了针对性NLP方法在低资源环境中的潜力，并强调可见性是纳入包容性的第一步。通过在NLP建立Comanche的基础，我们倡导优先考虑可访问性，文化敏感性和社区参与的计算方法。

Title: Do BERT-Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLMs?

Authors: Junyan Zhang, Yiming Huang, Shuliang Liu, Yubo Gao, Xuming Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18215
Pdf URL: https://arxiv.org/pdf/2505.18215
Copy Paste: [[2505.18215]] Do BERT-Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLMs?(https://arxiv.org/abs/2505.18215)
Keywords: llm
Abstract: The rapid adoption of LLMs has overshadowed the potential advantages of traditional BERT-like models in text classification. This study challenges the prevailing "LLM-centric" trend by systematically comparing three category methods, i.e., BERT-like models fine-tuning, LLM internal state utilization, and zero-shot inference across six high-difficulty datasets. Our findings reveal that BERT-like models often outperform LLMs. We further categorize datasets into three types, perform PCA and probing experiments, and identify task-specific model strengths: BERT-like models excel in pattern-driven tasks, while LLMs dominate those requiring deep semantics or world knowledge. Based on this, we propose TaMAS, a fine-grained task selection strategy, advocating for a nuanced, task-driven approach over a one-size-fits-all reliance on LLMs.
摘要：LLM的快速采用掩盖了文本分类中传统模型的潜在优势。这项研究通过系统地比较三种类别的方法，即Bert样模型微调，LLM内部状态利用率和零拍摄的推断，从而挑战了普遍的“以LLM为中心”的趋势。我们的发现表明，类似Bert的模型通常优于LLM。我们将数据集进一步分为三种类型，执行PCA和探测实验，并确定特定于任务的模型强度：类似于模式的模型在模式驱动的任务中表现出色，而LLMS则主导着那些需要深层语义或世界知识的人。基于此，我们提出了TAMA，这是一种精细的任务选择策略，主张在对LLMS的一定程度上依靠一定大小的任务驱动的方法。

Title: CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games

Authors: Shuhang Xu, Fangwei Zhong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18218
Pdf URL: https://arxiv.org/pdf/2505.18218
Copy Paste: [[2505.18218]] CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games(https://arxiv.org/abs/2505.18218)
Keywords: language model, llm, agent
Abstract: Metaphors are a crucial way for humans to express complex or subtle ideas by comparing one concept to another, often from a different domain. However, many large language models (LLMs) struggle to interpret and apply metaphors in multi-agent language games, hindering their ability to engage in covert communication and semantic evasion, which are crucial for strategic communication. To address this challenge, we introduce CoMet, a framework that enables LLM-based agents to engage in metaphor processing. CoMet combines a hypothesis-based metaphor reasoner with a metaphor generator that improves through self-reflection and knowledge integration. This enhances the agents' ability to interpret and apply metaphors, improving the strategic and nuanced quality of their interactions. We evaluate CoMet on two multi-agent language games - Undercover and Adversarial Taboo - which emphasize Covert Communication and Semantic Evasion. Experimental results demonstrate that CoMet significantly enhances the agents' ability to communicate strategically using metaphors.
摘要：隐喻是人类通过将一个概念与另一个概念进行比较，通常是来自另一个领域来表达复杂或微妙的想法的关键方法。但是，许多大型语言模型（LLM）难以解释和应用多代理语言游戏中的隐喻，阻碍了他们从事秘密交流和语义逃避的能力，这对于战略交流至关重要。为了应对这一挑战，我们介绍了彗星，该框架使基于LLM的代理能够从事隐喻处理。彗星将基于假设的隐喻推理器与隐喻发生器结合在一起，该形成者通过自我反省和知识整合来改善。这增强了代理商解释和应用隐喻的能力，从而提高了其互动的战略和细微差异。我们评估了两个多代理语言游戏 - 卧底和对抗性禁忌 - 强调秘密交流和语义逃避。实验结果表明，彗星可以显着增强代理使用隐喻进行策略性交流的能力。

Title: IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis

Authors: Hanyu Li, Haoyu Liu, Tingyu Zhu, Tianyu Guo, Zeyu Zheng, Xiaotie Deng, Michael I. Jordan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18223
Pdf URL: https://arxiv.org/pdf/2505.18223
Copy Paste: [[2505.18223]] IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis(https://arxiv.org/abs/2505.18223)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts' decisions evolve with deeper insights of the dataset. To address this, we introduce IDA-Bench, a novel benchmark evaluating LLM agents in multi-round interactive scenarios. Derived from complex Kaggle notebooks, tasks are presented as sequential natural language instructions by an LLM-simulated user. Agent performance is judged by comparing its final numerical output to the human-derived baseline. Initial results show that even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on < 50% of the tasks, highlighting limitations not evident in single-turn tests. This work underscores the need to improve LLMs' multi-round capabilities for building more reliable data analysis agents, highlighting the necessity of achieving a balance between instruction following and reasoning.
摘要：大型语言模型（LLM）作为数据分析代理显示出希望，但是现有的基准忽略了该领域的迭代性质，专家的决策随着数据集的深刻见解而发展。为了解决这个问题，我们介绍了IDA Bench，这是一种新颖的基准测试，评估了多轮交互式场景中的LLM代理。从复杂的Kaggle笔记本电脑中得出，任务由LLM模拟的用户作为顺序自然语言说明提出。代理性能是通过将其最终数值输出与人类衍生的基线进行比较来判断的。最初的结果表明，即使是最新的编码剂（例如Claude-3.7思维）也成功地完成了<50％的任务，这突出了单转弯测试中没有明显的限制。这项工作强调了提高LLMS多轮功能来构建更可靠的数据分析代理的需求，强调了在跟随和推理之间取得平衡的必要性。

Title: Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens

Authors: Xixian Yong, Xiao Zhou, Yingying Zhang, Jinlin Li, Yefeng Zheng, Xian Wu
Subjects: cs.CL, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2505.18237
Pdf URL: https://arxiv.org/pdf/2505.18237
Copy Paste: [[2505.18237]] Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens(https://arxiv.org/abs/2505.18237)
Keywords: language model
Abstract: The recent rise of Large Reasoning Models (LRMs) has significantly improved multi-step reasoning performance, but often at the cost of generating excessively long reasoning chains. This paper revisits the efficiency of such reasoning processes through an information-theoretic lens, revealing a fundamental trade-off between reasoning length and semantic efficiency. We propose two metrics, InfoBias and InfoGain, to quantify divergence from ideal reasoning paths and stepwise information contribution, respectively. Empirical analyses show that longer reasoning chains tend to exhibit higher information bias and diminishing information gain, especially for incorrect answers. Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high, improving efficiency while maintaining competitive accuracy. Compared to the Vanilla Think approach (default mode), our strategy yields a 1.10% improvement in average accuracy and a 50.80% reduction in token usage on QwQ-32B across six benchmark tasks spanning diverse reasoning types and difficulty levels, demonstrating superior efficiency and reasoning performance. These results underscore the promise of entropy-based methods for enhancing both accuracy and cost-effiiciency in large language model deployment.
摘要：大型推理模型（LRMS）的最新兴起已经显着改善了多步推理性能，但通常以产生过长的推理链为代价。本文通过信息理论镜头重新审视了此类推理过程的效率，揭示了推理长度和语义效率之间的基本权衡。我们建议两个指标分别量化与理想推理路径和逐步信息贡献的差异。经验分析表明，较长的推理链倾向于表现出更高的信息偏见和信息减少，尤其是对于错误的答案而言。在这些发现的激励下，我们引入了一种基于熵的自适应思维策略，该策略一旦信心就足够高，可以动态停止推理，从而提高效率，同时保持竞争精度。与Vanilla Think方法（默认模式）相比，我们的策略在平均准确性上提高了1.10％，在QWQ-32B上使用了50.80％的QWQ-32B，跨越了六种基准测试任务，这些任务涵盖了各种推理类型和难度水平，表现出了出色的效率和推理性能。这些结果强调了基于熵的方法的希望，以提高大型语言模型部署的准确性和成本效益。

Title: Taming LLMs with Negative Samples: A Reference-Free Framework to Evaluate Presentation Content with Actionable Feedback

Authors: Ananth Muppidi, Tarak Das, Sambaran Bandyopadhyay, Tripti Shukla, Dharun D A
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18240
Pdf URL: https://arxiv.org/pdf/2505.18240
Copy Paste: [[2505.18240]] Taming LLMs with Negative Samples: A Reference-Free Framework to Evaluate Presentation Content with Actionable Feedback(https://arxiv.org/abs/2505.18240)
Keywords: language model, llm
Abstract: The generation of presentation slides automatically is an important problem in the era of generative AI. This paper focuses on evaluating multimodal content in presentation slides that can effectively summarize a document and convey concepts to a broad audience. We introduce a benchmark dataset, RefSlides, consisting of human-made high-quality presentations that span various topics. Next, we propose a set of metrics to characterize different intrinsic properties of the content of a presentation and present REFLEX, an evaluation approach that generates scores and actionable feedback for these metrics. We achieve this by generating negative presentation samples with different degrees of metric-specific perturbations and use them to fine-tune LLMs. This reference-free evaluation technique does not require ground truth presentations during inference. Our extensive automated and human experiments demonstrate that our evaluation approach outperforms classical heuristic-based and state-of-the-art large language model-based evaluations in generating scores and explanations.
摘要：在生成AI时代，演示文稿的产生自动是一个重要的问题。本文着重于评估演示幻灯片中的多模式内容，这些幻灯片可以有效地总结文档并将概念传达给广泛的受众。我们介绍了一个基准数据集，RefSlides，由人制造的高质量演示文稿组成，这些演示涵盖了各种主题。接下来，我们提出一组指标来表征演示文稿内容的不同内在属性和反射的反射方法，这是一种评估方法，可为这些指标生成分数和可行的反馈。我们通过产生不同程度的度量特异性扰动的负呈现样本来实现这一目标，并将其用于微调LLM。这种无参考评估技术在推断期间不需要地面真理表现。我们广泛的自动化和人类实验表明，我们的评估方法在产生分数和解释方面优于基于经典的基于启发式和最先进的大型语言模型评估。

Title: Multi-Scale Probabilistic Generation Theory: A Hierarchical Framework for Interpreting Large Language Models

Authors: Yukin Zhang, Qi Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18244
Pdf URL: https://arxiv.org/pdf/2505.18244
Copy Paste: [[2505.18244]] Multi-Scale Probabilistic Generation Theory: A Hierarchical Framework for Interpreting Large Language Models(https://arxiv.org/abs/2505.18244)
Keywords: language model, gpt
Abstract: Large Transformer based language models achieve remarkable performance but remain opaque in how they plan, structure, and realize text. We introduce Multi_Scale Probabilistic Generation Theory (MSPGT), a hierarchical framework that factorizes generation into three semantic scales_global context, intermediate structure, and local word choices and aligns each scale with specific layer ranges in Transformer architectures. To identify scale boundaries, we propose two complementary metrics: attention span thresholds and inter layer mutual information peaks. Across four representative models (GPT-2, BERT, RoBERTa, and T5), these metrics yield stable local/intermediate/global partitions, corroborated by probing tasks and causal interventions. We find that decoder_only models allocate more layers to intermediate and global processing while encoder_only models emphasize local feature extraction. Through targeted interventions, we demonstrate that local scale manipulations primarily influence lexical diversity, intermediate-scale modifications affect sentence structure and length, and global_scale perturbations impact discourse coherence all with statistically significant effects. MSPGT thus offers a unified, architecture-agnostic method for interpreting, diagnosing, and controlling large language models, bridging the gap between mechanistic interpretability and emergent capabilities.
摘要：基于大型变压器的大型语言模型可以实现出色的性能，但在计划，结构和实现文本方面仍然不透明。我们介绍了Multi_scale概率生成理论（MSPGT），这是一个分层框架，将生成分解为三个语义scales_global上下文，中间结构和本地单词选择，并将每个尺度与变压器体系结构中的特定层范围保持一致。为了确定规模边界，我们提出了两个互补指标：注意跨度阈值和层间相互信息峰。在四个代表性模型（GPT-2，BERT，ROBERTA和T5）中，这些指标产生稳定的本地/中间/全球分区，通过探测任务和因果干预措施来证实。我们发现Dododer_only模型分配了更多层以中间和全局处理，而Encoder_only模型则强调本地特征提取。通过有针对性的干预措施，我们证明了局部规模的操作主要影响词汇多样性，中等规模的修改会影响句子的结构和长度，而Global_scale扰动影响了话语的连贯性，所有这些都具有统计学上的显着影响。因此，MSPGT提供了一种统一的，架构不足的方法来解释，诊断和控制大型语言模型，从而弥合机械性解释性和紧急功能之间的差距。

Title: MetaGen Blended RAG: Higher Accuracy for Domain-Specific Q&A Without Fine-Tuning

Authors: Kunal Sawarkar, Shivam R. Solanki, Abhilasha Mangal
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18247
Pdf URL: https://arxiv.org/pdf/2505.18247
Copy Paste: [[2505.18247]] MetaGen Blended RAG: Higher Accuracy for Domain-Specific Q&A Without Fine-Tuning(https://arxiv.org/abs/2505.18247)
Keywords: gpt, llm, retrieval-augmented generation
Abstract: Despite the widespread exploration of Retrieval-Augmented Generation (RAG), its deployment in enterprises for domain-specific datasets remains limited due to poor answer accuracy. These corpora, often shielded behind firewalls in private enterprise knowledge bases, having complex, domain-specific terminology, rarely seen by LLMs during pre-training; exhibit significant semantic variability across domains (like networking, military, or legal, etc.), or even within a single domain like medicine, and thus result in poor context precision for RAG systems. Currently, in such situations, fine-tuning or RAG with fine-tuning is attempted, but these approaches are slow, expensive, and lack generalization for accuracy as the new domain-specific data emerges. We propose an approach for Enterprise Search that focuses on enhancing the retriever for a domain-specific corpus through hybrid query indexes and metadata enrichment. This 'MetaGen Blended RAG' method constructs a metadata generation pipeline using key concepts, topics, and acronyms, and then creates a metadata-enriched hybrid index with boosted search queries. This approach avoids overfitting and generalizes effectively across domains. On the PubMedQA benchmark for the biomedical domain, the proposed method achieves 82% retrieval accuracy and 77% RAG accuracy, surpassing all previous RAG accuracy results without fine-tuning and sets a new benchmark for zero-shot results while outperforming much larger models like GPT3.5. The results are even comparable to the best fine-tuned models on this dataset, and we further demonstrate the robustness and scalability of the approach by evaluating it on other Q&A datasets like SQuAD, NQ etc.
摘要：尽管对检索型发电（RAG）进行了广泛的探索，但由于答案的准确性差，因此其在特定领域的企业中的部署在特定领域的数据集中仍然有限。这些语料库经常被私人企业知识库掩盖，具有复杂的，特定于领域的术语，在预训练期间很少见到LLM。在域（例如网络，军事或法律等），甚至在单个领域（如医学）内表现出显着的语义变异性，从而导致抹布系统的上下文精确度较差。当前，在这种情况下，尝试进行微调或抹布进行微调，但是随着新的特定领域数据的出现，这些方法很慢，昂贵，并且缺乏准确性的概括。我们提出了一种用于企业搜索的方法，该方法着重于通过混合查询索引和元数据富集来增强特定领域特定语料库的猎犬。这种“ Metagen混合抹布”方法使用关键概念，主题和首字母缩写构建元数据生成管道，然后使用增强的搜索查询创建一个富含元数据的混合索引。这种方法避免了过度拟合并在跨领域有效地概括。在生物医学结构域的PubMedQA基准上，该方法可实现82％的检索准确性和77％的抹布精度，超过了所有以前的抹布准确性，而无需微调，并为零摄像结果设定了新的基准测试，而却超过了更大的型号，例如GPT3.5。结果甚至可以与该数据集上最佳的微调模型相提并论，我们通过在其他Q＆A数据集（如Squep，NQ等）上对方法进行评估，进一步证明了该方法的鲁棒性和可扩展性。

Title: TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification

Authors: Jianghao Wu, Feilong Tang, Yulong Li, Ming Hu, Haochen Xue, Shoaib Jameel, Yutong Xie, Imran Razzak
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.18283
Pdf URL: https://arxiv.org/pdf/2505.18283
Copy Paste: [[2505.18283]] TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification(https://arxiv.org/abs/2505.18283)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Recent advances such as Chain-of-Thought prompting have significantly improved large language models (LLMs) in zero-shot medical reasoning. However, prompting-based methods often remain shallow and unstable, while fine-tuned medical LLMs suffer from poor generalization under distribution shifts and limited adaptability to unseen clinical scenarios. To address these limitations, we present TAGS, a test-time framework that combines a broadly capable generalist with a domain-specific specialist to offer complementary perspectives without any model fine-tuning or parameter updates. To support this generalist-specialist reasoning process, we introduce two auxiliary modules: a hierarchical retrieval mechanism that provides multi-scale exemplars by selecting examples based on both semantic and rationale-level similarity, and a reliability scorer that evaluates reasoning consistency to guide final answer aggregation. TAGS achieves strong performance across nine MedQA benchmarks, boosting GPT-4o accuracy by 13.8%, DeepSeek-R1 by 16.8%, and improving a vanilla 7B model from 14.1% to 23.9%. These results surpass several fine-tuned medical LLMs, without any parameter updates. The code will be available at this https URL.
摘要：最近的进步（例如，经过思考的促进链）在零摄像的医学推理中显着改善了大语言模型（LLM）。但是，基于促进的方法通常保持较浅和不稳定，而经过调整的医学LLMS在分配转移中的概括不良和对看不见的临床情况的适应性有限。为了解决这些限制，我们提出标签，这是一个测试时间框架，将广泛的通才与特定领域的专家相结合，以提供互补的观点，而无需任何模型微调或参数更新。为了支持这个通才特殊的推理过程，我们介绍了两个辅助模块：一种分层检索机制，通过选择基于语义和理性级别相似性的示例来提供多尺度的典范，以及一个可靠性得分手，可用于评估最终答案的一致性来指导最终答案集合。标签在9个MEDQA基准测试中实现了强劲的性能，将GPT-4O准确度提高了13.8％，DeepSeek-R1提高了16.8％，并将香草7B型号从14.1％提高到23.9％。这些结果超过了几个微调的医学LLM，没有任何参数更新。该代码将在此HTTPS URL上可用。

Title: Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards

Authors: Jinyan Su, Claire Cardie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18298
Pdf URL: https://arxiv.org/pdf/2505.18298
Copy Paste: [[2505.18298]] Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards(https://arxiv.org/abs/2505.18298)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated strong reasoning abilities in mathematical tasks, often enhanced through reinforcement learning (RL). However, RL-trained models frequently produce unnecessarily long reasoning traces -- even for simple queries -- leading to increased inference costs and latency. While recent approaches attempt to control verbosity by adding length penalties to the reward function, these methods rely on fixed penalty terms that are hard to tune and cannot adapt as the model's reasoning capability evolves, limiting their effectiveness. In this work, we propose an adaptive reward-shaping method that enables LLMs to "think fast and right" -- producing concise outputs without sacrificing correctness. Our method dynamically adjusts the reward trade-off between accuracy and response length based on model performance: when accuracy is high, the length penalty increases to encourage faster length reduction; when accuracy drops, the penalty is relaxed to preserve correctness. This adaptive reward accelerates early-stage length reduction while avoiding over-compression in later stages. Experiments across multiple datasets show that our approach consistently and dramatically reduces reasoning length while largely maintaining accuracy, offering a new direction for cost-efficient adaptive reasoning in large-scale language models.
摘要：大型语言模型（LLMS）在数学任务中表现出强大的推理能力，通常通过增强学习（RL）增强。但是，经过RL训练的模型通常会产生不必要的较长的推理痕迹（即使对于简单的查询），从而导致推理成本和延迟增加。尽管最近的方法试图通过为奖励功能增加长度惩罚来控制冗长，但这些方法依赖于难以调节并且不能随着模型的推理能力而适应的固定惩罚术语，从而限制了它们的有效性。在这项工作中，我们提出了一种自适应奖励塑形方法，使LLM可以“快速思考和正确” - 在不牺牲正确性的情况下产生简洁的输出。我们的方法会根据模型性能动态调整准确性和响应长度之间的奖励权衡：当准确性较高时，长度惩罚增加以鼓励更快的长度降低；当准确性下降时，罚款会放松以保持正确性。这种自适应奖励可以加速早期长度的降低，同时避免在后期过度压缩。多个数据集的实验表明，我们的方法始终如一地降低了推理的长度，同时在很大程度上保持准确性，为大规模语言模型中具有成本效益的适应性推理提供了新的方向。

Title: Is It Bad to Work All the Time? Cross-Cultural Evaluation of Social Norm Biases in GPT-4

Authors: Zhuozhuo Joy Liu, Farhan Samir, Mehar Bhatia, Laura K. Nelson, Vered Shwartz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18322
Pdf URL: https://arxiv.org/pdf/2505.18322
Copy Paste: [[2505.18322]] Is It Bad to Work All the Time? Cross-Cultural Evaluation of Social Norm Biases in GPT-4(https://arxiv.org/abs/2505.18322)
Keywords: gpt, llm
Abstract: LLMs have been demonstrated to align with the values of Western or North American cultures. Prior work predominantly showed this effect through leveraging surveys that directly ask (originally people and now also LLMs) about their values. However, it is hard to believe that LLMs would consistently apply those values in real-world scenarios. To address that, we take a bottom-up approach, asking LLMs to reason about cultural norms in narratives from different cultures. We find that GPT-4 tends to generate norms that, while not necessarily incorrect, are significantly less culture-specific. In addition, while it avoids overtly generating stereotypes, the stereotypical representations of certain cultures are merely hidden rather than suppressed in the model, and such stereotypes can be easily recovered. Addressing these challenges is a crucial step towards developing LLMs that fairly serve their diverse user base.
摘要：LLM已被证明与西方或北美文化的价值保持一致。先前的工作主要通过利用直接询问（最初是人和现在也是LLM）的调查来展示这种影响。但是，很难相信LLM会在实际情况下始终如一地应用这些价值。为了解决这个问题，我们采取了自下而上的方法，要求LLMS推理不同文化叙事中的文化规范。我们发现，GPT-4倾向于生成规范，尽管不一定不正确，但特定于文化的规范。此外，尽管它避免了明显产生刻板印象，但某些文化的刻板印象只是隐藏的而不是在模型中被抑制，并且可以轻松地恢复这种刻板印象。应对这些挑战是开发公平服务于其多样化用户群的LLM的关键步骤。

Title: PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language

Authors: Naghmeh Jamali, Milad Mohammadi, Danial Baledi, Zahra Rezvani, Hesham Faili
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18331
Pdf URL: https://arxiv.org/pdf/2505.18331
Copy Paste: [[2505.18331]] PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language(https://arxiv.org/abs/2505.18331)
Keywords: language model, llm
Abstract: Medical consumer question answering (CQA) is crucial for empowering patients by providing personalized and reliable health information. Despite recent advances in large language models (LLMs) for medical QA, consumer-oriented and multilingual resources, particularly in low-resource languages like Persian, remain sparse. To bridge this gap, we present PerMedCQA, the first Persian-language benchmark for evaluating LLMs on real-world, consumer-generated medical questions. Curated from a large medical QA forum, PerMedCQA contains 68,138 question-answer pairs, refined through careful data cleaning from an initial set of 87,780 raw entries. We evaluate several state-of-the-art multilingual and instruction-tuned LLMs, utilizing MedJudge, a novel rubric-based evaluation framework driven by an LLM grader, validated against expert human annotators. Our results highlight key challenges in multilingual medical QA and provide valuable insights for developing more accurate and context-aware medical assistance systems. The data is publicly available on this https URL
摘要：医疗消费者的问题回答（CQA）对于通过提供个性化和可靠的健康信息来增强患者至关重要。尽管医学质量保证的大型语言模型（LLMS）最近取得了进步，但面向消费者的和多语言的资源，尤其是在波斯语等低资源语言中，仍然很少。为了弥合这一差距，我们提出了PermedCQA，这是第一个评估现实世界中消费者生成的医疗问题LLM的波斯语基准。 PermedCQA在大型医疗质量检查论坛中策划，包含68,138个问答对，通过从初始组合的87,780个原始条目中仔细清洁数据进行了精制。我们利用MedJudge评估了几个最先进的多语言和指导调整的LLM，这是一个由LLM分级器驱动的新型基于栏目的评估框架，该框架对专家人类注释者进行了验证。我们的结果突出了多语言医学质量检查中的关键挑战，并为开发更准确和背景感知的医疗援助系统提供了宝贵的见解。该数据在此HTTPS URL上公开可用

Title: Model Editing with Graph-Based External Memory

Authors: Yash Kumar Atri, Ahmed Alaa, Thomas Hartvigsen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18343
Pdf URL: https://arxiv.org/pdf/2505.18343
Copy Paste: [[2505.18343]] Model Editing with Graph-Based External Memory(https://arxiv.org/abs/2505.18343)
Keywords: language model, gpt, llm, hallucination
Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their practical utility is often limited by persistent issues of hallucinations and outdated parametric knowledge. Although post-training model editing offers a pathway for dynamic updates, existing methods frequently suffer from overfitting and catastrophic forgetting. To tackle these challenges, we propose a novel framework that leverages hyperbolic geometry and graph neural networks for precise and stable model edits. We introduce HYPE (HYperbolic Parameter Editing), which comprises three key components: (i) Hyperbolic Graph Construction, which uses Poincaré embeddings to represent knowledge triples in hyperbolic space, preserving hierarchical relationships and preventing unintended side effects by ensuring that edits to parent concepts do not inadvertently affect child concepts; (ii) Möbius-Transformed Updates, which apply hyperbolic addition to propagate edits while maintaining structural consistency within the hyperbolic manifold, unlike conventional Euclidean updates that distort relational distances; and (iii) Dual Stabilization, which combines gradient masking and periodic GNN parameter resetting to prevent catastrophic forgetting by focusing updates on critical parameters and preserving long-term knowledge. Experiments on CounterFact, CounterFact+, and MQuAKE with GPT-J and GPT2-XL demonstrate that HYPE significantly enhances edit stability, factual accuracy, and multi-hop reasoning.
摘要：大型语言模型（LLM）彻底改变了自然语言处理，但是它们的实际效用通常受幻觉和过时的参数知识的持续问题的限制。尽管训练后模型编辑为动态更新提供了途径，但现有方法经常遭受过度拟合和灾难性遗忘的损失。为了应对这些挑战，我们提出了一个新型框架，该框架利用双曲线几何形状和图形神经网络进行精确稳定的模型编辑。我们引入HYPE（双曲线参数编辑），其中包括三个关键组成部分：（i）使用Poincaré嵌入的双曲线图构造在双曲线空间中代表知识的三元，从而确保对父母概念的编辑不影响孩子的概念，从而保留了层次结构的关系并保持无意外的副作用；（ii）Möbius转换的更新，这些更新将双曲线添加到传播的编辑中，同时保持双曲线歧管内的结构一致性，这与传统的欧几里得更新不同，这会扭曲关系距离；（iii）双重稳定，结合了梯度掩蔽和定期GNN参数重置，以防止灾难性遗忘，通过将更新集中在关键参数上并保留长期知识。与GPT-J和GPT2-XL进行反事实，反对+和MQUAKE的实验表明，炒作显着提高了编辑稳定性，事实准确性和多跳推理。

Title: The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs

Authors: Lucas Bandarkar, Nanyun Peng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18356
Pdf URL: https://arxiv.org/pdf/2505.18356
Copy Paste: [[2505.18356]] The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs(https://arxiv.org/abs/2505.18356)
Keywords: language model, llm
Abstract: Large language models (LLMs) still struggle across tasks outside of high-resource languages. In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce. Building on prior work, we first validate that the subsets of model parameters that matter most for mathematical reasoning and multilingual capabilities are distinctly non-overlapping. To exploit this implicit separability between task and target language parameterization, we develop and analyze numerous modular frameworks to improve the composition of the two during fine-tuning. These methods generally employ freezing parameters or post hoc model merging to assign math and language improvement to different key parts of the LLM. In the absence of in-language math data, we demonstrate that the modular approaches successfully improve upon baselines across three languages, four models, and two fine-tuning paradigms (full and LoRA). Furthermore, we identify the most consistently successful modular method to be fine-tuning separate language and math experts and model merging via Layer-Swapping, somewhat surprisingly. We offer possible explanations for this result via recent works on the linearity of task vectors. We further explain this by empirically showing that reverting less useful fine-tuning updates after training often outperforms freezing them from the start.
摘要：大型语言模型（LLM）仍然在高资源语言之外的任务中挣扎。在这项工作中，我们研究了少于任务特定的培训数据稀缺的低资源语言的跨语性转移。在先前工作的基础上，我们首先验证模型参数的子集对数学推理和多语言功能最重要的是明显的非重叠。为了利用任务和目标语言参数化之间的这种隐式可分离性，我们开发和分析了许多模块化框架，以改善在微调过程中两者的组成。这些方法通常采用冻结参数或事后合并后合并以将数学和语言改进分配给LLM的不同关键部分。在没有语言中的数学数据的情况下，我们证明了模块化方法在三种语言，四种模型和两个微调范式（Full and Lora）上成功改善了基线。此外，我们确定了最成功的模块化方法，是对单独的语言和数学专家进行微调，并通过交换层合并的模型合并，这有些令人惊讶。我们通过有关任务向量线性的最新工作为此结果提供了可能的解释。我们通过经验表明，在训练后恢复较不可能的微调更新通常优于从一开始就冻结它们，我们进一步解释了这一点。

Title: SchemaGraphSQL: Efficient Schema Linking with Pathfinding Graph Algorithms for Text-to-SQL on Large-Scale Databases

Authors: AmirHossein Safdarian, Milad Mohammadi, Ehsan Jahanbakhsh, Mona Shahamat Naderi, Heshaam Faili
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2505.18363
Pdf URL: https://arxiv.org/pdf/2505.18363
Copy Paste: [[2505.18363]] SchemaGraphSQL: Efficient Schema Linking with Pathfinding Graph Algorithms for Text-to-SQL on Large-Scale Databases(https://arxiv.org/abs/2505.18363)
Keywords: language model, llm, prompt
Abstract: Text-to-SQL systems translate natural language questions into executable SQL queries, and recent progress with large language models (LLMs) has driven substantial improvements in this task. Schema linking remains a critical component in Text-to-SQL systems, reducing prompt size for models with narrow context windows and sharpening model focus even when the entire schema fits. We present a zero-shot, training-free schema linking approach that first constructs a schema graph based on foreign key relations, then uses a single prompt to Gemini 2.5 Flash to extract source and destination tables from the user query, followed by applying classical path-finding algorithms and post-processing to identify the optimal sequence of tables and columns that should be joined, enabling the LLM to generate more accurate SQL queries. Despite being simple, cost-effective, and highly scalable, our method achieves state-of-the-art results on the BIRD benchmark, outperforming previous specialized, fine-tuned, and complex multi-step LLM-based approaches. We conduct detailed ablation studies to examine the precision-recall trade-off in our framework. Additionally, we evaluate the execution accuracy of our schema filtering method compared to other approaches across various model sizes.
摘要：文本到SQL系统将自然语言问题转化为可执行的SQL查询，而大型语言模型（LLMS）的最新进展已在此任务上进行了实质性改进。架构链接仍然是文本到SQL系统中的关键组件，即使整个模式都适合，也可以减少具有狭窄上下文窗口的模型的及时尺寸，并缩小模型的焦点。 We present a zero-shot, training-free schema linking approach that first constructs a schema graph based on foreign key relations, then uses a single prompt to Gemini 2.5 Flash to extract source and destination tables from the user query, followed by applying classical path-finding algorithms and post-processing to identify the optimal sequence of tables and columns that should be joined, enabling the LLM to generate more accurate SQL queries.尽管简单，具有成本效益且高度可扩展性，但我们的方法在鸟基准上取得了最新的结果，表现优于以前的专业，微调和复杂的基于LLM的方法。我们进行了详细的消融研究，以检查我们框架中的Precision-Recall权衡。此外，与各种模型大小的其他方法相比，我们评估了模式过滤方法的执行精度。

Title: ShIOEnv: A CLI Behavior-Capturing Environment Enabling Grammar-Guided Command Synthesis for Dataset Curation

Authors: Jarrod Ragsdale, Rajendra Boppana
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18374
Pdf URL: https://arxiv.org/pdf/2505.18374
Copy Paste: [[2505.18374]] ShIOEnv: A CLI Behavior-Capturing Environment Enabling Grammar-Guided Command Synthesis for Dataset Curation(https://arxiv.org/abs/2505.18374)
Keywords: language model, gpt
Abstract: Command-line interfaces (CLIs) provide structured textual environments for system administration. Explorations have been performed using pre-trained language models (PLMs) to simulate these environments for safe interaction in high-risk environments. However, their use has been constrained to frozen, large parameter models like GPT. For smaller architectures to reach a similar level of believability, a rich dataset of CLI interactions is required. Existing public datasets focus on mapping natural-language tasks to commands, omitting crucial execution data such as exit codes, outputs, and environmental side effects, limiting their usability for behavioral modeling. We introduce a Shell Input -Output Environment (ShIOEnv), which casts command construction as a Markov Decision Process whose state is the partially built sequence and whose actions append arguments. After each action, ShIOEnv executes the candidate and returns its exit status, output, and progress toward a minimal-length behavioral objective. Due to the intractable nature of the combinatorial argument state-action space, we derive a context-free grammar from man pages to mask invalid arguments from being emitted. We explore random and proximal-policy optimization (PPO)-optimized sampling of unrestricted and grammar-masked action spaces to produce four exploration strategies. We observed that grammar masking and PPO significantly improve sample efficiency to produce a higher quality dataset (maximizing the number of arguments while minimizing redundancies). Policy-generated datasets of shell input-output behavior pairs are used to fine-tune CodeT5, where we observe 85% improvements in BLEU-4 when constraining the action space to grammar productions with an additional 26% improvement when applying PPO. The ShIOEnv environment and curated command behavior datasets are released for use in future research.
摘要：命令行界面（CLIS）为系统管理提供结构化的文本环境。已经使用预训练的语言模型（PLM）进行了探索，以模拟这些环境，以在高风险环境中进行安全互动。但是，它们的使用被限制为冷冻，例如GPT等大型参数模型。为了使较小的体系结构达到相似的可信度，需要使用丰富的CLI交互数据。现有的公共数据集专注于将自然语言任务映射到命令上，省略了关键的执行数据，例如退出代码，输出和环境副作用，从而限制了其行为建模的可用性。我们介绍了Shell Input -Output环境（SHIOENV），该环境将命令构造作为马尔可夫决策过程，其状态是部分构建的序列，并且其动作附加了参数。每次动作之后，ShioEnv都会执行候选人，并返回其退出状态，输出和进步，以最小的行为目标。由于组合参数的棘手性质状态空间的棘手性，我们从人页面到掩盖无效的参数从发射中得出了无上下文的语法。我们探索无限制和语法掩盖的作用空间的随机和近端政策优化（PPO）优化的采样，以产生四种探索策略。我们观察到，语法掩盖和PPO显着提高了样品效率，以产生更高质量的数据集（在最大程度地减少冗余的同时，最大程度地提高了参数的数量）。 Shell输入行为对的政策生成的数据集用于微调CODET5，当将动作空间限制为语法生产时，在应用PPO时，我们观察到BLEU-4的85％改善。 SHIOENV环境和策划的命令行为数据集将在未来的研究中发布。

Title: NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

Authors: Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, Muhammad Abdul-Mageed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18383
Pdf URL: https://arxiv.org/pdf/2505.18383
Copy Paste: [[2505.18383]] NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities(https://arxiv.org/abs/2505.18383)
Keywords: language model, llm, chat
Abstract: Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in LLM development.
摘要：提高大语模型（LLM）的语言能力包括低资源语言是一个关键的研究领域。当前的研究指示主要依赖于通过翻译英语语料库生成的合成数据，在该语言中，该数据表现出有希望的语言理解和翻译能力，通常会导致模型与原始语言文化保持一致。这些模型经常无法代表当地社区的文化遗产和价值观。这项工作提出了一种方法来创建针对特定社区量身定制的合成和基于检索的预训练数据，考虑其（i）语言，（ii）文化遗产和（iii）文化价值。我们使用埃及和摩洛哥方言作为测试床来证明我们的方法论，以其语言和文化丰富性以及在LLM中的当前代表性不足而选择。作为概念证明，我们开发了Nilechat，这是一个适用于埃及和摩洛哥社区的3B参数LLM，结合了他们的语言，文化遗产和价值观。我们在各种理解，翻译以及文化和价值对齐基准的结果表明，Nilechat的表现优于现有大小相似的阿拉伯语意识LLM，并且与较大的模型相同。我们与社区共享我们的方法，数据和模型，以促进LLM开发中更多样化社区的包容和报道。

Title: RaDeR: Reasoning-aware Dense Retrieval Models

Authors: Debrup Das, Sam O' Nuallain, Razieh Rahimi
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.18405
Pdf URL: https://arxiv.org/pdf/2505.18405
Copy Paste: [[2505.18405]] RaDeR: Reasoning-aware Dense Retrieval Models(https://arxiv.org/abs/2505.18405)
Keywords: language model, llm, chain-of-thought
Abstract: We propose RaDeR, a set of reasoning-based dense retrieval models trained with data derived from mathematical problem solving using large language models (LLMs). Our method leverages retrieval-augmented reasoning trajectories of an LLM and self-reflective relevance evaluation, enabling the creation of both diverse and hard-negative samples for reasoning-intensive relevance. RaDeR retrievers, trained for mathematical reasoning, effectively generalize to diverse reasoning tasks in the BRIGHT and RAR-b benchmarks, consistently outperforming strong baselines in overall this http URL, RaDeR achieves significantly higher performance than baselines on the Math and Coding splits. In addition, RaDeR presents the first dense retriever that outperforms BM25 when queries are Chain-of-Thought reasoning steps, underscoring the critical role of reasoning-based retrieval to augment reasoning language models. Furthermore, RaDeR achieves comparable or superior performance while using only 2.5% of the training data used by the concurrent work REASONIR, highlighting the quality of our synthesized training data.
摘要：我们提出了Rader，这是一组基于推理的密集检索模型，该模型训练了使用大型语言模型（LLM）求解的数据训练的数据。我们的方法利用了LLM和自我反射相关性评估的检索提示的推理轨迹，从而使能够创建各种和硬性样本，以实现推理密集型相关性。 Rader猎犬接受了数学推理的培训，有效地将其推广到明亮和RAR-B基准测试中的各种推理任务，在整个HTTP URL总体上表现出色的基本线始终优于强大的基线，Rader的性能明显高于数学和编码分布的基准。此外，Rader提出了第一个密集的检索器，当查询是经过思考的推理步骤时，它胜过BM25，强调了基于推理的检索以增强推理语言模型的关键作用。此外，Rader仅使用并发工作原因使用的训练数据的2.5％，可以实现可比性或卓越的性能，从而强调了我们合成的培训数据的质量。

Title: DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

Authors: Yue Jiang, Jichu Li, Yang Liu, Dingkang Yang, Feng Zhou, Quyu Kong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18411
Pdf URL: https://arxiv.org/pdf/2505.18411
Copy Paste: [[2505.18411]] DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding(https://arxiv.org/abs/2505.18411)
Keywords: language model, llm, agent
Abstract: We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. The code and dataset have been released at this https URL
摘要：我们介绍了Danmakutppbench，这是一个综合基准，旨在在大语言模型（LLMS）时代推进多模式的时间点过程（TPP）建模。尽管已广泛研究了TPP来建模时间事件序列，但现有数据集主要是单峰的，在需要关于时间，文本和视觉信息的联合推理的模型中阻碍了进度。为了解决这一差距，Danmakutppbench包括两个互补的组件：（1）Danmakutpp-events，这是一种源自Bilibili视频平台的新颖数据集，其中用户生成的子弹注释（Danmaku）自然形成了多模式事件，以精确的时间播放时间播放，具有丰富的文本型视频frevent textumpuls Content和Poodsenting Video Frames和相关的视频列表；（2）Danmakutpp-QA是一种充满挑战的问题，通过最新的LLM和多模式LLMS（MLLMS）驱动的新型多机管道构建的挑战性提问的数据集，以靶向复杂的时间段 - 文本 - 视觉推理。我们使用经典的TPP模型和最近的MLLM进行了广泛的评估，揭示了当前方法对多模式事件动力学建模的能力的显着性能差距和局限性。我们的基准建立了强大的基线，并呼吁将TPP建模进一步集成到多模式的语言建模景观中。代码和数据集已在此HTTPS URL上发布

Title: Retrieval Augmented Generation-based Large Language Models for Bridging Transportation Cybersecurity Legal Knowledge Gaps

Authors: Khandakar Ashrafi Akbar, Md Nahiyan Uddin, Latifur Khan, Trayce Hockstad, Mizanur Rahman, Mashrur Chowdhury, Bhavani Thuraisingham
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18426
Pdf URL: https://arxiv.org/pdf/2505.18426
Copy Paste: [[2505.18426]] Retrieval Augmented Generation-based Large Language Models for Bridging Transportation Cybersecurity Legal Knowledge Gaps(https://arxiv.org/abs/2505.18426)
Keywords: language model, llm, hallucination, retrieval augmented generation, retrieval-augmented generation
Abstract: As connected and automated transportation systems evolve, there is a growing need for federal and state authorities to revise existing laws and develop new statutes to address emerging cybersecurity and data privacy challenges. This study introduces a Retrieval-Augmented Generation (RAG) based Large Language Model (LLM) framework designed to support policymakers by extracting relevant legal content and generating accurate, inquiry-specific responses. The framework focuses on reducing hallucinations in LLMs by using a curated set of domain-specific questions to guide response generation. By incorporating retrieval mechanisms, the system enhances the factual grounding and specificity of its outputs. Our analysis shows that the proposed RAG-based LLM outperforms leading commercial LLMs across four evaluation metrics: AlignScore, ParaScore, BERTScore, and ROUGE, demonstrating its effectiveness in producing reliable and context-aware legal insights. This approach offers a scalable, AI-driven method for legislative analysis, supporting efforts to update legal frameworks in line with advancements in transportation technologies.
摘要：随着连接和自动化的运输系统的发展，联邦和州当局越来越需要修改现有法律并制定新法规，以应对新兴的网络安全和数据隐私挑战。这项研究介绍了基于检索的大型语言模型（LLM）框架，旨在通过提取相关法律内容并产生准确的，特定于询问的响应来支持决策者。该框架着重于通过使用一组特定领域的问题来指导响应生成来减少LLM中的幻觉。通过合并检索机制，该系统增强了其产出的事实基础和特异性。我们的分析表明，拟议的基于抹布的LLM的表现优于四个评估指标的商业LLM：AlignScore，Parascore，Bertscore和Rouge，证明了其在产生可靠和背景意识的法律见解方面的有效性。这种方法为立法分析提供了可扩展的，AI驱动的方法，并支持根据运输技术的进步来更新法律框架的努力。

Title: Efficient Long CoT Reasoning in Small Language Models

Authors: Zhaoyang Wang, Jinqi Jiang, Tian Qiu, Hui Liu, Xianfeng Tang, Huaxiu Yao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18440
Pdf URL: https://arxiv.org/pdf/2505.18440
Copy Paste: [[2505.18440]] Efficient Long CoT Reasoning in Small Language Models(https://arxiv.org/abs/2505.18440)
Keywords: language model, chain-of-thought
Abstract: Recent large reasoning models such as DeepSeek-R1 exhibit strong complex problems solving abilities by generating long chain-of-thought (CoT) reasoning steps. It is challenging to directly train small language models (SLMs) to emerge long CoT. Thus, distillation becomes a practical method to enable SLMs for such reasoning ability. However, the long CoT often contains a lot of redundant contents (e.g., overthinking steps) which may make SLMs hard to learn considering their relatively poor capacity and generalization. To address this issue, we propose a simple-yet-effective method to prune unnecessary steps in long CoT, and then employ an on-policy method for the SLM itself to curate valid and useful long CoT training data. In this way, SLMs can effectively learn efficient long CoT reasoning and preserve competitive performance at the same time. Experimental results across a series of mathematical reasoning benchmarks demonstrate the effectiveness of the proposed method in distilling long CoT reasoning ability into SLMs which maintains the competitive performance but significantly reduces generating redundant reasoning steps.
摘要：最近的大型推理模型（例如DeepSeek-R1）通过产生长链（COT）推理步骤，表现出强烈的复杂问题解决能力。直接训练小语言模型（SLM）以出现长床是一项挑战。因此，蒸馏成为一种实用方法，可以使SLM获得这种推理能力。但是，长床通常包含许多冗余内容（例如，过度思考的步骤），这可能会使SLM难以学习，因为它们的容量和概括相对较差。为了解决这个问题，我们提出了一种简单的有效方法，以延长长床下的不必要步骤，然后对SLM本身采用单盘方法来策划有效且有用的长COT培训数据。这样，SLM可以有效地学习有效的长床推理，并同时保持竞争性能。一系列数学推理基准的实验结果证明了该方法在将长的COT推理能力提炼到SLM中的有效性，从而维持竞争性能，但会大大降低产生冗余推理步骤。

Title: BRIT: Bidirectional Retrieval over Unified Image-Text Graph

Authors: Ainulla Khan, Yamada Moyuru, Srinidhi Akella
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18450
Pdf URL: https://arxiv.org/pdf/2505.18450
Copy Paste: [[2505.18450]] BRIT: Bidirectional Retrieval over Unified Image-Text Graph(https://arxiv.org/abs/2505.18450)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising technique to enhance the quality and relevance of responses generated by large language models. While recent advancements have mainly focused on improving RAG for text-based queries, RAG on multi-modal documents containing both texts and images has not been fully explored. Especially when fine-tuning does not work. This paper proposes BRIT, a novel multi-modal RAG framework that effectively unifies various text-image connections in the document into a multi-modal graph and retrieves the texts and images as a query-specific sub-graph. By traversing both image-to-text and text-to-image paths in the graph, BRIT retrieve not only directly query-relevant images and texts but also further relevant contents to answering complex cross-modal multi-hop questions. To evaluate the effectiveness of BRIT, we introduce MM-RAG test set specifically designed for multi-modal question answering tasks that require to understand the text-image relations. Our comprehensive experiments demonstrate the superiority of BRIT, highlighting its ability to handle cross-modal questions on the multi-modal documents.
摘要：检索增强的一代（RAG）已成为一种有前途的技术，可以增强大语模型产生的响应的质量和相关性。尽管最近的进步主要集中在改善基于文本的查询的抹布上，但尚未完全探索包含文本和图像的多模式文档的抹布。特别是当微调不起作用时。本文提出了一个新型的多模式抹布框架Brit，该框架有效地将文档中的各种文本图像连接统一到多模式图中，并将文本和图像作为特定于特定的子图案检索。通过遍历图表中的图像到文本和文本形象路径，英国人不仅检索了与与之相关的图像和文本，而且还取得了进一步的相关内容，以回答复杂的跨模式多跳跃问题。为了评估英国人的有效性，我们介绍了专门为多模式问答的MM RAG测试集，以回答需要了解文本图像关系的任务。我们的全面实验表明了英国的优势，强调了其在多模式文档上处理跨模式问题的能力。

Title: MedScore: Factuality Evaluation of Free-Form Medical Answers

Authors: Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, Mark Dredze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18452
Pdf URL: https://arxiv.org/pdf/2505.18452
Copy Paste: [[2505.18452]] MedScore: Factuality Evaluation of Free-Form Medical Answers(https://arxiv.org/abs/2505.18452)
Keywords: language model, llm, hallucination
Abstract: While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new approach to decomposing medical answers into condition-aware valid facts. Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts. The resulting factuality score significantly varies by decomposition method, verification corpus, and used backbone LLM, highlighting the importance of customizing each step for reliable factuality evaluation.
摘要：尽管大型语言模型（LLM）可以产生流利而令人信服的响应，但它们不一定是正确的。这在流行的分解过程中尤其明显，然后验证的事实评估管道，LLMS通过将几代人分解为个人有效的主张来评估世代。事实评估对于医疗答案尤其重要，因为错误的医学信息可能会严重损害患者。但是，现有的事实系统对医疗领域的匹配不佳，因为它们通常仅根据客观，以实体为中心的公式化文本进行评估，例如传记和历史主题。这与条件依赖性，对话，假设，句子结构多样化和主观医学答案有所不同，这使得分解为有效的事实具有挑战性。我们提出了MedScore，这是一种将医疗答案分解为条件感知的有效事实的新方法。我们的方法提取的有效事实是现有方法的三倍，减少了幻觉和模糊的参考，并保留了事实的条件依赖性。由此产生的事实得分会因分解方法，验证语料库和使用的骨干LLM而有很大变化，从而强调了自定义每个步骤以进行可靠的事实评估的重要性。

Title: Hybrid Latent Reasoning via Reinforcement Learning

Authors: Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18454
Pdf URL: https://arxiv.org/pdf/2505.18454
Copy Paste: [[2505.18454]] Hybrid Latent Reasoning via Reinforcement Learning(https://arxiv.org/abs/2505.18454)
Keywords: language model, llm, chain-of-thought
Abstract: Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.
摘要：大型语言模型（LLM）的最新进展已引入潜在推理，作为自回旋推理的有前途的替代方法。通过使用以前的步骤执行内部计算，潜在推理受益于更有信息的特征，而不是采样离散的思想链（COT）路径。然而，潜在的推理方法通常与LLM不兼容，因为它们的持续范式与自回归产生的离散性质冲突。此外，这些方法依赖于COT痕迹进行训练，因此无法利用LLM的固有推理模式。在这项工作中，我们通过通过增强学习（RL）利用LLM的内在功能来探索潜在的推理。为此，我们引入了混合推理策略优化（HRPO），这是一种基于RL的混合潜在推理方法，（1）将先前的隐藏状态与可学习的门控机制集成到采样的代币中，（2）主要将培训与主要代币嵌入培训，同时逐渐逐步融合了更多的隐藏功能。该设计保持LLMS的生成能力，并使用离散和连续表示激励混合推理。此外，混合HRPO通过令牌采样将随机性引入潜在推理，从而实现基于RL的优化而无需COT轨迹。跨不同基准测试的广泛评估表明，HRPO在知识和推理密集型任务中都优于先前的方法。此外，受HRPO训练的LLM仍然可以解释，并且表现出有趣的行为，例如跨语性模式和较短的完成长度，突出了我们基于RL的方法的潜力，并为潜在推理中的未来工作提供了见解。

Title: Anchored Diffusion Language Model

Authors: Litu Rout, Constantine Caramanis, Sanjay Shakkottai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18456
Pdf URL: https://arxiv.org/pdf/2505.18456
Copy Paste: [[2505.18456]] Anchored Diffusion Language Model(https://arxiv.org/abs/2505.18456)
Keywords: language model, chain-of-thought
Abstract: Diffusion Language Models (DLMs) promise parallel generation and bidirectional context, yet they underperform autoregressive (AR) models in both likelihood modeling and generated text quality. We identify that this performance gap arises when important tokens (e.g., key words or low-frequency words that anchor a sentence) are masked early in the forward process, limiting contextual information for accurate reconstruction. To address this, we introduce the Anchored Diffusion Language Model (ADLM), a novel two-stage framework that first predicts distributions over important tokens via an anchor network, and then predicts the likelihoods of missing tokens conditioned on the anchored predictions. ADLM significantly improves test perplexity on LM1B and OpenWebText, achieving up to 25.4% gains over prior DLMs, and narrows the gap with strong AR baselines. It also achieves state-of-the-art performance in zero-shot generalization across seven benchmarks and surpasses AR models in MAUVE score, which marks the first time a DLM generates better human-like text than an AR model. Theoretically, we derive an Anchored Negative Evidence Lower Bound (ANELBO) objective and show that anchoring improves sample complexity and likelihood modeling. Beyond diffusion, anchoring boosts performance in AR models and enhances reasoning in math and logic tasks, outperforming existing chain-of-thought approaches
摘要：扩散语言模型（DLMS）有望并行生成和双向上下文，但是在似然建模和生成的文本质量中，它们的表现不佳。我们确定，当重要的令牌（例如，锚定句子的关键词或低频单词）在远期过程的早期掩盖时，就会出现这种性能差距，从而限制了上下文信息以进行准确的重建。为了解决这个问题，我们介绍了锚定扩散语言模型（ADLM），这是一个新颖的两阶段框架，首先通过锚网网络预测分布在重要令牌上，然后预测以锚定预测为条件的缺失令牌的可能性。 ADLM显着改善了LM1B和OpenWebText的测试困惑，比以前的DLM达到了高达25.4％的增长，并使用强大的AR基线缩小了差距。它还在七个基准测试中实现了零弹性概括的最新性能，并超过了淡紫色分数的AR模型，这标志着DLM首次产生比AR模型更好的人类样本。从理论上讲，我们得出了一个锚定的负面证据下限（Anelbo）目标，并表明锚定改善了样本的复杂性和似然模型。除了扩散之外，锚定在AR模型中提高了性能并增强数学和逻辑任务中的推理，超过了现有的思考链方法

Title: Measuring South Asian Biases in Large Language Models

Authors: Mamnuya Rinki, Chahat Raj, Anjishnu Mukherjee, Ziwei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18466
Pdf URL: https://arxiv.org/pdf/2505.18466
Copy Paste: [[2505.18466]] Measuring South Asian Biases in Large Language Models(https://arxiv.org/abs/2505.18466)
Keywords: language model, llm, prompt
Abstract: Evaluations of Large Language Models (LLMs) often overlook intersectional and culturally specific biases, particularly in underrepresented multilingual regions like South Asia. This work addresses these gaps by conducting a multilingual and intersectional analysis of LLM outputs across 10 Indo-Aryan and Dravidian languages, identifying how cultural stigmas influenced by purdah and patriarchy are reinforced in generative tasks. We construct a culturally grounded bias lexicon capturing previously unexplored intersectional dimensions including gender, religion, marital status, and number of children. We use our lexicon to quantify intersectional bias and the effectiveness of self-debiasing in open-ended generations (e.g., storytelling, hobbies, and to-do lists), where bias manifests subtly and remains largely unexamined in multilingual contexts. Finally, we evaluate two self-debiasing strategies (simple and complex prompts) to measure their effectiveness in reducing culturally specific bias in Indo-Aryan and Dravidian languages. Our approach offers a nuanced lens into cultural bias by introducing a novel bias lexicon and evaluation framework that extends beyond Eurocentric or small-scale multilingual settings.
摘要：对大语言模型（LLM）的评估通常忽略了交叉和文化特定的偏见，尤其是在代表性不足的南亚等多语言区域中。这项工作通过对10种印度雅利安语和德拉维语的LLM输出进行多种语言和交叉分析来解决这些差距，从而确定了受Purdah和Partiarchy影响的文化污名如何在生成任务中加强。我们构建了一种文化扎根的偏见词典，该词典捕获了以前未开发的交叉维度，包括性别，宗教，婚姻状况和儿童人数。我们使用词典来量化交叉偏见以及开放式世代（例如讲故事，爱好和待办事项清单）的自我贬低的有效性，在这种情况下，偏见表现出了微妙的表现，并且在多语言环境中仍然在很大程度上不受欢迎。最后，我们评估了两种自我抑制策略（简单而复杂的提示），以衡量它们在减少印度 - 雅利安语和德拉维语语言中具有文化特定偏见方面的有效性。我们的方法通过引入一种新颖的偏见词典和评估框架，为文化偏见提供了细微的镜头，该框架超出了欧洲中心或小规模的多语言环境。

Title: Investigating AI Rater Effects of Large Language Models: GPT, Claude, Gemini, and DeepSeek

Authors: Hong Jiao, Dan Song, Won-Chan Lee
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18486
Pdf URL: https://arxiv.org/pdf/2505.18486
Copy Paste: [[2505.18486]] Investigating AI Rater Effects of Large Language Models: GPT, Claude, Gemini, and DeepSeek(https://arxiv.org/abs/2505.18486)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models (LLMs) have been widely explored for automated scoring in low-stakes assessment to facilitate learning and instruction. Empirical evidence related to which LLM produces the most reliable scores and induces least rater effects needs to be collected before the use of LLMs for automated scoring in practice. This study compared ten LLMs (ChatGPT 3.5, ChatGPT 4, ChatGPT 4o, OpenAI o1, Claude 3.5 Sonnet, Gemini 1.5, Gemini 1.5 Pro, Gemini 2.0, as well as DeepSeek V3, and DeepSeek R1) with human expert raters in scoring two types of writing tasks. The accuracy of the holistic and analytic scores from LLMs compared with human raters was evaluated in terms of Quadratic Weighted Kappa. Intra-rater consistency across prompts was compared in terms of Cronbach Alpha. Rater effects of LLMs were evaluated and compared with human raters using the Many-Facet Rasch model. The results in general supported the use of ChatGPT 4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet with high scoring accuracy, better rater reliability, and less rater effects.
摘要：大型语言模型（LLMS）已被广泛探索，以进行低风险评估的自动评分，以促进学习和教学。 LLM产生最可靠的分数和引起最小评估者效应的经验证据需要在使用LLMS进行实践中自动评分之前收集最小的评估者效应。这项研究比较了十个LLM（Chatgpt 3.5，Chatgpt 4，Chatgpt 4O，Openai O1，Claude 3.5十四行诗，Gemini 1.5，Gemini 1.5 Pro，Gemini 1.5 Pro，Gemini 2.0，DeepSeek V3和DeepSeek V3和DeepSeek R1）与人类的专家评分者在得分两种类型的写作任务中。通过二次加权kappa评估了来自LLM的整体和分析得分的准确性和分析得分的准确性。根据Cronbach Alpha比较了提示之间的评价者一致性。使用多方Rasch模型评估了LLM的评估效应并与人类评估者进行比较。总体上，结果支持Chatgpt 4O，Gemini 1.5 Pro和Claude 3.5十四行诗的使用，其评分精度很高，更高的评估者可靠性和更少的评估者效果。

Title: The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models

Authors: Kefan Yu, Qingcheng Zeng, Weihao Xuan, Wanxin Li, Jingyi Wu, Rob Voigt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18497
Pdf URL: https://arxiv.org/pdf/2505.18497
Copy Paste: [[2505.18497]] The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models(https://arxiv.org/abs/2505.18497)
Keywords: language model, llm
Abstract: Current large language models (LLMs) have demonstrated emerging capabilities in social intelligence tasks, including implicature resolution (Sravanthi et al. (2024)) and theory-of-mind reasoning (Shapira et al. (2024)), both of which require substantial pragmatic understanding. However, how LLMs acquire this competence throughout the training process remains poorly understood. In this work, we introduce ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, designed to evaluate whether LLMs at different training stages can accurately infer nuanced speaker intentions. Each instance pairs two contextually appropriate but pragmatically distinct continuations, enabling fine-grained assessment of both pragmatic interpretation and contrastive reasoning. We systematically evaluate 22 LLMs across key training stages: pre-training, supervised fine-tuning (SFT), and preference optimization, to examine the development of pragmatic competence. Our results show that even base models exhibit notable sensitivity to pragmatic cues, which improves consistently with increases in model and data scale. Additionally, SFT and RLHF contribute further gains, particularly in cognitive-pragmatic reasoning. These findings highlight pragmatic competence as an emergent and compositional property of LLM training and offer new insights for aligning models with human communicative norms.
摘要：当前的大型语言模型（LLMS）已经证明了社会情报任务中的新兴能力，包括含义分辨率（Sravanthi等人（2024））和智力理论推理（Shapira等人（2024）），这两个都需要实质性的实用理解。但是，LLM在整个培训过程中如何获得这种能力仍然知之甚少。在这项工作中，我们介绍了AltPrag，这是一个基于替代方案的务实概念的数据集，旨在评估不同培训阶段的LLM是否可以准确地推断出细致的说话者的意图。每个实例在上下文中配对两个适当但务实的连续性，可以对实用解释和对比推理进行细粒度的评估。我们在关键培训阶段系统地评估了22个LLM：预训练，监督微调（SFT）和偏好优化，以检查实用能力的发展。我们的结果表明，即使基本模型也对实用线索具有明显的敏感性，这随着模型和数据量表的增加而始终如一地改善。此外，SFT和RLHF贡献了进一步的收益，尤其是在认知中的推理中。这些发现凸显了务实的能力是LLM培训的新兴和组成性能，并为将模型与人类交流规范保持一致的新见解提供了新的见解。

Title: How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation

Authors: Xin Lu, Yanyan Zhao, Si Wei, Shijin Wang, Bing Qin, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18522
Pdf URL: https://arxiv.org/pdf/2505.18522
Copy Paste: [[2505.18522]] How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation(https://arxiv.org/abs/2505.18522)
Keywords: language model
Abstract: Pre-trained language models represented by the Transformer have been proven to possess strong base capabilities, and the representative self-attention mechanism in the Transformer has become a classic in sequence modeling architectures. Different from the work of proposing sequence modeling architecture to improve the efficiency of attention mechanism, this work focuses on the impact of sequence modeling architectures on base capabilities. Specifically, our concern is: How exactly do sequence modeling architectures affect the base capabilities of pre-trained language models? In this work, we first point out that the mixed domain pre-training setting commonly adopted in existing architecture design works fails to adequately reveal the differences in base capabilities among various architectures. To address this, we propose a limited domain pre-training setting with out-of-distribution testing, which successfully uncovers significant differences in base capabilities among architectures at an early stage. Next, we analyze the base capabilities of stateful sequence modeling architectures, and find that they exhibit significant degradation in base capabilities compared to the Transformer. Then, through a series of architecture component analysis, we summarize a key architecture design principle: A sequence modeling architecture need possess full-sequence arbitrary selection capability to avoid degradation in base capabilities. Finally, we empirically validate this principle using an extremely simple Top-1 element selection architecture and further generalize it to a more practical Top-1 chunk selection architecture. Experimental results demonstrate our proposed sequence modeling architecture design principle and suggest that our work can serve as a valuable reference for future architecture improvements and novel designs.
摘要：事实证明，由变压器代表的预训练的语言模型具有强大的基础能力，而变压器中的代表性自我发挥机制已成为序列建模架构的经典。与提出序列建模架构以提高注意力机制的效率的工作不同，这项工作着重于序列建模体系结构对基础能力的影响。具体来说，我们关注的是：序列建模架构究竟如何影响预训练的语言模型的基本能力？在这项工作中，我们首先指出，现有建筑设计工作中通常采用的混合域预训练设置无法充分揭示各种体系结构之间基本能力的差异。为了解决这个问题，我们提出了一个有限的领域预训练设置，并通过分布式测试成功地揭示了早期阶段架构之间基础能力的显着差异。接下来，我们分析了状态序列建筑架构的基本能力，并发现它们与变压器相比表现出显着的基础能力降解。然后，通过一系列的体系结构组件分析，我们总结了一个关键的体系结构设计原理：序列建模架构需要具有完整的任意选择能力，以避免基本功能中的降级。最后，我们使用非常简单的Top-1元素选择体系结构从经验上验证了这一原理，并将其进一步推广到更实用的Top-1块选择体系结构。实验结果证明了我们提出的序列建筑设计原理，并建议我们的工作可以作为未来建筑改进和新颖设计的宝贵参考。

Title: metaTextGrad: Automatically optimizing language model optimizers

Authors: Guowei Xu, Mert Yuksekgonul, Carlos Guestrin, James Zou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18524
Pdf URL: https://arxiv.org/pdf/2505.18524
Copy Paste: [[2505.18524]] metaTextGrad: Automatically optimizing language model optimizers(https://arxiv.org/abs/2505.18524)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly used in learning algorithms, evaluations, and optimization tasks. Recent studies have shown that using LLM-based optimizers to automatically optimize model prompts, demonstrations, predictions themselves, or other components can significantly enhance the performance of AI systems, as demonstrated by frameworks such as DSPy and TextGrad. However, optimizers built on language models themselves are usually designed by humans with manual design choices; optimizers themselves are not optimized. Moreover, these optimizers are general purpose by design, to be useful to a broad audience, and are not tailored for specific tasks. To address these challenges, we propose metaTextGrad, which focuses on designing a meta-optimizer to further enhance existing optimizers and align them to be good optimizers for a given task. Our approach consists of two key components: a meta prompt optimizer and a meta structure optimizer. The combination of these two significantly improves performance across multiple benchmarks, achieving an average absolute performance improvement of up to 6% compared to the best baseline.
摘要：大型语言模型（LLM）越来越多地用于学习算法，评估和优化任务。最近的研究表明，使用基于LLM的优化器自动优化模型提示，演示，预测本身或其他组件可以显着提高AI系统的性能，如DSPY和TextGrad等框架所示。但是，以语言模型为基础的优化者通常是由具有手动设计选择的人类设计的。优化器本身并未优化。此外，这些优化者是设计的通用，对广泛的受众来说是有用的，并且不适合特定任务。为了应对这些挑战，我们提出了MetateXtgrad，该杂志的重点是设计元尝试器，以进一步增强现有的优化器并使其对齐为给定任务的优化器。我们的方法由两个关键组件组成：元提示优化器和元结构优化器。与最佳基线相比，这两个的组合显着提高了多个基准测试的性能，达到平均绝对性能提高了6％。

Title: Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models

Authors: Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, Xueqian Wang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.18536
Pdf URL: https://arxiv.org/pdf/2505.18536
Copy Paste: [[2505.18536]] Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models(https://arxiv.org/abs/2505.18536)
Keywords: language model, llm
Abstract: Standing in 2025, at a critical juncture in the pursuit of Artificial General Intelligence (AGI), reinforcement fine-tuning (RFT) has demonstrated significant potential in enhancing the reasoning capability of large language models (LLMs) and has led to the development of cutting-edge AI models such as OpenAI-o1 and DeepSeek-R1. Moreover, the efficient application of RFT to enhance the reasoning capability of multimodal large language models (MLLMs) has attracted widespread attention from the community. In this position paper, we argue that reinforcement fine-tuning powers the reasoning capability of multimodal large language models. To begin with, we provide a detailed introduction to the fundamental background knowledge that researchers interested in this field should be familiar with. Furthermore, we meticulously summarize the improvements of RFT in powering reasoning capability of MLLMs into five key points: diverse modalities, diverse tasks and domains, better training algorithms, abundant benchmarks and thriving engineering frameworks. Finally, we propose five promising directions for future research that the community might consider. We hope that this position paper will provide valuable insights to the community at this pivotal stage in the advancement toward AGI. Summary of works done on RFT for MLLMs is available at this https URL.
摘要：站在2025年，在追求人工通用智能（AGI）的关键时刻，加强微调（RFT）在增强大语言模型（LLMS）的推理能力方面具有巨大潜力，并导致开发了诸如OpenAI-O1和DeepSeek-R1之类的切割AI模型。此外，RFT在增强多模式模型（MLLM）的推理能力方面的有效应用引起了社区的广泛关注。在这个立场论文中，我们认为加强微调为多模式大语模型的推理能力提供了能力。首先，我们详细介绍了对该领域感兴趣的研究人员应该熟悉的基本背景知识。此外，我们仔细地总结了RFT在将MLLM的推理能力驱动到五个关键方面的改进：各种方式，不同的任务和领域，更好的培训算法，丰富的基准测试和蓬勃发展的工程框架。最后，我们提出了社区可能会考虑的五个有希望的未来研究方向。我们希望该立场论文将在晋升AGI的关键阶段为社区提供宝贵的见解。在此HTTPS URL上可用在RFT上完成的工作摘要。

Title: Business as \textit{Rule}sual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

Authors: Chen Yang, Ruping Xu, Ruizhe Li, Bin Cao, Jing Fan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18542
Pdf URL: https://arxiv.org/pdf/2505.18542
Copy Paste: [[2505.18542]] Business as \textit{Rule}sual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs(https://arxiv.org/abs/2505.18542)
Keywords: language model, llm
Abstract: Process mining aims to discover, monitor and optimize the actual behaviors of real processes. While prior work has mainly focused on extracting procedural action flows from instructional texts, rule flows embedded in business documents remain underexplored. To this end, we introduce a novel annotated Chinese dataset, \textbf{BPRF}, which contains 50 business process documents with 326 explicitly labeled business rules across multiple domains. Each rule is represented as a pair, and we annotate logical dependencies between rules (sequential, conditional, or parallel). We also propose \textbf{ExIde}, a framework for automatic business rule extraction and dependency relationship identification using large language models (LLMs). We evaluate ExIde using 12 state-of-the-art (SOTA) LLMs on the BPRF dataset, benchmarking performance on both rule extraction and dependency classification tasks of current LLMs. Our results demonstrate the effectiveness of ExIde in extracting structured business rules and analyzing their interdependencies for current SOTA LLMs, paving the way for more automated and interpretable business process automation.
摘要：过程挖掘旨在发现，监视和优化实际过程的实际行为。虽然先前的工作主要集中于从教学文本中提取程序动作流，但嵌入在业务文件中的规则流仍未得到充实。为此，我们介绍了一个新颖的注释中文数据集，\ textbf {bprf}，其中包含50个业务流程文档，其中包含326个跨多个域中明确标记为业务规则的326个。每个规则表示为<条件，动作>对，我们注释规则（顺序，条件或并行）之间的逻辑依赖性。我们还建议使用大语言模型（LLMS）的自动业务规则提取和依赖关系关系识别的框架\ textbf {exide}。我们使用BPRF数据集上的12个最先进（SOTA）LLMS评估外观，从而在当前LLMS的规则提取和依赖性分类任务上进行基准测试性能。我们的结果表明，Exide在提取结构化业务规则和分析其当前SOTA LLMS的相互依赖性方面的有效性，为更多自动化和可解释的业务流程自动化铺平了道路。

Title: Composable Cross-prompt Essay Scoring by Merging Models

Authors: Sanwoo Lee, Kun Liang, Yunfang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18548
Pdf URL: https://arxiv.org/pdf/2505.18548
Copy Paste: [[2505.18548]] Composable Cross-prompt Essay Scoring by Merging Models(https://arxiv.org/abs/2505.18548)
Keywords: llm, prompt
Abstract: Recent advances in cross-prompt automated essay scoring (AES) typically train models jointly on all source prompts, often requiring additional access to unlabeled target prompt essays simultaneously. However, using all sources is suboptimal in our pilot study, and re-accessing source datasets during adaptation raises privacy concerns. We propose a source-free adaptation approach that selectively merges individually trained source models' parameters instead of datasets. In particular, we simulate joint training through linear combinations of task vectors -- the parameter updates from fine-tuning. To optimize the combination's coefficients, we propose Prior-encoded Information Maximization (PIM), an unsupervised objective which promotes the model's score discriminability regularized by priors pre-computed from the sources. We employ Bayesian optimization as an efficient optimizer of PIM. Experimental results with LLMs on in-dataset and cross-dataset adaptation show that our method (1) consistently outperforms training jointly on all sources, (2) maintains superior robustness compared to other merging methods, (3) excels under severe distribution shifts where recent leading cross-prompt methods struggle, all while retaining computational efficiency.
摘要：交叉宣传自动论文评分（AES）的最新进展通常是在所有源提示上共同培训模型，通常需要同时额外访问未标记的目标提示论文。但是，在我们的试点研究中，使用所有来源是次优的，并且在适应过程中重新访问源数据集引发了隐私问题。我们提出了一种无源的适应方法，该方法有选择地合并了经过单独训练的源模型的参数而不是数据集。特别是，我们通过任务向量的线性组合模拟联合培训 - 来自微调的参数更新。为了优化组合的系数，我们提出了事先编码的信息最大化（PIM），这是一个无监督的目标，它促进了模型的分数可区分性，该分数由预先计算的先验的先验定于从源头上进行计算。我们采用贝叶斯优化作为PIM的有效优化器。在数据集和跨数据库适应性上使用LLMS的实验结果表明，我们的方法（1）始终在所有来源上共同胜过训练，（2）保持优势鲁棒性，（3）与其他合并的方法相比，在严重的分布转移下，在最近的领先交叉方法下，excecls均保持了较高的分配方法，同时又保持了计算效率。

Title: MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors

Authors: Baraa Hikal, Mohamed Basem, Islam Oshallah, Ali Hamdi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18549
Pdf URL: https://arxiv.org/pdf/2505.18549
Copy Paste: [[2505.18549]] MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors(https://arxiv.org/abs/2505.18549)
Keywords: language model, llm
Abstract: We present MSA-MathEval, our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks, without any task-specific architectural changes. To improve prediction reliability, we introduce a disagreement-aware ensemble inference strategy that enhances coverage of minority labels. Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location. These results demonstrate the effectiveness of scalable instruction tuning and disagreement-driven modeling for robust, multi-dimensional evaluation of LLMs as educational tutors.
摘要：我们介绍了MSA-MATHEVAL，我们提交给BEA 2025共享任务，以评估四个教学维度的AI导师响应：错误识别，错误位置，提供指导和可行性。我们的方法使用统一的培训管道来对所有轨道进行单个指令调整的语言模型，而无需任何特定于任务的架构更改。为了提高预测可靠性，我们引入了一种分歧感知的集合推理策略，以增强少数族裔标签的覆盖范围。我们的系统在所有轨道上都取得了强大的性能，在提供指导，第三名和错误识别和错误位置方面排名第一。这些结果证明了可扩展教学调整和分歧驱动的建模的有效性，以对LLM作为教育导师的稳健，多维评估。

Title: Unraveling Misinformation Propagation in LLM Reasoning

Authors: Yiyang Feng, Yichen Wang, Shaobo Cui, Boi Faltings, Mina Lee, Jiawei Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18555
Pdf URL: https://arxiv.org/pdf/2505.18555
Copy Paste: [[2505.18555]] Unraveling Misinformation Propagation in LLM Reasoning(https://arxiv.org/abs/2505.18555)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning, positioning them as promising tools for supporting human problem-solving. However, what happens when their performance is affected by misinformation, i.e., incorrect inputs introduced by users due to oversights or gaps in knowledge? Such misinformation is prevalent in real-world interactions with LLMs, yet how it propagates within LLMs' reasoning process remains underexplored. Focusing on mathematical reasoning, we present a comprehensive analysis of how misinformation affects intermediate reasoning steps and final answers. We also examine how effectively LLMs can correct misinformation when explicitly instructed to do so. Even with explicit instructions, LLMs succeed less than half the time in rectifying misinformation, despite possessing correct internal knowledge, leading to significant accuracy drops (10.02% - 72.20%). Further analysis shows that applying factual corrections early in the reasoning process most effectively reduces misinformation propagation, and fine-tuning on synthesized data with early-stage corrections significantly improves reasoning factuality. Our work offers a practical approach to mitigating misinformation propagation.
摘要：大型语言模型（LLM）在推理方面表现出了令人印象深刻的能力，将它们定位为支持人类问题解决的有前途的工具。但是，当他们的绩效受到错误信息的影响时，即由于知识的监督或差距而引入的不正确输入会发生什么？这种错误信息在与LLMS的现实世界相互作用中很普遍，但是它如何在LLMS的推理过程中传播仍未得到充分兴奋。为了关注数学推理，我们对错误信息如何影响中间推理步骤和最终答案进行了全面分析。我们还研究了在明确指示这样做时如何有效地纠正错误信息的有效性。即使有明确的说明，尽管拥有正确的内部知识，但LLM在纠正错误信息方面取得了不到一半的时间，导致准确的准确性下降（10.02％-72.20％）。进一步的分析表明，在推理过程的早期应用事实校正最有效地减少了错误信息传播，并对具有早期校正的综合数据进行微调可显着改善推理的事实。我们的工作提供了一种减轻错误信息传播的实用方法。

Title: Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Authors: Jun Zhuang, Haibo Jin, Ye Zhang, Zhengjian Kang, Wenbin Zhang, Gaby G. Dagher, Haohan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18556
Pdf URL: https://arxiv.org/pdf/2505.18556
Copy Paste: [[2505.18556]] Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation(https://arxiv.org/abs/2505.18556)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs' moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box LLMs indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our "FSTR+SPIN" variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in LLMs' safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.
摘要：意图检测是自然语言理解的核心组成部分，已大大发展为维护大型语言模型（LLM）的关键机制。虽然先前的工作已经采用了意图检测来增强LLMS的调节护栏，从而在对内容级别的越狱中取得了重大成功，但在恶意操纵下，这些意图意识到的护栏的稳健性仍然不足。在这项工作中，我们研究了意图意识到的护栏的脆弱性，并证明LLMS表现出隐式意图检测能力。我们提出了一个基于两阶段意图的及时启动框架IntentPrompt，该框架首先将有害的询问转化为结构化的大纲，并通过反馈回路通过反馈循环对提示进行迭代优化，以增强越狱的成功，从而进一步将其折叠为声明性式的叙事。在四个公共基准和各种Black-Box LLMS上进行的广泛实验表明，我们的框架始终优于几种尖端的越狱方法，甚至逃避了先进的意图分析（IA）和基于思维链（COT）的防御能力。具体而言，我们的“ FSTR+自旋”变体可在O1模型上对基于COT的防御措施的攻击成功率从88.25％到96.54％，而在基于IA的防御力下的GPT-4O模型上，基于COT的防御率从86.75％到97.12％。这些发现突出了LLMS安全机制中的危险弱点，并表明意图操纵对内容调节护栏构成了日益严重的挑战。

Title: TAG-INSTRUCT: Controlled Instruction Complexity Enhancement through Structure-based Augmentation

Authors: He Zhu, Zhiwen Ruan, Junyou Su, Xingwei He, Wenjia Zhang, Yun Chen, Guanhua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18557
Pdf URL: https://arxiv.org/pdf/2505.18557
Copy Paste: [[2505.18557]] TAG-INSTRUCT: Controlled Instruction Complexity Enhancement through Structure-based Augmentation(https://arxiv.org/abs/2505.18557)
Keywords: language model, llm, prompt
Abstract: High-quality instruction data is crucial for developing large language models (LLMs), yet existing approaches struggle to effectively control instruction complexity. We present TAG-INSTRUCT, a novel framework that enhances instruction complexity through structured semantic compression and controlled difficulty augmentation. Unlike previous prompt-based methods operating on raw text, TAG-INSTRUCT compresses instructions into a compact tag space and systematically enhances complexity through RL-guided tag expansion. Through extensive experiments, we show that TAG-INSTRUCT outperforms existing instruction complexity augmentation approaches. Our analysis reveals that operating in tag space provides superior controllability and stability across different instruction synthesis frameworks.
摘要：高质量的指导数据对于开发大型语言模型（LLM）至关重要，但是现有的方法难以有效地控制教学的复杂性。我们提出了标签教学，这是一个新颖的框架，可通过结构化语义压缩和受控的难度增强来增强教学的复杂性。与以前的基于原始文本运行的基于及时的方法不同，Tag-Instruct将指令压缩到紧凑的标签空间中，并通过RL引导的标签扩展系统地增强了复杂性。通过广泛的实验，我们表明，标签指令的表现优于现有的指令复杂性增强方法。我们的分析表明，在标签空间中运行提供了跨不同指令综合框架的卓越可控性和稳定性。

Title: From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test

Authors: Xunlian Dai, Li Zhou, Benyou Wang, Haizhou Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18562
Pdf URL: https://arxiv.org/pdf/2505.18562
Copy Paste: [[2505.18562]] From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test(https://arxiv.org/abs/2505.18562)
Keywords: language model, llm, prompt
Abstract: The human-centered word association test (WAT) serves as a cognitive proxy, revealing sociocultural variations through lexical-semantic patterns. We extend this test into an LLM-adaptive, free-relation task to assess the alignment of large language models (LLMs) with cross-cultural cognition. To mitigate the culture preference, we propose CultureSteer, an innovative approach that integrates a culture-aware steering mechanism to guide semantic representations toward culturally specific spaces. Experiments show that current LLMs exhibit significant bias toward Western cultural (notably in American) schemas at the word association level. In contrast, our model substantially improves cross-cultural alignment, surpassing prompt-based methods in capturing diverse semantic associations. Further validation on culture-sensitive downstream tasks confirms its efficacy in fostering cognitive alignment across cultures. This work contributes a novel methodological paradigm for enhancing cultural awareness in LLMs, advancing the development of more inclusive language technologies.
摘要：以人为中心的单词协会测试（WAT）是一种认知代理，通过词汇语义模式揭示了社会文化的变化。我们将此测试扩展到LLM适应性的自由关系任务中，以评估大语言模型（LLMS）与跨文化认知的一致性。为了减轻文化偏爱，我们提出了一种创新的文化方法，该方法将一种文化意识的转向机制整合在一起，以指导语义表达对文化特定的空间。实验表明，当前的LLM在关联层面上对西方文化（尤其是在美国）模式中表现出很大的偏见。相比之下，我们的模型基本上改善了跨文化的一致性，超过了基于迅速的方法来捕获各种语义关联的方法。对文化敏感的下游任务的进一步验证证实了其在培养各种文化的认知对齐方面的功效。这项工作为增强LLM中文化意识的新方法范式贡献了一种新颖的方法范式，从而推进了更具包容性语言技术的发展。

Title: Removal of Hallucination on Hallucination: Debate-Augmented RAG

Authors: Wentao Hu, Wengyu Zhang, Yiyang Jiang, Chen Jason Zhang, Xiaoyong Wei, Qing Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18581
Pdf URL: https://arxiv.org/pdf/2505.18581
Copy Paste: [[2505.18581]] Removal of Hallucination on Hallucination: Debate-Augmented RAG(https://arxiv.org/abs/2505.18581)
Keywords: hallucination, retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external knowledge, yet it introduces a critical issue: erroneous or biased retrieval can mislead generation, compounding hallucinations, a phenomenon we term Hallucination on Hallucination. To address this, we propose Debate-Augmented RAG (DRAG), a training-free framework that integrates Multi-Agent Debate (MAD) mechanisms into both retrieval and generation stages. In retrieval, DRAG employs structured debates among proponents, opponents, and judges to refine retrieval quality and ensure factual reliability. In generation, DRAG introduces asymmetric information roles and adversarial debates, enhancing reasoning robustness and mitigating factual inconsistencies. Evaluations across multiple tasks demonstrate that DRAG improves retrieval reliability, reduces RAG-induced hallucinations, and significantly enhances overall factual accuracy. Our code is available at this https URL.
摘要：检索提示的一代（RAG）通过整合外部知识来提高事实准确性，但它引入了一个关键问题：错误或有偏见的检索会误导产生，更复杂的幻觉，这是一种现象，我们在幻觉上称呼幻觉。为了解决这个问题，我们提出了辩论的抹布（Drag），这是一个无训练的框架，将多代理辩论（MAD）机制集成到检索和发电阶段。在检索中，Drag在支持者，反对者和法官之间采用结构性辩论来完善检索质量并确保事实可靠性。在一代人中，拖动引入了不对称的信息角色和对抗性辩论，增强了推理的鲁棒性并减轻了事实矛盾。跨多个任务的评估表明，阻力可提高检索可靠性，降低了抹布引起的幻觉，并显着提高了整体事实准确性。我们的代码可在此HTTPS URL上找到。

Title: Safety Alignment via Constrained Knowledge Unlearning

Authors: Zesheng Shi, Yucheng Zhou, Jing Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18588
Pdf URL: https://arxiv.org/pdf/2505.18588
Copy Paste: [[2505.18588]] Safety Alignment via Constrained Knowledge Unlearning(https://arxiv.org/abs/2505.18588)
Keywords: language model, llm
Abstract: Despite significant progress in safety alignment, large language models (LLMs) remain susceptible to jailbreak attacks. Existing defense mechanisms have not fully deleted harmful knowledge in LLMs, which allows such attacks to bypass safeguards and produce harmful outputs. To address this challenge, we propose a novel safety alignment strategy, Constrained Knowledge Unlearning (CKU), which focuses on two primary objectives: knowledge localization and retention, and unlearning harmful knowledge. CKU works by scoring neurons in specific multilayer perceptron (MLP) layers to identify a subset U of neurons associated with useful knowledge. During the unlearning process, CKU prunes the gradients of neurons in U to preserve valuable knowledge while effectively mitigating harmful content. Experimental results demonstrate that CKU significantly enhances model safety without compromising overall performance, offering a superior balance between safety and utility compared to existing methods. Additionally, our analysis of neuron knowledge sensitivity across various MLP layers provides valuable insights into the mechanics of safety alignment and model knowledge editing.
摘要：尽管安全对齐方面取得了重大进展，但大型语言模型（LLM）仍然容易受到越狱攻击的影响。现有的防御机制尚未完全删除LLM中的有害知识，这允许此类攻击绕过保障措施并产生有害产量。为了应对这一挑战，我们提出了一种新颖的安全一致性策略，限制了知识（CKU），该策略重点介绍了两个主要目标：知识定位和保留，以及学习有害知识。 CKU通过在特定的多层感知器（MLP）层中对神经元进行评分来识别与有用知识相关的神经元的子集U。在未来的过程中，CKU将神经元中的神经元梯度保留有价值的知识，同时有效地减轻有害内容。实验结果表明，CKU显着增强了模型安全性，而不会损害整体性能，与现有方法相比，安全性和效用之间的平衡具有较高的平衡。此外，我们对各种MLP层的神经元知识敏感性的分析为安全一致性和模型知识编辑机制提供了宝贵的见解。

Title: Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

Authors: Chen Han, Wenzhen Zheng, Xijin Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18596
Pdf URL: https://arxiv.org/pdf/2505.18596
Copy Paste: [[2505.18596]] Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models(https://arxiv.org/abs/2505.18596)
Keywords: language model, gpt, llm, agent
Abstract: The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real-world fact-checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical inconsistency and superficial verification. In response, we introduce Debate-to-Detect (D2D), a novel Multi-Agent Debate (MAD) framework that reformulates misinformation detection as a structured adversarial debate. Inspired by fact-checking workflows, D2D assigns domain-specific profiles to each agent and orchestrates a five-stage debate process, including Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. To transcend traditional binary classification, D2D introduces a multi-dimensional evaluation mechanism that assesses each claim across five distinct dimensions: Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics. Experiments with GPT-4o on two fakenews datasets demonstrate significant improvements over baseline methods, and the case study highlight D2D's capability to iteratively refine evidence while improving decision transparency, representing a substantial advancement towards robust and interpretable misinformation detection. The code will be open-sourced in a future release.
摘要：数字平台中错误信息的扩散揭示了传统检测方法的局限性，这些方法主要依赖于静态分类，并且无法捕获现实世界中事实检查的复杂过程。尽管大语言模型（LLMS）的进步增强了自动推理，但它们在错误信息检测中的应用仍受到逻辑上的不一致和表面验证问题的阻碍。作为回应，我们介绍了辩论 - 检测（D2D），这是一种新颖的多代理辩论（MAD）框架，将错误信息检测作为结构化的对抗性辩论进行了重新进行。受事实检查工作流程的启发，D2D向每个代理商分配了特定于领域的资料，并协调了五阶段的辩论过程，包括开幕式，反驳，免费辩论，闭幕式和判断。为了超越传统的二元分类，D2D引入了多维评估机制，该机制在五个不同的方面评估了每个主张：事实，源可靠性，推理质量，清晰度和道德规范。在两个Fakenews数据集上使用GPT-4O进行的实验表现出对基线方法的显着改善，并且该案例研究强调了D2D在改善决策透明度的同时，D2D的能力在提高了决策透明度的同时，代表了对可靠且可解释的错误信息发现检测的实质性进步。该代码将在以后的版本中开源。

Title: Flex-Judge: Think Once, Judge Anywhere

Authors: Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18601
Pdf URL: https://arxiv.org/pdf/2505.18601
Copy Paste: [[2505.18601]] Flex-Judge: Think Once, Judge Anywhere(https://arxiv.org/abs/2505.18601)
Keywords: language model, llm
Abstract: Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.
摘要：人类生成的奖励信号对于使生成模型具有人类偏好，指导培训和推理时间评估至关重要。虽然用作代理评估者的大型语言模型（LLMS），即LLM-AS-A-a-gudge，大大降低了与手动注释相关的成本，但它们通常需要广泛的特定于模态培训数据，并且未能很好地跨越多种多样的多模式任务。在本文中，我们提出了Flex-Gudge，这是一种推理指导的多式联运法官模型，该模型利用最少的文本推理数据来稳健地跨越多种方式和评估形式。我们的核心直觉是，结构化的文本推理解释本质上可以固有地编码可概括的决策模式，从而有效地转移了对多模式判断，例如使用图像或视频。经验结果表明，与最先进的商业API和经过广泛训练的多模式评估者相比，弹性法官接受了较少的文本数据培训，但具有竞争性或卓越的性能。值得注意的是，Flex-Gudge对分子等模式产生了广泛的影响，在分子之类的方式中，全面的评估基准很少，强调了其在资源受限域中的实际价值。我们的框架重点介绍了基于推理的文本监督是传统注释密集型方法的有力，具有成本效益的替代方案，从而实质上可以扩展可扩展的多模型模型 - A-A-A-A-A-A-a-A-a-a-a-Gudge。

Title: PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs

Authors: Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18610
Pdf URL: https://arxiv.org/pdf/2505.18610
Copy Paste: [[2505.18610]] PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs(https://arxiv.org/abs/2505.18610)
Keywords: language model, llm, chain-of-thought
Abstract: Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead. Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively studied in short-context scenarios. However, directly applying existing methods to long-CoT LLMs causes significant performance degradation due to the following two reasons: (1) Large cumulative error: Existing methods fail to adequately leverage available memory, and they directly quantize the KV Cache during each decoding step, leading to large cumulative quantization error. (2) Short-context calibration: Due to Rotary Positional Embedding (RoPE), the use of short-context data during calibration fails to account for the distribution of less frequent channels in the Key Cache, resulting in performance loss. We propose Progressive Mixed-Precision KV Cache Quantization (PM-KVQ) for long-CoT LLMs to address the above issues in two folds: (1) To reduce cumulative error, we design a progressive quantization strategy to gradually lower the bit-width of KV Cache in each block. Then, we propose block-wise memory allocation to assign a higher bit-width to more sensitive transformer blocks. (2) To increase the calibration length without additional overhead, we propose a new calibration strategy with positional interpolation that leverages short calibration data with positional interpolation to approximate the data distribution of long-context data. Extensive experiments on 7B-70B long-CoT LLMs show that PM-KVQ improves reasoning benchmark performance by up to 8% over SOTA baselines under the same memory budget. Our code is available at this https URL.
摘要：最近，通过长期的经营链（COT）技术，在开发具有推理能力的大语言模型（LLM）方面取得了重大进展。但是，由于大型键值（KV）缓存内存开销，这个长期推理过程施加了大量的内存开销。训练后的KV缓存量化已成为一种有希望的压缩技术，并且在短篇上下文的情况下进行了广泛的研究。但是，由于以下两个原因，直接将现有方法应用于长时间LLM会导致大量的性能降解：（1）大累积错误：现有方法无法充分利用可用的内存，并且在每个解码步骤中它们直接量化了KV缓存，导致累积量化量很大。（2）近感校准：由于旋转位置嵌入（绳索），校准期间的短篇小写数据的使用无法解释钥匙缓存中较不频繁的通道的分布，从而导致性能损失。我们建议长时间LLMS进行渐进的混合精液KV缓存量化（PM-KVQ），以解决上述问题，分为两倍：（1）减少累积误差，我们设计了一种渐进量化策略，以逐步降低每个块中KV Cache的位宽度。然后，我们提出了块的内存分配，以将更高的位宽度分配给更敏感的变压器块。（2）为了增加校准长度而没有额外的开销，我们提出了一种新的校准策略，并具有位置插值，该策略利用了用位置插值的短校准数据，以近似长篇文本数据的数据分布。在7B-70B长期LLMS上进行的广泛实验表明，PM-KVQ在相同的内存预算下，比SOTA基准相比，推理基准性能高达8％。我们的代码可在此HTTPS URL上找到。

Title: MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

Authors: Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu
Subjects: cs.CL, cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.18614
Pdf URL: https://arxiv.org/pdf/2505.18614
Copy Paste: [[2505.18614]] MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation(https://arxiv.org/abs/2505.18614)
Keywords: llm, chain-of-thought
Abstract: Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.
摘要：歌词翻译需要精确的语义传递和音乐节奏，音节结构和诗意风格的保存。在动画音乐剧中，由于与视觉和听觉线索的一致性，挑战加剧了。我们介绍了动画歌曲翻译（MAVL）的多语言Audio-Video歌词基准，这是第一个多语言，多模式的基准，用于歌词翻译。通过集成文本，音频和视频，MAVL比仅文本方法更丰富，更具有表现力的翻译。在此基础上，我们提出了带有三链Sylavl-cot的音节限制的音频视频LLM，该音频效果由Audio-Video提示并执行音节限制来产生自然听起来的歌词。实验结果表明，Sylavl-Cot在可唱歌和上下文准确性方面显着优于基于文本的模型，强调了歌词翻译的多模式，多语言方法的价值。

Title: DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation

Authors: Zhihao Jia, Mingyi Jia, Junwen Duan, Jianxin Wang
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.18630
Pdf URL: https://arxiv.org/pdf/2505.18630
Copy Paste: [[2505.18630]] DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation(https://arxiv.org/abs/2505.18630)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) demonstrate strong generalization and reasoning abilities, making them well-suited for complex decision-making tasks such as medical consultation (MC). However, existing LLM-based methods often fail to capture the dual nature of MC, which entails two distinct sub-tasks: symptom inquiry, a sequential decision-making process, and disease diagnosis, a classification problem. This mismatch often results in ineffective symptom inquiry and unreliable disease diagnosis. To address this, we propose \textbf{DDO}, a novel LLM-based framework that performs \textbf{D}ual-\textbf{D}ecision \textbf{O}ptimization by decoupling and independently optimizing the the two sub-tasks through a collaborative multi-agent workflow. Experiments on three real-world MC datasets show that DDO consistently outperforms existing LLM-based approaches and achieves competitive performance with state-of-the-art generation-based methods, demonstrating its effectiveness in the MC task.
摘要：大型语言模型（LLMS）表现出强大的概括和推理能力，使其非常适合复杂的决策任务，例如医疗咨询（MC）。但是，现有的基于LLM的方法通常无法捕获MC的双重性质，这需要两个不同的子任务：症状查询，一个顺序决策过程和疾病诊断，分类问题。这种不匹配通常会导致症状调查和不可靠的疾病诊断。为了解决这个问题，我们提出了\ textbf {ddo}，这是一个基于LLM的新颖框架，可以执行\ textbf {d} ual- \ textbf {d} ecision \ textbf {o}通过协作的多元化工作流来解除和独立地对两个子任务进行分解和独立地对两个子任务进行优化。在三个现实世界中的MC数据集上的实验表明，DDO始终优于现有的基于LLM的方法，并通过基于最新一代的方法实现竞争性能，证明了其在MC任务中的有效性。

Title: Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models

Authors: Md. Tanzib Hosain, Rajan Das Gupta, Md. Kishor Morol
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18638
Pdf URL: https://arxiv.org/pdf/2505.18638
Copy Paste: [[2505.18638]] Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models(https://arxiv.org/abs/2505.18638)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: In this work, we provide DZEN, a dataset of parallel Dzongkha and English test questions for Bhutanese middle and high school students. The over 5K questions in our collection span a variety of scientific topics and include factual, application, and reasoning-based questions. We use our parallel dataset to test a number of Large Language Models (LLMs) and find a significant performance difference between the models in English and Dzongkha. We also look at different prompting strategies and discover that Chain-of-Thought (CoT) prompting works well for reasoning questions but less well for factual ones. We also find that adding English translations enhances the precision of Dzongkha question responses. Our results point to exciting avenues for further study to improve LLM performance in Dzongkha and, more generally, in low-resource languages. We release the dataset at: this https URL.
摘要：在这项工作中，我们为不丹中学和高中生提供了Dzen，平行Dzongkha和英语测试问题的数据集。我们收藏中的5K问题涵盖了各种科学主题，包括事实，应用和基于推理的问题。我们使用并行数据集测试许多大型语言模型（LLMS），并在英语和Dzongkha中找到模型之间的显着性能差异。我们还研究了不同的提示策略，并发现促使人们在推理问题方面效果很好，但对事实问题的效果不佳。我们还发现，添加英语翻译提高了Dzongkha问题回答的精度。我们的结果表明，令人兴奋的途径是进一步研究，以提高Dzongkha的LLM表现，更普遍地使用低资源语言。我们在以下位置发布数据集：此HTTPS URL。

Title: Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster

Authors: Xiao Chen, Sihang Zhou, Ke Liang, Xiaoyu Sun, Xinwang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18642
Pdf URL: https://arxiv.org/pdf/2505.18642
Copy Paste: [[2505.18642]] Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster(https://arxiv.org/abs/2505.18642)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-thought (CoT) distillation allows a large language model (LLM) to guide a small language model (SLM) in reasoning tasks. Existing methods train the SLM to learn the long rationale in one iteration, resulting in two issues: 1) Long rationales lead to a large token-level batch size during training, making gradients of core reasoning tokens (i.e., the token will directly affect the correctness of subsequent reasoning) over-smoothed as they contribute a tiny fraction of the rationale. As a result, the SLM converges to sharp minima where it fails to grasp the reasoning logic. 2) The response is slow, as the SLM must generate a long rationale before reaching the answer. Therefore, we propose chunk-wise training (CWT), which uses a heuristic search to divide the rationale into internal semantically coherent chunks and focuses SLM on learning from only one chunk per iteration. In this way, CWT naturally isolates non-reasoning chunks that do not involve the core reasoning token (e.g., summary and transitional chunks) from the SLM learning for reasoning chunks, making the fraction of the core reasoning token increase in the corresponding iteration. Based on CWT, skip-thinking training (STT) is proposed. STT makes the SLM automatically skip non-reasoning medium chunks to reach the answer, improving reasoning speed while maintaining accuracy. We validate our approach on a variety of SLMs and multiple reasoning tasks.
摘要：经过思考链（COT）蒸馏允许大型语言模型（LLM）在推理任务中指导小语言模型（SLM）。现有方法训练SLM在一次迭代中学习长期基本原理，从而导致长期理由导致训练期间的较大令牌级别的批量大小，从而使核心推理令牌的梯度（即令牌将直接影响随后的推理的正确性），因为它们会导致其较小的效果，因为它们会造成一定的作品。结果，SLM会收敛到尖锐的最小值，在那里它无法掌握推理逻辑。 2）响应速度很慢，因为SLM必须在达到答案之前产生较长的理由。因此，我们提出了块培训（CWT），该培训使用启发式搜索将基本原理分为内部语义连贯的块，并将SLM专注于仅从迭代中的一个块中学习。通过这种方式，CWT自然地隔离了不涉及SLM学习的核心推理令牌（例如，摘要和过渡块）的非争议块（例如，摘要和过渡块），用于推理块，从而使相应迭代的核心推理令牌增加。根据CWT，提出了跳过思维训练（STT）。 STT使SLM自动跳过非争议的中等块以达到答案，从而提高了推理速度，同时保持准确性。我们验证了各种SLM和多个推理任务的方法。

Title: Climate-Eval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change

Authors: Murathan Kurfalı, Shorouq Zahra, Joakim Nivre, Gabriele Messori
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18653
Pdf URL: https://arxiv.org/pdf/2505.18653
Copy Paste: [[2505.18653]] Climate-Eval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change(https://arxiv.org/abs/2505.18653)
Keywords: language model, llm
Abstract: Climate-Eval is a comprehensive benchmark designed to evaluate natural language processing models across a broad range of tasks related to climate change. Climate-Eval aggregates existing datasets along with a newly developed news classification dataset, created specifically for this release. This results in a benchmark of 25 tasks based on 13 datasets, covering key aspects of climate discourse, including text classification, question answering, and information extraction. Our benchmark provides a standardized evaluation suite for systematically assessing the performance of large language models (LLMs) on these tasks. Additionally, we conduct an extensive evaluation of open-source LLMs (ranging from 2B to 70B parameters) in both zero-shot and few-shot settings, analyzing their strengths and limitations in the domain of climate change.
摘要：气候恶化是一种综合基准，旨在评估与气候变化相关的广泛任务中的自然语言处理模型。气候评估汇总现有数据集以及专门为此版本创建的新开发的新闻分类数据集。这导致基于13个数据集的25个任务的基准，涵盖了气候话语的关键方面，包括文本分类，问答和信息提取。我们的基准提供了一个标准化的评估套件，用于系统地评估这些任务上大语言模型（LLMS）的性能。此外，我们对零射击和少量设置的开源LLM进行了广泛的评估（范围从2B到70b参数范围），分析了它们在气候变化领域的优势和局限性。

Title: Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics

Authors: Pankaj Kumar, Subhankar Mishra
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18658
Pdf URL: https://arxiv.org/pdf/2505.18658
Copy Paste: [[2505.18658]] Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics(https://arxiv.org/abs/2505.18658)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have emerged as a promising cornerstone for the development of natural language processing (NLP) and artificial intelligence (AI). However, ensuring the robustness of LLMs remains a critical challenge. To address these challenges and advance the field, this survey provides a comprehensive overview of current studies in this area. First, we systematically examine the nature of robustness in LLMs, including its conceptual foundations, the importance of consistent performance across diverse inputs, and the implications of failure modes in real-world applications. Next, we analyze the sources of non-robustness, categorizing intrinsic model limitations, data-driven vulnerabilities, and external adversarial factors that compromise reliability. Following this, we review state-of-the-art mitigation strategies, and then we discuss widely adopted benchmarks, emerging metrics, and persistent gaps in assessing real-world reliability. Finally, we synthesize findings from existing surveys and interdisciplinary studies to highlight trends, unresolved issues, and pathways for future research.
摘要：大型语言模型（LLM）已成为自然语言处理（NLP）和人工智能（AI）发展的有前途的基石。但是，确保LLMS的鲁棒性仍然是一个关键挑战。为了应对这些挑战并推进了该领域，这项调查提供了对该领域当前研究的全面概述。首先，我们系统地检查了LLMS中鲁棒性的性质，包括其概念基础，各种输入中一致性的重要性以及在现实世界应用中失败模式的含义。接下来，我们分析了非舒适性的来源，对固有的模型限制，数据驱动的漏洞以及损害可靠性的外部对抗性因素进行分类。在此之后，我们回顾了最新的缓解策略，然后在评估现实世界可靠性时讨论广泛采用的基准，新兴指标和持续差距。最后，我们从现有的调查和跨学科研究中综合发现，以突出趋势，未解决的问题以及未来研究的途径。

Title: Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models

Authors: Zixiang Xu, Yanbo Wang, Yue Huang, Xiuying Chen, Jieyu Zhao, Meng Jiang, Xiangliang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18673
Pdf URL: https://arxiv.org/pdf/2505.18673
Copy Paste: [[2505.18673]] Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models(https://arxiv.org/abs/2505.18673)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP), yet their cross-lingual performance consistency remains a significant challenge. This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs. Our approach leverages beam search and LLM-based simulation to generate bilingual question pairs that expose performance discrepancies between English and target languages. We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models. The extensive experiments demonstrate that our method precisely and cost-effectively pinpoints cross-lingual weaknesses, consistently revealing over 50\% accuracy drops in target languages across a wide range of models. Moreover, further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns and benefit from targeted post-training. Code is available at this https URL.
摘要：大型语言模型（LLM）在自然语言处理（NLP）方面取得了巨大的成功，但是他们的跨语性表现一致性仍然是一个重大挑战。本文介绍了一种新型方法，用于有效地识别LLMS中固有的跨语性弱点。我们的方法利用光束搜索和基于LLM的模拟生成双语问题对，以暴露英语和目标语言之间的性能差异。我们使用这种方法构建了一个在16种语言上进行6,000多个双语对的新数据集，即使在最新模型中，也证明了它在揭示弱点方面的有效性。广泛的实验表明，我们的方法精确和成本效率地指出了跨语性弱点，始终揭示了超过50 \％的准确性在广泛模型的目标语言中的准确性下降。此外，进一步的实验研究了语言相似性与跨语性弱点之间的关系，表明语言相关的语言具有相似的性能模式，并受益于靶向后培训。代码可在此HTTPS URL上找到。

Title: Social Good or Scientific Curiosity? Uncovering the Research Framing Behind NLP Artefacts

Authors: Eric Chamoun, Nedjma Ousidhoum, Michael Schlichtkrull, Andreas Vlachos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18677
Pdf URL: https://arxiv.org/pdf/2505.18677
Copy Paste: [[2505.18677]] Social Good or Scientific Curiosity? Uncovering the Research Framing Behind NLP Artefacts(https://arxiv.org/abs/2505.18677)
Keywords: llm
Abstract: Clarifying the research framing of NLP artefacts (e.g., models, datasets, etc.) is crucial to aligning research with practical applications. Recent studies manually analyzed NLP research across domains, showing that few papers explicitly identify key stakeholders, intended uses, or appropriate contexts. In this work, we propose to automate this analysis, developing a three-component system that infers research framings by first extracting key elements (means, ends, stakeholders), then linking them through interpretable rules and contextual reasoning. We evaluate our approach on two domains: automated fact-checking using an existing dataset, and hate speech detection for which we annotate a new dataset-achieving consistent improvements over strong LLM baselines. Finally, we apply our system to recent automated fact-checking papers and uncover three notable trends: a rise in vague or underspecified research goals, increased emphasis on scientific exploration over application, and a shift toward supporting human fact-checkers rather than pursuing full automation.
摘要：阐明NLP工件的研究框架（例如，模型，数据集等）对于将研究与实际应用一致至关重要。最近的研究手动分析了跨领域的NLP研究，表明很少有论文明确识别主要利益相关者，预期用途或适当的环境。在这项工作中，我们建议自动进行此分析，开发一个三组分系统，该系统通过首先提取关键元素（含义，目的，利益相关者），然后通过可解释的规则和上下文推理将其链接到研究框架。我们在两个领域中评估了我们的方法：使用现有数据集进行自动事实检查，并仇恨言语检测注释了对强大LLM基准的新数据集检验的一致改进。最后，我们将系统应用于最新的自动事实检查论文，并发现了三个显着的趋势：模糊或指定的研究目标的增加，对科学探索而不是应用程序的强调，以及朝着支持人类事实检查者而不是追求全自动自动化的转变。

Title: TULUN: Transparent and Adaptable Low-resource Machine Translation

Authors: Raphaël Merx, Hanna Suominen, Lois Hong, Nick Thieberger, Trevor Cohn, Ekaterina Vylomova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18683
Pdf URL: https://arxiv.org/pdf/2505.18683
Copy Paste: [[2505.18683]] TULUN: Transparent and Adaptable Low-resource Machine Translation(https://arxiv.org/abs/2505.18683)
Keywords: language model, llm
Abstract: Machine translation (MT) systems that support low-resource languages often struggle on specialized domains. While researchers have proposed various techniques for domain adaptation, these approaches typically require model fine-tuning, making them impractical for non-technical users and small organizations. To address this gap, we propose Tulun, a versatile solution for terminology-aware translation, combining neural MT with large language model (LLM)-based post-editing guided by existing glossaries and translation memories. Our open-source web-based platform enables users to easily create, edit, and leverage terminology resources, fostering a collaborative human-machine translation process that respects and incorporates domain expertise while increasing MT accuracy. Evaluations show effectiveness in both real-world and benchmark scenarios: on medical and disaster relief translation tasks for Tetun and Bislama, our system achieves improvements of 16.90-22.41 ChrF++ points over baseline MT systems. Across six low-resource languages on the FLORES dataset, Tulun outperforms both standalone MT and LLM approaches, achieving an average improvement of 2.8 ChrF points over NLLB-54B.
摘要：支持低资源语言的机器翻译（MT）系统通常在专业领域上挣扎。尽管研究人员提出了针对域适应的各种技术，但这些方法通常需要微调模型，这对于非技术用户和小型组织来说是不切实际的。为了解决这一差距，我们提出了Tulun，Tulun是一种用于术语感知翻译的多功能解决方案，将神经MT与大语言模型（LLM）基于现有的词汇表和翻译记忆指导的后编辑。我们的开源网络平台使用户可以轻松创建，编辑和利用术语资源，从而促进了协作的人机翻译过程，该过程尊重并融入了域专业知识，同时提高了MT的准确性。评估在现实世界和基准方案中都显示出有效性：关于Tetun和Bislama的医疗和救灾翻译任务，我们的系统可在基线MT系统上提高16.90-22.41 CHRF ++分。在Flores数据集上的六种低资源语言中，Tulun的表现均优于独立MT和LLM方法，在NLLB-54B上平均提高了2.8 CHRF点。

Title: Large Language Models in the Task of Automatic Validation of Text Classifier Predictions

Authors: Aleksandr Tsymbalov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18688
Pdf URL: https://arxiv.org/pdf/2505.18688
Copy Paste: [[2505.18688]] Large Language Models in the Task of Automatic Validation of Text Classifier Predictions(https://arxiv.org/abs/2505.18688)
Keywords: language model, llm
Abstract: Machine learning models for text classification are trained to predict a class for a given text. To do this, training and validation samples must be prepared: a set of texts is collected, and each text is assigned a class. These classes are usually assigned by human annotators with different expertise levels, depending on the specific classification task. Collecting such samples from scratch is labor-intensive because it requires finding specialists and compensating them for their work; moreover, the number of available specialists is limited, and their productivity is constrained by human factors. While it may not be too resource-intensive to collect samples once, the ongoing need to retrain models (especially in incremental learning pipelines) to address data drift (also called model drift) makes the data collection process crucial and costly over the model's entire lifecycle. This paper proposes several approaches to replace human annotators with Large Language Models (LLMs) to test classifier predictions for correctness, helping ensure model quality and support high-quality incremental learning.
摘要：培训了用于文本分类的机器学习模型，以预测给定文本的类。为此，必须准备培训和验证样本：收集一组文本，并为每个文本分配一个类。这些类通常由具有不同专业知识级别的人类注释者分配，具体取决于特定的分类任务。从头开始收集此类样本是劳动密集型的，因为它需要寻找专家并为他们的工作补偿；此外，可用专家的数量有限，其生产力受到人为因素的限制。虽然一次收集样本可能不是资源密集的，但持续的重新培训需要（尤其是在增量学习管道中）来解决数据漂移（也称为模型漂移），这使得数据收集过程在模型的整个生命周期中至关重要且昂贵。本文提出了几种用大语言模型（LLM）替代人类注释者的方法，以测试分类器预测的正确性，有助于确保模型质量并支持高质量的增量学习。

Title: Benchmarking and Rethinking Knowledge Editing for Large Language Models

Authors: Guoxiu He, Xin Song, Futing Wang, Aixin Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18690
Pdf URL: https://arxiv.org/pdf/2505.18690
Copy Paste: [[2505.18690]] Benchmarking and Rethinking Knowledge Editing for Large Language Models(https://arxiv.org/abs/2505.18690)
Keywords: language model, llm
Abstract: Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups. To address this gap, we conduct a comprehensive benchmarking study. In addition to fact-level datasets, we introduce more complex event-based datasets and general-purpose datasets drawn from other tasks. Our evaluation covers both instruction-tuned and reasoning-oriented LLMs, under a realistic autoregressive inference setting rather than teacher-forced decoding. Beyond single-edit assessments, we also evaluate multi-edit scenarios to better reflect practical demands. We employ four evaluation dimensions, including portability, and compare all recent methods against a simple and straightforward baseline named Selective Contextual Reasoning (SCR). Empirical results reveal that parameter-based editing methods perform poorly under realistic conditions. In contrast, SCR consistently outperforms them across all settings. This study offers new insights into the limitations of current knowledge editing methods and highlights the potential of context-based reasoning as a more robust alternative.
摘要：知识编辑旨在更新大语言模型（LLMS）中的嵌入式知识。但是，现有方法，无论是通过参数修改还是外部内存集成，通常都遭受不一致的评估目标和实验设置。为了解决这一差距，我们进行了全面的基准测试研究。除了事实级数据集外，我们还介绍了更复杂的基于事件的数据集和从其他任务中绘制的通用数据集。我们的评估涵盖了指导调整和面向推理的LLM，在现实的自动回归推理环境下而不是教师的解码下。除了单编辑评估外，我们还评估了多编辑方案，以更好地反映实际需求。我们采用四个评估维度，包括可移植性，并将所有最新方法与名为选择性上下文推理（SCR）的简单直接基线进行比较。经验结果表明，基于参数的编辑方法在现实条件下的性能差。相比之下，SCR在所有设置上始终优于它们。这项研究为当前知识编辑方法的局限性提供了新的见解，并突出了基于上下文的推理作为更强大的替代方案的潜力。

Title: Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization

Authors: Meng Li, Guangda Huzhang, Haibo Zhang, Xiting Wang, Anxiang Zeng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18720
Pdf URL: https://arxiv.org/pdf/2505.18720
Copy Paste: [[2505.18720]] Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization(https://arxiv.org/abs/2505.18720)
Keywords: language model, llm
Abstract: Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO's effectiveness in improving instruction-following ability across various settings\footnote{Code is available at this https URL.}.
摘要：通过直接优化所选响应和被拒绝的响应之间的对数可能的差异，直接偏好优化（DPO）已成为将大型语言模型（LLM）与人类偏好保持一致的有前途的框架。但是，现有方法对响应中的所有代币分配了同等的重要性，而人类则专注于更有意义的部分。这会导致次优的优化优化，因为无关或嘈杂的令牌会影响DPO丢失。为了解决此限制，我们提出\ textbf {o} ptimal \ textbf {t}基于ransport的代币加权方案，用于增强直接\ textbf {p}参考\ textbf {o} ptimization（otpo）。通过强调具有语义上有意义的令牌对并取消强调不相关的令牌，我们的方法引入了一种情境感知的令牌加权方案，从而产生了更对比度的奖励差异估计。这种自适应加权增强了奖励稳定性，提高了可解释性，并确保偏好优化集中在响应之间的有意义差异上。广泛的实验已经验证了OTPO在改善各种设置的指令遵循能力\脚注{代码{代码{代码{代码}方面的有效性。

Title: LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges

Authors: Tao Liu, Hongying Zan, Yifan Li, Dixuan Zhang, Lulu Kong, Haixin Liu, Jiaming Hou, Aoze Zheng, Rui Li, Yiming Qiao, Zewei Luo, Qi Wang, Zhiqiang Zhang, Jiaxi Li, Supeng Liu, Kunli Zhang, Min Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18744
Pdf URL: https://arxiv.org/pdf/2505.18744
Copy Paste: [[2505.18744]] LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges(https://arxiv.org/abs/2505.18744)
Keywords: chain-of-thought
Abstract: Text-to-SQL is a fundamental task in natural language processing that seeks to translate natural language questions into meaningful and executable SQL queries. While existing datasets are extensive and primarily focus on business scenarios and operational logic, they frequently lack coverage of domain-specific knowledge and complex mathematical reasoning. To address this gap, we present a novel dataset tailored for complex reasoning and chain-of-thought analysis in SQL inference, encompassing physical, arithmetic, commonsense, and hypothetical reasoning. The dataset consists of 4,038 English questions, each paired with a unique SQL query and accompanied by 12,114 step-by-step reasoning annotations, spanning 45 databases across diverse domains. Experimental results demonstrate that LogicCat substantially increases the difficulty for state-of-the-art models, with the highest execution accuracy reaching only 14.96%. Incorporating our chain-of-thought annotations boosts performance to 33.96%. Benchmarking leading public methods on Spider and BIRD further underscores the unique challenges presented by LogicCat, highlighting the significant opportunities for advancing research in robust, reasoning-driven text-to-SQL systems. We have released our dataset code at this https URL.
摘要：文本到SQL是自然语言处理中的一项基本任务，该任务旨在将自然语言问题转化为有意义且可执行的SQL查询。尽管现有数据集是广泛的，并且主要关注业务场景和运营逻辑，但它们经常缺乏对特定于领域的知识和复杂数学推理的覆盖范围。为了解决这一差距，我们提出了一个针对SQL推断中复杂推理和经过思考分析的新型数据集，包括物理，算术，常识和假设推理。该数据集由4,038个英语问题组成，每个问题都与唯一的SQL查询配对，并伴有12,114个逐步推理注释，涵盖了跨不同域的45个数据库。实验结果表明，LogicCAT大大增加了最新模型的难度，而执行精度最高的精度仅达到14.96％。结合我们的思想链注释可将绩效提高到33.96％。基准对蜘蛛和伯德的领先公共方法进行了基准，进一步强调了LogicCat提出的独特挑战，突出了在强大的，推理驱动的文本到SQL系统中促进研究的重要机会。我们已经在此HTTPS URL上发布了数据集代码。

Title: Unifying Attention Heads and Task Vectors via Hidden State Geometry in In-Context Learning

Authors: Haolin Yang, Hakaze Cho, Yiqiao Zhong, Naoya Inoue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18752
Pdf URL: https://arxiv.org/pdf/2505.18752
Copy Paste: [[2505.18752]] Unifying Attention Heads and Task Vectors via Hidden State Geometry in In-Context Learning(https://arxiv.org/abs/2505.18752)
Keywords: language model, prompt
Abstract: The unusual properties of in-context learning (ICL) have prompted investigations into the internal mechanisms of large language models. Prior work typically focuses on either special attention heads or task vectors at specific layers, but lacks a unified framework linking these components to the evolution of hidden states across layers that ultimately produce the model's output. In this paper, we propose such a framework for ICL in classification tasks by analyzing two geometric factors that govern performance: the separability and alignment of query hidden states. A fine-grained analysis of layer-wise dynamics reveals a striking two-stage mechanism: separability emerges in early layers, while alignment develops in later layers. Ablation studies further show that Previous Token Heads drive separability, while Induction Heads and task vectors enhance alignment. Our findings thus bridge the gap between attention heads and task vectors, offering a unified account of ICL's underlying mechanisms.
摘要：文化学习（ICL）的异常属性促使研究了大语言模型的内部机制。先前的工作通常专注于特定层的特殊注意力头或任务向量，但缺乏将这些组件与跨层中隐藏状态的演变联系起来的统一框架，最终会产生模型的输出。在本文中，我们通过分析控制性能的两个几何因素：查询隐藏状态的可分离性和对齐方式，提出了ICL分类任务中的ICL框架。对层动力学的细粒度分析揭示了一种引人注目的两阶段机制：在早期层中出现可分离性，而对齐方式则在后来的层中发展。消融研究进一步表明，以前的令牌头可以驱动可分离性，而感应头和任务向量则增强了对齐。因此，我们的发现弥合了注意力头和任务向量之间的差距，提供了ICL基本机制的统一说明。

Title: Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection

Authors: Elsen Ronando, Sozo Inoue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18754
Pdf URL: https://arxiv.org/pdf/2505.18754
Copy Paste: [[2505.18754]] Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection(https://arxiv.org/abs/2505.18754)
Keywords: language model, llm, prompt
Abstract: In this paper, we propose a novel few-shot optimization with HED-LM (Hybrid Euclidean Distance with Large Language Models) to improve example selection for sensor-based classification tasks. While few-shot prompting enables efficient inference with limited labeled data, its performance largely depends on the quality of selected examples. HED-LM addresses this challenge through a hybrid selection pipeline that filters candidate examples based on Euclidean distance and re-ranks them using contextual relevance scored by large language models (LLMs). To validate its effectiveness, we apply HED-LM to a fatigue detection task using accelerometer data characterized by overlapping patterns and high inter-subject variability. Unlike simpler tasks such as activity recognition, fatigue detection demands more nuanced example selection due to subtle differences in physiological signals. Our experiments show that HED-LM achieves a mean macro F1-score of 69.13$\pm$10.71%, outperforming both random selection (59.30$\pm$10.13%) and distance-only filtering (67.61$\pm$11.39%). These represent relative improvements of 16.6% and 2.3%, respectively. The results confirm that combining numerical similarity with contextual relevance improves the robustness of few-shot prompting. Overall, HED-LM offers a practical solution to improve performance in real-world sensor-based learning tasks and shows potential for broader applications in healthcare monitoring, human activity recognition, and industrial safety scenarios.
摘要：在本文中，我们提出了一种新颖的几弹性优化，并使用HED-LM（与大语言模型的混合欧几里得距离）提出了一个新的，以改善基于传感器的分类任务的示例选择。虽然很少射击提示可以使用有限的标记数据进行有效的推断，但其性能很大程度上取决于所选示例的质量。 HED-LM通过混合选择管道解决了这一挑战，该管道根据欧几里得距离过滤候选示例，并使用大语言模型（LLMS）评分的上下文相关性对其进行重新排列。为了验证其有效性，我们使用加速度计数据将HED-LM应用于疲劳检测任务，该数据以重叠模式和高主题间可变性为特征。与更简单的任务（例如活动识别）不同，由于生理信号的细微差异，疲劳检测需要更细微的示例选择。我们的实验表明，HED-LM的平均宏F1分数为69.13 $ \ pm $ 10.71％，表现优于随机选择（59.30 $ \ pm $ 10.13％）和仅距离过滤（67.61 $ \ pm $ $ $ 11.39％）。这些分别代表16.6％和2.3％的相对改善。结果证实，将数值相似性与上下文相关性相结合可以提高少量提示的鲁棒性。总体而言，HED-LM提供了一种实用的解决方案，可以提高基于现实的传感器学习任务的性能，并显示出在医疗保健监测，人类活动识别和工业安全方案中更广泛应用的潜力。

Title: How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark

Authors: Minglai Yang, Ethan Huang, Liang Zhang, Mihai Surdeanu, William Wang, Liangming Pan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18761
Pdf URL: https://arxiv.org/pdf/2505.18761
Copy Paste: [[2505.18761]] How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark(https://arxiv.org/abs/2505.18761)
Keywords: language model, llm
Abstract: We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models' (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.
摘要：我们以分散注意力的环境（GSM-DC）介绍了小学数学，这是一种综合基准，用于评估大型语言模型的（LLMS）推理对系统控制的无关紧要的背景（IC）的鲁棒性。 GSM-DC用精确的干扰物注射构建符号推理图，从而实现严格的可再现评估。我们的实验表明，LLM对IC显着敏感，从而影响推理路径选择和算术准确性。此外，具有强大干扰物的培训模型可改善分布和分发场景的性能。我们进一步提出了一个以过程奖励模型为指导的逐步搜索，该搜索尤其增强了分布条件下的鲁棒性。

Title: Disentangling Knowledge Representations for Large Language Model Editing

Authors: Mengqi Zhang, Zisheng Zhou, Xiaotian Ye, Qiang Liu, Zhaochun Ren, Zhumin Chen, Pengjie Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18774
Pdf URL: https://arxiv.org/pdf/2505.18774
Copy Paste: [[2505.18774]] Disentangling Knowledge Representations for Large Language Model Editing(https://arxiv.org/abs/2505.18774)
Keywords: language model, llm
Abstract: Knowledge Editing has emerged as a promising solution for efficiently updating embedded knowledge in large language models (LLMs). While existing approaches demonstrate effectiveness in integrating new knowledge and preserving the original capabilities of LLMs, they fail to maintain fine-grained irrelevant knowledge facts that share the same subject as edited knowledge but differ in relation and object. This challenge arises because subject representations inherently encode multiple attributes, causing the target and fine-grained irrelevant knowledge to become entangled in the representation space, and thus vulnerable to unintended alterations during editing. To address this, we propose DiKE, a novel approach that Disentangles Knowledge representations for LLM Editing (DiKE). DiKE consists of two key components: a Knowledge Representation Disentanglement (KRD) module that decomposes the subject representation into target-knowledgerelated and -unrelated components, and a Disentanglement-based Knowledge Edit (DKE) module that updates only the target-related component while explicitly preserving the unrelated one. We further derive a closed-form, rank-one parameter update based on matrix theory to enable efficient and minimally invasive edits. To rigorously evaluate fine-grained irrelevant knowledge preservation, we construct FINE-KED, a new benchmark comprising fine-grained irrelevant knowledge at different levels of relational similarity to the edited knowledge. Extensive experiments across multiple LLMs demonstrate that DiKE substantially improves fine-grained irrelevant knowledge preservation while maintaining competitive general editing performance.
摘要：知识编辑已成为有前途的解决方案，用于有效地更新大型语言模型（LLMS）中的嵌入式知识。尽管现有方法表明在整合新知识和保留LLM的原始功能方面表现出有效性，但他们无法维持具有与编辑知识相同的主题但在关系和对象方面具有相同主题的精细无关的知识事实。之所以出现这一挑战，是因为主题表示本质上编码多个属性，导致目标和细粒度的无关知识在表示空间中纠缠，因此在编辑过程中容易受到意外改动的影响。为了解决这个问题，我们提出了Dike，这是一种新颖的方法，可以解开LLM编辑（Dike）的知识表示。 Dike由两个关键组成部分组成：知识表示解散（KRD）模块，该模块将主题表示分解为目标知识群体相关和无关的组件，以及基于分解的知识编辑（DKE）模块，仅更新目标相关组件，同时显式地保留无关的一个。我们进一步得出了基于矩阵理论的封闭形式，排名一号参数更新，以实现有效且微创的编辑。为了严格评估精细粒度无关的知识保存，我们构建了Fine-Ked，这是一种新的基准测试，其中包含与编辑知识不同的关系相似性的精细粒度无关知识。跨多个LLM的广泛实验表明，堤防可在维持竞争激烈的一般编辑性能的同时，大大改善了细粒度无关的知识保存。

Title: ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models

Authors: Hao Chen, Haoze Li, Zhiqing Xiao, Lirong Gao, Qi Zhang, Xiaomeng Hu, Ningtao Wang, Xing Fu, Junbo Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18799
Pdf URL: https://arxiv.org/pdf/2505.18799
Copy Paste: [[2505.18799]] ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models(https://arxiv.org/abs/2505.18799)
Keywords: language model, llm
Abstract: Aligning general-purpose large language models (LLMs) to downstream tasks often incurs significant costs, including constructing task-specific instruction pairs and extensive training adjustments. Prior research has explored various avenues to enhance alignment efficiency, primarily through minimal-data training or data-driven activations to identify key attention heads. However, these approaches inherently introduce data dependency, which hinders generalization and reusability. To address this issue and enhance model alignment efficiency, we propose the \textit{\textbf{A}ttention \textbf{L}ocalization and \textbf{P}runing \textbf{S}trategy (\textbf{ALPS})}, an efficient algorithm that localizes the most task-sensitive attention heads and prunes by restricting attention training updates to these heads, thereby reducing alignment costs. Experimental results demonstrate that our method activates only \textbf{10\%} of attention parameters during fine-tuning while achieving a \textbf{2\%} performance improvement over baselines on three tasks. Moreover, the identified task-specific heads are transferable across datasets and mitigate knowledge forgetting. Our work and findings provide a novel perspective on efficient LLM alignment.
摘要：将通用大语模型（LLMS）与下游任务保持一致，通常会产生巨大的成本，包括构建特定于任务的说明对和广泛的培训调整。先前的研究探索了各种途径以提高对齐效率，主要是通过最小的数据训练或数据驱动的激活来识别关键注意力头。但是，这些方法固有地引入了数据依赖性，这阻碍了概括和可重复性。 To address this issue and enhance model alignment efficiency, we propose the \textit{\textbf{A}ttention \textbf{L}ocalization and \textbf{P}runing \textbf{S}trategy (\textbf{ALPS})}, an efficient algorithm that localizes the most task-sensitive attention heads and prunes by将注意力训练的更新限制为这些头部，从而降低对齐成本。实验结果表明，我们的方法仅激活\ textbf {10 \％}在微调过程中的注意参数，同时实现\ textbf {2 \％}对三个任务的基础线的性能提高。此外，已确定的特定任务头可以在数据集中转移并减轻知识遗忘。我们的工作和发现为有效的LLM一致性提供了一种新颖的看法。

Title: Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

Authors: Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.18842
Pdf URL: https://arxiv.org/pdf/2505.18842
Copy Paste: [[2505.18842]] Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation(https://arxiv.org/abs/2505.18842)
Keywords: language model, llm
Abstract: We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs) that enables selective visual revisitation during inference. While current MLLMs typically consume visual input only once and reason purely over internal memory, v1 introduces a simple point-and-copy mechanism that allows the model to dynamically retrieve relevant image regions throughout the reasoning process. This mechanism augments existing architectures with minimal modifications, enabling contextual access to visual tokens based on the model's evolving hypotheses. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Experiments on three multimodal mathematical reasoning benchmarks -- MathVista, MathVision, and MathVerse -- demonstrate that v1 consistently improves performance over comparable baselines, particularly on tasks requiring fine-grained visual reference and multi-step reasoning. Our results suggest that dynamic visual access is a promising direction for enhancing grounded multimodal reasoning. Code, models, and data will be released to support future research.
摘要：我们提出V1，这是对多模式大语言模型（MLLM）的轻巧扩展，可在推理过程中进行选择性视觉重新审视。虽然当前的MLLM通常仅消耗视觉输入一次，而纯粹是在内部内存上的原因，但V1引入了一种简单的副本机制，该机制允许模型在整个推理过程中动态检索相关的图像区域。这种机制通过最小的修改增强了现有的体系结构，从而根据模型不断发展的假设，可以上下文访问视觉令牌。为了训练此功能，我们构造了V1G，这是一个具有交错视觉接地注释的300K多模式推理轨迹的数据集。在三个多模式数学推理基准（MathVista，MathVision和Mathverse）上进行的实验表明，V1始终提高了相当基准的性能，尤其是在需要细粒度的视觉参考和多步推理的任务上。我们的结果表明，动态视觉访问是增强接地多模式推理的有希望的方向。代码，模型和数据将发布以支持未来的研究。

Title: Multi-Party Conversational Agents: A Survey

Authors: Sagar Sapkota, Mohammad Saqib Hasan, Mubarak Shah, Santu Karmaker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18845
Pdf URL: https://arxiv.org/pdf/2505.18845
Copy Paste: [[2505.18845]] Multi-Party Conversational Agents: A Survey(https://arxiv.org/abs/2505.18845)
Keywords: language model, llm, agent
Abstract: Multi-party Conversational Agents (MPCAs) are systems designed to engage in dialogue with more than two participants simultaneously. Unlike traditional two-party agents, designing MPCAs faces additional challenges due to the need to interpret both utterance semantics and social dynamics. This survey explores recent progress in MPCAs by addressing three key questions: 1) Can agents model each participants' mental states? (State of Mind Modeling); 2) Can they properly understand the dialogue content? (Semantic Understanding); and 3) Can they reason about and predict future conversation flow? (Agent Action Modeling). We review methods ranging from classical machine learning to Large Language Models (LLMs) and multi-modal systems. Our analysis underscores Theory of Mind (ToM) as essential for building intelligent MPCAs and highlights multi-modal understanding as a promising yet underexplored direction. Finally, this survey offers guidance to future researchers on developing more capable MPCAs.
摘要：多方对话代理（MPCAS）是旨在同时与两个以上参与者进行对话的系统。与传统的两党代理不同，设计MPCAS由于需要解释语音语义和社会动态而面临其他挑战。这项调查通过解决三个关键问题来探讨MPCA的最新进展：1）代理可以对每个参与者的心理状态建模？（心态建模）； 2）他们可以正确理解对话内容吗？（语义理解）； 3）他们可以推理并预测未来的对话流程吗？（代理行动建模）。我们审查从古典机器学习到大型语言模型（LLM）和多模式系统的方法。我们的分析强调了心理理论（TOM）对于构建智能MPCAS的必要条件，并强调了多模式的理解是一个有前途但毫无疑问的方向。最后，这项调查为未来的研究人员提供了有关开发更有能力的MPCA的指导。

Title: Writing Like the Best: Exemplar-Based Expository Text Generation

Authors: Yuxiang Liu, Kevin Chen-Chuan Chang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18859
Pdf URL: https://arxiv.org/pdf/2505.18859
Copy Paste: [[2505.18859]] Writing Like the Best: Exemplar-Based Expository Text Generation(https://arxiv.org/abs/2505.18859)
Keywords: language model, llm
Abstract: We introduce the Exemplar-Based Expository Text Generation task, aiming to generate an expository text on a new topic using an exemplar on a similar topic. Current methods fall short due to their reliance on extensive exemplar data, difficulty in adapting topic-specific content, and issues with long-text coherence. To address these challenges, we propose the concept of Adaptive Imitation and present a novel Recurrent Plan-then-Adapt (RePA) framework. RePA leverages large language models (LLMs) for effective adaptive imitation through a fine-grained plan-then-adapt process. RePA also enables recurrent segment-by-segment imitation, supported by two memory structures that enhance input clarity and output coherence. We also develop task-specific evaluation metrics--imitativeness, adaptiveness, and adaptive-imitativeness--using LLMs as evaluators. Experimental results across our collected three diverse datasets demonstrate that RePA surpasses existing baselines in producing factual, consistent, and relevant texts for this task.
摘要：我们介绍了基于示例的说明性文本生成任务，旨在使用类似主题的示例来生成有关新主题的说明性文本。当前方法由于依赖广泛的示例数据，难以适应特定于主题的内容以及具有长文本连贯性的问题而缺乏目前的方法。为了应对这些挑战，我们提出了自适应模仿的概念，并提出了一种新颖的经常性计划，然后是Aftapt（REPA）框架。 REPA通过精细的计划 - 然后将其适应过程来利用大型语言模型（LLMS）进行有效的自适应模仿。 REPA还可以启用逐个段的经常性段模仿，并得到了两个内存结构的支持，这些内存结构增强了输入清晰度和输出相干性。我们还使用LLMS作为评估者，开发了特定于任务的评估指标（适应性，适应性和适应性型号）。我们收集的三个不同数据集的实验结果表明，REPA在为此任务生成事实，一致和相关文本时超过了现有的基准。

Title: Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework

Authors: Binhao Ma, Hanqing Guo, Zhengping Jay Luo, Rui Duan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18864
Pdf URL: https://arxiv.org/pdf/2505.18864
Copy Paste: [[2505.18864]] Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework(https://arxiv.org/abs/2505.18864)
Keywords: language model, gpt, llm, prompt
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced the naturalness and flexibility of human computer interaction by enabling seamless understanding across text, vision, and audio modalities. Among these, voice enabled models such as SpeechGPT have demonstrated considerable improvements in usability, offering expressive, and emotionally responsive interactions that foster deeper connections in real world communication scenarios. However, the use of voice introduces new security risks, as attackers can exploit the unique characteristics of spoken language, such as timing, pronunciation variability, and speech to text translation, to craft inputs that bypass defenses in ways not seen in text-based systems. Despite substantial research on text based jailbreaks, the voice modality remains largely underexplored in terms of both attack strategies and defense mechanisms. In this work, we present an adversarial attack targeting the speech input of aligned MLLMs in a white box scenario. Specifically, we introduce a novel token level attack that leverages access to the model's speech tokenization to generate adversarial token sequences. These sequences are then synthesized into audio prompts, which effectively bypass alignment safeguards and to induce prohibited outputs. Evaluated on SpeechGPT, our approach achieves up to 89 percent attack success rate across multiple restricted tasks, significantly outperforming existing voice based jailbreak methods. Our findings shed light on the vulnerabilities of voice-enabled multimodal systems and to help guide the development of more robust next-generation MLLMs.
摘要：多模式大语言模型（MLLM）的最新进展通过在文本，视觉和音频方式中实现无缝理解，从而显着增强了人类计算机交互的自然性和灵活性。其中，启用语音的模型（例如Speechgpt）表现出可用性，具有表现力和情感响应式互动的可观改善，从而在现实世界的交流场景中促进了更深层次的联系。但是，语音的使用引入了新的安全风险，因为攻击者可以利用口头语言的独特特征，例如时间安排，发音可变性和对文本翻译的语音，以绕过以基于文本的系统看不到的方式绕过防御的输入。尽管对基于文本的越狱进行了大量研究，但在攻击策略和国防机制方面，语音方式仍然在很大程度上没有被忽视。在这项工作中，我们提出了针对白盒情景中对齐MLLM的语音输入的对抗性攻击。具体来说，我们引入了一种新颖的令牌攻击，它利用对模型的语音令牌化的访问来生成对抗令牌序列。然后将这些序列合成为音频提示，从而有效地绕过对齐的保障并引起禁止的输出。在Speechgpt上进行了评估，我们的方法在多个受限制的任务中达到了多达89％的攻击成功率，大大优于现有的基于语音的越狱方法。我们的发现阐明了启用语音的多模式系统的脆弱性，并帮助指导开发更强大的下一代MLLM。

Title: Sci-LoRA: Mixture of Scientific LoRAs for Cross-Domain Lay Paraphrasing

Authors: Ming Cheng, Jiaying Gong, Hoda Eldardiry
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18867
Pdf URL: https://arxiv.org/pdf/2505.18867
Copy Paste: [[2505.18867]] Sci-LoRA: Mixture of Scientific LoRAs for Cross-Domain Lay Paraphrasing(https://arxiv.org/abs/2505.18867)
Keywords: language model
Abstract: Lay paraphrasing aims to make scientific information accessible to audiences without technical backgrounds. However, most existing studies focus on a single domain, such as biomedicine. With the rise of interdisciplinary research, it is increasingly necessary to comprehend knowledge spanning multiple technical fields. To address this, we propose Sci-LoRA, a model that leverages a mixture of LoRAs fine-tuned on multiple scientific domains. In particular, Sci-LoRA dynamically generates and applies weights for each LoRA, enabling it to adjust the impact of different domains based on the input text, without requiring explicit domain labels. To balance domain-specific knowledge and generalization across various domains, Sci-LoRA integrates information at both the data and model levels. This dynamic fusion enhances the adaptability and performance across various domains. Experimental results across twelve domains on five public datasets show that Sci-LoRA significantly outperforms state-of-the-art large language models and demonstrates flexible generalization and adaptability in cross-domain lay paraphrasing.
摘要：Lay释义旨在使没有技术背景的观众可以访问科学信息。但是，大多数现有的研究都集中在一个领域，例如生物医学。随着跨学科研究的兴起，越来越有必要理解跨越多个技术领域的知识。为了解决这个问题，我们提出了科学路线，该模型利用了在多个科学领域进行微调的Loras混合物。特别是，Sci-Lora动态生成并为每个Lora施加权重，使其能够根据输入文本调整不同域的影响，而无需明确的域标签。为了平衡各个领域的特定领域知识和概括，Sci-Lora在数据和模型级别上都集成了信息。这种动态融合增强了各个领域的适应性和性能。五个公共数据集的十二个域之间的实验结果表明，科幻-Lora的表现明显胜过最先进的大语模型，并证明了跨域外行解释的灵活概括和适应性。

Title: CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

Authors: Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, Chien-Sheng Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18878
Pdf URL: https://arxiv.org/pdf/2505.18878
Copy Paste: [[2505.18878]] CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions(https://arxiv.org/abs/2505.18878)
Keywords: llm, prompt, agent
Abstract: While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.
摘要：尽管AI代理在业务上具有变革性的潜力，但在广泛使用的平台上，公众现实的业务数据的稀缺却阻碍了有效的性能基准测试。现有的基准通常缺乏在其环境，数据和代理商交互中的忠诚，并且对各种业务方案和行业的覆盖率有限。为了解决这些差距，我们介绍了CRMARENA-PRO，这是对LLM专业环境中LLM代理的整体，现实评估的新基准。 CRMARENA-PRO在销售，服务以及“配置，价格和报价”过程中使用19个专家验证的任务扩展了CRMARENA，均以企业对业务和企业对客户对客户的情况进行扩展。它独特地纳入了以多种性格和强大的机密意识评估为指导的多转交互。实验表明，领先的LLM代理在CRMARENA-PRO上仅获得了58％的单转弯成功，在多转弯设置中，性能显着下降到约35％。尽管工作流的执行证明是顶级代理商更容易受理的（超过83％的单转弯成功），但其他评估的业务技能带来了更大的挑战。此外，代理商表现出接近零固有的机密意识。尽管有针对性的提示可以改善这一点，但它通常会损害任务绩效。这些发现突出了当前的LLM功能和企业需求之间的巨大差距，强调了多转弯推理，机密性依从性和多功能技能获取的需求。

Title: Building a Functional Machine Translation Corpus for Kpelle

Authors: Kweku Andoh Yamoah, Jackson Weako, Emmanuel J. Dorley
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18905
Pdf URL: https://arxiv.org/pdf/2505.18905
Copy Paste: [[2505.18905]] Building a Functional Machine Translation Corpus for Kpelle(https://arxiv.org/abs/2505.18905)
Keywords: language model
Abstract: In this paper, we introduce the first publicly available English-Kpelle dataset for machine translation, comprising over 2000 sentence pairs drawn from everyday communication, religious texts, and educational materials. By fine-tuning Meta's No Language Left Behind(NLLB) model on two versions of the dataset, we achieved BLEU scores of up to 30 in the Kpelle-to-English direction, demonstrating the benefits of data augmentation. Our findings align with NLLB-200 benchmarks on other African languages, underscoring Kpelle's potential for competitive performance despite its low-resource status. Beyond machine translation, this dataset enables broader NLP tasks, including speech recognition and language modelling. We conclude with a roadmap for future dataset expansion, emphasizing orthographic consistency, community-driven validation, and interdisciplinary collaboration to advance inclusive language technology development for Kpelle and other low-resourced Mande languages.
摘要：在本文中，我们介绍了第一个用于机器翻译的英语公开数据集，其中包括2000多个句子对，由日常交流，宗教文本和教育材料绘制。通过微调Meta在两个版本的数据集上遗留的No No语言（NLLB）模型，我们在KPELLE到英语方向上实现了高达30的BLEU分数，这表明了数据增强的好处。我们的发现与其他非洲语言的NLLB-200基准相吻合，尽管其地位低下，但仍强调了Kpelle的竞争性能潜力。除了机器翻译之外，该数据集还可以实现更广泛的NLP任务，包括语音识别和语言建模。我们以路线图的未来数据集扩展，强调拼字法，社区驱动的验证以及跨学科的合作，以推动Kpelle和其他低资源的Mande语言的包容性语言技术开发。

Title: Federated Retrieval-Augmented Generation: A Systematic Mapping Study

Authors: Abhijit Chakraborty, Chahana Dahal, Vivek Gupta
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.18906
Pdf URL: https://arxiv.org/pdf/2505.18906
Copy Paste: [[2505.18906]] Federated Retrieval-Augmented Generation: A Systematic Mapping Study(https://arxiv.org/abs/2505.18906)
Keywords: language model, retrieval-augmented generation
Abstract: Federated Retrieval-Augmented Generation (Federated RAG) combines Federated Learning (FL), which enables distributed model training without exposing raw data, with Retrieval-Augmented Generation (RAG), which improves the factual accuracy of language models by grounding outputs in external knowledge. As large language models are increasingly deployed in privacy-sensitive domains such as healthcare, finance, and personalized assistance, Federated RAG offers a promising framework for secure, knowledge-intensive natural language processing (NLP). To the best of our knowledge, this paper presents the first systematic mapping study of Federated RAG, covering literature published between 2020 and 2025. Following Kitchenham's guidelines for evidence-based software engineering, we develop a structured classification of research focuses, contribution types, and application domains. We analyze architectural patterns, temporal trends, and key challenges, including privacy-preserving retrieval, cross-client heterogeneity, and evaluation limitations. Our findings synthesize a rapidly evolving body of research, identify recurring design patterns, and surface open questions, providing a foundation for future work at the intersection of RAG and federated systems.
摘要：联邦检索型的生成（联合抹布）结合了联合学习（FL），该学习可以在不暴露原始数据的情况下进行分布式模型培训，并通过检索功能增强的生成（RAG），从而通过将外部知识中的输出扎根，从而提高了语言模型的事实准确性。随着大型语言模型越来越多地部署在对医疗保健，金融和个性化帮助等隐私敏感领域中，Federated Rag为安全，知识密集的自然语言处理（NLP）提供了有希望的框架。据我们所知，本文介绍了联合抹布的首次系统映射研究，涵盖了2020年至2025年之间发表的文献。根据Kitchenham的循证软件工程指南，我们开发了研究重点，贡献类型和应用领域的结构化分类。我们分析建筑模式，时间趋势和关键挑战，包括保护隐私的检索，跨客户异质性和评估限制。我们的发现综合了迅速发展的研究，识别经常出现的设计模式和表面开放问题，为在抹布和联合系统的交集中为未来的工作提供了基础。

Title: SCRum-9: Multilingual Stance Classification over Rumours on Social Media

Authors: Yue Li, Jake Vasilakes, Zhixue Zhao, Carolina Scarton
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18916
Pdf URL: https://arxiv.org/pdf/2505.18916
Copy Paste: [[2505.18916]] SCRum-9: Multilingual Stance Classification over Rumours on Social Media(https://arxiv.org/abs/2505.18916)
Keywords: llm
Abstract: We introduce SCRum-9, a multilingual dataset for Rumour Stance Classification, containing 7,516 tweet-reply pairs from X. SCRum-9 goes beyond existing stance classification datasets by covering more languages (9), linking examples to more fact-checked claims (2.1k), and including complex annotations from multiple annotators to account for intra- and inter-annotator variability. Annotations were made by at least three native speakers per language, totalling around 405 hours of annotation and 8,150 dollars in compensation. Experiments on SCRum-9 show that it is a challenging benchmark for both state-of-the-art LLMs (e.g. Deepseek) as well as fine-tuned pre-trained models, motivating future work in this area.
摘要：We introduce SCRum-9, a multilingual dataset for Rumour Stance Classification, containing 7,516 tweet-reply pairs from X. SCRum-9 goes beyond existing stance classification datasets by covering more languages (9), linking examples to more fact-checked claims (2.1k), and including complex annotations from multiple annotators to account for intra- and inter-annotator variability.注释由至少三名母语人士每语言进行，总计约405小时的注释和8,150美元的薪酬。 Scrum-9上的实验表明，这是最先进的LLM（例如DeepSeek）以及精细训练的预培训模型的具有挑战性的基准，从而激发了该领域的未来工作。

Title: Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments

Authors: Amel Muminovic (International Balkan University)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18927
Pdf URL: https://arxiv.org/pdf/2505.18927
Copy Paste: [[2505.18927]] Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments(https://arxiv.org/abs/2505.18927)
Keywords: language model, gpt, prompt
Abstract: As online platforms grow, comment sections increasingly host harassment that undermines user experience and well-being. This study benchmarks three leading large language models, OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic Claude 3 Opus, on a corpus of 5,080 YouTube comments sampled from high-abuse threads in gaming, lifestyle, food vlog, and music channels. The dataset comprises 1,334 harmful and 3,746 non-harmful messages in English, Arabic, and Indonesian, annotated independently by two reviewers with substantial agreement (Cohen's kappa = 0.83). Using a unified prompt and deterministic settings, GPT-4.1 achieved the best overall balance with an F1 score of 0.863, precision of 0.887, and recall of 0.841. Gemini flagged the highest share of harmful posts (recall = 0.875) but its precision fell to 0.767 due to frequent false positives. Claude delivered the highest precision at 0.920 and the lowest false-positive rate of 0.022, yet its recall dropped to 0.720. Qualitative analysis showed that all three models struggle with sarcasm, coded insults, and mixed-language slang. These results underscore the need for moderation pipelines that combine complementary models, incorporate conversational context, and fine-tune for under-represented languages and implicit abuse. A de-identified version of the dataset and full prompts is publicly released to promote reproducibility and further progress in automated content moderation.
摘要：随着在线平台的增长，评论部分越来越多地主持骚扰，破坏了用户体验和福祉。这项研究基准了三种领先的大型语言模型，即OpenAI GPT-4.1，Google Gemini 1.5 Pro和人类Claude 3 Opus，在5,080个YouTube评论中，从游戏，生活方式，食物vlog和音乐频道中的高滥用线程中取样了YouTube评论。该数据集包含1,334个有害和3,746个无害的消息，英语，阿拉伯语和印度尼西亚语，由两名具有实质性协议的审阅者独立注释（Cohen's Kappa = 0.83）。使用统一的提示和确定性设置，GPT-4.1以F1分数为0.863，精度为0.887，召回为0.841，达到了最佳总体平衡。双子座标记有害职位的最高份额（召回= 0.875），但由于频繁的假阳性，其精度下降到0.767。克劳德（Claude）在0.920时的最高精度和最低的假阳性速率为0.022，但其召回率降至0.720。定性分析表明，所有三个模型都与讽刺，编码的侮辱和混合语言cluess。这些结果强调了对结合互补模型，结合对话环境的适度管道的需求，并为代表性不足的语言和隐性滥用而进行了微调。公开发布数据集和完整提示的识别版本，以促进可重复性和自动化内容审核的进一步进展。

Title: MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems

Authors: Xuanming Zhang, Yuxuan Chen, Min-Hsuan Yeh, Yixuan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18943
Pdf URL: https://arxiv.org/pdf/2505.18943
Copy Paste: [[2505.18943]] MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems(https://arxiv.org/abs/2505.18943)
Keywords: language model, llm, agent
Abstract: Human social interactions depend on the ability to infer others' unspoken intentions, emotions, and beliefs-a cognitive skill grounded in the psychological concept of Theory of Mind (ToM). While large language models (LLMs) excel in semantic understanding tasks, they struggle with the ambiguity and contextual nuance inherent in human communication. To bridge this gap, we introduce MetaMind, a multi-agent framework inspired by psychological theories of metacognition, designed to emulate human-like social reasoning. MetaMind decomposes social understanding into three collaborative stages: (1) a Theory-of-Mind Agent generates hypotheses user mental states (e.g., intent, emotion), (2) a Domain Agent refines these hypotheses using cultural norms and ethical constraints, and (3) a Response Agent generates contextually appropriate responses while validating alignment with inferred intent. Our framework achieves state-of-the-art performance across three challenging benchmarks, with 35.7% improvement in real-world social scenarios and 6.2% gain in ToM reasoning. Notably, it enables LLMs to match human-level performance on key ToM tasks for the first time. Ablation studies confirm the necessity of all components, which showcase the framework's ability to balance contextual plausibility, social appropriateness, and user adaptation. This work advances AI systems toward human-like social intelligence, with applications in empathetic dialogue and culturally sensitive interactions. Code is available at this https URL.
摘要：人类的社会互动取决于推断他人不言而喻的意图，情感和信念的能力，这是一种基于心理理论的心理概念（汤姆）的认知技能。尽管大型语言模型（LLMS）在语义理解任务中表现出色，但它们在人类交流中固有的歧义和上下文细微差别斗争。为了弥合这一差距，我们介绍了Metamind，这是一个灵感来自元认知心理学理论的多机构框架，旨在模仿类似人类的社会推理。 MetaMind将社会理解分解为三个协作阶段：（1）一种理论的代理人产生假设的用户心理状态（例如，意图，情感），（2）域代理人使用文化规范和道德约束来完善这些假设，（3）响应代理在上下文中产生适当的响应，同时具有适当的响应，同时验证了与众不同的态度。我们的框架在三个具有挑战性的基准中实现了最先进的表现，现实世界中的社交场景提高了35.7％，汤姆推理的增长6.2％。值得注意的是，它使LLM能够首次在关键的TOM任务上匹配人类级别的性能。消融研究证实了所有组件的必要性，这些组件展示了该框架平衡上下文合理性，社会适当性和用户适应能力的能力。这项工作将AI系统推向了类似人类的社会智能，并在同理心对话和文化敏感的互动中应用。代码可在此HTTPS URL上找到。

Title: The Price of Format: Diversity Collapse in LLMs

Authors: Longfei Yun, Chenyang An, Zilong Wang, Letian Peng, Jingbo Shang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18949
Pdf URL: https://arxiv.org/pdf/2505.18949
Copy Paste: [[2505.18949]] The Price of Format: Diversity Collapse in LLMs(https://arxiv.org/abs/2505.18949)
Keywords: language model, llm, prompt
Abstract: Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model's output space. To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.
摘要：指导调节的大语言模型（LLMS）采用结构化模板，例如角色标记和特殊令牌，以在推理过程中执行格式一致性。但是，我们确定了这种格式的关键局限性：它诱导了一种现象，我们将多样性崩溃，其中该模型为开放式输入产生了语义上相似的输出，从而破坏了创造力和可变性。我们在诸如故事完成和自由形式生成之类的任务中系统地评估了这种效果，发现（1）即使在高温采样下，多样性崩溃也持续存在，并且（2）模板中的结构令牌显着限制了模型的输出空间。为了使这些发现与这些发现相关，我们使用一系列结构化提示进行了相同的模型，然后在三个轴上对其进行评估：下游任务性能，对齐行为和输出多样性。我们的分析表明，微调和推理之间的格式一致性对于结构敏感的任务（例如GSM8K，IFEVAL）至关重要，但是对知识重型任务（例如MMLU，WebQuestions）具有边缘影响。相比之下，产出多样性主要受结构令牌的存在或不存在，其格式最少产生了最多样化的输出。这些发现表明，目前提示的惯例虽然有益于对齐，但可能会无意中抑制输出多样性，强调对多样性感知的及时设计和指导调整的需求。

Title: BnMMLU: Measuring Massive Multitask Language Understanding in Bengali

Authors: Saman Sarker Joy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18951
Pdf URL: https://arxiv.org/pdf/2505.18951
Copy Paste: [[2505.18951]] BnMMLU: Measuring Massive Multitask Language Understanding in Bengali(https://arxiv.org/abs/2505.18951)
Keywords: language model, llm
Abstract: The Massive Multitask Language Understanding (MMLU) benchmark has been widely used to evaluate language models across various domains. However, existing MMLU datasets primarily focus on high-resource languages such as English, which leaves low-resource languages like Bengali underrepresented. In this paper, we introduce BnMMLU, a benchmark to evaluate the multitask language understanding capabilities of Bengali in language models. The dataset spans 23 domains, including science, humanities, mathematics and general knowledge and is structured in a multiple-choice format to assess factual knowledge, application-based problem-solving and reasoning abilities of language models. It consists of 138,949 question-option pairs. We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set. Additionally, we annotate the test set with three cognitive categories-factual knowledge, procedural application and reasoning-to gain deeper insights into model strengths and weaknesses across various cognitive tasks. The results reveal significant performance gaps, highlighting the need for improved pre-training and fine-tuning strategies tailored to Bengali data. We release the dataset and benchmark results to facilitate further research in this area.
摘要：大量的多任务语言理解（MMLU）基准已广泛用于评估各个领域的语言模型。但是，现有的MMLU数据集主要集中在高资源语言（例如英语）上，这些语言会使孟加拉语（如孟加拉语人数不足）的低资源语言不足。在本文中，我们介绍了BNMMLU，这是一种评估孟加拉语在语言模型中的多任务语言理解能力的基准。数据集跨越了23个领域，包括科学，人文学科，数学和通用知识，并以多项选择格式构建，以评估事实知识，基于应用程序的问题解决方案和语言模型的推理能力。它由138,949个问答对组成。我们在BNMMLU测试集上基准了几种专有和开源大型语言模型（LLMS）。此外，我们用三个认知类别知识，程序应用和推理来注释测试集，以更深入地了解各种认知任务的模型优势和劣势。结果揭示了巨大的性能差距，强调了需要改善针对孟加拉数据量身定制的预训练和微调策略。我们发布数据集和基准结果，以促进该领域的进一步研究。

Title: Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk?

Authors: Divij Chawla, Ashita Bhutada, Do Duc Anh, Abhinav Raghunathan, Vinod SP, Cathy Guo, Dar Win Liew, Prannaya Gupta, Rishabh Bhardwaj, Rajat Bhardwaj, Soujanya Poria
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18953
Pdf URL: https://arxiv.org/pdf/2505.18953
Copy Paste: [[2505.18953]] Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk?(https://arxiv.org/abs/2505.18953)
Keywords: gpt
Abstract: We evaluate the credibility of leading AI models in assessing investment risk appetite. Our analysis spans proprietary (GPT-4, Claude 3.7, Gemini 1.5) and open-weight models (LLaMA 3.1/3.3, DeepSeek-V3, Mistral-small), using 1,720 user profiles constructed with 16 risk-relevant features across 10 countries and both genders. We observe significant variance across models in score distributions and demographic sensitivity. For example, GPT-4o assigns higher risk scores to Nigerian and Indonesian profiles, while LLaMA and DeepSeek show opposite gender tendencies in risk classification. While some models (e.g., GPT-4o, LLaMA 3.1) align closely with expected scores in low- and mid-risk ranges, none maintain consistent performance across regions and demographics. Our findings highlight the need for rigorous, standardized evaluations of AI systems in regulated financial contexts to prevent bias, opacity, and inconsistency in real-world deployment.
摘要：我们评估了领导AI模型评估投资风险食欲的信誉。我们的分析跨越了专有（GPT-4，Claude 3.7，Gemini 1.5）和开放式模型（Llama 3.1/3.3，DeepSeek-V3，Mistral-Small），使用1,720个用户概况，该用户配置文件在10个国家/地区和性别范围内具有16个风险相关的特征。我们观察到分数分布和人口敏感性的模型之间的显着差异。例如，GPT-4O为尼日利亚和印度尼西亚的概况分配了更高的风险评分，而美洲驼和DeepSeek在风险分类中表现出相反的性别趋势。尽管某些模型（例如GPT-4O，Llama 3.1）与低风险和中等风险范围的预期得分紧密相吻合，但没有一个地区和人口统计数据保持一致的性能。我们的发现强调了在受规定的财务背景下对AI系统进行严格的标准化评估的必要性，以防止现实部署中的偏见，不透明度和不一致。

Title: System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts

Authors: Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, Bang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18962
Pdf URL: https://arxiv.org/pdf/2505.18962
Copy Paste: [[2505.18962]] System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts(https://arxiv.org/abs/2505.18962)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-thought (CoT) reasoning enables large language models (LLMs) to move beyond fast System-1 responses and engage in deliberative System-2 reasoning. However, this comes at the cost of significant inefficiency due to verbose intermediate output. Recent latent-space reasoning methods improve efficiency by operating on hidden states without decoding into language, yet they treat all steps uniformly, failing to distinguish critical deductions from auxiliary steps and resulting in suboptimal use of computational resources. In this paper, we propose System-1.5 Reasoning, an adaptive reasoning framework that dynamically allocates computation across reasoning steps through shortcut paths in latent this http URL, System-1.5 Reasoning introduces two types of dynamic shortcuts. The model depth shortcut (DS) adaptively reasons along the vertical depth by early exiting non-critical tokens through lightweight adapter branches, while allowing critical tokens to continue through deeper Transformer layers. The step shortcut (SS) reuses hidden states across the decoding steps to skip trivial steps and reason horizontally in latent space. Training System-1.5 Reasoning involves a two-stage self-distillation process: first distilling natural language CoT into latent-space continuous thought, and then distilling full-path System-2 latent reasoning into adaptive shortcut paths (System-1.5 Reasoning).Experiments on reasoning tasks demonstrate the superior performance of our method. For example, on GSM8K, System-1.5 Reasoning achieves reasoning performance comparable to traditional CoT fine-tuning methods while accelerating inference by over 20x and reducing token generation by 92.31% on average.
摘要：经过思考链（COT）推理使大型语言模型（LLMS）能够超越快速的System-1响应并参与协商System-2推理。但是，由于详细的中间产出，这是以明显低效率的代价。最近的潜在空间推理方法通过在隐藏状态下运行而不将语言解码提高效率，但它们均匀地对待所有步骤，未能将关键扣除与辅助步骤区分开，并导致对计算资源的使用次优。在本文中，我们提出了System-1.5推理，这是一个自适应推理框架，该框架通过潜在的HTTP URL中的快捷路径在推理步骤中动态分配计算，System-1.5推理引入了两种类型的动态快捷方式。模型深度快捷方式（DS）通过早期退出非关键令牌通过轻量级适配器分支来适应性地沿垂直深度，同时允许关键令牌继续通过更深的变压器层。步骤快捷方式（SS）在解码步骤上重用隐藏的状态，以在潜在空间中水平跳过琐碎的步骤和理由。训练系统1.5推理涉及一个两阶段的自我验证过程：首先将自然语言COT蒸馏成潜在的空间连续思考，然后将全路径System-2潜在推理蒸馏成自适应的快捷方式路径（System-1.5推理）。在推理任务上表现出我们方法的出色表现。例如，在GSM8K上，System-1.5推理实现了与传统的COT微调方法相当的推理性能，同时通过超过20倍加速推断，并将代币产生降低92.31％。

Title: Learning to Explain: Prototype-Based Surrogate Models for LLM Classification

Authors: Bowen Wei, Ziwei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18970
Pdf URL: https://arxiv.org/pdf/2505.18970
Copy Paste: [[2505.18970]] Learning to Explain: Prototype-Based Surrogate Models for LLM Classification(https://arxiv.org/abs/2505.18970)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated impressive performance on natural language tasks, but their decision-making processes remain largely opaque. Existing explanation methods either suffer from limited faithfulness to the model's reasoning or produce explanations that humans find difficult to understand. To address these challenges, we propose \textbf{ProtoSurE}, a novel prototype-based surrogate framework that provides faithful and human-understandable explanations for LLMs. ProtoSurE trains an interpretable-by-design surrogate model that aligns with the target LLM while utilizing sentence-level prototypes as human-understandable concepts. Extensive experiments show that ProtoSurE consistently outperforms SOTA explanation methods across diverse LLMs and datasets. Importantly, ProtoSurE demonstrates strong data efficiency, requiring relatively few training examples to achieve good performance, making it practical for real-world applications.
摘要：大型语言模型（LLMS）在自然语言任务上表现出了令人印象深刻的表现，但是他们的决策过程在很大程度上仍然不透明。现有的解释方法要么对模型的推理有限忠诚，要么产生人类难以理解的解释。为了应对这些挑战，我们提出了\ textbf {protosure}，这是一个基于新型原型的替代框架，为LLMS提供了忠实和人为理解的解释。 Protosure训练一种可解释的逐设计替代模型，该模型与目标LLM保持一致，同时使用句子级原型作为人类可行的概念。广泛的实验表明，蛋白质始终优于不同LLM和数据集的SOTA解释方法。重要的是，蛋白质表现出强大的数据效率，需要相对较少的培训示例才能实现良好的性能，从而实现了现实世界应用的实用性。

Title: Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings

Authors: Sarang Patil, Ashish Parmanand Pandey, Ioannis Koutis, Mengjia Xu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18973
Pdf URL: https://arxiv.org/pdf/2505.18973
Copy Paste: [[2505.18973]] Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings(https://arxiv.org/abs/2505.18973)
Keywords: language model
Abstract: Selective state-space models have achieved great success in long-sequence modeling. However, their capacity for language representation, especially in complex hierarchical reasoning tasks, remains underexplored. Most large language models rely on flat Euclidean embeddings, limiting their ability to capture latent hierarchies. To address this limitation, we propose Hierarchical Mamba (HiM), integrating efficient Mamba2 with exponential growth and curved nature of hyperbolic geometry to learn hierarchy-aware language embeddings for deeper linguistic understanding. Mamba2-processed sequences are projected to the Poincare ball (via tangent-based mapping) or Lorentzian manifold (via cosine and sine-based mapping) with "learnable" curvature, optimized with a combined hyperbolic loss. Our HiM model facilitates the capture of relational distances across varying hierarchical levels, enabling effective long-range reasoning. This makes it well-suited for tasks like mixed-hop prediction and multi-hop inference in hierarchical classification. We evaluated our HiM with four linguistic and medical datasets for mixed-hop prediction and multi-hop inference tasks. Experimental results demonstrated that: 1) Both HiM models effectively capture hierarchical relationships for four ontological datasets, surpassing Euclidean baselines. 2) HiM-Poincare captures fine-grained semantic distinctions with higher h-norms, while HiM-Lorentz provides more stable, compact, and hierarchy-preserving embeddings favoring robustness over detail.
摘要：选择性状态空间模型在长期序列建模中取得了巨大成功。但是，他们的语言表示能力，尤其是在复杂的层次结构推理任务中，仍然没有得到充实的态度。大多数大型语言模型都依赖于平坦的欧几里得嵌入，从而限制了它们捕获潜在层次结构的能力。为了解决这一限制，我们提出了层次的mamba（他），将有效的mAMBA2与双曲几何形状的指数增长和弯曲的性质相结合，以学习层次结构含义的语言嵌入，以更深入的语言理解。 MAMBA2所处理的序列被投影到Poincare Ball（通过基于切线的映射）或Lorentzian歧管（通过余弦和基于正弦的映射），并具有“可学习的”曲率，并用组合的双曲线损失进行了优化。我们的HIM模型促进了跨不同层次结构层面的关系距离的捕获，从而实现了有效的远程推理。这使其非常适合诸如混合跳跃预测和分层分类中的多跳推断之类的任务。我们使用四个语言和医学数据集评估了他的他，以进行混合跳跃预测和多跳推理任务。实验结果表明：1）他两个模型有效地捕获了四个本体论数据集的层次关系，超过了欧几里得基线。 2）他的poincare用更高的h-norms捕获了细颗粒的语义区别，而HIM-Lorentz提供了更稳定，紧凑和层次结构的嵌入嵌入，偏爱稳健性而不是细节。

Title: AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models

Authors: Miguel Angel Peñaloza Perez (1 and 2 and 5), Bruno Lopez Orozco (1 and 2 and 3), Jesus Tadeo Cruz Soto (1 and 2 and 4), Michelle Bruno Hernandez (1 and 2), Miguel Angel Alvarado Gonzalez (1 and 2), Sandra Malagon (1 and 2) ((1) Carreras con Impacto, (2) Aixo Lab, (3) Facultad de Ciencias UNAM Mexico, (4) Facultad de Matematicas Universidad Veracruzana Mexico, (5) Centro de Investigación Cientifica y de Educacion Superior de Ensenada Baja California Mexico)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18978
Pdf URL: https://arxiv.org/pdf/2505.18978
Copy Paste: [[2505.18978]] AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models(https://arxiv.org/abs/2505.18978)
Keywords: language model, gpt
Abstract: Existing mathematical reasoning benchmarks are predominantly English only or translation-based, which can introduce semantic drift and mask languagespecific reasoning errors. To address this, we present AI4Math, a benchmark of 105 original university level math problems natively authored in Spanish. The dataset spans seven advanced domains (Algebra, Calculus, Geometry, Probability, Number Theory, Combinatorics, and Logic), and each problem is accompanied by a step by step human solution. We evaluate six large language models GPT 4o, GPT 4o mini, o3 mini, LLaMA 3.3 70B, DeepSeek R1 685B, and DeepSeek V3 685B under four configurations: zero shot and chain of thought, each in Spanish and English. The top models (o3 mini, DeepSeek R1 685B, DeepSeek V3 685B) achieve over 70% accuracy, whereas LLaMA 3.3 70B and GPT-4o mini remain below 40%. Most models show no significant performance drop between languages, with GPT 4o even performing better on Spanish problems in the zero shot setting. Geometry, Combinatorics, and Probability questions remain persistently challenging for all models. These results highlight the need for native-language benchmarks and domain-specific evaluations to reveal reasoning failures not captured by standard metrics.
摘要：现有的数学推理基准主要仅是英语或基于翻译的基准，它可以引入语义漂移和掩盖语言特定的推理错误。为了解决这个问题，我们介绍了AI4MATH，这是105个原始大学级数学问题的基准。数据集跨越了七个高级域（代数，微积分，几何，概率，数字理论，组合学和逻辑），每个问题都伴随着逐步的人类解决方案。我们评估了六种大型语言模型GPT 4O，GPT 4O MINI，O3 MINI，LLAMA 3.3 70B，DEEPSEEK R1 685B和DEEPSEEK V3 685B在四种配置下：零镜头和思想链，每种都用西班牙语和英语。顶级型号（O3 Mini，DeepSeek R1 685B，DeepSeek V3 685B）的精度超过70％，而Llama 3.3 70B和GPT-4O Mini保持在40％以下。大多数模型之间的性能没有明显的性能下降，而GPT 4O甚至在零射击设置中在西班牙问题上表现更好。对于所有模型，几何，组合和概率问题仍然持续具有挑战性。这些结果突出了对本地语言基准和特定于域的评估的需求，以揭示标准指标未捕获的推理故障。

Title: FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)

Authors: Carlos Jude G. Maminta, Isaiah Job Enriquez, Deandre Nigel Nunez, Michael B. Dela Fuente (Institution College of Computer and Information Sciences, Polytechnic University of the Philippines, Sta. Mesa, Manila)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18995
Pdf URL: https://arxiv.org/pdf/2505.18995
Copy Paste: [[2505.18995]] FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)(https://arxiv.org/abs/2505.18995)
Keywords: language model, llm
Abstract: This study presents FiLLM, a Filipino-optimized large language model, designed to enhance natural language processing (NLP) capabilities in the Filipino language. Built upon the SeaLLM-7B 2.5 model, FiLLM leverages Low-Rank Adaptation (LoRA) fine-tuning to optimize memory efficiency while maintaining task-specific performance. The model was trained and evaluated on diverse Filipino datasets to address key NLP tasks, including Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Dependency Parsing, and Text Summarization. Performance comparisons with the CalamanCy model were conducted using F1 Score, Precision, Recall, Compression Rate, and Keyword Overlap metrics. Results indicate that Calamancy outperforms FILLM in several aspects, demonstrating its effectiveness in processing Filipino text with improved linguistic comprehension and adaptability. This research contributes to the advancement of Filipino NLP applications by providing an optimized, efficient, and scalable language model tailored for local linguistic needs.
摘要：这项研究介绍了Fillm是一种菲律宾优化的大语言模型，旨在增强菲律宾语言的自然语言处理（NLP）功能。 Fillm建立在Seallm-7b 2.5型号的基础上，利用低级适应（LORA）微调来优化记忆效率，同时保持特定于任务的性能。该模型在菲律宾的各种数据集上进行了培训和评估，以解决关键的NLP任务，包括指定的实体识别（NER），语音部分标签（POS）标记，依赖关系解析和文本摘要。使用F1得分，精度，召回，压缩率和关键字重叠度量指标进行了与灾难模型的性能比较。结果表明，灾难的表现优于几个方面，这表明了其在处理菲律宾文本方面的有效性，并改善了语言理解和适应性。这项研究通过提供针对当地语言需求的优化，高效且可扩展的语言模型来帮助菲律宾NLP应用程序的发展。

Title: VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

Authors: Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19000
Pdf URL: https://arxiv.org/pdf/2505.19000
Copy Paste: [[2505.19000]] VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization(https://arxiv.org/abs/2505.19000)
Keywords: language model, llm, chain-of-thought
Abstract: Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream this http URL address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.
摘要：将强化学习（RL）应用于视频大语言模型（视频-LLM）显示出复杂的视频推理的重要希望。 However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream this http URL address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually提高Video LLM产生深厚的长期推理链的能力。核心组件是推出的验证者，位于GRPO和直接偏好优化（DPO）训练阶段之间，以形成GRPO-Verifier-DPO训练环。该验证者利用小型LLM作为法官评估推出的推理逻辑，从而实现高质量的对比度数据的构建，包括反射性和上下文一致的COTS。这些策划的偏好样品推动了有效的DPO阶段（比GRPO快7倍），从而导致推理链质量的明显改善，尤其是在长度和上下文一致性方面。该训练循环受益于GRPO的广泛搜索和DPO的目标优化。实验结果表明：1）与标准GRPO变体相比，显着更快，更有效的优化，从而产生出色的性能； 2）我们训练有素的模型超过了大规模指导调整的视频插件的直接推断，在各种视频推理任务上产生了长期且上下文一致的COTS； 3）我们的模型具有一个迭代的表现优于强大的LMM（例如KIMI-VL）和较长的推理模型（例如，Video-R1），突出了其有效性和稳定性。

Title: Efficient Data Selection at Scale via Influence Distillation

Authors: Mahdi Nikdan, Vincent Cohen-Addad, Dan Alistarh, Vahab Mirrokni
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19051
Pdf URL: https://arxiv.org/pdf/2505.19051
Copy Paste: [[2505.19051]] Efficient Data Selection at Scale via Influence Distillation(https://arxiv.org/abs/2505.19051)
Keywords: language model, llm
Abstract: Effective data selection is critical for efficient training of modern Large Language Models (LLMs). This paper introduces Influence Distillation, a novel, mathematically-justified framework for data selection that employs second-order information to optimally weight training samples. By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data for LLM fine-tuning, guiding it toward strong performance on the target domain. We derive these optimal weights for both Gradient Descent and Adam optimizers. To ensure scalability and reduce computational cost, we propose a $\textit{landmark-based approximation}$: influence is precisely computed for a small subset of "landmark" samples and then efficiently propagated to all other samples to determine their weights. We validate Influence Distillation by applying it to instruction tuning on the Tulu V2 dataset, targeting a range of tasks including GSM8k, SQuAD, and MMLU, across several models from the Llama and Qwen families. Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to $3.5\times$ faster selection.
摘要：有效的数据选择对于现代大型语言模型（LLM）的有效培训至关重要。本文介绍了“影响蒸馏”，这是一种用于数据选择的新颖，数学上合并的框架，该框架采用二阶信息来最佳的体重训练样本。通过提炼每个样本对目标分布的影响，我们的方法分配了特定于模型的权重，这些权重用于选择LLM微调的训练数据，从而将其引导到目标域上的强劲性能。我们为梯度下降和亚当优化器提供了这些最佳权重。为了确保可伸缩性并降低计算成本，我们提出了一个$ \ textIt {基于里程碑的近似} $：精确计算了一小部分“地标”样本的子集，然后有效地传播给所有其他样品以确定其权重。我们通过将其应用于Tulu V2数据集上的指令调整来验证蒸馏，以针对Llama和Qwen家族的几种型号的一系列任务。实验表明，影响蒸馏的匹配或胜过最先进的性能，同时获得$ 3.5 \ times $ $ $的选择。

Title: An Embarrassingly Simple Defense Against LLM Abliteration Attacks

Authors: Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19056
Pdf URL: https://arxiv.org/pdf/2505.19056
Copy Paste: [[2505.19056]] An Embarrassingly Simple Defense Against LLM Abliteration Attacks(https://arxiv.org/abs/2505.19056)
Keywords: language model, llm, prompt, chat
Abstract: Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.
摘要：大型语言模型（LLM）通常通过拒绝有害说明来符合安全指南。最近的攻击称为消融，隔离并抑制了拒绝行为最负责的单个潜在方向，从而使模型能够产生不道德的内容。我们提出了一种修改模型如何产生拒绝的防御。我们构建了一个扩展的杂物数据集，其中包含有害提示，并充分响应，这证明了拒绝的原因。然后，我们在我们的扩展杂货数据集中微调Llama-2-7b-chat和Qwen2.5-Instruct（1.5b和3b参数），并在一组有害提示中评估所得系统。在我们的实验中，扩展的杂种模型保持较高的拒绝率，最多下降了10％，而基线模型的拒绝率下降了70-80％。对安全性和公用事业的广泛评估表明，扩展杂货微调可以中和消融攻击，同时保留一般绩效。

Title: UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models

Authors: Roman Vashurin, Maiya Goloburda, Preslav Nakov, Maxim Panov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19060
Pdf URL: https://arxiv.org/pdf/2505.19060
Copy Paste: [[2505.19060]] UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models(https://arxiv.org/abs/2505.19060)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have become indispensable tools across various applications, making it more important than ever to ensure the quality and the trustworthiness of their outputs. This has led to growing interest in uncertainty quantification (UQ) methods for assessing the reliability of LLM outputs. Many existing UQ techniques rely on token probabilities, which inadvertently introduces a bias with respect to the length of the output. While some methods attempt to account for this, we demonstrate that such biases persist even in length-normalized approaches. To address the problem, here we propose UNCERTAINTY-LINE: (Length-INvariant Estimation), a simple debiasing procedure that regresses uncertainty scores on output length and uses the residuals as corrected, length-invariant estimates. Our method is post-hoc, model-agnostic, and applicable to a range of UQ measures. Through extensive evaluation on machine translation, summarization, and question-answering tasks, we demonstrate that UNCERTAINTY-LINE: consistently improves over even nominally length-normalized UQ methods uncertainty estimates across multiple metrics and models.
摘要：大型语言模型（LLM）已成为各种应用程序中必不可少的工具，使得比以往任何时候都重要，以确保其产出的质量和可信度。这导致人们对评估LLM输出可靠性的不确定性定量（UQ）方法的兴趣日益增加。许多现有的UQ技术依赖于令牌概率，这无意间引入了有关输出长度的偏见。尽管某些方法试图考虑到这一点，但我们证明了这种偏见即使在长度归一化的方法中也持续存在。为了解决该问题，我们在这里提出不确定性线：（长度不变估计），这是一个简单的compiasing程序，可在输出长度上回归不确定性评分，并将残差使用为校正后的长度不变估计。我们的方法是事后，不可或缺的，并且适用于一系列UQ测量。通过对机器翻译，汇总和提问任务的广泛评估，我们证明了不确定性线：即使在多个指标和模型之间，即使是名义上长度归一化的UQ方法不确定性估计值也始终如一地改进。

Title: Towards Harmonized Uncertainty Estimation for Large Language Models

Authors: Rui Li, Jing Long, Muge Qi, Heming Xia, Lei Sha, Peiyi Wang, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19073
Pdf URL: https://arxiv.org/pdf/2505.19073
Copy Paste: [[2505.19073]] Towards Harmonized Uncertainty Estimation for Large Language Models(https://arxiv.org/abs/2505.19073)
Keywords: language model, llm
Abstract: To facilitate robust and trustworthy deployment of large language models (LLMs), it is essential to quantify the reliability of their generations through uncertainty estimation. While recent efforts have made significant advancements by leveraging the internal logic and linguistic features of LLMs to estimate uncertainty scores, our empirical analysis highlights the pitfalls of these methods to strike a harmonized estimation between indication, balance, and calibration, which hinders their broader capability for accurate uncertainty estimation. To address this challenge, we propose CUE (Corrector for Uncertainty Estimation): A straightforward yet effective method that employs a lightweight model trained on data aligned with the target LLM's performance to adjust uncertainty scores. Comprehensive experiments across diverse models and tasks demonstrate its effectiveness, which achieves consistent improvements of up to 60% over existing methods.
摘要：为了促进大型语言模型（LLMS）的稳健和值得信赖的部署，必须通过不确定性估计来量化世代的可靠性。尽管最近的努力通过利用LLM的内部逻辑和语言特征来估计不确定性得分，从而取得了重大进步，但我们的经验分析突出了这些方法的陷阱，以实现指示，平衡和校准之间的一致估计，这阻碍了其更广泛的能力，以进行准确的不确定性估计。为了应对这一挑战，我们提出了提示（不确定性估计的校正器）：一种简单而有效的方法，采用了一种轻巧的模型，该模型在与目标LLM的性能一致的数据上训练以调整不确定性得分。跨不同模型和任务的综合实验证明了其有效性，这比现有方法的一致性高达60％。

Title: ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models

Authors: Benjamin Clavié, Florian Brand
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19091
Pdf URL: https://arxiv.org/pdf/2505.19091
Copy Paste: [[2505.19091]] ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models(https://arxiv.org/abs/2505.19091)
Keywords: language model, prompt
Abstract: Recent advancements in Large Vision-Language Models (VLMs), have greatly enhanced their capability to jointly process text and images. However, despite extensive benchmarks evaluating visual comprehension (e.g., diagrams, color schemes, OCR tasks...), there is limited assessment of VLMs' ability to read and reason about text-rich images effectively. To fill this gap, we introduce ReadBench, a multimodal benchmark specifically designed to evaluate the reading comprehension capabilities of VLMs. ReadBench transposes contexts from established text-only benchmarks into images of text while keeping textual prompts and questions intact. Evaluating leading VLMs with ReadBench, we find minimal-but-present performance degradation on short, text-image inputs, while performance sharply declines for longer, multi-page contexts. Our experiments further reveal that text resolution has negligible effects on multimodal performance. These findings highlight needed improvements in VLMs, particularly their reasoning over visually presented extensive textual content, a capability critical for practical applications. ReadBench is available at this https URL .
摘要：大型视觉模型（VLM）的最新进展极大地增强了它们共同处理文本和图像的能力。然而，尽管广泛评估视觉理解的基准（例如，图，配色方案，OCR任务...），但VLMS对文本富裕图像有效阅读和推理的能力的评估有限。为了填补这一空白，我们介绍了登录，这是一种专门设计用于评估VLM的阅读理解能力的多模式基准。 dechbench将上下文从既定的文本基准测试中的上下文转换为文本图像，同时保持文本提示和问题完好无损。通过读取台评估领先的VLM，我们发现在简短的文本图像输入中最少但存在的性能降低，而性能急剧下降了更长的多页上下文。我们的实验进一步表明，文本分辨率对多模式性能的影响可忽略不计。这些发现突出显示了VLM中需要改进，尤其是它们在视觉上呈现的广泛文本内容上的推理，这对于实际应用至关重要。可在此HTTPS URL上找到登录。

Title: ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning

Authors: Yeyuan Wang, Dehong Gao, Rujiao Long, Lei Yi, Linbo Jin, Libin Yang, Xiaoyan Cai
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19100
Pdf URL: https://arxiv.org/pdf/2505.19100
Copy Paste: [[2505.19100]] ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning(https://arxiv.org/abs/2505.19100)
Keywords: language model, llm
Abstract: Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong performance. However, traditional DPO relies on binary preference optimization, rewarding or penalizing entire responses without considering fine-grained segment correctness, leading to suboptimal solutions. The root of this issue lies in the absence of fine-grained supervision during the optimization process. To address this, we propose Adaptive Sentence-level Preference Optimization (ASPO), which evaluates individual sentences for more precise preference optimization. By dynamically calculating adaptive rewards at the sentence level based on model predictions, ASPO enhances response content assessment without additional models or parameters. This significantly improves the alignment of multimodal features. Extensive experiments show that ASPO substantially enhances the overall performance of multimodal models.
摘要：直接偏好优化（DPO）在对齐大语言模型（LLMS）中的简单性和计算效率方面已引起了人们的关注。最近的进步将DPO扩展到了多模式方案，从而实现了强劲的性能。但是，传统的DPO依赖于二进制偏好优化，奖励或惩罚整个响应，而无需考虑细分细分市场正确性，从而导致了次优的解决方案。这个问题的根源在于在优化过程中没有细粒度的监督。为了解决这个问题，我们提出了自适应句子级的偏好优化（ASPO），该句子评估了单个句子以获得更精确的偏好优化。通过基于模型预测在句子级别上动态计算自适应奖励，ASPO可以增强响应内容评估，而无需其他模型或参数。这显着改善了多模式特征的比对。广泛的实验表明，ASPO大大提高了多模型模型的整体性能。

Title: CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models

Authors: Yongheng Zhang, Xu Liu, Ruoxi Zhou, Qiguang Chen, Hao Fei, Wenpeng Lu, Libo Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19108
Pdf URL: https://arxiv.org/pdf/2505.19108
Copy Paste: [[2505.19108]] CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models(https://arxiv.org/abs/2505.19108)
Keywords: language model, llm, hallucination
Abstract: Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap. Specifically, CCHall simultaneously incorporates both cross-lingual and cross-modal hallucination scenarios, which can be used to assess the cross-lingual and cross-modal capabilities of LLMs. Furthermore, we conduct a comprehensive evaluation on CCHall, exploring both mainstream open-source and closed-source LLMs. The experimental results highlight that current LLMs still struggle with CCHall. We hope CCHall can serve as a valuable resource to assess LLMs in joint cross-lingual and cross-modal scenarios.
摘要：在跨语言和跨模式场景中调查大语言模型（LLM）中的幻觉问题可以大大推动现实应用程序中的大规模部署。然而，目前的研究仅限于跨语言或跨模式的单一场景，在探索幻觉的幻觉和跨模式场景中留下了差距。在此激励的情况下，我们引入了一种新型的跨语言和跨模式幻觉基准（CCHALL）来填补这一空白。具体而言，CCHALL同时结合了跨语义和跨模式幻觉场景，可用于评估LLMS的跨语性和跨模式功能。此外，我们对CCHALL进行了全面的评估，探索了主流开源和封闭源LLM。实验结果凸显了当前的LLM仍在CCHALL中遇到困难。我们希望Cchall可以作为评估联合跨语言和跨模式场景中LLM的宝贵资源。

Title: Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering

Authors: Zheng Chu, Huiming Fan, Jingchang Chen, Qianyu Wang, Mingda Yang, Jiafeng Liang, Zhongjie Wang, Hao Li, Guo Tang, Ming Liu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19112
Pdf URL: https://arxiv.org/pdf/2505.19112
Copy Paste: [[2505.19112]] Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering(https://arxiv.org/abs/2505.19112)
Keywords: language model, llm
Abstract: Although large language models (LLMs) have demonstrated remarkable reasoning capabilities, they still face challenges in knowledge-intensive multi-hop reasoning. Recent work explores iterative retrieval to address complex problems. However, the lack of intermediate guidance often results in inaccurate retrieval and flawed intermediate reasoning, leading to incorrect reasoning. To address these, we propose Self-Critique Guided Iterative Reasoning (SiGIR), which uses self-critique feedback to guide the iterative reasoning process. Specifically, through end-to-end training, we enable the model to iteratively address complex problems via question decomposition. Additionally, the model is able to self-evaluate its intermediate reasoning steps. During iterative reasoning, the model engages in branching exploration and employs self-evaluation to guide the selection of promising reasoning trajectories. Extensive experiments on three multi-hop reasoning datasets demonstrate the effectiveness of our proposed method, surpassing the previous SOTA by $8.6\%$. Furthermore, our thorough analysis offers insights for future research. Our code, data, and models are available at Github: this https URL.
摘要：尽管大型语言模型（LLM）表现出了非凡的推理能力，但它们仍然面临着知识密集的多跳上推理的挑战。最近的工作探讨了解决复杂问题的迭代检索。但是，缺乏中间指导通常会导致检索不正确和中间推理，从而导致不正确的推理。为了解决这些问题，我们提出了自我评价指导的迭代推理（Sigir），该推理使用自我批评的反馈来指导迭代推理过程。具体而言，通过端到端培训，我们可以通过问题分解使模型迭代地解决复杂问题。此外，该模型能够自我评估其中间推理步骤。在迭代推理期间，该模型参与分支探索并采用自我评估来指导有前途的推理轨迹的选择。在三个多跳的推理数据集上进行了广泛的实验，证明了我们提出的方法的有效性，超过了先前的SOTA $ 8.6 \％$。此外，我们的详尽分析为未来的研究提供了见解。我们的代码，数据和模型可在GitHub：此HTTPS URL上获得。

Title: Controlling Language Confusion in Multilingual LLMs

Authors: Nahyun Lee, Yeongseo Woo, Hyunwoo Ko, Guijin Son
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19116
Pdf URL: https://arxiv.org/pdf/2505.19116
Copy Paste: [[2505.19116]] Controlling Language Confusion in Multilingual LLMs(https://arxiv.org/abs/2505.19116)
Keywords: language model, llm
Abstract: Large language models often suffer from language confusion, a phenomenon where responses are partially or entirely generated in unintended languages. This can critically impact user experience in low-resource settings. We hypothesize that conventional supervised fine-tuning exacerbates this issue because the softmax objective focuses probability mass only on the single correct token but does not explicitly penalize cross-lingual mixing. Interestingly, by observing loss trajectories during the pretraining phase, we observe that models fail to learn to distinguish between monolingual and language-confused text. Additionally, we find that ORPO, which adds penalties for unwanted output styles to standard SFT, effectively suppresses language-confused generations even at high decoding temperatures without degrading overall model performance. Our findings suggest that incorporating appropriate penalty terms can mitigate language confusion in low-resource settings with limited data.
摘要：大型语言模型通常遭受语言混乱的困扰，这种现象是用意外语言部分或完全产生的反应。这可能会严重影响低资源设置中的用户体验。我们假设常规的监督微调加剧了这个问题，因为软磁性物镜仅关注概率质量质量，仅在单个正确的令牌上，但不会明确惩罚跨语性的混合。有趣的是，通过观察预处理阶段的损失轨迹，我们观察到模型无法学会区分单语言和语言的文本。此外，我们发现ORPO为标准的SFT增加了对不需要的输出样式的罚款，即使在高解码温度下，也没有降低整体模型性能，也可以有效地抑制语言引用的世代。我们的发现表明，合并适当的罚款条款可以减轻低资源设置中的语言混乱，并具有有限的数据。

Title: Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models

Authors: Seunguk Yu, Juhwan Choi, Youngbin Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19121
Pdf URL: https://arxiv.org/pdf/2505.19121
Copy Paste: [[2505.19121]] Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models(https://arxiv.org/abs/2505.19121)
Keywords: language model, llm
Abstract: Despite the recent strides in large language models, studies have underscored the existence of social biases within these systems. In this paper, we delve into the validation and comparison of the ethical biases of LLMs concerning globally discussed and potentially sensitive topics, hypothesizing that these biases may arise from language-specific distinctions. Introducing the Multilingual Sensitive Questions & Answers Dataset (MSQAD), we collected news articles from Human Rights Watch covering 17 topics, and generated socially sensitive questions along with corresponding responses in multiple languages. We scrutinized the biases of these responses across languages and topics, employing two statistical hypothesis tests. The results showed that the null hypotheses were rejected in most cases, indicating biases arising from cross-language differences. It demonstrates that ethical biases in responses are widespread across various languages, and notably, these biases were prevalent even among different LLMs. By making the proposed MSQAD openly available, we aim to facilitate future research endeavors focused on examining cross-language biases in LLMs and their variant models.
摘要：尽管在大型语言模型中取得了进步，但研究突显了这些系统中社会偏见的存在。在本文中，我们深入研究了有关全球讨论和潜在敏感主题的LLMS伦理偏见的验证和比较，假设这些偏见可能来自语言特定的区别。介绍了多语言敏感的问题与答案数据集（MSQAD），我们从人权观察中收集了新闻文章，其中涵盖了17个主题，并产生了对社会敏感的问题以及多种语言的相应回答。我们采用了两个统计假设检验，仔细研究了跨语言和主题的这些反应的偏见。结果表明，在大多数情况下，零假设被拒绝，表明跨语言差异引起的偏见。它表明，反应中的道德偏见在各种语言上都广泛存在，尤其是这些偏见即使在不同的LLM中也普遍存在。通过公开提供建议的MSQAD，我们旨在促进未来的研究努力，专注于研究LLMS及其变体模型中的跨语言偏见。

Title: MMATH: A Multilingual Benchmark for Mathematical Reasoning

Authors: Wenyang Luo, Wayne Xin Zhao, Jing Sha, Shijin Wang, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19126
Pdf URL: https://arxiv.org/pdf/2505.19126
Copy Paste: [[2505.19126]] MMATH: A Multilingual Benchmark for Mathematical Reasoning(https://arxiv.org/abs/2505.19126)
Keywords: language model, prompt
Abstract: The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using MMATH, we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models. Our code and data could be found at this https URL.
摘要：大型推理模型的出现，例如OpenAI O1和DeepSeek R1，具有明显的高级复杂推理任务。但是，它们在多语言复杂推理中的能力仍然没有被忽略，现有的努力主要集中在更简单的任务上，例如MGSM。为了解决这一差距，我们介绍了MMATH，这是一种基准，用于跨越10种类型上不同语言的374个高质量数学问题的多语言复杂推理。使用MMATH，我们观察到，即使是DeepSeek R1（例如DeepSeek R1）的高级模型也会在语言上表现出很大的性能差异，并且遭受了意外语言中关键的脱离目标产生问题的反应。为了解决这个问题，我们探讨了策略，包括提示和培训，证明英语推理和目标语言可以同时提高性能并保留目标语言的一致性。我们的发现提供了新的见解和实用策略，以推动大型语言模型的多语言推理能力。我们的代码和数据可以在此HTTPS URL上找到。

Title: RetrieveAll: A Multilingual Named Entity Recognition Framework with Large Language Models

Authors: Jin Zhang, Fan Gao, Linyu Li, Yongbin Yu, Xiangxiang Wang, Nyima Tashi, Gadeng Luosang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19128
Pdf URL: https://arxiv.org/pdf/2505.19128
Copy Paste: [[2505.19128]] RetrieveAll: A Multilingual Named Entity Recognition Framework with Large Language Models(https://arxiv.org/abs/2505.19128)
Keywords: language model, prompt
Abstract: The rise of large language models has led to significant performance breakthroughs in named entity recognition (NER) for high-resource languages, yet there remains substantial room for improvement in low- and medium-resource languages. Existing multilingual NER methods face severe language interference during the multi-language adaptation process, manifested in feature conflicts between different languages and the competitive suppression of low-resource language features by high-resource languages. Although training a dedicated model for each language can mitigate such interference, it lacks scalability and incurs excessive computational costs in real-world applications. To address this issue, we propose RetrieveAll, a universal multilingual NER framework based on dynamic LoRA. The framework decouples task-specific features across languages and demonstrates efficient dynamic adaptability. Furthermore, we introduce a cross-granularity knowledge augmented method that fully exploits the intrinsic potential of the data without relying on external resources. By leveraging a hierarchical prompting mechanism to guide knowledge injection, this approach advances the paradigm from "prompt-guided inference" to "prompt-driven learning." Experimental results show that RetrieveAll outperforms existing baselines; on the PAN-X dataset, it achieves an average F1 improvement of 12.1 percent.
摘要：大型语言模型的兴起导致了高资源语言的指定实体识别（NER）的显着性能突破，但在低资产品和中型语言中仍有很大的改善空间。现有的多语言NER方法在多语言适应过程中面临严重的语言干扰，这表现在不同语言之间的特征冲突和高资源语言对低资源语言特征的竞争抑制。尽管培训每种语言的专用模型可以减轻这种干扰，但它缺乏可扩展性，并且在现实世界中的应用中会产生过多的计算成本。为了解决这个问题，我们提出了基于动态洛拉的通用多语言NER框架。该框架将跨语言的特定于任务特定功能分解，并演示有效的动态适应性。此外，我们引入了一种跨粒度知识增强方法，该方法在不依赖外部资源的情况下完全利用数据的内在潜力。通过利用层次结构提示机制指导知识注入，这种方法将范式从“迅速引入的推论”到“及时驱动的学习”。实验结果表明，检索的表现优于现有基线。在PAN-X数据集上，它的平均F1提高了12.1％。

Title: Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Authors: Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19147
Pdf URL: https://arxiv.org/pdf/2505.19147
Copy Paste: [[2505.19147]] Shifting AI Efficiency From Model-Centric to Data-Centric Compression(https://arxiv.org/abs/2505.19147)
Keywords: language model, llm
Abstract: The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community's advancement.
摘要：大型语言模型（LLMS）和多模式LLMS（MLLM）的快速发展历史上依赖于以模型为中心的缩放，通过将参数数量从数百万增加到数十亿美元增加到数十亿美元来推动绩效提高。但是，随着我们对模型大小的硬件限制，主要的计算瓶颈从根本上转移到了长期令牌序列上自我注意的二次成本，现在由超长的文本上下文，高分辨率图像和扩展视频驱动。在该立场论文中，\ textbf {我们认为，有效AI的研究重点正在从以模型为中心的压缩到以数据为中心的压缩}。我们将令牌压缩定位为新的边界，它通过减少模型训练或推断期间的代币数量来提高AI效率。通过全面的分析，我们首先研究了各个领域的长篇文化AI的最新发展，并为现有的模型效率策略建立了统一的数学框架，证明了为什么令牌压缩为何代表了解决长篇小说台面的关键范式转移。随后，我们系统地回顾了令牌压缩的研究格局，分析其基本收益，并确定其在各种情况下具有引人注目的优势。此外，我们对令牌压缩研究和概述有希望的未来方向的当前挑战提供了深入的分析。最终，我们的工作旨在提供有关AI效率，综合现有研究并促进创新发展的新观点，以应对增加上下文长度对AI社区发展的挑战。

Title: SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs

Authors: Firoj Alam, Md Arid Hasan, Shammur Absar Chowdhury
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19163
Pdf URL: https://arxiv.org/pdf/2505.19163
Copy Paste: [[2505.19163]] SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs(https://arxiv.org/abs/2505.19163)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various disciplines and tasks. However, benchmarking their capabilities with multilingual spoken queries remains largely unexplored. In this study, we introduce SpokenNativQA, the first multilingual and culturally aligned spoken question-answering (SQA) dataset designed to evaluate LLMs in real-world conversational settings. The dataset comprises approximately 33,000 naturally spoken questions and answers in multiple languages, including low-resource and dialect-rich languages, providing a robust benchmark for assessing LLM performance in speech-based interactions. SpokenNativQA addresses the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. We benchmark different ASR systems and LLMs for SQA and present our findings. We released the data at (this https URL) and the experimental scripts at (this https URL) for the research community.
摘要：大型语言模型（LLMS）在各个学科和任务中表现出了出色的表现。但是，通过多语言口语查询进行基准测试，在很大程度上尚未探索。在这项研究中，我们介绍了SpokennativQa，这是第一个多语言和文化调整的问题提问（SQA）数据集，旨在评估现实世界中的LLMS。该数据集包含大约33,000种自然语言的问题和答案，包括多种语言，包括低资源和方言丰富的语言，为评估基于语音互动的LLM性能提供了强大的基准。 SpokennativQA通过结合语音可变性，口音和语言多样性来解决基于文本的QA数据集的局限性。我们为SQA基准基准不同的ASR系统和LLM，并介绍我们的发现。我们在（此HTTPS URL）和研究社区（此HTTPS URL）上发布了数据。

Title: Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge

Authors: Zhuo Liu, Moxin Li, Xun Deng, Qifan Wang, Fuli Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19176
Pdf URL: https://arxiv.org/pdf/2505.19176
Copy Paste: [[2505.19176]] Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge(https://arxiv.org/abs/2505.19176)
Keywords: language model, gpt, llm
Abstract: LLM-as-a-Judge employs large language models (LLMs), such as GPT-4, to evaluate the quality of LLM-generated responses, gaining popularity for its cost-effectiveness and strong alignment with human evaluations. However, training proxy judge models using evaluation data generated by powerful teacher models introduces a critical yet previously overlooked issue: teacher preference bias, where the proxy judge model learns a biased preference for responses from the teacher model. To tackle this problem, we propose a novel setting that incorporates an additional assistant model, which is not biased toward the teacher model's responses, to complement the training data. Building on this setup, we introduce AGDe-Judge, a three-stage framework designed to debias from both the labels and feedbacks in the training data. Extensive experiments demonstrate that AGDe-Judge effectively reduces teacher preference bias while maintaining strong performance across six evaluation benchmarks. Code is available at this https URL.
摘要：LLM-AS-A-Gudge采用大型语言模型（LLMS），例如GPT-4，以评估LLM生成的响应质量，以其成本效益和与人类评估的紧密结合而受到欢迎。但是，使用强大的教师模型生成的评估数据培训代理法官模型引入了一个关键但以前被忽视的问题：教师偏好偏见，代理法官模型在其中了解了对教师模型的回答的有偏见的偏好。为了解决这个问题，我们提出了一种新颖的环境，该设置结合了一个附加的助理模型，该模型并不偏向教师模型的回答，以补充培训数据。在此设置的基础上，我们介绍了Agde-Gudge，这是一个三阶段的框架，旨在从培训数据中的标签和反馈中获得依据。广泛的实验表明，Agde-Gudge有效地减少了教师的偏好偏见，同时在六个评估基准中保持强劲的表现。代码可在此HTTPS URL上找到。

Title: Two LLMs debate, both are certain they've won

Authors: Minh Nhat Nguyen, Pradyumna Shyama Prasad
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19184
Pdf URL: https://arxiv.org/pdf/2505.19184
Copy Paste: [[2505.19184]] Two LLMs debate, both are certain they've won(https://arxiv.org/abs/2505.19184)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models' private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLM outputs are deployed without careful review in assistant roles or agentic settings.
摘要：在面对反对时，LLM可以准确调整他们的信心吗？ Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic过度自信。我们在十个最先进的LLMS中组织了60场三轮政策辩论，模型在每回合后都将其信心（0-100）评为胜利。我们观察到了五个有关模式的观察：（1）系统的过度自信：模型开始辩论，平均初始置信度为72.9％，而合理的50％基线。（2）信心升级：辩论者并没有降低信心，而是提高了他们的胜利概率，在最后一轮比赛中平均为83％。（3）相互估计：在61.7％的辩论中，双方同时声称> = = 75％的胜利概率，这是逻辑上的不可能。（4）持续的自我偏见偏见：辩论相同副本的模型将置信度从64.1％增加到75.2％；即使明确告知他们获胜的机会恰好是50％，信心仍然上升（从50.0％到57.1％）。（5）未对准的私人推理：模型的私人刮擦思想有时与他们的公众信心评级有所不同，这引起了人们对忠实思想链推理的关注。这些结果表明，LLM缺乏能够准确地自我评估或更新其在动态，多转弯任务上的信念的能力。在没有仔细审查助理角色或代理设置的情况下，部署了LLM输出的主要问题。

Title: LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

Authors: Yang Xiao, Jiashuo Wang, Ruifeng Yuan, Chunpu Xu, Kaishuai Xu, Wenjie Li, Pengfei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19187
Pdf URL: https://arxiv.org/pdf/2505.19187
Copy Paste: [[2505.19187]] LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling(https://arxiv.org/abs/2505.19187)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9\% to +6.6\%) with significantly reduced token usage (-3\% to -41\%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.
摘要：大型语言模型（LLMS）通过测试时间缩放方法表现出了显着的推理功能，尤其是当用更强大的大型推理模型（LRMS）蒸馏的经过思考链（COT）数据进行微调时。但是，这些推理链通常包含反映人类问题解决的详细元素，被分类为渐进推理（基本解决方案开发路径）和功能元素（验证过程，替代解决方案方法和错误校正）。尽管进行性推理至关重要，但功能元素在测试时间推断期间显着增加了计算需求。我们介绍了PIR（基于困惑的重要性改进），这是一个原则性的框架，该框架根据其对答案预测信心的影响，对每个推理步骤的重要性进行定量评估。 PIR系统地识别并选择性地介绍了低重要的功能步骤，同时保留了渐进式推理组件，创建了优化的训练数据，以维持核心解决方案路径的完整性，同时降低了冗长的态度。对PIR优化数据进行微调的模型具有较高的测试时间缩放属性，生成更简洁的推理链，同时提高了精确度（+0.9 \％至+6.6 \％），在有挑战性的推理基础上（-3 \％至-41 \％）的代币使用显着降低了，遍及有挑战性的推理基础标准（a aime aime amc amc amc and amc和gpq）。我们的方法表明，在不同的模型大小，数据源和代币预算之间进行了强有力的概括性，为在有效的测试时间扩展，响应时间和计算效率的情况下提供了一种实用解决方案，用于部署具有推理能力的LLM。

Title: Misleading through Inconsistency: A Benchmark for Political Inconsistencies Detection

Authors: Nursulu Sagimbayeva, Ruveyda Betül Bahçeci, Ingmar Weber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19191
Pdf URL: https://arxiv.org/pdf/2505.19191
Copy Paste: [[2505.19191]] Misleading through Inconsistency: A Benchmark for Political Inconsistencies Detection(https://arxiv.org/abs/2505.19191)
Keywords: language model, llm, prompt
Abstract: Inconsistent political statements represent a form of misinformation. They erode public trust and pose challenges to accountability, when left unnoticed. Detecting inconsistencies automatically could support journalists in asking clarification questions, thereby helping to keep politicians accountable. We propose the Inconsistency detection task and develop a scale of inconsistency types to prompt NLP-research in this direction. To provide a resource for detecting inconsistencies in a political domain, we present a dataset of 698 human-annotated pairs of political statements with explanations of the annotators' reasoning for 237 samples. The statements mainly come from voting assistant platforms such as Wahl-O-Mat in Germany and Smartvote in Switzerland, reflecting real-world political issues. We benchmark Large Language Models (LLMs) on our dataset and show that in general, they are as good as humans at detecting inconsistencies, and might be even better than individual humans at predicting the crowd-annotated ground-truth. However, when it comes to identifying fine-grained inconsistency types, none of the model have reached the upper bound of performance (due to natural labeling variation), thus leaving room for improvement. We make our dataset and code publicly available.
摘要：不一致的政治陈述代表了一种错误信息的形式。当没有注意到时，他们削弱了公众的信任，并对问责制构成挑战。自动检测不一致的情况可以支持记者提出澄清问题，从而帮助使政客负责。我们提出了不一致的检测任务，并开发了不一致类型的规模，以促使NLP研究朝这个方向提示。为了提供一种用于在政治领域中发现不一致之处的资源，我们提供了一个数据集，其中包括698对人类宣布的政治陈述，并解释了注释者对237个样本的推理。这些声明主要来自投票助理平台，例如德国的Wahl-O-Mat和瑞士的SmartVote，反映了现实世界中的政治问题。我们在数据集上基准了大型语言模型（LLM），并表明，通常，它们在发现不一致方面和人类一样出色，并且在预测人群宣布的地面真相时可能比单个人类更好。但是，在识别细粒度的不一致类型方面，该模型均未达到性能的上限（由于自然的标记变化），从而留出了改进的空间。我们使数据集和代码公开可用。

Title: DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding

Authors: Yunhai Hu, Tianhua Xia, Zining Liu, Rahul Raman, Xingyu Liu, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19201
Pdf URL: https://arxiv.org/pdf/2505.19201
Copy Paste: [[2505.19201]] DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding(https://arxiv.org/abs/2505.19201)
Keywords: language model, llm
Abstract: Speculative decoding (SD) has emerged as a powerful method for accelerating autoregressive generation in large language models (LLMs), yet its integration into vision-language models (VLMs) remains underexplored. We introduce DREAM, a novel speculative decoding framework tailored for VLMs that combines three key innovations: (1) a cross-attention-based mechanism to inject intermediate features from the target model into the draft model for improved alignment, (2) adaptive intermediate feature selection based on attention entropy to guide efficient draft model training, and (3) visual token compression to reduce draft model latency. DREAM enables efficient, accurate, and parallel multimodal decoding with significant throughput improvement. Experiments across a diverse set of recent popular VLMs, including LLaVA, Pixtral, SmolVLM and Gemma3, demonstrate up to 3.6x speedup over conventional decoding and significantly outperform prior SD baselines in both inference throughput and speculative draft acceptance length across a broad range of multimodal benchmarks. The code is publicly available at: this https URL
摘要：投机解码（SD）已成为加速大型语言模型（LLMS）自回归产生的强大方法，但其集成到视觉语言模型（VLMS）中仍然尚未得到充满激发。 We introduce DREAM, a novel speculative decoding framework tailored for VLMs that combines three key innovations: (1) a cross-attention-based mechanism to inject intermediate features from the target model into the draft model for improved alignment, (2) adaptive intermediate feature selection based on attention entropy to guide efficient draft model training, and (3) visual token compression to reduce draft model latency.梦想可以有效，准确和并行多模式解码，并有显着的吞吐量改进。在包括LLAVA，PIXTRAL，SMOLVLM和GEMMA3在内的各种流行的VLM中进行的实验表明，在传统解码方面，最高速度高达3.6倍，并且在推理吞吐量和投机性范围内的范围内，在广泛的多模型基础上，在推理吞吐量和推测范围内均具有明显优于先前的SD基线。该代码可公开可用：此HTTPS URL

Title: SpeakStream: Streaming Text-to-Speech with Interleaved Data

Authors: Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
Subjects: cs.CL, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.19206
Pdf URL: https://arxiv.org/pdf/2505.19206
Copy Paste: [[2505.19206]] SpeakStream: Streaming Text-to-Speech with Interleaved Data(https://arxiv.org/abs/2505.19206)
Keywords: language model, llm, agent
Abstract: The latency bottleneck of traditional text-to-speech (TTS) systems fundamentally hinders the potential of streaming large language models (LLMs) in conversational AI. These TTS systems, typically trained and inferenced on complete utterances, introduce unacceptable delays, even with optimized inference speeds, when coupled with streaming LLM outputs. This is particularly problematic for creating responsive conversational agents where low first-token latency is critical. In this paper, we present SpeakStream, a streaming TTS system that generates audio incrementally from streaming text using a decoder-only architecture. SpeakStream is trained using a next-step prediction loss on interleaved text-speech data. During inference, it generates speech incrementally while absorbing streaming input text, making it particularly suitable for cascaded conversational AI agents where an LLM streams text to a TTS system. Our experiments demonstrate that SpeakStream achieves state-of-the-art latency results in terms of first-token latency while maintaining the quality of non-streaming TTS systems.
摘要：传统文本到语音（TTS）系统的延迟瓶颈从根本上阻碍了对话式AI中流式传输大语模型（LLM）的潜力。这些TTS系统通常在完整的话语中受过训练和推断，即使在与流LLM输出相结合时，也会引入不可接受的延迟。这对于创建响应式对话剂的响应式延迟至关重要，这尤其有问题。在本文中，我们提供了SpeakStream，这是一种流式TTS系统，该系统使用仅解码器的体系结构从流文本中逐渐生成音频。使用交织的文本语音数据的下一步预测损失对Speakstream进行了训练。在推断期间，它在吸收流输入文本时会逐步生成语音，使其特别适合级联的对话式AI代理，其中LLM将文本流向TTS系统。我们的实验表明，SpeakStream在保持非流式TTS系统的质量的同时，以第一句话的延迟来实现最先进的潜伏期。

Title: MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search

Authors: Zonglin Yang, Wanhao Liu, Ben Gao, Yujie Liu, Wei Li, Tong Xie, Lidong Bing, Wanli Ouyang, Erik Cambria, Dongzhan Zhou
Subjects: cs.CL, cs.AI, cs.CE, stat.ML
Abstract URL: https://arxiv.org/abs/2505.19209
Pdf URL: https://arxiv.org/pdf/2505.19209
Copy Paste: [[2505.19209]] MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search(https://arxiv.org/abs/2505.19209)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the novel task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM's internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent chemistry literature show that our method consistently outperforms strong baselines.
摘要：大型语言模型（LLMS）在自动化科学假设的产生方面已显示出希望，但现有方法主要产生缺乏关键方法论和实验细节的粗粒化假设。我们介绍并正式定义了细粒度科学假设发现的新任务，该任务需要从粗糙的初始研究方向产生详细的，实验可行的假设。我们将其视为一个组合优化问题，并研究LLMS在最大利用时解决该问题的上限。具体而言，我们探讨了四个基本问题：（1）如何最好地利用LLM的内部启发式方法，以根据其自己的内部得分来制定它可能产生的所有可能产生的最有希望的假设，这是根据其自身的内部得分 - 在假设空间中定义潜在的奖励景观；（2）这样的LLM判断的更好的假设是否表现出与地面真相假设更强的比对；（3）使用类似能力的不同LLM的合奏来塑造奖励景观是否会比在其中最强的LLM的重复实例定义它更好的结果；（4）相同的LLM合奏是否比单个LLM提供了更可靠的奖励景观。为了解决这些问题，我们提出了一种层次搜索方法，该方法逐步提出并将细节整合到假设中，从一般概念到特定的实验配置发展。我们表明，这个层次过程使奖励景观平滑，并实现了更有效的优化。对最近的化学文献的专家注册细粒度假设的新基准的经验评估表明，我们的方法始终优于强大的基准。

Title: When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas

Authors: Steffen Backmann, David Guzman Piedrahita, Emanuel Tewolde, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.19212
Pdf URL: https://arxiv.org/pdf/2505.19212
Copy Paste: [[2505.19212]] When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas(https://arxiv.org/abs/2505.19212)
Keywords: language model, llm, agent
Abstract: Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents, making ethical alignment a key AI safety concern. While prior work has examined both LLMs' moral judgment and strategic behavior in social dilemmas, there is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives. To investigate this, we introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner's dilemma and public goods game with morally charged contexts. In MoralSim, we test a range of frontier models across both game structures and three distinct moral framings, enabling a systematic examination of how LLMs navigate social dilemmas in which ethical norms conflict with payoff-maximizing strategies. Our results show substantial variation across models in both their general tendency to act morally and the consistency of their behavior across game types, the specific moral framing, and situational factors such as opponent behavior and survival risks. Crucially, no model exhibits consistently moral behavior in MoralSim, highlighting the need for caution when deploying LLMs in agentic roles where the agent's "self-interest" may conflict with ethical expectations. Our code is available at this https URL.
摘要：大型语言模型（LLM）的最新进展使它们能够在复杂的代理角色中使用，涉及与人类或其他代理商的决策，从而使道德一致性成为AI的关键安全问题。虽然先前的工作已经检查了LLMS在社会困境中的道德判断和战略行为，但对道德要求直接与奖励或激励措施冲突时它们的行为的理解有限。为了调查这一点，我们在社会困境模拟（Moralsim）中介绍了道德行为，并评估LLM在囚犯的困境和公共物品游戏中的行为如何。在Moralsim中，我们在游戏结构和三个不同的道德框架上测试了一系列边境模型，从而可以系统地检查LLMS如何应对社会困境，在这种困境中，道德规范与回报最大化策略相冲突。我们的结果表明，模型的一般趋势在道德上的一般趋势以及其跨游戏类型的行为的一致性，特定的道德框架以及对手行为和生存风险等情境因素的一致性。至关重要的是，没有模型在道德上表现出始终如一的道德行为，突出了在代理人的“自身利益”可能与道德期望冲突的代理作用中时要谨慎的必要性。我们的代码可在此HTTPS URL上找到。

Title: The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarE Training

Authors: Weize Chen, Jiarui Yuan, Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19217
Pdf URL: https://arxiv.org/pdf/2505.19217
Copy Paste: [[2505.19217]] The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarE Training(https://arxiv.org/abs/2505.19217)
Keywords: language model, llm
Abstract: Recent large language models (LLMs) exhibit impressive reasoning but often over-think, generating excessively long responses that hinder efficiency. We introduce DIET ( DIfficulty-AwarE Training), a framework that systematically cuts these "token calories" by integrating on-the-fly problem difficulty into the reinforcement learning (RL) process. DIET dynamically adapts token compression strategies by modulating token penalty strength and conditioning target lengths on estimated task difficulty, to optimize the performance-efficiency trade-off. We also theoretically analyze the pitfalls of naive reward weighting in group-normalized RL algorithms like GRPO, and propose Advantage Weighting technique, which enables stable and effective implementation of these difficulty-aware objectives. Experimental results demonstrate that DIET significantly reduces token counts while simultaneously improving reasoning performance. Beyond raw token reduction, we show two crucial benefits largely overlooked by prior work: (1) DIET leads to superior inference scaling. By maintaining high per-sample quality with fewer tokens, it enables better scaling performance via majority voting with more samples under fixed computational budgets, an area where other methods falter. (2) DIET enhances the natural positive correlation between response length and problem difficulty, ensuring verbosity is appropriately allocated, unlike many existing compression methods that disrupt this relationship. Our analyses provide a principled and effective framework for developing more efficient, practical, and high-performing LLMs.
摘要：最近的大型语言模型（LLMS）表现出令人印象深刻的推理，但通常过度思考，产生过长的响应，阻碍了效率。我们引入饮食（困难训练），该框架通过将有效的问题难度集成到加强学习（RL）过程中，从而系统地削减了这些“代币卡路里”。饮食通过调节估计任务难度的标记惩罚强度和调节目标长度来动态适应令牌压缩策略，以优化性能效率折衷。我们还理论上分析了在GRPO（例如GRPO）中幼稚奖励加权的陷阱，并提出了优势加权技术，这可以稳定有效地实施这些困难的目标。实验结果表明，饮食可显着减少令牌计数，同时提高推理性能。除了降低原始令牌外，我们还显示了先前工作的两个关键益处：（1）饮食可以提高推理量表。通过维持较少的标记质量高，它可以通过在固定计算预算下的更多样本进行多数投票来更好地扩展性能，而其他方法则可以使用其他方法。（2）与许多破坏这种关系的现有压缩方法不同，饮食增强了响应长度和问题难度之间的自然正相关，以确保详细分配。我们的分析为开发更有效，实用和高性能的LLM提供了一个原则有效的框架。

Title: Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

Authors: Qian Cao, Xiting Wang, Yuzhuo Yuan, Yahui Liu, Fang Luo, Ruihua Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19236
Pdf URL: https://arxiv.org/pdf/2505.19236
Copy Paste: [[2505.19236]] Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator(https://arxiv.org/abs/2505.19236)
Keywords: language model, llm, prompt
Abstract: Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.
摘要：对于大型语言模型（LLM）来说，创造力评估仍然是一个具有挑战性的领域。当前的评估在很大程度上依赖于效率低下且昂贵的人类判断，从而阻碍了增强机器创造力的进步。尽管存在自动化方法，但从心理测试到启发式或基于促进的方法，它们通常缺乏可普遍性或与人类判断的一致性。为了解决这些问题，在本文中，我们提出了一个新颖的成对比较框架，用于评估文本创造力，利用共享的上下文指示以提高评估一致性。我们介绍Creataset，这是一个具有100k+人级和1M+合成创意指令响应对的大规模数据集，涵盖了多种开放域任务。通过对Creataset的培训，我们开发了一个名为Creval的基于LLM的评估者。在与人类判断的一致性方面相比，裂缝表现出了非凡的优势。实验结果强调了在训练高度健壮的评估者中整合人类生成和合成数据的必不可少的意义，并展示了孔瓦尔在提高LLM的创造力方面的实际实用性。我们将很快公开发布所有数据，代码和模型，以支持进一步的研究。

Title: LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models

Authors: Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin Paaßen, Steffen Eger
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19240
Pdf URL: https://arxiv.org/pdf/2505.19240
Copy Paste: [[2505.19240]] LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models(https://arxiv.org/abs/2505.19240)
Keywords: language model, llm, hallucination
Abstract: Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations such as failures in reasoning, hallucinations, and limited multilingual capability. In this survey, we conduct a data-driven, semi-automated review of research on limitations of LLM (LLLMs) from 2022 to 2024 using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we identify 14,648 relevant papers using keyword filtering, LLM-based classification, validated against expert labels, and topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM). We find that LLM-related research increases over fivefold in ACL and fourfold in arXiv. Since 2022, LLLMs research grows even faster, reaching over 30% of LLM papers by late 2024. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward safety and controllability (with topics like security risks, alignment, hallucinations, knowledge editing), and multimodality between 2022 and 2024. We release a dataset of annotated abstracts and a validated methodology, and offer a quantitative view of trends in LLM limitations research.
摘要：大型语言模型（LLM）的研究迅速发展，加上对它们的局限性，例如推理，幻觉和多语言能力有限的局限性。在这项调查中，我们使用自下而上的方法对2022年至2024年的LLM（LLLM）局限性研究进行了数据驱动的半自动化综述。从250,000 ACL和ARXIV论文的语料库中，我们使用关键字过滤，基于LLM的分类，针对专家标签进行验证以及主题聚类（通过两种方法，HDBSCAN+BERTOPIC和LLOMOM）确定14,648篇相关论文。我们发现，与LLM相关的研究在ACL中增加了五倍，而在ARXIV中的研究增加了四倍。自2022年以来，LLLMS研究的增长甚至更快，到2024年底到达LLM论文的30％以上。推理仍然是研究最多的限制，其次是概括，幻觉，偏见和安全性。随着时间的流逝，ACL数据集中主题的分布保持相对稳定，而ARXIV则转向安全性和可控性（诸如安全风险，对齐，幻觉，知识编辑）以及2022年之间的多模式。我们在综合级别的研究中释放了一个带有验证的摘要方法，并提供了一个有效的摘要研究，我们释放了一个限制性的llm llm。

Title: PATS: Process-Level Adaptive Thinking Mode Switching

Authors: Yi Wang, Junxiao Liu, Shimao Zhang, Jiajun Chen, Shujian Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19250
Pdf URL: https://arxiv.org/pdf/2505.19250
Copy Paste: [[2505.19250]] PATS: Process-Level Adaptive Thinking Mode Switching(https://arxiv.org/abs/2505.19250)
Keywords: language model, llm
Abstract: Current large-language models (LLMs) typically adopt a fixed reasoning strategy, either simple or complex, for all questions, regardless of their difficulty. This neglect of variation in task and reasoning process complexity leads to an imbalance between performance and efficiency. Existing methods attempt to implement training-free fast-slow thinking system switching to handle problems of varying difficulty, but are limited by coarse-grained solution-level strategy adjustments. To address this issue, we propose a novel reasoning paradigm: Process-Level Adaptive Thinking Mode Switching (PATS), which enables LLMs to dynamically adjust their reasoning strategy based on the difficulty of each step, optimizing the balance between accuracy and computational efficiency. Our approach integrates Process Reward Models (PRMs) with Beam Search, incorporating progressive mode switching and bad-step penalty mechanisms. Experiments on diverse mathematical benchmarks demonstrate that our methodology achieves high accuracy while maintaining moderate token usage. This study emphasizes the significance of process-level, difficulty-aware reasoning strategy adaptation, offering valuable insights into efficient inference for LLMs.
摘要：当前的大语模型（LLMS）通常在所有问题上都采用固定的推理策略，无论其难度如何。忽视任务和推理过程复杂性的变化导致绩效与效率之间的不平衡。现有的方法试图实施无训练的快速慢性思维系统切换以处理各种难度的问题，但受粗粒解决方案级策略调整的限制。为了解决这个问题，我们提出了一个新颖的推理范式：过程级别的自适应思维模式切换（PATS），它使LLM可以根据每个步骤的难度动态调整其推理策略，从而优化准确性和计算效率之间的平衡。我们的方法将流程奖励模型（PRM）与光束搜索集成在一起，并结合了渐进模式切换和不良步骤的惩罚机制。关于多种数学基准测试的实验表明，我们的方法学在保持中度的代币使用时达到了高精度。这项研究强调了过程级别，困难的推理策略适应的重要性，从而为LLM的有效推断提供了宝贵的见解。

Title: Unveiling Dual Quality in Product Reviews: An NLP-Based Approach

Authors: Rafał Poświata, Marcin Michał Mirończuk, Sławomir Dadas, Małgorzata Grębowiec, Michał Perełkiewicz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19254
Pdf URL: https://arxiv.org/pdf/2505.19254
Copy Paste: [[2505.19254]] Unveiling Dual Quality in Product Reviews: An NLP-Based Approach(https://arxiv.org/abs/2505.19254)
Keywords: llm
Abstract: Consumers often face inconsistent product quality, particularly when identical products vary between markets, a situation known as the dual quality problem. To identify and address this issue, automated techniques are needed. This paper explores how natural language processing (NLP) can aid in detecting such discrepancies and presents the full process of developing a solution. First, we describe in detail the creation of a new Polish-language dataset with 1,957 reviews, 540 highlighting dual quality issues. We then discuss experiments with various approaches like SetFit with sentence-transformers, transformer-based encoders, and LLMs, including error analysis and robustness verification. Additionally, we evaluate multilingual transfer using a subset of opinions in English, French, and German. The paper concludes with insights on deployment and practical applications.
摘要：消费者通常会面临不一致的产品质量，尤其是当市场之间相同产品变化时，这种情况称为双重质量问题。为了识别和解决此问题，需要自动化技术。本文探讨了自然语言处理（NLP）如何有助于检测这种差异并介绍开发解决方案的完整过程。首先，我们详细描述了一个新的波兰语言数据集的创建，其中有1,957个评论，540个突出了双重质量问题。然后，我们使用各种方法（例如SETFIT，句子转换器，基于变压器的编码器和LLM）讨论实验，包括错误分析和鲁棒性验证。此外，我们使用英语，法语和德语的一部分观点评估了多语言转移。本文以有关部署和实际应用的见解结束。

Title: A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models

Authors: Utkarsh Sahu, Zhisheng Qi, Yongjia Lei, Ryan A. Rossi, Franck Dernoncourt, Nesreen K. Ahmed, Mahantesh M Halappanavar, Yao Ma, Yu Wang
Subjects: cs.CL, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2505.19286
Pdf URL: https://arxiv.org/pdf/2505.19286
Copy Paste: [[2505.19286]] A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models(https://arxiv.org/abs/2505.19286)
Keywords: language model, llm
Abstract: Large language models have been extensively studied as neural knowledge bases for their knowledge access, editability, reasoning, and explainability. However, few works focus on the structural patterns of their knowledge. Motivated by this gap, we investigate these structural patterns from a graph perspective. We quantify the knowledge of LLMs at both the triplet and entity levels, and analyze how it relates to graph structural properties such as node degree. Furthermore, we uncover the knowledge homophily, where topologically close entities exhibit similar levels of knowledgeability, which further motivates us to develop graph machine learning models to estimate entity knowledge based on its local neighbors. This model further enables valuable knowledge checking by selecting triplets less known to LLMs. Empirical results show that using selected triplets for fine-tuning leads to superior performance.
摘要：大型语言模型已被广泛研究为知识访问，编辑性，推理和解释性的神经知识基础。但是，很少有作品集中在其知识的结构模式上。在这一差距的激励下，我们从图表的角度研究了这些结构模式。我们量化了三胞胎和实体级别的LLM的知识，并分析了它与图形结构特性（例如节点度）的关系。此外，我们在同质上发现知识，拓扑结尾的实体表现出相似的知识趋势，这进一步促使我们开发图形机器学习模型，以根据其本地邻居估算实体知识。该模型通过选择LLM鲜为人知的三胞胎来进一步实现宝贵的知识检查。经验结果表明，使用选定的三元组进行微调会导致出色的性能。

Title: 100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Authors: Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19293
Pdf URL: https://arxiv.org/pdf/2505.19293
Copy Paste: [[2505.19293]] 100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?(https://arxiv.org/abs/2505.19293)
Keywords: llm, long context
Abstract: Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.
摘要：长篇小说能力被认为是LLMS最重要的能力之一，因为真正具有上下文能力的LLM使用户能够轻松地处理许多最初耗尽的任务 - 例如，消化长期文档以找到答案与直接询问LLM。但是，现有的基于实际任务的长篇文章评估基准有两个主要的缺点。首先，像Longbench之类的基准通常无法提供适当的指标，可以将长篇小说性能与模型的基线能力分开，从而使跨模型比较不清楚。其次，这种基准通常使用固定的输入长度构建，这限制了它们在不同模型上的适用性，并且何时开始分解模型时无法揭示。为了解决这些问题，我们介绍了可控制的长篇小写基准和一个新颖的指标，该指标将基线知识与真实的长篇小说能力相关。实验证明了我们在有效评估LLM的方法方面的优越性。

Title: A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations

Authors: Lingjun Zhao, Hal Daumé III
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19299
Pdf URL: https://arxiv.org/pdf/2505.19299
Copy Paste: [[2505.19299]] A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations(https://arxiv.org/abs/2505.19299)
Keywords: language model
Abstract: Faithful free-text explanations are important to ensure transparency in high-stakes AI decision-making contexts, but they are challenging to generate by language models and assess by humans. In this paper, we present a measure for Prediction-EXplanation (PEX) consistency, by extending the concept of weight of evidence. This measure quantifies how much a free-text explanation supports or opposes a prediction, serving as an important aspect of explanation faithfulness. Our analysis reveals that more than 62% explanations generated by large language models lack this consistency. We show that applying direct preference optimization improves the consistency of generated explanations across three model families, with improvement ranging from 43.1% to 292.3%. Furthermore, we demonstrate that optimizing this consistency measure can improve explanation faithfulness by up to 9.7%.
摘要：忠实的自由文本解释对于确保高风险AI决策环境中的透明度很重要，但是它们挑战是通过语言模型产生并由人类评估。在本文中，我们通过扩展证据的概念来提出预测解释（PEX）一致性的衡量标准。该措施量化了自由文本解释支持或反对预测的多少，这是解释忠实的重要方面。我们的分析表明，大语模型产生的62％以上的解释缺乏这种一致性。我们表明，应用直接偏好优化提高了三个模型家族的生成解释的一致性，改善范围为43.1％至292.3％。此外，我们证明，优化这种一致性措施可以提高解释的忠诚度高达9.7％。

Title: SituatedThinker: Grounding LLM Reasoning with Real-World through Situated Thinking

Authors: Junnan Liu, Linhao Luo, Thuy-Trang Vu, Gholamreza Haffari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19300
Pdf URL: https://arxiv.org/pdf/2505.19300
Copy Paste: [[2505.19300]] SituatedThinker: Grounding LLM Reasoning with Real-World through Situated Thinking(https://arxiv.org/abs/2505.19300)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) demonstrate their impressive reasoning capabilities. However, the reasoning confined to internal parametric space limits LLMs' access to real-time information and understanding of the physical world. To overcome this constraint, we introduce SituatedThinker, a novel framework that enables LLMs to ground their reasoning in real-world contexts through situated thinking, which adaptively combines both internal knowledge and external information with predefined interfaces. By utilizing reinforcement learning, SituatedThinker incentivizes deliberate reasoning with the real world to acquire information and feedback, allowing LLMs to surpass their knowledge boundaries and enhance reasoning. Experimental results demonstrate significant performance improvements on multi-hop question-answering and mathematical reasoning benchmarks. Furthermore, SituatedThinker demonstrates strong performance on unseen tasks, such as KBQA, TableQA, and text-based games, showcasing the generalizable real-world grounded reasoning capability. Our codes are available at this https URL.
摘要：大型语言模型（LLM）的最新进展表明了他们令人印象深刻的推理能力。但是，限于内部参数空间的推理限制了LLMS对实时信息的访问和对物理世界的理解。为了克服这一约束，我们介绍了位置思路，这是一个新颖的框架，使LLMS能够通过位置思维在现实环境中进行推理，从而将内部知识和外部信息与预定义的接口一起自适应地结合在一起。通过利用强化学习，位置动力激励与现实世界的故意推理获取信息和反馈，从而使LLMS可以超越其知识界限并增强推理。实验结果表明，多跳的问题和数学推理基准的绩效改善。此外，位置思维器在看不见的任务（例如KBQA，TableQA和基于文本的游戏）上表现出很强的表现，展示了可概括的现实世界中的基础推理能力。我们的代码可在此HTTPS URL上找到。

Title: PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims

Authors: Yongmin Yoo, Qiongkai Xu, Longbing Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19345
Pdf URL: https://arxiv.org/pdf/2505.19345
Copy Paste: [[2505.19345]] PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims(https://arxiv.org/abs/2505.19345)
Keywords: language model, gpt, llm
Abstract: Natural language generation (NLG) metrics play a central role in evaluating generated texts, but are not well suited for the structural and legal characteristics of patent documents. Large language models (LLMs) offer strong potential in automating patent generation, yet research on evaluating LLM-generated patents remains limited, especially in evaluating the generation quality of patent claims, which are central to defining the scope of protection. Effective claim evaluation requires addressing legal validity, technical accuracy, and structural compliance. To address this gap, we introduce PatentScore, a multi-dimensional evaluation framework for assessing LLM-generated patent claims. PatentScore incorporates: (1) hierarchical decomposition for claim analysis; (2) domain-specific validation patterns based on legal and technical standards; and (3) scoring across structural, semantic, and legal dimensions. Unlike general-purpose NLG metrics, PatentScore reflects patent-specific constraints and document structures, enabling evaluation beyond surface similarity. We evaluate 400 GPT-4o-mini generated Claim 1s and report a Pearson correlation of $r = 0.819$ with expert annotations, outperforming existing NLG metrics. Furthermore, we conduct additional evaluations using open models such as Claude-3.5-Haiku and Gemini-1.5-flash, all of which show strong correlations with expert judgments, confirming the robustness and generalizability of our framework.
摘要：自然语言产生（NLG）指标在评估生成的文本中起着核心作用，但不适合专利文档的结构和法律特征。大型语言模型（LLM）在自动化专利生成方面具有强大的潜力，但是评估LLM生成的专利的研究仍然有限，尤其是评估专利主张的产生质量，这对于确定保护范围至关重要。有效的索赔评估需要解决法律有效性，技术准确性和结构合规性。为了解决这一差距，我们介绍了PatentsCore，这是一个多维评估框架，用于评估LLM生成的专利索赔。 PatentsCore合并：（1）索赔分析的层次分解；（2）基于法律和技术标准的特定领域验证模式；（3）在结构，语义和法律方面进行评分。与通用NLG指标不同，专利库反映了专利特定的约束和文档结构，从而使评估能够超出表面相似性。我们评估了400个GPT-4O-MINI产生的权利要求1，并报告了$ r = 0.819 $与专家注释的Pearson相关性，表现优于现有的NLG指标。此外，我们使用开放模型（例如Claude-3.5-Haiku和Gemini-1.5-flash）进行了其他评估，所有这些都与专家判断显示了密切的相关性，证实了我们框架的鲁棒性和概括性。

Title: GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance

Authors: Mohammad Mahdi Moradi, Sudhir Mudur
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19354
Pdf URL: https://arxiv.org/pdf/2505.19354
Copy Paste: [[2505.19354]] GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance(https://arxiv.org/abs/2505.19354)
Keywords: language model, llm, prompt
Abstract: Knowledge-Based Visual Question Answering (KB-VQA) methods focus on tasks that demand reasoning with information extending beyond the explicit content depicted in the image. Early methods relied on explicit knowledge bases to provide this auxiliary information. Recent approaches leverage Large Language Models (LLMs) as implicit knowledge sources. While KB-VQA methods have demonstrated promising results, their potential remains constrained as the auxiliary text provided may not be relevant to the question context, and may also include irrelevant information that could misguide the answer predictor. We introduce a novel four-stage framework called Grounding Caption-Guided Knowledge-Based Visual Question Answering (GC-KBVQA), which enables LLMs to effectively perform zero-shot VQA tasks without the need for end-to-end multimodal training. Innovations include grounding question-aware caption generation to move beyond generic descriptions and have compact, yet detailed and context-rich information. This is combined with knowledge from external sources to create highly informative prompts for the LLM. GC-KBVQA can address a variety of VQA tasks, and does not require task-specific fine-tuning, thus reducing both costs and deployment complexity by leveraging general-purpose, pre-trained LLMs. Comparison with competing KB-VQA methods shows significantly improved performance. Our code will be made public.
摘要：基于知识的视觉问题回答（KB-VQA）方法集中在要求推理的任务上，信息延伸的信息超出了图像中描述的明确内容。早期方法依靠明确的知识库来提供此辅助信息。最近的方法利用大型语言模型（LLM）作为隐性知识来源。尽管KB-VQA方法已经证明了有希望的结果，但由于提供的辅助文本可能与问题上下文无关，因此它们的潜力仍然受到限制，并且还可能包含可能误导答案预测变量的无关信息。我们介绍了一个新颖的四阶段框架，称为接地字幕引导的基于知识的视觉问题答案（GC-KBVQA），该框架使LLMS能够有效执行零击的VQA任务，而无需进行端到端的多模态培训。创新包括接地问题的标题生成，以超越通用描述，并具有紧凑而详细且详细的上下文信息。这与来自外部资源的知识相结合，为LLM创建高度信息的提示。 GC-KBVQA可以解决各种VQA任务，并且不需要特定于任务的微调，从而通过利用通用，预先培训的LLM来降低成本和部署复杂性。与竞争KB-VQA方法的比较显示出明显提高的性能。我们的代码将公开。

Title: ChartLens: Fine-grained Visual Attribution in Charts

Authors: Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi, Dinesh Manocha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19360
Pdf URL: https://arxiv.org/pdf/2505.19360
Copy Paste: [[2505.19360]] ChartLens: Fine-grained Visual Attribution in Charts(https://arxiv.org/abs/2505.19360)
Keywords: language model, llm, hallucination, prompt
Abstract: The growing capabilities of multimodal large language models (MLLMs) have advanced tasks like chart understanding. However, these models often suffer from hallucinations, where generated text sequences conflict with the provided visual data. To address this, we introduce Post-Hoc Visual Attribution for Charts, which identifies fine-grained chart elements that validate a given chart-associated response. We propose ChartLens, a novel chart attribution algorithm that uses segmentation-based techniques to identify chart objects and employs set-of-marks prompting with MLLMs for fine-grained visual attribution. Additionally, we present ChartVA-Eval, a benchmark with synthetic and real-world charts from diverse domains like finance, policy, and economics, featuring fine-grained attribution annotations. Our evaluations show that ChartLens improves fine-grained attributions by 26-66%.
摘要：多模式大语言模型（MLLM）的不断增长的功能具有高级任务，例如图表理解。但是，这些模型通常会遭受幻觉的困扰，其中产生的文本序列与所提供的视觉数据冲突。为了解决这个问题，我们介绍了图表的事后视觉归因，该图表标识了验证给定图表相关响应的细粒图元素。我们提出了ChartLens，这是一种新颖的图表归因算法，该算法使用基于细分的技术识别图表对象，并采用了用MLLM提示的标志物，以获得细粒度的视觉归因。此外，我们提出了Chartva-eval，这是一种基准，该基准具有来自金融，政策和经济学等不同领域的合成和现实世界图表，具有细粒度的归因注释。我们的评估表明，ChartLens将细粒度的归因提高了26-66％。

Title: Belief Attribution as Mental Explanation: The Role of Accuracy, Informativity, and Causality

Authors: Lance Ying, Almog Hillel, Ryan Truong, Vikash K. Mansinghka, Joshua B. Tenenbaum, Tan Zhi-Xuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19376
Pdf URL: https://arxiv.org/pdf/2505.19376
Copy Paste: [[2505.19376]] Belief Attribution as Mental Explanation: The Role of Accuracy, Informativity, and Causality(https://arxiv.org/abs/2505.19376)
Keywords: agent
Abstract: A key feature of human theory-of-mind is the ability to attribute beliefs to other agents as mentalistic explanations for their behavior. But given the wide variety of beliefs that agents may hold about the world and the rich language we can use to express them, which specific beliefs are people inclined to attribute to others? In this paper, we investigate the hypothesis that people prefer to attribute beliefs that are good explanations for the behavior they observe. We develop a computational model that quantifies the explanatory strength of a (natural language) statement about an agent's beliefs via three factors: accuracy, informativity, and causal relevance to actions, each of which can be computed from a probabilistic generative model of belief-driven behavior. Using this model, we study the role of each factor in how people selectively attribute beliefs to other agents. We investigate this via an experiment where participants watch an agent collect keys hidden in boxes in order to reach a goal, then rank a set of statements describing the agent's beliefs about the boxes' contents. We find that accuracy and informativity perform reasonably well at predicting these rankings when combined, but that causal relevance is the single factor that best explains participants' responses.
摘要：人类心理理论的一个关键特征是将信念归因于其他代理的能力，作为对其行为的心理解释。但是，鉴于代理商可能对世界和我们可以用来表达它们的丰富语言有多种信念，人们倾向于将哪些特定的信念归因于他人？在本文中，我们调查了人们更喜欢将信念归因于他们观察到的行为的良好解释的假设。我们开发了一个计算模型，该模型通过三个因素量化了（自然语言）对代理人信念的解释性强度：准确性，信息性和因果关系与行为相关性，每种动作都可以从信仰驱动行为的概率生成模型中计算出来。使用此模型，我们研究了每个因素在人们如何选择将信念归因于其他代理商中的作用。我们通过一个实验调查了这一点，参与者观看代理收集隐藏在框中以达到目标的钥匙，然后对一组描述代理商对盒子内容的信念进行排名。我们发现，合并后的准确性和信息性在预测这些排名方面表现出色，但是因果关系是最能解释参与者反应的单一因素。

Title: Simple and Effective Baselines for Code Summarisation Evaluation

Authors: Jade Robinson, Jonathan K. Kummerfeld
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2505.19392
Pdf URL: https://arxiv.org/pdf/2505.19392
Copy Paste: [[2505.19392]] Simple and Effective Baselines for Code Summarisation Evaluation(https://arxiv.org/abs/2505.19392)
Keywords: llm
Abstract: Code documentation is useful, but writing it is time-consuming. Different techniques for generating code summaries have emerged, but comparing them is difficult because human evaluation is expensive and automatic metrics are unreliable. In this paper, we introduce a simple new baseline in which we ask an LLM to give an overall score to a summary. Unlike n-gram and embedding-based baselines, our approach is able to consider the code when giving a score. This allows us to also make a variant that does not consider the reference summary at all, which could be used for other tasks, e.g., to evaluate the quality of documentation in code bases. We find that our method is as good or better than prior metrics, though we recommend using it in conjunction with embedding-based methods to avoid the risk of LLM-specific bias.
摘要：代码文档很有用，但是编写它很耗时。出现了用于生成代码摘要的不同技术，但是比较它们很困难，因为人类评估很昂贵，自动指标不可靠。在本文中，我们介绍了一个简单的新基线，我们要求LLM给出摘要的总体分数。与N-Gram和基于嵌入的基线不同，我们的方法在给出分数时能够考虑代码。这使我们还可以制作一个根本不考虑参考摘要的变体，该变体可以用于其他任务，例如评估代码库中文档的质量。我们发现我们的方法比以前的指标一样好或更好，尽管我们建议将其与基于嵌入的方法结合使用以避免LLM特异性偏见的风险。

Title: CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems

Authors: Yan Wen, Junfeng Guo, Heng Huang
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2505.19405
Pdf URL: https://arxiv.org/pdf/2505.19405
Copy Paste: [[2505.19405]] CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems(https://arxiv.org/abs/2505.19405)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: As large language models (LLMs) evolve into autonomous agents capable of collaborative reasoning and task execution, multi-agent LLM systems have emerged as a powerful paradigm for solving complex problems. However, these systems pose new challenges for copyright protection, particularly when sensitive or copyrighted content is inadvertently recalled through inter-agent communication and reasoning. Existing protection techniques primarily focus on detecting content in final outputs, overlooking the richer, more revealing reasoning processes within the agents themselves. In this paper, we introduce CoTGuard, a novel framework for copyright protection that leverages trigger-based detection within Chain-of-Thought (CoT) reasoning. Specifically, we can activate specific CoT segments and monitor intermediate reasoning steps for unauthorized content reproduction by embedding specific trigger queries into agent prompts. This approach enables fine-grained, interpretable detection of copyright violations in collaborative agent scenarios. We evaluate CoTGuard on various benchmarks in extensive experiments and show that it effectively uncovers content leakage with minimal interference to task performance. Our findings suggest that reasoning-level monitoring offers a promising direction for safeguarding intellectual property in LLM-based agent systems.
摘要：随着大型语言模型（LLMS）演变为能够协作推理和任务执行的自主代理，多代理LLM系统已成为解决复杂问题的强大范式。但是，这些系统对版权保护构成了新的挑战，尤其是当敏感或受版权保护的内容被无意间通过试验间的沟通和推理召回时。现有的保护技术主要集中于检测最终产出中的内容，忽略代理本身内部更丰富，更揭示的推理过程。在本文中，我们介绍了Cotguard，这是一个新型版权保护框架，该框架利用了基于触发的检测（COT）推理（COT）推理。具体而言，我们可以激活特定的COT段，并通过将特定的触发查询嵌入到代理提示中，以监视未经授权的内容复制的中间推理步骤。这种方法可以在协作代理方案中对侵犯版权的细化，可解释的检测。我们在广泛的实验中评估了各种基准测试的Cotguard，并表明它有效地发现了内容泄漏，对任务性能的干扰最少。我们的发现表明，推理级监测为保护基于LLM的代理系统中的知识产权提供了有希望的方向。

Title: Self-Reflective Planning with Knowledge Graphs: Enhancing LLM Reasoning Reliability for Question Answering

Authors: Jiajun Zhu, Ye Liu, Meikai Bao, Kai Zhang, Yanghai Zhang, Qi Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19410
Pdf URL: https://arxiv.org/pdf/2505.19410
Copy Paste: [[2505.19410]] Self-Reflective Planning with Knowledge Graphs: Enhancing LLM Reasoning Reliability for Question Answering(https://arxiv.org/abs/2505.19410)
Keywords: language model, llm, hallucination
Abstract: Recently, large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, yet they remain prone to hallucinations when reasoning with insufficient internal knowledge. While integrating LLMs with knowledge graphs (KGs) provides access to structured, verifiable information, existing approaches often generate incomplete or factually inconsistent reasoning paths. To this end, we propose Self-Reflective Planning (SRP), a framework that synergizes LLMs with KGs through iterative, reference-guided reasoning. Specifically, given a question and topic entities, SRP first searches for references to guide planning and reflection. In the planning process, it checks initial relations and generates a reasoning path. After retrieving knowledge from KGs through a reasoning path, it implements iterative reflection by judging the retrieval result and editing the reasoning path until the answer is correctly retrieved. Extensive experiments on three public datasets demonstrate that SRP surpasses various strong baselines and further underscore its reliable reasoning ability.
摘要：最近，大型语言模型（LLMS）在自然语言处理任务中表现出了显着的功能，但是在内部知识不足时，它们仍然容易出现幻觉。尽管将LLM与知识图（kgs）集成在一起，可访问结构化的，可验证的信息，但现有方法通常会产生不完整或实际上不一致的推理路径。为此，我们提出了自我反思计划（SRP），该框架通过迭代，参考引导的推理协同LLM与KG协同。具体而言，在一个问题和主题实体的情况下，SRP首先搜索参考文献以指导计划和反思。在计划过程中，它检查初始关系并生成推理路径。在通过推理路径从KG检索知识后，它通过判断检索结果并编辑推理路径来实现迭代反射，直到正确检索答案为止。在三个公共数据集上进行的广泛实验表明，SRP超过了各种强大的基准，并进一步强调了其可靠的推理能力。

Title: The Role of Diversity in In-Context Learning for Large Language Models

Authors: Wenyang Xiao, Haoyu Zhao, Lingxiao Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19426
Pdf URL: https://arxiv.org/pdf/2505.19426
Copy Paste: [[2505.19426]] The Role of Diversity in In-Context Learning for Large Language Models(https://arxiv.org/abs/2505.19426)
Keywords: language model, llm
Abstract: In-context learning (ICL) is a crucial capability of current large language models (LLMs), where the selection of examples plays a key role in performance. While most existing approaches focus on selecting the most similar examples to the query, the impact of diversity in example selection remains underexplored. We systematically investigate the role of diversity in in-context example selection through experiments across a range of tasks, from sentiment classification to more challenging math and code problems. Experiments on Llama-3.1, Gemma-2, and Mistral-v0.3 families of models show that diversity-aware selection methods improve performance, particularly on complex tasks like math and code, and enhance robustness to out-of-distribution queries. To support these findings, we introduce a theoretical framework that explains the benefits of incorporating diversity in in-context example selection.
摘要：内在学习（ICL）是当前大语言模型（LLMS）的关键能力，其中示例的选择在性能中起着关键作用。尽管大多数现有的方法都集中在选择与查询最相似的示例，但示例选择中多样性的影响仍然没有得到充满反感。我们通过跨多个任务的实验，从情感分类到更具挑战性的数学和代码问题，系统地研究了多样性在文本中的示例选择中的作用。 Llama-3.1，Gemma-2和Mismtral-V0.3模型家族的实验表明，多样性感知的选择方法可以改善性能，尤其是在数学和代码等复杂任务上，并增强对分布范围的查询的鲁棒性。为了支持这些发现，我们介绍了一个理论框架，该框架解释了将多样性纳入在文章中的示例选择中的好处。

Title: Frictional Agent Alignment Framework: Slow Down and Don't Break Things

Authors: Abhijnan Nath, Carine Graff, Andrei Bachinin, Nikhil Krishnaswamy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19428
Pdf URL: https://arxiv.org/pdf/2505.19428
Copy Paste: [[2505.19428]] Frictional Agent Alignment Framework: Slow Down and Don't Break Things(https://arxiv.org/abs/2505.19428)
Keywords: llm, prompt, agent
Abstract: AI support of collaborative interactions entails mediating potential misalignment between interlocutor beliefs. Common preference alignment methods like DPO excel in static settings, but struggle in dynamic collaborative tasks where the explicit signals of interlocutor beliefs are sparse and skewed. We propose the Frictional Agent Alignment Framework (FAAF), to generate precise, context-aware "friction" that prompts for deliberation and re-examination of existing evidence. FAAF's two-player objective decouples from data skew: a frictive-state policy identifies belief misalignments, while an intervention policy crafts collaborator-preferred responses. We derive an analytical solution to this objective, enabling training a single policy via a simple supervised loss. Experiments on three benchmarks show FAAF outperforms competitors in producing concise, interpretable friction and in OOD generalization. By aligning LLMs to act as adaptive "thought partners" -- not passive responders -- FAAF advances scalable, dynamic human-AI collaboration. Our code and data can be found at this https URL.
摘要：人工智能对协作互动的支持需要调解对话者信念之间的潜在未对准。诸如DPO Excel在静态环境中的常见偏好一致性方法，但是在动态协作任务中挣扎，使对话者信念的明确信号稀疏且偏斜。我们建议摩擦代理一致性框架（FAAF），以产生确切的，上下文感知的“摩擦”，以促使对现有证据进行审议和重新检查。 FAAF的两人目标脱离数据偏斜：摩擦状态政策确定了信念的错位，而干预政策则策划了合作者的质量偏爱的响应。我们为这个目标提供了一个分析解决方案，从而通过简单的监督损失训练单个政策。三个基准测试的实验表明，FAAF在产生简洁，可解释的摩擦和OOD概括方面优于竞争对手。通过使LLM充当自适应的“思想伙伴”（不是被动反应者），FAAF可以扩展，动态的人类合作。我们的代码和数据可以在此HTTPS URL上找到。

Title: Rhapsody: A Dataset for Highlight Detection in Podcasts

Authors: Younghan Park, Anuj Diwan, David Harwath, Eunsol Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19429
Pdf URL: https://arxiv.org/pdf/2505.19429
Copy Paste: [[2505.19429]] Rhapsody: A Dataset for Highlight Detection in Podcasts(https://arxiv.org/abs/2505.19429)
Keywords: language model, gpt, prompt
Abstract: Podcasts have become daily companions for half a billion users. Given the enormous amount of podcast content available, highlights provide a valuable signal that helps viewers get the gist of an episode and decide if they want to invest in listening to it in its entirety. However, identifying highlights automatically is challenging due to the unstructured and long-form nature of the content. We introduce Rhapsody, a dataset of 13K podcast episodes paired with segment-level highlight scores derived from YouTube's 'most replayed' feature. We frame the podcast highlight detection as a segment-level binary classification task. We explore various baseline approaches, including zero-shot prompting of language models and lightweight finetuned language models using segment-level classification heads. Our experimental results indicate that even state-of-the-art language models like GPT-4o and Gemini struggle with this task, while models finetuned with in-domain data significantly outperform their zero-shot performance. The finetuned model benefits from leveraging both speech signal features and transcripts. These findings highlight the challenges for fine-grained information access in long-form spoken media.
摘要：播客已成为五十亿用户的每日伴侣。鉴于可用的播客内容大量可用，重点提供了一个有价值的信号，可以帮助观众获得情节的要点，并决定是否要全面投资倾听。但是，由于内容的非结构化和长形式，因此自动识别突出显示是具有挑战性的。我们介绍了Rhapsody，这是一个13K播客剧集的数据集，并与YouTube的“最重播”功能得出的细分级别的突出显示分数配对。我们将播客突出显示作为细分级二进制分类任务。我们探索各种基线方法，包括使用细分级分类头的语言模型的零射击提示和轻巧的登录语言模型。我们的实验结果表明，即使是GPT-4O和Gemini等最先进的语言模型，也可以在此任务上挣扎，而使用域内数据进行的模型也明显优于其零弹性的性能。填充模型受益于利用语音信号特征和成绩单。这些发现突出了长期口语媒体中细粒度信息访问的挑战。

Title: Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation

Authors: Keane Ong, Rui Mao, Deeksha Varshney, Paul Pu Liang, Erik Cambria, Gianmarco Mengaldo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19430
Pdf URL: https://arxiv.org/pdf/2505.19430
Copy Paste: [[2505.19430]] Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation(https://arxiv.org/abs/2505.19430)
Keywords: language model, llm
Abstract: Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form-forward counterfactual reasoning-focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. Large Language Models (LLMs) offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, Fin-Force-FINancial FORward Counterfactual Evaluation. By curating financial news headlines and providing structured evaluation, Fin-Force supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on Fin-Force, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research.
摘要：反事实推理通常涉及考虑实际事件的替代方案。虽然经常应用于了解过去的事件，但在预期合理的未来发展方面具有独特的形式反事实推理。这种推理在动态的金融市场中是无价的，在这种市场中，预期的市场发展可以为利益相关者提供潜在的风险和机会，从而指导他们的决策。但是，由于所涉及的认知需求，对自动解决方案的需求进行了大规模进行挑战。大型语言模型（LLMS）提供了希望，但仍未针对此应用程序进行探索。为了解决这一差距，我们介绍了一种新颖的基准，鳍金融 - 金融远期反事实评估。通过策划金融新闻头条并提供结构化评估，Fin-Force支持基于LLM的前瞻性反事实发电。这为探索和预期未来市场发展的可扩展和自动化解决方案铺平了道路，从而为决策提供了结构化的见解。通过对Fin-Force的实验，我们评估了最先进的LLM和反事实生成方法，分析其局限性并提出了未来研究的见解。

Title: Route to Reason: Adaptive Routing for LLM and Reasoning Strategy Selection

Authors: Zhihong Pan, Kai Zhang, Yuze Zhao, Yupeng Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19435
Pdf URL: https://arxiv.org/pdf/2505.19435
Copy Paste: [[2505.19435]] Route to Reason: Adaptive Routing for LLM and Reasoning Strategy Selection(https://arxiv.org/abs/2505.19435)
Keywords: language model, llm
Abstract: The inherent capabilities of a language model (LM) and the reasoning strategies it employs jointly determine its performance in reasoning tasks. While test-time scaling is regarded as an effective approach to tackling complex reasoning tasks, it incurs substantial computational costs and often leads to "overthinking", where models become trapped in "thought pitfalls". To address this challenge, we propose Route-To-Reason (RTR), a novel unified routing framework that dynamically allocates both LMs and reasoning strategies according to task difficulty under budget constraints. RTR learns compressed representations of both expert models and reasoning strategies, enabling their joint and adaptive selection at inference time. This method is low-cost, highly flexible, and can be seamlessly extended to arbitrary black-box or white-box models and strategies, achieving true plug-and-play functionality. Extensive experiments across seven open source models and four reasoning strategies demonstrate that RTR achieves an optimal trade-off between accuracy and computational efficiency among all baselines, achieving higher accuracy than the best single model while reducing token usage by over 60%.
摘要：语言模型（LM）的固有功能及其采用的推理策略共同确定其在推理任务中的绩效。虽然测试时间缩放被认为是解决复杂推理任务的有效方法，但它会造成大量的计算成本，并且经常导致“过度思考”，其中模型被陷入“思想陷阱”中。为了应对这一挑战，我们提出了一个新型的统一路由框架，这是一个新颖的统一路由框架，该框架在预算限制下根据任务难度动态分配LMS和推理策略。 RTR学习了专家模型和推理策略的压缩表示形式，在推理时可以进行联合和自适应选择。该方法是低成本，高度灵活的，可以无缝扩展到任意的黑框或白色框模型和策略，从而实现了真正的插件功能。在七个开源模型和四种推理策略上进行的广泛实验表明，RTR在所有基础线之间的准确性和计算效率之间取得了最佳的权衡，比最佳单个模型获得了更高的准确性，而将令牌用法降低了60％以上。

Title: Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

Authors: Rihui Xin, Han Liu, Zecheng Wang, Yupeng Zhang, Dianbo Sui, Xiaolin Hu, Bingning Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19439
Pdf URL: https://arxiv.org/pdf/2505.19439
Copy Paste: [[2505.19439]] Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers(https://arxiv.org/abs/2505.19439)
Keywords: language model, llm
Abstract: Large Language Models have achieved remarkable success in natural language processing tasks, with Reinforcement Learning playing a key role in adapting them to specific applications. However, obtaining ground truth answers for training LLMs in mathematical problem-solving is often challenging, costly, and sometimes unfeasible. This research delves into the utilization of format and length as surrogate signals to train LLMs for mathematical problem-solving, bypassing the need for traditional ground truth this http URL study shows that a reward function centered on format correctness alone can yield performance improvements comparable to the standard GRPO algorithm in early phases. Recognizing the limitations of format-only rewards in the later phases, we incorporate length-based rewards. The resulting GRPO approach, leveraging format-length surrogate signals, not only matches but surpasses the performance of the standard GRPO algorithm relying on ground truth answers in certain scenarios, achieving 40.0\% accuracy on AIME2024 with a 7B base model. Through systematic exploration and experimentation, this research not only offers a practical solution for training LLMs to solve mathematical problems and reducing the dependence on extensive ground truth data collection, but also reveals the essence of why our label-free approach succeeds: base model is like an excellent student who has already mastered mathematical and logical reasoning skills, but performs poorly on the test paper, it simply needs to develop good answering habits to achieve outstanding results in exams , in other words, to unlock the capabilities it already possesses.
摘要：大型语言模型在自然语言处理任务中取得了巨大的成功，强化学习在使其适应特定应用程序方面发挥了关键作用。但是，在数学解决问题问题中获得培训LLM的地面真理答案通常是具有挑战性的，昂贵的，有时是不可行的。这项研究深入研究了格式和长度作为替代信号来训练LLM的数学问题解决问题，绕过对传统基础真理的需求这项HTTP URL研究表明，单独以格式正确性为中心的奖励函数可以产生与早期阶段标准GRPO算法相当的性能提高。认识到以后阶段中仅格式奖励的局限性，我们结合了基于长度的奖励。由此产生的GRPO方法，利用格式长度的替代信号，不仅匹配，而且超过了标准的GRPO算法的性能，这些算法在某些情况下依靠地面真相答案，在某些情况下以7B基础模型实现了AIME2024的40.0 \％精度。通过系统的探索和实验，这项研究不仅为培训LLM提供了一种实用的解决方案，以解决数学问题并减少对广泛的真理数据收集的依赖，而且还揭示了我们的无标签方法成功的本质：基础模型就像一个出色的学生一样，他已经掌握了数学和逻辑上的良好的回答，可以很好地回答良好的回答，从而在测试纸上表现出了良好的回答，它在测试纸上效果良好，在测试纸上，在兴起的范围内，在兴起的范围内，却在良好的方法中表现出了良好的回答。单词，解锁它已经拥有的功能。

Title: The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models

Authors: Shashata Sawmya, Micah Adler, Nir Shavit
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19440
Pdf URL: https://arxiv.org/pdf/2505.19440
Copy Paste: [[2505.19440]] The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models(https://arxiv.org/abs/2505.19440)
Keywords: language model, llm
Abstract: This paper studies the emergence of interpretable categorical features within large language models (LLMs), analyzing their behavior across training checkpoints (time), transformer layers (space), and varying model sizes (scale). Using sparse autoencoders for mechanistic interpretability, we identify when and where specific semantic concepts emerge within neural activations. Results indicate clear temporal and scale-specific thresholds for feature emergence across multiple domains. Notably, spatial analysis reveals unexpected semantic reactivation, with early-layer features re-emerging at later layers, challenging standard assumptions about representational dynamics in transformer models.
摘要：本文研究了大语言模型（LLMS）中可解释的分类特征的出现，分析其在训练检查点（时间），变压器层（空间）和不同模型尺寸（比例）之间的行为。使用稀疏的自动编码器来进行机械解释性，我们确定了神经激活中特定的语义概念的何时何地出现。结果表明在多个域中出现特征出现的时间和规模特异性阈值。值得注意的是，空间分析揭示了意外的语义重新激活，并在后期重新出现早期特征，这挑战了有关变压器模型中代表动力学的标准假设。

Title: Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks

Authors: Mohammad Mahdi Moradi, Walid Ahmed, Shuangyue Wen, Sudhir Mudur, Weiwei Zhang, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19472
Pdf URL: https://arxiv.org/pdf/2505.19472
Copy Paste: [[2505.19472]] Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks(https://arxiv.org/abs/2505.19472)
Keywords: language model
Abstract: Attention and State-Space Models (SSMs) when combined in a hybrid network in sequence or in parallel provide complementary strengths. In a hybrid sequential pipeline they alternate between applying a transformer to the input and then feeding its output into a SSM. This results in idle periods in the individual components increasing end-to-end latency and lowering throughput caps. In the parallel hybrid architecture, the transformer operates independently in parallel with the SSM, and these pairs are cascaded, with output from one pair forming the input to the next. Two issues are (i) creating an expressive knowledge representation with the inherently divergent outputs from these separate branches, and (ii) load balancing the computation between these parallel branches, while maintaining representation fidelity. In this work we present FlowHN, a novel parallel hybrid network architecture that accommodates various strategies for load balancing, achieved through appropriate distribution of input tokens between the two branches. Two innovative differentiating factors in FlowHN include a FLOP aware dynamic token split between the attention and SSM branches yielding efficient balance in compute load, and secondly, a method to fuse the highly divergent outputs from individual branches for enhancing representation expressivity. Together they enable much better token processing speeds, avoid bottlenecks, and at the same time yield significantly improved accuracy as compared to other competing works. We conduct comprehensive experiments on autoregressive language modeling for models with 135M, 350M, and 1B parameters. FlowHN outperforms sequential hybrid models and its parallel counterpart, achieving up to 4* higher Tokens per Second (TPS) and 2* better Model FLOPs Utilization (MFU).
摘要：关注和状态空间模型（SSM）在序列或并行的混合网络中组合时会提供互补的强度。在混合顺序管道中，它们在将变压器应用于输入然后将其输出输入到SSM之间进行交替。这会导致单个组件的空闲周期，从而增加了端到端延迟并降低吞吐量盖。在平行的混合体系结构中，变压器与SSM并行独立运行，并且这些对级联，输出一对构成输入到下一个的输入。两个问题是（i）创建一种表现力的知识表示形式，其固有不同的输出与这些单独的分支机构固有不同，以及（ii）负载平衡这些并行分支之间的计算，同时保持表示表示。在这项工作中，我们介绍了一种新型的平行混合网络体系结构Flowhn，可容纳各种负载平衡的策略，通过在两个分支之间适当分配输入令牌来实现。 FlowHN中的两个创新的区分因素包括插槽意识到的动态令牌在注意力和SSM分支之间划分，在计算负载中产生有效的平衡，其次，一种将来自单个分支的高度不同输出融合以增强表示表示表现力的方法。与其他竞争作品相比，它们共同实现了更好的令牌处理速度，避免瓶颈，同时产生的准确性显着提高。我们针对具有135m，350m和1b参数的模型进行自回归语言建模的全面实验。 FlowHN的表现优于顺序杂交模型及其平行对应物，每秒可实现高达4*高令牌（TPS）和2*更好的模型Flops利用率（MFU）。

Title: Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection

Authors: Mohammad Mahdi Moradi, Hossam Amer, Sudhir Mudur, Weiwei Zhang, Yang Liu, Walid Ahmed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19475
Pdf URL: https://arxiv.org/pdf/2505.19475
Copy Paste: [[2505.19475]] Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection(https://arxiv.org/abs/2505.19475)
Keywords: language model, llm
Abstract: Learning to adapt pretrained language models to unlabeled, out-of-distribution data is a critical challenge, as models often falter on structurally novel reasoning tasks even while excelling within their training distribution. We introduce a new framework called VDS-TTT - Verifier-Driven Sample Selection for Test-Time Training to efficiently address this. We use a learned verifier to score a pool of generated responses and select only from high ranking pseudo-labeled examples for fine-tuned adaptation. Specifically, for each input query our LLM generates N candidate answers; the verifier assigns a reliability score to each, and the response with the highest confidence and above a fixed threshold is paired with its query for test-time training. We fine-tune only low-rank LoRA adapter parameters, ensuring adaptation efficiency and fast convergence. Our proposed self-supervised framework is the first to synthesize verifier driven test-time training data for continuous self-improvement of the model. Experiments across three diverse benchmarks and three state-of-the-art LLMs demonstrate that VDS-TTT yields up to a 32.29% relative improvement over the base model and a 6.66% gain compared to verifier-based methods without test-time training, highlighting its effectiveness and efficiency for on-the-fly large language model adaptation.
摘要：学会适应预验证的语言模型未标记，分发数据是一个至关重要的挑战，因为即使在其培训分布中表现出色的同时，模型也经常在结构上新颖的推理任务上步履蹒跚。我们介绍了一个名为VDS-TTT的新框架 - 验证者驱动的样本选择，以进行测试时间培训，以有效解决此问题。我们使用学习的验证者来评分生成的响应池，并仅从高级伪标记的示例中选择进行微调适应。具体而言，对于每个输入查询，我们的llm生成n个候选答案；验证者为每个验证者分配一个可靠性得分，并且具有最高置信度的响应，并且固定阈值高于其测试时间培训的查询。我们仅微调低级洛拉适配器参数，以确保适应效率和快速收敛。我们提出的自我监督框架是第一个合成验证者驱动的测试时间训练数据以持续自我改善的框架。在三个不同的基准和三个最先进的LLMS上进行的实验表明，与基于验证时间训练相比，VDS-TTT的相对改善的相对改善高达32.29％，增益6.66％，没有测试时间培训，强调了其对fly-Fliple大型语言模型适应的有效性和效率。

Title: CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis

Authors: Ruixiang Feng, Shen Gao, Xiuying Chen, Lisi Chen, Shuo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19484
Pdf URL: https://arxiv.org/pdf/2505.19484
Copy Paste: [[2505.19484]] CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis(https://arxiv.org/abs/2505.19484)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they often exhibit a specific cultural biases, neglecting the values and linguistic diversity of low-resource regions. This cultural bias not only undermines universal equality, but also risks reinforcing stereotypes and perpetuating discrimination. To address this, we propose CulFiT, a novel culturally-aware training paradigm that leverages multilingual data and fine-grained reward modeling to enhance cultural sensitivity and inclusivity. Our approach synthesizes diverse cultural-related questions, constructs critique data in culturally relevant languages, and employs fine-grained rewards to decompose cultural texts into verifiable knowledge units for interpretable evaluation. We also introduce GlobalCultureQA, a multilingual open-ended question-answering dataset designed to evaluate culturally-aware responses in a global context. Extensive experiments on three existing benchmarks and our GlobalCultureQA demonstrate that CulFiT achieves state-of-the-art open-source model performance in cultural alignment and general reasoning.
摘要：大型语言模型（LLM）在各种任务中都表现出了非凡的能力，但是它们经常表现出特定的文化偏见，忽略了低资源地区的价值和语言多样性。这种文化偏见不仅破坏了普遍的平等，而且有可能加强刻板印象和永久歧视。为了解决这个问题，我们提出了Culfit，这是一种新型的文化意识训练范式，利用多语言数据和细分奖励建模来增强文化敏感性和包容性。我们的方法综合了与文化相关的各种问题，在文化上相关的语言中构建了批评数据，并采用细粒度的奖励将文化文本分解为可验证的知识单位，以进行可解释的评估。我们还介绍了GlobalUltureqa，这是一种多语言开放式的提问数据集，旨在评估在全球环境中的文化意识回答。对三个现有基准和我们的Globalultureqa进行的广泛实验表明，Culfit在文化一致性和一般推理中实现了最先进的开源模型。

Title: Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents

Authors: Manoj Balaji Jagadeeshan, Prince Raj, Pawan Goyal
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.19494
Pdf URL: https://arxiv.org/pdf/2505.19494
Copy Paste: [[2505.19494]] Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents(https://arxiv.org/abs/2505.19494)
Keywords: gpt
Abstract: The study presents a comprehensive benchmark for retrieving Sanskrit documents using English queries, focusing on the chapters of the Srimadbhagavatam. It employs a tripartite approach: Direct Retrieval (DR), Translation-based Retrieval (DT), and Query Translation (QT), utilizing shared embedding spaces and advanced translation methods to enhance retrieval systems in a RAG framework. The study fine-tunes state-of-the-art models for Sanskrit's linguistic nuances, evaluating models such as BM25, REPLUG, mDPR, ColBERT, Contriever, and GPT-2. It adapts summarization techniques for Sanskrit documents to improve QA processing. Evaluation shows DT methods outperform DR and QT in handling the cross-lingual challenges of ancient texts, improving accessibility and understanding. A dataset of 3,400 English-Sanskrit query-document pairs underpins the study, aiming to preserve Sanskrit scriptures and share their philosophical importance widely. Our dataset is publicly available at this https URL
摘要：该研究提出了一个全面的基准，用于使用英文查询来检索梵语文件，重点介绍了Srimadbhagavatam的章节。它采用三方方法：直接检索（DR），基于翻译的检索（DT）和查询翻译（QT），利用共享的嵌入空间和高级翻译方法来增强RAG框架中的检索系统。该研究的梵语细微差别微调模型，评估BM25，Replug，MDPR，Colbert，Contriever和GPT-2等模型。它适应了梵语文档的摘要技术，以改善质量检查处理。评估显示，DT方法在处理古代文本的跨语性挑战方面的表现优于DR和QT，从而提高了可访问性和理解。这项研究的基础，旨在保留梵文经文并广泛地分享其哲学重要性。我们的数据集可在此HTTPS URL上公开获取

Title: LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study

Authors: Dongil Yang, Minjin Kim, Sunghwan Kim, Beong-woo Kwak, Minjun Park, Jinseok Hong, Woontack Woo, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19510
Pdf URL: https://arxiv.org/pdf/2505.19510
Copy Paste: [[2505.19510]] LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study(https://arxiv.org/abs/2505.19510)
Keywords: language model, llm
Abstract: The remarkable reasoning and generalization capabilities of Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks. To effectively support these applications, grounding in spatial and temporal understanding in multimodal environments is essential. To this end, recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene. However, a comprehensive evaluation of LLMs' ability to utilize scene graphs remains limited. In this work, we introduce Text-Scene Graph (TSG) Bench, a benchmark designed to systematically assess LLMs' ability to (1) understand scene graphs and (2) generate them from textual narratives. With TSG Bench we evaluate 11 LLMs and reveal that, while models perform well on scene graph understanding, they struggle with scene graph generation, particularly for complex narratives. Our analysis indicates that these models fail to effectively decompose discrete scenes from a complex narrative, leading to a bottleneck when generating scene graphs. These findings underscore the need for improved methodologies in scene graph generation and provide valuable insights for future research. The demonstration of our benchmark is available at this https URL. Additionally, our code and evaluation data are publicly available at this https URL.
摘要：大语言模型（LLMS）的显着推理和概括能力为其在体现的AI，机器人技术和其他现实世界中扩展应用程序铺平了道路。为了有效地支持这些应用，在多模式环境中的空间和时间理解中基础是必不可少的。为此，最近的作品具有杠杆场景图，这是一种结构化表示形式，它在场景中编码实体，属性及其关系。但是，对LLMS使用场景图的能力的全面评估仍然有限。在这项工作中，我们介绍了文本场景图（TSG）基准，这是一种基准测试，旨在系统地评估LLMS（1）了解场景图的能力，（2）从文本叙述中生成它们。在TSG台面上，我们评估了11个LLM，并揭示了模型在场景图中的表现良好，但它们与场景图的生成斗争，尤其是对于复杂的叙述而言。我们的分析表明，这些模型无法有效地从复杂的叙述中分解离散场景，从而在生成场景图时会导致瓶颈。这些发现强调了对场景图生成中改进方法的需求，并为未来的研究提供了宝贵的见解。我们的基准测试的演示可在此HTTPS URL上获得。此外，我们的代码和评估数据在此HTTPS URL上公开可用。

Title: Causal Distillation: Transferring Structured Explanations from Large to Compact Language Models

Authors: Aggrey Muhebwa, Khalid K. Osman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19511
Pdf URL: https://arxiv.org/pdf/2505.19511
Copy Paste: [[2505.19511]] Causal Distillation: Transferring Structured Explanations from Large to Compact Language Models(https://arxiv.org/abs/2505.19511)
Keywords: language model
Abstract: Large proprietary language models exhibit strong causal reasoning abilities that smaller open-source models struggle to replicate. We introduce a novel framework for distilling causal explanations that transfers causal reasoning skills from a powerful teacher model to a compact open-source model. The key idea is to train the smaller model to develop causal reasoning abilities by generating structured cause-and-effect explanations consistent with those of the teacher model. To evaluate the quality of the student-generated explanations, we introduce a new metric called Causal Explanation Coherence (CEC) to assess the structural and logical consistency of causal reasoning. This metric uses sentence-level semantic alignment to measure how well each part of the generated explanation corresponds to the teacher's reference, capturing both faithfulness and coverage of the underlying causal chain. Our framework and the CEC metric provide a principled foundation for training smaller models to perform robust causal reasoning and for systematically assessing the coherence of explanations in language model outputs.
摘要：大型专有语言模型表现出强大的因果推理能力，而较小的开源模型则难以复制。我们介绍了一个新颖的框架，用于提炼因果解释，该解释将因果推理技能从强大的教师模型转移到紧凑的开源模型。关键的想法是训练较小的模型，通过产生与教师模型的结构性因果解释来发展因果推理能力。为了评估学生生成的解释的质量，我们引入了一个名为Causal解释一致性（CEC）的新指标，以评估因果推理的结构和逻辑一致性。该指标使用句子级的语义对准来衡量生成的解释的每个部分与教师的参考相对应，从而捕捉了基本因果关系的忠诚和覆盖范围。我们的框架和CEC指标为训练较小的模型提供了强大的因果推理，并系统地评估语言模型输出中解释的连贯性。

Title: SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback

Authors: Yaoning Yu, Ye Yu, Kai Wei, Haojing Luo, Haohan Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19514
Pdf URL: https://arxiv.org/pdf/2505.19514
Copy Paste: [[2505.19514]] SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback(https://arxiv.org/abs/2505.19514)
Keywords: language model, llm, prompt
Abstract: Prompt quality plays a critical role in the performance of large language models (LLMs), motivating a growing body of work on prompt optimization. Most existing methods optimize prompts over a fixed dataset, assuming static input distributions and offering limited support for iterative improvement. We introduce SIPDO (Self-Improving Prompts through Data-Augmented Optimization), a closed-loop framework for prompt learning that integrates synthetic data generation into the optimization process. SIPDO couples a synthetic data generator with a prompt optimizer, where the generator produces new examples that reveal current prompt weaknesses and the optimizer incrementally refines the prompt in response. This feedback-driven loop enables systematic improvement of prompt performance without assuming access to external supervision or new tasks. Experiments across question answering and reasoning benchmarks show that SIPDO outperforms standard prompt tuning methods, highlighting the value of integrating data synthesis into prompt learning workflows.
摘要：及时的质量在大语模型（LLM）的性能中起着至关重要的作用，激发了迅速优化的越来越多的工作。假设静态输入分布并为迭代改进提供有限的支持，大多数现有方法都可以在固定数据集上优化提示。我们介绍了SIPDO（通过数据调制优化的提示进行自我改进的提示），这是一个闭环框架，用于迅速学习，将合成数据生成整合到优化过程中。 Sipdo将合成数据生成器与提示优化器融合在一起，在此生成器会产生新的示例，以揭示当前的提示弱点，而优化器会逐步完善提示以响应。此反馈驱动的循环可以系统地改进及时性能，而无需假设访问外部监督或新任务。跨问答案和推理基准的实验表明，Sipdo的表现优于标准提示方法，突出了将数据合成到及时学习工作流程中的价值。

Title: Bias in Political Dialogue: Tagging U.S. Presidential Debates with an Extended DAMSL Framework

Authors: Lavanya Prahallad, Radhika Mamidi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19515
Pdf URL: https://arxiv.org/pdf/2505.19515
Copy Paste: [[2505.19515]] Bias in Political Dialogue: Tagging U.S. Presidential Debates with an Extended DAMSL Framework(https://arxiv.org/abs/2505.19515)
Keywords: gpt, chat
Abstract: We present a critical discourse analysis of the 2024 U.S. presidential debates, examining Donald Trump's rhetorical strategies in his interactions with Joe Biden and Kamala Harris. We introduce a novel annotation framework, BEADS (Bias Enriched Annotation for Dialogue Structure), which systematically extends the DAMSL framework to capture bias driven and adversarial discourse features in political communication. BEADS includes a domain and language agnostic set of tags that model ideological framing, emotional appeals, and confrontational tactics. Our methodology compares detailed human annotation with zero shot ChatGPT assisted tagging on verified transcripts from the Trump and Biden (19,219 words) and Trump and Harris (18,123 words) debates. Our analysis shows that Trump consistently dominated in key categories: Challenge and Adversarial Exchanges, Selective Emphasis, Appeal to Fear, Political Bias, and Perceived Dismissiveness. These findings underscore his use of emotionally charged and adversarial rhetoric to control the narrative and influence audience perception. In this work, we establish BEADS as a scalable and reproducible framework for critical discourse analysis across languages, domains, and political contexts.
摘要：我们对2024年美国总统辩论进行了批判性的论述分析，并在与乔·拜登（Joe Biden）和卡玛拉·哈里斯（Kamala Harris）的互动中研究了唐纳德·特朗普（Donald Trump）的言辞策略。我们介绍了一个新颖的注释框架，珠子（对话结构的偏见丰富注释），该框架系统地扩展了DAMSL框架以捕捉偏见驱动的偏见和政治交流中的对抗性话语特征。珠子包括模拟意识形态框架，情感吸引力和对抗性策略的域和语言不可知论标签。我们的方法将详细的人类注释与零镜头chatgpt进行了比较，辅助标记了特朗普和拜登（19,219个单词）的经过验证的成绩单，特朗普和哈里斯（18,123个单词）辩论。我们的分析表明，特朗普始终统治着关键类别：挑战和对抗性交流，选择性强调，呼吁恐惧，政治偏见和被认为的轻蔑。这些发现强调了他对以情感充电和对抗性的言论来控制叙事并影响观众的感知。在这项工作中，我们将珠子建立为跨语言，领域和政治背景的批判性话语分析的可扩展且可重复的框架。

Title: Small Language Models: Architectures, Techniques, Evaluation, Problems and Future Adaptation

Authors: Tanjil Hasan Sakib, Md. Tanzib Hosain, Md. Kishor Morol
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19529
Pdf URL: https://arxiv.org/pdf/2505.19529
Copy Paste: [[2505.19529]] Small Language Models: Architectures, Techniques, Evaluation, Problems and Future Adaptation(https://arxiv.org/abs/2505.19529)
Keywords: language model
Abstract: Small Language Models (SLMs) have gained substantial attention due to their ability to execute diverse language tasks successfully while using fewer computer resources. These models are particularly ideal for deployment in limited environments, such as mobile devices, on-device processing, and edge systems. In this study, we present a complete assessment of SLMs, focussing on their design frameworks, training approaches, and techniques for lowering model size and complexity. We offer a novel classification system to organize the optimization approaches applied for SLMs, encompassing strategies like pruning, quantization, and model compression. Furthermore, we assemble SLM's studies of evaluation suite with some existing datasets, establishing a rigorous platform for measuring SLM capabilities. Alongside this, we discuss the important difficulties that remain unresolved in this sector, including trade-offs between efficiency and performance, and we suggest directions for future study. We anticipate this study to serve as a beneficial guide for researchers and practitioners who aim to construct compact, efficient, and high-performing language models.
摘要：小型语言模型（SLM）由于能够在使用较少的计算机资源而成功执行各种语言任务的能力，因此受到了很大的关注。这些模型特别适合在有限的环境中部署，例如移动设备，设备处理和边缘系统。在这项研究中，我们介绍了SLM的完整评估，重点关注其设计框架，培训方法以及降低模型大小和复杂性的技术。我们提供了一种新型的分类系统，以组织用于SLM的优化方法，包括修剪，量化和模型压缩等策略。此外，我们将SLM的评估套件研究与一些现有数据集组装，建立了一个严格的测量SLM功能的平台。除此之外，我们讨论了在该领域尚未解决的重要困难，包括效率和绩效之间的权衡，我们建议将来研究的方向。我们预计这项研究将成为旨在构建紧凑，高效和高性能语言模型的研究人员和从业者的有益指南。

Title: DoctorRAG: Medical RAG Fusing Knowledge with Patient Analogy through Textual Gradients

Authors: Yuxing Lu, Gecheng Fu, Wei Wu, Xukai Zhao, Sin Yee Goi, Jinzhuo Wang
Subjects: cs.CL, cs.AI, cs.CE, cs.IR, cs.MA
Abstract URL: https://arxiv.org/abs/2505.19538
Pdf URL: https://arxiv.org/pdf/2505.19538
Copy Paste: [[2505.19538]] DoctorRAG: Medical RAG Fusing Knowledge with Patient Analogy through Textual Gradients(https://arxiv.org/abs/2505.19538)
Keywords: agent
Abstract: Existing medical RAG systems mainly leverage knowledge from medical knowledge bases, neglecting the crucial role of experiential knowledge derived from similar patient cases -- a key component of human clinical reasoning. To bridge this gap, we propose DoctorRAG, a RAG framework that emulates doctor-like reasoning by integrating both explicit clinical knowledge and implicit case-based experience. DoctorRAG enhances retrieval precision by first allocating conceptual tags for queries and knowledge sources, together with a hybrid retrieval mechanism from both relevant knowledge and patient. In addition, a Med-TextGrad module using multi-agent textual gradients is integrated to ensure that the final output adheres to the retrieved knowledge and patient query. Comprehensive experiments on multilingual, multitask datasets demonstrate that DoctorRAG significantly outperforms strong baseline RAG models and gains improvements from iterative refinements. Our approach generates more accurate, relevant, and comprehensive responses, taking a step towards more doctor-like medical reasoning systems.
摘要：现有的医学抹布系统主要利用医学知识基础的知识，忽略了来自类似患者病例的经验知识的关键作用，这是人类临床推理的关键组成部分。为了弥合这一差距，我们提出了Doctorrag，这是一个抹布框架，通过整合明确的临床知识和基于隐性的病例经验来模仿医生样的推理。 Doctorrag通过首先为查询和知识来源分配概念标签，并从相关知识和患者中分配混合检索机制，从而提高了检索精度。此外，还集成了使用多代理文本梯度的Med-Textgrad模块，以确保最终输出遵守检索到的知识和患者查询。关于多语言多任务数据集的全面实验表明，Doctorrag的表现明显优于强大的基线抹布模型，并从迭代改进中获得了改进。我们的方法产生了更准确，相关和全面的反应，迈出了更类似医生的医学推理系统。

Title: How Syntax Specialization Emerges in Language Models

Authors: Xufeng Duan, Zhaoqian Yao, Yunhao Zhang, Shaonan Wang, Zhenguang G. Cai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19548
Pdf URL: https://arxiv.org/pdf/2505.19548
Copy Paste: [[2505.19548]] How Syntax Specialization Emerges in Language Models(https://arxiv.org/abs/2505.19548)
Keywords: language model, llm
Abstract: Large language models (LLMs) have been found to develop surprising internal specializations: Individual neurons, attention heads, and circuits become selectively sensitive to syntactic structure, reflecting patterns observed in the human brain. While this specialization is well-documented, how it emerges during training and what influences its development remains largely unknown. In this work, we tap into the black box of specialization by tracking its formation over time. By quantifying internal syntactic consistency across minimal pairs from various syntactic phenomena, we identify a clear developmental trajectory: Syntactic sensitivity emerges gradually, concentrates in specific layers, and exhibits a 'critical period' of rapid internal specialization. This process is consistent across architectures and initialization parameters (e.g., random seeds), and is influenced by model scale and training data. We therefore reveal not only where syntax arises in LLMs but also how some models internalize it during training. To support future research, we will release the code, models, and training checkpoints upon acceptance.
摘要：已经发现大型语言模型（LLMS）会发展出令人惊讶的内部专业知识：单个神经元，注意力头和电路对句法结构有选择敏感，反映了在人脑中观察到的模式。尽管有充分记录的专业化，但在培训期间的出现以及影响其发展的原因仍然很大程度上是未知的。在这项工作中，我们通过随着时间的推移跟踪其编队来利用黑框。通过量化各种句法现象的最小对之间的内部句法一致性，我们确定了一个清晰的发育轨迹：句法灵敏度逐渐出现，集中在特定的层中，并表现出快速内部专业化的“关键时期”。此过程在架构和初始化参数（例如随机种子）之间是一致的，并且受模型量表和训练数据的影响。因此，我们不仅揭示了LLM中语法出现的何处，还可以揭示某些模型在培训期间如何内在的。为了支持未来的研究，我们将在接受后发布代码，模型和培训检查点。

Title: Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents

Authors: Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, wenlin zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, Tong Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19549
Pdf URL: https://arxiv.org/pdf/2505.19549
Copy Paste: [[2505.19549]] Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents(https://arxiv.org/abs/2505.19549)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have recently been widely adopted in conversational agents. However, the increasingly long interactions between users and agents accumulate extensive dialogue records, making it difficult for LLMs with limited context windows to maintain a coherent long-term dialogue memory and deliver personalized responses. While retrieval-augmented memory systems have emerged to address this issue, existing methods often depend on single-granularity memory segmentation and retrieval. This approach falls short in capturing deep memory connections, leading to partial retrieval of useful information or substantial noise, resulting in suboptimal performance. To tackle these limits, we propose MemGAS, a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval. MemGAS is based on multi-granularity memory units and employs Gaussian Mixture Models to cluster and associate new memories with historical ones. An entropy-based router adaptively selects optimal granularity by evaluating query relevance distributions and balancing information completeness and noise. Retrieved memories are further refined via LLM-based filtering. Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks, achieving superior performance across different query types and top-K settings.
摘要：大型语言模型（LLM）最近在对话代理中被广泛采用。但是，用户和代理商之间越来越长的互动累积了广泛的对话记录，这使得具有有限上下文窗口的LLM难以维持连贯的长期对话记忆并提供个性化的响应。尽管已经出现了检索声明的内存系统来解决此问题，但现有方法通常取决于单个粒度内存分割和检索。这种方法在捕获深层内存连接方面缺乏，导致部分检索有用的信息或大量噪声，从而导致次优性能。为了应对这些限制，我们提出了一个备忘录，该框架是通过构建多粒性关联，自适应选择和检索来增强内存巩固的框架。 Memgas基于多晶型内存单元，并采用高斯混合模型将新记忆与历史记忆相关联。基于熵的路由器通过评估查询相关性分布以及平衡信息的完整性和噪声来适应最佳粒度。通过基于LLM的过滤进一步完善了检索的记忆。在四个长期内存基准上进行的实验表明，Memgas在问题答案和检索任务上都优于最先进的方法，从而在不同的查询类型和TOP-K设置上实现了卓越的性能。

Title: DocMEdit: Towards Document-Level Model Editing

Authors: Li Zeng, Zeming Liu, Chong Feng, Heyan Huang, Yuhang Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19572
Pdf URL: https://arxiv.org/pdf/2505.19572
Copy Paste: [[2505.19572]] DocMEdit: Towards Document-Level Model Editing(https://arxiv.org/abs/2505.19572)
Keywords: language model, llm
Abstract: Model editing aims to correct errors and outdated knowledge in the Large language models (LLMs) with minimal cost. Prior research has proposed a variety of datasets to assess the effectiveness of these model editing methods. However, most existing datasets only require models to output short phrases or sentences, overlooks the widespread existence of document-level tasks in the real world, raising doubts about their practical usability. Aimed at addressing this limitation and promoting the application of model editing in real-world scenarios, we propose the task of document-level model editing. To tackle such challenges and enhance model capabilities in practical settings, we introduce \benchmarkname, a dataset focused on document-level model editing, characterized by document-level inputs and outputs, extrapolative, and multiple facts within a single edit. We propose a series of evaluation metrics and experiments. The results show that the difficulties in document-level model editing pose challenges for existing model editing methods.
摘要：模型编辑旨在以最低的成本纠正大语言模型（LLMS）中的错误和过时的知识。先前的研究提出了各种数据集来评估这些模型编辑方法的有效性。但是，大多数现有数据集仅需要模型来输出简短的短语或句子，忽略了现实世界中文档级任务的广泛存在，从而引起了对其实际可用性的疑问。旨在解决此限制并促进在现实情况下编辑的应用，我们提出了文档级模型编辑的任务。为了应对实际设置中的这些挑战并增强模型功能，我们介绍了\ Benchmarkname，这是一个专注于文档级模型编辑的数据集，其特征是文档级输入和输出，外推性和单个编辑中的多个事实。我们提出了一系列评估指标和实验。结果表明，文档级模型编辑的困难对现有模型编辑方法构成挑战。

Title: TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization

Authors: Dingyu Yao, Bowen Shen, Zheng Lin, Wei Liu, Jian Luan, Bin Wang, Weiping Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19586
Pdf URL: https://arxiv.org/pdf/2505.19586
Copy Paste: [[2505.19586]] TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization(https://arxiv.org/abs/2505.19586)
Keywords: language model, llm
Abstract: The Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to PCIe bandwidth bottlenecks in CPU-GPU communication, while aggressive compression causes notable performance degradation. We identify that certain layers in the LLM need to maintain global information and are unsuitable for selective loading. In contrast, other layers primarily focus on a few tokens with dominant activations that potentially incur substantial quantization error. This observation leads to a key insight that loading dominant tokens and quantizing all tokens can complement each other. Building on this insight, we propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading. TailorKV develops an inference framework along with a hardware-friendly implementation that leverages these complementary characteristics. Extensive long-context evaluations exhibit that TailorKV achieves nearly lossless performance under aggressive compression settings, outperforming the state-of-the-art. Particularly, the Llama-3.1-8B with 128k context can be served within a single RTX 3090 GPU, reaching 82 ms per token during decoding.
摘要：生成大语言模型（LLMS）中的键值（KV）缓存引入了大量内存开销。现有作品通过卸载或压缩KV缓存来减轻这种负担。但是，加载整个缓存会因CPU-GPU通信中的PCIE带宽瓶颈而导致明显的潜伏期，而侵略性压缩会导致显着的性能降低。我们确定LLM中的某些层需要维护全局信息，并且不适合选择性加载。相比之下，其他层主要集中于一些具有显着激活的代币，这些代币可能会导致实质性的量化误差。该观察结果导致一个关键的见解，即加载主导令牌和量化所有令牌可以相互补充。在此洞察力的基础上，我们提出了一种混合压缩方法Tailorkv，该方法无缝地整合了量化和卸载。 Tailorkv开发了一个推理框架，以及对这些互补特征的硬件友好实现。广泛的长篇小说评估表现出，在积极的压缩环境下，尾巴近来实现了几乎无损的性能，表现优于最先进的表现。特别是，可以在单个RTX 3090 GPU中提供具有128K上下文的Llama-3.1-8b，在解码过程中达到每个令牌82毫秒。

Title: Multi-Agent Collaboration via Evolving Orchestration

Authors: Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.19591
Pdf URL: https://arxiv.org/pdf/2505.19591
Copy Paste: [[2505.19591]] Multi-Agent Collaboration via Evolving Orchestration(https://arxiv.org/abs/2505.19591)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem-solving. While recent research explores multi-agent collaboration among LLMs, most approaches rely on static organizational structures that struggle to adapt as task complexity and agent numbers grow, resulting in coordination overhead and inefficiencies. To this end, we propose a puppeteer-style paradigm for LLM-based multi-agent collaboration, where a centralized orchestrator ("puppeteer") dynamically directs agents ("puppets") in response to evolving task states. This orchestrator is trained via reinforcement learning to adaptively sequence and prioritize agents, enabling flexible and evolvable collective reasoning. Experiments on closed- and open-domain scenarios show that this method achieves superior performance with reduced computational costs. Analyses further reveal that the key improvements consistently stem from the emergence of more compact, cyclic reasoning structures under the orchestrator's evolution.
摘要：大型语言模型（LLM）在各种下游任务中取得了出色的结果，但是它们的整体性质限制了复杂问题解决的可扩展性和效率。尽管最近的研究探讨了LLM之间的多代理协作，但大多数方法都依赖于静态的组织结构，这些组织结构很难适应任务复杂性和代理数量的增长，从而导致协调开销和效率低下。为此，我们为基于LLM的多代理协作提出了一个木偶式式范式，其中集中式编目（“ Puppeteer”）动态地指导代理（“ Puppets”）以响应不断发展的任务状态。该编排者通过加强学习来适应序列和优先级的代理，从而实现灵活且可演化的集体推理。对闭合和开放域情景的实验表明，这种方法通过降低的计算成本实现了卓越的性能。分析进一步表明，关键的改进始终源于在编排者进化下更紧凑，循环推理结构的出现。

Title: Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study

Authors: Guanyu Hou, Jiaming He, Yinhang Zhou, Ji Guo, Yitong Qiao, Rui Zhang, Wenbo Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19598
Pdf URL: https://arxiv.org/pdf/2505.19598
Copy Paste: [[2505.19598]] Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study(https://arxiv.org/abs/2505.19598)
Keywords: language model, prompt
Abstract: Large Audio-Language Models (LALMs) are increasingly deployed in real-world applications, yet their robustness against malicious audio injection attacks remains underexplored. This study systematically evaluates five leading LALMs across four attack scenarios: Audio Interference Attack, Instruction Following Attack, Context Injection Attack, and Judgment Hijacking Attack. Using metrics like Defense Success Rate, Context Robustness Score, and Judgment Robustness Index, their vulnerabilities and resilience were quantitatively assessed. Experimental results reveal significant performance disparities among models; no single model consistently outperforms others across all attack types. The position of malicious content critically influences attack effectiveness, particularly when placed at the beginning of sequences. A negative correlation between instruction-following capability and robustness suggests models adhering strictly to instructions may be more susceptible, contrasting with greater resistance by safety-aligned models. Additionally, system prompts show mixed effectiveness, indicating the need for tailored strategies. This work introduces a benchmark framework and highlights the importance of integrating robustness into training pipelines. Findings emphasize developing multi-modal defenses and architectural designs that decouple capability from susceptibility for secure LALMs deployment.
摘要：大型音频语言模型（LALMS）越来越多地在现实世界中部署，但是它们针对恶意音频注射攻击的稳健性仍然没有得到充实。这项研究系统地评估了四种攻击场景中的五个领先的LALM：音频干扰攻击，攻击后的指示，上下文注射攻击和判断劫持攻击。使用诸如国防成功率，上下文鲁棒性得分和判断鲁棒性指数之类的指标，对其脆弱性和韧性进行了定量评估。实验结果显示模型之间的绩效差异很大。在所有攻击类型中，没有任何单个模型始终优于其他模型。恶意内容的位置严重影响攻击效果，尤其是在序列开始时。指导跟随能力和鲁棒性之间的负相关性表明，严格遵守指令的模型可能更容易受到敏感，与安全一致模型的抵抗力更大。此外，系统提示表现出不同的效力，表明需要量身定制的策略。这项工作引入了基准框架，并突出了将鲁棒性整合到训练管道中的重要性。调查结果强调开发多模式防御和建筑设计，使能力与安全LALMS部署的易感性相抗衡。

Title: Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar

Authors: Andrew Gambardella, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19599
Pdf URL: https://arxiv.org/pdf/2505.19599
Copy Paste: [[2505.19599]] Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar(https://arxiv.org/abs/2505.19599)
Keywords: language model
Abstract: Typical methods for evaluating the performance of language models evaluate their ability to answer questions accurately. These evaluation metrics are acceptable for determining the extent to which language models can understand and reason about text in a general sense, but fail to capture nuanced capabilities, such as the ability of language models to recognize and obey rare grammar points, particularly in languages other than English. We measure the perplexity of language models when confronted with the "first person psych predicate restriction" grammar point in Japanese. Weblab is the only tested open source model in the 7-10B parameter range which consistently assigns higher perplexity to ungrammatical psych predicate sentences than grammatical ones. We give evidence that Weblab's uniformly bad tokenization is a possible root cause for its good performance, and show that Llama 3's perplexity on grammatical psych predicate sentences can be reduced by orders of magnitude (28x difference) by restricting test sentences to those with uniformly well-behaved tokenizations. We show in further experiments on machine translation tasks that language models will use alternative grammar patterns in order to produce grammatical sentences when tokenization issues prevent the most natural sentence from being output.
摘要：评估语言模型性能的典型方法评估了他们准确回答问题的能力。这些评估指标是可以接受的，可以从一般意义上确定语言模型可以理解和推理文本的程度，但无法捕获细微的能力，例如语言模型识别和遵守稀有语法点的能力，尤其是英语以外的语言。当面对日语中的“第一人称心理谓词限制”语法点时，我们衡量语言模型的困惑。 Weblab是7-10b参数范围中唯一经过测试的开源模型，该模型始终比语法句子分配给非语法心理谓语句子。我们提供的证据表明，韦布拉布（Weblab）统一的不良令牌化是其良好表现的可能根本原因，并表明，通过将测试句子限制为统一的均匀行为良好的示威者，可以通过将测试句子限制为数量级（28x差异）来降低语法心理谓语句子的困惑。我们在有关机器翻译任务的进一步实验中表明，语言模型将使用替代语法模式，以便在标记问题阻止最自然的句子输出时产生语法句子。

Title: Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically

Authors: Ryan Soh-Eun Shim, Domenico De Cristofaro, Chengzhi Martin Hu, Alessandro Vietti, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19606
Pdf URL: https://arxiv.org/pdf/2505.19606
Copy Paste: [[2505.19606]] Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically(https://arxiv.org/abs/2505.19606)
Keywords: language model
Abstract: Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs. Such an alignment has also been observed in speech foundation models. However, it remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech. Building on prior work on spoken translation retrieval, we perform pronunciation-controlled experiments to observe if cross-lingual alignment can indeed occur in such models on a semantic basis, instead of relying on phonetic similarities. Our findings indicate that even in the absence of phonetic cues, spoken translation retrieval accuracy remains relatively stable. We follow up with a controlled experiment on a word-level dataset of cross-lingual synonyms and near-homophones, confirming the existence of both phonetic and semantic knowledge in the encoder. Finally, we qualitatively examine the transcriptions produced by early exiting the encoder, where we observe that speech translation produces semantic errors that are characterized by phonetic similarities to corresponding words in the source language. We apply this insight from early exiting to speech recognition in seven low-resource languages unsupported by the Whisper model, and achieve improved accuracy in all languages examined, particularly for languages with transparent orthographies.
摘要：预读语言模型（LMS）中的跨语性对齐能够在基于文本的LMS中有效传输。在语音基础模型中也观察到了这样的一致性。但是，这仍然是一个悬而未决的问题，是否适用于语音的基于文本的跨语性对齐方式。在先前关于口头翻译检索的工作的基础上，我们执行了发音控制的实验，以观察是否确实在此类模型中确实可以在语义上进行跨语言对准，而不是依靠语音相似性。我们的发现表明，即使没有语音提示，口语翻译的检索精度仍然相对稳定。我们对跨语性同义词和近载体的单词级数据集进行了对照实验，证实了编码器中语音和语义知识的存在。最后，我们定性地检查了早期退出编码器所产生的转录，在那里我们观察到语音翻译会产生语义错误，这些误差的特征是与源语言中的相应单词相似。我们将这种洞察力从早期退出到语音识别中，以七种不受耳语模型支持的低资源语言，并在所检查的所有语言中提高准确性，尤其是对于具有透明拼字法的语言。

Title: HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices

Authors: Silin Li, Yuhang Guo, Jiashu Yao, Zeming Liu, Haifeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19628
Pdf URL: https://arxiv.org/pdf/2505.19628
Copy Paste: [[2505.19628]] HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices(https://arxiv.org/abs/2505.19628)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) have the potential to revolutionize smart home assistants by enhancing their ability to accurately understand user needs and respond appropriately, which is extremely beneficial for building a smarter home environment. While recent studies have explored integrating LLMs into smart home systems, they primarily focus on handling straightforward, valid single-device operation instructions. However, real-world scenarios are far more complex and often involve users issuing invalid instructions or controlling multiple devices simultaneously. These have two main challenges: LLMs must accurately identify and rectify errors in user instructions and execute multiple user instructions perfectly. To address these challenges and advance the development of LLM-based smart home assistants, we introduce HomeBench, the first smart home dataset with valid and invalid instructions across single and multiple devices in this paper. We have experimental results on 13 distinct LLMs; e.g., GPT-4o achieves only a 0.0% success rate in the scenario of invalid multi-device instructions, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning, retrieval-augmented generation, and fine-tuning. Our code and dataset are publicly available at this https URL.
摘要：大型语言模型（LLMS）有可能通过增强其准确了解用户需求和适当响应的能力来彻底改变智能家庭助理，这对于建立更智能的家庭环境非常有益。尽管最近的研究探索了将LLM集成到智能家庭系统中，但它们主要集中于处理直接，有效的单个设备操作指令。但是，实际情况更为复杂，并且通常涉及用户同时发出无效的说明或控制多个设备。这些面临两个主要的挑战：LLM必须准确地识别和纠正用户指令中的错误，并完美地执行多个用户指令。为了应对这些挑战并推进基于LLM的智能家居助理的开发，我们介绍了HomeBench，这是本文中单个和多个设备的第一个智能主页数据集，该数据集具有有效和无效的说明。我们对13个不同的LLM有实验结果；例如，在无效的多设备说明的情况下，GPT-4O仅达到0.0％的成功率，表明即使在这种情况下，在这种情况下，现有的最先进的LLMS仍然无法表现良好，即使借助在文章中的学习，检验式学习的一代，并进行了微调。我们的代码和数据集可在此HTTPS URL上公开获得。

Title: DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

Authors: Yichun Feng, Jiawei Wang, Lu Zhou, Yixue Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19630
Pdf URL: https://arxiv.org/pdf/2505.19630
Copy Paste: [[2505.19630]] DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue(https://arxiv.org/abs/2505.19630)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges. Existing systems rely on a one-way information transmission mode where patients must fully describe their symptoms in a single round, leading to nonspecific diagnostic recommendations when complaints are vague. Traditional multi-turn dialogue methods based on supervised learning are constrained by static data-driven paradigms, lacking generalizability and struggling to intelligently extract key clinical information. To address these limitations, we propose DoctorAgent-RL, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty. The doctor agent continuously optimizes its questioning strategy within the RL framework through multi-turn interactions with the patient agent, dynamically adjusting its information-gathering path based on comprehensive rewards from the Consultation Evaluator. This RL fine-tuning mechanism enables LLMs to autonomously develop interaction strategies aligned with clinical reasoning logic, rather than superficially imitating patterns in existing dialogue data. Notably, we constructed MTMedDialog, the first English multi-turn medical consultation dataset capable of simulating patient interactions. Experiments demonstrate that DoctorAgent-RL outperforms existing models in both multi-turn reasoning capability and final diagnostic performance, demonstrating practical value in assisting clinical consultations. this https URL
摘要：大型语言模型（LLMS）在生物医学问题答案领域表现出了出色的功能，但是它们在现实世界中的临床咨询中的应用仍然面临着核心挑战。现有系统依赖于单向信息传播模式，患者必须在一轮中充分描述其症状，从而在投诉模糊时会导致非特异性诊断建议。基于监督学习的传统多转向对话方法受到静态数据驱动范式的限制，缺乏普遍性和努力智能提取关键的临床信息。为了解决这些局限性，我们提出了Doctoragent-RL，这是一种基于强化学习（RL）的多代理协作框架，该框架将医疗咨询建模为不确定性下的动态决策过程。医生经纪人通过与患者代理的多转交互在RL框架中不断优化其质疑策略，从而根据咨询评估者的全面奖励动态调整其信息收集路径。这种RL微调机制使LLMS能够自主制定与临床推理逻辑一致的交互策略，而不是在现有对话数据中表面模仿模式。值得注意的是，我们构建了MTMedDialog，这是第一个能够模拟患者互动的英语多转移医疗咨询数据集。实验表明，Doctoragent-RL在多转弯推理能力和最终诊断性能中都胜过现有模型，这表明在协助临床咨询方面的实用价值。此HTTPS URL

Title: Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models

Authors: Zihong Zhang, Liqi He, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19631
Pdf URL: https://arxiv.org/pdf/2505.19631
Copy Paste: [[2505.19631]] Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models(https://arxiv.org/abs/2505.19631)
Keywords: language model, llm, prompt
Abstract: Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension". Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA ($\textbf{L}$arge $\textbf{L}$anguage Model-Inspired $\textbf{A}$ho-$\textbf{C}$orasick $\textbf{A}$utomaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic $n$-gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available at this https URL
摘要：单词分割是自然语言处理（NLP）的基石。基于“首先理解，后来的理解”概念，我们提出了一个新框架，以使用大语言模型（LLM）（LLM）探索无监督的单词分割的极限，并根据单词分割评估LLMS的语义理解能力。我们采用当前的主流LLM来跨多种语言进行单词分割，以评估LLMS的“理解”。我们的发现表明，LLMS能够遵循简单的提示，将原始文本细分为单词。有一种趋势表明，具有更多参数的模型倾向于在多种语言上表现更好。此外，我们介绍了一种新颖的无监督方法，称为llaca（$ \ textbf {l} $ arge $ \ textbf {l} $ arguage model-model-inspired $ \ textbf {a} $ ho- $ \ $ \ $ \ \ \ textbf {c}利用Aho-Corasick Automata的高级模式识别能力，Llaca创新了这些功能，将它们与精心预言的LLM的深刻见解相结合。这种方法不仅可以构建动态$ n $ gram模型，该模型根据上下文信息进行了调整，而且还整合了对LLM的细微理解，从而对传统方法进行了重大改进。我们的源代码可在此HTTPS URL上找到

Title: Faster and Better LLMs via Latency-Aware Test-Time Scaling

Authors: Zili Wang, Tianyu Zhang, Haoli Bai, Lu Hou, Xianzhi Yu, Wulong Liu, Shiming Xiang, Lei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19634
Pdf URL: https://arxiv.org/pdf/2505.19634
Copy Paste: [[2505.19634]] Faster and Better LLMs via Latency-Aware Test-Time Scaling(https://arxiv.org/abs/2505.19634)
Keywords: language model, llm
Abstract: Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a latency-aware evaluation of representative TTS methods, we demonstrate that a compute-optimal TTS does not always result in the lowest latency in scenarios where latency is critical. To address this gap and achieve latency-optimal TTS, we propose two key approaches by optimizing the concurrency configurations: (1) branch-wise parallelism, which leverages multiple concurrent inference branches, and (2) sequence-wise parallelism, enabled by speculative decoding. By integrating these two approaches and allocating computational resources properly to each, our latency-optimal TTS enables a 32B model to reach 82.3% accuracy on MATH-500 within 1 minute and a smaller 3B model to achieve 72.4% within 10 seconds. Our work emphasizes the importance of latency-aware TTS and demonstrates its ability to deliver both speed and accuracy in latency-sensitive scenarios.
摘要：事实证明，测试时间缩放（TTS）可有效提高推理过程中大语言模型（LLM）的性能。但是，现有的研究从对潜伏期敏感的角度忽略了TT的效率。通过对代表性TTS方法的延迟感知评估，我们证明了在延迟至关重要的情况下，计算最佳的TTS并不总是会导致最低的延迟。为了解决这一差距并实现延迟 - 最佳的TT，我们通过优化并发配置提出了两种关键方法：（1）分支 - 智慧并行性，该分支平行性利用了多个并发推断分支，（2）序列方面的平行性，通过投机解码来启用。通过将这两种方法集成并正确分配计算资源，我们的延迟最佳TTS使32B模型在1分钟内可以在Math-500上达到82.3％的精度，而较小的3B模型可以在10秒内达到72.4％。我们的工作强调了延迟感知TT的重要性，并展示了其在潜伏期敏感的情况下同时提供速度和准确性的能力。

Title: Interleaved Reasoning for Large Language Models via Reinforcement Learning

Authors: Roy Xie, David Qiu, Deepak Gopinath, Dong Lin, Yanchao Sun, Chong Wang, Saloni Potdar, Bhuwan Dhingra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19640
Pdf URL: https://arxiv.org/pdf/2505.19640
Copy Paste: [[2505.19640]] Interleaved Reasoning for Large Language Models via Reinforcement Learning(https://arxiv.org/abs/2505.19640)
Keywords: language model, llm, chain-of-thought
Abstract: Long chain-of-thought (CoT) significantly enhances large language models' (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective rule-based reward to incentivize correct intermediate steps, which guides the policy model toward correct reasoning paths by leveraging intermediate signals generated during interleaved reasoning. Extensive experiments conducted across five diverse datasets and three RL algorithms (PPO, GRPO, and REINFORCE++) demonstrate consistent improvements over traditional think-answer reasoning, without requiring external tools. Specifically, our approach reduces TTFT by over 80% on average and improves up to 19.3% in Pass@1 accuracy. Furthermore, our method, trained solely on question answering and logical reasoning datasets, exhibits strong generalization ability to complex reasoning datasets such as MATH, GPQA, and MMLU. Additionally, we conduct in-depth analysis to reveal several valuable insights into conditional reward modeling.
摘要：长期的思考链（COT）显着增强了大型语言模型（LLM）的推理能力。但是，广泛的推理痕迹导致效率低下和增加时间（TTFT）的增加。我们提出了一种新颖的培训范式，该训练范式使用加固学习（RL）指导推理LLMS以交织和回答多跳的问题。我们观察到，模型本质上具有执行交织的推理的能力，可以通过RL进一步增强。我们引入了一个简单而有效的基于规则的奖励，以激励正确的中间步骤，该步骤通过利用在交错推理期间产生的中间信号来指导政策模型来正确推理路径。在五个不同的数据集和三种RL算法（PPO，GRPO和增强++）上进行的广泛实验表明，在不需要外部工具的情况下，对传统的思想解答推理进行了一致的改进。具体而言，我们的方法平均将TTFT降低了80％以上，并在PASS@1准确度中提高了19.3％。此外，我们的方法仅根据问题答案和逻辑推理数据集进行了培训，具有强大的概括能力，可以复杂推理数据集，例如数学，GPQA和MMLU。此外，我们进行了深入的分析，以揭示对条件奖励建模的一些有价值的见解。

Title: Select, Read, and Write: A Multi-Agent Framework of Full-Text-based Related Work Generation

Authors: Xiaochuan Liu, Ruihua Song, Xiting Wang, Xu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19647
Pdf URL: https://arxiv.org/pdf/2505.19647
Copy Paste: [[2505.19647]] Select, Read, and Write: A Multi-Agent Framework of Full-Text-based Related Work Generation(https://arxiv.org/abs/2505.19647)
Keywords: agent
Abstract: Automatic related work generation (RWG) can save people's time and effort when writing a draft of related work section (RWS) for further revision. However, existing methods for RWG always suffer from shallow comprehension due to taking the limited portions of references papers as input and isolated explanation for each reference due to ineffective capturing the relationships among them. To address these issues, we focus on full-text-based RWG task and propose a novel multi-agent framework. Our framework consists of three agents: a selector that decides which section of the papers is going to read next, a reader that digests the selected section and updates a shared working memory, and a writer that generates RWS based on the final curated memory. To better capture the relationships among references, we also propose two graph-aware strategies for selector, enabling to optimize the reading order with constrains of the graph structure. Extensive experiments demonstrate that our framework consistently improves performance across three base models and various input configurations. The graph-aware selectors outperform alternative selectors, achieving state-of-the-art results. The code and data are available at this https URL.
摘要：自动相关的工作生成（RWG）可以在撰写相关工作部分（RWS）的草稿以进行进一步修订时节省人们的时间和精力。但是，RWG的现有方法始终遭受浅层理解的影响，因为将参考文件的有限部分作为输入和隔离解释，因为捕获了它们之间的关系，因此每次参考文献的孤立解释。为了解决这些问题，我们专注于基于全文的RWG任务，并提出一个新颖的多代理框架。我们的框架由三个代理组成：一个选择器的选择器，该选择器下一步将要读取哪些部分，读者会消化所选部分并更新共享的工作内存，以及基于最终策划内存的RWS生成RWS的作者。为了更好地捕获参考文献之间的关系，我们还为选择器提出了两个图形感知的策略，从而使图形结构的约束优化阅读顺序。广泛的实验表明，我们的框架始终提高三种基本模型和各种输入配置的性能。图形感知的选择器的表现优于替代选择器，可以实现最新的结果。该代码和数据可在此HTTPS URL上找到。

Title: GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models

Authors: Tingjia Shen, Hao Wang, Chuan Qin, Ruijun Sun, Yang Song, Defu Lian, Hengshu Zhu, Enhong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19660
Pdf URL: https://arxiv.org/pdf/2505.19660
Copy Paste: [[2505.19660]] GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models(https://arxiv.org/abs/2505.19660)
Keywords: language model, llm
Abstract: Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP), primarily focused on extracting answers from unstructured textual data. With the rapid advancements in Large Language Models (LLMs), LLM-based OpenQA methods have reaped the benefits of emergent understanding and answering capabilities enabled by massive parameters compared to traditional methods. However, most of these methods encounter two critical challenges: how to integrate knowledge into LLMs effectively and how to adaptively generate results with specific answer formats for various task situations. To address these challenges, we propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation on LLMs simultaneously. Specifically, we first train a dense passage retrieval model to retrieve associated knowledge from a given knowledge base. Subsequently, we introduce a novel knowledge integration model that incorporates the retrieval knowledge into instructions during fine-tuning to intensify the model. Furthermore, to enable controllable generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based on text consistency incorporating all coherence, fluency, and answer format assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO, and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover, ablation studies have disclosed a linear relationship between the frequency of retrieved knowledge and the model's ability to recall knowledge accurately against the ground truth. Our code of GenKI is available at this https URL
摘要：开放域问答（OPENQA）代表自然语言处理（NLP）的基石，主要侧重于从非结构化的文本数据中提取答案。随着大语言模型（LLMS）的快速发展，基于LLM的OpenQA方法与传统方法相比，通过大量参数实现了新兴理解和回答能力的好处。但是，这些方法中的大多数遇到了两个关键挑战：如何有效地将知识整合到LLMS中，以及如何适应各种任务情况的特定答案格式。为了应对这些挑战，我们提出了一个名为Genki的新颖框架，该框架旨在通过同时探索知识集成和可控生成来改善OpenQA性能。具体来说，我们首先训练一个密集的通道检索模型，以从给定的知识库中检索相关的知识。随后，我们引入了一种新颖的知识整合模型，该模型将检索知识纳入微调过程中以加强模型。此外，为了在LLM中启用可控的生成，我们根据文本一致性充分利用了一定的调整LLM和合奏，其中包含了所有连贯性，流利性和答案格式的保证。最后，在Triviaqa，MSMARCO和CMRC2018数据集上进行的广泛实验，具有多种答案格式，证明了Genki与先进基线的比较的有效性。此外，消融研究已经揭示了检索知识的频率与模型准确回忆知识与地面真理的能力之间的线性关系。我们的Genki守则可在此HTTPS URL上找到

Title: LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation

Authors: Weikang Yuan, Kaisong Song, Zhuoren Jiang, Junjie Cao, Yujie Zhang, Jun Lin, Kun Kuang, Ji Zhang, Xiaozhong Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19667
Pdf URL: https://arxiv.org/pdf/2505.19667
Copy Paste: [[2505.19667]] LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation(https://arxiv.org/abs/2505.19667)
Keywords: language model, gpt, llm
Abstract: Legal consultation is essential for safeguarding individual rights and ensuring access to justice, yet remains costly and inaccessible to many individuals due to the shortage of professionals. While recent advances in Large Language Models (LLMs) offer a promising path toward scalable, low-cost legal assistance, current systems fall short in handling the interactive and knowledge-intensive nature of real-world consultations. To address these challenges, we introduce LeCoDe, a real-world multi-turn benchmark dataset comprising 3,696 legal consultation dialogues with 110,008 dialogue turns, designed to evaluate and improve LLMs' legal consultation capability. With LeCoDe, we innovatively collect live-streamed consultations from short-video platforms, providing authentic multi-turn legal consultation dialogues. The rigorous annotation by legal experts further enhances the dataset with professional insights and expertise. Furthermore, we propose a comprehensive evaluation framework that assesses LLMs' consultation capabilities in terms of (1) clarification capability and (2) professional advice quality. This unified framework incorporates 12 metrics across two dimensions. Through extensive experiments on various general and domain-specific LLMs, our results reveal significant challenges in this task, with even state-of-the-art models like GPT-4 achieving only 39.8% recall for clarification and 59% overall score for advice quality, highlighting the complexity of professional consultation scenarios. Based on these findings, we further explore several strategies to enhance LLMs' legal consultation abilities. Our benchmark contributes to advancing research in legal domain dialogue systems, particularly in simulating more real-world user-expert interactions.
摘要：法律咨询对于维护个人权利和确保司法权的机会至关重要，但由于专业人员的短缺，许多人仍然昂贵且无法接近。尽管大型语言模型（LLMS）的最新进展为通往可扩展，低成本法律援助的有前途的途径，但当前系统在处理现实世界咨询的互动性和知识密集性性质方面缺乏。为了应对这些挑战，我们介绍了Lecode，这是一个现实世界中的多转化基准数据集，其中包含3,696个法律咨询对话，其中有110,008个对话转弯，旨在评估和提高LLMS的法律咨询能力。借助Lecode，我们从短视频平台中创新地收集了实时咨询，提供了真实的多转弯法律咨询对话。法律专家的严格注释进一步通过专业见解和专业知识增强了数据集。此外，我们提出了一个全面的评估框架，以（1）澄清能力和（2）专业建议质量来评估LLMS的咨询功能。这个统一的框架在两个维度上包含了12个指标。通过对各种一般和域特异性LLM的广泛实验，我们的结果揭示了这项任务的重大挑战，甚至包括GPT-4（例如GPT-4）的最先进模型仅实现39.8％的澄清召回率，并提出了59％的咨询质量分数，强调了专业咨询场景的复杂性。基于这些发现，我们进一步探讨了增强LLMS法律咨询能力的几种策略。我们的基准有助于推进法律域对话系统中的研究，尤其是在模拟更真实的用户专家交互时。

Title: Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models

Authors: Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari
Subjects: cs.CL, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.19670
Pdf URL: https://arxiv.org/pdf/2505.19670
Copy Paste: [[2505.19670]] Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models(https://arxiv.org/abs/2505.19670)
Keywords: language model, llm
Abstract: Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs) by enabling audio-based human interactions. However, recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment. Despite advances in defence measures for text and vision LLMs, effective safety-alignment strategies and audio-safety dataset specifically targeting LALMs are notably absent. Meanwhile defence measures based on Supervised Fine-tuning (SFT) struggle to address safety improvement while avoiding over-rejection issues, significantly compromising helpfulness. In this work, we propose an unsupervised safety-fine-tuning strategy as remedy that reshapes model's representation space to enhance existing LALMs safety-alignment while balancing the risk of over-rejection. Our experiments, conducted across three generations of Qwen LALMs, demonstrate that our approach significantly improves LALMs safety under three modality input conditions (audio-text, text-only, and audio-only) while increasing over-rejection rate by only 0.88% on average. Warning: this paper contains harmful examples.
摘要：大型音频语言模型（LALMS）通过启用基于音频的人类互动来扩展大语言模型（LLMS）的功能。但是，最近的研究表明，由于安全 - 安全不足，LALM仍然容易受到有害查询的影响。尽管有针对文本和视觉LLM的国防措施的进步，但明显没有针对LALM的有效的安全统一策略和音频安全数据集。同时，基于受监督的微调（SFT）的国防措施努力解决安全改善，同时避免过度拒绝问题，从而极大地损害了帮助。在这项工作中，我们提出了一种无监督的安全 - 调整策略，作为补救措施，它重塑了模型的表示空间，以增强现有的LALMS安全 - 同时平衡过度拒绝的风险。我们在三代QWEN LALMS进行的实验表明，我们的方法在三种模式输入条件（Audio-Text，仅文本和仅一本音频）下显着提高了LALMS的安全性，而过度拒绝率仅平均增加了0.88％。警告：本文包含有害的例子。

Title: Comparing Moral Values in Western English-speaking societies and LLMs with Word Associations

Authors: Chaoyi Xiang, Chunhua Liu, Simon De Deyne, Lea Frermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19674
Pdf URL: https://arxiv.org/pdf/2505.19674
Copy Paste: [[2505.19674]] Comparing Moral Values in Western English-speaking societies and LLMs with Word Associations(https://arxiv.org/abs/2505.19674)
Keywords: language model, llm, prompt
Abstract: As the impact of large language models increases, understanding the moral values they reflect becomes ever more important. Assessing the nature of moral values as understood by these models via direct prompting is challenging due to potential leakage of human norms into model training data, and their sensitivity to prompt formulation. Instead, we propose to use word associations, which have been shown to reflect moral reasoning in humans, as low-level underlying representations to obtain a more robust picture of LLMs' moral reasoning. We study moral differences in associations from western English-speaking communities and LLMs trained predominantly on English data. First, we create a large dataset of LLM-generated word associations, resembling an existing data set of human word associations. Next, we propose a novel method to propagate moral values based on seed words derived from Moral Foundation Theory through the human and LLM-generated association graphs. Finally, we compare the resulting moral conceptualizations, highlighting detailed but systematic differences between moral values emerging from English speakers and LLM associations.
摘要：随着大语模型的影响增加，理解它们所反映的道德价值观变得越来越重要。通过直接提示来评估这些模型所理解的道德价值的性质，由于人类规范可能泄漏到模型训练数据中，以及它们对迅速制定的敏感性。取而代之的是，我们建议使用单词关联，这些单词关联已被证明反映了人类的道德推理，因为低级的基础表示，以获取更强大的LLMS道德推理图片。我们研究了以英语数据为主导的英语社区和LLM的协会中的道德差异。首先，我们创建了一个大型LLM生成的单词关联数据集，类似于人类单词关联的现有数据集。接下来，我们提出了一种新的方法来基于通过人类和LLM生成的关联图从道德基础理论得出的种子单词传播道德价值。最后，我们比较了由此产生的道德概念化，并强调了英语说话者和LLM协会的道德价值观之间的详细但系统的差异。

Title: Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement

Authors: Liqin Ye, Agam Shah, Chao Zhang, Sudheer Chava
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19675
Pdf URL: https://arxiv.org/pdf/2505.19675
Copy Paste: [[2505.19675]] Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement(https://arxiv.org/abs/2505.19675)
Keywords: language model, llm
Abstract: The traditional process of creating labeled datasets is labor-intensive and expensive. Recent breakthroughs in open-source large language models (LLMs) have opened up a new avenue in generating labeled datasets automatically for various natural language processing (NLP) tasks, providing an alternative to such an expensive annotation process. However, the reliability of such auto-generated labels remains a significant concern due to inherent inaccuracies. When learning from noisy labels, the model's generalization is likely to be harmed as it is prone to overfit to those label noises. While previous studies in learning from noisy labels mainly focus on synthetic noise and real-world noise, LLM-generated label noise receives less attention. In this paper, we propose SiDyP: Simplex Label Diffusion with Dynamic Prior to calibrate the classifier's prediction, thus enhancing its robustness towards LLM-generated noisy labels. SiDyP retrieves potential true label candidates by neighborhood label distribution in text embedding space and iteratively refines noisy candidates using a simplex diffusion model. Our framework can increase the performance of the BERT classifier fine-tuned on both zero-shot and few-shot LLM-generated noisy label datasets by an average of 7.21% and 7.30% respectively. We demonstrate the effectiveness of SiDyP by conducting extensive benchmarking for different LLMs over a variety of NLP tasks. Our code is available on Github.
摘要：创建标记数据集的传统过程是劳动密集型且昂贵的。最新的开源大语模型（LLM）的突破已经开辟了一种新的途径，以自动生成标记的数据集以自动进行各种自然语言处理（NLP）任务，从而为这种昂贵的注释过程提供了替代方案。但是，由于固有的不准确性，这种自动生成标签的可靠性仍然是一个重大问题。当从嘈杂的标签中学习时，模型的概括可能会受到损害，因为它容易过度贴上那些标签的噪音。虽然先前从嘈杂标签学习的研究主要集中于合成噪声和现实世界噪声，但LLM生成的标签噪声受到了较少的关注。在本文中，我们提出了SIDYP：单纯标记在校准分类器的预测之前用动态扩散，从而增强了其对LLM生成的嘈杂标签的稳健性。 Sidyp通过文本嵌入空间中的邻居标签分布来检索潜在的True标签候选者，并使用单纯形扩散模型进行了迭代的噪音候选。我们的框架可以将BERT分类器的性能提高在零拍摄和少数LLM生成的嘈杂标签数据集上平均分别为7.21％和7.30％。我们通过在各种NLP任务上为不同的LLM进行广泛的基准测试来证明SIDYP的有效性。我们的代码可在GitHub上找到。

Title: Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

Authors: Hao Fang, Changle Zhou, Jiawei Kong, Kuofeng Gao, Bin Chen, Tao Liang, Guojun Ma, Shu-Tao Xia
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19678
Pdf URL: https://arxiv.org/pdf/2505.19678
Copy Paste: [[2505.19678]] Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs(https://arxiv.org/abs/2505.19678)
Keywords: language model, hallucination
Abstract: Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
摘要：大型视觉模型（LVLM）容易受到幻觉的影响，在语义上产生的响应似乎是可见的，但与输入图像几乎没有或没有相关。先前的研究表明，这个问题主要源于LVLMS对语言先验的过度依赖，同时忽略了解码过程中的视觉信息。为了减轻这个问题，我们引入了一种新颖的条件性相互信息（C-PMI）校准的解码策略，该策略可自适应地增强生成的文本和输入图像之间的相互依赖性，以减轻幻觉。与仅关注文本令牌采样的现有方法不同，我们建议共同对视觉和文本代币对C-PMI的贡献进行共同建模，从而将减轻幻觉缓解作为旨在最大化共同信息的双层优化问题。为了解决它，我们设计了一个令牌纯化机制，该机制通过对文本令牌进行采样与给定图像最大相关的文本令牌，同时精炼与生成的响应最相关的图像令牌，从而动态调节解码过程。跨各种基准的广泛实验表明，所提出的方法显着降低了LVLMS的幻觉，同时保持解码效率。

Title: Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

Authors: Yi Liu, Dianqing Liu, Mingye Zhu, Junbo Guo, Yongdong Zhang, Zhendong Mao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19700
Pdf URL: https://arxiv.org/pdf/2505.19700
Copy Paste: [[2505.19700]] Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models(https://arxiv.org/abs/2505.19700)
Keywords: language model, llm
Abstract: The widespread adoption of large language models (LLMs) across industries has increased the demand for high-quality and customizable outputs. However, traditional alignment methods often require retraining large pretrained models, making it difficult to quickly adapt and optimize LLMs for diverse applications. To address this limitation, we propose a novel \textit{Residual Alignment Model} (\textit{RAM}) that formalizes the alignment process as a type of importance sampling. In this framework, the unaligned upstream model serves as the proposal distribution, while the alignment process is framed as secondary sampling based on an autoregressive alignment module that acts as an estimator of the importance weights. This design enables a natural detachment of the alignment module from the target aligned model, improving flexibility and scalability. Based on this model, we derive an efficient sequence-level training strategy for the alignment module, which operates independently of the proposal module. Additionally, we develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods. Experimental evaluations on two leading open-source LLMs across diverse tasks, including instruction following, domain adaptation, and preference optimization, demonstrate that our approach consistently outperforms baseline models.
摘要：整个行业中大型语言模型（LLM）的广泛采用增加了对高质量和可定制产出的需求。但是，传统的对准方法通常需要重新审议的大型模型，因此很难快速适应和优化LLM用于不同的应用程序。为了解决此限制，我们提出了一种新颖的\ textit {残留对齐模型}（\ textit {ram}），将对齐过程形式化为一种重要性采样类型。在此框架中，未对准的上游模型是提议分布，而对齐过程则基于自回归对齐模块的辅助采样，该模块是重要性权重的估计器。该设计使对齐模块从目标对齐模型中自然脱离，从而提高了灵活性和可扩展性。基于此模型，我们为对齐模块提供了有效的序列级训练策略，该模块独立于提案模块运行。此外，我们开发了一种具有迭代令牌级解码的重采样算法，以在可比方法中解决常见的第一句延迟问题。对两种领先的开源LLM的实验评估跨不同任务，包括以下指导，域的适应和偏好优化，表明我们的方法始终超过基线模型。

Title: Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision

Authors: Tej Deep Pala, Panshul Sharma, Amir Zadeh, Chuan Li, Soujanya Poria
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19706
Pdf URL: https://arxiv.org/pdf/2505.19706
Copy Paste: [[2505.19706]] Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision(https://arxiv.org/abs/2505.19706)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) are prone to hallucination, especially during multi-hop and reasoning-intensive tasks such as mathematical problem solving. While Outcome Reward Models verify only final answers, Process Reward Models (PRMs) score each intermediate step to steer generation toward coherent solutions. We introduce PathFinder-PRM, a novel hierarchical, error-aware discriminative PRM that first classifies math and consistency errors at each step, then combines these fine-grained signals to estimate step correctness. To train PathFinder-PRM, we construct a 400K-sample dataset by enriching the human-annotated PRM800K corpus and RLHFlow Mistral traces with three-dimensional step-level labels. On PRMBench, PathFinder-PRM achieves a new state-of-the-art PRMScore of 67.7, outperforming the prior best (65.5) while using 3 times less data. When applied to reward guided greedy search, our model yields prm@8 48.3, a +1.5 point gain over the strongest baseline. These results demonstrate that decoupled error detection and reward estimation not only boost fine-grained error detection but also substantially improve end-to-end, reward-guided mathematical reasoning with greater data efficiency.
摘要：大型语言模型（LLM）容易幻觉，尤其是在多跳和推理密集型任务（例如数学问题解决）的过程中。虽然结果奖励模型仅验证最终答案，但过程奖励模型（PRMS）为每个中间步骤划分，以将生成转向连贯的解决方案。我们介绍了Pathfinder-prm，这是一种新型的层次结构，错误意识到的歧视性PRM，该PRM首先分类了每个步骤的数学和一致性错误，然后将这些细粒信号结合在一起以估计步骤正确性。为了训练Pathfinder-PRM，我们通过使用三维阶梯标签的人类注销的PRM800K语料库和RLHFLOW MISTRAL TRACE来构建一个400K样本的数据集。在PRMBENCH上，Pathfinder-Prm获得了67.7的最新最先进的PRMSCORE，在使用少3倍的数据时表现优于先前的最佳（65.5）。当应用于奖励指导的贪婪搜索时，我们的模型会产生PRM@8 48.3，比最强基线的+1.5点增益。这些结果表明，脱钩的误差检测和奖励估计不仅可以提高细粒度的误差检测，而且还可以实质上提高端到端，奖励指导的数学推理，并提高数据效率。

Title: MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning

Authors: Zhaopeng Feng, Yupu Liang, Shaosheng Cao, Jiayuan Su, Jiahan Ren, Zhe Xu, Yao Hu, Wenxuan Huang, Jian Wu, Zuozhu Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19714
Pdf URL: https://arxiv.org/pdf/2505.19714
Copy Paste: [[2505.19714]] MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning(https://arxiv.org/abs/2505.19714)
Keywords: language model, llm
Abstract: Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT$^{3}$, the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT$^{3}$ adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and translation. It is trained using a novel multi-mixed reward mechanism that adapts rule-based RL strategies to TIMT's intricacies, offering fine-grained, non-binary feedback across tasks. Furthermore, to facilitate the evaluation of TIMT in authentic cross-cultural and real-world social media contexts, we introduced XHSPost, the first social media TIMT benchmark. Our MT$^{3}$-7B-Zero achieves state-of-the-art results on the latest in-domain MIT-10M benchmark, outperforming strong baselines such as Qwen2.5-VL-72B and InternVL2.5-78B by notable margins across multiple metrics. Additionally, the model shows strong generalization to out-of-distribution language pairs and datasets. In-depth analyses reveal how multi-task synergy, reinforcement learning initialization, curriculum design, and reward formulation contribute to advancing MLLM-driven TIMT.
摘要：文本图像计算机翻译（TIMT） - 翻译嵌入在图像中的文本内容的任务 - 对于可访问性，跨语性信息访问和现实世界中的文档的了解至关重要。但是，由于需要准确的光学特征识别（OCR），可靠的视觉文本推理和高质量的翻译，TIMT仍然是一个复杂的挑战，通常需要级联的多阶段管道。大规模增强学习（RL）的最新进展已改善了大语模型（LLMS）和多模式LLMS（MLLMS）的推理，但是它们在端到端TIMT的应用仍未得到充分展望。为了弥合此差距，我们介绍了MT $^{3} $，这是将多任务RL应用于MLLM的第一个框架。 MT $^{3} $采用针对三个关键子技能的多任务优化范式：文本识别，上下文感知推理和翻译。它是使用一种新型的多混合奖励机制训练的，该机制将基于规则的RL策略调整为Timt的复杂性，从而在跨任务中提供细粒度的非二元反馈。此外，为了促进在真实的跨文化和现实社交媒体环境中对Timt的评估，我们介绍了XHSpost，这是第一个社交媒体Timt Benchmark。我们的MT $^{3} $ -7B-Zero在最新的MIT-10M基准测试中取得了最新的结果，超过了QWEN2.5-VL-72B和InternVL2.5-78B等强大的基线，多个跨多个计量的利润率。此外，该模型对分布式语言对和数据集的强烈概括。深入的分析揭示了多任务协同，增强学习初始化，课程设计和奖励配方如何有助于推进MLLM驱动的TIMT。

Title: Graceful Forgetting in Generative Language Models

Authors: Chunyang Jiang, Chi-min Chan, Yiyang Cai, Yulong Liu, Wei Xue, Yike Guo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19715
Pdf URL: https://arxiv.org/pdf/2505.19715
Copy Paste: [[2505.19715]] Graceful Forgetting in Generative Language Models(https://arxiv.org/abs/2505.19715)
Keywords: language model
Abstract: Recently, the pretrain-finetune paradigm has become a cornerstone in various deep learning areas. While in general the pre-trained model would promote both effectiveness and efficiency of downstream tasks fine-tuning, studies have shown that not all knowledge acquired during pre-training is beneficial. Some of the knowledge may actually bring detrimental effects to the fine-tuning tasks, which is also known as negative transfer. To address this problem, graceful forgetting has emerged as a promising approach. The core principle of graceful forgetting is to enhance the learning plasticity of the target task by selectively discarding irrelevant knowledge. However, this approach remains underexplored in the context of generative language models, and it is often challenging to migrate existing forgetting algorithms to these models due to architecture incompatibility. To bridge this gap, in this paper we propose a novel framework, Learning With Forgetting (LWF), to achieve graceful forgetting in generative language models. With Fisher Information Matrix weighting the intended parameter updates, LWF computes forgetting confidence to evaluate self-generated knowledge regarding the forgetting task, and consequently, knowledge with high confidence is periodically unlearned during fine-tuning. Our experiments demonstrate that, although thoroughly uncovering the mechanisms of knowledge interaction remains challenging in pre-trained language models, applying graceful forgetting can contribute to enhanced fine-tuning performance.
摘要：最近，预处理范式已成为各个深度学习领域的基石。通常，预先训练的模型将促进下游任务的有效性和效率，但研究表明，并非在培训期间获得的所有知识都是有益的。某些知识实际上可能会给微调任务带来不利影响，这也称为负转移。为了解决这个问题，优雅的遗忘已经成为一种有前途的方法。优雅遗忘的核心原则是通过选择性地丢弃无关的知识来增强目标任务的学习可塑性。但是，在生成语言模型的背景下，这种方法仍然没有得到充实的态度，由于架构不兼容，将现有遗忘算法迁移到这些模型的算法通常是具有挑战性的。为了弥合这一差距，在本文中，我们提出了一个新颖的框架，学习遗忘（LWF），以实现生成语言模型的优雅遗忘。通过Fisher信息矩阵加权预期的参数更新，LWF计算忘记信心，以评估有关遗忘任务的自我生成的知识，因此，在微调期间，定期毫无疑问地没有置信度知识。我们的实验表明，尽管在预训练的语言模型中彻底揭示了知识互动的机制仍然具有挑战性，但应用优雅的遗忘可能会导致增强的微调表现。

Title: Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking

Authors: Yihao Ai, Zhiyuan Ning, Weiwei Dai, Pengfei Wang, Yi Du, Wenjuan Cui, Kunpeng Liu, Yuanchun Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19722
Pdf URL: https://arxiv.org/pdf/2505.19722
Copy Paste: [[2505.19722]] Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking(https://arxiv.org/abs/2505.19722)
Keywords: language model, llm, prompt
Abstract: Biomedical entity linking aims to map nonstandard entities to standard entities in a knowledge base. Traditional supervised methods perform well but require extensive annotated data to transfer, limiting their usage in low-resource scenarios. Large language models (LLMs), especially closed-source LLMs, can address these but risk stability issues and high economic costs: using these models is restricted by commercial companies and brings significant economic costs when dealing with large amounts of data. To address this, we propose ``RPDR'', a framework combining closed-source LLMs and open-source LLMs for re-ranking candidates retrieved by a retriever fine-tuned with a small amount of data. By prompting a closed-source LLM to generate training data from unannotated data and fine-tuning an open-source LLM for re-ranking, we effectively distill the knowledge to the open-source LLM that can be deployed locally, thus avoiding the stability issues and the problem of high economic costs. We evaluate RPDR on two datasets, including one real-world dataset and one publicly available dataset involving two languages: Chinese and English. RPDR achieves 0.019 Acc@1 improvement and 0.036 Acc@1 improvement on the Aier dataset and the Ask A Patient dataset when the amount of training data is not enough. The results demonstrate the superiority and generalizability of the proposed framework.
摘要：链接的生物医学实体旨在将非标准实体映射到知识库中的标准实体。传统监督方法的运行良好，但需要大量的注释数据才能传输，从而在低资源场景中限制了它们的使用情况。大型语言模型（LLMS），尤其是封闭式LLM，可以解决这些问题，但风险稳定问题和高昂的经济成本：使用这些模型受商业公司的限制，并在处理大量数据时带来了巨大的经济成本。为了解决这个问题，我们提出了``RPDR''，该框架结合了封闭源LLMS和开源LLMS，用于将候选者重新排列的候选者，该候选者通过少量数据进行了微调。通过提示封闭源LLM从未注释的数据中生成培训数据并微调开源LLM进行重新排行，我们将知识有效地提炼为可以在本地部署的开源LLM，从而避免稳定性问题和高经济成本的问题。我们在两个数据集上评估了RPDR，包括一个现实世界数据集和一个涉及两种语言的公开可用数据集：中文和英语。 RPDR可在AIER数据集上实现0.019 ACC@1改进和0.036 ACC@1改进，并且在培训数据的数量不足时，请询问患者数据集。结果证明了所提出的框架的优势和普遍性。

Title: Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models

Authors: Yang Zhang, Yu Yu, Bo Tang, Yu Zhu, Chuxiong Sun, Wenqiang Wei, Jie Hu, Zipeng Xie, Zhiyu Li, Feiyu Xiong, Edward Chung
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19743
Pdf URL: https://arxiv.org/pdf/2505.19743
Copy Paste: [[2505.19743]] Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models(https://arxiv.org/abs/2505.19743)
Keywords: language model, llm
Abstract: With the rapid development of Large Language Models (LLMs), aligning these models with human preferences and values is critical to ensuring ethical and safe applications. However, existing alignment techniques such as RLHF or DPO often require direct fine-tuning on LLMs with billions of parameters, resulting in substantial computational costs and inefficiencies. To address this, we propose Micro token-level Accept-Reject Aligning (MARA) approach designed to operate independently of the language models. MARA simplifies the alignment process by decomposing sentence-level preference learning into token-level binary classification, where a compact three-layer fully-connected network determines whether candidate tokens are "Accepted" or "Rejected" as part of the response. Extensive experiments across seven different LLMs and three open-source datasets show that MARA achieves significant improvements in alignment performance while reducing computational costs.
摘要：随着大语言模型（LLM）的快速发展，将这些模型与人类的偏好和价值观保持一致，对于确保道德和安全应用至关重要。但是，现有的对齐技术（例如RLHF或DPO）通常需要对数十亿个参数的LLM进行直接调查，从而导致大量的计算成本和效率低下。为了解决这个问题，我们提出了旨在独立于语言模型操作的微令牌级接受对准（MARA）方法。 MARA通过将句子级的偏好学习分解为令牌级别的二进制分类来简化对齐过程，在该分类中，紧凑的三层完全连接的网络确定候选代币是“接受的”还是“被拒绝”作为响应的一部分。在七个不同的LLM和三个开源数据集中进行的广泛实验表明，Mara在降低计算成本的同时取得了重大改善。

Title: NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering

Authors: Ruisheng Cao, Hanchong Zhang, Tiancheng Huang, Zhangyi Kang, Yuxin Zhang, Liangtai Sun, Hanqi Li, Yuxun Miao, Shuai Fan, Lu Chen, Kai Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19754
Pdf URL: https://arxiv.org/pdf/2505.19754
Copy Paste: [[2505.19754]] NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering(https://arxiv.org/abs/2505.19754)
Keywords: language model, llm, retrieval augmented generation, agent
Abstract: The increasing number of academic papers poses significant challenges for researchers to efficiently acquire key details. While retrieval augmented generation (RAG) shows great promise in large language model (LLM) based automated question answering, previous works often isolate neural and symbolic retrieval despite their complementary strengths. Moreover, conventional single-view chunking neglects the rich structure and layout of PDFs, e.g., sections and tables. In this work, we propose NeuSym-RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process. By leveraging multi-view chunking and schema-based parsing, NeuSym-RAG organizes semi-structured PDF content into both the relational database and vectorstore, enabling LLM agents to iteratively gather context until sufficient to generate answers. Experiments on three full PDF-based QA datasets, including a self-annotated one AIRQA-REAL, show that NeuSym-RAG stably defeats both the vector-based RAG and various structured baselines, highlighting its capacity to unify both retrieval schemes and utilize multiple views. Code and data are publicly available at this https URL.
摘要：越来越多的学术论文对研究人员提出了重大挑战，以有效获取关键细节。尽管基于大语言模型（LLM）自动化问题的回收增强发电（RAG）在大型语言模型（LLM）中表现出巨大的希望，但以前的作品尽管具有互补的优势，但仍经常隔离神经和象征性的检索。此外，常规的单视块忽略了PDF的丰富结构和布局，例如部分和表格。在这项工作中，我们提出了Neusym-rag，这是一种混合神经符号检索框架，将两个范式结合在交互过程中。通过利用多视图块和基于模式的解析，Neusym-rag将半结构化的PDF内容组织到关系数据库和矢量店中，使LLM代理能够迭代地收集上下文，直到足够产生答案。在三个完整的基于PDF的QA数据集上进行的实验，包括一个自称的AIRQA真实，表明Neusym-rag稳定地击败了基于矢量的抹布和各种结构化基线，突出了其统一检索方案和利用多个视图的能力。代码和数据在此HTTPS URL上公开可用。

Title: Efficient Reasoning via Chain of Unconscious Thought

Authors: Ruihan Gong, Yue Liu, Wenjie Qu, Mingzhe Du, Yufei He, Yingwei Ma, Yulin Chen, Xiang Liu, Yi Wen, Xinfeng Li, Ruidong Wang, Xinzhong Zhu, Bryan Hooi, Jiaheng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19756
Pdf URL: https://arxiv.org/pdf/2505.19756
Copy Paste: [[2505.19756]] Efficient Reasoning via Chain of Unconscious Thought(https://arxiv.org/abs/2505.19756)
Keywords: prompt
Abstract: Large Reasoning Models (LRMs) achieve promising performance but compromise token efficiency due to verbose reasoning processes. Unconscious Thought Theory (UTT) posits that complex problems can be solved more efficiently through internalized cognitive processes. Inspired by UTT, we propose a new reasoning paradigm, termed Chain of Unconscious Thought (CoUT), to improve the token efficiency of LRMs by guiding them to mimic human unconscious thought and internalize reasoning processes. Concretely, we first prompt the model to internalize the reasoning by thinking in the hidden layer. Then, we design a bag of token-efficient strategies to further help models reduce unnecessary tokens yet preserve the performance. Our work reveals that models may possess beneficial unconscious thought, enabling improved efficiency without sacrificing performance. Extensive experiments demonstrate the effectiveness of CoUT. Remarkably, it surpasses CoT by reducing token usage by 47.62% while maintaining comparable accuracy, as shown in Figure 1. The code of CoUT is available at this link: this https URL
摘要：大型推理模型（LRMS）实现了令人鼓舞的性能，但由于冗长的推理过程而妥协令牌效率。无意识的思想理论（UTT）认为，可以通过内在的认知过程更有效地解决复杂的问题。受UTT的启发，我们提出了一种新的推理范式，称为无意识思想链（COUT），以通过指导他们模仿人类无意识的思想和内部化推理过程来提高LRM的令牌效率。具体而言，我们首先提示模型通过在隐藏层中思考来内部化推理。然后，我们设计了一袋代币高效的策略，以进一步帮助模型减少不必要的令牌，但可以保留性能。我们的工作表明，模型可能具有有益的无意识思想，从而提高了效率而无需牺牲绩效。广泛的实验证明了代表的有效性。值得注意的是，它通过将令牌用法降低47.62％，同时保持可比较精度，从而超过了COT，如图1所示。cout代码可在此链接上可用：此https url

Title: SGM: A Framework for Building Specification-Guided Moderation Filters

Authors: Masoomali Fatehkia, Enes Altinisik, Husrev Taha Sencar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19766
Pdf URL: https://arxiv.org/pdf/2505.19766
Copy Paste: [[2505.19766]] SGM: A Framework for Building Specification-Guided Moderation Filters(https://arxiv.org/abs/2505.19766)
Keywords: language model, llm
Abstract: Aligning large language models (LLMs) with deployment-specific requirements is critical but inherently imperfect. Despite extensive training, models remain susceptible to misalignment and adversarial inputs such as jailbreaks. Content moderation filters are commonly used as external safeguards, though they typically focus narrowly on safety. We introduce SGM (Specification-Guided Moderation), a flexible framework for training moderation filters grounded in user-defined specifications that go beyond standard safety concerns. SGM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals. SGM-trained filters perform on par with state-of-the-art safety filters built on curated datasets, while supporting fine-grained and user-defined alignment control.
摘要：将大型语言模型（LLMS）与特定于部署的要求保持一致是至关重要的，但本质上是不完美的。尽管进行了广泛的培训，但模型仍然容易受到未对准和对抗性意见（例如越狱）的影响。内容适中过滤器通常用作外部保障措施，尽管它们通常关注安全性。我们介绍了SGM（规范引导的审核），这是一个灵活的框架，用于以用户定义的规范为基础的训练适度过滤器，超出了标准安全问题。 SGM可以在不依赖人写的示例的情况下自动化培训数据生成，从而为各种特定于应用程序的一致性目标提供了可扩展的支持。经过SGM训练的过滤器与构建的策划数据集建立的最先进的安全过滤器相同，同时支持细粒度和用户定义的对齐控制。

Title: T^2Agent A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search

Authors: Xing Cui, Yueying Zou, Zekun Li, Peipei Li, Xinyuan Xu, Xuannan Liu, Huaibo Huang, Ran He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19768
Pdf URL: https://arxiv.org/pdf/2505.19768
Copy Paste: [[2505.19768]] T^2Agent A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search(https://arxiv.org/abs/2505.19768)
Keywords: agent
Abstract: Real-world multimodal misinformation often arises from mixed forgery sources, requiring dynamic reasoning and adaptive verification. However, existing methods mainly rely on static pipelines and limited tool usage, limiting their ability to handle such complexity and diversity. To address this challenge, we propose T2Agent, a novel misinformation detection agent that incorporates an extensible toolkit with Monte Carlo Tree Search (MCTS). The toolkit consists of modular tools such as web search, forgery detection, and consistency analysis. Each tool is described using standardized templates, enabling seamless integration and future expansion. To avoid inefficiency from using all tools simultaneously, a Bayesian optimization-based selector is proposed to identify a task-relevant subset. This subset then serves as the action space for MCTS to dynamically collect evidence and perform multi-source verification. To better align MCTS with the multi-source nature of misinformation detection, T2Agent extends traditional MCTS with multi-source verification, which decomposes the task into coordinated subtasks targeting different forgery sources. A dual reward mechanism containing a reasoning trajectory score and a confidence score is further proposed to encourage a balance between exploration across mixed forgery sources and exploitation for more reliable evidence. We conduct ablation studies to confirm the effectiveness of the tree search mechanism and tool usage. Extensive experiments further show that T2Agent consistently outperforms existing baselines on challenging mixed-source multimodal misinformation benchmarks, demonstrating its strong potential as a training-free approach for enhancing detection accuracy. The code will be released.
摘要：现实世界中的多模式错误信息通常是由混合伪造来源引起的，需要动态推理和适应性验证。但是，现有方法主要依赖静态管道和有限的工具使用，从而限制了它们处理这种复杂性和多样性的能力。为了应对这一挑战，我们提出了T2Agent，这是一种新颖的错误信息检测剂，将可扩展的工具包与蒙特卡洛树搜索（MCTS）结合在一起。该工具包包括模块化工具，例如Web搜索，伪造检测和一致性分析。使用标准化模板来描述每个工具，从而实现无缝集成和未来的扩展。为了避免同时使用所有工具的效率低下，提出了基于贝叶斯优化的选择器来识别与任务相关的子集。然后，该子集是MCT动态收集证据并执行多源验证的动作空间。为了使MCT与错误信息检测的多源性性质保持一致，T2agent通过多源验证扩展了传统的MCT，将任务分解为针对不同伪造来源的协调子任务。进一步提出了包含推理轨迹得分和置信度评分的双重奖励机制，以鼓励在混合伪造来源的探索与剥削之间取得平衡，以获得更可靠的证据。我们进行消融研究以确认树木搜索机制和工具使用的有效性。广泛的实验进一步表明，T2agent在挑战混合源多模式错误信息基准基准方面始终优于现有基准，这表明其作为提高检测准确性的无训练方法的强大潜力。代码将发布。

Title: What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs

Authors: Sangyeop Kim, Yohan Lee, Yongwoo Song, Kimin Lee
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2505.19773
Pdf URL: https://arxiv.org/pdf/2505.19773
Copy Paste: [[2505.19773]] What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs(https://arxiv.org/abs/2505.19773)
Keywords: language model, llm
Abstract: We investigate long-context vulnerabilities in Large Language Models (LLMs) through Many-Shot Jailbreaking (MSJ). Our experiments utilize context length of up to 128K tokens. Through comprehensive analysis with various many-shot attack settings with different instruction styles, shot density, topic, and format, we reveal that context length is the primary factor determining attack effectiveness. Critically, we find that successful attacks do not require carefully crafted harmful content. Even repetitive shots or random dummy text can circumvent model safety measures, suggesting fundamental limitations in long-context processing capabilities of LLMs. The safety behavior of well-aligned models becomes increasingly inconsistent with longer contexts. These findings highlight significant safety gaps in context expansion capabilities of LLMs, emphasizing the need for new safety mechanisms.
摘要：我们通过许多越狱（MSJ）调查了大语言模型（LLM）中的长篇文章漏洞。我们的实验利用上下文长度长达128K令牌。通过具有不同指令样式，射击密度，主题和格式的各种射击攻击设置的全面分析，我们揭示了上下文长度是决定攻击效果的主要因素。至关重要的是，我们发现成功的攻击不需要精心制作的有害内容。即使是重复的镜头或随机虚拟文本也可以规避模型安全措施，这表明LLMS长篇文化处理能力的基本限制。良好的模型的安全行为与更长的环境变得越来越不一致。这些发现突出了LLM的上下文扩展功能中的显着安全差距，强调了对新安全机制的需求。

Title: Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification

Authors: Akram Elbouanani, Evan Dufraisse, Adrian Popescu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19776
Pdf URL: https://arxiv.org/pdf/2505.19776
Copy Paste: [[2505.19776]] Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification(https://arxiv.org/abs/2505.19776)
Keywords: llm
Abstract: Political biases encoded by LLMs might have detrimental effects on downstream applications. Existing bias analysis methods rely on small-size intermediate tasks (questionnaire answering or political content generation) and rely on the LLMs themselves for analysis, thus propagating bias. We propose a new approach leveraging the observation that LLM sentiment predictions vary with the target entity in the same sentence. We define an entropy-based inconsistency metric to encode this prediction variability. We insert 1319 demographically and politically diverse politician names in 450 political sentences and predict target-oriented sentiment using seven models in six widely spoken languages. We observe inconsistencies in all tested combinations and aggregate them in a statistically robust analysis at different granularity levels. We observe positive and negative bias toward left and far-right politicians and positive correlations between politicians with similar alignment. Bias intensity is higher for Western languages than for others. Larger models exhibit stronger and more consistent biases and reduce discrepancies between similar languages. We partially mitigate LLM unreliability in target-oriented sentiment classification (TSC) by replacing politician names with fictional but plausible counterparts.
摘要：LLM编码的政治偏见可能会对下游应用产生不利影响。现有的偏见分析方法依赖于小型中间任务（问卷回答或政治内容生成），并依靠LLM本身进行分析，从而传播偏见。我们提出了一种新的方法，利用了以下观察，即LLM情绪预测在同一句子中与目标实体有所不同。我们定义一个基于熵的不一致度量，以编码此预测变异性。我们在450个政治句子中插入1319年的人口统计学和政治多样性的政治家名字，并使用六种广泛使用的语言中的七个模型来预测目标的情绪。我们观察到所有测试组合中的不一致，并在不同粒度水平的统计稳定分析中汇总它们。我们观察到对左派和极右翼政客的积极和负面偏见，以及相似一致性的政客之间的正相关。西方语言的偏见强度高于其他语言。较大的模型表现出更强，更一致的偏见，并减少了类似语言之间的差异。我们通过用虚构但合理的对应物替换政治家名称来部分缓解面向目标的情感分类（TSC）的LLM不可靠性。

Title: The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants

Authors: Yiqun Zhang, Hao Li, Chenxu Wang, Linyao Chen, Qiaosheng Zhang, Peng Ye, Shi Feng, Daling Wang, Zhen Wang, Xinrun Wang, Jia Xu, Lei Bai, Wanli Ouyang, Shuyue Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19797
Pdf URL: https://arxiv.org/pdf/2505.19797
Copy Paste: [[2505.19797]] The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants(https://arxiv.org/abs/2505.19797)
Keywords: language model, gpt
Abstract: As proprietary giants increasingly dominate the race for ever-larger language models, a pressing question arises for the open-source community: can smaller models remain competitive across a broad range of tasks? In this paper, we present the Avengers--a simple recipe that effectively leverages the collective intelligence of open-source, smaller language models. Our framework is built upon four lightweight operations: (i) embedding: encode queries using a text embedding model; (ii) clustering: group queries based on their semantic similarity; (iii) scoring: scores each model's performance within each cluster; and (iv) voting: improve outputs via repeated sampling and voting. At inference time, each query is embedded and assigned to its nearest cluster. The top-performing model(s) within that cluster are selected to generate the response using the Self-Consistency or its multi-model variant. Remarkably, with 10 open-source models (~7B parameters each), the Avengers collectively outperforms GPT-4.1 on 10 out of 15 datasets (spanning mathematics, code, logic, knowledge, and affective tasks). In particular, it surpasses GPT-4.1 on mathematics tasks by 18.21% and on code tasks by 7.46%. Furthermore, the Avengers delivers superior out-of-distribution generalization, and remains robust across various embedding models, clustering algorithms, ensemble strategies, and values of its sole parameter--the number of clusters. We have open-sourced the code on GitHub: this https URL
摘要：随着专有巨头越来越多地以越来越多的语言模型统治种族，开源社区出现了一个紧迫的问题：在广泛的任务中，较小的模型能否保持竞争力？在本文中，我们介绍了复仇者联盟 - 一种简单的食谱，有效地利用了开源，较小语言模型的集体智能。我们的框架建立在四个轻巧操作的基础上：（i）使用文本嵌入模型进行嵌入：编码查询；（ii）聚类：基于其语义相似性的组查询；（iii）评分：在每个集群中分数每个模型的性能；（iv）投票：通过反复的抽样和投票提高产出。在推理时，每个查询都嵌入并分配到其最近的群集中。选择该群集中的表现最佳模型以使用自谐度或其多模型变体生成响应。值得注意的是，通过10个开源模型（每个参数〜7b参数），《复仇者联盟》在15个数据集中的10个（跨越数学，代码，逻辑，知识和情感任务）中统称超过GPT-4.1。特别是，它在数学任务上超过了GPT-4.1，而代码任务则超过了7.46％。此外，复仇者联盟提供了卓越的分布概括，并且在各种嵌入模型，聚类算法，集合策略和其唯一参数的值（群集的数量）中保持稳健。我们已经在GitHub上开源了代码：此HTTPS URL

Title: MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

Authors: Zaid Alyafeai, Maged S. Al-Shaibani, Bernard Ghanem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19800
Pdf URL: https://arxiv.org/pdf/2505.19800
Copy Paste: [[2505.19800]] MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs(https://arxiv.org/abs/2505.19800)
Keywords: language model, llm
Abstract: Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al.,2021) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets' scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length, few-shot learning, and web browsing integration, we demonstrate that modern LLMs show promising results in automating this task, highlighting the need for further future work improvements to ensure consistent and reliable performance. We release the code: this https URL and dataset: this https URL for the research community.
摘要：元数据提取对于分类和保存数据集至关重要，实现有效的研究发现和可重复性，尤其是考虑到科学研究中当前的指数增长。 Masader（Alyafeai等，2021）为从阿拉伯NLP数据集的学术文章中提取广泛的元数据属性奠定了基础，但它在很大程度上依赖手动注释。在本文中，我们介绍了mole，该框架利用大型语言模型（LLMS）自动从涵盖阿拉伯语以外的其他语言数据集的科学论文中自动提取元数据属性。我们的模式驱动的方法论对多个输入格式进行了整个文档，并结合了可靠的验证机制以保持一致的输出。此外，我们引入了一个新的基准测试，以评估这项任务的研究进度。通过对上下文长度，很少的学习和Web浏览集成的系统分析，我们证明了现代LLM在自动执行此任务方面显示出令人鼓舞的结果，突出了对未来的未来工作的需求，以确保一致和可靠的绩效。我们发布代码：此HTTPS URL和数据集：研究社区的HTTPS URL。

Title: Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation

Authors: Siyuan Li, Jian Chen, Rui Yao, Xuming Hu, Peilin Zhou, Weihua Qiu, Simin Zhang, Chucheng Dong, Zhiyao Li, Qipeng Xie, Zixuan Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19804
Pdf URL: https://arxiv.org/pdf/2505.19804
Copy Paste: [[2505.19804]] Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation(https://arxiv.org/abs/2505.19804)
Keywords: language model, llm
Abstract: Nowadays, regulatory compliance has become a cornerstone of corporate governance, ensuring adherence to systematic legal frameworks. At its core, financial regulations often comprise highly intricate provisions, layered logical structures, and numerous exceptions, which inevitably result in labor-intensive or comprehension challenges. To mitigate this, recent Regulatory Technology (RegTech) and Large Language Models (LLMs) have gained significant attention in automating the conversion of regulatory text into executable compliance logic. However, their performance remains suboptimal particularly when applied to Chinese-language financial regulations, due to three key limitations: (1) incomplete domain-specific knowledge representation, (2) insufficient hierarchical reasoning capabilities, and (3) failure to maintain temporal and logical coherence. One promising solution is to develop a domain specific and code-oriented datasets for model training. Existing datasets such as LexGLUE, LegalBench, and CODE-ACCORD are often English-focused, domain-mismatched, or lack fine-grained granularity for compliance code generation. To fill these gaps, we present Compliance-to-Code, the first large-scale Chinese dataset dedicated to financial regulatory compliance. Covering 1,159 annotated clauses from 361 regulations across ten categories, each clause is modularly structured with four logical elements-subject, condition, constraint, and contextual information-along with regulation relations. We provide deterministic Python code mappings, detailed code reasoning, and code explanations to facilitate automated auditing. To demonstrate utility, we present FinCheck: a pipeline for regulation structuring, code generation, and report generation.
摘要：如今，法规合规性已成为公司治理的基石，确保遵守系统的法律框架。从本质上讲，金融法规通常包括高度复杂的规定，分层的逻辑结构以及许多例外，这不可避免地会带来劳动密集型或理解的挑战。为了减轻这种情况，最近的监管技术（RegTech）和大型语言模型（LLMS）在自动化监管文本转换为可执行合规逻辑方面引起了极大的关注。但是，由于三个关键局限性，特别是应用于中文财务法规时，它们的表现仍然是最佳的：（1）特定于域的知识表示形式不完整，（2）层次结构推理能力不足，以及（3）未能保持时间和逻辑相干性。一个有希望的解决方案是开发特定于域的面向代码的数据集来进行模型培训。诸如Lexglue，LegalBench和Code-Accord之类的现有数据集通常是以英语为中心的，不匹配的域名，或者缺乏合规代码生成的细粒度粒度。为了填补这些空白，我们提出了遵守代码，这是第一个专门用于财务监管合规性的中国数据集。每个子句涵盖361条法规的1,159条带有的条款，每个子句均具有四个逻辑元素 - 主题，条件，约束和上下文信息的模块化结构，以及与调节关系。我们提供确定性的Python代码映射，详细的代码推理和代码说明，以促进自动审核。为了展示实用程序，我们提出了Fincheck：调节结构，代码生成和报告生成的管道。

Title: Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks

Authors: Sirui Chen, Shuqin Ma, Shu Yu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19806
Pdf URL: https://arxiv.org/pdf/2505.19806
Copy Paste: [[2505.19806]] Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks(https://arxiv.org/abs/2505.19806)
Keywords: language model, llm
Abstract: Consciousness stands as one of the most profound and distinguishing features of the human mind, fundamentally shaping our understanding of existence and agency. As large language models (LLMs) develop at an unprecedented pace, questions concerning intelligence and consciousness have become increasingly significant. However, discourse on LLM consciousness remains largely unexplored territory. In this paper, we first clarify frequently conflated terminologies (e.g., LLM consciousness and LLM awareness). Then, we systematically organize and synthesize existing research on LLM consciousness from both theoretical and empirical perspectives. Furthermore, we highlight potential frontier risks that conscious LLMs might introduce. Finally, we discuss current challenges and outline future directions in this emerging field. The references discussed in this paper are organized at this https URL.
摘要：意识是人类思想最深刻和最杰出的特征之一，从根本上塑造了我们对存在和代理的理解。随着大型语言模型（LLM）以前所未有的速度发展，有关智能和意识的问题变得越来越重要。但是，关于LLM意识的论述在很大程度上仍未开发。在本文中，我们首先澄清了经常混合的术语（例如LLM意识和LLM意识）。然后，我们从理论和经验的角度系统地组织和综合了对LLM意识的现有研究。此外，我们强调了有意识的LLM可能引入的潜在边界风险。最后，我们讨论当前的挑战，并概述这个新兴领域的未来方向。本文讨论的参考文献是在此HTTPS URL上组织的。

Title: Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

Authors: Junnan Liu, Hongwei Liu, Linchen Xiao, Shudong Liu, Taolin Zhang, Zihan Ma, Songyang Zhang, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19815
Pdf URL: https://arxiv.org/pdf/2505.19815
Copy Paste: [[2505.19815]] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective(https://arxiv.org/abs/2505.19815)
Keywords: language model, llm
Abstract: We propose a novel framework for comprehending the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM's parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.
摘要：我们提出了一个新颖的框架，以通过元学习的角度理解大语言模型（LLM）的推理能力。通过将推理轨迹概念化为LLM参数的伪梯度下降更新，我们可以确定LLM推理与各种元学习范式之间的相似之处。我们将推理任务的培训过程正式为元学习设置，每个问题都视为单个任务，而推理轨迹则是用于适应模型参数的内部循环优化。一旦接受了各种问题的培训，LLM就会开发出基本的推理能力，可以推广到以前看不见的问题。广泛的经验评估证实了LLM推理与元学习之间的牢固联系，从元学习的角度探讨了几个引起关注的问题。我们的工作不仅增强了对LLM推理的理解，而且还提供了通过既定的元学习技术来改善这些模型的实用见解。

Title: FoodTaxo: Generating Food Taxonomies with Large Language Models

Authors: Pascal Wullschleger, Majid Zarharan, Donnacha Daly, Marc Pouly, Jennifer Foster
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19838
Pdf URL: https://arxiv.org/pdf/2505.19838
Copy Paste: [[2505.19838]] FoodTaxo: Generating Food Taxonomies with Large Language Models(https://arxiv.org/abs/2505.19838)
Keywords: language model, llm, prompt
Abstract: We investigate the utility of Large Language Models for automated taxonomy generation and completion specifically applied to taxonomies from the food technology industry. We explore the extent to which taxonomies can be completed from a seed taxonomy or generated without a seed from a set of known concepts, in an iterative fashion using recent prompting techniques. Experiments on five taxonomies using an open-source LLM (Llama-3), while promising, point to the difficulty of correctly placing inner nodes.
摘要：我们调查了大型语言模型的自动分类生成和完成的实用性，专门用于食品技术行业的分类法。我们探讨了从种子分类法可以完成分类法或使用最新提示技术以迭代方式从一组已知概念中产生的分类法完成的程度。使用开源LLM（LLAMA-3）对五个分类法进行实验，但有望表明难以正确放置内部节点。

Title: Improving Multilingual Math Reasoning for African Languages

Authors: Odunayo Ogundepo, Akintunde Oladipo, Kelechi Ogueji, Esther Adenuga, David Ifeoluwa Adelani, Jimmy Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19848
Pdf URL: https://arxiv.org/pdf/2505.19848
Copy Paste: [[2505.19848]] Improving Multilingual Math Reasoning for African Languages(https://arxiv.org/abs/2505.19848)
Keywords: language model, llm
Abstract: Researchers working on low-resource languages face persistent challenges due to limited data availability and restricted access to computational resources. Although most large language models (LLMs) are predominantly trained in high-resource languages, adapting them to low-resource contexts, particularly African languages, requires specialized techniques. Several strategies have emerged for adapting models to low-resource languages in todays LLM landscape, defined by multi-stage pre-training and post-training paradigms. However, the most effective approaches remain uncertain. This work systematically investigates which adaptation strategies yield the best performance when extending existing LLMs to African languages. We conduct extensive experiments and ablation studies to evaluate different combinations of data types (translated versus synthetically generated), training stages (pre-training versus post-training), and other model adaptation configurations. Our experiments focuses on mathematical reasoning tasks, using the Llama 3.1 model family as our base model.
摘要：由于数据可用性有限和限制对计算资源的访问，从事低资源语言的研究人员面临持续的挑战。尽管大多数大型语言模型（LLMS）主要接受了高农产品语言的培训，使它们适应低资源的环境，尤其是非洲语言，需要专门的技术。在当今LLM景观中，已经出现了几种策略，以使模型适应低资源语言，这是由多阶段训练前和训练后范式定义的。但是，最有效的方法仍然不确定。这项工作系统地研究了哪些适应策略在将现有LLM扩展到非洲语言时会产生最佳性能。我们进行了广泛的实验和消融研究，以评估数据类型的不同组合（翻译与合成生成），训练阶段（训练前与训练后训练）以及其他模型适应配置。我们的实验侧重于数学推理任务，使用Llama 3.1模型家族作为我们的基本模型。

Title: Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages

Authors: Gulfarogh Azam, Mohd Sadique, Saif Ali, Mohammad Nadeem, Erik Cambria, Shahab Saquib Sohail, Mohammad Sultan Alam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19851
Pdf URL: https://arxiv.org/pdf/2505.19851
Copy Paste: [[2505.19851]] Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages(https://arxiv.org/abs/2505.19851)
Keywords: language model, gpt, llm
Abstract: Transliteration, the process of mapping text from one script to another, plays a crucial role in multilingual natural language processing, especially within linguistically diverse contexts such as India. Despite significant advancements through specialized models like IndicXlit, recent developments in large language models suggest a potential for general-purpose models to excel at this task without explicit task-specific training. The current work systematically evaluates the performance of prominent LLMs, including GPT-4o, GPT-4.5, GPT-4.1, Gemma-3-27B-it, and Mistral-Large against IndicXlit, a state-of-the-art transliteration model, across ten major Indian languages. Experiments utilized standard benchmarks, including Dakshina and Aksharantar datasets, with performance assessed via Top-1 Accuracy and Character Error Rate. Our findings reveal that while GPT family models generally outperform other LLMs and IndicXlit for most instances. Additionally, fine-tuning GPT-4o improves performance on specific languages notably. An extensive error analysis and robustness testing under noisy conditions further elucidate strengths of LLMs compared to specialized models, highlighting the efficacy of foundational models for a wide spectrum of specialized applications with minimal overhead.
摘要：音译是将文本从一个脚本映射到另一个脚本的过程，在多语言自然语言处理中起着至关重要的作用，尤其是在语言上不同的环境中，例如印度。尽管通过IndionXlit等专业模型取得了重大进步，但大型语言模型的最新发展表明，通用模型在没有明确特定于任务的培训的情况下可以在此任务中脱颖而出。当前的工作系统地评估了突出的LLM的性能，包括GPT-4O，GPT-4.5，GPT-4.1，GEMMA-3-27B-IT和MISTRAL-RARGE，以及跨十种主要印度语言的最先进的音译模型IndiNxlit，这是一种最先进的音译模型。实验利用了包括Dakshina和Aksharantar数据集在内的标准基准测试，并通过TOP-1准确性和字符错误率评估了性能。我们的发现表明，尽管GPT家庭模型在大多数情况下通常都优于其他LLM和IndiNxlit。此外，微调GPT-4O尤其提高了特定语言的性能。与专业模型相比，在嘈杂条件下进行了广泛的误差分析和鲁棒性测试，进一步阐明了LLM的强度，这突出了基础模型对广泛的专业应用的功效，其开销最少。

Title: APE: A Data-Centric Benchmark for Efficient LLM Adaptation in Text Summarization

Authors: Javier Marín
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19912
Pdf URL: https://arxiv.org/pdf/2505.19912
Copy Paste: [[2505.19912]] APE: A Data-Centric Benchmark for Efficient LLM Adaptation in Text Summarization(https://arxiv.org/abs/2505.19912)
Keywords: language model, llm
Abstract: We present Adjacent Possible Exploration (APE), a simple yet effective method for adapting large language models to specific tasks using minimal computational resources. Unlike traditional fine-tuning that requires extensive compute, APE iteratively fine-tunes models on small, carefully selected data batches (200 examples), retaining only improvements. On news summarization, APE achieves 40 percent BLEU improvement using just a T4 GPU in 60 minutes, matching or exceeding more complex methods like LoRA while remaining conceptually simple. Our approach is particularly valuable for researchers and practitioners with limited computational resources. We provide open-source code and demonstrate APE's effectiveness through both automatic metrics and human evaluation. While inspired by evolutionary theory's "adjacent possible", APE's core insight has a very practical application: small, iterative data perturbations can efficiently guide LLMs toward task-specific performance without expensive retraining.
摘要：我们提出了相邻的可能探索（APE），这是一种简单而有效的方法，用于使用最小的计算资源调整大型语言模型。与需要大量计算的传统微调不同，小型，精心选择的数据批次（200个示例）上的猿迭代微调模型，仅保留改进。在新闻摘要中，APE仅在60分钟内仅使用T4 GPU实现40％的BLEU改进，匹配或超过更复杂的方法，例如LORA，同时在概念上保持简单。我们的方法对于具有有限的计算资源的研究人员和从业人员特别有价值。我们提供开源代码，并通过自动指标和人类评估来展示猿的有效性。尽管受到进化论的“邻近可能”的启发，APE的核心洞察力具有非常实际的应用：较小的，迭代的数据扰动可以有效地指导LLM在特定于任务的性能的情况下而无需昂贵的再培训。

Title: Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Authors: Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19914
Pdf URL: https://arxiv.org/pdf/2505.19914
Copy Paste: [[2505.19914]] Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles(https://arxiv.org/abs/2505.19914)
Keywords: language model, llm
Abstract: Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at this https URL.
摘要：大型语言模型（LLM），例如OpenAI的O1和DeepSeek的R1，擅长于具有可验证的奖励（RLVR）的数学和编码等先进的推理任务（RLVR），但仍然在没有领域知识的人类可以解决的难题中挣扎。我们介绍了Enigmata，这是第一个量身定制的全面套件，该套件旨在通过拼图推理技能提高LLM。它包括七个类别的36个任务，每个任务都有1）一个生成具有可控难度的无限示例的发电机，以及2）基于规则的验证器用于自动评估。该发电机设计设计支持可扩展的多任务RL训练，细粒度分析和无缝RLVR集成。我们进一步提出了Enigmata-eval，这是一种严格的基准，并制定了优化的多任务RLVR策略。我们训练有素的模型QWEN2.5-32B-ENIGMATA在拼图推理基准等基准（如Enigmata-eval，Arc-Agi（32.8％）和Arc-Agi 2（0.6％）等拼图推理基准上，始终超过O3-Mini-High和O1。它还可以很好地推广到跨域拼图基准和数学推理，而多任务取舍。当对诸如SEED1.5思考（20B激活参数和200B总参数）之类的较大模型进行培训时，来自Enigmata的拼图数据进一步提高了高级数学和STEM推理任务的SOTA性能，例如Aime（2024-2025），例如Aneraime和GPQA（Diamond）（钻石（Diamond）），显示了Enigmata的好处。这项工作提供了一个统一，可控的框架，用于推进LLM中的逻辑推理。这项工作的资源可以在此HTTPS URL上找到。

Title: ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs

Authors: Pooneh Mousavi, Yingzhi Wang, Mirco Ravanelli, Cem Subakan
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.19937
Pdf URL: https://arxiv.org/pdf/2505.19937
Copy Paste: [[2505.19937]] ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs(https://arxiv.org/abs/2505.19937)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are widely used in Spoken Language Understanding (SLU). Recent SLU models process audio directly by adapting speech input into LLMs for better multimodal learning. A key consideration for these models is the cross-modal alignment between text and audio modalities, which is a telltale sign as to whether or not LLM is able to associate semantic meaning to audio segments. While various methods exist for fusing these modalities, there is no standard metric to evaluate alignment quality in LLMs. In this work, we propose a new metric, ALAS (Automatic Latent Alignment Score). Our study examines the correlation between audio and text representations across transformer layers, for two different tasks (Spoken Question Answering and Emotion Recognition). We showcase that our metric behaves as expected across different layers and different tasks.
摘要：大型语言模型（LLM）被广泛用于口语理解（SLU）。最近的SLU模型直接通过将语音输入调整为LLM来处理音频，以获得更好的多模式学习。这些模型的关键考虑因素是文本和音频方式之间的跨模式对齐，这是LLM是否能够将语义含义与音频段相关联。尽管存在融合这些方式的各种方法，但尚无标准指标来评估LLM中的一致性质量。在这项工作中，我们提出了一个新的指标（自动潜在对齐评分）。我们的研究研究了两个不同的任务（口语答案和情感识别），研究了音频与文本表示之间的相关性。我们展示了我们的公制行为在不同层和不同的任务上的预期。

Title: MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

Authors: Zhongzhan Huang, Guoming Ling, Shanshan Zhong, Hefeng Wu, Liang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19959
Pdf URL: https://arxiv.org/pdf/2505.19959
Copy Paste: [[2505.19959]] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models(https://arxiv.org/abs/2505.19959)
Keywords: language model, llm, long context
Abstract: Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs). However, due to the inherently lengthy nature of long-text data, existing LCU benchmarks for LLMs often result in prohibitively high evaluation costs, like testing time and inference expenses. Through extensive experimentation, we discover that existing LCU benchmarks exhibit significant redundancy, which means the inefficiency in evaluation. In this paper, we propose a concise data compression method tailored for long-text data with sparse information characteristics. By pruning the well-known LCU benchmark LongBench, we create MiniLongBench. This benchmark includes only 237 test samples across six major task categories and 21 distinct tasks. Through empirical analysis of over 60 LLMs, MiniLongBench achieves an average evaluation cost reduced to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results. Therefore, our MiniLongBench, as a low-cost benchmark, holds great potential to substantially drive future research into the LCU capabilities of LLMs. See this https URL for our code, data and tutorial.
摘要：长篇小说理解（LCU）是当前大语言模型（LLMS）中探索的关键领域。但是，由于长文化数据的固有性质固有的性质，LLMS的现有LCU基准通常会导致高度高的评估成本，例如测试时间和推理费用。通过广泛的实验，我们发现现有的LCU基准表现出显着的冗余，这意味着评估的效率低下。在本文中，我们提出了一种针对具有稀疏信息特征的长文本数据量身定制的简洁数据压缩方法。通过修剪众所周知的LCU基准Longbench，我们创建了Minilongbench。该基准仅包括六个主要任务类别和21个不同任务的237个测试样本。通过对超过60个LLM的经验分析，Minilongbench的平均评估成本仅降至原始的4.5％，同时保持平均等级相关系数为0.97，并获得Longbench的结果。因此，作为低成本基准，我们的Minilongbench具有巨大的潜力，可以实质上推动未来对LLMS LCU功能的研究。有关我们的代码，数据和教程，请参见此HTTPS URL。

Title: CP-Router: An Uncertainty-Aware Router Between LLM and LRM

Authors: Jiayuan Su, Fulin Lin, Zhaopeng Feng, Han Zheng, Teng Wang, Zhenyu Xiao, Xinlong Zhao, Zuozhu Liu, Lu Cheng, Hongwei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19970
Pdf URL: https://arxiv.org/pdf/2505.19970
Copy Paste: [[2505.19970]] CP-Router: An Uncertainty-Aware Router Between LLM and LRM(https://arxiv.org/abs/2505.19970)
Keywords: language model, llm, prompt
Abstract: Recent advances in Large Reasoning Models (LRMs) have significantly improved long-chain reasoning capabilities over Large Language Models (LLMs). However, LRMs often produce unnecessarily lengthy outputs even for simple queries, leading to inefficiencies or even accuracy degradation compared to LLMs. To overcome this, we propose CP-Router, a training-free and model-agnostic routing framework that dynamically selects between an LLM and an LRM, demonstrated with multiple-choice question answering (MCQA) prompts. The routing decision is guided by the prediction uncertainty estimates derived via Conformal Prediction (CP), which provides rigorous coverage guarantees. To further refine the uncertainty differentiation across inputs, we introduce Full and Binary Entropy (FBE), a novel entropy-based criterion that adaptively selects the appropriate CP threshold. Experiments across diverse MCQA benchmarks, including mathematics, logical reasoning, and Chinese chemistry, demonstrate that CP-Router efficiently reduces token usage while maintaining or even improving accuracy compared to using LRM alone. We also extend CP-Router to diverse model pairings and open-ended QA, where it continues to demonstrate strong performance, validating its generality and robustness.
摘要：大型推理模型（LRMS）的最新进展已大大提高了大语言模型（LLM）的长链推理能力。但是，与LLM相比，LRMS通常甚至在简单的查询中产生冗长的输出，从而导致效率低下甚至准确性降解。为了克服这一点，我们提出了CP-ROUTER，这是一个无训练和模型的路由框架，在LLM和LRM之间动态选择，并用多项选择的问题回答（MCQA）提示进行了演示。路由决策以通过共形预测（CP）得出的预测不确定性估计为指导，该预测提供了严格的保证。为了进一步完善跨输入之间的不确定性差异，我们引入了完整的二元熵（FBE），这是一种基于熵的新标准，可自适应选择适当的CP阈值。跨多种MCQA基准（包括数学，逻辑推理和中国化学）的实验表明，与单独使用LRM相比，CP-ROUTER可以有效地降低令牌使用情况，同时维持甚至提高准确性。我们还将CP-ROUTER扩展到不同的模型配对和开放式质量质量QA，在那里它继续证明了强劲的性能，从而验证了其一般性和稳健性。

Title: Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language

Authors: Kilian Sennrich, Sina Ahmadi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19971
Pdf URL: https://arxiv.org/pdf/2505.19971
Copy Paste: [[2505.19971]] Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language(https://arxiv.org/abs/2505.19971)
Keywords: gpt
Abstract: Knowledge graphs offer an excellent solution for representing the lexical-semantic structures of lexicographic data. However, working with the SPARQL query language represents a considerable hurdle for many non-expert users who could benefit from the advantages of this technology. This paper addresses the challenge of creating natural language interfaces for lexicographic data retrieval on knowledge graphs such as Wikidata. We develop a multidimensional taxonomy capturing the complexity of Wikidata's lexicographic data ontology module through four dimensions and create a template-based dataset with over 1.2 million mappings from natural language utterances to SPARQL queries. Our experiments with GPT-2 (124M), Phi-1.5 (1.3B), and GPT-3.5-Turbo reveal significant differences in model capabilities. While all models perform well on familiar patterns, only GPT-3.5-Turbo demonstrates meaningful generalization capabilities, suggesting that model size and diverse pre-training are crucial for adaptability in this domain. However, significant challenges remain in achieving robust generalization, handling diverse linguistic data, and developing scalable solutions that can accommodate the full complexity of lexicographic knowledge representation.
摘要：知识图为代表词法数据的词汇语义结构提供了出色的解决方案。但是，使用SPARQL查询语言，对于许多可以从该技术的优势中受益的非专家用户来说，这是一个巨大的障碍。本文应对在Wikidata等知识图上创建自然语言界面的挑战。我们开发了一个多维分类法，通过四个维度捕获了Wikidata词典数据本体模块的复杂性，并创建一个基于模板的数据集，其中包含超过120万个从自然语言话语到SPARQL查询的映射。我们对GPT-2（124M），PHI-1.5（1.3B）和GPT-3.5-Turbo进行的实验揭示了模型能力的显着差异。尽管所有模型在熟悉的模式上都表现良好，但GPT-3.5-Turbo表现出有意义的概括能力，这表明模型大小和多样化的预训练对于该域中的适应性至关重要。但是，在实现强大的概括，处理多种语言数据以及开发可扩展解决方案的情况下仍然存在重大挑战，以适应词典知识表示的全部复杂性。

Title: DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset

Authors: Alkis Koudounas, Moreno La Quatra, Elena Baralis
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.19978
Pdf URL: https://arxiv.org/pdf/2505.19978
Copy Paste: [[2505.19978]] DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset(https://arxiv.org/abs/2505.19978)
Keywords: language model, llm
Abstract: Recent advances in conversational AI have demonstrated impressive capabilities in single-turn responses, yet multi-turn dialogues remain challenging for even the most sophisticated language models. Current dialogue datasets are limited in their emotional range, domain diversity, turn depth, and are predominantly text-only, hindering progress in developing more human-like conversational systems across modalities. To address these limitations, we present DeepDialogue, a large-scale multimodal dataset containing 40,150 high-quality multi-turn dialogues spanning 41 domains and incorporating 20 distinct emotions with coherent emotional progressions. Our approach pairs 9 different language models (4B-72B parameters) to generate 65,600 initial conversations, which we then evaluate through a combination of human annotation and LLM-based quality filtering. The resulting dataset reveals fundamental insights: smaller models fail to maintain coherence beyond 6 dialogue turns; concrete domains (e.g., "cars," "travel") yield more meaningful conversations than abstract ones (e.g., "philosophy"); and cross-model interactions produce more coherent dialogues than same-model conversations. A key contribution of DeepDialogue is its speech component, where we synthesize emotion-consistent voices for all 40,150 dialogues, creating the first large-scale open-source multimodal dialogue dataset that faithfully preserves emotional context across multi-turn conversations.
摘要：对话AI的最新进展在单转响应中表现出了令人印象深刻的功能，但是对于最复杂的语言模型来说，多转向对话仍然具有挑战性。当前的对话数据集在其情感范围内受到限制，领域的多样性，深度，并且主要仅是文本，这阻碍了跨模式跨越更类似人类的对话系统的进展。为了解决这些局限性，我们提出了DeepDialogue，这是一个大规模的多模式数据集，其中包含40,150个高质量的多转向对话，涉及41个域，并将20种不同的情绪与连贯的情感发展结合在一起。我们的方法对9种不同的语言模型（4B-72B参数）进行了65,600个初始对话，然后我们通过人类注释和基于LLM的质量过滤的组合进行评估。最终的数据集揭示了基本见解：较小的模型无法保持连贯性超过6个对话转弯；混凝土域（例如，“汽车”，“旅行”）比抽象的对话（例如“哲学”）产生更多的有意义的对话；与同一模型对话相比，跨模型的相互作用产生的对话更多。 DeepDialogue的一个关键贡献是其语音组成部分，我们在所有40,150个对话中综合了情感一致的声音，创建了第一个大型开放源代码多模式对话数据集，该数据集忠实地保留了在多转向对话中的情感上下文。

Title: How Well Do Large Reasoning Models Translate? A Comprehensive Evaluation for Multi-Domain Machine Translation

Authors: Yongshi Ye, Biao Fu, Chongxuan Huang, Yidong Chen, Xiaodong Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.19987
Pdf URL: https://arxiv.org/pdf/2505.19987
Copy Paste: [[2505.19987]] How Well Do Large Reasoning Models Translate? A Comprehensive Evaluation for Multi-Domain Machine Translation(https://arxiv.org/abs/2505.19987)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated strong performance in general-purpose machine translation, but their effectiveness in complex, domain-sensitive translation tasks remains underexplored. Recent advancements in Large Reasoning Models (LRMs), raise the question of whether structured reasoning can enhance translation quality across diverse domains. In this work, we compare the performance of LRMs with traditional LLMs across 15 representative domains and four translation directions. Our evaluation considers various factors, including task difficulty, input length, and terminology density. We use a combination of automatic metrics and an enhanced MQM-based evaluation hierarchy to assess translation quality. Our findings show that LRMs consistently outperform traditional LLMs in semantically complex domains, especially in long-text and high-difficulty translation scenarios. Moreover, domain-adaptive prompting strategies further improve performance by better leveraging the reasoning capabilities of LRMs. These results highlight the potential of structured reasoning in MDMT tasks and provide valuable insights for optimizing translation systems in domain-sensitive contexts.
摘要：大型语言模型（LLMS）在通用机器翻译中表现出了强劲的性能，但是它们在复杂的，域敏感的翻译任务中的有效性仍未得到充实。大型推理模型（LRMS）的最新进展提出了一个问题，即结构化推理是否可以增强各种领域的翻译质量。在这项工作中，我们将LRM的性能与15个代表性域和四个翻译方向的传统LLM进行比较。我们的评估考虑了各种因素，包括任务难度，输入长度和术语密度。我们结合使用自动指标和增强的基于MQM的评估层次结构来评估翻译质量。我们的发现表明，LRM在语义复杂的领域中始终胜过传统的LLM，尤其是在长文化和高难题的翻译方案中。此外，域自适应提示策略通过更好地利用LRMS的推理能力来进一步提高性能。这些结果突出了MDMT任务中结构化推理的潜力，并提供了在域敏感环境中优化翻译系统的宝贵见解。

Title: WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback

Authors: Minda Hu, Tianqing Fang, Jianshu Zhang, Junyu Ma, Zhisong Zhang, Jingyan Zhou, Hongming Zhang, Haitao Mi, Dong Yu, Irwin King
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20013
Pdf URL: https://arxiv.org/pdf/2505.20013
Copy Paste: [[2505.20013]] WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback(https://arxiv.org/abs/2505.20013)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Web agents powered by Large Language Models (LLMs) show promise for next-generation AI, but their limited reasoning in uncertain, dynamic web environments hinders robust deployment. In this paper, we identify key reasoning skills essential for effective web agents, i.e., reflection & lookahead, branching, and rollback, and curate trajectory data that exemplifies these abilities by reconstructing the agent's (inference-time) reasoning algorithms into chain-of-thought rationales. We conduct experiments in the agent self-improving benchmark, OpenWebVoyager, and demonstrate that distilling salient reasoning patterns into the backbone LLM via simple fine-tuning can substantially enhance its performance. Our approach yields significant improvements across multiple benchmarks, including WebVoyager, Mind2web-live, and SimpleQA (web search), highlighting the potential of targeted reasoning skill enhancement for web agents.
摘要：由大型语言模型（LLMS）提供动力的Web代理显示出对下一代AI的希望，但是它们在不确定的，动态的网络环境中的有限推理阻碍了强大的部署。在本文中，我们确定了有效的网络代理所必需的关键推理技能，即反射和lookahead，分支和回滚，以及策划轨迹数据，这些数据通过重建代理（推理时间）推理算法来示例这些能力，从而示例这些能力。我们在代理自我提出的基准测试中进行实验，OpenWebVoyager，并证明通过简单的微调将显着的推理模式蒸馏到主链LLM中可以显着提高其性能。我们的方法对包括WebVoyager，Mind2Web-Live和SimpleQA（Web搜索）在内的多个基准进行了重大改进，突出了针对Web代理的有针对性推理技能提高的潜力。

Title: Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation

Authors: Hoyun Song, Huije Lee, Jisu Shin, Sukmin Cho, Changgeon Ko, Jong C. Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20014
Pdf URL: https://arxiv.org/pdf/2505.20014
Copy Paste: [[2505.20014]] Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation(https://arxiv.org/abs/2505.20014)
Keywords: language model, llm
Abstract: The detection of mental health problems from social media and the interpretation of these results have been extensively explored. Research has shown that incorporating clinical symptom information into a model enhances domain expertise, improving its detection and interpretation performance. While large language models (LLMs) are shown to be effective for generating explanatory rationales in mental health detection, their substantially large parameter size and high computational cost limit their practicality. Reasoning distillation transfers this ability to smaller language models (SLMs), but inconsistencies in the relevance and domain alignment of LLM-generated rationales pose a challenge. This paper investigates how rationale quality impacts SLM performance in mental health detection and explanation generation. We hypothesize that ensuring high-quality and domain-relevant rationales enhances the distillation. To this end, we propose a framework that selects rationales based on their alignment with expert clinical reasoning. Experiments show that our quality-focused approach significantly enhances SLM performance in both mental disorder detection and rationale generation. This work highlights the importance of rationale quality and offers an insightful framework for knowledge transfer in mental health applications.
摘要：广泛探讨了社交媒体对心理健康问题的发现以及对这些结果的解释。研究表明，将临床症状信息纳入模型可以增强领域的专业知识，从而提高其检测和解释性能。虽然大型语言模型（LLM）被证明可有效地产生心理健康检测中的解释原理，但它们的参数大小和较高的计算成本限制了其实用性。推理蒸馏将这种能力转移到较小的语言模型（SLM）中，但在LLM生成的理由的相关性和域对齐中不一致构成了挑战。本文研究了基本原理质量如何影响SLM在心理健康检测和解释产生中的表现。我们假设确保高质量和与域相关的理由可以增强蒸馏。为此，我们提出了一个框架，该框架根据其与专家临床推理的一致性选择理由。实验表明，我们以质量为中心的方法显着提高了精神障碍检测和基本原理产生的SLM表现。这项工作强调了基本原理质量的重要性，并为心理健康应用中的知识转移提供了一个有见地的框架。

Title: TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation

Authors: Chengrui Huang, Shen Gao, Zhengliang Shi, Dongsheng Wang, Shuo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20016
Pdf URL: https://arxiv.org/pdf/2505.20016
Copy Paste: [[2505.20016]] TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation(https://arxiv.org/abs/2505.20016)
Keywords: llm
Abstract: Existing tool-learning methods usually rely on supervised fine-tuning, they often overlook fine-grained optimization of internal tool call details, leading to limitations in preference alignment and error discrimination. To overcome these challenges, we propose Token-level Tool-use Preference Alignment Training Framework (TTPA), a training paradigm for constructing token-level tool-use preference datasets that align LLMs with fine-grained preferences using a novel error-oriented scoring mechanism. TTPA first introduces reversed dataset construction, a method for creating high-quality, multi-turn tool-use datasets by reversing the generation flow. Additionally, we propose Token-level Preference Sampling (TPS) to capture fine-grained preferences by modeling token-level differences during generation. To address biases in scoring, we introduce the Error-oriented Scoring Mechanism (ESM), which quantifies tool-call errors and can be used as a training signal. Extensive experiments on three diverse benchmark datasets demonstrate that TTPA significantly improves tool-using performance while showing strong generalization ability across models and datasets.
摘要：现有的工具学习方法通常依赖于监督的微调，他们经常忽略内部工具呼叫细节的细粒度优化，从而导致偏好对齐和错误歧视的限制。为了克服这些挑战，我们提出了令牌级的工具使用偏好对准培训框架（TTPA），这是一种用于构建令牌级工具使用的偏好数据集的训练范式，该数据集将LLM与新颖的犯错的雪橇机制相结合。 TTPA首先引入了反向数据集构建，这是一种通过逆转生成流来创建高质量，多转移工具使用数据集的方法。此外，我们提出了令牌级偏好采样（TPS），以通过对代币级别的差异进行建模，以捕获细粒度的偏好。为了解决评分中的偏见，我们介绍了面向错误的评分机制（ESM），该计算机制量化了工具通话错误，可以用作训练信号。对三个不同基准数据集进行的广泛实验表明，TTPA显着提高了使用工具性能，同时在模型和数据集中显示出强大的概括能力。

Title: Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking

Authors: Yihan Chen, Benfeng Xu, Xiaorui Wang, Yongdong Zhang, Zhendong Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20023
Pdf URL: https://arxiv.org/pdf/2505.20023
Copy Paste: [[2505.20023]] Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking(https://arxiv.org/abs/2505.20023)
Keywords: language model, gpt, llm, prompt, chat, agent
Abstract: Autonomous agents, which perceive environments and take actions to achieve goals, have become increasingly feasible with the advancements in large language models (LLMs). However, current powerful agents often depend on sophisticated prompt engineering combined with closed-source LLMs like GPT-4. Although training open-source LLMs using expert trajectories from teacher models has yielded some improvements in agent capabilities, this approach still faces limitations such as performance plateauing and error propagation. To mitigate these challenges, we propose STeP, a novel method for improving LLM-based agent training. We synthesize self-reflected trajectories that include reflections and corrections of error steps, which enhance the effectiveness of LLM agents in learning from teacher models, enabling them to become agents capable of self-reflecting and correcting. We also introduce partial masking strategy that prevents the LLM from internalizing incorrect or suboptimal steps. Experiments demonstrate that our method improves agent performance across three representative tasks: ALFWorld, WebShop, and SciWorld. For the open-source model LLaMA2-7B-Chat, when trained using self-reflected trajectories constructed with Qwen1.5-110B-Chat as the teacher model, it achieves comprehensive improvements with less training data compared to agents trained exclusively on expert trajectories.
摘要：感知环境并采取行动实现目标的自主代理人在大型语言模型（LLM）的进步方面变得越来越可行。但是，当前强大的代理通常取决于复杂的及时工程，并结合了GPT-4（例如GPT-4）的封闭源LLM。尽管使用来自教师模型的专家轨迹的培训开源LLM培训了代理能力的一些改进，但这种方法仍然面临着诸如性能平台和错误传播之类的限制。为了减轻这些挑战，我们提出了一种步骤，这是一种改善基于LLM的代理训练的新方法。我们综合了自我反射的轨迹，包括反思和错误步骤的校正，从而提高了LLM代理在从教师模型中学习的有效性，从而使它们成为能够自我反射和纠正的代理。我们还引入了部分掩蔽策略，以防止LLM内部化不正确或次优步骤。实验表明，我们的方法改善了三个代表性任务的代理性能：Alfworld，WebShop和Sciworld。对于开源模型Llama2-7b-chat，当使用以QWEN1.5-110B-CHAT为教师模型构建的自我反射轨迹进行培训时，它可以通过与专家轨迹受过培训的代理人相比，通过较少的培训数据来实现全面的改进。

Title: Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs

Authors: Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Mrinmaya Sachan, Preslav Nakov, Artem Shelmanov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20045
Pdf URL: https://arxiv.org/pdf/2505.20045
Copy Paste: [[2505.20045]] Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs(https://arxiv.org/abs/2505.20045)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) exhibit impressive fluency, but often produce critical errors known as "hallucinations". Uncertainty quantification (UQ) methods are a promising tool for coping with this fundamental shortcoming. Yet, existing UQ methods face challenges such as high computational overhead or reliance on supervised learning. Here, we aim to bridge this gap. In particular, we propose RAUQ (Recurrent Attention-based Uncertainty Quantification), an unsupervised approach that leverages intrinsic attention patterns in transformers to detect hallucinations efficiently. By analyzing attention weights, we identified a peculiar pattern: drops in attention to preceding tokens are systematically observed during incorrect generations for certain "uncertainty-aware" heads. RAUQ automatically selects such heads, recurrently aggregates their attention weights and token-level confidences, and computes sequence-level uncertainty scores in a single forward pass. Experiments across 4 LLMs and 12 question answering, summarization, and translation tasks demonstrate that RAUQ yields excellent results, outperforming state-of-the-art UQ methods using minimal computational overhead (<1% latency). Moreover, it requires no task-specific labels and no careful hyperparameter tuning, offering plug-and-play real-time hallucination detection in white-box LLMs.
摘要：大型语言模型（LLM）表现出令人印象深刻的流利性，但通常会产生称为“幻觉”的关键错误。不确定性定量（UQ）方法是应对这种基本缺点的有前途的工具。但是，现有的UQ方法面临着诸如高度计算开销或依赖监督学习的挑战。在这里，我们的目标是弥合这一差距。特别是，我们提出了RAUQ（基于重复注意的不确定性量化），这是一种无监督的方法，它利用变形金刚中的固有注意力模式有效地检测幻觉。通过分析注意力重量，我们确定了一种特殊的模式：在不正确的世代中，系统地观察到了某些“不确定性感知”头部的注意力。 Rauq自动选择此类头部，将其重复汇总给他们的注意力权重和令牌级别的信心，并在单个正向通过中计算序列级别的不确定性得分。在4个LLM和12个问题回答，摘要和翻译任务的实验表明，Rauq可以使用最小的计算开销（<1％延迟）产生出色的结果，优于最先进的UQ方法。此外，它不需要特定于任务的标签，也不需要仔细的高参数调整，可以在白色盒子LLMS中提供插件的实时幻觉检测。

Title: Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks

Authors: Debargha Ganguly, Vikash Singh, Sreehari Sankar, Biyao Zhang, Xuecen Zhang, Srinivasan Iyengar, Xiaotian Han, Amit Sharma, Shivkumar Kalyanaraman, Vipin Chaudhary
Subjects: cs.CL, cs.AI, cs.LO, cs.SE
Abstract URL: https://arxiv.org/abs/2505.20047
Pdf URL: https://arxiv.org/pdf/2505.20047
Copy Paste: [[2505.20047]] Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks(https://arxiv.org/abs/2505.20047)
Keywords: language model, llm
Abstract: Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.
摘要：大型语言模型（LLMS）通过产生正式规格来使自动推理民主化。但是，存在根本的张力：LLM是概率的，而正式验证要求确定性保证。本文通过全面研究LLM生成的正式人工制品中的失败模式和不确定性定量（UQ）来解决这一认识论差距。我们对五个前沿LLM的系统评估揭示了基于满足的模量理论（SMT）自动化域对准确性的特定影响（从逻辑任务的 +34.8％到逻辑任务的 +34.8％，至-44.5％的事实对准确性），已知的UQ技术（例如，像代币概率的熵都无法识别这些错误。我们引入了一个无上下文的语法（PCFG）框架，以建模LLM输出，从而产生精致的不确定性分类法。我们发现不确定性信号是任务依赖性的（例如，逻辑的语法熵，AUROC> 0.93）。最后，这些信号的轻巧融合可以进行选择性验证，从而大大降低了错误的错误（14-100％），并将LLM驱动的形式化转变为可靠的工程学科。

Title: Incentivizing Reasoning from Weak Supervision

Authors: Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao, Bolin Ding, Bingbing Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20072
Pdf URL: https://arxiv.org/pdf/2505.20072
Copy Paste: [[2505.20072]] Incentivizing Reasoning from Weak Supervision(https://arxiv.org/abs/2505.20072)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) have demonstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) demonstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality demonstrations and reinforcement learning. We investigate whether the reasoning capabilities of LLMs can be effectively incentivized via supervision from significantly weaker models. We further analyze when and why such weak supervision succeeds in eliciting reasoning abilities in stronger models. Our findings show that supervision from significantly weaker reasoners can substantially improve student reasoning performance, recovering close to 94% of the gains of expensive RL at a fraction of the cost. Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks. Our results suggest that this simple weak-to-strong paradigm is a promising and generalizable alternative to costly methods for incentivizing strong reasoning capabilities at inference-time in LLMs. The code is publicly available at this https URL.
摘要：大型语言模型（LLM）在推理密集型任务上表现出了令人印象深刻的表现，但是提高其推理能力通常依赖于具有可验证信号的强化学习（RL）或具有高质量的长链（COT）（COT）（COT）（COT）的强化学习（RL），这两种都昂贵。在本文中，我们研究了一个新的问题，即激励LLM的推理能力而没有昂贵的高质量演示和强化学习。我们研究LLM的推理能力是否可以通过明显较弱的模型的监督有效地激励。我们进一步分析了何时以及为什么这种薄弱的监督成功地在更强的模型中提高推理能力。我们的发现表明，明显较弱的推理者的监督可以大大提高学生的推理表现，从而恢复了近94％的昂贵RL收益，而成本的一小部分。跨不同基准和模型体系结构的实验表明，弱推理者可以有效地激励更强的学生模型中的推理，从而不断提高各种推理任务的性能。我们的结果表明，这种简单的弱到较强的范式是一种有前途且可概括的替代方法，用于激励LLMS推理时强大的推理能力的昂贵方法。该代码在此HTTPS URL上公开可用。

Title: Inference-time Alignment in Continuous Space

Authors: Yige Yuan, Teng Xiao, Li Yunfan, Bingbing Xu, Shuchang Tao, Yunqi Qiu, Huawei Shen, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20081
Pdf URL: https://arxiv.org/pdf/2505.20081
Copy Paste: [[2505.20081]] Inference-time Alignment in Continuous Space(https://arxiv.org/abs/2505.20081)
Keywords: language model
Abstract: Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility. Existing methods rely on generating multiple responses from the base policy for search using a reward model, which can be considered as searching in a discrete response space. However, these methods struggle to explore informative candidates when the base policy is weak or the candidate set is small, resulting in limited effectiveness. In this paper, to address this problem, we propose Simple Energy Adaptation ($\textbf{SEA}$), a simple yet effective algorithm for inference-time alignment. In contrast to expensive search over the discrete space, SEA directly adapts original responses from the base policy toward the optimal one via gradient-based sampling in continuous latent space. Specifically, SEA formulates inference as an iterative optimization procedure on an energy function over actions in the continuous space defined by the optimal policy, enabling simple and effective alignment. For instance, despite its simplicity, SEA outperforms the second-best baseline with a relative improvement of up to $ \textbf{77.51%}$ on AdvBench and $\textbf{16.36%}$ on MATH. Our code is publicly available at this https URL
摘要：在推理时，将大型语言模型与人类反馈保持一致，由于其灵活性，人们受到了越来越多的关注。现有方法依赖于从基本策略中生成多个响应，以使用奖励模型来搜索，这可以被视为在离散响应空间中搜索。但是，这些方法很难在基本政策较弱或候选人集很小时探索信息丰富的候选人，从而产生有限的有效性。在本文中，为了解决这个问题，我们提出了简单的能量适应（$ \ textbf {sea} $），这是一种简单而有效的推理时间对齐算法。与在离散空间上昂贵的搜索相反，SEA通过在连续的潜在空间中的基于梯度的采样直接将原始响应从基本策略调整为最佳响应。具体而言，SEA将推断作为对能量功能的迭代优化程序，而不是由最佳策略定义的连续空间中的动作，从而实现了简单有效的一致性。例如，尽管它很简单，但SEA在Advbench上的相对改善却超过了第二好的基线，而相对改善的相对改善最高为$ \ textbf {77.51％} $，而数学上的$ \ textbf {16.36％} $。我们的代码在此HTTPS URL上公开可用

Title: Multi-Domain Explainability of Preferences

Authors: Nitay Calderon, Liat Ein-Dor, Roi Reichart
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20088
Pdf URL: https://arxiv.org/pdf/2505.20088
Copy Paste: [[2505.20088]] Multi-Domain Explainability of Preferences(https://arxiv.org/abs/2505.20088)
Keywords: language model, llm, prompt
Abstract: Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated end-to-end method for generating local and global concept-based explanations of preferences across multiple domains. Our method employs an LLM to discover concepts that differentiate between chosen and rejected responses and represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two novel application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work provides a new paradigm for explainability in the era of LLMs.
摘要：偏好机制，例如人类的偏好，LLM-AS-A-Gudge（LAAJ）和奖励模型，对于对齐和评估大语言模型（LLMS）至关重要。然而，推动这些偏好的基本概念仍然吸引了知名度。在这项工作中，我们提出了一种完全自动化的端到端方法，用于生成基于本地和全球概念的基于多个域的偏好的解释。我们的方法采用LLM来发现区分选择和拒绝响应的概念，并用基于概念的向量表示它们。为了模拟概念和偏好之间的关系，我们提出了一个白盒分层多域回归模型，该模型既捕获域将军和域特异性效应。为了评估我们的方法，我们策划了一个跨越八个具有挑战性和不同领域的数据集，并解释了十二个机制。我们的方法实现了强大的偏好预测性能，表现优于基准，同时也可以解释。此外，我们评估了两个新型应用程序驱动设置中的解释。首先，通过LAAJ解释的概念引导LLM输出产生了这些法官一贯喜欢的回应。其次，促使Laajs提出解释人类的概念改善了他们的偏好预测。我们的工作共同为LLM时代提供了新的解释性范式。

Title: MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

Authors: Thang Nguyen, Peter Chin, Yu-Wing Tai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20096
Pdf URL: https://arxiv.org/pdf/2505.20096
Copy Paste: [[2505.20096]] MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning(https://arxiv.org/abs/2505.20096)
Keywords: prompt, retrieval-augmented generation, chain-of-thought, agent
Abstract: We present MA-RAG, a Multi-Agent framework for Retrieval-Augmented Generation (RAG) that addresses the inherent ambiguities and reasoning challenges in complex information-seeking tasks. Unlike conventional RAG methods that rely on either end-to-end fine-tuning or isolated component enhancements, MA-RAG orchestrates a collaborative set of specialized AI agents: Planner, Step Definer, Extractor, and QA Agents, to tackle each stage of the RAG pipeline with task-aware reasoning. Ambiguities may arise from underspecified queries, sparse or indirect evidence in retrieved documents, or the need to integrate information scattered across multiple sources. MA-RAG mitigates these challenges by decomposing the problem into subtasks, such as query disambiguation, evidence extraction, and answer synthesis, and dispatching them to dedicated agents equipped with chain-of-thought prompting. These agents communicate intermediate reasoning and progressively refine the retrieval and synthesis process. Our design allows fine-grained control over information flow without any model fine-tuning. Crucially, agents are invoked on demand, enabling a dynamic and efficient workflow that avoids unnecessary computation. This modular and reasoning-driven architecture enables MA-RAG to deliver robust, interpretable results. Experiments on multi-hop and ambiguous QA benchmarks demonstrate that MA-RAG outperforms state-of-the-art training-free baselines and rivals fine-tuned systems, validating the effectiveness of collaborative agent-based reasoning in RAG.
摘要：我们提出了MA-RAG，这是一个用于检索的多代理框架（RAG），该框架解决了复杂信息寻求任务中固有的歧义和推理挑战。与依赖端到端微调或隔离组件增强功能的传统抹布方法不同，MA-RAG协作了一组专业的AI代理：计划者，步骤定义器，提取器和质量检查代理，以使用任务意识到的推理来解决RAG管道的每个阶段。歧义性可能来自指定的查询，稀疏或间接证据在检索的文件中，或者需要整合散布在多个来源的信息。 Ma-Rag通过将问题分解为子任务来缓解这些挑战，例如查询歧义，证据提取和答案合成，并将其派遣到配备了经过思考的提示的专用代理。这些试剂传达了中间推理，并逐步完善了检索和合成过程。我们的设计允许对信息流进行细粒度的控制，而无需进行任何模型调整。至关重要的是，按需调用代理，使动态有效的工作流程避免了不必要的计算。这种模块化和推理驱动的体系结构使MA-RAG能够提供强大的，可解释的结果。关于多跳和模棱两可的质量检查基准测试的实验表明，MA-RAG的表现优于最先进的无训练基准和竞争对手的微调系统，从而验证了基于协作代理的RAG中基于协作的推理的有效性。

Title: S2LPP: Small-to-Large Prompt Prediction across LLMs

Authors: Liang Cheng, Tianyi LI, Zhaowei Wang, Mark Steedman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20097
Pdf URL: https://arxiv.org/pdf/2505.20097
Copy Paste: [[2505.20097]] S2LPP: Small-to-Large Prompt Prediction across LLMs(https://arxiv.org/abs/2505.20097)
Keywords: language model, llm, prompt
Abstract: The performance of pre-trained Large Language Models (LLMs) is often sensitive to nuances in prompt templates, requiring careful prompt engineering, adding costs in terms of computing and human effort. In this study, we present experiments encompassing multiple LLMs variants of varying sizes aimed at probing their preference with different prompts. Through experiments on Question Answering, we show prompt preference consistency across LLMs of different sizes. We also show that this consistency extends to other tasks, such as Natural Language Inference. Utilizing this consistency, we propose a method to use a smaller model to select effective prompt templates for a larger model. We show that our method substantially reduces the cost of prompt engineering while consistently matching performance with optimal prompts among candidates. More importantly, our experiment shows the efficacy of our strategy across fourteen LLMs and its applicability to a broad range of NLP tasks, highlighting its robustness
摘要：预先训练的大语言模型（LLM）的性能通常对迅速模板中的细微差别敏感，需要仔细的及时工程，从而增加了计算和人为努力的成本。在这项研究中，我们提出了涉及各种尺寸的多个LLMS变体的实验，旨在通过不同的提示来探测其偏好。通过对问答的实验，我们显示了不同大小的LLM的迅速偏好一致性。我们还表明，这种一致性扩展到其他任务，例如自然语言推断。利用这种一致性，我们提出了一种使用较小模型为较大模型选择有效提示模板的方法。我们表明，我们的方法大大降低了迅速工程的成本，同时将持续的性能与候选人之间的最佳提示相匹配。更重要的是，我们的实验显示了我们在14个LLM的策略的功效及其对广泛的NLP任务的适用性，突出了其稳健性

Title: Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities

Authors: Chuangtao Ma, Yongrui Chen, Tianxing Wu, Arijit Khan, Haofen Wang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.20099
Pdf URL: https://arxiv.org/pdf/2505.20099
Copy Paste: [[2505.20099]] Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities(https://arxiv.org/abs/2505.20099)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have demonstrated remarkable performance on question-answering (QA) tasks because of their superior capabilities in natural language understanding and generation. However, LLM-based QA struggles with complex QA tasks due to poor reasoning capacity, outdated knowledge, and hallucinations. Several recent works synthesize LLMs and knowledge graphs (KGs) for QA to address the above challenges. In this survey, we propose a new structured taxonomy that categorizes the methodology of synthesizing LLMs and KGs for QA according to the categories of QA and the KG's role when integrating with LLMs. We systematically survey state-of-the-art advances in synthesizing LLMs and KGs for QA and compare and analyze these approaches in terms of strength, limitations, and KG requirements. We then align the approaches with QA and discuss how these approaches address the main challenges of different complex QA. Finally, we summarize the advancements, evaluation metrics, and benchmark datasets and highlight open challenges and opportunities.
摘要：大型语言模型（LLMS）在提问（QA）任务上表现出了出色的表现，因为它们在自然语言理解和产生方面具有出色的能力。但是，由于推理能力差，知识过时和幻觉，基于LLM的质量检查局在复杂的质量检查任务中挣扎。最近的几项工作合成了LLM和知识图（kgs），以应对上述挑战。在这项调查中，我们提出了一种新的结构化分类法，该分类法将根据质量保证和LLMS集成时，根据QA的类别和KG的角色将合成LLM和kgs的方法分类为QA。我们系统地调查了QA合成LLM和KG的最新进展，并根据强度，局限性和KG要求比较和分析这些方法。然后，我们将方法与质量检查对齐，并讨论这些方法如何应对不同复杂质量检查的主要挑战。最后，我们总结了进步，评估指标和基准数据集，并突出了开放的挑战和机遇。

Title: Adaptive Deep Reasoning: Triggering Deep Thinking When Needed

Authors: Yunhao Wang, Yuhao Zhang, Tinghao Yu, Can Xu, Feng Zhang, Fengzong Lian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20101
Pdf URL: https://arxiv.org/pdf/2505.20101
Copy Paste: [[2505.20101]] Adaptive Deep Reasoning: Triggering Deep Thinking When Needed(https://arxiv.org/abs/2505.20101)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) have shown impressive capabilities in handling complex tasks through long-chain reasoning. However, the extensive reasoning steps involved can significantly increase computational costs, posing challenges for real-world deployment. Recent efforts have focused on optimizing reasoning efficiency by shortening the Chain-of-Thought (CoT) reasoning processes through various approaches, such as length-aware prompt engineering, supervised fine-tuning on CoT data with variable lengths, and reinforcement learning with length penalties. Although these methods effectively reduce reasoning length, they still necessitate an initial reasoning phase. More recent approaches have attempted to integrate long-chain and short-chain reasoning abilities into a single model, yet they still rely on manual control to toggle between short and long this http URL this work, we propose a novel approach that autonomously switches between short and long reasoning chains based on problem complexity. Our method begins with supervised fine-tuning of the base model to equip both long-chain and short-chain reasoning abilities. We then employ reinforcement learning to further balance short and long CoT generation while maintaining accuracy through two key strategies: first, integrating reinforcement learning with a long-short adaptive group-wise reward strategy to assess prompt complexity and provide corresponding rewards; second, implementing a logit-based reasoning mode switching loss to optimize the model's initial token choice, thereby guiding the selection of the reasoning this http URL on mathematical datasets demonstrate that our model can dynamically switch between long-chain and short-chain reasoning modes without substantially sacrificing performance. This advancement enhances the practicality of reasoning in large language models for real-world applications.
摘要：大型语言模型（LLM）通过长链推理表现出令人印象深刻的能力来处理复杂的任务。但是，涉及的广泛推理步骤可以显着增加计算成本，从而对现实部署构成挑战。最近的努力集中在通过各种方法（例如长度敏感的工程设计）缩短对链条的思想链（COT）推理过程来优化推理效率，对具有可变长度的COT数据进行了微调，并以长度的惩罚进行了加强学习。尽管这些方法有效地减少了推理长度，但它们仍然需要初始推理阶段。最新的方法试图将长链和短链推理能力整合到一个模型中，但他们仍然依靠手动控制来在短时间和长时间之间切换这项工作，我们提出了一种新的方法，该方法是根据问题复杂性自动切换的，该方法自主在短和长推理链之间进行自主切换。我们的方法始于对基本模型的监督微调，以配备长链和短链推理能力。然后，我们采用强化学习来进一步平衡短期和长长的婴儿床的产生，同时通过两种关键策略保持准确性：首先，将强化学习与长短的自适应团体奖励策略整合在一起，以评估及时的复杂性并提供相应的奖励；其次，实施基于logit的推理模式切换损失以优化模型的初始令牌选择，从而指导选择该http url的推理数学数据集中的推理表明，我们的模型可以在长链和短链推理模式之间动态切换，而无需实质性地牺牲性能。这种进步增强了对现实世界应用的大语言模型中推理的实用性。

Title: Language-Agnostic Suicidal Risk Detection Using Large Language Models

Authors: June-Woo Kim, Wonkyo Oh, Haram Yoon, Sung-Hoon Yoon, Dae-Jin Kim, Dong-Ho Lee, Sang-Yeol Lee, Chan-Mo Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20109
Pdf URL: https://arxiv.org/pdf/2505.20109
Copy Paste: [[2505.20109]] Language-Agnostic Suicidal Risk Detection Using Large Language Models(https://arxiv.org/abs/2505.20109)
Keywords: language model, llm, prompt
Abstract: Suicidal risk detection in adolescents is a critical challenge, yet existing methods rely on language-specific models, limiting scalability and generalization. This study introduces a novel language-agnostic framework for suicidal risk assessment with large language models (LLMs). We generate Chinese transcripts from speech using an ASR model and then employ LLMs with prompt-based queries to extract suicidal risk-related features from these transcripts. The extracted features are retained in both Chinese and English to enable cross-linguistic analysis and then used to fine-tune corresponding pretrained language models independently. Experimental results show that our method achieves performance comparable to direct fine-tuning with ASR results or to models trained solely on Chinese suicidal risk-related features, demonstrating its potential to overcome language constraints and improve the robustness of suicidal risk assessment.
摘要：青少年中的自杀风险检测是一个关键的挑战，但现有的方法依赖于特定于语言的模型，限制了可伸缩性和概括。这项研究介绍了一个新型的语言不足的框架，用于使用大型语言模型（LLMS）进行自杀风险评估。我们使用ASR模型从语音中生成中国笔录，然后采用基于及时的查询的LLM来从这些成绩单中提取自杀风险相关的功能。提取的特征保留在中文和英语中，以实现跨语言分析，然后用来独立微调相应的预审前的语言模型。实验结果表明，我们的方法可以达到与ASR结果或仅针对中国自杀风险相关的特征进行培训的模型的直接微调的性能，这表明了其克服语言限制并提高自杀风险评估的稳健性的潜力。

Title: ResSVD: Residual Compensated SVD for Large Language Model Compression

Authors: Haolei Bai, Siyong Jian, Tuo Liang, Yu Yin, Huan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20112
Pdf URL: https://arxiv.org/pdf/2505.20112
Copy Paste: [[2505.20112]] ResSVD: Residual Compensated SVD for Large Language Model Compression(https://arxiv.org/abs/2505.20112)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in a wide range of downstream natural language processing tasks. Nevertheless, their considerable sizes and memory demands hinder practical deployment, underscoring the importance of developing efficient compression strategies. Singular value decomposition (SVD) decomposes a matrix into orthogonal components, enabling efficient low-rank approximation. This is particularly suitable for LLM compression, where weight matrices often exhibit significant redundancy. However, current SVD-based methods neglect the residual matrix from truncation, resulting in significant truncation loss. Additionally, compressing all layers of the model results in severe performance degradation. To overcome these limitations, we propose ResSVD, a new post-training SVD-based LLM compression method. Specifically, we leverage the residual matrix generated during the truncation process to reduce truncation loss. Moreover, under a fixed overall compression ratio, we selectively compress the last few layers of the model, which mitigates error propagation and significantly improves the performance of compressed this http URL evaluations of ResSVD on diverse LLM families and multiple benchmark datasets indicate that ResSVD consistently achieves superior performance over existing counterpart methods, demonstrating its practical effectiveness.
摘要：大型语言模型（LLM）在各种下游自然语言处理任务中表现出了令人印象深刻的功能。然而，它们的大小和内存需要阻碍实际部署，强调了制定有效的压缩策略的重要性。奇异值分解（SVD）将矩阵分解为正交组件，从而实现有效的低级近似值。这特别适合LLM压缩，其中重量矩阵通常表现出明显的冗余。但是，当前基于SVD的方法忽略了截断的残留矩阵，从而导致截断损失明显。另外，压缩模型的所有层会导致严重的性能降解。为了克服这些局限性，我们提出了RESSVD，这是一种新的基于SVD后SVD的LLM压缩方法。具体而言，我们利用在截断过程中生成的残留矩阵来减少截短损失。此外，在固定的总体压缩比下，我们选择性地压缩了该模型的最后几层，这可以减轻误差的传播，并显着改善了RESSVD对不同LLM家族的HTTP URL评估的压缩性能，并且多个基准数据集对RESSVD的效果表明，这表明RESSVD一致地表现出了效果，这表明其实践效果效果，这表明RESSVD稳定地表明了效果。

Title: Named Entity Recognition in Historical Italian: The Case of Giacomo Leopardi's Zibaldone

Authors: Cristian Santini, Laura Melosi, Emanuele Frontoni
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20113
Pdf URL: https://arxiv.org/pdf/2505.20113
Copy Paste: [[2505.20113]] Named Entity Recognition in Historical Italian: The Case of Giacomo Leopardi's Zibaldone(https://arxiv.org/abs/2505.20113)
Keywords: language model, llm
Abstract: The increased digitization of world's textual heritage poses significant challenges for both computer science and literary studies. Overall, there is an urgent need of computational techniques able to adapt to the challenges of historical texts, such as orthographic and spelling variations, fragmentary structure and digitization errors. The rise of large language models (LLMs) has revolutionized natural language processing, suggesting promising applications for Named Entity Recognition (NER) on historical documents. In spite of this, no thorough evaluation has been proposed for Italian texts. This research tries to fill the gap by proposing a new challenging dataset for entity extraction based on a corpus of 19th century scholarly notes, i.e. Giacomo Leopardi's Zibaldone (1898), containing 2,899 references to people, locations and literary works. This dataset was used to carry out reproducible experiments with both domain-specific BERT-based models and state-of-the-art LLMs such as LLaMa3.1. Results show that instruction-tuned models encounter multiple difficulties handling historical humanistic texts, while fine-tuned NER models offer more robust performance even with challenging entity types such as bibliographic references.
摘要：世界文本遗产的数字化增加对计算机科学和文学研究都构成了重大挑战。总体而言，迫切需要计算技术能够适应历史文本的挑战，例如拼字和拼写变化，零碎的结构和数字化错误。大语言模型（LLM）的兴起彻底改变了自然语言处理，这表明对历史文档的指定实体识别（NER）的有希望的应用。尽管如此，尚未对意大利文本进行彻底的评估。这项研究试图通过提出一个基于19世纪学术笔记的语料库的实体提取的新的具有挑战性的数据集来填补空白，即Giacomo Leopardi的Zibaldone（1898），其中包含2,899个对人，地点和文学作品的参考。该数据集用于使用基于域特异性BERT的模型和最先进的LLMS（例如LLAMA3.1）进行可重复的实验。结果表明，指导调整的模型会遇到多种困难处理历史人文主义文本，而微调的NER模型即使具有挑战性的实体类型，例如书目参考文献，也提供了更强的性能。

Title: TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

Authors: Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, Bela Gipp
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2505.20118
Pdf URL: https://arxiv.org/pdf/2505.20118
Copy Paste: [[2505.20118]] TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent(https://arxiv.org/abs/2505.20118)
Keywords: language model, llm, prompt, agent
Abstract: As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.
摘要：随着大型语言模型（LLMS）融入敏感的工作流程中，人们对泄漏机密信息的潜力的担忧越来越不断增加。我们提出了Trojanstego，这是一种新颖的威胁模型，在这种模型中，对手微型LLM将敏感的上下文信息嵌入了通过语言隐志中的自然输出中，而无需明确控制对推理输入。我们介绍了概述受损的LLM的风险因素的分类学，并使用它来评估威胁的风险概况。为了实施Trojanstego，我们提出了一个实用的编码方案，该方案基于LLMS通过微调来学习的词汇分区。实验结果表明，受损的模型可靠地传输32位秘密，其准确性为87％，在三代人中使用多数投票，精度达到了97％以上的精度。此外，它们保持高效用，可以逃避人类的检测并保持连贯性。这些结果突出了一类新的LLM数据剥离攻击，这些攻击是被动，秘密，实用和危险的。

Title: Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers

Authors: Zhengliang Shi, Lingyong Yan, Dawei Yin, Suzan Verberne, Maarten de Rijke, Zhaochun Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20128
Pdf URL: https://arxiv.org/pdf/2505.20128
Copy Paste: [[2505.20128]] Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers(https://arxiv.org/abs/2505.20128)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have been widely integrated into information retrieval to advance traditional techniques. However, effectively enabling LLMs to seek accurate knowledge in complex tasks remains a challenge due to the complexity of multi-hop queries as well as the irrelevant retrieved content. To address these limitations, we propose EXSEARCH, an agentic search framework, where the LLM learns to retrieve useful information as the reasoning unfolds through a self-incentivized process. At each step, the LLM decides what to retrieve (thinking), triggers an external retriever (search), and extracts fine-grained evidence (recording) to support next-step reasoning. To enable LLM with this capability, EXSEARCH adopts a Generalized Expectation-Maximization algorithm. In the E-step, the LLM generates multiple search trajectories and assigns an importance weight to each; the M-step trains the LLM on them with a re-weighted loss function. This creates a self-incentivized loop, where the LLM iteratively learns from its own generated data, progressively improving itself for search. We further theoretically analyze this training process, establishing convergence guarantees. Extensive experiments on four knowledge-intensive benchmarks show that EXSEARCH substantially outperforms baselines, e.g., +7.8% improvement on exact match score. Motivated by these promising results, we introduce EXSEARCH-Zoo, an extension that extends our method to broader scenarios, to facilitate future work.
摘要：大型语言模型（LLM）已被广泛集成到信息检索中以推进传统技术。但是，由于多跳查询的复杂性以及无关检索的内容，有效地使LLM能够在复杂任务中寻求准确的知识仍然是一个挑战。为了解决这些局限性，我们提出了ExSearch，一个代理搜索框架，LLM学会在其中检索有用的信息，因为推理通过自我引导的过程展开。在每个步骤中，LLM决定要检索什么（思考），触发外部猎犬（搜索），并提取细粒度证据（记录）以支持下一步推理。为了使LLM具有此功能，ExSearch采用了广义期望最大化算法。在E-Step中，LLM生成了多个搜索轨迹，并为每个搜索重量分配了重要的权重。 M-Step通过重新加权损失功能在其上训练LLM。这创建了一个自启动的循环，LLM迭代从其自身生成的数据中学习，从而逐步改善了自身搜索。我们从理论上进一步分析了这一训练过程，并确定融合保证。对四个知识密集型基准测试的广泛实验表明，ExSearch的表现大大优于基准，例如，精确匹配分数提高了 +7.8％。在这些有希望的结果的激励下，我们介绍了Exsearch-Zoo，这是一种扩展，将我们的方法扩展到更广泛的场景，以促进未来的工作。

Title: AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings

Authors: Konstantin Dobler, Desmond Elliott, Gerard de Melo
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20133
Pdf URL: https://arxiv.org/pdf/2505.20133
Copy Paste: [[2505.20133]] AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings(https://arxiv.org/abs/2505.20133)
Keywords: language model
Abstract: Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods either require expensive further training or pretraining of additional modules. In this paper, we propose AweDist and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that AweDist is able to outperform even strong baselines.
摘要：当前的语言模型依赖于在预处理时间确定的静态词汇，这可能导致性能下降和原始词汇中代表性不足的域的计算成本增加。可以添加新的代币以解决此问题，并在其新嵌入式的良好初始化时进行良好的初始化。但是，现有的嵌入初始化方法要么需要昂贵的进一步培训，要么需要预处理其他模块。在本文中，我们提出了敬畏者，并表明，通过使用原始令牌化获得的表示表示，我们可以快速学习新令牌的高质量输入嵌入。广泛的开放重量模型的实验结果表明，Awyist能够超越强大的基线。

Title: SeMe: Training-Free Language Model Merging via Semantic Alignment

Authors: Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20144
Pdf URL: https://arxiv.org/pdf/2505.20144
Copy Paste: [[2505.20144]] SeMe: Training-Free Language Model Merging via Semantic Alignment(https://arxiv.org/abs/2505.20144)
Keywords: language model
Abstract: Despite the remarkable capabilities of Language Models (LMs) across diverse tasks, no single model consistently outperforms others, necessitating efficient methods to combine their strengths without expensive retraining. Existing model merging techniques, such as parameter averaging and task-guided fusion, often rely on data-dependent computations or fail to preserve internal knowledge, limiting their robustness and scalability. We introduce SeMe (Semantic-based Merging), a novel, data-free, and training-free approach that leverages latent semantic alignment to merge LMs at a fine-grained, layer-wise level. Unlike prior work, SeMe not only preserves model behaviors but also explicitly stabilizes internal knowledge, addressing a critical gap in LM fusion. Through extensive experiments across diverse architectures and tasks, we demonstrate that SeMe outperforms existing methods in both performance and efficiency while eliminating reliance on external data. Our work establishes a new paradigm for knowledge-aware model merging and provides insights into the semantic structure of LMs, paving the way for more scalable and interpretable model composition.
摘要：尽管在各种任务中具有出色的语言模型（LMS）功能，但没有单一模型始终超越其他模型，因此需要有效的方法将其优势结合起来而无需昂贵的再培训。现有的模型合并技术（例如参数平均和任务指导的融合）通常依赖于数据依赖性计算或无法保留内部知识，从而限制了它们的鲁棒性和可扩展性。我们介绍了一种新型，无数据和无训练的方法（基于语义的合并），该方法利用潜在的语义对齐方式将LMS以细粒度的层次层次合并。与先前的工作不同，SEME不仅可以保留模型行为，而且可以显式稳定内部知识，从而解决LM融合中的关键差距。通过跨不同体系结构和任务的广泛实验，我们证明了SEME在性能和效率方面都优于现有方法，同时消除了对外部数据的依赖。我们的工作为知识吸引模型合并建立了一个新的范式，并提供了对LMS语义结构的见解，为更可扩展和可解释的模型组成铺平了道路。

Title: UORA: Uniform Orthogonal Reinitialization Adaptation in Parameter-Efficient Fine-Tuning of Large Models

Authors: Xueyan Zhang, Jinman Zhao, Zhifei Yang, Yibo Zhong, Shuhao Guan, Linbo Cao, Yining Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20154
Pdf URL: https://arxiv.org/pdf/2505.20154
Copy Paste: [[2505.20154]] UORA: Uniform Orthogonal Reinitialization Adaptation in Parameter-Efficient Fine-Tuning of Large Models(https://arxiv.org/abs/2505.20154)
Keywords: language model, llm
Abstract: This paper introduces Uniform Orthogonal Reinitialization Adaptation (UORA), a novel parameter-efficient fine-tuning (PEFT) approach for Large Language Models (LLMs). UORA achieves state-of-the-art performance and parameter efficiency by leveraging a low-rank approximation method to reduce the number of trainable parameters. Unlike existing methods such as LoRA and VeRA, UORA employs an interpolation-based reparametrization mechanism that selectively reinitializes rows and columns in frozen projection matrices, guided by the vector magnitude heuristic. This results in substantially fewer trainable parameters compared to LoRA and outperforms VeRA in computation and storage efficiency. Comprehensive experiments across various benchmarks demonstrate UORA's superiority in achieving competitive fine-tuning performance with negligible computational overhead. We demonstrate its performance on GLUE and E2E benchmarks and its effectiveness in instruction-tuning large language models and image classification models. Our contributions establish a new paradigm for scalable and resource-efficient fine-tuning of LLMs.
摘要：本文介绍了统一的正交重新定性适应（UORA），这是大语模型（LLMS）的新型参数微调方法（PEFT）方法。 UORA通过利用低级近似方法来减少可训练参数的数量来实现最先进的性能和参数效率。与现有的方法（例如Lora和Vera）不同，UORA采用了基于插值的重新处理机制，在载体幅度启发式下，在冷冻投影矩阵中有选择地重新定位了排行的投影矩阵中的行和列。与LORA相比，这会导致较少的训练参数，并且在计算和存储效率方面的表现优于Vera。各种基准的全面实验表明，UORA在通过可忽略的计算开销中实现竞争性微调性能方面具有优势。我们证明了它在胶水和E2E基准测试上的性能及其在调整大型语言模型和图像分类模型的指导中的有效性。我们的贡献为LLM的可扩展和资源有效的微调建立了新的范式。

Title: Pangu Light: Weight Re-Initialization for Pruning and Accelerating LLMs

Authors: Hanting Chen, Jiarui Qin, Jialong Guo, Tao Yuan, Yichun Yin, Huiling Zhen, Yasheng Wang, Jinpeng Li, Xiaojun Meng, Meng Zhang, Rongju Ruan, Zheyuan Bai, Yehui Tang, Can Chen, Xinghao Chen, Fisher Yu, Ruiming Tang, Yunhe Wang (and Other Contributors)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20155
Pdf URL: https://arxiv.org/pdf/2505.20155
Copy Paste: [[2505.20155]] Pangu Light: Weight Re-Initialization for Pruning and Accelerating LLMs(https://arxiv.org/abs/2505.20155)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) deliver state-of-the-art capabilities across numerous tasks, but their immense size and inference costs pose significant computational challenges for practical deployment. While structured pruning offers a promising avenue for model compression, existing methods often struggle with the detrimental effects of aggressive, simultaneous width and depth reductions, leading to substantial performance degradation. This paper argues that a critical, often overlooked, aspect in making such aggressive joint pruning viable is the strategic re-initialization and adjustment of remaining weights to improve the model post-pruning training accuracies. We introduce Pangu Light, a framework for LLM acceleration centered around structured pruning coupled with novel weight re-initialization techniques designed to address this ``missing piece''. Our framework systematically targets multiple axes, including model width, depth, attention heads, and RMSNorm, with its effectiveness rooted in novel re-initialization methods like Cross-Layer Attention Pruning (CLAP) and Stabilized LayerNorm Pruning (SLNP) that mitigate performance drops by providing the network a better training starting point. Further enhancing efficiency, Pangu Light incorporates specialized optimizations such as absorbing Post-RMSNorm computations and tailors its strategies to Ascend NPU characteristics. The Pangu Light models consistently exhibit a superior accuracy-efficiency trade-off, outperforming prominent baseline pruning methods like Nemotron and established LLMs like Qwen3 series. For instance, on Ascend NPUs, Pangu Light-32B's 81.6 average score and 2585 tokens/s throughput exceed Qwen3-32B's 80.9 average score and 2225 tokens/s.
摘要：大型语言模型（LLMS）在许多任务中提供了最先进的功能，但是它们的巨大规模和推理成本对实际部署构成了重大的计算挑战。尽管结构化的修剪为模型压缩提供了有希望的途径，但现有方法通常会在侵略性，同时宽度和深度降低的有害影响方面遇到困难，从而导致大量的性能退化。本文认为，使这种积极的联合修剪可行的关键，经常被忽视的方面是战略性的重新定位和剩余权重调整以改善模型后的固定后培训精确度。我们介绍了Pangu Light，这是一个以结构化修剪为中心的LLM加速度的框架，再加上新型的重量重新定位技术，旨在解决此``丢失零件''。我们的框架系统地靶向多个轴，包括模型宽度，深度，注意力头和RMSNORM，其有效性植根于新型的重新定位方法，例如跨层注意力（CLAP）（CLAP）和稳定的Layernorm Pruning（SLNP），通过为网络提供更好的训练起点来减轻性能下降。进一步提高了效率，Pangu Light结合了专门的优化，例如吸收RMSNorm计算并量身定制其提升NPU特征的策略。 Pangu Light模型始终表现出卓越的准确性效率折衷，优于Nemotron和Qwen3系列（例如Nemotron）等明显的基线修剪方法。例如，在Ascend NPU上，Pangu Light-32b的81.6平均得分和2585个令牌/s吞吐量超过Qwen3-32b的80.9平均得分和2225个令牌/s。

Title: Exploring Generative Error Correction for Dysarthric Speech Recognition

Authors: Moreno La Quatra, Alkis Koudounas, Valerio Mario Salerno, Sabato Marco Siniscalchi
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2505.20163
Pdf URL: https://arxiv.org/pdf/2505.20163
Copy Paste: [[2505.20163]] Exploring Generative Error Correction for Dysarthric Speech Recognition(https://arxiv.org/abs/2505.20163)
Keywords: llm
Abstract: Despite the remarkable progress in end-to-end Automatic Speech Recognition (ASR) engines, accurately transcribing dysarthric speech remains a major challenge. In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). We assess different configurations of model scales and training strategies, incorporating specific hypothesis selection to improve transcription accuracy. Experiments on the Speech Accessibility Project dataset demonstrate the strength of our approach on structured and spontaneous speech, while highlighting challenges in single-word recognition. Through comprehensive analysis, we provide insights into the complementary roles of acoustic and linguistic modeling in dysarthric speech recognition
摘要：尽管端到端自动语音识别（ASR）发动机取得了显着进展，但准确转录违反语音仍然是一个重大挑战。在这项工作中，我们在Interspeech 2025上为语音可访问性项目挑战提出了一个两阶段的框架，该框架将尖端语音识别模型与基于LLM的生成误差校正（GER）相结合。我们评估模型量表和训练策略的不同配置，并结合了特定的假设选择以提高转录精度。语音可访问性项目数据集的实验证明了我们在结构化和自发语音方面的方法，同时强调了单词识别中的挑战。通过全面的分析，我们提供了有关声学和语言建模在违反语音识别中的互补作用的见解

Title: Visual Abstract Thinking Empowers Multimodal Reasoning

Authors: Dairu Liu, Ziyue Wang, Minyuan Ruan, Fuwen Luo, Chi Chen, Peng Li, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20164
Pdf URL: https://arxiv.org/pdf/2505.20164
Copy Paste: [[2505.20164]] Visual Abstract Thinking Empowers Multimodal Reasoning(https://arxiv.org/abs/2505.20164)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Images usually convey richer detail than text, but often include redundant information which potentially downgrades multimodal reasoning performance. When faced with lengthy or complex messages, humans tend to employ abstract thinking to convert them into simple and concise abstracts. Inspired by this cognitive strategy, we introduce Visual Abstract Thinking (VAT), a novel thinking paradigm that prompts Multimodal Large Language Models (MLLMs) with visual abstract instead of explicit verbal thoughts or elaborate guidance, permitting a more concentrated visual reasoning mechanism. Explicit thinking, such as Chain-of-thought (CoT) or tool-augmented approaches, increases the complexity of reasoning process via inserting verbose intermediate steps, external knowledge or visual information. In contrast, VAT reduces redundant visual information and encourages models to focus their reasoning on more essential visual elements. Experimental results show that VAT consistently empowers different models, and achieves an average gain of 17% over GPT-4o baseline by employing diverse types of visual abstracts, demonstrating that VAT can enhance visual reasoning abilities for MLLMs regarding conceptual, structural and relational reasoning tasks. VAT is also compatible with CoT in knowledge-intensive multimodal reasoning tasks. These findings highlight the effectiveness of visual reasoning via abstract thinking and encourage further exploration of more diverse reasoning paradigms from the perspective of human cognition.
摘要：图像通常传达的细节比文本更丰富，但通常包含冗余信息，这些信息可能会降低多模式推理性能。当面对冗长或复杂的信息时，人类倾向于采用抽象思维将它们转换为简单明了的摘要。受这种认知策略的启发，我们引入了视觉抽象思维（VAT），这是一种新颖的思维范式，它以视觉上的抽象促使多模式大语言模型（MLLM），而不是显式的言语思想或详细的指导，从而允许更加集中的视觉推理机制。明确的思维（例如，经过思考链（COT）或工具扬声器的方法）通过插入冗长的中间步骤，外部知识或视觉信息来增加推理过程的复杂性。相反，增值税减少了冗余的视觉信息，并鼓励模型将其推理集中在更重要的视觉元素上。实验结果表明，增值税始终赋予不同的模型，并通过采用多种类型的视觉摘要来获得比GPT-4O基线的平均增益，这表明增值税可以增强MLLM在概念，结构和关系推理任务方面的视觉推理能力。增值税还与知识密集型多模式推理任务中的COT兼容。这些发现突出了视觉推理通过抽象思维的有效性，并鼓励从人类认知的角度进一步探索更多样化的推理范式。

Title: THiNK: Can Large Language Models Think-aloud?

Authors: Yongan Yu, Mengqian Wu, Yiran Lin, Nikki G. Lobczowski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20184
Pdf URL: https://arxiv.org/pdf/2505.20184
Copy Paste: [[2505.20184]] THiNK: Can Large Language Models Think-aloud?(https://arxiv.org/abs/2505.20184)
Keywords: language model, llm, agent
Abstract: Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom's Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.
摘要：评估大语言模型（LLM）中的高阶思维能力仍然是一个基本挑战，尤其是在超出表面级别准确性的任务中。在这项工作中，我们提出了思考（测试知识的高阶概念），这是一个基于Bloom的分类学基础的多代理，反馈驱动的评估框架。将框架推理评估视为问题产生问题，批评和修订的一项迭代任务，鼓励LLMS通过逐步思考和精致思考。这使得对低阶（例如，记住，理解）和高阶（例如，评估，创建）思维技能进行系统评估。我们将思考应用于七个最先进的LLM，并对其产出进行详细的认知分析。结果表明，尽管模型可靠地执行低阶类别，但它们在现实情况下应用知识并表现出有限的抽象而努力。结构化反馈回路可显着提高推理性能，尤其是在高阶思维中。定性评估进一步证实，思维引导的产出可以更好地与域逻辑和问题结构保持一致。我们框架的守则提供了一种可扩展的方法来探测和增强LLM推理，提供了以学习科学为基础的评估方向，该方向可在我们的GitHub存储库中获得。

Title: Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning

Authors: Xiaorong Wang, Ting Yang, Zhu Zhang, Shuo Wang, Zihan Zhou, Liner Yang, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20195
Pdf URL: https://arxiv.org/pdf/2505.20195
Copy Paste: [[2505.20195]] Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning(https://arxiv.org/abs/2505.20195)
Keywords: llm
Abstract: Assessing the quality of long-form, model-generated text is challenging, even with advanced LLM-as-a-Judge methods, due to performance degradation as input length increases. To address this issue, we propose a divide-and-conquer approach, which breaks down the comprehensive evaluation task into a series of localized scoring tasks, followed by a final global assessment. This strategy allows for more granular and manageable evaluations, ensuring that each segment of the text is assessed in isolation for both coherence and quality, while also accounting for the overall structure and consistency of the entire piece. Moreover, we introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations. By incorporating human-generated feedback directly into the evaluation process, this method allows the model to better align with human judgment. Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation, thereby reducing annotation costs in practical scenarios. Experimental results show that the proposed evaluation framework outperforms several representative baselines, highlighting the effectiveness of our approach.
摘要：即使使用高级LLM-AS-A-a-gudge方法，由于输入长度的增加，评估模型生成的文本的质量即使使用先进的LLM-AS-A-A-Gudge方法也很具有挑战性。为了解决这个问题，我们提出了一种分裂和纠纷方法，该方法将全面的评估任务分解为一系列本地评分任务，然后进行最终的全球评估。该策略允许进行更精细，可管理的评估，确保文本的每个部分都以相干性和质量为单独评估，同时还考虑了整片的整体结构和一致性。此外，我们介绍了一种混合的内在学习方法，该方法利用人类注释来提高本地和全球评估的表现。通过将人类生成的反馈直接纳入评估过程，该方法使该模型可以更好地与人类判断力保持一致。最后，我们开发了一种基于不确定性的主动学习算法，该算法有效地选择了人类注释的数据样本，从而在实际情况下降低了注释成本。实验结果表明，所提出的评估框架的表现优于几个代表性基线，强调了我们方法的有效性。

Title: Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking

Authors: Pengxiang Li, Shilin Yan, Joey Tsai, Renrui Zhang, Ruichuan An, Ziyu Guo, Xiaowei Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20199
Pdf URL: https://arxiv.org/pdf/2505.20199
Copy Paste: [[2505.20199]] Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking(https://arxiv.org/abs/2505.20199)
Keywords: language model
Abstract: Classifier-Free Guidance (CFG) significantly enhances controllability in generative models by interpolating conditional and unconditional predictions. However, standard CFG often employs a static unconditional input, which can be suboptimal for iterative generation processes where model uncertainty varies dynamically. We introduce Adaptive Classifier-Free Guidance (A-CFG), a novel method that tailors the unconditional input by leveraging the model's instantaneous predictive confidence. At each step of an iterative (masked) diffusion language model, A-CFG identifies tokens in the currently generated sequence for which the model exhibits low confidence. These tokens are temporarily re-masked to create a dynamic, localized unconditional input. This focuses CFG's corrective influence precisely on areas of ambiguity, leading to more effective guidance. We integrate A-CFG into a state-of-the-art masked diffusion language model and demonstrate its efficacy. Experiments on diverse language generation benchmarks show that A-CFG yields substantial improvements over standard CFG, achieving, for instance, a 3.9 point gain on GPQA. Our work highlights the benefit of dynamically adapting guidance mechanisms to model uncertainty in iterative generation.
摘要：无分类器引导（CFG）通过插值条件和无条件预测可显着增强生成模型中的可控性。但是，标准CFG通常采用静态无条件输入，对于迭代生成过程而言，模型不确定性动态变化可能是次优的。我们引入了自适应分类器指导（A-CFG），这是一种新型方法，可通过利用模型的瞬时预测信心来量身定制无条件输入。在迭代（掩盖）扩散语言模型的每个步骤中，A-CFG识别当前生成的序列中的代币，该序列的模型表现出较低的置信度。这些令牌被暂时重新掩盖以创建动态的局部无条件输入。这将CFG的纠正效果精确地放在了歧义领域，从而提供了更有效的指导。我们将A-CFG集成到最先进的掩盖扩散语言模型中，并证明其功效。关于不同语言生成基准测试的实验表明，A-CFG比标准CFG产生了实质性改进，例如，GPQA上有3.9点的增长。我们的工作突出了动态适应指导机制以模拟迭代产生中的不确定性的好处。

Title: Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations

Authors: Mohit Chandra, Siddharth Sriraman, Harneet Singh Khanuja, Yiqiao Jin, Munmun De Choudhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20201
Pdf URL: https://arxiv.org/pdf/2505.20201
Copy Paste: [[2505.20201]] Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations(https://arxiv.org/abs/2505.20201)
Keywords: language model, llm, agent
Abstract: Limited access to mental healthcare, extended wait times, and increasing capabilities of Large Language Models (LLMs) has led individuals to turn to LLMs for fulfilling their mental health needs. However, examining the multi-turn mental health conversation capabilities of LLMs remains under-explored. Existing evaluation frameworks typically focus on diagnostic accuracy and win-rates and often overlook alignment with patient-specific goals, values, and personalities required for meaningful conversations. To address this, we introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations and use it to create the Mental Health Sensemaking Dialogue (MHSD) dataset, comprising over 2,200 patient-LLM conversations. Additionally, we present MultiSenseEval, a holistic framework to evaluate the multi-turn conversation abilities of LLMs in healthcare settings using human-centric criteria. Our findings reveal that frontier reasoning models yield below-par performance for patient-centric communication and struggle at advanced diagnostic capabilities with average score of 31%. Additionally, we observed variation in model performance based on patient's persona and performance drop with increasing turns in the conversation. Our work provides a comprehensive synthetic data generation framework, a dataset and evaluation framework for assessing LLMs in multi-turn mental health conversations.
摘要：获得心理保健，延长的等待时间以及越来越多的大语言模型（LLMS）的能力有限，这使个人求助于满足其心理健康需求的LLM。但是，检查LLMS的多转弯心理健康对话能力仍然不足。现有的评估框架通常集中在诊断准确性和胜利率上，并且经常忽略与特定于患者的目标，价值观和个性所需的有意义对话所需的人的一致性。为了解决这个问题，我们介绍了Medagent，这是一个新颖的框架，用于综合产生逼真的，多转的心理健康感知对话，并使用它来创建心理健康感觉对话（MHSD）数据集，其中包括2,200多个患者-LLM对话。此外，我们提出了MultenSenseVal，这是一个整体框架，旨在使用以人为中心的标准评估医疗保健环境中LLM的多转交谈能力。我们的发现表明，前沿推理模型在以患者为中心的沟通和挣扎的高级诊断能力中产生低于标准的表现，平均得分为31％。此外，我们根据患者的角色和性能下降，观察到模型性能的差异，并且在对话中越来越多。我们的工作提供了一个全面的综合数据生成框架，一个数据集和评估框架，用于评估多转变心理健康对话中的LLM。

Title: How to Improve the Robustness of Closed-Source Models on NLI

Authors: Joe Stacey, Lisa Alazraki, Aran Ubhi, Beyza Ermis, Aaron Mueller, Marek Rei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20209
Pdf URL: https://arxiv.org/pdf/2505.20209
Copy Paste: [[2505.20209]] How to Improve the Robustness of Closed-Source Models on NLI(https://arxiv.org/abs/2505.20209)
Keywords: language model, llm
Abstract: Closed-source Large Language Models (LLMs) have become increasingly popular, with impressive performance across a wide range of natural language tasks. These models can be fine-tuned to further improve performance, but this often results in the models learning from dataset-specific heuristics that reduce their robustness on out-of-distribution (OOD) data. Existing methods to improve robustness either perform poorly, or are non-applicable to closed-source models because they assume access to model internals, or the ability to change the model's training procedure. In this work, we investigate strategies to improve the robustness of closed-source LLMs through data-centric methods that do not require access to model internals. We find that the optimal strategy depends on the complexity of the OOD data. For highly complex OOD datasets, upsampling more challenging training examples can improve robustness by up to 1.5%. For less complex OOD datasets, replacing a portion of the training set with LLM-generated examples can improve robustness by 3.7%. More broadly, we find that large-scale closed-source autoregressive LLMs are substantially more robust than commonly used encoder models, and are a more appropriate choice of baseline going forward.
摘要：封闭式语言模型（LLM）变得越来越流行，在各种自然语言任务中的表现令人印象深刻。这些模型可以进行微调以进一步提高性能，但这通常会导致模型从数据集特定的启发式方法中学习，从而降低了其对分布（OOD）数据的鲁棒性。改善鲁棒性的现有方法要么表现不佳，要么不适用于封闭形式模型，因为它们可以访问模型内部设备，或者可以更改模型的训练程序的能力。在这项工作中，我们通过以数据为中心的方法研究了不需要访问模型内部设备的策略来提高封闭源LLM的鲁棒性。我们发现，最佳策略取决于OOD数据的复杂性。对于高度复杂的OOD数据集，提高更具挑战性的训练示例可以提高鲁棒性高达1.5％。对于不太复杂的OOD数据集，用LLM生成的示例代替一部分训练集可以提高3.7％的鲁棒性。更广泛地说，我们发现大规模封闭式自回归LLM比常用的编码器模型更强大，并且是基线的更合适的选择。

Title: FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

Authors: Hao Kang, Zichun Yu, Chenyan Xiong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20225
Pdf URL: https://arxiv.org/pdf/2505.20225
Copy Paste: [[2505.20225]] FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models(https://arxiv.org/abs/2505.20225)
Keywords: language model, llm
Abstract: Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at this https URL.
摘要：最近的大型语言模型，例如Gemini-1.5，DeepSeek-V3和Llama-4越来越多地采用了Experts（MOE）架构，这些结构仅通过即将代表的模型中的一小部分来激活效率强大的绩效折衷。然而，学术研究人员仍然缺乏一个完全开放的端到端MOE平台来研究扩展，路由和专家行为。我们发布了Flame-Moe，这是一个完全开源的研究套件，由七个仅解码器模型组成，范围从3800万到1.7B，主动参数范围为Active参数，其架构-64个具有前8个门控的专家和2名共享专家 - 跨越反映现代生产的LLM。所有培训数据管道，脚本，日志和检查站都可以公开使用，以实现可重现的实验。在六项评估任务中，火焰-MOE可在经过相同差异训练的密集基线上提高平均精度高达3.4点。利用完整的训练痕量透明度，我们提出了初步分析，表明（i）专家越来越专注于不同的令牌子集，（ii）共激活矩阵仍然很少，反映了多样的专家使用情况，并且（iii）路由行为在训练早期稳定。所有代码，培训日志和模型检查点均在此HTTPS URL上可用。

Title: Efficient Speech Translation through Model Compression and Knowledge Distillation

Authors: Yasmin Moslem
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.20237
Pdf URL: https://arxiv.org/pdf/2505.20237
Copy Paste: [[2505.20237]] Efficient Speech Translation through Model Compression and Knowledge Distillation(https://arxiv.org/abs/2505.20237)
Keywords: language model
Abstract: Efficient deployment of large audio-language models for speech translation remains challenging due to their significant computational requirements. In this paper, we address this challenge through our system submissions to the "Model Compression" track at the International Conference on Spoken Language Translation (IWSLT 2025). We experiment with a combination of approaches including iterative layer pruning based on layer importance evaluation, low-rank adaptation with 4-bit quantization (QLoRA), and knowledge distillation. In our experiments, we use Qwen2-Audio-7B-Instruct for speech translation into German and Chinese. Our pruned (student) models achieve up to a 50% reduction in both model parameters and storage footprint, while retaining 97-100% of the translation quality of the in-domain (teacher) models.
摘要：大型音频语言模型的有效部署由于其巨大的计算要求，因此语音翻译的有效部署仍然具有挑战性。在本文中，我们通过系统提交的“模型压缩”曲目在国际口语翻译会议（IWSLT 2025）上解决了这一挑战。我们尝试了一种方法组合，包括基于层的重要性评估，低级别适应性（Qlora）（Qlora）和知识蒸馏，包括迭代层修剪。在我们的实验中，我们使用qwen2-audio-7b教学将语音翻译成德语和中文。我们的修剪（学生）模型的模型参数和存储足迹的降低了50％，同时保留了内域（教师）型号的翻译质量的97-100％。

Title: It's High Time: A Survey of Temporal Information Retrieval and Question Answering

Authors: Bhawna Piryani, Abdelrahman Abdullah, Jamshid Mozafari, Avishek Anand, Adam Jatowt
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.20243
Pdf URL: https://arxiv.org/pdf/2505.20243
Copy Paste: [[2505.20243]] It's High Time: A Survey of Temporal Information Retrieval and Question Answering(https://arxiv.org/abs/2505.20243)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Time plays a critical role in how information is generated, retrieved, and interpreted. In this survey, we provide a comprehensive overview of Temporal Information Retrieval and Temporal Question Answering, two research areas aimed at handling and understanding time-sensitive information. As the amount of time-stamped content from sources like news articles, web archives, and knowledge bases increases, systems must address challenges such as detecting temporal intent, normalizing time expressions, ordering events, and reasoning over evolving or ambiguous facts. These challenges are critical across many dynamic and time-sensitive domains, from news and encyclopedias to science, history, and social media. We review both traditional approaches and modern neural methods, including those that use transformer models and Large Language Models (LLMs). We also review recent advances in temporal language modeling, multi-hop reasoning, and retrieval-augmented generation (RAG), alongside benchmark datasets and evaluation strategies that test temporal robustness, recency awareness, and generalization.
摘要：时间在信息的产生，检索和解释方式中起着至关重要的作用。在这项调查中，我们提供了有关时间信息检索和时间问题回答的全面概述，这两个研究领域旨在处理和了解时间敏感信息。随着来自新闻文章，网络档案和知识库等来源的时间stamp量的数量，系统必须应对诸如检测时间意图，正常时间表达式，订购事件的正常值以及对发展或模棱两可的事实的推理等挑战。这些挑战在许多动态和时间敏感的领域中至关重要，从新闻和百科全书到科学，历史和社交媒体。我们回顾了传统方法和现代神经方法，包括使用变压器模型和大型语言模型（LLM）的方法。我们还回顾了时间语言建模，多跳上推理和检索型发电（RAG）的最新进展，以及测试时间鲁棒性，恢复意识和概括的基准数据集和评估策略。

Title: KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing

Authors: Rui Li, Quanyu Dai, Zeyu Zhang, Xu Chen, Zhenhua Dong, Ji-Rong Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20245
Pdf URL: https://arxiv.org/pdf/2505.20245
Copy Paste: [[2505.20245]] KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing(https://arxiv.org/abs/2505.20245)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Recent advances in retrieval-augmented generation (RAG) furnish large language models (LLMs) with iterative retrievals of relevant information to handle complex multi-hop questions. These methods typically alternate between LLM reasoning and retrieval to accumulate external information into the LLM's context. However, the ever-growing context inherently imposes an increasing burden on the LLM to perceive connections among critical information pieces, with futile reasoning steps further exacerbating this overload issue. In this paper, we present KnowTrace, an elegant RAG framework to (1) mitigate the context overload and (2) bootstrap higher-quality multi-step reasoning. Instead of simply piling the retrieved contents, KnowTrace autonomously traces out desired knowledge triplets to organize a specific knowledge graph relevant to the input question. Such a structured workflow not only empowers the LLM with an intelligible context for inference, but also naturally inspires a reflective mechanism of knowledge backtracing to identify contributive LLM generations as process supervision data for self-bootstrapping. Extensive experiments show that KnowTrace consistently surpasses existing methods across three multi-hop question answering benchmarks, and the bootstrapped version further amplifies the gains.
摘要：最新的检索演奏生成（RAG）的进步提供了大型语言模型（LLMS），并迭代地检索相关信息以处理复杂的多跳问题。这些方法通常在LLM推理和检索之间交替，以将外部信息累积到LLM的上下文中。但是，不断增长的上下文本质上会对LLM施加越来越多的负担，以感知关键信息部分之间的联系，徒劳的推理步骤进一步加剧了这个过载问题。在本文中，我们提出了知识轨迹，这是一个优雅的抹布框架，可（1）减轻上下文过载和（2）引导更高质量的多步推理。知识轨迹不简单地堆积所检索的内容，而是自动搜索所需的知识三胞胎，以组织与输入问题相关的特定知识图。这样的结构化工作流不仅使LLM具有可理解的推论背景，而且自然会激发知识回溯的反思机制，以确定贡献的LLM代作为自启动的过程监督数据。广泛的实验表明，知识轨道始终超过三个多跳的问题回答基准测试的现有方法，而自举版本进一步扩大了增长。

Title: WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models

Authors: Yongan Yu, Qingchen Hu, Xianda Du, Jiayin Wang, Fengran Mo, Renee Sieber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20249
Pdf URL: https://arxiv.org/pdf/2505.20249
Copy Paste: [[2505.20249]] WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models(https://arxiv.org/abs/2505.20249)
Keywords: language model, llm
Abstract: Climate change adaptation requires the understanding of disruptive weather impacts on society, where large language models (LLMs) might be applicable. However, their effectiveness is under-explored due to the difficulty of high-quality corpus collection and the lack of available benchmarks. The climate-related events stored in regional newspapers record how communities adapted and recovered from disasters. However, the processing of the original corpus is non-trivial. In this study, we first develop a disruptive weather impact dataset with a four-stage well-crafted construction pipeline. Then, we propose WXImpactBench, the first benchmark for evaluating the capacity of LLMs on disruptive weather impacts. The benchmark involves two evaluation tasks, multi-label classification and ranking-based question answering. Extensive experiments on evaluating a set of LLMs provide first-hand analysis of the challenges in developing disruptive weather impact understanding and climate change adaptation systems. The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.
摘要：气候变化的适应需要了解对社会的破坏性天气影响，在这种情况下，大型语言模型（LLM）可能适用。但是，由于高质量集合的难度和缺乏可用的基准，它们的有效性不足以探索。与气候相关的事件存储在区域报纸上，记录了社区如何适应和从灾难中恢复。但是，原始语料库的处理是非平凡的。在这项研究中，我们首先使用四阶段精心制作的施工管道开发出破坏性的天气影响数据集。然后，我们提出了WXIMPACTBENCH，这是评估LLMS对破坏性天气影响的能力的第一个基准。基准涉及两个评估任务，多标签分类和基于排名的问题答案。评估一组LLM的广泛实验提供了有关发展破坏性天气影响理解和气候变化适应系统的挑战的第一手分析。可以使用构建的数据集和评估框架的代码，以帮助社会防止灾难中的漏洞。

Title: Does quantization affect models' performance on long-context tasks?

Authors: Anmol Mekala, Anirudh Atmakuru, Yixiao Song, Marzena Karpinska, Mohit Iyyer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20276
Pdf URL: https://arxiv.org/pdf/2505.20276
Copy Paste: [[2505.20276]] Does quantization affect models' performance on long-context tasks?(https://arxiv.org/abs/2505.20276)
Keywords: language model, gpt, llm, long context
Abstract: Large language models (LLMs) now support context windows exceeding 128K tokens, but this comes with significant memory requirements and high inference latency. Quantization can mitigate these costs, but may degrade performance. In this work, we present the first systematic evaluation of quantized LLMs on tasks with long-inputs (>64K tokens) and long-form outputs. Our evaluation spans 9.7K test examples, five quantization methods (FP8, GPTQ-int8, AWQ-int4, GPTQ-int4, BNB-nf4), and five models (Llama-3.1 8B and 70B; Qwen-2.5 7B, 32B, and 72B). We find that, on average, 8-bit quantization preserves accuracy (~0.8% drop), whereas 4-bit methods lead to substantial losses, especially for tasks involving long context inputs (drops of up to 59%). This degradation tends to worsen when the input is in a language other than English. Crucially, the effects of quantization depend heavily on the quantization method, model, and task. For instance, while Qwen-2.5 72B remains robust under BNB-nf4, Llama-3.1 70B experiences a 32% performance drop on the same task. These findings highlight the importance of a careful, task-specific evaluation before deploying quantized LLMs, particularly in long-context scenarios and with languages other than English.
摘要：现在，大型语言模型（LLMS）支持上下文窗口超过128K令牌，但这带有大量的内存需求和高推理延迟。量化可以减轻这些成本，但可能会降低性能。在这项工作中，我们在具有长输入（> 64K令牌）和长形输出的任务上介绍了对量化LLM的第一个系统评估。我们的评估跨度为9.7K测试示例，五种量化方法（FP8，GPTQ-INT8，AWQ-INT4，GPTQ-INT4，BNB-NF4）和五个模型（Llama-3.1 8b和70b; qwen-2.5 7b; qwen-2.5 7b，32b; qwen-2.5 7b;我们发现，平均而言，8位量化可以保留准确性（下降〜0.8％），而4位方法会导致巨大损失，尤其是对于涉及长上下文输入的任务（最高为59％）。当输入使用英语以外的其他语言时，这种降解往往会恶化。至关重要的是，量化的效果在很大程度上取决于量化方法，模型和任务。例如，尽管BNB-NF4下的QWEN-2.5 72B保持强大，但Llama-3.1 70B在同一任务上的性能下降了32％。这些发现突出了在部署量化的LLM之前，特别是在长篇文章方案和英语以外的其他语言之前进行仔细的，特定于任务的评估的重要性。

Title: OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction

Authors: Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, Jingkuan Song, Fei Huang, Yongbin Li
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.20277
Pdf URL: https://arxiv.org/pdf/2505.20277
Copy Paste: [[2505.20277]] OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction(https://arxiv.org/abs/2505.20277)
Keywords: language model, agent
Abstract: Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role's voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at this https URL.
摘要：角色扮演代理（RPA）受益于大型语言模型，是一种新兴的互动AI系统，可模拟具有不同个性的角色或角色。但是，现有方法主要集中于模仿文本形式角色之间的对话，忽略了角色的语音特征（例如语音风格和情感）在互动中起着至关重要的影响，这在现实情况下倾向于更加沉浸式体验。为了实现这一目标，我们提出了Omnicharacter，这是第一个无缝的语音语言人格互动模型，以实现低潜伏期的沉浸式RPA。具体而言，Omnicharacter使代理商能够在整个互动过程中始终如一地表现出特定于角色的人格特征和声音特征，从而使语音和语言响应的混合在一起。为了使模型与语音语言场景保持一致，我们构建了一个名为OmnichActer-10k的数据集，该数据集涉及更独特的字符（20），丰富的上下文化的多轮对话（10K）和动态语音响应（135k）。实验结果表明，与现有的RPA和主流语音语言模型相比，我们的方法在内容和样式方面产生更好的响应，其响应延迟低至289ms。代码和数据集可在此HTTPS URL上找到。

Title: One-shot Entropy Minimization

Authors: Zitian Gao, Lynx Chen, Joey Zhou, Bryan Dai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20282
Pdf URL: https://arxiv.org/pdf/2505.20282
Copy Paste: [[2505.20282]] One-shot Entropy Minimization(https://arxiv.org/abs/2505.20282)
Keywords: language model, prompt
Abstract: We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at this https URL.
摘要：我们培训了13,440个大语言模型，发现熵最小化仅需要一个未标记的数据和10个步骤优化，以实现与使用数千个数据和基于规则的增强学习中精心设计的奖励相当甚至更大的性能改进。这个惊人的结果可能会促使大型语言模型重新思考训练后范例。我们的代码在此HTTPS URL上可用。

Title: MASKSEARCH: A Universal Pre-Training Framework to Enhance Agentic Search Capability

Authors: Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20285
Pdf URL: https://arxiv.org/pdf/2505.20285
Copy Paste: [[2505.20285]] MASKSEARCH: A Universal Pre-Training Framework to Enhance Agentic Search Capability(https://arxiv.org/abs/2505.20285)
Keywords: language model, llm, agent
Abstract: Retrieval-Augmented Language Models (RALMs) represent a classic paradigm where models enhance generative capabilities using external knowledge retrieved via a specialized module. Recent advancements in Agent techniques enable Large Language Models (LLMs) to autonomously utilize tools for retrieval, planning, and reasoning. While existing training-based methods show promise, their agentic abilities are limited by inherent characteristics of the task-specific data used during training. To further enhance the universal search capability of agents, we propose a novel pre-training framework, MASKSEARCH. In the pre-training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans on a large number of pre-training data, thus acquiring universal retrieval and reasoning capabilities for LLMs. After that, the model is trained on downstream tasks to achieve further improvement. We apply both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) for training. For SFT, we combine agent-based and distillation-based methods to generate training data, starting with a multi-agent system consisting of a planner, rewriter, observer, and followed by a self-evolving teacher model. While for RL, we employ DAPO as the training framework and adopt a hybrid reward system consisting of answer rewards and format rewards. Additionally, we introduce a curriculum learning approach that allows the model to learn progressively from easier to more challenging instances based on the number of masked spans. We evaluate the effectiveness of our framework in the scenario of open-domain multi-hop question answering. Through extensive experiments, we demonstrate that MASKSEARCH significantly enhances the performance of LLM-based search agents on both in-domain and out-of-domain downstream tasks.
摘要：检索声明的语言模型（RALMS）代表了经典的范式，其中模型使用通过专用模块检索的外部知识增强生成能力。代理技术的最新进步使大型语言模型（LLMS）自主利用工具来检索，计划和推理。尽管现有的基于培训的方法显示出希望，但它们的代理能力受到培训过程中使用的任务特定数据的固有特征的限制。为了进一步增强代理的通用搜索能力，我们提出了一个新颖的训练框架MaskSearch。在训练前阶段，我们介绍了检索增强蒙版预测（RAMP）任务，该任务学会在其中学习利用搜索工具来填充大量预训练数据，从而获得LLMS的通用检索和推理能力。之后，对模型进行了下游任务的培训，以实现进一步的改进。我们将监督的微调（SFT）和加强学习（RL）应用于培训。对于SFT，我们将基于代理和基于蒸馏的方法结合起来，以生成培训数据，从由计划者，重写者，观察者组成的多代理系统开始，然后是自我发展的教师模型。对于RL，我们使用DAPO作为培训框架，并采用由答案奖励和格式奖励组成的混合奖励系统。此外，我们引入了一种课程学习方法，该方法使模型可以根据蒙版跨度的数量逐渐从易于挑战到更具挑战性的实例学习。我们在开放域多跳问题的情况下评估框架的有效性。通过广泛的实验，我们证明了MaskSearch可以显着提高基于LLM的搜索剂对内域和下游任务的性能。

Title: Enhancing the Comprehensibility of Text Explanations via Unsupervised Concept Discovery

Authors: Yifan Sun, Danding Wang, Qiang Sheng, Juan Cao, Jintao Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20293
Pdf URL: https://arxiv.org/pdf/2505.20293
Copy Paste: [[2505.20293]] Enhancing the Comprehensibility of Text Explanations via Unsupervised Concept Discovery(https://arxiv.org/abs/2505.20293)
Keywords: language model
Abstract: Concept-based explainable approaches have emerged as a promising method in explainable AI because they can interpret models in a way that aligns with human reasoning. However, their adaption in the text domain remains limited. Most existing methods rely on predefined concept annotations and cannot discover unseen concepts, while other methods that extract concepts without supervision often produce explanations that are not intuitively comprehensible to humans, potentially diminishing user trust. These methods fall short of discovering comprehensible concepts automatically. To address this issue, we propose \textbf{ECO-Concept}, an intrinsically interpretable framework to discover comprehensible concepts with no concept annotations. ECO-Concept first utilizes an object-centric architecture to extract semantic concepts automatically. Then the comprehensibility of the extracted concepts is evaluated by large language models. Finally, the evaluation result guides the subsequent model fine-tuning to obtain more understandable explanations. Experiments show that our method achieves superior performance across diverse tasks. Further concept evaluations validate that the concepts learned by ECO-Concept surpassed current counterparts in comprehensibility.
摘要：基于概念的可解释方法已成为可解释的AI中有前途的方法，因为它们可以以与人类推理保持一致的方式来解释模型。但是，它们在文本领域中的适应性仍然有限。大多数现有方法依赖于预定义的概念注释，无法发现看不见的概念，而没有监督的其他方法提取概念的其他方法通常会产生不直观地理解人类的解释，从而潜在地减少用户信任。这些方法无法自动发现可理解的概念。为了解决这个问题，我们提出\ textbf {eco-concept}，这是一个可以自然解释的框架，可以发现无概念注释的可理解概念。生态概念首先利用以对象为中心的体系结构自动提取语义概念。然后，通过大语言模型评估提取概念的可理解性。最后，评估结果指导后续模型微调以获得更易于理解的解释。实验表明，我们的方法在各种任务中实现了卓越的性能。进一步的概念评估可以验证生态概念学到的概念在可理解方面超过了当前的同行。

Title: Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?

Authors: Michael Kirchhof, Luca Füger, Adam Goliński, Eeshan Gunesh Dhekane, Arno Blaas, Sinead Williamson
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.20295
Pdf URL: https://arxiv.org/pdf/2505.20295
Copy Paste: [[2505.20295]] Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?(https://arxiv.org/abs/2505.20295)
Keywords: language model, llm
Abstract: To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue that in the output space of LLMs, the space of strings, exist strings expressive enough to summarize the distribution over output strings the LLM deems possible. We lay a foundation for this new avenue of uncertainty explication and present SelfReflect, a theoretically-motivated metric to assess how faithfully a string summarizes an LLM's internal answer distribution. We show that SelfReflect is able to discriminate even subtle differences of candidate summary strings and that it aligns with human judgement, outperforming alternative metrics such as LLM judges and embedding comparisons. With SelfReflect, we investigate a number of self-summarization methods and find that even state-of-the-art reasoning models struggle to explicate their internal uncertainty. But we find that faithful summarizations can be generated by sampling and summarizing. Our metric enables future works towards this universal form of LLM uncertainties.
摘要：为了揭示大型语言模型（LLM）何时不确定响应，不确定性量化通常会产生百分比和输出。但这是我们所能做的吗？我们认为，在LLM的输出空间中，字符串的空间存在足以表达的字符串，以总结输出字符串的分布，而LLM认为可能。我们为这一新的不确定性阐明途径奠定了基础，并呈现自我反射，这是一种理论上动机的指标，以评估字符串如何概述LLM的内部答案分布。我们表明，自我反射能够区分候选摘要字符串的细微差异，并且与人类的判断相符，表现优于LLM法官等替代指标和嵌入比较。通过自我反射，我们调查了许多自及化方法，发现即使是最先进的推理模型也很难阐明其内部不确定性。但是我们发现，可以通过抽样和总结来产生忠实的总结。我们的指标使未来的LLM不确定性形式的未来工作。

Title: Reasoning LLMs are Wandering Solution Explorers

Authors: Jiahao Lu, Ziwei Xu, Mohan Kankanhalli
Subjects: cs.CL, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2505.20296
Pdf URL: https://arxiv.org/pdf/2505.20296
Copy Paste: [[2505.20296]] Reasoning LLMs are Wandering Solution Explorers(https://arxiv.org/abs/2505.20296)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning abilities through test-time computation (TTC) techniques such as chain-of-thought prompting and tree-based reasoning. However, we argue that current reasoning LLMs (RLLMs) lack the ability to systematically explore the solution space. This paper formalizes what constitutes systematic problem solving and identifies common failure modes that reveal reasoning LLMs to be wanderers rather than systematic explorers. Through qualitative and quantitative analysis across multiple state-of-the-art LLMs, we uncover persistent issues: invalid reasoning steps, redundant explorations, hallucinated or unfaithful conclusions, and so on. Our findings suggest that current models' performance can appear to be competent on simple tasks yet degrade sharply as complexity increases. Based on the findings, we advocate for new metrics and tools that evaluate not just final outputs but the structure of the reasoning process itself.
摘要：大型语言模型（LLMS）通过测试时间计算（TTC）技术（例如，经过思考的提示链和基于树的推理）表现出了令人印象深刻的推理能力。但是，我们认为当前的推理LLMS（RLLM）缺乏系统地探索解决方案空间的能力。本文正式构成了系统的问题解决，并确定了揭示推理LLM为流浪者而不是系统的探险家的常见故障模式。通过对多个最先进的LLM的定性和定量分析，我们发现了持续存在的问题：无效的推理步骤，冗余探索，幻觉或不忠的结论等等。我们的发现表明，当前模型的性能似乎可以在简单的任务上胜任，但随着复杂性的提高而急剧降解。根据调查结果，我们主张不仅评估最终输出，而且评估推理过程本身的结构。

Title: MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

Authors: Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, Kiyoharu Aizawa
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.20298
Pdf URL: https://arxiv.org/pdf/2505.20298
Copy Paste: [[2505.20298]] MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding(https://arxiv.org/abs/2505.20298)
Keywords: gpt
Abstract: Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.
摘要：漫画或日本漫画是一种丰富的多模式叙事形式，以复杂的方式将图像和文本融合在一起。教导大型多模型模型（LMM）以在类似人类的水平上理解这些叙事可以帮助漫画创建者反思并完善他们的故事。为此，我们介绍了两个基准，以实现多模式漫画的理解：针对页面内部文本识别的Mangaocr和Mangavqa，这是一种新颖的基准测试，旨在通过视觉问题答案评估上下文理解。 Mangavqa由526个高质量的，手动构建的问题解答对组成，可在各种叙事和视觉场景中进行可靠的评估。在这些基准测试的基础上，我们开发了Mangalmm，这是一种从开源LMM QWEN2.5-VL进行的漫画规范化模型，以共同处理这两个任务。通过广泛的实验，包括与GPT-4O和GEMINI 2.5等专有模型的比较，我们评估了LMM的理解漫画。我们的基准和模型为评估和推进漫画叙事领域中的LMM提供了全面的基础。