2025-06-17

Title: Focusing on Students, not Machines: Grounded Question Generation and Automated Answer Grading

Authors: Gérôme Meyer, Philip Breuer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12066
Pdf URL: https://arxiv.org/pdf/2506.12066
Copy Paste: [[2506.12066]] Focusing on Students, not Machines: Grounded Question Generation and Automated Answer Grading(https://arxiv.org/abs/2506.12066)
Keywords: language model, llm, retrieval augmented generation
Abstract: Digital technologies are increasingly used in education to reduce the workload of teachers and students. However, creating open-ended study or examination questions and grading their answers is still a tedious task. This thesis presents the foundation for a system that generates questions grounded in class materials and automatically grades student answers. It introduces a sophisticated method for chunking documents with a visual layout, specifically targeting PDF documents. This method enhances the accuracy of downstream tasks, including Retrieval Augmented Generation (RAG). Our thesis demonstrates that high-quality questions and reference answers can be generated from study material. Further, it introduces a new benchmark for automated grading of short answers to facilitate comparison of automated grading systems. An evaluation of various grading systems is conducted and indicates that Large Language Models (LLMs) can generalise to the task of automated grading of short answers from their pre-training tasks. As with other tasks, increasing the parameter size of the LLMs leads to greater performance. Currently, available systems still need human oversight, especially in examination scenarios.
摘要：数字技术越来越多地用于教育中，以减少教师和学生的工作量。但是，提出开放式研究或检查问题并对他们的答案进行评分仍然是一项繁琐的任务。本论文为一个系统奠定了基础，该系统生成基于课堂材料的问题并自动对学生的答案进行分级。它介绍了一种具有视觉布局的复杂方法，用于将文档分解，专门针对PDF文档。该方法提高了下游任务的准确性，包括检索增强发电（RAG）。我们的论文表明，可以从研究材料中产生高质量的问题和参考答案。此外，它引入了一个新的基准，用于自动化的简短答案，以促进自动化分级系统的比较。进行了各种分级系统的评估，并表明大型语言模型（LLMS）可以从其前训练任务中概括为简短答案的自动分级任务。与其他任务一样，增加LLM的参数大小会导致更大的性能。目前，可用的系统仍然需要人类的监督，尤其是在考试场景中。

Title: ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour

Authors: Jack Contro, Simrat Deol, Yulan He, Martim Brandão
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12090
Pdf URL: https://arxiv.org/pdf/2506.12090
Copy Paste: [[2506.12090]] ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour(https://arxiv.org/abs/2506.12090)
Keywords: language model, llm, prompt, chat
Abstract: This paper introduces ChatbotManip, a novel dataset for studying manipulation in Chatbots. It contains simulated generated conversations between a chatbot and a (simulated) user, where the chatbot is explicitly asked to showcase manipulation tactics, persuade the user towards some goal, or simply be helpful. We consider a diverse set of chatbot manipulation contexts, from consumer and personal advice to citizen advice and controversial proposition argumentation. Each conversation is annotated by human annotators for both general manipulation and specific manipulation tactics. Our research reveals three key findings. First, Large Language Models (LLMs) can be manipulative when explicitly instructed, with annotators identifying manipulation in approximately 84\% of such conversations. Second, even when only instructed to be ``persuasive'' without explicit manipulation prompts, LLMs frequently default to controversial manipulative strategies, particularly gaslighting and fear enhancement. Third, small fine-tuned open source models, such as BERT+BiLSTM have a performance comparable to zero-shot classification with larger models like Gemini 2.5 pro in detecting manipulation, but are not yet reliable for real-world oversight. Our work provides important insights for AI safety research and highlights the need of addressing manipulation risks as LLMs are increasingly deployed in consumer-facing applications.
摘要：本文介绍了Chatbotmanip，这是一个新颖的数据集，用于研究聊天机器人中的操纵。它包含聊天机器人和（模拟）用户之间的模拟生成的对话，在该用户中明确要求聊天机器人展示操纵策略，说服用户达到某个目标，或者只是有用。我们考虑了各种各样的聊天机器人操纵环境，从消费者和个人建议到公民建议和有争议的命题论证。每次对话都由人类注释者注释，以进行一般操纵和特定的操纵策略。我们的研究揭示了三个关键发现。首先，大型语言模型（LLMS）在明确指示时可以是操纵性的，注释者可以识别大约84％的此类对话中的操纵。其次，即使仅指示没有明确操纵提示的``有说服力''，llms也经常默认来进行有争议的操纵策略，尤其是气光和恐惧增强。第三，小型微调开源模型（例如Bert+Bilstm）具有与零拍的性能相当的，具有Gemini 2.5 Pro（例如检测操作）等较大模型，但对于实际监督而言尚不可靠。我们的工作为人工智能安全研究提供了重要的见解，并突出了应对操纵风险的需求，因为LLM越来越多地部署在面向消费者的应用程序中。

Title: Continuously Updating Digital Twins using Large Language Models

Authors: Harry Amad, Nicolás Astorga, Mihaela van der Schaar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12091
Pdf URL: https://arxiv.org/pdf/2506.12091
Copy Paste: [[2506.12091]] Continuously Updating Digital Twins using Large Language Models(https://arxiv.org/abs/2506.12091)
Keywords: language model
Abstract: Digital twins are models of real-world systems that can simulate their dynamics in response to potential actions. In complex settings, the state and action variables, and available data and knowledge relevant to a system can constantly change, requiring digital twins to continuously update with these changes to remain relevant. Current approaches struggle in this regard, as they require fixed, well-defined modelling environments, and they cannot adapt to novel variables without re-designs, or incorporate new information without re-training. To address this, we frame digital twinning as an in-context learning problem using large language models, enabling seamless updates to the twin at inference time. We develop CALM-DT, a Context-Adaptive Language Model-based Digital Twin that can accurately simulate across diverse state-action spaces using in-context learning alone by utilising fine-tuned encoders for sample retrieval. We empirically demonstrate CALM-DT's competitive performance with existing digital twin approaches, and its unique ability to adapt to changes in its modelling environment without parameter updates.
摘要：数字双胞胎是现实世界系统的模型，可以响应潜在的动作来模拟其动态。在复杂的设置中，状态和动作变量以及与系统相关的可用数据和知识可以不断变化，要求数字双胞胎不断更新这些更改以保持相关性。当前的方法在这方面遇到了困难，因为它们需要固定的，定义明确的建模环境，并且在没有重新设计的情况下，它们无法适应新的变量，或者在没有重新训练的情况下纳入了新信息。为了解决这个问题，我们使用大语言模型将数字孪生构图为一种文化学习问题，从而在推理时可以对双胞胎进行无缝更新。我们开发了一种基于上下文自适应语言模型的数字双胞胎Calm-DT，它可以通过使用微型编码器进行样本检索来精确模拟各种州行动空间。我们从经验上证明了Caln-DT通过现有数字双胞胎方法的竞争性能，以及它适应其建模环境而没有参数更新的独特能力。

Title: UCD: Unlearning in LLMs via Contrastive Decoding

Authors: Vinith M. Suriyakumar, Ayush Sekhari, Ashia Wilson
Subjects: cs.CL, cs.CR, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.12097
Pdf URL: https://arxiv.org/pdf/2506.12097
Copy Paste: [[2506.12097]] UCD: Unlearning in LLMs via Contrastive Decoding(https://arxiv.org/abs/2506.12097)
Keywords: language model, llm
Abstract: Machine unlearning aims to remove specific information, e.g. sensitive or undesirable content, from large language models (LLMs) while preserving overall performance. We propose an inference-time unlearning algorithm that uses contrastive decoding, leveraging two auxiliary smaller models, one trained without the forget set and one trained with it, to guide the outputs of the original model using their difference during inference. Our strategy substantially improves the tradeoff between unlearning effectiveness and model utility. We evaluate our approach on two unlearning benchmarks, TOFU and MUSE. Results show notable gains in both forget quality and retained performance in comparison to prior approaches, suggesting that incorporating contrastive decoding can offer an efficient, practical avenue for unlearning concepts in large-scale models.
摘要：机器未学习旨在删除特定信息，例如来自大语言模型（LLM）的敏感或不良内容，同时保留整体性能。我们提出了一种使用对比度解码的推理时间 - 未学习算法，利用了两个辅助较小的型号，一种训练有训练而没有忘记集，一种对其进行了训练，以使用推断期间的差异来指导原始模型的输出。我们的战略大大提高了未学习效率和模型效用之间的权衡。我们评估了两个未学习的基准豆腐和缪斯的方法。结果表明，与先前的方法相比，忘记质量和保留性能的显着提高，这表明合并对比解码可以为大型模型中的学习概念提供高效，实用的途径。

Title: Personalized LLM Decoding via Contrasting Personal Preference

Authors: Hyungjune Bu, Chanjoo Jung, Minjae Kang, Jaehyung Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12109
Pdf URL: https://arxiv.org/pdf/2506.12109
Copy Paste: [[2506.12109]] Personalized LLM Decoding via Contrasting Personal Preference(https://arxiv.org/abs/2506.12109)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and training-based methods have been actively explored, the development of effective decoding-time algorithms remains largely overlooked, despite their demonstrated potential. In this paper, we propose CoPe (Contrasting Personal Preference), a novel decoding-time approach applied after performing parameter-efficient fine-tuning (PEFT) on user-specific data. Our core idea is to leverage reward-guided decoding specifically for personalization by maximizing each user's implicit reward signal. We evaluate CoPe across five open-ended personalized text generation tasks. Our empirical results demonstrate that CoPe achieves strong performance, improving personalization by an average of 10.57% in ROUGE-L, without relying on external reward models or additional training procedures.
摘要：由于大型语言模型（LLM）逐渐部署在各种现实世界中，因此LLMS的个性化变得越来越重要。尽管已经积极探索了各种LLM个性化方法，例如基于迅速的和基于培训的方法，但有效的解码时间算法的开发仍在很大程度上被忽略了，尽管它们具有潜力。在本文中，我们提出了COPE（对比的个人喜好），这是一种在用户特定数据上执行参数有效的微调（PEFT）后采用的新型解码时间方法。我们的核心思想是通过最大化每个用户的隐式奖励信号来利用专门用于个性化的奖励指导解码。我们评估五个开放式个性化文本生成任务的应对。我们的经验结果表明，COPE在不依赖外部奖励模型或其他培训程序的情况下，COPE在Rouge-L中的个性化平均提高了10.57％。

Title: Eliciting Reasoning in Language Models with Cognitive Tools

Authors: Brown Ebouky, Andrea Bartezzaghi, Mattia Rigotti
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12115
Pdf URL: https://arxiv.org/pdf/2506.12115
Copy Paste: [[2506.12115]] Eliciting Reasoning in Language Models with Cognitive Tools(https://arxiv.org/abs/2506.12115)
Keywords: language model, gpt, llm, agent
Abstract: The recent advent of reasoning models like OpenAI's o1 was met with excited speculation by the AI community about the mechanisms underlying these capabilities in closed models, followed by a rush of replication efforts, particularly from the open source community. These speculations were largely settled by the demonstration from DeepSeek-R1 that chains-of-thought and reinforcement learning (RL) can effectively replicate reasoning on top of base LLMs. However, it remains valuable to explore alternative methods for theoretically eliciting reasoning that could help elucidate the underlying mechanisms, as well as providing additional methods that may offer complementary benefits. Here, we build on the long-standing literature in cognitive psychology and cognitive architectures, which postulates that reasoning arises from the orchestrated, sequential execution of a set of modular, predetermined cognitive operations. Crucially, we implement this key idea within a modern agentic tool-calling framework. In particular, we endow an LLM with a small set of "cognitive tools" encapsulating specific reasoning operations, each executed by the LLM itself. Surprisingly, this simple strategy results in considerable gains in performance on standard mathematical reasoning benchmarks compared to base LLMs, for both closed and open-weight models. For instance, providing our "cognitive tools" to GPT-4.1 increases its pass@1 performance on AIME2024 from 26.7% to 43.3%, bringing it very close to the performance of o1-preview. In addition to its practical implications, this demonstration contributes to the debate regarding the role of post-training methods in eliciting reasoning in LLMs versus the role of inherent capabilities acquired during pre-training, and whether post-training merely uncovers these latent abilities.
摘要：AI社区对诸如Openai's O1之类的推理模型的出现进行了激动的猜测，该猜测是关于封闭模型中这些功能的基础机制，随后进行了大量的复制工作，尤其是开源社区。这些猜测在很大程度上是由DeepSeek-R1的示威来解决的，即经营和加强学习（RL）可以有效地在基础LLM的顶部复制推理。但是，探索理论上引发推理的替代方法仍然很有价值，这些方法可以帮助阐明潜在的机制，并提供可能提供互补益处的其他方法。在这里，我们以认知心理学和认知体系结构的长期文献为基础，这些文学源于一组模块化，预定的认知操作的精心策划的，顺序执行的推理。至关重要的是，我们在现代的代理工具称呼框架中实现了这一关键思想。特别是，我们将LLM赋予了一小部分“认知工具”，封装了特定的推理操作，每个工具由LLM本身执行。令人惊讶的是，与基本LLM相比，这种简单的策略在标准数学推理基准测试基准方面的性能很大，封闭式和开放权重模型。例如，将我们的“认知工具”提供给GPT-4.1，将其在AIME2024上的通过@1从26.7％提高到43.3％，使其非常接近O1-Preview的性能。除了其实际含义外，该演示还有助于有关训练方法在LLM中引起推理的作用与在训练期间获得的固有能力的作用的辩论，以及训练后训练是否仅仅揭示了这些潜在能力。

Title: Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources?

Authors: Houyi Li, Ka Man Lo, Ziqi Wang, Zili Wang, Wenzhen Zheng, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12119
Pdf URL: https://arxiv.org/pdf/2506.12119
Copy Paste: [[2506.12119]] Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources?(https://arxiv.org/abs/2506.12119)
Keywords: language model, llm
Abstract: Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints - that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for the enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens. All models will be released publicly.
摘要：Experts（MOE）语言模型的混合物大幅扩大模型能力，并实现出色的性能，而无需提高comperute。但是，MOE在严格平等的资源限制下是否可以超过密集的体系结构 - 也就是说，当总参数计数，培训计算和数据预算相同时？尽管其实践价值和潜力很大，但这个问题仍然没有探索。在本文中，我们提出了一种新颖的观点和方法论框架来彻底研究这个问题。首先，我们全面研究了MOE的体系结构，并实现了最大化性能的最佳模型设计。基于此，我们随后发现，在最佳区域中具有激活速率的MOE模型能够在相同的总参数，培训计算和数据资源下优于其密度对应物。更重要的是，该最佳区域在不同的模型大小之间保持一致。尽管事实证明，额外的数据是提高性能的权衡，但我们表明这可以通过重复使用数据解决。我们通过广泛的实验验证了我们的发现，以2B的比例训练了近200个语言模型，超过50个语言模型在7B尺度上累积处理50万亿代币。所有模型将公开发布。

Title: Hatevolution: What Static Benchmarks Don't Tell Us

Authors: Chiara Di Bonaventura, Barbara McGillivray, Yulan He, Albert Meroño-Peñuela
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12148
Pdf URL: https://arxiv.org/pdf/2506.12148
Copy Paste: [[2506.12148]] Hatevolution: What Static Benchmarks Don't Tell Us(https://arxiv.org/abs/2506.12148)
Keywords: language model
Abstract: Language changes over time, including in the hate speech domain, which evolves quickly following social dynamics and cultural shifts. While NLP research has investigated the impact of language evolution on model training and has proposed several solutions for it, its impact on model benchmarking remains under-explored. Yet, hate speech benchmarks play a crucial role to ensure model safety. In this paper, we empirically evaluate the robustness of 20 language models across two evolving hate speech experiments, and we show the temporal misalignment between static and time-sensitive evaluations. Our findings call for time-sensitive linguistic benchmarks in order to correctly and reliably evaluate language models in the hate speech domain.
摘要：语言会随着时间的流逝而变化，包括在仇恨言论领域，随着社会动态和文化转变的迅速发展。尽管NLP研究调查了语言进化对模型培训的影响，并为其提出了多种解决方案，但其对模型基准测试的影响仍然不足。然而，仇恨言论基准在确保模型安全方面起着至关重要的作用。在本文中，我们在两个不断发展的仇恨言论实验中凭经验评估了20种语言模型的鲁棒性，并且我们显示了静态和时间敏感评估之间的时间差异。我们的发现要求对时间敏感的语言基准测试，以便正确，可靠地评估仇恨言语领域中的语言模型。

Title: Maximally-Informative Retrieval for State Space Model Generation

Authors: Evan Becker, Benjamin Bowman, Matthew Trager, Tian Yu Liu, Luca Zancato, Wei Xia, Stefano Soatto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12149
Pdf URL: https://arxiv.org/pdf/2506.12149
Copy Paste: [[2506.12149]] Maximally-Informative Retrieval for State Space Model Generation(https://arxiv.org/abs/2506.12149)
Keywords: llm, retrieval-augmented generation
Abstract: Given a query and dataset, the optimal way of answering the query is to make use all the information available. Modern LLMs exhibit impressive ability to memorize training data, but data not deemed important during training is forgotten, and information outside that training set cannot be made use of. Processing an entire dataset at inference time is infeasible due to the bounded nature of model resources (e.g. context size in transformers or states in state space models), meaning we must resort to external memory. This constraint naturally leads to the following problem: How can we decide based on the present query and model, what among a virtually unbounded set of known data matters for inference? To minimize model uncertainty for a particular query at test-time, we introduce Retrieval In-Context Optimization (RICO), a retrieval method that uses gradients from the LLM itself to learn the optimal mixture of documents for answer generation. Unlike traditional retrieval-augmented generation (RAG), which relies on external heuristics for document retrieval, our approach leverages direct feedback from the model. Theoretically, we show that standard top-$k$ retrieval with model gradients can approximate our optimization procedure, and provide connections to the leave-one-out loss. We demonstrate empirically that by minimizing an unsupervised loss objective in the form of question perplexity, we can achieve comparable retriever metric performance to BM25 with \emph{no finetuning}. Furthermore, when evaluated on quality of the final prediction, our method often outperforms fine-tuned dense retrievers such as E5.
摘要：给定查询和数据集，回答查询的最佳方法是使用所有可用信息。现代LLM表现出令人印象深刻的记忆训练数据的能力，但是在培训期间认为并不重要的数据被遗忘了，并且无法使用该培训集以外的信息。由于模型资源的有界性（例如，变形金刚或状态空间模型中的状态中的上下文大小），在推理时间处理整个数据集是不可行的，这意味着我们必须诉诸外部内存。这种约束自然会导致以下问题：我们如何根据当前查询和模型决定，在一组无限的已知数据集中有关推论的事项中，什么是什么？为了最大程度地减少测试时间特定查询的模型不确定性，我们引入了检索中文本优化（RICO），这是一种检索方法，它使用LLM本身的梯度来学习文档的最佳混合物以进行答案。与传统的检索型发电（RAG）不同，它依赖于外部启发式方法进行文件检索，我们的方法利用了模型的直接反馈。从理论上讲，我们表明具有模型梯度的标准顶级$ K $检索可以近似我们的优化程序，并为一对一的损失提供连接。我们从经验上证明，通过以问题的形式将无监督的损失目标最小化，我们可以使用\ emph {no Finetuning}实现与BM25的可比检索指标性能。此外，当对最终预测的质量进行评估时，我们的方法通常比诸如E5之类的微调密集检索器的表现通常优于诸如E5这样的微调。

Title: A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

Authors: Tatiana Ankinina, Jan Cegin, Jakub Simko, Simon Ostermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12158
Pdf URL: https://arxiv.org/pdf/2506.12158
Copy Paste: [[2506.12158]] A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages(https://arxiv.org/abs/2506.12158)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly used to generate synthetic textual data for training smaller specialized models. However, a comparison of various generation strategies for low-resource language settings is lacking. While various prompting strategies have been proposed, such as demonstrations, label-based summaries, and self-revision, their comparative effectiveness remains unclear, especially for low-resource languages. In this paper, we systematically evaluate the performance of these generation strategies and their combinations across 11 typologically diverse languages, including several extremely low-resource ones. Using three NLP tasks and four open-source LLMs, we assess downstream model performance on generated versus gold-standard data. Our results show that strategic combinations of generation methods, particularly target-language demonstrations with LLM-based revisions, yield strong performance, narrowing the gap with real data to as little as 5% in some settings. We also find that smart prompting techniques can reduce the advantage of larger LLMs, highlighting efficient generation strategies for synthetic data generation in low-resource scenarios with smaller models.
摘要：大型语言模型（LLMS）越来越多地用于生成用于培训较小专业模型的合成文本数据。但是，缺乏对低资源语言设置的各种一代策略的比较。尽管已经提出了各种提示策略，例如示范，基于标签的摘要和自我修复，但它们的比较效率仍然不清楚，尤其是对于低资源语言。在本文中，我们系统地评估了这些一代策略及其在11种类型上多样的语言中的表现，包括几种极低的资源。使用三个NLP任务和四个开源LLM，我们评估生成的数据与金标准数据的下游模型性能。我们的结果表明，发电方法的战略组合，尤其是基于LLM的修订的目标语言演示，产生强大的性能，在某些情况下，实际数据缩小差距至少5％。我们还发现，智能提示技术可以降低较大的LLM的优势，从而强调使用较小型号的低资源场景中合成数据生成的有效生成策略。

Title: Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs

Authors: Chenqian Le, Ziheng Gong, Chihang Wang, Haowei Ni, Panfeng Li, Xupeng Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12182
Pdf URL: https://arxiv.org/pdf/2506.12182
Copy Paste: [[2506.12182]] Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs(https://arxiv.org/abs/2506.12182)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) have shown great potential in medical question answering (MedQA), yet adapting them to biomedical reasoning remains challenging due to domain-specific complexity and limited supervision. In this work, we study how prompt design and lightweight fine-tuning affect the performance of open-source LLMs on PubMedQA, a benchmark for multiple-choice biomedical questions. We focus on two widely used prompting strategies - standard instruction prompts and Chain-of-Thought (CoT) prompts - and apply QLoRA for parameter-efficient instruction tuning. Across multiple model families and sizes, our experiments show that CoT prompting alone can improve reasoning in zero-shot settings, while instruction tuning significantly boosts accuracy. However, fine-tuning on CoT prompts does not universally enhance performance and may even degrade it for certain larger models. These findings suggest that reasoning-aware prompts are useful, but their benefits are model- and scale-dependent. Our study offers practical insights into combining prompt engineering with efficient finetuning for medical QA applications.
摘要：大型语言模型（LLM）在医学问题答案（MEDQA）中表现出很大的潜力，但是由于特定于领域的复杂性和有限的监督，使它们适应生物医学推理仍然具有挑战性。在这项工作中，我们研究了迅速设计和轻巧的微调如何影响PubMedQA上开源LLM的性能，PubMedQA是多项选择生物医学问题的基准。我们专注于两种广泛使用的提示策略 - 标准说明提示和链条链（COT）提示 - 并将Qlora应用于参数有效的指令调整。在多个模型家族和尺寸中，我们的实验表明，仅COT提示就可以改善零摄影设置的推理，而指令调整会显着提高准确性。但是，在COT提示上进行微调并不能普遍提高性能，甚至可能在某些较大的型号中降低其性能。这些发现表明，推理意识的提示很有用，但是它们的好处是模型和规模依赖性的。我们的研究提供了将及时工程和有效的医疗质量检查应用程序结合起来的实用见解。

Title: Supernova Event Dataset: Interpreting Large Language Model's Personality through Critical Event Analysis

Authors: Pranav Agarwal, Ioana Ciucă
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12189
Pdf URL: https://arxiv.org/pdf/2506.12189
Copy Paste: [[2506.12189]] Supernova Event Dataset: Interpreting Large Language Model's Personality through Critical Event Analysis(https://arxiv.org/abs/2506.12189)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly integrated into everyday applications. As their influence grows, understanding their decision making and underlying personality becomes essential. In this work, we interpret model personality using our proposed Supernova Event Dataset, a novel dataset with diverse articles spanning biographies, historical events, news, and scientific discoveries. We use this dataset to benchmark LLMs on extracting and ranking key events from text, a subjective and complex challenge that requires reasoning over long-range context and modeling causal chains. We evaluate small models like Phi-4, Orca 2, and Qwen 2.5, and large, stronger models such as Claude 3.7, Gemini 2.5, and OpenAI o3, and propose a framework where another LLM acts as a judge to infer each model's personality based on its selection and classification of events. Our analysis shows distinct personality traits: for instance, Orca 2 demonstrates emotional reasoning focusing on interpersonal dynamics, while Qwen 2.5 displays a more strategic, analytical style. When analyzing scientific discovery events, Claude Sonnet 3.7 emphasizes conceptual framing, Gemini 2.5 Pro prioritizes empirical validation, and o3 favors step-by-step causal reasoning. This analysis improves model interpretability, making them user-friendly for a wide range of diverse applications.
摘要：大型语言模型（LLM）越来越多地集成到日常应用中。随着影响力的增长，了解他们的决策和潜在的个性变得至关重要。在这项工作中，我们使用拟议的Supernova事件数据集来解释模型个性，该数据集是一个新颖的数据集，其中包含传记，历史事件，新闻和科学发现的各种文章。我们使用此数据集对LLM进行基准从文本中提取和排名关键事件，这是一个主观且复杂的挑战，需要在远程上下文上进行推理并建模因果链。我们评估了诸如PHI-4，ORCA 2和QWEN 2.5的小型模型，以及Claude 3.7，Gemini 2.5和OpenAI O3等大型，更强的模型，并提出了一个框架，其中另一个LLM作为法官的作用，根据其选择和事件的分类来推断每个模型的人格。我们的分析表明了不同的人格特征：例如，Orca 2展示了专注于人际动态的情感推理，而Qwen 2.5显示了一种更具战略意义的分析风格。在分析科学发现事件时，Claude SONNet 3.7强调概念框架，Gemini 2.5 Pro优先考虑经验验证，而O3偏向于逐步的因果推理。该分析改善了模型的可解释性，使它们在各种不同的应用程序中都可以用户友好。

Title: Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

Authors: Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12229
Pdf URL: https://arxiv.org/pdf/2506.12229
Copy Paste: [[2506.12229]] Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index(https://arxiv.org/abs/2506.12229)
Keywords: language model
Abstract: Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora -- counting string appearances and retrieving the enclosing documents -- yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18$\times$) and memory use during both indexing (3.2$\times$ reduction) and querying (down to a negligible amount). We index 46TB of Internet text in 50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.
摘要：语言模型主要是从Internet的大量文本数据上培训的，了解此数据源变得越来越重要。精确匹配的搜索引擎启用大型文本语料库的搜索 - 计数字符串外观并检索封闭文档 - 但是，高存储台上的高架阻碍了他们在Internet级数据上的应用程序。我们提出Infini-gram mini，这是一种可以使Petabyte级文本Corpora可搜索的高效且可扩展的系统。基于FM-Index数据结构（Ferragina和Manzini，2000），同时索引和压缩文本，我们的系统创建了索引，只有44％的语料库。从索引速度（18 $ \ times $）和索引期间的内存使用（3.2 $ \ times $降低）和查询（降至可忽略的金额）方面，Infini-gram mini在FM索引的最佳实现和最佳现有FM索引实现方面有了很大的改善。我们在50天内使用一个128核CPU节点（如果使用75个这样的节点）为46TB的Internet文本索引。我们在基准污染的大规模分析中显示了一个重要的Infini-gram mini用例。我们发现几个核心LM评估基准在Internet爬网中受到严重污染（最多40％的小队），如果接受此类数据培训，这可能会高估语言模型的能力。我们主持了一个基准污染公告，以分享许多核心和社区基准的基准的污染率。我们还发布了一个Web接口和API端点，以在Infini-gram mini索引上提供一般搜索查询。

Title: Large Language Models for History, Philosophy, and Sociology of Science: Interpretive Uses, Methodological Challenges, and Critical Perspectives

Authors: Arno Simons, Michael Zichert, Adrian Wüthrich
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.12242
Pdf URL: https://arxiv.org/pdf/2506.12242
Copy Paste: [[2506.12242]] Large Language Models for History, Philosophy, and Sociology of Science: Interpretive Uses, Methodological Challenges, and Critical Perspectives(https://arxiv.org/abs/2506.12242)
Keywords: language model, llm, retrieval-augmented generation
Abstract: This paper explores the use of large language models (LLMs) as research tools in the history, philosophy, and sociology of science (HPSS). LLMs are remarkably effective at processing unstructured text and inferring meaning from context, offering new affordances that challenge long-standing divides between computational and interpretive methods. This raises both opportunities and challenges for HPSS, which emphasizes interpretive methodologies and understands meaning as context-dependent, ambiguous, and historically situated. We argue that HPSS is uniquely positioned not only to benefit from LLMs' capabilities but also to interrogate their epistemic assumptions and infrastructural implications. To this end, we first offer a concise primer on LLM architectures and training paradigms tailored to non-technical readers. We frame LLMs not as neutral tools but as epistemic infrastructures that encode assumptions about meaning, context, and similarity, conditioned by their training data, architecture, and patterns of use. We then examine how computational techniques enhanced by LLMs, such as structuring data, detecting patterns, and modeling dynamic processes, can be applied to support interpretive research in HPSS. Our analysis compares full-context and generative models, outlines strategies for domain and task adaptation (e.g., continued pretraining, fine-tuning, and retrieval-augmented generation), and evaluates their respective strengths and limitations for interpretive inquiry in HPSS. We conclude with four lessons for integrating LLMs into HPSS: (1) model selection involves interpretive trade-offs; (2) LLM literacy is foundational; (3) HPSS must define its own benchmarks and corpora; and (4) LLMs should enhance, not replace, interpretive methods.
摘要：本文探讨了大型语言模型（LLM）作为历史，哲学和科学社会学（HPSS）的研究工具。 LLM非常有效地处理非结构化文本并从上下文中推断出意义，从而提供了挑战计算方法和解释方法之间长期鸿沟的新负担。这引发了HPS的机会和挑战，这强调了解释性方法论，并将其理解为依赖上下文，模棱两可且历史上的意义。我们认为，HPSS不仅是从LLM的能力中受益的独特位置，而且还可以审问其认知假设和基础设施的影响。为此，我们首先提供了针对非技术读者量身定制的LLM体系结构和培训范式的简洁底漆。我们将llms不是中性工具，而是作为认识论基础结构，这些基础架构编码有关含义，上下文和相似性的假设，其训练数据，体系结构和使用模式为条件。然后，我们研究如何通过LLM（例如结构数据，检测模式和建模动态过程）来增强计算技术，以支持HPS中的解释性研究。我们的分析比较了全文和生成模型，概述了领域和任务适应的策略（例如，持续预处理，微调和检索效果的生成），并评估了HPSS中的解释性询问的各自的优势和局限性。我们以四个将LLMS整合到HPSS中的课程结束：（1）模型选择涉及解释性权衡；（2）LLM识字率是基础的；（3）HPS必须定义自己的基准和语料库；（4）LLMS应增强，不代替解释性方法。

Title: The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs

Authors: Avinash Baidya, Kamalika Das, Xiang Gao
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2506.12266
Pdf URL: https://arxiv.org/pdf/2506.12266
Copy Paste: [[2506.12266]] The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs(https://arxiv.org/abs/2506.12266)
Keywords: language model, gpt, llm, agent
Abstract: Large Language Model (LLM)-based agents have significantly impacted Task-Oriented Dialog Systems (TODS) but continue to face notable performance challenges, especially in zero-shot scenarios. While prior work has noted this performance gap, the behavioral factors driving the performance gap remain under-explored. This study proposes a comprehensive evaluation framework to quantify the behavior gap between AI agents and human experts, focusing on discrepancies in dialog acts, tool usage, and knowledge utilization. Our findings reveal that this behavior gap is a critical factor negatively impacting the performance of LLM agents. Notably, as task complexity increases, the behavior gap widens (correlation: 0.963), leading to a degradation of agent performance on complex task-oriented dialogs. For the most complex task in our study, even the GPT-4o-based agent exhibits low alignment with human behavior, with low F1 scores for dialog acts (0.464), excessive and often misaligned tool usage with a F1 score of 0.139, and ineffective usage of external knowledge. Reducing such behavior gaps leads to significant performance improvement (24.3% on average). This study highlights the importance of comprehensive behavioral evaluations and improved alignment strategies to enhance the effectiveness of LLM-based TODS in handling complex tasks.
摘要：基于大型语言模型（LLM）的代理人已经严重影响了面向任务的对话系统（TODS），但继续面临着显着的绩效挑战，尤其是在零照片的情况下。尽管先前的工作已经注意到了这种性能差距，但推动性能差距的行为因素仍然不足。这项研究提出了一个全面的评估框架，以量化AI代理与人类专家之间的行为差距，重点关注对话行为，工具使用和知识利用的差异。我们的发现表明，这种行为差距是对LLM代理的性能产生负面影响的关键因素。值得注意的是，随着任务复杂性的增加，行为差距扩大（相关：0.963），从而导致代理性能在复杂的以任务为导向的对话框上降解。对于我们研究中最复杂的任务，即使是基于GPT-4O的代理也表现出与人类行为的较低对准，对话行为的F1得分较低（0.464）（0.464），F1得分过多且经常错位的工具使用率为0.139，而外部知识的使用效率低下。减少这种行为差距会导致显着的绩效提高（平均为24.3％）。这项研究强调了全面的行为评估和改进的对齐策略的重要性，以增强基于LLM的TOD在处理复杂任务中的有效性。

Title: Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning

Authors: Xiaotian Zhang, Yuan Wang, Zhaopeng Feng, Ruizhe Chen, Zhijie Zhou, Yan Zhang, Hongxia Xu, Jian Wu, Zuozhu Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12307
Pdf URL: https://arxiv.org/pdf/2506.12307
Copy Paste: [[2506.12307]] Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning(https://arxiv.org/abs/2506.12307)
Keywords: language model, llm
Abstract: Medical Question-Answering (QA) encompasses a broad spectrum of tasks, including multiple choice questions (MCQ), open-ended text generation, and complex computational reasoning. Despite this variety, a unified framework for delivering high-quality medical QA has yet to emerge. Although recent progress in reasoning-augmented large language models (LLMs) has shown promise, their ability to achieve comprehensive medical understanding is still largely unexplored. In this paper, we present Med-U1, a unified framework for robust reasoning across medical QA tasks with diverse output formats, ranging from MCQs to complex generation and computation tasks. Med-U1 employs pure large-scale reinforcement learning with mixed rule-based binary reward functions, incorporating a length penalty to manage output verbosity. With multi-objective reward optimization, Med-U1 directs LLMs to produce concise and verifiable reasoning chains. Empirical results reveal that Med-U1 significantly improves performance across multiple challenging Med-QA benchmarks, surpassing even larger specialized and proprietary models. Furthermore, Med-U1 demonstrates robust generalization to out-of-distribution (OOD) tasks. Extensive analysis presents insights into training strategies, reasoning chain length control, and reward design for medical LLMs. The code will be released.
摘要：医疗询问（QA）涵盖了一系列任务，包括多项选择问题（MCQ），开放式文本生成和复杂的计算推理。尽管有这种多样性，但提供高质量医学质量保证的统一框架尚未出现。尽管推理提升的大语言模型（LLMS）的最新进展已显示出希望，但他们获得全面的医学理解的能力仍然在很大程度上尚未得到探索。在本文中，我们提出了Med-U1，这是一个统一的框架，用于在各种输出格式的医学质量检查任务中进行稳健推理，从MCQ到复杂的生成和计算任务。 Med-U1采用纯粹的大规模增强学习，并具有混合的基于规则的二进制奖励功能，并结合了管理输出冗长的长度惩罚。通过多目标奖励优化，MED-U1指示LLMS生成简洁而可验证的推理链。经验结果表明，MED-U1显着提高了多个具有挑战性的MED-QA基准的性能，超过了更大的专业和专有模型。此外，Med-U1证明了对分布（OOD）任务的强大概括。广泛的分析介绍了对培训策略，推理链长度控制和医疗LLM的奖励设计的见解。代码将发布。

Title: Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective

Authors: Hitomi Yanaka, Xinqi He, Jie Lu, Namgi Han, Sunjin Oh, Ryoma Kumon, Yuma Matsuoka, Katsuhiko Watabe, Yuko Itatsu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12327
Pdf URL: https://arxiv.org/pdf/2506.12327
Copy Paste: [[2506.12327]] Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective(https://arxiv.org/abs/2506.12327)
Keywords: language model, gpt, llm
Abstract: An growing number of studies have examined the social bias of rapidly developed large language models (LLMs). Although most of these studies have focused on bias occurring in a single social attribute, research in social science has shown that social bias often occurs in the form of intersectionality -- the constitutive and contextualized perspective on bias aroused by social attributes. In this study, we construct the Japanese benchmark inter-JBBQ, designed to evaluate the intersectional bias in LLMs on the question-answering setting. Using inter-JBBQ to analyze GPT-4o and Swallow, we find that biased output varies according to its contexts even with the equal combination of social attributes.
摘要：越来越多的研究研究了快速发展的大语言模型（LLM）的社会偏见。尽管这些研究中的大多数都集中在单个社会属性中发生的偏见，但社会科学的研究表明，社会偏见通常以交叉性的形式发生 - 社会属性引起的偏见的构成和上下文化的观点。在这项研究中，我们构建了日本基准Inter-JBBQ，旨在评估问题解决环境中LLMS中的相互偏见。使用Inter-JBBQ分析GPT-4O和吞咽，我们发现即使社会属性相等的组合，也会根据其上下文而有偏见。

Title: Investigating the Effects of Cognitive Biases in Prompts on Large Language Model Outputs

Authors: Yan Sun, Stanley Kok
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12338
Pdf URL: https://arxiv.org/pdf/2506.12338
Copy Paste: [[2506.12338]] Investigating the Effects of Cognitive Biases in Prompts on Large Language Model Outputs(https://arxiv.org/abs/2506.12338)
Keywords: language model, llm, prompt
Abstract: This paper investigates the influence of cognitive biases on Large Language Models (LLMs) outputs. Cognitive biases, such as confirmation and availability biases, can distort user inputs through prompts, potentially leading to unfaithful and misleading outputs from LLMs. Using a systematic framework, our study introduces various cognitive biases into prompts and assesses their impact on LLM accuracy across multiple benchmark datasets, including general and financial Q&A scenarios. The results demonstrate that even subtle biases can significantly alter LLM answer choices, highlighting a critical need for bias-aware prompt design and mitigation strategy. Additionally, our attention weight analysis highlights how these biases can alter the internal decision-making processes of LLMs, affecting the attention distribution in ways that are associated with output inaccuracies. This research has implications for Al developers and users in enhancing the robustness and reliability of Al applications in diverse domains.
摘要：本文研究了认知偏见对大语言模型（LLMS）输出的影响。认知偏见（例如确认和可用性偏见）会通过提示扭曲用户输入，从而导致LLMS不忠和误导性输出。我们的研究使用系统的框架将各种认知偏见引入提示中，并评估它们对多个基准数据集（包括一般和财务问答场景）中LLM准确性的影响。结果表明，即使是微妙的偏见也可以显着改变LLM答案的选择，从而突出对偏见感知及时设计和缓解策略的迫切需求。此外，我们的注意力重量分析强调了这些偏见如何改变LLM的内部决策过程，以与输出不准确相关的方式影响注意力分布。这项研究对AL开发人员和用户具有增强AL应用程序在不同域中应用程序的鲁棒性和可靠性的影响。

Title: Refract ICL: Rethinking Example Selection in the Era of Million-Token Models

Authors: Arjun R. Akula, Kazuma Hashimoto, Krishna Srinivasan, Aditi Chaudhary, Karthik Raman, Michael Bendersky
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12346
Pdf URL: https://arxiv.org/pdf/2506.12346
Copy Paste: [[2506.12346]] Refract ICL: Rethinking Example Selection in the Era of Million-Token Models(https://arxiv.org/abs/2506.12346)
Keywords: language model, llm
Abstract: The emergence of long-context large language models (LLMs) has enabled the use of hundreds, or even thousands, of demonstrations for in-context learning (ICL) - a previously impractical regime. This paper investigates whether traditional ICL selection strategies, which balance the similarity of ICL examples to the test input (using a text retriever) with diversity within the ICL set, remain effective when utilizing a large number of demonstrations. Our experiments demonstrate that, while longer contexts can accommodate more examples, simply increasing the number of demonstrations does not guarantee improved performance. Smart ICL selection remains crucial, even with thousands of demonstrations. To further enhance ICL in this setting, we introduce Refract ICL, a novel ICL selection algorithm specifically designed to focus LLM attention on challenging examples by strategically repeating them within the context and incorporating zero-shot predictions as error signals. Our results show that Refract ICL significantly improves the performance of extremely long-context models such as Gemini 1.5 Pro, particularly on tasks with a smaller number of output classes.
摘要：长篇文章大语模型（LLM）的出现使得在秘密学习（ICL）（ICL）（一种先前不切实际的制度）中使用了数百甚至数千个演示。本文调查了传统的ICL选择策略，即在使用大量演示的情况下，在ICL集合中具有多样性的ICL示例与测试输入的相似性（使用文本回猎版）是否保持有效。我们的实验表明，尽管较长的上下文可以容纳更多示例，但仅增加演示的数量并不能保证提高性能。即使有成千上万的演示，智能ICL选择仍然至关重要。为了进一步增强ICL在这种情况下，我们引入了折射ICL，这是一种新型的ICL选择算法，专门旨在通过在上下文中策略性地将其重复并将零摄像的预测纳入误差信号，以将LLM的注意力集中在具有挑战性的示例上。我们的结果表明，折射ICL显着提高了极长的模型（例如Gemini 1.5 Pro）的性能，尤其是在较少数量的输出类别的任务上。

Title: Efficient Reasoning Through Suppression of Self-Affirmation Reflections in Large Reasoning Models

Authors: Kaiyuan Liu, Chen Shen, Zhanwei Zhang, Junjie Liu, Xiaosong Yuan, Jieping ye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12353
Pdf URL: https://arxiv.org/pdf/2506.12353
Copy Paste: [[2506.12353]] Efficient Reasoning Through Suppression of Self-Affirmation Reflections in Large Reasoning Models(https://arxiv.org/abs/2506.12353)
Keywords: llm
Abstract: While recent advances in large reasoning models have demonstrated remarkable performance, efficient reasoning remains critical due to the rapid growth of output length. Existing optimization approaches highlights a tendency toward "overthinking", yet lack fine-grained analysis. In this work, we focus on Self-Affirmation Reflections: redundant reflective steps that affirm prior content and often occurs after the already correct reasoning steps. Observations of both original and optimized reasoning models reveal pervasive self-affirmation reflections. Notably, these reflections sometimes lead to longer outputs in optimized models than their original counterparts. Through detailed analysis, we uncover an intriguing pattern: compared to other reflections, the leading words (i.e., the first word of sentences) in self-affirmation reflections exhibit a distinct probability bias. Motivated by this insight, we can locate self-affirmation reflections and conduct a train-free experiment demonstrating that suppressing self-affirmation reflections reduces output length without degrading accuracy across multiple models (R1-Distill-Models, QwQ-32B, and Qwen3-32B). Furthermore, we also improve current train-based method by explicitly suppressing such reflections. In our experiments, we achieve length compression of 18.7\% in train-free settings and 50.2\% in train-based settings for R1-Distill-Qwen-1.5B. Moreover, our improvements are simple yet practical and can be directly applied to existing inference frameworks, such as vLLM. We believe that our findings will provide community insights for achieving more precise length compression and step-level efficient reasoning.
摘要：尽管大型推理模型的最新进展表现出显着的性能，但由于产出长度的快速增长，有效的推理仍然至关重要。现有的优化方法突出了“过度思考”的趋势，但缺乏细粒度的分析。在这项工作中，我们专注于自我肯定的反思：冗余的反射步骤，这些步骤肯定了先前的内容，并且经常在已经正确的推理步骤之后发生。对原始和优化推理模型的观察揭示了普遍的自我肯定反射。值得注意的是，这些反射有时会导致优化模型中的输出更长的输出，而其原始对应物的产量更长。通过详细的分析，我们发现了一个有趣的模式：与其他反射相比，自我肯定反射中的主词（即第一个句子的句子）表现出明显的概率偏见。在这种见识的推动下，我们可以找到自我确认的反射并进行无火车的实验，表明抑制自我肯定反射会降低输出长度，而不会在多种模型（R1-Distill-Models，QWQ-32B，QWQ-32B和QWEN3-32B）中降低准确性而降低准确性。此外，我们还通过明确抑制这种反射来改善当前基于火车的方法。在我们的实验中，我们在无火车设置中实现了18.7 \％的长度压缩，在R1-Distill-Qwen-1.5B的基于火车的设置中达到50.2 \％。此外，我们的改进很简单却实用，可以直接应用于现有推理框架，例如VLLM。我们认为，我们的发现将提供社区见解，以实现更精确的长度压缩和步进级别的有效推理。

Title: Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics

Authors: Asifullah khan, Muhammad Zaeem Khan, Saleha Jamshed, Sadia Ahmad, Aleesha Zainab, Kaynat Khatib, Faria Bibi, Abdul Rehman
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2506.12365
Pdf URL: https://arxiv.org/pdf/2506.12365
Copy Paste: [[2506.12365]] Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics(https://arxiv.org/abs/2506.12365)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: This survey paper outlines the key developments in the field of Large Language Models (LLMs), such as enhancing their reasoning skills, adaptability to various tasks, increased computational efficiency, and ability to make ethical decisions. The techniques that have been most effective in bridging the gap between human and machine communications include the Chain-of-Thought prompting, Instruction Tuning, and Reinforcement Learning from Human Feedback. The improvements in multimodal learning and few-shot or zero-shot techniques have further empowered LLMs to handle complex jobs with minor input. They also manage to do more with less by applying scaling and optimization tricks for computing power conservation. This survey also offers a broader perspective on recent advancements in LLMs going beyond isolated aspects such as model architecture or ethical concerns. It categorizes emerging methods that enhance LLM reasoning, efficiency, and ethical alignment. It also identifies underexplored areas such as interpretability, cross-modal integration and sustainability. With recent progress, challenges like huge computational costs, biases, and ethical risks remain constant. Addressing these requires bias mitigation, transparent decision-making, and clear ethical guidelines. Future research will focus on enhancing models ability to handle multiple input, thereby making them more intelligent, safe, and reliable.
摘要：该调查文件概述了大语言模型（LLMS）领域的关键发展，例如提高其推理能力，适应各种任务，提高计算效率以及做出道德决策的能力。最有效地弥合人与机器通信之间的差距的技术包括从人类反馈中进行的思想链，教学调整和加强学习。多模式学习和少数或零射技术的改进进一步授权LLMS以较小的输入来处理复杂的工作。他们还通过将缩放和优化技巧应用于计算功率保护，从而减少了更多的作用。这项调查还为LLM的最新进步提供了更广泛的看法，而不是孤立的方面，例如模型架构或道德问题。它对增强LLM推理，效率和道德一致性的新兴方法分类。它还确定了诸如可解释性，跨模式整合和可持续性之类的未经证实的领域。随着最近的进步，巨大的计算成本，偏见和道德风险等挑战仍然稳定。解决这些需要缓解偏见，透明决策和明确的道德准则。未来的研究将着重于增强模型处理多个输入的能力，从而使它们更加聪明，安全和可靠。

Title: Understanding the Effect of Knowledge Graph Extraction Error on Downstream Graph Analyses: A Case Study on Affiliation Graphs

Authors: Erica Cai, Brendan O'Connor
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2506.12367
Pdf URL: https://arxiv.org/pdf/2506.12367
Copy Paste: [[2506.12367]] Understanding the Effect of Knowledge Graph Extraction Error on Downstream Graph Analyses: A Case Study on Affiliation Graphs(https://arxiv.org/abs/2506.12367)
Keywords: language model, llm
Abstract: Knowledge graphs (KGs) are useful for analyzing social structures, community dynamics, institutional memberships, and other complex relationships across domains from sociology to public health. While recent advances in large language models (LLMs) have improved the scalability and accessibility of automated KG extraction from large text corpora, the impacts of extraction errors on downstream analyses are poorly understood, especially for applied scientists who depend on accurate KGs for real-world insights. To address this gap, we conducted the first evaluation of KG extraction performance at two levels: (1) micro-level edge accuracy, which is consistent with standard NLP evaluations, and manual identification of common error sources; (2) macro-level graph metrics that assess structural properties such as community detection and connectivity, which are relevant to real-world applications. Focusing on affiliation graphs of person membership in organizations extracted from social register books, our study identifies a range of extraction performance where biases across most downstream graph analysis metrics are near zero. However, as extraction performance declines, we find that many metrics exhibit increasingly pronounced biases, with each metric tending toward a consistent direction of either over- or under-estimation. Through simulations, we further show that error models commonly used in the literature do not capture these bias patterns, indicating the need for more realistic error models for KG extraction. Our findings provide actionable insights for practitioners and underscores the importance of advancing extraction methods and error modeling to ensure reliable and meaningful downstream analyses.
摘要：知识图（KGS）可用于分析社会结构，社区动态，机构成员资格和其他复杂关系，从社会学到公共卫生。尽管大型语言模型（LLM）的最新进展提高了自动化kg从大型文本语料库中提取自动化的可及性和可访问性，但人们对提取错误对下游分析的影响很少，尤其是对于依靠准确的kg的应用科学家而言。为了解决这一差距，我们在两个级别上首次对KG提取性能进行了评估：（1）微级边缘精度，这与标准NLP评估是一致的，以及对常见误差源的手动识别；（2）评估与现实世界应用相关的结构属性的宏观图指标，例如社区检测和连接性。我们的研究专注于从社会登记书中提取的组织成员资格的隶属图，我们的研究确定了一系列提取性能，其中大多数下游图形分析的偏见指标接近零。但是，随着提取性能的下降，我们发现许多指标表现出越来越明显的偏见，每个指标都倾向于过度估计或低估的一致方向。通过模拟，我们进一步表明，文献中常用的错误模型不会捕获这些偏差模式，这表明需要对KG提取更现实的错误模型。我们的发现为从业者提供了可行的见解，并强调了推进提取方法和错误建模的重要性，以确保可靠和有意义的下游分析。

Title: Training-free LLM Merging for Multi-task Learning

Authors: Zichuan Fu, Xian Wu, Yejing Wang, Wanyu Wang, Shanshan Ye, Hongzhi Yin, Yi Chang, Yefeng Zheng, Xiangyu Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.12379
Pdf URL: https://arxiv.org/pdf/2506.12379
Copy Paste: [[2506.12379]] Training-free LLM Merging for Multi-task Learning(https://arxiv.org/abs/2506.12379)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse natural language processing (NLP) tasks. The release of open-source LLMs like LLaMA and Qwen has triggered the development of numerous fine-tuned models tailored for various tasks and languages. In this paper, we explore an important question: is it possible to combine these specialized models to create a unified model with multi-task capabilities. We introduces Hierarchical Iterative Merging (Hi-Merging), a training-free method for unifying different specialized LLMs into a single model. Specifically, Hi-Merging employs model-wise and layer-wise pruning and scaling, guided by contribution analysis, to mitigate parameter conflicts. Extensive experiments on multiple-choice and question-answering tasks in both Chinese and English validate Hi-Merging's ability for multi-task learning. The results demonstrate that Hi-Merging consistently outperforms existing merging techniques and surpasses the performance of models fine-tuned on combined datasets in most scenarios. Code is available at: this https URL.
摘要：大型语言模型（LLMS）已经证明了各种自然语言处理（NLP）任务的非凡功能。诸如Llama和Qwen之类的开源LLM的发布触发了针对各种任务和语言量身定制的众多微型模型的开发。在本文中，我们探讨了一个重要的问题：是否可以将这些专业模型结合起来，以创建具有多任务功能的统一模型。我们介绍了分层迭代合并（HI-MERGING），这是一种无训练的方法，可将不同的专用LLM统一为单个模型。具体而言，Hi-Mering在贡献分析的指导下采用模型和层次修剪和缩放，以减轻参数冲突。关于中文和英语的多项选择和提问任务的广泛实验验证了Hi-Mering的多任务学习能力。结果表明，在大多数情况下，Hi-Mering始终胜过现有合并技术，并超过合并数据集上微调的模型的性能。代码可用：此HTTPS URL。

Title: Recent Advances and Future Directions in Literature-Based Discovery

Authors: Andrej Kastrin, Bojan Cestnik, Nada Lavrač
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12385
Pdf URL: https://arxiv.org/pdf/2506.12385
Copy Paste: [[2506.12385]] Recent Advances and Future Directions in Literature-Based Discovery(https://arxiv.org/abs/2506.12385)
Keywords: language model, llm
Abstract: The explosive growth of scientific publications has created an urgent need for automated methods that facilitate knowledge synthesis and hypothesis generation. Literature-based discovery (LBD) addresses this challenge by uncovering previously unknown associations between disparate domains. This article surveys recent methodological advances in LBD, focusing on developments from 2000 to the present. We review progress in three key areas: knowledge graph construction, deep learning approaches, and the integration of pre-trained and large language models (LLMs). While LBD has made notable progress, several fundamental challenges remain unresolved, particularly concerning scalability, reliance on structured data, and the need for extensive manual curation. By examining ongoing advances and outlining promising future directions, this survey underscores the transformative role of LLMs in enhancing LBD and aims to support researchers and practitioners in harnessing these technologies to accelerate scientific innovation.
摘要：科学出版物的爆炸性增长迫切需要促进知识综合和假设产生的自动化方法。基于文献的发现（LBD）通过发现不同域之间的以前未知的关联来应对这一挑战。本文调查了LBD的最新方法论进步，重点是从2000年到现在。我们回顾了三个关键领域的进展：知识图构造，深度学习方法以及预先培训和大型语言模型（LLMS）的整合。尽管LBD取得了显着的进展，但仍未解决一些基本挑战，尤其是有关可伸缩性，对结构化数据的依赖以及对大量手动策划的需求。通过研究正在进行的进步并概述了有希望的未来方向，这项调查强调了LLMS在增强LBD方面的变革性作用，并旨在支持研究人员和从业人员利用这些技术加速科学创新。

Title: Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model

Authors: Chong Li, Yingzhuo Deng, Jiajun Zhang, Chengqing Zong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12388
Pdf URL: https://arxiv.org/pdf/2506.12388
Copy Paste: [[2506.12388]] Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model(https://arxiv.org/abs/2506.12388)
Keywords: language model, llm
Abstract: The curse of multilinguality phenomenon is a fundamental problem of multilingual Large Language Models (LLMs), where the competition between massive languages results in inferior performance. It mainly comes from limited capacity and negative transfer between dissimilar languages. To address this issue, we propose a method to dynamically group and scale up the parameters of multilingual LLM while boosting positive transfer among similar languages. Specifically, the model is first tuned on monolingual corpus to determine the parameter deviation in each layer and quantify the similarity between languages. Layers with more deviations are extended to mixture-of-experts layers to reduce competition between languages, where one expert module serves one group of similar languages. Experimental results on 18 to 128 languages show that our method reduces the negative transfer between languages and significantly boosts multilingual performance with fewer parameters. Such language group specialization on experts benefits the new language adaptation and reduces the inference on the previous multilingual knowledge learned.
摘要：多语言现象的诅咒是多语言大语言模型（LLMS）的一个基本问题，其中大规模语言之间的竞争会导致表现较低。它主要来自有限的能力和不同语言之间的负转移。为了解决此问题，我们提出了一种方法，以动态分组和扩展多语言LLM的参数，同时促进类似语言之间的正转移。具体而言，该模型首先在单语语料库上调整，以确定每一层的参数偏差并量化语言之间的相似性。具有更多偏差的图层扩展到了专家的混合物，以减少语言之间的竞争，其中一个专家模块为一组类似的语言提供服务。 18至128种语言的实验结果表明，我们的方法减少了语言之间的负面传递，并显着提高了多语言性能，参数较少。这样的语言组专业化专家受益于新的语言适应，并减少了对以前的多语言知识的推论。

Title: Exploring Cultural Variations in Moral Judgments with Large Language Models

Authors: Hadi Mohammadi, Efthymia Papadopoulou, Yasmeen F.S.S. Meijer, Ayoub Bagheri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12433
Pdf URL: https://arxiv.org/pdf/2506.12433
Copy Paste: [[2506.12433]] Exploring Cultural Variations in Moral Judgments with Large Language Models(https://arxiv.org/abs/2506.12433)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs can mirror variations in moral attitudes reported by two major cross-cultural surveys: the World Values Survey and the PEW Research Center's Global Attitudes Survey. We compare smaller, monolingual, and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with more recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct). Using log-probability-based moral justifiability scores, we correlate each model's outputs with survey data covering a broad set of ethical topics. Our results show that many earlier or smaller models often produce near-zero or negative correlations with human judgments. In contrast, advanced instruction-tuned models (including GPT-4o and GPT-4o-mini) achieve substantially higher positive correlations, suggesting they better reflect real-world moral attitudes. While scaling up model size and using instruction tuning can improve alignment with cross-cultural moral norms, challenges remain for certain topics and regions. We discuss these findings in relation to bias analysis, training data diversity, and strategies for improving the cultural sensitivity of LLMs.
摘要：大型语言模型（LLMS）在许多任务中表现出很强的表现，但是它们捕获文化多样的道德价值观的能力尚不清楚。在本文中，我们检查了LLM是否可以反映两种主要的跨文化调查报告的道德态度的变化：世界价值调查和皮尤研究中心的全球态度调查。我们将较小，单语和多语言模型（GPT-2，OPT，Bloomz和Qwen）与最新的指导调节模型（GPT-4O，GPT-4O，GPT-4O-MINI，GEMMA-2-9B-IT和LLAMA-3.3-70B教学）进行了比较。使用基于对数的道德合理性得分，我们将每个模型的输出与涵盖一系列道德主题的调查数据相关联。我们的结果表明，许多早期或较小的模型通常与人类判断产生接近零或负相关。相反，先进的指导调整模型（包括GPT-4O和GPT-4O-MINI）实现了更高的正相关性，这表明它们更好地反映了现实世界的道德态度。虽然扩大模型大小并使用指令调整可以改善跨文化道德规范的对准，但对于某些主题和地区，挑战仍然存在。我们讨论了有关偏见分析，培训数据多样性以及提高LLM文化敏感性的策略的这些发现。

Title: From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment

Authors: Bin Xie, Bingbing Xu, Yige Yuan, Shengmao Zhu, Huawei Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12446
Pdf URL: https://arxiv.org/pdf/2506.12446
Copy Paste: [[2506.12446]] From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment(https://arxiv.org/abs/2506.12446)
Keywords: language model, gpt, llm
Abstract: Inference-time alignment methods have gained significant attention for their efficiency and effectiveness in aligning large language models (LLMs) with human preferences. However, existing dominant approaches using reward-guided search (RGS) primarily rely on outcome reward models (ORMs), which suffer from a critical granularity mismatch: ORMs are designed to provide outcome rewards for complete responses, while RGS methods rely on process rewards to guide the policy, leading to inconsistent scoring and suboptimal alignment. To address this challenge, we introduce process reward models (PRMs) into RGS and argue that an ideal PRM should satisfy two objectives: Score Consistency, ensuring coherent evaluation across partial and complete responses, and Preference Consistency, aligning partial sequence assessments with human preferences. Based on these, we propose SP-PRM, a novel dual-consistency framework integrating score consistency-based and preference consistency-based partial evaluation modules without relying on human annotation. Extensive experiments on dialogue, summarization, and reasoning tasks demonstrate that SP-PRM substantially enhances existing RGS methods, achieving a 3.6%-10.3% improvement in GPT-4 evaluation scores across all tasks.
摘要：推理时间对准方法在使大语言模型（LLM）与人类偏好的效率和有效性方面引起了人们的关注。但是，使用奖励指导搜索（RGS）的现有主要方法主要依赖于结果奖励模型（ORM），这些模型（ORMS）遭受了关键的粒度不匹配：ORM旨在为完整响应提供结果奖励，而RGS方法依靠过程奖励来指导该策略，导致不一致的刺痛和次要屈服。为了应对这一挑战，我们将流程奖励模型（PRM）介绍到RGS中，并认为理想的PRM应满足两个目标：得分一致性，确保跨部分和完整响应的一致评估，以及偏好一致性，将部分序列评估与人类的偏好保持一致。基于这些，我们提出了SP-PRM，这是一个新型的双偶然性框架，将基于得分一致性和基于偏好一致性的部分评估模块整合而不依赖于人类注释。关于对话，摘要和推理任务的广泛实验表明，SP-PRM显着增强了现有的RGS方法，在所有任务中的GPT-4评估得分方面提高了3.6％-10.3％。

Title: Language Surgery in Multilingual Large Language Models

Authors: Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, Samuel Cahyawijaya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12450
Pdf URL: https://arxiv.org/pdf/2506.12450
Copy Paste: [[2506.12450]] Language Surgery in Multilingual Large Language Models(https://arxiv.org/abs/2506.12450)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages, revolutionizing natural language processing. This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers, and its implications for disentangling language-specific and language-agnostic information. We empirically confirm the existence of this alignment, analyze its behavior in comparison to explicitly designed alignment models, and demonstrate its potential for language-specific manipulation without semantic degradation. Building on these findings, we propose Inference-Time Language Control (ITLC), a novel method that leverages latent injection to enable precise cross-lingual language control and mitigate language confusion in LLMs. Our experiments highlight ITLC's strong cross-lingual control capabilities while preserving semantic integrity in target languages. Furthermore, we demonstrate its effectiveness in alleviating the cross-lingual language confusion problem, which persists even in current large-scale LLMs, leading to inconsistent language generation. This work advances our understanding of representation alignment in LLMs and introduces a practical solution for enhancing their cross-lingual performance.
摘要：大型语言模型（LLM）表现出了跨任务和语言的显着概括能力，从而彻底改变了自然语言处理。本文调查了LLMS，尤其是中层层中自然新兴的表示对准，及其对解散语言特定语言和语言信息的影响。我们从经验上证实了这种对齐的存在，与明确设计的对齐模型相比，分析了其行为，并证明了其在没有语义降低的情况下进行特定语言操作的潜力。在这些发现的基础上，我们提出了推理时间控制（ITLC），这是一种利用潜在注入的新方法，以实现精确的跨语性语言控制，并减轻LLMS中的语言混乱。我们的实验强调了ITLC强大的跨语性控制功能，同时保留了目标语言的语义完整性。此外，我们证明了它在减轻跨语性语言混乱问题方面的有效性，即使在当前的大规模LLM中，这也会持续下来，从而导致语言产生不一致。这项工作促进了我们对LLM中表示形式的理解，并引入了一种实用解决方案，以增强其跨语性性能。

Title: TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks

Authors: Zhou Chen, Zhiqiang Wei, Yuqi Bai, Xue Xiong, Jianmin Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12473
Pdf URL: https://arxiv.org/pdf/2506.12473
Copy Paste: [[2506.12473]] TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks(https://arxiv.org/abs/2506.12473)
Keywords: language model, llm
Abstract: Model routing allocates queries to the suitable model, improving system performance while reducing costs. However, existing routing methods face practical limitations that hinder scalability in large-scale applications and struggle to keep up with the rapid growth of the large language model (LLM) ecosystem. To tackle these challenges, we propose TagRouter, a training-free model routing method designed to optimize the synergy among multiple LLMs for open-domain text generation tasks. Experimental results demonstrate that TagRouter outperforms 13 baseline methods, increasing the accept rate of system by 6.15% and reducing costs by 17.20%, achieving optimal cost-efficiency. Our findings provides the LLM community with an efficient and scalable solution for model ensembling, offering users an evolvable "super model."
摘要：模型路由将查询分配给合适的模型，从而提高系统性能，同时降低成本。但是，现有的路由方法面临的实际限制，阻碍了大规模应用中的可伸缩性，并难以跟上大语言模型（LLM）生态系统的快速增长。为了应对这些挑战，我们提出了Tagrouter，这是一种无训练的模型路由方法，旨在优化多个LLMS之间的协同作用，以实现开放域文本生成任务。实验结果表明，Tagrouter的表现优于13种基线方法，将系统的接受率提高6.15％，并使成本降低17.20％，从而实现了最佳的成本效益。我们的发现为LLM社区提供了一个有效且可扩展的解决方案，可用于模型结合，从而为用户提供了可发展的“超级模型”。

Title: FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation

Authors: Zhuocheng Zhang, Yang Feng, Min Zhang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.12494
Pdf URL: https://arxiv.org/pdf/2506.12494
Copy Paste: [[2506.12494]] FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation(https://arxiv.org/abs/2506.12494)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) plays a pivotal role in modern large language model applications, with numerous existing frameworks offering a wide range of functionalities to facilitate the development of RAG systems. However, we have identified several persistent challenges in these frameworks, including difficulties in algorithm reproduction and sharing, lack of new techniques, and high system overhead. To address these limitations, we introduce \textbf{FlexRAG}, an open-source framework specifically designed for research and prototyping. FlexRAG supports text-based, multimodal, and network-based RAG, providing comprehensive lifecycle support alongside efficient asynchronous processing and persistent caching capabilities. By offering a robust and flexible solution, FlexRAG enables researchers to rapidly develop, deploy, and share advanced RAG systems. Our toolkit and resources are available at \href{this https URL}{this https URL}.
摘要：检索增强的生成（RAG）在现代大型语言模型应用中起着关键作用，许多现有的框架提供了广泛的功能来促进抹布系统的开发。但是，我们已经确定了这些框架中的几个持续挑战，包括算法复制和共享的困难，缺乏新技术以及高系统开销。为了解决这些限制，我们介绍了\ textbf {FlexRag}，这是一个专门为研究和原型设计而设计的开源框架。 FlexRag支持基于文本的，多模式和基于网络的抹布，提供全面的生命周期支持以及高效的异步处理和持续的缓存功能。通过提供强大而灵活的解决方案，FlexRAG使研究人员能够快速开发，部署和共享先进的抹布系统。我们的工具包和资源可在\ href {this HTTPS url} {此https url}上获得。

Title: Improving Factuality for Dialogue Response Generation via Graph-Based Knowledge Augmentation

Authors: Xiangyan Chen, Yujian Gan, Matthew Purver
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2506.12496
Pdf URL: https://arxiv.org/pdf/2506.12496
Copy Paste: [[2506.12496]] Improving Factuality for Dialogue Response Generation via Graph-Based Knowledge Augmentation(https://arxiv.org/abs/2506.12496)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) succeed in many natural language processing tasks. However, their tendency to hallucinate - generate plausible but inconsistent or factually incorrect text - can cause problems in certain tasks, including response generation in dialogue. To mitigate this issue, knowledge-augmented methods have shown promise in reducing hallucinations. Here, we introduce a novel framework designed to enhance the factuality of dialogue response generation, as well as an approach to evaluate dialogue factual accuracy. Our framework combines a knowledge triple retriever, a dialogue rewrite, and knowledge-enhanced response generation to produce more accurate and grounded dialogue responses. To further evaluate generated responses, we propose a revised fact score that addresses the limitations of existing fact-score methods in dialogue settings, providing a more reliable assessment of factual consistency. We evaluate our methods using different baselines on the OpendialKG and HybriDialogue datasets. Our methods significantly improve factuality compared to other graph knowledge-augmentation baselines, including the state-of-the-art G-retriever. The code will be released on GitHub.
摘要：大型语言模型（LLM）在许多自然语言处理任务中取得了成功。但是，他们幻觉的趋势 - 产生合理但不一致或实际上不正确的文本 - 可能会在某些任务中引起问题，包括对话中的响应产生。为了减轻此问题，知识增强的方法在减少幻觉方面已经有望。在这里，我们介绍了一个新颖的框架，旨在提高对话响应的事实，并评估对话的事实准确性的方法。我们的框架结合了一个知识三重猎犬，对话的重写和知识增强的响应生成，以产生更准确和扎根的对话响应。为了进一步评估生成的响应，我们提出了一个修订后的事实评分，该评分解决了对话设置中现有事实评分方法的局限性，从而对事实一致性进行了更可靠的评估。我们使用OpenDialKG和HybriDialogue数据集上的不同基线来评估我们的方法。与其他图形知识提高基线相比，我们的方法可显着改善事实，包括最先进的G-Retriever。该代码将在Github上发布。

Title: Towards Fairness Assessment of Dutch Hate Speech Detection

Authors: Julie Bauer, Rishabh Kaushal, Thales Bertaglia, Adriana Iamnitchi
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.12502
Pdf URL: https://arxiv.org/pdf/2506.12502
Copy Paste: [[2506.12502]] Towards Fairness Assessment of Dutch Hate Speech Detection(https://arxiv.org/abs/2506.12502)
Keywords: llm
Abstract: Numerous studies have proposed computational methods to detect hate speech online, yet most focus on the English language and emphasize model development. In this study, we evaluate the counterfactual fairness of hate speech detection models in the Dutch language, specifically examining the performance and fairness of transformer-based models. We make the following key contributions. First, we curate a list of Dutch Social Group Terms that reflect social context. Second, we generate counterfactual data for Dutch hate speech using LLMs and established strategies like Manual Group Substitution (MGS) and Sentence Log-Likelihood (SLL). Through qualitative evaluation, we highlight the challenges of generating realistic counterfactuals, particularly with Dutch grammar and contextual coherence. Third, we fine-tune baseline transformer-based models with counterfactual data and evaluate their performance in detecting hate speech. Fourth, we assess the fairness of these models using Counterfactual Token Fairness (CTF) and group fairness metrics, including equality of odds and demographic parity. Our analysis shows that models perform better in terms of hate speech detection, average counterfactual fairness and group fairness. This work addresses a significant gap in the literature on counterfactual fairness for hate speech detection in Dutch and provides practical insights and recommendations for improving both model performance and fairness.
摘要：大量研究提出了计算方法来在线检测仇恨言论，但大多数专注于英语并强调模型发展。在这项研究中，我们评估了荷兰语中仇恨言语检测模型的反事实公平性，特别研究了基于变压器的模型的性能和公平性。我们做出以下关键贡献。首先，我们策划了反映社会背景的荷兰社会群体术语列表。其次，我们使用LLM和诸如手动组替代（MGS）和句子对数类似类（SLL）等既定策略（SLL）的策略来生成荷兰仇恨言论的反事实数据。通过定性评估，我们强调了产生现实的反事实的挑战，尤其是在荷兰语法和上下文连贯性的情况下。第三，我们使用反事实数据微调基线变压器的模型，并评估其在检测仇恨言论时的表现。第四，我们使用反事实令牌公平性（CTF）和群体公平指标（包括赔率和人口统计平等）评估这些模型的公平性。我们的分析表明，模型在仇恨言论检测，平均反事实公平和群体公平方面表现更好。这项工作解决了有关荷兰人仇恨言论检测的反事实公平性的文献中的一个重大差距，并提供了改善模型性能和公平性的实用见解和建议。

Title: Detection, Classification, and Mitigation of Gender Bias in Large Language Models

Authors: Xiaoqing Cheng, Hongying Zan, Lulu Kong, Jinwang Song, Min Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12527
Pdf URL: https://arxiv.org/pdf/2506.12527
Copy Paste: [[2506.12527]] Detection, Classification, and Mitigation of Gender Bias in Large Language Models(https://arxiv.org/abs/2506.12527)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: With the rapid development of large language models (LLMs), they have significantly improved efficiency across a wide range of domains. However, recent studies have revealed that LLMs often exhibit gender bias, leading to serious social implications. Detecting, classifying, and mitigating gender bias in LLMs has therefore become a critical research focus. In the NLPCC 2025 Shared Task 7: Chinese Corpus for Gender Bias Detection, Classification and Mitigation Challenge, we investigate how to enhance the capabilities of LLMs in gender bias detection, classification, and mitigation. We adopt reinforcement learning, chain-of-thoughts (CoT) reasoning, and supervised fine-tuning to handle different Subtasks. Specifically, for Subtasks 1 and 2, we leverage the internal reasoning capabilities of LLMs to guide multi-step thinking in a staged manner, which simplifies complex biased queries and improves response accuracy. For Subtask 3, we employ a reinforcement learning-based approach, annotating a preference dataset using GPT-4. We then apply Direct Preference Optimization (DPO) to mitigate gender bias by introducing a loss function that explicitly favors less biased completions over biased ones. Our approach ranked first across all three subtasks of the NLPCC 2025 Shared Task 7.
摘要：随着大语言模型（LLM）的快速发展，它们在各个领域的效率大大提高。但是，最近的研究表明，LLM经常表现出性别偏见，从而产生严重的社会影响。因此，检测，分类和缓解LLM中的性别偏见已成为关键的研究重点。在NLPCC 2025共享任务7：中国对性别偏见检测，分类和缓解挑战的语料库中，我们研究了如何增强LLMS在性别偏见检测，分类和缓解方面的能力。我们采用强化学习，经营链（COT）推理以及监督微调来处理不同的子任务。具体而言，对于子任务1和2，我们利用LLM的内部推理能力以分阶段的方式指导多步思想，从而简化了复杂的偏见偏见并提高了响应精度。对于子任务3，我们采用了一种基于强化学习的方法，使用GPT-4注释偏好数据集。然后，我们通过引入明确偏爱偏见的损失函数而不是偏见的损失函数来应用直接的偏好优化（DPO）来减轻性别偏见。我们的方法在NLPCC 2025共享任务7的所有三个子任务中排名第一。

Title: Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

Authors: Xiaoran Fan, Zhichao Sun, Yangfan Gao, Jingfei Xiong, Hang Yan, Yifei Cao, Jiajun Sun, Shuo Li, Zhihao Zhang, Zhiheng Xi, Yuhao Zhou, Senjie Jin, Changhao Jiang, Junjie Ye, Ming Zhang, Rui Zheng, Zhenhua Han, Yunke Zhang, Demei Yan, Shaokang Dong, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2506.12537
Pdf URL: https://arxiv.org/pdf/2506.12537
Copy Paste: [[2506.12537]] Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction(https://arxiv.org/abs/2506.12537)
Keywords: language model, llm
Abstract: Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the impact of key components (i.e., speech tokenizers, speech heads, and speaker modeling) on the performance of LLM-centric SLMs. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12$\times$ faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.
摘要：语音语言模型（SLM）为统一语音和文本理解和产生提供了有希望的途径。但是，在实现有效的跨模式一致性和高质量的语音产生方面仍然存在挑战。在这项工作中，我们系统地研究了关键组件（即语音令牌，语音头和扬声器建模）对以LLM为中心SLM的性能的影响。我们比较了公平的SLM框架下的耦合，半脱耦和完全解耦的语音令牌，并发现脱钩的令牌化显着提高了对齐和合成质量。为了解决语音和文本之间的信息密度不匹配，我们将多型预测（MTP）介绍到SLMS中，使每个隐藏状态能够解码多个语音令牌。这会导致更快的解码速度和单词错误率大幅下降（从6.07到3.01）。此外，我们提出了一种说话者感知的一代范式，并引入了Roltriviaqa，这是一种具有多种说话者身份的大规模角色扮演知识QA基准。实验表明，我们的方法增强了知识理解和说话者的一致性。

Title: RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking

Authors: Shuo Yang, Yuqin Dai, Guoqing Wang, Xinran Zheng, Jinfeng Xu, Jinze Li, Zhenzhe Ying, Weiqiang Wang, Edith C.H. Ngai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12538
Pdf URL: https://arxiv.org/pdf/2506.12538
Copy Paste: [[2506.12538]] RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking(https://arxiv.org/abs/2506.12538)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) hold significant potential for advancing fact-checking by leveraging their capabilities in reasoning, evidence retrieval, and explanation generation. However, existing benchmarks fail to comprehensively evaluate LLMs and Multimodal Large Language Models (MLLMs) in realistic misinformation scenarios. To bridge this gap, we introduce RealFactBench, a comprehensive benchmark designed to assess the fact-checking capabilities of LLMs and MLLMs across diverse real-world tasks, including Knowledge Validation, Rumor Detection, and Event Verification. RealFactBench consists of 6K high-quality claims drawn from authoritative sources, encompassing multimodal content and diverse domains. Our evaluation framework further introduces the Unknown Rate (UnR) metric, enabling a more nuanced assessment of models' ability to handle uncertainty and balance between over-conservatism and over-confidence. Extensive experiments on 7 representative LLMs and 4 MLLMs reveal their limitations in real-world fact-checking and offer valuable insights for further research. RealFactBench is publicly available at this https URL.
摘要：大型语言模型（LLMS）通过利用其在推理，证据检索和解释产生方面的能力来推进事实核对事实的潜力。但是，现有的基准无法全面评估LLM和多模式的大型语言模型（MLLM），以现实的错误信息信息方案。为了弥合这一差距，我们介绍了RealFactBench，这是一个综合基准，旨在评估不同现实世界中LLM和MLLM的事实检查功能，包括知识验证，谣言检测和事件验证。 RealFactBench由从权威来源提取的6K高质量主张组成，包括多模式内容和不同的领域。我们的评估框架进一步介绍了未知的率（UNR）度量，从而更加细微地评估模型的能力，可以处理过度保守和过度信心之间的不确定性和平衡。对7个代表性LLM和4个MLLM的广泛实验揭示了它们在实际事实检查中的局限性，并为进一步的研究提供了宝贵的见解。在此HTTPS URL上，realfactbench公开可用。

Title: Profiling News Media for Factuality and Bias Using LLMs and the Fact-Checking Methodology of Human Experts

Authors: Zain Muhammad Mujahid, Dilshod Azizov, Maha Tufail Agro, Preslav Nakov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.12552
Pdf URL: https://arxiv.org/pdf/2506.12552
Copy Paste: [[2506.12552]] Profiling News Media for Factuality and Bias Using LLMs and the Fact-Checking Methodology of Human Experts(https://arxiv.org/abs/2506.12552)
Keywords: language model, llm, prompt
Abstract: In an age characterized by the proliferation of mis- and disinformation online, it is critical to empower readers to understand the content they are reading. Important efforts in this direction rely on manual or automatic fact-checking, which can be challenging for emerging claims with limited information. Such scenarios can be handled by assessing the reliability and the political bias of the source of the claim, i.e., characterizing entire news outlets rather than individual claims or articles. This is an important but understudied research direction. While prior work has looked into linguistic and social contexts, we do not analyze individual articles or information in social media. Instead, we propose a novel methodology that emulates the criteria that professional fact-checkers use to assess the factuality and political bias of an entire outlet. Specifically, we design a variety of prompts based on these criteria and elicit responses from large language models (LLMs), which we aggregate to make predictions. In addition to demonstrating sizable improvements over strong baselines via extensive experiments with multiple LLMs, we provide an in-depth error analysis of the effect of media popularity and region on model performance. Further, we conduct an ablation study to highlight the key components of our dataset that contribute to these improvements. To facilitate future research, we released our dataset and code at this https URL.
摘要：在一个以在线误差和虚假信息在线泛滥为特征的年龄中，授权读者了解他们正在阅读的内容至关重要。在这个方向上的重要努力取决于手动或自动事实检查，这对于有限的信息而言，这对于新兴的主张可能具有挑战性。可以通过评估索赔来源的可靠性和政治偏见来处理此类情况，即表征整个新闻媒体而不是个人主张或文章。这是一个重要但经过研究的研究方向。虽然先前的工作已经研究了语言和社会环境，但我们没有在社交媒体中分析单个文章或信息。取而代之的是，我们提出了一种新颖的方法，该方法模拟了专业事实检查者用来评估整个渠道的事实和政治偏见的标准。具体来说，我们根据这些标准设计了各种提示，并引起了大型语言模型（LLM）的响应，我们将其汇总以做出预测。除了通过使用多个LLM的广泛实验证明对强基础的相当大的改进外，我们还提供了对媒体受欢迎程度和区域对模型性能的影响的深入误差分析。此外，我们进行了一项消融研究，以突出有助于这些改进的数据集的关键组成部分。为了促进未来的研究，我们在此HTTPS URL上发布了数据集和代码。

Title: DoTA-RAG: Dynamic of Thought Aggregation RAG

Authors: Saksorn Ruangtanusak, Natthapath Rungseesiripak, Peerawat Rojratchadakorn, Monthol Charattrakool, Natapong Nitarach
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.12571
Pdf URL: https://arxiv.org/pdf/2506.12571
Copy Paste: [[2506.12571]] DoTA-RAG: Dynamic of Thought Aggregation RAG(https://arxiv.org/abs/2506.12571)
Keywords: retrieval-augmented generation
Abstract: In this paper, we introduce DoTA-RAG (Dynamic-of-Thought Aggregation RAG), a retrieval-augmented generation system optimized for high-throughput, large-scale web knowledge indexes. Traditional RAG pipelines often suffer from high latency and limited accuracy over massive, diverse datasets. DoTA-RAG addresses these challenges with a three-stage pipeline: query rewriting, dynamic routing to specialized sub-indexes, and multi-stage retrieval and ranking. We further enhance retrieval by evaluating and selecting a superior embedding model, re-embedding the large FineWeb-10BT corpus. Moreover, we create a diverse Q&A dataset of 500 questions generated via the DataMorgana setup across a broad range of WebOrganizer topics and formats. DoTA-RAG improves the answer correctness score from 0.752 (baseline, using LiveRAG pre-built vector store) to 1.478 while maintaining low latency, and it achieves a 0.929 correctness score on the Live Challenge Day. These results highlight DoTA-RAG's potential for practical deployment in domains requiring fast, reliable access to large and evolving knowledge sources.
摘要：在本文中，我们介绍了Dota-rag（动态的聚合抹布），这是一种针对高通量，大规模的Web知识索引的检索增强的生成系统。传统的抹布管道通常会遭受高潜伏期的高度和有限的准确性，而准确性在大量的，多样化的数据集中。 Dota-rag通过三阶段的管道解决了这些挑战：查询重写，与专门的子索引的动态路由以及多阶段的检索和排名。我们通过评估和选择高级嵌入模型，重新安装了大型FineWeb-10bt语料库来进一步增强检索。此外，我们创建了一个通过Datamorgana设置生成的500个问题的各种问答数据集，这些问题在广泛的Weboranizer主题和格式中生成。 Dota-rag将答案正确度得分从0.752（基线，使用Liverag Pre-Build Vector Store）提高到1.478，同时保持低潜伏期，并且在现场挑战日达到了0.929的正确性得分。这些结果突出了Dota-Rag在需要快速，可靠地访问大型和不断发展的知识源的域中实用部署的潜力。

Title: Overview of the NLPCC 2025 Shared Task: Gender Bias Mitigation Challenge

Authors: Yizhi Li, Ge Zhang, Hanhua Hong, Yiwen Wang, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12574
Pdf URL: https://arxiv.org/pdf/2506.12574
Copy Paste: [[2506.12574]] Overview of the NLPCC 2025 Shared Task: Gender Bias Mitigation Challenge(https://arxiv.org/abs/2506.12574)
Keywords: language model
Abstract: As natural language processing for gender bias becomes a significant interdisciplinary topic, the prevalent data-driven techniques, such as pre-trained language models, suffer from biased corpus. This case becomes more obvious regarding those languages with less fairness-related computational linguistic resources, such as Chinese. To this end, we propose a Chinese cOrpus foR Gender bIas Probing and Mitigation (CORGI-PM), which contains 32.9k sentences with high-quality labels derived by following an annotation scheme specifically developed for gender bias in the Chinese context. It is worth noting that CORGI-PM contains 5.2k gender-biased sentences along with the corresponding bias-eliminated versions rewritten by human annotators. We pose three challenges as a shared task to automate the mitigation of textual gender bias, which requires the models to detect, classify, and mitigate textual gender bias. In the literature, we present the results and analysis for the teams participating this shared task in NLPCC 2025.
摘要：随着性别偏见的自然语言处理成为一个重要的跨学科主题，流行的数据驱动技术（例如预训练的语言模型）患有偏见。对于那些与公平相关的计算语言资源（例如中文）的语言，这种情况变得更加明显。为此，我们提出了一个用于性别偏见探测和缓解（CORGI-PM）的中国语料库，其中包含32.9k句子，其句子具有通过遵循在中国环境中专门为性别偏见开发的注释方案而得出的高质量标签的句子。值得注意的是，Corgi-PM包含5.2K性别偏见的句子以及相应的偏见释放版本，由人类注释者重写。我们提出了三个挑战，作为自动化文本性别偏见自动化的共同任务，这需要模型来检测，分类和减轻文本性别偏见。在文献中，我们为参与NLPCC 2025中参与这项共同任务的团队提供了结果和分析。

Title: Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders

Authors: Ananya Joshi, Celia Cintas, Skyler Speakman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12576
Pdf URL: https://arxiv.org/pdf/2506.12576
Copy Paste: [[2506.12576]] Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders(https://arxiv.org/abs/2506.12576)
Keywords: language model, gpt, llm, prompt
Abstract: Recent work shows that Sparse Autoencoders (SAE) applied to large language model (LLM) layers have neurons corresponding to interpretable concepts. These SAE neurons can be modified to align generated outputs, but only towards pre-identified topics and with some parameter tuning. Our approach leverages the observational and modification properties of SAEs to enable alignment for any topic. This method 1) scores each SAE neuron by its semantic similarity to an alignment text and uses them to 2) modify SAE-layer-level outputs by emphasizing topic-aligned neurons. We assess the alignment capabilities of this approach on diverse public topic datasets including Amazon reviews, Medicine, and Sycophancy, across the currently available open-source LLMs and SAE pairs (GPT2 and Gemma) with multiple SAEs configurations. Experiments aligning to medical prompts reveal several benefits over fine-tuning, including increased average language acceptability (0.25 vs. 0.5), reduced training time across multiple alignment topics (333.6s vs. 62s), and acceptable inference time for many applications (+0.00092s/token). Our open-source code is available at this http URL.
摘要：最近的工作表明，应用于大语言模型（LLM）层的稀疏自动编码器（SAE）具有与可解释概念相对应的神经元。这些SAE神经元可以修改以使生成的输出对齐，但仅针对预识别的主题和一些参数调整。我们的方法利用了SAE的观察和修改属性，以使任何主题对齐。该方法1）通过与对齐文本的语义相似性来评分每个SAE神经元，并使用它们来2）通过强调主题对齐的神经元来修改SAE层级输出。我们在当前可用的开源LLMS和SAE Pairs（GPT2和Gemma）中，评估了这种方法在包括亚马逊评论，医学和粘糊剂在内的各种公共主题数据集上的对齐功能，并具有多种SAES配置。与医疗提示保持一致的实验显示了与微调相比的几个好处，包括增加平均语言可接受性（0.25 vs. 0.5），跨多个对齐主题（333.6 vs. 62s）的培训时间缩短，以及许多应用程序的可接受推理时间（+0.00092s/soken）。我们的开源代码可在此HTTP URL上找到。

Title: OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases

Authors: Yongrui Chen, Zhiqiang Liu, Jing Yu, Lin Ren, Nan Hu, Xinbang Dai, Jiajun Liu, Jiazhen Kang, Shenyu Zhang, Xinda Wang, Keyan Ding, Pengfei Shen, Haolei Zhu, Hongjie Deng, Yisong Wang, Tongtong Wu, Sheng Bi, Wen Zhang, Tianxing Wu, Qiu Ji, Haofen Wang, Wenliang Chen, Huajun Chen, Guilin Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12577
Pdf URL: https://arxiv.org/pdf/2506.12577
Copy Paste: [[2506.12577]] OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases(https://arxiv.org/abs/2506.12577)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated substantial progress on reasoning tasks involving unstructured text, yet their capabilities significantly deteriorate when reasoning requires integrating structured external knowledge such as knowledge graphs, code snippets, or formal logic. This limitation is partly due to the absence of benchmarks capable of systematically evaluating LLM performance across diverse structured knowledge modalities. To address this gap, we introduce \textbf{\textsc{OneEval}}, a comprehensive benchmark explicitly designed to assess the knowledge-intensive reasoning capabilities of LLMs across four structured knowledge modalities, unstructured text, knowledge graphs, code, and formal logic, and five critical domains (general knowledge, government, science, law, and programming). \textsc{OneEval} comprises 4,019 carefully curated instances and includes a challenging subset, \textsc{OneEval}\textsubscript{Hard}, consisting of 1,285 particularly difficult cases. Through extensive evaluation of 18 state-of-the-art open-source and proprietary LLMs, we establish three core findings: a) \emph{persistent limitations in structured reasoning}, with even the strongest model achieving only 32.2\% accuracy on \textsc{OneEval}\textsubscript{Hard}; b) \emph{performance consistently declines as the structural complexity of the knowledge base increases}, with accuracy dropping sharply from 53\% (textual reasoning) to 25\% (formal logic); and c) \emph{diminishing returns from extended reasoning chains}, highlighting the critical need for models to adapt reasoning depth appropriately to task complexity. We release the \textsc{OneEval} datasets, evaluation scripts, and baseline results publicly, accompanied by a leaderboard to facilitate ongoing advancements in structured knowledge reasoning.
摘要：大型语言模型（LLMS）在涉及非结构化文本的推理任务上已经证明了很大的进步，但是当推理需要整合结构化的外部知识，例如知识图，代码片段或正式逻辑时，它们的功能显着恶化。这种局限性部分是由于缺乏能够系统地评估各种结构化知识方式的LLM性能的基准。为了解决这一差距，我们介绍了\ textbf {\ textsc {OneEval}}，这是一个明确设计的全面基准，旨在评估LLM在四种结构化知识模式，非结构化文本，知识图形，代码和形式逻辑和五个关键领域以及五个关键领域以及一般政府，政府，科学，法律，法律，法律和计划中的知识密集型推理能力。 \ textsc {OneEval}包括4,019个精心策划的实例，包括一个具有挑战性的子集，\ textsc {oneeval} \ textsubscript {hard}，由1,285个特别困难的情况组成。通过对18个最先进的开源和专有LLM的广泛评估，我们建立了三个核心发现：a）\ emph {结构化推理中的持续限制}，甚至最强的模型也仅实现32.2 \％的精度，\％textsc {oneeval} {OneEval} \ textsubscript \ textsubscript hard hard}; b）\ emph {性能始终降低，因为知识库的结构复杂性增加}，精度从53 \％（文本推理）急剧下降到25 \％（正式逻辑）；和c）\ emph {从扩展推理链中减少回报}，突出了模型的关键需求，以适当地适应任务复杂性。我们发布\ textsc {OneEval}数据集，评估脚本和基线结果，并由排行榜伴随，以促进结构化知识推理的持续进步。

Title: An Exploration of Mamba for Speech Self-Supervised Models

Authors: Tzu-Quan Lin, Heng-Cheng Kuo, Tzu-Chieh Wei, Hsi-Chun Cheng, Chun-Wei Chen, Hsien-Fu Hsiao, Yu Tsao, Hung-yi Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12606
Pdf URL: https://arxiv.org/pdf/2506.12606
Copy Paste: [[2506.12606]] An Exploration of Mamba for Speech Self-Supervised Models(https://arxiv.org/abs/2506.12606)
Keywords: language model
Abstract: While Mamba has demonstrated strong performance in language modeling, its potential as a speech self-supervised (SSL) model remains underexplored, with prior studies limited to isolated tasks. To address this, we explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures. Leveraging the linear-time Selective State Space, these models enable fine-tuning on long-context ASR with significantly lower compute. Moreover, they show superior performance when fine-tuned for streaming ASR. Beyond fine-tuning, these models show competitive performance on SUPERB probing benchmarks, particularly in causal settings. Our analysis shows that they yield higher-quality quantized representations and capture speaker-related features more distinctly than Transformer-based models. These findings highlight Mamba-based SSL as a promising and complementary direction for long-sequence modeling, real-time speech modeling, and speech unit extraction.
摘要：尽管曼巴（Mamba）在语言建模方面表现出了强劲的表现，但其作为语音自我监督（SSL）模型的潜力仍然没有被忽视，先前的研究仅限于孤立的任务。为了解决这个问题，我们探索基于Mamba的Hubert模型，作为基于变压器的SSL体系结构的替代品。这些模型利用线性时间选择性状态空间，可以在较低的compute上进行长篇小说ASR进行微调。此外，在流媒体ASR进行微调时，它们表现出卓越的性能。除了微调外，这些模型在精湛的探测基准中表现出竞争性能，尤其是在因果环境中。我们的分析表明，它们比基于变压器的模型更明显地产生更高质量的量化表示形式，并捕获了与说话者相关的特征。这些发现突出显示了基于MAMBA的SSL是长期序列建模，实时语音建模和语音单元提取的有前途和互补方向。

Title: Towards Building General Purpose Embedding Models for Industry 4.0 Agents

Authors: Christodoulos Constantinides, Shuxin Lin, Dhaval Patel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12607
Pdf URL: https://arxiv.org/pdf/2506.12607
Copy Paste: [[2506.12607]] Towards Building General Purpose Embedding Models for Industry 4.0 Agents(https://arxiv.org/abs/2506.12607)
Keywords: language model, llm, agent
Abstract: In this work we focus on improving language models' understanding for asset maintenance to guide the engineer's decisions and minimize asset downtime. Given a set of tasks expressed in natural language for Industry 4.0 domain, each associated with queries related to a specific asset, we want to recommend relevant items and generalize to queries of similar assets. A task may involve identifying relevant sensors given a query about an asset's failure mode. Our approach begins with gathering a qualitative, expert-vetted knowledge base to construct nine asset-specific task datasets. To create more contextually informed embeddings, we augment the input tasks using Large Language Models (LLMs), providing concise descriptions of the entities involved in the queries. This embedding model is then integrated with a Reasoning and Acting agent (ReAct), which serves as a powerful tool for answering complex user queries that require multi-step reasoning, planning, and knowledge inference. Through ablation studies, we demonstrate that: (a) LLM query augmentation improves the quality of embeddings, (b) Contrastive loss and other methods that avoid in-batch negatives are superior for datasets with queries related to many items, and (c) It is crucial to balance positive and negative in-batch samples. After training and testing on our dataset, we observe a substantial improvement: HIT@1 increases by +54.2%, MAP@100 by +50.1%, and NDCG@10 by +54.7%, averaged across all tasks and models. Additionally, we empirically demonstrate the model's planning and tool invocation capabilities when answering complex questions related to industrial asset maintenance, showcasing its effectiveness in supporting Subject Matter Experts (SMEs) in their day-to-day operations.
摘要：在这项工作中，我们着重于提高语言模型对资产维护的理解，以指导工程师的决策并最大程度地减少资产的停机时间。鉴于在工业4.0域中以自然语言表达的一组任务，每个任务都与与特定资产有关的查询相关，我们想推荐相关项目并推广到相似资产的查询。任务可能涉及确定有关资产故障模式的查询的相关传感器。我们的方法始于收集定性，专家的知识库以构建九个资产特定的任务数据集。为了创建更多上下文知情的嵌入，我们使用大语言模型（LLMS）增强输入任务，从而简要描述查询中涉及的实体。然后将此嵌入模型与推理和代理剂（React）集成在一起，该工具是回答需要多步推理，计划和知识推论的复杂用户查询的强大工具。通过消融研究，我们证明：（a）LLM查询增强提高了嵌入的质量，（b）对比损失和避免隔离式负面负面因素的其他方法对于与许多项目相关的查询的数据集优越，并且（c）对于平衡正面和负群体内和负群体内的样本至关重要。在我们的数据集进行了训练和测试之后，我们观察到了一个实质性的改进：命中@1增加 +54.2％，MAP@100升至 +50.1％，而NDCG@10乘 +54.7％，在所有任务和模型中平均。此外，我们在回答与工业资产维护有关的复杂问题时，在经验上证明了该模型的计划和工具调用功能，并在其日常运营中展示了其在支持主题专家（SME）方面的有效性。

Title: OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics

Authors: Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C. Lipton, J. Zico Kolter, Pratyush Maini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12618
Pdf URL: https://arxiv.org/pdf/2506.12618
Copy Paste: [[2506.12618]] OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics(https://arxiv.org/abs/2506.12618)
Keywords: language model, llm
Abstract: Robust unlearning is crucial for safely deploying large language models (LLMs) in environments where data privacy, model safety, and regulatory compliance must be ensured. Yet the task is inherently challenging, partly due to difficulties in reliably measuring whether unlearning has truly occurred. Moreover, fragmentation in current methodologies and inconsistent evaluation metrics hinder comparative analysis and reproducibility. To unify and accelerate research efforts, we introduce OpenUnlearning, a standardized and extensible framework designed explicitly for benchmarking both LLM unlearning methods and metrics. OpenUnlearning integrates 9 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks (TOFU, MUSE, and WMDP) and also enables analyses of forgetting behaviors across 450+ checkpoints we publicly release. Leveraging OpenUnlearning, we propose a novel meta-evaluation benchmark focused specifically on assessing the faithfulness and robustness of evaluation metrics themselves. We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite. Overall, we establish a clear, community-driven pathway toward rigorous development in LLM unlearning research.
摘要：强大的学习对于在必须确保必须确保数据隐私，模型安全性和法规合规性的环境中安全部署大型语言模型（LLM）至关重要。然而，任务本质上是具有挑战性的，部分原因是在可靠地衡量是否确实发生过学习的困难方面遇到了困难。此外，当前方法论和不一致的评估指标的分裂阻碍了比较分析和可重复性。为了统一和加速研究工作，我们介绍了开放式学习，这是一个明确设计的标准化且可扩展的框架，该框架旨在基准LLM学习方法和指标。 Openunlearning整合了9种领先的基准（豆腐，Muse和WMDP）的9种未学习算法和16种不同的评估，还可以分析我们公开发布的450多个检查站的忘记行为。利用开放式学习，我们提出了一种新颖的元评估基准，专门针对评估评估指标本身的忠诚和鲁棒性。我们还基准了多种学习方法，并提供了与广泛评估套件的比较分析。总体而言，我们在LLM学习研究中建立了一种清晰，以社区为导向的途径。

Title: Between Predictability and Randomness: Seeking Artistic Inspiration from AI Generative Models

Authors: Olga Vechtomova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12634
Pdf URL: https://arxiv.org/pdf/2506.12634
Copy Paste: [[2506.12634]] Between Predictability and Randomness: Seeking Artistic Inspiration from AI Generative Models(https://arxiv.org/abs/2506.12634)
Keywords: language model, llm
Abstract: Artistic inspiration often emerges from language that is open to interpretation. This paper explores the use of AI-generated poetic lines as stimuli for creativity. Through analysis of two generative AI approaches--lines generated by Long Short-Term Memory Variational Autoencoders (LSTM-VAE) and complete poems by Large Language Models (LLMs)--I demonstrate that LSTM-VAE lines achieve their evocative impact through a combination of resonant imagery and productive indeterminacy. While LLMs produce technically accomplished poetry with conventional patterns, LSTM-VAE lines can engage the artist through semantic openness, unconventional combinations, and fragments that resist closure. Through the composition of an original poem, where narrative emerged organically through engagement with LSTM-VAE generated lines rather than following a predetermined structure, I demonstrate how these characteristics can serve as evocative starting points for authentic artistic expression.
摘要：艺术灵感通常是从开放的解释的语言中出现的。本文探讨了AI生成的诗歌线作为创造力的刺激。通过分析长期记忆变异自动编码器（LSTM-VAE）和大型语言模型（LLMS）完整的诗歌产生的两种生成AI方法 - 我证明了LSTM-VAE线通过谐振成像和生产力的不确定性来实现其令人回味的影响。尽管LLM通过传统的模式产生技术成就的诗歌，但LSTM-VAE系列可以通过语义开放性，非常规的组合和抵抗闭合的碎片吸引艺术家。通过原始诗的构图，叙事通过与LSTM-VAE的互动而有机地出现，而不是遵循预定的结构，我演示了这些特征如何作为真实艺术表达的令人回味的起点。

Title: Synthetic Socratic Debates: Examining Persona Effects on Moral Decision and Persuasion Dynamics

Authors: Jiarui Liu, Yueqi Song, Yunze Xiao, Mingqian Zheng, Lindia Tjuatja, Jana Schaich Borg, Mona Diab, Maarten Sap
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12657
Pdf URL: https://arxiv.org/pdf/2506.12657
Copy Paste: [[2506.12657]] Synthetic Socratic Debates: Examining Persona Effects on Moral Decision and Persuasion Dynamics(https://arxiv.org/abs/2506.12657)
Keywords: language model, llm, agent
Abstract: As large language models (LLMs) are increasingly used in morally sensitive domains, it is crucial to understand how persona traits affect their moral reasoning and persuasive behavior. We present the first large-scale study of multi-dimensional persona effects in AI-AI debates over real-world moral dilemmas. Using a 6-dimensional persona space (age, gender, country, class, ideology, and personality), we simulate structured debates between AI agents over 131 relationship-based cases. Our results show that personas affect initial moral stances and debate outcomes, with political ideology and personality traits exerting the strongest influence. Persuasive success varies across traits, with liberal and open personalities reaching higher consensus and win rates. While logit-based confidence grows during debates, emotional and credibility-based appeals diminish, indicating more tempered argumentation over time. These trends mirror findings from psychology and cultural studies, reinforcing the need for persona-aware evaluation frameworks for AI moral reasoning.
摘要：由于大型语言模型（LLM）越来越多地用于道德敏感领域，因此了解角色特征如何影响其道德推理和说服力行为至关重要。我们介绍了关于现实世界中的道德困境的AI-AI辩论中对多维角色效应的首次大规模研究。使用6维角色空间（年龄，性别，国家，阶级，意识形态和人格），我们模拟了131个基于关系的案件以上的AI代理之间的结构性辩论。我们的结果表明，人格会影响最初的道德立场和辩论成果，政治意识形态和人格特征产生了最强的影响。有说服力的成功各不相同，自由主义和开放性格达到更高的共识和胜利率。尽管基于原则的信心在辩论期间增长，但基于情感和信誉的吸引力降低了，表明随着时间的流逝，辩论更加脾气暴躁。这些趋势反映了心理学和文化研究的发现，加强了对人工智能道德推理的角色感知评估框架的需求。

Title: Flexible Realignment of Language Models

Authors: Wenhong Zhu, Ruobing Xie, Weinan Zhang, Rui Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.12704
Pdf URL: https://arxiv.org/pdf/2506.12704
Copy Paste: [[2506.12704]] Flexible Realignment of Language Models(https://arxiv.org/abs/2506.12704)
Keywords: language model
Abstract: Realignment becomes necessary when a language model (LM) fails to meet expected performance. We propose a flexible realignment framework that supports quantitative control of alignment degree during training and inference. This framework incorporates Training-time Realignment (TrRa), which efficiently realigns the reference model by leveraging the controllable fusion of logits from both the reference and already aligned models. For example, TrRa reduces token usage by 54.63% on DeepSeek-R1-Distill-Qwen-1.5B without any performance degradation, outperforming DeepScaleR-1.5B's 33.86%. To complement TrRa during inference, we introduce a layer adapter that enables smooth Inference-time Realignment (InRa). This adapter is initialized to perform an identity transformation at the bottom layer and is inserted preceding the original layers. During inference, input embeddings are simultaneously processed by the adapter and the original layer, followed by the remaining layers, and then controllably interpolated at the logit level. We upgraded DeepSeek-R1-Distill-Qwen-7B from a slow-thinking model to one that supports both fast and slow thinking, allowing flexible alignment control even during inference. By encouraging deeper reasoning, it even surpassed its original performance.
摘要：当语言模型（LM）无法满足预期性能时，重新调整就变得必要。我们提出了一个灵活的调整框架，该框架支持训练和推理期间对齐度的定量控制。该框架结合了培训时间重新调整（TRRA），该框架通过利用参考模型和已经对齐模型的逻辑可控融合来有效地重新调整参考模型。例如，TRRA在DeepSeek-R1-Distill-Qwen-1.5b中将令牌使用率降低了54.63％，而不会降级任何性能，表现优于DeepScaler-1.5b的33.86％。为了补充推断期间的TRRA，我们引入了一个层适配器，该适配器可以使推理时间重新调整（INRA）。初始化此适配器以在底层执行身份转换，并在原始层之前插入。在推断过程中，输入嵌入由适配器和原始层同时处理，然后是剩余的层，然后在logit级别进行可控的插值。我们将DeepSeek-R1-Distill-Qwen-7b从慢速思维模型升级为支持快速和缓慢思考的模型，即使在推断期间也可以灵活地控制。通过鼓励更深的推理，它甚至超过了其最初的表现。

Title: Rethinking Hate Speech Detection on Social Media: Can LLMs Replace Traditional Models?

Authors: Daman Deep Singh, Ramanuj Bhattacharjee, Abhijnan Chakraborty
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.12744
Pdf URL: https://arxiv.org/pdf/2506.12744
Copy Paste: [[2506.12744]] Rethinking Hate Speech Detection on Social Media: Can LLMs Replace Traditional Models?(https://arxiv.org/abs/2506.12744)
Keywords: language model, llm
Abstract: Hate speech detection across contemporary social media presents unique challenges due to linguistic diversity and the informal nature of online discourse. These challenges are further amplified in settings involving code-mixing, transliteration, and culturally nuanced expressions. While fine-tuned transformer models, such as BERT, have become standard for this task, we argue that recent large language models (LLMs) not only surpass them but also redefine the landscape of hate speech detection more broadly. To support this claim, we introduce IndoHateMix, a diverse, high-quality dataset capturing Hindi-English code-mixing and transliteration in the Indian context, providing a realistic benchmark to evaluate model robustness in complex multilingual scenarios where existing NLP methods often struggle. Our extensive experiments show that cutting-edge LLMs (such as LLaMA-3.1) consistently outperform task-specific BERT-based models, even when fine-tuned on significantly less data. With their superior generalization and adaptability, LLMs offer a transformative approach to mitigating online hate in diverse environments. This raises the question of whether future works should prioritize developing specialized models or focus on curating richer and more varied datasets to further enhance the effectiveness of LLMs.
摘要：当代社交媒体之间的仇恨言论检测提出了由于语言多样性和在线话语的非正式性质而引起的独特挑战。这些挑战在涉及混合，音译和文化细微差别的表达式的环境中进一步扩大。虽然诸如BERT之类的微调变压器模型已成为这项任务的标准，但我们认为，最近的大型语言模型（LLMS）不仅超过了它们，而且更广泛地重新定义了仇恨言论检测的景观。为了支持这一主张，我们介绍了Indohapemix，这是一个多样化的高质量数据集，在印度背景下捕获印度英语代码混合和音译，提供了一个现实的基准，以评估现有NLP方法经常挣扎的复杂多语言场景中的模型鲁棒性。我们的广泛实验表明，即使对数据进行微调，尖端的LLM（例如Llama-3.1）始终超过基于特定于任务的BERT模型。 LLM凭借其出色的概括和适应性，提供了一种变革性的方法来减轻不同环境中的在线仇恨。这就提出了一个问题，即未来的工作是否应优先考虑开发专业模型，还是专注于策划更丰富和更多多样化的数据集，以进一步提高LLM的有效性。

Title: Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models

Authors: David Guzman Piedrahita, Irene Strauss, Bernhard Schölkopf, Rada Mihalcea, Zhijing Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12758
Pdf URL: https://arxiv.org/pdf/2506.12758
Copy Paste: [[2506.12758]] Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models(https://arxiv.org/abs/2506.12758)
Keywords: language model, llm, prompt
Abstract: As Large Language Models (LLMs) become increasingly integrated into everyday life and information ecosystems, concerns about their implicit biases continue to persist. While prior work has primarily examined socio-demographic and left--right political dimensions, little attention has been paid to how LLMs align with broader geopolitical value systems, particularly the democracy--authoritarianism spectrum. In this paper, we propose a novel methodology to assess such alignment, combining (1) the F-scale, a psychometric tool for measuring authoritarian tendencies, (2) FavScore, a newly introduced metric for evaluating model favorability toward world leaders, and (3) role-model probing to assess which figures are cited as general role-models by LLMs. We find that LLMs generally favor democratic values and leaders, but exhibit increases favorability toward authoritarian figures when prompted in Mandarin. Further, models are found to often cite authoritarian figures as role models, even outside explicit political contexts. These results shed light on ways LLMs may reflect and potentially reinforce global political ideologies, highlighting the importance of evaluating bias beyond conventional socio-political axes. Our code is available at: this https URL
摘要：随着大型语言模型（LLM）越来越多地整合到日常生活和信息生态系统中，人们对它们隐式偏见的担忧继续持续存在。尽管先前的工作主要研究了社会人口统计学和左派政治方面，但很少关注LLMS与更广泛的地缘政治价值体系，尤其是民主 - 授权主义谱系的更广泛的地缘政治价值体系。在本文中，我们提出了一种新的方法来评估这种对齐方式，结合了（1）F量表，一种用于测量专制倾向的心理测量工具，（2）FavScore，FavScore是一种新引入的指标，用于评估对世界领导者的模型偏爱，以及（3）角色模型探测，以评估哪些图形被列为一般作用型劳力型的图形。我们发现，LLMS通常会偏爱民主价值观和领导人，但是在促使普通话提示时，对专制人物的青睐提高了。此外，发现模型经常引用专制人物作为榜样，即使是在明确的政治背景之外也是如此。这些结果阐明了LLM可以反映并有可能增强全球政治意识形态的方式，从而强调了评估偏见以外的偏见的重要性。我们的代码可用：此HTTPS URL

Title: Surprise Calibration for Better In-Context Learning

Authors: Zhihang Tan, Jingrui Hou, Ping Wang, Qibiao Hu, Peng Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12796
Pdf URL: https://arxiv.org/pdf/2506.12796
Copy Paste: [[2506.12796]] Surprise Calibration for Better In-Context Learning(https://arxiv.org/abs/2506.12796)
Keywords: language model, llm
Abstract: In-context learning (ICL) has emerged as a powerful paradigm for task adaptation in large language models (LLMs), where models infer underlying task structures from a few demonstrations. However, ICL remains susceptible to biases that arise from prior knowledge and contextual demonstrations, which can degrade the performance of LLMs. Existing bias calibration methods typically apply fixed class priors across all inputs, limiting their efficacy in dynamic ICL settings where the context for each query differs. To address these limitations, we adopt implicit sequential Bayesian inference as a framework for interpreting ICL, identify "surprise" as an informative signal for class prior shift, and introduce a novel method--Surprise Calibration (SC). SC leverages the notion of surprise to capture the temporal dynamics of class priors, providing a more adaptive and computationally efficient solution for in-context learning. We empirically demonstrate the superiority of SC over existing bias calibration techniques across a range of benchmark natural language processing tasks.
摘要：在大型语言模型（LLMS）中，内在的学习（ICL）已成为一个强大的任务适应范式，其中模型从几个演示中推断了基本的任务结构。但是，ICL仍然容易受到先验知识和上下文演示产生的偏见，这些偏见会降低LLM的性能。现有的偏置校准方法通常在所有输入中应用固定的类先验，从而在每个查询的上下文都不同的动态ICL设置中限制了它们的功效。为了解决这些局限性，我们采用隐式顺序贝叶斯推断作为解释ICL的框架，将“惊喜”确定为班级先前转移的信息信号，并引入了一种新颖的方法 - 暴力校准（SC）。 SC利用了惊喜的概念来捕获班级先验的时间动态，从而提供了更具适应性和计算有效的解决方案，以实现对内在的学习。我们从经验上证明了SC在一系列基准自然语言处理任务上的SC优于现有的偏置校准技术。

Title: Transforming Chatbot Text: A Sequence-to-Sequence Approach

Authors: Natesh Reddy, Mark Stamp
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.12843
Pdf URL: https://arxiv.org/pdf/2506.12843
Copy Paste: [[2506.12843]] Transforming Chatbot Text: A Sequence-to-Sequence Approach(https://arxiv.org/abs/2506.12843)
Keywords: language model, gpt, llm, chat
Abstract: Due to advances in Large Language Models (LLMs) such as ChatGPT, the boundary between human-written text and AI-generated text has become blurred. Nevertheless, recent work has demonstrated that it is possible to reliably detect GPT-generated text. In this paper, we adopt a novel strategy to adversarially transform GPT-generated text using sequence-to-sequence (Seq2Seq) models, with the goal of making the text more human-like. We experiment with the Seq2Seq models T5-small and BART which serve to modify GPT-generated sentences to include linguistic, structural, and semantic components that may be more typical of human-authored text. Experiments show that classification models trained to distinguish GPT-generated text are significantly less accurate when tested on text that has been modified by these Seq2Seq models. However, after retraining classification models on data generated by our Seq2Seq technique, the models are able to distinguish the transformed GPT-generated text from human-generated text with high accuracy. This work adds to the accumulating knowledge of text transformation as a tool for both attack -- in the sense of defeating classification models -- and defense -- in the sense of improved classifiers -- thereby advancing our understanding of AI-generated text.
摘要：由于大语言模型（LLM）的进步，例如chatgpt，人文所写的文本和AI生成的文本之间的边界变得模糊了。然而，最近的工作表明，有可能可靠地检测到GPT生成的文本。在本文中，我们采用一种新颖的策略，使用顺序到序列（SEQ2SEQ）模型对抗GPT生成的文本，目的是使文本更像人性化。我们尝试使用SEQ2SEQ模型T5-Mall和Bart，这些模型可修改GPT生成的句子，以包括语言，结构和语义组件，这些句子可能更典型地是人为作者的文本。实验表明，在对这些SEQ2SEQ模型修改的文本进行测试时，经过训练以区分GPT生成的文本的分类模型的准确性明显较小。但是，在通过我们的SEQ2SEQ技术生成的数据进行了重新分类模型之后，这些模型能够以高精度将转换后的GPT生成的文本与人类生成的文本区分开。这项工作增加了对文本转换的积累知识，作为攻击的一种工具 - 在击败分类模型和防御的意义上，从改进的分类器的意义上讲，这是一种攻击的知识，从而促进了我们对AI生成的文本的理解。

Title: QFFT, Question-Free Fine-Tuning for Adaptive Reasoning

Authors: Wanlong Liu, Junxiao Xu, Fei Yu, Yukang Lin, Ke Ji, Wenyu Chen, Yan Xu, Yasheng Wang, Lifeng Shang, Benyou Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12860
Pdf URL: https://arxiv.org/pdf/2506.12860
Copy Paste: [[2506.12860]] QFFT, Question-Free Fine-Tuning for Adaptive Reasoning(https://arxiv.org/abs/2506.12860)
Keywords: chain-of-thought
Abstract: Recent advancements in Long Chain-of-Thought (CoT) reasoning models have improved performance on complex tasks, but they suffer from overthinking, which generates redundant reasoning steps, especially for simple questions. This paper revisits the reasoning patterns of Long and Short CoT models, observing that the Short CoT patterns offer concise reasoning efficiently, while the Long CoT patterns excel in challenging scenarios where the Short CoT patterns struggle. To enable models to leverage both patterns, we propose Question-Free Fine-Tuning (QFFT), a fine-tuning approach that removes the input question during training and learns exclusively from Long CoT responses. This approach enables the model to adaptively employ both reasoning patterns: it prioritizes the Short CoT patterns and activates the Long CoT patterns only when necessary. Experiments on various mathematical datasets demonstrate that QFFT reduces average response length by more than 50\%, while achieving performance comparable to Supervised Fine-Tuning (SFT). Additionally, QFFT exhibits superior performance compared to SFT in noisy, out-of-domain, and low-resource scenarios.
摘要：长期思考链（COT）推理模型的最新进展提高了复杂任务的性能，但它们遭受了过度思考的困扰，这会产生冗余的推理步骤，尤其是对于简单的问题。本文重新审视了长COT模型的推理模式，观察到短的婴儿床模式有效地提供了简洁的推理，而在挑战性的场景中，长cot模式在短床挣扎的情况下表现出色。为了使模型能够利用这两种模式，我们提出了无问题的微调（QFFT），这是一种微调方法，可在培训过程中消除输入问题，并仅从长COT的响应中学习。这种方法使该模型能够自适应采用两种推理模式：它优先考虑短COT模式并仅在必要时激活长的COT模式。各种数学数据集的实验表明，QFFT将平均响应长度降低超过50 \％，同时达到可与受监督的微调（SFT）相当的性能。此外，与嘈杂，室外和低资源场景中的SFT相比，QFFT表现出色。

Title: ArgHiTZ at ArchEHR-QA 2025: A Two-Step Divide and Conquer Approach to Patient Question Answering for Top Factuality

Authors: Adrián Cuadrón, Aimar Sagasti, Maitane Urruela, Iker De la Iglesia, Ane G Domingo-Aldama, Aitziber Atutxa, Josu Goikoetxea, Ander Barrena
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12886
Pdf URL: https://arxiv.org/pdf/2506.12886
Copy Paste: [[2506.12886]] ArgHiTZ at ArchEHR-QA 2025: A Two-Step Divide and Conquer Approach to Patient Question Answering for Top Factuality(https://arxiv.org/abs/2506.12886)
Keywords: prompt
Abstract: This work presents three different approaches to address the ArchEHR-QA 2025 Shared Task on automated patient question answering. We introduce an end-to-end prompt-based baseline and two two-step methods to divide the task, without utilizing any external knowledge. Both two step approaches first extract essential sentences from the clinical text, by prompt or similarity ranking, and then generate the final answer from these notes. Results indicate that the re-ranker based two-step system performs best, highlighting the importance of selecting the right approach for each subtask. Our best run achieved an overall score of 0.44, ranking 8th out of 30 on the leaderboard, securing the top position in overall factuality.
摘要：这项工作提出了三种不同的方法来解决Archehr-QA 2025关于自动化患者问题的共享任务。我们介绍了一种基于端到端的提示基线和两种两步方法来分割任务，而无需使用任何外部知识。这两个步骤的方法首先通过提示或相似性排名从临床文本中提取基本句子，然后从这些笔记中产生最终答案。结果表明，基于重新级别的两步系统的表现最佳，突出了为每个子任务选择正确方法的重要性。我们的最佳跑步总得分为0.44，排行榜中的30分中排名第8，确保了总体事实的最高位置。

Title: SciDA: Scientific Dynamic Assessor of LLMs

Authors: Junting Zhou, Tingjia Miao, Yiyan Liao, Qichao Wang, Zhoufutu Wen, Yanqin Wang, Yunjie Huang, Ge Yan, Leqi Wang, Yucheng Xia, Hongwan Gao, Yuansong Zeng, Renjie Zheng, Chen Dun, Yitao Liang, Tong Yang, Wenhao Huang, Ge Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12909
Pdf URL: https://arxiv.org/pdf/2506.12909
Copy Paste: [[2506.12909]] SciDA: Scientific Dynamic Assessor of LLMs(https://arxiv.org/abs/2506.12909)
Keywords: language model, llm
Abstract: Advancement in Large Language Models (LLMs) reasoning capabilities enables them to solve scientific problems with enhanced efficacy. Thereby, a high-quality benchmark for comprehensive and appropriate assessment holds significance, while existing ones either confront the risk of data contamination or lack involved disciplines. To be specific, due to the data source overlap of LLMs training and static benchmark, the keys or number pattern of answers inadvertently memorized (i.e. data contamination), leading to systematic overestimation of their reasoning capabilities, especially numerical reasoning. We propose SciDA, a multidisciplinary benchmark that consists exclusively of over 1k Olympic-level numerical computation problems, allowing randomized numerical initializations for each inference round to avoid reliance on fixed numerical patterns. We conduct a series of experiments with both closed-source and open-source top-performing LLMs, and it is observed that the performance of LLMs drop significantly under random numerical initialization. Thus, we provide truthful and unbiased assessments of the numerical reasoning capabilities of LLMs. The data is available at this https URL
摘要：大型语言模型（LLMS）推理能力的进步使他们能够通过增强的功效来解决科学问题。因此，全面和适当评估的高质量基准具有重要意义，而现有的基准则面临数据污染的风险或缺乏学科。具体来说，由于数据源的LLMS训练和静态基准的重叠，答案的键或数字模式无意中记忆（即数据污染），从而导致系统性高估其推理能力，尤其是数值推理。我们提出SCIDA，这是一个多学科基准，仅由超过1K的奥林匹克级数值计算问题组成，可以为每个推进圈进行随机数值初始化，以避免依赖固定的数值模式。我们对封闭源和开源最佳的LLM进行了一系列实验，并且观察到，在随机数值初始化下，LLMS的性能显着下降。因此，我们提供了对LLM的数值推理能力的真实和公正的评估。数据可在此HTTPS URL上获得

Title: PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization

Authors: Meiling Tao, Chenghao Zhu, Dongyi Ding, Tiannan Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12915
Pdf URL: https://arxiv.org/pdf/2506.12915
Copy Paste: [[2506.12915]] PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization(https://arxiv.org/abs/2506.12915)
Keywords: llm
Abstract: With the rapid improvement in the general capabilities of LLMs, LLM personalization, i.e., how to build LLM systems that can generate personalized responses or services that are tailored to distinct user personas, has become an increasingly important research and engineering problem. However, unlike many new challenging benchmarks being released for evaluating the general/reasoning capabilities, the lack of high-quality benchmarks for evaluating LLM personalization greatly hinders progress in this field. To address this, we introduce PersonaFeedback, a new benchmark that directly evaluates LLMs' ability to provide personalized responses given pre-defined user personas and queries. Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization, focusing on evaluating the model's ability to generate responses tailored to explicit personas. PersonaFeedback consists of 8298 human-annotated test cases, which are categorized into easy, medium, and hard tiers based on the contextual complexity of the user personas and the difficulty in distinguishing subtle differences between two personalized responses. We conduct comprehensive evaluations across a wide range of models. The empirical results reveal that even state-of-the-art LLMs that can solve complex real-world reasoning tasks could fall short on the hard tier of PersonaFeedback where even human evaluators may find the distinctions challenging. Furthermore, we conduct an in-depth analysis of failure modes across various types of systems, demonstrating that the current retrieval-augmented framework should not be seen as a de facto solution for personalization tasks. All benchmark data, annotation protocols, and the evaluation pipeline will be publicly available to facilitate future research on LLM personalization.
摘要：随着LLM的一般功能的快速提高，LLM个性化，即如何构建可以生成针对不同用户角色量身定制的个性化响应或服务的LLM系统，已成为越来越重要的研究和工程问题。但是，与许多用于评估一般/推理能力的新挑战性基准不同，缺乏评估LLM个性化的高质量基准极大地阻碍了该领域的进步。为了解决这个问题，我们介绍了PersonafeAdback，这是一种新的基准，该基准直接评估了LLMS提供给定预定义的用户角色和查询的个性化响应的能力。与需要模型从历史互动中推断出隐式用户角色的现有基准不同，Personafeback将角色推断与个性化相关，重点是评估该模型生成针对明确角色量身定制的响应的能力。 Personafewback由8298个人类注销的测试用例组成，这些案例根据用户角色的上下文复杂性以及区分两个个性化响应之间细微差异的难度将其分为简单，中和硬层。我们在广泛的模型上进行全面的评估。经验结果表明，即使是可以解决复杂的现实推理任务的最先进的LLM，也可能在Personafeppack的坚硬层面上落在，即使是人类评估者也可能会发现具有挑战性的区别。此外，我们对各种系统的故障模式进行了深入的分析，表明当前的检索框架不应被视为实现个性化任务的事实解决方案。所有基准数据，注释协议和评估管道都将公开使用，以促进对LLM个性化的未来研究。

Title: SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models

Authors: Xingjian Diao, Chunhui Zhang, Keyi Kong, Weiyi Wu, Chiyu Ma, Zhongyu Ouyang, Peijun Qing, Soroush Vosoughi, Jiang Gui
Subjects: cs.CL, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.12935
Pdf URL: https://arxiv.org/pdf/2506.12935
Copy Paste: [[2506.12935]] SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models(https://arxiv.org/abs/2506.12935)
Keywords: language model
Abstract: While large language models have shown reasoning capabilities, their application to the audio modality, particularly in large audio-language models (ALMs), remains significantly underdeveloped. Addressing this gap requires a systematic approach, involving a capable base model, high-quality reasoning-oriented audio data, and effective training algorithms. In this study, we present a comprehensive solution: we introduce the Audio Logical Reasoning (ALR) dataset, consisting of 6,446 text-audio annotated samples specifically designed for complex reasoning tasks. Building on this resource, we propose SoundMind, a rule-based reinforcement learning (RL) algorithm tailored to endow ALMs with deep bimodal reasoning abilities. By training Qwen2.5-Omni-7B on the ALR dataset using SoundMind, our approach achieves state-of-the-art performance in audio logical reasoning. This work highlights the impact of combining high-quality, reasoning-focused datasets with specialized RL techniques, advancing the frontier of auditory intelligence in language models. Our code and the proposed dataset are available at this https URL.
摘要：尽管大型语言模型已经显示出推理能力，但它们在音频方式上的应用，尤其是在大型音频模型（ALMS）中，仍然显着不发达。解决此差距需要系统的方法，涉及强大的基础模型，高质量的以推理为导向的音频数据和有效的培训算法。在这项研究中，我们提出了一个综合解决方案：我们介绍了音频逻辑推理（ALR）数据集，该数据集由6,446个专门针对复杂推理任务设计的文本原告注释的样本组成。在此资源的基础上，我们提出了SoundMind，这是一种基于规则的增强学习（RL）算法，该算法量身定制，以赋予施舍具有深厚的双峰推理能力。通过使用SoundMind在ALR数据集上训练QWEN2.5-OMNI-7B，我们的方法在音频逻辑推理中实现了最先进的性能。这项工作突出了将高质量的，以推理为中心的数据集与专门的RL技术相结合的影响，从而推进了语言模型中听觉智能的前沿。我们的代码和提议的数据集可在此HTTPS URL上找到。

Title: CliniDial: A Naturally Occurring Multimodal Dialogue Dataset for Team Reflection in Action During Clinical Operation

Authors: Naihao Deng, Kapotaksha Das, Rada Mihalcea, Vitaliy Popov, Mohamed Abouelenien
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12936
Pdf URL: https://arxiv.org/pdf/2506.12936
Copy Paste: [[2506.12936]] CliniDial: A Naturally Occurring Multimodal Dialogue Dataset for Team Reflection in Action During Clinical Operation(https://arxiv.org/abs/2506.12936)
Keywords: llm
Abstract: In clinical operations, teamwork can be the crucial factor that determines the final outcome. Prior studies have shown that sufficient collaboration is the key factor that determines the outcome of an operation. To understand how the team practices teamwork during the operation, we collected CliniDial from simulations of medical operations. CliniDial includes the audio data and its transcriptions, the simulated physiology signals of the patient manikins, and how the team operates from two camera angles. We annotate behavior codes following an existing framework to understand the teamwork process for CliniDial. We pinpoint three main characteristics of our dataset, including its label imbalances, rich and natural interactions, and multiple modalities, and conduct experiments to test existing LLMs' capabilities on handling data with these characteristics. Experimental results show that CliniDial poses significant challenges to the existing models, inviting future effort on developing methods that can deal with real-world clinical data. We open-source the codebase at this https URL
摘要：在临床操作中，团队合作可能是决定最终结果的关键因素。先前的研究表明，足够的协作是决定操作结果的关键因素。为了了解团队在运营期间如何实践团队合作，我们从医疗操作的模拟中收集了临床。 Clinidial包括音频数据及其转录，患者手持的模拟生理信号以及团队如何从两个相机角度运行。我们在现有框架之后对行为代码进行注释，以了解Clinidial的团队合作过程。我们指出了数据集的三个主要特征，包括其标签失衡，丰富和自然相互作用以及多种模式，并进行实验，以测试具有这些特征的数据处理数据的现有功能。实验结果表明，临床对现有模型构成了重大挑战，邀请将来的努力开发可以处理现实世界临床数据的方法。我们在此HTTPS URL上开放代码库

Title: Assessing the Role of Data Quality in Training Bilingual Language Models

Authors: Skyler Seto, Maartje ter Hoeve, Maureen de Seyssel, David Grangier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12966
Pdf URL: https://arxiv.org/pdf/2506.12966
Copy Paste: [[2506.12966]] Assessing the Role of Data Quality in Training Bilingual Language Models(https://arxiv.org/abs/2506.12966)
Keywords: language model
Abstract: Bilingual and multilingual language models offer a promising path toward scaling NLP systems across diverse languages and users. However, their performance often varies wildly between languages as prior works show that adding more languages can degrade performance for some languages (such as English), while improving others (typically more data constrained languages). In this work, we investigate causes of these inconsistencies by comparing bilingual and monolingual language models. Our analysis reveals that unequal data quality, not just data quantity, is a major driver of performance degradation in bilingual settings. We propose a simple yet effective data filtering strategy to select higher-quality bilingual training data with only high quality English data. Applied to French, German, and Chinese, our approach improves monolingual performance by 2-4% and reduces bilingual model performance gaps to 1%. These results highlight the overlooked importance of data quality in multilingual pretraining and offer a practical recipe for balancing performance.
摘要：双语和多语言模型为跨不同语言和用户扩展NLP系统提供了有希望的途径。但是，他们的性能通常在语言之间差异很大，因为先前的作品表明，添加更多语言会降低某些语言（例如英语）的性能，同时改善其他语言（通常是更多的数据约束语言）。在这项工作中，我们通过比较双语和单语言模型来研究这些不一致的原因。我们的分析表明，不仅数据质量，不仅是数据数量，还是双语设置中性能降低的主要驱动力。我们提出了一种简单而有效的数据过滤策略，以选择只有高质量的英语数据选择高质量的双语培训数据。我们的方法适用于法语，德语和中文，将单语的性能提高了2-4％，并将双语模型性能差距降低到1％。这些结果突出了数据质量在多语言预审查中的重要性，并为平衡性能提供了实用的食谱。

Title: Multi-document Summarization through Multi-document Event Relation Graph Reasoning in LLMs: a case study in Framing Bias Mitigation

Authors: Yuanyuan Lei, Ruihong Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12978
Pdf URL: https://arxiv.org/pdf/2506.12978
Copy Paste: [[2506.12978]] Multi-document Summarization through Multi-document Event Relation Graph Reasoning in LLMs: a case study in Framing Bias Mitigation(https://arxiv.org/abs/2506.12978)
Keywords: llm, prompt
Abstract: Media outlets are becoming more partisan and polarized nowadays. Most previous work focused on detecting media bias. In this paper, we aim to mitigate media bias by generating a neutralized summary given multiple articles presenting different ideological views. Motivated by the critical role of events and event relations in media bias detection, we propose to increase awareness of bias in LLMs via multi-document events reasoning and use a multi-document event relation graph to guide the summarization process. This graph contains rich event information useful to reveal bias: four common types of in-doc event relations to reflect content framing bias, cross-doc event coreference relation to reveal content selection bias, and event-level moral opinions to highlight opinionated framing bias. We further develop two strategies to incorporate the multi-document event relation graph for neutralized summarization. Firstly, we convert a graph into natural language descriptions and feed the textualized graph into LLMs as a part of a hard text prompt. Secondly, we encode the graph with graph attention network and insert the graph embedding into LLMs as a soft prompt. Both automatic evaluation and human evaluation confirm that our approach effectively mitigates both lexical and informational media bias, and meanwhile improves content preservation.
摘要：如今，媒体媒体正在变得越来越多。以前的大多数工作都侧重于检测媒体偏见。在本文中，我们旨在通过产生中和摘要来减轻媒体偏见，鉴于表现出不同的意识形态观点的多篇文章。通过事件和事件关系在媒体偏见检测中的关键作用的激励，我们建议通过多文件事件推理提高LLMS中对LLM的偏见的认识，并使用多文件事件关系图来指导摘要过程。该图包含有助于揭示偏见的丰富事件信息：在DOC事件关系中的四种常见类型，以反映内容框架偏见，跨doc事件核心关系以揭示内容选择偏见以及事件级别的道德观点，以突出显示自以为是的框架偏见。我们进一步制定了两种策略，以结合多文件事件关系图，以中和摘要。首先，我们将图形转换为自然语言描述，并将文本化图作为硬文本提示的一部分馈送到LLM中。其次，我们使用图形注意力网络编码图形，并将嵌入图的图作为软提示中。自动评估和人类评估都证实，我们的方法有效地减轻了词汇和信息媒体的偏见，同时改善了内容保存。

Title: Large Language Models Enhanced by Plug and Play Syntactic Knowledge for Aspect-based Sentiment Analysis

Authors: Yuanhe Tian, Xu Li, Wei Wang, Guoqing Jin, Pengsen Cheng, Yan Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.12991
Pdf URL: https://arxiv.org/pdf/2506.12991
Copy Paste: [[2506.12991]] Large Language Models Enhanced by Plug and Play Syntactic Knowledge for Aspect-based Sentiment Analysis(https://arxiv.org/abs/2506.12991)
Keywords: language model, llm
Abstract: Aspect-based sentiment analysis (ABSA) generally requires a deep understanding of the contextual information, including the words associated with the aspect terms and their syntactic dependencies. Most existing studies employ advanced encoders (e.g., pre-trained models) to capture such context, especially large language models (LLMs). However, training these encoders is resource-intensive, and in many cases, the available data is insufficient for necessary fine-tuning. Therefore it is challenging for learning LLMs within such restricted environments and computation efficiency requirement. As a result, it motivates the exploration of plug-and-play methods that adapt LLMs to ABSA with minimal effort. In this paper, we propose an approach that integrates extendable components capable of incorporating various types of syntactic knowledge, such as constituent syntax, word dependencies, and combinatory categorial grammar (CCG). Specifically, we propose a memory module that records syntactic information and is incorporated into LLMs to instruct the prediction of sentiment polarities. Importantly, this encoder acts as a versatile, detachable plugin that is trained independently of the LLM. We conduct experiments on benchmark datasets, which show that our approach outperforms strong baselines and previous approaches, thus demonstrates its effectiveness.
摘要：基于方面的情感分析（ABSA）通常需要对上下文信息有深入的了解，包括与方面术语及其句法依赖性相关的单词。大多数现有的研究都采用高级编码器（例如预训练的模型）来捕获此类环境，尤其是大型语言模型（LLMS）。但是，培训这些编码器是资源密集的，在许多情况下，可用数据不足以进行必要的微调。因此，在此类受限环境和计算效率要求中学习LLM是一项挑战。结果，它激发了对插入式LLMS absa absa的探索。在本文中，我们提出了一种可以整合能够结合各种类型的句法知识的可扩展组件的方法，例如构成语法，单词依赖性和组合性类别语法（CCG）。具体而言，我们提出了一个记录句法信息的内存模块，并将其纳入LLMS以指导情感极性的预测。重要的是，该编码器充当一个多功能，可拆卸的插件，该插件是独立于LLM训练的。我们在基准数据集上进行实验，这表明我们的方法表现优于强基础和以前的方法，因此证明了其有效性。

Title: Missing the human touch? A computational stylometry analysis of GPT-4 translations of online Chinese literature

Authors: Xiaofang Yao, Yong-Bin Kang, Anthony McCosker
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13013
Pdf URL: https://arxiv.org/pdf/2506.13013
Copy Paste: [[2506.13013]] Missing the human touch? A computational stylometry analysis of GPT-4 translations of online Chinese literature(https://arxiv.org/abs/2506.13013)
Keywords: language model, gpt, llm
Abstract: Existing research indicates that machine translations (MTs) of literary texts are often unsatisfactory. MTs are typically evaluated using automated metrics and subjective human ratings, with limited focus on stylistic features. Evidence is also limited on whether state-of-the-art large language models (LLMs) will reshape literary translation. This study examines the stylistic features of LLM translations, comparing GPT-4's performance to human translations in a Chinese online literature task. Computational stylometry analysis shows that GPT-4 translations closely align with human translations in lexical, syntactic, and content features, suggesting that LLMs might replicate the 'human touch' in literary translation style. These findings offer insights into AI's impact on literary translation from a posthuman perspective, where distinctions between machine and human translations become increasingly blurry.
摘要：现有的研究表明，文学文本的机器翻译（MT）通常不令人满意。通常使用自动指标和主观人类评级对MT进行评估，对风格特征的关注有限。关于最先进的大语言模型（LLM）是否会重塑文学翻译的证据也有限。这项研究检查了LLM翻译的风格特征，将GPT-4的性能与中国在线文学任务中的人类翻译进行了比较。计算样式分析表明，GPT-4翻译与词汇，句法和内容特征中的人类翻译紧密一致，这表明LLMS可能会在文学翻译样式中复制“人触摸”。这些发现从后人类的角度来看，对AI对文学翻译的影响提供了见解，在这些角度上，机器和人类翻译之间的区别变得越来越模糊。

Title: Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models

Authors: Muhammad Reza Qorib, Junyi Li, Hwee Tou Ng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13044
Pdf URL: https://arxiv.org/pdf/2506.13044
Copy Paste: [[2506.13044]] Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models(https://arxiv.org/abs/2506.13044)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated impressive translation capabilities even without being explicitly trained on parallel data. This remarkable property has led some to believe that parallel data is no longer necessary for building multilingual language models. While some attribute this to the emergent abilities of LLMs due to scale, recent work suggests that it is actually caused by incidental bilingual signals present in the training data. Various methods have been proposed to maximize the utility of parallel data to enhance the multilingual capabilities of multilingual encoder-based and encoder-decoder language models. However, some decoder-based LLMs opt to ignore parallel data instead. In this work, we conduct a systematic study on the impact of adding parallel data on LLMs' multilingual capabilities, focusing specifically on translation and multilingual common-sense reasoning. Through controlled experiments, we demonstrate that parallel data can significantly improve LLMs' multilingual capabilities.
摘要：大型语言模型（LLMS）即使未经平行数据进行明确培训也表现出令人印象深刻的翻译功能。这种非凡的属性使一些人认为不再需要并行数据来构建多语言语言模型。尽管有些人将其归因于LLMS的紧急能力，但最近的工作表明，这实际上是由培训数据中存在的偶然双语信号引起的。已经提出了各种方法来最大化并行数据的实用性，以增强基于多语言编码器和编码器语言模型的多语言功能。但是，一些基于解码器的LLMS选择忽略并行数据。在这项工作中，我们进行了一项系统的研究，对在LLMS的多语言能力上添加并行数据的影响，专门针对翻译和多语言常识性推理。通过对照实验，我们证明并行数据可以显着提高LLMS的多语言能力。

Title: CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model

Authors: Jiangtong Li, Yiyun Zhu, Dawei Cheng, Zhijun Ding, Changjun Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13055
Pdf URL: https://arxiv.org/pdf/2506.13055
Copy Paste: [[2506.13055]] CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model(https://arxiv.org/abs/2506.13055)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have rapidly evolved with the growth of Large Language Models (LLMs) and are now applied in various fields. In finance, the integration of diverse modalities such as text, charts, and tables is crucial for accurate and efficient decision-making. Therefore, an effective evaluation system that incorporates these data types is essential for advancing financial application. In this paper, we introduce CFBenchmark-MM, a Chinese multimodal financial benchmark with over 9,000 image-question pairs featuring tables, histogram charts, line charts, pie charts, and structural diagrams. Additionally, we develop a staged evaluation system to assess MLLMs in handling multimodal information by providing different visual content step by step. Despite MLLMs having inherent financial knowledge, experimental results still show limited efficiency and robustness in handling multimodal financial context. Further analysis on incorrect responses reveals the misinterpretation of visual content and the misunderstanding of financial concepts are the primary issues. Our research validates the significant, yet underexploited, potential of MLLMs in financial analysis, highlighting the need for further development and domain-specific optimization to encourage the enhanced use in financial domain.
摘要：多模式大型语言模型（MLLM）随着大语言模型（LLM）的增长迅速发展，现在在各个领域应用。在金融中，诸如文本，图表和表格之类的不同方式的整合对于准确有效的决策至关重要。因此，合并这些数据类型的有效评估系统对于推进财务应用至关重要。在本文中，我们介绍了CFBENCHMACK-MM，这是一种中国多模式的基准，其9,000多个图像问题对，具有桌子，直方图图表，线路图，饼图，饼图和结构图。此外，我们开发了一个分阶段的评估系统，通过逐步提供不同的视觉内容来评估MLLM的处理多模式信息。尽管MLLM具有固有的财务知识，但实验结果仍然显示出有限的效率和鲁棒性处理多模式财务环境。对不正确响应的进一步分析表明，对视觉内容的误解和对财务概念的误解是主要问题。我们的研究验证了MLLM在财务分析中的巨大而又没有被普遍的潜力，强调了对进一步开发和特定领域的优化的需求，以鼓励增强在金融领域中的使用。

Title: Multipole Attention for Efficient Long Context Reasoning

Authors: Coleman Hooper, Sebastian Zhao, Luca Manolache, Sehoon Kim, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.13059
Pdf URL: https://arxiv.org/pdf/2506.13059
Copy Paste: [[2506.13059]] Multipole Attention for Efficient Long Context Reasoning(https://arxiv.org/abs/2506.13059)
Keywords: long context, prompt, chain-of-thought
Abstract: Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long chain-of-thought reasoning in order to think before answering, which requires generating thousands of tokens. While sparse attention methods can help reduce the KV cache pressure induced by this long autoregressive reasoning, these methods can introduce errors which disrupt the reasoning process. Additionally, prior methods often pre-process the input to make it easier to identify the important prompt tokens when computing attention during generation, and this pre-processing is challenging to perform online for newly generated reasoning tokens. Our work addresses these challenges by introducing Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens, while maintaining approximate representations for the remaining tokens. Our method first performs clustering to group together semantically similar key vectors, and then uses the cluster centroids both to identify important key vectors and to approximate the remaining key vectors in order to retain high accuracy. We design a fast cluster update process to quickly re-cluster the input and previously generated tokens, thereby allowing for accelerating attention to the previous output tokens. We evaluate our method using emerging LRMs such as Qwen-8B, demonstrating that our approach can maintain accuracy on complex reasoning tasks even with aggressive attention sparsity settings. We also provide kernel implementations to demonstrate the practical efficiency gains from our method, achieving up to 4.5$\times$ speedup for attention in long-context reasoning applications. Our code is available at this https URL.
摘要：大型推理模型（LRMS）显示出对复杂问题解决任务的有希望的准确性提高。尽管这些模型通过在测试时利用其他计算来获得很高的精度，但它们需要产生长期的经过思考的推理才能在回答之前思考，这需要产生数千个令牌。虽然稀疏的注意方法可以帮助降低这种长期自回归推理引起的KV缓存压力，但这些方法可以引入破坏推理过程的错误。此外，先前的方法通常会预处理输入，以便在生成期间计算注意力时更容易识别重要的提示令牌，并且这种预处理的挑战是在线执行新生成的推理代币。我们的工作通过引入多极注意来解决这些挑战，从而通过仅计算最重要的令牌的精确注意来加速自回归推理，同时维持其余令牌的近似表示形式。我们的方法首先执行聚类以将语义上相似的密钥向量分组在一起，然后使用群集质心识别重要的关键向量并近似剩余的密钥向量，以保持高精度。我们设计了一个快速的群集更新过程，以快速重新群集输入和先前生成的令牌，从而加快对以前的输出令牌的关注。我们使用新兴的LRM（例如QWEN-8B）评估我们的方法，表明即使使用积极的注意力稀疏设置，我们的方法也可以在复杂的推理任务上保持准确性。我们还提供了内核实施，以证明我们方法的实践效率提高，在长篇文化推理应用程序中最多可获得4.5 $ \ times $ speedup。我们的代码可在此HTTPS URL上找到。

Title: MotiveBench: How Far Are We From Human-Like Motivational Reasoning in Large Language Models?

Authors: Xixian Yong, Jianxun Lian, Xiaoyuan Yi, Xiao Zhou, Xing Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13065
Pdf URL: https://arxiv.org/pdf/2506.13065
Copy Paste: [[2506.13065]] MotiveBench: How Far Are We From Human-Like Motivational Reasoning in Large Language Models?(https://arxiv.org/abs/2506.13065)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have been widely adopted as the core of agent frameworks in various scenarios, such as social simulations and AI companions. However, the extent to which they can replicate human-like motivations remains an underexplored question. Existing benchmarks are constrained by simplistic scenarios and the absence of character identities, resulting in an information asymmetry with real-world situations. To address this gap, we propose MotiveBench, which consists of 200 rich contextual scenarios and 600 reasoning tasks covering multiple levels of motivation. Using MotiveBench, we conduct extensive experiments on seven popular model families, comparing different scales and versions within each family. The results show that even the most advanced LLMs still fall short in achieving human-like motivational reasoning. Our analysis reveals key findings, including the difficulty LLMs face in reasoning about "love & belonging" motivations and their tendency toward excessive rationality and idealism. These insights highlight a promising direction for future research on the humanization of LLMs. The dataset, benchmark, and code are available at this https URL.
摘要：在各种情况下，例如社交模拟和AI伴侣，大型语言模型（LLM）已被广泛用作代理框架的核心。但是，他们能在多大程度上复制类似人类的动机仍然是一个毫无疑问的问题。现有的基准受到简单的场景和缺乏角色身份的约束，从而导致信息不对称的情况。为了解决这一差距，我们提出了MotiveBench，其中包括200个丰富的上下文场景和600个推理任务，涵盖了多个级别的动机。使用MotiveBench，我们对七个流行模型系列进行了广泛的实验，比较了每个家族内的不同尺度和版本。结果表明，即使是最先进的LLM，也无法实现类似人类的动机推理。我们的分析揭示了关键发现，包括LLM在推理“爱与归属”动机的推理中所面临的困难及其倾向于过度理性和理想主义的倾向。这些见解突出了对LLM人性化的未来研究的有希望的方向。该数据集，基准和代码可在此HTTPS URL上找到。

Title: CHILL at SemEval-2025 Task 2: You Can't Just Throw Entities and Hope -- Make Your LLM to Get Them Right

Authors: Jaebok Lee, Yonghyun Ryu, Seongmin Park, Yoonjung Choi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.13070
Pdf URL: https://arxiv.org/pdf/2506.13070
Copy Paste: [[2506.13070]] CHILL at SemEval-2025 Task 2: You Can't Just Throw Entities and Hope -- Make Your LLM to Get Them Right(https://arxiv.org/abs/2506.13070)
Keywords: language model, llm, retrieval augmented generation
Abstract: In this paper, we describe our approach for the SemEval 2025 Task 2 on Entity-Aware Machine Translation (EA-MT). Our system aims to improve the accuracy of translating named entities by combining two key approaches: Retrieval Augmented Generation (RAG) and iterative self-refinement techniques using Large Language Models (LLMs). A distinctive feature of our system is its self-evaluation mechanism, where the LLM assesses its own translations based on two key criteria: the accuracy of entity translations and overall translation quality. We demonstrate how these methods work together and effectively improve entity handling while maintaining high-quality translations.
摘要：在本文中，我们描述了在实体感知机器翻译（EA-MT）上的Semeval 2025任务2的方法。我们的系统旨在通过结合两种关键方法来提高翻译命名实体的准确性：使用大语言模型（LLMS）检索增强发电（RAG）和迭代自我注册技术。我们系统的一个独特特征是其自我评估机制，其中LLM根据两个关键标准评估其自身翻译：实体翻译的准确性和整体翻译质量。我们演示了这些方法如何共同起作用并有效地改善实体处理，同时保持高质量的翻译。

Title: Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs

Authors: Gyutaek Oh, Seoyeon Kim, Sangjoon Park, Byung-Hoon Kim
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.13102
Pdf URL: https://arxiv.org/pdf/2506.13102
Copy Paste: [[2506.13102]] Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs(https://arxiv.org/abs/2506.13102)
Keywords: language model, llm, prompt
Abstract: Test-time scaling has recently emerged as a promising approach for enhancing the reasoning capabilities of large language models or vision-language models during inference. Although a variety of test-time scaling strategies have been proposed, and interest in their application to the medical domain is growing, many critical aspects remain underexplored, including their effectiveness for vision-language models and the identification of optimal strategies for different settings. In this paper, we conduct a comprehensive investigation of test-time scaling in the medical domain. We evaluate its impact on both large language models and vision-language models, considering factors such as model size, inherent model characteristics, and task complexity. Finally, we assess the robustness of these strategies under user-driven factors, such as misleading information embedded in prompts. Our findings offer practical guidelines for the effective use of test-time scaling in medical applications and provide insights into how these strategies can be further refined to meet the reliability and interpretability demands of the medical domain.
摘要：最近，测试时间缩放是一种有前途的方法，用于增强推断期间大型语言模型或视觉模型的推理能力。尽管已经提出了各种测试时间缩放策略，并且对它们在医疗领域的应用的兴趣正在增长，但许多关键方面仍未得到充实，包括它们对视觉模型的有效性以及确定不同环境的最佳策略。在本文中，我们对医疗领域的测试时间缩放进行了全面研究。我们评估了其对大型语言模型和视觉模型的影响，考虑了诸如模型大小，固有模型特征和任务复杂性之类的因素。最后，我们在用户驱动的因素下评估了这些策略的鲁棒性，例如提示中嵌入的误导信息。我们的发现提供了实用的指南，以有效地在医疗应用中使用测试时间缩放，并提供有关如何进一步完善这些策略以满足医疗领域的可靠性和可解释性需求的见解。

Title: Leveraging In-Context Learning for Language Model Agents

Authors: Shivanshu Gupta, Sameer Singh, Ashish Sabharwal, Tushar Khot, Ben Bogin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13109
Pdf URL: https://arxiv.org/pdf/2506.13109
Copy Paste: [[2506.13109]] Leveraging In-Context Learning for Language Model Agents(https://arxiv.org/abs/2506.13109)
Keywords: language model, llm, prompt, agent
Abstract: In-context learning (ICL) with dynamically selected demonstrations combines the flexibility of prompting large language models (LLMs) with the ability to leverage training data to improve performance. While ICL has been highly successful for prediction and generation tasks, leveraging it for agentic tasks that require sequential decision making is challenging -- one must think not only about how to annotate long trajectories at scale and how to select demonstrations, but also what constitutes demonstrations, and when and where to show them. To address this, we first propose an algorithm that leverages an LLM with retries along with demonstrations to automatically and efficiently annotate agentic tasks with solution trajectories. We then show that set-selection of trajectories of similar tasks as demonstrations significantly improves performance, reliability, robustness, and efficiency of LLM agents. However, trajectory demonstrations have a large inference cost overhead. We show that this can be mitigated by using small trajectory snippets at every step instead of an additional trajectory. We find that demonstrations obtained from larger models (in the annotation phase) also improve smaller models, and that ICL agents can even rival costlier trained agents. Thus, our results reveal that ICL, with careful use, can be very powerful for agentic tasks as well.
摘要：具有动态选择的演示的文化学习（ICL）结合了促使大语言模型（LLMS）的灵活性，并能够利用培训数据来提高绩效。尽管ICL在预测和生成任务方面取得了非常成功的成功，但利用它来实现需要顺序决策的代理任务，这是具有挑战性的 - 不仅必须考虑如何在大规模上注释长长的轨迹以及如何选择演示，还必须考虑什么构成演示，以及何时何时示威。为了解决这个问题，我们首先提出了一种算法，该算法利用LLM进行了重试以及演示以自动有效地用解决方案轨迹注释代理任务。然后，我们表明，作为演示的相似任务的轨迹的设定选择可显着提高LLM代理的性能，可靠性，鲁棒性和效率。但是，轨迹演示的推理成本开销很大。我们表明，可以通过在每个步骤中使用小型轨迹片段而不是附加轨迹来减轻这种情况。我们发现，从较大模型（在注释阶段）获得的演示也改善了较小的模型，ICL代理甚至可以与训练有素的代理相匹配。因此，我们的结果表明，通过仔细使用ICL，对于代理任务也可能非常强大。

Title: Adapting LLMs for Minimal-edit Grammatical Error Correction

Authors: Ryszard Staruch, Filip Graliński, Daniel Dzienisiewicz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13148
Pdf URL: https://arxiv.org/pdf/2506.13148
Copy Paste: [[2506.13148]] Adapting LLMs for Minimal-edit Grammatical Error Correction(https://arxiv.org/abs/2506.13148)
Keywords: language model, llm
Abstract: Decoder-only large language models have shown superior performance in the fluency-edit English Grammatical Error Correction, but their adaptation for minimal-edit English GEC is still underexplored. To improve their effectiveness in the minimal-edit approach, we explore the error rate adaptation topic and propose a novel training schedule method. Our experiments set a new state-of-the-art result for a single-model system on the BEA-test set. We also detokenize the most common English GEC datasets to match the natural way of writing text. During the process, we find that there are errors in them. Our experiments analyze whether training on detokenized datasets impacts the results and measure the impact of the usage of the datasets with corrected erroneous examples. To facilitate reproducibility, we have released the source code used to train our models.
摘要：仅解码器的大型语言模型在流利的英语语法错误校正中表现出了出色的性能，但是它们对最小英语GEC的适应性仍然没有被忽略。为了提高其在最小编辑方法中的有效性，我们探讨了错误率适应主题，并提出了一种新颖的培训时间表方法。我们的实验为BEA-TEST集合中的单模系统设定了新的最新结果。我们还贬低了最常见的英语GEC数据集，以匹配自然写文本的方式。在此过程中，我们发现其中存在错误。我们的实验分析了对二次化数据集的培训是否会影响结果并用更正的错误示例来衡量数据集使用的影响。为了促进可重复性，我们发布了用于培训模型的源代码。

Title: Ai-Facilitated Analysis of Abstracts and Conclusions: Flagging Unsubstantiated Claims and Ambiguous Pronouns

Authors: Evgeny Markhasin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13172
Pdf URL: https://arxiv.org/pdf/2506.13172
Copy Paste: [[2506.13172]] Ai-Facilitated Analysis of Abstracts and Conclusions: Flagging Unsubstantiated Claims and Ambiguous Pronouns(https://arxiv.org/abs/2506.13172)
Keywords: language model, gpt, llm, prompt, chat, multi-run
Abstract: We present and evaluate a suite of proof-of-concept (PoC), structured workflow prompts designed to elicit human-like hierarchical reasoning while guiding Large Language Models (LLMs) in high-level semantic and linguistic analysis of scholarly manuscripts. The prompts target two non-trivial analytical tasks: identifying unsubstantiated claims in summaries (informational integrity) and flagging ambiguous pronoun references (linguistic clarity). We conducted a systematic, multi-run evaluation on two frontier models (Gemini Pro 2.5 Pro and ChatGPT Plus o3) under varied context conditions. Our results for the informational integrity task reveal a significant divergence in model performance: while both models successfully identified an unsubstantiated head of a noun phrase (95% success), ChatGPT consistently failed (0% success) to identify an unsubstantiated adjectival modifier that Gemini correctly flagged (95% success), raising a question regarding potential influence of the target's syntactic role. For the linguistic analysis task, both models performed well (80-90% success) with full manuscript context. In a summary-only setting, however, ChatGPT achieved a perfect (100%) success rate, while Gemini's performance was substantially degraded. Our findings suggest that structured prompting is a viable methodology for complex textual analysis but show that prompt performance may be highly dependent on the interplay between the model, task type, and context, highlighting the need for rigorous, model-specific testing.
摘要：我们介绍并评估了一套概念验证（POC），结构化工作流提示，旨在引起人类样的层次结构推理，同时指导大型语言模型（LLMS），以对学术手稿进行高级语言和语言分析。提示针对两个非平凡的分析任务：识别摘要（信息完整性）中未经证实的主张，并标记模棱两可的代词参考（语言清晰度）。在各种上下文条件下，我们对两种边境模型（Gemini Pro 2.5 Pro和Chatgpt Plus O3）进行了系统的多项评估。我们的信息完整性任务的结果揭示了模型性能的显着差异：虽然两种模型成功地识别了名词短语的未经证实的头（95％的成功），但ChatGPT始终失败（0％成功），以确定一个未经验证的形容词修饰符，使Gexini正确地标记了GexIni（95％的成功），从而提高了有关责任的问题，从而引起了一个潜在的影响。对于语言分析任务，两个模型在完整的手稿上下文中都表现良好（成功80-90％）。但是，在摘要的环境中，Chatgpt取得了完美的成功率（100％），而Gemini的性能大大降低了。我们的发现表明，结构化提示是一种复杂文本分析的可行方法，但表明及时性能可能高度依赖于模型，任务类型和上下文之间的相互作用，从而强调了对严格的，特定于模型的测试的需求。

Title: Development of the user-friendly decision aid Rule-based Evaluation and Support Tool (REST) for optimizing the resources of an information extraction task

Authors: Guillaume Bazin, Xavier Tannier, Fanny Adda, Ariel Cohen, Akram Redjdal, Emmanuelle Kempf
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13177
Pdf URL: https://arxiv.org/pdf/2506.13177
Copy Paste: [[2506.13177]] Development of the user-friendly decision aid Rule-based Evaluation and Support Tool (REST) for optimizing the resources of an information extraction task(https://arxiv.org/abs/2506.13177)
Keywords: llm
Abstract: Rules could be an information extraction (IE) default option, compared to ML and LLMs in terms of sustainability, transferability, interpretability, and development burden. We suggest a sustainable and combined use of rules and ML as an IE method. Our approach starts with an exhaustive expert manual highlighting in a single working session of a representative subset of the data corpus. We developed and validated the feasibility and the performance metrics of the REST decision tool to help the annotator choose between rules as a by default option and ML for each entity of an IE task. REST makes the annotator visualize the characteristics of each entity formalization in the free texts and the expected rule development feasibility and IE performance metrics. ML is considered as a backup IE option and manual annotation for training is therefore minimized. The external validity of REST on a 12-entity use case showed good reproducibility.
摘要：与ML和LLM相比，在可持续性，可转让性，可解释性和发展负担方面，规则可能是信息提取（IE）默认选项。我们建议将规则和ML作为IE方法的可持续和合并使用。我们的方法始于详尽的专家手册，该手册在数据语料库的代表性子集的单个工作会议中突出显示。我们开发并验证了REST决策工具的可行性和性能指标，以帮助注释者在IE任务的每个实体中作为默认选项和ML之间的选择。休息使注释者可视化自由文本中每个实体形式化的特征以及预期的规则发展可行性和IE性能指标。 ML被认为是备份IE选项，因此将培训的手动注释最小化。在12个实用性案例上休息的外部有效性表现出良好的可重复性。

Title: Enhancing Large Language Models with Reliable Knowledge Graphs

Authors: Qinggang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13178
Pdf URL: https://arxiv.org/pdf/2506.13178
Copy Paste: [[2506.13178]] Enhancing Large Language Models with Reliable Knowledge Graphs(https://arxiv.org/abs/2506.13178)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in text generation and understanding, yet their reliance on implicit, unstructured knowledge often leads to factual inaccuracies and limited interpretability. Knowledge Graphs (KGs), with their structured, relational representations, offer a promising solution to ground LLMs in verified knowledge. However, their potential remains constrained by inherent noise, incompleteness, and the complexity of integrating their rigid structure with the flexible reasoning of LLMs. This thesis presents a systematic framework to address these limitations, advancing the reliability of KGs and their synergistic integration with LLMs through five interconnected contributions. This thesis addresses these challenges through a cohesive framework that enhances LLMs by refining and leveraging reliable KGs. First, we introduce contrastive error detection, a structure-based method to identify incorrect facts in KGs. This approach is extended by an attribute-aware framework that unifies structural and semantic signals for error correction. Next, we propose an inductive completion model that further refines KGs by completing the missing relationships in evolving KGs. Building on these refined KGs, KnowGPT integrates structured graph reasoning into LLMs through dynamic prompting, improving factual grounding. These contributions form a systematic pipeline (from error detection to LLM integration), demonstrating that reliable KGs significantly enhance the robustness, interpretability, and adaptability of LLMs.
摘要：大型语言模型（LLM）在文本生成和理解中表现出了显着的能力，但是它们对隐式，非结构化知识的依赖通常会导致事实上的不准确性和有限的解释性。知识图（kgs）及其结构化的关系表示，为经过验证的知识中的地面LLM提供了有希望的解决方案。但是，它们的潜力仍然受到固有的噪声，不完整性以及将其刚性结构与LLM的灵活推理相结合的复杂性的限制。本论文提出了一个系统的框架来解决这些局限性，通过五个相互联系的贡献提高了KGS及其与LLMS的协同整合的可靠性。本论文通过一个有凝聚力的框架来解决这些挑战，该框架通过完善和利用可靠的KG来增强LLM。首先，我们引入了对比度错误检测，这是一种基于结构的方法，可以识别kgs中不正确的事实。这种方法是通过属性感知框架扩展的，该框架将结构和语义信号统一以进行误差校正。接下来，我们提出了一个归纳完成模型，该模型通过完成不断发展的KGS中缺失的关系进一步完善了KG。在这些精致的KG的基础上，Knogpt通过动态提示将结构化的图形推理整合到LLM中，从而改善了事实基础。这些贡献构成了系统的管道（从误差检测到LLM集成），表明可靠的kg可显着增强LLMS的鲁棒性，可解释性和适应性。

Title: Align-then-Unlearn: Embedding Alignment for LLM Unlearning

Authors: Philipp Spohn, Leander Girrbach, Jessica Bader, Zeynep Akata
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.13181
Pdf URL: https://arxiv.org/pdf/2506.13181
Copy Paste: [[2506.13181]] Align-then-Unlearn: Embedding Alignment for LLM Unlearning(https://arxiv.org/abs/2506.13181)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) are trained on massive datasets, they have raised significant privacy and ethical concerns due to their potential to inadvertently retain sensitive information. Unlearning seeks to selectively remove specific data from trained models, such as personal information or copyrighted content. Current approaches targeting specific output sequences at the token level often fail to achieve complete forgetting and remain susceptible to prompt rephrasing. We propose Align-then-Unlearn, a novel framework that performs unlearning in the semantic embedding space rather than directly on output tokens. Align-then-Unlearn first augments the LLM with an embedding prediction module trained to anticipate future context representations. Unlearning is then achieved by fine-tuning the model to minimize the similarity between these predicted embeddings and a target embedding that represents the concept to be removed. Initial results show that Align-then-Unlearn effectively removes targeted knowledge with minimal degradation in overall model utility. These findings suggest that embedding-based unlearning offers a promising and robust approach to removing conceptual knowledge. Our code is available at this https URL.
摘要：由于大型语言模型（LLM）在大规模数据集上进行了培训，因此由于它们无意中保留敏感信息的潜力，它们提出了大量隐私和道德问题。 Unlerning试图从训练有素的模型（例如个人信息或受版权保护的内容）中选择性地删除特定数据。当前针对代币级别特定输出序列的方法通常无法实现完全的遗忘，并且仍然容易迅速重新设计。我们提出了Align-Then-Unlearnn，这是一个新型框架，在语义嵌入空间中执行了不学习，而不是直接在输出令牌上。 Align-then-notlearn首先通过嵌入的预测模块来预测未来的上下文表示。然后，通过微调模型来最大程度地减少这些预测的嵌入与代表要去除的概念的目标嵌入之间的相似性来实现学习。最初的结果表明，对齐 - 未检测有效地消除了目标知识，而整体模型实用程序中的最小降解。这些发现表明，基于嵌入的学习提供了一种有希望且强大的方法来消除概念知识。我们的代码可在此HTTPS URL上找到。

Title: Breaking Thought Patterns: A Multi-Dimensional Reasoning Framework for LLMs

Authors: Xintong Tang, Meiru Zhang, Shang Xiao, Junzhao Jin, Zihan Zhao, Liwei Li, Yang Zheng, Bangyi Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13192
Pdf URL: https://arxiv.org/pdf/2506.13192
Copy Paste: [[2506.13192]] Breaking Thought Patterns: A Multi-Dimensional Reasoning Framework for LLMs(https://arxiv.org/abs/2506.13192)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) are often constrained by rigid reasoning processes, limiting their ability to generate creative and diverse responses. To address this, a novel framework called LADDER is proposed, combining Chain-of-Thought (CoT) reasoning, Mixture of Experts (MoE) models, and multi-dimensional up/down-sampling strategies which breaks the limitations of traditional LLMs. First, CoT reasoning guides the model through multi-step logical reasoning, expanding the semantic space and breaking the rigidity of thought. Next, MoE distributes the reasoning tasks across multiple expert modules, each focusing on specific sub-tasks. Finally, dimensionality reduction maps the reasoning outputs back to a lower-dimensional semantic space, yielding more precise and creative responses. Extensive experiments across multiple tasks demonstrate that LADDER significantly improves task completion, creativity, and fluency, generating innovative and coherent responses that outperform traditional models. Ablation studies reveal the critical roles of CoT and MoE in enhancing reasoning abilities and creative output. This work contributes to the development of more flexible and creative LLMs, capable of addressing complex and novel tasks.
摘要：大型语言模型（LLM）通常受到严格的推理过程的限制，从而限制了它们产生创造性和多样化的反应能力。为了解决这个问题，提出了一个名为“梯子”的新型框架，结合了经营链（COT）推理，专家（MOE）模型的混合物以及多维的向上/下采样策略，从而破坏了传统LLM的限制。首先，COT推理通过多步逻辑推理指导模型，扩大语义空间并破坏了思想的刚性。接下来，MOE在多个专家模块上分发推理任务，每个模块都集中在特定的子任务上。最后，尺寸降低了推理输出将其映射回一个较低维的语义空间，从而产生了更精确和创造性的响应。跨多个任务的广泛实验表明，梯子可显着提高任务的完成，创造力和流利度，从而产生超过传统模型的创新和连贯的响应。消融研究揭示了COT和MOE在增强推理能力和创造性产出中的关键作用。这项工作有助于开发更灵活，更具创意的LLM，能够解决复杂和新颖的任务。

Title: Do Music Preferences Reflect Cultural Values? A Cross-National Analysis Using Music Embedding and World Values Survey

Authors: Yongjae Kim, Seongchan Park
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2506.13199
Pdf URL: https://arxiv.org/pdf/2506.13199
Copy Paste: [[2506.13199]] Do Music Preferences Reflect Cultural Values? A Cross-National Analysis Using Music Embedding and World Values Survey(https://arxiv.org/abs/2506.13199)
Keywords: gpt
Abstract: This study explores the extent to which national music preferences reflect underlying cultural values. We collected long-term popular music data from YouTube Music Charts across 62 countries, encompassing both Western and non-Western regions, and extracted audio embeddings using the CLAP model. To complement these quantitative representations, we generated semantic captions for each track using LP-MusicCaps and GPT-based summarization. Countries were clustered based on contrastive embeddings that highlight deviations from global musical norms. The resulting clusters were projected into a two-dimensional space via t-SNE for visualization and evaluated against cultural zones defined by the World Values Survey (WVS). Statistical analyses, including MANOVA and chi-squared tests, confirmed that music-based clusters exhibit significant alignment with established cultural groupings. Furthermore, residual analysis revealed consistent patterns of overrepresentation, suggesting non-random associations between specific clusters and cultural zones. These findings indicate that national-level music preferences encode meaningful cultural signals and can serve as a proxy for understanding global cultural boundaries.
摘要：这项研究探讨了民族音乐偏好反映基本文化价值的程度。我们从62个国家 /地区的YouTube音乐图表中收集了长期的流行音乐数据，涵盖了西方和非西部地区，并使用拍手模型提取了音频嵌入。为了补充这些定量表示形式，我们使用LP-MusicCaps和基于GPT的摘要为每个曲目生成语义标题。根据对比的嵌入，将各国聚集，这些嵌入突出了与全球音乐规范的偏差。通过T-SNE将所得的簇投影到二维空间中以进行可视化，并针对由世界价值调查（WVS）定义的文化区域进行评估。包括Manova和Chi方检验在内的统计分析证实，基于音乐的簇与已建立的文化群体表现出很大的一致性。此外，剩余分析表明，过分占代表性的一致模式，表明特定簇和文化区域之间的非随机关联。这些发现表明，国家一级的音乐偏好编码有意义的文化信号，并可以作为理解全球文化边界的代理。

Title: Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law

Authors: Qiming Ge, Shuhao Xing, Songyang Gao, Yunhua Zhou, Yicheng Zou, Songyang Zhang, Zhi Chen, Hang Yan, Qi Zhang, Qipeng Guo, Kai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13216
Pdf URL: https://arxiv.org/pdf/2506.13216
Copy Paste: [[2506.13216]] Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law(https://arxiv.org/abs/2506.13216)
Keywords: language model
Abstract: Scaling law builds the relationship between training computation and validation loss, enabling researchers to effectively predict the loss trending of models across different levels of computation. However, a gap still remains between validation loss and the model's downstream capabilities, making it untrivial to apply scaling law to direct performance prediction for downstream tasks. The loss typically represents a cumulative penalty for predicted tokens, which are implicitly considered to have equal importance. Nevertheless, our studies have shown evidence that when considering different training data distributions, we cannot directly model the relationship between downstream capability and computation or token loss. To bridge the gap between validation loss and downstream task capabilities, in this work, we introduce Capability Salience Vector, which decomposes the overall loss and assigns different importance weights to tokens to assess a specific meta-capability, aligning the validation loss with downstream task performance in terms of the model's capabilities. Experiments on various popular benchmarks demonstrate that our proposed Capability Salience Vector could significantly improve the predictability of language model performance on downstream tasks.
摘要：缩放法律建立了训练计算与验证损失之间的关系，使研究人员能够有效预测不同计算级别模型的损失趋势。但是，在验证损失和模型的下游功能之间仍然存在差距，这使得将缩放定律应用于下游任务的直接绩效预测并不容易。损失通常代表对预测的令牌的累积罚款，这些损失被隐含地认为具有同等的重要性。然而，我们的研究表明，在考虑不同的训练数据分布时，我们无法直接建模下游能力与计算或令牌损失之间的关系。为了弥合验证损失与下游任务能力之间的差距，在这项工作中，我们引入了功能显着性向量，该矢量分解了整体损失，并为代币分配了不同的重要性权重以评估特定的元能力，从而使验证损失与模型功能的下游任务绩效保持一致。各种流行基准的实验表明，我们提出的能力显着性向量可以显着提高下游任务上语言模型性能的可预测性。

Title: IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation

Authors: Zijie Lin, Yang Zhang, Xiaoyan Zhao, Fengbin Zhu, Fuli Feng, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13229
Pdf URL: https://arxiv.org/pdf/2506.13229
Copy Paste: [[2506.13229]] IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation(https://arxiv.org/abs/2506.13229)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown strong potential for recommendation by framing item prediction as a token-by-token language generation task. However, existing methods treat all item tokens equally, simply pursuing likelihood maximization during both optimization and decoding. This overlooks crucial token-level differences in decisiveness-many tokens contribute little to item discrimination yet can dominate optimization or decoding. To quantify token decisiveness, we propose a novel perspective that models item generation as a decision process, measuring token decisiveness by the Information Gain (IG) each token provides in reducing uncertainty about the generated item. Our empirical analysis reveals that most tokens have low IG but often correspond to high logits, disproportionately influencing training loss and decoding, which may impair model performance. Building on these insights, we introduce an Information Gain-based Decisiveness-aware Token handling (IGD) strategy that integrates token decisiveness into both tuning and decoding. Specifically, IGD downweights low-IG tokens during tuning and rebalances decoding to emphasize tokens with high IG. In this way, IGD moves beyond pure likelihood maximization, effectively prioritizing high-decisiveness tokens. Extensive experiments on four benchmark datasets with two LLM backbones demonstrate that IGD consistently improves recommendation accuracy, achieving significant gains on widely used ranking metrics compared to strong baselines.
摘要：大型语言模型（LLMS）通过将项目预测作为逐个token语言生成的任务表现出强大的推荐潜力。但是，现有方法平等地对待所有项目令牌，只是在优化和解码过程中追求最大化。这忽略了决定性的代币中至关重要的令牌级别的差异，对项目歧视几乎没有贡献，但可以主导优化或解码。为了量化令牌的决定性，我们提出了一种新颖的观点，该观点将项目生成模拟为决策过程，通过每个令牌的信息增益（IG）来衡量令牌的决定性，以减少对生成的项目的不确定性。我们的经验分析表明，大多数令牌的Ig低，但通常对应于高逻辑，对训练损失和解码的影响不成比例，这可能会损害模型性能。在这些见解的基础上，我们引入了基于信息获得的决定性的令牌处理（IGD）策略，该策略将令牌的决定性集成到调整和解码中。具体而言，在调整和重新平衡时，IgD减小令牌的低功率为高Ig，以高Ig强调令牌。这样，IGD超越了纯粹的可能性最大化，有效地优先考虑高确定性令牌。在具有两个LLM骨架的四个基准数据集上进行的广泛实验表明，与强基线相比，IGD始终提高了建议准确性，从而在广泛使用的排名指标上取得了显着提高。

Title: AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

Authors: Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.13284
Pdf URL: https://arxiv.org/pdf/2506.13284
Copy Paste: [[2506.13284]] AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy(https://arxiv.org/abs/2506.13284)
Keywords: prompt
Abstract: In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt. Both approaches yield notable improvements in reasoning performance, with scaling the number of prompts resulting in more substantial gains. We then explore the following questions regarding the synergy between SFT and RL: (i) Does a stronger SFT model consistently lead to better final performance after large-scale RL training? (ii) How can we determine an appropriate sampling temperature during RL training to effectively balance exploration and exploitation for a given SFT initialization? Our findings suggest that (i) holds true, provided effective RL training is conducted, particularly when the sampling temperature is carefully chosen to maintain the temperature-adjusted entropy around 0.3, a setting that strikes a good balance between exploration and exploitation. Notably, the performance gap between initial SFT models narrows significantly throughout the RL process. Leveraging a strong SFT foundation and insights into the synergistic interplay between SFT and RL, our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and achieves new state-of-the-art performance among Qwen2.5-7B-based reasoning models on challenging math and code benchmarks, thereby demonstrating the effectiveness of our post-training recipe. We release the model and data at: this https URL
摘要：在这项工作中，我们研究了有监督的微调（SFT）与增强学习（RL）之间的协同作用，以开发强大的推理模型。我们首先通过两种缩放策略来策划SFT培训数据：增加收集的提示的数量和每个提示的生成响应数量。两种方法都在推理性能方面取得了显着改善，并扩展了提示的数量，从而取得了更大的收益。然后，我们探讨了有关SFT和RL之间的协同作用的以下问题：（i）强大的SFT模型会在大规模RL培训后始终导致更好的最终性能吗？（ii）如何在RL训练期间确定适当的采样温度，以有效地平衡给定SFT初始化的探索和剥削？我们的发现表明（i）是正确的，只要进行有效的RL训练，尤其是当仔细选择采样温度以维持温度调整的熵时，该设置在探索和开发之间达到了良好的平衡。值得注意的是，在整个RL过程中，初始SFT模型之间的性能差距大大缩小。我们的AceReason-Nemotron-1.1 7b模型利用强大的SFT基础和对SFT和RL之间的协同相互作用的见解，显着超过了AceReason-Nemotron-1.0，并在基于QWEN2.5-7B的基于QWEN2.5-7B基于QWEN2.5-7B的推理模型上实现了在有挑战的数学和代码中的基础上的新型推理模型。我们在以下位置发布模型和数据：此HTTPS URL

Title: Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs

Authors: Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang, Yang Deng, Xiang Wang, Xiangnan He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13285
Pdf URL: https://arxiv.org/pdf/2506.13285
Copy Paste: [[2506.13285]] Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs(https://arxiv.org/abs/2506.13285)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown strong performance across natural language tasks, but remain vulnerable to backdoor attacks. Recent model editing-based approaches enable efficient backdoor injection by directly modifying parameters to map specific triggers to attacker-desired responses. However, these methods often suffer from safety fallback, where the model initially responds affirmatively but later reverts to refusals due to safety alignment. In this work, we propose DualEdit, a dual-objective model editing framework that jointly promotes affirmative outputs and suppresses refusal responses. To address two key challenges -- balancing the trade-off between affirmative promotion and refusal suppression, and handling the diversity of refusal expressions -- DualEdit introduces two complementary techniques. (1) Dynamic loss weighting calibrates the objective scale based on the pre-edited model to stabilize optimization. (2) Refusal value anchoring compresses the suppression target space by clustering representative refusal value vectors, reducing optimization conflict from overly diverse token sets. Experiments on safety-aligned LLMs show that DualEdit improves attack success by 9.98\% and reduces safety fallback rate by 10.88\% over baselines.
摘要：大型语言模型（LLMS）在自然语言任务中表现出很强的表现，但仍然容易受到后门攻击的影响。最新的基于模型编辑的方法可以通过直接修改参数将特定触发器映射到攻击者态度的响应来实现有效的后门注入。但是，这些方法通常会遭受安全性后备的困扰，该方法最初响应肯定，但后来由于安全对准而拒绝拒绝。在这项工作中，我们提出了Dualedit，这是一种双重目标模型编辑框架，共同促进肯定的产出并抑制拒绝响应。为了应对两个主要挑战 - 平衡肯定的晋升与拒绝抑制之间的权衡，以及处理拒绝表达的多样性 - Dualedit引入了两种补充技术。（1）动态损失减肥根据预先编辑的模型校准了客观量表，以稳定优化。（2）拒绝锚定值通过聚类代表性拒绝值向量来压缩抑制目标空间，从而减少了来自过度多样的令牌集的优化冲突。安全对准LLMS的实验表明，Dualedit将攻击成功提高了9.98 \％，并将安全性后备率降低了10.88％\％。

Title: Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models

Authors: Bo Li, Chengben Xu, Wufeng Zhang
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.13300
Pdf URL: https://arxiv.org/pdf/2506.13300
Copy Paste: [[2506.13300]] Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models(https://arxiv.org/abs/2506.13300)
Keywords: language model, chain-of-thought
Abstract: This paper presents Seewo's systems for both tracks of the Multilingual Conversational Speech Language Model Challenge (MLC-SLM), addressing automatic speech recognition (ASR) and speaker diarization with ASR (SD-ASR). We introduce a multi-stage training pipeline that explicitly enhances reasoning and self-correction in speech language models for ASR. Our approach combines curriculum learning for progressive capability acquisition, Chain-of-Thought data augmentation to foster intermediate reflection, and Reinforcement Learning with Verifiable Rewards (RLVR) to further refine self-correction through reward-driven optimization. This approach achieves substantial improvements over the official challenge baselines. On the evaluation set, our best system attains a WER/CER of 11.57% for Track 1 and a tcpWER/tcpCER of 17.67% for Track 2. Comprehensive ablation studies demonstrate the effectiveness of each component under challenge constraints.
摘要：本文介绍了Seewo的系统，用于多语言对话语音语言模型挑战（MLC-SLM）的两种轨道，均通过ASR（SD-ASR）解决自动语音识别（ASR）和扬声器诊断。我们介绍了一条多阶段的培训管道，该管道明确增强了ASR语音语言模型中的推理和自我纠正。我们的方法结合了课程学习，以获得渐进能力的获取，进行思想链数据增强，以促进中间反思，并通过可验证的奖励（RLVR）（RLVR）通过奖励驱动的优化进一步完善自我纠正。这种方法比官方的挑战基准取得了重大改进。在评估集中，我们的最佳系统对于轨道1的WER/CER为11.57％，轨道2的TCPWER/TCPCer为17.67％。综合消融研究表明，在挑战限制下，每个组件的有效性。

Title: Large Language Models as 'Hidden Persuaders': Fake Product Reviews are Indistinguishable to Humans and Machines

Authors: Weiyao Meng, John Harvey, James Goulding, Chris James Carter, Evgeniya Lukinova, Andrew Smith, Paul Frobisher, Mina Forrest, Georgiana Nica-Avram
Subjects: cs.CL, cs.AI, econ.GN
Abstract URL: https://arxiv.org/abs/2506.13313
Pdf URL: https://arxiv.org/pdf/2506.13313
Copy Paste: [[2506.13313]] Large Language Models as 'Hidden Persuaders': Fake Product Reviews are Indistinguishable to Humans and Machines(https://arxiv.org/abs/2506.13313)
Keywords: language model, llm
Abstract: Reading and evaluating product reviews is central to how most people decide what to buy and consume online. However, the recent emergence of Large Language Models and Generative Artificial Intelligence now means writing fraudulent or fake reviews is potentially easier than ever. Through three studies we demonstrate that (1) humans are no longer able to distinguish between real and fake product reviews generated by machines, averaging only 50.8% accuracy overall - essentially the same that would be expected by chance alone; (2) that LLMs are likewise unable to distinguish between fake and real reviews and perform equivalently bad or even worse than humans; and (3) that humans and LLMs pursue different strategies for evaluating authenticity which lead to equivalently bad accuracy, but different precision, recall and F1 scores - indicating they perform worse at different aspects of judgment. The results reveal that review systems everywhere are now susceptible to mechanised fraud if they do not depend on trustworthy purchase verification to guarantee the authenticity of reviewers. Furthermore, the results provide insight into the consumer psychology of how humans judge authenticity, demonstrating there is an inherent 'scepticism bias' towards positive reviews and a special vulnerability to misjudge the authenticity of fake negative reviews. Additionally, results provide a first insight into the 'machine psychology' of judging fake reviews, revealing that the strategies LLMs take to evaluate authenticity radically differ from humans, in ways that are equally wrong in terms of accuracy, but different in their misjudgments.
摘要：阅读和评估产品评论对于大多数人决定在线购买和消费的方式至关重要。但是，现在大型语言模型和生成人工智能的出现现在意味着编写欺诈或假评论可能比以往任何时候都容易。通过三项研究，我们证明（1）人类不再能够区分机器产生的真实和假产品评论，总体上只有50.8％的准确性 - 本质上是单独偶然地期望的；（2）LLM同样无法区分假评论和相同的糟糕甚至比人类更糟糕；（3）人类和LLMS采取不同的策略来评估真实性，这导致了等效的准确性，但精度，回忆和F1分数不同，表明它们在判断的不同方面的表现较差。结果表明，如果审查系统不依赖于值得信赖的购买验证来保证审阅者的真实性，那么各地的审查系统现在都容易受到机械化欺诈的影响。此外，结果提供了人们对人类如何判断真实性的消费者心理学的见解，表明人们对积极评价有固有的“怀疑偏见”，并有一个特殊的脆弱性，以错误地判断虚假负面评论的真实性。此外，结果提供了对判断虚假评论的“机器心理学”的首次见解，表明LLMS采取的策略以完全错误的方式在准确性方面同样错误，但误解的方式不同。

Title: Document-Level Tabular Numerical Cross-Checking: A Coarse-to-Fine Approach

Authors: Chaoxu Pang, Yixuan Cao, Ganbin Zhou, Hongwei Li, Ping Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13328
Pdf URL: https://arxiv.org/pdf/2506.13328
Copy Paste: [[2506.13328]] Document-Level Tabular Numerical Cross-Checking: A Coarse-to-Fine Approach(https://arxiv.org/abs/2506.13328)
Keywords: language model, llm
Abstract: Numerical consistency across tables in disclosure documents is critical for ensuring accuracy, maintaining credibility, and avoiding reputational and economic risks. Automated tabular numerical cross-checking presents two significant challenges: (C1) managing the combinatorial explosion of candidate instances at the document level and (C2) comprehending multi-faceted numerical semantics. Previous research typically depends on heuristic-based filtering or simplified context extraction, often struggling to balance performance and efficiency. Recently, large language models (LLMs) have demonstrated remarkable contextual understanding capabilities that helps address C2 at the instance level, yet they remain hampered by computational inefficiency (C1) and limited domain expertise. This paper introduces CoFiTCheck, a novel LLM-based coarse-to-fine framework that addresses these challenges through two sequential stages: embedding-based filtering and discriminative classification. The embedding-based filtering stage introduces an instructional parallel encoding method to efficiently represent all numerical mentions in a table with LLMs, as well as a decoupled InfoNCE objective to mitigate the isolated mention problem. The discriminative classification stage employs a specialized LLM for fine-grained analysis of the remaining candidate pairs. This stage is further enhanced by our crosstable numerical alignment pretraining paradigm, which leverages weak supervision from cross-table numerical equality relationships to enrich task-specific priors without requiring manual annotation. Comprehensive evaluation across three types of real-world disclosure documents demonstrates that CoFiTCheck significantly outperforms previous methods while maintaining practical efficiency.
摘要：披露文件中跨表的数值一致性对于确保准确性，保持信誉以及避免声誉和经济风险至关重要。自动表格数值交叉检查提出了两个重大挑战：（C1）在文档级别管理候选实例的组合爆炸，以及（C2）理解多方面的数值语义。以前的研究通常取决于基于启发式的过滤或简化的上下文提取，通常会努力平衡性能和效率。最近，大型语言模型（LLMS）表现出了出色的上下文理解能力，可以在实例级别上解决C2，但它们仍然受到计算效率低下（C1）和有限的域专业知识的阻碍。本文介绍了Cofitcheck，这是一种新型的基于LLM的粗到精细框架，该框架通过两个顺序阶段解决了这些挑战：基于嵌入的过滤和辨别性分类。基于嵌入式的过滤阶段引入了一种教学并行编码方法，以有效地表示带有LLM的表中的所有数值提及，以及一个脱钩的Infonce目标，以减轻孤立的提及问题。歧视性分类阶段采用专门的LLM来对其余候选对的细粒度分析。我们的交叉数值对准预处理进一步增强了这一阶段，该范式利用跨桌子数值平等关系的弱监督，以丰富特定于任务的先验而无需手动注释。对三种现实世界披露文件进行的全面评估表明，Cofitcheck在维持实际效率的同时显着优于以前的方法。

Title: NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025

Authors: Yizhou Peng, Bin Wang, Yi-Wen Chao, Ziyang Ma, Haoyang Zhang, Hexin Liu, Xie Chen, Eng Siong Chng
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2506.13339
Pdf URL: https://arxiv.org/pdf/2506.13339
Copy Paste: [[2506.13339]] NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025(https://arxiv.org/abs/2506.13339)
Keywords: language model, llm, prompt
Abstract: This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, language-specific prompts and model averaging techniques were instrumental in boosting system performance across diverse languages. Compared to the initial baseline system, our final model reduced the average Mix Error Rate from 20.2% to 10.6%, representing an absolute improvement of 9.6% (a relative improvement of 48%) on the evaluation set. Our results demonstrate the effectiveness of our approach and offer practical insights for future Speech Large Language Models.
摘要：本报告详细介绍了为Interspeech 2025多语言对话语音和语言模型（MLC-SLM）挑战（任务I）开发的NTU SpeechLab系统，我们获得了第五名。我们介绍了对多语言自动语音识别系统的全面分析，突出了模型架构，数据选择和培训策略中的关键进步。特别是，特定于语言的提示和模型平均技术有助于提高各种语言的系统性能。与初始基线系统相比，我们的最终模型将平均混合误差率从20.2％降低到10.6％，这是评估集的绝对提高9.6％（相对提高48％）。我们的结果证明了我们方法的有效性，并为将来的语音大语模型提供了实用的见解。

Title: Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks

Authors: Yifei Xu, Tusher Chakraborty, Srinagesh Sharma, Leonardo Nunes, Emre Kıcıman, Songwu Lu, Ranveer Chandra
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.13351
Pdf URL: https://arxiv.org/pdf/2506.13351
Copy Paste: [[2506.13351]] Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks(https://arxiv.org/abs/2506.13351)
Keywords: language model, llm, chain-of-thought
Abstract: Recent advances in Large Language Models (LLMs) have showcased impressive reasoning abilities in structured tasks like mathematics and programming, largely driven by Reinforcement Learning with Verifiable Rewards (RLVR), which uses outcome-based signals that are scalable, effective, and robust against reward hacking. However, applying similar techniques to open-ended long-form reasoning tasks remains challenging due to the absence of generic, verifiable reward signals. To address this, we propose Direct Reasoning Optimization (DRO), a reinforcement learning framework for fine-tuning LLMs on open-ended, particularly long-form, reasoning tasks, guided by a new reward signal: the Reasoning Reflection Reward (R3). At its core, R3 selectively identifies and emphasizes key tokens in the reference outcome that reflect the influence of the model's preceding chain-of-thought reasoning, thereby capturing the consistency between reasoning and reference outcome at a fine-grained level. Crucially, R3 is computed internally using the same model being optimized, enabling a fully self-contained training setup. Additionally, we introduce a dynamic data filtering strategy based on R3 for open-ended reasoning tasks, reducing cost while improving downstream performance. We evaluate DRO on two diverse datasets -- ParaRev, a long-form paragraph revision task, and FinQA, a math-oriented QA benchmark -- and show that it consistently outperforms strong baselines while remaining broadly applicable across both open-ended and structured domains.
摘要：大型语言模型（LLM）的最新进展已在结构化任务和编程等结构化任务中展示了令人印象深刻的推理能力，这在很大程度上是由可验证的奖励（RLVR）驱动的，该奖励（RLVR）使用了基于结果的信号，这些信号可扩展，有效，有效且可靠地反对奖励黑客。但是，由于缺乏通用，可验证的奖励信号，将类似的技术应用于开放式的长期推理任务仍然具有挑战性。为了解决这个问题，我们提出了直接推理优化（DRO），这是在开放式奖励信号的开放式，尤其是长形的推理任务上对LLM进行微调LLM的强化学习框架：推理反思奖励（R3）。 R3以核心选择性地识别并强调了参考结果中的关键令牌，以反映该模型先前的经济链推理的影响，从而捕获了在细粒级别上的推理和参考结果之间的一致性。至关重要的是，R3是在内部使用相同的模型进行了优化计算的，从而实现了完全独立的训练设置。此外，我们为开放式推理任务引入了基于R3的动态数据过滤策略，从而降低了成本，同时改善了下游性能。我们在两个不同的数据集上评估了DRO-Pararev是一项长期的段落修订任务和面向数学的QA基准的FinQA-并表明它始终超越强大的基线，同时在开放式和结构化领域中保持广泛适用。

Title: StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns

Authors: Luanbo Wan, Weizhi Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13356
Pdf URL: https://arxiv.org/pdf/2506.13356
Copy Paste: [[2506.13356]] StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns(https://arxiv.org/abs/2506.13356)
Keywords: language model, llm
Abstract: Long-term memory (LTM) is essential for large language models (LLMs) to achieve autonomous intelligence in complex, evolving environments. Despite increasing efforts in memory-augmented and retrieval-based architectures, there remains a lack of standardized benchmarks to systematically evaluate LLMs' long-term memory abilities. Existing benchmarks still face challenges in evaluating knowledge retention and dynamic sequential reasoning, and in their own flexibility, all of which limit their effectiveness in assessing models' LTM capabilities. To address these gaps, we propose a novel benchmark framework based on interactive fiction games, featuring dynamically branching storylines with complex reasoning structures. These structures simulate real-world scenarios by requiring LLMs to navigate hierarchical decision trees, where each choice triggers cascading dependencies across multi-turn interactions. Our benchmark emphasizes two distinct settings to test reasoning complexity: one with immediate feedback upon incorrect decisions, and the other requiring models to independently trace back and revise earlier choices after failure. As part of this benchmark, we also construct a new dataset designed to test LLMs' LTM within narrative-driven environments. We further validate the effectiveness of our approach through detailed experiments. Experimental results demonstrate the benchmark's ability to robustly and reliably assess LTM in LLMs.
摘要：长期记忆（LTM）对于大语言模型（LLMS）在复杂，不断发展的环境中实现自动智能至关重要。尽管越来越多地在基于内存和检索的架构上做出了努力，但仍缺乏系统地评估LLMS的长期记忆能力的标准化基准。现有的基准仍面临评估知识保留和动态顺序推理的挑战，并且以其自身的灵活性，所有这些都限制了它们在评估模型的LTM功能方面的有效性。为了解决这些差距，我们提出了一个基于互动小说游戏的新型基准框架，其中包含具有复杂推理结构的动态分支故事情节。这些结构通过要求LLMS浏览层次决策树来模拟现实世界的场景，在这种情况下，每个选择都会触发跨多转交互之间的级联依赖关系。我们的基准强调了两个不同的设置以测试推理复杂性：一种立即对决策的反馈，另一个要求模型在失败后独立追溯和修改早期选择。作为此基准测试的一部分，我们还构建了一个新的数据集，旨在在叙事驱动的环境中测试LLMS的LTM。我们通过详细的实验进一步验证方法的有效性。实验结果表明，基准测试能够在LLMS中可靠地评估LTM的能力。

Title: Efficient Medical VIE via Reinforcement Learning

Authors: Lijun Liu, Ruiyang Li, Zhaocheng Liu, Chenglin Zhu, Chong Li, Jiehan Cheng, Qiang Ju, Jian Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13363
Pdf URL: https://arxiv.org/pdf/2506.13363
Copy Paste: [[2506.13363]] Efficient Medical VIE via Reinforcement Learning(https://arxiv.org/abs/2506.13363)
Keywords: language model, hallucination
Abstract: Visual Information Extraction (VIE) converts unstructured document images into structured formats like JSON, critical for medical applications such as report analysis and online consultations. Traditional methods rely on OCR and language models, while end-to-end multimodal models offer direct JSON generation. However, domain-specific schemas and high annotation costs limit their effectiveness in medical VIE. We base our approach on the Reinforcement Learning with Verifiable Rewards (RLVR) framework to address these challenges using only 100 annotated samples. Our approach ensures dataset diversity, a balanced precision-recall reward mechanism to reduce hallucinations and improve field coverage, and innovative sampling strategies to enhance reasoning capabilities. Fine-tuning Qwen2.5-VL-7B with our RLVR method, we achieve state-of-the-art performance on medical VIE tasks, significantly improving F1, precision, and recall. While our models excel on tasks similar to medical datasets, performance drops on dissimilar tasks, highlighting the need for domain-specific optimization. Case studies further demonstrate the value of reasoning during training and inference for VIE.
摘要：视觉信息提取（VIE）将非结构化的文档图像转换为JSON等结构化格式，这对于报告分析和在线咨询等医学应用至关重要。传统方法依赖于OCR和语言模型，而端到端的多模型模型则提供直接的JSON生成。但是，特定领域的模式和高注释成本限制了其在医疗竞争中的有效性。我们将使用可验证的奖励（RLVR）框架以增强学习为基础，仅使用100个带注释的样本来应对这些挑战。我们的方法确保了数据集多样性，这是一种平衡的Precision-Recall奖励机制，可减少幻觉和改善现场覆盖范围，以及增强推理能力的创新抽样策略。通过我们的RLVR方法进行微调QWEN2.5-VL-7B，我们在医疗VIE任务上实现了最先进的绩效，从而显着改善了F1，精度和召回率。尽管我们的模型在类似于医疗数据集类似的任务上表现出色，但性能下降了不同的任务，突出了对域特异性优化的需求。案例研究进一步证明了在训练和推断VIE期间推理的价值。

Title: Decompositional Reasoning for Graph Retrieval with Large Language Models

Authors: Valentin Six, Evan Dufraisse, Gaël de Chalendar
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.13380
Pdf URL: https://arxiv.org/pdf/2506.13380
Copy Paste: [[2506.13380]] Decompositional Reasoning for Graph Retrieval with Large Language Models(https://arxiv.org/abs/2506.13380)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel at many NLP tasks, but struggle with multi-hop reasoning and factual consistency, limiting their effectiveness on knowledge-intensive tasks like complex question answering (QA). Linking Knowledge Graphs (KG) and LLMs has shown promising results, but LLMs generally lack the ability to reason efficiently over graph-structured information. To tackle this problem, we propose a novel retrieval approach that integrates textual knowledge graphs into the LLM reasoning process via query decomposition. Our method decomposes complex questions into sub-questions, retrieves relevant textual subgraphs, and composes a question-specific knowledge graph to guide answer generation. For that, we use a weighted similarity function that focuses on both the complex question and the generated subquestions to extract a relevant subgraph, which allows efficient and precise retrieval for complex questions and improves the performance of LLMs on multi-hop QA tasks. This structured reasoning pipeline enhances factual grounding and interpretability while leveraging the generative strengths of LLMs. We evaluate our method on standard multi-hop QA benchmarks and show that it achieves comparable or superior performance to competitive existing methods, using smaller models and fewer LLM calls.
摘要：大型语言模型（LLMS）在许多NLP任务上都表现出色，但是在多跳的推理和事实一致性上挣扎，将其有效性限制在知识密集的任务上，例如复杂的问题答案（QA）。链接知识图（kg）和LLMS已显示出令人鼓舞的结果，但LLM通常缺乏有效推理图形结构信息的能力。为了解决这个问题，我们提出了一种新型的检索方法，该方法将文本知识图通过查询分解整合到LLM推理过程中。我们的方法将复杂的问题分解为子问题，检索相关的文本子图，并构成特定于问题的知识图来指导答案生成。为此，我们使用加权相似性函数，该功能重点介绍了复杂的问题和生成的子问题来提取相关子图，该子图可以有效，精确地检索复杂的问题，并提高LLMS在多跳QA任务上的性能。这种结构化的推理管道可以增强事实基础和解释性，同时利用LLM的生成优势。我们使用较小的型号和更少的LLM调用来评估我们的标准多跳QA基准测试方法的方法，并表明它与现有方法具有可比性或优越的性能。

Title: Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR

Authors: Yizhou Peng, Hexin Liu, Eng Siong Chng
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2506.13396
Pdf URL: https://arxiv.org/pdf/2506.13396
Copy Paste: [[2506.13396]] Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR(https://arxiv.org/abs/2506.13396)
Keywords: language model, llm
Abstract: This paper introduces the integration of language-specific bi-directional context into a speech large language model (SLLM) to improve multilingual continuous conversational automatic speech recognition (ASR). We propose a character-level contextual masking strategy during training, which randomly removes portions of the context to enhance robustness and better emulate the flawed transcriptions that may occur during inference. For decoding, a two-stage pipeline is utilized: initial isolated segment decoding followed by context-aware re-decoding using neighboring hypotheses. Evaluated on the 1500-hour Multilingual Conversational Speech and Language Model (MLC-SLM) corpus covering eleven languages, our method achieves an 18% relative improvement compared to a strong baseline, outperforming even the model trained on 6000 hours of data for the MLC-SLM competition. These results underscore the significant benefit of incorporating contextual information in multilingual continuous conversational ASR.
摘要：本文将特定于语言的双向环境集成到语音大语言模型（SLLM）中，以改善多语言连续的对话自动语音识别（ASR）。我们在训练过程中提出了一个角色级别的上下文掩盖策略，该策略随机去除了一部分上下文，以增强鲁棒性并更好地模拟推理过程中可能发生的有缺陷的转录。为了解码，使用了两阶段的管道：初始隔离段解码，然后使用相邻的假设重新编码了上下文感知。在涵盖11种语言的1500小时多语言对话语音和语言模型（MLC-SLM）语料库中进行了评估，我们的方法与强大的基线相比实现了18％的相对改善，甚至在6000个小时的MLC-SLM竞争数据的数据中都表现出色。这些结果强调了将上下文信息纳入多语种连续对话式ASR中的重大好处。

Title: RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis

Authors: Pengzuo Wu, Yuhang Yang, Guangcheng Zhu, Chao Ye, Hong Gu, Xu Lu, Ruixuan Xiao, Bowen Bao, Yijing He, Liangyu Zha, Wentao Ye, Junbo Zhao, Haobo Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13405
Pdf URL: https://arxiv.org/pdf/2506.13405
Copy Paste: [[2506.13405]] RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis(https://arxiv.org/abs/2506.13405)
Keywords: language model, llm
Abstract: With the rapid advancement of Large Language Models (LLMs), there is an increasing need for challenging benchmarks to evaluate their capabilities in handling complex tabular data. However, existing benchmarks are either based on outdated data setups or focus solely on simple, flat table structures. In this paper, we introduce RealHiTBench, a comprehensive benchmark designed to evaluate the performance of both LLMs and Multimodal LLMs (MLLMs) across a variety of input formats for complex tabular data, including LaTeX, HTML, and PNG. RealHiTBench also includes a diverse collection of tables with intricate structures, spanning a wide range of task types. Our experimental results, using 25 state-of-the-art LLMs, demonstrate that RealHiTBench is indeed a challenging benchmark. Moreover, we also develop TreeThinker, a tree-based pipeline that organizes hierarchical headers into a tree structure for enhanced tabular reasoning, validating the importance of improving LLMs' perception of table hierarchies. We hope that our work will inspire further research on tabular data reasoning and the development of more robust models. The code and data are available at this https URL.
摘要：随着大语言模型（LLM）的快速发展，越来越需要挑战基准来评估其处理复杂表格数据的能力。但是，现有的基准测试基于过时的数据设置，或者仅专注于简单的平桌结构。在本文中，我们介绍了Realhitbench，这是一种综合基准，旨在评估LLM和多模式LLMS（MLLMS）的性能，用于各种输入格式，用于复杂的表格数据，包括乳胶，HTML和PNG。 Realhitbench还包括各种具有复杂结构的表，涵盖了广泛的任务类型。我们使用25种最先进的LLM的实验结果表明，Realhitbench确实是一个具有挑战性的基准。此外，我们还开发了Treethinker，这是一种基于树的管道，该管道将层次结构标头组织到树结构中，以增强表格推理，从而验证了改善LLMS对表层次结构的感知的重要性。我们希望我们的工作能够激发有关表格数据推理和更强大模型的发展的进一步研究。该代码和数据可在此HTTPS URL上找到。

Title: Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study

Authors: Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Seraphina Zhang, Tianfu Wang, Nicholas Jing Yuan, Xing Xie, Hui Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13464
Pdf URL: https://arxiv.org/pdf/2506.13464
Copy Paste: [[2506.13464]] Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study(https://arxiv.org/abs/2506.13464)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown impressive capabilities across tasks such as mathematics, coding, and reasoning, yet their learning ability, which is crucial for adapting to dynamic environments and acquiring new knowledge, remains underexplored. In this work, we address this gap by introducing a framework inspired by cognitive psychology and education. Specifically, we decompose general learning ability into three distinct, complementary dimensions: Learning from Instructor (acquiring knowledge via explicit guidance), Learning from Concept (internalizing abstract structures and generalizing to new contexts), and Learning from Experience (adapting through accumulated exploration and feedback). We conduct a comprehensive empirical study across the three learning dimensions and identify several insightful findings, such as (i) interaction improves learning; (ii) conceptual understanding is scale-emergent and benefits larger models; and (iii) LLMs are effective few-shot learners but not many-shot learners. Based on our framework and empirical findings, we introduce a benchmark that provides a unified and realistic evaluation of LLMs' general learning abilities across three learning cognition dimensions. It enables diagnostic insights and supports evaluation and development of more adaptive and human-like models.
摘要：大型语言模型（LLM）在数学，编码和推理等任务中表现出令人印象深刻的能力，但是它们的学习能力（对于适应动态环境并获取新知识至关重要，这仍然是至关重要的。在这项工作中，我们通过引入一个受认知心理学和教育启发的框架来解决这一差距。具体来说，我们将一般学习能力分解为三个不同的互补维度：从教师那里学习（通过明确的指导获取知识），从概念中学习（内部化抽象结构和对新环境的推广）以及从经验中学习（通过累积的探索和反馈来调整）。我们对三个学习维度进行了全面的经验研究，并确定了几个有见地的发现，例如（i）相互作用可以改善学习；（ii）概念的理解是规模亮片，并使更大的模型受益；（iii）LLM是有效的几次学习者，但没有很多学习者。根据我们的框架和经验发现，我们引入了一个基准，该基准对LLMS在三个学习认知维度中的一般学习能力进行了统一和现实的评估。它可以实现诊断见解，并支持更具适应性和类似人类模型的评估和开发。

Title: ROSAQ: Rotation-based Saliency-Aware Weight Quantization for Efficiently Compressing Large Language Models

Authors: Junho Yoon, Geom Lee, Donghyeon Jeon, Inho Kang, Seung-Hoon Na
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13472
Pdf URL: https://arxiv.org/pdf/2506.13472
Copy Paste: [[2506.13472]] ROSAQ: Rotation-based Saliency-Aware Weight Quantization for Efficiently Compressing Large Language Models(https://arxiv.org/abs/2506.13472)
Keywords: language model, llm
Abstract: Quantization has been widely studied as an effective technique for reducing the memory requirement of large language models (LLMs), potentially improving the latency time as well. Utilizing the characteristic of rotational invariance of transformer, we propose the rotation-based saliency-aware weight quantization (ROSAQ), which identifies salient channels in the projection feature space, not in the original feature space, where the projected "principal" dimensions are naturally considered as "salient" features. The proposed ROSAQ consists of 1) PCA-based projection, which first performs principal component analysis (PCA) on a calibration set and transforms via the PCA projection, 2) Salient channel dentification, which selects dimensions corresponding to the K-largest eigenvalues as salient channels, and 3) Saliency-aware quantization with mixed-precision, which uses FP16 for salient dimensions and INT3/4 for other dimensions. Experiment results show that ROSAQ shows improvements over the baseline saliency-aware quantization on the original feature space and other existing quantization methods. With kernel fusion, ROSAQ presents about 2.3x speed up over FP16 implementation in generating 256 tokens with a batch size of 64.
摘要：量化已被广泛研究为减少大语言模型（LLM）的记忆需求的有效技术，从而有可能改善潜伏时间。利用变压器旋转不变性的特征，我们提出了基于旋转的显着性权重量化（Rosaq），该量化标识了投影特征空间中的显着通道，而不是在原始特征空间中，其中投影的“主体”尺寸自然被视为“显着”特征。拟议的ROSAQ由1）基于PCA的投影组成，该预测首先在校准集上执行主要成分分析（PCA），并通过PCA投影进行转换，2）显着通道牙齿牙齿化，选择与k最大特征值相对应的尺寸，并将其作为较高的eigenvalueres和3）用于混合量的forcision and 3）。方面。实验结果表明，Rosaq在原始特征空间和其他现有量化方法上显示了基线显着性量化的改进。通过内核融合，Rosaq在FP16实现中呈现约2.3倍的速度，生成256个令牌，批量大小为64。

Title: Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning

Authors: David Bani-Harouni, Chantal Pellegrini, Ege Özsoy, Matthias Keicher, Nassir Navab
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.13474
Pdf URL: https://arxiv.org/pdf/2506.13474
Copy Paste: [[2506.13474]] Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning(https://arxiv.org/abs/2506.13474)
Keywords: language model, llm, agent
Abstract: Clinical decision-making is a dynamic, interactive, and cyclic process where doctors have to repeatedly decide on which clinical action to perform and consider newly uncovered information for diagnosis and treatment. Large Language Models (LLMs) have the potential to support clinicians in this process, however, most applications of LLMs in clinical decision support suffer from one of two limitations: Either they assume the unrealistic scenario of immediate availability of all patient information and do not model the interactive and iterative investigation process, or they restrict themselves to the limited "out-of-the-box" capabilities of large pre-trained models without performing task-specific training. In contrast to this, we propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM, that converges towards a diagnosis via repeatedly requesting and interpreting relevant tests. Using a hybrid training paradigm combining supervised and reinforcement learning, we train LA-CDM with three objectives targeting critical aspects of clinical decision-making: accurate hypothesis generation, hypothesis uncertainty estimation, and efficient decision-making. We evaluate our methodology on MIMIC-CDM, a real-world dataset covering four abdominal diseases containing various clinical tests and show the benefit of explicitly training clinical decision-making for increasing diagnostic performance and efficiency.
摘要：临床决策是一个动态，互动和环状的过程，医生必须反复决定要执行哪种临床动作，并考虑新发现的诊断和治疗信息。但是，大型语言模型（LLM）有可能在此过程中支持临床医生，但是，LLM在临床决策支持中的大多数应用都有两个局限性之一：他们要么假定所有患者信息即时可用性的不切实际情景，并且不模拟互动性和迭代性研究过程，要么不将自己限制为有限的“开箱式培训”，而不必训练大型培训模型的“大型型号”，该模型的“能力均可进行较大的指标”。与此相反，我们建议使用假设驱动的不确定性语言剂LA-CDM对诊断临床决策进行建模，该诊断通过反复请求和解释相关测试来融合诊断。使用混合训练范式结合了监督和强化学习，我们将LA-CDM与针对临床决策关键方面的三个目标培训：准确的假设产生，假设不确定性估计以及有效的决策。我们评估了模拟CDM的方法，这是一个现实世界中的数据集，涵盖了四种包含各种临床测试的腹部疾病，并显示了明确培训临床决策以提高诊断性能和效率的好处。

Title: Position: Pause Recycling LoRAs and Prioritize Mechanisms to Uncover Limits and Effectiveness

Authors: Mei-Yen Chen, Thi Thu Uyen Hoang, Michael Hahn, M. Saquib Sarfraz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13479
Pdf URL: https://arxiv.org/pdf/2506.13479
Copy Paste: [[2506.13479]] Position: Pause Recycling LoRAs and Prioritize Mechanisms to Uncover Limits and Effectiveness(https://arxiv.org/abs/2506.13479)
Keywords: language model
Abstract: Merging or routing low-rank adapters (LoRAs) has emerged as a popular solution for enhancing large language models, particularly when data access is restricted by regulatory or domain-specific constraints. This position paper argues that the research community should shift its focus from developing new merging or routing algorithms to understanding the conditions under which reusing LoRAs is truly effective. Through theoretical analysis and synthetic two-hop reasoning and math word-problem tasks, we examine whether reusing LoRAs enables genuine compositional generalization or merely reflects shallow pattern matching. Evaluating two data-agnostic methods--parameter averaging and dynamic adapter selection--we found that reusing LoRAs often fails to logically integrate knowledge across disjoint fine-tuning datasets, especially when such knowledge is underrepresented during pretraining. Our empirical results, supported by theoretical insights into LoRA's limited expressiveness, highlight the preconditions and constraints of reusing them for unseen tasks and cast doubt on its feasibility as a truly data-free approach. We advocate for pausing the pursuit of novel methods for recycling LoRAs and emphasize the need for rigorous mechanisms to guide future academic research in adapter-based model merging and practical system designs for practitioners.
摘要：合并或路由较低的适配器（LORAS）已成为增强大型语言模型的流行解决方案，尤其是当数据访问受监管或特定于域的约束限制时。该立场论文认为，研究界应将其重点从开发新的合并或路由算法转变为理解重复使用洛拉斯真正有效的条件。通过理论分析和综合两跳推理和数学单词问题任务，我们检查了重复使用洛拉斯是否可以实现真正的构图概括或仅反映浅层模式匹配。评估两种数据反应方法 - 参数平均和动态适配器选择 - 我们发现，重复使用洛拉斯通常无法逻辑地整合跨跨性别数据集的知识，尤其是在预处理过程中这种知识不足的情况下。我们的经验结果得到了对洛拉（Lora）有限表现力的理论见解的支持，强调了重用它们从未见过的任务的前提和约束，并对其作为一种真正无数据的方法的可行性表示怀疑。我们倡导暂停追求新方法来回收洛拉斯，并强调需要严格的机制来指导基于适配器的模型合并和实践系统设计的未来学术研究。

Title: TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

Authors: Ezgi Başar, Francesca Padovani, Jaap Jumelet, Arianna Bisazza
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13487
Pdf URL: https://arxiv.org/pdf/2506.13487
Copy Paste: [[2506.13487]] TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs(https://arxiv.org/abs/2506.13487)
Keywords: language model
Abstract: We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.
摘要：我们介绍了Turblimp，这是语言最小对的第一个土耳其基准，旨在评估单语言和多语言模型（LMS）的语言能力。覆盖16个语言现象，每对1000对，Turblipl填补了土耳其语的语言评估资源的重要空白。在设计基准测试时，我们会额外关注土耳其的两种特性，这些特性在当前对LMS的句法评估中一直被研究，即通过形态学过程的单词顺序灵活性和从属化。我们对广泛的LMS和新收集的人类可接受性判断的实验表明，即使是尖端的大型LMS仍然困扰着对人类并不具有挑战性的语法现象，并且与人类相比，对单词顺序和形态复杂性的敏感性也不同。

Title: BOW: Bottlenecked Next Word Exploration

Authors: Ming Shen, Zhikun Xu, Xiao Ye, Jacob Dineen, Ben Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13502
Pdf URL: https://arxiv.org/pdf/2506.13502
Copy Paste: [[2506.13502]] BOW: Bottlenecked Next Word Exploration(https://arxiv.org/abs/2506.13502)
Keywords: language model, llm
Abstract: Large language models (LLMs) are typically trained via next-word prediction (NWP), which provides strong surface-level fluency but often lacks support for robust reasoning. We propose BOttlenecked next Word exploration (BOW), a novel RL framework that rethinks NWP by introducing a reasoning bottleneck where a policy model first generates a reasoning path rather than predicting the next token directly, after which a frozen judge model predicts the next token distribution based solely on this reasoning path. We train the policy model using GRPO with rewards that quantify how effectively the reasoning path facilitates next-word recovery. Compared with other continual pretraining baselines, we show that BOW improves both the general and next-word reasoning capabilities of the base model, evaluated on various benchmarks. Our findings show that BOW can serve as an effective and scalable alternative to vanilla NWP.
摘要：大型语言模型（LLMS）通常通过下字预测（NWP）进行培训，该预测提供了强大的表面流利度，但通常缺乏对稳健推理的支持。我们提出了一个瓶颈下一个单词探索（BOW），这是一个新颖的RL框架，通过引入推理瓶颈来重新考虑NWP，其中策略模型首先生成了推理路径，而不是直接预测下一个标记，然后冻结的法官模型可以完全基于此推理路径，预测下一个近代的标记分布。我们使用GRPO的奖励训练政策模型，以量化推理路径如何有效地促进下一字恢复。与其他持续预处理的基线相比，我们表明弓可以提高基本模型的一般和下一词的推理能力，并在各种基准测试中进行了评估。我们的发现表明，Bow可以作为Vanilla NWP的有效替代品。

Title: TensorSLM: Energy-efficient Embedding Compression of Sub-billion Parameter Language Models on Low-end Devices

Authors: Mingxue Xu, Yao Lei Xu, Danilo P. Mandic
Subjects: cs.CL, cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2506.13514
Pdf URL: https://arxiv.org/pdf/2506.13514
Copy Paste: [[2506.13514]] TensorSLM: Energy-efficient Embedding Compression of Sub-billion Parameter Language Models on Low-end Devices(https://arxiv.org/abs/2506.13514)
Keywords: language model, gpt, llm
Abstract: Small Language Models (SLMs, or on-device LMs) have significantly fewer parameters than Large Language Models (LLMs). They are typically deployed on low-end devices, like mobile phones and single-board computers. Unlike LLMs, which rely on increasing model size for better generalisation, SLMs designed for edge applications are expected to have adaptivity to the deployment environments and energy efficiency given the device battery life constraints, which are not addressed in datacenter-deployed LLMs. This paper addresses these two requirements by proposing a training-free token embedding compression approach using Tensor-Train Decomposition (TTD). Each pre-trained token embedding vector is converted into a lower-dimensional Matrix Product State (MPS). We comprehensively evaluate the extracted low-rank structures across compression ratio, language task performance, latency, and energy consumption on a typical low-end device, i.e. Raspberry Pi. Taking the sub-billion parameter versions of GPT-2/Cerebres-GPT and OPT models as examples, our approach achieves a comparable language task performance to the original model with around $2.0\times$ embedding layer compression, while the energy consumption of a single query drops by half.
摘要：小语言模型（SLM或设备LMS）的参数明显少于大语言模型（LLMS）。它们通常部署在低端设备上，例如手机和单板计算机。与LLM依赖于增加模型大小以更好地泛化的LLM不同，鉴于设备电池寿命限制因素，专为边缘应用设计的SLM对部署环境和能源效率具有适应性，这在Datacenter-Deployed LLMS中未解决。本文通过使用张量 - 培训分解（TTD）提出一种无训练的令牌嵌入压缩方法来解决这两个要求。每个预训练的令牌嵌入向量都会转换为较低维矩阵乘积状态（MPS）。我们全面评估了典型低端设备（即Raspberry Pi）上压缩比，语言任务性能，潜伏期和能源消耗的跨压缩比，语言任务性能，潜伏期和能源消耗的低级别结构。以GPT-2/Cerebres-GPT和OPT模型为示例的数十亿个参数版本，我们的方法与原始模型的语言任务性能达到了可比的语言任务性能，其$ 2.0 \ times $嵌入了层层压缩，而单个查询的能量消耗下降了一半。

Title: Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization

Authors: Guanghui Song, Dongping Liao, Yiren Zhao, Kejiang Ye, Cheng-zhong Xu, Xitong Gao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.13541
Pdf URL: https://arxiv.org/pdf/2506.13541
Copy Paste: [[2506.13541]] Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization(https://arxiv.org/abs/2506.13541)
Keywords: language model
Abstract: Transformer models face scalability challenges in causal language modeling (CLM) due to inefficient memory allocation for growing key-value (KV) caches, which strains compute and storage resources. Existing methods like Grouped Query Attention (GQA) and token-level KV optimization improve efficiency but rely on rigid resource allocation, often discarding "low-priority" tokens or statically grouping them, failing to address the dynamic spectrum of token importance. We propose mixSGA, a novel mixture-of-expert (MoE) approach that dynamically optimizes token-wise computation and memory allocation. Unlike prior approaches, mixSGA retains all tokens while adaptively routing them to specialized experts with varying KV group sizes, balancing granularity and efficiency. Our key novelties include: (1) a token-wise expert-choice routing mechanism guided by learned importance scores, enabling proportional resource allocation without token discard; (2) weight-sharing across grouped attention projections to minimize parameter overhead; and (3) an auxiliary loss to ensure one-hot routing decisions for training-inference consistency in CLMs. Extensive evaluations across Llama3, TinyLlama, OPT, and Gemma2 model families show mixSGA's superiority over static baselines. On instruction-following and continued pretraining tasks, mixSGA achieves higher ROUGE-L and lower perplexity under the same KV budgets.
摘要：变形金刚模型在因果语言建模（CLM）中面临可扩展性挑战，这是由于不效率的存储器分配，用于增长的键值（KV）缓存，这会构成计算和存储资源的压力。现有的方法如分组的查询注意（GQA）和令牌级的KV优化提高了效率，但依赖于严格的资源分配，通常会丢弃“低优先级”代币或静态分组它们，无法解决令牌重要性的动态频谱。我们提出了Mixsga，这是一种新型的Expert（MOE）方法，可以动态优化令牌的计算和内存分配。与先前的方法不同，Mixsga保留了所有令牌，同时将其与具有不同KV组大小不同的专业专家进行自适应路由，平衡了粒度和效率。我们的主要新颖性包括：（1）以学到的重要性分数为指导的代币专家选择路由机制，实现了不丢失的比例资源分配；（2）分组的注意投影之间的重量共享，以最大程度地减少参数开销；（3）辅助损失，以确保CLM中训练 - 推断一致性的一列路由决策。在Llama3，Tinyllama，Opt和Gemma2模型系列中进行了广泛的评估，表明了Mixsga优于静态基线。在跟随指导和持续预处理的任务上，Mixsga在相同的KV预算下实现了更高的Rouge-L和更低的困惑。

Title: Understand the Implication: Learning to Think for Pragmatic Understanding

Authors: Settaluri Lakshmi Sravanthi, Kishan Maharaj, Sravani Gunnu, Abhijit Mishra, Pushpak Bhattacharyya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13559
Pdf URL: https://arxiv.org/pdf/2506.13559
Copy Paste: [[2506.13559]] Understand the Implication: Learning to Think for Pragmatic Understanding(https://arxiv.org/abs/2506.13559)
Keywords: llm
Abstract: Pragmatics, the ability to infer meaning beyond literal interpretation, is crucial for social cognition and communication. While LLMs have been benchmarked for their pragmatic understanding, improving their performance remains underexplored. Existing methods rely on annotated labels but overlook the reasoning process humans naturally use to interpret implicit meaning. To bridge this gap, we introduce a novel pragmatic dataset, ImpliedMeaningPreference, that includes explicit reasoning (thoughts) for both correct and incorrect interpretations. Through preference-tuning and supervised fine-tuning, we demonstrate that thought-based learning significantly enhances LLMs' pragmatic understanding, improving accuracy by 11.12% across model families. We further discuss a transfer-learning study where we evaluate the performance of thought-based training for the other tasks of pragmatics (presupposition, deixis) that are not seen during the training time and observe an improvement of 16.10% compared to label-trained models.
摘要：语用学是推断出超出字面解释的意义的能力，对于社会认知和交流至关重要。尽管LLM的务实理解已经以基准为基准，但提高其性能仍未得到充实。现有方法依赖于注释的标签，但忽略了人类自然用来解释隐性含义的推理过程。为了弥合这一差距，我们介绍了一个新颖的务实数据集，即暗示的反应，其中包括正确和不正确解释的明确推理（思想）。通过偏好调节和监督的微调，我们证明了基于思想的学习可以显着增强LLMS的实用理解，从而在模型家族中提高了精度增长11.12％。我们进一步讨论了一项转移学习研究，我们评估了基于思想的培训对其他在训练时间期间未见的实用主义任务（预设，deixis）的表现，并且与标签培训模型相比，它在训练时间内没有观察到16.10％的提高。

Title: Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

Authors: Tuan Nguyen, Long-Vu Hoang, Huy-Dat Tran
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.13596
Pdf URL: https://arxiv.org/pdf/2506.13596
Copy Paste: [[2506.13596]] Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems(https://arxiv.org/abs/2506.13596)
Keywords: language model, llm
Abstract: This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.
摘要：本文介绍了我们针对MLC-SLM Challenge 2025的系统，重点介绍了具有大型语言模型（LLMS）的多语言语音识别和语言建模。我们的方法结合了微调的窃窃私语-V3编码器与有效的投影仪体系结构和各种解码器配置。我们采用三阶段培训方法，逐步优化编码器，投影仪和LLM组件。我们的系统通过使用GEMMA3-12B的私人测试平均值/CER的私人测试平均结果获得竞争性能，使用QWEN2.5-7B作为仅解码器的语言模型，并获得了18.6％的竞争性能。

Title: CAMS: A CityGPT-Powered Agentic Framework for Urban Human Mobility Simulation

Authors: Yuwei Du, Jie Feng, Jian Yuan, Yong Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13599
Pdf URL: https://arxiv.org/pdf/2506.13599
Copy Paste: [[2506.13599]] CAMS: A CityGPT-Powered Agentic Framework for Urban Human Mobility Simulation(https://arxiv.org/abs/2506.13599)
Keywords: language model, gpt, llm, agent
Abstract: Human mobility simulation plays a crucial role in various real-world applications. Recently, to address the limitations of traditional data-driven approaches, researchers have explored leveraging the commonsense knowledge and reasoning capabilities of large language models (LLMs) to accelerate human mobility simulation. However, these methods suffer from several critical shortcomings, including inadequate modeling of urban spaces and poor integration with both individual mobility patterns and collective mobility distributions. To address these challenges, we propose \textbf{C}ityGPT-Powered \textbf{A}gentic framework for \textbf{M}obility \textbf{S}imulation (\textbf{CAMS}), an agentic framework that leverages the language based urban foundation model to simulate human mobility in urban space. \textbf{CAMS} comprises three core modules, including MobExtractor to extract template mobility patterns and synthesize new ones based on user profiles, GeoGenerator to generate anchor points considering collective knowledge and generate candidate urban geospatial knowledge using an enhanced version of CityGPT, TrajEnhancer to retrieve spatial knowledge based on mobility patterns and generate trajectories with real trajectory preference alignment via DPO. Experiments on real-world datasets show that \textbf{CAMS} achieves superior performance without relying on externally provided geospatial information. Moreover, by holistically modeling both individual mobility patterns and collective mobility constraints, \textbf{CAMS} generates more realistic and plausible trajectories. In general, \textbf{CAMS} establishes a new paradigm that integrates the agentic framework with urban-knowledgeable LLMs for human mobility simulation.
摘要：人类流动性模拟在各种现实世界应用中起着至关重要的作用。最近，为了解决传统数据驱动方法的局限性，研究人员探索了利用大语模型（LLMS）的常识知识和推理能力来加速人类流动性模拟。但是，这些方法遭受了几个关键的缺点，包括对城市空间的建模不足以及与个别的移动性模式和集体流动性分布的不良融合。为了应对这些挑战，我们提出\ textbf {c} itygpt \ textbf {a} \ textbf {m} obility \ obility \ textbf {s} imulation（\ textbf {cams}）的绅士框架，该框架是一种基于语言的乌班基础模型，以模仿人类的人类型号，以模拟脉络式的构图。 \textbf{CAMS} comprises three core modules, including MobExtractor to extract template mobility patterns and synthesize new ones based on user profiles, GeoGenerator to generate anchor points considering collective knowledge and generate candidate urban geospatial knowledge using an enhanced version of CityGPT, TrajEnhancer to retrieve spatial knowledge based on mobility patterns and generate trajectories with real trajectory preference alignment通过DPO。现实世界数据集的实验表明，\ textbf {cams}在不依赖外部提供的地理空间信息的情况下达到了卓越的性能。此外，通过整体对单个移动性模式和集体移动性约束进行整体建模，\ textbf {cams}会生成更现实和更合理的轨迹。通常，\ textbf {cams}建立了一个新的范式，将代理框架与城市知名度的LLM集成了人类移动模拟。

Title: An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability

Authors: Yusuke Yamauchi, Taro Yano, Masafumi Oyamada
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13639
Pdf URL: https://arxiv.org/pdf/2506.13639
Copy Paste: [[2506.13639]] An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability(https://arxiv.org/abs/2506.13639)
Keywords: language model, llm
Abstract: As large language models (LLMs) continue to advance, reliable evaluation methods are essential particularly for open-ended, instruction-following tasks. LLM-as-a-Judge enables automatic evaluation using LLMs as evaluators, but its reliability remains uncertain. In this work, we analyze key factors affecting its trustworthiness, focusing on alignment with human judgments and evaluation consistency. Using BIGGENBench and EvalBiasBench, we study the effects of evaluation design, decoding strategies, and Chain-of-Tought (CoT) reasoning in evaluation. Our results show that evaluation criteria are critical for reliability, non-deterministic sampling improves alignment with human preferences over deterministic evaluation, and CoT reasoning offers minimal gains when clear evaluation criteria are present.
摘要：随着大型语言模型（LLM）继续发展，可靠的评估方法是至关重要的，特别是对于开放式，指导遵循的任务。 LLM-AS-A-Gudge可以使用LLMS作为评估者进行自动评估，但其可靠性仍然不确定。在这项工作中，我们分析了影响其可信度的关键因素，重点是与人类判断和评估一致性保持一致。我们使用BigGenbench和dishbiasbench，研究评估设计，解码策略和陷入困境（COT）推理的效果。我们的结果表明，评估标准对于可靠性至关重要，非确定性抽样可改善与确定性评估相对于人类偏好的一致性，而当存在明确的评估标准时，COT推理可提供最小的收益。

Title: EvolvTrip: Enhancing Literary Character Understanding with Temporal Theory-of-Mind Graphs

Authors: Bohao Yang, Hainiu Xu, Jinhua Du, Ze Li, Yulan He, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13641
Pdf URL: https://arxiv.org/pdf/2506.13641
Copy Paste: [[2506.13641]] EvolvTrip: Enhancing Literary Character Understanding with Temporal Theory-of-Mind Graphs(https://arxiv.org/abs/2506.13641)
Keywords: language model, llm
Abstract: A compelling portrayal of characters is essential to the success of narrative writing. For readers, appreciating a character's traits requires the ability to infer their evolving beliefs, desires, and intentions over the course of a complex storyline, a cognitive skill known as Theory-of-Mind (ToM). Performing ToM reasoning in prolonged narratives requires readers to integrate historical context with current narrative information, a task at which humans excel but Large Language Models (LLMs) often struggle. To systematically evaluate LLMs' ToM reasoning capability in long narratives, we construct LitCharToM, a benchmark of character-centric questions across four ToM dimensions from classic literature. Further, we introduce EvolvTrip, a perspective-aware temporal knowledge graph that tracks psychological development throughout narratives. Our experiments demonstrate that EvolvTrip consistently enhances performance of LLMs across varying scales, even in challenging extended-context scenarios. EvolvTrip proves to be particularly valuable for smaller models, partially bridging the performance gap with larger LLMs and showing great compatibility with lengthy narratives. Our findings highlight the importance of explicit representation of temporal character mental states in narrative comprehension and offer a foundation for more sophisticated character understanding. Our data and code are publicly available at this https URL.
摘要：引人入胜的人物描写对于叙事写作的成功至关重要。对于读者而言，欣赏角色的特征需要在复杂的故事情节过程中推断出他们不断发展的信念，欲望和意图，这是一种称为意见理论的认知技能（汤姆）。在长时间的叙述中进行汤姆推理需要读者将历史上下文与当前的叙事信息整合在一起，这一任务是人类擅长但大型语言模型（LLMS）经常挣扎的任务。为了系统地评估LLMS在长期叙述中的TOM推理能力，我们构建了Litchartom，这是经典文学中四个Tom Dimensions中以角色为中心问题的基准。此外，我们介绍了Evolvtrip，这是一个透视感知的时间知识图，可以跟踪整个叙事的心理发展。我们的实验表明，即使在具有挑战性的扩展文化场景中，EvolVtrip也会始终提高LLM的性能。事实证明，Evolvtrip对于较小的模型特别有价值，部分弥合了较大的LLM的性能差距，并与冗长的叙述显示出很大的兼容性。我们的发现强调了在叙事理解中明确表示心理状态的重要性，并为更复杂的性格理解奠定了基础。我们的数据和代码可在此HTTPS URL上公开获取。

Title: Prefix-Tuning+: Modernizing Prefix-Tuning through Attention Independent Prefix Data

Authors: Haonan Wang, Brian Chen, Li Siquan, Liang Xinhe, Tianyang Hu, Hwee Kuan Lee, Kenji Kawaguchi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13674
Pdf URL: https://arxiv.org/pdf/2506.13674
Copy Paste: [[2506.13674]] Prefix-Tuning+: Modernizing Prefix-Tuning through Attention Independent Prefix Data(https://arxiv.org/abs/2506.13674)
Keywords: language model, llm
Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods have become crucial for rapidly adapting large language models (LLMs) to downstream tasks. Prefix-Tuning, an early and effective PEFT technique, demonstrated the ability to achieve performance comparable to full fine-tuning with significantly reduced computational and memory overhead. However, despite its earlier success, its effectiveness in training modern state-of-the-art LLMs has been very limited. In this work, we demonstrate empirically that Prefix-Tuning underperforms on LLMs because of an inherent tradeoff between input and prefix significance within the attention head. This motivates us to introduce Prefix-Tuning+, a novel architecture that generalizes the principles of Prefix-Tuning while addressing its shortcomings by shifting the prefix module out of the attention head itself. We further provide an overview of our construction process to guide future users when constructing their own context-based methods. Our experiments show that, across a diverse set of benchmarks, Prefix-Tuning+ consistently outperforms existing Prefix-Tuning methods. Notably, it achieves performance on par with the widely adopted LoRA method on several general benchmarks, highlighting the potential modern extension of Prefix-Tuning approaches. Our findings suggest that by overcoming its inherent limitations, Prefix-Tuning can remain a competitive and relevant research direction in the landscape of parameter-efficient LLM adaptation.
摘要：参数有效的微调（PEFT）方法对于快速调整大型语言模型（LLM）到下游任务已经至关重要。前缀调整是一种早期有效的PEFT技术，证明了实现与完整微调相当的性能，并大大减少了计算和内存开销的能力。但是，尽管它早些时候取得了成功，但在培训现代最先进的LLMS方面的有效性非常有限。在这项工作中，我们从经验上证明了在LLMS上的前缀调节表现不佳，因为注意力集团内的输入和前缀意义之间的固有权衡。这促使我们引入前缀调整+，这是一种新颖的体系结构，概括了前缀调整的原理，同时通过将前缀模块从注意力转移到注意力方面来解决其缺点。我们进一步概述了我们的施工过程，以指导未来的用户构建自己的基于上下文的方法。我们的实验表明，在各种基准的基准中，前缀调整+始终优于现有的前缀调整方法。值得注意的是，它以几种一般基准的广泛采用的洛拉方法来实现性能，突出了前缀调整方法的现代现代扩展。我们的发现表明，通过克服其固有的局限性，前缀调整可以在参数有效的LLM适应性景观中仍然是竞争性和相关的研究方向。

Title: Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models

Authors: Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.13681
Pdf URL: https://arxiv.org/pdf/2506.13681
Copy Paste: [[2506.13681]] Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models(https://arxiv.org/abs/2506.13681)
Keywords: language model, llm
Abstract: Sampling from language models impacts the quality and diversity of outputs, affecting both research and real-world applications. Recently, Nguyen et al. 2024's "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" introduced a new sampler called min-p, claiming it achieves superior quality and diversity over established samplers such as basic, top-k, and top-p sampling. The significance of these claims was underscored by the paper's recognition as the 18th highest-scoring submission to ICLR 2025 and selection for an Oral presentation. This paper conducts a comprehensive re-examination of the evidence supporting min-p and reaches different conclusions from the original paper's four lines of evidence. First, the original paper's human evaluations omitted data, conducted statistical tests incorrectly, and described qualitative feedback inaccurately; our reanalysis demonstrates min-p did not outperform baselines in quality, diversity, or a trade-off between quality and diversity; in response to our findings, the authors of the original paper conducted a new human evaluation using a different implementation, task, and rubric that nevertheless provides further evidence min-p does not improve over baselines. Second, comprehensively sweeping the original paper's NLP benchmarks reveals min-p does not surpass baselines when controlling for the number of hyperparameters. Third, the original paper's LLM-as-a-Judge evaluations lack methodological clarity and appear inconsistently reported. Fourth, community adoption claims (49k GitHub repositories, 1.1M GitHub stars) were found to be unsubstantiated, leading to their removal; the revised adoption claim remains misleading. We conclude that evidence presented in the original paper fails to support claims that min-p improves quality, diversity, or a trade-off between quality and diversity.
摘要：语言模型的采样会影响产出的质量和多样性，从而影响研究和现实世界的应用。最近，Nguyen等。 2024年的“为创意和连贯的LLM输出来升级热量：Min-P采样”引入了一个名为Min-P的新采样器，声称它比基本，TOP-K和TOP-P采样等建立的采样器获得了卓越的质量和多样性。这些主张的意义强调了本文的认可是ICLR 2025分数第18名，并选择了口头表现。本文对支持Min-P的证据进行了全面的重新检查，并得出了原始论文的四条证据的不同结论。首先，原始论文的人类评估省略了数据，错误地进行了统计检验，并描述了定性反馈不准确。我们的重新分析表明，Min-P在质量，多样性或质量和多样性之间的质量，多样性或权衡方面的表现并没有表现优于基线。为了回应我们的发现，原始论文的作者使用不同的实施，任务和标题进行了新的人类评估，但提供了进一步的证据，Min-P并不能改善基线。其次，全面扫描原始纸的NLP基准测试表明，在控制超参数数量时，Min-P不会超过基线。第三，原始论文的LLM-AS-A-A-Gudge评估缺乏方法论上的清晰度，并且似乎不一致。第四，发现社区收养主张（49k GitHub存储库，110万吉特布星）被发现未经证实，导致他们被撤职；修订后的收养索赔仍然具有误导性。我们得出的结论是，原始论文中提供的证据未能支持Min-P可以提高质量，多样性或质量和多样性之间的权衡的说法。

Title: Balancing Knowledge Delivery and Emotional Comfort in Healthcare Conversational Systems

Authors: Shang-Chi Tsai, Yun-Nung Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13692
Pdf URL: https://arxiv.org/pdf/2506.13692
Copy Paste: [[2506.13692]] Balancing Knowledge Delivery and Emotional Comfort in Healthcare Conversational Systems(https://arxiv.org/abs/2506.13692)
Keywords: language model, llm
Abstract: With the advancement of large language models, many dialogue systems are now capable of providing reasonable and informative responses to patients' medical conditions. However, when patients consult their doctor, they may experience negative emotions due to the severity and urgency of their situation. If the model can provide appropriate comfort and empathy based on the patient's negative emotions while answering medical questions, it will likely offer a more reassuring experience during the medical consultation process. To address this issue, our paper explores the balance between knowledge sharing and emotional support in the healthcare dialogue process. We utilize a large language model to rewrite a real-world interactive medical dialogue dataset, generating patient queries with negative emotions and corresponding medical responses aimed at soothing the patient's emotions while addressing their concerns. The modified data serves to refine the latest large language models with various fine-tuning methods, enabling them to accurately provide sentences with both emotional reassurance and constructive suggestions in response to patients' questions. Compared to the original LLM model, our experimental results demonstrate that our methodology significantly enhances the model's ability to generate emotional responses while maintaining its original capability to provide accurate knowledge-based answers.
摘要：随着大型语言模型的发展，许多对话系统现在能够为患者的医疗状况提供合理且信息性的反应。但是，当患者咨询医生时，由于处境的严重性和紧迫性，他们可能会遇到负面情绪。如果该模型可以根据患者的负面情绪在回答医疗问题的同时提供适当的舒适和同理心，那么在医疗咨询过程中，它可能会提供更令人放心的经验。为了解决这个问题，我们的论文探讨了医疗对话过程中知识共享与情感支持之间的平衡。我们利用大型语言模型来重写现实世界中的交互式医学对话数据集，产生具有负面情绪的患者查询，并相应的医学反应旨在舒缓患者的情绪，同时解决他们的关注点。修改后的数据旨在通过各种微调方法来完善最新的大语言模型，从而使它们能够准确地提供情感上的保证和建设性建议，以回答患者的问题。与原始的LLM模型相比，我们的实验结果表明，我们的方法学显着增强了该模型产生情绪反应的能力，同时保持其原始能力以提供准确的基于知识的答案。

Title: Instruction Following by Boosting Attention of Large Language Models

Authors: Vitoria Guardieiro, Adam Stein, Avishree Khare, Eric Wong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.13734
Pdf URL: https://arxiv.org/pdf/2506.13734
Copy Paste: [[2506.13734]] Instruction Following by Boosting Attention of Large Language Models(https://arxiv.org/abs/2506.13734)
Keywords: language model, llm, prompt
Abstract: Controlling the generation of large language models (LLMs) remains a central challenge to ensure their safe and reliable deployment. While prompt engineering and finetuning are common approaches, recent work has explored latent steering, a lightweight technique that alters LLM internal activations to guide generation. However, subsequent studies revealed latent steering's effectiveness to be limited, often underperforming simple instruction prompting. To address this limitation, we first establish a benchmark across diverse behaviors for standardized evaluation of steering techniques. Building on insights from this benchmark, we introduce Instruction Attention Boosting (InstABoost), a latent steering method that boosts the strength of instruction prompting by altering the model's attention during generation. InstABoost combines the strengths of existing approaches and is theoretically supported by prior work that suggests that in-context rule following in transformer-based models can be controlled by manipulating attention on instructions. Empirically, InstABoost demonstrates superior control success compared to both traditional prompting and latent steering.
摘要：控制大语模型（LLM）的产生仍然是确保其安全和可靠部署的核心挑战。尽管迅速的工程和填充是常见的方法，但最近的工作探索了潜在转向，这是一种轻巧的技术，可以改变LLM内部激活以指导生成。但是，随后的研究揭示了潜在转向的有效性受到限制，通常表现不佳的简单指导提示。为了解决这一限制，我们首先建立了跨不同行为的基准，以评估转向技术。在此基准测试的洞察力的基础上，我们引入了指导注意力提升（Instaboost），这是一种潜在的转向方法，可以通过改变代理在世代相传中改变模型的注意来提高指导的强度。 Instaboost结合了现有方法的优势，从理论上得到了先前的工作，这表明可以通过操纵指示的注意来控制基于变压器的模型中的文本规则。从经验上讲，与传统提示和潜在转向相比，Instaboost表现出了卓越的控制成功。

Title: LTRR: Learning To Rank Retrievers for LLMs

Authors: To Eun Kim, Fernando Diaz
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.13743
Pdf URL: https://arxiv.org/pdf/2506.13743
Copy Paste: [[2506.13743]] LTRR: Learning To Rank Retrievers for LLMs(https://arxiv.org/abs/2506.13743)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems typically rely on a single fixed retriever, despite growing evidence that no single retriever performs optimally across all query types. In this paper, we explore a query routing approach that dynamically selects from a pool of retrievers based on the query, using both train-free heuristics and learned routing models. We frame routing as a learning-to-rank (LTR) problem and introduce LTRR, a framework that learns to rank retrievers by their expected utility gain to downstream LLM performance. Our experiments, conducted on synthetic QA data with controlled query type variations, show that routing-based RAG systems can outperform the best single-retriever-based systems. Performance gains are especially pronounced in models trained with the Answer Correctness (AC) metric and with pairwise learning approaches, especially with XGBoost. We also observe improvements in generalization to out-of-distribution queries. As part of the SIGIR 2025 LiveRAG challenge, our submitted system demonstrated the practical viability of our approach, achieving competitive performance in both answer correctness and faithfulness. These findings highlight the importance of both training methodology and metric selection in query routing for RAG systems.
摘要：尽管有越来越多的证据表明没有单个猎犬在所有查询类型中都能发挥最佳性能，但检索增强的生成（RAG）系统通常依赖于单个固定的检索器。在本文中，我们探讨了一种查询路由方法，该方法使用无火车的启发式方法和学习的路由模型，根据查询从搜索器中动态选择。我们将路由作为学习到级别的问题（LTR）问题，并介绍LTRR，该框架学会通过其预期的效用增益到下游LLM性能对检索员进行排名。我们的实验是对具有控制查询类型变化的合成质量检查数据进行的，表明基于路由的抹布系统可以胜过最佳的基于单重的系统。性能提高在接受答案正确性（AC）度量和成对学习方法的模型中特别明显，尤其是使用XGBOOST。我们还观察到对分布范围的查询的概括有所改善。作为Sigir 2025 Liverag挑战的一部分，我们提交的系统证明了我们方法的实际生存能力，在答案正确性和忠诚中都取得了竞争性能。这些发现突出了培训方法和指标选择在抹布系统中的查询路由中的重要性。

Title: Steering LLM Thinking with Budget Guidance

Authors: Junyan Li, Wenshuo Zhao, Yang Zhang, Chuang Gan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13752
Pdf URL: https://arxiv.org/pdf/2506.13752
Copy Paste: [[2506.13752]] Steering LLM Thinking with Budget Guidance(https://arxiv.org/abs/2506.13752)
Keywords: language model, llm
Abstract: Recent deep-thinking large language models often reason extensively to improve performance, but such lengthy reasoning is not always desirable, as it incurs excessive inference costs with disproportionate performance gains. Controlling reasoning length without sacrificing performance is therefore important, but remains challenging, especially under tight thinking budgets. We propose budget guidance, a simple yet effective method for steering the reasoning process of LLMs toward a target budget without requiring any LLM fine-tuning. Our approach introduces a lightweight predictor that models a Gamma distribution over the remaining thinking length during next-token generation. This signal is then used to guide generation in a soft, token-level manner, ensuring that the overall reasoning trace adheres to the specified thinking budget. Budget guidance enables natural control of the thinking length, along with significant token efficiency improvements over baseline methods on challenging math benchmarks. For instance, it achieves up to a 26% accuracy gain on the MATH-500 benchmark under tight budgets compared to baseline methods, while maintaining competitive accuracy with only 63% of the thinking tokens used by the full-thinking model. Budget guidance also generalizes to broader task domains and exhibits emergent capabilities, such as estimating question difficulty. The source code is available at: this https URL.
摘要：最近有深入思考的大型语言模型通常会广泛地提高性能，但是这种漫长的推理并不总是可取的，因为它会带来过度的推理成本，而表现不成比例。因此，在不牺牲绩效的情况下控制推理长度很重要，但仍然具有挑战性，尤其是在紧张的思维预算下。我们提出了预算指导，这是将LLMS推理到目标预算的推理过程的一种简单而有效的方法，而无需进行任何LLM微调。我们的方法引入了一个轻巧的预测指标，该预测因素，该预测因素在下一代中的其余思维长度上建模了伽马分布。然后，该信号用于以柔和的代币级别的方式指导生成，以确保整体推理跟踪遵守指定的思维预算。预算指南可以自然控制思维长度，以及对挑战性数学基准测试的基线方法的显着标记效率提高。例如，与基线方法相比，在预算紧张的情况下，它在数学500基准的准确性上获得了26％的准确性增长，同时保持竞争精度，只有63％的思维代币被全思维模型使用。预算指导还概括了更广泛的任务领域和展示新兴的能力，例如估计问题难度。源代码可在以下网址提供：此HTTPS URL。