2025-06-05

Title: Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems

Authors: Michael E. Garcia-Alcoser, Mobina GhojoghNejad, Fakrul Islam Tushar, David Kim, Kyle J. Lafata, Geoffrey D. Rubin, Joseph Y. Lo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03259
Pdf URL: https://arxiv.org/pdf/2506.03259
Copy Paste: [[2506.03259]] Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems(https://arxiv.org/abs/2506.03259)
Keywords: language model, llm, prompt
Abstract: Purpose: This study aims to evaluate the effectiveness of large language models (LLMs) in automating disease annotation of CT radiology reports. We compare a rule-based algorithm (RBA), RadBERT, and three lightweight open-weight LLMs for multi-disease labeling of chest, abdomen, and pelvis (CAP) CT reports. Materials and Methods: This retrospective study analyzed 40,833 CT reports from 29,540 patients, with 1,789 CAP reports manually annotated across three organ systems. External validation was conducted using the CT-RATE dataset. Three open-weight LLMs were tested with zero-shot prompting. Performance was evaluated using Cohen's Kappa and micro/macro-averaged F1 scores. Results: In 12,197 Duke CAP reports from 8,854 patients, Llama-3.1 8B and Gemma-3 27B showed the highest agreement ($\kappa$ median: 0.87). On the manually annotated set, Gemma-3 27B achieved the top macro-F1 (0.82), followed by Llama-3.1 8B (0.79), while the RBA scored lowest (0.64). On the CT-RATE dataset (lungs/pleura only), Llama-3.1 8B performed best (0.91), with Gemma-3 27B close behind (0.89). Performance differences were mainly due to differing labeling practices, especially for lung atelectasis. Conclusion: Lightweight LLMs outperform rule-based methods for CT report annotation and generalize across organ systems with zero-shot prompting. However, binary labels alone cannot capture the full nuance of report language. LLMs can provide a flexible, efficient solution aligned with clinical judgment and user needs.
摘要：目的：本研究旨在评估大语模型（LLM）在CT放射学报告的自动化疾病注释中的有效性。我们比较了一种基于规则的算法（RBA），Radbert和三个轻巧的开放式LLM，用于胸部，腹部和骨盆（CAP）CT报告的多疾病标签。材料和方法：这项回顾性研究分析了来自29,540例患者的40,833个CT报告，并在三个器官系统中手动注释了1,789个CAP报告。使用CT率数据集进行了外部验证。三个开放式LLM通过零拍摄提示进行了测试。使用Cohen的Kappa和微/宏观平均的F1分数评估了性能。结果：在12,197例Duke Cap的报告中，来自8,854例患者，Llama-3.1 8B和Gemma-3 27B显示出最高的协议（$ \ kappa $中位数：0.87）。在手动注释的集合中，Gemma-3 27B获得了最高的宏F1（0.82），其次是Llama-3.1 8b（0.79），而RBA得分最低（0.64）。在CT率数据集（仅肺/胸膜）上，Llama-3.1 8b表现最佳（0.91），Gemma-3 27B紧随其后（0.89）。性能差异主要是由于标记实践的不同，尤其是对于肺部肺不张症。结论：轻巧的LLMS优于基于规则的CT报告注释和跨器官系统的概括的方法，并具有零拍摄提示。但是，仅二进制标签无法捕获报告语言的全部细微差别。 LLM可以提供与临床判断和用户需求相符的灵活，高效的解决方案。

Title: A conclusive remark on linguistic theorizing and language modeling

Authors: Cristiano Chesi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03268
Pdf URL: https://arxiv.org/pdf/2506.03268
Copy Paste: [[2506.03268]] A conclusive remark on linguistic theorizing and language modeling(https://arxiv.org/abs/2506.03268)
Keywords: language model
Abstract: This is the final remark on the replies received to my target paper in the Italian Journal of Linguistics
摘要：这是关于我在意大利语言学杂志上收到的对我的目标论文收到的答复的最后一句话

Title: FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

Authors: Christodoulos Constantinides, Dhaval Patel, Shuxin Lin, Claudio Guerrero, Sunil Dagajirao Patil, Jayant Kalagnanam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03278
Pdf URL: https://arxiv.org/pdf/2506.03278
Copy Paste: [[2506.03278]] FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes(https://arxiv.org/abs/2506.03278)
Keywords: language model, gpt, llm, agent
Abstract: We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering. We evaluate the Industrial knowledge of over a dozen LLMs-including GPT-4, Llama, and Mistral-on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases. Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models. We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets. We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and (c) LLMFeatureSelector, an LLM-based feature selection scikit-learn pipeline. The software is available at this https URL.
摘要：我们介绍了FailuresEnsoriq，这是一种新型的多选择提问（MCQA）基准测试系统，旨在评估大型语言模型（LLMS）在行业4.0中推理和理解复杂的，特定于领域的方案的能力。与传统的质量检查基准不同，我们的系统专注于通过故障模式，传感器数据以及在各种工业资产之间的关系的多个方面。通过这项工作，我们设想了一个范式变化，其中建模决策不仅是使用统计工具（例如相关分析和显着性测试）进行数据驱动的，而且还由专门的LLMS驱动域驱动，这可以推论可以通过功能工程来捕获的关键贡献者和有用模式。我们使用扰动 - 不确定性 - 复杂性分析，专家评估研究，资产特异性知识差距分析，使用外部知识基础使用外部知识基础来评估来自不同镜头的十几种LLMS，包括GPT-4，LLAMA和MISTRAL-ON FAILERESORIQ。尽管具有强大推理能力的封闭源模型方法是专家级别的性能，但全面的基准表明，性能的显着下降，对模型中的扰动，干扰和固有的知识差距脆弱。我们还提供了一个现实世界中的案例研究，以了解LLM如何推动与各种资产相关的3个不同故障预测数据集上的建模决策。我们发布：（a）针对各种工业资产的专家MCQA，（b）FailureSoriq基准测试和基于ISO文档中非文本数据构建的MCQA的FACE排行榜，以及（c）LLMFeaturesElector，LLM基于LLM的Scikit-Scikit-scikit-Learn Pipeline。该软件可在此HTTPS URL上找到。

Title: HyperSteer: Activation Steering at Scale with Hypernetworks

Authors: Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, Atticus Geiger
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03292
Pdf URL: https://arxiv.org/pdf/2506.03292
Copy Paste: [[2506.03292]] HyperSteer: Activation Steering at Scale with Hypernetworks(https://arxiv.org/abs/2506.03292)
Keywords: language model, prompt
Abstract: Steering language models (LMs) by modifying internal activations is a popular approach for controlling text generation. Unsupervised dictionary learning methods, e.g., sparse autoencoders, can be scaled to produce many steering vectors, but lack guarantees on the individual efficacy of each vector and control over the coverage of relevant steering tasks. In contrast, supervised methods for constructing steering vectors are targeted and effective, but require more data collection and training for each additional steering vector produced. In this work, we introduce HyperSteer, a family of hypernetwork-based architectures which are trained end-to-end to generate steering vectors conditioned on the natural language steering prompts and the internals of the steered LM. In our evaluations, we show that scaling HyperSteer with thousands of steering prompts exceeds the performance of state-of-the-art activation steering methods, even on steering prompts never seen during training. Moreover, HyperSteer performs on par with steering-via-prompting.
摘要：通过修改内部激活来转向语言模型（LMS）是控制文本生成的流行方法。无监督的词典学习方法，例如稀疏的自动编码器，可以缩放以产生许多转向向量，但缺乏对每个向量的个体功效的保证，并且可以控制相关转向任务的覆盖范围。相比之下，构造转向向量的监督方法是有针对性且有效的，但需要为每次产生的其他转向向量收集和培训进行更多的数据收集和培训。在这项工作中，我们介绍了Hypersteer，这是一个基于超网络的建筑，经过训练的端到端训练，以生成以自然语言转向提示和转向LM的内部为条件的转向向量。在我们的评估中，我们表明，即使在训练期间从未见过的转向提示中，还超过了成千上万的转向提示的Hyperster超过最先进的激活转向方法的性能。此外，Hypersteer与助攻性启动相当。

Title: Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem

Authors: Yubo Wang, Ping Nie, Kai Zou, Lijun Wu, Wenhu Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03295
Pdf URL: https://arxiv.org/pdf/2506.03295
Copy Paste: [[2506.03295]] Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem(https://arxiv.org/abs/2506.03295)
Keywords: llm, prompt
Abstract: We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models' reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.
摘要：我们目睹了QWEN-MATH，MIMO和PHI-4等强的LLM具有从训练前阶段继承的巨大推理潜力。通过加强学习（RL），这些模型可以在推理任务上显着改善。最近的研究表明，即使是单个问题的RL也可以释放这些模型的推理能力。但是，RL不仅昂贵，而且不稳定。即使是一声的RL也需要数百个GPU小时。这提出了一个关键的问题：是否有更有效的方法来释放这些强大的基础LLM的推理潜力？在这项工作中，我们证明了对一个问题的批评微调（CFT）可以有效地释放LLM的推理潜力。我们的方法通过将各种模型生成的解决方案收集到单个问题并使用教师LLM来提供详细的评论来构建批评数据。我们在CFT数据上微调了QWEN和LLAMA家庭模型，范围从1.5B到14B参数不等，并观察到各种推理任务的绩效提高。例如，仅经过5个GPU小时的培训，QWEN-MATH-7B-CFT在六个数学基准测试中的平均提高了15％，三个逻辑推理基准的平均提高为16％。这些结果与RL相当甚至超过了20倍的计算结果。消融研究揭示了在不同的及时问题上，一击CFT的鲁棒性。这些结果突出显示了一种简单，一般和计算的方法，可以释放现代LLM的推理能力。

Title: From Instructions to ODRL Usage Policies: An Ontology Guided Approach

Authors: Daham M. Mustafa, Abhishek Nadgeri, Diego Collarana, Benedikt T. Arnold, Christoph Quix, Christoph Lange, Stefan Decker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03301
Pdf URL: https://arxiv.org/pdf/2506.03301
Copy Paste: [[2506.03301]] From Instructions to ODRL Usage Policies: An Ontology Guided Approach(https://arxiv.org/abs/2506.03301)
Keywords: language model, gpt, prompt
Abstract: This study presents an approach that uses large language models such as GPT-4 to generate usage policies in the W3C Open Digital Rights Language ODRL automatically from natural language instructions. Our approach uses the ODRL ontology and its documentation as a central part of the prompt. Our research hypothesis is that a curated version of existing ontology documentation will better guide policy generation. We present various heuristics for adapting the ODRL ontology and its documentation to guide an end-to-end KG construction process. We evaluate our approach in the context of dataspaces, i.e., distributed infrastructures for trustworthy data exchange between multiple participating organizations for the cultural domain. We created a benchmark consisting of 12 use cases of varying complexity. Our evaluation shows excellent results with up to 91.95% accuracy in the resulting knowledge graph.
摘要：这项研究提出了一种使用大型语言模型（例如GPT-4）来从自然语言指令中自动自动从W3C开放数字权利语言ODRL中生成使用策略的方法。我们的方法使用ODRL本体及其文档作为提示的中心部分。我们的研究假设是，现有的本体文档的策划版本将更好地指导政策生成。我们介绍了各种启发式方法，以调整ODRL本体论及其文档，以指导端到端的建筑过程。我们在数据舱，即分布式基础架构的背景下评估我们的方法，以在多个参与组织的文化领域之间进行可信赖的数据交换。我们创建了一个由12种复杂性不同的用例组成的基准。我们的评估显示出极好的结果，在最终的知识图中，精度高达91.95％。

Title: Hopscotch: Discovering and Skipping Redundancies in Language Models

Authors: Mustafa Eyceoz, Nikhil Shivakumar Nayak, Hao Wang, Ligong Han, Akash Srivastava
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03303
Pdf URL: https://arxiv.org/pdf/2506.03303
Copy Paste: [[2506.03303]] Hopscotch: Discovering and Skipping Redundancies in Language Models(https://arxiv.org/abs/2506.03303)
Keywords: language model
Abstract: Modern causal language models stack many attention blocks to improve performance, but not all blocks are necessary for every task. We propose Hopscotch, a simple yet effective method that identifies and skips attention blocks with least contributions to a task and adapts to preserve output quality. Hopscotch jointly optimizes which blocks to skip and how to scale the outputs of the remaining layers. By introducing lightweight, trainable scaling parameters to attention and MLP blocks, it mitigates distribution shifts in hidden states caused by removing attention blocks. Hopscotch does not modify model weights or require access to pretraining or instruction-tuning data, and is compatible with existing model compression techniques. When applied to $\texttt{Llama-3.1-8B}$ and $\texttt{Qwen2.5-7B}$, Hopscotch achieves less than a 2% drop in performance even after skipping four attention blocks.
摘要：现代因果语言模型堆叠了许多注意力障碍以提高性能，但并非所有任务都需要所有障碍。我们提出了Hopscotch，这是一种简单而有效的方法，它可以识别并跳过对任务的最小贡献并适应以保持产出质量的关注块。 Hopscotch共同优化了哪些块跳过以及如何缩放其余层的输出。通过向注意力和MLP块引入轻质，可训练的缩放参数，它可以减轻由于删除注意力块而引起的隐藏状态的分布变化。 HopsCotch不会修改模型权重，也不需要访问预处理或指令调查数据，并且与现有模型压缩技术兼容。当应用于$ \ texttt {llama-3.1-8b} $和$ \ texttt {qwen2.5-7b} $时，即使在跳过四个注意力障碍后，Hopscotch的性能下降也低于2％。

Title: Ask a Local: Detecting Hallucinations With Specialized Model Divergence

Authors: Aldan Creo, Héctor Cerezo-Costas, Pedro Alonso-Doval, Maximiliano Hormazábal-Lagos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03357
Pdf URL: https://arxiv.org/pdf/2506.03357
Copy Paste: [[2506.03357]] Ask a Local: Detecting Hallucinations With Specialized Model Divergence(https://arxiv.org/abs/2506.03357)
Keywords: language model, llm, hallucination
Abstract: Hallucinations in large language models (LLMs) - instances where models generate plausible but factually incorrect information - present a significant challenge for AI. We introduce "Ask a Local", a novel hallucination detection method exploiting the intuition that specialized models exhibit greater surprise when encountering domain-specific inaccuracies. Our approach computes divergence between perplexity distributions of language-specialized models to identify potentially hallucinated spans. Our method is particularly well-suited for a multilingual context, as it naturally scales to multiple languages without the need for adaptation, relying on external data sources, or performing training. Moreover, we select computationally efficient models, providing a scalable solution that can be applied to a wide range of languages and domains. Our results on a human-annotated question-answer dataset spanning 14 languages demonstrate consistent performance across languages, with Intersection-over-Union (IoU) scores around 0.3 and comparable Spearman correlation values. Our model shows particularly strong performance on Italian and Catalan, with IoU scores of 0.42 and 0.38, respectively, while maintaining cross-lingual effectiveness without language-specific adaptations. We release our code and architecture to facilitate further research in multilingual hallucination detection.
摘要：大语言模型（LLMS）中的幻觉 - 模型产生合理但事实不正确的信息的实例 - 对AI提出了重大挑战。我们介绍了一种新颖的幻觉检测方法“问本地”，利用了直觉，即专业模型在遇到特定领域的不准确性时会表现出更大的惊喜。我们的方法计算语言特有模型的困惑分布之间的差异，以识别潜在的幻觉跨度。我们的方法特别适合多种语言上下文，因为它自然地扩展到多种语言，而无需适应，依靠外部数据源或执行培训。此外，我们选择了计算高效的模型，提供可扩展的解决方案，可以应用于各种语言和域。我们对跨越14种语言的人类通知的问题答案数据集的结果表明，跨语言的性能一致，跨工会（IOU）得分约为0.3，而Spearman的相关值则显示。我们的模型在意大利和加泰罗尼亚州的表现尤其强劲，IOU得分分别为0.42和0.38，同时保持跨语性的有效性而没有语言特定的适应性。我们发布我们的代码和体系结构，以促进多语言幻觉检测的进一步研究。

Title: A Multimodal, Multilingual, and Multidimensional Pipeline for Fine-grained Crowdsourcing Earthquake Damage Evaluation

Authors: Zihui Ma, Lingyao Li, Juan Li, Wenyue Hua, Jingxiao Liu, Qingyuan Feng, Yuki Miura
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.03360
Pdf URL: https://arxiv.org/pdf/2506.03360
Copy Paste: [[2506.03360]] A Multimodal, Multilingual, and Multidimensional Pipeline for Fine-grained Crowdsourcing Earthquake Damage Evaluation(https://arxiv.org/abs/2506.03360)
Keywords: language model, llm
Abstract: Rapid, fine-grained disaster damage assessment is essential for effective emergency response, yet remains challenging due to limited ground sensors and delays in official reporting. Social media provides a rich, real-time source of human-centric observations, but its multimodal and unstructured nature presents challenges for traditional analytical methods. In this study, we propose a structured Multimodal, Multilingual, and Multidimensional (3M) pipeline that leverages multimodal large language models (MLLMs) to assess disaster impacts. We evaluate three foundation models across two major earthquake events using both macro- and micro-level analyses. Results show that MLLMs effectively integrate image-text signals and demonstrate a strong correlation with ground-truth seismic data. However, performance varies with language, epicentral distance, and input modality. This work highlights the potential of MLLMs for disaster assessment and provides a foundation for future research in applying MLLMs to real-time crisis contexts. The code and data are released at: this https URL
摘要：快速，细粒度的灾害损害评估对于有效的紧急响应至关重要，但由于地面传感器有限和官方报告的延迟，仍然具有挑战性。社交媒体提供了以人为中心的观察的丰富，实时的来源，但其多模式和非结构化的性质给传统的分析方法带来了挑战。在这项研究中，我们提出了一种结构化的多模式，多语言和多维（3M）管道，该管道利用多模式大型语言模型（MLLM）来评估灾难的影响。我们使用宏观和微观分析评估了两个主要地震事件中的三个基础模型。结果表明，MLLM有效地整合了图像文本信号，并证明了与地震数据的密切相关性。但是，性能随着语言，上央距离和输入方式而变化。这项工作突出了MLLM进行灾难评估的潜力，并为将MLLMS应用于实时危机环境的未来研究奠定了基础。代码和数据发布在以下位置：此HTTPS URL

Title: Trajectory Prediction Meets Large Language Models: A Survey

Authors: Yi Xu, Ruining Yang, Yitian Zhang, Yizhou Wang, Jianglin Lu, Mingyuan Zhang, Lili Su, Yun Fu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.03408
Pdf URL: https://arxiv.org/pdf/2506.03408
Copy Paste: [[2506.03408]] Trajectory Prediction Meets Large Language Models: A Survey(https://arxiv.org/abs/2506.03408)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have sparked growing interest in integrating language-driven techniques into trajectory prediction. By leveraging their semantic and reasoning capabilities, LLMs are reshaping how autonomous systems perceive, model, and predict trajectories. This survey provides a comprehensive overview of this emerging field, categorizing recent work into five directions: (1) Trajectory prediction via language modeling paradigms, (2) Direct trajectory prediction with pretrained language models, (3) Language-guided scene understanding for trajectory prediction, (4) Language-driven data generation for trajectory prediction, (5) Language-based reasoning and interpretability for trajectory prediction. For each, we analyze representative methods, highlight core design choices, and identify open challenges. This survey bridges natural language processing and trajectory prediction, offering a unified perspective on how language can enrich trajectory prediction.
摘要：大型语言模型（LLM）的最新进展激发了人们对将语言驱动技术整合到轨迹预测中的越来越兴趣。通过利用其语义和推理能力，LLM正在重塑自主系统如何感知，模型和预测轨迹。这项调查提供了对这个新兴领域的全面概述，将最近的工作分为五个方向：（1）通过语言建模范式进行轨迹预测，（2）使用预读的语言模型的直接轨迹预测，（3）语言指导的场景理解轨迹预测的场景理解，（4）基于语言预测的语言驱动数据，用于轨迹预测性，以实现基于语言的轨迹，（5）跨性别的迹象，（5）跨性别性，（5）跨性别的迹象。对于每一个，我们分析代表性方法，突出核心设计选择并确定开放的挑战。这项调查桥接了自然语言处理和轨迹预测，为语言如何丰富轨迹预测提供了统一的观点。

Title: DistRAG: Towards Distance-Based Spatial Reasoning in LLMs

Authors: Nicole R Schneider, Nandini Ramachandran, Kent O'Sullivan, Hanan Samet
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.03424
Pdf URL: https://arxiv.org/pdf/2506.03424
Copy Paste: [[2506.03424]] DistRAG: Towards Distance-Based Spatial Reasoning in LLMs(https://arxiv.org/abs/2506.03424)
Keywords: language model, llm
Abstract: Many real world tasks where Large Language Models (LLMs) can be used require spatial reasoning, like Point of Interest (POI) recommendation and itinerary planning. However, on their own LLMs lack reliable spatial reasoning capabilities, especially about distances. To address this problem, we develop a novel approach, DistRAG, that enables an LLM to retrieve relevant spatial information not explicitly learned during training. Our method encodes the geodesic distances between cities and towns in a graph and retrieves a context subgraph relevant to the question. Using this technique, our method enables an LLM to answer distance-based reasoning questions that it otherwise cannot answer. Given the vast array of possible places an LLM could be asked about, DistRAG offers a flexible first step towards providing a rudimentary `world model' to complement the linguistic knowledge held in LLMs.
摘要：可以使用大型语言模型（LLM）的许多现实世界任务都需要空间推理，例如兴趣点（POI）建议和行程计划。但是，在他们自己的LLM上缺乏可靠的空间推理能力，尤其是关于距离。为了解决这个问题，我们开发了一种新颖的方法，即分散，使LLM能够检索在培训期间未明确学习的相关空间信息。我们的方法在图中编码城市和城镇之间的大地距离，并检索与问题相关的上下文子图。使用此技术，我们的方法使LLM能够回答基于距离的推理问题，否则它无法回答。鉴于可以询问LLM的大量可能位置，因此Distrag为提供基本的“世界模型”提供了灵活的第一步，以补充LLMS中的语言知识。

Title: Time Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models

Authors: Ahmad Dawar Hakimi, Ali Modarressi, Philipp Wicke, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03434
Pdf URL: https://arxiv.org/pdf/2506.03434
Copy Paste: [[2506.03434]] Time Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models(https://arxiv.org/abs/2506.03434)
Keywords: language model, llm
Abstract: Understanding how large language models (LLMs) acquire and store factual knowledge is crucial for enhancing their interpretability and reliability. In this work, we analyze the evolution of factual knowledge representation in the OLMo-7B model by tracking the roles of its attention heads and feed forward networks (FFNs) over the course of pre-training. We classify these components into four roles: general, entity, relation-answer, and fact-answer specific, and examine their stability and transitions. Our results show that LLMs initially depend on broad, general-purpose components, which later specialize as training progresses. Once the model reliably predicts answers, some components are repurposed, suggesting an adaptive learning process. Notably, attention heads display the highest turnover. We also present evidence that FFNs remain more stable throughout training. Furthermore, our probing experiments reveal that location-based relations converge to high accuracy earlier in training than name-based relations, highlighting how task complexity shapes acquisition dynamics. These insights offer a mechanistic view of knowledge formation in LLMs.
摘要：了解大型语言模型（LLM）如何获取和存储事实知识对于增强其可解释性和可靠性至关重要。在这项工作中，我们通过在预训练过程中跟踪其注意力头和饲料前向网络（FFN）的作用来分析OLMO-7B模型中事实知识表示的演变。我们将这些组成部分分为四个角色：一般，实体，关系 - 撤回和事实撤回的特定角色，并检查其稳定性和过渡。我们的结果表明，LLMS最初取决于广泛的通用组件，后来随着培训的进行而专门研究。一旦模型可靠地预测答案，一些组件就会重新使用，这表明了自适应学习过程。值得注意的是，注意力头的营业额最高。我们还提供了证据，表明FFN在整个培训中保持稳定。此外，我们的探测实验表明，基于位置的关系在培训中比基于名称的关系更早地收敛到高精度，这突出了任务复杂性如何塑造采集动态。这些见解提供了LLM中知识形成的机械观点。

Title: Delta-KNN: Improving Demonstration Selection in In-Context Learning for Alzheimer's Disease Detection

Authors: Chuyuan Li, Raymond Li, Thalia S. Field, Giuseppe Carenini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03476
Pdf URL: https://arxiv.org/pdf/2506.03476
Copy Paste: [[2506.03476]] Delta-KNN: Improving Demonstration Selection in In-Context Learning for Alzheimer's Disease Detection(https://arxiv.org/abs/2506.03476)
Keywords: language model, llm
Abstract: Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that leads to dementia, and early intervention can greatly benefit from analyzing linguistic abnormalities. In this work, we explore the potential of Large Language Models (LLMs) as health assistants for AD diagnosis from patient-generated text using in-context learning (ICL), where tasks are defined through a few input-output examples. Empirical results reveal that conventional ICL methods, such as similarity-based selection, perform poorly for AD diagnosis, likely due to the inherent complexity of this task. To address this, we introduce Delta-KNN, a novel demonstration selection strategy that enhances ICL performance. Our method leverages a delta score to assess the relative gains of each training example, coupled with a KNN-based retriever that dynamically selects optimal "representatives" for a given input. Experiments on two AD detection datasets across three open-source LLMs demonstrate that Delta-KNN consistently outperforms existing ICL baselines. Notably, when using the Llama-3.1 model, our approach achieves new state-of-the-art results, surpassing even supervised classifiers.
摘要：阿尔茨海默氏病（AD）是一种进行性神经退行性疾病，导致痴呆症，早期干预可以从分析语言异常中受益匪浅。在这项工作中，我们探讨了大语模型（LLM）作为健康助理的潜力，该助理是通过使用文本学习（ICL）从患者生成的文本（ICL）中进行AD诊断的，其中通过一些输入输出示例定义了任务。经验结果表明，常规ICL方法（例如基于相似性的选择）在AD诊断方面的表现较差，这可能是由于该任务的固有复杂性。为了解决这个问题，我们介绍了Delta-Knn，这是一种新颖的演示选择策略，可增强ICL性能。我们的方法利用了DELTA分数来评估每个训练示例的相对收益，并与基于KNN的回收者相结合，该检索器动态选择给定输入的最佳“代表”。在三个开源LLMS上的两个AD检测数据集上进行的实验表明，Delta-KNN始终优于现有的ICL基线。值得注意的是，当使用Llama-3.1模型时，我们的方法可实现新的最新结果，甚至超过了监督分类器。

Title: APT: Improving Specialist LLM Performance with Weakness Case Acquisition and Iterative Preference Training

Authors: Jun Rao, Zepeng Lin, Xuebo Liu, Xiaopeng Ke, Lian Lian, Dong Jin, Shengjun Cheng, Jun Yu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03483
Pdf URL: https://arxiv.org/pdf/2506.03483
Copy Paste: [[2506.03483]] APT: Improving Specialist LLM Performance with Weakness Case Acquisition and Iterative Preference Training(https://arxiv.org/abs/2506.03483)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) often require domain-specific fine-tuning to address targeted tasks, which risks degrading their general capabilities. Maintaining a balance between domain-specific enhancements and general model utility is a key challenge. This paper proposes a novel approach named APT (Weakness Case Acquisition and Iterative Preference Training) to enhance domain-specific performance with self-generated dis-preferred weakness data (bad cases and similar cases). APT uniquely focuses on training the model using only those samples where errors occur, alongside a small, similar set of samples retrieved for this purpose. This targeted training minimizes interference with the model's existing knowledge base, effectively retaining generic capabilities. Experimental results on the LLama-2 and Mistral-V0.3 models across various benchmarks demonstrate that APT ensures no reduction in generic capacity and achieves superior performance on downstream tasks compared to various existing methods. This validates our method as an effective strategy for enhancing domain-specific capabilities without sacrificing the model's broader applicability.
摘要：大型语言模型（LLMS）通常需要特定领域的微调来解决目标任务，这有可能降低其一般能力。在特定领域的增强和通用模型实用程序之间保持平衡是一个关键挑战。本文提出了一种新颖的方法，称为APT（弱点案例获取和迭代偏好训练），以通过自我生成的剥夺弱点数据（不良情况和类似情况）来增强特定于领域的性能。 APT独特地专注于仅使用出现错误的样本以及为此目的检索到的一组相似样本的样本。该针对性的训练最大程度地减少了对模型现有知识库的干扰，从而有效地保留了通用能力。各种基准的Llama-2和Mistral-V0.3模型的实验结果表明，与现有的各种方法相比，APT不能确保降低通用能力和在下游任务上的卓越性能。这将我们的方法验证为增强域特异性功能的有效策略，而无需牺牲模型的更广泛的适用性。

Title: EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding

Authors: Mingxu Tao, Jie Hu, Mingchuan Yang, Yunhuai Liu, Dongyan Zhao, Yansong Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03489
Pdf URL: https://arxiv.org/pdf/2506.03489
Copy Paste: [[2506.03489]] EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding(https://arxiv.org/abs/2506.03489)
Keywords: language model, llm
Abstract: The remarkable performance of Large language models (LLMs) relies heavily on the availability of abundant high-quality training data. However, the high cost of acquiring annotated data often prevents models from obtaining capabilities to tackle downstream tasks. In this paper, we introduce a novel method, EpiCoDe that boosts model performance in data-scarcity scenarios without extra training. We first employ model extrapolation to enhance a finetuned model with its inferior version, and then adopt contrastive decoding to further reduce predicted errors, by comparing the logit scores given by the extrapolated and the vanilla finetuned model. Experiments across three tasks over four different LLMs show that EpiCoDe consistently outperforms existing methods with significant and robust improvement. We also propose a new theoretical framework to reveal the mechanism behind contrastive decoding in data-scarcity scenarios, which further helps us better understand the effectiveness of EpiCoDe.
摘要：大语言模型（LLM）的出色表现在很大程度上取决于丰富的高质量培训数据的可用性。但是，获取带注释的数据的高成本通常阻止模型获得解决下游任务的功能。在本文中，我们介绍了一种新颖的方法，即Epicode，该方法在没有额外训练的情况下提高了数据划界场景中的模型性能。我们首先采用模型外推以增强较低版本的填充模型，然后通过比较推断的和香草捕获模型给出的logit分数，采用对比度解码以进一步减少预测错误。在四个不同LLM的三个任务上进行的实验表明，Epicode始终优于现有方法，具有重大且可靠的改进。我们还提出了一个新的理论框架，以揭示数据划分场景中对比度解码背后的机制，这进一步帮助我们更好地理解了Epicode的有效性。

Title: Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing

Authors: Shigeng Chen, Linhao Luo, Zhangchi Qiu, Yanan Cao, Carl Yang, Shirui Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03490
Pdf URL: https://arxiv.org/pdf/2506.03490
Copy Paste: [[2506.03490]] Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing(https://arxiv.org/abs/2506.03490)
Keywords: language model, llm
Abstract: Recently, knowledge editing (KE) has emerged as a promising approach to update specific facts in Large Language Models (LLMs) without the need for full retraining. Despite the effectiveness in general-domain benchmarks, their applicability to complex medical domain remains largely unexplored. Medical knowledge editing is particularly challenging, as it requires LLMs to internalize the knowledge and generalize to unseen scenarios for effective and interpretable decision-making. In this work, we propose a novel framework called MedEditBench to rigorously evaluate the effectiveness of existing KE methods in the medical domain. In MedEditBench, we introduce a new medical knowledge editing benchmark as well as three different knowledge editing paradigms, which are designed to assess the impact of different knowledge sources for editing. Our findings indicate that current KE methods result in only superficial memorization of the injected information, failing to generalize to new scenarios. To overcome this limitation, we present Self-Generated Rationale Editing (SGR-Edit), which utilizes model-derived rationales as the target knowledge for editing, thereby uncovering the underlying reasoning process and demonstrating significant improvements over existing KE approaches. Additionally, we offer deeper insights into medical knowledge editing, including the localization of medical knowledge in LLMs and the impact of sequential editing on evolving knowledge. This could provide practical guidance for implementing KE methods in real-world medical applications.
摘要：最近，知识编辑（KE）已成为一种有前途的方法，可以在不需要全面再培训的情况下更新大语模型（LLMS）的特定事实。尽管一般基准测试具有有效性，但它们对复杂的医疗领域的适用性仍未得到探索。医学知识编辑尤其具有挑战性，因为它要求LLM在内部化知识并普遍性以有效和可解释的决策。在这项工作中，我们提出了一个名为MediDitbench的新型框架，以严格评估现有KE方法在医疗领域中的有效性。在MedeDitbench中，我们引入了一种新的医学知识编辑基准以及三种不同的知识编辑范例，旨在评估不同知识源在编辑中的影响。我们的发现表明，当前的KE方法仅导致注射信息的表面记忆，未能推广到新方案。为了克服这一局限性，我们提出了自我生成的理由编辑（SGR-EDIT），该编辑利用模型衍生的理由作为编辑的目标知识，从而揭示了基本的推理过程并证明了对现有KE方法的重大改进。此外，我们还可以更深入地了解医学知识编辑，包括在LLMS中的医学知识本地化以及顺序编辑对不断发展的知识的影响。这可以为在现实世界医学应用中实施KE方法提供实用的指导。

Title: Measuring Human Involvement in AI-Generated Text: A Case Study on Academic Writing

Authors: Yuchen Guo, Zhicheng Dou, Huy H. Nguyen, Ching-Chun Chang, Saku Sugawara, Isao Echizen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03501
Pdf URL: https://arxiv.org/pdf/2506.03501
Copy Paste: [[2506.03501]] Measuring Human Involvement in AI-Generated Text: A Case Study on Academic Writing(https://arxiv.org/abs/2506.03501)
Keywords: language model, gpt, chat
Abstract: Content creation has dramatically progressed with the rapid advancement of large language models like ChatGPT and Claude. While this progress has greatly enhanced various aspects of life and work, it has also negatively affected certain areas of society. A recent survey revealed that nearly 30% of college students use generative AI to help write academic papers and reports. Most countermeasures treat the detection of AI-generated text as a binary classification task and thus lack robustness. This approach overlooks human involvement in the generation of content even though human-machine collaboration is becoming mainstream. Besides generating entire texts, people may use machines to complete or revise texts. Such human involvement varies case by case, which makes binary classification a less than satisfactory approach. We refer to this situation as participation detection obfuscation. We propose using BERTScore as a metric to measure human involvement in the generation process and a multi-task RoBERTa-based regressor trained on a token classification task to address this problem. To evaluate the effectiveness of this approach, we simulated academic-based scenarios and created a continuous dataset reflecting various levels of human involvement. All of the existing detectors we examined failed to detect the level of human involvement on this dataset. Our method, however, succeeded (F1 score of 0.9423 and a regressor mean squared error of 0.004). Moreover, it demonstrated some generalizability across generative models. Our code is available at this https URL
摘要：随着Chatgpt和Claude等大型语言模型的快速发展，内容创建已取得了巨大的进步。尽管这一进步大大增强了生活和工作的各个方面，但它也对社会的某些领域产生了负面影响。最近的一项调查显示，近30％的大学生使用生成AI来帮助撰写学术论文和报告。大多数对策将检测AI生成的文本视为二元分类任务，因此缺乏鲁棒性。这种方法忽略了人类参与内容的参与，即使人类机器的合作正成为主流。除了生成整个文本外，人们还可以使用机器来完成或修改文本。这种人的参与会因情况而变化，这使得二进制分类成为不满意的方法。我们将这种情况称为参与检测混淆。我们建议将BertScore用作指标，以衡量人类参与生成过程的参与，并通过对代币分类任务进行培训的基于Roberta的多任务回归器来解决此问题。为了评估这种方法的有效性，我们模拟了基于学术的情况，并创建了一个反映了各种人类参与的连续数据集。我们检查的所有现有检测器都无法检测到该数据集对人类参与的水平。但是，我们的方法成功（F1得分为0.9423，回归均值平方误差为0.004）。此外，它证明了在生成模型之间的一些普遍性。我们的代码可在此HTTPS URL上找到

Title: Accurate Sublayer Pruning for Large Language Models by Exploiting Latency and Tunability Information

Authors: Seungcheol Park, Sojin Lee, Jongjin Kim, Jinsik Lee, Hyunjik Jo, U Kang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03510
Pdf URL: https://arxiv.org/pdf/2506.03510
Copy Paste: [[2506.03510]] Accurate Sublayer Pruning for Large Language Models by Exploiting Latency and Tunability Information(https://arxiv.org/abs/2506.03510)
Keywords: language model, llm
Abstract: How can we accelerate large language models(LLMs) without sacrificing accuracy? The slow inference speed of LLMs hinders us to benefit from their remarkable performance in diverse applications. This is mainly because numerous sublayers are stacked together in LLMs. Sublayer pruning compresses and expedites LLMs via removing unnecessary sublayers. However, existing sublayer pruning algorithms are limited in accuracy since they naively select sublayers to prune, overlooking the different characteristics of each sublayer. In this paper, we propose SPRINT (Sublayer PRuning wIth LateNcy and Tunability Information), an accurate sublayer pruning method for LLMs. SPRINT accurately selects a target sublayer to prune by considering 1) the amount of latency reduction after pruning and 2) the tunability of sublayers. SPRINT iteratively prunes redundant sublayers and swiftly tunes the parameters of remaining sublayers. Experiments show that SPRINT achieves the best accuracy-speedup trade-off, exhibiting up to 23.88%p higher accuracy on zero-shot commonsense reasoning benchmarks compared to existing pruning algorithms.
摘要：我们如何在不牺牲准确性的情况下加速大型语言模型（LLM）？ LLMS的推理速度缓慢使我们受益于他们在不同应用中的出色表现。这主要是因为许多子层被堆叠在LLMS中。下层修剪会通过删除不必要的子层来压缩和加快LLMS。但是，现有的子层修剪算法的准确性有限，因为它们天真地选择了子层以修剪，俯瞰着每个子层的不同特征。在本文中，我们提出了Sprint（带有延迟和可调性信息的Sublayer修剪），这是一种准确的LLMS的Sublayer修剪方法。 Sprint通过考虑1）修剪后的延迟减小和2）子层的可调性来精确选择目标子层以修剪。迭代地修剪冗余子层，并迅速调整剩余的子层的参数。实验表明，Sprint实现了最佳的准确性速度权衡，与现有的修剪算法相比，在零射合常识性推理基准上，PECH的精度高达23.88％。

Title: TokAlign: Efficient Vocabulary Adaptation via Token Alignment

Authors: Chong Li, Jiajun Zhang, Chengqing Zong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03523
Pdf URL: https://arxiv.org/pdf/2506.03523
Copy Paste: [[2506.03523]] TokAlign: Efficient Vocabulary Adaptation via Token Alignment(https://arxiv.org/abs/2506.03523)
Keywords: language model, llm
Abstract: Tokenization serves as a foundational step for Large Language Models (LLMs) to process text. In new domains or languages, the inefficiency of the tokenizer will slow down the training and generation of LLM. The mismatch in vocabulary also hinders deep knowledge transfer between LLMs like token-level distillation. To mitigate this gap, we propose an efficient method named TokAlign to replace the vocabulary of LLM from the token co-occurrences view, and further transfer the token-level knowledge between models. It first aligns the source vocabulary to the target one by learning a one-to-one mapping matrix for token IDs. Model parameters, including embeddings, are rearranged and progressively fine-tuned for the new vocabulary. Our method significantly improves multilingual text compression rates and vocabulary initialization for LLMs, decreasing the perplexity from 3.4$\text{e}^2$ of strong baseline methods to 1.2$\text{e}^2$ after initialization. Experimental results on models across multiple parameter scales demonstrate the effectiveness and generalization of TokAlign, which costs as few as 5k steps to restore the performance of the vanilla model. After unifying vocabularies between LLMs, token-level distillation can remarkably boost (+4.4% than sentence-level distillation) the base model, costing only 235M tokens.
摘要：令牌化是大型语言模型（LLMS）处理文本的基础步骤。在新的领域或语言中，令牌效率的效率低下将减慢LLM的培训和产生。词汇的不匹配也阻碍了诸如令牌级蒸馏之类的LLM之间的深层知识转移。为了减轻这一差距，我们提出了一种名为tokalign的有效方法，以从令牌共发生的视图中替换LLM的词汇，并进一步传输模型之间的令牌级知识。首先，它通过学习代币ID的一对一映射矩阵来使源词汇与目标词汇保持一致。模型参数（包括嵌入）被重新排列，并逐步调整了新词汇量。我们的方法显着提高了LLMS的多语言文本压缩率和词汇初始化，从而将困惑从3.4 $ \ text {e}^2 $强的基线方法降低到初始化后的1.2 $ \ text {e}^2 $。对多个参数量表的模型的实验结果证明了Tokalign的有效性和概括，该模型的成本少于5K步骤以恢复香草模型的性能。在统一LLM之间的词汇之后，令牌级蒸馏可以显着提高基本模型的基础模型，仅耗资23500万个代币。

Title: Seed-Coder: Let the Code Model Curate Data for Itself

Authors: Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song, Jiaze Chen, Siyao Liu, Kai Shen, Liang Xiang, Yonghui Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03524
Pdf URL: https://arxiv.org/pdf/2506.03524
Copy Paste: [[2506.03524]] Seed-Coder: Let the Code Model Curate Data for Itself(https://arxiv.org/abs/2506.03524)
Keywords: language model, llm, chain-of-thought
Abstract: Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.
摘要：大语言模型（LLM）预处理中的代码数据不仅对代码相关的任务而言至关重要，而且对于增强了LLM的一般智能。当前的开源LLMS通常严重依靠人类的努力来制作其代码预处理数据，例如使用针对单个编程语言量身定制的手工制作的过滤规则，或者使用人类通知的数据来训练质量过滤器。但是，这些方法固有地限制了可伸缩性，容易出现主观偏见，并且跨越各种编程语言的延伸和维护的昂贵。为了应对这些挑战，我们介绍了种子代码，这是一系列由8B尺寸的基础，指示和推理模型组成的开源LLM，从而最大程度地减少了人类参与数据构建的。我们的代码预处理数据是由以模型为中心的数据管道生成的，该数据管道主要利用LLMS来评分和过滤代码数据。通过监督的微调和偏好优化进一步培训指导模型，推理模型利用长链（LongCot）强化学习来改善多步代码推理。 Seed-Coder在相似尺寸的开源模型之间取得了最新的结果，甚至超过了一些更大的模型，证明了代码生成，代码完成，代码编辑，代码推理和软件工程任务的卓越性能。

Title: Go-Browse: Training Web Agents with Structured Exploration

Authors: Apurva Gandhi, Graham Neubig
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03533
Pdf URL: https://arxiv.org/pdf/2506.03533
Copy Paste: [[2506.03533]] Go-Browse: Training Web Agents with Structured Exploration(https://arxiv.org/abs/2506.03533)
Keywords: language model, gpt, agent
Abstract: One of the fundamental problems in digital agents is their lack of understanding of their environment. For instance, a web browsing agent may get lost in unfamiliar websites, uncertain what pages must be visited to achieve its goals. To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments. Go-Browse achieves efficient exploration by framing data collection as a graph search, enabling reuse of information across exploration episodes. We instantiate our method on the WebArena benchmark, collecting a dataset of 10K successful task-solving trajectories and 40K interaction steps across 100 URLs. Fine-tuning a 7B parameter language model on this dataset achieves a success rate of 21.7% on the WebArena benchmark, beating GPT-4o mini by 2.4% and exceeding current state-of-the-art results for sub-10B parameter models by 2.9%.
摘要：数字代理商中的基本问题之一是他们对环境缺乏了解。例如，网络浏览代理可能会迷失在陌生的网站中，不确定必须访问哪些页面才能实现其目标。为了解决这个问题，我们提出了Go-Browse，这是一种通过对Web环境的结构化探索来自动收集多样化和现实的Web代理数据的方法。 Go-Browse通过将数据收集作为图形搜索来实现有效的探索，从而可以在探索情节中重复使用信息。我们在Webarena基准测试中实例化方法，收集了10K成功解决轨迹的数据集和100个URL的40K交互步骤。在此数据集上对7B参数语言模型进行微调，在Webarena基准测试上达到了21.7％的成功率，将GPT-4O MINI击败了2.4％，超过了Sub-10B参数模型的当前最新结果。

Title: Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement

Authors: Xiaofeng Zhou, Heyan Huang, Lizi Liao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03541
Pdf URL: https://arxiv.org/pdf/2506.03541
Copy Paste: [[2506.03541]] Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement(https://arxiv.org/abs/2506.03541)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) continue to set new standards in knowledge-intensive and complex reasoning tasks, yet their high computational demands limit widespread adoption. While distilling large models into smaller ones offers a sustainable solution, current techniques--such as static knowledge distillation, resource-intensive reinforcement learning from human feedback, or limited self-reflection--struggle to yield substantial and lasting performance gains. In this paper, we present a novel Debate and Reflect (D&R) framework that orchestrates multi-turn debates between smaller models and stronger teacher models, eliciting actionable feedback (e.g., error analysis, corrective strategies) to guide student models. Further, we introduce Tree-structured Direct Preference Optimization (T-DPO) to efficiently leverage these debate logs, organizing interactions into a hierarchical format for effective training. Empirical evaluations across diverse NLP benchmarks demonstrate that our approach significantly improves smaller-model accuracy, robustness, and generalization, outperforming conventional baselines by a large margin.
摘要：大型语言模型（LLM）继续在知识密集型和复杂的推理任务中设定新的标准，但是它们的高计算需求限制了广泛采用。尽管将大型模型蒸馏成较小的模型提供了一种可持续的解决方案，但当前的技术（例如静态知识蒸馏，资源密集型加强对人类反馈的增强学习或有限的自我反省） - 挣扎以产生可观且持久的性能增长。在本文中，我们提出了一个新颖的辩论和反思（D＆R）框架，该框架策划了较小的模型与更强大的教师模型之间的多转弯辩论，从而引发了可行的反馈（例如，错误分析，纠正策略）来指导学生模型。此外，我们介绍了树结构化的直接偏好优化（T-DPO），以有效利用这些辩论日志，将相互作用组织为层次形式，以进行有效的训练。跨不同NLP基准测试的经验评估表明，我们的方法显着提高了较小的模型精度，鲁棒性和概括，并且优于常规基线的较大边缘。

Title: BPO: Revisiting Preference Modeling in Direct Preference Optimization

Authors: Lin Sun, Chuang Liu, Peng Liu, Bingyang Li, Weijia Lu, Ning Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03557
Pdf URL: https://arxiv.org/pdf/2506.03557
Copy Paste: [[2506.03557]] BPO: Revisiting Preference Modeling in Direct Preference Optimization(https://arxiv.org/abs/2506.03557)
Keywords: language model, llm
Abstract: Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through pairwise ranking losses, it often neglects absolute reward magnitudes. This oversight can decrease the likelihood of chosen responses and increase the risk of generating out-of-distribution responses, leading to poor performance. We term this issue Degraded Chosen Responses (DCR).To address this issue, we propose Balanced Preference Optimization (BPO), a novel framework that dynamically balances the optimization of chosen and rejected responses through two key components: balanced reward margin and gap adaptor. Unlike previous methods, BPO can fundamentally resolve DPO's DCR issue, without introducing additional constraints to the loss function. Experimental results on multiple mathematical reasoning tasks show that BPO significantly outperforms DPO, improving accuracy by +10.1% with Llama-3.1-8B-Instruct (18.8% to 28.9%) and +11.7% with Qwen2.5-Math-7B (35.0% to 46.7%). It also surpasses DPO variants by +3.6% over IPO (43.1%), +5.0% over SLiC (41.7%), and +3.1% over Cal-DPO (43.6%) on the same model. Remarkably, our algorithm requires only a single line of code modification, making it simple to implement and fully compatible with existing DPO-based frameworks.
摘要：直接偏好优化（DPO）已成为将大语言模型（LLMS）与人类偏好保持一致的流行方法。尽管DPO通过成对排名损失有效地保留了所选响应和拒绝响应之间的相对排序，但它通常忽略了绝对奖励的幅度。这种监督可以减少选择反应的可能性，并增加产生分布反应的风险，从而导致性能差。我们称此问题降低了选择的响应（DCR）。为了解决此问题，我们提出了平衡的偏好优化（BPO），这是一个新颖的框架，通过两个关键组成部分动态平衡选择和拒绝响应的优化：平衡的奖励利润率和GAP适配器。与以前的方法不同，BPO可以从根本上解决DPO的DCR问题，而无需对损失函数产生其他限制。多个数学推理任务的实验结果表明，BPO明显胜过DPO，而Llama-3.1-8B-Instruct（18.8％至28.9％）将准确性提高了 +10.1％，而QWEN2.5-MATH-7B（35.0％至46.7％）。它还超过了IPO（43.1％）的DPO变体 +3.6％，超过SLIC（41.7％） +5.0％，在同一模型上，CAL-DPO（43.6％）的DPO变体超过了DPO。值得注意的是，我们的算法仅需要一行代码修改，使其易于实现并与现有基于DPO的框架完全兼容。

Title: ConsistentChat: Building Skeleton-Guided Consistent Dialogues for Large Language Models from Scratch

Authors: Jiawei Chen, Xinyan Guan, Qianhao Yuan, Guozhao Mo, Weixiang Zhou, Yaojie Lu, Hongyu Lin, Ben He, Le Sun, Xianpei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03558
Pdf URL: https://arxiv.org/pdf/2506.03558
Copy Paste: [[2506.03558]] ConsistentChat: Building Skeleton-Guided Consistent Dialogues for Large Language Models from Scratch(https://arxiv.org/abs/2506.03558)
Keywords: language model, chat
Abstract: Current instruction data synthesis methods primarily focus on single-turn instructions and often neglect cross-turn coherence, resulting in context drift and reduced task completion rates in extended conversations. To address this limitation, we propose Skeleton-Guided Multi-Turn Dialogue Generation, a framework that constrains multi-turn instruction synthesis by explicitly modeling human conversational intent. It operates in two stages: (1) Intent Modeling, which captures the global structure of human dialogues by assigning each conversation to one of nine well-defined intent trajectories, ensuring a coherent and goal-oriented information flow; and (2) Skeleton Generation, which constructs a structurally grounded sequence of user queries aligned with the modeled intent, thereby serving as a scaffold that constrains and guides the downstream instruction synthesis process. Based on this process, we construct ConsistentChat, a multi-turn instruction dataset with approximately 15,000 multi-turn conversations and 224,392 utterances. Experiments on the Light, Topdial, and MT-Eval benchmarks show that models fine-tuned on ConsistentChat achieve a 20-30% improvement in chat consistency and up to a 15% increase in task success rate, significantly outperforming models trained on existing single-turn and multi-turn instruction datasets.
摘要：当前的指令数据综合方法主要集中在单转指令上，并且经常忽略跨转弯连贯性，从而导致上下文漂移并降低了扩展对话中的任务完成率。为了解决这一限制，我们提出了骨架引导的多转化对话生成，该框架通过明确建模人类的对话意图来限制多转移指令综合。它分为两个阶段运行：（1）意图建模，该建模通过将每个对话分配给九个定义明确的意图轨迹之一，从而捕获了人类对话的全球结构，从而确保了连贯和面向目标的信息流；（2）骨架生成，它构造了与建模意图一致的结构接地的用户查询序列，从而用作限制并指导下游指令综合过程的支架。基于此过程，我们构建了一致的聊天，这是一个多转弯指令数据集，其中约有15,000个多转交谈和224,392个话语。在光，上和MT-eval基准测试的实验表明，一致性的模型可在聊天一致性上提高20-30％，并提高了15％的任务成功率，大大胜过对现有的单转弯和多转弯指令数据集进行培训的模型。

Title: POSS: Position Specialist Generates Better Draft for Speculative Decoding

Authors: Langlin Huang, Chengsong Huang, Jixuan Leng, Di Huang, Jiaxin Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03566
Pdf URL: https://arxiv.org/pdf/2506.03566
Copy Paste: [[2506.03566]] POSS: Position Specialist Generates Better Draft for Speculative Decoding(https://arxiv.org/abs/2506.03566)
Keywords: language model, llm, chat
Abstract: Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at this https URL.
摘要：投机解码通过使用小型草稿模型预测多个代币，以及一个大型目标模型来加速大型语言模型（LLM）推断，并并行验证这些令牌。最近的研究利用目标模型的隐藏状态来增强模型预测准确性。但是，由于模型生成的特征在草案模型中积累了错误，现有方法遭受了以后位置的草稿标记预测质量的降低质量。在本文中，我们提出了位置专家（POSS），该专家由多个位置特有的草稿层组成，以在分配的位置生成令牌。职位专家在每次起草的后期位置上大大提高了令牌的接受率，因为每个专家只需要专注于处理一定水平的草稿模型功能偏差。六个数据集的Llama-3-8b-Instruct和Llama-2-13b-Chat的实验结果表明，这可能有效地在平均接受度和加速率上有效地改善了基线。我们的代码库可在此HTTPS URL上找到。

Title: MiMo-VL Technical Report

Authors: Xiaomi LLM-Core Team: Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, Wenhan Ma, Weiwei Lv, Weiji Zhuang, Wei Liu, Sirui Deng, Shuo Liu, Shimao Chen, Shihua Yu, Shaohui Liu, Shande Wang, Rui Ma, Qiantong Wang, Peng Wang, Nuo Chen, Menghang Zhu, Kangyang Zhou, Kang Zhou, Kai Fang, Jun Shi, Jinhao Dong, Jiebao Xiao, Jiaming Xu, Huaqiu Liu, Hongshen Xu, Heng Qu, Haochen Zhao, Hanglong Lv, Guoan Wang, Duo Zhang, Dong Zhang, Di Zhang, Chong Ma, Chang Liu, Can Cai, Bingquan Xia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03569
Pdf URL: https://arxiv.org/pdf/2506.03569
Copy Paste: [[2506.03569]] MiMo-VL Technical Report(https://arxiv.org/abs/2506.03569)
Keywords: language model, chain-of-thought
Abstract: We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at this https URL.
摘要：我们开源MIMO-VL-7B-SFT和MIMO-VL-7B-RL，这是两个强大的视觉语言模型，在一般的视觉理解和多模式推理方面提供了最先进的性能。 MIMO-VL-7B-RL在40个评估任务中的35个上优于QWEN2.5-VL-7B，在OlympiaDbench上得分为59.4，超过了具有高达78B参数的模型。对于GUI接地应用程序，它在OSWorld-G上设置了一个新标准，甚至超过了UI-TARS等专业模型。我们的培训结合了四个阶段的预训练（2.4万亿代币）与混合的车上加强学习（MORL）结合了整合多样化的奖励信号。我们确定将高质量的推理数据与长期思考链纳入预训练阶段的重要性，以及尽管同时进行多域优化的挑战，但混合RL的好处。我们还贡献了一个全面的评估套件，涵盖了50多个任务，以促进可重复性并推进该领域。该HTTPS URL可用模型检查点和完整评估套件。

Title: FreePRM: Training Process Reward Models Without Ground Truth Process Labels

Authors: Lin Sun, Chuang Liu, Xiaofeng Ma, Tao Yang, Weijia Lu, Ning Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03570
Pdf URL: https://arxiv.org/pdf/2506.03570
Copy Paste: [[2506.03570]] FreePRM: Training Process Reward Models Without Ground Truth Process Labels(https://arxiv.org/abs/2506.03570)
Keywords: language model, llm
Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated that Process Reward Models (PRMs) play a crucial role in enhancing model performance. However, training PRMs typically requires step-level labels, either manually annotated or automatically generated, which can be costly and difficult to obtain at scale. To address this challenge, we introduce FreePRM, a weakly supervised framework for training PRMs without access to ground-truth step-level labels. FreePRM first generates pseudo step-level labels based on the correctness of final outcome, and then employs Buffer Probability to eliminate impact of noise inherent in pseudo labeling. Experimental results show that FreePRM achieves an average F1 score of 53.0% on ProcessBench, outperforming fully supervised PRM trained on Math-Shepherd by +24.1%. Compared to other open-source PRMs, FreePRM outperforms upon RLHFlow-PRM-Mistral-8B (28.4%) by +24.6%, EurusPRM (31.3%) by +21.7%, and Skywork-PRM-7B (42.1%) by +10.9%. This work introduces a new paradigm in PRM training, significantly reducing reliance on costly step-level annotations while maintaining strong performance.
摘要：大型语言模型（LLM）的最新进展表明，过程奖励模型（PRMS）在增强模型性能中起着至关重要的作用。但是，培训PRM通常需要手动注释或自动生成的阶梯级标签，这可能是昂贵且难以在大规模上获得的。为了应对这一挑战，我们介绍了FreePRM，这是一个较弱的监督框架，用于培训PRM，而无需访问地面级别级别标签。 FreePRM首先根据最终结果的正确性生成伪级标签，然后采用缓冲概率来消除伪标记中固有的噪声的影响。实验结果表明，FREEPRM在ProcessBench上的平均F1得分为53.0％，表现优于在数学 - 陪伴下训练的完全监督的PRM +24.1％。与其他开源PRM相比，FreePRM在RLHFlow-PRM-Mistral-8B（28.4％）上的表现 + +24.6％，Eurusprm（31.3％） +21.7％，而Skywork-PRM-7B（42.1％）则 + +10.9％。这项工作引入了PRM培训中的新范式，大大降低了对昂贵的阶梯注释的依赖，同时保持了强劲的性能。

Title: Exchange of Perspective Prompting Enhances Reasoning in Large Language Models

Authors: Lin Sun, Can Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03573
Pdf URL: https://arxiv.org/pdf/2506.03573
Copy Paste: [[2506.03573]] Exchange of Perspective Prompting Enhances Reasoning in Large Language Models(https://arxiv.org/abs/2506.03573)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) have made significant advancements in addressing diverse natural language processing (NLP) tasks. However, their performance is often limited by inherent comprehension of problems. To address this limitation, we propose Exchange-of-Perspective (EoP), a novel framework designed to exchange perspectives across different definitions of problem, so that it can break the fixed mindset from any particular formulation of the question. We conducted extensive and comprehensive experiments on 8 benchmarks. The results show that EoP can significantly improve performance. For instance, compared to the non-commutative baseline PHP, with GPT-3.5-Turbo and EoP, we observe a 3.6% improvement on AQuA (60.6% to 64.2%), while GPT-4-powered EoP demonstrates a 7.7% overall accuracy enhancement on Math (53.9% to 61.6%) and a 3.5% improvement on OlympiadBench Maths (43.5% to 47.0%) when using Qwen-2.5-72b.
摘要：大型语言模型（LLMS）在解决各种自然语言处理（NLP）任务方面取得了重大进步。但是，他们的表现通常受到对问题的内在理解的限制。为了解决这一限制，我们提出了一种透视（EOP），这是一个新颖的框架，旨在在问题的不同定义上交换观点，以便它可以使固定的心态与问题的任何特定表述有关。我们对8个基准进行了广泛而全面的实验。结果表明，EOP可以显着提高性能。 For instance, compared to the non-commutative baseline PHP, with GPT-3.5-Turbo and EoP, we observe a 3.6% improvement on AQuA (60.6% to 64.2%), while GPT-4-powered EoP demonstrates a 7.7% overall accuracy enhancement on Math (53.9% to 61.6%) and a 3.5% improvement on OlympiadBench Maths (43.5% to 47.0%)使用QWEN-2.5-72B时。

Title: KG-BiLM: Knowledge Graph Embedding via Bidirectional Language Models

Authors: Zirui Chen, Xin Wang, Zhao Li, Wenbin Guo, Dongxiao He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03576
Pdf URL: https://arxiv.org/pdf/2506.03576
Copy Paste: [[2506.03576]] KG-BiLM: Knowledge Graph Embedding via Bidirectional Language Models(https://arxiv.org/abs/2506.03576)
Keywords: language model
Abstract: Recent advances in knowledge representation learning (KRL) highlight the urgent necessity to unify symbolic knowledge graphs (KGs) with language models (LMs) for richer semantic understanding. However, existing approaches typically prioritize either graph structure or textual semantics, leaving a gap: a unified framework that simultaneously captures global KG connectivity, nuanced linguistic context, and discriminative reasoning semantics. To bridge this gap, we introduce KG-BiLM, a bidirectional LM framework that fuses structural cues from KGs with the semantic expressiveness of generative transformers. KG-BiLM incorporates three key components: (i) Bidirectional Knowledge Attention, which removes the causal mask to enable full interaction among all tokens and entities; (ii) Knowledge-Masked Prediction, which encourages the model to leverage both local semantic contexts and global graph connectivity; and (iii) Contrastive Graph Semantic Aggregation, which preserves KG structure via contrastive alignment of sampled sub-graph representations. Extensive experiments on standard benchmarks demonstrate that KG-BiLM outperforms strong baselines in link prediction, especially on large-scale graphs with complex multi-hop relations - validating its effectiveness in unifying structural information and textual semantics.
摘要：知识表示学习（KRL）的最新进展强调了将符号知识图（KGS）与语言模型（LMS）统一的迫切必要性，以获得更丰富的语义理解。但是，现有方法通常优先考虑图形结构或文本语义，留下差距：同时捕获全局kg连接性，细微的语言上下文和歧视性推理语义的统一框架。为了弥合这一差距，我们引入了KG-BILM，这是一种双向LM框架，该框架将KGS的结构线索与生成变压器的语义表达性融合在一起。 KG-BILM结合了三个关键组成部分：（i）双向知识的关注，这可以去除因果面具以使所有令牌和实体之间的完全相互作用；（ii）知识掩盖的预测，这鼓励模型利用本地语义上下文和全球图连接性；（iii）对比图语义聚集，该图通过采样子图表的对比度来保存Kg结构。对标准基准测试的广泛实验表明，KG-BILM在链接预测中的表现要优于强大的基线，尤其是在具有复杂多跳跃关系的大型图表上 - 验证了其在统一结构信息和文本语义方面的有效性。

Title: Automatically Suggesting Diverse Example Sentences for L2 Japanese Learners Using Pre-Trained Language Models

Authors: Enrico Benedetti, Akiko Aizawa, Florian Boudin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03580
Pdf URL: https://arxiv.org/pdf/2506.03580
Copy Paste: [[2506.03580]] Automatically Suggesting Diverse Example Sentences for L2 Japanese Learners Using Pre-Trained Language Models(https://arxiv.org/abs/2506.03580)
Keywords: language model, gpt
Abstract: Providing example sentences that are diverse and aligned with learners' proficiency levels is essential for fostering effective language acquisition. This study examines the use of Pre-trained Language Models (PLMs) to produce example sentences targeting L2 Japanese learners. We utilize PLMs in two ways: as quality scoring components in a retrieval system that draws from a newly curated corpus of Japanese sentences, and as direct sentence generators using zero-shot learning. We evaluate the quality of sentences by considering multiple aspects such as difficulty, diversity, and naturalness, with a panel of raters consisting of learners of Japanese, native speakers -- and GPT-4. Our findings suggest that there is inherent disagreement among participants on the ratings of sentence qualities, except for difficulty. Despite that, the retrieval approach was preferred by all evaluators, especially for beginner and advanced target proficiency, while the generative approaches received lower scores on average. Even so, our experiments highlight the potential for using PLMs to enhance the adaptability of sentence suggestion systems and therefore improve the language learning journey.
摘要：提供多样化并与学习者的能力水平保持一致的示例句子对于培养有效的语言获取至关重要。这项研究研究了使用预训练的语言模型（PLM）来制作针对L2日本学习者的示例句子。我们以两种方式利用PLM：作为检索系统中的质量评分组件，该系统从日本句子的新策划语料库中汲取灵感，以及使用零拍学习的直接句子生成器。我们通过考虑难度，多样性和自然性等多个方面来评估句子的质量，并由由日语，母语者和GPT-4组成的评估者组成。我们的发现表明，除困难外，参与者对句子质量的评分存在固有的分歧。尽管如此，所有评估者还是为了获得初学者和高级目标水平，而生成方法平均获得了较低的分数。即便如此，我们的实验仍强调了使用PLM来增强句子建议系统的适应性并因此改善语言学习旅程的潜力。

Title: From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models

Authors: Viktor Hangya, Fabian Küch, Darina Gold
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03592
Pdf URL: https://arxiv.org/pdf/2506.03592
Copy Paste: [[2506.03592]] From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models(https://arxiv.org/abs/2506.03592)
Keywords: language model, llm
Abstract: Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate, essential capabilities like reasoning and code generation rely on the more time-consuming NLG (token-by-token generation) format. In this work, our aim is to decrease the computational burden of NLG benchmarks in order to enable monitoring crucial LLM capabilities during model training. We reformulate generative tasks into computationally cheaper NLU alternatives. We test the performance correlation between the original and reformulated tasks using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code generation, factual knowledge and reading comprehension. Our results show a strong correlation between task formats, supporting capability assessment via cheaper alternatives and achieving over 35x average reduction in evaluation time. We plan to publish our benchmark adaptions.
摘要：在培训期间对LLM的迭代评估对于确保预期能力的发展至关重要，但可以是时间和计算的。尽管模型从固定的答案选择中选择的NLU任务价格便宜，但理由和代码生成等基本功能依赖于更耗时的NLG（逐个代币）格式。在这项工作中，我们的目的是减轻NLG基准的计算负担，以便在模型培训期间监视关键的LLM功能。我们将生成任务重新制定为计算更便宜的NLU替代方案。我们使用各种大小和4个功能的8 LMS测试原始任务和重新重新制定任务之间的性能相关性：数学推理，代码生成，事实知识和阅读理解。我们的结果表明，任务格式之间的相关性很强，通过较低的替代方案支持能力评估，并在评估时间平均减少了35倍以上。我们计划发布我们的基准改编。

Title: Auto prompt sql: a resource-efficient architecture for text-to-sql translation in constrained environments

Authors: Zetong Tang, Qian Ma, Di Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03598
Pdf URL: https://arxiv.org/pdf/2506.03598
Copy Paste: [[2506.03598]] Auto prompt sql: a resource-efficient architecture for text-to-sql translation in constrained environments(https://arxiv.org/abs/2506.03598)
Keywords: language model, prompt, chain-of-thought
Abstract: Using the best Text-to-SQL methods in resource-constrained environments is challenging due to their reliance on resource-intensive open-source models. This paper introduces Auto Prompt SQL(AP-SQL), a novel architecture designed to bridge the gap between resource-efficient small open-source models and the powerful capabilities of large closed-source models for Text-to-SQL translation. Our method decomposes the task into schema filtering, retrieval-augmented text-to-SQL generation based on in-context examples, and prompt-driven schema linking and SQL generation. To improve schema selection accuracy, we fine-tune large language models. Crucially, we also explore the impact of prompt engineering throughout the process, leveraging Chain-of-Thought(CoT) and Graph-of-Thought(GoT) templates to significantly enhance the model's reasoning for accurate SQL generation. Comprehensive evaluations on the Spider benchmarks demonstrate the effectiveness of AP-SQL.
摘要：在资源约束环境中使用最佳的文本到SQL方法，由于它们依赖资源密集型开源模型，因此具有挑战性。本文介绍了自动提示SQL（AP-SQL），这是一种新颖的体系结构，旨在弥合资源有效的小型开源型号和用于文本到SQL翻译的大型闭合源模型的强大功能之间的差距。我们的方法将任务分解为架构过滤，基于内部上下文示例的检索文本到SQL生成，以及及时驱动的模式链接和SQL生成。为了提高模式选择的准确性，我们调整了大型语言模型。至关重要的是，我们还探讨了整个过程中迅速工程的影响，利用了经过三通链（COT）和经过思考图（GOT）模板，以显着增强该模型的准确SQL生成的推理。对蜘蛛基准的全面评估证明了AP-SQL的有效性。

Title: Learning to Insert [PAUSE] Tokens for Better Reasoning

Authors: Eunki Kim, Sangryul Kim, James Thorne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03616
Pdf URL: https://arxiv.org/pdf/2506.03616
Copy Paste: [[2506.03616]] Learning to Insert [PAUSE] Tokens for Better Reasoning(https://arxiv.org/abs/2506.03616)
Keywords: language model, llm
Abstract: To enhance reasoning capabilities, previous works have explored incorporating special-purpose tokens into the training process. These strategies strengthen the learning mechanism of transformer-based large language models (LLMs). Building on prior research, in which inserting dummy tokens consecutively just before reasoning steps can enhance effectiveness, we introduce a novel approach termed Dynamic Inserting Tokens Training (DIT). Our method identifies positions within sequences where model confidence is lowest according to token log-likelihood. Strategically inserting [PAUSE] tokens on these positions bolsters the model's predictive capabilities for subsequent tokens. Experimental results across diverse datasets and models, from the 2.7B model to the 8B model, demonstrate that DIT consistently outperforms traditional fine-tuning and previous token insertion methods. With this simple yet effective method, we achieve accuracy gains of up to 4.7%p on GSM8K, 3.23%p on AQUA-RAT, and pass@1 improvements of up to 3.4%p on MBPP datasets. Our work shows a model-based, dynamic approach rather than a heuristic one, thereby broadening the scope of research in reasoning.
摘要：为了增强推理能力，以前的作品探索了将特殊用品代币纳入培训过程的。这些策略增强了基于变压器的大语言模型（LLM）的学习机制。在先前的研究的基础上，在推理步骤可以提高效力之前，插入虚拟令牌的插入，我们引入了一种新颖的方法，称为动态插入令牌培训（DIT）。我们的方法确定了序列中模型置信度最低的位置。从策略上插入[暂停]令牌上的这些位置，以增强该模型的后续令牌的预测能力。从2.7b模型到8B模型，各种数据集和模型之间的实验结果表明，DIT始终优于传统的微调和以前的令牌插入方法。通过这种简单而有效的方法，我们在GSM8K上获得了高达4.7％P的准确性，在Aqua-Rat上获得了3.23％的P，并通过MBPP数据集的1个提高了3.4％P的@1提高。我们的工作显示了一种基于模型的动态方法，而不是一种启发式方法，从而扩大了推理研究的范围。

Title: Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales

Authors: Ayuto Tsutsumi, Yuu Jinnai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03619
Pdf URL: https://arxiv.org/pdf/2506.03619
Copy Paste: [[2506.03619]] Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales(https://arxiv.org/abs/2506.03619)
Keywords: language model, llm
Abstract: Although Large Language Models (LLMs) have demonstrated strong language understanding and generation abilities across various languages, their cultural knowledge is often limited to English-speaking communities, which can marginalize the cultures of non-English communities. To address the problem, evaluation of the cultural awareness of the LLMs and the methods to develop culturally aware LLMs have been investigated. In this study, we focus on evaluating knowledge of folktales, a key medium for conveying and circulating culture. In particular, we focus on Japanese folktales, specifically on knowledge of Yokai. Yokai are supernatural creatures originating from Japanese folktales that continue to be popular motifs in art and entertainment today. Yokai have long served as a medium for cultural expression, making them an ideal subject for assessing the cultural awareness of LLMs. We introduce YokaiEval, a benchmark dataset consisting of 809 multiple-choice questions (each with four options) designed to probe knowledge about yokai. We evaluate the performance of 31 Japanese and multilingual LLMs on this dataset. The results show that models trained with Japanese language resources achieve higher accuracy than English-centric models, with those that underwent continued pretraining in Japanese, particularly those based on Llama-3, performing especially well. The code and dataset are available at this https URL ILab/YokaiEval.
摘要：尽管大型语言模型（LLM）表现出了各种语言的强烈语言理解和发电能力，但它们的文化知识通常仅限于说英语的社区，这可以使非英语社区的文化边缘化。为了解决这个问题，已经研究了对LLM的文化意识的评估以及开发具有文化意识的LLM的方法。在这项研究中，我们专注于评估民间故事的知识，这是传达和循环文化的关键媒介。特别是，我们专注于日本民间故事，尤其是Yokai的知识。 Yokai是源自日本民间故事的超自然生物，如今仍是艺术和娱乐中的流行主题。长期以来，Yokai一直是文化表达的一种媒介，使其成为评估LLM的文化意识的理想主题。我们介绍了Yokaival，这是一个由809个多项选择问题（每个选项）组成的基准数据集，旨在探究有关Yokai的知识。我们评估了此数据集上31个日语和多语言LLM的性能。结果表明，接受日语资源培训的模型比以英语为中心的模型获得了更高的精度，而以英语为中心的模型进行了日语进行预处理的模型，尤其是基于Llama-3的模型，表现尤其很好。该代码和数据集可在此HTTPS URL ILAB/Yokaieval中找到。

Title: Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks

Authors: Lin Mu, Guowei Chu, Li Ni, Lei Sang, Zhize Wu, Peiquan Jin, Yiwen Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03627
Pdf URL: https://arxiv.org/pdf/2506.03627
Copy Paste: [[2506.03627]] Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks(https://arxiv.org/abs/2506.03627)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. However, they are highly sensitive to input perturbations, such as typographical errors or slight character order errors, which can substantially degrade their performance. Despite advances in prompting techniques, developing a prompting strategy that explicitly mitigates the negative impact of such perturbations remains an open challenge. To bridge this gap, we propose Robustness of Prompting (RoP), a novel prompting strategy specifically designed to enhance the robustness of LLMs. RoP consists of two stages: Error Correction and Guidance. In the Error Correction stage, RoP applies diverse perturbation methods to generate adversarial examples, which are then used to construct prompts that automatically correct input errors. In the Guidance stage, RoP generates an optimal guidance prompting based on the corrected input, steering the model toward more robust and accurate inferences. Through comprehensive experiments spanning arithmetic, commonsense, and logical reasoning tasks, we demonstrate that RoP significantly improves LLMs' robustness against adversarial perturbations. Notably, it maintains model accuracy with only minimal degradation compared to clean input scenarios, thereby establishing RoP as a practical and effective approach for enhancing LLM robustness in real-world applications.
摘要：大型语言模型（LLMS）通过有效利用提示策略来表现在各种任务中的表现出色。但是，它们对输入扰动高度敏感，例如印刷错误或轻微的字符顺序错误，这可能会大大降低其性能。尽管进步了提示技术，但制定了一种明确减轻这种扰动的负面影响的提示策略仍然是一个开放的挑战。为了弥合这一差距，我们提出了提示的鲁棒性（ROP），这是一种新颖的提示策略，专门旨在增强LLM的鲁棒性。 ROP包括两个阶段：错误校正和指导。在误差校正阶段，ROP应用了各种扰动方法来生成对抗示例，然后将其用于构造自动纠正输入错误的提示。在指导阶段，ROP根据校正后的输入生成了最佳的指导提示，将模型转向更健壮和准确的推论。通过跨越算术，常识和逻辑推理任务的全面实验，我们证明了ROP可以显着提高LLMS对对抗性扰动的鲁棒性。值得注意的是，与清洁的输入方案相比，它仅具有最小的降解，以保持模型精度，从而确立了ROP作为增强现实世界应用中LLM鲁棒性的一种实用有效方法。

Title: RewardAnything: Generalizable Principle-Following Reward Models

Authors: Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, Wei Ye
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03637
Pdf URL: https://arxiv.org/pdf/2506.03637
Copy Paste: [[2506.03637]] RewardAnything: Generalizable Principle-Following Reward Models(https://arxiv.org/abs/2506.03637)
Keywords: language model, llm
Abstract: Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another. The standard practice of collecting task-specific preference data and retraining reward models is resource-intensive, often producing biased rewards, and limits practical application. We introduce generalizable, principle-following reward models. We propose that RMs should understand and adhere to dynamically provided natural language specifications of reward principles, similar to instruction-following in LLMs. To measure this capability, we develop RABench, a comprehensive benchmark for RMs focusing on generalization across diverse principles. Evaluations on RABench reveal poor generalization of current RMs. As a solution, we present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles. We achieve SotA performance with RewardAnything in traditional RM benchmark simply by specifying a well-defined principle, and results on RABench show we excel in adapting to novel principles without retraining. Furthermore, RewardAnything integrates seamlessly with existing RLHF methods and we show by a case study on how to automatically and efficiently align LLMs with only natural language principles.
摘要：奖励模型，对于指导大型语言模型优化必不可少的，通常是在固定偏好数据集上训练的，从而导致与单个隐式偏好分布的严格对齐。这防止了一项任务中清心的各种现实世界需求的适应，以详细说明另一个任务。收集特定任务的偏好数据和再培训奖励模型的标准实践是资源密集的，通常会产生有偏见的奖励，并限制了实际应用。我们介绍了可概括的，原则上的奖励模型。我们建议RMS应该理解并遵守动态提供奖励原则的自然语言规范，类似于LLMS中的指导跟踪。为了衡量这种能力，我们开发了Rabench，这是RMS的全面基准，重点是跨不同原则的概括。对Rabench的评估显示当前RMS的概括不佳。作为解决方案，我们提出了奖励，这是一种新颖的RM设计和训练，以明确遵循自然语言原则。我们仅通过指定一个定义明确的原则就可以在传统的RM基准中获得SOTA性能，而Rabench上的结果表明，我们在适应新的原理而不进行重新培训时表现出色。此外，Rewardything与现有的RLHF方法无缝集成，我们通过案例研究表明，如何自动有效地将LLM与自然语言原则相提并论。

Title: Trustworthy Medical Question Answering: An Evaluation-Centric Survey

Authors: Yinuo Wang, Robert E. Mercer, Frank Rudzicz, Sudipta Singha Roy, Pengjie Ren, Zhumin Chen, Xindi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03659
Pdf URL: https://arxiv.org/pdf/2506.03659
Copy Paste: [[2506.03659]] Trustworthy Medical Question Answering: An Evaluation-Centric Survey(https://arxiv.org/abs/2506.03659)
Keywords: language model, llm
Abstract: Trustworthiness in healthcare question-answering (QA) systems is important for ensuring patient safety, clinical effectiveness, and user confidence. As large language models (LLMs) become increasingly integrated into medical settings, the reliability of their responses directly influences clinical decision-making and patient outcomes. However, achieving comprehensive trustworthiness in medical QA poses significant challenges due to the inherent complexity of healthcare data, the critical nature of clinical scenarios, and the multifaceted dimensions of trustworthy AI. In this survey, we systematically examine six key dimensions of trustworthiness in medical QA, i.e., Factuality, Robustness, Fairness, Safety, Explainability, and Calibration. We review how each dimension is evaluated in existing LLM-based medical QA systems. We compile and compare major benchmarks designed to assess these dimensions and analyze evaluation-guided techniques that drive model improvements, such as retrieval-augmented grounding, adversarial fine-tuning, and safety alignment. Finally, we identify open challenges-such as scalable expert evaluation, integrated multi-dimensional metrics, and real-world deployment studies-and propose future research directions to advance the safe, reliable, and transparent deployment of LLM-powered medical QA.
摘要：医疗保健提问（QA）系统中的可信赖性对于确保患者的安全性，临床有效性和用户信心很重要。随着大型语言模型（LLMS）越来越多地整合到医疗环境中，其反应的可靠性直接影响临床决策和患者结果。但是，由于医疗保健数据的固有复杂性，临床场景的批判性质以及可信赖的AI的多方面维度，因此在医疗质量检查中实现全面的可信度构成了重大挑战。在这项调查中，我们系统地检查了医学质量检查中可信度的六个关键维度，即事实，鲁棒性，公平性，安全性，解释性和校准。我们回顾了如何在现有的基于LLM的医疗质量检查系统中评估每个维度的方法。我们编译和比较了旨在评估这些维度的主要基准测试，并分析评估指导的技术，以推动改进模型的改进，例如检索仪的接地，对抗性微调和安全一致性。最后，我们确定开放的挑战，例如可扩展的专家评估，综合的多维指标以及现实世界的部署研究，并提出了未来的研究方向，以推动LLM驱动的医疗质量质量质量质量药的安全，可靠和透明的部署。

Title: Robust Preference Optimization via Dynamic Target Margins

Authors: Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, Xiang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03690
Pdf URL: https://arxiv.org/pdf/2506.03690
Copy Paste: [[2506.03690]] Robust Preference Optimization via Dynamic Target Margins(https://arxiv.org/abs/2506.03690)
Keywords: language model, llm
Abstract: The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose $\gamma$-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, $\gamma$-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, $\gamma$-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, $\gamma$-PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, $\gamma$-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at \href{this https URL}{this https URL}.
摘要：大语言模型（LLM）的一致性对于确保其在实际应用中的安全性和可靠性至关重要。直接偏好优化（DPO）已成为一种有效的方法，它使用偏好对直接优化模型，从而大大减少了资源需求。但是，DPO的有效性在很大程度上取决于数据质量，而数据质量通常会因噪声而损害。在这项工作中，我们提出了$ \ gamma $ -po，这是一种动态目标边距优先优化算法，该算法在成对级别调整奖励保证金。通过引入特定实例的边距校准，$ \ gamma $ -po从策略上优先考虑高信心对（证明更高奖励利润率的对），同时抑制了模棱两可对的潜在噪声。此外，$ \ gamma $ -po是一种插件方法，与DPO的变体兼容，这些变体依赖于优先对之间的奖励保证金。在诸如Alpacaeval2和Arena-Hard之类的基准中，$ \ gamma $ -po的平均提高了4.4％\％\％的改善，为最先进的性能树立了新的基准。此外，$ \ gamma $ -po需要最小的代码更改，并且对培训效率的影响微不足道，这使其成为增强LLMS对齐的强大解决方案。我们的代码可在\ href {此https url} {this HTTPS url}上获得。

Title: AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

Authors: Zhepei Wei, Wei-Lin Chen, Xinyu Zhu, Yu Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03700
Pdf URL: https://arxiv.org/pdf/2506.03700
Copy Paste: [[2506.03700]] AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism(https://arxiv.org/abs/2506.03700)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential token generation process, where each token must be generated before the next can be processed. This sequential dependency restricts the ability to fully leverage modern hardware's parallel processing capabilities. Existing methods like speculative decoding and layer skipping offer potential speedups but have notable drawbacks: speculative decoding relies on an auxiliary "drafter" model, which can be challenging to acquire and increases memory overhead, while layer skipping may introduce discrepancies in the outputs due to the missing key-value cache at skipped layers. In this work, we propose AdaDecode, which accelerates LLM decoding without requiring auxiliary models or changes to the original model parameters, while ensuring output consistency. AdaDecode leverages the insight that many tokens can accurately be generated at intermediate layers, as further layers often do not significantly alter predictions once the model reaches a certain confidence. By adaptively generating tokens at intermediate layers when confidence is high, AdaDecode enables the next token's computation to begin immediately. The remaining layer computations for early-predicted tokens are deferred and executed in parallel with subsequent tokens when needed, maximizing hardware utilization and reducing decoding latency. A final verification step ensures that early predictions match the results of standard autoregressive decoding, preserving output parity. Experiments across diverse generation tasks shows that AdaDecode consistently achieves superior decoding throughput with up to 1.73x speedup, while guaranteeing output parity with standard autoregressive decoding.
摘要：大型语言模型（LLM）越来越多地用于长期生成（例如，长期经过经过经过经过经过经过经过思考的推理），其中解码效率成为关键的瓶颈：自动回答的解码本质上受其顺序代币生成过程的限制，在该过程中，每个令牌都必须在下一个代币之前生成下一个令牌。这种顺序依赖性限制了充分利用现代硬件并行处理功能的能力。诸如投机解码和层跳过之类的现有方法提供了潜在的加速，但具有显着的缺点：投机解码依赖于辅助的“起草者”模型，这可能会挑战并增加内存开销，而图层跳过可能会导致由于缺失的键入钥匙值加速而引起的差异。在这项工作中，我们提出了Adadecode，它可以加速LLM解码，而无需辅助模型或更改原始模型参数，同时确保输出一致性。 Adadecode利用了可以在中间层上准确生成许多令牌的见解，因为一旦模型达到一定的置信度，进一步的层通常不会显着改变预测。通过在置信度高时在中间层上自适应生成令牌，AdadeCode可以立即开始下一代币的计算。在需要时与随后的令牌并行执行的早期预测令牌的剩余图层计算，最大化硬件利用率并减少解码延迟。最终验证步骤可确保早期预测与标准自回归解码的结果相匹配，从而保留输出奇偶校验。跨不同一代任务的实验表明，Adadecode始终以高达1.73倍的速度实现了优越的解码吞吐量，同时保证了通过标准自动回应解码的输出奇偶校验。

Title: ScoreRAG: A Retrieval-Augmented Generation Framework with Consistency-Relevance Scoring and Structured Summarization for News Generation

Authors: Pei-Yun Lin, Yen-lung Tsai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03704
Pdf URL: https://arxiv.org/pdf/2506.03704
Copy Paste: [[2506.03704]] ScoreRAG: A Retrieval-Augmented Generation Framework with Consistency-Relevance Scoring and Structured Summarization for News Generation(https://arxiv.org/abs/2506.03704)
Keywords: language model, hallucination, retrieval-augmented generation
Abstract: This research introduces ScoreRAG, an approach to enhance the quality of automated news generation. Despite advancements in Natural Language Processing and large language models, current news generation methods often struggle with hallucinations, factual inconsistencies, and lack of domain-specific expertise when producing news articles. ScoreRAG addresses these challenges through a multi-stage framework combining retrieval-augmented generation, consistency relevance evaluation, and structured summarization. The system first retrieves relevant news documents from a vector database, maps them to complete news items, and assigns consistency relevance scores based on large language model evaluations. These documents are then reranked according to relevance, with low-quality items filtered out. The framework proceeds to generate graded summaries based on relevance scores, which guide the large language model in producing complete news articles following professional journalistic standards. Through this methodical approach, ScoreRAG aims to significantly improve the accuracy, coherence, informativeness, and professionalism of generated news articles while maintaining stability and consistency throughout the generation process. The code and demo are available at: this https URL.
摘要：这项研究介绍了Scorerag，这是一种提高自动新闻质量的方法。尽管自然语言处理和大型语言模型取得了进步，但当前的新闻生成方法通常会在幻觉，事实上不一致以及在制作新闻文章时缺乏特定领域的专业知识。 Scorerag通过多阶段框架来解决这些挑战，从而结合了检索功能增强的生成，一致性相关性评估和结构化摘要。该系统首先从矢量数据库中检索相关的新闻文档，将它们映射以完成新闻项目，并根据大型语言模型评估分配一致性相关性分数。然后根据相关性将这些文档重新播放，并过滤低质量的项目。该框架继续根据相关得分生成分级摘要，该分数指导大型语言模型以专业新闻标准制作完整的新闻文章。通过这种有条理的方法，Scorerag旨在显着提高生成的新闻文章的准确性，相干性，信息性和专业精神，同时在整个生成过程中保持稳定性和一致性。代码和演示可在以下网址提供：此HTTPS URL。

Title: Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision

Authors: Chaeyun Jang, Moonseok Choi, Yegon Kim, Hyungi Lee, Juho Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03723
Pdf URL: https://arxiv.org/pdf/2506.03723
Copy Paste: [[2506.03723]] Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision(https://arxiv.org/abs/2506.03723)
Keywords: language model, llm, chain-of-thought
Abstract: Uncertainty calibration is essential for the safe deployment of large language models (LLMs), particularly when users rely on verbalized confidence estimates. While prior work has focused on classifiers or short-form generation, confidence calibration for chain-of-thought (CoT) reasoning remains largely unexplored. Surprisingly, we find that supervised fine-tuning with scalar confidence labels alone suffices to elicit self-verification behavior of language models, without any explicit reasoning supervision or reinforcement learning-based rewards. Despite being trained only to produce a verbalized confidence score without any self-verifying examples, the model learns to generate longer and self-checking responses for low-confidence queries while providing more concise answers for high-confidence ones. We further propose a simple rethinking method that boosts performance via test-time scaling based on calibrated uncertainty. Experiments on GSM8K and held-out reasoning tasks such as MATH-500 and ARC-Challenge show that our confidence-aware fine-tuning improves both calibration and accuracy, while also enhancing interpretability by aligning the model's reasoning path with its confidence.
摘要：不确定性校准对于安全部署大型语言模型（LLMS）至关重要，尤其是当用户依靠口头上的置信度估计时。虽然先前的工作集中在分类器或短形成生成上，但对经营链（COT）推理的置信度校准基本上仍未得到探索。令人惊讶的是，我们发现单独使用标量信心标签进行微调的微调足以引起语言模型的自我验证行为，而无需任何明确的推理监督或基于强化学习的奖励。尽管受过培训仅是为了产生口头上的置信度得分而没有任何自我验证的例子，但该模型学会了为低信心查询产生更长的自我检查响应，同时为高信心的响应提供了更简洁的答案。我们进一步提出了一种简单的重新思考方法，该方法可以根据校准的不确定性通过测试时间缩放来提高性能。在GSM8K和持有推理任务（例如Math-500和Arc-Challenge）上进行的实验表明，我们感知信心的微调可以提高校准和准确性，同时还通过将模型的推理路径与信心对齐来增强可解释性。

Title: Act-as-Pet: Benchmarking the Abilities of Large Language Models as E-Pets in Social Network Services

Authors: Hongcheng Guo, Zheyong Xie, Shaosheng Cao, Boyang Wang, Weiting Liu, Zheyu Ye, Zhoujun Li, Zuozhu Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03761
Pdf URL: https://arxiv.org/pdf/2506.03761
Copy Paste: [[2506.03761]] Act-as-Pet: Benchmarking the Abilities of Large Language Models as E-Pets in Social Network Services(https://arxiv.org/abs/2506.03761)
Keywords: language model, llm
Abstract: As interest in using Large Language Models (LLMs) for interactive and emotionally rich experiences grows, virtual pet companionship emerges as a novel yet underexplored application. Existing approaches focus on basic pet role-playing interactions without systematically benchmarking LLMs for comprehensive companionship. In this paper, we introduce Pet-Bench, a dedicated benchmark that evaluates LLMs across both self-interaction and human-interaction dimensions. Unlike prior work, Pet-Bench emphasizes self-evolution and developmental behaviors alongside interactive engagement, offering a more realistic reflection of pet companionship. It features diverse tasks such as intelligent scheduling, memory-based dialogues, and psychological conversations, with over 7,500 interaction instances designed to simulate complex pet behaviors. Evaluation of 28 LLMs reveals significant performance variations linked to model size and inherent capabilities, underscoring the need for specialized optimization in this domain. Pet-Bench serves as a foundational resource for benchmarking pet-related LLM abilities and advancing emotionally immersive human-pet interactions.
摘要：随着对使用大型语言模型（LLM）进行互动和情感丰富的体验的兴趣，虚拟宠物同伴成为一种新颖而又毫无疑问的应用。现有的方法着重于基本的宠物角色扮演互动，而无需系统地基准llms以进行全面的陪伴。在本文中，我们介绍了Pet Bench，这是一种专门的基准测试，可评估自我交往和人类相互作用维度的LLM。与先前的工作不同，Pet Bench强调了自我进化和发展行为以及互动参与，从而更现实地反映了宠物陪伴。它具有各种任务，例如智能调度，基于内存的对话和心理对话，其中7,500多个互动实例旨在模拟复杂的宠物行为。对28个LLM的评估揭示了与模型大小和固有功能相关的显着性能变化，强调了该域中对专门优化的需求。宠物基础是基础基准与宠物相关的LLM能力并提高情绪沉浸式人类宠物的相互作用的基础资源。

Title: AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models

Authors: Yifeng Gu, Zicong Jiang, Jianxiu Jin, Kailing Guo, Ziyang Zhang, Xiangmin Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03762
Pdf URL: https://arxiv.org/pdf/2506.03762
Copy Paste: [[2506.03762]] AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models(https://arxiv.org/abs/2506.03762)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have significantly advanced the field of Artificial Intelligence. However, their deployment is resource-intensive, not only due to the large number of model parameters but also because the (Key-Value) KV cache consumes a lot of memory during inference. While several works propose reducing the KV cache by evicting the unnecessary tokens, these approaches rely on accumulated attention score as eviction score to quantify the importance of the token. We identify the accumulated attention score is biased and it decreases with the position of the tokens in the mathematical expectation. As a result, the retained tokens concentrate on the initial positions, limiting model's access to global contextual information. To address this issue, we propose Adaptive holistic attention KV (AhaKV), it addresses the bias of the accumulated attention score by adaptively tuning the scale of softmax according the expectation of information entropy of attention scores. To make use of the holistic attention information in self-attention mechanism, AhaKV utilize the information of value vectors, which is overlooked in previous works, to refine the adaptive score. We show theoretically that our method is well suited for bias reduction. We deployed AhaKV on different models with a fixed cache budget. Experiments show that AhaKV successfully mitigates bias and retains crucial tokens across global context and achieve state-of-the-art results against other related work on several benchmark tasks.
摘要：大型语言模型（LLM）已大大推进了人工智能领域。但是，它们的部署是资源密集的，不仅是由于模型参数数量大量，而且还因为（键值）KV缓存在推理过程中会消耗大量内存。虽然几项作品建议通过驱逐不必要的令牌来减少KV缓存，但这些方法依靠累积的注意力评分作为驱逐评分来量化令牌的重要性。我们确定累积的注意力评分是有偏见的，并且随着令牌的位置而降低了数学期望。结果，保留的令牌集中在初始位置上，从而限制了模型对全球上下文信息的访问。为了解决这个问题，我们提出了自适应的整体注意力KV（AHAKV），它通过根据注意力评分的信息熵的期望来适应软调整软性评分来解决累积注意力评分的偏见。为了利用自我注意机制中的整体注意力信息，Ahakv利用了先前工作中忽略的价值向量的信息来完善适应性评分。从理论上讲，我们表明我们的方法非常适合减少偏差。我们用固定的高速缓存预算部署了AHAKV。实验表明，AHAKV成功地减轻了偏见并保留了跨全球环境的关键令牌，并在几项基准任务上与其他相关工作实现最新结果。

Title: ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations

Authors: Quang Hieu Pham, Thuy Duong Nguyen, Tung Pham, Anh Tuan Luu, Dat Quoc Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03763
Pdf URL: https://arxiv.org/pdf/2506.03763
Copy Paste: [[2506.03763]] ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations(https://arxiv.org/abs/2506.03763)
Keywords: language model, llm, chain-of-thought
Abstract: The capabilities of large language models (LLMs) have been enhanced by training on data that reflects human thought processes, such as the Chain-of-Thought format. However, evidence suggests that the conventional scheme of next-word prediction may not fully capture how humans learn to think. Inspired by how humans generalize mathematical reasoning, we propose a new approach named ClozeMath to fine-tune LLMs for mathematical reasoning. Our ClozeMath involves a text-infilling task that predicts masked equations from a given solution, analogous to cloze exercises used in human learning. Experiments on GSM8K, MATH, and GSM-Symbolic show that ClozeMath surpasses the strong baseline Masked Thought in performance and robustness, with two test-time scaling decoding algorithms, Beam Search and Chain-of-Thought decoding. Additionally, we conduct an ablation study to analyze the effects of various architectural and implementation choices on our approach.
摘要：大语模型（LLM）的功能通过反映人类思维过程的数据（例如思想链格式）来增强。但是，有证据表明，下一字预测的常规计划可能无法完全捕捉人类学会思考的方式。受到人类如何概括数学推理的启发，我们提出了一种名为Clozemath的新方法，以微调LLM的数学推理。我们的clozemath涉及一项文本注入的任务，该任务可以预测给定解决方案的掩盖方程，类似于人类学习中使用的紧密练习。 GSM8K，Math和GSM符号符号的实验表明，Clozemath的性能和鲁棒性超过了强大的基线掩盖思想，具有两个测试时间缩放分解算法，光束搜索和经过思考的解码。此外，我们进行了一项消融研究，以分析各种架构和实施选择对我们方法的影响。

Title: Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models

Authors: Seungcheol Park, Jeongin Bae, Beomseok Kwon, Minjun Kim, Byeongwook Kim, Se Jung Kwon, U Kang, Dongsoo Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03781
Pdf URL: https://arxiv.org/pdf/2506.03781
Copy Paste: [[2506.03781]] Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models(https://arxiv.org/abs/2506.03781)
Keywords: language model, llm
Abstract: How can we quantize large language models while preserving accuracy? Quantization is essential for deploying large language models (LLMs) efficiently. Binary-coding quantization (BCQ) and uniform quantization (UQ) are promising quantization schemes that have strong expressiveness and optimizability, respectively. However, neither scheme leverages both advantages. In this paper, we propose UniQuanF (Unified Quantization with Flexible Mapping), an accurate quantization method for LLMs. UniQuanF harnesses both strong expressiveness and optimizability by unifying the flexible mapping technique in UQ and non-uniform quantization levels of BCQ. We propose unified initialization, and local and periodic mapping techniques to optimize the parameters in UniQuanF precisely. After optimization, our unification theorem removes computational and memory overhead, allowing us to utilize the superior accuracy of UniQuanF without extra deployment costs induced by the unification. Experimental results demonstrate that UniQuanF outperforms existing UQ and BCQ methods, achieving up to 4.60% higher accuracy on GSM8K benchmark.
摘要：在保持准确性的同时，我们如何量化大型语言模型？量化对于有效部署大型语言模型（LLM）至关重要。二进制编码量化（BCQ）和均匀量化（UQ）是有希望的量化方案，分别具有强烈的表现力和优化性。但是，这两种方案都没有利用这两个优势。在本文中，我们提出了独立性（使用灵活映射的统一量化），这是LLMS的准确量化方法。 Uniquanf通过在BCQ的UQ和非均匀量化水平中统一柔性映射技术来实现强的表现力和优化性。我们提出了统一的初始化以及局部和周期性的映射技术，以精确地优化Uniquanf中的参数。优化后，我们的统一定理删除了计算和内存开销，使我们能够在统一引起的额外部署成本的情况下利用Uniquanf的出色精度。实验结果表明，Uniquanf的表现优于现有的UQ和BCQ方法，在GSM8K基准测试上的精度高达4.60％。

Title: Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons

Authors: Isik Baran Sandan, Tu Anh Dinh, Jan Niehues
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03785
Pdf URL: https://arxiv.org/pdf/2506.03785
Copy Paste: [[2506.03785]] Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons(https://arxiv.org/abs/2506.03785)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown to be effective evaluators across various domains such as machine translations or the scientific domain. Current LLM-as-a-Judge approaches rely mostly on individual assessments or a single round of pairwise assessments, preventing the judge LLM from developing a global ranking perspective. To address this, we present Knockout Assessment, an LLM-asa Judge method using a knockout tournament system with iterative pairwise comparisons. Experiments across three LLMs on two datasets show that knockout assessment improves scoring accuracy, increasing Pearson correlation with expert evaluations by 0.07 on average for university-level exam scoring and machine translation evaluations, aligning LLM assessments more closely with human scoring.
摘要：大型语言模型（LLMS）已证明是各个领域的有效评估者，例如机器翻译或科学领域。当前的LLM-AS-A-A-Gudge方法主要依赖于个人评估或一轮成对评估，从而阻止了法官LLM发展全球排名的角度。为了解决这个问题，我们介绍了淘汰赛评估，这是一种使用淘汰赛系统进行淘汰赛系统的LLM-ASA法官方法，并进行了迭代成对比较。在两个数据集上进行的三个LLM的实验表明，敲除评估提高了评分的准确性，在大学级考试评分和机器翻译评估中，Pearson与专家评估的相关性平均增加了0.07，将LLM评估与人类评分更加紧密。

Title: Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts

Authors: Sidharth Pulipaka, Sparsh Jain, Ashwin Sankar, Raj Dabre
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03793
Pdf URL: https://arxiv.org/pdf/2506.03793
Copy Paste: [[2506.03793]] Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts(https://arxiv.org/abs/2506.03793)
Keywords: language model
Abstract: Punctuation plays a vital role in structuring meaning, yet current models often struggle to restore it accurately in transcripts of spontaneous speech, especially in the presence of disfluencies such as false starts and backtracking. These limitations hinder the performance of downstream tasks like translation, text to speech, summarization, etc. where sentence boundaries are critical for preserving quality. In this work, we introduce Cadence, a generalist punctuation restoration model adapted from a pretrained large language model. Cadence is designed to handle both clean written text and highly spontaneous spoken transcripts. It surpasses the previous state of the art in performance while expanding support from 14 to all 22 Indian languages and English. We conduct a comprehensive analysis of model behavior across punctuation types and language families, identifying persistent challenges under domain shift and with rare punctuation marks. Our findings demonstrate the efficacy of utilizing pretrained language models for multilingual punctuation restoration and highlight Cadence practical value for low resource NLP pipelines at scale.
摘要：标点符号在构造意义中起着至关重要的作用，但是当前的模型通常难以在自发语音的转录中准确地恢复它，尤其是在存在诸如虚假开始和回溯之类的爆发的情况下。这些限制阻碍了下游任务的执行，例如翻译，文本到语音，摘要等。句子边界对于保留质量至关重要。在这项工作中，我们介绍了Cadence，这是一种通才标点符号修复模型，该模型是根据预算的大语言模型进行的。 Cadence旨在处理干净的书面文字和高度自发的口语笔录。它超过了先前的表演状态，同时将支持从14种印度语言和英语扩大。我们对标点符号类型和语言家族的模型行为进行全面分析，确定域转移和罕见标点符号下的持续挑战。我们的发现证明了利用审慎的语言模型用于多语言标点符号恢复，并突出了低资源NLP管道的Cadence实用价值。

Title: PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading

Authors: Qiuhan Han, Qian Wang, Atsushi Yoshikawa, Masayuki Yamamura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03861
Pdf URL: https://arxiv.org/pdf/2506.03861
Copy Paste: [[2506.03861]] PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading(https://arxiv.org/abs/2506.03861)
Keywords: language model, llm, agent
Abstract: High-Frequency Trading (HFT) is pivotal in cryptocurrency markets, demanding rapid decision-making. Social media platforms like Reddit offer valuable, yet underexplored, information for such high-frequency, short-term trading. This paper introduces \textbf{PulseReddit}, a novel dataset that is the first to align large-scale Reddit discussion data with high-frequency cryptocurrency market statistics for short-term trading analysis. We conduct an extensive empirical study using Large Language Model (LLM)-based Multi-Agent Systems (MAS) to investigate the impact of social sentiment from PulseReddit on trading performance. Our experiments conclude that MAS augmented with PulseReddit data achieve superior trading outcomes compared to traditional baselines, particularly in bull markets, and demonstrate robust adaptability across different market regimes. Furthermore, our research provides conclusive insights into the performance-efficiency trade-offs of different LLMs, detailing significant considerations for practical model selection in HFT applications. PulseReddit and our findings establish a foundation for advanced MAS research in HFT, demonstrating the tangible benefits of integrating social media.
摘要：高频交易（HFT）在加密货币市场中至关重要，要求快速决策。像Reddit这样的社交媒体平台为高频，短期交易提供了有价值但尚未充满信心的信息。本文介绍了\ textbf {pulsereddit}，这是一个新颖的数据集，它是第一个将大规模REDDIT讨论数据与高频加密货币市场统计数据保持一致的数据集，用于短期交易分析。我们使用基于大语言模型（LLM）的多机构系统（MAS）进行广泛的实证研究，以调查PulseredDit对交易绩效的社交情绪的影响。我们的实验得出的结论是，与传统基线相比，使用脉冲数据的MAS增强了MAS，尤其是在牛市中，可以实现出色的交易成果，并证明了不同市场制度的强大适应性。此外，我们的研究提供了对不同LLM的性能效率折衷的最终见解，详细介绍了HFT应用中实用模型选择的重要考虑因素。 PulseredDit和我们的发现为HFT的高级MAS研究奠定了基础，证明了整合社交媒体的切实好处。

Title: EuroGEST: Investigating gender stereotypes in multilingual language models

Authors: Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03867
Pdf URL: https://arxiv.org/pdf/2506.03867
Copy Paste: [[2506.03867]] EuroGEST: Investigating gender stereotypes in multilingual language models(https://arxiv.org/abs/2506.03867)
Keywords: language model, llm
Abstract: Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics. Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages. We use EuroGEST to evaluate 24 multilingual language models from six model families, demonstrating that the strongest stereotypes in all models across all languages are that women are \textit{beautiful,} \textit{empathetic} and \textit{neat} and men are \textit{leaders}, \textit{strong, tough} and \textit{professional}. We also show that larger models encode gendered stereotypes more strongly and that instruction finetuning does not consistently reduce gendered stereotypes. Our work highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.
摘要：大型语言模型越来越支持多种语言，但大多数以性别偏见为中心的基准。我们介绍了Eurogest，这是一个旨在在英语和29种欧洲语言中衡量LLM中的性别型推理的数据集。 Eurogest建立在现有的专家知识基准的基础上，涵盖了16种性别刻板印象，并使用翻译工具，质量估计指标和形态启发式方法在这项工作中扩展了这项工作。人类评估证实，我们的数据生成方法可导致跨语言的翻译和性别标签的高度准确性。我们使用Eurogest评估来自六个模型系列的24种多种语言模型，表明所有语言中所有模型中最强的刻板印象是女性是\ textit {beautiful，} \ textit {colthatic {colthatic {entertit {neateT {neateat}，男人是\ textit {pextit {fined {fined {fined {fined {fined {pextit}，\ textit} and torge，我们还表明，较大的模型更强烈地编码性别刻板印象，并且指令填充并不能始终如一地减少性别刻板印象。我们的工作强调了对LLM中公平性进行更多多语言研究的需求，并提供了可扩展的方法和资源来审核跨语言的性别偏见。

Title: RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing

Authors: Ruihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu, Mingkuan Feng, Shuai Zhang, Jianhua Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03880
Pdf URL: https://arxiv.org/pdf/2506.03880
Copy Paste: [[2506.03880]] RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing(https://arxiv.org/abs/2506.03880)
Keywords: language model, llm
Abstract: The rapid advancements in large language models (LLMs) have led to the emergence of routing techniques, which aim to efficiently select the optimal LLM from diverse candidates to tackle specific tasks, optimizing performance while reducing costs. Current LLM routing methods are limited in effectiveness due to insufficient exploration of the intrinsic connection between user queries and the characteristics of LLMs. To address this issue, in this paper, we present RadialRouter, a novel framework for LLM routing which employs a lightweight Transformer-based backbone with a radial structure named RadialFormer to articulate the query-LLMs relationship. The optimal LLM selection is performed based on the final states of RadialFormer. The pipeline is further refined by an objective function that combines Kullback-Leibler divergence with the query-query contrastive loss to enhance robustness. Experimental results on RouterBench show that RadialRouter significantly outperforms existing routing methods by 9.2\% and 5.8\% in the Balance and Cost First scenarios, respectively. Additionally, its adaptability toward different performance-cost trade-offs and the dynamic LLM pool demonstrates practical application potential.
摘要：大语言模型（LLM）的快速进步导致了路由技术的出现，该技术旨在有效地从不同的候选人中选择最佳的LLM，以解决特定的任务，在降低成本的同时优化性能。当前的LLM路由方法由于对用户查询和LLMS特征之间的内在连接的探索不足而受到限制。为了解决这个问题，在本文中，我们介绍了RadialRouter，这是一个新型LLM路由的框架，该框架采用了一个基于轻巧的变压器的主链，其径向结构名为Radialformer，以阐明查询符号的关系。最佳LLM选择是根据径向形式的最终状态执行的。该管道通过目标函数进一步完善，该目标函数将kullback-leibler差异与查询质量对比度损失相结合以增强鲁棒性。 Routerbench上的实验结果表明，在余额和成本第一场景中，RadialRouter显着优于9.2 \％和5.8％的现有路由方法。此外，它对不同性能成本权衡的适应性和动态LLM池表现出实用的应用潜力。

Title: Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation

Authors: Junyi Chen, Shihao Bai, Zaijun Wang, Siyu Wu, Chuheng Du, Hailong Yang, Ruihao Gong, Shengzhong Liu, Fan Wu, Guihai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03887
Pdf URL: https://arxiv.org/pdf/2506.03887
Copy Paste: [[2506.03887]] Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation(https://arxiv.org/abs/2506.03887)
Keywords: llm
Abstract: Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre$^3$ that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre$^3$ enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre$^3$ introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead. Pre$^3$ can be seamlessly integrated into standard LLM inference frameworks, reducing time per output token (TPOT) by up to 40% and increasing throughput by up to 36% in our experiments. Our code is available at this https URL.
摘要：广泛的LLM应用要求有效的结构化世代，特别是对于LR（1）语法，以指定格式的输出（例如JSON）产生输出。现有方法主要将LR（1）语法解析为下降自动机（PDA），从而导致运行时执行开销，以实现上下文依赖性令牌处理，尤其是在大型推理批次下效率低下。为了解决这些问题，我们提出了利用确定性下降自动机（DPDA）来优化受约束的LLM解码效率的Pre $^3 $。首先，通过在预处理过程中对前缀条件的边缘进行了预先计算，$^3 $可以提前启用整个时间边缘分析，从而使并行过渡处理成为可能。其次，通过利用前缀条件的边缘，Pre $^3 $引入了一种新颖的方法，将LR（1）过渡图转换为DPDA，从而消除了对运行时路径探索的需求，并以最小的开销而实现了边缘过渡。可以将Pre $^3 $无缝集成到标准LLM推理框架中，在我们的实验中，每个输出令牌（TPOT）的时间最多减少了40％，最多将吞吐量增加了36％。我们的代码可在此HTTPS URL上找到。

Title: Magic Mushroom: A Customizable Benchmark for Fine-grained Analysis of Retrieval Noise Erosion in RAG Systems

Authors: Yuxin Zhang, Yan Wang, Yongrui Chen, Shenyu Zhang, Xinbang Dai, Sheng Bi, Guilin Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03901
Pdf URL: https://arxiv.org/pdf/2506.03901
Copy Paste: [[2506.03901]] Magic Mushroom: A Customizable Benchmark for Fine-grained Analysis of Retrieval Noise Erosion in RAG Systems(https://arxiv.org/abs/2506.03901)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating external retrieved information, mitigating issues such as hallucination and outdated knowledge. However, RAG systems are highly sensitive to retrieval noise prevalent in real-world scenarios. Existing benchmarks fail to emulate the complex and heterogeneous noise distributions encountered in real-world retrieval environments, undermining reliable robustness assessment. In this paper, we define four categories of retrieval noise based on linguistic properties and noise characteristics, aiming to reflect the heterogeneity of noise in real-world scenarios. Building on this, we introduce Magic Mushroom, a benchmark for replicating "magic mushroom" noise: contexts that appear relevant on the surface but covertly mislead RAG systems. Magic Mushroom comprises 7,468 single-hop and 3,925 multi-hop question-answer pairs. More importantly, Magic Mushroom enables researchers to flexibly configure combinations of retrieval noise according to specific research objectives or application scenarios, allowing for highly controlled evaluation setups. We evaluate LLM generators of varying parameter scales and classic RAG denoising strategies under diverse noise distributions to investigate their performance dynamics during progressive noise encroachment. Our analysis reveals that both generators and denoising strategies have significant room for improvement and exhibit extreme sensitivity to noise distributions. Magic Mushroom emerges as a promising tool for evaluating and advancing noise-robust RAG systems, accelerating their widespread deployment in real-world applications. The Magic Mushroom benchmark is available at the this https URL.
摘要：通过合并外部检索信息，减轻幻觉和过时的知识等问题，检索增强的生成（RAG）系统可以增强大语言模型（LLM）。但是，在实际情况下，抹布系统对检索噪声非常敏感。现有的基准无法效仿在现实检索环境中遇到的复杂且异质的噪声分布，从而破坏了可靠的鲁棒性评估。在本文中，我们根据语言特性和噪声特征定义了四类检索噪声，旨在反映现实世界中噪声的异质性。在此基础上，我们引入了魔术蘑菇，这是复制“魔术蘑菇”噪声的基准：在表面上显得相关但掩盖了误导的抹布系统的上下文。魔术蘑菇包括7,468个单跳和3,925个多跳问答对。更重要的是，魔术蘑菇使研究人员能够根据特定的研究目标或应用程序场景灵活配置检索噪声的组合，从而可以进行高度控制的评估设置。我们评估了不同参数量表的LLM发生器和在不同噪声分布下的经典抹布降级策略，以研究其在渐进噪声侵占过程中的性能动态。我们的分析表明，发电机和剥夺策略都有很大的改进空间，并且对噪声分布具有极大的敏感性。魔术蘑菇是一种有前途的工具，用于评估和推动噪声射击抹布系统，从而加速其在现实世界应用中的广泛部署。魔术蘑菇基准可在此HTTPS URL上获得。

Title: HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Authors: Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, Siqi He, Shannan Yan, Junzhe Chen, Xiaomin He, Chaoya Jiang, Wei Ye, Kaidong Yu, Xuelong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03922
Pdf URL: https://arxiv.org/pdf/2506.03922
Copy Paste: [[2506.03922]] HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models(https://arxiv.org/abs/2506.03922)
Keywords: language model, llm, agent
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.
摘要：多模式大语言模型（MLLM）表现出了巨大的潜力，可以推进广泛的领域。但是，当前评估MLLM的基准主要强调了典型的STEM学科的常识和垂直逐步推理，同时忽略了人文和社会科学的独特需求和潜力（HSS）。 HSS域中的任务需要更多的水平，跨学科思维，并深入跨相关领域的知识整合，这给MLLM带来了独特的挑战，尤其是在将抽象概念与相应的视觉表示联系起来。在解决这一差距时，我们提出了HSSBench，这是一个专门的基准测试，旨在评估MLLM在HSS任务上的功能，包括多种语言，包括联合国的六种官方语言。我们还介绍了针对HSS场景量身定制的新型数据生成管道，其中多个领域专家和自动化代理协作以生成并迭代地完善每个样本。 HSSBench包含13,000多个精心设计的样品，涵盖了六个关键类别。我们在HSSBench上基准了20多个主流MLLM，并证明即使对于最先进的模型，它也会构成重大挑战。我们希望这种基准将激发进一步的研究，以增强MLLM的跨学科推理能力，尤其是它们在整个领域内化和联系知识的能力。

Title: More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning

Authors: Mohammadamin Shafiei, Hamidreza Saffari, Nafise Sadat Moosavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03923
Pdf URL: https://arxiv.org/pdf/2506.03923
Copy Paste: [[2506.03923]] More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning(https://arxiv.org/abs/2506.03923)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) are known to be sensitive to input phrasing, but the mechanisms by which semantic cues shape reasoning remain poorly understood. We investigate this phenomenon in the context of comparative math problems with objective ground truth, revealing a consistent and directional framing bias: logically equivalent questions containing the words ``more'', ``less'', or ``equal'' systematically steer predictions in the direction of the framing term. To study this effect, we introduce MathComp, a controlled benchmark of 300 comparison scenarios, each evaluated under 14 prompt variants across three LLM families. We find that model errors frequently reflect linguistic steering, systematic shifts toward the comparative term present in the prompt. Chain-of-thought prompting reduces these biases, but its effectiveness varies: free-form reasoning is more robust, while structured formats may preserve or reintroduce directional drift. Finally, we show that including demographic identity terms (e.g., ``a woman'', ``a Black person'') in input scenarios amplifies directional drift, despite identical underlying quantities, highlighting the interplay between semantic framing and social referents. These findings expose critical blind spots in standard evaluation and motivate framing-aware benchmarks for diagnosing reasoning robustness and fairness in LLMs.
摘要：已知大型语言模型（LLM）对输入措辞很敏感，但是语义提示形状推理的机制仍然很少理解。我们在客观基础真理的比较数学问题的背景下研究了这一现象，揭示了一个一致和定向的框架偏见：逻辑上等效的问题，其中包含``更多''，``更''，``较小''或``'''''或````'''''或````'''''或````''''或```''''为了研究这种效果，我们介绍了MathComp，这是300个比较方案的受控基准，每个基准在三个LLM家族的14个及时变体中进行了评估。我们发现，模型错误经常反映了语言转向，系统的转向向提示中存在的比较术语转移。经过思考的链条促使这些偏见降低了这些偏见，但其有效性各不相同：自由形式推理更强大，而结构化格式可以保留或重新引入方向漂移。最后，我们表明，包括人口认同术语（例如，``女人''，``一个女人''，``黑人''）在输入方案中放大了方向漂移，尽管相同的基础数量相同，从而突出了语义框架和社会参与者之间的相互作用。这些发现在标准评估中暴露了关键的盲点，并激发了框架感知的基准测试，以诊断LLMS的推理稳健性和公平性。

Title: TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering

Authors: Junnan Zhu, Jingyi Wang, Bohan Yu, Xiaoyu Wu, Junbo Li, Lei Wang, Nan Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03949
Pdf URL: https://arxiv.org/pdf/2506.03949
Copy Paste: [[2506.03949]] TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering(https://arxiv.org/abs/2506.03949)
Keywords: llm
Abstract: LLMs have shown impressive progress in natural language processing. However, they still face significant challenges in TableQA, where real-world complexities such as diverse table structures, multilingual data, and domain-specific reasoning are crucial. Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage. Furthermore, most benchmarks are monolingual and fail to capture the cross-lingual and cross-domain variability in practical applications. To address these limitations, we introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks. Specifically, TableEval includes tables with various structures (such as concise, hierarchical, and nested tables) collected from four domains (including government, finance, academia, and industry reports). Besides, TableEval features cross-lingual scenarios with tables in Simplified Chinese, Traditional Chinese, and English. To minimize the risk of data leakage, we collect all data from recent real-world documents. Considering that existing TableQA metrics fail to capture semantic accuracy, we further propose SEAT, a new evaluation framework that assesses the alignment between model responses and reference answers at the sub-question level. Experimental results have shown that SEAT achieves high agreement with human judgment. Extensive experiments on TableEval reveal critical gaps in the ability of state-of-the-art LLMs to handle these complex, real-world TableQA tasks, offering insights for future improvements. We make our dataset available here: this https URL.
摘要：LLM在自然语言处理方面表现出了令人印象深刻的进步。但是，它们在TableQA中仍然面临重大挑战，在TableQA中，现实世界中的复杂性，例如各种表结构，多语言数据和特定于领域的推理至关重要。现有的TableQA基准通常受到关注简单平坦桌子的关注并遭受数据泄漏的限制。此外，大多数基准是单语的，并且无法捕获实际应用中的跨语性和跨域变异性。为了解决这些限制，我们介绍了TableVal，这是一种新的基准测试，旨在评估现实的TableQA任务上的LLMS。具体而言，TableVal包括从四个领域（包括政府，金融，学术界和行业报告）收集的各种结构（例如简洁，等级和嵌套表）的表。此外，TableeVal具有跨语言的场景，并在简化的中文，中文和英语中具有桌子。为了最大程度地减少数据泄漏的风险，我们从最近的现实世界文档中收集了所有数据。考虑到现有的TableQA指标无法捕获语义精度，我们进一步提出了席位，这是一个新的评估框架，可以评估模型响应和参考答案在子问题级别上的一致性。实验结果表明，座位与人类判断力达成很高的一致性。对TableEval的广泛实验揭示了最先进的LLM处理这些复杂的，现实世界的TableQA任务的能力的关键差距，从而提供了以后改进的见解。我们在这里提供数据集：此HTTPS URL。

Title: From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding

Authors: Chiwei Zhu, Benfeng Xu, Xiaorui Wang, Zhendong Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03968
Pdf URL: https://arxiv.org/pdf/2506.03968
Copy Paste: [[2506.03968]] From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding(https://arxiv.org/abs/2506.03968)
Keywords: language model, llm
Abstract: The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest diverse and complex instructions at scale, utilizing the vast range of web documents. Specifically, we construct a dataset of 1 million instructions, called SynthQuestions, and demonstrate that models trained on it achieve leading performance on several common benchmarks, with improvements that continually scale with more web corpora. Data, models and codes will be available at this https URL.
摘要：追求多样化，复杂和大规模的指导数据对于自动使大语言模型（LLMS）保持一致至关重要。尽管有一些能够大规模生成合成指令的方法，但它们要么遭受有限的接地来源，导致分布狭窄，要么依赖于无法在复杂性方面产生有意义的轨迹的琐碎扩展。相比之下，有利于有效一致性的说明通常是由认知见解制成的，并基于现实世界中的用例。在本文中，我们使用属性接地综合了此类说明，其中涉及1）自上而下的归因过程，该过程将选择性的真实指令集接地到定位用户，以及2）一个自下而上的合成过程，该过程利用Web文档首先生成情况，然后是有意义的指令。该框架使我们能够利用广泛的Web文档来大规模收集多样的复杂说明。具体来说，我们构建了一个名为SynthQuestions的100万个说明的数据集，并证明了对其进行培训的模型在几个常见的基准测试中实现了领先的性能，并通过不断扩展的Web Corpora进行了改进。数据，模型和代码将在此HTTPS URL上可用。

Title: Structured Pruning for Diverse Best-of-N Reasoning Optimization

Authors: Hieu Trung Nguyen, Bao Nguyen, Viet Anh Nguyen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03978
Pdf URL: https://arxiv.org/pdf/2506.03978
Copy Paste: [[2506.03978]] Structured Pruning for Diverse Best-of-N Reasoning Optimization(https://arxiv.org/abs/2506.03978)
Keywords: language model
Abstract: Model pruning in transformer-based language models, traditionally viewed as a means of achieving computational savings, can enhance the model's reasoning capabilities. In this work, we uncover a surprising phenomenon: the selective pruning of certain attention heads leads to improvements in reasoning performance, particularly on challenging tasks. Motivated by this observation, we propose SPRINT, a novel contrastive learning framework that dynamically selects the optimal head and layer to prune during inference. By aligning question embeddings with head embeddings, SPRINT identifies those pruned-head configurations that result in more accurate reasoning. Extensive experiments demonstrate that our method significantly outperforms traditional best-of-$N$ and random head selection strategies on the MATH500 and GSM8K datasets.
摘要：传统上将基于变压器的语言模型修剪为实现计算节省的手段，可以增强模型的推理能力。在这项工作中，我们发现了一个令人惊讶的现象：某些关注头的选择性修剪会导致推理性能的改善，尤其是在具有挑战性的任务上。在这一观察过程中，我们提出了Sprint，这是一个新颖的对比学习框架，该框架在推断过程中动态选择最佳的头部和层以修剪。通过将问题嵌入与头部嵌入的嵌入，Sprint可以确定那些简化的头构型，从而导致更准确的推理。广泛的实验表明，我们的方法在MATH500和GSM8K数据集上大大优于传统的最佳$ N $和随机的头部选择策略。

Title: Around the World in 24 Hours: Probing LLM Knowledge of Time and Place

Authors: Carolin Holtermann, Paul Röttger, Anne Lauscher
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03984
Pdf URL: https://arxiv.org/pdf/2506.03984
Copy Paste: [[2506.03984]] Around the World in 24 Hours: Probing LLM Knowledge of Time and Place(https://arxiv.org/abs/2506.03984)
Keywords: language model, llm, prompt, chat, chain-of-thought
Abstract: Reasoning over time and space is essential for understanding our world. However, the abilities of language models in this area are largely unexplored as previous work has tested their abilities for logical reasoning in terms of time and space in isolation or only in simple or artificial environments. In this paper, we present the first evaluation of the ability of language models to jointly reason over time and space. To enable our analysis, we create GeoTemp, a dataset of 320k prompts covering 289 cities in 217 countries and 37 time zones. Using GeoTemp, we evaluate eight open chat models of three different model families for different combinations of temporal and geographic knowledge. We find that most models perform well on reasoning tasks involving only temporal knowledge and that overall performance improves with scale. However, performance remains constrained in tasks that require connecting temporal and geographical information. We do not find clear correlations of performance with specific geographic regions. Instead, we find a significant performance increase for location names with low model perplexity, suggesting their repeated occurrence during model training. We further demonstrate that their performance is heavily influenced by prompt formulation - a direct injection of geographical knowledge leads to performance gains, whereas, surprisingly, techniques like chain-of-thought prompting decrease performance on simpler tasks.
摘要：随着时间和空间的推理对于理解我们的世界至关重要。但是，由于以前的工作已经在孤立的时间和空间方面或仅在简单或人造的环境中测试了它们的逻辑推理能力，因此在这一领域的语言模型的能力在很大程度上没有探索。在本文中，我们介绍了语言模型在时间和空间中共同推理的能力的首次评估。为了实现我们的分析，我们创建了Geotemp，这是一个涵盖217个国家和37个时区的320k提示的数据集。使用Geotemp，我们评估了三种不同模型家族的八种开放式聊天模型，用于时间和地理知识的不同组合。我们发现，大多数模型在仅涉及时间知识的推理任务上表现良好，并且总体绩效随规模而改善。但是，绩效仍在需要连接时间和地理信息的任务中受到限制。我们没有发现与特定地理区域的性能明显相关。取而代之的是，我们发现对模型较低的位置名称的性能大幅提高，这表明它们在模型训练中反复发生。我们进一步证明，它们的性能受到迅速配方的严重影响 - 直接注入地理知识会导致性能提高，而令人惊讶的是，诸如经过想象的链条促使促使更简单任务的性能下降。

Title: Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models

Authors: Alex Laitenberger, Christopher D. Manning, Nelson F. Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03989
Pdf URL: https://arxiv.org/pdf/2506.03989
Copy Paste: [[2506.03989]] Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models(https://arxiv.org/abs/2506.03989)
Keywords: language model, retrieval-augmented generation, agent
Abstract: With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single pass, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document's Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, pairing it with emerging embedding and language models to assess trade-offs between complexity and effectiveness as model capabilities evolve.
摘要：随着能够在单个通行证中处理数万个令牌的长篇文章模型（LMS）的兴起，多阶段检索生成一代（RAG）管道是否仍然为简单，单阶段的方法提供可衡量的好处吗？为了评估这个问题，我们对系统缩放的令牌预算进行了对质量检查任务的受控评估，比较了两个最近的多阶段管道，Readagent和Raptor，与三个基线，包括Dos Rag（文档的原始结构RAG），一种简单的回传方法，是一种保留原始通道订单。尽管设计直接设计，但DOS RAG仍始终在多个长篇下说QA基准上匹配或优于更复杂的方法。我们建议将DOS抹布建立为未来破布评估的简单但强大的基准，将其与新兴的嵌入和语言模型配对，以评估随着模型能力的发展，复杂性和有效性之间的权衡。

Title: DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding

Authors: Hongzhi Zhang, Jingyuan Zhang, Xingguang Ji, Qi Wang, Fuzheng Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.03990
Pdf URL: https://arxiv.org/pdf/2506.03990
Copy Paste: [[2506.03990]] DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding(https://arxiv.org/abs/2506.03990)
Keywords: llm
Abstract: Typical video modeling methods, such as LLava, represent videos as sequences of visual tokens, which are then processed by the LLM backbone for effective video understanding. However, this approach leads to a massive number of visual tokens, especially for long videos. A practical solution is to first extract relevant visual information from the large visual context before feeding it into the LLM backbone, thereby reducing computational overhead. In this work, we introduce DynTok, a novel \textbf{Dyn}amic video \textbf{Tok}en compression strategy. DynTok adaptively splits visual tokens into groups and merges them within each group, achieving high compression in regions with low information density while preserving essential content. Our method reduces the number of tokens to 44.4% of the original size while maintaining comparable performance. It further benefits from increasing the number of video frames and achieves 65.3% on Video-MME and 72.5% on MLVU. By applying this simple yet effective compression method, we expose the redundancy in video token representations and offer insights for designing more efficient video modeling techniques.
摘要：典型的视频建模方法（例如LLAVA）表示视频为视觉令牌的序列，然后由LLM骨架处理以有效的视频理解。但是，这种方法会导致大量的视觉令牌，尤其是对于长视频。一个实用的解决方案是将相关的视觉信息从大型视觉上下文中提取，然后将其馈入LLM主链，从而减少计算开销。在这项工作中，我们介绍了Dyntok，这是一种新颖的\ textbf {dyn} amic Video \ textbf {tok} en Compression策略。 Dyntok将视觉令牌自适应地分成组并在每个组中合并，在信息密度低的区域中达到高压，同时保留基本内容。我们的方法将令牌数量减少到原始大小的44.4％，同时保持可比的性能。它从增加视频帧的数量并在视频中获得65.3％的人数和MLVU的72.5％，从而进一步受益。通过应用这种简单而有效的压缩方法，我们揭示了视频令牌表示中的冗余，并提供了设计更有效的视频建模技术的见解。

Title: Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era

Authors: Dan Oneata, Desmond Elliott, Stella Frank
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.03994
Pdf URL: https://arxiv.org/pdf/2506.03994
Copy Paste: [[2506.03994]] Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era(https://arxiv.org/abs/2506.03994)
Keywords: language model
Abstract: Human learning and conceptual representation is grounded in sensorimotor experience, in contrast to state-of-the-art foundation models. In this paper, we investigate how well such large-scale models, trained on vast quantities of data, represent the semantic feature norms of concrete object concepts, e.g. a ROSE is red, smells sweet, and is a flower. More specifically, we use probing tasks to test which properties of objects these models are aware of. We evaluate image encoders trained on image data alone, as well as multimodally-trained image encoders and language-only models, on predicting an extended denser version of the classic McRae norms and the newer Binder dataset of attribute ratings. We find that multimodal image encoders slightly outperform language-only approaches, and that image-only encoders perform comparably to the language models, even on non-visual attributes that are classified as "encyclopedic" or "function". These results offer new insights into what can be learned from pure unimodal learning, and the complementarity of the modalities.
摘要：与最先进的基础模型相反，人类的学习和概念表现以感觉运动经验为基础。在本文中，我们研究了这样的大规模模型，对大量数据训练的训练如何代表混凝土对象概念的语义特征规范，例如玫瑰是红色的，闻起来很香，是一朵花。更具体地说，我们使用探测任务来测试这些模型所知道的对象的属性。我们评估了仅在图像数据上训练的图像编码器，以及多模式训练的图像编码器和仅语言模型，以预测经典MCRAE Norms的扩展密度版本和属性评级的较新的粘合剂数据集。我们发现，多模式图像编码略优胜于语言的方法，而仅图像编码器的性能与语言模型相当，即使是在被归类为“百科全书”或“ function”的非视觉属性上。这些结果为从纯正的单峰学习以及模式的互补性中学到的知识提供了新的见解。

Title: QQSUM: A Novel Task and Model of Quantitative Query-Focused Summarization for Review-based Product Question Answering

Authors: An Quang Tang, Xiuzhen Zhang, Minh Ngoc Dinh, Zhuang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04020
Pdf URL: https://arxiv.org/pdf/2506.04020
Copy Paste: [[2506.04020]] QQSUM: A Novel Task and Model of Quantitative Query-Focused Summarization for Review-based Product Question Answering(https://arxiv.org/abs/2506.04020)
Keywords: retrieval-augmented generation
Abstract: Review-based Product Question Answering (PQA) allows e-commerce platforms to automatically address customer queries by leveraging insights from user reviews. However, existing PQA systems generate answers with only a single perspective, failing to capture the diversity of customer opinions. In this paper we introduce a novel task Quantitative Query-Focused Summarization (QQSUM), which aims to summarize diverse customer opinions into representative Key Points (KPs) and quantify their prevalence to effectively answer user queries. While Retrieval-Augmented Generation (RAG) shows promise for PQA, its generated answers still fall short of capturing the full diversity of viewpoints. To tackle this challenge, our model QQSUM-RAG, which extends RAG, employs few-shot learning to jointly train a KP-oriented retriever and a KP summary generator, enabling KP-based summaries that capture diverse and representative opinions. Experimental results demonstrate that QQSUM-RAG achieves superior performance compared to state-of-the-art RAG baselines in both textual quality and quantification accuracy of opinions. Our source code is available at: this https URL
摘要：基于审核的产品问答（PQA）允许电子商务平台通过利用用户评论的洞察力自动解决客户查询。但是，现有的PQA系统仅具有单一的视角生成答案，无法捕获客户意见的多样性。在本文中，我们介绍了一个新颖的任务定量量化摘要（QQSUM），该摘要旨在将各种客户意见汇总到代表性的要点（KPS）中，并量化其流行率以有效地回答用户查询。虽然检索增强的一代（RAG）显示出对PQA的希望，但其产生的答案仍然没有捕捉到各种观点的多样性。为了应对这一挑战，我们的模型qqsum-rag扩展了抹布，它使用了几乎没有弹药的学习来共同培训面向KP的猎犬和KP摘要生成器，从而使基于KP的摘要能够捕获多样化和代表性的意见。实验结果表明，与最先进的抹布基线相比，Qqsum-rag在文本质量和观点的量化准确性方面都取得了出色的性能。我们的源代码可用：此HTTPS URL

Title: AI Agents for Conversational Patient Triage: Preliminary Simulation-Based Evaluation with Real-World EHR Data

Authors: Sina Rashidian, Nan Li, Jonathan Amar, Jong Ha Lee, Sam Pugh, Eric Yang, Geoff Masterson, Myoung Cha, Yugang Jia, Akhil Vaid
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04032
Pdf URL: https://arxiv.org/pdf/2506.04032
Copy Paste: [[2506.04032]] AI Agents for Conversational Patient Triage: Preliminary Simulation-Based Evaluation with Real-World EHR Data(https://arxiv.org/abs/2506.04032)
Keywords: llm, agent
Abstract: Background: We present a Patient Simulator that leverages real world patient encounters which cover a broad range of conditions and symptoms to provide synthetic test subjects for development and testing of healthcare agentic models. The simulator provides a realistic approach to patient presentation and multi-turn conversation with a symptom-checking agent. Objectives: (1) To construct and instantiate a Patient Simulator to train and test an AI health agent, based on patient vignettes derived from real EHR data. (2) To test the validity and alignment of the simulated encounters provided by the Patient Simulator to expert human clinical providers. (3) To illustrate the evaluation framework of such an LLM system on the generated realistic, data-driven simulations -- yielding a preliminary assessment of our proposed system. Methods: We first constructed realistic clinical scenarios by deriving patient vignettes from real-world EHR encounters. These vignettes cover a variety of presenting symptoms and underlying conditions. We then evaluate the performance of the Patient Simulator as a simulacrum of a real patient encounter across over 500 different patient vignettes. We leveraged a separate AI agent to provide multi-turn questions to obtain a history of present illness. The resulting multiturn conversations were evaluated by two expert clinicians. Results: Clinicians scored the Patient Simulator as consistent with the patient vignettes in those same 97.7% of cases. The extracted case summary based on the conversation history was 99% relevant. Conclusions: We developed a methodology to incorporate vignettes derived from real healthcare patient data to build a simulation of patient responses to symptom checking agents. The performance and alignment of this Patient Simulator could be used to train and test a multi-turn conversational AI agent at scale.
摘要：背景：我们提出了一个患者模拟器，该模拟器利用现实世界中的患者遇到的遇到，涵盖了广泛的疾病和症状，以提供用于开发和测试医疗保健代理模型的合成测试对象。模拟器为患者的表现和与症状检查剂进行多转交谈提供了一种现实的方法。目标：（1）基于从实际EHR数据中得出的患者小插曲来构建和实例化患者模拟器以训练和测试AI健康剂。（2）测试患者模拟器与专家人类临床提供商提供的模拟相遇的有效性和对齐方式。（3）为了说明这种LLM系统的评估框架在生成的现实，数据驱动的模拟上 - 对我们提出的系统进行初步评估。方法：我们首先通过从现实世界中的EHR相遇中得出患者小插曲来构建现实的临床场景。这些小插图涵盖了各种呈现症状和潜在条件。然后，我们评估患者模拟器的性能，作为在500多个不同患者小插曲中遇到的真实患者的模拟。我们利用单独的AI代理来提供多转弯的问题，以获得当前疾病的史。由两名专业临床医生评估了由此产生的多元弯曲对话。结果：在97.7％的病例中，临床医生对患者模拟器的评分与患者小插图一致。根据对话历史记录提取的案例摘要为99％。结论：我们开发了一种方法，以结合从实际医疗保健患者数据中得出的小插图，以模拟患者对症状检查剂的反应。该患者模拟器的性能和对齐方式可用于训练和测试多转向对话的AI代理。

Title: LexTime: A Benchmark for Temporal Ordering of Legal Events

Authors: Claire Barale, Leslie Barrett, Vikram Sunil Bajaj, Michael Rovatsos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04041
Pdf URL: https://arxiv.org/pdf/2506.04041
Copy Paste: [[2506.04041]] LexTime: A Benchmark for Temporal Ordering of Legal Events(https://arxiv.org/abs/2506.04041)
Keywords: llm
Abstract: Temporal reasoning in legal texts is important for applications like case law analysis and compliance monitoring. However, existing datasets lack expert language evaluation, leaving a gap in understanding how LLMs manage event ordering in legal contexts. We introduce LexTime, the first dataset designed to evaluate LLMs' event ordering capabilities in legal language, consisting of 512 instances from U.S. Federal Complaints with annotated event pairs and their temporal relations. Our findings show that (1) LLMs are more accurate on legal event ordering than on narrative (up to +10.5%); (2) longer input contexts and implicit events boost accuracy, reaching 80.8% for implicit-explicit event pairs; (3) legal linguistic complexities and nested clauses remain a challenge. We investigate how context length, explicit vs implicit event pairs, and legal language features affect model performance, demonstrating the need for specific modeling strategies to enhance temporal event reasoning.
摘要：法律文本中的时间推理对于诸如案例法分析和合规性监控之类的应用非常重要。但是，现有数据集缺乏专家语言评估，因此在理解LLM的法律环境中如何管理事件订购方面留下了差距。我们介绍了Lextime，这是第一个旨在评估LLMS的法定语言订购功能的数据集，其中包括来自美国联邦带有带注释的事件对及其临时关系的512个实例。我们的发现表明，（1）LLMS在法律事件排序上比叙述更准确（高达10.5％）；（2）更长的输入上下文和隐式事件提高了准确性，对于隐式解释事件对达到80.8％；（3）法律语言复杂性和嵌套条款仍然是一个挑战。我们研究上下文长度，明确与隐式事件对以及法律语言特征如何影响模型绩效，这表明了对增强时间事件推理的特定建模策略的需求。

Title: Unveiling and Eliminating the Shortcut Learning for Locate-Then-Edit Knowledge Editing via Both Subject and Relation Awareness

Authors: Xiyu Liu, Zhengxiao Liu, Naibin Gu, Zheng Lin, Ji Xiang, Weiping Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04042
Pdf URL: https://arxiv.org/pdf/2506.04042
Copy Paste: [[2506.04042]] Unveiling and Eliminating the Shortcut Learning for Locate-Then-Edit Knowledge Editing via Both Subject and Relation Awareness(https://arxiv.org/abs/2506.04042)
Keywords: language model
Abstract: Knowledge editing aims to alternate the target knowledge predicted by large language models while ensuring the least side effects on unrelated knowledge. An effective way to achieve knowledge editing is to identify pivotal parameters for predicting factual associations and modify them with an optimization process to update the predictions. However, these locate-then-edit methods are uncontrollable since they tend to modify most unrelated relations connected to the subject of target editing. We unveil that this failure of controllable editing is due to a shortcut learning issue during the optimization process. Specifically, we discover two crucial features that are the subject feature and the relation feature for models to learn during optimization, but the current optimization process tends to over-learning the subject feature while neglecting the relation feature. To eliminate this shortcut learning of the subject feature, we propose a novel two-stage optimization process that balances the learning of the subject feature and the relation feature. Experimental results demonstrate that our approach successfully prevents knowledge editing from shortcut learning and achieves the optimal overall performance, contributing to controllable knowledge editing.
摘要：知识编辑旨在交替通过大型语言模型预测的目标知识，同时确保对无关知识的最小副作用。实现知识编辑的一种有效方法是确定以预测事实关联的关键参数，并通过优化过程进行更新预测的优化过程。但是，这些定位的编辑方法是不可控制的，因为它们倾向于修改与目标编辑主题相关的大多数无关关系。我们揭示了可控编辑的失败是由于优化过程中的快捷方式学习问题。具体而言，我们发现了两个关键特征，这些功能是主题功能和在优化过程中学习的关系功能，但是当前的优化过程倾向于在忽略关系功能的同时超越主题功能。为了消除对主题特征的快捷方式学习，我们提出了一个新颖的两阶段优化过程，该过程可以平衡对主题特征的学习和关系特征。实验结果表明，我们的方法成功地阻止了知识编辑从快捷方式学习中，并实现了最佳的整体性能，从而有助于控制知识编辑。

Title: Think Like a Person Before Responding: A Multi-Faceted Evaluation of Persona-Guided LLMs for Countering Hate

Authors: Mikel K. Ngueajio, Flor Miriam Plaza-del-Arco, Yi-Ling Chung, Danda B. Rawat, Amanda Cercas Curry
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04043
Pdf URL: https://arxiv.org/pdf/2506.04043
Copy Paste: [[2506.04043]] Think Like a Person Before Responding: A Multi-Faceted Evaluation of Persona-Guided LLMs for Countering Hate(https://arxiv.org/abs/2506.04043)
Keywords: language model, gpt, llm, prompt
Abstract: Automated counter-narratives (CN) offer a promising strategy for mitigating online hate speech, yet concerns about their affective tone, accessibility, and ethical risks remain. We propose a framework for evaluating Large Language Model (LLM)-generated CNs across four dimensions: persona framing, verbosity and readability, affective tone, and ethical robustness. Using GPT-4o-Mini, Cohere's CommandR-7B, and Meta's LLaMA 3.1-70B, we assess three prompting strategies on the MT-Conan and HatEval datasets. Our findings reveal that LLM-generated CNs are often verbose and adapted for people with college-level literacy, limiting their accessibility. While emotionally guided prompts yield more empathetic and readable responses, there remain concerns surrounding safety and effectiveness.
摘要：自动化的反叙事（CN）为减轻在线仇恨言论提供了有希望的策略，但仍然担心其情感语调，可及性和道德风险。我们提出了一个评估大型语言模型（LLM）跨四个维度的CNS的框架：角色框架，冗长和可读性，情感语调和道德鲁棒性。使用GPT-4O-Mini，Cohere的Commanr-7b和Meta的Llama 3.1-70B，我们评估了MT-CONAN和HATEAN VATEMATS的三种提示策略。我们的发现表明，LLM生成的中枢神经系统经常是详细的，并适合具有大学识字素养的人，从而限制了他们的可访问性。尽管情绪引导的提示提示产生了更多的善解人意和可读性的反应，但对安全性和有效性仍存在关注。

Title: Lacuna Inc. at SemEval-2025 Task 4: LoRA-Enhanced Influence-Based Unlearning for LLMs

Authors: Aleksey Kudelya, Alexander Shirnin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04044
Pdf URL: https://arxiv.org/pdf/2506.04044
Copy Paste: [[2506.04044]] Lacuna Inc. at SemEval-2025 Task 4: LoRA-Enhanced Influence-Based Unlearning for LLMs(https://arxiv.org/abs/2506.04044)
Keywords: language model, llm
Abstract: This paper describes LIBU (LoRA enhanced influence-based unlearning), an algorithm to solve the task of unlearning - removing specific knowledge from a large language model without retraining from scratch and compromising its overall utility (SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models). The algorithm combines classical \textit{influence functions} to remove the influence of the data from the model and \textit{second-order optimization} to stabilize the overall utility. Our experiments show that this lightweight approach is well applicable for unlearning LLMs in different kinds of task.
摘要：本文介绍了Libu（Lora增强了基于影响力的未学习），这是一种算法来解决学习的任务 - 从大型语言模型中删除特定的知识而不从头开始重新验证并损害其整体实用程序（Semeval-2025任务4：从大语言模型中删除敏感内容）。该算法结合了经典\ textIt {rankation函数}，以从模型中删除数据的影响，并\ textit {二阶优化}稳定整体效用。我们的实验表明，这种轻巧的方法非常适用于在各种任务中学习LLM。

Title: On Support Samples of Next Word Prediction

Authors: Yuqian Li, Yupei Du, Yufang Liu, Feifei Feng, Mou Xiao Feng, Yuanbin Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04047
Pdf URL: https://arxiv.org/pdf/2506.04047
Copy Paste: [[2506.04047]] On Support Samples of Next Word Prediction(https://arxiv.org/abs/2506.04047)
Keywords: language model
Abstract: Language models excel in various tasks by making complex decisions, yet understanding the rationale behind these decisions remains a challenge. This paper investigates \emph{data-centric interpretability} in language models, focusing on the next-word prediction task. Using representer theorem, we identify two types of \emph{support samples}-those that either promote or deter specific predictions. Our findings reveal that being a support sample is an intrinsic property, predictable even before training begins. Additionally, while non-support samples are less influential in direct predictions, they play a critical role in preventing overfitting and shaping generalization and representation learning. Notably, the importance of non-support samples increases in deeper layers, suggesting their significant role in intermediate representation this http URL insights shed light on the interplay between data and model decisions, offering a new dimension to understanding language model behavior and interpretability.
摘要：语言模型通过做出复杂的决策在各种任务中表现出色，但是了解这些决策背后的理由仍然是一个挑战。本文研究了语言模型中的\ emph {以数据为中心的可解释性}，重点是下一个字预测任务。使用代表定理，我们确定两种类型的\ emph {support samples} - 促进或阻止特定预测的\ emph {support samples}。我们的发现表明，作为支持样本是一种内在属性，甚至可以在培训开始之前可预测。此外，尽管非支撑样本在直接预测中的影响力较小，但它们在防止过度拟合和塑造概括和表示学习方面起着关键作用。值得注意的是，非支撑样本的重要性在更深的层中增加，这表明它们在中间表示中的重要作用了这种HTTP URL洞察力阐明了数据和模型决策之间的相互作用，从而为理解语言模型行为和解释性提供了新的维度。

Title: Explainability-Based Token Replacement on LLM-Generated Text

Authors: Hadi Mohammadi, Anastasia Giachanou, Daniel L. Oberski, Ayoub Bagheri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04050
Pdf URL: https://arxiv.org/pdf/2506.04050
Copy Paste: [[2506.04050]] Explainability-Based Token Replacement on LLM-Generated Text(https://arxiv.org/abs/2506.04050)
Keywords: language model, llm
Abstract: Generative models, especially large language models (LLMs), have shown remarkable progress in producing text that appears human-like. However, they often exhibit patterns that make their output easier to detect than text written by humans. In this paper, we investigate how explainable AI (XAI) methods can be used to reduce the detectability of AI-generated text (AIGT) while also introducing a robust ensemble-based detection approach. We begin by training an ensemble classifier to distinguish AIGT from human-written text, then apply SHAP and LIME to identify tokens that most strongly influence its predictions. We propose four explainability-based token replacement strategies to modify these influential tokens. Our findings show that these token replacement approaches can significantly diminish a single classifier's ability to detect AIGT. However, our ensemble classifier maintains strong performance across multiple languages and domains, showing that a multi-model approach can mitigate the impact of token-level manipulations. These results show that XAI methods can make AIGT harder to detect by focusing on the most influential tokens. At the same time, they highlight the need for robust, ensemble-based detection strategies that can adapt to evolving approaches for hiding AIGT.
摘要：生成模型，尤其是大型语言模型（LLM），在产生看起来像人类的文本方面显示出了显着的进步。但是，它们经常表现出比人类写的文本更容易检测到其输出的模式。在本文中，我们研究了如何使用可解释的AI（XAI）方法来减少AI生成的文本（AIGT）的可检测性，同时还引入了强大的集合检测方法。我们首先训练合奏分类器，以将AIGT与人写的文本区分开，然后应用Shap和Lime来确定最强烈影响其预测的令牌。我们提出了四种基于解释性的代币替代策略，以修改这些影响力的令牌。我们的发现表明，这些令牌替换方法可以大大降低单个分类器检测AIGT的能力。但是，我们的合奏分类器在多种语言和域上保持了强劲的性能，表明多模型方法可以减轻令牌级操作的影响。这些结果表明，XAI方法可以通过关注最有影响力的令牌来使AIGT更难检测。同时，他们强调了需要适应藏匿AIGT的不断发展的方法的强大，合奏的检测策略的需求。

Title: High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning

Authors: Tim Franzmeyer, Archie Sravankumar, Lijuan Liu, Yuning Mao, Rui Hou, Sinong Wang, Jakob N. Foerster, Luke Zettlemoyer, Madian Khabsa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04051
Pdf URL: https://arxiv.org/pdf/2506.04051
Copy Paste: [[2506.04051]] High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning(https://arxiv.org/abs/2506.04051)
Keywords: language model, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability -- a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data that encodes what the model can and cannot reliably generate. We generate this data by splitting responses of the pretrained LLM into factual fragments (atomic statements or reasoning steps), and use ground truth information to identify incorrect fragments. We achieve capability-aligned finetuning responses by either removing incorrect fragments or replacing them with "Unsure from Here" -- according to a tunable threshold that allows practitioners to trade off response completeness and mean correctness of the response's fragments. We finetune four open-source models for biography writing, mathematics, coding, and medicine with HALT for three different trade-off thresholds. HALT effectively trades off response completeness for correctness, increasing the mean correctness of response fragments by 15% on average, while resulting in a 4% improvement in the F1 score (mean of completeness and correctness of the response) compared to the relevant baselines. By tuning HALT for highest correctness, we train a single reliable Llama3-70B model with correctness increased from 51% to 87% across all four domains while maintaining 53% of the response completeness achieved with standard finetuning.
摘要：大型语言模型（LLMS）当前响应每个提示。但是，当他们缺乏知识或能力时，它们可能会产生错误的答案 - 一个称为幻觉的问题。我们相反，我们建议仅在对其正确性和其他（部分）弃权的信心时才生成内容。具体而言，我们的方法停止产生与能力一致的训练后数据，该数据编码模型可以和无法可靠地生成的内容。我们通过将验证的LLM的响应分解为事实片段（原子陈述或推理步骤），并使用地面真相信息来识别错误的片段来生成这些数据。我们通过删除错误的片段或用“不确定此处的不确定”来实现与能力一致的固定响应 - 根据可调阈值，该阈值使从业者可以权衡响应的完整性和响应片段的平均正确性。我们为传记写作，数学，编码和医学的四个开源模型与三个不同的权衡阈值一起停止。与相关基准相比，停止有效地将响应完整性以正确性为单位，使响应片段的平均正确性平均增加15％，而F1得分的提高了4％（响应的完整性和正确性的平均值）。通过对最高正确性进行调整，我们训练一个可靠的Llama3-70b模型，其正确性从所有四个域中的51％增加到了87％，同时维持了使用标准芬太尼实现的响应完整性的53％。

Title: Progressive Mastery: Customized Curriculum Learning with Guided Prompting for Mathematical Reasoning

Authors: Muling Wu, Qi Qian, Wenhao Liu, Xiaohua Wang, Zisu Huang, Di Liang, LI Miao, Shihan Dou, Changze Lv, Zhenghua Wang, Zhibo Xu, Lina Chen, Tianlong Li, Xiaoqing Zheng, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04065
Pdf URL: https://arxiv.org/pdf/2506.04065
Copy Paste: [[2506.04065]] Progressive Mastery: Customized Curriculum Learning with Guided Prompting for Mathematical Reasoning(https://arxiv.org/abs/2506.04065)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have achieved remarkable performance across various reasoning tasks, yet post-training is constrained by inefficient sample utilization and inflexible difficulty samples processing. To address these limitations, we propose Customized Curriculum Learning (CCL), a novel framework with two key innovations. First, we introduce model-adaptive difficulty definition that customizes curriculum datasets based on each model's individual capabilities rather than using predefined difficulty metrics. Second, we develop "Guided Prompting," which dynamically reduces sample difficulty through strategic hints, enabling effective utilization of challenging samples that would otherwise degrade performance. Comprehensive experiments on supervised fine-tuning and reinforcement learning demonstrate that CCL significantly outperforms uniform training approaches across five mathematical reasoning benchmarks, confirming its effectiveness across both paradigms in enhancing sample utilization and model performance.
摘要：大型语言模型（LLMS）在各种推理任务中都取得了出色的性能，但是培训受到效率低下的样本利用和不灵活的难度样品处理的限制。为了解决这些限制，我们提出了定制的课程学习（CCL），这是一个具有两个关键创新的新颖框架。首先，我们介绍模型自适应难度定义，该定义根据每个模型的个人功能自定义课程数据集，而不是使用预定义的难度指标。其次，我们开发了“引导提示”，该提示可以通过战略提示动态减少样本难度，从而有效利用具有挑战性的样本，这些样本原本会降低性能。对监督微调和强化学习的全面实验表明，CCL在五个数学推理基准中的统一训练方法显着超过了统一的训练方法，从而证实了其在增强样品利用率和模型性能的范式方面的有效性。

Title: LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward

Authors: Yi Zhao, Siqi Wang, Jing Li
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2506.04070
Pdf URL: https://arxiv.org/pdf/2506.04070
Copy Paste: [[2506.04070]] LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward(https://arxiv.org/abs/2506.04070)
Keywords: language model, gpt, llm
Abstract: Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study, hence, focuses on producing precise, in-situ, step-by-step navigation instructions that are practically usable by VI users. Concretely, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to generate rewards guiding the Vision-Language Model (VLM) post-training. This enhances instruction usability while reducing costly real-world data needs. To facilitate training and testing, we introduce NIG4VI, a 27k-sample open-sourced benchmark. It provides diverse navigation scenarios with accurate spatial coordinates, supporting detailed, open-ended in-situ instruction generation. Experiments on NIG4VI show the effectiveness of LaF-GRPO by quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU +14\%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o's 0.323) and yields more intuitive, safer instructions. Code and benchmark are available at \href{this https URL}{this https URL}.
摘要：视力障碍（VI）个体（NIG-VI）的导航指令生成至关重要，但相对却相对不受影响。因此，这项研究的重点是生产精确的，原地，逐步导航指令，这些指令实际上可由VI用户使用。具体而言，我们提出了LAF-GRPO（LLM-AS-AS-AS-AS-trower GRPO），其中LLM模拟VI用户的响应以生成奖励，从而引导视觉语言模型（VLM）培训。这增强了指导可用性，同时减少了昂贵的现实数据需求。为了促进培训和测试，我们介绍了27K样本开源基准NIG4VI。它提供了准确的空间坐标，为各种导航方案提供了支持，支持了详细的开放式现场教学生成。 NIG4VI上的实验显示了通过定量指标（例如，零（LAF-GRPO）提高BLEU +14 \％\％; SFT +（LAF-GRPO）MATEOR 0.542与GPT-4O的0.323）的有效性，并产生更具直觉的指示。代码和基准标准可在\ href {this HTTPS url} {此https url}上获得。

Title: Controlling Difficulty of Generated Text for AI-Assisted Language Learning

Authors: Meiqing Jin, Liam Dugan, Chris Callison-Burch
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2506.04072
Pdf URL: https://arxiv.org/pdf/2506.04072
Copy Paste: [[2506.04072]] Controlling Difficulty of Generated Text for AI-Assisted Language Learning(https://arxiv.org/abs/2506.04072)
Keywords: language model, llm, prompt
Abstract: Practicing conversations with large language models (LLMs) presents a promising alternative to traditional in-person language learning. However, most LLMs generate text at a near-native level of complexity, making them ill-suited for beginner learners (CEFR: A1-A2). In this paper, we investigate whether controllable generation techniques -- specifically modular methods that do not require model fine-tuning -- can adapt LLM outputs to better support absolute beginners. We evaluate these methods through both automatic metrics and a user study with university-level learners of Japanese. Our findings show that while prompting alone fails to control output difficulty, the use of future discriminators (Yang and Klein, 2021) significantly improves output comprehensibility (from 40.4\% to 84.3\%). We further introduce a novel token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments. To support future research in AI-assisted language learning, we release our code, models, annotation tools, and dataset.
摘要：与大型语言模型（LLMS）练习对话为传统语言学习提供了一种有希望的替代方法。但是，大多数LLM都以近乎本地的复杂性生成文本，使其不适合初学者学习者（CEFR：A1-A2）。在本文中，我们研究了可控的生成技术（特别是不需要模型微调的模块化方法）是否可以调整LLM输出以更好地支持绝对初学者。我们通过自动指标和大学级学习者的日语学习者评估这些方法。我们的发现表明，尽管单独提示无法控制输出难度，但未来歧视者的使用（Yang and Klein，2021）显着提高了输出的可理解性（从40.4 \％\％到84.3 \％）。我们进一步介绍了一种新颖的代币评估度量标准，即象征性的率（TMR），该指标量量化了每句话难以理解的令牌的比例，并与人类的判断密切相关。为了支持AI辅助语言学习的未来研究，我们发布了代码，模型，注释工具和数据集。

Title: A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions

Authors: Chung-Chun Wang, Jhen-Ke Lin, Hao-Chien Lu, Hong-Yun Lin, Berlin Chen
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.04077
Pdf URL: https://arxiv.org/pdf/2506.04077
Copy Paste: [[2506.04077]] A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions(https://arxiv.org/abs/2506.04077)
Keywords: language model, llm, prompt
Abstract: Automated speaking assessment (ASA) on opinion expressions is often hampered by the scarcity of labeled recordings, which restricts prompt diversity and undermines scoring reliability. To address this challenge, we propose a novel training paradigm that leverages a large language models (LLM) to generate diverse responses of a given proficiency level, converts responses into synthesized speech via speaker-aware text-to-speech synthesis, and employs a dynamic importance loss to adaptively reweight training instances based on feature distribution differences between synthesized and real speech. Subsequently, a multimodal large language model integrates aligned textual features with speech signals to predict proficiency scores directly. Experiments conducted on the LTTC dataset show that our approach outperforms methods relying on real data or conventional augmentation, effectively mitigating low-resource constraints and enabling ASA on opinion expressions with cross-modal information.
摘要：标记的录音的稀缺性通常会妨碍对意见表达的自动化评估（ASA），这限制了迅速的多样性并破坏了评分的可靠性。为了应对这一挑战，我们提出了一种新型的培训范式，该范围利用大型语言模型（LLM）产生了给定能力水平的多样化响应，通过说话者意识到的文本到语音综合，将响应转换为综合语音，并采用动态的重要性损失，以适应性重复的培训实例，以基于综合和实际语音之间的特征分布和实际的语音差异。随后，多模式的大语言模型将对齐的文本特征与语音信号集成在一起，以直接预测能力得分。在LTTC数据集上进行的实验表明，我们的方法优于依靠真实数据或常规增强的方法，有效地减轻了低资源约束，并使ASA能够以交叉模式信息的意见表达方式。

Title: LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Authors: Ming Zhang, Yujiong Shen, Zelin Li, Huayu Sha, Binze Hu, Yuhui Wang, Chenhao Huang, Shichun Liu, Jingqi Tong, Changhao Jiang, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04078
Pdf URL: https://arxiv.org/pdf/2506.04078
Copy Paste: [[2506.04078]] LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation(https://arxiv.org/abs/2506.04078)
Keywords: language model, llm, prompt
Abstract: Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains. The dataset is released in this https URL.
摘要：在医学中评估大型语言模型（LLM）至关重要，因为医疗应用需要高精度，而错误的空间很少。当前的医疗基准有三种主要类型：基于体检，全面的医学和专业评估。但是，这些基准有局限性设计（主要是多项选择），数据源（通常不是来自真实临床方案）和评估方法（对复杂推理的评估不良）。为了解决这些问题，我们提出了LLMEVAL-MED，这是一个新的基准测试，涵盖了五个核心医疗领域，其中包括由现实世界中电子健康记录和专家设计的临床方案创建的2996个问题。我们还设计了一个自动评估管道，将专家开发的清单纳入了我们的LLM-As-Gudge框架。此外，我们的方法论通过人机协议分析来验证机器评分，基于专家反馈，动态精炼清单和提示，以确保可靠性。我们在LLMEVAL-MED上评估了三个类别（专业医学模型，开源模型和封闭形式模型）的13个LLM，从而为在医疗领域的安全有效部署提供了宝贵的见解。该数据集在此HTTPS URL中发布。

Title: EuroLLM-9B: Technical Report

Authors: Pedro Henrique Martins, João Alves, Patrick Fernandes, Nuno M. Guerreiro, Ricardo Rei, Amin Farajian, Mateusz Klimaszewski, Duarte M. Alves, José Pombal, Manuel Faysse, Pierre Colombo, François Yvon, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04079
Pdf URL: https://arxiv.org/pdf/2506.04079
Copy Paste: [[2506.04079]] EuroLLM-9B: Technical Report(https://arxiv.org/abs/2506.04079)
Keywords: language model, llm
Abstract: This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. We describe the pre-training data collection and filtering pipeline, including the creation of EuroFilter, an AI-based multilingual filter, as well as the design of EuroBlocks-Synthetic, a novel synthetic dataset for post-training that enhances language coverage for European languages. Evaluation results demonstrate EuroLLM-9B's competitive performance on multilingual benchmarks and machine translation tasks, establishing it as the leading open European-made LLM of its size. To support open research and adoption, we release all major components of this work, including the base and instruction-tuned models, the EuroFilter classifier, and the synthetic post-training dataset.
摘要：该报告介绍了Eurollm-9B，这是一种大型语言模型，从头开始训练，以涵盖欧洲公民的所有24种官方官方语言和11种其他语言，以支持欧洲公民的需求。 Eurollm解决了欧洲语言在现有的大型语言模型中的代表性不足和服务不足的问题。我们提供了Eurollm-9B开发的全面概述，包括令牌设计师设计，建筑规范，数据过滤和培训程序。我们描述了训练前数据收集和过滤管道，包括创建基于AI的多语言过滤器Eurofilter，以及Euroblocks-Synthetic的设计，这是一种用于后培训的新型合成数据集，可增强欧洲语言的语言覆盖范围。评估结果表明，Eurollm-9B在多语言基准和机器翻译任务上的竞争性能，将其确立为其大小的领先开放式欧洲制造的LLM。为了支持开放研究和采用，我们发布了这项工作的所有主要组成部分，包括基本和指导模型，EuroFilter分类器和合成后培训数据集。

Title: TextAtari: 100K Frames Game Playing with Language Agents

Authors: Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, Bo Jin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04098
Pdf URL: https://arxiv.org/pdf/2506.04098
Copy Paste: [[2506.04098]] TextAtari: 100K Frames Game Playing with Language Agents(https://arxiv.org/abs/2506.04098)
Keywords: language model, chain-of-thought, agent
Abstract: We present TextAtari, a benchmark for evaluating language agents on very long-horizon decision-making tasks spanning up to 100,000 steps. By translating the visual state representations of classic Atari games into rich textual descriptions, TextAtari creates a challenging test bed that bridges sequential decision-making with natural language processing. The benchmark includes nearly 100 distinct tasks with varying complexity, action spaces, and planning horizons, all rendered as text through an unsupervised representation learning framework (AtariARI). We evaluate three open-source large language models (Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks (zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess how different forms of prior knowledge affect performance on these long-horizon challenges. Four scenarios-Basic, Obscured, Manual Augmentation, and Reference-based-investigate the impact of semantic understanding, instruction comprehension, and expert demonstrations on agent decision-making. Our results reveal significant performance gaps between language agents and human players in extensive planning tasks, highlighting challenges in sequential reasoning, state tracking, and strategic planning across tens of thousands of steps. TextAtari provides standardized evaluation protocols, baseline implementations, and a framework for advancing research at the intersection of language models and planning.
摘要：我们提出Textatari，这是一种评估语言代理的基准，该基准的长期决策任务最高可达100,000个步骤。通过将经典Atari游戏的视觉状态表示为丰富的文本描述，Textatari创建了一个具有挑战性的测试床，以自然语言处理将顺序决策桥梁桥接。该基准包括近100个具有不同复杂性，动作空间和规划视野不同的不同任务，这些任务都是通过无监督的表示学习框架（Atariari）呈现为文本的。我们在三个跨三个代理框架（零射击，很少的思想链和反思推理）中评估了三种开源大语模型（QWEN2.5-7B，GEMMA-7B和LLAMA3.1-8B），以评估不同形式的先验知识对这些长期以来的知识的影响如何在这些长期以来的挑战上影响。四种情况基础，模糊，手动增强和基于参考的研究对语义理解，教学理解和专家演示对代理决策的影响。我们的结果揭示了在广泛的计划任务中，语言代理和人类参与者之间的巨大绩效差距，强调了跨越成千上万个步骤的顺序推理，国家跟踪和战略规划的挑战。 Textatari提供了标准化的评估协议，基线实现以及在语言模型与计划的交集中进行研究的框架。

Title: Rectified Sparse Attention

Authors: Yutao Sun, Tianzhu Ye, Li Dong, Yuqing Xia, Jian Chen, Yizhao Gao, Shijie Cao, Jianyong Wang, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04108
Pdf URL: https://arxiv.org/pdf/2506.04108
Copy Paste: [[2506.04108]] Rectified Sparse Attention(https://arxiv.org/abs/2506.04108)
Keywords: language model
Abstract: Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42$\times$ end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at this https URL.
摘要：有效的长期生成是大型语言模型的关键挑战。尽管最近稀疏的解码方法提高了效率，但它们却遭受了KV缓存未对准的损失，在这种情况下，近似错误会累积并降低产生质量。在这项工作中，我们提出了纠正的稀疏注意力（RESA），这是一种简单而有效的方法，将块状注意力与周期性密集的整流结合在一起。通过使用密集的正向通行证以固定间隔刷新KV缓存，RESA界限误差积累并保留与预训练分布的对齐。跨数学推理，语言建模和检索任务的实验表明，RESA可以通过显着提高效率达到近乎无情的发电质量。值得注意的是，RESA以256K序列的解码下最多可提供2.42 $ \ times $端到端的速度，从而使其成为可扩展长篇小说推断的实用解决方案。代码可在此HTTPS URL上找到。

Title: CLAIM: An Intent-Driven Multi-Agent Framework for Analyzing Manipulation in Courtroom Dialogues

Authors: Disha Sheshanarayana, Tanishka Magar, Ayushi Mittal, Neelam Chaplot
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04131
Pdf URL: https://arxiv.org/pdf/2506.04131
Copy Paste: [[2506.04131]] CLAIM: An Intent-Driven Multi-Agent Framework for Analyzing Manipulation in Courtroom Dialogues(https://arxiv.org/abs/2506.04131)
Keywords: agent
Abstract: Courtrooms are places where lives are determined and fates are sealed, yet they are not impervious to manipulation. Strategic use of manipulation in legal jargon can sway the opinions of judges and affect the decisions. Despite the growing advancements in NLP, its application in detecting and analyzing manipulation within the legal domain remains largely unexplored. Our work addresses this gap by introducing LegalCon, a dataset of 1,063 annotated courtroom conversations labeled for manipulation detection, identification of primary manipulators, and classification of manipulative techniques, with a focus on long conversations. Furthermore, we propose CLAIM, a two-stage, Intent-driven Multi-agent framework designed to enhance manipulation analysis by enabling context-aware and informed decision-making. Our results highlight the potential of incorporating agentic frameworks to improve fairness and transparency in judicial processes. We hope that this contributes to the broader application of NLP in legal discourse analysis and the development of robust tools to support fairness in legal decision-making. Our code and data are available at this https URL.
摘要：法庭是确定生命并掩盖命运的地方，但它们并不是没有操纵。在法律术语中对操纵的战略使用可以影响法官的意见并影响决策。尽管NLP的进步越来越大，但其在检测和分析法律领域内的操作中的应用仍未得到探索。我们的工作通过引入LegalCon来解决这一差距，该差距是1,063个带注释的法庭对话的数据集，该数据集标有用于操纵检测，主要操纵器的识别和对操纵技术的分类，重点是长时间对话。此外，我们提出主张，这是一个两阶段，意图驱动的多代理框架，旨在通过实现上下文感知和知情的决策来增强操纵分析。我们的结果强调了合并代理框架以提高司法过程中的公平性和透明度的潜力。我们希望这有助于NLP在法律话语分析中更广泛的应用，并开发强大的工具以支持法律决策中的公平性。我们的代码和数据可在此HTTPS URL上找到。

Title: Are Lexicon-Based Tools Still the Gold Standard for Valence Analysis in Low-Resource Flemish?

Authors: Ratna Kandala, Katie Hoemann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04139
Pdf URL: https://arxiv.org/pdf/2506.04139
Copy Paste: [[2506.04139]] Are Lexicon-Based Tools Still the Gold Standard for Valence Analysis in Low-Resource Flemish?(https://arxiv.org/abs/2506.04139)
Keywords: llm, prompt
Abstract: Understanding the nuances in everyday language is pivotal for advancements in computational linguistics & emotions research. Traditional lexicon-based tools such as LIWC and Pattern have long served as foundational instruments in this domain. LIWC is the most extensively validated word count based text analysis tool in the social sciences and Pattern is an open source Python library offering functionalities for NLP. However, everyday language is inherently spontaneous, richly expressive, & deeply context dependent. To explore the capabilities of LLMs in capturing the valences of daily narratives in Flemish, we first conducted a study involving approximately 25,000 textual responses from 102 Dutch-speaking participants. Each participant provided narratives prompted by the question, "What is happening right now and how do you feel about it?", accompanied by self-assessed valence ratings on a continuous scale from -50 to +50. We then assessed the performance of three Dutch-specific LLMs in predicting these valence scores, and compared their outputs to those generated by LIWC and Pattern. Our findings indicate that, despite advancements in LLM architectures, these Dutch tuned models currently fall short in accurately capturing the emotional valence present in spontaneous, real-world narratives. This study underscores the imperative for developing culturally and linguistically tailored models/tools that can adeptly handle the complexities of natural language use. Enhancing automated valence analysis is not only pivotal for advancing computational methodologies but also holds significant promise for psychological research with ecologically valid insights into human daily experiences. We advocate for increased efforts in creating comprehensive datasets & finetuning LLMs for low-resource languages like Flemish, aiming to bridge the gap between computational linguistics & emotion research.
摘要：了解日常语言的细微差别是计算语言学和情感研究进步的关键。 LIWC和Pattern等传统词典工具长期以来一直是该领域的基础工具。 LIWC是社会科学中最广泛验证的基于单词计数的文本分析工具，模式是开源Python库，为NLP提供功能。但是，日常语言本质上是自发的，丰富的表现力和深层背景。为了探索LLM在捕获佛兰德的每日叙事价值的能力，我们首先进行了一项研究，其中涉及102名讲荷兰语参与者的约25,000个文本回答。每个参与者都提供了一个问题，该问题是“现在发生了什么，您对此有何看法？”，并伴随着自我评估的价值评级，从-50到+50。然后，我们评估了三个荷兰特异性LLM在预测这些价得分时的性能，并将其输出与LIWC和模式产生的输出进行了比较。我们的发现表明，尽管LLM体系结构取得了进步，但这些荷兰调整的模型目前在准确地捕获自发的真实世界叙事中存在的情感价值方面缺乏。这项研究强调了开发文化和语言量身定制的模型/工具的必要性，这些模型/工具可以擅长处理自然语言使用的复杂性。增强自动价分析不仅是推进计算方法论的关键，而且还具有对人类日常体验的生态有效见解的心理研究的重要希望。我们倡导为低资源语言（如佛兰德语）创建全面的数据集和Finetuning LLM的努力，旨在弥合计算语言学和情感研究之间的差距。

Title: Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis

Authors: Kejian Zhu, Shangqing Tu, Zhuoran Jin, Lei Hou, Juanzi Li, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04142
Pdf URL: https://arxiv.org/pdf/2506.04142
Copy Paste: [[2506.04142]] Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis(https://arxiv.org/abs/2506.04142)
Keywords: language model, llm
Abstract: The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical. In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through comparative and causal analysis. Building on this, we introduce an evaluation method called shortcut neuron patching to suppress shortcut neurons. Experiments validate the effectiveness of our approach in mitigating contamination. Additionally, our evaluation results exhibit a strong linear correlation with MixEval, a recently released trustworthy benchmark, achieving a Spearman coefficient ($\rho$) exceeding 0.95. This high correlation indicates that our method closely reveals true capabilities of the models and is trustworthy. We conduct further experiments to demonstrate the generalizability of our method across various benchmarks and hyperparameter settings. Code: this https URL
摘要：大语言模型（LLM）的发展取决于值得信赖的评估。但是，大多数当前的评估依赖于公共基准，这些公共基准容易出现严重损害公平性的数据污染问题。先前的研究重点是构建动态基准以解决污染。但是，不断建立新的基准是昂贵且周期性的。在这项工作中，我们旨在通过分析污染模型本身的机制来解决污染。通过我们的实验，我们发现对受污染模型的高估可能是由于参数获得了训练中的快捷解决方案。我们进一步提出了一种通过比较和因果分析来鉴定快捷方式神经元的新方法。在此基础上，我们引入了一种称为快捷神经元补丁的评估方法，以抑制快捷方式神经元。实验验证了我们方法在减轻污染中的有效性。此外，我们的评估结果与最近发布的值得信赖的基准MixeVal具有强烈的线性相关性，达到了超过0.95的Spearman系数（$ \ rho $）。这种高相关性表明我们的方法密切揭示了模型的真正能力，并且值得信赖。我们进行了进一步的实验，以证明我们在各种基准和高参数设置中的方法的普遍性。代码：此HTTPS URL

Title: A Dataset for Addressing Patient's Information Needs related to Clinical Course of Hospitalization

Authors: Sarvesh Soni, Dina Demner-Fushman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04156
Pdf URL: https://arxiv.org/pdf/2506.04156
Copy Paste: [[2506.04156]] A Dataset for Addressing Patient's Information Needs related to Clinical Course of Hospitalization(https://arxiv.org/abs/2506.04156)
Keywords: language model, llm, prompt
Abstract: Patients have distinct information needs about their hospitalization that can be addressed using clinical evidence from electronic health records (EHRs). While artificial intelligence (AI) systems show promise in meeting these needs, robust datasets are needed to evaluate the factual accuracy and relevance of AI-generated responses. To our knowledge, no existing dataset captures patient information needs in the context of their EHRs. We introduce ArchEHR-QA, an expert-annotated dataset based on real-world patient cases from intensive care unit and emergency department settings. The cases comprise questions posed by patients to public health forums, clinician-interpreted counterparts, relevant clinical note excerpts with sentence-level relevance annotations, and clinician-authored answers. To establish benchmarks for grounded EHR question answering (QA), we evaluated three open-weight large language models (LLMs)--Llama 4, Llama 3, and Mixtral--across three prompting strategies: generating (1) answers with citations to clinical note sentences, (2) answers before citations, and (3) answers from filtered citations. We assessed performance on two dimensions: Factuality (overlap between cited note sentences and ground truth) and Relevance (textual and semantic similarity between system and reference answers). The final dataset contains 134 patient cases. The answer-first prompting approach consistently performed best, with Llama 4 achieving the highest scores. Manual error analysis supported these findings and revealed common issues such as omitted key clinical evidence and contradictory or hallucinated content. Overall, ArchEHR-QA provides a strong benchmark for developing and evaluating patient-centered EHR QA systems, underscoring the need for further progress toward generating factual and relevant responses in clinical contexts.
摘要：患者有关于住院治疗的不同信息需求，可以使用电子健康记录（EHR）的临床证据来解决。尽管人工智能（AI）系统在满足这些需求方面表现出了希望，但需要强大的数据集来评估AI生成的响应的事实准确性和相关性。据我们所知，没有现有的数据集在EHR的背景下捕获患者信息需求。我们介绍了基于重症监护室和急诊科设置的现实世界患者病例的专家宣布的数据集Archehr-QA。这些案例包括患者对公共卫生论坛的提出的问题，临床医生解释的对应物，相关的临床票据摘录带有句子级相关注释以及临床医生作者的答案。为了建立基于扎根的EHR问题答案（QA）的基准，我们评估了三种开放重量大型语言模型（LLMS）-Llama 4，Lllama 3和Mixtral-across- across三个提示策略：生成（1）对临床注释句子引用的答案，引用了临床注释句子，（2）引用前的回答，以及（3）回答过是被过滤的引用的答案。我们评估了两个维度的绩效：事实（引用的注释句子和地面真理之间的重叠）和相关性（系统和参考答案之间的文本和语义相似性）。最终数据集包含134例患者病例。回答优先的提示方法始终表现最好，而Llama 4的得分最高。手动错误分析支持了这些发现，并揭示了常见问题，例如省略的主要临床证据以及矛盾或幻觉的内容。总体而言，ArchEHR-QA为开发和评估以患者为中心的EHR QA系统提供了强大的基准，强调了在临床背景下产生事实和相关反应的进一步进步的必要条件。

Title: SkipGPT: Dynamic Layer Pruning Reinvented with Token Awareness and Module Decoupling

Authors: Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Zhiwei Fei, Hui Su, Xiaoyu Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04179
Pdf URL: https://arxiv.org/pdf/2506.04179
Copy Paste: [[2506.04179]] SkipGPT: Dynamic Layer Pruning Reinvented with Token Awareness and Module Decoupling(https://arxiv.org/abs/2506.04179)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) achieve remarkable performance across tasks but incur substantial computational costs due to their deep, multi-layered architectures. Layer pruning has emerged as a strategy to alleviate these inefficiencies, but conventional static pruning methods overlook two critical dynamics inherent to LLM inference: (1) horizontal dynamics, where token-level heterogeneity demands context-aware pruning decisions, and (2) vertical dynamics, where the distinct functional roles of MLP and self-attention layers necessitate component-specific pruning policies. We introduce SkipGPT, a dynamic layer pruning framework designed to optimize computational resource allocation through two core innovations: (1) global token-aware routing to prioritize critical tokens, and (2) decoupled pruning policies for MLP and self-attention components. To mitigate training instability, we propose a two-stage optimization paradigm: first, a disentangled training phase that learns routing strategies via soft parameterization to avoid premature pruning decisions, followed by parameter-efficient LoRA fine-tuning to restore performance impacted by layer removal. Extensive experiments demonstrate that SkipGPT reduces over 40% of model parameters while matching or exceeding the performance of the original dense model across benchmarks. By harmonizing dynamic efficiency with preserved expressivity, SkipGPT advances the practical deployment of scalable, resource-aware LLMs. Our code is publicly available at: this https URL.
摘要：大型语言模型（LLMS）在任务之间取得了显着的性能，但由于其深层，多层的体系结构而产生了实质性的计算成本。层修剪已成为减轻这些低效率的策略，但常规的静态修剪方法忽略了LLM固有的两个临界动力学：（1）水平动态：令牌级别的异质性，其中上下文意识到的垂直动态和（2）垂直的官能效应和自行效应的垂直动力学，并具有独特的官能作用，并有效地构成效率的官能事件。政策。我们介绍了Skipgpt，这是一个动态层修剪框架，旨在通过两项核心创新来优化计算资源分配：（1）全局令牌感知的公路以优先考虑关键代币，以及（2）将MLP和自我发挥成分的脱口可值的修剪政策。为了减轻训练不稳定性，我们提出了一个两阶段的优化范式：首先，一个分离的训练阶段，通过软参数化来学习路由策略，以避免过早修剪决策，然后进行参数效率的lora微调以恢复受层层删除影响的绩效。广泛的实验表明，在匹配或超过基准跨基准的原始密集模型的同时，跳过会减少40％以上的模型参数。通过将动态效率与保留的表达性协调，Skipgpt可以实践可扩展的资源感知LLM的实际部署。我们的代码可公开可用：此HTTPS URL。

Title: SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models

Authors: Yuhao Wu, Yushi Bai, Zhiqiang Hu, Juanzi Li, Roy Ka-Wei Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04180
Pdf URL: https://arxiv.org/pdf/2506.04180
Copy Paste: [[2506.04180]] SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models(https://arxiv.org/abs/2506.04180)
Keywords: language model, llm, agent
Abstract: Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter-Agent, an agent-based framework designed to enhance the quality and consistency of long-form text generation. SuperWriter-Agent introduces explicit structured thinking-through planning and refinement stages into the generation pipeline, guiding the model to follow a more deliberate and cognitively grounded process akin to that of a professional writer. Based on this framework, we construct a supervised fine-tuning dataset to train a 7B SuperWriter-LM. We further develop a hierarchical Direct Preference Optimization (DPO) procedure that uses Monte Carlo Tree Search (MCTS) to propagate final quality assessments and optimize each generation step accordingly. Empirical results across diverse benchmarks demonstrate that SuperWriter-LM achieves state-of-the-art performance, surpassing even larger-scale baseline models in both automatic evaluation and human evaluation. Furthermore, comprehensive ablation studies demonstrate the effectiveness of hierarchical DPO and underscore the value of incorporating structured thinking steps to improve the quality of long-form text generation.
摘要：对于大型语言模型（LLM），尤其是在保持连贯性，确保逻辑一致性和随着序列长度的增加而保持文本质量的长度挑战仍然是一个重大挑战。为了解决这些限制，我们提出了Superwriter-Agent，这是一个基于代理的框架，旨在提高长篇文本生成的质量和一致性。 Superwriter-Agent将明确的结构化思维计划和精炼阶段引入了一代管道，指导该模型遵循类似于专业作家的过程，类似于专业作家的过程。基于此框架，我们构建了一个监督的微调数据集来培训7B Superwriter-LM。我们进一步开发了一个层次直接偏好优化（DPO）程序，该过程使用蒙特卡洛树搜索（MCT）来传播最终质量评估并相应地优化每个一代步骤。跨不同基准的经验结果表明，超级作者LM实现了最先进的性能，在自动评估和人类评估中都超过了更大规模的基线模型。此外，全面的消融研究证明了层次DPO的有效性，并强调了结合结构化思维步骤以提高长篇文本生成质量的价值。

Title: Long or short CoT? Investigating Instance-level Switch of Large Reasoning Models

Authors: Ruiqi Zhang, Changyi Xiao, Yixin Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04182
Pdf URL: https://arxiv.org/pdf/2506.04182
Copy Paste: [[2506.04182]] Long or short CoT? Investigating Instance-level Switch of Large Reasoning Models(https://arxiv.org/abs/2506.04182)
Keywords: prompt, chain-of-thought
Abstract: With the rapid advancement of large reasoning models, long Chain-of-Thought (CoT) prompting has demonstrated strong performance on complex tasks. However, this often comes with a significant increase in token usage. In this paper, we conduct a comprehensive empirical analysis comparing long and short CoT strategies. Our findings reveal that while long CoT can lead to performance improvements, its benefits are often marginal relative to its significantly higher token consumption. Specifically, long CoT tends to outperform when ample generation budgets are available, whereas short CoT is more effective under tighter budget constraints. These insights underscore the need for a dynamic approach that selects the proper CoT strategy based on task context and resource availability. To address this, we propose SwitchCoT, an automatic framework that adaptively chooses between long and short CoT strategies to balance reasoning accuracy and computational efficiency. Moreover, SwitchCoT is designed to be budget-aware, making it broadly applicable across scenarios with varying resource constraints. Experimental results demonstrate that SwitchCoT can reduce inference costs by up to 50% while maintaining high accuracy. Notably, under limited token budgets, it achieves performance comparable to, or even exceeding, that of using either long or short CoT alone.
摘要：随着大型推理模型的快速发展，长期的经营链（COT）提示在复杂的任务上表现出了强劲的表现。但是，这通常会大大增加令牌使用情况。在本文中，我们进行了全面的经验分析，比较了长长的COT策略。我们的发现表明，虽然长长的婴儿床可以改善绩效，但其优势通常相对于其明显更高的令牌消耗而言是微不足道的。具体而言，当有足够的发电预算可用时，长床往往会胜过表现，而在更严格的预算限制下，短婴儿床更有效。这些见解强调了对动态方法的需求，该方法基于任务上下文和资源可用性选择了适当的COT策略。为了解决这个问题，我们提出了SwitchCot，这是一个自动框架，可以在长长的和短的COT策略之间自适应地选择，以平衡推理准确性和计算效率。此外，SwitchCot旨在具有预算吸引力，使其在具有不同资源限制的情况下广泛适用。实验结果表明，SwitchCot可以将推理成本降低多达50％，同时保持高精度。值得注意的是，在有限的标记预算下，它可以实现与单独使用长或短的婴儿床相当甚至超过绩效。

Title: R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning

Authors: Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, Limin Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04185
Pdf URL: https://arxiv.org/pdf/2506.04185
Copy Paste: [[2506.04185]] R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning(https://arxiv.org/abs/2506.04185)
Keywords: language model, llm
Abstract: Large language models (LLMs) have notably progressed in multi-step and long-chain reasoning. However, extending their reasoning capabilities to encompass deep interactions with search remains a non-trivial challenge, as models often fail to identify optimal reasoning-search interaction trajectories, resulting in suboptimal responses. We propose R-Search, a novel reinforcement learning framework for Reasoning-Search integration, designed to enable LLMs to autonomously execute multi-step reasoning with deep search interaction, and learn optimal reasoning search interaction trajectories via multi-reward signals, improving response quality in complex logic- and knowledge-intensive tasks. R-Search guides the LLM to dynamically decide when to retrieve or reason, while globally integrating key evidence to enhance deep knowledge interaction between reasoning and search. During RL training, R-Search provides multi-stage, multi-type rewards to jointly optimize the reasoning-search trajectory. Experiments on seven datasets show that R-Search outperforms advanced RAG baselines by up to 32.2% (in-domain) and 25.1% (out-of-domain). The code and data are available at this https URL.
摘要：大型语言模型（LLM）在多步和长链推理中取得了显着发展。但是，扩展其推理能力以包含与搜索的深层交互仍然是一个非平凡的挑战，因为模型通常无法识别最佳的推理 - 搜索互动轨迹，从而导致次优响应。我们提出了R-Search，这是一种用于推理搜索集成的新颖的增强学习框架，旨在使LLMS能够自主通过深层搜索互动来自主执行多步骤推理，并通过多回搜索信号学习最佳的推理搜索互动轨迹，从而提高复杂逻辑和知识密集型任务中的响应质量。 R-Search指导LLM动态决定何时检索或理性，同时全球整合关键证据，以增强推理和搜索之间的深层知识相互作用。在RL培训期间，R-Search提供了多阶段的多类奖励，以共同优化推理搜索轨迹。七个数据集的实验表明，R-Search的表现优于高级抹布基线，高达32.2％（内域）和25.1％（室外）。该代码和数据可在此HTTPS URL上找到。

Title: Efficient Knowledge Editing via Minimal Precomputation

Authors: Akshat Gupta, Maochuan Lu, Thomas Hartvigsen, Gopala Anumanchipalli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04226
Pdf URL: https://arxiv.org/pdf/2506.04226
Copy Paste: [[2506.04226]] Efficient Knowledge Editing via Minimal Precomputation(https://arxiv.org/abs/2506.04226)
Keywords: gpt
Abstract: Knowledge editing methods like MEMIT are able to make data and compute efficient updates of factual knowledge by using a single sentence to update facts and their consequences. However, what is often overlooked is a "precomputation step", which requires a one-time but significant computational cost. The authors of MEMIT originally precompute approximately 44 million hidden vectors per edited layer, which requires a forward pass over 44 million tokens. For GPT-J (6B), this precomputation step takes 36 hours on a single GPU, while it takes approximately 40 hours for Llama2-7B. Additionally, this precomputation time grows with model size. In this paper, we show that this excessive computational cost is unnecessary. Knowledge editing using MEMIT and related methods, such as ROME and EMMET, can be performed by pre-computing a very small portion of the 44 million hidden vectors. We first present the theoretical minimum number of hidden vector precomputation required for solutions of these editing methods to exist. We then empirically show that knowledge editing using these methods can be done by pre-computing significantly fewer hidden vectors. Specifically, we show that the precomputation step can be done with less than 0.3% of the originally stipulated number of hidden vectors. This saves a significant amount of precomputation time and allows users to begin editing new models within a few minutes.
摘要：诸如MEMIT之类的知识编辑方法能够通过使用单个句子更新事实及其后果来制定数据并计算有效的事实知识更新。但是，经常被忽视的是“预先启动步骤”，它需要一次性但重大的计算成本。 MEMIT的作者最初预先计算每个编辑层约有4400万个隐藏矢量，这需要超过4400万个令牌。对于GPT-J（6B），此预抄录步骤需要一个单个GPU的36小时，而Llama2-7B大约需要40小时。此外，此预约时间随型号大小而生长。在本文中，我们表明这种过多的计算成本是不必要的。可以通过预先计算4400万个隐藏矢量中的一小部分来使用MEMIT和相关方法（例如Rome和Emmet）进行知识编辑。我们首先介绍了这些编辑方法的解决方案所需的隐藏矢量预定量的理论最小数量。然后，我们从经验上表明，使用这些方法进行编辑可以通过预先计算明显更少的隐藏向量来完成。具体而言，我们表明，可以使用最初规定的隐藏矢量数量的少于0.3％来完成预约步骤。这样可以节省大量的预先计算时间，并允许用户在几分钟内开始编辑新型号。