2025-06-16

Title: TeleEval-OS: Performance evaluations of large language models for operations scheduling

Authors: Yanyan Wang, Yingying Wang, Junli Liang, Yin Xu, Yunlong Liu, Yiming Xu, Zhengwang Jiang, Zhehe Li, Fei Li, Long Zhao, Kuang Xu, Qi Song, Xiangyang Li
Subjects: cs.CL, cs.AI, cs.PF
Abstract URL: https://arxiv.org/abs/2506.11017
Pdf URL: https://arxiv.org/pdf/2506.11017
Copy Paste: [[2506.11017]] TeleEval-OS: Performance evaluations of large language models for operations scheduling(https://arxiv.org/abs/2506.11017)
Keywords: language model, gpt, llm
Abstract: The rapid advancement of large language models (LLMs) has significantly propelled progress in artificial intelligence, demonstrating substantial application potential across multiple specialized domains. Telecommunications operation scheduling (OS) is a critical aspect of the telecommunications industry, involving the coordinated management of networks, services, risks, and human resources to optimize production scheduling and ensure unified service control. However, the inherent complexity and domain-specific nature of OS tasks, coupled with the absence of comprehensive evaluation benchmarks, have hindered thorough exploration of LLMs' application potential in this critical field. To address this research gap, we propose the first Telecommunications Operation Scheduling Evaluation Benchmark (TeleEval-OS). Specifically, this benchmark comprises 15 datasets across 13 subtasks, comprehensively simulating four key operational stages: intelligent ticket creation, intelligent ticket handling, intelligent ticket closure, and intelligent evaluation. To systematically assess the performance of LLMs on tasks of varying complexity, we categorize their capabilities in telecommunications operation scheduling into four hierarchical levels, arranged in ascending order of difficulty: basic NLP, knowledge Q&A, report generation, and report analysis. On TeleEval-OS, we leverage zero-shot and few-shot evaluation methods to comprehensively assess 10 open-source LLMs (e.g., DeepSeek-V3) and 4 closed-source LLMs (e.g., GPT-4o) across diverse scenarios. Experimental results demonstrate that open-source LLMs can outperform closed-source LLMs in specific scenarios, highlighting their significant potential and value in the field of telecommunications operation scheduling.
摘要：大型语言模型（LLM）的快速发展已大大推动了人工智能的进步，这表明了多个专用领域的巨大应用潜力。电信操作计划（OS）是电信行业的关键方面，涉及网络，服务，风险和人力资源的协调管理，以优化生产计划并确保统一服务控制。但是，OS任务的固有复杂性和特定于领域的特定性质，再加上缺乏全面的评估基准，阻碍了对LLMS在此关键领域中的应用潜力的彻底探索。为了解决这一研究差距，我们提出了第一个电信操作计划评估基准（Teleeval-OS）。具体而言，该基准分别包括13个子任务中的15个数据集，全面模拟了四个关键操作阶段：智能机票创建，智能票务处理，智能票务关闭和智能评估。为了系统地评估LLM在不同复杂性任务上的性能，我们将其在电信操作计划中的功能分为四个层次结构级别，以困难的上升顺序：基本NLP，知识问答，报告生成和报告分析。在远程Val-OS上，我们利用零射门和几乎没有射击的评估方法来全面评估10个开源的LLM（例如，DeepSeek-V3）和4个封闭源LLMS（例如GPT-4O），跨各种情况。实验结果表明，在特定方案中，开源LLM可以胜过闭合扣除LLM，从而强调了它们在电信操作计划领域的巨大潜力和价值。

Title: Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation

Authors: Jiayu Yao, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Yuyao Ge, Zhecheng Li, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11063
Pdf URL: https://arxiv.org/pdf/2506.11063
Copy Paste: [[2506.11063]] Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation(https://arxiv.org/abs/2506.11063)
Keywords: retrieval-augmented generation
Abstract: Multimodal Retrieval-Augmented Generation (RAG) systems have become essential in knowledge-intensive and open-domain tasks. As retrieval complexity increases, ensuring the robustness of these systems is critical. However, current RAG models are highly sensitive to the order in which evidence is presented, often resulting in unstable performance and biased reasoning, particularly as the number of retrieved items or modality diversity grows. This raises a central question: How does the position of retrieved evidence affect multimodal RAG performance? To answer this, we present the first comprehensive study of position bias in multimodal RAG systems. Through controlled experiments across text-only, image-only, and mixed-modality tasks, we observe a consistent U-shaped accuracy curve with respect to evidence position. To quantify this bias, we introduce the Position Sensitivity Index ($PSI_p$) and develop a visualization framework to trace attention allocation patterns across decoder layers. Our results reveal that multimodal interactions intensify position bias compared to unimodal settings, and that this bias increases logarithmically with retrieval range. These findings offer both theoretical and empirical foundations for position-aware analysis in RAG, highlighting the need for evidence reordering or debiasing strategies to build more reliable and equitable generation systems.
摘要：多模式检索型生成（RAG）系统在知识密集型和开放域的任务中已成为必不可少的。随着检索复杂性的增加，确保这些系统的鲁棒性至关重要。但是，当前的抹布模型对提供证据的顺序高度敏感，通常会导致性能不稳定和偏见的推理，尤其是随着检索到的项目的数量或方式多样性的增长。这提出了一个核心问题：检索证据的位置如何影响多模式抹布的性能？为了回答这一点，我们介绍了多模式抹布系统中位置偏差的首次全面研究。通过跨文本，仅图像和混合模式任务进行的受控实验，我们观察到有关证据位置的一致的U形精度曲线。为了量化这一偏见，我们介绍了位置灵敏度指数（$ psi_p $），并开发一个可视化框架，以追踪解码器层的注意力分配模式。我们的结果表明，与单峰设置相比，多模式相互作用加剧了位置偏差，并且这种偏见随着检索范围而对数增加。这些发现为抹布中的位置感知分析提供了理论和经验基础，强调了对证据进行重新排序或辩护策略的需求，以建立更可靠和公平的生成系统。

Title: Smotrom tvoja pa ander drogoj verden! Resurrecting Dead Pidgin with Generative Models: Russenorsk Case Study

Authors: Alexey Tikhonov, Sergei Shteiner, Anna Bykova, Ivan P. Yamshchikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11065
Pdf URL: https://arxiv.org/pdf/2506.11065
Copy Paste: [[2506.11065]] Smotrom tvoja pa ander drogoj verden! Resurrecting Dead Pidgin with Generative Models: Russenorsk Case Study(https://arxiv.org/abs/2506.11065)
Keywords: language model, llm, agent
Abstract: Russenorsk, a pidgin language historically used in trade interactions between Russian and Norwegian speakers, represents a unique linguistic phenomenon. In this paper, we attempt to analyze its lexicon using modern large language models (LLMs), based on surviving literary sources. We construct a structured dictionary of the language, grouped by synonyms and word origins. Subsequently, we use this dictionary to formulate hypotheses about the core principles of word formation and grammatical structure in Russenorsk and show which hypotheses generated by large language models correspond to the hypotheses previously proposed ones in the academic literature. We also develop a "reconstruction" translation agent that generates hypothetical Russenorsk renderings of contemporary Russian and Norwegian texts.
摘要：罗素斯克（Russenorsk）是一种历史上用于俄罗斯和挪威语者之间贸易相互作用的语言，它代表了一种独特的语言现象。在本文中，我们尝试使用基于幸存的文学资料的现代大语模型（LLM）分析其词典。我们构建了该语言的结构化词典，由同义词和单词起源分组。随后，我们使用该词典来提出有关Russenorsk中单词形成和语法结构的核心原理的假设，并显示了大语模型产生的假设与学术文献中先前提出的假设相对应。我们还开发了一种“重建”翻译代理，该翻译代理产生了假设的Russenorsk渲染，以当代俄罗斯和挪威文本。

Title: A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes

Authors: Hieu Nghiem, Hemanth Reddy Singareddy, Zhuqi Miao, Jivan Lamichhane, Abdulaziz Ahmed, Johnson Thomas, Dursun Delen, William Paiva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11067
Pdf URL: https://arxiv.org/pdf/2506.11067
Copy Paste: [[2506.11067]] A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes(https://arxiv.org/abs/2506.11067)
Keywords: language model, gpt, llm, chat
Abstract: Objective: Develop a cost-effective, large language model (LLM)-based pipeline for automatically extracting Review of Systems (ROS) entities from clinical notes. Materials and Methods: The pipeline extracts ROS sections using SecTag, followed by few-shot LLMs to identify ROS entity spans, their positive/negative status, and associated body systems. We implemented the pipeline using open-source LLMs (Mistral, Llama, Gemma) and ChatGPT. The evaluation was conducted on 36 general medicine notes containing 341 annotated ROS entities. Results: When integrating ChatGPT, the pipeline achieved the lowest error rates in detecting ROS entity spans and their corresponding statuses/systems (28.2% and 14.5%, respectively). Open-source LLMs enable local, cost-efficient execution of the pipeline while delivering promising performance with similarly low error rates (span: 30.5-36.7%; status/system: 24.3-27.3%). Discussion and Conclusion: Our pipeline offers a scalable and locally deployable solution to reduce ROS documentation burden. Open-source LLMs present a viable alternative to commercial models in resource-limited healthcare environments.
摘要：目的：开发一种经济高效的大语言模型（LLM）的管道，以自动从临床笔记中提取对系统（ROS）实体的审查。材料和方法：管道使用sectag提取ROS切片，然后是几乎没有弹药的LLM，以识别ROS实体跨度，其正/负状态和相关的身体系统。我们使用开源LLM（Mistral，Llama，Gemma）和Chatgpt实施了管道。该评估是对包含341个注释ROS实体的36个通用医学注释进行的。结果：整合CHATGPT时，管道在检测ROS实体跨度及其相应的状态/系统（分别为28.2％和14.5％）方面达到了最低的错误率。开源LLMS可以以类似较低的错误率（SPAN：30.5-36.7％；状态/系统：24.3-27.3％）提供有希望的性能，使管道的本地，具有成本效益的执行能力。讨论和结论：我们的管道提供了可扩展且可部署的解决方案，以减轻ROS文档负担。开源LLM在资源有限的医疗保健环境中提出了商业模型的可行替代方案。

Title: Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models

Authors: Bumjin Park, Jinsil Lee, Jaesik Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11068
Pdf URL: https://arxiv.org/pdf/2506.11068
Copy Paste: [[2506.11068]] Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models(https://arxiv.org/abs/2506.11068)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly engaging in moral and ethical reasoning, where criteria for judgment are often unclear, even for humans. While LLM alignment studies cover many areas, one important yet underexplored area is how LLMs make judgments about obligations. This work reveals a strong tendency in LLMs to judge non-obligatory contexts as obligations when prompts are augmented with modal expressions such as must or ought to. We introduce this phenomenon as Deontological Keyword Bias (DKB). We find that LLMs judge over 90\% of commonsense scenarios as obligations when modal expressions are present. This tendency is consist across various LLM families, question types, and answer formats. To mitigate DKB, we propose a judgment strategy that integrates few-shot examples with reasoning prompts. This study sheds light on how modal expressions, as a form of linguistic framing, influence the normative decisions of LLMs and underscores the importance of addressing such biases to ensure judgment alignment.
摘要：大型语言模型（LLM）越来越多地参与道德和道德推理，即使对于人类来说，判断的标准通常不清楚。尽管LLM对准研究涵盖了许多领域，但LLM的一个重要但又没有被忽视的领域是如何对义务做出判断。这项工作揭示了LLM的强烈趋势，即当提示通过模态表达式增强（例如必须或应该）时，将非裁定的环境判断为义务。我们将此现象引入了道义学关键字偏见（DKB）。我们发现，在存在模态表达时，LLMS将超过90％的常识场景视为义务。这种趋势在各种LLM家族，问题类型和答案格式中组成。为了减轻DKB，我们提出了一种判断策略，将很少的示例与推理提示相结合。这项研究阐明了模态表达方式（作为语言框架的一种形式）如何影响LLM的规范决策，并强调解决此类偏见以确保判断一致性的重要性。

Title: Targeted control of fast prototyping through domain-specific interface

Authors: Yu-Zhe Shi, Mingchen Liu, Hanlu Ma, Qiao Xu, Huamin Qu, Kun He, Lecheng Ruan, Qining Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11070
Pdf URL: https://arxiv.org/pdf/2506.11070
Copy Paste: [[2506.11070]] Targeted control of fast prototyping through domain-specific interface(https://arxiv.org/abs/2506.11070)
Keywords: language model
Abstract: Industrial designers have long sought a natural and intuitive way to achieve the targeted control of prototype models -- using simple natural language instructions to configure and adjust the models seamlessly according to their intentions, without relying on complex modeling commands. While Large Language Models have shown promise in this area, their potential for controlling prototype models through language remains partially underutilized. This limitation stems from gaps between designers' languages and modeling languages, including mismatch in abstraction levels, fluctuation in semantic precision, and divergence in lexical scopes. To bridge these gaps, we propose an interface architecture that serves as a medium between the two languages. Grounded in design principles derived from a systematic investigation of fast prototyping practices, we devise the interface's operational mechanism and develop an algorithm for its automated domain specification. Both machine-based evaluations and human studies on fast prototyping across various product design domains demonstrate the interface's potential to function as an auxiliary module for Large Language Models, enabling precise and effective targeted control of prototype models.
摘要：工业设计师长期以来一直在寻求一种自然而直观的方式来实现原型模型的目标控制 - 使用简单的自然语言指令根据其意图进行配置和无缝调整模型，而无需依赖复杂的建模命令。尽管大型语言模型在这一领域表现出了希望，但它们通过语言控制原型模型的潜力仍然部分未被充分利用。这种限制源于设计师语言和建模语言之间的差距，包括抽象级别的不匹配，语义精度波动以及词汇范围的差异。为了弥合这些差距，我们提出了一种接口体系结构，该架构是两种语言之间的媒介。基于对快速原型制作实践的系统研究的设计原则，我们设计了界面的操作机制，并为其自动域规范开发了算法。基于机器的评估和人类对各种产品设计域快速原型化的研究都证明了该界面充当大型语言模型的辅助模块的潜力，从而实现了对原型模型的精确有效的目标控制。

Title: CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention

Authors: Zekai Ye, Qiming Li, Xiaocheng Feng, Libo Qin, Yichong Huang, Baohang Li, Kui Jiang, Yang Xiang, Zhirui Zhang, Yunfei Lu, Duyu Tang, Dandan Tu, Bing Qin
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.11073
Pdf URL: https://arxiv.org/pdf/2506.11073
Copy Paste: [[2506.11073]] CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention(https://arxiv.org/abs/2506.11073)
Keywords: language model, hallucination
Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensive. In this paper, inspired by observing the disparities in cross-modal attention patterns across languages, we propose Cross-Lingual Attention Intervention for Mitigating multilingual object hallucination (CLAIM) in LVLMs, a novel near training-free method by aligning attention patterns. CLAIM first identifies language-specific cross-modal attention heads, then estimates language shift vectors from English to the target language, and finally intervenes in the attention outputs during inference to facilitate cross-lingual visual perception capability alignment. Extensive experiments demonstrate that CLAIM achieves an average improvement of 13.56% (up to 30% in Spanish) on the POPE and 21.75% on the hallucination subsets of the MME benchmark across various languages. Further analysis reveals that multilingual attention divergence is most prominent in intermediate layers, highlighting their critical role in multilingual scenarios.
摘要：大型视觉模型（LVLM）表现出令人印象深刻的多模式能力，但仍容易出现多语言对象幻觉，与英语相比，在使用非英语语言的查询时，产生与视觉输入不一致的响应的可能性更高。解决这些问题的大多数现有方法取决于资源密集的预处理或微调。在本文中，灵感来自于观察语言跨模式注意模式的差异，我们提出了跨语性注意力干预，以减轻LVLMS中的多语言对象幻觉（Simel），这是一种通过使注意力模式保持一致的新颖的无训练方法。索赔首先确定了特定语言的跨模式注意力头，然后估算了语言转移向量从英语到目标语言，最后在推断过程中介入了注意力输出以促进跨语言视觉感知能力对齐。广泛的实验表明，教皇的索赔的平均提高13.56％（西班牙语最高为30％），在各种语言上，MME基准的幻觉子集的平均提高了21.75％。进一步的分析表明，多语言注意力差异在中间层中最突出，突出了它们在多语言场景中的关键作用。

Title: CyclicReflex: Improving Large Reasoning Models via Cyclical Reflection Token Scheduling

Authors: Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred Hero, Sijia Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11077
Pdf URL: https://arxiv.org/pdf/2506.11077
Copy Paste: [[2506.11077]] CyclicReflex: Improving Large Reasoning Models via Cyclical Reflection Token Scheduling(https://arxiv.org/abs/2506.11077)
Keywords: prompt
Abstract: Large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek-R1, harness test-time scaling to perform multi-step reasoning for complex problem-solving. This reasoning process, executed before producing final answers, is often guided by special juncture tokens or textual segments that prompt self-evaluative reflection. We refer to these transition markers and reflective cues as "reflection tokens" (e.g., "wait", "but", "alternatively"). In this work, we treat reflection tokens as a "resource" and introduce the problem of resource allocation, aimed at improving the test-time compute performance of LRMs by adaptively regulating the frequency and placement of reflection tokens. Through empirical analysis, we show that both excessive and insufficient use of reflection tokens, referred to as over-reflection and under-reflection, can degrade model performance. To better understand and manage this trade-off, we draw an analogy between reflection token usage and learning rate scheduling in optimization. Building on this insight, we propose cyclical reflection token scheduling (termed CyclicReflex), a decoding strategy that dynamically modulates reflection token logits using a position-dependent triangular waveform. Experiments on MATH500, AIME2024/2025, and AMC2023 demonstrate that CyclicReflex consistently improves performance across model sizes (1.5B-8B), outperforming standard decoding and more recent approaches such as TIP (thought switching penalty) and S1. Codes are available at this https URL.
摘要：大型推理模型（LRMS），例如OpenAI的O1和DeepSeek-R1，利用测试时间缩放，以执行多步推理以解决复杂的问题解决。在产生最终答案之前执行的推理过程通常以特殊的关头令牌或文本段引导，这些片段促使自我评估反射。我们将这些过渡标记和反思性线索称为“反射令牌”（例如，“ wait”，“但是”，“或者”）。在这项工作中，我们将反射令牌视为“资源”，并介绍资源分配问题，旨在通过适应调节反射令牌的频率和放置来改善LRM的测试时间计算性能。通过经验分析，我们表明，对反射令牌的过度使用和不充分使用（称为过度反射和反射不足）都会降低模型性能。为了更好地理解和管理这一权衡，我们在优化中进行了反思令牌使用和学习率计划之间的类比。在此洞察力的基础上，我们提出了周期性反射令牌调度（称为Cyclicreflex），这是一种解码策略，使用依赖位置的三角形波形动态调节反射令牌逻辑。 MATH500，AIME2024/2025和AMC2023上的实验表明，CyclicReflex始终提高模型大小（1.5b-8b）的性能，优于标准解码，以及更近期的方法（例如TIP（Thinking Switching惩罚）和S1）。代码可在此HTTPS URL上找到。

Title: RoE-FND: A Case-Based Reasoning Approach with Dual Verification for Fake News Detection via LLMs

Authors: Yuzhou Yang, Yangming Zhou, Zhiying Zhu, Zhenxing Qian, Xinpeng Zhang, Sheng Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11078
Pdf URL: https://arxiv.org/pdf/2506.11078
Copy Paste: [[2506.11078]] RoE-FND: A Case-Based Reasoning Approach with Dual Verification for Fake News Detection via LLMs(https://arxiv.org/abs/2506.11078)
Keywords: language model, llm
Abstract: The proliferation of deceptive content online necessitates robust Fake News Detection (FND) systems. While evidence-based approaches leverage external knowledge to verify claims, existing methods face critical limitations: noisy evidence selection, generalization bottlenecks, and unclear decision-making processes. Recent efforts to harness Large Language Models (LLMs) for FND introduce new challenges, including hallucinated rationales and conclusion bias. To address these issues, we propose \textbf{RoE-FND} (\textbf{\underline{R}}eason \textbf{\underline{o}}n \textbf{\underline{E}}xperiences FND), a framework that reframes evidence-based FND as a logical deduction task by synergizing LLMs with experiential learning. RoE-FND encompasses two stages: (1) \textit{self-reflective knowledge building}, where a knowledge base is curated by analyzing past reasoning errors, namely the exploration stage, and (2) \textit{dynamic criterion retrieval}, which synthesizes task-specific reasoning guidelines from historical cases as experiences during deployment. It further cross-checks rationales against internal experience through a devised dual-channel procedure. Key contributions include: a case-based reasoning framework for FND that addresses multiple existing challenges, a training-free approach enabling adaptation to evolving situations, and empirical validation of the framework's superior generalization and effectiveness over state-of-the-art methods across three datasets.
摘要：在线欺骗性内容的扩散需要可靠的假新闻检测（FND）系统。基于证据的方法利用外部知识来验证主张，但现有的方法面临着关键的局限性：嘈杂的证据选择，概括瓶颈和不清楚的决策过程。利用大型语言模型（LLM）的最新努力提出了新的挑战，包括幻觉的理由和结论偏见。 To address these issues, we propose \textbf{RoE-FND} (\textbf{\underline{R}}eason \textbf{\underline{o}}n \textbf{\underline{E}}xperiences FND), a framework that reframes evidence-based FND as a logical deduction task by synergizing LLMs with体验式学习。 Roe-fnd包括两个阶段：（1）\ textit {自我反射知识建设}，其中通过分析过去的推理错误，即探索阶段，以及（2）\ textit {Dynamic Criterion检索}来策划知识基础，从而合成了任务特定的Pracion Pracion ChigeLINES在历史上的经验，从而综合了历史性的指南。它通过设计的双通道程序进一步跨核对内部经验的理由。主要贡献包括：FND的基于案例的推理框架，该框架解决了多个现有挑战，一种无训练的方法，可以适应不断发展的情况，以及对框架对三个数据集的最先进方法的卓越概括和有效性的经验验证。

Title: MANBench: Is Your Multimodal Model Smarter than Human?

Authors: Han Zhou, Qitong Xu, Yiheng Dong, Xin Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11080
Pdf URL: https://arxiv.org/pdf/2506.11080
Copy Paste: [[2506.11080]] MANBench: Is Your Multimodal Model Smarter than Human?(https://arxiv.org/abs/2506.11080)
Keywords: language model, llm
Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has ignited discussions regarding their potential to surpass human performance in multimodal tasks. In response, we introduce MANBench (Multimodal Ability Norms Benchmark), a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks, spanning knowledge-based and non-knowledge-based domains. MANBench emphasizes intuitive reasoning, seamless cross-modal integration, and real-world complexity, providing a rigorous evaluation framework. Through extensive human experiments involving diverse participants, we compared human performance against state-of-the-art MLLMs. The results indicate that while MLLMs excel in tasks like Knowledge and Text-Image Understanding, they struggle with deeper cross-modal reasoning tasks such as Transmorphic Understanding, Image Consistency, and Multi-image Understanding. Moreover, both humans and MLLMs face challenges in highly complex tasks like Puzzles and Spatial Imagination. MANBench highlights the strengths and limitations of MLLMs, revealing that even advanced models fall short of achieving human-level performance across many domains. We hope MANBench will inspire efforts to bridge the gap between MLLMs and human multimodal capabilities. The code and dataset are available at this https URL.
摘要：多模式大语言模型（MLLM）的快速发展引发了关于它们在多模式任务中超越人类绩效的潜力的讨论。作为回应，我们介绍了Manbench（多模式能力规范基准），这是一种双语基准（英语和中文），其中包括在九个任务中构成1,314个问题，涵盖了基于知识和基于知识的基于知识的域。 Manbench强调直观的推理，无缝的跨模式整合和现实世界的复杂性，提供了严格的评估框架。通过涉及不同参与者的广泛人类实验，我们将人类的绩效与最新的MLLM进行了比较。结果表明，尽管MLLMS在知识和文本图像理解等任务中表现出色，但他们在更深入的跨模式推理任务（例如透射理解，图像一致性和多图像理解）中挣扎。此外，人类和MLLM在难题和空间想象力等高度复杂的任务中都面临着挑战。 Manbench强调了MLLM的优势和局限性，这表明即使是高级模型也没有在许多领域实现人类水平的性能。我们希望Manbench能够激发弥合MLLM与人类多模式能力之间的差距的努力。该代码和数据集可在此HTTPS URL上找到。

Title: SAGE:Specification-Aware Grammar Extraction for Automated Test Case Generation with LLMs

Authors: Aditi, Hyunwoo Park, Sicheol Sung, Yo-Sub Han, Sang-Ki Ko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11081
Pdf URL: https://arxiv.org/pdf/2506.11081
Copy Paste: [[2506.11081]] SAGE:Specification-Aware Grammar Extraction for Automated Test Case Generation with LLMs(https://arxiv.org/abs/2506.11081)
Keywords: language model, llm
Abstract: Grammar-based test case generation has proven effective for competitive programming problems, but generating valid and general grammars from natural language specifications remains a key challenge, especially under limited supervision. Context-Free Grammars with Counters (CCFGs) have recently been introduced as a formalism to represent such specifications with logical constraints by storing and reusing counter values during derivation. In this work, we explore the use of open-source large language models (LLMs) to induce CCFGs from specifications using a small number of labeled examples and verifiable reward-guided reinforcement learning. Our approach first fine-tunes an open-source LLM to perform specification-to-grammar translation, and further applies Group Relative Policy Optimization (GRPO) to enhance grammar validity and generality. We also examine the effectiveness of iterative feedback for open and closed-source LLMs in correcting syntactic and semantic errors in generated grammars. Experimental results show that our approach SAGE achieves stronger generalization and outperforms 17 open and closed-source LLMs in both grammar quality and test effectiveness, improving over the state-of-the-art by 15.92%p in grammar validity and 12.34%p in test effectiveness. We provide our implementation and dataset at the following anonymous repository:this https URL
摘要：基于语法的测试案例生成已被证明对竞争性编程问题有效，但是从自然语言规范中产生有效和一般语法仍然是一个关键挑战，尤其是在有限的监督下。最近引入了带有计数器（CCFG）的无上下文语法（CCFG）作为形式主义，以通过在派生过程中存储和重复使用反值来表示具有逻辑约束的此类规格。在这项工作中，我们探讨了使用开源大型语言模型（LLM）使用少数标记的示例和可验证的奖励指导的增强学习来诱导CCFG的使用。我们的方法首先微调开源LLM来执行规范到语法翻译，并进一步应用小组相对策略优化（GRPO）来提高语法有效性和一般性。我们还研究了迭代反馈对开放和封闭源LLM在纠正产生语法中句法和语义错误方面的有效性。实验结果表明，我们的SAGE在语法质量和测试有效性方面均超强概括17开放和封闭源LLM，在语法有效性中，对最先进的p提高了15.92％的P，在测试有效性中提高了12.34％的p。我们在以下匿名存储库中提供我们的实现和数据集：此HTTPS URL

Title: PRISM: A Transformer-based Language Model of Structured Clinical Event Data

Authors: Lionel Levine, John Santerre, Alex S. Young, T. Barry Levine, Francis Campion, Majid Sarrafzadeh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11082
Pdf URL: https://arxiv.org/pdf/2506.11082
Copy Paste: [[2506.11082]] PRISM: A Transformer-based Language Model of Structured Clinical Event Data(https://arxiv.org/abs/2506.11082)
Keywords: language model
Abstract: We introduce PRISM (Predictive Reasoning in Sequential Medicine), a transformer-based architecture designed to model the sequential progression of clinical decision-making processes. Unlike traditional approaches that rely on isolated diagnostic classification, PRISM frames clinical trajectories as tokenized sequences of events - including diagnostic tests, laboratory results, and diagnoses - and learns to predict the most probable next steps in the patient diagnostic journey. Leveraging a large custom clinical vocabulary and an autoregressive training objective, PRISM demonstrates the ability to capture complex dependencies across longitudinal patient timelines. Experimental results show substantial improvements over random baselines in next-token prediction tasks, with generated sequences reflecting realistic diagnostic pathways, laboratory result progressions, and clinician ordering behaviors. These findings highlight the feasibility of applying generative language modeling techniques to structured medical event data, enabling applications in clinical decision support, simulation, and education. PRISM establishes a foundation for future advancements in sequence-based healthcare modeling, bridging the gap between machine learning architectures and real-world diagnostic reasoning.
摘要：我们介绍了基于变压器的结构（旨在模拟临床决策过程的顺序进展），这是一种基于变压器的架构。与依靠孤立诊断分类的传统方法不同，Prism框架临床轨迹作为令牌化事件的序列（包括诊断测试，实验室结果和诊断），并学会了预测患者诊断过程中最可能的下一步。利用大型定制临床词汇和自回归训练目标，Prism证明了能够捕获纵向患者时间表的复杂依赖性的能力。实验结果表明，在下一步的预测任务中，对随机基准的实质性改善，生成的序列反映了现实的诊断途径，实验室结果进行进行和临床医生的订购行为。这些发现突出了将生成语言建模技术应用于结构化医疗事件数据的可行性，从而在临床决策支持，仿真和教育中实现了应用程序。 Prism为基于序列的医疗保健建模的未来进步奠定了基础，从而弥合了机器学习体系结构与现实世界诊断推理之间的差距。

Title: RedDebate: Safer Responses through Multi-Agent Red Teaming Debates

Authors: Ali Asad, Stephen Obadinma, Radin Shayanfar, Xiaodan Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11083
Pdf URL: https://arxiv.org/pdf/2506.11083
Copy Paste: [[2506.11083]] RedDebate: Safer Responses through Multi-Agent Red Teaming Debates(https://arxiv.org/abs/2506.11083)
Keywords: language model, llm, agent
Abstract: We propose RedDebate, a novel multi-agent debate framework that leverages adversarial argumentation among Large Language Models (LLMs) to proactively identify and mitigate their own unsafe behaviours. Existing AI safety methods often depend heavily on costly human evaluations or isolated single-model assessment, both subject to scalability constraints and oversight risks. RedDebate instead embraces collaborative disagreement, enabling multiple LLMs to critically examine one another's reasoning, and systematically uncovering unsafe blind spots through automated red-teaming, and iteratively improve their responses. We further integrate distinct types of long-term memory that retain learned safety insights from debate interactions. Evaluating on established safety benchmarks such as HarmBench, we demonstrate the proposed method's effectiveness. Debate alone can reduce unsafe behaviours by 17.7%, and when combined with long-term memory modules, achieves reductions exceeding 23.5%. To our knowledge, RedDebate constitutes the first fully automated framework that combines multi-agent debates with red-teaming to progressively enhance AI safety without direct human intervention.(Github Repository: this https URL)
摘要：我们提出了Reddebate，这是一种新型的多代理辩论框架，利用大型语言模型（LLMS）之间的对抗性论证来主动识别和减轻自己的不安全行为。现有的AI安全方法通常在很大程度上取决于昂贵的人类评估或孤立的单模评估，既要受到可伸缩性约束和监督风险。相反，Reddebate拥抱了协作分歧，使多个LLMS能够批判性地检查彼此的推理，并通过自动红色团队系统地系统地揭示了不安全的盲点，并迭代地改善了他们的响应。我们进一步整合了不同类型的长期记忆，这些记忆保留了从辩论相互作用中获得的安全见解。评估既定的安全基准（例如Harmbench），我们证明了该方法的有效性。仅辩论就可以将不安全的行为减少17.7％，并且与长期记忆模块结合使用，降低超过23.5％。据我们所知，Reddebate构成了第一个完全自动化的框架，该框架结合了多代理辩论和红色团队，无需直接人类干预即可逐步增强AI安全性。（GitHub存储库：此HTTPS URL）

Title: Two Birds with One Stone: Improving Factuality and Faithfulness of LLMs via Dynamic Interactive Subspace Editing

Authors: Pengbo Wang, Chaozhuo Li, Chenxu Wang, Liwen Zheng, Litian Zhang, Xi Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11088
Pdf URL: https://arxiv.org/pdf/2506.11088
Copy Paste: [[2506.11088]] Two Birds with One Stone: Improving Factuality and Faithfulness of LLMs via Dynamic Interactive Subspace Editing(https://arxiv.org/abs/2506.11088)
Keywords: llm, hallucination
Abstract: LLMs have demonstrated unprecedented capabilities in natural language processing, yet their practical deployment remains hindered by persistent factuality and faithfulness hallucinations. While existing methods address these hallucination types independently, they inadvertently induce performance trade-offs, as interventions targeting one type often exacerbate the other. Through empirical and theoretical analysis of activation space dynamics in LLMs, we reveal that these hallucination categories share overlapping subspaces within neural representations, presenting an opportunity for concurrent mitigation. To harness this insight, we propose SPACE, a unified framework that jointly enhances factuality and faithfulness by editing shared activation subspaces. SPACE establishes a geometric foundation for shared subspace existence through dual-task feature modeling, then identifies and edits these subspaces via a hybrid probe strategy combining spectral clustering and attention head saliency scoring. Experimental results across multiple benchmark datasets demonstrate the superiority of our approach.
摘要：LLM在自然语言处理中表现出了前所未有的能力，但是他们的实际部署仍然受到持续的事实和忠诚幻觉的阻碍。尽管现有方法独立解决了这些幻觉类型，但它们无意中引起了绩效权衡，因为针对一种类型的干预措施通常会加剧另一种类型。通过对LLM中激活空间动态的经验和理论分析，我们揭示了这些幻觉类别在神经表示中共享重叠子空间，这为同时缓解的机会提供了机会。为了利用这种见识，我们提出了一个空间，这是一个统一的框架，通过编辑共享的激活子空间来共同增强事实和忠诚。空间通过双任务特征建模为共享子空间的存在建立了几何基础，然后通过结合光谱聚类和注意力头部显着性评分的混合探针策略来识别和编辑这些子空间。多个基准数据集的实验结果证明了我们方法的优势。

Title: Customizing Speech Recognition Model with Large Language Model Feedback

Authors: Shaoshi Ling, Guoli Ye
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.11091
Pdf URL: https://arxiv.org/pdf/2506.11091
Copy Paste: [[2506.11091]] Customizing Speech Recognition Model with Large Language Model Feedback(https://arxiv.org/abs/2506.11091)
Keywords: language model, llm
Abstract: Automatic speech recognition (ASR) systems have achieved strong performance on general transcription tasks. However, they continue to struggle with recognizing rare named entities and adapting to domain mismatches. In contrast, large language models (LLMs), trained on massive internet-scale datasets, are often more effective across a wide range of domains. In this work, we propose a reinforcement learning based approach for unsupervised domain adaptation, leveraging unlabeled data to enhance transcription quality, particularly the named entities affected by domain mismatch, through feedback from a LLM. Given contextual information, our framework employs a LLM as the reward model to score the hypotheses from the ASR model. These scores serve as reward signals to fine-tune the ASR model via reinforcement learning. Our method achieves a 21\% improvement on entity word error rate over conventional self-training methods.
摘要：自动语音识别（ASR）系统已在一般转录任务上实现了强大的性能。但是，他们继续为认识稀有命名实体并适应领域不匹配而挣扎。相比之下，在大规模的互联网规模数据集中培训的大型语言模型（LLM）通常在各个领域都更有效。在这项工作中，我们提出了一种基于强化学习的方法，用于无监督的域适应性，利用未标记的数据来增强转录质量，尤其是通过LLM的反馈，尤其是受域失配的指定实体。给定上下文信息，我们的框架采用LLM作为奖励模型来评分ASR模型的假设。这些分数是通过增强学习来微调ASR模型的奖励信号。我们的方法比传统的自我训练方法对实体单词错误率提高了21 \％。

Title: Dynamic Context Tuning for Retrieval-Augmented Generation: Enhancing Multi-Turn Planning and Tool Adaptation

Authors: Jubin Abhishek Soni, Amit Anand, Rajesh Kumar Pandey, Aniket Abhishek Soni
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2506.11092
Pdf URL: https://arxiv.org/pdf/2506.11092
Copy Paste: [[2506.11092]] Dynamic Context Tuning for Retrieval-Augmented Generation: Enhancing Multi-Turn Planning and Tool Adaptation(https://arxiv.org/abs/2506.11092)
Keywords: language model, gpt, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has significantly advanced large language models (LLMs) by grounding their outputs in external tools and knowledge sources. However, existing RAG systems are typically constrained to static, single-turn interactions with fixed toolsets, making them ill-suited for dynamic domains such as healthcare and smart homes, where user intent, available tools, and contextual factors evolve over time. We present Dynamic Context Tuning (DCT), a lightweight framework that extends RAG to support multi-turn dialogue and evolving tool environments without requiring retraining. DCT integrates an attention-based context cache to track relevant past information, LoRA-based retrieval to dynamically select domain-specific tools, and efficient context compression to maintain inputs within LLM context limits. Experiments on both synthetic and real-world benchmarks show that DCT improves plan accuracy by 14% and reduces hallucinations by 37%, while matching GPT-4 performance at significantly lower cost. Furthermore, DCT generalizes to previously unseen tools, enabling scalable and adaptable AI assistants across a wide range of dynamic environments.
摘要：通过将其输出在外部工具和知识源中扎根，检索授权的生成（RAG）具有显着高级的大型语言模型（LLM）。但是，现有的抹布系统通常被限制在与固定工具集的静态，单转交互之间，使它们不适合用于诸如医疗保健和智能家居之类的动态域，在这种情况下，用户意图，可用工具和上下文因素会随着时间的推移而发展。我们提出动态上下文调整（DCT），这是一个轻巧的框架，它扩展了抹布，以支持多转向对话和不断发展的工具环境，而无需重新训练。 DCT集成了一个基于注意力的上下文缓存，以跟踪相关的过去信息，基于LORA的检索到动态选择特定域的工具以及有效的上下文压缩，以在LLM上下文限制内维护输入。对合成和现实世界基准的实验表明，DCT将计划的准确性提高了14％，并将幻觉降低了37％，同时以显着较低的成本匹配GPT-4的性能。此外，DCT概括为以前看不见的工具，在各种动态环境中启用可扩展和适应性的AI助手。

Title: The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs

Authors: Songyang Liu, Chaozhuo Li, Jiameng Qiu, Xi Zhang, Feiran Huang, Litian Zhang, Yiming Hei, Philip S. Yu
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2506.11094
Pdf URL: https://arxiv.org/pdf/2506.11094
Copy Paste: [[2506.11094]] The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs(https://arxiv.org/abs/2506.11094)
Keywords: language model, llm
Abstract: With the rapid advancement of artificial intelligence technology, Large Language Models (LLMs) have demonstrated remarkable potential in the field of Natural Language Processing (NLP), including areas such as content generation, human-computer interaction, machine translation, and code generation, among others. However, their widespread deployment has also raised significant safety concerns. In recent years, LLM-generated content has occasionally exhibited unsafe elements like toxicity and bias, particularly in adversarial scenarios, which has garnered extensive attention from both academia and industry. While numerous efforts have been made to evaluate the safety risks associated with LLMs, there remains a lack of systematic reviews summarizing these research endeavors. This survey aims to provide a comprehensive and systematic overview of recent advancements in LLMs safety evaluation, focusing on several key aspects: (1) "Why evaluate" that explores the background of LLMs safety evaluation, how they differ from general LLMs evaluation, and the significance of such evaluation; (2) "What to evaluate" that examines and categorizes existing safety evaluation tasks based on key capabilities, including dimensions such as toxicity, robustness, ethics, bias and fairness, truthfulness, and so on; (3) "Where to evaluate" that summarizes the evaluation metrics, datasets and benchmarks currently used in safety evaluations; (4) "How to evaluate" that reviews existing evaluation toolkit, and categorizing mainstream evaluation methods based on the roles of the evaluators. Finally, we identify the challenges in LLMs safety evaluation and propose potential research directions to promote further advancement in this field. We emphasize the importance of prioritizing LLMs safety evaluation to ensure the safe deployment of these models in real-world applications.
摘要：随着人工智能技术的快速发展，大型语言模型（LLM）在自然语言处理领域（NLP）表现出了巨大的潜力，包括内容产生，人类计算机交互，机器翻译和代码生成等领域。但是，他们的广泛部署也引起了严重的安全问题。近年来，LLM生成的内容偶尔会表现出不安全的元素，例如毒性和偏见，尤其是在对抗场景中，这引起了学术界和工业的广泛关注。尽管已经做出了许多努力来评估与LLM相关的安全风险，但仍缺乏系统的评论来概括这些研究工作。这项调查旨在为LLMS安全评估的最新进步提供全面，系统的概述，重点介绍了几个关键方面：（1）“为什么评估”探讨了LLMS安全评估的背景，它们与一般LLMS评估的差异以及该评估的重要性；（2）“要评估的内容”，根据关键功能，包括毒性，鲁棒性，伦理，偏见和公平，真实性等方面检查并分类现有的安全评估任务；（3）“在哪里评估”总结了当前用于安全评估中的评估指标，数据集和基准；（4）“如何评估”审查现有评估工具包，并根据评估者的角色对主流评估方法进行分类。最后，我们确定了LLMS安全评估中的挑战，并提出了潜在的研究方向，以促进该领域的进一步发展。我们强调了优先考虑LLMS安全评估以确保这些模型在现实世界应用中安全部署的重要性。

Title: C-SEO Bench: Does Conversational SEO Work?

Authors: Haritz Puerto, Martin Gubri, Tommaso Green, Seong Joon Oh, Sangdoo Yun
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.11097
Pdf URL: https://arxiv.org/pdf/2506.11097
Copy Paste: [[2506.11097]] C-SEO Bench: Does Conversational SEO Work?(https://arxiv.org/abs/2506.11097)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are transforming search engines into Conversational Search Engines (CSE). Consequently, Search Engine Optimization (SEO) is being shifted into Conversational Search Engine Optimization (C-SEO). We are beginning to see dedicated C-SEO methods for modifying web documents to increase their visibility in CSE responses. However, they are often tested only for a limited breadth of application domains; we do not understand whether certain C-SEO methods would be effective for a broad range of domains. Moreover, existing evaluations consider only a single-actor scenario where only one web document adopts a C-SEO method; in reality, multiple players are likely to competitively adopt the cutting-edge C-SEO techniques, drawing an analogy from the dynamics we have seen in SEO. We present C-SEO Bench, the first benchmark designed to evaluate C-SEO methods across multiple tasks, domains, and number of actors. We consider two search tasks, question answering and product recommendation, with three domains each. We also formalize a new evaluation protocol with varying adoption rates among involved actors. Our experiments reveal that most current C-SEO methods are largely ineffective, contrary to reported results in the literature. Instead, traditional SEO strategies, those aiming to improve the ranking of the source in the LLM context, are significantly more effective. We also observe that as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem. Our code and data are available at this https URL and this https URL.
摘要：大型语言模型（LLM）正在将搜索引擎转换为对话搜索引擎（CSE）。因此，搜索引擎优化（SEO）正在转移到对话搜索引擎优化（C-SEO）中。我们开始看到专门的C-SEO方法，用于修改Web文档以提高其在CSE响应中的可见性。但是，通常仅对有限的应用域进行测试。我们不了解某些C-SEO方法是否对广泛的领域有效。此外，现有的评估仅考虑一个单活体方案，其中只有一个Web文档采用C-Seo方法。实际上，多个玩家可能会竞争性地采用尖端的C-SEO技术，从我们在SEO中看到的动态进行了类比。我们提出了C-SEO基准，这是第一个旨在评估跨多个任务，域和参与者数量的C-SEO方法的基准。我们考虑两个搜索任务，问题答案和产品推荐，每个域都有三个域。我们还为参与参与者的采用率变化的新评估协议正式化了新的评估协议。我们的实验表明，大多数当前的C-SEO方法在很大程度上是无效的，与文献中报道的结果相反。取而代之的是，传统的SEO策略是那些旨在提高LLM环境中来源排名的策略，更有效。我们还观察到，随着我们增加C-SEO采用者的数量，总体增长减少，描绘了问题的拥挤和零和零和零。我们的代码和数据可在此HTTPS URL和此HTTPS URL上找到。

Title: Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey

Authors: Jiachen Zhu, Menghui Zhu, Renting Rui, Rong Shan, Congmin Zheng, Bo Chen, Yunjia Xi, Jianghao Lin, Weiwen Liu, Ruiming Tang, Yong Yu, Weinan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11102
Pdf URL: https://arxiv.org/pdf/2506.11102
Copy Paste: [[2506.11102]] Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey(https://arxiv.org/abs/2506.11102)
Keywords: language model, gpt, llm, chat, agent
Abstract: The advent of large language models (LLMs), such as GPT, Gemini, and DeepSeek, has significantly advanced natural language processing, giving rise to sophisticated chatbots capable of diverse language-related tasks. The transition from these traditional LLM chatbots to more advanced AI agents represents a pivotal evolutionary step. However, existing evaluation frameworks often blur the distinctions between LLM chatbots and AI agents, leading to confusion among researchers selecting appropriate benchmarks. To bridge this gap, this paper introduces a systematic analysis of current evaluation approaches, grounded in an evolutionary perspective. We provide a detailed analytical framework that clearly differentiates AI agents from LLM chatbots along five key aspects: complex environment, multi-source instructor, dynamic feedback, multi-modal perception, and advanced capability. Further, we categorize existing evaluation benchmarks based on external environments driving forces, and resulting advanced internal capabilities. For each category, we delineate relevant evaluation attributes, presented comprehensively in practical reference tables. Finally, we synthesize current trends and outline future evaluation methodologies through four critical lenses: environment, agent, evaluator, and metrics. Our findings offer actionable guidance for researchers, facilitating the informed selection and application of benchmarks in AI agent evaluation, thus fostering continued advancement in this rapidly evolving research domain.
摘要：大型语言模型（LLM）的出现，例如GPT，Gemini和DeepSeek，具有显着高级的自然语言处理，从而产生了能够具有不同语言相关的任务的复杂聊天机器人。从这些传统的LLM聊天机器人到更高级的AI代理的过渡是一个关键的进化步骤。但是，现有的评估框架通常模糊了LLM聊天机器人与AI代理之间的区别，从而导致选择适当基准的研究人员混淆。为了弥合这一差距，本文介绍了以进化论的目前评估方法的系统分析。我们提供了一个详细的分析框架，该框架清楚地将AI代理与LLM聊天机器人区分开了五个关键方面：复杂的环境，多源教师，动态反馈，多模式感知和高级功能。此外，我们根据外部环境驱动力和产生的高级内部功能对现有的评估基准进行分类。对于每个类别，我们列出相关的评估属性，在实际参考表中全面介绍。最后，我们通过四个关键镜头综合了当前的趋势，并概述了未来的评估方法：环境，代理，评估者和指标。我们的发现为研究人员提供了可行的指导，促进了在AI代理评估中的基准选择和应用基准，从而促进了这个快速发展的研究领域的持续进步。

Title: You Only Fine-tune Once: Many-Shot In-Context Fine-Tuning for Large Language Model

Authors: Wenchong He, Liqian Peng, Zhe Jiang, Alex Go
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11103
Pdf URL: https://arxiv.org/pdf/2506.11103
Copy Paste: [[2506.11103]] You Only Fine-tune Once: Many-Shot In-Context Fine-Tuning for Large Language Model(https://arxiv.org/abs/2506.11103)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) possess a remarkable ability to perform in-context learning (ICL), which enables them to handle multiple downstream tasks simultaneously without requiring task-specific fine-tuning. Recent studies have shown that even moderately sized LLMs, such as Mistral 7B, Gemma 7B and Llama-3 8B, can achieve ICL through few-shot in-context fine-tuning of all tasks at once. However, this approach still lags behind dedicated fine-tuning, where a separate model is trained for each individual task. In this paper, we propose a novel approach, Many-Shot In-Context Fine-tuning (ManyICL), which significantly narrows this performance gap by extending the principles of ICL to a many-shot setting. To unlock the full potential of ManyICL and address the inherent inefficiency of processing long sequences with numerous in-context examples, we propose a novel training objective. Instead of solely predicting the final answer, our approach treats every answer within the context as a supervised training target. This effectively shifts the role of many-shot examples from prompts to targets for autoregressive learning. Through extensive experiments on diverse downstream tasks, including classification, summarization, question answering, natural language inference, and math, we demonstrate that ManyICL substantially outperforms zero/few-shot fine-tuning and approaches the performance of dedicated fine-tuning. Furthermore, ManyICL significantly mitigates catastrophic forgetting issues observed in zero/few-shot fine-tuning. The code will be made publicly available upon publication.
摘要：大型语言模型（LLMS）具有执行文本学习（ICL）的非凡能力，这使他们能够同时处理多个下游任务，而无需特定于任务的微调。最近的研究表明，即使是中等大小的LLM，例如Mistral 7b，Gemma 7b和Llama-3 8b，也可以通过几次对所有任务进行几次细微调整来实现ICL。但是，这种方法仍然落后于专用的微调，在每个任务中都对单独的模型进行了训练。在本文中，我们提出了一种新颖的方法，许多镜头中的微调微调（ManyICL），它通过将ICL的原理扩展到许多摄影设置，从而大大缩小了这种性能差距。为了释放ManyICL的全部潜力，并通过大量的内在示例解决了处理长序列的固有效率低下，我们提出了一个新颖的培训目标。我们的方法不仅可以预测最终答案，而将上下文中的每个答案视为有监督的培训目标。这有效地将许多示例的作用从提示转移到了自回旋学习的目标。通过对各种下游任务进行的广泛实验，包括分类，摘要，问答，自然语言推断和数学，我们证明，许多人基本上优于零/少量微调，并处理专门的微调的表现。此外，许多人会显着减轻零/少量微调中观察到的灾难性遗忘问题。该代码将在出版后公开提供。

Title: DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration

Authors: Hanzhi Zhang, Heng Fan, Kewei Sha, Yan Huang, Yunhe Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11104
Pdf URL: https://arxiv.org/pdf/2506.11104
Copy Paste: [[2506.11104]] DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration(https://arxiv.org/abs/2506.11104)
Keywords: language model, llm
Abstract: Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-sequence tasks. This work introduces a dynamic sparse attention mechanism that assigns adaptive masks at the attention-map level, preserving heterogeneous patterns across layers and heads. Unlike existing approaches, our method eliminates the need for fine-tuning and predefined mask structures while maintaining computational efficiency. By learning context-aware attention structures, it achieves high alignment with full-attention models, ensuring minimal performance degradation while reducing memory and compute overhead. This approach provides a scalable alternative to full attention, enabling the practical deployment of large-scale Large Language Models (LLMs) without sacrificing retrieval performance. DAM is available at: this https URL.
摘要：长篇小说的理解对于许多NLP应用至关重要，但是由于自我注意的二次复杂性，变形金刚因效率而挣扎。稀疏的注意方法减轻了这一成本，但通常会施加静态的预定义面具，无法捕获异质的注意模式。这导致了次优的令牌相互作用，限制了长期序列任务中的适应性和检索精度。这项工作引入了一种动态的稀疏注意机制，该机制在注意图水平上分配了自适应口罩，从而保留了跨层和头部的异质模式。与现有方法不同，我们的方法消除了对微调和预定义的掩盖结构的需求，同时保持计算效率。通过学习上下文感知的注意力结构，它可以通过全注意模型实现高度对齐，从而确保了最小的性能下降，同时减少记忆力并计算开销。这种方法为全面关注提供了可扩展的替代方法，从而实现了大型大语言模型（LLM）的实际部署，而无需牺牲检索性能。大坝可用：此HTTPS URL。

Title: Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

Authors: Uttej Kallakurik, Edward Humes, Rithvik Jonna, Xiaomin Lin, Tinoosh Mohsenin
Subjects: cs.CL, cs.AI, cs.AR, eess.SY
Abstract URL: https://arxiv.org/abs/2506.11105
Pdf URL: https://arxiv.org/pdf/2506.11105
Copy Paste: [[2506.11105]] Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation(https://arxiv.org/abs/2506.11105)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have significant impact on the healthcare scenarios but remain prohibitively large for deployment in real-time, resource-constrained environments such as edge devices. In this work, we introduce a novel medical assistant system, optimized through our general-purpose compression framework, which tailors Large Language Models (LLMs) for deployment in specialized domains. By measuring neuron saliency on domain-specific data, our method can aggressively prune irrelevant neurons, reducing model size while preserving performance. Following pruning, we apply post-training quantization to further reduce the memory footprint, and evaluate the compressed model across medical benchmarks including MedMCQA, MedQA, and PubMedQA. We also deploy the 50\% compressed Gemma and the 67\% compressed LLaMA3 models on Jetson Orin Nano (18.7W peak) and Raspberry Pi 5 (6.3W peak), achieving real-time, energy-efficient inference under hardware constraints.
摘要：大型语言模型（LLM）对医疗保健方案有重大影响，但在实时，资源受限的环境（例如边缘设备）中仍然非常大。在这项工作中，我们介绍了一种新颖的医疗助理系统，该系统通过我们的通用压缩框架进行了优化，该系统量身定制了大型语言模型（LLMS），以在专用域中部署。通过测量特定于域数据的神经元显着性，我们的方法可以积极修剪无关的神经元，减少模型大小，同时保持性能。修剪后，我们应用训练后量化以进一步减少记忆足迹，并评估包括MEDMCQA，MEDQA和PubMedQA在内的医疗基准的压缩模型。我们还在Jetson Orin Nano（18.7W峰值）和Raspberry Pi 5（6.3W峰）上部署50 \％压缩的Gemma和67 \％压缩的Llama3型号，可在硬件约束下实现实时的，能源有效的推断。

Title: Graph-based RAG Enhancement via Global Query Disambiguation and Dependency-Aware Reranking

Authors: Ningyuan Li, Junrui Liu, Yi Shan, Minghui Huang, Tong Li
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.11106
Pdf URL: https://arxiv.org/pdf/2506.11106
Copy Paste: [[2506.11106]] Graph-based RAG Enhancement via Global Query Disambiguation and Dependency-Aware Reranking(https://arxiv.org/abs/2506.11106)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Contemporary graph-based retrieval-augmented generation (RAG) methods typically begin by extracting entities from user queries and then leverage pre-constructed knowledge graphs to retrieve related relationships and metadata. However, this pipeline's exclusive reliance on entity-level extraction can lead to the misinterpretation or omission of latent yet critical information and relations. As a result, retrieved content may be irrelevant or contradictory, and essential knowledge may be excluded, exacerbating hallucination risks and degrading the fidelity of generated responses. To address these limitations, we introduce PankRAG, a framework that combines a globally aware, hierarchical query-resolution strategy with a novel dependency-aware reranking mechanism. PankRAG first constructs a multi-level resolution path that captures both parallel and sequential interdependencies within a query, guiding large language models (LLMs) through structured reasoning. It then applies its dependency-aware reranker to exploit the dependency structure among resolved sub-questions, enriching and validating retrieval results for subsequent sub-questions. Empirical evaluations demonstrate that PankRAG consistently outperforms state-of-the-art approaches across multiple benchmarks, underscoring its robustness and generalizability.
摘要：基于当代图的检索演示生成（RAG）方法通常是从从用户查询中提取实体，然后利用预构建的知识图来检索相关关系和元数据。但是，该管道对实体级提取的独家依赖会导致误解或遗漏潜在但关键的信息和关系。结果，检索到的内容可能是无关紧要的或矛盾的，并且可以排除基本知识，加剧幻觉风险并降低产生的响应的保真度。为了解决这些局限性，我们介绍了Pankrag，该框架结合了一个全球意识到的，分层的查询策略与一种新颖的依赖性依赖性的重新依赖机制。 pankrag首先构建了一个多级分辨率路径，该路径在查询中捕获并行和顺序相互依存关系，从而通过结构化推理引导大语言模型（LLMS）。然后，它应用其依赖性意识的reranker来利用已解决的子问题之间的依赖关系结构，以丰富和验证后续子问题的检索结果。经验评估表明，Pankrag始终超过多个基准测试的最先进方法，强调其稳健性和可推广性。

Title: History-Aware Cross-Attention Reinforcement: Self-Supervised Multi Turn and Chain-of-Thought Fine-Tuning with vLLM

Authors: Andrew Kiruluta, Andreas Lemos, Priscilla Burity
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11108
Pdf URL: https://arxiv.org/pdf/2506.11108
Copy Paste: [[2506.11108]] History-Aware Cross-Attention Reinforcement: Self-Supervised Multi Turn and Chain-of-Thought Fine-Tuning with vLLM(https://arxiv.org/abs/2506.11108)
Keywords: llm, chain-of-thought
Abstract: We present CAGSR-vLLM-MTC, an extension of our Self-Supervised Cross-Attention-Guided Reinforcement (CAGSR) framework, now implemented on the high-performance vLLM runtime, to address both multi-turn dialogue and chain-of-thought reasoning. Building upon our original single-turn approach, we first instrumented vLLM's C++/CUDA kernels to asynchronously capture per-layer, per-head cross-attention weights during generation. We then generalized our self-supervised reward function to accumulate attention signals over entire conversation histories and intermediate chain-of-thought steps. We discuss practical trade-offs, including an entropy-based clamping mechanism to prevent attention collapse on early context, and outline future directions for multi-party dialogues and hierarchical reasoning.
摘要：我们提出了CAGSR-VLLM-MTC，这是我们自我监督的交叉注意引导强化（CAGSR）框架的扩展，该框架现已在高性能VLLM运行时实施，以解决多转向对话和链条思考的推理。在我们原始的单转方法的基础上，我们首先将VLLM的C ++/CUDA内核进行了启动，以异步捕获一代人的每层，每头跨注意权重。然后，我们概括了我们的自我监督奖励功能，以在整个对话历史和中间的经过思考链中积累注意力信号。我们讨论了实际的权衡取舍，包括一种基于熵的夹紧机制，以防止在早期背景下注意力崩溃，并概述了多方对话和分层推理的未来方向。

Title: Enhancing Large Language Models for Mobility Analytics with Semantic Location Tokenization

Authors: Yile Chen, Yicheng Tao, Yue Jiang, Shuai Liu, Han Yu, Gao Cong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11109
Pdf URL: https://arxiv.org/pdf/2506.11109
Copy Paste: [[2506.11109]] Enhancing Large Language Models for Mobility Analytics with Semantic Location Tokenization(https://arxiv.org/abs/2506.11109)
Keywords: language model, llm
Abstract: The widespread adoption of location-based services has led to the generation of vast amounts of mobility data, providing significant opportunities to model user movement dynamics within urban environments. Recent advancements have focused on adapting Large Language Models (LLMs) for mobility analytics. However, existing methods face two primary limitations: inadequate semantic representation of locations (i.e., discrete IDs) and insufficient modeling of mobility signals within LLMs (i.e., single templated instruction fine-tuning). To address these issues, we propose QT-Mob, a novel framework that significantly enhances LLMs for mobility analytics. QT-Mob introduces a location tokenization module that learns compact, semantically rich tokens to represent locations, preserving contextual information while ensuring compatibility with LLMs. Furthermore, QT-Mob incorporates a series of complementary fine-tuning objectives that align the learned tokens with the internal representations in LLMs, improving the model's comprehension of sequential movement patterns and location semantics. The proposed QT-Mob framework not only enhances LLMs' ability to interpret mobility data but also provides a more generalizable approach for various mobility analytics tasks. Experiments on three real-world dataset demonstrate the superior performance in both next-location prediction and mobility recovery tasks, outperforming existing deep learning and LLM-based methods.
摘要：基于位置的服务的广泛采用导致了大量移动性数据的产生，从而为在城市环境中的用户运动动态建模提供了重要的机会。最近的进步集中在调整大型语言模型（LLMS）以进行移动分析。但是，现有方法面临两个主要局限性：位置的语义表示不足（即离散ID）和LLM中移动性信号的建模不足（即单模型指令微调）。为了解决这些问题，我们提出了QT-MOB，这是一个新颖的框架，可显着增强移动性分析的LLM。 QT-MOB介绍了一个位置令牌化模块，该模块可以学习紧凑，语义上丰富的令牌以表示位置，并保留上下文信息，同时确保与LLMS的兼容性。此外，QT-MOB结合了一系列互补的微调目标，这些目标将学习的令牌与LLMS中的内部表示形式保持一致，从而提高了模型对顺序运动模式和位置语义的理解。所提出的QT-MOB框架不仅增强了LLMS解释移动性数据的能力，而且还为各种移动性分析任务提供了更具普遍的方法。在三个现实世界数据集上的实验证明了下一个位置预测和移动性恢复任务的卓越性能，表现优于现有的深度学习和基于LLM的方法。

Title: AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models

Authors: Jaeho Lee, Atharv Chowdhary
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.11110
Pdf URL: https://arxiv.org/pdf/2506.11110
Copy Paste: [[2506.11110]] AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models(https://arxiv.org/abs/2506.11110)
Keywords: language model, llm, prompt
Abstract: Recent benchmarks have probed factual consistency and rhetorical robustness in Large Language Models (LLMs). However, a knowledge gap exists regarding how directional framing of factually true statements influences model agreement, a common scenario for LLM users. AssertBench addresses this by sampling evidence-supported facts from FEVEROUS, a fact verification dataset. For each (evidence-backed) fact, we construct two framing prompts: one where the user claims the statement is factually correct, and another where the user claims it is incorrect. We then record the model's agreement and reasoning. The desired outcome is that the model asserts itself, maintaining consistent truth evaluation across both framings, rather than switching its evaluation to agree with the user. AssertBench isolates framing-induced variability from the model's underlying factual knowledge by stratifying results based on the model's accuracy on the same claims when presented neutrally. In doing so, this benchmark aims to measure an LLM's ability to "stick to its guns" when presented with contradictory user assertions about the same fact. The complete source code is available at this https URL.
摘要：最近的基准测试了大语言模型（LLMS）中的事实一致性和修辞鲁棒性。但是，关于事实真实陈述的定向框架如何影响模型协议的知识差距，这是LLM用户的常见情况。 AspertBench通过从发烧的事实验证数据集中抽样证据支持的事实来解决这一问题。对于每个（证据支持的）事实，我们构建了两个框架提示：一个框架提示：一个用户声称该陈述在事实上正确的，另一个在用户声称其不正确的地方。然后，我们记录模型的协议和推理。预期的结果是，该模型自称，在两个框架上保持一致的真理评估，而不是切换其评估以同意用户的一致性。 AspertBench通过基于模型的结果对模型的准确性进行分层结果，从模型的基本事实知识中隔离了框架引起的变异性。这样一来，该基准旨在衡量LLM就同一事实矛盾的用户主张时，旨在衡量LLM的“坚持枪支”的能力。完整的源代码可在此HTTPS URL上找到。

Title: Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions

Authors: Kun Zhang, Le Wu, Kui Yu, Guangyi Lv, Dacao Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11111
Pdf URL: https://arxiv.org/pdf/2506.11111
Copy Paste: [[2506.11111]] Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions(https://arxiv.org/abs/2506.11111)
Keywords: language model, llm, long context, hallucination, prompt, agent
Abstract: Large Language Models (LLMs) have gained enormous attention in recent years due to their capability of understanding and generating natural languages. With the rapid development and wild-range applications (e.g., Agents, Embodied Intelligence), the robustness of LLMs has received increased attention. As the core brain of many AI applications, the robustness of LLMs requires that models should not only generate consistent contents, but also ensure the correctness and stability of generated content when dealing with unexpeted application scenarios (e.g., toxic prompts, limited noise domain data, outof-distribution (OOD) applications, etc). In this survey paper, we conduct a thorough review of the robustness of LLMs, aiming to provide a comprehensive terminology of concepts and methods around this field and facilitate the community. Specifically, we first give a formal definition of LLM robustness and present the collection protocol of this survey paper. Then, based on the types of perturbated inputs, we organize this survey from the following perspectives: 1) Adversarial Robustness: tackling the problem that prompts are manipulated intentionally, such as noise prompts, long context, data attack, etc; 2) OOD Robustness: dealing with the unexpected real-world application scenarios, such as OOD detection, zero-shot transferring, hallucinations, etc; 3) Evaluation of Robustness: summarizing the new evaluation datasets, metrics, and tools for verifying the robustness of LLMs. After reviewing the representative work from each perspective, we discuss and highlight future opportunities and research directions in this field. Meanwhile, we also organize related works and provide an easy-to-search project (this https URL) to support the community.
摘要：大型语言模型（LLM）近年来由于理解和产生自然语言的能力而受到了极大的关注。随着快速发展和野生范围的应用（例如，代理，体现的智能），LLM的鲁棒性受到了越来越多的关注。作为许多AI应用的核心大脑，LLMS的鲁棒性要求模型不仅应产生一致的内容，而且还应确保在处理未扩展的应用程序场景时（例如，有毒的提示，有限的噪声域数据，Outof-Distibution（OOD）应用程序等），确保生成内容的正确性和稳定性。在本调查文件中，我们对LLM的鲁棒性进行了详尽的审查，旨在提供有关该领域概念和方法的全面术语，并促进社区。具体来说，我们首先给出了LLM鲁棒性的正式定义，并介绍了本调查文件的收集协议。然后，基于扰动输入的类型，我们从以下角度组织了这项调查：1）对抗性鲁棒性：解决有意操纵提示的问题，例如噪声提示，长上下文，数据攻击等； 2）OOD鲁棒性：处理意外的现实应用程序方案，例如OOD检测，零射击传输，幻觉等； 3）鲁棒性评估：总结新的评估数据集，指标和用于验证LLM鲁棒性的工具。在从每个角度审查了代表性工作之后，我们讨论并强调了该领域的未来机会和研究方向。同时，我们还组织了相关作品，并提供一个易于搜索的项目（此HTTPS URL）来支持社区。

Title: Manifesto from Dagstuhl Perspectives Workshop 24352 -- Conversational Agents: A Framework for Evaluation (CAFE)

Authors: Christine Bauer, Li Chen, Nicola Ferro, Norbert Fuhr, Avishek Anand, Timo Breuer, Guglielmo Faggioli, Ophir Frieder, Hideo Joho, Jussi Karlgren, Johannes Kiesel, Bart P. Knijnenburg, Aldo Lipani, Lien Michiels, Andrea Papenmeier, Maria Soledad Pera, Mark Sanderson, Scott Sanner, Benno Stein, Johanne R. Trippas, Karin Verspoor, Martijn C Willemsen
Subjects: cs.CL, cs.HC, cs.IR
Abstract URL: https://arxiv.org/abs/2506.11112
Pdf URL: https://arxiv.org/pdf/2506.11112
Copy Paste: [[2506.11112]] Manifesto from Dagstuhl Perspectives Workshop 24352 -- Conversational Agents: A Framework for Evaluation (CAFE)(https://arxiv.org/abs/2506.11112)
Keywords: agent
Abstract: During the workshop, we deeply discussed what CONversational Information ACcess (CONIAC) is and its unique features, proposing a world model abstracting it, and defined the Conversational Agents Framework for Evaluation (CAFE) for the evaluation of CONIAC systems, consisting of six major components: 1) goals of the system's stakeholders, 2) user tasks to be studied in the evaluation, 3) aspects of the users carrying out the tasks, 4) evaluation criteria to be considered, 5) evaluation methodology to be applied, and 6) measures for the quantitative criteria chosen.
摘要：在研讨会期间，我们深入讨论了哪些对话信息访问（圆锥形）及其独特的特征，提出了一个世界模型，并定义了对对话的代理评估框架（CAFE）进行评估Coniac系统，由六个主要组成部分进行评估，由六个主要组成部分，由系统的利益相关者的用户任务进行4个方面，包括该系统的用户任务，以下方面的用户任务，3）考虑到5）要应用的评估方法，以及6）所选定量标准的措施。

Title: Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks

Authors: Tzu-Ling Lin, Wei-Chih Chen, Teng-Fang Hsiao, Hou-I Liu, Ya-Hsin Yeh, Yu Kai Chan, Wen-Sheng Lien, Po-Yen Kuo, Philip S. Yu, Hong-Han Shuai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11113
Pdf URL: https://arxiv.org/pdf/2506.11113
Copy Paste: [[2506.11113]] Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks(https://arxiv.org/abs/2506.11113)
Keywords: language model, llm
Abstract: Peer review is essential for maintaining academic quality, but the increasing volume of submissions places a significant burden on reviewers. Large language models (LLMs) offer potential assistance in this process, yet their susceptibility to textual adversarial attacks raises reliability concerns. This paper investigates the robustness of LLMs used as automated reviewers in the presence of such attacks. We focus on three key questions: (1) The effectiveness of LLMs in generating reviews compared to human reviewers. (2) The impact of adversarial attacks on the reliability of LLM-generated reviews. (3) Challenges and potential mitigation strategies for LLM-based review. Our evaluation reveals significant vulnerabilities, as text manipulations can distort LLM assessments. We offer a comprehensive evaluation of LLM performance in automated peer reviewing and analyze its robustness against adversarial attacks. Our findings emphasize the importance of addressing adversarial risks to ensure AI strengthens, rather than compromises, the integrity of scholarly communication.
摘要：同行评审对于维持学术质量至关重要，但是越来越多的提交给审稿人带来了重大负担。大型语言模型（LLMS）在此过程中提供了潜在的帮助，但是它们对文本对抗攻击的敏感性引起了可靠性的问题。本文研究了在存在此类攻击的情况下用作自动审稿人的LLM的鲁棒性。我们关注三个关键问题：（1）LLM与人类审稿人相比生成评论的有效性。（2）对抗攻击对LLM生成评论的可靠性的影响。（3）基于LLM的评论的挑战和潜在缓解策略。我们的评估揭示了很大的漏洞，因为文本操作会扭曲LLM评估。我们在自动同行审查中对LLM性能进行全面评估，并分析其针对对抗性攻击的鲁棒性。我们的发现强调了应对对抗风险的重要性，以确保AI增强而不是妥协学术交流的完整性。

Title: KokushiMD-10: Benchmark for Evaluating Large Language Models on Ten Japanese National Healthcare Licensing Examinations

Authors: Junyu Liu, Kaiqi Yan, Tianyang Wang, Qian Niu, Momoko Nagai-Tanima, Tomoki Aoyama
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11114
Pdf URL: https://arxiv.org/pdf/2506.11114
Copy Paste: [[2506.11114]] KokushiMD-10: Benchmark for Evaluating Large Language Models on Ten Japanese National Healthcare Licensing Examinations(https://arxiv.org/abs/2506.11114)
Keywords: language model, gpt, llm
Abstract: Recent advances in large language models (LLMs) have demonstrated notable performance in medical licensing exams. However, comprehensive evaluation of LLMs across various healthcare roles, particularly in high-stakes clinical scenarios, remains a challenge. Existing benchmarks are typically text-based, English-centric, and focus primarily on medicines, which limits their ability to assess broader healthcare knowledge and multimodal reasoning. To address these gaps, we introduce KokushiMD-10, the first multimodal benchmark constructed from ten Japanese national healthcare licensing exams. This benchmark spans multiple fields, including Medicine, Dentistry, Nursing, Pharmacy, and allied health professions. It contains over 11588 real exam questions, incorporating clinical images and expert-annotated rationales to evaluate both textual and visual reasoning. We benchmark over 30 state-of-the-art LLMs, including GPT-4o, Claude 3.5, and Gemini, across both text and image-based settings. Despite promising results, no model consistently meets passing thresholds across domains, highlighting the ongoing challenges in medical AI. KokushiMD-10 provides a comprehensive and linguistically grounded resource for evaluating and advancing reasoning-centric medical AI across multilingual and multimodal clinical tasks.
摘要：大型语言模型（LLM）的最新进展表现出了医学许可考试的显着表现。但是，对各种医疗保健角色的LLM的全面评估，尤其是在高风险临床情况下，仍然是一个挑战。现有的基准通常基于文本，以英语为中心，主要关注药物，这限制了其评估更广泛的医疗保健知识和多模式推理的能力。为了解决这些差距，我们介绍了Kokushimd-10，这是第一个由十个日本国家医疗保证考试构建的多模式基准。该基准涵盖了多个领域，包括医学，牙科，护理，药房和盟友健康专业。它包含超过11588个真实的考试问题，结合了临床图像和专家注销的理由，以评估文本和视觉推理。我们基于文本和基于图像的设置，基于30个最先进的LLM，包括GPT-4O，Claude 3.5和Gemini。尽管有希望的结果，但没有任何模型始终达到跨域的阈值，这突出了医疗AI的持续挑战。 Kokushimd-10提供了一种全面的，语言上的资源，用于评估和推进以推理为中心的医学AI，跨多语言和多模式临床任务。

Title: Incorporating Domain Knowledge into Materials Tokenization

Authors: Yerim Oh, Jun-Hyung Park, Junho Kim, SungHo Kim, SangKeun Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11115
Pdf URL: https://arxiv.org/pdf/2506.11115
Copy Paste: [[2506.11115]] Incorporating Domain Knowledge into Materials Tokenization(https://arxiv.org/abs/2506.11115)
Keywords: language model
Abstract: While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel tokenization approach that integrates material knowledge into tokenization. Based on MatDetector trained on our materials knowledge base and a re-ranking method prioritizing material concepts in token merging, MATTER maintains the structural integrity of identified material concepts and prevents fragmentation during tokenization, ensuring their semantic meaning remains intact. The experimental results demonstrate that MATTER outperforms existing tokenization methods, achieving an average performance gain of $4\%$ and $2\%$ in the generation and classification tasks, respectively. These results underscore the importance of domain knowledge for tokenization strategies in scientific text processing. Our code is available at this https URL
摘要：尽管语言模型越来越多地用于材料科学中，但典型的模型依赖于最初用于自然语言处理的以频率为中心的代币化方法。但是，这些方法经常产生过度的碎片和语义损失，无法维持材料概念的结构和语义完整性。为了解决这个问题，我们提出了一种新的代币化方法，将物质知识纳入令牌化。基于对我们材料知识基础训练的Matdetector，并采用了将材料合并中的材料概念优先级的重新排列方法，物质维持了已识别的材料概念的结构完整性，并防止了令牌化过程中的分散化，确保其语义含义保持完整。实验结果表明，物质的表现优于现有的代币化方法，在发电和分类任务中的平均绩效增益分别为$ 4 \％$ $和$ 2 \％$。这些结果强调了域知识对象征化策略在科学文本处理中的重要性。我们的代码可在此HTTPS URL上找到

Title: Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models

Authors: Jijie Li, Li Du, Hanyu Zhao, Bo-wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, Yonghua Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11116
Pdf URL: https://arxiv.org/pdf/2506.11116
Copy Paste: [[2506.11116]] Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models(https://arxiv.org/abs/2506.11116)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA, Qwen, and Yi, and observe substantial performance gains across both foundational and instruction following benchmarks, consistently surpassing official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B outperforms GPT-4-0314 by 8.6\% on instruction following tasks while achieving comparable foundational performance. These results underscore the synergy between foundational and chat training and offer new insights into holistic LLM development. Our dataset\footnote{this https URL} and codes\footnote{this https URL} have been publicly released.
摘要：大型语言模型（LLMS）在现实世界应用中表现出很强的性能，但是现有的开源指令数据集通常集中在狭窄的域上，例如数学或编码，限制了通用并通过专有模型扩大差距。为了弥合这一差距，我们引入了Infinity-Instruct，这是一个高质量的指令数据集，旨在通过两阶段管道来增强LLM的基础和聊天功能。在第1阶段，我们使用混合数据选择技术策划了740万高质量的基础指令（Infinstruct-f-74.4m）。在第2阶段，我们通过涉及指令选择，演变和诊断过滤的两个阶段过程合成了150万高质量的聊天说明（Infinstruct-G-1.5M）。我们通过微调几种开源模型来评估无穷大的教学，包括Mistral，Llama，Qwen和Yi，并在基准后观察基础和指导的大量绩效提高，并始终超过官方教学指导调节的同行。值得注意的是，Infinstruct-llama3.1-70B在以下任务上的指令上以8.6 \％优于GPT-4-0314，同时实现了可比的基础绩效。这些结果强调了基础培训和聊天培训之间的协同作用，并为整体LLM开发提供了新的见解。我们的数据集\ footNote {this HTTPS url}和代码\ footNote {this HTTPS url}已公开发布。

Title: ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research

Authors: Junyong Lin, Lu Dai, Ruiqian Han, Yijie Sui, Ruilin Wang, Xingliang Sun, Qinglin Wu, Min Feng, Hao Liu, Hui Xiong
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.11117
Pdf URL: https://arxiv.org/pdf/2506.11117
Copy Paste: [[2506.11117]] ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research(https://arxiv.org/abs/2506.11117)
Keywords: llm, retrieval-augmented generation
Abstract: Scientific researchers need intensive information about datasets to effectively evaluate and develop theories and methodologies. The information needs regarding datasets are implicitly embedded in particular research tasks, rather than explicitly expressed in search queries. However, existing scientific retrieval and question-answering (QA) datasets typically address straightforward questions, which do not align with the distribution of real-world research inquiries. To bridge this gap, we developed ScIRGen, a dataset generation framework for scientific QA \& retrieval that more accurately reflects the information needs of professional science researchers, and uses it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers. Technically, we designed a dataset-oriented information extraction method that leverages academic papers to augment the dataset representation. We then proposed a question generation framework by employing cognitive taxonomy to ensure the quality of synthesized questions. We also design a method to automatically filter synthetic answers based on the perplexity shift of LLMs, which is highly aligned with human judgment of answers' validity. Collectively, these methodologies culminated in the creation of the 61k QA dataset, ScIRGen-Geo. We benchmarked representative methods on the ScIRGen-Geo dataset for their question-answering and retrieval capabilities, finding out that current methods still suffer from reasoning from complex questions. This work advances the development of more sophisticated tools to support the intricate information needs of the scientific community.
摘要：科学研究人员需要有关数据集的密集信息，以有效评估和发展理论和方法。有关数据集的信息需求被隐式嵌入到特定的研究任务中，而不是在搜索查询中明确表达。但是，现有的科学检索和提问（QA）数据集通常会解决直接问题，这与现实世界研究询问的分布不符。为了弥合这一差距，我们开发了Scirgen，这是一种用于科学质量质量和检索的数据集生成框架，更准确地反映了专业科学研究人员的信息需求，并使用它来创建具有现实查询，数据集和纸张和纸张的大规模科学检索生成（RAG）数据集。从技术上讲，我们设计了一种面向数据集的信息提取方法，该方法利用学术论文来增强数据集表示。然后，我们通过使用认知分类法来确保合成问题的质量，提出了一个问题生成框架。我们还设计了一种方法，可以根据LLM的困惑转移自动过滤合成答案，这与人类对答案的有效性的判断高度一致。总的来说，这些方法最终导致了61K QA数据集Scirgen-Geo的创建。我们在Scirgen-Geo数据集上对其提问和检索功能进行了代表性方法，发现当前的方法仍然来自复杂问题的推理。这项工作为开发了更复杂的工具的开发，以支持科学界的复杂信息需求。

Title: Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech

Authors: Jingyu Li, Lingchao Mao, Hairong Wang, Zhendong Wang, Xi Mao, Xuelei Sherry Ni
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.11119
Pdf URL: https://arxiv.org/pdf/2506.11119
Copy Paste: [[2506.11119]] Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech(https://arxiv.org/abs/2506.11119)
Keywords: language model
Abstract: Background: Alzheimer's disease and related dementias (ADRD) are progressive neurodegenerative conditions where early detection is vital for timely intervention and care. Spontaneous speech contains rich acoustic and linguistic markers that may serve as non-invasive biomarkers for cognitive decline. Foundation models, pre-trained on large-scale audio or text data, produce high-dimensional embeddings encoding contextual and acoustic features. Methods: We used the PREPARE Challenge dataset, which includes audio recordings from over 1,600 participants with three cognitive statuses: healthy control (HC), mild cognitive impairment (MCI), and Alzheimer's Disease (AD). We excluded non-English, non-spontaneous, or poor-quality recordings. The final dataset included 703 (59.13%) HC, 81 (6.81%) MCI, and 405 (34.06%) AD cases. We benchmarked a range of open-source foundation speech and language models to classify cognitive status into the three categories. Results: The Whisper-medium model achieved the highest performance among speech models (accuracy = 0.731, AUC = 0.802). Among language models, BERT with pause annotation performed best (accuracy = 0.662, AUC = 0.744). ADRD detection using state-of-the-art automatic speech recognition (ASR) model-generated audio embeddings outperformed others. Including non-semantic features like pause patterns consistently improved text-based classification. Conclusion: This study introduces a benchmarking framework using foundation models and a clinically relevant dataset. Acoustic-based approaches -- particularly ASR-derived embeddings -- demonstrate strong potential for scalable, non-invasive, and cost-effective early detection of ADRD.
摘要：背景：阿尔茨海默氏病和相关痴呆症（ADRD）是进行性神经退行性疾病，早期发现对于及时的干预和护理至关重要。自发的语音包含丰富的声学和语言标志物，这些标志物可能是认知能力下降的非侵入性生物标志物。基础模型，在大规模音频或文本数据上进行了预先训练，会产生编码上下文和声学特征的高维嵌入。方法：我们使用了《准备挑战数据集》，其中包括来自1,600多名具有三个认知状态的参与者的录音：健康对照（HC），轻度认知障碍（MCI）和阿尔茨海默氏病（AD）。我们排除了非英语，非自发或质量不佳的录音。最终数据集包括703（59.13％）HC，81（6.81％）MCI和405（34.06％）AD病例。我们对一系列开源基础语音和语言模型进行了基准测试，以将认知状况分类为三类。结果：耳语中的模型在语音模型中达到了最高的性能（精度= 0.731，AUC = 0.802）。在语言模型中，带有暂停注释的BERT表现最佳（准确性= 0.662，AUC = 0.744）。使用最先进的自动语音识别（ASR）模型生成的音频嵌入的ADRD检测优于他人。包括暂停模式（例如暂停模式）始终改善基于文本的分类等非语义功能。结论：这项研究介绍了使用基础模型和临床相关数据集的基准测试框架。基于声学的方法，尤其是ASR衍生的嵌入 - 表现出对ADRD的可扩展性，无创和具有成本效益的早期检测的强大潜力。

Title: SDMPrune: Self-Distillation MLP Pruning for Efficient Large Language Models

Authors: Hourun Zhu, Chengchao Shen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.11120
Pdf URL: https://arxiv.org/pdf/2506.11120
Copy Paste: [[2506.11120]] SDMPrune: Self-Distillation MLP Pruning for Efficient Large Language Models(https://arxiv.org/abs/2506.11120)
Keywords: language model, llm
Abstract: In spite of strong performance achieved by LLMs, the costs of their deployment are unaffordable. For the compression of LLMs, gradient-based pruning methods present promising effectiveness. However, in these methods, the gradient computation with one-hot labels ignore the potential predictions on other words, thus missing key information for generative capability of the original model. To address this issue, we introduce a self-distillation loss during the pruning phase (rather than post-training) to fully exploit the predictions of the original model, thereby obtaining more accurate gradient information for pruning. Moreover, we find that, compared to attention modules, the predictions of LLM are less sensitive to multilayer perceptron (MLP) modules, which take up more than $5 \times$ parameters (LLaMA3.2-1.2B). To this end, we focus on the pruning of MLP modules, to significantly compress LLM without obvious performance degradation. Experimental results on extensive zero-shot benchmarks demonstrate that our method significantly outperforms existing pruning methods. Furthermore, our method achieves very competitive performance among 1B-scale open source LLMs. The source code and trained weights are available at this https URL.
摘要：尽管LLMS实现了强劲的性能，但其部署成本仍无法承受。对于压缩LLM，基于梯度的修剪方法具有有希望的有效性。但是，在这些方法中，带有单热标签的梯度计算忽略了其他单词上的潜在预测，因此缺少关键信息来获得原始模型的生成能力。为了解决这个问题，我们在修剪阶段（而不是训练后）引入了自我验证损失，以充分利用原始模型的预测，从而获得更准确的梯度信息以进行修剪。此外，我们发现，与注意模块相比，LLM的预测对多层感知器（MLP）模块不太敏感，该模块占用超过$ 5 \ times $ $参数（llama3.2-1.2b）。为此，我们专注于MLP模块的修剪，以显着压缩LLM而不会明显的性能退化。广泛的零射基准测试结果的实验结果表明，我们的方法显着优于现有的修剪方法。此外，我们的方法在1B规模的开源LLM中达到了非常有竞争力的性能。源代码和受过训练的权重可在此HTTPS URL上找到。

Title: SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR

Authors: Wei-Ping Huang, Guan-Ting Lin, Hung-yi Lee
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.11121
Pdf URL: https://arxiv.org/pdf/2506.11121
Copy Paste: [[2506.11121]] SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR(https://arxiv.org/abs/2506.11121)
Keywords: language model
Abstract: Despite progress in end-to-end ASR, real-world domain mismatches still cause performance drops, which Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference. Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction. In this work, we identify a previously overlooked challenge: TTA can interfere with language model rescoring, revealing the nontrivial nature of effectively combining the two methods. Based on this insight, we propose SUTA-LM, a simple yet effective extension of SUTA, an entropy-minimization-based TTA approach, with language model rescoring. SUTA-LM first applies a controlled adaptation process guided by an auto-step selection mechanism leveraging both acoustic and linguistic information, followed by language model rescoring to refine the outputs. Experiments on 18 diverse ASR datasets show that SUTA-LM achieves robust results across a wide range of domains.
摘要：尽管端到端ASR进展，但现实世界的域不匹配仍会导致性能下降，该测试时间适应（TTA）旨在通过在推断期间调整模型来减轻。最近的工作探讨了将TTA与外部语言模型组合在一起的技术，该技术使用梁搜索逆转或生成误差校正等技术。在这项工作中，我们确定了一个以前被忽视的挑战：TTA可以干扰语言模型的逆转，从而揭示了有效结合两种方法的非平凡性质。基于这种见解，我们提出了Suta-LM，这是Suta的简单而有效的扩展，这是一种基于熵限制的TTA方法，并通过语言模型进行撤销。 SUTA-LM首先采用受控的适应过程，该过程是由利用声学和语言信息的自动步骤选择机制引导的，然后是语言模型进行重新分组以完善输出。对18种ASR数据集进行的实验表明，Suta-LM在各个域中取得了强大的结果。

Title: ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams

Authors: Freddie Grabovski, Gilad Gressel, Yisroel Mirsky
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11125
Pdf URL: https://arxiv.org/pdf/2506.11125
Copy Paste: [[2506.11125]] ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams(https://arxiv.org/abs/2506.11125)
Keywords: language model, llm
Abstract: Large Language Models (LLMs), combined with Text-to-Speech (TTS) and Automatic Speech Recognition (ASR), are increasingly used to automate voice phishing (vishing) scams. These systems are scalable and convincing, posing a significant security threat. We identify the ASR transcription step as the most vulnerable link in the scam pipeline and introduce ASRJam, a proactive defence framework that injects adversarial perturbations into the victim's audio to disrupt the attacker's ASR. This breaks the scam's feedback loop without affecting human callers, who can still understand the conversation. While prior adversarial audio techniques are often unpleasant and impractical for real-time use, we also propose EchoGuard, a novel jammer that leverages natural distortions, such as reverberation and echo, that are disruptive to ASR but tolerable to humans. To evaluate EchoGuard's effectiveness and usability, we conducted a 39-person user study comparing it with three state-of-the-art attacks. Results show that EchoGuard achieved the highest overall utility, offering the best combination of ASR disruption and human listening experience.
摘要：大型语言模型（LLMS）与文本到语音（TTS）和自动语音识别（ASR）相结合，越来越多地用于自动化语音网络钓鱼（Vishing）骗局。这些系统是可扩展的，令人信服的，构成了重大的安全威胁。我们将ASR转录步骤确定为骗局中最脆弱的链接，并引入Asrjam，这是一个主动的防御框架，将对抗性扰动注入受害者的音频中，以破坏攻击者的ASR。这打破了骗局的反馈循环而不会影响人类呼叫者，他们仍然可以理解对话。虽然先前的对抗音频技术通常是不愉快的，并且对于实时使用而言是不切实际的，但我们还提出了Echoguard，这是一种充满自然扭曲的小型干扰器，例如回响和回声，对ASR破坏了ASR，但对人类却可以忍受。为了评估Echoguard的有效性和可用性，我们进行了一项39人的用户研究，将其与三项最新攻击进行了比较。结果表明，Echoguard实现了最高的总体实用程序，提供了ASR破坏和人类聆听体验的最佳组合。

Title: GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions

Authors: Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Longrong Yang, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11127
Pdf URL: https://arxiv.org/pdf/2506.11127
Copy Paste: [[2506.11127]] GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions(https://arxiv.org/abs/2506.11127)
Keywords: agent
Abstract: Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this gap, we propose GUIRoboTron-Speech, the first end-to-end autonomous GUI agent that directly accepts speech instructions and on-device screenshots to predict actions. Confronted with the scarcity of speech-based GUI agent datasets, we initially generated high-quality speech instructions for training by leveraging a random timbre text-to-speech (TTS) model to convert existing text instructions. We then develop GUIRoboTron-Speech's capabilities through progressive grounding and planning training stages. A key contribution is a heuristic mixed-instruction training strategy designed to mitigate the modality imbalance inherent in pre-trained foundation models. Comprehensive experiments on several benchmark datasets validate the robust and superior performance of GUIRoboTron-Speech, demonstrating the significant potential and widespread applicability of speech as an effective instruction modality for driving GUI agents. Our code and datasets are available at this https URL.
摘要：图形用户界面（GUI）的自主代理正在彻底改变人类计算机的互动，但他们对基于文本的说明的依赖会限制可访问性和便利性，尤其是在免提场景中。为了解决这一差距，我们提出了GuiroBotron语音，这是第一个直接接受语音说明和在设备上的屏幕截图以预测动作的端到端自主GUI代理。面对基于语音的GUI代理数据集的稀缺性，我们最初生成了高质量的语音说明，以利用随机的音色文本到语音（TTS）模型来转换现有文本说明。然后，我们通过渐进的基础和计划培训阶段来开发GuiroBotron语音的功能。一个关键的贡献是一种启发式的混合指导训练策略，旨在减轻预训练的基础模型固有的方式。在几个基准数据集上进行的全面实验验证了GuiroBotron语音的稳健性和卓越性能，证明了语音作为驱动GUI代理的有效指导方式的巨大潜力和广泛适用性。我们的代码和数据集可在此HTTPS URL上找到。

Title: Stronger Language Models Produce More Human-Like Errors

Authors: Andrew Keenan Richardson, Ryan Othniel Kearns, Sean Moss, Vincent Wang-Mascianica, Philipp Koralus
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11128
Pdf URL: https://arxiv.org/pdf/2506.11128
Copy Paste: [[2506.11128]] Stronger Language Models Produce More Human-Like Errors(https://arxiv.org/abs/2506.11128)
Keywords: language model, chat
Abstract: Do language models converge toward human-like reasoning patterns as they improve? We provide surprising evidence that while overall reasoning capabilities increase with model sophistication, the nature of errors increasingly mirrors predictable human reasoning fallacies: a previously unobserved inverse scaling phenomenon. To investigate this question, we apply the Erotetic Theory of Reasoning (ETR), a formal cognitive framework with empirical support for predicting human reasoning outcomes. Using the open-source package PyETR, we generate logical reasoning problems where humans predictably err, evaluating responses from 38 language models across 383 reasoning tasks. Our analysis indicates that as models advance in general capability (as measured by Chatbot Arena scores), the proportion of their incorrect answers that align with ETR-predicted human fallacies tends to increase ($\rho = 0.360, p = 0.0265$). Notably, as we observe no correlation between model sophistication and logical correctness on these tasks, this shift in error patterns toward human-likeness occurs independently of error rate. These findings challenge the prevailing view that scaling language models naturally obtains normative rationality, suggesting instead a convergence toward human-like cognition inclusive of our characteristic biases and limitations, as we further confirm by demonstrating order-effects in language model reasoning.
摘要：语言模型会随着改进的类似人类的推理模式而融合？我们提供了令人惊讶的证据表明，尽管总体推理能力随模型的成熟而增加，但错误的性质越来越多地反映了可预测的人类推理谬误：以前未观察到的逆缩放现象。为了调查这个问题，我们应用了推理的色素理论（ETR），这是一个正式的认知框架，具有预测人类推理结果的经验支持。使用开源软件包PYERTER，我们会产生逻辑推理问题，其中人类可以预测地错误，从而评估了383个推理任务中38个语言模型的响应。我们的分析表明，随着模型的一般能力提高（通过聊天机器人竞技场得分衡量），其与ETR预测的人谬误保持一致的错误答案的比例往往会增加（$ \ rho = 0.360，p = 0.0265 $）。值得注意的是，由于我们观察到这些任务上的模型复杂性和逻辑正确性之间没有相关性，因此，这种误差模式向人类风格的转变独立于错误率。这些发现挑战了普遍的观点，即缩放语言模型自然获得了规范性合理性，而是表明对人类般的认知的融合，包括我们的特征偏见和局限性，因为我们通过在语言模型推理中证明秩序效应进一步证实。

Title: Trustworthy AI for Medicine: Continuous Hallucination Detection and Elimination with CHECK

Authors: Carlos Garcia-Fernandez, Luis Felipe, Monique Shotande, Muntasir Zitu, Aakash Tripathi, Ghulam Rasool, Issam El Naqa, Vivek Rudrapatna, Gilmer Valdes
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11129
Pdf URL: https://arxiv.org/pdf/2506.11129
Copy Paste: [[2506.11129]] Trustworthy AI for Medicine: Continuous Hallucination Detection and Elimination with CHECK(https://arxiv.org/abs/2506.11129)
Keywords: language model, gpt, llm, hallucination
Abstract: Large language models (LLMs) show promise in healthcare, but hallucinations remain a major barrier to clinical use. We present CHECK, a continuous-learning framework that integrates structured clinical databases with a classifier grounded in information theory to detect both factual and reasoning-based hallucinations. Evaluated on 1500 questions from 100 pivotal clinical trials, CHECK reduced LLama3.3-70B-Instruct hallucination rates from 31% to 0.3% - making an open source model state of the art. Its classifier generalized across medical benchmarks, achieving AUCs of 0.95-0.96, including on the MedQA (USMLE) benchmark and HealthBench realistic multi-turn medical questioning. By leveraging hallucination probabilities to guide GPT-4o's refinement and judiciously escalate compute, CHECK boosted its USMLE passing rate by 5 percentage points, achieving a state-of-the-art 92.1%. By suppressing hallucinations below accepted clinical error thresholds, CHECK offers a scalable foundation for safe LLM deployment in medicine and other high-stakes domains.
摘要：大型语言模型（LLM）在医疗保健方面表现出希望，但幻觉仍然是临床使用的主要障碍。我们提出检查，这是一个连续学习的框架，将结构化的临床数据库与基于信息理论的分类器集成在一起，以检测基于事实和推理的幻觉。对100个关键临床试验的1500个问题进行了评估，检查Llama3.3-70b-教学幻觉率从31％降低到0.3％ - 使开源模型的最新状态。它的分类器跨越了医疗基准，达到0.95-0.96的AUC，包括在MEDQA（USMLE）基准和HealthBench现实的多转变医学质疑上。通过利用幻觉概率指导GPT-4O的改进并明智地升级计算，Check将其USMLE的传球率提高了5个百分点，达到了最先进的92.1％。通过抑制以下公认的临床错误阈值的幻觉，Check为医学和其他高风险域中的安全LLM部署提供了可扩展的基础。

Title: Large Language Models and Emergence: A Complex Systems Perspective

Authors: David C. Krakauer, John W. Krakauer, Melanie Mitchell
Subjects: cs.CL, cs.AI, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2506.11135
Pdf URL: https://arxiv.org/pdf/2506.11135
Copy Paste: [[2506.11135]] Large Language Models and Emergence: A Complex Systems Perspective(https://arxiv.org/abs/2506.11135)
Keywords: language model, llm
Abstract: Emergence is a concept in complexity science that describes how many-body systems manifest novel higher-level properties, properties that can be described by replacing high-dimensional mechanisms with lower-dimensional effective variables and theories. This is captured by the idea "more is different". Intelligence is a consummate emergent property manifesting increasingly efficient -- cheaper and faster -- uses of emergent capabilities to solve problems. This is captured by the idea "less is more". In this paper, we first examine claims that Large Language Models exhibit emergent capabilities, reviewing several approaches to quantifying emergence, and secondly ask whether LLMs possess emergent intelligence.
摘要：出现是复杂性科学中的一个概念，它描述了多体系统表现出新的高级特性，可以通过用较低维有效变量和理论代替高维机制来描述这些属性。这是由“更多是不同”的想法所捕获的。智能是一种出现的紧急财产，表现出越来越有效的效率 - 更便宜，更快地使用了新兴能力来解决问题。这是由“少更多”所捕获的。在本文中，我们首先研究了大型语言模型具有新兴能力的说法，审查了量化出现的几种方法，其次询问LLMS是否具有新兴的智能。

Title: Scalable Medication Extraction and Discontinuation Identification from Electronic Health Records Using Large Language Models

Authors: Chong Shao, Douglas Snyder, Chiran Li, Bowen Gu, Kerry Ngan, Chun-Ting Yang, Jiageng Wu, Richard Wyss, Kueiyu Joshua Lin, Jie Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11137
Pdf URL: https://arxiv.org/pdf/2506.11137
Copy Paste: [[2506.11137]] Scalable Medication Extraction and Discontinuation Identification from Electronic Health Records Using Large Language Models(https://arxiv.org/abs/2506.11137)
Keywords: language model, gpt, llm, prompt
Abstract: Identifying medication discontinuations in electronic health records (EHRs) is vital for patient safety but is often hindered by information being buried in unstructured notes. This study aims to evaluate the capabilities of advanced open-sourced and proprietary large language models (LLMs) in extracting medications and classifying their medication status from EHR notes, focusing on their scalability on medication information extraction without human annotation. We collected three EHR datasets from diverse sources to build the evaluation benchmark. We evaluated 12 advanced LLMs and explored multiple LLM prompting strategies. Performance on medication extraction, medication status classification, and their joint task (extraction then classification) was systematically compared across all experiments. We found that LLMs showed promising performance on the medication extraction and discontinuation classification from EHR notes. GPT-4o consistently achieved the highest average F1 scores in all tasks under zero-shot setting - 94.0% for medication extraction, 78.1% for discontinuation classification, and 72.7% for the joint task. Open-sourced models followed closely, Llama-3.1-70B-Instruct achieved the highest performance in medication status classification on the MIV-Med dataset (68.7%) and in the joint task on both the Re-CASI (76.2%) and MIV-Med (60.2%) datasets. Medical-specific LLMs demonstrated lower performance compared to advanced general-domain LLMs. Few-shot learning generally improved performance, while CoT reasoning showed inconsistent gains. LLMs demonstrate strong potential for medication extraction and discontinuation identification on EHR notes, with open-sourced models offering scalable alternatives to proprietary systems and few-shot can further improve LLMs' capability.
摘要：在电子健康记录（EHRS）中识别药物停产对于患者的安全至关重要，但通常会掩盖信息中的信息，从而妨碍非结构化的注释。这项研究旨在评估先进的开源和专有大语模型（LLMS）在提取药物和从EHR注释中对药物状态进行分类的能力，重点关注他们在没有人类注释的情况下对药物信息提取的可扩展性。我们从不同来源收集了三个EHR数据集，以构建评估基准。我们评估了12个高级LLM，并探讨了多个LLM提示策略。在所有实验中，都要系统地比较药物提取，药物状态分类及其联合任务（然后进行分类）的性能。我们发现，LLM在EHR注释中表现出有希望的药物提取和中断分类的表现。 GPT-4O始终在零射门设置下所有任务中的最高平均F1得分-94.0％的药物提取，中断分类为78.1％，而联合任务为72.7％。开源模型紧密遵循，Llama-3.1-70B实验室在MIV-MED数据集（68.7％）上以及在RE-CASI（76.2％）和MIV-MIV-MED（60.2％）数据集的联合任务上获得了药物状态分类的最高性能。与先进的通用域LLM相比，医学特异性LLM的性能较低。很少有学习的学习通常改善了性能，而COT推理表现出不一致的收益。 LLMS在EHR注释上表现出强大的药物提取和停用识别的潜力，开源模型为专有系统提供了可扩展的替代方案，很少射击可以进一步提高LLMS的能力。

Title: Iterative Multilingual Spectral Attribute Erasure

Authors: Shun Shao, Yftah Ziser, Zheng Zhao, Yifu Qiu, Shay B. Cohen, Anna Korhonen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11244
Pdf URL: https://arxiv.org/pdf/2506.11244
Copy Paste: [[2506.11244]] Iterative Multilingual Spectral Attribute Erasure(https://arxiv.org/abs/2506.11244)
Keywords: language model
Abstract: Multilingual representations embed words with similar meanings to share a common semantic space across languages, creating opportunities to transfer debiasing effects between languages. However, existing methods for debiasing are unable to exploit this opportunity because they operate on individual languages. We present Iterative Multilingual Spectral Attribute Erasure (IMSAE), which identifies and mitigates joint bias subspaces across multiple languages through iterative SVD-based truncation. Evaluating IMSAE across eight languages and five demographic dimensions, we demonstrate its effectiveness in both standard and zero-shot settings, where target language data is unavailable, but linguistically similar languages can be used for debiasing. Our comprehensive experiments across diverse language models (BERT, LLaMA, Mistral) show that IMSAE outperforms traditional monolingual and cross-lingual approaches while maintaining model utility.
摘要：多语言表示嵌入具有相似含义的单词，可以在语言上共享一个共同的语义空间，从而创造了在语言之间传递偏见效应的机会。但是，现有的依据方法无法利用此机会，因为它们在单个语言上运作。我们提出了迭代的多语言光谱属性擦除（IMSAE），该属性通过基于迭代的SVD截断来识别并减轻多种语言的关节偏置子空间。评估跨八种语言和五个人口统计学维度的IMSAE，我们在标准和零击设置中都证明了它的有效性，而目标语言数据不可用，但是语言上相似的语言可以用于偏见。我们跨不同语言模型（Bert，Llama，Mistral）进行的全面实验表明，IMSAE在维护模型实用程序的同时，超出了传统的单语言和跨语性方法。

Title: No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning

Authors: Kushagra Dixit, Abhishek Rajgaria, Harshavardhan Kalalbandi, Dan Roth, Vivek Gupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11246
Pdf URL: https://arxiv.org/pdf/2506.11246
Copy Paste: [[2506.11246]] No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning(https://arxiv.org/abs/2506.11246)
Keywords: language model, llm, prompt
Abstract: Temporal Table Reasoning is a critical challenge for Large Language Models (LLMs), requiring effective prompting techniques to extract relevant insights. Despite existence of multiple prompting methods, their impact on table reasoning remains largely unexplored. Furthermore, the performance of these models varies drastically across different table and context structures, making it difficult to determine an optimal approach. This work investigates multiple prompting technique across diverse table types to determine optimal approaches for different scenarios. We find that performance varies based on entity type, table structure, requirement of additional context and question complexity, with NO single method consistently outperforming others. To mitigate these challenges, we introduce SEAR, an adaptive prompting framework inspired by human reasoning that dynamically adjusts based on context characteristics and integrates a structured reasoning. Our results demonstrate that SEAR achieves superior performance across all table types compared to other baseline prompting techniques. Additionally, we explore the impact of table structure refactoring, finding that a unified representation enhances model's reasoning.
摘要：时间表推理对于大型语言模型（LLM）来说是一个关键挑战，需要有效提示技术来提取相关见解。尽管存在多种提示方法，但它们对桌子推理的影响仍然在很大程度上尚未探索。此外，这些模型的性能在不同的表和上下文结构上差异很大，因此难以确定最佳方法。这项工作研究了各种表类型的多种提示技术，以确定不同情况的最佳方法。我们发现，性能会根据实体类型，表结构，其他上下文的要求和问题复杂性而有所不同，没有单一的方法始终超过其他方法。为了减轻这些挑战，我们引入了SEAR，这是一个受到人类推理的启发的自适应提示框架，该框架受到人类推理的启发，该框架根据上下文特征动态调整并整合结构化的推理。我们的结果表明，与其他基线提示技术相比，SEAR在所有表类型中都取得了卓越的性能。此外，我们探讨了表结构重构的影响，发现统一表示增强了模型的推理。

Title: Learning a Continue-Thinking Token for Enhanced Test-Time Scaling

Authors: Liran Ringel, Elad Tolochinsky, Yaniv Romano
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.11274
Pdf URL: https://arxiv.org/pdf/2506.11274
Copy Paste: [[2506.11274]] Learning a Continue-Thinking Token for Enhanced Test-Time Scaling(https://arxiv.org/abs/2506.11274)
Keywords: language model
Abstract: Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing "" with "Wait") can extend reasoning steps and improve accuracy. In this work, we explore whether a dedicated continue-thinking token can be learned to trigger extended reasoning. We augment a distilled version of DeepSeek-R1 with a single learned "<|continue-thinking|>" token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., "Wait") for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model's accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing.
摘要：测试时间缩放已成为通过推理时使用其他计算来改善语言模型性能的有效方法。最近的研究表明，有思想的令牌（例如，用“等待”代替“ ”）可以扩展推理步骤并提高准确性。在这项工作中，我们探讨了是否可以学会一个持续思维的令牌来触发扩展的推理。我们通过单个学习的“ <|持续思维|>”来增强DeepSeek-R1的蒸馏版，仅通过增强学习训练其嵌入，同时保持模型的重量冻结。我们的实验表明，与基线模型和使用固定令牌（例如“ wait”）进行预算强迫的测试时间缩放方法相比，这学到的令牌可以提高标准数学基准的准确性。特别是，我们观察到，如果固定方法提高了基本模型的准确性，我们的方法显着提高了进步。例如，在GSM8K基准测试中，固定方法的准确性可实现1.3％的绝对提高，而我们的学习方法比不使用预算强迫的基本模型可获得4.2％的提高。

Title: Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning

Authors: Yang Zhang, Amr Mohamed, Hadi Abdine, Guokan Shang, Michalis Vazirgiannis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11300
Pdf URL: https://arxiv.org/pdf/2506.11300
Copy Paste: [[2506.11300]] Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning(https://arxiv.org/abs/2506.11300)
Keywords: language model, prompt
Abstract: Curriculum learning has shown promise in improving training efficiency and generalization in various machine learning domains, yet its potential in pretraining language models remains underexplored, prompting our work as the first systematic investigation in this area. We experimented with different settings, including vanilla curriculum learning, pacing-based sampling, and interleaved curricula-guided by six difficulty metrics spanning linguistic and information-theoretic perspectives. We train models under these settings and evaluate their performance on eight diverse benchmarks. Our experiments reveal that curriculum learning consistently improves convergence in early and mid-training phases, and can yield lasting gains when used as a warmup strategy with up to $3.5\%$ improvement. Notably, we identify compression ratio, lexical diversity, and readability as effective difficulty signals across settings. Our findings highlight the importance of data ordering in large-scale pretraining and provide actionable insights for scalable, data-efficient model development under realistic training scenarios.
摘要：课程学习已显示出在提高各种机器学习领域的培训效率和概括方面的希望，但其在训练训练的语言模型中的潜力仍然没有充满信心，这促使我们作为该领域的首次系统调查的工作。我们尝试了不同的设置，包括香草课程学习，基于起搏的抽样，并通过跨越语言和信息理论观点的六个难度指标进行了交织课程。我们在这些设置下训练模型，并在八个不同的基准测试中评估其性能。我们的实验表明，课程学习一致地改善了早期和中期培训阶段的收敛性，并且当用作高达$ 3.5 \％$改善的热身策略时，可以产生持久的收益。值得注意的是，我们将压缩比，词汇多样性和可读性确定为跨环境的有效难度信号。我们的发现强调了数据排序在大规模预处理中的重要性，并为在现实的培训方案下提供了可伸缩的，可扩展的，可扩展的数据有效模型开发的见解。

Title: Don't Pay Attention

Authors: Mohammad Hammoud, Devang Acharya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11305
Pdf URL: https://arxiv.org/pdf/2506.11305
Copy Paste: [[2506.11305]] Don't Pay Attention(https://arxiv.org/abs/2506.11305)
Keywords: language model
Abstract: The Transformer has become the de facto standard for large language models and a wide range of downstream tasks across various domains. Despite its numerous advantages like inherent training parallelism, the Transformer still faces key challenges due to its inability to effectively process sequences beyond a fixed context window and the quadratic complexity of its attention mechanism. These challenges have renewed interest in RNN-like architectures, which offer linear scaling with sequence length and improved handling of long-range dependencies, albeit with limited parallelism due to their inherently recurrent nature. In this paper, we propose Avey, a new neural foundational architecture that breaks away from both attention and recurrence. Avey comprises a ranker and an autoregressive neural processor, which collaboratively identify and contextualize only the most relevant tokens for any given token, regardless of their positions in the sequence. Specifically, Avey decouples sequence length from context width, thus enabling effective processing of arbitrarily long sequences. Experimental results show that Avey compares favorably to the Transformer across a variety of standard short-range NLP benchmarks, while notably excelling at capturing long-range dependencies.
摘要：变压器已成为大型语言模型和各个领域的各种下游任务的事实上的标准。尽管有许多优势，例如固有的训练并行性，但由于无法有效处理固定上下文窗口和注意机制的二次复杂性，变压器仍然面临着关键的挑战。这些挑战已经引起了人们对类似RNN的架构的兴趣，这些构建结构具有序列长度的线性缩放，并改善了对远程依赖性的处理，尽管由于其固有的经常性性质，因此具有有限的并行性。在本文中，我们提出了Avey，这是一种新的神经基础架构，既不注意和复发。 Avey包括排名和自回归神经处理器，该处理器在任何给定令牌上合作地识别和上下文仅将最相关的令牌化，而不管它们在顺序中的立场如何。具体而言，avey将序列长度与上下文宽度脱离，从而有效地处理任意长序列。实验结果表明，AVEY与各种标准短期NLP基准测试中的变压器相比，同时特别擅长捕获长期依赖性。

Title: Surprisal from Larger Transformer-based Language Models Predicts fMRI Data More Poorly

Authors: Yi-Chien Lin, William Schuler
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11338
Pdf URL: https://arxiv.org/pdf/2506.11338
Copy Paste: [[2506.11338]] Surprisal from Larger Transformer-based Language Models Predicts fMRI Data More Poorly(https://arxiv.org/abs/2506.11338)
Keywords: language model
Abstract: As Transformers become more widely incorporated into natural language processing tasks, there has been considerable interest in using surprisal from these models as predictors of human sentence processing difficulty. Recent work has observed a positive relationship between Transformer-based models' perplexity and the predictive power of their surprisal estimates on reading times, showing that language models with more parameters and trained on more data are less predictive of human reading times. However, these studies focus on predicting latency-based measures (i.e., self-paced reading times and eye-gaze durations) with surprisal estimates from Transformer-based language models. This trend has not been tested on brain imaging data. This study therefore evaluates the predictive power of surprisal estimates from 17 pre-trained Transformer-based models across three different language families on two functional magnetic resonance imaging datasets. Results show that the positive relationship between model perplexity and model fit still obtains, suggesting that this trend is not specific to latency-based measures and can be generalized to neural measures.
摘要：随着变压器更广泛地纳入自然语言处理任务中，人们对使用这些模型的惊喜作为人类句子处理难度的预测因素引起了人们的极大兴趣。最近的工作观察到了基于变形金刚的模型的困惑与阅读时间的惊人估计的预测能力之间存在正相关关系，这表明具有更多参数和对更多数据培训的语言模型对人类阅读时间的预测较少。但是，这些研究着重于预测基于潜伏期的措施（即自定进度的阅读时间和眼睛凝视持续时间），并从基于变压器的语言模型中进行了惊人的估计。这种趋势尚未在大脑成像数据上进行测试。因此，这项研究评估了两个功能性磁共振成像数据集上三个不同语言家族的17个基于预训练的变压器模型的惊奇估计值的预测能力。结果表明，模型困惑和模型拟合之间的正相关关系仍在获得，这表明这种趋势不是基于潜伏期的测量的特定，并且可以推广到神经测量。

Title: From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review

Authors: Yaohui Zhang, Haijing Zhang, Wenlong Ji, Tianyu Hua, Nick Haber, Hancheng Cao, Weixin Liang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11343
Pdf URL: https://arxiv.org/pdf/2506.11343
Copy Paste: [[2506.11343]] From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review(https://arxiv.org/abs/2506.11343)
Keywords: language model, llm, agent
Abstract: The advent of large language models (LLMs) offers unprecedented opportunities to reimagine peer review beyond the constraints of traditional workflows. Despite these opportunities, prior efforts have largely focused on replicating traditional review workflows with LLMs serving as direct substitutes for human reviewers, while limited attention has been given to exploring new paradigms that fundamentally rethink how LLMs can participate in the academic review process. In this paper, we introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, this approach enables a more accurate and robust measure of relative manuscript quality. Our experiments demonstrate that this comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers. However, our analysis also reveals emergent biases in the selection process, notably a reduced novelty in research topics and an increased institutional imbalance. These findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity.
摘要：大型语言模型（LLMS）的出现提供了前所未有的机会，可以在传统工作流程的限制之外重新想象同行评审。尽管有这些机会，但先前的努力主要集中在复制传统的审查工作流中，其中LLM是人类审稿人的直接替代品，而对探索新范式的关注有限，从根本上讲，这些范式从根本上重新考虑LLM可以参与学术审查过程。在本文中，我们介绍并探索了一种新型机制，该机制采用LLM代理在手稿之间进行成对比较，而不是个人评分。通过从实质成对评估中汇总结果，该方法可以对相对手稿质量进行更准确和强大的衡量。我们的实验表明，这种比较方法在识别高影响力论文方面显着优于传统的基于评分的方法。但是，我们的分析还揭示了选择过程中出现的偏见，尤其是研究主题的新颖性和机构失衡的增加。这些发现凸显了通过LLM重新思考同行评估的变革潜力，也凸显了未来系统必须解决的关键挑战以确保公平和多样性。

Title: The Biased Samaritan: LLM biases in Perceived Kindness

Authors: Jack H Fagan, Ruhaan Juyaal, Amy Yue-Ming Yu, Siya Pun
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.11361
Pdf URL: https://arxiv.org/pdf/2506.11361
Copy Paste: [[2506.11361]] The Biased Samaritan: LLM biases in Perceived Kindness(https://arxiv.org/abs/2506.11361)
Keywords: language model, llm, prompt
Abstract: While Large Language Models (LLMs) have become ubiquitous in many fields, understanding and mitigating LLM biases is an ongoing issue. This paper provides a novel method for evaluating the demographic biases of various generative AI models. By prompting models to assess a moral patient's willingness to intervene constructively, we aim to quantitatively evaluate different LLMs' biases towards various genders, races, and ages. Our work differs from existing work by aiming to determine the baseline demographic identities for various commercial models and the relationship between the baseline and other demographics. We strive to understand if these biases are positive, neutral, or negative, and the strength of these biases. This paper can contribute to the objective assessment of bias in Large Language Models and give the user or developer the power to account for these biases in LLM output or in training future LLMs. Our analysis suggested two key findings: that models view the baseline demographic as a white middle-aged or young adult male; however, a general trend across models suggested that non-baseline demographics are more willing to help than the baseline. These methodologies allowed us to distinguish these two biases that are often tangled together.
摘要：尽管大型语言模型（LLM）在许多领域都变得无处不在，但理解和减轻LLM偏见是一个持续的问题。本文提供了一种评估各种生成AI模型的人口偏见的新方法。通过促使模型评估道德患者愿意进行建设性干预的意愿，我们旨在定量评估不同的LLMS对各种性别，种族和年龄的偏见。我们的工作与现有工作不同，目的是确定各种商业模型的基线人口统计身份以及基线与其他人口统计学之间的关系。我们努力理解这些偏见是积极的，中性的还是负的，以及这些偏见的强度。本文可以有助于对大语言模型中偏见的客观评估，并赋予用户或开发人员在LLM输出中或培训未来LLM中考虑这些偏见的能力。我们的分析提出了两个关键发现：模型将基线人群视为白人中年或年轻男性。但是，整个模型的总体趋势表明，非基线人口统计学比基线更愿意提供帮助。这些方法使我们能够区分通常纠结在一起的这两个偏见。

Title: Curriculum-Guided Layer Scaling for Language Model Pretraining

Authors: Karanpartap Singh, Neil Band, Ehsan Adeli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11389
Pdf URL: https://arxiv.org/pdf/2506.11389
Copy Paste: [[2506.11389]] Curriculum-Guided Layer Scaling for Language Model Pretraining(https://arxiv.org/abs/2506.11389)
Keywords: language model
Abstract: As the cost of pretraining large language models grows, there is continued interest in strategies to improve learning efficiency during this core training stage. Motivated by cognitive development, where humans gradually build knowledge as their brains mature, we propose Curriculum-Guided Layer Scaling (CGLS), a framework for compute-efficient pretraining that synchronizes increasing data difficulty with model growth through progressive layer stacking (i.e. gradually adding layers during training). At the 100M parameter scale, using a curriculum transitioning from synthetic short stories to general web data, CGLS outperforms baseline methods on the question-answering benchmarks PIQA and ARC. Pretraining at the 1.2B scale, we stratify the DataComp-LM corpus with a DistilBERT-based classifier and progress from general text to highly technical or specialized content. Our results show that progressively increasing model depth alongside sample difficulty leads to better generalization and zero-shot performance on various downstream benchmarks. Altogether, our findings demonstrate that CGLS unlocks the potential of progressive stacking, offering a simple yet effective strategy for improving generalization on knowledge-intensive and reasoning tasks.
摘要：随着预处理大语模型的成本的增长，对在此核心培训阶段提高学习效率的策略一直存在兴趣。由认知发展的动机（人类逐渐建立知识作为大脑成熟），我们提出了课程引导的层缩放比例（CGLS），这是一种计算有效预读的框架，该框架可以通过累进层堆叠来同步数据与模型增长的增加与模型增长的增加（即在训练过程中逐渐添加层）。在100m的参数量表上，使用从合成短篇小说到一般Web数据的课程过渡，CGLS优于提出问题基准PIQA和ARC上的基线方法。在1.2B量表上进行预处理，我们将数据库-LM语料库分类为基于大杂质的分类器，从一般文本到高度技术或专业内容的进展。我们的结果表明，逐渐增加模型深度与样本难度同时，可以在各种下游基准上获得更好的概括和零拍的性能。总之，我们的发现表明，CGLS释放了渐进式堆叠的潜力，为改善知识密集型和推理任务的概括提供了一种简单而有效的策略。

Title: Predicting Early-Onset Colorectal Cancer with Large Language Models

Authors: Wilson Lau, Youngwon Kim, Sravanthi Parasa, Md Enamul Haque, Anand Oka, Jay Nanduri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11410
Pdf URL: https://arxiv.org/pdf/2506.11410
Copy Paste: [[2506.11410]] Predicting Early-Onset Colorectal Cancer with Large Language Models(https://arxiv.org/abs/2506.11410)
Keywords: language model, llm
Abstract: The incidence rate of early-onset colorectal cancer (EoCRC, age < 45) has increased every year, but this population is younger than the recommended age established by national guidelines for cancer screening. In this paper, we applied 10 different machine learning models to predict EoCRC, and compared their performance with advanced large language models (LLM), using patient conditions, lab results, and observations within 6 months of patient journey prior to the CRC diagnoses. We retrospectively identified 1,953 CRC patients from multiple health systems across the United States. The results demonstrated that the fine-tuned LLM achieved an average of 73% sensitivity and 91% specificity.
摘要：每年发病早期结直肠癌的发病率（EOCRC，年龄<45）都会有所提高，但该人群比国家癌症筛查指南确定的建议年龄年轻。在本文中，我们采用了10种不同的机器学习模型来预测EOCRC，并使用患者条件，实验室结果和在CRC诊断之前的患者旅行后6个月内将其表现与高级大语言模型（LLM）进行了比较。我们回顾性地确定了来自美国多个卫生系统的1,953名CRC患者。结果表明，微调的LLM平均敏感性和91％的特异性达到了73％。

Title: Efficient Long-Context LLM Inference via KV Cache Clustering

Authors: Jie Hu, Shengnan Wang, Yutong He, Ping Gong, Jiawei Yi, Juncheng Zhang, Youhui Bai, Renhai Chen, Gong Zhang, Cheng Li, Kun Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11418
Pdf URL: https://arxiv.org/pdf/2506.11418
Copy Paste: [[2506.11418]] Efficient Long-Context LLM Inference via KV Cache Clustering(https://arxiv.org/abs/2506.11418)
Keywords: language model, llm
Abstract: Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational overhead. In this paper, we introduce Chelsea, a simple yet effective framework for online KV cache clustering. Our approach is based on the observation that key states exhibit high similarity along the sequence dimension. To enable efficient clustering, we divide the sequence into chunks and propose Chunked Soft Matching, which employs an alternating partition strategy within each chunk and identifies clusters based on similarity. Chelsea then merges the KV cache within each cluster into a single centroid. Additionally, we provide a theoretical analysis of the computational complexity and the optimality of the intra-chunk partitioning strategy. Extensive experiments across various models and long-context benchmarks demonstrate that Chelsea achieves up to 80% reduction in KV cache memory usage while maintaining comparable model performance. Moreover, with minimal computational overhead, Chelsea accelerates the decoding stage of inference by up to 3.19$\times$ and reduces end-to-end latency by up to 2.72$\times$.
摘要：具有扩展上下文窗口的大型语言模型（LLM）已越来越普遍解决复杂的任务。但是，长篇小说LLMS所需的大量键值（KV）缓存构成了重大部署挑战。现有方法要么丢弃子孙后代所需的潜在关键信息，要么由于高度计算开销而提供有限的效率提高。在本文中，我们介绍了切尔西，这是一个简单而有效的在线KV缓存集群的框架。我们的方法基于这样的观察，即关键状态沿序列维度表现出很高的相似性。为了实现有效的聚类，我们将序列分为块，并提出了块的软匹配，该匹配匹配在每个块中采用交替的分区策略，并根据相似性识别簇。然后，切尔西将每个集群中的KV缓存合并为一个质心。此外，我们还提供了计算复杂性和锁骨内分区策略的最佳性的理论分析。在各种模型和长篇小说基准之间进行的广泛实验表明，切尔西在保持可比较的模型性能的同时，在KV高速缓存内存使用情况下最多可减少80％。此外，随着计算开销的最低，切尔西将推断的解码阶段提高了3.19 $ \ times $，并将端到端延迟降低到2.72 $ \ times $。

Title: Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards

Authors: Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, Sean Hendryx
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11425
Pdf URL: https://arxiv.org/pdf/2506.11425
Copy Paste: [[2506.11425]] Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards(https://arxiv.org/abs/2506.11425)
Keywords: language model, llm, agent
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has been widely adopted as the de facto method for enhancing the reasoning capabilities of large language models and has demonstrated notable success in verifiable domains like math and competitive programming tasks. However, the efficacy of RLVR diminishes significantly when applied to agentic environments. These settings, characterized by multi-step, complex problem solving, lead to high failure rates even for frontier LLMs, as the reward landscape is too sparse for effective model training via conventional RLVR. In this work, we introduce Agent-RLVR, a framework that makes RLVR effective in challenging agentic settings, with an initial focus on software engineering tasks. Inspired by human pedagogy, Agent-RLVR introduces agent guidance, a mechanism that actively steers the agent towards successful trajectories by leveraging diverse informational cues. These cues, ranging from high-level strategic plans to dynamic feedback on the agent's errors and environmental interactions, emulate a teacher's guidance, enabling the agent to navigate difficult solution spaces and promotes active self-improvement via additional environment exploration. In the Agent-RLVR training loop, agents first attempt to solve tasks to produce initial trajectories, which are then validated by unit tests and supplemented with agent guidance. Agents then reattempt with guidance, and the agent policy is updated with RLVR based on the rewards of these guided trajectories. Agent-RLVR elevates the pass@1 performance of Qwen-2.5-72B-Instruct from 9.4% to 22.4% on SWE-Bench Verified. We find that our guidance-augmented RLVR data is additionally useful for test-time reward model training, shown by further boosting pass@1 to 27.8%. Agent-RLVR lays the groundwork for training agents with RLVR in complex, real-world environments where conventional RL methods struggle.
摘要：从可验证的奖励（RLVR）中学习的强化学习已被广泛用作提高大语言模型的推理能力的事实上的方法，并在数学和竞争性编程任务等可验证的域中取得了显着的成功。但是，当应用于代理环境时，RLVR的功效会大大降低。这些设置的特征是多步，复杂的问题解决，即使对于Frontier LLM，也会导致高失败率，因为奖励景观对于通过常规RLVR而言有效的模型培训太稀疏。在这项工作中，我们介绍了Agent-RLVR，该框架使RLVR有效地挑战了代理设置，并最初关注软件工程任务。受到人类教学法的启发，Agent-RLVR引入了代理指导，该机制通过利用多样化的信息提示来积极地推动代理朝着成功的轨迹方向发展。这些提示从高级战略计划到对代理商的错误和环境互动的动态反馈，模仿教师的指导，使代理商能够通过额外的环境探索来驾驶困难的解决方案空间并促进主动的自我完善。在Agent-RLVR训练环中，代理商首先尝试解决任务以产生初始轨迹，然后通过单位测试验证，并补充了代理指导。然后，代理商对指导进行了重新计算，并且根据这些指导轨迹的奖励，对代理政策进行了RLVR的更新。 Agent-RLVR将QWEN-2.5-72B教学QWEN-2.5-72B教学的通过率提高到SWE-Bench的22.4％。我们发现，我们的指导性RLVR数据对于测试时间奖励模型培训也很有用，这表明通过进一步提高通行证@1％至27.8％。 Agent-RLVR在复杂的现实世界环境中为培训代理提供了基础，在这些环境中，常规RL方法挣扎。

Title: KoGEC : Korean Grammatical Error Correction with Pre-trained Translation Models

Authors: Taeeun Kim, Semin Jeong, Youngsook Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11432
Pdf URL: https://arxiv.org/pdf/2506.11432
Copy Paste: [[2506.11432]] KoGEC : Korean Grammatical Error Correction with Pre-trained Translation Models(https://arxiv.org/abs/2506.11432)
Keywords: language model, gpt, llm
Abstract: This research introduces KoGEC, a Korean Grammatical Error Correction system using pre\--trained translation models. We fine-tuned NLLB (No Language Left Behind) models for Korean GEC, comparing their performance against large language models like GPT-4 and HCX-3. The study used two social media conversation datasets for training and testing. The NLLB models were fine-tuned using special language tokens to distinguish between original and corrected Korean sentences. Evaluation was done using BLEU scores and an "LLM as judge" method to classify error types. Results showed that the fine-tuned NLLB (KoGEC) models outperformed GPT-4o and HCX-3 in Korean GEC tasks. KoGEC demonstrated a more balanced error correction profile across various error types, whereas the larger LLMs tended to focus less on punctuation errors. We also developed a Chrome extension to make the KoGEC system accessible to users. Finally, we explored token vocabulary expansion to further improve the model but found it to decrease model performance. This research contributes to the field of NLP by providing an efficient, specialized Korean GEC system and a new evaluation method. It also highlights the potential of compact, task-specific models to compete with larger, general-purpose language models in specialized NLP tasks.
摘要：这项研究介绍了Kogec，这是一种使用PRE \ - 经过训练的翻译模型的韩国语法误差校正系统。我们对韩国GEC进行了微调的NLLB（没有留下的语言）模型，将它们的性能与GPT-4和HCX-3（例如HCX-3）等大型语言模型进行了比较。该研究使用两个社交媒体对话数据集进行培训和测试。使用特殊语言令牌对NLLB模型进行微调，以区分原始和校正的韩国句子。使用BLEU分数和“法官”方法进行评估，以对错误类型进行分类。结果表明，在韩国GEC任务中，微调的NLLB（KOGEC）模型优于GPT-4O和HCX-3。 Kogec在各种误差类型上表现出更平衡的误差校正概要概要，而较大的LLMS倾向于更少关注标点符号错误。我们还开发了一个Chrome扩展名，以使用户可以访问KOGEC系统。最后，我们探索了令牌词汇扩展以进一步改善模型，但发现它可以降低模型性能。这项研究通过提供有效的，专业的韩国GEC系统和新的评估方法来为NLP领域做出贡献。它还强调了紧凑的，特定于任务的模型的潜力，即在专业的NLP任务中与更大的通用语言模型竞争。

Title: AbsenceBench: Language Models Can't Tell What's Missing

Authors: Harvey Yiyun Fu, Aryan Shrivastava, Jared Moore, Peter West, Chenhao Tan, Ari Holtzman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11440
Pdf URL: https://arxiv.org/pdf/2506.11440
Copy Paste: [[2506.11440]] AbsenceBench: Language Models Can't Tell What's Missing(https://arxiv.org/abs/2506.11440)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to "gaps" in documents since these absences don't correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).
摘要：大型语言模型（LLMS）越来越能够处理长输入并在其中找到特定信息，这是由于它们在干草堆（NIAH）测试中的针头表现所证明的。但是，尽管模型在回忆令人惊讶的信息方面表现出色，但他们仍然很难识别清晰省略的信息。我们介绍了缺勤台，以评估LLMS在三个领域中检测缺失信息的能力：数字序列，诗歌和GitHub拉的请求。缺席者要求模型确定故意删除哪些文档，同时访问原始和编辑的上下文。尽管这些任务显然具有直接性，但我们的实验表明，即使是Claude-3.7-Sonnet（例如Claude-3.7-Sonnet）的最新模型，仅达到69.6％的F1分数，平均上下文长度为5K令牌。我们的分析表明，这种绩效较差源于一个基本限制：变形金刚的注意机制无法轻易地在文档中出现“差距”，因为这些缺勤与可以参与的任何特定键都不相对应。总体而言，我们的结果和分析提供了一个案例研究，以了解模型已经是超人（NIAH）的任务的紧密近端以及模型意外崩溃的任务（缺席）。

Title: A Gamified Evaluation and Recruitment Platform for Low Resource Language Machine Translation Systems

Authors: Carlos Rafael Catalan
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2506.11467
Pdf URL: https://arxiv.org/pdf/2506.11467
Copy Paste: [[2506.11467]] A Gamified Evaluation and Recruitment Platform for Low Resource Language Machine Translation Systems(https://arxiv.org/abs/2506.11467)
Keywords: language model
Abstract: Human evaluators provide necessary contributions in evaluating large language models. In the context of Machine Translation (MT) systems for low-resource languages (LRLs), this is made even more apparent since popular automated metrics tend to be string-based, and therefore do not provide a full picture of the nuances of the behavior of the system. Human evaluators, when equipped with the necessary expertise of the language, will be able to test for adequacy, fluency, and other important metrics. However, the low resource nature of the language means that both datasets and evaluators are in short supply. This presents the following conundrum: How can developers of MT systems for these LRLs find adequate human evaluators and datasets? This paper first presents a comprehensive review of existing evaluation procedures, with the objective of producing a design proposal for a platform that addresses the resource gap in terms of datasets and evaluators in developing MT systems. The result is a design for a recruitment and gamified evaluation platform for developers of MT systems. Challenges are also discussed in terms of evaluating this platform, as well as its possible applications in the wider scope of Natural Language Processing (NLP) research.
摘要：人类评估者在评估大语言模型方面提供了必要的贡献。在低资源语言（LRLS）的机器翻译（MT）系统的上下文中，由于流行的自动化指标倾向于基于字符串，因此这使它变得更加明显，因此不能全面了解系统行为的细微差别。当人类评估人员配备了该语言的必要专业知识时，将能够测试适当性，流利性和其他重要指标。但是，语言的低资源性质意味着数据集和评估者都供不应求。这提出了以下难题：这些LRLS的MT系统开发人员如何找到足够的人类评估者和数据集？本文首先对现有评估程序进行了全面的审查，目的是为一个平台生成一个设计建议，该平台在开发MT系统的数据集和评估人员方面解决了资源差距。结果是针对MT Systems开发人员的招聘和游戏化评估平台的设计。还通过评估该平台及其在更广泛的自然语言处理（NLP）研究范围中的应用来讨论挑战。

Title: Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards

Authors: Jaehoon Yun, Jiwoong Sohn, Jungwoo Park, Hyunjae Kim, Xiangru Tang, Yanjun Shao, Yonghoe Koo, Minhyeok Ko, Qingyu Chen, Mark Gerstein, Michael Moor, Jaewoo Kang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11474
Pdf URL: https://arxiv.org/pdf/2506.11474
Copy Paste: [[2506.11474]] Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards(https://arxiv.org/abs/2506.11474)
Keywords: language model, retrieval-augmented generation
Abstract: Large language models have shown promise in clinical decision making, but current approaches struggle to localize and correct errors at specific steps of the reasoning process. This limitation is critical in medicine, where identifying and addressing reasoning errors is essential for accurate diagnosis and effective patient care. We introduce Med-PRM, a process reward modeling framework that leverages retrieval-augmented generation to verify each reasoning step against established medical knowledge bases. By verifying intermediate reasoning steps with evidence retrieved from clinical guidelines and literature, our model can precisely assess the reasoning quality in a fine-grained manner. Evaluations on five medical QA benchmarks and two open-ended diagnostic tasks demonstrate that Med-PRM achieves state-of-the-art performance, with improving the performance of base models by up to 13.50% using Med-PRM. Moreover, we demonstrate the generality of Med-PRM by integrating it in a plug-and-play fashion with strong policy models such as Meerkat, achieving over 80\% accuracy on MedQA for the first time using small-scale models of 8 billion parameters. Our code and data are available at: this https URL
摘要：大型语言模型在临床决策中表现出了希望，但是当前的方法很难在推理过程的特定步骤中定位和纠正错误。这种局限性在医学中至关重要，在医学中，识别和解决推理错误对于准确诊断和有效的患者护理至关重要。我们介绍了Med-Prm，这是一个过程奖励建模框架，该框架利用检索功能的生成来验证针对已建立的医学知识基础的每个推理步骤。通过通过从临床准则和文献中获取的证据来验证中间推理步骤，我们的模型可以精确地评估推理质量的细粒度。对五个医学质量检查基准和两个开放式诊断任务的评估表明，Med-Prm实现了最先进的性能，使用MED-PRM提高了基本模型的性能高达13.50％。此外，我们通过以强大的政策模型（例如Meerkat）将MED-PRM的通用性集成到MED-PRM，从而首次使用80亿个参数的小型模型在MEDQA上实现了80 \％的准确性。我们的代码和数据可用：此HTTPS URL

Title: ImmunoFOMO: Are Language Models missing what oncologists see?

Authors: Aman Sinha, Bogdan-Valentin Popescu, Xavier Coubez, Marianne Clausel, Mathieu Constant
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11478
Pdf URL: https://arxiv.org/pdf/2506.11478
Copy Paste: [[2506.11478]] ImmunoFOMO: Are Language Models missing what oncologists see?(https://arxiv.org/abs/2506.11478)
Keywords: language model, llm
Abstract: Language models (LMs) capabilities have grown with a fast pace over the past decade leading researchers in various disciplines, such as biomedical research, to increasingly explore the utility of LMs in their day-to-day applications. Domain specific language models have already been in use for biomedical natural language processing (NLP) applications. Recently however, the interest has grown towards medical language models and their understanding capabilities. In this paper, we investigate the medical conceptual grounding of various language models against expert clinicians for identification of hallmarks of immunotherapy in breast cancer abstracts. Our results show that pre-trained language models have potential to outperform large language models in identifying very specific (low-level) concepts.
摘要：在过去的十年中，语言模型（LMS）的能力以快速的速度增强了各种学科的领先研究人员，例如生物医学研究，以越来越多地探索LMS在日常应用中的实用性。域特定语言模型已经用于生物医学自然语言处理（NLP）应用程序。然而，最近，对医学语言模型及其理解能力的兴趣已增长。在本文中，我们研究了针对专家临床医生的各种语言模型的医学概念基础，以识别乳腺癌摘要中免疫疗法的标志。我们的结果表明，预训练的语言模型有可能在识别非常具体的（低级）概念方面胜过大型语言模型。

Title: Relational Schemata in BERT Are Inducible, Not Emergent: A Study of Performance vs. Competence in Language Models

Authors: Cole Gawin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11485
Pdf URL: https://arxiv.org/pdf/2506.11485
Copy Paste: [[2506.11485]] Relational Schemata in BERT Are Inducible, Not Emergent: A Study of Performance vs. Competence in Language Models(https://arxiv.org/abs/2506.11485)
Keywords: language model
Abstract: While large language models like BERT demonstrate strong empirical performance on semantic tasks, whether this reflects true conceptual competence or surface-level statistical association remains unclear. I investigate whether BERT encodes abstract relational schemata by examining internal representations of concept pairs across taxonomic, mereological, and functional relations. I compare BERT's relational classification performance with representational structure in [CLS] token embeddings. Results reveal that pretrained BERT enables high classification accuracy, indicating latent relational signals. However, concept pairs organize by relation type in high-dimensional embedding space only after fine-tuning on supervised relation classification tasks. This indicates relational schemata are not emergent from pretraining alone but can be induced via task scaffolding. These findings demonstrate that behavioral performance does not necessarily imply structured conceptual understanding, though models can acquire inductive biases for grounded relational abstraction through appropriate training.
摘要：尽管像Bert这样的大型语言模型在语义任务上表现出强烈的经验表现，但这是否反映了真正的概念能力还是表面级统计关联尚不清楚。我研究了伯特是否通过检查跨分类学，merological和功能关系的概念对的内部表示来编码抽象的关系模式。我将Bert的关系分类性能与[CLS]令牌嵌入中的表示结构进行了比较。结果表明，预估计的BERT可实现高分类精度，表明潜在的关系信号。但是，概念对仅在对监督关系分类任务进行微调后，通过高维嵌入空间中的关系类型组织。这表明关系架构并非单独训练，而是可以通过任务脚手架诱导的。这些发现表明，行为表现并不一定意味着结构化的概念理解，尽管模型可以通过适当的训练获得接地关系抽象的归纳偏见。

Title: Lag-Relative Sparse Attention In Long Context Training

Authors: Manlai Liang, Wanyi Huang, Mandi Liu, Huaijun Li, Jinlong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11498
Pdf URL: https://arxiv.org/pdf/2506.11498
Copy Paste: [[2506.11498]] Lag-Relative Sparse Attention In Long Context Training(https://arxiv.org/abs/2506.11498)
Keywords: language model, llm, long context
Abstract: Large Language Models (LLMs) have made significant strides in natural language processing and generation, yet their ability to handle long-context input remains constrained by the quadratic complexity of attention computation and linear-increasing key-value memory footprint. To reduce computational costs and memory, key-value cache compression techniques are commonly applied at inference time, but this often leads to severe performance degradation, as models are not trained to handle compressed context. Although there are more sophisticated compression methods, they are typically unsuitable for post-training because of their incompatibility with gradient-based optimization or high computation overhead. To fill this gap with no additional parameter and little computation overhead, we propose Lag-Relative Sparse Attention(LRSA) anchored by the LagKV compression method for long context post-training. Our method performs chunk-by-chunk prefilling, which selects the top K most relevant key-value pairs in a fixed-size lagging window, allowing the model to focus on salient historical context while maintaining efficiency. Experimental results show that our approach significantly enhances the robustness of the LLM with key-value compression and achieves better fine-tuned results in the question-answer tuning task.
摘要：大型语言模型（LLMS）在自然语言处理和生成方面取得了重大进步，但是它们处理长篇文本输入的能力仍然受到注意计算的二次复杂性和线性增强的键值记忆足迹的限制。为了降低计算成本和内存，通常在推理时间应用键值缓存压缩技术，但这通常会导致严重的性能下降，因为未经训练模型来处理压缩环境。尽管有更复杂的压缩方法，但由于它们与基于梯度的优化或高计算开销的不兼容，因此通常不适合培训。为了填补这一空白，没有其他参数和很少的计算开销，我们提出了由lagkv压缩方法锚定的滞后稀疏注意（LRSA），用于长上下文训练后。我们的方法在固定尺寸的滞后窗口中选择最相关的键值对，从而选择最相关的键值对，从而使模型可以专注于显着的历史环境，同时保持效率。实验结果表明，我们的方法可以通过钥匙值压缩显着增强LLM的鲁棒性，并在问题解答任务中获得更好的微调结果。

Title: On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval

Authors: Seongbo Jang, Seonghyeon Lee, Dongha Lee, Hwanjo Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11499
Pdf URL: https://arxiv.org/pdf/2506.11499
Copy Paste: [[2506.11499]] On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval(https://arxiv.org/abs/2506.11499)
Keywords: chat
Abstract: Multimodal chatbots have become one of the major topics for dialogue systems in both research community and industry. Recently, researchers have shed light on the multimodality of responses as well as dialogue contexts. This work explores how a dialogue system can output responses in various modalities such as text and image. To this end, we first formulate a multimodal dialogue response retrieval task for retrieval-based systems as the combination of three subtasks. We then propose three integration methods based on a two-step approach and an end-to-end approach, and compare the merits and demerits of each method. Experimental results on two datasets demonstrate that the end-to-end approach achieves comparable performance without an intermediate step in the two-step approach. In addition, a parameter sharing strategy not only reduces the number of parameters but also boosts performance by transferring knowledge across the subtasks and the modalities.
摘要：多模式聊天机器人已成为研究社区和行业对话系统的主要主题之一。最近，研究人员阐明了响应的多模式以及对话环境。这项工作探讨了对话系统如何以文本和图像等各种方式输出响应。为此，我们首先为基于检索的系统作为三个子任务的组合制定了多模式对话响应检索任务。然后，我们根据两步方法和端到端方法提出了三种集成方法，并比较每种方法的优点和缺点。两个数据集的实验结果表明，端到端方法在两步方法中没有中间步骤实现可比的性能。此外，参数共享策略不仅可以减少参数的数量，而且通过在子任务和模式中传输知识来提高性能。

Title: From Persona to Person: Enhancing the Naturalness with Multiple Discourse Relations Graph Learning in Personalized Dialogue Generation

Authors: Chih-Hao Hsu, Ying-Jia Lin, Hung-Yu Kao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11557
Pdf URL: https://arxiv.org/pdf/2506.11557
Copy Paste: [[2506.11557]] From Persona to Person: Enhancing the Naturalness with Multiple Discourse Relations Graph Learning in Personalized Dialogue Generation(https://arxiv.org/abs/2506.11557)
Keywords: language model
Abstract: In dialogue generation, the naturalness of responses is crucial for effective human-machine interaction. Personalized response generation poses even greater challenges, as the responses must remain coherent and consistent with the user's personal traits or persona descriptions. We propose MUDI ($\textbf{Mu}$ltiple $\textbf{Di}$scourse Relations Graph Learning) for personalized dialogue generation. We utilize a Large Language Model to assist in annotating discourse relations and to transform dialogue data into structured dialogue graphs. Our graph encoder, the proposed DialogueGAT model, then captures implicit discourse relations within this structure, along with persona descriptions. During the personalized response generation phase, novel coherence-aware attention strategies are implemented to enhance the decoder's consideration of discourse relations. Our experiments demonstrate significant improvements in the quality of personalized responses, thus resembling human-like dialogue exchanges.
摘要：在对话产生中，反应的自然性对于有效的人机相互作用至关重要。个性化的响应产生更大的挑战，因为响应必须保持一致并与用户的个人特征或角色描述一致。我们建议个性化的对话，我们建议使用Mudi（$ \ textbf {Mu} $ ltiple $ \ textbf {di} $ scourse关系图形学习）。我们利用大型语言模型来协助注释话语关系并将对话数据转换为结构化对话图。我们的图形编码器是拟议的对话模型，然后捕获了该结构中的隐式话语关系以及角色描述。在个性化的响应生成阶段，实施了新颖的一致性注意注意策略，以增强解码器对话语关系的考虑。我们的实验表明，个性化响应的质量有了显着改善，因此类似于人类的对话交流。

Title: Are LLMs Good Text Diacritizers? An Arabic and Yorùbá Case Study

Authors: Hawau Olamide Toyin, Samar M. Magdy, Hanan Aldarmaki
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11602
Pdf URL: https://arxiv.org/pdf/2506.11602
Copy Paste: [[2506.11602]] Are LLMs Good Text Diacritizers? An Arabic and Yorùbá Case Study(https://arxiv.org/abs/2506.11602)
Keywords: language model, llm, hallucination
Abstract: We investigate the effectiveness of large language models (LLMs) for text diacritization in two typologically distinct languages: Arabic and Yoruba. To enable a rigorous evaluation, we introduce a novel multilingual dataset MultiDiac, with diverse samples that capture a range of diacritic ambiguities. We evaluate 14 LLMs varying in size, accessibility, and language coverage, and benchmark them against 6 specialized diacritization models. Additionally, we fine-tune four small open-source models using LoRA for Yoruba. Our results show that many off-the-shelf LLMs outperform specialized diacritization models for both Arabic and Yoruba, but smaller models suffer from hallucinations. Fine-tuning on a small dataset can help improve diacritization performance and reduce hallucination rates.
摘要：我们研究了大型语言模型（LLM）在两种类型上不同语言中的文本大语的有效性：阿拉伯语和约鲁巴语。为了实现严格的评估，我们引入了一种新型的多语言数据集多功能动物，并具有不同的样本，可捕获一系列的大气歧义。我们评估了14个LLM的规模，可访问性和语言覆盖范围，并根据6种专业的大气压模型进行基准测试。此外，我们使用洛拉（Lora）为约鲁巴（Yoruba）微调了四个小型开源型号。我们的结果表明，许多现成的LLMS的表现都超过了阿拉伯语和约鲁巴省的专业数字化模型，但是较小的模型却遭受了幻觉的影响。在小数据集上进行微调可以帮助提高调节性能并降低幻觉率。

Title: SceneGram: Conceptualizing and Describing Tangrams in Scene Context

Authors: Simeon Junker, Sina Zarrieß
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11631
Pdf URL: https://arxiv.org/pdf/2506.11631
Copy Paste: [[2506.11631]] SceneGram: Conceptualizing and Describing Tangrams in Scene Context(https://arxiv.org/abs/2506.11631)
Keywords: llm
Abstract: Research on reference and naming suggests that humans can come up with very different ways of conceptualizing and referring to the same object, e.g. the same abstract tangram shape can be a "crab", "sink" or "space ship". Another common assumption in cognitive science is that scene context fundamentally shapes our visual perception of objects and conceptual expectations. This paper contributes SceneGram, a dataset of human references to tangram shapes placed in different scene contexts, allowing for systematic analyses of the effect of scene context on conceptualization. Based on this data, we analyze references to tangram shapes generated by multimodal LLMs, showing that these models do not account for the richness and variability of conceptualizations found in human references.
摘要：参考和命名的研究表明，人类可以提出非常不同的概念化和指代相同对象的方式，例如相同的抽象刺激形状可以是“螃蟹”，“水槽”或“太空船”。认知科学中的另一个常见假设是，场景背景从根本上塑造了我们对物体和概念期望的视觉感知。本文贡献了场景图，这是人类引用对不同场景上下文中坦格拉姆形状的数据集，从而可以系统地分析场景上下文对概念化的影响。基于这些数据，我们分析了对多模式LLM产生的Tangram形状的参考，表明这些模型并未考虑人类参考中发现的概念化的丰富性和可变性。

Title: LoRA-Gen: Specializing Large Language Model via Online LoRA Generation

Authors: Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Yixiao Ge, Xiu Li, Ying Shan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11638
Pdf URL: https://arxiv.org/pdf/2506.11638
Copy Paste: [[2506.11638]] LoRA-Gen: Specializing Large Language Model via Online LoRA Generation(https://arxiv.org/abs/2506.11638)
Keywords: language model, agent
Abstract: Recent advances have highlighted the benefits of scaling language models to enhance performance across a wide range of NLP tasks. However, these approaches still face limitations in effectiveness and efficiency when applied to domain-specific tasks, particularly for small edge-side models. We propose the LoRA-Gen framework, which utilizes a large cloud-side model to generate LoRA parameters for edge-side models based on task descriptions. By employing the reparameterization technique, we merge the LoRA parameters into the edge-side model to achieve flexible specialization. Our method facilitates knowledge transfer between models while significantly improving the inference efficiency of the specialized model by reducing the input context length. Without specialized training, LoRA-Gen outperforms conventional LoRA fine-tuning, which achieves competitive accuracy and a 2.1x speedup with TinyLLaMA-1.1B in reasoning tasks. Besides, our method delivers a compression ratio of 10.1x with Gemma-2B on intelligent agent tasks.
摘要：最近的进步强调了扩展语言模型的好处，以提高各种NLP任务的性能。但是，当应用于特定于域的任务时，这些方法仍然面临着有效性和效率的限制，尤其是对于小边缘模型。我们提出了Lora-Gen框架，该框架利用大型云端模型来基于任务描述为边缘模型生成Lora参数。通过采用重聚技术，我们将洛拉参数合并到边缘模型中以实现灵活的专业化。我们的方法促进了模型之间的知识转移，同时通过减少输入上下文长度来显着提高专业模型的推理效率。如果没有专门的培训，Lora-Gen的表现就超过了传统的Lora微调，该调整可实现竞争精度，而Tinyllama-1.1b在推理任务中具有2.1倍的速度。此外，我们的方法在智能代理任务上使用GEMMA-2B提供了10.1倍的压缩率。

Title: Converting Annotated Clinical Cases into Structured Case Report Forms

Authors: Pietro Ferrazzi, Alberto Lavelli, Bernardo Magnini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11666
Pdf URL: https://arxiv.org/pdf/2506.11666
Copy Paste: [[2506.11666]] Converting Annotated Clinical Cases into Structured Case Report Forms(https://arxiv.org/abs/2506.11666)
Keywords: language model, llm
Abstract: Case Report Forms (CRFs) are largely used in medical research as they ensure accuracy, reliability, and validity of results in clinical studies. However, publicly available, wellannotated CRF datasets are scarce, limiting the development of CRF slot filling systems able to fill in a CRF from clinical notes. To mitigate the scarcity of CRF datasets, we propose to take advantage of available datasets annotated for information extraction tasks and to convert them into structured CRFs. We present a semi-automatic conversion methodology, which has been applied to the E3C dataset in two languages (English and Italian), resulting in a new, high-quality dataset for CRF slot filling. Through several experiments on the created dataset, we report that slot filling achieves 59.7% for Italian and 67.3% for English on a closed Large Language Models (zero-shot) and worse performances on three families of open-source models, showing that filling CRFs is challenging even for recent state-of-the-art LLMs. We release the datest at this https URL
摘要：病例报告表（CRF）在医学研究中很大程度上使用，因为它们确保了临床研究中结果的准确性，可靠性和有效性。但是，公开可用的，良好的CRF数据集很少，这限制了能够从临床笔记中填充CRF的CRF插槽填充系统的开发。为了减轻CRF数据集的稀缺性，我们建议利用带有信息提取任务的可用数据集并将其转换为结构化的CRF。我们提出了一种半自动转换方法，该方法已应用于两种语言（英语和意大利语）上的E3C数据集，从而为CRF插槽填充提供了一个新的高质量数据集。通过在创建数据集的几个实验中，我们报告说，在封闭的大型语言模型（零射击）上，填充插槽的插槽填充了意大利语的59.7％，英语为67.3％，在三个开放源代码模型的家族上的表现较差，这表明即使填充CRF对于最近的近期crf也很挑战。我们在此HTTPS URL上释放Datest

Title: LLMs for Sentence Simplification: A Hybrid Multi-Agent prompting Approach

Authors: Pratibha Zunjare, Michael Hsiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11681
Pdf URL: https://arxiv.org/pdf/2506.11681
Copy Paste: [[2506.11681]] LLMs for Sentence Simplification: A Hybrid Multi-Agent prompting Approach(https://arxiv.org/abs/2506.11681)
Keywords: language model, llm, prompt, agent
Abstract: This paper addresses the challenge of transforming complex sentences into sequences of logical, simplified sentences while preserving semantic and logical integrity with the help of Large Language Models. We propose a hybrid approach that combines advanced prompting with multi-agent architectures to enhance the sentence simplification process. Experimental results show that our approach was able to successfully simplify 70% of the complex sentences written for video game design application. In comparison, a single-agent approach attained a 48% success rate on the same task.
摘要：本文解决了将复杂句子转换为逻辑，简化句子序列的挑战，同时在大型语言模型的帮助下保留语义和逻辑完整性。我们提出了一种混合方法，将高级提示与多代理体系结构相结合，以增强句子的简化过程。实验结果表明，我们的方法能够成功简化为视频游戏设计应用程序编写的复杂句子的70％。相比之下，单人方法在同一任务上达到了48％的成功率。

Title: Configurable Preference Tuning with Rubric-Guided Synthetic Data

Authors: Víctor Gallego
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.11702
Pdf URL: https://arxiv.org/pdf/2506.11702
Copy Paste: [[2506.11702]] Configurable Preference Tuning with Rubric-Guided Synthetic Data(https://arxiv.org/abs/2506.11702)
Keywords: language model, llm, prompt
Abstract: Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at this https URL
摘要：人类对齐方式的人类反馈模型，例如基于直接偏好优化的基础的模型（DPO），通常以奇异的静态偏好集烘烤，从而限制了适应性。本文通过引入可配置的偏好调整（CPT）来挑战整体偏好的假设，这是一种新型框架，用于培养具有基于明确的，人类解动的指令动态调整其行为的能力。 CPT利用合成生成的偏好数据，该数据以系统提示为条件，这些提示是从结构化的细粒性标题中得出的，这些标准定义了所需的属性，例如写作样式。通过对这些栏杆引导的偏好进行微调，LLM学会了在推理时间调节其输出，以响应系统提示而无需重新培训。这种方法不仅提供了细粒度的控制，而且还提供了建模更细微和与上下文有关的人类反馈的机制。此HTTPS URL发布了几种实验工件，例如培训代码，生成的数据集和微调模型

Title: DART: Distilling Autoregressive Reasoning to Silent Thought

Authors: Nan Jiang, Ziming Wu, De-Chuan Zhan, Fuming Lai, Shaobing Lian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11752
Pdf URL: https://arxiv.org/pdf/2506.11752
Copy Paste: [[2506.11752]] DART: Distilling Autoregressive Reasoning to Silent Thought(https://arxiv.org/abs/2506.11752)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) reasoning has significantly advanced Large Language Models (LLMs) in solving complex tasks. However, its autoregressive paradigm leads to significant computational overhead, hindering its deployment in latency-sensitive applications. To address this, we propose \textbf{DART} (\textbf{D}istilling \textbf{A}utoregressive \textbf{R}easoning to Silent \textbf{T}hought), a self-distillation framework that enables LLMs to replace autoregressive CoT with non-autoregressive Silent Thought (ST). Specifically, DART introduces two training pathways: the CoT pathway for traditional reasoning and the ST pathway for generating answers directly from a few ST tokens. The ST pathway utilizes a lightweight Reasoning Evolvement Module (REM) to align its hidden states with the CoT pathway, enabling the ST tokens to evolve into informative embeddings. During inference, only the ST pathway is activated, leveraging evolving ST tokens to deliver the answer directly. Extensive experimental results demonstrate that DART achieves comparable reasoning performance to existing baselines while offering significant efficiency gains, serving as a feasible alternative for efficient reasoning.
摘要：在解决复杂的任务时，经过深思熟虑的（COT）推理具有明显的高级大语模型（LLM）。但是，其自回旋范式导致了大量的计算开销，从而阻碍了其在潜伏期敏感的应用中的部署。为了解决这个问题，我们提出\ textbf {dart}（\ textbf {d} istilling \ textbf {a} utoregrelistion \ textbf {r}对静音\ textbf {t} houghted进行了自动构思\ textbf {t}），一种使llms替换为无声的cot cot的自distillation框架。具体而言，DART介绍了两种训练途径：传统推理的COT途径和直接从几个ST令牌生成答案的ST途径。 ST途径利用轻巧的推理演变模块（REM）将其隐藏状态与COT途径保持一致，从而使ST代币能够演变为信息丰富的嵌入。在推断期间，仅激活ST路径，利用不断发展的ST令牌直接提供答案。广泛的实验结果表明，DART在提供显着效率提高的同时，达到了可比的推理性能，是有效推理的可行替代方案。

Title: DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Authors: Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.11763
Pdf URL: https://arxiv.org/pdf/2506.11763
Copy Paste: [[2506.11763]] DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents(https://arxiv.org/abs/2506.11763)
Keywords: llm, agent
Abstract: Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports--compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA's information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at this https URL to accelerate the development of practical LLM-based agents.
摘要：深度研究代理是基于LLM的代理的重要类别。通过自主协调多步探索，有针对性的检索和高阶合成，它们将大量的在线信息转换为分析师级，引用丰富的报告 - 将手动台式研究用于几分钟。但是，系统地评估这些试剂能力的全面基准仍然没有。为了弥合这一差距，我们提出了Deepresearch Bench，这是一个由100个博士学位研究任务组成的基准，每个研究任务都是由22个不同领域的领域专家精心制作的。评估DRA本质上是复杂且劳动密集型的。因此，我们提出了两种新型方法，它们与人类判断力有很强的一致性。第一个是一种基于参考的方法，具有自适应标准，用于评估生成的研究报告的质量。引入了另一个框架来评估DRA的信息检索和收集功能，以评估其有效的引文数和整体引用准确性。我们在此HTTPS URL上开源了这些框架的Deepresearch Stans和这些框架的关键组成部分，以加速基于LLM的代理的开发。

Title: Long-Short Alignment for Effective Long-Context Modeling in LLMs

Authors: Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.11769
Pdf URL: https://arxiv.org/pdf/2506.11769
Copy Paste: [[2506.11769]] Long-Short Alignment for Effective Long-Context Modeling in LLMs(https://arxiv.org/abs/2506.11769)
Keywords: language model, llm
Abstract: Large language models (LLMs) have exhibited impressive performance and surprising emergent properties. However, their effectiveness remains limited by the fixed context window of the transformer architecture, posing challenges for long-context modeling. Among these challenges, length generalization -- the ability to generalize to sequences longer than those seen during training -- is a classical and fundamental problem. In this work, we propose a fresh perspective on length generalization, shifting the focus from the conventional emphasis on input features such as positional encodings or data structures to the output distribution of the model. Specifically, through case studies on synthetic tasks, we highlight the critical role of \textbf{long-short alignment} -- the consistency of output distributions across sequences of varying lengths. Extending this insight to natural language tasks, we propose a metric called Long-Short Misalignment to quantify this phenomenon, uncovering a strong correlation between the metric and length generalization performance. Building on these findings, we develop a regularization term that promotes long-short alignment during training. Extensive experiments validate the effectiveness of our approach, offering new insights for achieving more effective long-context modeling in LLMs. Code is available at this https URL.
摘要：大型语言模型（LLM）表现出令人印象深刻的表现和令人惊讶的新兴特性。但是，它们的有效性仍然受到变压器体系结构的固定上下文窗口的限制，这对长篇文化建模构成了挑战。在这些挑战中，长度的概括 - 概括到序列比训练过程中所见的能力更长 - 是一个经典的基本问题。在这项工作中，我们对长度概括提出了一个新的视角，将焦点从对输入特征（例如位置编码或数据结构）的传统重点转移到模型的输出分布。具体而言，通过有关综合任务的案例研究，我们突出了\ textbf {long-short Arignment}的关键作用 - 跨不同长度序列的输出分布的一致性。将这种洞察力扩展到自然语言任务，我们提出了一个称为Long-short未对准的度量标准，以量化这一现象，从而发现了度量与长度的通用性能之间的密切相关性。在这些发现的基础上，我们开发了一个正规化术语，该术语促进了培训期间长期偏差的一致性。广泛的实验验证了我们的方法的有效性，为在LLM中提供了更有效的长篇小说建模提供了新的见解。代码可在此HTTPS URL上找到。

Title: Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models

Authors: Maximilian Kreutner, Marlene Lutz, Markus Strohmaier
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.11798
Pdf URL: https://arxiv.org/pdf/2506.11798
Copy Paste: [[2506.11798]] Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models(https://arxiv.org/abs/2506.11798)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse, but have been found to consistently display a progressive left-leaning bias. At the same time, so-called persona or identity prompts have been shown to produce LLM behavior that aligns with socioeconomic groups that the base model is not aligned with. In this work, we analyze whether zero-shot persona prompting with limited information can accurately predict individual voting decisions and, by aggregation, accurately predict positions of European groups on a diverse set of policies. We evaluate if predictions are stable towards counterfactual arguments, different persona prompts and generation methods. Finally, we find that we can simulate voting behavior of Members of the European Parliament reasonably well with a weighted F1 score of approximately 0.793. Our persona dataset of politicians in the 2024 European Parliament and our code are available at this https URL.
摘要：大型语言模型（LLMS）具有明显的理解甚至产生政治话语的功能，但已被发现始终如一地表现出渐进的左倾偏见。同时，已经证明所谓的角色或身份提示会产生与基本模型不符的社会经济群体保持一致的LLM行为。在这项工作中，我们分析了零击角色提示有限的信息是否可以准确预测单个投票决定，并通过汇总来准确预测欧洲团体在各种政策上的立场。我们评估预测是否稳定在反事实论点，不同的角色提示和生成方法上。最后，我们发现我们可以很好地模拟欧洲议会议员的投票行为，而加权F1得分约为0.793。我们在2024年欧洲议会中的政治人物数据集和我们的法规可在此HTTPS URL上获得。

Title: Are Multimodal Large Language Models Pragmatically Competent Listeners in Simple Reference Resolution Tasks?

Authors: Simeon Junker, Manar Ali, Larissa Koch, Sina Zarrieß, Hendrik Buschmeier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11807
Pdf URL: https://arxiv.org/pdf/2506.11807
Copy Paste: [[2506.11807]] Are Multimodal Large Language Models Pragmatically Competent Listeners in Simple Reference Resolution Tasks?(https://arxiv.org/abs/2506.11807)
Keywords: language model, llm
Abstract: We investigate the linguistic abilities of multimodal large language models in reference resolution tasks featuring simple yet abstract visual stimuli, such as color patches and color grids. Although the task may not seem challenging for today's language models, being straightforward for human dyads, we consider it to be a highly relevant probe of the pragmatic capabilities of MLLMs. Our results and analyses indeed suggest that basic pragmatic capabilities, such as context-dependent interpretation of color descriptions, still constitute major challenges for state-of-the-art MLLMs.
摘要：我们研究了具有简单但抽象的视觉刺激的参考分辨率任务中多模式大型语言模型的语言能力，例如颜色贴片和颜色网格。尽管对于当今的语言模型而言，这项任务似乎并不具有挑战性，但对于人类二元组来说是直接的，但我们认为这是对MLLM务实能力的高度相关探测。我们的结果和分析确实表明，基本的实用能力（例如对颜色描述的上下文依赖性解释）仍然构成最新MLLM的主要挑战。

Title: Post Persona Alignment for Multi-Session Dialogue Generation

Authors: Yi-Pei Chen, Noriki Nishida, Hideki Nakayama, Yuji Matsumoto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11857
Pdf URL: https://arxiv.org/pdf/2506.11857
Copy Paste: [[2506.11857]] Post Persona Alignment for Multi-Session Dialogue Generation(https://arxiv.org/abs/2506.11857)
Keywords: language model, llm
Abstract: Multi-session persona-based dialogue generation presents challenges in maintaining long-term consistency and generating diverse, personalized responses. While large language models (LLMs) excel in single-session dialogues, they struggle to preserve persona fidelity and conversational coherence across extended interactions. Existing methods typically retrieve persona information before response generation, which can constrain diversity and result in generic outputs. We propose Post Persona Alignment (PPA), a novel two-stage framework that reverses this process. PPA first generates a general response based solely on dialogue context, then retrieves relevant persona memories using the response as a query, and finally refines the response to align with the speaker's persona. This post-hoc alignment strategy promotes naturalness and diversity while preserving consistency and personalization. Experiments on multi-session LLM-generated dialogue data demonstrate that PPA significantly outperforms prior approaches in consistency, diversity, and persona relevance, offering a more flexible and effective paradigm for long-term personalized dialogue generation.
摘要：基于多项性角色的对话生成在保持长期一致性和产生多样化的个性化响应方面提出了挑战。尽管大型语言模型（LLMS）在单节对话中表现出色，但他们努力在扩展互动中保持角色忠诚度和对话连贯性。现有方法通常会在响应生成之前检索角色信息，这可能会限制多样性并导致通用输出。我们提出了角色对齐（PPA），这是一个新颖的两阶段框架，可以逆转此过程。 PPA首先仅根据对话上下文产生一般响应，然后以响应作为查询来检索相关的角色记忆，并最终完善对与说话者角色保持一致的响应。这种事后一致性策略促进了自然和多样性，同时保持一致性和个性化。多课程LLM生成的对话数据的实验表明，PPA在一致性，多样性和角色相关性方面显着优于先前的方法，为长期个性化的对话生成提供了更灵活，更有效的范式。

Title: Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

Authors: Xiaoran Liu, Siyang He, Qiqi Wang, Ruixiao Li, Yuerong Song, Zhigeng Liu, Linlin Li, Qun Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11886
Pdf URL: https://arxiv.org/pdf/2506.11886
Copy Paste: [[2506.11886]] Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache(https://arxiv.org/abs/2506.11886)
Keywords: language model, llm
Abstract: Large Language Models struggle with memory demands from the growing Key-Value (KV) cache as context lengths increase. Existing compression methods homogenize head dimensions or rely on attention-guided token pruning, often sacrificing accuracy or introducing computational overhead. We propose FourierAttention, a training-free framework that exploits the heterogeneous roles of transformer head dimensions: lower dimensions prioritize local context, while upper ones capture long-range dependencies. By projecting the long-context-insensitive dimensions onto orthogonal Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients. Evaluations on LLaMA models show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel, FlashFourierAttention, is designed to optimize memory via streamlined read-write operations, enabling efficient deployment without performance compromise.
摘要：随着上下文长度的增加，大型语言模型与增长的键值（KV）缓存的内存需求斗争。现有的压缩方法均化头尺寸或依靠注意力引导的令牌修剪，通常牺牲准确性或引入计算开销。我们提出了傅里叶词性，这是一个无训练的框架，利用了变压器头维的异质作用：较低的维度优先考虑本地上下文，而上环境则捕获了远程依赖性。通过将长篇文化不敏感的维度投射到正交傅立叶底座上，傅立叶说明可以用固定长度的光谱系数近似其时间演变。对美洲驼模型的评估表明，傅立叶仪达到了长牛排和针中的针刺（NIAH）上的最佳长篇文化精度。此外，自定义的Triton内核FlashFourierAtteention旨在通过简化的读取作业来优化内存，从而在没有绩效妥协的情况下实现有效的部署。

Title: GeistBERT: Breathing Life into German NLP

Authors: Raphael Scheible-Schmitt, Johann Frei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11903
Pdf URL: https://arxiv.org/pdf/2506.11903
Copy Paste: [[2506.11903]] GeistBERT: Breathing Life into German NLP(https://arxiv.org/abs/2506.11903)
Keywords: language model
Abstract: Advances in transformer-based language models have highlighted the benefits of language-specific pre-training on high-quality corpora. In this context, German NLP stands to gain from updated architectures and modern datasets tailored to the linguistic characteristics of the German language. GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus and optimizing model performance across various NLP tasks. It was pre-trained using fairseq with standard hyperparameters, initialized from GottBERT weights, and trained on a large-scale German corpus using Whole Word Masking (WWM). Based on the pre-trained model, we derived extended-input variants using Nyströmformer and Longformer architectures with support for sequences up to 8k tokens. While these long-context models were not evaluated on dedicated long-context benchmarks, they are included in our release. We assessed all models on NER (CoNLL 2003, GermEval 2014) and text classification (GermEval 2018 fine/coarse, 10kGNAD) using $F_1$ score and accuracy. The GeistBERT models achieved strong performance, leading all tasks among the base models and setting a new state-of-the-art (SOTA). Notably, the base models outperformed larger models in several tasks. To support the German NLP research community, we are releasing GeistBERT under the MIT license.
摘要：基于变压器的语言模型的进步强调了语言特定的预培训对高质量语料库的好处。在这种情况下，德国NLP将从更新的体系结构和针对德语语言特征量身定制的现代数据集中获得。吉斯特伯特（Geistbert）试图通过对各种语料库进行逐步培训并在各种NLP任务中优化模型性能来改善德语处理。它是使用Fairseq和标准超级参数的Fairseq进行了预训练的，它是从Gottbert重量初始化的，并使用整个单词蒙版（WWM）在大规模的德国语料库上进行了培训。基于预先训练的模型，我们使用NyStrömformer和Longformer架构来得出扩展输入变体，并支持最高8K令牌的序列。虽然这些长篇小说模型尚未在专用的长篇小写基准测试中评估，但它们包含在我们的版本中。我们使用$ f_1 $得分和准确性评估了NER（Conll 2003，Germeval 2014）和文本分类（Germeval 2018 Fine/Cover，10KGNAD）的所有模型。吉斯特伯特（Geistbert）模型取得了强大的性能，在基本模型中领导了所有任务，并设定了新的最先进（SOTA）。值得注意的是，基本模型在多个任务中的表现优于较大的模型。为了支持德国NLP研究社区，我们将根据MIT许可发布Geistbert。

Title: Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback

Authors: Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, Daniel Khashabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.11930
Pdf URL: https://arxiv.org/pdf/2506.11930
Copy Paste: [[2506.11930]] Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback(https://arxiv.org/abs/2506.11930)
Keywords: language model, llm
Abstract: Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and change their incorrect answers to correct ones. In this paper, we systematically investigate LLMs' ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3.7 (with and without extended thinking). Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limitation, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We also perform a rigorous exploration of potential causes of FEEDBACK FRICTION, ruling out factors such as model overconfidence and data familiarity. We hope that highlighting this issue in LLMs and ruling out several apparent causes will help future research in self-improvement.
摘要：最近的研究表明，LLM具有一定的能力，可以在给予外部反馈时提高其反应。但是，尚不清楚这些模型如何有效，彻底地融合外部反馈。在理想的情况下，如果LLM收到几乎完美且完全的反馈，我们希望他们完全整合反馈并将其错误的答案更改为纠正词。在本文中，我们系统地研究了LLMS通过设计受控的实验环境来合并反馈的能力。对于每个问题，一个求解器模型都尝试解决方案，然后一个反馈生成器访问接近完整的地面真实答案会产生目标反馈，然后求解器再次尝试。我们在各种任务中评估了这一管道，包括数学推理，知识推理，科学推理以及一般的多域评估，其中包括Claude 3.7在内的最先进的语言模型（具有和没有扩展思维）。令人惊讶的是，即使在这些近乎理想的条件下，求解器模型也始终显示出对反馈的抵抗力，这是我们称为反馈摩擦的限制。为了减轻这种限制，我们试验基于抽样的策略，例如渐进的温度升高和明确拒绝先前尝试的错误答案，这会产生改进，但仍无法帮助模型实现目标性能。我们还对反馈摩擦的潜在原因进行了严格的探索，排除了模型过度自信和数据熟悉度等因素。我们希望在LLM中强调这个问题并排除几个明显的原因将有助于未来的自我完善研究。

Title: Improving Large Language Model Safety with Contrastive Representation Learning

Authors: Samuel Simko, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.11938
Pdf URL: https://arxiv.org/pdf/2506.11938
Copy Paste: [[2506.11938]] Improving Large Language Model Safety with Contrastive Representation Learning(https://arxiv.org/abs/2506.11938)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at this https URL
摘要：大型语言模型（LLM）是具有深远影响的强大工具，但它们产生对多样化和不受控制的投入的反应的能力使它们容易受到对抗性攻击的影响。尽管现有的防御能力通常很难跨越各种攻击类型，但代表工程的最新进步提供了有希望的替代方案。在这项工作中，我们提出了一个防御框架，该框架将模型辩护作为对比表示学习（CRL）问题。我们的方法使用基于三重态的损失与对抗性硬采矿相结合的模型，以鼓励良性和有害表示之间的分离。我们跨多个模型的实验结果表明，我们的方法的表现优于先前表示工程的防御，从而提高了对输入级别和嵌入空间攻击的鲁棒性，而不会损害标准性能。我们的代码可在此HTTPS URL上找到

Title: code_transformed: The Influence of Large Language Models on Code

Authors: Yuliang Xu, Siming Huang, Mingmeng Geng, Yao Wan, Xuanhua Shi, Dongping Chen
Subjects: cs.CL, cs.AI, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2506.12014
Pdf URL: https://arxiv.org/pdf/2506.12014
Copy Paste: [[2506.12014]] code_transformed: The Influence of Large Language Models on Code(https://arxiv.org/abs/2506.12014)
Keywords: language model, llm, prompt
Abstract: Coding remains one of the most fundamental modes of interaction between humans and machines. With the rapid advancement of Large Language Models (LLMs), code generation capabilities have begun to significantly reshape programming practices. This development prompts a central question: Have LLMs transformed code style, and how can such transformation be characterized? In this paper, we present a pioneering study that investigates the impact of LLMs on code style, with a focus on naming conventions, complexity, maintainability, and similarity. By analyzing code from over 19,000 GitHub repositories linked to arXiv papers published between 2020 and 2025, we identify measurable trends in the evolution of coding style that align with characteristics of LLM-generated code. For instance, the proportion of snake\_case variable names in Python code increased from 47% in Q1 2023 to 51% in Q1 2025. Furthermore, we investigate how LLMs approach algorithmic problems by examining their reasoning processes. Given the diversity of LLMs and usage scenarios, among other factors, it is difficult or even impossible to precisely estimate the proportion of code generated or assisted by LLMs. Our experimental results provide the first large-scale empirical evidence that LLMs affect real-world programming style.
摘要：编码仍然是人与机器之间最基本的相互作用模式之一。随着大语言模型（LLM）的快速发展，代码生成功能已经开始重大改造编程实践。这种开发提示了一个核心问题：LLMS是否转换了代码样式，如何将这种转换表征？在本文中，我们提出了一项开创性的研究，该研究研究了LLM对代码样式的影响，重点是命名惯例，复杂性，可维护性和相似性。通过分析与2020年至2025年之间发表的ARXIV论文相关的19,000多个GitHub存储库的代码，我们确定了与LLM生成的代码特征相一致的编码样式演变的可测量趋势。例如，Python代码中的蛇的比例从Q1 2023中的47％增加到Q1 2025中的51％。此外，我们研究了LLMS如何通过检查其推理过程来解决算法问题。鉴于LLM和使用情况的多样性，除其他因素外，很难甚至不可能精确估计LLMS生成或协助的代码比例。我们的实验结果提供了第一个大规模经验证据，表明LLM会影响现实世界的编程风格。