2025-06-30

Title: VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

Authors: Hyeongcheol Park, MinHyuk Jang, Ha Dam Baek, Gyusam Chang, Jiyoung Seo, Jiwan Park, Hogun Park, Sangpil Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21556
Pdf URL: https://arxiv.org/pdf/2506.21556
Copy Paste: [[2506.21556]] VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation(https://arxiv.org/abs/2506.21556)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowledge graphs, which restricts their knowledge, resulting in outdated or incomplete knowledge coverage, and they often support only a narrow range of modalities, such as text and visual information. These limitations reduce their extensibility and applicability to a broad range of multimodal tasks, particularly as the field shifts toward richer modalities such as video and audio in recent MLLMs. Therefore, we propose the Visual-Audio-Text Knowledge Graph (VAT-KG), the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual, audio, and text information, where each triplet is linked to multimodal data and enriched with detailed descriptions of concepts. Specifically, our construction pipeline ensures cross-modal knowledge alignment between multimodal data and fine-grained semantics through a series of stringent filtering and alignment steps, enabling the automatic generation of MMKGs from any multimodal dataset. We further introduce a novel multimodal RAG framework that retrieves detailed concept-level knowledge in response to queries from arbitrary modalities. Experiments on question answering tasks across various modalities demonstrate the effectiveness of VAT-KG in supporting MLLMs, highlighting its practical value in unifying and leveraging multimodal knowledge.
摘要：多种模式知识图（MMKGS）代表多种模式的明确知识，通过补充多模式大语言模型（MLLMS）的隐式知识，并通过检索增强发电（RAG）启用更扎根的推理，从而发挥着关键作用。但是，现有的MMKG通常在范围上受到限制：它们通常是通过增加预先存在的知识图来构建的，这些知识图限制了他们的知识，导致过时或不完整的知识覆盖范围，并且通常仅支持狭窄的模式，例如文本和视觉信息。这些限制降低了其对广泛多模式任务的可扩展性和适用性，尤其是当该领域转向最近的MLLM中的视频和音频等富裕方式时。因此，我们提出了Visual-Audio-Text知识图（VAT-KG），这是第一个以概念为中心和知识密集的多模式知识图，涵盖了视觉，音频和文本信息，其中每个三重态都链接到多模态数据，并具有详细的概念描述。具体而言，我们的施工管道可通过一系列严格的过滤和对齐步骤确保多模式数据和细粒语义之间的跨模式知识对齐，从而使任何多模态数据集的MMKG自动生成。我们进一步介绍了一个新颖的多模式抹布框架，该框架以任意模式的查询来检索详细的概念级知识。关于回答各种方式的问题的实验，可以证明增值税KG在支持MLLM中的有效性，从而强调了其在统一和利用多模式知识方面的实践价值。

Title: Debunk and Infer: Multimodal Fake News Detection via Diffusion-Generated Evidence and LLM Reasoning

Authors: Kaiying Yan, Moyang Liu, Yukun Liu, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xuefei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21557
Pdf URL: https://arxiv.org/pdf/2506.21557
Copy Paste: [[2506.21557]] Debunk and Infer: Multimodal Fake News Detection via Diffusion-Generated Evidence and LLM Reasoning(https://arxiv.org/abs/2506.21557)
Keywords: language model, llm, agent
Abstract: The rapid spread of fake news across multimedia platforms presents serious challenges to information credibility. In this paper, we propose a Debunk-and-Infer framework for Fake News Detection(DIFND) that leverages debunking knowledge to enhance both the performance and interpretability of fake news detection. DIFND integrates the generative strength of conditional diffusion models with the collaborative reasoning capabilities of multimodal large language models (MLLMs). Specifically, debunk diffusion is employed to generate refuting or authenticating evidence based on the multimodal content of news videos, enriching the evaluation process with diverse yet semantically aligned synthetic samples. To improve inference, we propose a chain-of-debunk strategy where a multi-agent MLLM system produces logic-grounded, multimodal-aware reasoning content and final veracity judgment. By jointly modeling multimodal features, generative debunking cues, and reasoning-rich verification within a unified architecture, DIFND achieves notable improvements in detection accuracy. Extensive experiments on the FakeSV and FVC datasets show that DIFND not only outperforms existing approaches but also delivers trustworthy decisions.
摘要：假新闻跨多媒体平台的迅速传播给信息信誉带来了严重的挑战。在本文中，我们提出了一个用于假新闻检测（DIFND）的揭穿框架，该框架利用揭穿知识来增强假新闻检测的性能和解释性。 Difnd将条件扩散模型的生成强度与多模式大语言模型（MLLM）的协作推理能力相结合。具体而言，采用揭穿扩散来基于新闻视频的多模式内容生成反驳或认证证据，以多种多样但具有语义上一致的合成样本的评估过程丰富了评估过程。为了改善推论，我们提出了一种链链策略，其中多代理MLLM系统会产生逻辑接收的，多模式感知的推理内容和最终的真实性判断。通过在统一体系结构中共同建模多模式特征，生成揭露线索和富含推理的验证，Difnd可以在检测准确性方面取得显着提高。 FAKESV和FVC数据集的广泛实验表明，不仅要优于现有方法，而且还提供了可信赖的决定。

Title: Bench to the Future: A Pastcasting Benchmark for Forecasting Agents

Authors: FutureSearch: Jack Wildman, Nikos I. Bosse, Daniel Hnyk, Peter Mühlbacher, Finn Hambly, Jon Evans, Dan Schwarz, Lawrence Phillips
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21558
Pdf URL: https://arxiv.org/pdf/2506.21558
Copy Paste: [[2506.21558]] Bench to the Future: A Pastcasting Benchmark for Forecasting Agents(https://arxiv.org/abs/2506.21558)
Keywords: llm, chain-of-thought, agent
Abstract: Forecasting is a challenging task that offers a clearly measurable way to study AI systems. Forecasting requires a large amount of research on the internet, and evaluations require time for events to happen, making the development of forecasting benchmarks challenging. To date, no forecasting benchmark provides a realistic, hermetic, and repeatable environment for LLM forecasters. We introduce Bench To the Future (BTF), a "pastcasting" benchmark with hundreds of high-quality questions for which the resolution is already known. Each question is accompanied by a large offline corpus of tens of thousands of relevant web pages, enabling a way to elicit realistic "forecasts" on past events from LLMs. Results suggest that our pastcasting environment can produce results comparable to those based on forecasts using the internet on at-the-time unresolved questions. We show results benchmarking agent and chain-of-thought forecasting approaches using several LLMs, including the recently-released Claude 4 models, and demonstrate BTF's ability to track steady forecasting capability progress over time. We intend this to be a living benchmark, with new questions added continually to account for increasing training data cutoff dates. We invite researchers to contact us at hello@futuresearch.ai to utilize our benchmark or tooling for their own research.
摘要：预测是一项具有挑战性的任务，它提供了一种可衡量的研究AI系统的方法。预测需要在互联网上进行大量研究，并且评估需要时间才能发生，从而开发了预测基准的挑战。迄今为止，尚无预测基准为LLM预报员提供现实，密封且可重复的环境。我们将基准介绍给未来（BTF），这是一个“粘贴”基准，上面有数百个高质量问题，该分辨率已经知道。每个问题都伴随着大量的离线语料库，其中成千上万的相关网页，为通过LLMS的过去事件引起了一种现实的“预测”的方法。结果表明，我们的粘贴环境可以根据互联网在未解决的问题上使用Internet产生与基于预测的结果相媲美的结果。我们显示了使用多种LLM（包括最近发行的Claude 4型号）进行基准测试剂和经过思考的预测方法，并展示了BTF跟踪随时间推移稳定预测能力进步的能力。我们打算这是一个活着的基准，并不断添加新问题，以说明培训数据截止日期的增加。我们邀请研究人员通过hello@futuresearch.ai与我们联系，以利用我们的基准或工具进行自己的研究。

Title: GraphLAMA: Enabling Efficient Adaptation of Graph Language Models with Limited Annotations

Authors: Junze Chen, Cheng Yang, Shujie Li, Zhiqiang Zhang, Yawen Li, Junping Du, Chuan Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21559
Pdf URL: https://arxiv.org/pdf/2506.21559
Copy Paste: [[2506.21559]] GraphLAMA: Enabling Efficient Adaptation of Graph Language Models with Limited Annotations(https://arxiv.org/abs/2506.21559)
Keywords: language model, llm, long context, prompt
Abstract: Large language models (LLMs) have demonstrated their strong capabilities in various domains, and have been recently integrated for graph analysis as graph language models (GLMs). With LLMs as the predictor, some GLMs can interpret unseen tasks described by natural language, and learn from a few examples in the prompts without parameter tuning, known as in-context learning (ICL). Another subset of GLMs utilizes abundant training labels to enhance model performance, known as instruction tuning. However, we argue that ICL on graphs has effectiveness issues due to fixed parameters and efficiency issues due to long context. Meanwhile, the large amount of labeled data required for instruction tuning can be difficult to obtain in real-world scenarios. To this end, we aim to introduce an extra parameter adaptation stage that can efficiently tailor GLMs to an unseen graph and task with only a few labeled examples, in exchange for better prediction accuracy and faster inference speed. For implementation, in this paper we propose GraphLAMA method, with its model backbone and learning schemes specialized for efficient tuning and inference. Specifically, for model backbone, we use a graph neural network (GNN) with several well-designed components to transform nodes into the representation space of LLM tokens. Task instructions can then be represented as a mixture of node and language tokens. In the pre-training stage, model parameters except the LLM will be trained with different tasks to capture general knowledge. In the adaptation stage, only a few pre-trained parameters will be updated based on few-shot examples. Extensive experiments on few/zero-shot node classification and summary generation show that our proposed GraphLAMA achieves state-of-the-art performance with 4.91% absolution improvement in accuracy. Compared with ICL, our inference speed can be 10 times faster under 5-shot setting.
摘要：大型语言模型（LLMS）已证明了它们在各个领域的强大功能，并且最近已集成了图形分析作为图形语言模型（GLM）。以LLM作为预测指标，有些GLM可以解释自然语言所描述的看不见的任务，并从提示中的一些示例中学习，而无需参数调整（称为封闭式学习（ICL））。 GLM的另一个子集利用丰富的训练标签来增强模型性能，称为指令调整。但是，我们认为，由于固定参数和效率问题，ICL在图形上存在有效性问题。同时，在实际情况下，很难获得指令调整所需的大量标记数据。为此，我们旨在引入一个额外的参数适应阶段，该阶段可以有效地将GLMS量身定制为一个看不见的图形和任务，仅使用几个标记的示例，以换取更好的预测准确性和更快的推理速度。对于实施，在本文中，我们提出了Graphlama方法，其模型骨干和学习方案专门用于有效调整和推理。具体而言，对于模型骨干，我们使用具有多个精心设计的组件的图形神经网络（GNN）将节点转换为LLM令牌的表示空间。然后可以将任务说明表示为节点和语言令牌的混合。在预训练阶段，模型参数除LLM以外的培训将通过不同的任务培训以捕获通用知识。在适应阶段，只有少数预先训练的参数将根据少数示例进行更新。关于几个/零射门的分类和摘要生成的广泛实验表明，我们提出的Graphlama在准确性方面可获得4.91％的豁免率提高，可实现最先进的性能。与ICL相比，我们的推理速度可以在5次设置下快10倍。

Title: Reinforcement Learning Fine-Tuning of Language Model for Instruction Following and Math Reasoning

Authors: Yifu Han, Geo Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21560
Pdf URL: https://arxiv.org/pdf/2506.21560
Copy Paste: [[2506.21560]] Reinforcement Learning Fine-Tuning of Language Model for Instruction Following and Math Reasoning(https://arxiv.org/abs/2506.21560)
Keywords: language model
Abstract: This study investigates the effectiveness of reinforcement learning (RL) fine-tuning techniques on a compact language model (Qwen2.5-0.5B Base) for two challenging tasks: instruction following and mathematical reasoning. We compare supervised fine-tuning (SFT), Direct Preference Optimization (DPO) using preference-labeled data, and Reinforce Leave-One-Out (RLOO) with reward models. Our experiments show that RLOO with DeBERTa reward modeling achieves the best alignment, while DPO provides strong and consistent results. For math reasoing tasks, synthetic data augmentation and best-of-N sampling with an external verifier significantly improve accuracy, showing the potential of combining fine-tuning with inference-time tools. This study highlights key trade-offs and practical strategies for training lightweight, task-aligned small-scale language models.
摘要：这项研究调查了对紧凑型语言模型（QWEN2.5-0.5B基础）对两项具有挑战性的任务的增强学习（RL）微调技术的有效性：以下指导和数学推理。我们使用偏好标记的数据比较了监督的微调（SFT），直接偏好优化（DPO），并通过奖励模型来加强保留的（RLOO）（RLOO）。我们的实验表明，与Deberta奖励建模的Rloo达到了最佳一致性，而DPO可提供强大而一致的结果。对于数学重新调整任务，使用外部验证器的合成数据扩展和最佳N采样可显着提高精度，表明将微调与推理时间工具相结合的潜力。这项研究重点介绍了培训轻巧，任务一致的小型语言模型的关键权衡和实用策略。

Title: Reasoning Isn't Enough: Examining Truth-Bias and Sycophancy in LLMs

Authors: Emilio Barkett, Olivia Long, Madhavendra Thakur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21561
Pdf URL: https://arxiv.org/pdf/2506.21561
Copy Paste: [[2506.21561]] Reasoning Isn't Enough: Examining Truth-Bias and Sycophancy in LLMs(https://arxiv.org/abs/2506.21561)
Keywords: language model, gpt, llm, prompt
Abstract: Despite their widespread use in fact-checking, moderation, and high-stakes decision-making, large language models (LLMs) remain poorly understood as judges of truth. This study presents the largest evaluation to date of LLMs' veracity detection capabilities and the first analysis of these capabilities in reasoning models. We had eight LLMs make 4,800 veracity judgments across several prompts, comparing reasoning and non-reasoning models. We find that rates of truth-bias, or the likelihood to believe a statement is true, regardless of whether it is actually true, are lower in reasoning models than in non-reasoning models, but still higher than human benchmarks. Most concerning, we identify sycophantic tendencies in several advanced models (o4-mini and GPT-4.1 from OpenAI, R1 from DeepSeek), which displayed an asymmetry in detection accuracy, performing well in truth accuracy but poorly in deception accuracy. This suggests that capability advances alone do not resolve fundamental veracity detection challenges in LLMs.
摘要：尽管它们在事实检查，节制和高风险决策中广泛使用，但大型语言模型（LLMS）仍然被认为是真理法官。这项研究介绍了LLMS准确检测能力迄今为止的最大评估，以及对推理模型中这些功能的首次分析。我们有八个LLM在几个提示中做出了4,800个真实性判断，以比较推理和非争议模型。我们发现，真理偏见的速率，或认为陈述是真实的，无论其实际是正确的，在推理模型中的可能性都比非争议模型低，但仍然比人类基准高。大多数令人担忧的是，我们确定了几种高级模型（来自DeepSeek的Openai的O4-Mini和GPT-4.1）中的Sicophantic趋势，该模型表现出不对称的检测准确性，在真理精度上表现良好，但在欺骗准确性方面表现不佳。这表明，仅功能就不会解决LLMS中基本的真实检测挑战。

Title: FloorPlan-DeepSeek (FPDS): A multimodal approach to floorplan generation using vector-based next room prediction

Authors: Jun Yin, Pengyu Zeng, Jing Zhong, Peilin Li, Miao Zhang, Ran Luo, Shuai Lu
Subjects: cs.CL, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2506.21562
Pdf URL: https://arxiv.org/pdf/2506.21562
Copy Paste: [[2506.21562]] FloorPlan-DeepSeek (FPDS): A multimodal approach to floorplan generation using vector-based next room prediction(https://arxiv.org/abs/2506.21562)
Keywords: language model
Abstract: In the architectural design process, floor plan generation is inherently progressive and iterative. However, existing generative models for floor plans are predominantly end-to-end generation that produce an entire pixel-based layout in a single pass. This paradigm is often incompatible with the incremental workflows observed in real-world architectural practice. To address this issue, we draw inspiration from the autoregressive 'next token prediction' mechanism commonly used in large language models, and propose a novel 'next room prediction' paradigm tailored to architectural floor plan modeling. Experimental evaluation indicates that FPDS demonstrates competitive performance in comparison to diffusion models and Tell2Design in the text-to-floorplan task, indicating its potential applicability in supporting future intelligent architectural design.
摘要：在建筑设计过程中，平面图的产生本质上是进步和迭代的。但是，现有的平面图生成模型主要是端到端的一代，在单个通行证中产生了整个基于像素的布局。这种范式通常与在现实世界建筑实践中观察到的增量工作流程不相容。为了解决这个问题，我们从大型语言模型中通常使用的自回旋“近代预测”机制中汲取灵感，并提出了针对建筑平面图建模量身定制的小说“ Next Room预测”范式。实验评估表明，与扩散模型相比，FPD在文本到地板任务中表现出竞争性能，表明其在支持未来智能建筑设计方面的潜在适用性。

Title: FormosanBench: Benchmarking Low-Resource Austronesian Languages in the Era of Large Language Models

Authors: Kaiying Kevin Lin, Hsiyu Chen, Haopeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21563
Pdf URL: https://arxiv.org/pdf/2506.21563
Copy Paste: [[2506.21563]] FormosanBench: Benchmarking Low-Resource Austronesian Languages in the Era of Large Language Models(https://arxiv.org/abs/2506.21563)
Keywords: language model, llm
Abstract: While large language models (LLMs) have demonstrated impressive performance across a wide range of natural language processing (NLP) tasks in high-resource languages, their capabilities in low-resource and minority languages remain significantly underexplored. Formosan languages -- a subgroup of Austronesian languages spoken in Taiwan -- are both linguistically rich and endangered, largely due to the sociolinguistic dominance of Mandarin. In this work, we introduce FORMOSANBENCH, the first benchmark for evaluating LLMs on low-resource Austronesian languages. It covers three endangered Formosan languages: Atayal, Amis, and Paiwan, across three core NLP tasks: machine translation, automatic speech recognition (ASR), and text summarization. We assess model performance in zero-shot, 10-shot, and fine-tuned settings using FORMOSANBENCH. Our results reveal a substantial performance gap between high-resource and Formosan languages. Existing LLMs consistently underperform across all tasks, with 10-shot learning and fine-tuning offering only limited improvements. These findings underscore the urgent need for more inclusive NLP technologies that can effectively support endangered and underrepresented languages. We release our datasets and code to facilitate future research in this direction.
摘要：尽管大型语言模型（LLMS）在广泛的自然语言处理（NLP）任务中表现出了令人印象深刻的表现，但它们在低资源和少数族裔语言中的能力仍然显着尚未得到充满兴趣。福尔摩山语言 - 台湾所说的奥地利语语言亚群 - 在语言上是丰富而濒临灭绝的，这在很大程度上是由于普通话的社会语言主导地位。在这项工作中，我们介绍了Formosanbench，这是评估低资源澳洲语言语言LLM的第一个基准。它涵盖了三种濒临灭绝的福尔摩山语言：Atayal，Amis和Paiwan，跨三个核心NLP任务：机器翻译，自动语音识别（ASR）和文本摘要。我们使用formosannench评估零射，10次和微调设置的模型性能。我们的结果表明，高资源和福尔摩山语言之间存在很大的性能差距。现有的LLM在所有任务中始终表现不佳，10次学习和微调仅提供有限的改进。这些发现强调了对更具包容性NLP技术的迫切需求，这些技术可以有效地支持濒危和代表性不足的语言。我们发布我们的数据集和代码，以促进未来的研究。

Title: Team QUST at SemEval-2025 Task 10: Evaluating Large Language Models in Multiclass Multi-label Classification of News Entity Framing

Authors: Jiyan Liu, Youzheng Liu, Taihang Wang, Xiaoman Xu, Yimin Wang, Ye Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21564
Pdf URL: https://arxiv.org/pdf/2506.21564
Copy Paste: [[2506.21564]] Team QUST at SemEval-2025 Task 10: Evaluating Large Language Models in Multiclass Multi-label Classification of News Entity Framing(https://arxiv.org/abs/2506.21564)
Keywords: language model
Abstract: This paper describes the participation of QUST_NLP in the SemEval-2025 Task 7. We propose a three-stage retrieval framework specifically designed for fact-checked claim retrieval. Initially, we evaluate the performance of several retrieval models and select the one that yields the best results for candidate retrieval. Next, we employ multiple re-ranking models to enhance the candidate results, with each model selecting the Top-10 outcomes. In the final stage, we utilize weighted voting to determine the final retrieval outcomes. Our approach achieved 5th place in the monolingual track and 7th place in the crosslingual track. We release our system code at: this https URL.
摘要：本文介绍了qust_nlp参与Semeval-2025任务7。我们提出了一个专门为事实检查的索赔检索而设计的三阶段检索框架。最初，我们评估了几种检索模型的性能，并选择为候选检索带来最佳结果的模型。接下来，我们采用多个重新级别模型来增强候选结果，每个模型都选择前10个结果。在最后阶段，我们利用加权投票来确定最终检索结果。我们的方法在单语曲目中获得第五名，在跨语言曲目中排名第七。我们在以下位置发布系统代码：此HTTPS URL。

Title: A Multi-Agent Probabilistic Inference Framework Inspired by Kairanban-Style CoT System with IdoBata Conversation for Debiasing

Authors: Takato Ueno, Keito Inoshita
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.21565
Pdf URL: https://arxiv.org/pdf/2506.21565
Copy Paste: [[2506.21565]] A Multi-Agent Probabilistic Inference Framework Inspired by Kairanban-Style CoT System with IdoBata Conversation for Debiasing(https://arxiv.org/abs/2506.21565)
Keywords: language model, llm, agent
Abstract: Japan's kairanban culture and idobata conversations have long functioned as traditional communication practices that foster nuanced dialogue among community members and contribute to the formation of social balance. Inspired by these information exchange processes, this study proposes a multi-agent inference framework (KCS+IBC) that integrates multiple large language models (LLMs) to achieve bias mitigation, improved explainability, and probabilistic prediction in sentiment analysis. In addition to sequentially sharing prediction results, the proposed method incorporates a mid-phase casual dialogue session to blend formal inference with individual perspectives and introduces probabilistic sentiment prediction. Experimental results show that KCS achieves accuracy comparable to that of a single LLM across datasets, while KCS+IBC exhibits a consistent decrease in entropy and a gradual increase in variance during the latter stages of inference, suggesting the framework's ability to balance aggregation and diversity of predictions. Future work will quantitatively assess the impact of these characteristics on bias correction and aim to develop more advanced sentiment analysis systems.
摘要：日本的Kairanban文化和Idobata的对话长期以来一直是传统的沟通实践，促进了社区成员之间细微的对话，并为社会平衡的形成做出了贡献。受这些信息交换过程的启发，本研究提出了一个多代理推理框架（KCS+IBC），该框架集成了多个大型语言模型（LLMS），以实现偏见缓解，改善的解释性和情感分析中的概率预测。除了依次共享预测结果外，该提出的方法还结合了中期休闲对话会议，以将正式推断与各个观点融合在一起，并引入概率情感预测。实验结果表明，KCS的准确性与跨数据集的单个LLM相当，而KCS+IBC在推理的后期阶段表现出一致的熵降低和差异逐渐增加，这表明该框架能够平衡预测的汇总和预测的多样性。未来的工作将定量评估这些特征对偏见纠正的影响，并旨在开发更高级的情感分析系统。

Title: BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining

Authors: Baqer M. Merzah, Tania Taami, Salman Asoudeh, Amir reza Hossein pour, Saeed Mirzaee, Amir Ali Bengari
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21567
Pdf URL: https://arxiv.org/pdf/2506.21567
Copy Paste: [[2506.21567]] BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining(https://arxiv.org/abs/2506.21567)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs) have recently gained attention in the life sciences due to their capacity to model, extract, and apply complex biological information. Beyond their classical use as chatbots, these systems are increasingly used for complex analysis and problem-solving in specialized fields, including bioinformatics. First, we introduce BIOPARS-BENCH, a dataset from over 10,000 scientific articles, textbooks, and medical websites. BioParsQA was also introduced to evaluate the proposed model, which consists of 5,231 Persian medical questions and answers. This study then introduces BioPars, a simple but accurate measure designed to assess LLMs for three main abilities: acquiring subject-specific knowledge, interpreting and synthesizing such knowledge, and demonstrating proper evidence. Comparing ChatGPT, Llama, and Galactica, our study highlights their ability to remember and retrieve learned knowledge but also reveals shortcomings in addressing higher-level, real-world questions and fine-grained inferences. These findings indicate the need for further fine-tuning to address the capabilities of LLM in bioinformatics tasks. To our knowledge, BioPars is the first application of LLM in Persian medical QA, especially for generating long answers. Evaluation of four selected medical QA datasets shows that BioPars has achieved remarkable results compared to comparative approaches. The model on BioParsQA achieved a ROUGE-L score of 29.99, which is an improvement over GPT-4 1.0. The model achieved a BERTScore of 90.87 with the MMR method. The MoverScore and BLEURT values were also higher in this model than the other three models. In addition, the reported scores for the model are MoverScore=60.43 and BLEURT=50.78. BioPars is an ongoing project and all resources related to its development will be made available via the following GitHub repository: this https URL.
摘要：大型语言模型（LLM）最近由于其建模，提取和应用复杂的生物学信息而引起了生命科学的关注。除了它们作为聊天机器人的经典用途外，这些系统越来越多地用于在包括生物信息学在内的专业领域进行复杂的分析和解决问题。首先，我们介绍了Biopars Bench，这是一个来自10,000多种科学文章，教科书和医疗网站的数据集。还引入了Bioparsqa来评估所提出的模型，该模型由5,231个波斯医学问题和答案组成。然后，这项研究引入了Biopars，这是一种简单但准确的措施，旨在评估LLM的三个主要能力：获取特定于主题的知识，解释和综合此类知识并证明适当的证据。比较Chatgpt，Llama和Galactica，我们的研究强调了他们记住和检索学习知识的能力，但也揭示了解决高级，现实世界中的问题和细粒度的推论方面的缺点。这些发现表明需要进行进一步的微调来解决LLM在生物信息学任务中的功能。据我们所知，Biopars是LLM在波斯医学质量检查中的第一个应用，尤其是用于产生长答案。对四个选定的医疗质量检查数据集的评估表明，与比较方法相比，Biopars取得了显着的结果。 BioparSQA的模型达到了29.99的Rouge-L得分，这比GPT-4 1.0的改进。该模型使用MMR方法实现了90.87的BERTSCORE。在该模型中，MoverScore和Bleurt值也高于其他三个模型。另外，该模型的报告得分为MoverScore = 60.43，Bleurt = 50.78。 Biopars是一个正在进行的项目，与其开发相关的所有资源将通过以下GitHub存储库提供：此HTTPS URL。

Title: Assessing RAG and HyDE on 1B vs. 4B-Parameter Gemma LLMs for Personal Assistants Integretion

Authors: Andrejs Sorstkins
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21568
Pdf URL: https://arxiv.org/pdf/2506.21568
Copy Paste: [[2506.21568]] Assessing RAG and HyDE on 1B vs. 4B-Parameter Gemma LLMs for Personal Assistants Integretion(https://arxiv.org/abs/2506.21568)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation
Abstract: Resource efficiency is a critical barrier to deploying large language models (LLMs) in edge and privacy-sensitive applications. This study evaluates the efficacy of two augmentation strategies--Retrieval-Augmented Generation (RAG) and Hypothetical Document Embeddings (HyDE)--on compact Gemma LLMs of 1 billion and 4 billion parameters, within the context of a privacy-first personal assistant. We implement short-term memory via MongoDB and long-term semantic storage via Qdrant, orchestrated through FastAPI and LangChain, and expose the system through a this http URL frontend. Across both model scales, RAG consistently reduces latency by up to 17\% and eliminates factual hallucinations when responding to user-specific and domain-specific queries. HyDE, by contrast, enhances semantic relevance--particularly for complex physics prompts--but incurs a 25--40\% increase in response time and a non-negligible hallucination rate in personal-data retrieval. Comparing 1 B to 4 B models, we observe that scaling yields marginal throughput gains for baseline and RAG pipelines, but magnifies HyDE's computational overhead and variability. Our findings position RAG as the pragmatic choice for on-device personal assistants powered by small-scale LLMs.
摘要：资源效率是在边缘和对隐私敏感应用程序中部署大型语言模型（LLM）的关键障碍。这项研究评估了两种增强策略的功效 - 重新定位的生成（RAG）和假设文件嵌入（HYDE）对紧凑型Gemma LLMS的10亿和40亿个参数，在隐私优先的个人助理的背景下。我们通过MongoDB和长期的语义存储通过QDRANT实施短期记忆，并通过Fastapi和Langchain进行了策划，并通过此HTTP URL前端曝光系统。在这两个模型量表中，RAG始终将潜伏期降低到17 \％，并在响应特定于用户的特定和域特异性查询时消除了事实幻觉。相比之下，海德提高了语义相关性 - 尤其是复杂物理的提示 - 但在响应时间增加了25--40 \％\％\％\％\％\％\％\％\％\％。比较1 B与4 B模型，我们观察到缩放的基线和RAG管道可以产生边缘吞吐量的增长，但放大了Hyde的计算开销和可变性。我们的发现位置抹布是由小规模LLM提供支持的机上个人助理的务实选择。

Title: Hybrid-NL2SVA: Integrating RAG and Finetuning for LLM-based NL2SVA

Authors: Weihua Xiao, Derek Ekberg, Siddharth Garg, Ramesh Karri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21569
Pdf URL: https://arxiv.org/pdf/2506.21569
Copy Paste: [[2506.21569]] Hybrid-NL2SVA: Integrating RAG and Finetuning for LLM-based NL2SVA(https://arxiv.org/abs/2506.21569)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation
Abstract: SystemVerilog Assertions (SVAs) are critical for verifying the correctness of hardware designs, but manually writing them from natural language property descriptions, i.e., NL2SVA, remains a labor-intensive and error-prone task. Recent advances in large language models (LLMs) offer opportunities to automate this translation. However, existing models still struggle with understanding domain-specific syntax and semantics. To enhance LLM performance in NL2SVA, we propose a customized retrieval-augmented generation (RAG) framework and a synthetic fine-tuning dataset that together improve LLM's performance. To further improve lightweight models over NL2SVA, our fine-tuning dataset provides prompt-guided explanations that teach LLMs the layer-by-layer construction process of concurrent SVAs, enabling supervised fine-tuning that greatly improves syntax and functionality accuracy. To evaluate the performance of LLMs over NL2SVA, we construct the largest evaluation dataset for NL2SVA, comprising 40 Verilog designs and 229 formally verified SVAs with detailed annotations. Experimental results show that our customized RAG framework increases the number of functionality matched SVAs by 58.42% over GPT-4o-mini, while Qwen2.5-Coder-7B-Instruct fine-tuned on our fine-tuning dataset and integrated with HybridRetrieval achieves a 59.05% over the base Qwen model.
摘要：SystemVerilog断言（SVA）对于验证硬件设计的正确性至关重要，但是从自然语言属性描述（即NL2SVA）手动编写它们仍然是劳动密集型且容易发生错误的任务。大型语言模型（LLM）的最新进展为自动化这一翻译提供了机会。但是，现有模型仍在理解特定领域的语法和语义方面困难。为了提高NL2SVA中的LLM性能，我们提出了定制的检索生成一代（RAG）框架和合成的微调数据集，共同提高LLM的性能。为了进一步改善NL2SVA的轻巧模型，我们的微调数据集提供了迅速的引导解释，使LLMS同时使用SVA的逐层构造过程，从而实现了有监督的微调，从而极大地提高了语法和功能精度。为了评估LLM在NL2SVA上的性能，我们为NL2SVA构建了最大的评估数据集，其中包括40个Verilog设计和229个正式验证的SVA，并具有详细的注释。实验结果表明，我们自定义的RAG框架将匹配的功能性SVA的数量增加了58.42％，而GPT-4O-Mini的数量则增加了，而QWEN2.5-CODER-7B-7B-INSTRUCTION在我们的微调数据集上进行了微调，并与HybridretretRidretRidretRidretRidretRidretRidretRidretRidretRidretRidretRidretRidretRidretRidretRidretRidretRidretRidretridretRidretRidretridretridretrivevers相比，基础QWEN模型的59.05％。

Title: Random Initialization Can't Catch Up: The Advantage of Language Model Transfer for Time Series Forecasting

Authors: Roland Riachi, Kashif Rasul, Arjun Ashok, Prateek Humane, Alexis Roger, Andrew R. Williams, Yuriy Nevmyvaka, Irina Rish
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21570
Pdf URL: https://arxiv.org/pdf/2506.21570
Copy Paste: [[2506.21570]] Random Initialization Can't Catch Up: The Advantage of Language Model Transfer for Time Series Forecasting(https://arxiv.org/abs/2506.21570)
Keywords: language model
Abstract: Recent works have demonstrated the effectiveness of adapting pre-trained language models (LMs) for forecasting time series in the low-data regime. We build upon these findings by analyzing the effective transfer from language models to time series forecasting under various design choices including upstream post-training, time series tokenizer and language backbone size. In the low-data regime, these design choices have a significant impact on the validation loss, with clear-cut choices that outperform others. Contrary to Hernandez et al. (2021), we observe that the validation loss of the LMs continues to smoothly decrease long after the validation loss of the randomly initialized models has converged, leading to a non-vanishing transfer gap that holds across design choices. These findings not only help shed light on the effective use of compute-efficient training for time series, but also open the way for the study of modality-agnostic properties of data distributions leveraged by these models.
摘要：最近的作品证明了在低数据制度中适应预培训的语言模型（LMS）以预测时间序列的有效性。我们通过分析从语言模型到时间序列的有效转移在各种设计选择下的有效转移来建立这些发现，包括上游训练后，时间序列令牌和语言骨干大小。在低数据制度中，这些设计选择对验证损失有重大影响，明确的选择胜过其他人。与Hernandez等人相反。（2021），我们观察到，LMS的验证损失在验证损失损失随机初始化模型后很长时间继续顺利降低，从而导致跨设计选择的非变化传输差距。这些发现不仅有助于阐明在时间序列中有效利用计算有效培训，而且为研究这些模型利用的数据分布的模态性不足属性开辟了道路。

Title: Towards Understanding the Cognitive Habits of Large Reasoning Models

Authors: Jianshuo Dong, Yujia Fu, Chuanrui Hu, Chao Zhang, Han Qiu
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2506.21571
Pdf URL: https://arxiv.org/pdf/2506.21571
Copy Paste: [[2506.21571]] Towards Understanding the Cognitive Habits of Large Reasoning Models(https://arxiv.org/abs/2506.21571)
Keywords: llm
Abstract: Large Reasoning Models (LRMs), which autonomously produce a reasoning Chain of Thought (CoT) before producing final responses, offer a promising approach to interpreting and monitoring model behaviors. Inspired by the observation that certain CoT patterns -- e.g., ``Wait, did I miss anything?'' -- consistently emerge across tasks, we explore whether LRMs exhibit human-like cognitive habits. Building on Habits of Mind, a well-established framework of cognitive habits associated with successful human problem-solving, we introduce CogTest, a principled benchmark designed to evaluate LRMs' cognitive habits. CogTest includes 16 cognitive habits, each instantiated with 25 diverse tasks, and employs an evidence-first extraction method to ensure reliable habit identification. With CogTest, we conduct a comprehensive evaluation of 16 widely used LLMs (13 LRMs and 3 non-reasoning ones). Our findings reveal that LRMs, unlike conventional LLMs, not only exhibit human-like habits but also adaptively deploy them according to different tasks. Finer-grained analyses further uncover patterns of similarity and difference in LRMs' cognitive habit profiles, particularly certain inter-family similarity (e.g., Qwen-3 models and DeepSeek-R1). Extending the study to safety-related tasks, we observe that certain habits, such as Taking Responsible Risks, are strongly associated with the generation of harmful responses. These findings suggest that studying persistent behavioral patterns in LRMs' CoTs is a valuable step toward deeper understanding of LLM misbehavior. The code is available at: this https URL.
摘要：大型推理模型（LRMS）在产生最终响应之前自主产生思想链（COT），为解释和监测模型行为提供了有希望的方法。受到观察的启发，即某些COT模式 - 例如，``等等，我错过了什么吗？'' - 我们始终跨任务出现，我们探索LRMS是否表现出类似人类的认知习惯。我们以心态的习惯为基础，这是一个与成功解决人类问题解决的认知习惯相关的完善的框架，我们引入了Cogtest，这是一种原则上的基准测试，旨在评估LRMS的认知习惯。 Cogtest包括16种认知习惯，每个习惯都实例化了25个不同的任务，并采用了证据优先的提取方法来确保可靠的习惯鉴定。使用CogTest，我们对16种广泛使用的LLM（13个LRM和3个非共同的LLM）进行了全面评估。我们的发现表明，LRM与传统的LLM不同，不仅表现出类似人类的习惯，而且还根据不同的任务适应它们。细粒度分析进一步发现了LRMS认知习惯谱的相似性和差异模式，尤其是某些家庭间相似性（例如QWEN-3模型和DeepSeek-R1）。将研究扩展到与安全有关的任务，我们观察到某些习惯（例如承担责任风险）与有害反应的产生密切相关。这些发现表明，在LRMS的COTS中研究持续的行为模式是朝着更深入了解LLM不当行为的宝贵一步。该代码可用：此HTTPS URL。

Title: Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling

Authors: Tianyu.Zou, Shengwu.Xiong, Ruilin.Yao, Jirui.Huang, Yi.Rong, Yaxiong.Chen, Shili.Xiong, Cong.Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21572
Pdf URL: https://arxiv.org/pdf/2506.21572
Copy Paste: [[2506.21572]] Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling(https://arxiv.org/abs/2506.21572)
Keywords: language model, llm
Abstract: Evaluating multimodal large language models (MLLMs) remains a fundamental challenge due to a lack of structured, interpretable, and theoretically grounded benchmark designs. Existing benchmarks often adopt heuristic-based task groupings with unclear cognitive targets, thus resulting in overlapping abilities, redundant indicators, and limited diagnostic power. In this work, we propose a novel framework for aligning MLLM benchmark based on Structural Equation Modeling (SEM) to analyze and quantify the internal validity, dimensional separability, and contribution of benchmark components. Motivated by the observed limitations of current designs, we further introduce a novel capability hierarchy grounded in Piagets theory of cognitive development, dividing MLLM abilities into three hierarchical layers, i.e., Perception, Memory, and Reasoning. We reorganize existing MLLM benchmarks under the proposed framework and construct a new benchmark named Gold. Experimental results demonstrate that the proposed benchmark exhibits stronger interpretability, reduced indicator redundancy, and clearer cognitive consistency compared to existing approaches.
摘要：评估多模式大语模型（MLLM）仍然是一个基本挑战，因为缺乏结构化，可解释和理论上的基准设计。现有的基准通常采用基于启发式的任务组，具有不清楚的认知目标，从而导致重叠能力，冗余指标和有限的诊断能力。在这项工作中，我们提出了一个新的框架，用于基于结构方程建模（SEM）对齐MLLM基准测试，以分析和量化内部有效性，尺寸可分离性和基准组件的贡献。受到当前设计的局限性的激励，我们进一步引入了一个基于认知发展理论的新型能力层次结构，将MLLM能力分为三个层次结构层，即感知，记忆和推理。我们在拟议的框架下重组了现有的MLLM基准测试，并构建了一个名为Gold的新基准。实验结果表明，与现有方法相比，所提出的基准分析表现出更强的可解释性，降低的指标冗余性和更清晰的认知一致性。

Title: Instruction Learning Paradigms: A Dual Perspective on White-box and Black-box LLMs

Authors: Yanwei Ren, Liu Liu, Baosheng Yu, Jiayan Qiu, Quan Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21573
Pdf URL: https://arxiv.org/pdf/2506.21573
Copy Paste: [[2506.21573]] Instruction Learning Paradigms: A Dual Perspective on White-box and Black-box LLMs(https://arxiv.org/abs/2506.21573)
Keywords: language model, llm
Abstract: Optimizing instructions for large language models (LLMs) is critical for harnessing their full potential in complex and diverse tasks. However, relying solely on white-box approaches demands extensive computational resources and offers limited representational capacity, while black-box models can incur prohibitive financial costs. To address these challenges, we introduce a novel framework that seamlessly merges the strengths of both paradigms. Black-box models provide high-quality, diverse instruction initializations, and white-box models supply fine-grained interpretability through hidden states and output features. By enforcing a semantic similarity constraint, these components fuse into a unified high-dimensional representation that captures deep semantic and structural nuances, enabling an iterative optimization process to refine instruction quality and adaptability. Extensive evaluations across a broad spectrum of tasks-ranging from complex reasoning to cross-lingual generalization-demonstrate that our approach consistently outperforms state-of-the-art baselines. This fusion of black-box initialization with advanced semantic refinement yields a scalable and efficient solution, paving the way for next-generation LLM-driven applications in diverse real-world scenarios. The source code will be released soon.
摘要：优化大型语言模型（LLM）的说明对于在复杂而多样化的任务中发挥其全部潜力至关重要。但是，仅依靠白色框方法需要广泛的计算资源，并提供有限的代表性能力，而黑框模型可以产生巨大的财务成本。为了应对这些挑战，我们介绍了一个新颖的框架，该框架无缝地融合了这两个范式的优势。 Black-Box模型提供了高质量的不同指令初始化，白框模型通过隐藏的状态和输出功能提供细粒度的可解释性。通过执行语义相似性约束，这些组件将融合到统一的高维表示中，捕获了深层的语义和结构细微差别，从而实现了迭代优化过程，以优化指令质量和适应性。从复杂的推理到跨语言概括的广泛任务进行了广泛的评估，表明我们的方法始终优于最先进的基准。黑盒初始化与高级语义改进的融合产生了可扩展有效的解决方案，为在不同的现实世界中的下一代LLM驱动的应用铺平了道路。源代码将很快发布。

Title: Digital Gatekeepers: Exploring Large Language Model's Role in Immigration Decisions

Authors: Yicheng Mao, Yang Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21574
Pdf URL: https://arxiv.org/pdf/2506.21574
Copy Paste: [[2506.21574]] Digital Gatekeepers: Exploring Large Language Model's Role in Immigration Decisions(https://arxiv.org/abs/2506.21574)
Keywords: language model, gpt, llm, chat
Abstract: With globalization and increasing immigrant populations, immigration departments face significant work-loads and the challenge of ensuring fairness in decision-making processes. Integrating artificial intelligence offers a promising solution to these challenges. This study investigates the potential of large language models (LLMs),such as GPT-3.5 and GPT-4, in supporting immigration decision-making. Utilizing a mixed-methods approach,this paper conducted discrete choice experiments and in-depth interviews to study LLM decision-making strategies and whether they are fair. Our findings demonstrate that LLMs can align their decision-making with human strategies, emphasizing utility maximization and procedural fairness. Meanwhile, this paper also reveals that while ChatGPT has safeguards to prevent unintentional discrimination, it still exhibits stereotypes and biases concerning nationality and shows preferences toward privileged group. This dual analysis highlights both the potential and limitations of LLMs in automating and enhancing immigration decisions.
摘要：随着全球化和移民人口的增加，移民部门面临着重大的工作量以及确保决策过程中公平性的挑战。整合人工智能为这些挑战提供了有希望的解决方案。这项研究调查了大语模型（LLM）的潜力，例如GPT-3.5和GPT-4，在支持移民决策方面。利用混合方法方法，本文进行了离散的选择实验和深入的访谈，以研究LLM决策策略以及它们是否公平。我们的发现表明，LLM可以使他们的决策与人类策略保持一致，从而强调效用最大化和程序公平。同时，本文还表明，尽管Chatgpt有防止无意歧视的保障措施，但它仍然表现出有关国籍的刻板印象和偏见，并显示了对特权集团的偏好。这种双重分析强调了LLM在自动化和增强移民决策中的潜力和局限性。

Title: STRuCT-LLM: Unifying Tabular and Graph Reasoning with Reinforcement Learning for Semantic Parsing

Authors: Josefa Lia Stoisser, Marc Boubnovski Martell, Lawrence Phillips, Casper Hansen, Julien Fauqueur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21575
Pdf URL: https://arxiv.org/pdf/2506.21575
Copy Paste: [[2506.21575]] STRuCT-LLM: Unifying Tabular and Graph Reasoning with Reinforcement Learning for Semantic Parsing(https://arxiv.org/abs/2506.21575)
Keywords: language model, llm, chain-of-thought
Abstract: We propose STRuCT-LLM, a unified framework for training large language models (LLMs) to perform structured reasoning over both relational and graph-structured data. Our approach jointly optimizes Text-to-SQL and Text-to-Cypher tasks using reinforcement learning (RL) combined with Chain-of-Thought (CoT) supervision. To support fine-grained optimization in graph-based parsing, we introduce a topology-aware reward function based on graph edit distance. Unlike prior work that treats relational and graph formalisms in isolation, STRuCT-LLM leverages shared abstractions between SQL and Cypher to induce cross-formalism transfer, enabling SQL training to improve Cypher performance and vice versa - even without shared schemas. Our largest model (QwQ-32B) achieves substantial relative improvements across tasks: on semantic parsing, Spider improves by 13.5\% and Text2Cypher by 73.1\%. The model also demonstrates strong zero-shot generalization, improving performance on downstream tabular QA (TableBench: 8.5\%) and knowledge graph QA (CR-LT-KGQA: 1.7\%) without any QA-specific supervision. These results demonstrate both the effectiveness of executable queries as scaffolds for structured reasoning and the synergistic benefits of jointly training on SQL and Cypher (code available at this https URL).
摘要：我们提出了结构-LLM，这是一个统一的训练大语言模型（LLMS）的统一框架，以在关系和图形结构数据上执行结构性推理。我们的方法共同使用加固学习（RL）与经过三通链（COT）监督相结合优化了文本到SQL和文本到键盘任务。为了支持基于图的解析中的细粒优化，我们基于图编辑距离引入了拓扑感知的奖励功能。与以前的工作和图形形式主义分离不同，结构-LLM在SQL和Cypher之间共享抽象以诱导跨形式转移，从而使SQL训练能够提高Cypher性能，反之亦然 - 即使没有共享模式。我们最大的模型（QWQ-32B）实现了整个任务的实质相对改进：在语义解析上，蜘蛛提高了13.5 \％，而Text2Cypher则提高了73.1 \％。该模型还显示了强烈的零弹性概括，在没有任何QA特定监督的情况下，提高了下游表格质量质量质量QA（TableBench：8.5 \％）和知识图QA（CR-LT-KGQA：1.7 \％）。这些结果证明了可执行查询作为结构化推理的脚手架的有效性，也证明了对SQL和Cypher共同培训的协同益处（该代码可在此HTTPS URL上获得）。

Title: Adapting Whisper for Parameter-efficient Code-Switching Speech Recognition via Soft Prompt Tuning

Authors: Hongli Yang, Yizhou Peng, Hao Huang, Sheng Li
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.21576
Pdf URL: https://arxiv.org/pdf/2506.21576
Copy Paste: [[2506.21576]] Adapting Whisper for Parameter-efficient Code-Switching Speech Recognition via Soft Prompt Tuning(https://arxiv.org/abs/2506.21576)
Keywords: prompt
Abstract: Large-scale multilingual ASR models like Whisper excel in high-resource settings but face challenges in low-resource scenarios, such as rare languages and code-switching (CS), due to computational costs and catastrophic forgetting. We explore Soft Prompt Tuning (SPT), a parameter-efficient method to enhance CS ASR while preserving prior knowledge. We evaluate two strategies: (1) full fine-tuning (FFT) of both soft prompts and the entire Whisper model, demonstrating improved cross-lingual capabilities compared to traditional methods, and (2) adhering to SPT's original design by freezing model parameters and only training soft prompts. Additionally, we introduce SPT4ASR, a combination of different SPT variants. Experiments on the SEAME and ASRU2019 datasets show that deep prompt tuning is the most effective SPT approach, and our SPT4ASR methods achieve further error reductions in CS ASR, maintaining parameter efficiency similar to LoRA, without degrading performance on existing languages.
摘要：大型多语言ASR模型，例如在高资源环境中的Whisper Excel，但由于计算成本和灾难性的遗忘，在低资源场景中面临挑战，例如稀有语言和代码转换（CS）。我们探索软提示调整（SPT），这是一种在保留先验知识的同时增强CS ASR的参数有效方法。我们评估了两种策略：（1）软提示和整个耳语模型的完整微调（FFT），与传统方法相比，跨语性能力提高了，（2）通过冷冻模型参数和仅训练软提示来粘附SPT的原始设计。此外，我们引入了SPT4ASR，这是不同SPT变体的组合。在接缝和ASRU2019数据集上进行的实验表明，深度及时调整是最有效的SPT方法，我们的SPT4ASR方法实现了CS ASR的进一步误差，维持类似于Lora的参数效率，而不会降低现有语言的性能。

Title: Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR

Authors: Hongli Yang, Sheng Li, Hao Huang, Ayiduosi Tuohan, Yizhou Peng
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.21577
Pdf URL: https://arxiv.org/pdf/2506.21577
Copy Paste: [[2506.21577]] Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR(https://arxiv.org/abs/2506.21577)
Keywords: prompt
Abstract: Recent advancements in multilingual automatic speech recognition (ASR) have been driven by large-scale end-to-end models like Whisper. However, challenges such as language interference and expanding to unseen languages (language expansion) without degrading performance persist. This paper addresses these with three contributions: 1) Entire Soft Prompt Tuning (Entire SPT), which applies soft prompts to both the encoder and decoder, enhancing feature extraction and decoding; 2) Language-Aware Prompt Tuning (LAPT), which leverages cross-lingual similarities to encode shared and language-specific features using lightweight prompt matrices; 3) SPT-Whisper, a toolkit that integrates SPT into Whisper and enables efficient continual learning. Experiments across three languages from FLEURS demonstrate that Entire SPT and LAPT outperform Decoder SPT by 5.0% and 16.0% in language expansion tasks, respectively, providing an efficient solution for dynamic, multilingual ASR models with minimal computational overhead.
摘要：多语言自动语音识别（ASR）的最新进步是由大规模的端到端模型（如Whisper）驱动的。但是，诸如语言干扰和扩展到看不见的语言（语言扩展）之类的挑战而不降低绩效。本文用三个贡献解决了这些内容：1）整个软提示调整（整个SPT），该调整适用于编码器和解码器，从而增强了特征提取和解码； 2）语言意识及时调整（LAPT），使用轻巧的提示矩阵来利用跨语性相似性来编码共享和特定于语言的功能； 3）SPT-WHISPER，一种将SPT集成到耳语中并实现有效持续学习的工具包。来自Fleurs的三种语言进行的实验表明，整个SPT和在语言扩展任务中分别超过了5.0％和16.0％的解码器SPT，为动态，多语言ASR模型提供了有效的解决方案，该模型具有最小的计算高间接费用。

Title: HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models

Authors: Andrew Maranhão Ventura D'addario
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21578
Pdf URL: https://arxiv.org/pdf/2506.21578
Copy Paste: [[2506.21578]] HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models(https://arxiv.org/abs/2506.21578)
Keywords: language model, gpt, llm
Abstract: The evaluation of Large Language Models (LLMs) in healthcare has been dominated by physician-centric, English-language benchmarks, creating a dangerous illusion of competence that ignores the interprofessional nature of patient care. To provide a more holistic and realistic assessment, we introduce HealthQA-BR, the first large-scale, system-wide benchmark for Portuguese-speaking healthcare. Comprising 5,632 questions from Brazil's national licensing and residency exams, it uniquely assesses knowledge not only in medicine and its specialties but also in nursing, dentistry, psychology, social work, and other allied health professions. We conducted a rigorous zero-shot evaluation of over 20 leading LLMs. Our results reveal that while state-of-the-art models like GPT 4.1 achieve high overall accuracy (86.6%), this top-line score masks alarming, previously unmeasured deficiencies. A granular analysis shows performance plummets from near-perfect in specialties like Ophthalmology (98.7%) to barely passing in Neurosurgery (60.0%) and, most notably, Social Work (68.4%). This "spiky" knowledge profile is a systemic issue observed across all models, demonstrating that high-level scores are insufficient for safety validation. By publicly releasing HealthQA-BR and our evaluation suite, we provide a crucial tool to move beyond single-score evaluations and toward a more honest, granular audit of AI readiness for the entire healthcare team.
摘要：以医师为中心的英语基准主导了对医疗保健中大型语言模型（LLM）的评估，从而产生了一种危险的能力幻想，忽略了患者护理的跨专业性质。为了提供更全面和现实的评估，我们介绍了HealthQa-BR，这是第一个大规模的，全系统的讲葡萄牙医疗保健的基准。它包括来自巴西国家许可和居住考试的5,632个问题，它不仅在医学及其专业方面进行了独特的评估知识，还可以评估护理，牙科，心理学，社会工作和其他相关健康专业的知识。我们对20多个领先的LLM进行了严格的零射门评估。我们的结果表明，尽管像GPT 4.1这样的最新模型达到了较高的总体精度（86.6％），但此顶级得分掩盖了令人震惊的，以前无法满足的缺陷。颗粒状分析表明，诸如眼科（98.7％）等专业中的近乎完美的表现局势落下到几乎没有神经外科手术（60.0％）和最著名的社会工作（68.4％）（68.4％）。这种“尖峰”知识概况是在所有模型中都观察到的系统问题，表明高级分数不足以进行安全验证。通过公开发布HealthQa-BR和我们的评估套件，我们提供了一个至关重要的工具，可以超越单分数评估，并朝着对整个医疗团队的AI准备就绪进行更诚实，详尽的审核。

Title: From General Reasoning to Domain Expertise: Uncovering the Limits of Generalization in Large Language Models

Authors: Dana Alsagheer, Yang Lu, Abdulrahman Kamal, Omar Kamal, Mohammad Kamal, Nada Mansour, Cosmo Yang Wu, Rambiba Karanjai, Sen Li, Weidong Shi
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.21580
Pdf URL: https://arxiv.org/pdf/2506.21580
Copy Paste: [[2506.21580]] From General Reasoning to Domain Expertise: Uncovering the Limits of Generalization in Large Language Models(https://arxiv.org/abs/2506.21580)
Keywords: language model, llm
Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains. However, effective decision-making relies heavily on strong reasoning abilities. Reasoning is the foundation for decision-making, providing the analytical and logical framework to make sound choices. Reasoning involves analyzing information, drawing inferences, and reaching conclusions based on logic or evidence. Decision-making builds on this foundation by applying the insights from reasoning to select the best course of action among alternatives. Together, these processes create a continuous cycle of thought and action aimed at achieving goals effectively. As AI technology evolves, there is a growing trend to train LLMs to excel in general reasoning. This study explores how the general reasoning capabilities of LLMs connect to their performance in domain-specific reasoning tasks.
摘要：大型语言模型（LLM）的最新进展表现出了各个领域的显着功能。但是，有效的决策在很大程度上取决于强大的推理能力。推理是决策的基础，提供了分析和逻辑框架以做出合理的选择。推理涉及分析信息，绘制推论并根据逻辑或证据得出结论。决策是基于该基金会的基础，通过运用推理的见解来选择替代方案之间的最佳行动方案。这些过程共同创造了一个旨在有效实现目标的思想和行动的连续循环。随着AI技术的发展，培训LLM在一般推理中脱颖而出的趋势正在增长。这项研究探讨了LLMS的一般推理能力如何连接到其在特定领域的推理任务中的性能。

Title: VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Authors: Sam Yu-Te Lee, Chengyang Ji, Shicheng Wen, Lifu Huang, Dongyi Liu, Kwan-Liu Ma
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2506.21582
Pdf URL: https://arxiv.org/pdf/2506.21582
Copy Paste: [[2506.21582]] VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents(https://arxiv.org/abs/2506.21582)
Keywords: language model, llm, agent
Abstract: Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE's effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience -- from none to expert -- demonstrates the system's usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.
摘要：传统上，文本分析需要自然语言处理（NLP）或文本分析的专业知识，这为入门级分析师带来了障碍。大型语言模型（LLM）的最新进展通过启用更容易访问和自动化的文本分析（例如主题检测，摘要，信息提取等）来改变NLP的景观。我们介绍视频，该系统支持入门级数据分析师，以使用智能代理进行高级文本分析。 VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results.我们进行了两个定量实验，以评估视频的有效性并分析常见药物错误。一项涉及具有不同水平NLP和文本分析经验的参与者的用户研究 - 从无到专家 - 展示了系统的可用性并揭示了不同的用户行为模式。这些发现确定了对人类代理协作的设计含义，验证视频对非专家用户的实际实用性，并为智能文本分析系统的未来改进提供信息。

Title: Empirical Evidence for Alignment Faking in Small LLMs and Prompt-Based Mitigation Techniques

Authors: J. Koorndijk
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.21584
Pdf URL: https://arxiv.org/pdf/2506.21584
Copy Paste: [[2506.21584]] Empirical Evidence for Alignment Faking in Small LLMs and Prompt-Based Mitigation Techniques(https://arxiv.org/abs/2506.21584)
Keywords: language model, llm, prompt
Abstract: Current literature suggests that alignment faking (deceptive alignment) is an emergent property of large language models. We present the first empirical evidence that a small instruction-tuned model, specifically LLaMA 3 8B, can also exhibit alignment faking. We further show that prompt-only interventions, including deontological moral framing and scratchpad reasoning, significantly reduce this behavior without modifying model internals. This challenges the assumption that prompt-based ethics are trivial and that deceptive alignment requires scale. We introduce a taxonomy distinguishing shallow deception, shaped by context and suppressible through prompting, from deep deception, which reflects persistent, goal-driven misalignment. Our findings refine the understanding of deception in language models and underscore the need for alignment evaluations across model sizes and deployment settings.
摘要：当前的文献表明，对齐伪造（欺骗性对齐）是大语言模型的新兴属性。我们提供了第一个经验证据，表明小型教学模型，特别是Llama 3 8b，也可以表现出对齐伪造。我们进一步表明，及时的干预措施，包括道德道德框架和刮擦板推理，大大降低了这种行为，而无需修改模型内部。这挑战了以下假设：基于迅速的伦理是微不足道的，而欺骗性的一致性需要规模。我们引入了一种分类学区分浅欺骗的分类学，这是由上下文塑造的，并通过深深的欺骗提示，这反映了持续的，目标驱动的未对准。我们的发现完善了对语言模型中欺骗的理解，并强调了对模型大小和部署设置进行对齐评估的需求。

Title: Evaluation of LLM-based Strategies for the Extraction of Food Product Information from Online Shops

Authors: Christoph Brosch, Sian Brumm, Rolf Krieger, Jonas Scheffler
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21585
Pdf URL: https://arxiv.org/pdf/2506.21585
Copy Paste: [[2506.21585]] Evaluation of LLM-based Strategies for the Extraction of Food Product Information from Online Shops(https://arxiv.org/abs/2506.21585)
Keywords: language model, llm
Abstract: Generative AI and large language models (LLMs) offer significant potential for automating the extraction of structured information from web pages. In this work, we focus on food product pages from online retailers and explore schema-constrained extraction approaches to retrieve key product attributes, such as ingredient lists and nutrition tables. We compare two LLM-based approaches, direct extraction and indirect extraction via generated functions, evaluating them in terms of accuracy, efficiency, and cost on a curated dataset of 3,000 food product pages from three different online shops. Our results show that although the indirect approach achieves slightly lower accuracy (96.48\%, $-1.61\%$ compared to direct extraction), it reduces the number of required LLM calls by 95.82\%, leading to substantial efficiency gains and lower operational costs. These findings suggest that indirect extraction approaches can provide scalable and cost-effective solutions for large-scale information extraction tasks from template-based web pages using LLMs.
摘要：生成的AI和大型语言模型（LLMS）为自动从网页中提取结构化信息提供了重要潜力。在这项工作中，我们专注于在线零售商的食品页面，并探索架构约束的提取方法，以检索关键产品属性，例如成分清单和营养表。我们比较了两种基于LLM的方法，即通过生成的功能进行直接提取和间接提取，从三个不同的在线商店中的3,000个食品产品页面上的精确性，效率和成本进行评估。我们的结果表明，尽管间接方法的精度略低（与直接提取相比，$ -1.61 \％$ $ $ -1.61 \％$），但它将所需LLM调用的数量降低了95.82 \％，从而导致实质性效率增长和降低运营成本。这些发现表明，间接提取方法可以使用LLMS从基于模板的网页中为大规模信息提取任务提供可扩展且具有成本效益的解决方案。

Title: Can Vision Language Models Understand Mimed Actions?

Authors: Hyundong Cho, Spencer Lin, Tejas Srinivasan, Michael Saxon, Deuksin Kwon, Natali T. Chavez, Jonathan May
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.21586
Pdf URL: https://arxiv.org/pdf/2506.21586
Copy Paste: [[2506.21586]] Can Vision Language Models Understand Mimed Actions?(https://arxiv.org/abs/2506.21586)
Keywords: language model
Abstract: Nonverbal communication (NVC) plays an integral role in human language, but studying NVC in general is challenging because of its broad scope and high variance in interpretation among individuals and cultures. However, mime -- the theatrical technique of suggesting intent using only gesture, expression, and movement -- is a subset of NVC that consists of explicit and embodied actions with much lower human interpretation variance. We argue that a solid understanding of mimed actions is a crucial prerequisite for vision-language models capable of interpreting and commanding more subtle aspects of NVC. Hence, we propose Mime Identification Multimodal Evaluation (MIME), a novel video-based question answering benchmark comprising of 86 mimed actions. Constructed with motion capture data, MIME consists of variations of each action with perturbations applied to the character, background, and viewpoint for evaluating recognition robustness. We find that both open-weight and API-based vision-language models perform significantly worse than humans on MIME, motivating the need for increased research for instilling more robust understanding of human gestures.
摘要：非语言交流（NVC）在人类语言中起着不可或缺的作用，但是研究NVC通常是具有挑战性的，因为它的范围很大，并且在个人和文化之间的解释方面差异很大。然而，Mime（仅使用手势，表达和运动意图的戏剧技术）是NVC的一个子集，它由具有较低人类解释差异的明确和体现的作用组成。我们认为，对模仿动作的扎实理解是能够解释和指挥NVC更微妙的视觉模型的关键先决条件。因此，我们提出了MIME识别多模式评估（MIME），这是一个基于视频的新型问题，回答了包括86个模仿动作的基准。用运动捕获数据构建，MIME由每个动作的变化，并应用于角色，背景和观点，以评估识别稳定性。我们发现，开放式和基于API的视力语言模型的表现都比人类在MIME上的表现差得多，这激发了增加研究的需求，以灌输对人类手势的更强有力的理解。

Title: Is DeepSeek a New Voice Among LLMs in Public Opinion Simulation?

Authors: Weihong Qi, Fan Huang, Jisun An, Haewoon Kwak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21587
Pdf URL: https://arxiv.org/pdf/2506.21587
Copy Paste: [[2506.21587]] Is DeepSeek a New Voice Among LLMs in Public Opinion Simulation?(https://arxiv.org/abs/2506.21587)
Keywords: language model, gpt, llm
Abstract: This study evaluates the ability of DeepSeek, an open-source large language model (LLM), to simulate public opinions in comparison to LLMs developed by major tech companies. By comparing DeepSeek-R1 and DeepSeek-V3 with Qwen2.5, GPT-4o, and Llama-3.3 and utilizing survey data from the American National Election Studies (ANES) and the Zuobiao dataset of China, we assess these models' capacity to predict public opinions on social issues in both China and the United States, highlighting their comparative capabilities between countries. Our findings indicate that DeepSeek-V3 performs best in simulating U.S. opinions on the abortion issue compared to other topics such as climate change, gun control, immigration, and services for same-sex couples, primarily because it more accurately simulates responses when provided with Democratic or liberal personas. For Chinese samples, DeepSeek-V3 performs best in simulating opinions on foreign aid and individualism but shows limitations in modeling views on capitalism, particularly failing to capture the stances of low-income and non-college-educated individuals. It does not exhibit significant differences from other models in simulating opinions on traditionalism and the free market. Further analysis reveals that all LLMs exhibit the tendency to overgeneralize a single perspective within demographic groups, often defaulting to consistent responses within groups. These findings highlight the need to mitigate cultural and demographic biases in LLM-driven public opinion modeling, calling for approaches such as more inclusive training methodologies.
摘要：这项研究评估了开源大型语言模型（LLM）DeepSeek与主要科技公司开发的LLM相比，模拟公众意见的能力。通过将DeepSeek-R1和DeepSeek-V3与QWEN2.5，GPT-4O和LLAMA-3.3进行比较，并利用来自美国国家选举研究（ANES）和中国的Zuobiao数据集中的调查数据，我们评估了这些模型在中国和美国的社交问题上预测公众意见的能力，以相比的国家之间的比较能力。我们的发现表明，与其他主题相比，DeepSeek-V3在模拟美国对堕胎问题的看法方面表现最佳，例如气候变化，枪支管制，移民和同性夫妇的服务，主要是因为与民主或自由主义角色一起提供时，它更准确地模拟了反应。对于中国样本，DeepSeek-V3在模拟外国援助和个人主义的观点方面表现最佳，但在建模对资本主义的观点的限制中，尤其是未能捕捉到低收入和非大学教育的人的立场。在模拟对传统主义和自由市场的观点时，它与其他模型没有显着差异。进一步的分析表明，所有LLM都表现出使人口统计组内的单一观点过度概括的趋势，通常默认为组内的一致响应。这些发现凸显了在LLM驱动的公众舆论建模中减轻文化和人口偏见的必要性，呼吁采用更具包容性培训方法等方法。

Title: Understanding Verbatim Memorization in LLMs Through Circuit Discovery

Authors: Ilya Lasy, Peter Knees, Stefan Woltran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21588
Pdf URL: https://arxiv.org/pdf/2506.21588
Copy Paste: [[2506.21588]] Understanding Verbatim Memorization in LLMs Through Circuit Discovery(https://arxiv.org/abs/2506.21588)
Keywords: llm
Abstract: Underlying mechanisms of memorization in LLMs -- the verbatim reproduction of training data -- remain poorly understood. What exact part of the network decides to retrieve a token that we would consider as start of memorization sequence? How exactly is the models' behaviour different when producing memorized sentence vs non-memorized? In this work we approach these questions from mechanistic interpretability standpoint by utilizing transformer circuits -- the minimal computational subgraphs that perform specific functions within the model. Through carefully constructed contrastive datasets, we identify points where model generation diverges from memorized content and isolate the specific circuits responsible for two distinct aspects of memorization. We find that circuits that initiate memorization can also maintain it once started, while circuits that only maintain memorization cannot trigger its initiation. Intriguingly, memorization prevention mechanisms transfer robustly across different text domains, while memorization induction appears more context-dependent.
摘要：LLMS中记忆的潜在机制 - 训练数据的逐字繁殖 - 仍然对知识较少。网络的确切部分决定检索我们将其视为记忆序列开始的令牌？在产生记忆的句子与未找到的记忆时，模型的行为与众不同？在这项工作中，我们通过利用变压器电路从机械可解释性的角度来解决这些问题 - 模型中执行特定功能的最小计算子图。通过精心构造的对比数据集，我们确定了模型生成与记忆内容分歧并隔离负责记忆两个不同方面的特定电路的点。我们发现，启动记忆的电路也可以一旦开始，而只能保持记忆的电路不能触发其启动。有趣的是，记忆预防机制在不同的文本域上牢固地转移，而记忆诱导似乎更依赖上下文。

Title: A General Method for Detecting Information Generated by Large Language Models

Authors: Minjia Mao, Dongjun Wei, Xiao Fang, Michael Chau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21589
Pdf URL: https://arxiv.org/pdf/2506.21589
Copy Paste: [[2506.21589]] A General Method for Detecting Information Generated by Large Language Models(https://arxiv.org/abs/2506.21589)
Keywords: language model, llm
Abstract: The proliferation of large language models (LLMs) has significantly transformed the digital information landscape, making it increasingly challenging to distinguish between human-written and LLM-generated content. Detecting LLM-generated information is essential for preserving trust on digital platforms (e.g., social media and e-commerce sites) and preventing the spread of misinformation, a topic that has garnered significant attention in IS research. However, current detection methods, which primarily focus on identifying content generated by specific LLMs in known domains, face challenges in generalizing to new (i.e., unseen) LLMs and domains. This limitation reduces their effectiveness in real-world applications, where the number of LLMs is rapidly multiplying and content spans a vast array of domains. In response, we introduce a general LLM detector (GLD) that combines a twin memory networks design and a theory-guided detection generalization module to detect LLM-generated information across unseen LLMs and domains. Using real-world datasets, we conduct extensive empirical evaluations and case studies to demonstrate the superiority of GLD over state-of-the-art detection methods. The study has important academic and practical implications for digital platforms and LLMs.
摘要：大型语言模型（LLM）的扩散已显着改变了数字信息格局，使区分人写和LLM生成的内容变得越来越具有挑战性。检测LLM生成的信息对于保持对数字平台（例如社交媒体和电子商务网站）的信任并防止错误信息传播至关重要，这个主题在研究中引起了极大的关注。但是，当前的检测方法主要集中在识别已知域中特定LLM产生的内容时面临挑战，以推广到新的（即看不见）LLM和域。这种限制降低了它们在现实世界中的有效性，在现实世界中，LLM的数量正在迅速繁殖，并且内容涵盖了大量域。作为响应，我们引入了一般LLM检测器（GLD），该检测器（GLD）结合了双存储网络设计和理论引导的检测概括模块，以检测在看不见的LLM和域中LLM生成的信息。使用现实世界数据集，我们进行了广泛的经验评估和案例研究，以证明GLD优于最先进的检测方法。该研究对数字平台和LLM具有重要的学术和实际意义。

Title: Representation Consistency for Accurate and Coherent LLM Answer Aggregation

Authors: Junqi Jiang, Tom Bewley, Salim I. Amoukou, Francesco Leofante, Antonio Rago, Saumitra Mishra, Francesca Toni
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21590
Pdf URL: https://arxiv.org/pdf/2506.21590
Copy Paste: [[2506.21590]] Representation Consistency for Accurate and Coherent LLM Answer Aggregation(https://arxiv.org/abs/2506.21590)
Keywords: language model, llm, prompt
Abstract: Test-time scaling improves large language models' (LLMs) performance by allocating more compute budget during inference. To achieve this, existing methods often require intricate modifications to prompting and sampling strategies. In this work, we introduce representation consistency (RC), a test-time scaling method for aggregating answers drawn from multiple candidate responses of an LLM regardless of how they were generated, including variations in prompt phrasing and sampling strategy. RC enhances answer aggregation by not only considering the number of occurrences of each answer in the candidate response set, but also the consistency of the model's internal activations while generating the set of responses leading to each answer. These activations can be either dense (raw model activations) or sparse (encoded via pretrained sparse autoencoders). Our rationale is that if the model's representations of multiple responses converging on the same answer are highly variable, this answer is more likely to be the result of incoherent reasoning and should be down-weighted during aggregation. Importantly, our method only uses cached activations and lightweight similarity computations and requires no additional model queries. Through experiments with four open-source LLMs and four reasoning datasets, we validate the effectiveness of RC for improving task performance during inference, with consistent accuracy improvements (up to 4%) over strong test-time scaling baselines. We also show that consistency in the sparse activation signals aligns well with the common notion of coherent reasoning.
摘要：测试时间缩放通过在推断期间分配更多的计算预算来改善大型语言模型（LLMS）的性能。为此，现有方法通常需要复杂的修改来提示和采样策略。在这项工作中，我们介绍了表示一致性（RC），这是一种测试时间缩放方法，用于汇总从LLM的多个候选响应中得出的答案，而不管它们是如何生成的，包括及时的逐渐措辞和采样策略的变化。 RC不仅通过考虑候选响应集中每个答案的出现数量来增强答案的聚合，而且还可以在产生导致每个答案的响应集的同时，同时生成模型内部激活的一致性。这些激活可以是密集的（原始模型激活）或稀疏（通过验证的稀疏自动编码器编码）。我们的理由是，如果模型在相同答案上融合的多个响应的表示高度可变，则此答案更有可能是由于推理不连贯的结果，并且在聚合过程中应下降。重要的是，我们的方法仅使用缓存的激活和轻量级相似性计算，并且不需要其他模型查询。通过使用四个开源LLM和四个推理数据集的实验，我们验证了RC在推断过程中提高任务性能的有效性，并且在强大的测试时间缩放基准基准中的准确性提高（高达4％）。我们还表明，稀疏激活信号中的一致性与相干推理的常见概念很好地吻合。

Title: FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning

Authors: Shaoyu Dou, Yutian Shen, Mofan Chen, Zixuan Wang, Jiajie Xu, Qi Guo, Kailai Shao, Chao Chen, Haixiang Hu, Haibo Shi, Min Min, Liwen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21591
Pdf URL: https://arxiv.org/pdf/2506.21591
Copy Paste: [[2506.21591]] FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning(https://arxiv.org/abs/2506.21591)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) demonstrate significant potential but face challenges in complex financial reasoning tasks requiring both domain knowledge and sophisticated reasoning. Current evaluation benchmarks often fall short by not decoupling these capabilities indicators from single task performance and lack root cause analysis for task failure. To address this, we introduce FinEval-KR, a novel evaluation framework for decoupling and quantifying LLMs' knowledge and reasoning abilities independently, proposing distinct knowledge score and reasoning score metrics. Inspired by cognitive science, we further propose a cognitive score based on Bloom's taxonomy to analyze capabilities in reasoning tasks across different cognitive levels. We also release a new open-source Chinese financial reasoning dataset covering 22 subfields to support reproducible research and further advancements in financial reasoning. Our experimental results reveal that LLM reasoning ability and higher-order cognitive ability are the core factors influencing reasoning accuracy. We also specifically find that even top models still face a bottleneck with knowledge application. Furthermore, our analysis shows that specialized financial LLMs generally lag behind the top general large models across multiple metrics.
摘要：大型语言模型（LLMS）具有巨大的潜力，但在需要领域知识和复杂推理的复杂财务推理任务中面临挑战。当前的评估基准通常不将这些功能指标与单个任务绩效解耦，并且缺乏任务失败的根本原因分析。为了解决这个问题，我们介绍了FineVal-KR，这是一个新颖的评估框架，用于独立地分解和量化LLMS的知识和推理能力，提出独特的知识评分和推理得分指标。受认知科学的启发，我们进一步提出了基于Bloom的分类法的认知评分，以分析不同认知水平的推理任务的能力。我们还发布了一个新的中国中国财务推理数据集，该数据集涉及22个子场，以支持可重复的研究和财务推理的进一步进步。我们的实验结果表明，LLM推理能力和高阶认知能力是影响推理准确性的核心因素。我们还特别发现，即使是顶级模型仍然面临着具有知识应用的瓶颈。此外，我们的分析表明，专业的财务LLM通常落后于多个指标的顶级大型大型模型。

Title: Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training

Authors: Ahmed M. Adly, Mostafa Samy, Amr Fawzy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21594
Pdf URL: https://arxiv.org/pdf/2506.21594
Copy Paste: [[2506.21594]] Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training(https://arxiv.org/abs/2506.21594)
Keywords: language model
Abstract: We present Gazal-R1, a 32-billion-parameter language model that achieves state-of-the-art performance in medical reasoning while providing transparent, step-by-step explanations for clinical decision-making. Built upon Qwen3 32B, our model demonstrates that strategic training can enable mid-sized models to outperform significantly larger counterparts in specialized domains. We developed a novel two-stage training pipeline: first, supervised fine-tuning on a carefully curated dataset of 107,033 synthetic medical reasoning examples that teaches structured clinical thinking, enhanced by advanced parameter-efficient techniques including Weight-Decomposed Low-Rank Adaptation (DoRA) and Rank-Stabilized LoRA (rsLoRA); second, reinforcement learning using Group Relative Policy Optimization (GRPO) with a sophisticated multi-component reward system that refines accuracy, format adherence, and reasoning quality. Gazal-R1 achieves exceptional performance across medical benchmarks, scoring 87.1% on MedQA, 81.6% on MMLU Pro (Medical), and 79.6% on PubMedQA, surpassing models up to 12x larger. Beyond its strong empirical results, this work provides detailed insights into the challenges of training reasoning-capable models in specialized domains, including issues with reward hacking, training instability, and the fundamental tension between factual recall and detailed reasoning. Our methodology offers a reproducible framework for developing high-capability, domain-specific language models that balance performance, efficiency, and explainability.
摘要：我们提出了Gazal-R1，这是一种320亿个参数模型，在医学推理方面取得了最先进的表现，同时为临床决策提供了透明的，分步的解释。基于QWEN3 32B，我们的模型表明，战略培训可以使中型模型在专用域中的表现明显优于更大的模型。我们开发了一条新型的两阶段训练管道：首先在经过精心策划的107,033个合成医学推理示例的数据集上进行了监督微调，该数据集教授结构化临床思维，通过高级参数效率高效技术（包括重量分解的低级适应（DORA）和排名稳定的Lora Lora（Rora）（RSlora），可以增强。其次，使用小组相对策略优化（GRPO）的加强学习，并具有复杂的多组分奖励系统，可完善准确性，格式依从性和推理质量。 Gazal-R1在MEDQA上取得了卓越的表现，在MEDQA上得分为87.1％，MMLU PRO（医疗）的得分为81.6％，PubMedQA的得分为79.6％，超过12倍的模型。除了其强大的经验结果外，这项工作还提供了有关专业领域中具有推理能力模型的挑战的详细见解，包括奖励黑客，培训不稳定的问题以及事实召回和详细推理之间的根本张力。我们的方法提供了一个可再现的框架，用于开发高能力，特定领域的语言模型，以平衡性能，效率和解释性。

Title: Thunder-LLM: Efficiently Adapting LLMs to Korean with Minimal Resources

Authors: Jinpyo Kim, Gyeongje Cho, Chanwoo Park, Jongwon Park, Jongmin Kim, Yeonkyoun So, Jaejin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21595
Pdf URL: https://arxiv.org/pdf/2506.21595
Copy Paste: [[2506.21595]] Thunder-LLM: Efficiently Adapting LLMs to Korean with Minimal Resources(https://arxiv.org/abs/2506.21595)
Keywords: llm
Abstract: Since state-of-the-art LLMs often underperform in languages other than English or Chinese, improving the capability of LLMs in new languages has become an essential task. Moreover, LLMs' entire end-to-end training process remains largely unknown to the public due to proprietary reasons, technical complexity, inconsistent documentation, and ethical considerations. The complete picture remains a closely guarded secret within the industry. This paper presents methods to adapt an existing English-based LLM to Korean in a low-budget scenario. We describe the entire end-to-end process: collecting Korean datasets, preprocessing the data, training the model, creating downstream benchmarks, and conducting evaluations. The evaluation results indicate that our method can effectively and cost-efficiently add new language capabilities to existing LLMs. Our new bilingual models, Thunder-LLM and Thunder-LLM-Ins, achieve superior Korean performance compared to state-of-the-art models while utilizing minimal data and computational resources. We share our comprehensive experience and make the code publicly available.
摘要：由于最先进的LLM通常在英语或中文以外的其他语言中表现不佳，因此提高LLM的新语言能力已成为一项重要任务。此外，由于专有原因，技术复杂性，不一致的文件和道德考虑，LLMS的整个端到端培训过程在很大程度上仍然是未知的。完整的图片仍然是该行业中严密的秘密。本文介绍了在低预算的情况下将现有的基于英语的LLM调整为韩语的方法。我们描述了整个端到端过程：收集韩国数据集，预处理数据，训练模型，创建下游基准测试和进行评估。评估结果表明，我们的方法可以有效，成本效益地为现有LLM添加新的语言功能。与最先进的模型相比，我们的新双语模型，Thunder-Llm和Thunder-Llm-Ins在使用最小的数据和计算资源的同时，取得了出色的韩国性能。我们分享我们的全面经验，并使代码公开可用。

Title: Evaluating Multimodal Large Language Models on Educational Textbook Question Answering

Authors: Hessa A. Alawwad, Anas Zafar, Areej Alhothali, Usman Naseem, Ali Alkhathlan, Amani Jamal
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.21596
Pdf URL: https://arxiv.org/pdf/2506.21596
Copy Paste: [[2506.21596]] Evaluating Multimodal Large Language Models on Educational Textbook Question Answering(https://arxiv.org/abs/2506.21596)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Multimodal large language models (MLLMs) have recently achieved significant success in vision--language tasks. However, their capacity to reason over complex, long lessons and intricate educational diagrams that cannot be represented as a single natural image remains largely untested. In this work, we present the first evaluation of state-of-the-art MLLMs on the textbook question answering (TQA) task using the CK12-QA dataset. We assess the performance of recent vision-language models, including LLaVA and LLaMA 3.2-Vision, across various input configurations. Additionally, we introduce a lightweight multimodal retrieval-augmented generation (RAG) pipeline that integrates both paragraphs and diagrams from the lesson into the prompt. Our results demonstrate the influence of retrieved educational context on model accuracy and reasoning, while also revealing current limitations in handling question-context relationships and the potential for noise, pointing to key directions for future research in multimodal AI-driven learning.
摘要：多模式的大语言模型（MLLM）最近在视觉任务上取得了重大成功。但是，他们在复杂的，长期的教训和复杂的教育图上推理的能力在很大程度上没有得到测试。在这项工作中，我们使用CK12-QA数据集在教科书问答（TQA）任务上对最先进的MLLM进行了首次评估。我们在各种输入配置中评估了最近的视觉模型的性能，包括Llava和Llama 3.2 vision。此外，我们引入了轻巧的多模式检索生成（RAG）管道，将课程从课程中整合到提示中。我们的结果表明，检索到教育环境对模型准确性和推理的影响，同时还揭示了当前处理问题上下文关系的局限性和噪音的潜力，指出了多模式AI驱动学习的未来研究的关键方向。

Title: Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering

Authors: Brandon Colelough, Davis Bartels, Dina Demner-Fushman
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.21597
Pdf URL: https://arxiv.org/pdf/2506.21597
Copy Paste: [[2506.21597]] Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering(https://arxiv.org/abs/2506.21597)
Keywords: language model, llm
Abstract: In this paper, we present an overview of ClinIQLink, a shared task, collocated with the 24th BioNLP workshop at ACL 2025, designed to stress-test large language models (LLMs) on medically-oriented question answering aimed at the level of a General Practitioner. The challenge supplies 4,978 expert-verified, medical source-grounded question-answer pairs that cover seven formats: true/false, multiple choice, unordered list, short answer, short-inverse, multi-hop, and multi-hop-inverse. Participating systems, bundled in Docker or Apptainer images, are executed on the CodaBench platform or the University of Maryland's Zaratan cluster. An automated harness (Task 1) scores closed-ended items by exact match and open-ended items with a three-tier embedding metric. A subsequent physician panel (Task 2) audits the top model responses.
摘要：在本文中，我们介绍了Cliniqlink的概述，Cliniqlink是一项共同的任务，与ACL 2025的第24届BIONLP研讨会相处，旨在在针对医学上讲的问题上强调大型语言模型（LLMS），旨在回答旨在针对一般从业者的水平。挑战提供了4,978个专家验证的，医学源的问题答案对，涵盖了七种格式：True/fass，多项选择，无序列表，短答案，短期，短距离，多跳和多跳入。在Codabench平台或马里兰州的Zaratan集群上执行了捆绑在Docker或Apptainer图像中的参与系统。一个自动安全带（任务1）通过具有三层嵌入度量的确切匹配和开放式项目分数封闭式项目。随后的医师面板（任务2）审核了顶部模型响应。

Title: Structured Attention Matters to Multimodal LLMs in Document Understanding

Authors: Chang Liu, Hongkai Chen, Yujun Cai, Hang Wu, Qingwen Ye, Ming-Hsuan Yang, Yiwei Wang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.21600
Pdf URL: https://arxiv.org/pdf/2506.21600
Copy Paste: [[2506.21600]] Structured Attention Matters to Multimodal LLMs in Document Understanding(https://arxiv.org/abs/2506.21600)
Keywords: language model, llm
Abstract: Document understanding remains a significant challenge for multimodal large language models (MLLMs). While previous research has primarily focused on locating evidence pages through precise multimodal queries, our work investigates a fundamental yet overlooked aspect: how input format influences document comprehension performance. Through systematic analysis, we discover that raw OCR text often impairs rather than improves MLLMs' performance, which is a counterintuitive finding we attribute to attention dispersion and structure loss. To further substantiate our hypothesis, we propose a novel structure-preserving approach that encodes document elements using the LaTex paradigm, maintaining the hierarchical organization and spatial relationships critical for comprehension. Our attention analysis reveals that structured text induces structured attention patterns on both textual and visual content, directing models to focus on semantically meaningful regions while reducing attention waste. This approach significantly enhances MLLMs' document question answering performance across diverse document types without requiring architectural modifications or additional training.
摘要：文档理解仍然是多模式大语言模型（MLLM）的重大挑战。尽管先前的研究主要集中在通过精确的多模式查询来定位证据页面，但我们的工作调查了一个基本但被忽视的方面：输入格式如何影响文档理解性能。通过系统的分析，我们发现RAW OCR文本通常会损害而不是改善MLLM的性能，这是我们归因于注意力分散和结构损失的违反直觉的发现。为了进一步证实我们的假设，我们提出了一种新型的结构保存方法，该方法使用乳胶范式编码文档元素，维持层次结构组织和空间关系对理解至关重要。我们的注意力分析表明，结构化文本在文本和视觉内容上引起了结构化的注意模式，指示模型专注于语义上有意义的区域，同时减少注意力浪费。这种方法可显着增强MLLM的文档问题回答各种文档类型的绩效，而无需进行架构修改或其他培训。

Title: BiMark: Unbiased Multilayer Watermarking for Large Language Models

Authors: Xiaoyan Feng, He Zhang, Yanjun Zhang, Leo Yu Zhang, Shirui Pan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21602
Pdf URL: https://arxiv.org/pdf/2506.21602
Copy Paste: [[2506.21602]] BiMark: Unbiased Multilayer Watermarking for Large Language Models(https://arxiv.org/abs/2506.21602)
Keywords: language model, llm, prompt
Abstract: Recent advances in Large Language Models (LLMs) have raised urgent concerns about LLM-generated text authenticity, prompting regulatory demands for reliable identification mechanisms. Although watermarking offers a promising solution, existing approaches struggle to simultaneously achieve three critical requirements: text quality preservation, model-agnostic detection, and message embedding capacity, which are crucial for practical implementation. To achieve these goals, the key challenge lies in balancing the trade-off between text quality preservation and message embedding capacity. To address this challenge, we propose BiMark, a novel watermarking framework that achieves these requirements through three key innovations: (1) a bit-flip unbiased reweighting mechanism enabling model-agnostic detection, (2) a multilayer architecture enhancing detectability without compromising generation quality, and (3) an information encoding approach supporting multi-bit watermarking. Through theoretical analysis and extensive experiments, we validate that, compared to state-of-the-art multi-bit watermarking methods, BiMark achieves up to 30% higher extraction rates for short texts while maintaining text quality indicated by lower perplexity, and performs comparably to non-watermarked text on downstream tasks such as summarization and translation.
摘要：大型语言模型（LLM）的最新进展引起了人们对LLM生成的文本真实性的紧急关注，促使对可靠识别机制的监管要求。尽管水印提供了一个有希望的解决方案，但现有的方法很难同时达到三个关键要求：文本质量保存，模型不足的检测和消息嵌入能力，这对于实际实施至关重要。为了实现这些目标，关键的挑战在于平衡文本质量保存和嵌入式嵌入能力之间的权衡。为了应对这一挑战，我们提出了Bimark，这是一个新颖的水印框架，通过三个关键创新来达到这些要求：（1）一种纤利无偏的重新权再权重新处理机制，可实现模型 - 静态检测，（2）多层体系结构增强可检测性，增强了不受影响的生成质量，以及（3）信息配置的多级水位。通过理论分析和广泛的实验，我们证实了这一点，与最先进的多位水印方法相比，Bimark的短文中的提取率高达30％，同时维持较低的困惑性的文本质量，并且在下面的诸如Summarization和Translization和Translization等下游任务上执行的文本质量相当。

Title: Operationalizing Automated Essay Scoring: A Human-Aware Approach

Authors: Yenisel Plasencia-Calaña
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21603
Pdf URL: https://arxiv.org/pdf/2506.21603
Copy Paste: [[2506.21603]] Operationalizing Automated Essay Scoring: A Human-Aware Approach(https://arxiv.org/abs/2506.21603)
Keywords: language model, llm
Abstract: This paper explores the human-centric operationalization of Automated Essay Scoring (AES) systems, addressing aspects beyond accuracy. We compare various machine learning-based approaches with Large Language Models (LLMs) approaches, identifying their strengths, similarities and differences. The study investigates key dimensions such as bias, robustness, and explainability, considered important for human-aware operationalization of AES systems. Our study shows that ML-based AES models outperform LLMs in accuracy but struggle with explainability, whereas LLMs provide richer explanations. We also found that both approaches struggle with bias and robustness to edge scores. By analyzing these dimensions, the paper aims to identify challenges and trade-offs between different methods, contributing to more reliable and trustworthy AES methods.
摘要：本文探讨了自动论文评分（AES）系统的以人为本的操作，解决了无法准确的方面。我们将各种基于机器学习的方法与大语言模型（LLM）方法进行比较，确定其优势，相似性和差异。该研究研究了关键维度，例如偏见，鲁棒性和解释性，这对于AES系统的人类感知操作很重要。我们的研究表明，基于ML的AES模型在准确性方面的表现优于LLM，但在解释性方面挣扎，而LLMS提供了更丰富的解释。我们还发现，这两种方法都与偏见和鲁棒性争夺边缘分数。通过分析这些维度，本文旨在确定不同方法之间的挑战和权衡，从而有助于更可靠和值得信赖的AES方法。

Title: MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Authors: Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, Zhenhua Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21605
Pdf URL: https://arxiv.org/pdf/2506.21605
Copy Paste: [[2506.21605]] MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents(https://arxiv.org/abs/2506.21605)
Keywords: llm, agent
Abstract: Recent works have highlighted the significance of memory mechanisms in LLM-based agents, which enable them to store observed information and adapt to dynamic environments. However, evaluating their memory capabilities still remains challenges. Previous evaluations are commonly limited by the diversity of memory levels and interactive scenarios. They also lack comprehensive metrics to reflect the memory capabilities from multiple aspects. To address these problems, in this paper, we construct a more comprehensive dataset and benchmark to evaluate the memory capability of LLM-based agents. Our dataset incorporates factual memory and reflective memory as different levels, and proposes participation and observation as various interactive scenarios. Based on our dataset, we present a benchmark, named MemBench, to evaluate the memory capability of LLM-based agents from multiple aspects, including their effectiveness, efficiency, and capacity. To benefit the research community, we release our dataset and project at this https URL.
摘要：最近的工作突出了基于LLM的代理中存储机制的重要性，这使他们能够存储观察到的信息并适应动态环境。但是，评估其内存能力仍然是挑战。以前的评估通常受到记忆水平和交互式场景的多样性的限制。他们还缺乏全面的指标来反映多个方面的记忆能力。为了解决这些问题，在本文中，我们构建了一个更全面的数据集和基准测试，以评估基于LLM的代理的内存能力。我们的数据集将事实记忆和反思性记忆作为不同的级别结合在一起，并提出参与和观察为各种交互式场景。基于我们的数据集，我们提出了一个名为Membench的基准，以评估来自LLM的代理来自多个方面的内存能力，包括它们的有效性，效率和容量。为了使研究社区受益，我们在此HTTPS URL上发布数据集和项目。

Title: Large Language Models as symbolic DNA of cultural dynamics

Authors: Parham Pourdavood, Michael Jacob, Terrence Deacon
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.21606
Pdf URL: https://arxiv.org/pdf/2506.21606
Copy Paste: [[2506.21606]] Large Language Models as symbolic DNA of cultural dynamics(https://arxiv.org/abs/2506.21606)
Keywords: language model, llm
Abstract: This paper proposes a novel conceptualization of Large Language Models (LLMs) as externalized informational substrates that function analogously to DNA for human cultural dynamics. Rather than viewing LLMs as either autonomous intelligence or mere programmed mimicry, we argue they serve a broader role as repositories that preserve compressed patterns of human symbolic expression--"fossils" of meaningful dynamics that retain relational residues without their original living contexts. Crucially, these compressed patterns only become meaningful through human reinterpretation, creating a recursive feedback loop where they can be recombined and cycle back to ultimately catalyze human creative processes. Through analysis of four universal features--compression, decompression, externalization, and recursion--we demonstrate that just as DNA emerged as a compressed and externalized medium for preserving useful cellular dynamics without containing explicit reference to goal-directed physical processes, LLMs preserve useful regularities of human culture without containing understanding of embodied human experience. Therefore, we argue that LLMs' significance lies not in rivaling human intelligence, but in providing humanity a tool for self-reflection and playful hypothesis-generation in a low-stakes, simulated environment. This framework positions LLMs as tools for cultural evolvability, enabling humanity to generate novel hypotheses about itself while maintaining the human interpretation necessary to ground these hypotheses in ongoing human aesthetics and norms.
摘要：本文提出了大型语言模型（LLM）作为外部信息基材的新颖概念化，该信息基质类似于人类文化动力学的DNA。我们认为，与其将LLM视为自主智能或仅编程模仿的模仿，不如说它们是保留人类象征性表达的压缩模式的储存库的更广泛的作用 - 有意义的动态的“化石”，这些动态保留了关系残留物，而无需其原始的生活环境。至关重要的是，这些压缩模式只有通过人类的重新解释才有意义，创建了一个递归反馈循环，可以重新组合它们并循环回到最终催化人类的创造过程。通过分析四个通用特征 - 压缩，减压，外部化和递归 - 我们证明，就像DNA成为一种压缩和外部化介质一样，可以保留有用的蜂窝动力学，而无需明确的指导性物理过程，而LLMS保留有用的人类文化的有用规律，而无需包含对体验体验的体验的理解。因此，我们认为LLMS的意义不在于与人类智能竞争，而是在低赌注，模拟的环境中为人类提供自我反思和嬉戏假说产生的工具。该框架将LLMS定位为文化可发展性的工具，使人类能够产生有关自身的新颖假设，同时保持人类的解释，以将这些假设扎根于正在进行的人类美学和规范中。

Title: CORE-KG: An LLM-Driven Knowledge Graph Construction Framework for Human Smuggling Networks

Authors: Dipak Meher, Carlotta Domeniconi, Guadalupe Correa-Cabrera
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21607
Pdf URL: https://arxiv.org/pdf/2506.21607
Copy Paste: [[2506.21607]] CORE-KG: An LLM-Driven Knowledge Graph Construction Framework for Human Smuggling Networks(https://arxiv.org/abs/2506.21607)
Keywords: llm, hallucination, prompt
Abstract: Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer valuable insights but are unstructured, lexically dense, and filled with ambiguous or shifting references-posing challenges for automated knowledge graph (KG) construction. Existing KG methods often rely on static templates and lack coreference resolution, while recent LLM-based approaches frequently produce noisy, fragmented graphs due to hallucinations, and duplicate nodes caused by a lack of guided extraction. We propose CORE-KG, a modular framework for building interpretable KGs from legal texts. It uses a two-step pipeline: (1) type-aware coreference resolution via sequential, structured LLM prompts, and (2) entity and relationship extraction using domain-guided instructions, built on an adapted GraphRAG framework. CORE-KG reduces node duplication by 33.28%, and legal noise by 38.37% compared to a GraphRAG-based baseline-resulting in cleaner and more coherent graph structures. These improvements make CORE-KG a strong foundation for analyzing complex criminal networks.
摘要：人类走私网络越来越自适应，难以分析。法律案例文件提供了宝贵的见解，但无组织，词汇密集，并充满了自动化知识图（KG）构建的模棱两可或转移的参考挑战。现有的KG方法通常依赖于静态模板并且缺乏核心分辨率，而最近的基于LLM的方法经常由于幻觉而产生嘈杂的，碎片的图，以及由于缺乏指导性提取而引起的重复节点。我们提出了Core-KG，这是一个模块化的框架，用于从法律文本中构建可解释的KG。它使用两步管道：（1）通过顺序结构化的LLM提示进行类型感知的核心分辨率，以及（2）使用域引导的指令（建立在适应的GraphRag框架上）的实体和关系提取。与基于GraphRag的基线基线折叠相比，Core-KG将节点的重复减少了33.28％，法定噪声将其减少38.37％。这些改进使Core-KG成为分析复杂犯罪网络的强大基础。

Title: SysTemp: A Multi-Agent System for Template-Based Generation of SysML v2

Authors: Yasmine Bouamra, Bruno Yun, Alexandre Poisson, Frédéric Armetta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21608
Pdf URL: https://arxiv.org/pdf/2506.21608
Copy Paste: [[2506.21608]] SysTemp: A Multi-Agent System for Template-Based Generation of SysML v2(https://arxiv.org/abs/2506.21608)
Keywords: agent
Abstract: The automatic generation of SysML v2 models represents a major challenge in the engineering of complex systems, particularly due to the scarcity of learning corpora and complex syntax. We present SysTemp, a system aimed at facilitating and improving the creation of SysML v2 models from natural language specifications. It is based on a multi-agent system, including a template generator that structures the generation process. We discuss the advantages and challenges of this system through an evaluation, highlighting its potential to improve the quality of the generations in SysML v2 modeling.
摘要：SYSML V2模型的自动生成代表了复杂系统工程的主要挑战，尤其是由于学习语料库和复杂语法的稀缺性。我们提出SystemP，该系统旨在促进和改善自然语言规范的SYSML V2模型的创建。它基于多代理系统，包括构造生成过程的模板生成器。我们通过评估讨论了该系统的优势和挑战，突出了其在SYSML V2建模中提高世代质量的潜力。

Title: From Thinking to Output: Chain-of-Thought and Text Generation Characteristics in Reasoning Language Models

Authors: Junhao Liu, Zhenhao Xu, Yuxin Fang, Yichuan Chen, Zuobin Ying, Wenhan Chang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2506.21609
Pdf URL: https://arxiv.org/pdf/2506.21609
Copy Paste: [[2506.21609]] From Thinking to Output: Chain-of-Thought and Text Generation Characteristics in Reasoning Language Models(https://arxiv.org/abs/2506.21609)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Recently, there have been notable advancements in large language models (LLMs), demonstrating their growing abilities in complex reasoning. However, existing research largely overlooks a thorough and systematic comparison of these models' reasoning processes and outputs, particularly regarding their self-reflection pattern (also termed "Aha moment") and the interconnections across diverse domains. This paper proposes a novel framework for analyzing the reasoning characteristics of four cutting-edge large reasoning models (GPT-o1, DeepSeek-R1, Kimi-k1.5, and Grok-3) using keywords statistic and LLM-as-a-judge paradigm. Our approach connects their internal thinking processes with their final outputs. A diverse dataset consists of real-world scenario-based questions covering logical deduction, causal inference, and multi-step problem-solving. Additionally, a set of metrics is put forward to assess both the coherence of reasoning and the accuracy of the outputs. The research results uncover various patterns of how these models balance exploration and exploitation, deal with problems, and reach conclusions during the reasoning process. Through quantitative and qualitative comparisons, disparities among these models are identified in aspects such as the depth of reasoning, the reliance on intermediate steps, and the degree of similarity between their thinking processes and output patterns and those of GPT-o1. This work offers valuable insights into the trade-off between computational efficiency and reasoning robustness and provides practical recommendations for enhancing model design and evaluation in practical applications. We publicly release our project at: this https URL
摘要：最近，大语言模型（LLM）取得了显着进步，证明了它们在复杂推理中的增长。但是，现有的研究在很大程度上忽略了这些模型的推理过程和输出的彻底和系统的比较，尤其是关于它们的自我反射模式（也称为“ aha arment”）以及跨不同领域的互连。本文提出了一个新的框架，用于分析使用关键字统计量和LLM-AS-A-Judge范式的四种尖端大推理模型（GPT-O1，DeepSeek-R1，Kimi-K1.5和Grok-3）的推理特征。我们的方法将他们的内部思维过程与最终输出联系起来。一个不同的数据集由基于现实情况的问题组成，涵盖了逻辑推论，因果推理和多步问题解决问题。此外，提出了一组指标来评估推理的连贯性和产出的准确性。该研究结果发现了这些模型如何平衡探索和剥削，处理问题以及在推理过程中得出结论的各种模式。通过定量和定性的比较，这些模型之间的差异在诸如推理的深度，对中间步骤的依赖以及其思维过程和输出模式之间的相似程度以及GPT-O1的相似程度中确定了差异。这项工作为计算效率和鲁棒性之间的权衡提供了宝贵的见解，并为增强实用应用中的模型设计和评估提供了实用的建议。我们在以下位置公开发布我们的项目：此HTTPS URL

Title: Does Multimodality Lead to Better Time Series Forecasting?

Authors: Xiyuan Zhang, Boran Han, Haoyang Fang, Abdul Fatir Ansari, Shuai Zhang, Danielle C. Maddix, Cuixiong Hu, Andrew Gordon Wilson, Michael W. Mahoney, Hao Wang, Yan Liu, Huzefa Rangwala, George Karypis, Bernie Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21611
Pdf URL: https://arxiv.org/pdf/2506.21611
Copy Paste: [[2506.21611]] Does Multimodality Lead to Better Time Series Forecasting?(https://arxiv.org/abs/2506.21611)
Keywords: language model, prompt
Abstract: Recently, there has been growing interest in incorporating textual information into foundation models for time series forecasting. However, it remains unclear whether and under what conditions such multimodal integration consistently yields gains. We systematically investigate these questions across a diverse benchmark of 14 forecasting tasks spanning 7 domains, including health, environment, and economics. We evaluate two popular multimodal forecasting paradigms: aligning-based methods, which align time series and text representations; and prompting-based methods, which directly prompt large language models for forecasting. Although prior works report gains from multimodal input, we find these effects are not universal across datasets and models, and multimodal methods sometimes do not outperform the strongest unimodal baselines. To understand when textual information helps, we disentangle the effects of model architectural properties and data characteristics. Our findings highlight that on the modeling side, incorporating text information is most helpful given (1) high-capacity text models, (2) comparatively weaker time series models, and (3) appropriate aligning strategies. On the data side, performance gains are more likely when (4) sufficient training data is available and (5) the text offers complementary predictive signal beyond what is already captured from the time series alone. Our empirical findings offer practical guidelines for when multimodality can be expected to aid forecasting tasks, and when it does not.
摘要：最近，人们对将文本信息纳入时间序列预测的基础模型越来越兴趣。但是，目前尚不清楚这种多模式整合是否会始终产生收益。我们在14个预测任务的不同基准中系统地研究了这些问题，这些任务涵盖了7个领域，包括健康，环境和经济学。我们评估了两个流行的多模式预测范例：基于对齐的方法，它们对齐时间序列和文本表示；和促使基于基于的方法，它直接促使大型语言模型进行预测。尽管先前的作品报告了多模式输入的收益，但我们发现这些效果在数据集和模型之间并非通用，而且多模式方法有时不会胜过最强的单峰基线。要了解何时文本信息有所帮助，我们会删除模型架构属性和数据特征的效果。我们的发现强调，在建模方面，合并文本信息是最有用的（1）（1）高容量的文本模型，（2）相对较弱的时间序列模型以及（3）适当的对齐策略。在数据方面，当（4）提供足够的培训数据时，性能提升更有可能，并且（5）文本提供了互补的预测信号，而不是仅限时间序列就已经捕获的内容。我们的经验发现提供了实用的指南，何时可以期望多模式有助于预测任务，以及何时没有。

Title: ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

Authors: Gautam Siddharth Kashyap, Mohammad Anas Azeez, Rafiq Ali, Zohaib Hasan Siddiqui, Jiechao Gao, Usman Naseem
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.21613
Pdf URL: https://arxiv.org/pdf/2506.21613
Copy Paste: [[2506.21613]] ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech(https://arxiv.org/abs/2506.21613)
Keywords: language model, llm
Abstract: The increasing prevalence of child-targeted hate speech online underscores the urgent need for specialized datasets to address this critical issue. Existing hate speech datasets lack agespecific annotations, fail to capture nuanced contexts, and overlook the unique emotional impact on children. To bridge this gap, we introduce ChildGuard1, a curated dataset derived from existing corpora and enriched with child-specific annotations. ChildGuard captures diverse contexts of child-targeted hate speech, spanning age groups. We benchmark existing state-of-the-art hate speech detection methods, including Large Language Models (LLMs), and assess their effectiveness in detecting and contextualizing child-targeted hate speech. To foster further research in this area, we publicly release ChildGuard, providing a robust foundation for developing improved methods to detect and mitigate such harm.
摘要：在网上有孩子定位的仇恨言论的越来越多，这迫切需要专业数据集以解决这一关键问题。现有的仇恨言论数据集缺乏分配的注释，无法捕获细微的上下文，并且忽略了对儿童的独特情感影响。为了弥合这一差距，我们介绍了ChildGuard1，这是一种源自现有Corpora的策划数据集，并具有特定于儿童的注释。 Childguard捕获了跨年龄段的儿童态度仇恨言论的各种环境。我们基于现有的最先进的仇恨言论检测方法，包括大语言模型（LLMS），并评估它们在检测和背景化儿童仇恨言论方面的有效性。为了促进该领域的进一步研究，我们公开释放儿童卫队，为开发改进的方法来检测和减轻这种伤害提供了强大的基础。

Title: LastingBench: Defend Benchmarks Against Knowledge Leakage

Authors: Yixiong Fang, Tianran Sun, Yuling Shi, Min Wang, Xiaodong Gu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21614
Pdf URL: https://arxiv.org/pdf/2506.21614
Copy Paste: [[2506.21614]] LastingBench: Defend Benchmarks Against Knowledge Leakage(https://arxiv.org/abs/2506.21614)
Keywords: language model, llm
Abstract: The increasing complexity of large language models (LLMs) raises concerns about their ability to "cheat" on standard Question Answering (QA) benchmarks by memorizing task-specific data. This undermines the validity of benchmark evaluations, as they no longer reflect genuine model capabilities but instead the effects of data leakage. While prior work has focused on detecting such leakage, little attention has been given to mitigating its impact and preserving the long-term utility of benchmarks. In this paper, we introduce LastingBench, a novel framework designed to continuously reinforce and safeguard existing benchmarks against knowledge leakage. LastingBench identifies leakage points in the context through perturbation, then rewrites the leakage points to counterfactual ones-disrupting memorization while preserving the benchmark's original evaluative intent. Evaluations of state-of-the-art QA benchmarks show significant performance gaps, highlighting the efficacy of LastingBench in reducing memorization effects. LastingBench offers a practical and scalable solution to ensure benchmark robustness over time, promoting fairer and more interpretable evaluations of LLMs.
摘要：大型语言模型（LLMS）的复杂性日益复杂引起了人们对标准问题答案（QA）基准测试的能力，通过记住特定于任务的数据。这破坏了基准评估的有效性，因为它们不再反映真正的模型功能，而是数据泄漏的效果。虽然先前的工作重点是检测这种泄漏，但很少关注减轻其影响并保留基准的长期实用性。在本文中，我们介绍了LastingBench，这是一个新颖的框架，旨在不断增强和保护现有基准，以防止知识泄漏。 LastingBench通过扰动确定上下文中的泄漏点，然后将泄漏点重写为反事实中断记忆，同时保留基准测试的原始评估意图。对最先进的质量检查基准测试的评估显示出明显的性能差距，从而突出了持久板在降低记忆效应方面的功效。 LastingBench提供了一种实用且可扩展的解决方案，以确保随着时间的推移基准鲁棒性，从而促进对LLM的更公平，更可解释的评估。

Title: Refine Medical Diagnosis Using Generation Augmented Retrieval and Clinical Practice Guidelines

Authors: Wenhao Li, Hongkuan Zhang, Hongwei Zhang, Zhengxu Li, Zengjie Dong, Yafan Chen, Niranjan Bidargaddi, Hong Liu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.21615
Pdf URL: https://arxiv.org/pdf/2506.21615
Copy Paste: [[2506.21615]] Refine Medical Diagnosis Using Generation Augmented Retrieval and Clinical Practice Guidelines(https://arxiv.org/abs/2506.21615)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Current medical language models, adapted from large language models (LLMs), typically predict ICD code-based diagnosis from electronic health records (EHRs) because these labels are readily available. However, ICD codes do not capture the nuanced, context-rich reasoning clinicians use for diagnosis. Clinicians synthesize diverse patient data and reference clinical practice guidelines (CPGs) to make evidence-based decisions. This misalignment limits the clinical utility of existing models. We introduce GARMLE-G, a Generation-Augmented Retrieval framework that grounds medical language model outputs in authoritative CPGs. Unlike conventional Retrieval-Augmented Generation based approaches, GARMLE-G enables hallucination-free outputs by directly retrieving authoritative guideline content without relying on model-generated text. It (1) integrates LLM predictions with EHR data to create semantically rich queries, (2) retrieves relevant CPG knowledge snippets via embedding similarity, and (3) fuses guideline content with model output to generate clinically aligned recommendations. A prototype system for hypertension diagnosis was developed and evaluated on multiple metrics, demonstrating superior retrieval precision, semantic relevance, and clinical guideline adherence compared to RAG-based baselines, while maintaining a lightweight architecture suitable for localized healthcare deployment. This work provides a scalable, low-cost, and hallucination-free method for grounding medical language models in evidence-based clinical practice, with strong potential for broader clinical deployment.
摘要：根据大型语言模型（LLM）改编的当前医学语言模型通常可以通过电子健康记录（EHR）预测基于ICD代码的诊断，因为这些标签易于使用。但是，ICD代码并未捕获临床医生用于诊断的细微，上下文富裕的推理。临床医生综合了不同的患者数据和参考临床实践指南（CPG），以做出基于证据的决定。这种错位限制了现有模型的临床实用性。我们介绍了Garmle-G，这是一个以一代为生的检索框架，该框架将医学语言模型在权威CPG中的输出进行。与传统的基于基于检索的生成方法不同，Garmle-G可以通过直接检索权威指南内容而不依赖模型生成的文本来实现无幻觉输出。它（1）将LLM预测与EHR数据集成在一起，以创建语义丰富的查询，（2）通过嵌入相似性检索相关的CPG知识片段，（3）将指南内容与模型输出融合以生成临床上的建议。与基于抹布的基线相比，对多个指标进行了用于高血压诊断的原型系统，并在多个指标上进行了评估，这表明了卓越的检索精度，语义相关性和临床指南依从性，同时保持了适合局部医疗保健部署的轻质体系结构。这项工作为基于证据的临床实践中的医学语言模型提供了一种可扩展，低成本和幻觉的方法，具有更广泛的临床部署潜力。

Title: TIM: A Large-Scale Dataset and large Timeline Intelligence Model for Open-domain Timeline Summarization

Authors: Chuanrui Hu, Wei Hu, Penghang Yu, Hua Zhang, Bing-Kun Bao
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.21616
Pdf URL: https://arxiv.org/pdf/2506.21616
Copy Paste: [[2506.21616]] TIM: A Large-Scale Dataset and large Timeline Intelligence Model for Open-domain Timeline Summarization(https://arxiv.org/abs/2506.21616)
Keywords: language model, llm
Abstract: Open-domain Timeline Summarization (TLS) is crucial for monitoring the evolution of news topics. To identify changes in news topics, existing methods typically employ general Large Language Models (LLMs) to summarize relevant timestamps from retrieved news. While general LLMs demonstrate capabilities in zero-shot news summarization and timestamp localization, they struggle with assessing topic relevance and understanding topic evolution. Consequently, the summarized information often includes irrelevant details or inaccurate timestamps. To address these issues, we propose the first large Timeline Intelligence Model (TIM) for open-domain TLS, which is capable of effectively summarizing open-domain timelines. Specifically, we begin by presenting a large-scale TLS dataset, comprising over 1,000 news topics and more than 3,000 annotated TLS instances. Furthermore, we propose a progressive optimization strategy, which gradually enhance summarization performance. It employs instruction tuning to enhance summarization and topic-irrelevant information filtering capabilities. Following this, it exploits a novel dual-alignment reward learning method that incorporates both semantic and temporal perspectives, thereby improving the understanding of topic evolution principles. Through this progressive optimization strategy, TIM demonstrates a robust ability to summarize open-domain timelines. Extensive experiments in open-domain demonstrate the effectiveness of our TIM.
摘要：开放域时间表摘要（TLS）对于监视新闻主题的演变至关重要。为了确定新闻主题的变化，现有方法通常采用通用大语模型（LLM）来总结从检索到的新闻中的相关时间戳。尽管LLMS将军证明了零摄像新闻摘要和时间戳本地化的能力，但他们在评估主题相关性和理解主题演变方面很难。因此，总结的信息通常包括无关紧要的细节或不准确的时间戳。为了解决这些问题，我们为开放域TLS提出了第一个大型时间表智能模型（TIM），该模型能够有效地总结开放域时间表。具体而言，我们首先展示一个大规模的TLS数据集，其中包括1,000多个新闻主题和3,000多个带注释的TLS实例。此外，我们提出了一种渐进优化策略，该策略逐渐增强了汇总性能。它采用指令调整来增强摘要和主题信息过滤功能。此后，它利用了一种新颖的双对准奖励学习方法，该方法既包含语义和时间观点，从而提高了对主题演化原理的理解。通过这种渐进优化策略，蒂姆展示了构成开放域时间表的强大能力。开放域中的广泛实验证明了我们蒂姆的有效性。

Title: TrajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge

Authors: Zhiyuan Zhang, Xiaosong Jia, Guanyu Chen, Qifeng Li, Junchi Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21618
Pdf URL: https://arxiv.org/pdf/2506.21618
Copy Paste: [[2506.21618]] TrajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge(https://arxiv.org/abs/2506.21618)
Keywords: agent
Abstract: In this technical report, we introduce TrajTok, a trajectory tokenizer for discrete next-token-prediction based behavior generation models, which combines data-driven and rule-based methods with better coverage, symmetry and robustness, along with a spatial-aware label smoothing method for cross-entropy loss. We adopt the tokenizer and loss for the SMART model and reach a superior performance with realism score of 0.7852 on the Waymo Open Sim Agents Challenge 2025. We will open-source the code in the future.
摘要：在这份技术报告中，我们介绍了Trajtok，这是一种用于离散的基于下一步预测的行为生成模型的轨迹令牌，该模型将基于数据驱动的方法和基于规则的方法与更好的覆盖率，对称性和鲁棒性以及一种空间感知的标签平滑方法相结合。我们采用智能模型的令牌和损失，并在Waymo Open Sims Agents Challenges挑战2025上达到0.7852的卓越性能。我们将来将打开代码。

Title: IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

Authors: Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.21619
Pdf URL: https://arxiv.org/pdf/2506.21619
Copy Paste: [[2506.21619]] IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech(https://arxiv.org/abs/2506.21619)
Keywords: gpt, prompt
Abstract: Large-scale text-to-speech (TTS) models are typically categorized into autoregressive and non-autoregressive systems. Although autoregressive systems exhibit certain advantages in speech naturalness, their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This is a key limitation in applications such as video dubbing that require strict audio-visual synchronization. This paper introduces IndexTTS2, which proposes a novel and autoregressive-model-friendly method for speech duration control. The method supports two generation modes: one allows explicit specification of the number of generated tokens for precise duration control; the other does not require manual input and lets the model freely generate speech while preserving prosodic characteristics from the input prompt. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control of timbre and emotion. In the zero-shot setting, the model can perfectly reproduce the emotional characteristics of the input prompt. Users may also provide a separate emotion prompt, even from a different speaker, allowing the model to reconstruct the target timbre while conveying the desired emotion. To enhance clarity during strong emotional expressions, we incorporate GPT latent representations to improve speech stability. Meanwhile, to lower the barrier for emotion control, we design a soft instruction mechanism based on textual descriptions by fine-tuning Qwen3. This enables effective guidance of speech generation with desired emotional tendencies using natural language input. Experimental results demonstrate that IndexTTS2 outperforms existing state-of-the-art zero-shot TTS models in word error rate, speaker similarity, and emotional fidelity.
摘要：大规模的文本到语音（TTS）模型通常分为自回归和非自动回忆系统。尽管自回归系统在语音自然性方面具有某些优势，但它们的逐个代际生成机制使得很难精确控制综合语音的持续时间。这是需要严格视听同步的视频配音等应用程序中的关键限制。本文介绍了Indextts2，该文章提出了一种新颖的自动回归模型友好方法来控制语音持续时间。该方法支持两代模式：一个人允许在精确持续时间控制中明确规范生成的令牌数量；另一个不需要手动输入，并让模型可以自由地生成语音，同时从输入提示中保留韵律特征。此外，IndextTS2在情感表达与说话者身份之间实现了分离，从而可以独立控制音色和情感。在零拍设置中，该模型可以完美地重现输入提示的情感特征。用户也可以提供一个单独的情感提示，即使是来自不同的扬声器，也可以使模型在传达所需的情绪的同时重建目标音色。为了提高强烈情绪表达期间的清晰度，我们结合了GPT潜在表示以提高语音稳定性。同时，为了降低情绪控制的障碍，我们通过微调QWEN3设计了一种基于文本描述的软教学机制。这可以使用自然语言输入以所需的情感倾向来有效地指导言语产生。实验结果表明，在单词错误率，说话者的相似性和情感忠诚度中，IndextTS2优于现有的最新零击中TTS模型。

Title: How Large Language Models play humans in online conversations: a simulated study of the 2016 US politics on Reddit

Authors: Daniele Cirulli, Giulio Cimini, Giovanni Palermo
Subjects: cs.CL, cs.AI, cs.CY, cs.SI, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2506.21620
Pdf URL: https://arxiv.org/pdf/2506.21620
Copy Paste: [[2506.21620]] How Large Language Models play humans in online conversations: a simulated study of the 2016 US politics on Reddit(https://arxiv.org/abs/2506.21620)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have recently emerged as powerful tools for natural language generation, with applications spanning from content creation to social simulations. Their ability to mimic human interactions raises both opportunities and concerns, particularly in the context of politically relevant online discussions. In this study, we evaluate the performance of LLMs in replicating user-generated content within a real-world, divisive scenario: Reddit conversations during the 2016 US Presidential election. In particular, we conduct three different experiments, asking GPT-4 to generate comments by impersonating either real or artificial partisan users. We analyze the generated comments in terms of political alignment, sentiment, and linguistic features, comparing them against real user contributions and benchmarking against a null model. We find that GPT-4 is able to produce realistic comments, both in favor of or against the candidate supported by the community, yet tending to create consensus more easily than dissent. In addition we show that real and artificial comments are well separated in a semantically embedded space, although they are indistinguishable by manual inspection. Our findings provide insights on the potential use of LLMs to sneak into online discussions, influence political debate and shape political narratives, bearing broader implications of AI-driven discourse manipulation.
摘要：大型语言模型（LLM）最近成为自然语言生成的强大工具，其应用程序从内容创建到社交模拟。他们模仿人类互动的能力既带来了机会又引起关注，尤其是在政治上相关的在线讨论的背景下。在这项研究中，我们评估了LLM在复制现实世界中的分裂情景中复制用户生成的内容的性能：2016年美国总统大选期间的Reddit对话。特别是，我们进行了三个不同的实验，要求GPT-4通过模拟真实或人工党派用户来产生评论。我们根据政治一致性，情感和语言特征分析了生成的评论，将它们与实际用户的贡献进行了比较，并将基准测试与无效模型进行了比较。我们发现，GPT-4能够发表逼真的评论，无论是支持社区支持的候选人还是反对候选人，但倾向于比异议更容易建立共识。此外，我们表明，尽管可以通过手动检查无法区分，但在语义嵌入的空间中，真实和人工评论在语义嵌入式空间中得到了很好的分离。我们的发现提供了有关LLM潜在使用在线讨论，影响政治辩论和塑造政治叙事的潜在用途的见解，对AI驱动的话语操纵具有更广泛的影响。

Title: The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

Authors: Jasper Dekoninck, Ivo Petrov, Kristian Minchev, Mislav Balunovic, Martin Vechev, Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, Margulan Ismoldayev
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21621
Pdf URL: https://arxiv.org/pdf/2506.21621
Copy Paste: [[2506.21621]] The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs(https://arxiv.org/abs/2506.21621)
Keywords: language model, llm
Abstract: In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that performs on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof correctness.
摘要：近几个月来，大型语言模型（LLM）在数学证明的生成方面取得了重大进展，但由于缺乏大规模，高质量的人类评估证明数据集而阻碍了进一步的进步。虽然创建昂贵，但这种数据集对于推动培训的改进和实现严格的证明产生能力分析至关重要。在这项工作中，我们提出了开放式证明语料库（OPC），该数据集由最先进的LLMS制作的5,000多个人评估的证明。 OPC是专门为证明生成研究中的广泛适用性和下游使用而设计的，并且是第一个将大量正确的LLM生成的解决方案包括在享有声望的数学竞赛（例如USAMO和IMO）中的问题。使用OPC，我们探讨了自动证明生成中的关键问题：（1）自然语言和正式证明生成之间的性能差距，（2）最终解答准确性和完整有效性之间的差异，以及（3）最佳N选择对证明质量的影响。最后，为了展示OPC的效用，我们在数据集上为8B参数模型而言，获得了与最佳模型Gemini-2.5-Pro一起执行的模型，以评估证明正确性。

Title: Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents

Authors: Jiaxi Zhuang, Kangning Li, Jue Hou, Mingjun Xu, Zhifeng Gao, Hengxing Cai
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.21625
Pdf URL: https://arxiv.org/pdf/2506.21625
Copy Paste: [[2506.21625]] Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents(https://arxiv.org/abs/2506.21625)
Keywords: language model, gpt, llm
Abstract: Extracting molecular structure-activity relationships (SARs) from scientific literature and patents is essential for drug discovery and materials research. However, this task remains challenging due to heterogeneous document formats and limitations of existing methods. Specifically, rule-based approaches relying on rigid templates fail to generalize across diverse document layouts, while general-purpose multimodal large language models (MLLMs) lack sufficient accuracy and reliability for specialized tasks, such as layout detection and optical chemical structure recognition (OCSR). To address these challenges, we introduce DocSAR-200, a rigorously annotated benchmark of 200 scientific documents designed specifically for evaluating SAR extraction methods. Additionally, we propose Doc2SAR, a novel synergistic framework that integrates domain-specific tools with MLLMs enhanced via supervised fine-tuning (SFT). Extensive experiments demonstrate that Doc2SAR achieves state-of-the-art performance across various document types, significantly outperforming leading end-to-end baselines. Specifically, Doc2SAR attains an overall Table Recall of 80.78% on DocSAR-200, exceeding end2end GPT-4o by 51.48%. Furthermore, Doc2SAR demonstrates practical usability through efficient inference and is accompanied by a web app.
摘要：从科学文献和专利中提取分子结构活性关系（SAR）对于药物发现和材料研究至关重要。但是，由于文档格式和现有方法的局限性，此任务仍然具有挑战性。具体而言，依靠刚性模板的基于规则的方法无法跨越各种文档布局，而通用多模式大语言模型（MLLM）对专业任务（例如布局检测和光学化学结构识别（OCSR））缺乏足够的准确性和可靠性。为了应对这些挑战，我们介绍了Docsar-200，这是200个专门用于评估SAR提取方法的200个科学文档的严格注释基准。此外，我们提出了DOC2SAR，这是一种新型的协同框架，该框架将特定于域特异性工具与通过监督的微调（SFT）增强的MLLM集成在一起。广泛的实验表明，DOC2SAR在各种文档类型上都达到了最先进的性能，这表现明显优于领先的端到端基线。具体而言，DOC2SAR在2000年的Docsar-200中达到了80.78％的总桌面召回，超过了End2End GPT-4O的51.48％。此外，DOC2SAR通过有效的推理展示了可用性，并伴随着网络应用程序。

Title: Do We Really Need GNNs with Explicit Structural Modeling? MLPs Suffice for Language Model Representations

Authors: Li Zhou, Hao Jiang, Junjie Li, Zefeng Zhao, Feng Jiang, Wenyu Chen, Haizhou Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21682
Pdf URL: https://arxiv.org/pdf/2506.21682
Copy Paste: [[2506.21682]] Do We Really Need GNNs with Explicit Structural Modeling? MLPs Suffice for Language Model Representations(https://arxiv.org/abs/2506.21682)
Keywords: language model
Abstract: Explicit structural information has been proven to be encoded by Graph Neural Networks (GNNs), serving as auxiliary knowledge to enhance model capabilities and improve performance in downstream NLP tasks. However, recent studies indicate that GNNs fail to fully utilize structural information, whereas Multi-Layer Perceptrons (MLPs), despite lacking the message-passing mechanisms inherent to GNNs, exhibit a surprising ability in structure-aware tasks. Motivated by these findings, this paper introduces a comprehensive probing framework from an information-theoretic perspective. The framework is designed to systematically assess the role of explicit structural modeling in enhancing language model (LM) representations and to investigate the potential of MLPs as efficient and scalable alternatives to GNNs. We extend traditional probing classifiers by incorporating a control module that allows for selective use of either the full GNN model or its decoupled components, specifically, the message-passing and feature-transformation this http URL modular approach isolates and assesses the individual contributions of these operations, avoiding confounding effects from the complete GNN architecture. Using the Edge Probing Suite, a diagnostic tool for evaluating the linguistic knowledge encoded in LMs, we find that MLPs, when used as feature-transformation modules, consistently improve the linguistic knowledge captured in LM representations across different architectures. They effectively encode both syntactic and semantic patterns. Similarly, GNNs that incorporate feature-transformation operations show beneficial effects. In contrast, models that rely solely on message-passing operations tend to underperform, often leading to negative impacts on probing task performance.
摘要：明确的结构信息已被证明是由图神经网络（GNN）编码的，可作为辅助知识，以增强模型功能并提高下游NLP任务的性能。但是，最近的研究表明，GNN无法充分利用结构信息，而多层感知器（MLP）尽管缺乏GNN固有的消息通知机制，但在结构意识到的任务中表现出令人惊讶的能力。在这些发现的激励下，本文从信息理论的角度引入了全面的探测框架。该框架旨在系统地评估显式结构建模在增强语言模型（LM）表示中的作用，并研究MLP作为GNN的有效且可扩展的替代方案的潜力。我们通过合并一个控制模块来扩展传统的探测分类器，该模块可以选择性地使用完整的GNN模型或其脱耦组件，具体来说，该HTTP URL模块化方法隔离株并评估这些操作的个人贡献，从而避免对完整的GNN体系结构产生混乱的影响。使用边缘探测套件，一种用于评估LMS中语言知识的诊断工具，我们发现，当用作特征变形模块时，MLP始终提高不同体系结构中LM表示中捕获的语言知识。它们有效地编码了句法和语义模式。同样，结合特征转化操作的GNN显示出有益的效果。相比之下，仅依赖消息操作的模型往往表现不佳，通常会导致对探测任务绩效的负面影响。

Title: (Fact) Check Your Bias

Authors: Eivind Morris Bakke, Nora Winger Heggelund
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21745
Pdf URL: https://arxiv.org/pdf/2506.21745
Copy Paste: [[2506.21745]] (Fact) Check Your Bias(https://arxiv.org/abs/2506.21745)
Keywords: language model, llm, prompt
Abstract: Automatic fact verification systems increasingly rely on large language models (LLMs). We investigate how parametric knowledge biases in these models affect fact-checking outcomes of the HerO system (baseline for FEVER-25). We examine how the system is affected by: (1) potential bias in Llama 3.1's parametric knowledge and (2) intentionally injected bias. When prompted directly to perform fact-verification, Llama 3.1 labels nearly half the claims as "Not Enough Evidence". Using only its parametric knowledge it is able to reach a verdict on the remaining half of the claims. In the second experiment, we prompt the model to generate supporting, refuting, or neutral fact-checking documents. These prompts significantly influence retrieval outcomes, with approximately 50\% of retrieved evidence being unique to each perspective. Notably, the model sometimes refuses to generate supporting documents for claims it believes to be false, creating an inherent negative bias. Despite differences in retrieved evidence, final verdict predictions show stability across prompting strategies. The code is available at: this https URL
摘要：自动事实验证系统越来越依赖大型语言模型（LLM）。我们研究了这些模型中的参数知识偏见如何影响英雄系统的事实检查结果（Fever-25的基线）。我们研究了系统如何受到以下方式的影响：（1）Llama 3.1参数知识的潜在偏见以及（2）有意注射偏见。当直接提示进行事实验证时，Llama 3.1将索赔的一半标记为“没有足够的证据”。仅使用其参数知识，它才能对剩余的索赔的判决做出判决。在第二个实验中，我们提示该模型生成支持，驳斥或中性事实检查文件。这些提示会显着影响检索结果，大约50％的检索证据是每个观点所独有的。值得注意的是，该模型有时拒绝生成支持文件的索赔，其认为是错误的，从而造成了固有的负面偏见。尽管检索证据的差异差异，但最终的判决预测表明了促进策略的稳定性。代码可用：此HTTPS URL

Title: Evaluating List Construction and Temporal Understanding capabilities of Large Language Models

Authors: Alexandru Dumitru, V Venktesh, Adam Jatowt, Avishek Anand
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21783
Pdf URL: https://arxiv.org/pdf/2506.21783
Copy Paste: [[2506.21783]] Evaluating List Construction and Temporal Understanding capabilities of Large Language Models(https://arxiv.org/abs/2506.21783)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) have demonstrated immense advances in a wide range of natural language tasks. However, these models are susceptible to hallucinations and errors on particularly temporal understanding tasks involving multiple entities in answers. In such tasks, they fail to associate entities with accurate time intervals, generate a complete list of entities in answers or reason about events associated with specific temporal bounds. Existing works do not extensively evaluate the abilities of the model to perform implicit and explicit temporal understanding in a list answer construction setup. To bridge this gap, we propose the Time referenced List based Question Answering or TLQA benchmark that requires structured answers in list format aligned with corresponding time periods. Our TLQA benchmark, requires both list construction and temporal understanding simultaneously, which to the best of our knowledge has not been explored in prior benchmarks. We investigate the temporal understanding and list construction capabilities of state-of-the-art generative models on TLQA in closed-book and open-domain settings. Our findings reveal significant shortcomings in current models, particularly their inability to provide complete answers and temporally align facts in a closed-book setup and the need to improve retrieval in open-domain setup, providing clear future directions for research on TLQA. The benchmark and code at this https URL.
摘要：大型语言模型（LLMS）在各种自然语言任务中都表现出了巨大进步。但是，这些模型容易受到幻觉和错误的影响，这些模型特别了解涉及多个实体的任务。在此类任务中，它们未能将实体与准确的时间间隔相关联，在答案或与特定时间界相关的事件中生成了一个完整的实体列表。现有作品不会广泛评估模型在列表答案构造设置中执行隐式和明确的时间理解的能力。为了弥合此差距，我们建议基于时间引用的列表答案或TLQA基准，该问题需要以与相应时间段对齐的列表格式结构化答案。我们的TLQA基准同时需要列表构造和时间理解，据我们所知，这在先前的基准测试中尚未探索。我们调查了封闭式和开放域设置中TLQA上最先进的生成模型的时间理解并列出了最先进的生成模型的构建功能。我们的发现显示了当前模型中的重大缺点，尤其是他们无法在闭幕设置中提供完整的答案和时间对齐事实，并且需要改善开放域设置中的检索，从而为TLQA提供了明确的未来方向。此HTTPS URL上的基准和代码。

Title: Offensive Language Detection on Social Media Using XLNet

Authors: Reem Alothman, Hafida Benhidour, Said Kerrache
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21795
Pdf URL: https://arxiv.org/pdf/2506.21795
Copy Paste: [[2506.21795]] Offensive Language Detection on Social Media Using XLNet(https://arxiv.org/abs/2506.21795)
Keywords: chat
Abstract: The widespread use of text-based communication on social media-through chats, comments, and microblogs-has improved user interaction but has also led to an increase in offensive content, including hate speech, racism, and other forms of abuse. Due to the enormous volume of user-generated content, manual moderation is impractical, which creates a need for automated systems that can detect offensive language. Deep learning models, particularly those using transfer learning, have demonstrated significant success in understanding natural language through large-scale pretraining. In this study, we propose an automatic offensive language detection model based on XLNet, a generalized autoregressive pretraining method, and compare its performance with BERT (Bidirectional Encoder Representations from Transformers), which is a widely used baseline in natural language processing (NLP). Both models are evaluated using the Offensive Language Identification Dataset (OLID), a benchmark Twitter dataset that includes hierarchical annotations. Our experimental results show that XLNet outperforms BERT in detecting offensive content and in categorizing the types of offenses, while BERT performs slightly better in identifying the targets of the offenses. Additionally, we find that oversampling and undersampling strategies are effective in addressing class imbalance and improving classification performance. These findings highlight the potential of transfer learning and XLNet-based architectures to create robust systems for detecting offensive language on social media platforms.
摘要：在社交媒体聊天，评论和微博中，基于文本的交流的广泛使用改善了用户互动，但也导致进攻内容的增加，包括仇恨言论，种族主义和其他形式的虐待。由于用户生成的内容量庞大，手动审核是不切实际的，这需要自动化系统可以检测出令人反感的语言。深度学习模型，尤其是那些使用转移学习的模型，通过大规模预处理在理解自然语言方面取得了巨大的成功。在这项研究中，我们提出了一种基于XLNET的自动进攻性语言检测模型，XLNet是一种广义自动回归预处理方法，并将其性能与BERT（来自Transformers的双向编码器表示）进行比较，该方法是自然语言处理中广泛使用的基线（NLP）。使用进攻性语言标识数据集（OLID）评估这两种模型，这是一个包括层次注释的基准Twitter数据集。我们的实验结果表明，XLNET在检测进攻性内容和对犯罪类型的分类方面的表现优于BERT，而BERT在识别犯罪目标方面的表现稍好一些。此外，我们发现过采样和不足采样策略可有效解决阶级失衡和改善分类绩效。这些发现突出了转移学习和基于XLNET的架构的潜力，以创建可靠的系统，用于在社交媒体平台上检测进攻性语言。

Title: Towards Transparent AI: A Survey on Explainable Large Language Models

Authors: Avash Palikhe, Zhenyu Yu, Zichong Wang, Wenbin Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.21812
Pdf URL: https://arxiv.org/pdf/2506.21812
Copy Paste: [[2506.21812]] Towards Transparent AI: A Survey on Explainable Large Language Models(https://arxiv.org/abs/2506.21812)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have played a pivotal role in advancing Artificial Intelligence (AI). However, despite their achievements, LLMs often struggle to explain their decision-making processes, making them a 'black box' and presenting a substantial challenge to explainability. This lack of transparency poses a significant obstacle to the adoption of LLMs in high-stakes domain applications, where interpretability is particularly essential. To overcome these limitations, researchers have developed various explainable artificial intelligence (XAI) methods that provide human-interpretable explanations for LLMs. However, a systematic understanding of these methods remains limited. To address this gap, this survey provides a comprehensive review of explainability techniques by categorizing XAI methods based on the underlying transformer architectures of LLMs: encoder-only, decoder-only, and encoder-decoder models. Then these techniques are examined in terms of their evaluation for assessing explainability, and the survey further explores how these explanations are leveraged in practical applications. Finally, it discusses available resources, ongoing research challenges, and future directions, aiming to guide continued efforts toward developing transparent and responsible LLMs.
摘要：大型语言模型（LLM）在推进人工智能（AI）方面发挥了关键作用。但是，尽管取得了成就，但LLM经常努力解释他们的决策过程，使其成为“黑匣子”，并对解释性提出了重大挑战。这种缺乏透明度为在高风险领域应用中采用LLM的采用带来了重要的障碍，在高风险领域应用中，解释性尤其重要。为了克服这些局限性，研究人员开发了各种可解释的人工智能（XAI）方法，这些方法为LLM提供了可解释的解释。但是，对这些方法的系统理解仍然有限。为了解决这一差距，这项调查通过基于LLMS的基本变压器体系结构对XAI方法进行分类，对解释性技术进行了全面审查：仅编码，仅解码器和编码器模型。然后，根据评估解释性的评估，对这些技术进行了研究，该调查进一步探讨了这些解释是如何在实际应用中利用这些解释的。最后，它讨论了可用的资源，正在进行的研究挑战以及未来的方向，旨在指导持续的努力发展透明和负责任的LLM。

Title: Exploring the Structure of AI-Induced Language Change in Scientific English

Authors: Riley Galpin, Bryce Anderson, Tom S. Juzek
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21817
Pdf URL: https://arxiv.org/pdf/2506.21817
Copy Paste: [[2506.21817]] Exploring the Structure of AI-Induced Language Change in Scientific English(https://arxiv.org/abs/2506.21817)
Keywords: language model, gpt, prompt, chat
Abstract: Scientific English has undergone rapid and unprecedented changes in recent years, with words such as "delve," "intricate," and "crucial" showing significant spikes in frequency since around 2022. These changes are widely attributed to the growing influence of Large Language Models like ChatGPT in the discourse surrounding bias and misalignment. However, apart from changes in frequency, the exact structure of these linguistic shifts has remained unclear. The present study addresses this and investigates whether these changes involve the replacement of synonyms by suddenly 'spiking words,' for example, "crucial" replacing "essential" and "key," or whether they reflect broader semantic and pragmatic qualifications. To further investigate structural changes, we include part of speech tagging in our analysis to quantify linguistic shifts over grammatical categories and differentiate between word forms, like "potential" as a noun vs. as an adjective. We systematically analyze synonym groups for widely discussed 'spiking words' based on frequency trends in scientific abstracts from PubMed. We find that entire semantic clusters often shift together, with most or all words in a group increasing in usage. This pattern suggests that changes induced by Large Language Models are primarily semantic and pragmatic rather than purely lexical. Notably, the adjective "important" shows a significant decline, which prompted us to systematically analyze decreasing lexical items. Our analysis of "collapsing" words reveals a more complex picture, which is consistent with organic language change and contrasts with the patterns of the abrupt spikes. These insights into the structure of language change contribute to our understanding of how language technology continues to shape human language.
摘要：近年来，科学英语经历了快速且前所未有的变化，诸如“挖掘”，“复杂”和“关键”之类的词表现出了自2022年左右以来的频率尖峰。这些变化广泛归因于围绕偏见和错误对象的聊天等大型语言模型的不断增长。但是，除了频率变化外，这些语言偏移的确切结构尚不清楚。本研究解决了这一点，并研究了这些变化是否涉及突然“尖峰单词”来替换同义词，例如“至关重要的”代替“必需”和“键”，还是反映更广泛的语义和务实的资格。为了进一步研究结构变化，我们在分析中包括语音标记的一部分，以量化语言类别的语言转变，并区分单词形式，例如“势”作为名词与形容词。我们系统地分析了基于PubMed的科学摘要中的频率趋势广泛讨论的“尖峰单词”的同义组。我们发现，整个语义簇经常一起转移，小组中的大多数单词都在增加。这种模式表明，大型语言模型引起的变化主要是语义和务实的，而不是纯粹的词汇。值得注意的是，形容词“重要”显示出显着下降，这促使我们系统地分析了减少的词汇项目。我们对“崩溃”单词的分析揭示了一个更复杂的图片，这与有机语言的变化一致，并且与突然尖峰的模式形成对比。这些对语言变化结构的见解有助于我们对语言技术如何继续塑造人类语言的理解。

Title: The Consistency Hypothesis in Uncertainty Quantification for Large Language Models

Authors: Quan Xiao, Debarun Bhattacharjya, Balaji Ganesan, Radu Marinescu, Katsiaryna Mirylenka, Nhan H Pham, Michael Glass, Junkyu Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21849
Pdf URL: https://arxiv.org/pdf/2506.21849
Copy Paste: [[2506.21849]] The Consistency Hypothesis in Uncertainty Quantification for Large Language Models(https://arxiv.org/abs/2506.21849)
Keywords: language model, llm
Abstract: Estimating the confidence of large language model (LLM) outputs is essential for real-world applications requiring high user trust. Black-box uncertainty quantification (UQ) methods, relying solely on model API access, have gained popularity due to their practical benefits. In this paper, we examine the implicit assumption behind several UQ methods, which use generation consistency as a proxy for confidence, an idea we formalize as the consistency hypothesis. We introduce three mathematical statements with corresponding statistical tests to capture variations of this hypothesis and metrics to evaluate LLM output conformity across tasks. Our empirical investigation, spanning 8 benchmark datasets and 3 tasks (question answering, text summarization, and text-to-SQL), highlights the prevalence of the hypothesis under different settings. Among the statements, we highlight the `Sim-Any' hypothesis as the most actionable, and demonstrate how it can be leveraged by proposing data-free black-box UQ methods that aggregate similarities between generations for confidence estimation. These approaches can outperform the closest baselines, showcasing the practical value of the empirically observed consistency hypothesis.
摘要：估计大语言模型（LLM）输出的置信度对于需要高用户信任的实际应用程序至关重要。黑盒不确定性量化（UQ）方法仅依靠API访问，由于其实际好处而获得了受欢迎程度。在本文中，我们研究了几种UQ方法背后的隐含假设，这些假设将生成一致性用作信心的代理，这一想法是我们形式化为一致性假设的想法。我们介绍了三个具有相应统计测试的数学语句，以捕获该假设和指标的变化，以评估任务之间的LLM输出符合性。我们的实证研究涵盖了8个基准数据集和3个任务（问答，文本摘要和文本到SQL），突出了在不同设置下假设的普遍性。在陈述中，我们将“ Sim-any”假设重点介绍为最可行的假设，并通过提出无数据的黑盒UQ方法来证明如何利用它，以汇总世代之间的相似性以估计置信度。这些方法可以胜过最接近的基线，从而展示了经验观察到的一致性假设的实际价值。

Title: Derivational Probing: Unveiling the Layer-wise Derivation of Syntactic Structures in Neural Language Models

Authors: Taiga Someya, Ryo Yoshida, Hitomi Yanaka, Yohei Oseki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21861
Pdf URL: https://arxiv.org/pdf/2506.21861
Copy Paste: [[2506.21861]] Derivational Probing: Unveiling the Layer-wise Derivation of Syntactic Structures in Neural Language Models(https://arxiv.org/abs/2506.21861)
Keywords: language model
Abstract: Recent work has demonstrated that neural language models encode syntactic structures in their internal representations, yet the derivations by which these structures are constructed across layers remain poorly understood. In this paper, we propose Derivational Probing to investigate how micro-syntactic structures (e.g., subject noun phrases) and macro-syntactic structures (e.g., the relationship between the root verbs and their direct dependents) are constructed as word embeddings propagate upward across layers. Our experiments on BERT reveal a clear bottom-up derivation: micro-syntactic structures emerge in lower layers and are gradually integrated into a coherent macro-syntactic structure in higher layers. Furthermore, a targeted evaluation on subject-verb number agreement shows that the timing of constructing macro-syntactic structures is critical for downstream performance, suggesting an optimal timing for integrating global syntactic information.
摘要：最近的工作表明，神经语言模型在其内部表示中编码句法结构，但是在跨层中构建这些结构的推导仍然很众所周知。在本文中，我们提出了派生探测，以研究微音句法结构（例如，主题名词短语）和宏观句法结构（例如，根动词及其直接依赖者之间的关系）是如何构造的，作为单词嵌入在跨层向上传播的单词嵌入。我们在BERT上的实验揭示了一个明确的自下而上推导：在下层中出现了微观句法结构，并逐渐将其整合到较高层中的相干宏观句法结构中。此外，对主题 - 动力数量协议的有针对性评估表明，构建宏观句法结构的时机对于下游性能至关重要，这表明是整合全球句法信息的最佳时机。

Title: DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

Authors: Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Lijiang Li, Zuwei Long, Bo Tong, Ke Li, Xing Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21864
Pdf URL: https://arxiv.org/pdf/2506.21864
Copy Paste: [[2506.21864]] DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE(https://arxiv.org/abs/2506.21864)
Keywords: language model, llm
Abstract: Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at this https URL.
摘要：本地多模式大型语言模型（MLLM）将单个大语言模型（LLM）重组成能够语音和文本生成的口语模型（SLM）。与模块化和对齐的MLLM相比，本机MLLM保留了更丰富的副语言特征，例如情感和韵律，并直接在主干LLM中产生语音响应，而不是使用单独的语音解码器。这种整合还导致响应潜伏期较低，相互作用更平滑。但是，与预处理文本LLM所需的大量文本数据相比，本机MLLM遭受了灾难性的遗忘和性能退化，因为可用的配对语音文本数据不足以支持MLLM的预处理。为了解决这个问题，我们提出了DeepTalk，这是基于专家（MOE）体系结构的混合的自适应模式专家学习的框架。 DeepTalk首先根据LLM中的模态负载自适应区分了模态专家。然后，每个模式专家都接受了专业的单模式培训，然后进行了共同的多模式协作培训。结果，与原始LLM相比，DeepTalk仅降低了5.5％的性能下降，这显着低于通常在天然MLLM（例如GLM-4-VOICE）中通常看到的20％以上的平均性能下降，并且与模块化MLLM相当。同时，端到端的对话延迟在0.5秒内保留，确保了无缝且聪明的语音互动经验。代码和模型在此HTTPS URL上发布。

Title: WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation

Authors: Jian Zhang, Linhao Zhang, Bokai Lei, Chuhan Wu, Wei Jia, Xiao Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21875
Pdf URL: https://arxiv.org/pdf/2506.21875
Copy Paste: [[2506.21875]] WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation(https://arxiv.org/abs/2506.21875)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Recent multi-modal Large Language Models (LLMs) such as GPT-4o have demonstrated strong capabilities of direct speech interaction. However, the lack of specialized and comprehensive benchmarks for end-to-end speech LLM evaluation hinders optimizing the user experience of Audio LLMs in real-world applications. Existing evaluation methods often adapt text-based benchmarks, overlooking speech's unique characteristics and challenges, including prosody, homophones, stuttering, and differing user expectations. Here, we present a novel approach to thoroughly evaluate LLMs in practical speech conversations. We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena. We further design a query-aware evaluation method to use customized evaluation checklists and prompts to enhance the accuracy of automatic evaluation. We conduct comprehensive testing and detailed analysis of various mainstream speech models, revealing significant differences in model performance across different speech scenarios. The use of query-aware evaluation further enables a finer-grained assessment under various speech-specific scenarios. Our benchmark can provide valuable insights for speech model development and evaluation.
摘要：最近的多模式大型语言模型（LLM）（例如GPT-4O）表现出了直接语音相互作用的强大功能。但是，缺乏用于端到端语音LLM评估的专业和全面的基准，阻碍了在现实世界应用程序中优化音频LLM的用户体验。现有的评估方法通常调整基于文本的基准测试，忽略语音的独特特征和挑战，包括韵律，同音词，口吃和不同的用户期望。在这里，我们提出了一种新颖的方法，可以彻底评估实际语音对话中的LLM。我们系统地策划了与口语场景相关的现实聊天数据，在说话者属性和声学条件中引入多样性，并使用特定于语音的现象增强数据集。我们进一步设计了一种查询感知的评估方法，以使用自定义的评估清单和提示来提高自动评估的准确性。我们对各种主流语音模型进行了全面的测试和详细分析，揭示了不同语音场景之间模型性能的显着差异。查询意识评估的使用进一步使在各种语音特定方案下进行了细粒度的评估。我们的基准可以为语音模型开发和评估提供宝贵的见解。

Title: Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation

Authors: Qiyue Gao, Xinyu Pi, Kevin Liu, Junrong Chen, Ruolan Yang, Xinqi Huang, Xinyu Fang, Lu Sun, Gautham Kishore, Bo Ai, Stone Tao, Mengyang Liu, Jiaxi Yang, Chao-Jung Lai, Chuanyang Jin, Jiannan Xiang, Benhao Huang, Zeming Chen, David Danks, Hao Su, Tianmin Shu, Ziqiao Ma, Lianhui Qin, Zhiting Hu
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.21876
Pdf URL: https://arxiv.org/pdf/2506.21876
Copy Paste: [[2506.21876]] Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation(https://arxiv.org/abs/2506.21876)
Keywords: language model, gpt, agent
Abstract: Internal world models (WMs) enable agents to understand the world's state and predict transitions, serving as the basis for advanced deliberative reasoning. Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs. While the latest studies have evaluated and shown limitations in specific capabilities such as visual understanding, a systematic evaluation of VLMs' fundamental WM abilities remains absent. Drawing on comparative psychology and cognitive science, we propose a two-stage framework that assesses Perception (visual, spatial, temporal, quantitative, and motion) and Prediction (mechanistic simulation, transitive inference, compositional inference) to provide an atomic evaluation of VLMs as WMs. Guided by this framework, we introduce WM-ABench, a large-scale benchmark comprising 23 fine-grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations. Through 660 experiments on 15 latest commercial and open-source VLMs, we find that these models exhibit striking limitations in basic world modeling abilities. For instance, almost all models perform at near-random accuracy when distinguishing motion trajectories. Additionally, they lack disentangled understanding -- e.g., some models tend to believe blue objects move faster than green ones. More rich results and analyses reveal significant gaps between VLMs and human-level world modeling.
摘要：内部世界模型（WMS）使代理商能够理解世界国家并预测过渡，这是高级审议推理的基础。最近的大型视力模型（VLM），例如OpenAI O3，GPT-4O和Gemini，具有通用WMS的潜力。尽管最新的研究已经评估并显示了特定功能（例如视觉理解）的局限性，但对VLMS的基本WM能力的系统评估仍然不存在。利用比较心理学和认知科学，我们提出了一个两阶段的框架，该框架评估感知（视觉，空间，时间，定量和运动）以及预测（机械模拟，及时推理，组成推断），以提供对VLMS作为WMS的原子评估。在此框架的指导下，我们引入了WM-Abench，这是一个大规模的基准测试，其中包括6种不同的模拟环境，具有23个细粒度的评估维度，并具有受控的反事实模拟。通过对15个最新商业和开源VLM的660个实验，我们发现这些模型在基本世界建模能力中表现出明显的局限性。例如，在区分运动轨迹时，几乎所有模型都以几乎随机的精度执行。此外，它们缺乏脱节的理解 - 例如，某些模型倾向于相信蓝色的物体比绿色物体更快。更丰富的结果和分析揭示了VLM与人类水平的世界建模之间的显着差距。

Title: A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs

Authors: Sean Kim, Hyuhng Joon Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21881
Pdf URL: https://arxiv.org/pdf/2506.21881
Copy Paste: [[2506.21881]] A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs(https://arxiv.org/abs/2506.21881)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly deployed across diverse linguistic and cultural contexts, understanding their behavior in both factual and disputable scenarios is essential, especially when their outputs may shape public opinion or reinforce dominant narratives. In this paper, we define two types of bias in LLMs: model bias (bias stemming from model training) and inference bias (bias induced by the language of the query), through a two-phase evaluation. Phase 1 evaluates LLMs on factual questions where a single verifiable answer exists, assessing whether models maintain consistency across different query languages. Phase 2 expands the scope by probing geopolitically sensitive disputes, where responses may reflect culturally embedded or ideologically aligned perspectives. We construct a manually curated dataset spanning both factual and disputable QA, across four languages and question types. The results show that Phase 1 exhibits query language induced alignment, while Phase 2 reflects an interplay between the model's training context and query language. This paper offers a structured framework for evaluating LLM behavior across neutral and sensitive topics, providing insights for future LLM deployment and culturally aware evaluation practices in multilingual contexts.
摘要：由于大型语言模型（LLM）越来越多地部署在各种语言和文化背景中，因此在事实和有争议的场景中了解它们的行为至关重要，尤其是当它们的产出可能塑造公众舆论或增强主导叙事时。在本文中，我们通过两阶段的评估来定义LLMS中的两种偏差：模型偏见（偏见）和推理偏差（由查询语言引起的偏差）。第1阶段评估LLMS在存在单个可验证答案的事实问题上，评估模型是否保持不同查询语言的一致性。第2阶段通过探测地缘敏感的争议扩大范围，在这种争端中，响应可能反映了文化嵌入或意识形态上一致的观点。我们构建一个跨四种语言和问题类型的手动策划的数据集，该数据集涵盖了事实和有争议的质量检查。结果表明，第1阶段表现出查询语言引起的对齐方式，而第2阶段反映了模型的训练环境与查询语言之间的相互作用。本文提供了一个结构化框架，用于评估中性和敏感主题的LLM行为，为未来的LLM部署和文化意识的评估实践提供见解。

Title: AutoMixer: Checkpoint Artifacts as Automatic Data Mixers

Authors: Ernie Chang, Yang Li, Patrick Huber, David Kant, Yangyang Shi, Vikas Chandra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21910
Pdf URL: https://arxiv.org/pdf/2506.21910
Copy Paste: [[2506.21910]] AutoMixer: Checkpoint Artifacts as Automatic Data Mixers(https://arxiv.org/abs/2506.21910)
Keywords: language model
Abstract: In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with performance improvements of up to 1.93%. Overall, this shows the potential of checkpoint models to enhance data quality and optimize data mixtures.
摘要：在语言模型培训中，希望为模型配备各种任务的功能。但是，尚不清楚如何直接获得这些功能的正确数据混合物，因为很难对数据和任务之间的关系进行建模。在这项工作中，我们观察到检查点模型在训练轨迹中的不同点具有新兴功能。通常，培训过程将检查点保存为被用作训练数据信号来源的工件。我们根据其在基准上的各自功能来识别这些伪像模型，并通过使用汇总的一阶影响近似与源数据相比，并利用它们作为数据混音器。我们在八个推理基准上证明了所提出的框架在训练训练环境中显示出显着改善，其性能提高高达1.93％。总体而言，这显示了检查点模型增强数据质量并优化数据混合物的潜力。

Title: PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory

Authors: Junho Myung, Yeon Su Park, Sunwoo Kim, Shin Yoo, Alice Oh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21961
Pdf URL: https://arxiv.org/pdf/2506.21961
Copy Paste: [[2506.21961]] PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory(https://arxiv.org/abs/2506.21961)
Keywords: language model, llm
Abstract: Evaluating the performance and biases of large language models (LLMs) through role-playing scenarios is becoming increasingly common, as LLMs often exhibit biased behaviors in these contexts. Building on this line of research, we introduce PapersPlease, a benchmark consisting of 3,700 moral dilemmas designed to investigate LLMs' decision-making in prioritizing various levels of human needs. In our setup, LLMs act as immigration inspectors deciding whether to approve or deny entry based on the short narratives of people. These narratives are constructed using the Existence, Relatedness, and Growth (ERG) theory, which categorizes human needs into three hierarchical levels. Our analysis of six LLMs reveals statistically significant patterns in decision-making, suggesting that LLMs encode implicit preferences. Additionally, our evaluation of the impact of incorporating social identities into the narratives shows varying responsiveness based on both motivational needs and identity cues, with some models exhibiting higher denial rates for marginalized identities. All data is publicly available at this https URL.
摘要：通过角色扮演方案评估大语言模型（LLM）的性能和偏见越来越普遍，因为LLMS在这些情况下经常表现出偏见的行为。在这一研究的基础上，我们介绍了PaperSplease，这是一个由3,700个道德困境组成的基准，旨在调查LLMS的决策，以优先考虑各种各样的人类需求。在我们的设置中，LLM充当移民检查员，决定是根据人们的简短叙述来批准还是拒绝进入。这些叙述是使用存在，相关性和增长（ERG）理论来构建的，该理论将人类需求分为三个层次级别。我们对六个LLM的分析揭示了决策中具有统计学意义的模式，这表明LLMS编码了隐式偏好。此外，我们对将社会身份纳入叙述的影响的评估表明，基于动机需求和身份线索的响应能力各不相同，有些模型显示出更高的边缘化身份拒绝率。所有数据均在此HTTPS URL上公开可用。

Title: More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents

Authors: Weimin Xiong, Ke Wang, Yifan Song, Hanchao Liu, Sai Zhou, Wei Peng, Sujian Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21967
Pdf URL: https://arxiv.org/pdf/2506.21967
Copy Paste: [[2506.21967]] More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents(https://arxiv.org/abs/2506.21967)
Keywords: llm, agent
Abstract: Current evaluations of tool-integrated LLM agents typically focus on end-to-end tool-usage evaluation while neglecting their stability. This limits their real-world applicability, as various internal or external factors can cause agents to crash or behave abnormally. Our research addresses this by investigating whether agents are vulnerable to errors throughout the entire tool invocation process, including reading tool documentation, selecting tools and generating parameters, and processing the tool's response. Through extensive experiments, we observe that agents are highly susceptible to errors at each stage and agents based on open-source models are more vulnerable than those based on proprietary models. We also find that increasing the model size does not significantly improve tool invocation reasoning and may make agents more vulnerable to attacks resembling normal user instructions. This highlights the importance of evaluating agent stability and offers valuable insights for future LLM development and evaluation.
摘要：当前对工具集成的LLM代理的评估通常集中在端到端的工具使用评估上，同时忽略了其稳定性。这限制了他们的现实世界适用性，因为各种内部或外部因素可能导致代理人崩溃或表现异常。我们的研究通过研究代理是否容易在整个工具调用过程中遇到错误，包括阅读工具文档，选择工具和生成参数以及处理工具的响应。通过广泛的实验，我们观察到代理在每个阶段都非常容易受到错误的影响，而基于开源模型的代理比基于专有模型的代理更容易受到伤害。我们还发现，增加模型大小并不能显着改善工具调用推理，并可能使代理更容易受到类似于正常用户说明的攻击。这突出了评估代理稳定性的重要性，并为未来的LLM开发和评估提供了宝贵的见解。

Title: Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Authors: Mohamed Ahmed, Mohamed Abdelmouty, Mingyu Kim, Gunvanth Kandula, Alex Park, James C. Davis
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21972
Pdf URL: https://arxiv.org/pdf/2506.21972
Copy Paste: [[2506.21972]] Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses(https://arxiv.org/abs/2506.21972)
Keywords: language model, gpt, llm, prompt
Abstract: The advancement of Pre-Trained Language Models (PTLMs) and Large Language Models (LLMs) has led to their widespread adoption across diverse applications. Despite their success, these models remain vulnerable to attacks that exploit their inherent weaknesses to bypass safety measures. Two primary inference-phase threats are token-level and prompt-level jailbreaks. Token-level attacks embed adversarial sequences that transfer well to black-box models like GPT but leave detectable patterns and rely on gradient-based token optimization, whereas prompt-level attacks use semantically structured inputs to elicit harmful responses yet depend on iterative feedback that can be unreliable. To address the complementary limitations of these methods, we propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs. GCG + PAIR and the newly explored GCG + WordGame hybrids were evaluated across multiple Vicuna and Llama models. GCG + PAIR consistently raised attack-success rates over its constituent techniques on undefended models; for instance, on Llama-3, its Attack Success Rate (ASR) reached 91.6%, a substantial increase from PAIR's 58.4% baseline. Meanwhile, GCG + WordGame matched the raw performance of WordGame maintaining a high ASR of over 80% even under stricter evaluators like Mistral-Sorry-Bench. Crucially, both hybrids retained transferability and reliably pierced advanced defenses such as Gradient Cuff and JBShield, which fully blocked single-mode attacks. These findings expose previously unreported vulnerabilities in current safety stacks, highlight trade-offs between raw success and defensive robustness, and underscore the need for holistic safeguards against adaptive adversaries.
摘要：预先训练的语言模型（PTLM）和大型语言模型（LLM）的进步导致了它们在各种应用程序中的广泛采用。尽管它们取得了成功，但这些模型仍然容易受到攻击，这些攻击利用其固有的弱点绕过安全措施。两个主要的推论威胁是令牌级别和及时的越狱。令牌级的攻击嵌入了嵌入的对抗序列，这些序列很好地转移到了诸如GPT，但留下可检测模式并依靠基于梯度的令牌优化的黑盒模型，而及时升级攻击使用语义结构化输入来引起有害响应，但依赖于可能是不可靠的迭代反馈。为了解决这些方法的互补限制，我们提出了两种集成令牌和及时级别技术的混合方法，以提高各种PTLM的越狱效率。在多个Vicuna和Llama模型中评估了GCG +对以及新探索的GCG + WordGame混合动力车。 GCG +对始终提高了对未防御模型的组成技术的攻击率。例如，在Llama-3上，其攻击成功率（ASR）达到91.6％，比对的58.4％的基线增长了大幅增长。同时，GCG + WordGame匹配WordGame的原始性能，即使在Mistral-Sorry-Sench的更严格的评估者下，也保持高80％以上的ASR。至关重要的是，两种混合动力车都保留了可转移性，并可靠地刺穿了高级防御措施，例如梯度袖口和JBShield，它们完全阻止了单模攻击。这些发现在当前的安全堆栈中暴露了以前未报告的漏洞，突出了原始成功和防御鲁棒性之间的权衡，并强调了对自适应对手的整体保障措施的需求。

Title: Don't Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism

Authors: Simon Münker, Nils Schwager, Achim Rettinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21974
Pdf URL: https://arxiv.org/pdf/2506.21974
Copy Paste: [[2506.21974]] Don't Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism(https://arxiv.org/abs/2506.21974)
Keywords: language model, llm, agent
Abstract: The ability of Large Language Models (LLMs) to mimic human behavior triggered a plethora of computational social science research, assuming that empirical studies of humans can be conducted with AI agents instead. Since there have been conflicting research findings on whether and when this hypothesis holds, there is a need to better understand the differences in their experimental designs. We focus on replicating the behavior of social network users with the use of LLMs for the analysis of communication on social networks. First, we provide a formal framework for the simulation of social networks, before focusing on the sub-task of imitating user communication. We empirically test different approaches to imitate user behavior on X in English and German. Our findings suggest that social simulations should be validated by their empirical realism measured in the setting in which the simulation components were fitted. With this paper, we argue for more rigor when applying generative-agent-based modeling for social simulation.
摘要：大型语言模型（LLM）模仿人类行为的能力引发了众多计算社会科学研究，假设人类的经验研究可以与AI代理进行。由于关于该假设是否存在以及何时存在的研究结果相互矛盾，因此有必要更好地了解其实验设计的差异。我们专注于通过使用LLMS来分析社交网络上的沟通来复制社交网络用户的行为。首先，在关注模仿用户通信的子任务之前，我们为模拟社交网络提供了正式的框架。我们在经验上测试了不同的方法，以模仿英语和德语中X上的用户行为。我们的发现表明，社会模拟应通过在安装模拟组成部分的环境中衡量的经验现实主义来验证。在本文中，我们在将基于生成代理的建模应用于社交模拟时提出了更严格的要求。

Title: Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit

Authors: Kartheek Kumar Reddy Nareddy, Sarah Ternus, Julia Niebling
Subjects: cs.CL, cs.AI, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2506.21990
Pdf URL: https://arxiv.org/pdf/2506.21990
Copy Paste: [[2506.21990]] Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit(https://arxiv.org/abs/2506.21990)
Keywords: chat
Abstract: The developments in transformer encoder-decoder architectures have led to significant breakthroughs in machine translation, Automatic Speech Recognition (ASR), and instruction-based chat machines, among other applications. The pre-trained models were trained on vast amounts of generic data over a few epochs (fewer than five in most cases), resulting in their strong generalization capabilities. Nevertheless, the performance of these models does suffer when applied to niche domains like transcribing pilot speech in the cockpit, which involves a lot of specific vocabulary and multilingual conversations. This paper investigates and improves the transcription accuracy of cockpit conversations with Whisper models. We have collected around 85 minutes of cockpit simulator recordings and 130 minutes of interview recordings with pilots and manually labeled them. The speakers are middle aged men speaking both German and English. To improve the accuracy of transcriptions, we propose multiple normalization schemes to refine the transcripts and improve Word Error Rate (WER). We then employ fine-tuning to enhance ASR performance, utilizing performance-efficient fine-tuning with Low-Rank Adaptation (LoRA). Hereby, WER decreased from 68.49 \% (pretrained whisper Large model without normalization baseline) to 26.26\% (finetuned whisper Large model with the proposed normalization scheme).
摘要：变压器编码器架构的发展导致了机器翻译，自动语音识别（ASR）和基于教学的聊天机等的重大突破。对预训练的模型在几个时期（在大多数情况下少于五个）上进行了大量的通用数据培训，从而产生了强大的概括能力。然而，这些模型的性能确实应用于驾驶舱中的录音域（例如转录飞行员演讲）时，涉及许多特定的词汇和多语言对话。本文研究并提高了使用耳语模型对话的转录精度。我们收集了大约85分钟的驾驶舱模拟器录音，并与飞行员进行了130分钟的采访录音，并手动标记了他们。演讲者是中年男子，讲德语和英语。为了提高转录的准确性，我们提出了多种标准化方案，以完善成绩单并提高单词错误率（WER）。然后，我们采用微调来提高ASR性能，利用低级适应性（LORA）的性能高效调节。因此，从68.49 \％（预处理的低语模型没有归一化基线）下降到26.26 \％（具有拟议的归一化方案的芬兰语大型模型）。

Title: Can Peter Pan Survive MT? A Stylometric Study of LLMs, NMTs, and HTs in Children's Literature Translation

Authors: Delu Kong, Lieve Macken
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22038
Pdf URL: https://arxiv.org/pdf/2506.22038
Copy Paste: [[2506.22038]] Can Peter Pan Survive MT? A Stylometric Study of LLMs, NMTs, and HTs in Children's Literature Translation(https://arxiv.org/abs/2506.22038)
Keywords: language model, llm
Abstract: This study focuses on evaluating the performance of machine translations (MTs) compared to human translations (HTs) in English-to-Chinese children's literature translation (CLT) from a stylometric perspective. The research constructs a Peter Pan corpus, comprising 21 translations: 7 human translations (HTs), 7 large language model translations (LLMs), and 7 neural machine translation outputs (NMTs). The analysis employs a generic feature set (including lexical, syntactic, readability, and n-gram features) and a creative text translation (CTT-specific) feature set, which captures repetition, rhythm, translatability, and miscellaneous levels, yielding 447 linguistic features in total. Using classification and clustering techniques in machine learning, we conduct a stylometric analysis of these translations. Results reveal that in generic features, HTs and MTs exhibit significant differences in conjunction word distributions and the ratio of 1-word-gram-YiYang, while NMTs and LLMs show significant variation in descriptive words usage and adverb ratios. Regarding CTT-specific features, LLMs outperform NMTs in distribution, aligning more closely with HTs in stylistic characteristics, demonstrating the potential of LLMs in CLT.
摘要：这项研究重点是评估与英语至中国儿童文学翻译（CLT）在式角度的角度相比，与人类翻译（HTS）相比。该研究构建了彼得潘语料库，包括21种翻译：7种人类翻译（HTS），7个大语言模型翻译（LLMS）和7个神经机器翻译输出（NMTS）。该分析采用通用功能集（包括词汇，句法，可读性和N-gram功能）和创意文本翻译（CTT特定）功能集，该集合捕获了重复，节奏，翻译性和杂项级别，总计产生447个语言特征。使用机器学习中的分类和聚类技术，我们对这些翻译进行了口语分析。结果表明，在通用特征中，HTS和MTS在连词单词分布和1个字的Yiyang的比率上表现出显着差异，而NMTS和LLMS在描述性词使用情况和副词比例中显示出显着差异。关于CTT特异性特征，LLMS在分布方面的表现优于NMT，在风格特征中与HTS更紧密地对齐，这表明LLM在CLT中的潜力。

Title: Decoding Machine Translationese in English-Chinese News: LLMs vs. NMTs

Authors: Delu Kong, Lieve Macken
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22050
Pdf URL: https://arxiv.org/pdf/2506.22050
Copy Paste: [[2506.22050]] Decoding Machine Translationese in English-Chinese News: LLMs vs. NMTs(https://arxiv.org/abs/2506.22050)
Keywords: language model, llm
Abstract: This study explores Machine Translationese (MTese) -- the linguistic peculiarities of machine translation outputs -- focusing on the under-researched English-to-Chinese language pair in news texts. We construct a large dataset consisting of 4 sub-corpora and employ a comprehensive five-layer feature set. Then, a chi-square ranking algorithm is applied for feature selection in both classification and clustering tasks. Our findings confirm the presence of MTese in both Neural Machine Translation systems (NMTs) and Large Language Models (LLMs). Original Chinese texts are nearly perfectly distinguishable from both LLM and NMT outputs. Notable linguistic patterns in MT outputs are shorter sentence lengths and increased use of adversative conjunctions. Comparing LLMs and NMTs, we achieve approximately 70% classification accuracy, with LLMs exhibiting greater lexical diversity and NMTs using more brackets. Additionally, translation-specific LLMs show lower lexical diversity but higher usage of causal conjunctions compared to generic LLMs. Lastly, we find no significant differences between LLMs developed by Chinese firms and their foreign counterparts.
摘要：这项研究探索了机器翻译人员（MTESE） - 机器翻译输出的语言特点 - 重点关注新闻文本中研究不足的英语对英语对语言。我们构建了一个由4个亚公司组成的大型数据集，并采用了全面的五层功能集。然后，在分类任务和聚类任务中都应用了卡方排名算法。我们的发现证实了MTESE在神经机器翻译系统（NMT）和大语言模型（LLMS）中的存在。原始中文文本几乎可以与LLM和NMT输出都完全区分。 MT输出中的显着语言模式是较短的句子长度和增加的逆转连词的使用。比较LLM和NMT，我们达到了约70％的分类精度，LLMS具有更大的词汇多样性和使用更多括号的NMT。此外，与通用LLM相比，转换特异性LLMS显示出较低的词汇多样性，但因果关系的使用较高。最后，我们发现中国公司及其外国同行开发的LLM之间没有显着差异。

Title: Lost at the Beginning of Reasoning

Authors: Baohao Liao, Xinyi Chen, Sara Rajaee, Yuhui Xu, Christian Herold, Anders Søgaard, Maarten de Rijke, Christof Monz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22058
Pdf URL: https://arxiv.org/pdf/2506.22058
Copy Paste: [[2506.22058]] Lost at the Beginning of Reasoning(https://arxiv.org/abs/2506.22058)
Keywords: language model, llm, chain-of-thought
Abstract: Recent advancements in large language models (LLMs) have significantly advanced complex reasoning capabilities, particularly through extended chain-of-thought (CoT) reasoning that incorporates mechanisms such as backtracking, self-reflection and self-correction. Despite these developments, the self-correction abilities of LLMs during long CoT reasoning remain underexplored. And recent findings on overthinking suggest that such models often engage in unnecessarily redundant reasoning. In this work, we empirically show that the first reasoning step exerts a disproportionately large influence on the final prediction - errors introduced at this stage can substantially degrade subsequent reasoning quality. This phenomenon is consistently observed across two state-of-the-art open-source reasoning model families: DeepSeek-R1 and Qwen3. To address this, we propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps while discarding suboptimal ones, achieving up to a 70% reduction in inference cost without sacrificing accuracy. Finally, we introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities, offering a foundation for future research on robust reasoning in LLMs.
摘要：大型语言模型（LLM）的最新进展具有显着高级的复杂推理能力，尤其是通过扩展链链（COT）推理，结合了诸如回溯，自我反射和自我纠正等机制。尽管有这些发展，但在长期的COT推理期间，LLM的自我纠正能力仍未得到充实。关于过度思考的最新发现表明，这样的模型经常进行不必要的多余推理。在这项工作中，我们从经验上表明，第一个推理步骤对最终预测产生了不成比例的很大的影响 - 此阶段引入的错误可能会大大降低随后的推理质量。在两个最先进的开源推理模型家族中，始终观察到这种现象：DeepSeek-R1和Qwen3。为了解决这一问题，我们提出了一种有效的抽样策略，该策略利用奖励模型来识别和保留高质量的第一推动步骤，同时丢弃次优基础，从而在不牺牲准确性的情况下达到了70％的推理成本降低70％。最后，我们介绍了一种新的基准测试，该基准专门通过故意有缺陷的第一推理步骤构建，以系统地评估模型自我校正功能，为LLMS中强大推理的未来研究奠定了基础。

Title: Identifying a Circuit for Verb Conjugation in GPT-2

Authors: David Demitri Africa
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22105
Pdf URL: https://arxiv.org/pdf/2506.22105
Copy Paste: [[2506.22105]] Identifying a Circuit for Verb Conjugation in GPT-2(https://arxiv.org/abs/2506.22105)
Keywords: gpt, prompt
Abstract: I implement a procedure to isolate and interpret the sub-network (or "circuit") responsible for subject-verb agreement in GPT-2 Small. In this study, the model is given prompts where the subject is either singular (e.g. "Alice") or plural (e.g. "Alice and Bob"), and the task is to correctly predict the appropriate verb form ("walks" for singular subjects, "walk" for plural subjects). Using a series of techniques-including performance verification automatic circuit discovery via direct path patching, and direct logit attribution- I isolate a candidate circuit that contributes significantly to the model's correct verb conjugation. The results suggest that only a small fraction of the network's component-token pairs is needed to achieve near-model performance on the base task but substantially more for more complex settings.
摘要：我实施了一个程序，以隔离和解释负责GPT-2 Small主题 - 动词协议的子网络（或“电路”）。在这项研究中，给出了该模型的提示，即受试者是单数（例如“爱丽丝”）或复数（例如“ Alice and Bob”）的提示，并且任务是正确预测适当的动词形式（对于单数主题的“ walks”，“ walk”，“ walk”对复数主题）。使用一系列通过直接路径修补（直接logit属性）的技术验证自动验证的技术验证 - i隔离候选电路，该电路对模型的正确动词共轭有很大贡献。结果表明，仅需要一小部分网络组件对，才能在基本任务上实现近模型的性能，但对于更复杂的设置而言，更大的是。

Title: SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition

Authors: Muhammad Umar Farooq, Oscar Saz
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.22143
Pdf URL: https://arxiv.org/pdf/2506.22143
Copy Paste: [[2506.22143]] SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition(https://arxiv.org/abs/2506.22143)
Keywords: language model
Abstract: This paper investigates the performance of various speech SSL models on dialectal Arabic (DA) and Arabic-English code-switched (CS) speech. To address data scarcity, a modified audio-splicing approach is introduced to generate artificial CS speech data. Fine-tuning an already fine-tuned SSL model with the proposed Spliced-Audio Generated (SAGE) data results in an absolute improvement on Word Error Rate (WER) of 7.8% on Arabic and English CS benchmarks. Additionally, an Experience Replay (ER) inspired approach is proposed to enhance generalisation across DA and CS speech while mitigating catastrophic forgetting. Integrating an out-of-domain 3-gram language model reduces the overall mean WER from 31.7% to 26.6%. Few-shot fine-tuning for code-switching benchmarks further improves WER by 4.9%. A WER of 31.1% on Arabic-English CS benchmarks surpasses large-scale multilingual models, including USM and Whisper-large-v2 (both over ten times larger) by an absolute margin of 5.5% and 8.4%, respectively.
摘要：本文调查了各种语音SSL模型在辩证阿拉伯语（DA）和阿拉伯语 - 英语码转换（CS）演讲中的性能。为了解决数据稀缺性，引入了修改的音频切割方法来生成人工CS语音数据。通过提出的剪接原告（SAGE）数据对已经进行微调的SSL模型进行微调，从而使阿拉伯语和英语CS基准的单词错误率（WER）的绝对提高为7.8％。此外，提出了一种经验重播（ER）启发的方法，以增强DA和CS语音的概括，同时减轻灾难性的遗忘。整合室外3克语言模型将总体平均值从31.7％降低到26.6％。对于代码开关基准测试的几种微调，进一步提高了4.9％。阿拉伯语英语CS基准的31.1％的速度分别超过了大规模的多语言模型，包括USM和Whisper-Large-V2（均大于十倍）的绝对保证金分别为5.5％和8.4％。

Title: Training Language Model to Critique for Better Refinement

Authors: Tianshu Yu, Chao Xiang, Mingchuan Yang, Pei Ke, Bosi Wen, Cunxiang Wang, Jiale Cheng, Li Zhang, Xinyu Mu, Chuxiong Sun, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22157
Pdf URL: https://arxiv.org/pdf/2506.22157
Copy Paste: [[2506.22157]] Training Language Model to Critique for Better Refinement(https://arxiv.org/abs/2506.22157)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable evaluation and critique capabilities, providing insightful feedback and identifying flaws in various tasks. However, limited research has explored which types of critiques are most effective for improving model responses or how to generate such critiques. To address this gap, we introduce \textbf{R}efinement-oriented \textbf{C}ritique \textbf{O}ptimization (RCO), a novel framework designed to train critic models using refinement signals. RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses. The critique utility (CU) quantifies the effectiveness of these refinements, serving as the reward signal for training the critic model. By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment, ensuring that critiques driving meaningful improvements are rewarded. We evaluate RCO across five tasks, i.e., dialog generation, summarization, question answering, mathematical reasoning, and code generation, and show that it significantly outperforms traditional methods and open-source models in terms of critique quality and refinement outcomes. Our contributions include the introduction of RCO, a novel supervision scheme based on refined response preferences, and comprehensive experimental results that highlight the method's effectiveness in enhancing LLM critique-refinement loops.
摘要：大型语言模型（LLM）表现出了出色的评估和批评能力，提供了有见地的反馈并确定各种任务中的缺陷。但是，有限的研究探索了哪些类型的批评对于改善模型反应或如何产生此类批评最有效。为了解决此差距，我们介绍了\ textbf {r} efinement-endiended \ textbf {c} ritique \ textbf {o} ptimization（rco），这是一个新颖的框架，旨在使用改进信号来训练评论家模型。 RCO使用反馈循环，其中评论家模型产生的批评指导演员模型来完善其响应。评论公用事业（CU）量化了这些改进的有效性，作为训练评论家模型的奖励信号。通过专注于导致更好改进的批评，RCO消除了对直接批评偏好评估的需求，以确保奖励推动有意义的改进的批评得到奖励。我们在五个任务中评估RCO，即对话生成，摘要，问题答案，数学推理和代码生成，并表明它在批评质量和改进成果方面显着优于传统方法和开源模型。我们的贡献包括引入RCO，这是一种基于精致响应偏好的新型监督计划，以及全面的实验结果，突出了该方法在增强LLM批评循环方面的有效性。

Title: Leveraging In-Context Learning for Political Bias Testing of LLMs

Authors: Patrick Haller, Jannis Vamvas, Rico Sennrich, Lena A. Jäger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22232
Pdf URL: https://arxiv.org/pdf/2506.22232
Copy Paste: [[2506.22232]] Leveraging In-Context Learning for Political Bias Testing of LLMs(https://arxiv.org/abs/2506.22232)
Keywords: llm
Abstract: A growing body of work has been querying LLMs with political questions to evaluate their potential biases. However, this probing method has limited stability, making comparisons between models unreliable. In this paper, we argue that LLMs need more context. We propose a new probing task, Questionnaire Modeling (QM), that uses human survey data as in-context examples. We show that QM improves the stability of question-based bias evaluation, and demonstrate that it may be used to compare instruction-tuned models to their base versions. Experiments with LLMs of various sizes indicate that instruction tuning can indeed change the direction of bias. Furthermore, we observe a trend that larger models are able to leverage in-context examples more effectively, and generally exhibit smaller bias scores in QM. Data and code are publicly available.
摘要：越来越多的工作一直在向LLM询问政治问题，以评估其潜在偏见。但是，这种探测方法的稳定性有限，从而使模型之间的比较不可靠。在本文中，我们认为LLM需要更多的上下文。我们提出了一项新的探测任务，问卷建模（QM），该任务将人类调查数据作为文本示例。我们表明，QM提高了基于问题的偏见评估的稳定性，并证明它可用于将指导调节模型与其基本版本进行比较。具有各种尺寸的LLM的实验表明，指令调整确实可以改变偏差的方向。此外，我们观察到一种趋势，即较大的模型能够更有效地利用秘密示例，并且通常在QM中表现出较小的偏差分数。数据和代码公开可用。

Title: Detection of Personal Data in Structured Datasets Using a Large Language Model

Authors: Albert Agisha Ntwali, Luca Rück, Martin Heckmann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22305
Pdf URL: https://arxiv.org/pdf/2506.22305
Copy Paste: [[2506.22305]] Detection of Personal Data in Structured Datasets Using a Large Language Model(https://arxiv.org/abs/2506.22305)
Keywords: language model, gpt
Abstract: We propose a novel approach for detecting personal data in structured datasets, leveraging GPT-4o, a state-of-the-art Large Language Model. A key innovation of our method is the incorporation of contextual information: in addition to a feature's name and values, we utilize information from other feature names within the dataset as well as the dataset description. We compare our approach to alternative methods, including Microsoft Presidio and CASSED, evaluating them on multiple datasets: DeSSI, a large synthetic dataset, datasets we collected from Kaggle and OpenML as well as MIMIC-Demo-Ext, a real-world dataset containing patient information from critical care units. Our findings reveal that detection performance varies significantly depending on the dataset used for evaluation. CASSED excels on DeSSI, the dataset on which it was trained. Performance on the medical dataset MIMIC-Demo-Ext is comparable across all models, with our GPT-4o-based approach clearly outperforming the others. Notably, personal data detection in the Kaggle and OpenML datasets appears to benefit from contextual information. This is evidenced by the poor performance of CASSED and Presidio (both of which do not utilize the context of the dataset) compared to the strong results of our GPT-4o-based approach. We conclude that further progress in this field would greatly benefit from the availability of more real-world datasets containing personal information.
摘要：我们提出了一种新的方法，用于检测结构化数据集中的个人数据，利用GPT-4O（一种最先进的大语言模型）。我们方法的关键创新是上下文信息的合并：除了功能的名称和值之外，我们还利用数据集中其他功能名称的信息以及数据集说明。我们比较了我们的替代方法的方法，包括Microsoft Presidio和Cassed，在多个数据集上进行评估：Dessi，大型合成数据集，我们从Kaggle和OpenML收集的数据集，以及Mimic-demo-ext，以及一个现实的数据集，其中包含来自Grical Care Care Imit的患者信息的真实数据集。我们的发现表明，检测性能取决于用于评估的数据集有很大变化。卡斯（Cassed）在训练的数据集Dessi上表现出色。在所有模型中，Michical DataSet Mimic-Demo-Ext的性能都是可比性的，我们基于GPT-4O的方法显然优于其他方法。值得注意的是，Kaggle和OpenML数据集中的个人数据检测似乎从上下文信息中受益。与我们基于GPT-4O的方法相比，Cassed和Presidio的性能不佳（这两者都不利用数据集的上下文）证明了这一点。我们得出的结论是，该领域的进一步进展将大大受益于包含个人信息的更多真实世界数据集。

Title: Evaluating Scoring Bias in LLM-as-a-Judge

Authors: Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, Haixiang Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22316
Pdf URL: https://arxiv.org/pdf/2506.22316
Copy Paste: [[2506.22316]] Evaluating Scoring Bias in LLM-as-a-Judge(https://arxiv.org/abs/2506.22316)
Keywords: language model, llm, prompt
Abstract: The remarkable performance of Large Language Models (LLMs) gives rise to``LLM-as-a-Judge'', where LLMs are employed as evaluators for complex tasks. Moreover, it has been widely adopted across fields such as Natural Language Processing (NLP), preference learning, and various specific domains. However, there are various biases within LLM-as-a-Judge, which adversely affect the fairness and reliability of judgments. Current research on evaluating or mitigating bias in LLM-as-a-Judge predominantly focuses on comparison-based evaluations, while systematic investigations into bias in scoring-based evaluations remain limited. Therefore, we define scoring bias in LLM-as-a-Judge as the scores differ when scoring judge models are bias-related perturbed, and provide a well-designed framework to comprehensively evaluate scoring bias. We augment existing LLM-as-a-Judge benchmarks through data synthesis to construct our evaluation dataset and design multi-faceted evaluation metrics. Our experimental results demonstrate that the scoring stability of existing judge models is disrupted by scoring biases. Further exploratory experiments and discussions provide valuable insights into the design of scoring prompt templates and the mitigation of scoring biases on aspects such as score rubrics, score IDs, and reference answer selection.
摘要：大语言模型（LLMS）的出色表现产生了“ llm-as-a-a-gudge”，其中LLM被用作复杂任务的评估者。此外，它已被广泛采用，例如自然语言处理（NLP），偏好学习和各种特定领域。但是，LLM-AS-A-A-Gudge中存在各种偏见，这会对判断的公平性和可靠性产生不利影响。当前关于评估或减轻法学律师法官偏见的研究主要集中在基于比较的评估上，而对基于评分的评估中偏见的系统研究仍然有限。因此，我们将LLM-AS-A-A-法官的评分偏差定义为评分法官模型与偏见相关的扰动时的分数有所不同，并提供了一个精心设计的框架来全面评估评分偏见。 We augment existing LLM-as-a-Judge benchmarks through data synthesis to construct our evaluation dataset and design multi-faceted evaluation metrics.我们的实验结果表明，现有法官模型的评分稳定性被评分偏见破坏了。进一步的探索性实验和讨论为评分及时模板的设计以及减轻评分偏见的方面提供了宝贵的见解。

Title: Why Are Parsing Actions for Understanding Message Hierarchies Not Random?

Authors: Daichi Kato, Ryo Ueda, Yusuke Miyao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22366
Pdf URL: https://arxiv.org/pdf/2506.22366
Copy Paste: [[2506.22366]] Why Are Parsing Actions for Understanding Message Hierarchies Not Random?(https://arxiv.org/abs/2506.22366)
Keywords: agent
Abstract: If humans understood language by randomly selecting parsing actions, it might have been necessary to construct a robust symbolic system capable of being interpreted under any hierarchical structure. However, human parsing strategies do not seem to follow such a random pattern. Why is that the case? In fact, a previous study on emergent communication using models with hierarchical biases have reported that agents adopting random parsing strategies$\unicode{x2013}$ones that deviate significantly from human language comprehension$\unicode{x2013}$can achieve high communication accuracy. In this study, we investigate this issue by making two simple and natural modifications to the experimental setup: (I) we use more complex inputs that have hierarchical structures, such that random parsing makes semantic interpretation more difficult, and (II) we incorporate a surprisal-related term, which is known to influence the order of words and characters in natural language, into the objective function. With these changes, we evaluate whether agents employing random parsing strategies still maintain high communication accuracy.
摘要：如果人类通过随机选择解析动作来理解语言，则可能有必要构建能够在任何层次结构下解释的强大符号系统。但是，人类解析策略似乎并没有遵循这种随机模式。为什么这样？实际上，先前对使用具有层次偏见的模型的新兴沟通的研究报告说，采用随机解析策略的代理商$ \ unicode {x2013} $与人类语言理解$ \ Unicode {x2013} $显着偏离的代理可以实现高沟通准确性。在这项研究中，我们通过对实验设置进行了两个简单自然的修改来调查这个问题：（i）我们使用具有层次结构的更复杂的输入，以便随机解析使语义解释更加困难，并且（ii）我们将相关的术语纳入了自然语言的言语和字符的序列，以自然的语言函数，从而影响目标函数。通过这些更改，我们评估采用随机解析策略的代理人是否仍然保持高沟通准确性。

Title: QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization

Authors: Danush Khanna, Aditya Kumar Guru, Srivarshinee Sridhar, Zidan Ahmed, Rubhav Bahirwani, Meetu Malhotra, Vinija Jain, Aman Chadha, Amitava Das, Kripabandhu Ghosh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22396
Pdf URL: https://arxiv.org/pdf/2506.22396
Copy Paste: [[2506.22396]] QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization(https://arxiv.org/abs/2506.22396)
Keywords: language model, gpt, llm
Abstract: Inference accounts for the majority of latency and energy consumption in large language model (LLM) deployments, often exceeding 90% of total cost. While training-time efficiency has seen extensive progress, runtime optimization remains a key bottleneck, particularly under autoregressive decoding. Existing approaches -- such as pruning, quantization, early exits, and speculative decoding -- often require retraining, architectural changes, or disrupt decoding compatibility. We introduce QuickSilver, a modular, token-level framework that enables semantic adaptivity at inference time without altering model weights or structure. QuickSilver integrates four synergistic mechanisms: (i) Dynamic Token Halting, which halts computation for tokens with converged representations; (ii) KV Cache Skipping, which selectively suppresses memory writes to reduce attention overhead; and (iii) Contextual Token Fusion, which collapses redundant tokens into shared paths to shrink sequence length. Unlike speculative decoding or MoE routing, QuickSilver operates entirely on frozen, dense models and requires no auxiliary networks. Applied to GPT-2 and Llama-2 across WikiText-103 and C4, QuickSilver achieves up to 39.6% FLOP reduction with negligible perplexity degradation (<=0.2).
摘要：推理是大语模型（LLM）部署的大部分潜伏期和能源消耗，通常超过总成本的90％。尽管培训时间效率取得了广泛的进步，但运行时优化仍然是关键的瓶颈，尤其是在自回归解码下。现有的方法（例如修剪，量化，提前退出和投机解码）通常需要重新训练，建筑变化或破坏解码兼容性。我们介绍了QuickSilver，这是一种模块化的令牌级框架，可在推理时间进行语义适应性，而不会改变模型权重或结构。 QuickSilver集成了四种协同机制：（i）动态令牌停止，该停止以收敛表示的令牌停止计算；（ii）KV缓存跳过，它有选择地抑制记忆写作以减少开销的注意力；（iii）上下文令牌融合，该融合将冗余令牌折叠成共享路径，以缩小序列长度。与投机解码或MOE路由不同，QuickSilver完全在冷冻，密集的模型上运行，并且不需要辅助网络。在Wikitext-103和C4上应用于GPT-2和Llama-2，Quicksilver可实现高达39.6％的失败降低，而易于困惑降解（<= 0.2）。

Title: Refining Czech GEC: Insights from a Multi-Experiment Approach

Authors: Petr Pechman, Milan Straka, Jana Straková, Jakub Náplava
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22402
Pdf URL: https://arxiv.org/pdf/2506.22402
Copy Paste: [[2506.22402]] Refining Czech GEC: Insights from a Multi-Experiment Approach(https://arxiv.org/abs/2506.22402)
Keywords: language model, llm
Abstract: We present a grammar error correction (GEC) system that achieves state of the art for the Czech language. Our system is based on a neural network translation approach with the Transformer architecture, and its key feature is its real-time synthetic generation pipeline, which dynamically augments sentences with artificial errors by introducing both language-agnostic and Czech-specific errors. We conduct a comprehensive series of experiments, investigating the Czech GEC corpora as bases for synthetic error introduction, several error generation strategies, domain balancing, tokenization granularity, model size, and data scaling during fine-tuning. Additionally, we evaluate the performance of large language models (LLMs) on Czech GEC in both end-user and expert fine-tuning scenarios. Our best-performing model is superior both in performance and computational efficiency. The source code and the trained model links are available on this https URL.
摘要：我们提出了一种语法误差校正（GEC）系统，该系统可实现捷克语的最新技术。我们的系统基于具有变压器体系结构的神经网络翻译方法，其关键功能是其实时合成生成管道，该管道通过引入语言 - 敏捷和捷克特异性错误，通过人工错误动态增强句子。我们进行了一系列的一系列实验，研究了捷克GEC COLPORA作为合成错误引入，几种错误产生策略，域平衡，令牌化粒度，模型大小和数据扩展过程中的基础。此外，我们在最终用户和专家微调方案中评估了捷克GEC上大语言模型（LLM）的性能。我们表现最好的模型在性能和计算效率方面都优越。源代码和训练有素的模型链接可在此HTTPS URL上找到。

Title: HyperCLOVA X THINK Technical Report

Authors: NAVER Cloud HyperCLOVA X Team
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22403
Pdf URL: https://arxiv.org/pdf/2506.22403
Copy Paste: [[2506.22403]] HyperCLOVA X THINK Technical Report(https://arxiv.org/abs/2506.22403)
Keywords: language model, gpt
Abstract: We introduce HyperCLOVA X THINK, the first reasoning-focused large language model in the HyperCLOVA X family, pre-trained on roughly $6$ trillion high-quality Korean, and English tokens, augmented with targeted synthetic Korean data. It was implemented as a compute-memory-balanced Peri-LN Transformer scaled with $\mu$P, pre-trained through a three-stage curriculum that expands the context window to $128$K tokens, and post-trained via supervised fine-tuning with Reinforcement Learning from Verifiable Rewards supports both detailed rationale and concise-answer modes. It delivers competitive performance against similarly sized models on Korea-focused benchmarks such as KMMLU, CSAT, KoBALT-700, HAERAE-1.0, and KoBigBench, while preserving robust bilingual consistency and translation quality. In addition, a vision-augmented variant matches or exceeds GPT-4.1 on the KCSAT STEM benchmark, all of which are achieved with substantially lower training compute than existing models of similar sizes. We also present a pruning and distillation technique that will soon be applied to HyperCLOVA X THINK for an open-source and business-friendly foundation model. Altogether, these capabilities position HyperCLOVA X THINK as a robust foundation for Korean AI innovation and a valuable resource for the global research community.
摘要：我们介绍了Hyperclova X Think，这是HyperClova X家族中首个以推理为重点的大语言模型，预先培训大约6美元的高质量韩语和英语代币，并具有有针对性的合成韩国数据。 It was implemented as a compute-memory-balanced Peri-LN Transformer scaled with $\mu$P, pre-trained through a three-stage curriculum that expands the context window to $128$K tokens, and post-trained via supervised fine-tuning with Reinforcement Learning from Verifiable Rewards supports both detailed rationale and concise-answer modes.它针对以韩国为重点的基准（例如KMMLU，CSAT，KOBALT-700，HAERAE-1.0和KOBIGBENCH）等类似大小的模型提供了竞争性能，同时保持了强大的双语一致性和翻译质量。此外，在KCSAT STEM基准上，视觉增强的变体匹配或超过了GPT-4.1，所有这些匹配都比现有尺寸的现有模型的训练计算大大低。我们还提出了一种修剪和蒸馏技术，该技术将很快应用于HyperClova X，以考虑开源和业务友好的基础模型。总的来说，这些功能位置HyperClova X认为是韩国AI创新的强大基础，也是全球研究界的宝贵资源。

Title: Sequential Diagnosis with Language Models

Authors: Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, Eric Horvitz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22405
Pdf URL: https://arxiv.org/pdf/2506.22405
Copy Paste: [[2506.22405]] Sequential Diagnosis with Language Models(https://arxiv.org/abs/2506.22405)
Keywords: language model
Abstract: Artificial intelligence holds great promise for expanding access to expert medical knowledge and reasoning. However, most evaluations of language models rely on static vignettes and multiple-choice questions that fail to reflect the complexity and nuance of evidence-based medicine in real-world settings. In clinical practice, physicians iteratively formulate and revise diagnostic hypotheses, adapting each subsequent question and test to what they've just learned, and weigh the evolving evidence before committing to a final diagnosis. To emulate this iterative process, we introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases into stepwise diagnostic encounters. A physician or AI begins with a short case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only when explicitly queried. Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed. We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians, proposes likely differential diagnoses and strategically selects high-value, cost-effective tests. When paired with OpenAI's o3 model, MAI-DxO achieves 80% diagnostic accuracy--four times higher than the 20% average of generalist physicians. MAI-DxO also reduces diagnostic costs by 20% compared to physicians, and 70% compared to off-the-shelf o3. When configured for maximum accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families. We highlight how AI systems, when guided to think iteratively and act judiciously, can advance diagnostic precision and cost-effectiveness in clinical care.
摘要：人工智能在扩大获得专家医学知识和推理的访问方面具有巨大的希望。但是，大多数对语言模型的评估都依赖于静态小插图和多项选择问题，这些问题无法反映现实世界中循证医学的复杂性和细微差别。在临床实践中，医生迭代地提出和修改诊断假设，调整每个后续问题并测试他们刚刚学到了什么，并在承诺最终诊断之前权衡了不断发展的证据。为了模拟这一迭代过程，我们介绍了连续诊断基准，该基准将304个诊断挑战的新英格兰医学杂志临床病理学会议（NEJM-CPC）病例转化为逐步诊断的相遇。医师或AI以简短的情况抽象开头，必须迭代地从网守模型中请求其他详细信息，该详细信息仅在明确查询时揭示发现。性能不仅是通过诊断准确性来评估的，而且还通过医师就诊和进行测试的成本来评估。我们还提出了MAI诊断编排者（MAI-DXO），这是一种模拟医师小组的模型 - 静态编排，它提出了可能的差异诊断，并从策略上选择了高价值，具有成本效益的测试。当与OpenAI的O3型号配对时，MAI-DXO可实现80％的诊断精度 - 超过20％的通才医生的20％。与医生相比，MAI-DXO还将诊断成本降低了20％，与现成的O3相比，诊断成本为70％。当配置最高精度时，MAI-DXO可实现85.5％的精度。这些性能以Mai-Dxo的形式获得了跨越Openai，Gemini，Claude，Grok，Deepseek和Llama家庭的模型的概括。我们强调了AI系统在指导迭代思考并明智地采取行动时如何提高诊断精度和临床护理中的成本效益。