2025-08-12

Title: Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction

Authors: Juliana Resplande Sant'anna Gomes, Arlindo Rodrigues Galvão Filho
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.06495
Pdf URL: https://arxiv.org/pdf/2508.06495
Copy Paste: [[2508.06495]] Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction(https://arxiv.org/abs/2508.06495)
Keywords: language model, llm
Abstract: The accelerated dissemination of disinformation often outpaces the capacity for manual fact-checking, highlighting the urgent need for Semi-Automated Fact-Checking (SAFC) systems. Within the Portuguese language context, there is a noted scarcity of publicly available datasets that integrate external evidence, an essential component for developing robust AFC systems, as many existing resources focus solely on classification based on intrinsic text features. This dissertation addresses this gap by developing, applying, and analyzing a methodology to enrich Portuguese news corpora (this http URL, this http URL, MuMiN-PT) with external evidence. The approach simulates a user's verification process, employing Large Language Models (LLMs, specifically Gemini 1.5 Flash) to extract the main claim from texts and search engine APIs (Google Search API, Google FactCheck Claims Search API) to retrieve relevant external documents (evidence). Additionally, a data validation and preprocessing framework, including near-duplicate detection, is introduced to enhance the quality of the base corpora.
摘要：虚假信息的加速传播通常超过了手动事实检查的能力，强调了迫切需要半自动事实检查（SAFC）系统。在葡萄牙语语言上下文中，有一个著名的公开数据集缺乏集成外部证据的数据集，这是开发强大的AFC系统的重要组成部分，因为许多现有资源仅集中于基于内在文本特征的分类。本论文通过开发，应用和分析一种富含葡萄牙新闻机构（此HTTP URL，此HTTP URL，Mumin-pt）的方法来解决这一差距。该方法模拟了用户的验证过程，采用大型语言模型（LLM，特别是Gemini 1.5 Flash）从文本和搜索引擎API（Google Search API，Google Factcheck索赔搜索API）中提取主要主张，以检索相关的外部文档（证据）。此外，还引入了数据验证和预处理框架，包括近乎研究的检测，以提高基本语料库的质量。

Title: Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models

Authors: Yao Ge, Sudeshna Das, Yuting Guo, Abeed Sarker
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06504
Pdf URL: https://arxiv.org/pdf/2508.06504
Copy Paste: [[2508.06504]] Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models(https://arxiv.org/abs/2508.06504)
Keywords: language model, gpt, llm, prompt, retrieval augmented generation, retrieval-augmented generation
Abstract: Biomedical named entity recognition (NER) is a high-utility natural language processing (NLP) task, and large language models (LLMs) show promise particularly in few-shot settings (i.e., limited training data). In this article, we address the performance challenges of LLMs for few-shot biomedical NER by investigating a dynamic prompting strategy involving retrieval-augmented generation (RAG). In our approach, the annotated in-context learning examples are selected based on their similarities with the input texts, and the prompt is dynamically updated for each instance during inference. We implemented and optimized static and dynamic prompt engineering techniques and evaluated them on five biomedical NER datasets. Static prompting with structured components increased average F1-scores by 12% for GPT-4, and 11% for GPT-3.5 and LLaMA 3-70B, relative to basic static prompting. Dynamic prompting further improved performance, with TF-IDF and SBERT retrieval methods yielding the best results, improving average F1-scores by 7.3% and 5.6% in 5-shot and 10-shot settings, respectively. These findings highlight the utility of contextually adaptive prompts via RAG for biomedical NER.
摘要：生物医学命名实体识别（NER）是一项高纯粹的自然语言处理（NLP）任务，大型语言模型（LLMS）表现出希望，尤其是在少数拍摄的环境中（即有限的培训数据）。在本文中，我们通过调查涉及检索效果（RAG）的动态提示策略来解决LLMS对几个生物医学NER的性能挑战。在我们的方法中，根据其与输入文本的相似性选择了注释的中文学习示例，并且在推理过程中为每个实例动态更新了提示。我们实施并优化了静态和动态的及时工程技术，并在五个生物医学NER数据集上对其进行了评估。相对于基本的静态提示，GPT-4的静态提示与GPT-4的平均F1分数增加了12％，GPT-3.5和Llama 3-70B增加了11％。动态促使进一步提高了性能，TF-IDF和SBERT检索方法可获得最佳结果，在5-Shot和10-Shot设置中分别将平均F1分数提高了7.3％和5.6％。这些发现突出了通过生物医学NER的抹布上下文自适应提示的实用性。

Title: CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

Authors: Lei Jiang, Fan Chen
Subjects: cs.CL, cs.AI, cs.CY, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06524
Pdf URL: https://arxiv.org/pdf/2508.06524
Copy Paste: [[2508.06524]] CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models(https://arxiv.org/abs/2508.06524)
Keywords: language model, llm
Abstract: Neural scaling laws have driven the development of increasingly large language models (LLMs) by linking accuracy improvements to growth in parameter count, dataset size, and compute. However, these laws overlook the carbon emissions that scale exponentially with LLM size. This paper presents \textit{CarbonScaling}, an analytical framework that extends neural scaling laws to incorporate both operational and embodied carbon in LLM training. By integrating models for neural scaling, GPU hardware evolution, parallelism optimization, and carbon estimation, \textit{CarbonScaling} quantitatively connects model accuracy to carbon footprint. Results show that while a power-law relationship between accuracy and carbon holds, real-world inefficiencies significantly increase the scaling factor. Hardware technology scaling reduces carbon emissions for small to mid-sized models, but offers diminishing returns for extremely large LLMs due to communication overhead and underutilized GPUs. Training optimizations-especially aggressive critical batch size scaling-help alleviate this inefficiency. \textit{CarbonScaling} offers key insights for training more sustainable and carbon-efficient LLMs.
摘要：神经缩放定律通过将精度提高与参数计数，数据集大小和计算的增长联系起来，推动了日益大的语言模型（LLM）的发展。但是，这些定律忽略了随着LLM大小的指数缩放的碳排放。本文介绍了\ textit {Carbonscaling}，这是一个分析框架，该框架扩展了神经缩放定律，以在LLM培训中纳入操作和体现的碳。通过集成神经缩放，GPU硬件演化，并行性优化和碳估计的模型，\ textit {Carbonscaling}定量将模型精度与碳足迹联系起来。结果表明，尽管准确性和碳之间存在幂律关系，但现实世界中的效率低下可显着增加缩放系数。硬件技术扩展可减少中小型型号的碳排放，但由于交流开销和未充分利用的GPU，非常大的LLM的回报率降低。培训优化 - 尤其是侵略性的关键批次尺寸缩放量表可以减轻这种效率低下。 \ textit {Carbonscaling}提供了培训更可持续和碳效率的LLM的关键见解。

Title: The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

Authors: Aamod Thakur, Ajay Nagpal, Atharva Savarkar, Kundeshwar Pundalik, Siddhesh Dosi, Piyush Sawarkar, Viraj Thakur, Rohit Saluja, Maunendra Sankar Desarkar, Ganesh Ramakrishnan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06533
Pdf URL: https://arxiv.org/pdf/2508.06533
Copy Paste: [[2508.06533]] The Art of Breaking Words: Rethinking Multilingual Tokenizer Design(https://arxiv.org/abs/2508.06533)
Keywords: language model, llm
Abstract: While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs
摘要：虽然模型架构和培训目标是经过充分研究的，但令牌化，尤其是在多语言环境中，仍然是大语言模型（LLM）发展的一个相对被忽视的方面。现有的令牌剂通常表现出高令牌比率，对上下文长度的使用效率低下以及推理较慢。我们提出了一项系统的研究，该研究将词汇大小，预习俗规则和训练库组成与令牌到字效率和模型质量联系起来。为了在语言上不同的环境中进行分析，我们对AIND脚本进行了广泛的实验，该实验由于其高脚本多样性和拼写复杂性而提出了独特的挑战。利用这些分析的见解，我们提出了一种用于数据组成的新算法，该算法可以平衡用于令牌训练的多语言数据。我们对二手型策略的观察显着提高了模型性能，而对于常规数据随机方法，我们的数据组成算法将平均令牌与单词比例降低了约6％。我们的代币器可以平均达到40％以上的标记与单词多语言指示模型的比率。这种改进在模型性能和推理速度方面都可以衡量。这突出了建筑和培训目标，作为建筑有效，可扩展的多语言LLM的关键杠杆

Title: Factor Augmented Supervised Learning with Text Embeddings

Authors: Zhanye Luo, Yuefeng Han, Xiufan Yu
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2508.06548
Pdf URL: https://arxiv.org/pdf/2508.06548
Copy Paste: [[2508.06548]] Factor Augmented Supervised Learning with Text Embeddings(https://arxiv.org/abs/2508.06548)
Keywords: language model, llm
Abstract: Large language models (LLMs) generate text embeddings from text data, producing vector representations that capture the semantic meaning and contextual relationships of words. However, the high dimensionality of these embeddings often impedes efficiency and drives up computational cost in downstream tasks. To address this, we propose AutoEncoder-Augmented Learning with Text (AEALT), a supervised, factor-augmented framework that incorporates dimension reduction directly into pre-trained LLM workflows. First, we extract embeddings from text documents; next, we pass them through a supervised augmented autoencoder to learn low-dimensional, task-relevant latent factors. By modeling the nonlinear structure of complex embeddings, AEALT outperforms conventional deep-learning approaches that rely on raw embeddings. We validate its broad applicability with extensive experiments on classification, anomaly detection, and prediction tasks using multiple real-world public datasets. Numerical results demonstrate that AEALT yields substantial gains over both vanilla embeddings and several standard dimension reduction methods.
摘要：大型语言模型（LLMS）从文本数据中生成文本嵌入，从而产生捕获单词的语义含义和上下文关系的向量表示。但是，这些嵌入的高维度通常会阻碍效率，并推动下游任务中的计算成本。为了解决这个问题，我们建议使用文本（AEALT）的自动编码器杰出学习，这是一个受监督的，因子赋予的框架，将尺寸缩小直接缩小到预先训练的LLM工作流中。首先，我们从文本文档中提取嵌入；接下来，我们通过监督的增强自动编码器通过他们学习低维，与任务相关的潜在因素。通过对复杂嵌入的非线性结构进行建模，AEALT的表现优于依赖原始嵌入的常规深度学习方法。我们通过使用多个现实世界公共数据集进行分类，异常检测和预测任务的广泛实验来验证其广泛的适用性。数值结果表明，AEALT在Vanilla嵌入和几种标准降低方法方面产生可观的增长。

Title: Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs

Authors: Ying Liu, Can Li, Ting Zhang, Mei Wang, Qiannan Zhu, Jian Li, Hua Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06583
Pdf URL: https://arxiv.org/pdf/2508.06583
Copy Paste: [[2508.06583]] Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs(https://arxiv.org/abs/2508.06583)
Keywords: language model, llm, prompt
Abstract: The conversational capabilities of large language models hold significant promise for enabling scalable and interactive tutoring. While prior research has primarily examined their capacity for Socratic questioning, it often overlooks a critical dimension: adaptively guiding learners based on their cognitive states. This study shifts focus from mere question generation to the broader instructional guidance capability. We ask: Can LLMs emulate expert tutors who dynamically adjust strategies in response to learners' understanding? To investigate this, we propose GuideEval, a benchmark grounded in authentic educational dialogues that evaluates pedagogical guidance through a three-phase behavioral framework: (1) Perception, inferring learner states; (2) Orchestration, adapting instructional strategies; and (3) Elicitation, stimulating proper reflections. Empirical findings reveal that existing LLMs frequently fail to provide effective adaptive scaffolding when learners exhibit confusion or require redirection. Furthermore, we introduce a behavior-guided finetuning strategy that leverages behavior-prompted instructional dialogues, significantly enhancing guidance performance. By shifting the focus from isolated content evaluation to learner-centered interaction, our work advocates a more dialogic paradigm for evaluating Socratic LLMs.
摘要：大语言模型的对话能力具有启用可扩展和互动辅导的巨大希望。虽然先前的研究主要研究了他们的苏格拉底质质疑能力，但它经常忽略一个关键的维度：基于认知状态的自适应指导学习者。这项研究将重点从单纯的问题产生转变为更广泛的教学指导能力。我们问：LLM可以效仿以学习者的理解而动态调整策略的专家教师？为了调查这一点，我们提出了GuideVal，这是一个基于真实的教育对话的基准，该基准通过三相行为框架评估教学指导：（1）感知，推断学习者国家；（2）编排，调整教学策略；（3）启发，刺激适当的反射。经验发现表明，当学习者表现出混乱或需要重定向时，现有的LLM经常无法提供有效的自适应脚手架。此外，我们引入了一种行为引导的鉴定策略，该策略利用行为促进的教学对话，可显着提高指导性能。通过将焦点从孤立的内容评估转移到以学习者为中心的互动中，我们的工作提倡对苏格拉底LLM进行更对话范式。

Title: LLM Unlearning Without an Expert Curated Dataset

Authors: Xiaoyuan Zhu, Muru Zhang, Ollie Liu, Robin Jia, Willie Neiswanger
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06595
Pdf URL: https://arxiv.org/pdf/2508.06595
Copy Paste: [[2508.06595]] LLM Unlearning Without an Expert Curated Dataset(https://arxiv.org/abs/2508.06595)
Keywords: language model, llm, prompt
Abstract: Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at this https URL.
摘要：现代的大型语言模型通常编码敏感，有害或受版权保护的知识，这提出了对事后学习的需求 - 能够从模型中删除特定的知识领域而无需完全重新培训。当前未学习管道中的一个主要瓶颈是构建有效的忘记集合数据，以近似目标域并指导模型忘记它。在这项工作中，我们引入了一种可扩展的自动化方法，以使用语言模型本身生成高质量的忘记套装。我们的方法通过结构化的提示管道综合了教科书风格的数据，仅需要域名作为输入。通过实验未学习生物安全性，网络安全性和哈利·波特（Harry Potter）小说，我们表明我们的合成数据集始终优于基线合成替代方案，并且与专家策划的数据集相当。此外，消融研究表明，多步生成管道显着提高了数据多样性，进而改善了未学习的实用程序。总体而言，我们的发现表明，合成数据集为无需手动干预而无需进行手动干预的各种新兴领域提供了实用，可扩展的途径。我们在此HTTPS URL上发布代码和数据集。

Title: BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

Authors: Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.06600
Pdf URL: https://arxiv.org/pdf/2508.06600
Copy Paste: [[2508.06600]] BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent(https://arxiv.org/abs/2508.06600)
Keywords: language model, gpt, llm, agent
Abstract: Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.
摘要：将大型语言模型（LLMS）与搜索工具集成的深入研究代理在提高处理复杂查询的有效性方面取得了成功，该查询需要迭代搜索计划和推理，而不是搜索结果。对Browsecomp等当前基准测试的评估依赖于黑框实时Web搜索API，在（1）公平性中具有明显的限制：动态和不透明的Web Apis阻碍了公平的比较和深度研究方法的可重复性；（2）透明度：缺乏对文档语料库的控制，因此很难隔离猎犬的贡献。换句话说，当前的评估可以比较在给定时间的完整深入研究系统，但它们并没有促进良好控制的实验，以提供对潜在深入研究LLM的能力的见解。为了应对这些挑战，我们引入了BrowseComp-Plus，这是一种源自BrowseComp的基准，采用了固定的，精心策划的语料库。 BrowseComp-Plus中的每个查询都包括人为验证的支持文档和挖掘具有挑战性的负面因素，从而实现了受控的实验。该基准显示可有效区分深度研究系统的性能。例如，与BM25猎犬配对时，开源模型搜索R1的精度为3.86％，而GPT-5则达到55.9％。将GPT-5与QWEN3-EBEDDING-8B回收者集成在一起，进一步将其准确性提高到70.1％，而搜索呼叫较少。该基准允许对深度研究代理和检索方法进行全面评估和分析分析，从而促进了深入研究系统中对检索有效性，引用准确性和上下文工程的见解。

Title: Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models

Authors: Tomohiro Sawada, Kartik Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.06621
Pdf URL: https://arxiv.org/pdf/2508.06621
Copy Paste: [[2508.06621]] Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models(https://arxiv.org/abs/2508.06621)
Keywords: language model
Abstract: Standard Byte-Pair Encoding (BPE) tokenization compresses text by pairing a learned token vocabulary with a detailed merge list. Recent work has shown that this merge list exposes a potential attack surface for extracting information about language model's training data. In this paper, we explore the downstream impact of BPE inference algorithms that do not rely on this merge list at all, and hence differ from the encoding process during BPE training. To address this question, we investigate two broad classes of BPE inference schemes that differ from BPE application during training: a) targeted deviation from merge-lists including random merge orders, and various corruptions of merge list involving deletion/truncation, and b) non-targeted BPE inference algorithms that do not depend on the merge list but focus on compressing the text either greedily or exactly. Extensive experiments across diverse language modeling tasks like accuracy-based QA benchmarks, machine translation, and open-ended generation reveal that while targeted deviation from the merge lists exhibits significant degradation in language model performance, the non-targeted merge-list-free inference algorithms result in minimal impact on downstream performance that is often much smaller than expected. These findings pave way for simpler and potentially more privacy-preserving tokenization schemes that do not catastrophically compromise model performance.
摘要：标准的字节对编码（BPE）令牌化通过将有学士学位的令牌词汇与详细的合并列表配对来压缩文本。最近的工作表明，该合并列表暴露了一个潜在的攻击表面，以提取有关语言模型培训数据的信息。在本文中，我们探讨了不依赖此合并列表的BPE推理算法的下游影响，因此与BPE培训期间的编码过程有所不同。为了解决这个问题，我们研究了两种与BPE在培训过程中的应用不同的BPE推理方案：a）与包括随机合并订单在内的合并列表的有针对性偏差，以及涉及删除/截断的合并列表的各种损坏，b）不限制的BPE推理算法，这些算法不依赖于合并列表，但要么依赖于压缩或准确地涉及文本的文本。跨不同语言建模任务等各种语言建模任务等广泛的实验，例如基于准确性的QA基准，机器翻译和开放式一代，虽然有针对性的与合并列表的偏差在语言模型性能中表现出很大的降低，但非目标的免费推断推理算法对下降效果的最小影响会导致对下游效果的最小影响，这通常比预期更小。这些发现铺平了道路，以更简单，可能会损害模型性能的更简单和可能更隐私的代币化方案。

Title: Measuring Stereotype and Deviation Biases in Large Language Models

Authors: Daniel Wang, Eli Brignac, Minjia Mao, Xiao Fang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.06649
Pdf URL: https://arxiv.org/pdf/2508.06649
Copy Paste: [[2508.06649]] Measuring Stereotype and Deviation Biases in Large Language Models(https://arxiv.org/abs/2508.06649)
Keywords: language model, llm
Abstract: Large language models (LLMs) are widely applied across diverse domains, raising concerns about their limitations and potential risks. In this study, we investigate two types of bias that LLMs may display: stereotype bias and deviation bias. Stereotype bias refers to when LLMs consistently associate specific traits with a particular demographic group. Deviation bias reflects the disparity between the demographic distributions extracted from LLM-generated content and real-world demographic distributions. By asking four advanced LLMs to generate profiles of individuals, we examine the associations between each demographic group and attributes such as political affiliation, religion, and sexual orientation. Our experimental results show that all examined LLMs exhibit both significant stereotype bias and deviation bias towards multiple groups. Our findings uncover the biases that occur when LLMs infer user attributes and shed light on the potential harms of LLM-generated outputs.
摘要：大型语言模型（LLM）广泛应用于不同的领域，引起了人们对其局限性和潜在风险的担忧。在这项研究中，我们研究了LLM可能显示的两种偏差：刻板印象偏见和偏差偏见。刻板印象偏见是指LLM始终将特定特征与特定人口组相关联时。偏差反映了从LLM生成的内容中提取的人口统计分布与现实世界人口统计分布之间的差异。通过要求四个高级LLM产生个人的概况，我们研究了每个人口组之间的关联和政治隶属关系，宗教和性取向等属性。我们的实验结果表明，所有检查的LLM都表现出明显的刻板印象偏见和对多组的偏差。我们的发现发现了LLMS推断用户属性并阐明LLM生成的输出的潜在危害时发生的偏见。

Title: Testing the Limits of Machine Translation from One Book

Authors: Jonathan Shaw, Dillon Mee, Timothy Khouw, Zackary Leech, Daniel Wilson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.06665
Pdf URL: https://arxiv.org/pdf/2508.06665
Copy Paste: [[2508.06665]] Testing the Limits of Machine Translation from One Book(https://arxiv.org/abs/2508.06665)
Keywords: language model, llm
Abstract: Current state-of-the-art models demonstrate capacity to leverage in-context learning to translate into previously unseen language contexts. Tanzer et al. [2024] utilize language materials (e.g. a grammar) to improve translation quality for Kalamang using large language models (LLMs). We focus on Kanuri, a language that, despite having substantial speaker population, has minimal digital resources. We design two datasets for evaluation: one focused on health and humanitarian terms, and another containing generalized terminology, investigating how domain-specific tasks impact LLM translation quality. By providing different combinations of language resources (grammar, dictionary, and parallel sentences), we measure LLM translation effectiveness, comparing results to native speaker translations and human linguist performance. We evaluate using both automatic metrics and native speaker assessments of fluency and accuracy. Results demonstrate that parallel sentences remain the most effective data source, outperforming other methods in human evaluations and automatic metrics. While incorporating grammar improves over zero-shot translation, it fails as an effective standalone data source. Human evaluations reveal that LLMs achieve accuracy (meaning) more effectively than fluency (grammaticality). These findings suggest LLM translation evaluation benefits from multidimensional assessment beyond simple accuracy metrics, and that grammar alone, without parallel sentences, does not provide sufficient context for effective domain-specific translation.
摘要：当前的最新模型表明能够利用上下文学习以转化为以前看不见的语言环境。 Tanzer等。 [2024]使用大型语言模型（LLM）使用语言材料（例如语法）来提高卡拉姆的翻译质量。我们专注于Kanuri（尽管有大量的演讲者人口，但具有最少的数字资源，这种语言。我们设计了两个用于评估的数据集：一个侧重于健康和人道主义术语，另一个包含广义术语，研究了特定于领域的任务如何影响LLM翻译质量。通过提供不同的语言资源组合（语法，词典和平行句子），我们测量了LLM翻译有效性，将结果与母语者翻译和人类语言学家的表现进行了比较。我们使用自动指标和母语者评估流利性和准确性评估。结果表明，并行句子仍然是最有效的数据源，在人类评估和自动指标中的其他方法优于其他方法。在合并语法的同时，它比零弹性翻译改进，但它作为有效的独立数据源失败。人类评估表明，LLMS比流利度（语法）更有效地达到准确性（含义）。这些发现表明，LLM翻译评估从多维评估超出简单精度指标之外的益处，而单独的语法，没有平行的句子，并不能为有效的域特异性翻译提供足够的背景。

Title: Do Biased Models Have Biased Thoughts?

Authors: Swati Rajwal, Shivank Garg, Reem Abdel-Salam, Abdelrahman Zayed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06671
Pdf URL: https://arxiv.org/pdf/2508.06671
Copy Paste: [[2508.06671]] Do Biased Models Have Biased Thoughts?(https://arxiv.org/abs/2508.06671)
Keywords: language model, prompt, chain-of-thought
Abstract: The impressive performance of language models is undeniable. However, the presence of biases based on gender, race, socio-economic status, physical appearance, and sexual orientation makes the deployment of language models challenging. This paper studies the effect of chain-of-thought prompting, a recent approach that studies the steps followed by the model before it responds, on fairness. More specifically, we ask the following question: \textit{Do biased models have biased thoughts}? To answer our question, we conduct experiments on $5$ popular large language models using fairness metrics to quantify $11$ different biases in the model's thoughts and output. Our results show that the bias in the thinking steps is not highly correlated with the output bias (less than $0.6$ correlation with a $p$-value smaller than $0.001$ in most cases). In other words, unlike human beings, the tested models with biased decisions do not always possess biased thoughts.
摘要：语言模型的令人印象深刻的性能是不可否认的。但是，基于性别，种族，社会经济地位，外表和性取向的偏见的存在使语言模型的部署变得挑战。本文研究了经过三通链的效果，这是一种最近的方法，研究了模型在响应之前的步骤，就公平性。更具体地说，我们提出以下问题：\ textit {有偏见的模型有偏见的想法}？为了回答我们的问题，我们使用公平度量指标对$ 5 $流行的大语言模型进行实验，以量化$ 11 $的思想和输出中的不同偏见。我们的结果表明，思维步骤中的偏见与输出偏差不高度相关（在大多数情况下，与$ p $值小于$ 0.001 $相关的$ 0.6 $相关）。换句话说，与人类不同，具有偏见决定的经过测试的模型并不总是具有偏见的思想。

Title: Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge

Authors: Evangelia Spiliopoulou, Riccardo Fogliato, Hanna Burnsky, Tamer Soliman, Jie Ma, Graham Horwood, Miguel Ballesteros
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06709
Pdf URL: https://arxiv.org/pdf/2508.06709
Copy Paste: [[2508.06709]] Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge(https://arxiv.org/abs/2508.06709)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) can serve as judges that offer rapid and reliable assessments of other LLM outputs. However, models may systematically assign overly favorable ratings to their own outputs, a phenomenon known as self-bias, which can distort evaluations of true model performance. Previous studies often conflate genuine differences in model quality with bias or incorrectly assume that evaluations from LLMs and humans follow the same rating distributions. In this work, we present a statistical framework that explicitly formalizes assumptions under which self-bias can be identified and estimated. Our method models the difference in the scoring distribution that LLM-as-a-judge assigns to its own completions compared to other models, while accounting for the underlying quality of the completions provided by an independent, third-party judge (e.g., humans). Our method reliably isolates and quantifies self-bias, even when models vary in ability, ensuring that genuine performance differences are not mistaken for self-bias. We conduct an empirical analysis of self-bias on a large dataset (>5000 prompt-completion pairs) consisting of expert human annotations and judgments from nine different LLM judges. We find that some models, such as GPT-4o and Claude 3.5 Sonnet, systematically assign higher scores to their own outputs. These models also display family-bias; systematically assigning higher ratings to outputs produced by other models of the same family. Our findings highlight potential pitfalls of using LLM judges and offer practical guidance to mitigate biases when interpreting automated evaluations.
摘要：大型语言模型（LLM）可以用作对其他LLM产出的快速和可靠评估的法官。但是，模型可能会系统地将过于有利的评级分配给其自己的输出，这一现象称为自偏见，可能会扭曲对真实模型性能的评估。先前的研究通常将模型质量的真实差异与偏差混为一谈，或者错误地认为来自LLM和人类的评估遵循相同的评分分布。在这项工作中，我们提出了一个统计框架，该框架明确地正式化了可以识别和估计自偏见的假设。我们的方法模型与其他模型相比，LLM-AS-A-Gudge分配给自己的完成的评分分布的差异，同时考虑了独立的第三方法官（例如，人类）提供的完善的基本质量。我们的方法可靠地隔离并量化自偏见，即使模型的能力变化，确保了真正的性能差异并没有误认为是自偏见的。我们对大型数据集（> 5000个及时完成对）进行自偏见的经验分析，该数据集由九个不同LLM法官的专家注释和判断组成。我们发现某些模型，例如GPT-4O和Claude 3.5十四行诗，从系统地为自己的输出分配了更高的分数。这些模型还表现出家庭偏见。系统地将更高的评级分配给由同一家族的其他模型产生的产出。我们的发现突出了使用LLM法官的潜在陷阱，并提供了实用的指导来解释自动评估时减轻偏见。

Title: Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis

Authors: Komala Subramanyam Cherukuri, Pranav Abishai Moses, Aisa Sakata, Jiangping Chen, Haihua Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06729
Pdf URL: https://arxiv.org/pdf/2508.06729
Copy Paste: [[2508.06729]] Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis(https://arxiv.org/abs/2508.06729)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Oral histories are vital records of lived experience, particularly within communities affected by systemic injustice and historical erasure. Effective and efficient analysis of their oral history archives can promote access and understanding of the oral histories. However, Large-scale analysis of these archives remains limited due to their unstructured format, emotional complexity, and high annotation costs. This paper presents a scalable framework to automate semantic and sentiment annotation for Japanese American Incarceration Oral History. Using LLMs, we construct a high-quality dataset, evaluate multiple models, and test prompt engineering strategies in historically sensitive contexts. Our multiphase approach combines expert annotation, prompt design, and LLM evaluation with ChatGPT, Llama, and Qwen. We labeled 558 sentences from 15 narrators for sentiment and semantic classification, then evaluated zero-shot, few-shot, and RAG strategies. For semantic classification, ChatGPT achieved the highest F1 score (88.71%), followed by Llama (84.99%) and Qwen (83.72%). For sentiment analysis, Llama slightly outperformed Qwen (82.66%) and ChatGPT (82.29%), with all models showing comparable results. The best prompt configurations were used to annotate 92,191 sentences from 1,002 interviews in the JAIOH collection. Our findings show that LLMs can effectively perform semantic and sentiment annotation across large oral history collections when guided by well-designed prompts. This study provides a reusable annotation pipeline and practical guidance for applying LLMs in culturally sensitive archival analysis. By bridging archival ethics with scalable NLP techniques, this work lays the groundwork for responsible use of artificial intelligence in digital humanities and preservation of collective memory. GitHub: this https URL.
摘要：口述历史是生活经验的重要记录，尤其是在受系统性不公和历史擦除影响的社区中。对其口述历史档案的有效分析可以促进对口腔历史的访问和理解。但是，由于它们的非结构化格式，情绪复杂性和高注释成本，对这些档案的大规模分析仍然有限。本文提出了一个可扩展的框架，以自动化日裔美国监禁历史的语义和情感注释。使用LLM，我们构建了一个高质量的数据集，评估多个模型，并在历史上敏感的环境中测试及时的工程策略。我们的多相方法将专家注释，及时设计和LLM评估与Chatgpt，Llame和Qwen结合在一起。我们标记了15位叙述者的558个句子，用于情感和语义分类，然后评估了零射，很少的和抹布策略。对于语义分类，CHATGPT获得了最高的F1分数（88.71％），其次是Llama（84.99％）和Qwen（83.72％）。对于情感分析，美洲驼的表现略高于QWEN（82.66％）和Chatgpt（82.29％），所有模型均显示出可比的结果。最佳提示配置用于注释Jaioh系列中1,002个访谈中的92,191个句子。我们的发现表明，当通过精心设计的提示指导时，LLM可以有效地在大型口述历史记录中执行语义和情感注释。这项研究提供了可重复使用的注释管道和用于在文化敏感的档案分析中应用LLM的实用指南。通过将档案伦理与可扩展的NLP技术桥接，这项工作为在数字人文中负责使用人工智能并保存集体记忆奠定了基础。 GitHub：此HTTPS URL。

Title: Many-Turn Jailbreaking

Authors: Xianjun Yang, Liqiang Xiao, Shiyang Li, Faisal Ladhak, Hyokun Yun, Linda Ruth Petzold, Yi Xu, William Yang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06755
Pdf URL: https://arxiv.org/pdf/2508.06755
Copy Paste: [[2508.06755]] Many-Turn Jailbreaking(https://arxiv.org/abs/2508.06755)
Keywords: language model, llm, long context, prompt
Abstract: Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than the first-turn conversation or a single target query. This is an even more serious threat because 1) it is common for users to continue asking relevant follow-up questions to clarify certain jailbroken details, and 2) it is also possible that the initial round of jailbreaking causes the LLMs to respond to additional irrelevant questions consistently. As the first step (First draft done at June 2024) in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs.
摘要：当前在大型语言模型（LLM）上的越狱工作旨在从给定的提示中引起不安全的产出。但是，它仅着重于针对一个特定查询的单转越狱。相反，高级LLM旨在处理非常长的上下文，因此可以进行多转交谈。因此，我们建议探索多转弯的越狱，其中越来越多的LLM不断地测试比第一转交谈或单个目标查询。这是一个更严重的威胁，因为1）用户通常会继续提出相关的后续问题以澄清某些越狱的细节，而2）最初的越狱一轮也可能导致LLM始终如一地回答其他无关的问题。作为探索多转弯越狱的第一步（最初的选秀权），我们构建了一个多转弯的越狱基准（MTJ-Bench），用于在一系列开放和封闭的型号上对此设置进行基准测试，并为这一新的安全威胁提供新颖的见解。通过揭示这种新的脆弱性，我们旨在呼吁社区努力建立更安全的LLM，并为对越狱LLM的更深入了解。

Title: SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection

Authors: Ziqi Liu, Yangbin Chen, Ziyang Zhou, Yilin Li, Mingxuan Hu, Yushan Pan, Zhijie Xu
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2508.06803
Pdf URL: https://arxiv.org/pdf/2508.06803
Copy Paste: [[2508.06803]] SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection(https://arxiv.org/abs/2508.06803)
Keywords: language model, hallucination, agent
Abstract: Sarcasm detection is a crucial yet challenging Natural Language Processing task. Existing Large Language Model methods are often limited by single-perspective analysis, static reasoning pathways, and a susceptibility to hallucination when processing complex ironic rhetoric, which impacts their accuracy and reliability. To address these challenges, we propose **SEVADE**, a novel **S**elf-**Ev**olving multi-agent **A**nalysis framework with **D**ecoupled **E**valuation for hallucination-resistant sarcasm detection. The core of our framework is a Dynamic Agentive Reasoning Engine (DARE), which utilizes a team of specialized agents grounded in linguistic theory to perform a multifaceted deconstruction of the text and generate a structured reasoning chain. Subsequently, a separate lightweight rationale adjudicator (RA) performs the final classification based solely on this reasoning chain. This decoupled architecture is designed to mitigate the risk of hallucination by separating complex reasoning from the final judgment. Extensive experiments on four benchmark datasets demonstrate that our framework achieves state-of-the-art performance, with average improvements of **6.75%** in Accuracy and **6.29%** in Macro-F1 score.
摘要：讽刺检测是一项至关重要但充满挑战的自然语言处理任务。现有的大型语言模型方法通常受单一镜头分析，静态推理途径以及处理复杂的讽刺性言论的易感性的限制，这会影响其准确性和可靠性。为了应对这些挑战，我们提出了** sevade **，这是一种小说** s ** elf - ** ev ** olving Multi-Agent ** a ** nalysis框架，具有** d ** ecoupled ** e ** e ** es al **估计持续的避免讽刺讽刺。我们框架的核心是一种动态的代理推理引擎（DARE），该引擎利用以语言理论为基础的专门代理团队来执行文本的多方面解构并生成结构化的推理链。随后，单独的轻质理由裁决器（RA）仅基于该推理链执行最终分类。这种脱钩的建筑旨在通过将复杂的推理与最终判断分开来减轻幻觉的风险。四个基准数据集的广泛实验表明，我们的框架可以达到最先进的性能，准确性的平均改善为** 6.75％**，而** 6.29％**在Macro-F1分数中**。

Title: Annotating Errors in English Learners' Written Language Production: Advancing Automated Written Feedback Systems

Authors: Steven Coyne, Diana Galvan-Sosa, Ryan Spring, Camélia Guerraoui, Michael Zock, Keisuke Sakaguchi, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.06810
Pdf URL: https://arxiv.org/pdf/2508.06810
Copy Paste: [[2508.06810]] Annotating Errors in English Learners' Written Language Production: Advancing Automated Written Feedback Systems(https://arxiv.org/abs/2508.06810)
Keywords: language model, llm
Abstract: Recent advances in natural language processing (NLP) have contributed to the development of automated writing evaluation (AWE) systems that can correct grammatical errors. However, while these systems are effective at improving text, they are not optimally designed for language learning. They favor direct revisions, often with a click-to-fix functionality that can be applied without considering the reason for the correction. Meanwhile, depending on the error type, learners may benefit most from simple explanations and strategically indirect hints, especially on generalizable grammatical rules. To support the generation of such feedback, we introduce an annotation framework that models each error's error type and generalizability. For error type classification, we introduce a typology focused on inferring learners' knowledge gaps by connecting their errors to specific grammatical patterns. Following this framework, we collect a dataset of annotated learner errors and corresponding human-written feedback comments, each labeled as a direct correction or hint. With this data, we evaluate keyword-guided, keyword-free, and template-guided methods of generating feedback using large language models (LLMs). Human teachers examined each system's outputs, assessing them on grounds including relevance, factuality, and comprehensibility. We report on the development of the dataset and the comparative performance of the systems investigated.
摘要：自然语言处理（NLP）的最新进展已促进了可以纠正语法错误的自动写作评估（AWE）系统的发展。但是，尽管这些系统可有效改进文本，但它们并不是为语言学习设计的最佳设计。他们赞成直接修订，通常具有点击固定功能，可以在不考虑校正原因的情况下应用。同时，根据错误类型，学习者可能会从简单的解释和战略性间接提示中受益最大，尤其是在可普遍的语法规则上。为了支持这种反馈的生成，我们引入了一个注释框架，该框架对每个错误的错误类型和概括性进行了建模。对于错误类型分类，我们引入了一种类型学，该类型学通过将错误与特定的语法模式联系起来来推断学习者的知识差距。在此框架之后，我们收集一个带注释的学习者错误的数据集和相应的人写的反馈评论，每个反馈评论标记为直接校正或提示。借助此数据，我们评估了使用大语言模型（LLMS）生成反馈的关键字引导，无关键和模板引导的方法。人类教师检查了每个系统的产出，以相关性，事实和理解性在内进行评估。我们报告了数据集的开发以及所研究系统的比较性能。

Title: The ReQAP System for Question Answering over Personal Information

Authors: Philipp Christmann, Gerhard Weikum
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.06880
Pdf URL: https://arxiv.org/pdf/2508.06880
Copy Paste: [[2508.06880]] The ReQAP System for Question Answering over Personal Information(https://arxiv.org/abs/2508.06880)
Keywords: language model
Abstract: Personal information is abundant on users' devices, from structured data in calendar, shopping records or fitness tools, to unstructured contents in mail and social media posts. This works presents the ReQAP system that supports users with answers for complex questions that involve filters, joins and aggregation over heterogeneous sources. The unique trait of ReQAP is that it recursively decomposes questions and incrementally builds an operator tree for execution. Both the question interpretation and the individual operators make smart use of light-weight language models, with judicious fine-tuning. The demo showcases the rich functionality for advanced user questions, and also offers detailed tracking of how the answers are computed by the operators in the execution tree. Being able to trace answers back to the underlying sources is vital for human comprehensibility and user trust in the system.
摘要：个人信息在用户的设备上，从日历，购物记录或健身工具中的结构化数据到邮件和社交媒体帖子中的非结构化内容。这项工作介绍了REQAP系统，该系统为用户提供了答案，以解决涉及过滤器，加入和聚合异质资源的复杂问题。 Reqap的独特特征是它递归分解问题，并逐步构建操作员树进行执行。问题解释和单个操作员都可以通过明智的微调来明智地使用轻量级语言模型。该演示为高级用户问题展示了丰富的功能，还提供了执行树中操作员如何计算答案的详细跟踪。能够追溯到基本来源的答案对于人类的理解性和对系统的用户信任至关重要。

Title: Score Before You Speak: Improving Persona Consistency in Dialogue Generation using Response Quality Scores

Authors: Arpita Saggar, Jonathan C. Darling, Vania Dimitrova, Duygu Sarikaya, David C. Hogg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.06886
Pdf URL: https://arxiv.org/pdf/2508.06886
Copy Paste: [[2508.06886]] Score Before You Speak: Improving Persona Consistency in Dialogue Generation using Response Quality Scores(https://arxiv.org/abs/2508.06886)
Keywords: language model, llm, prompt, chat
Abstract: Persona-based dialogue generation is an important milestone towards building conversational artificial intelligence. Despite the ever-improving capabilities of large language models (LLMs), effectively integrating persona fidelity in conversations remains challenging due to the limited diversity in existing dialogue data. We propose a novel framework SBS (Score-Before-Speaking), which outperforms previous methods and yields improvements for both million and billion-parameter models. Unlike previous methods, SBS unifies the learning of responses and their relative quality into a single step. The key innovation is to train a dialogue model to correlate augmented responses with a quality score during training and then leverage this knowledge at inference. We use noun-based substitution for augmentation and semantic similarity-based scores as a proxy for response quality. Through extensive experiments with benchmark datasets (PERSONA-CHAT and ConvAI2), we show that score-conditioned training allows existing models to better capture a spectrum of persona-consistent dialogues. Our ablation studies also demonstrate that including scores in the input prompt during training is superior to conventional training setups. Code and further details are available at this https URL
摘要：基于角色的对话生成是建立对话人工智能的重要里程碑。尽管大语言模型（LLMS）的不断改善的能力，但由于现有对话数据的多样性有限，在对话中有效整合对话中仍然具有挑战性。我们提出了一个新颖的框架SB（在说得分之前），该框架的表现优于以前的方法，并且对百万和十亿参数的模型都提高了改进。与以前的方法不同，SBS将响应及其相对质量的学习统一分为一个步骤。关键创新是训练对话模型，以将增强响应与培训期间的质量分数相关联，然后在推断时利用这一知识。我们使用基于名词的替换来增强和基于语义相似性的分数作为响应质量的代理。通过对基准数据集（Persona-Chat和Convai2）进行广泛的实验，我们表明，得分条件条件的培训使现有模型可以更好地捕获一系列角色一致的对话。我们的消融研究还表明，在训练期间输入提示中包括分数优于常规培训设置。代码和更多详细信息可在此HTTPS URL上找到

Title: Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection

Authors: Siyuan Li, Xi Lin, Guangyan Li, Zehao Liu, Aodu Wulianghai, Li Ding, Jun Wu, Jianhua Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06913
Pdf URL: https://arxiv.org/pdf/2508.06913
Copy Paste: [[2508.06913]] Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection(https://arxiv.org/abs/2508.06913)
Keywords: language model, gpt, llm
Abstract: The rapid advancement of large language models (LLMs) has resulted in increasingly sophisticated AI-generated content, posing significant challenges in distinguishing LLM-generated text from human-written language. Existing detection methods, primarily based on lexical heuristics or fine-tuned classifiers, often suffer from limited generalizability and are vulnerable to paraphrasing, adversarial perturbations, and cross-domain shifts. In this work, we propose SentiDetect, a model-agnostic framework for detecting LLM-generated text by analyzing the divergence in sentiment distribution stability. Our method is motivated by the empirical observation that LLM outputs tend to exhibit emotionally consistent patterns, whereas human-written texts display greater emotional variability. To capture this phenomenon, we define two complementary metrics: sentiment distribution consistency and sentiment distribution preservation, which quantify stability under sentiment-altering and semantic-preserving transformations. We evaluate SentiDetect on five diverse datasets and a range of advanced LLMs,including Gemini-1.5-Pro, Claude-3, GPT-4-0613, and LLaMa-3.3. Experimental results demonstrate its superiority over state-of-the-art baselines, with over 16% and 11% F1 score improvements on Gemini-1.5-Pro and GPT-4-0613, respectively. Moreover, SentiDetect also shows greater robustness to paraphrasing, adversarial attacks, and text length variations, outperforming existing detectors in challenging scenarios.
摘要：大型语言模型（LLMS）的快速发展导致了越来越复杂的AI生成的内容，在区分LLM生成的文本和人撰写的语言方面提出了重大挑战。现有的检测方法主要基于词汇启发式方法或微调分类器，通常遭受有限的推广性，并且容易受到释义，对抗性扰动和跨域转移。在这项工作中，我们提出了SentidEtect，这是一种模型不稳定的框架，用于通过分析情感分布稳定性的差异来检测LLM生成的文本。我们的方法是由经验观察结果激发的，即LLM输出倾向于表现出情感一致的模式，而人写的文本表现出更大的情感变异性。为了捕获这一现象，我们定义了两个互补指标：情感分布一致性和情感分布保护，它们在改变情感和语义传播的转换下量化了稳定性。我们在五个不同的数据集和一系列高级LLM上评估了SentideTect，包括Gemini-1.5-Pro，Claude-3，GPT-4-0613和Llama-3.3。实验结果证明了其优于最先进的基线，在GEMINI-1.5-PRO和GPT-4-0613上的F1得分分别超过16％和11％。此外，SentidEtect还表现出更大的弹药，对抗攻击和文本长度变化的鲁棒性，在挑战性的情况下表现优于现有检测器。

Title: Two-Stage Quranic QA via Ensemble Retrieval and Instruction-Tuned Answer Extraction

Authors: Mohamed Basem, Islam Oshallah, Ali Hamdi, Khaled Shaban, Hozaifa Kassab
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.06971
Pdf URL: https://arxiv.org/pdf/2508.06971
Copy Paste: [[2508.06971]] Two-Stage Quranic QA via Ensemble Retrieval and Instruction-Tuned Answer Extraction(https://arxiv.org/abs/2508.06971)
Keywords: language model, prompt
Abstract: Quranic Question Answering presents unique challenges due to the linguistic complexity of Classical Arabic and the semantic richness of religious texts. In this paper, we propose a novel two-stage framework that addresses both passage retrieval and answer extraction. For passage retrieval, we ensemble fine-tuned Arabic language models to achieve superior ranking performance. For answer extraction, we employ instruction-tuned large language models with few-shot prompting to overcome the limitations of fine-tuning on small datasets. Our approach achieves state-of-the-art results on the Quran QA 2023 Shared Task, with a MAP@10 of 0.3128 and MRR@10 of 0.5763 for retrieval, and a pAP@10 of 0.669 for extraction, substantially outperforming previous methods. These results demonstrate that combining model ensembling and instruction-tuned language models effectively addresses the challenges of low-resource question answering in specialized domains.
摘要：古兰经问题回答是由于古典阿拉伯语和宗教文本的语义丰富性的语言复杂性带来了独特的挑战。在本文中，我们提出了一个新颖的两阶段框架，该框架既解决通道检索又解决回答。对于通过检索，我们整合了微调的阿拉伯语模型，以实现卓越的排名表现。为了提取答案，我们采用了指导调整的大型语言模型，却很少弹性，从而克服了小型数据集上微调的局限性。我们的方法在古兰经QA 2023共享任务上实现了最新的结果，@10 of 0.3128，MRR@10 of 0.5763的检索，以及用于提取的0.669 of 0.669的PAP@10，大大超过了先前的方法。这些结果表明，将模型结合和指导调整的语言模型有效地解决了在专用域中回答低资源问题的挑战。

Title: Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

Authors: Zhijun Tu, Hanting Chen, Siqi Liu, Chuanjian Liu, Jian Li, Jie Hu, Yunhe Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.06974
Pdf URL: https://arxiv.org/pdf/2508.06974
Copy Paste: [[2508.06974]] Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models(https://arxiv.org/abs/2508.06974)
Keywords: language model, llm
Abstract: 1-bit LLM quantization offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes direct adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the floating-point weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved using pre-trained models, eliminating the need for expensive training from scratch.
摘要：1位LLM量化在降低存储和计算成本方面具有显着优势。但是，现有方法通常会从头开始训练1位LLMS，因此无法完全利用预训练的模型。这导致高训练成本和明显的准确性降解。我们确定完整精度和1位表示之间的巨大差距使直接适应变得困难。在本文中，我们为前向和向后引入了一致的渐进式训练，将浮点重量转换为二进制的训练。此外，我们结合了二元感知初始化和双级级补偿，以减少进行性训练的困难并改善性能。各种尺寸的LLM的实验结果表明，我们的方法表现优于现有方法。我们的结果表明，可以使用预训练的型号来实现高性能1位LLM，从而消除了从头开始需要昂贵的培训的需求。

Title: Vec2Summ: Text Summarization via Probabilistic Sentence Embeddings

Authors: Mao Li, Fred Conrad, Johann Gagnon-Bartsch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07017
Pdf URL: https://arxiv.org/pdf/2508.07017
Copy Paste: [[2508.07017]] Vec2Summ: Text Summarization via Probabilistic Sentence Embeddings(https://arxiv.org/abs/2508.07017)
Keywords: language model, llm
Abstract: We propose Vec2Summ, a novel method for abstractive summarization that frames the task as semantic compression. Vec2Summ represents a document collection using a single mean vector in the semantic embedding space, capturing the central meaning of the corpus. To reconstruct fluent summaries, we perform embedding inversion -- decoding this mean vector into natural language using a generative language model. To improve reconstruction quality and capture some degree of topical variability, we introduce stochasticity by sampling from a Gaussian distribution centered on the mean. This approach is loosely analogous to bagging in ensemble learning, where controlled randomness encourages more robust and varied outputs. Vec2Summ addresses key limitations of LLM-based summarization methods. It avoids context-length constraints, enables interpretable and controllable generation via semantic parameters, and scales efficiently with corpus size -- requiring only $O(d + d^2)$ parameters. Empirical results show that Vec2Summ produces coherent summaries for topically focused, order-invariant corpora, with performance comparable to direct LLM summarization in terms of thematic coverage and efficiency, albeit with less fine-grained detail. These results underscore Vec2Summ's potential in settings where scalability, semantic control, and corpus-level abstraction are prioritized.
摘要：我们提出了VEC2Summ，这是一种抽象性摘要的新方法，将任务构成语义压缩。 VEC2SUMM代表使用单个均值向量在语义嵌入空间中的文档集合，从而捕获了语料库的核心含义。为了重建流利的摘要，我们执行嵌入反演 - 使用生成语言模型将此平均向量解码为自然语言。为了提高重建质量并捕获一定程度的局部变异性，我们通过以平均值为中心的高斯分布来引入随机性。这种方法与整体学习中的包装非常相似，在整体学习中，受控的随机性会鼓励更强大和多样化的输出。 VEC2SUMM解决了基于LLM的摘要方法的关键局限性。它避免了上下文长度的约束，可以通过语义参数启用可解释和可控制的生成，并有效地缩放了语料库大小 - 仅需$ o（d + d^2）$参数。经验结果表明，VEC2Summ为局部关注的，订单不变的语料库产生连贯的摘要，其性能可与Direct LLM摘要相当，但在主题覆盖范围和效率方面，尽管细节较不细致。这些结果强调了VEC2SUMM在确定性，语义控制和语料库级抽象的设置中的潜力。

Title: SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages

Authors: Muhammad Dehan Al Kautsar, Aswin Candra, Muhammad Alif Al Hakim, Maxalmina Satria Kahfi, Fajri Koto, Alham Fikri Aji, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Genta Indra Winata
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07069
Pdf URL: https://arxiv.org/pdf/2508.07069
Copy Paste: [[2508.07069]] SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages(https://arxiv.org/abs/2508.07069)
Keywords: language model, chat, agent
Abstract: Although numerous datasets have been developed to support dialogue systems, most existing chit-chat datasets overlook the cultural nuances inherent in natural human conversations. To address this gap, we introduce SEADialogues, a culturally grounded dialogue dataset centered on Southeast Asia, a region with over 700 million people and immense cultural diversity. Our dataset features dialogues in eight languages from six Southeast Asian countries, many of which are low-resource despite having sizable speaker populations. To enhance cultural relevance and personalization, each dialogue includes persona attributes and two culturally grounded topics that reflect everyday life in the respective communities. Furthermore, we release a multi-turn dialogue dataset to advance research on culturally aware and human-centric large language models, including conversational dialogue agents.
摘要：尽管已经开发了许多数据集来支持对话系统，但大多数现有的Chit-Chat数据集都忽略了自然人类对话中固有的文化细微差别。为了解决这一差距，我们介绍了Seadialogues，这是一个以文化为基础的对话数据集，以东南亚为中心，该地区拥有超过7亿人，具有超过7亿人和巨大的文化多样性。我们的数据集以来自六个东南亚国家 /地区的八种语言进行对话，尽管有相当大的说话者人口，但其中许多人的资源却很低。为了增强文化相关性和个性化，每次对话都包括角色属性和两个反映各个社区日常生活的文化扎根的话题。此外，我们发布了一个多转化的对话数据集，以推动对以文化意识和以人为本的大型语言模型（包括对话对话代理）进行的研究。

Title: BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context

Authors: Aditya Tomar, Nihar Ranjan Sahoo, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07090
Pdf URL: https://arxiv.org/pdf/2508.07090
Copy Paste: [[2508.07090]] BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context(https://arxiv.org/abs/2508.07090)
Keywords: language model
Abstract: Evaluating social biases in language models (LMs) is crucial for ensuring fairness and minimizing the reinforcement of harmful stereotypes in AI systems. Existing benchmarks, such as the Bias Benchmark for Question Answering (BBQ), primarily focus on Western contexts, limiting their applicability to the Indian context. To address this gap, we introduce BharatBBQ, a culturally adapted benchmark designed to assess biases in Hindi, English, Marathi, Bengali, Tamil, Telugu, Odia, and Assamese. BharatBBQ covers 13 social categories, including 3 intersectional groups, reflecting prevalent biases in the Indian sociocultural landscape. Our dataset contains 49,108 examples in one language that are expanded using translation and verification to 392,864 examples in eight different languages. We evaluate five multilingual LM families across zero and few-shot settings, analyzing their bias and stereotypical bias scores. Our findings highlight persistent biases across languages and social categories and often amplified biases in Indian languages compared to English, demonstrating the necessity of linguistically and culturally grounded benchmarks for bias evaluation.
摘要：评估语言模型（LMS）中的社会偏见对于确保公平性和最大程度地限制AI系统中有害刻板印象的增强至关重要。现有的基准，例如提问答案的偏差基准（BBQ），主要集中在西方环境上，将其适用性限制在印度背景下。为了解决这一差距，我们介绍了BharatBBQ，这是一种文化适应的基准，旨在评估印地语，英语，马拉地语，孟加拉语，孟加拉语，泰米尔语，泰卢固语，奥迪亚，奥迪亚和阿萨姆语。 BharatBBQ涵盖了13个社会类别，其中包括3个交叉群体，反映了印度社会文化景观中普遍存在的偏见。我们的数据集包含一种用一种语言的49,108个示例，这些示例使用翻译和验证扩展到392,864个示例，八种不同的语言。我们评估了五个多语言LM家族在零和几乎没有的设置中，分析了它们的偏见和刻板印象分数。我们的发现强调了跨语言和社会类别的持续偏见，并且与英语相比，印度语言的偏见通常会放大，这表明了语言和文化上基于基准的基准的必要性，以进行偏见评估。

Title: Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

Authors: Lijie Yang, Zhihao Zhang, Arti Jain, Shijie Cao, Baihong Yuan, Yiwei Chen, Zhihao Jia, Ravi Netravali
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07101
Pdf URL: https://arxiv.org/pdf/2508.07101
Copy Paste: [[2508.07101]] Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning(https://arxiv.org/abs/2508.07101)
Keywords: prompt
Abstract: Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves -- and in some cases improves -- accuracy while achieving a $1.1\times$ average decoding speed-up compared to full attention. Moreover, LessIsMore attends to $2\times$ fewer tokens without accuracy loss, achieving a $1.13\times$ end-to-end speed-up compared to existing sparse attention methods.
摘要：大型推理模型通过测试时间缩放实现了强大的性能，但会导致大量的计算开销，尤其是处理短输入提示时的代币产生过多。尽管稀疏注意机制可以减少潜伏期和记忆使用量，但由于长期推理期间的累积错误，现有方法遭受了严重的准确性降解。这些方法通常需要高令牌保留率或昂贵的再培训。我们介绍了Lessismore，这是一种用于推理任务的无训练稀疏注意机制，它利用全球注意力模式，而不是依靠传统的特定于头部特定的本地优化。 Leseismore从本地注意力负责人的汇总代币选择最近有上下文信息，从而为未来的解码层提供了统一的横轴代币排名。这种统一的选择通过避免需要每头保持单独的令牌子集来提高概括和效率。跨不同推理任务和基准进行的评估表明，Lessismore保留（在某些情况下都提高了精度），同时获得了1.1美元的平均解码速度，而相比之下。此外，Lessismore的$ 2 \ times $ $ $ $较少的代币而没有准确的损失，与现有的稀疏注意方法相比，$ 1.13 \ times $端到端的加速。

Title: Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution

Authors: Falaah Arif Khan, Nivedha Sivakumar, Yinong Oliver Wang, Katherine Metcalf, Cezanne Camacho, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07111
Pdf URL: https://arxiv.org/pdf/2508.07111
Copy Paste: [[2508.07111]] Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution(https://arxiv.org/abs/2508.07111)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have achieved impressive performance, leading to their widespread adoption as decision-support tools in resource-constrained contexts like hiring and admissions. There is, however, scientific consensus that AI systems can reflect and exacerbate societal biases, raising concerns about identity-based harm when used in critical social contexts. Prior work has laid a solid foundation for assessing bias in LLMs by evaluating demographic disparities in different language reasoning tasks. In this work, we extend single-axis fairness evaluations to examine intersectional bias, recognizing that when multiple axes of discrimination intersect, they create distinct patterns of disadvantage. We create a new benchmark called WinoIdentity by augmenting the WinoBias dataset with 25 demographic markers across 10 attributes, including age, nationality, and race, intersected with binary gender, yielding 245,700 prompts to evaluate 50 distinct bias patterns. Focusing on harms of omission due to underrepresentation, we investigate bias through the lens of uncertainty and propose a group (un)fairness metric called Coreference Confidence Disparity which measures whether models are more or less confident for some intersectional identities than others. We evaluate five recently published LLMs and find confidence disparities as high as 40% along various demographic attributes including body type, sexual orientation and socio-economic status, with models being most uncertain about doubly-disadvantaged identities in anti-stereotypical settings. Surprisingly, coreference confidence decreases even for hegemonic or privileged markers, indicating that the recent impressive performance of LLMs is more likely due to memorization than logical reasoning. Notably, these are two independent failures in value alignment and validity that can compound to cause social harm.
摘要：大型语言模型（LLMS）的表现令人印象深刻，从而使其在资源受限的环境中的决策支持工具（如雇用和入学）中广泛采用。但是，科学共识是AI系统可以反映和加剧社会偏见，在批判性社会环境中使用时对基于身份的伤害的担忧引起了人们的关注。先前的工作通过评估不同语言推理任务中的人口差异来评估LLM的偏见奠定了坚实的基础。在这项工作中，我们扩展了单轴公平性评估以检查交叉偏见，并认识到当多个歧视轴相交时，它们会产生不同的劣势模式。我们通过在10个属性中使用25个人口标记的Winobias数据集来创建一个名为WinoIdentity的新基准，包括年龄，国籍和种族，与二进制性别相交，产生245,700个提示，评估50个不同的偏见模式。我们关注因代表性不足而造成的省略危害，我们通过不确定性的镜头调查偏见，并提出一个称为Coreferne Pusitive Disparity的集体（UN）公平度量指标，该公平度为COREFERCE置信度差异，该公平度量是衡量模型是否比其他人更自信的模型。我们评估了五个最近发布的LLM，并在各种人口统计学属性（包括身体类型，性取向和社会经济状况）中发现置信度差异高达40％，而模型对反疾病型环境中的双重偏离性身份最不确定。令人惊讶的是，即使对于霸权或特权标记，Coreference置信度也会降低，这表明LLM最近的令人印象深刻的表现更有可能是由于记忆而不是逻辑推理所致。值得注意的是，这是价值一致性和有效性的两个独立失败，可能会造成社会伤害。

Title: Gradient Surgery for Safe LLM Fine-Tuning

Authors: Biao Yi, Jiahao Li, Baolei Zhang, Lihai Nie, Tong Li, Tiansheng Huang, Zheli Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07172
Pdf URL: https://arxiv.org/pdf/2508.07172
Copy Paste: [[2508.07172]] Gradient Surgery for Safe LLM Fine-Tuning(https://arxiv.org/abs/2508.07172)
Keywords: language model, llm
Abstract: Fine-tuning-as-a-Service introduces a critical vulnerability where a few malicious examples mixed into the user's fine-tuning dataset can compromise the safety alignment of Large Language Models (LLMs). While a recognized paradigm frames safe fine-tuning as a multi-objective optimization problem balancing user task performance with safety alignment, we find existing solutions are critically sensitive to the harmful ratio, with defenses degrading sharply as harmful ratio increases. We diagnose that this failure stems from conflicting gradients, where the user-task update directly undermines the safety objective. To resolve this, we propose SafeGrad, a novel method that employs gradient surgery. When a conflict is detected, SafeGrad nullifies the harmful component of the user-task gradient by projecting it onto the orthogonal plane of the alignment gradient, allowing the model to learn the user's task without sacrificing safety. To further enhance robustness and data efficiency, we employ a KL-divergence alignment loss that learns the rich, distributional safety profile of the well-aligned foundation model. Extensive experiments show that SafeGrad provides state-of-the-art defense across various LLMs and datasets, maintaining robust safety even at high harmful ratios without compromising task fidelity.
摘要：微调-As-a-Service引入了一个关键的漏洞，其中将一些恶意示例混合到用户的微调数据集中可能会损害大语言模型（LLMS）的安全对准。尽管公认的范例将安全微调作为多目标优化问题平衡用户任务性能与安全对齐方式，但我们发现现有的解决方案对有害比率非常敏感，随着有害比率的增加，防御量急剧下降。我们诊断出这种故障源于冲突的梯度，用户任务更新直接破坏了安全目标。为了解决这个问题，我们提出了采用梯度手术的新方法Safegrad。当检测到冲突时，Safegrad通过将其投影到对齐梯度的正交平面上，从而无效用户任务梯度的有害组件，从而使模型可以在不牺牲安全性的情况下学习用户的任务。为了进一步提高鲁棒性和数据效率，我们采用了KL-Divergence一致性损失，该损失了解了良好的基础模型的丰富分配安全性。广泛的实验表明，Safegrad在各种LLM和数据集中提供最先进的防御，即使在高危害比率下保持稳健的安全性，而不会损害任务保真度。

Title: Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

Authors: Leyi Pan, Zheyu Fu, Yunpeng Zhai, Shuchang Tao, Sheng Guan, Shiyu Huang, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Felix Henry, Lijie Wen, Aiwei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07173
Pdf URL: https://arxiv.org/pdf/2508.07173
Copy Paste: [[2508.07173]] Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models(https://arxiv.org/abs/2508.07173)
Keywords: language model, llm
Abstract: The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and prior benchmarks designed for other LLMs lack the ability to assess safety performance under audio-visual joint inputs or cross-modal safety consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality combinations and variations with 972 samples each, including dedicated audio-visual harm cases. Considering OLLMs' comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency Score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) no model excels in both overall safety and consistency, with only 3 models achieving over 0.6 in both metrics and top performer scoring around 0.8; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Our benchmark and metrics highlight urgent needs for enhanced OLLM safety, providing a foundation for future improvements.
摘要：Omni-Modal大语言模型（OLLMS）的兴起，将视觉和听觉处理与文本整合在一起，需要进行强大的安全评估以减轻有害输出。但是，目前尚无针对OLLM的专用基准，并且为其他LLMS设计的先前基准测试缺乏评估视听关节输入或跨模式安全一致性下的安全性能的能力。为了填补这一空白，我们介绍了Omni-SafetyBench，这是OLLM安全评估的第一个全面的并行基准，其中包括24种模式组合和各种方式，每个模式组合，每个样本都有972个样本，包括专用的视听危害案例。考虑到Ollms具有复杂的Omni模式输入以及对跨模式一致性评估的需求，我们提出了量身定制的指标：基于条件攻击成功率（C-ASR）的安全得分（C-ASR）和拒绝率（C-RR），以考虑理解失败的理解失败，以及跨模式安全一致性评分（CMSC-SCERESSCERESSINE），以相互衡量MATES CONSES CONSES CONSES COMPERSINES COMPARS COMPERSINES COMPERSINES COMPARS COMPARS COMPARS COMPARS COMPARS。评估6个开源和4个闭合源OLLMS揭示了关键漏洞：（1）在总体安全性和一致性方面，没有模型都出色，只有3个模型在指标和最高绩效的0.8分数中都达到0.6以上；（2）通过复杂的输入，尤其是视听关节，安全防御能力削弱了；（3）严重的弱点持续存在，某些模型在特定方式上得分低至0.14。我们的基准和指标强调了加强OLLM安全性的紧迫需求，为将来改进的基础提供了基础。

Title: Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

Authors: Jiaqi Yin, Yi-Wei Chen, Meng-Lung Lee, Xiya Liu
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2508.07179
Pdf URL: https://arxiv.org/pdf/2508.07179
Copy Paste: [[2508.07179]] Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks(https://arxiv.org/abs/2508.07179)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation, agent
Abstract: Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This "semantic drift" compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications.
摘要：企业数据管道以跨多种编程语言进行复杂的转换为特征，通常会导致原始元数据和下游数据之间的语义断开连接。这种“语义漂移”损害了数据的可重复性和治理，并损害了诸如检索功能的生成（RAG）和文本到SQL系统之类的服务的实用性。为了解决这个问题，提出了一个新颖的框架，以从多语言企业管道脚本中自动提取细粒度架构谱系。此方法标识了四个关键组件：源模式，源表，转换逻辑和聚合操作，创建数据转换的标准化表示。为了对谱系质量进行严格的评估，本文介绍了架构谱系复合评估（SLICE），该指标既评估结构正确性和语义保真度。还提出了一个新的基准，包括来自现实世界工业脚本的1,700个手动注释的谱系。实验是使用12种语言模型进行的，从1.3b到32B小语言模型（SLM），再到大语言模型（LLMS），例如GPT-4O和GPT-4.1。结果表明，具有模型大小和提示技术的复杂性的模式谱系提取量表的性能。特别是，使用单个推理轨迹的32B开源模型可以在标准提示下实现与GPT系列相当的性能。这一发现提出了一种可扩展且经济的方法，用于在实际应用中部署模式感知的代理。

Title: DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention

Authors: Kabir Khan, Priya Sharma, Arjun Mehta, Neha Gupta, Ravi Narayanan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07185
Pdf URL: https://arxiv.org/pdf/2508.07185
Copy Paste: [[2508.07185]] DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention(https://arxiv.org/abs/2508.07185)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) suffer from a critical limitation: their knowledge is static and quickly becomes outdated. Retraining these massive models is computationally prohibitive, while existing knowledge editing techniques can be slow and may introduce unforeseen side effects. To address this, we propose DySK-Attn, a novel framework that enables LLMs to efficiently integrate real-time knowledge from a dynamic external source. Our approach synergizes an LLM with a dynamic Knowledge Graph (KG) that can be updated instantaneously. The core of our framework is a sparse knowledge attention mechanism, which allows the LLM to perform a coarse-to-fine grained search, efficiently identifying and focusing on a small, highly relevant subset of facts from the vast KG. This mechanism avoids the high computational cost of dense attention over the entire knowledge base and mitigates noise from irrelevant information. We demonstrate through extensive experiments on time-sensitive question-answering tasks that DySK-Attn significantly outperforms strong baselines, including standard Retrieval-Augmented Generation (RAG) and model editing techniques, in both factual accuracy for updated knowledge and computational efficiency. Our framework offers a scalable and effective solution for building LLMs that can stay current with the ever-changing world.
摘要：大型语言模型（LLM）受到关键限制：他们的知识是静态的，并且很快变得过时。重新训练这些巨大的模型在计算上是过敏的，而现有的知识编辑技术可能会很慢，并且可能引入不可预见的副作用。为了解决这个问题，我们提出了Dysk-Attn，这是一个新颖的框架，使LLMS能够有效地从动态外部来源整合实时知识。我们的方法通过动态知识图（kg）协同llm，可以立即更新。我们框架的核心是一种稀疏的知识注意机制，该机制使LLM可以进行粗糙的细粒度搜索，有效地识别并关注大量kg的小型事实子集。这种机制避免了整个知识库中密集关注的高计算成本，并减轻了无关信息的噪音。我们通过对时间敏感的提问任务进行广泛的实验来证明，Dysk-Attn的表现明显优于强大的基准，包括标准的检索效果生成（RAG）和模型编辑技术，以更新知识和计算效率的事实准确性。我们的框架为构建LLM提供了可扩展有效的解决方案，该解决方案可以在不断变化的世界中保持最新。

Title: Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment

Authors: Yanru Sun, Emadeldeen Eldele, Zongxia Xie, Yucheng Wang, Wenzhe Niu, Qinghua Hu, Chee Keong Kwoh, Min Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07195
Pdf URL: https://arxiv.org/pdf/2508.07195
Copy Paste: [[2508.07195]] Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment(https://arxiv.org/abs/2508.07195)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have recently demonstrated impressive capabilities in natural language processing due to their strong generalization and sequence modeling capabilities. However, their direct application to time series forecasting remains challenging due to two fundamental issues: the inherent heterogeneity of temporal patterns and the modality gap between continuous numerical signals and discrete language representations. In this work, we propose TALON, a unified framework that enhances LLM-based forecasting by modeling temporal heterogeneity and enforcing semantic alignment. Specifically, we design a Heterogeneous Temporal Encoder that partitions multivariate time series into structurally coherent segments, enabling localized expert modeling across diverse temporal patterns. To bridge the modality gap, we introduce a Semantic Alignment Module that aligns temporal features with LLM-compatible representations, enabling effective integration of time series into language-based models while eliminating the need for handcrafted prompts during inference. Extensive experiments on seven real-world benchmarks demonstrate that TALON achieves superior performance across all datasets, with average MSE improvements of up to 11\% over recent state-of-the-art methods. These results underscore the effectiveness of incorporating both pattern-aware and semantic-aware designs when adapting LLMs for time series forecasting. The code is available at: this https URL.
摘要：大型语言模型（LLMS）最近由于强大的概括和序列建模功能，在自然语言处理中表现出了令人印象深刻的能力。但是，由于两个基本问题：时间模式的固有异质性以及连续数值信号和离散语言表示之间的模态差距，它们直接应用时间序列预测仍然具有挑战性。在这项工作中，我们提出了TALON，这是一个统一的框架，通过建模时间异质性和执行语义一致性来增强基于LLM的预测。具体而言，我们设计了一个异质的时间编码器，该编码器将多元时间序列划分为结构连贯的段，从而使跨不同时间模式的局部专家建模。为了弥合模式差距，我们引入了一个语义对齐模块，该模块将时间特征与LLM兼容表示形式保持一致，从而使时间序列有效地集成到基于语言的模型中，同时消除了推断期间手工制作的提示的需求。对七个现实基准测试的广泛实验表明，塔隆在所有数据集中都取得了卓越的性能，而MSE的平均改善高达11 \％，而不是最近的最新方法。这些结果强调了在适应时间序列预测的LLM时同时合并模式感知和语义感知设计的有效性。该代码可用：此HTTPS URL。

Title: Enhancing Rumor Detection Methods with Propagation Structure Infused Language Model

Authors: Chaoqun Cui, Siyuan Li, Kunkun Ma, Caiyan Jia
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2508.07209
Pdf URL: https://arxiv.org/pdf/2508.07209
Copy Paste: [[2508.07209]] Enhancing Rumor Detection Methods with Propagation Structure Infused Language Model(https://arxiv.org/abs/2508.07209)
Keywords: language model
Abstract: Pretrained Language Models (PLMs) have excelled in various Natural Language Processing tasks, benefiting from large-scale pretraining and self-attention mechanism's ability to capture long-range dependencies. However, their performance on social media application tasks like rumor detection remains suboptimal. We attribute this to mismatches between pretraining corpora and social texts, inadequate handling of unique social symbols, and pretraining tasks ill-suited for modeling user engagements implicit in propagation structures. To address these issues, we propose a continue pretraining strategy called Post Engagement Prediction (PEP) to infuse information from propagation structures into PLMs. PEP makes models to predict root, branch, and parent relations between posts, capturing interactions of stance and sentiment crucial for rumor detection. We also curate and release large-scale Twitter corpus: TwitterCorpus (269GB text), and two unlabeled claim conversation datasets with propagation structures (UTwitter and UWeibo). Utilizing these resources and PEP strategy, we train a Twitter-tailored PLM called SoLM. Extensive experiments demonstrate PEP significantly boosts rumor detection performance across universal and social media PLMs, even in few-shot scenarios. On benchmark datasets, PEP enhances baseline models by 1.0-3.7\% accuracy, even enabling it to outperform current state-of-the-art methods on multiple datasets. SoLM alone, without high-level modules, also achieves competitive results, highlighting the strategy's effectiveness in learning discriminative post interaction features.
摘要：审前的语言模型（PLM）在各种自然语言处理任务中都表现出色，这受益于大规模预处理和自我发起的机制捕获长期依赖性的能力。但是，他们在社交媒体应用程序中的表现如谣言检测仍然是最佳的。我们将其归因于预处理的语料库和社交文本之间的不匹配，不充分处理独特的社交符号以及不适合建模传播结构中隐含的用户参与的训练任务。为了解决这些问题，我们提出了一种称为“参与后预测”（PEP）的持续预处理策略，以将传播结构的信息注入PLM中。 PEP制作模型来预测帖子之间的根，分支和父母关系，捕获立场的相互作用和对谣言检测至关重要的情感。我们还策划并发布了大规模的Twitter语料库：TwitterCorpus（269GB文本），以及两个带有传播结构（UTWITTER和UWEIBO）的未标记的索赔对话数据集。利用这些资源和鼓舞人心的策略，我们培训了一个名为Solm的Twitter批准的PLM。广泛的实验表明，即使在少数情况下，也可以显着提高通用媒体和社交媒体PLM的谣言检测表现。在基准数据集上，PEP提高了基线模型的1.0-3.7 \％精度，甚至可以超过多个数据集上的当前最新方法。单独使用Solm，没有高级模块，也可以取得竞争成果，突出了该策略在学习判别后交互特征方面的有效性。

Title: How Does a Deep Neural Network Look at Lexical Stress?

Authors: Itai Allouche, Itay Asael, Rotem Rousso, Vered Dassa, Ann Bradlow, Seung-Eun Kim, Matthew Goldrick, Joseph Keshet
Subjects: cs.CL, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2508.07229
Pdf URL: https://arxiv.org/pdf/2508.07229
Copy Paste: [[2508.07229]] How Does a Deep Neural Network Look at Lexical Stress?(https://arxiv.org/abs/2508.07229)
Keywords: prompt
Abstract: Despite their success in speech processing, neural networks often operate as black boxes, prompting the question: what informs their decisions, and how can we interpret them? This work examines this issue in the context of lexical stress. A dataset of English disyllabic words was automatically constructed from read and spontaneous speech. Several Convolutional Neural Network (CNN) architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs (e.g., initial stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out test data. Layerwise Relevance Propagation (LRP), a technique for CNN interpretability analysis, revealed that predictions for held-out minimal pairs (PROtest vs. proTEST ) were most strongly influenced by information in stressed versus unstressed syllables, particularly the spectral properties of stressed vowels. However, the classifiers also attended to information throughout the word. A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel's first and second formants, with some evidence that its pitch and third formant also contribute. These results reveal deep learning's ability to acquire distributed cues to stress from naturally occurring data, extending traditional phonetic work based around highly controlled stimuli.
摘要：尽管他们在语音处理方面取得了成功，但神经网络经常作为黑匣子运作，提示了一个问题：是什么为他们的决定提供了信息，我们如何解释它们？这项工作在词汇压力的背景下检查了这个问题。由阅读和自发的语音自动构建英语单词的数据集。对几个卷积神经网络（CNN）结构进行了训练，以预测缺乏最小压力对（例如，初始应力钱包，最终应力扩展）的低音调词的光谱表示的应力位置，可在持有测试数据上达到高达92％的准确性。 CNN可解释性分析的一种层次相关性传播（LRP）表明，预测最小对（抗议与抗议）的预测受到压力和未压力的音节的信息最大的影响，尤其是压力元音的光谱特性。但是，分类器还参与了整个单词的信息。提出了特定于功能的相关性分析，其结果表明，我们表现最佳的分类器受压力元音的第一个和第二个共振体的强烈影响，并有证据表明其螺距和第三个共振剂也有贡献。这些结果揭示了深度学习能够从天然发生的数据中获取分布式线索，从而扩展了基于高度控制的刺激的传统语音工作。

Title: Prompt Tuning for Few-Shot Continual Learning Named Entity Recognition

Authors: Zhe Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07248
Pdf URL: https://arxiv.org/pdf/2508.07248
Copy Paste: [[2508.07248]] Prompt Tuning for Few-Shot Continual Learning Named Entity Recognition(https://arxiv.org/abs/2508.07248)
Keywords: prompt
Abstract: Knowledge distillation has been successfully applied to Continual Learning Named Entity Recognition (CLNER) tasks, by using a teacher model trained on old-class data to distill old-class entities present in new-class data as a form of regularization, thereby avoiding catastrophic forgetting. However, in Few-Shot CLNER (FS-CLNER) tasks, the scarcity of new-class entities makes it difficult for the trained model to generalize during inference. More critically, the lack of old-class entity information hinders the distillation of old knowledge, causing the model to fall into what we refer to as the Few-Shot Distillation Dilemma. In this work, we address the above challenges through a prompt tuning paradigm and memory demonstration template strategy. Specifically, we designed an expandable Anchor words-oriented Prompt Tuning (APT) paradigm to bridge the gap between pre-training and fine-tuning, thereby enhancing performance in few-shot scenarios. Additionally, we incorporated Memory Demonstration Templates (MDT) into each training instance to provide replay samples from previous tasks, which not only avoids the Few-Shot Distillation Dilemma but also promotes in-context learning. Experiments show that our approach achieves competitive performances on FS-CLNER.
摘要：通过使用对老式数据训练的教师模型来提取新级数据中存在的老式实体，从而避免了灾难性的遗忘，知识蒸馏已成功地应用于持续学习指定的实体识别（CLNER）任务。但是，在少数弹药（FS-CLNER）任务中，新级实体的稀缺性使训练有素的模型在推理过程中很难概括。更重要的是，缺乏老式的实体信息阻碍了旧知识的蒸馏，从而导致该模型陷入了我们所说的几滴蒸馏难题。在这项工作中，我们通过迅速调整范式和内存演示模板策略来应对上述挑战。具体来说，我们设计了一个可扩展的锚词为导向的及时调整（APT）范式，以弥合预训练和微调之间的差距，从而在几乎没有场景的情况下增强了性能。此外，我们将记忆演示模板（MDT）纳入每个训练实例中，以提供以前任务的重播样本，这不仅避免了几个发动的蒸馏难题，而且还促进了文本学习。实验表明，我们的方法在FS-CLNER上实现了竞争性能。

Title: Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models

Authors: Qiongqiong Wang, Hardik B. Sailor, Jeremy H. M. Wong, Tianchi Liu, Shuo Sun, Wenyu Zhang, Muhammad Huzaifah, Nancy Chen, Ai Ti Aw
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2508.07273
Pdf URL: https://arxiv.org/pdf/2508.07273
Copy Paste: [[2508.07273]] Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models(https://arxiv.org/abs/2508.07273)
Keywords: language model, llm
Abstract: Current large speech language models (Speech-LLMs) often exhibit limitations in empathetic reasoning, primarily due to the absence of training datasets that integrate both contextual content and paralinguistic cues. In this work, we propose two approaches to incorporate contextual paralinguistic information into model training: (1) an explicit method that provides paralinguistic metadata (e.g., emotion annotations) directly to the LLM, and (2) an implicit method that automatically generates novel training question-answer (QA) pairs using both categorical and dimensional emotion annotations alongside speech transcriptions. Our implicit method boosts performance (LLM-judged) by 38.41% on a human-annotated QA benchmark, reaching 46.02% when combined with the explicit approach, showing effectiveness in contextual paralinguistic understanding. We also validate the LLM judge by demonstrating its correlation with classification metrics, providing support for its reliability.
摘要：当前的大型语音语言模型（语音 - llms）经常在移情推理中表现出局限性，这主要是由于缺乏整合上下文内容和副语言提示的培训数据集。在这项工作中，我们提出了两种方法，将上下文的副语言信息纳入模型培训中：（1）一种明确的方法，该方法将直接向LLM的副语言元数据（例如情感注释）提供给了副语言元数据，（2）一种隐含的方法，一种自动生成新颖的培训询问问题（QA）Pairs的新型培训训练（QA）的言论的惯例和二元演讲。我们的隐式方法将人类宣布的QA基准增长了38.41％的性能（LLM判断），与显式方法结合使用时，质量为46.02％，在上下文副语言理解中显示出有效性。我们还通过证明其与分类指标的相关性来验证LLM法官，从而为其可靠性提供了支持。

Title: MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory

Authors: Vasudha Varadarajan, Hui Xu, Rebecca Astrid Boehme, Mariam Marlan Mirstrom, Sverker Sikstrom, H. Andrew Schwartz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07279
Pdf URL: https://arxiv.org/pdf/2508.07279
Copy Paste: [[2508.07279]] MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory(https://arxiv.org/abs/2508.07279)
Keywords: language model, llm, agent
Abstract: Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, an adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows.
摘要：大型语言模型（LLMS）的最新进展为可扩展的互动心理健康评估提供了新的机会，但是LLMS的过度查询负担用户，并且对跨转诊性症状概况进行现实筛查的效率低下。我们介绍了Maqua，这是一个适应性的问答框架，用于同时进行多维心理健康筛查。 Maqua将语言响应的多结果建模与项目响应理论（IRT）和因素分析相结合，选择了各个方面的多个维度响应的问题，以优化诊断信息，提高准确性并潜在地减少响应负担。与随机订购相比，新型数据集中的经验结果表明，Maqua将评分稳定所需的评估问题数量减少了50-87％（例如，达到稳定的抑郁症得分，较少的问题和饮食失调症得分较少，而饮食失调得分较少，而较少的问题则减少了85％）。 Maqua在内部化（抑郁症，焦虑）和外部化（吸毒，饮食失调）领域表现出强劲的表现，并且早期停止策略进一步减少了患者的时间和负担。这些发现将Maqua定位为可扩展，细微差异和互动性心理健康筛查的强大而有效的工具，将基于LLM的代理集成到现实世界中的临床工作流程中。

Title: "Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas

Authors: Junchen Ding, Penghao Jiang, Zihao Xu, Ziqi Ding, Yichen Zhu, Jiaojiao Jiang, Yuekang Li
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2508.07284
Pdf URL: https://arxiv.org/pdf/2508.07284
Copy Paste: [[2508.07284]] "Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas(https://arxiv.org/abs/2508.07284)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) increasingly mediate ethically sensitive decisions, understanding their moral reasoning processes becomes imperative. This study presents a comprehensive empirical evaluation of 14 leading LLMs, both reasoning enabled and general purpose, across 27 diverse trolley problem scenarios, framed by ten moral philosophies, including utilitarianism, deontology, and altruism. Using a factorial prompting protocol, we elicited 3,780 binary decisions and natural language justifications, enabling analysis along axes of decisional assertiveness, explanation answer consistency, public moral alignment, and sensitivity to ethically irrelevant cues. Our findings reveal significant variability across ethical frames and model types: reasoning enhanced models demonstrate greater decisiveness and structured justifications, yet do not always align better with human consensus. Notably, "sweet zones" emerge in altruistic, fairness, and virtue ethics framings, where models achieve a balance of high intervention rates, low explanation conflict, and minimal divergence from aggregated human judgments. However, models diverge under frames emphasizing kinship, legality, or self interest, often producing ethically controversial outcomes. These patterns suggest that moral prompting is not only a behavioral modifier but also a diagnostic tool for uncovering latent alignment philosophies across providers. We advocate for moral reasoning to become a primary axis in LLM alignment, calling for standardized benchmarks that evaluate not just what LLMs decide, but how and why.
摘要：随着大型语言模型（LLMS）日益介导道德上敏感的决策，理解其道德推理过程就必须进行。这项研究对27种不同的手推车问题情景进行了对14个领先的LLM的全面经验评估，包括实现推理和通用，这是由十种道德哲学所构成的，包括功利主义，义务和利他主义。使用阶乘提示协议，我们提出了3,780个二进制决策和自然语言理由，沿决策性自信，解释答案一致性，公共道德对准和对伦理上无关的线索的敏感性实现分析。我们的发现揭示了道德框架和模型类型之间的显着差异：推理增强的模型表现出更大的决定性和结构性的理由，但并不总是与人类共识更好地保持一致。值得注意的是，“甜蜜区域”在利他，公平和美德伦理框架中出现，在该框架中，模型达到了高干预率，低解释冲突以及与总体人类判断的最小差异的平衡。但是，在强调亲属关系，合法性或自我利益的框架下的模型通常会产生伦理上有争议的结果。这些模式表明，道德提示不仅是一种行为修饰符，而且是诊断工具，用于揭示各个提供者的潜在一致性理念。我们倡导道德推理成为LLM对齐中的主要轴，呼吁标准化基准，不仅评估LLMS决定的内容，还可以评估如何和原因。

Title: Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking

Authors: Jian Chen, Jinbao Tian, Yankui Li, Zhou Li
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.07286
Pdf URL: https://arxiv.org/pdf/2508.07286
Copy Paste: [[2508.07286]] Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking(https://arxiv.org/abs/2508.07286)
Keywords: language model, llm
Abstract: Accurate information extraction from specialized texts is a critical challenge, particularly for named entity recognition (NER) in the architecture, engineering, and construction (AEC) domain to support automated rule checking (ARC). The performance of standard pre-trained models is often constrained by the domain gap, as they struggle to interpret the specialized terminology and complex relational contexts inherent in AEC texts. Although this issue can be mitigated by further pre-training on large, human-curated domain corpora, as exemplified by methods like ARCBERT, this approach is both labor-intensive and cost-prohibitive. Consequently, leveraging large language models (LLMs) for automated knowledge generation has emerged as a promising alternative. However, the optimal strategy for generating knowledge that can genuinely enhance smaller, efficient models remains an open question. To address this, we propose ARCE (augmented RoBERTa with contextualized elucidations), a novel approach that systematically explores and optimizes this generation process. ARCE employs an LLM to first generate a corpus of simple, direct explanations, which we term Cote, and then uses this corpus to incrementally pre-train a RoBERTa model prior to its fine-tuning on the downstream task. Our extensive experiments show that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20%. This result also reveals a key finding: simple, explanation-based knowledge proves surprisingly more effective than complex, role-based rationales for this task. The code is publicly available at:this https URL.
摘要：从专业文本中进行准确的信息提取是一个关键挑战，特别是对于建筑，工程和构造（AEC）域中指定的实体识别（NER）以支持自动规则检查（ARC）。标准预训练模型的性能通常受域间隙的限制，因为它们难以解释AEC文本固有的专业术语和复杂的关系环境。尽管可以通过在Arcbert等方法中进行大型，人类策划的领域语料库进行进一步的预先培训来减轻此问题，但这种方法既是劳动密集型又是成本良好的方法。因此，利用大型语言模型（LLMS）进行自动化知识的产生已成为有希望的选择。但是，产生可以真正增强较小，高效模型的知识的最佳策略仍然是一个悬而未决的问题。为了解决这个问题，我们提出了Arce（通过上下文化的阐明增强了罗伯塔），一种新颖的方法，可以系统地探索和优化这一一代过程。 ARCE采用LLM首先生成一个简单，直接解释的语料库，我们将其称为Cote，然后使用该语料库在下游任务上进行微调之前，在其微调之前逐步预训练Roberta模型。我们的广泛实验表明，ARCE在基准AEC数据集上建立了新的最先进的实验，获得了77.20％的宏F1分数。该结果还揭示了一个关键发现：基于简单的基于解释的知识证明，与复杂的基于角色的理由相比，对此任务的理由更为有效。该代码可公开可用，网址为：此HTTPS URL。

Title: CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation

Authors: Yexing Du, Kaiyuan Liu, Youcheng Pan, Zheng Chu, Bo Yang, Xiaocheng Feng, Yang Xiang, Ming Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07295
Pdf URL: https://arxiv.org/pdf/2508.07295
Copy Paste: [[2508.07295]] CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation(https://arxiv.org/abs/2508.07295)
Keywords: language model, gpt, llm, hallucination
Abstract: As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel \textbf{C}ross-lingual and \textbf{C}ross-modal \textbf{F}actuality benchmark (\textbf{CCFQA}). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs' cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at this https URL.
摘要：随着大型语言模型（LLM）在多语言世界中越来越普及，确保无幻觉的事实显着至关重要。但是，现有的基准用于评估多模式大语言模型（MLLM）的可靠性，主要集中在文本或视觉方式上，主要重点是英语，这在处理多语言输入时会在评估中差距，尤其是在语音中。为了弥合这一差距，我们提出了一个小说\ textbf {c} ross-ligual和\ textbf {c} ross-modal \ textbf {f}真实基准（\ textbf {ccfqa}）。具体而言，CCFQA基准包含跨8种语言的平行语音文本事实问题，旨在系统地评估MLLMS的跨模式和跨模式的事实能力。我们的实验结果表明，当前的MLLM在CCFQA基准上仍面临重大挑战。此外，我们提出了一些射击的转移学习策略，该策略有效地将LLMS中LLMS的问题答案（QA）功能转移到了多语言的口语问答（SQA）任务（SQA）任务，从而通过仅使用5次培训来通过GPT-4O-Mini-Mini-Audio实现竞争性能。我们将CCFQA作为基础研究资源发布，以更强大，更可靠的语音理解能力来促进MLLM的发展。我们的代码和数据集可在此HTTPS URL上找到。

Title: HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways

Authors: Cristian Cosentino, Annamaria Defilippo, Marco Dossena, Christopher Irwin, Sara Joubbi, Pietro Liò
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07308
Pdf URL: https://arxiv.org/pdf/2508.07308
Copy Paste: [[2508.07308]] HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways(https://arxiv.org/abs/2508.07308)
Keywords: language model, llm, retrieval-augmented generation
Abstract: HealthBranches is a novel benchmark dataset for medical Question-Answering (Q&A), specifically designed to evaluate complex reasoning in Large Language Models (LLMs). This dataset is generated through a semi-automated pipeline that transforms explicit decision pathways from medical source into realistic patient cases with associated questions and answers. Covering 4,063 case studies across 17 healthcare topics, each data point is based on clinically validated reasoning chains. HealthBranches supports both open-ended and multiple-choice question formats and uniquely includes the full reasoning path for each Q&A. Its structured design enables robust evaluation of LLMs' multi-step inference capabilities, including their performance in structured Retrieval-Augmented Generation (RAG) contexts. HealthBranches establishes a foundation for the development of more trustworthy, interpretable, and clinically reliable LLMs in high-stakes domains while also serving as a valuable resource for educational purposes.
摘要：HealthBranches是一种用于医疗询问（Q＆A）的新型基准数据集，该数据集是专门设计用于评估大语言模型（LLMS）中复杂推理的。该数据集是通过半自动化管道生成的，该管道将明确的决策途径从医疗源转变为具有相关问题和答案的现实患者案例。每个数据点涵盖了17个医疗保健主题的4,063个案例研究，基于经过临床验证的推理链。 HealthBranches支持开放式和多项选择的问题格式，并唯一地包括每个Q＆A的完整推理路径。它的结构化设计可以对LLMS多步推理功能进行强有力的评估，包括它们在结构化检索增强生成（RAG）上下文中的性能。 HealthBranches为在高风险领域中开发更可信赖，可解释和临床可靠的LLM的开发奠定了基础，同时也是用于教育目的的宝贵资源。

Title: ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering

Authors: Shubhra Ghosh, Abhilekh Borah, Aditya Kumar Guru, Kripabandhu Ghosh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07321
Pdf URL: https://arxiv.org/pdf/2508.07321
Copy Paste: [[2508.07321]] ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering(https://arxiv.org/abs/2508.07321)
Keywords: language model, llm
Abstract: The rapid proliferation of Large Language Models (LLMs) has significantly contributed to the development of equitable AI systems capable of factual question-answering (QA). However, no known study tests the LLMs' robustness when presented with obfuscated versions of questions. To systematically evaluate these limitations, we propose a novel technique, ObfusQAte and, leveraging the same, introduce ObfusQA, a comprehensive, first of its kind, framework with multi-tiered obfuscation levels designed to examine LLM capabilities across three distinct dimensions: (i) Named-Entity Indirection, (ii) Distractor Indirection, and (iii) Contextual Overload. By capturing these fine-grained distinctions in language, ObfusQA provides a comprehensive benchmark for evaluating LLM robustness and adaptability. Our study observes that LLMs exhibit a tendency to fail or generate hallucinated responses when confronted with these increasingly nuanced variations. To foster research in this direction, we make ObfusQAte publicly available.
摘要：大型语言模型（LLM）的快速扩散显着有助于发展能够事实提问的公平AI系统（QA）。但是，当有混淆版本的问题时，尚无已知的研究测试LLMS的鲁棒性。为了系统地评估这些局限性，我们提出了一种新颖的技术，obfusqate，并利用相同的内容，引入了obfusqa，这是一个全面的，第一个此类的框架，具有多层混淆级别，旨在检查三个不同维度的LLM功能，以检查三个不同的维度：（i）指定性的压缩性的过度，（ii）Intercractiactor，（II）Intercector，（ii）Indercection，（II）和III II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II。通过捕获这些语言中的细粒度差异，Obfusqa为评估LLM稳健性和适应性提供了全面的基准。我们的研究观察到，在面对这些越来越细微的变化时，LLMS表现出失败或产生幻觉反应的趋势。为了朝这个方向促进研究，我们可以公开使用ObfusQate。

Title: Strategies of Code-switching in Human-Machine Dialogs

Authors: Dean Geckt, Melinda Fricke, Shuly Wintner
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07325
Pdf URL: https://arxiv.org/pdf/2508.07325
Copy Paste: [[2508.07325]] Strategies of Code-switching in Human-Machine Dialogs(https://arxiv.org/abs/2508.07325)
Keywords: prompt, chat
Abstract: Most people are multilingual, and most multilinguals code-switch, yet the characteristics of code-switched language are not fully understood. We developed a chatbot capable of completing a Map Task with human participants using code-switched Spanish and English. In two experiments, we prompted the bot to code-switch according to different strategies, examining (1) the feasibility of such experiments for investigating bilingual language use, and (2) whether participants would be sensitive to variations in discourse and grammatical patterns. Participants generally enjoyed code-switching with our bot as long as it produced predictable code-switching behavior; when code-switching was random or ungrammatical (as when producing unattested incongruent mixed-language noun phrases, such as `la fork'), participants enjoyed the task less and were less successful at completing it. These results underscore the potential downsides of deploying insufficiently developed multilingual language technology, while also illustrating the promise of such technology for conducting research on bilingual language use.
摘要：大多数人都是多语言的，而且大多数多语言是代码开关，但是尚未完全理解代码开关语言的特征。我们开发了一个聊天机器人，能够使用代码切换的西班牙语和英语与人类参与者完成地图任务。在两个实验中，我们根据不同的策略提示机器人进行代码转换，检查（1）此类实验研究双语语言使用的可行性，以及（2）参与者是否对话语和语法模式的变化敏感。只要参与者产生可预测的代码转换行为，参与者通常会喜欢我们的机器人进行代码转换。当代码切换是随机或不语法的（就像在产生未经证实的不一致的混合语言名词短语时，例如“ la fork”）时，参与者的享受较少，并且在完成它方面的成功较低。这些结果强调了部署不足开发多语言语言技术的潜在弊端，同时也说明了这种技术在进行双语语言使用研究的希望。

Title: Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance

Authors: Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, Irwin King
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2508.07375
Pdf URL: https://arxiv.org/pdf/2508.07375
Copy Paste: [[2508.07375]] Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance(https://arxiv.org/abs/2508.07375)
Keywords: language model
Abstract: Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational dynamics such as interruptions, backchannels, and overlapping speech, and End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions. However, they face a critical challenge -- their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. While text-guided speech generation could mitigate these issues, it suffers from timing and length issues when integrating textual guidance into double-channel audio streams, disrupting the precise time alignment essential for natural interactions. To address these challenges, we propose TurnGuide, a novel planning-inspired approach that mimics human conversational planning by dynamically segmenting assistant speech into dialogue turns and generating turn-level text guidance before speech output, which effectively resolves both insertion timing and length challenges. Extensive experiments demonstrate our approach significantly improves e2e FD-SLMs' conversational abilities, enabling them to generate semantically meaningful and coherent speech while maintaining natural conversational flow. Demos are available at this https URL. Code will be available at this https URL.
摘要：全载语言语言模型（FD-SLM）是专门的基础模型，旨在通过建模复杂的对话动态来实现自然的实时口语互动，例如中断，回音和重叠的语音，以及端到端（E2E）FD-SLMS利用现实世界中的两次渠道对话，以捕获nu anuecation的对话模式，以实现人类的两次互动，以实现人类的对话模式。但是，他们面临着一个关键的挑战 - 由于延长语音序列和有限的高质量口语对话数据，与纯文本对话相比，他们的对话能力经常降低。尽管文本指导的语音生成可以减轻这些问题，但在将文本指导集成到双通道音频流中时，它会遭受时间和长度问题的影响，从而破坏了自然互动必不可少的精确时间对齐。为了应对这些挑战，我们提出了TurnGuide，这是一种新型的计划启发的方法，通过将助手语音分为对话转弯并在语音输出之前产生转交级文本指导，模仿人类的对话计划，从而有效地解决了插入时间和长度挑战。广泛的实验证明了我们的方法可显着提高E2E FD-SLMS的对话能力，从而使它们能够在保持自然对话流程的同时产生语义上有意义且连贯的语音。该HTTPS URL可用演示。代码将在此HTTPS URL上可用。

Title: Grounding Multilingual Multimodal LLMs With Cultural Knowledge

Authors: Jean de Dieu Nyandwi, Yueqi Song, Simran Khanuja, Graham Neubig
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07414
Pdf URL: https://arxiv.org/pdf/2508.07414
Copy Paste: [[2508.07414]] Grounding Multilingual Multimodal LLMs With Cultural Knowledge(https://arxiv.org/abs/2508.07414)
Keywords: language model, llm
Abstract: Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. CulturalPangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of 5.0 without degrading results on mainstream vision-language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.
摘要：多模式的大型语言模型在高资源环境中表现出色，但通常会误解长尾文化实体，并且在低资源语言中表现不佳。为了解决这一差距，我们提出了一种以数据为中心的方法，该方法直接在文化知识中以MLLM为基础。利用Wikidata的大规模知识图，我们收集了代表具有文化意义的实体的图像，并生成合成的多语言视觉问题回答数据。由此产生的数据集，文化场，包括2200万高质量，文化丰富的VQA对，跨越了42个国家和39种语言。我们在文化场上训练开源的MLLM文化Pangea，交织到标准的多语言指导数据，以保留一般能力。 CulturalPangea在各种以文化为中心的多语言多模式基准上实现了开放模型的最先进的性能，平均表现超过了5.0的先前模型，而不会在主流视力语言任务上降低结果。我们的发现表明，我们有针对性的文化扎根的方法可以实质上缩小MLLM中的文化差距，并为全球包含多模式系统提供实用的途径。

Title: Let's Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs

Authors: Zhiyi Lyu, Jianguo Huang, Yanchen Deng, Steven Hoi, Bo An
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07434
Pdf URL: https://arxiv.org/pdf/2508.07434
Copy Paste: [[2508.07434]] Let's Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs(https://arxiv.org/abs/2508.07434)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) with inference-time scaling techniques show promise for code generation, yet face notable efficiency and scalability challenges. Construction-based tree-search methods suffer from rapid growth in tree size, high token consumption, and lack of anytime property. In contrast, improvement-based methods offer better performance but often struggle with uninformative reward signals and inefficient search strategies. In this work, we propose \textbf{ReLoc}, a unified local search framework which effectively performs step-by-step code revision. Specifically, ReLoc explores a series of local revisions through four key algorithmic components: initial code drafting, neighborhood code generation, candidate evaluation, and incumbent code updating, each of which can be instantiated with specific decision rules to realize different local search algorithms such as Hill Climbing (HC) or Genetic Algorithm (GA). Furthermore, we develop a specialized revision reward model that evaluates code quality based on revision distance to produce fine-grained preferences that guide the local search toward more promising candidates. Finally, our extensive experimental results demonstrate that our approach achieves superior performance across diverse code generation tasks, significantly outperforming both construction-based tree search as well as the state-of-the-art improvement-based code generation methods.
摘要：具有推理时间缩放技术的大型语言模型（LLMS）对代码生成有望，但面临着显着的效率和可伸缩性挑战。基于施工的树搜索方法的树木大小，高令牌消费量以及缺乏任何时间的财产的快速增长。相比之下，基于改进的方法提供了更好的性能，但通常会在不知情的奖励信号和效率低下的搜索策略中挣扎。在这项工作中，我们建议\ textbf {reloc}，这是一个统一的本地搜索框架，可有效执行分步代码修订。具体而言，ROSOC通过四个关键算法组件探索了一系列本地修订：初始代码起草，邻里代码生成，候选评估和现有代码更新，每种都可以与特定的决策规则实例化，以实现不同的本地搜索算法，例如爬坡攀岩（HC）或Genetic Algorithm（GA）。此外，我们开发了一个专门的修订奖励模型，该模型可以根据修订距离评估代码质量，以产生细粒度的偏好，以指导本地搜索到更有前途的候选人。最后，我们广泛的实验结果表明，我们的方法在不同的代码生成任务中达到了卓越的性能，从而大大优于基于施工的树木搜索以及基于最新的改进的代码生成方法。

Title: Positional Biases Shift as Inputs Approach Context Window Limits

Authors: Blerta Veseli, Julian Chibane, Mariya Toneva, Alexander Koller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07479
Pdf URL: https://arxiv.org/pdf/2508.07479
Copy Paste: [[2508.07479]] Positional Biases Shift as Inputs Approach Context Window Limits(https://arxiv.org/abs/2508.07479)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) often struggle to use information across long inputs effectively. Prior work has identified positional biases, such as the Lost in the Middle (LiM) effect, where models perform better when information appears at the beginning (primacy bias) or end (recency bias) of the input, rather than in the middle. However, long-context studies have not consistently replicated these effects, raising questions about their intensity and the conditions under which they manifest. To address this, we conducted a comprehensive analysis using relative rather than absolute input lengths, defined with respect to each model's context window. Our findings reveal that the LiM effect is strongest when inputs occupy up to 50% of a model's context window. Beyond that, the primacy bias weakens, while recency bias remains relatively stable. This effectively eliminates the LiM effect; instead, we observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input. Furthermore, our results suggest that successful retrieval is a prerequisite for reasoning in LLMs, and that the observed positional biases in reasoning are largely inherited from retrieval. These insights have implications for long-context tasks, the design of future LLM benchmarks, and evaluation methodologies for LLMs handling extended inputs.
摘要：大型语言模型（LLM）通常很难有效地使用长期输入的信息。先前的工作已经确定了位置偏见，例如中间（LIM）效应中的丢失，在该信息出现在开始时（初次偏见）或输入的结束（重新偏差）时，模型的性能更好，而不是中间。但是，长篇文章研究并未始终如一地复制这些效果，从而提出了有关其强度及其表现条件的问题。为了解决这个问题，我们使用相对的输入长度进行了全面的分析，该分析是根据每个模型的上下文窗口定义的。我们的发现表明，当输入占据模型上下文窗口的50％时，LIM效应最强。除此之外，首要偏见会减弱，而后退偏见仍然相对稳定。这有效地消除了LIM效应；取而代之的是，我们观察到基于距离的偏见，当相关信息更接近输入结束时，模型性能会更好。此外，我们的结果表明，成功的检索是LLMS推理的先决条件，并且在推理中观察到的位置偏见很大程度上是从检索中遗传而来的。这些见解对长篇文章任务，未来LLM基准的设计以及LLMS处理扩展输入的评估方法具有影响。

Title: ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models

Authors: Archchana Sindhujan, Shenbin Qian, Chan Chi Chun Matthew, Constantin Orasan, Diptesh Kanojia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07484
Pdf URL: https://arxiv.org/pdf/2508.07484
Copy Paste: [[2508.07484]] ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models(https://arxiv.org/abs/2508.07484)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown remarkable performance across a wide range of natural language processing tasks. Quality Estimation (QE) for Machine Translation (MT), which assesses the quality of a source-target pair without relying on reference translations, remains a challenging cross-lingual task for LLMs. The challenges stem from the inherent limitations of existing LLM-based QE systems, which are pre-trained for causal language modelling rather than regression-specific tasks, further elevated by the presence of low-resource languages given pre-training data distribution. This paper introduces ALOPE, an adaptive layer-optimization framework designed to enhance LLM-based QE by restructuring Transformer representations through layer-wise adaptation for improved regression-based prediction. Our framework integrates low-rank adapters (LoRA) with regression task heads, leveraging selected pre-trained Transformer layers for improved cross-lingual alignment. In addition to the layer-specific adaptation, ALOPE introduces two strategies-dynamic weighting, which adaptively combines representations from multiple layers, and multi-head regression, which aggregates regression losses from multiple heads for QE. Our framework shows improvements over various existing LLM-based QE approaches. Empirical evidence suggests that intermediate Transformer layers in LLMs provide contextual representations that are more aligned with the cross-lingual nature of the QE task. We make resultant models and framework code publicly available for further research, also allowing existing LLM-based MT frameworks to be scaled with QE capabilities.
摘要：大型语言模型（LLM）在各种自然语言处理任务中表现出色。用于评估源目标对的质量而不依赖参考翻译的质量估算（MT）仍然是LLM的一项挑战性的跨语义任务。挑战源于现有基于LLM的量化质量标准系统的固有局限性，这些系统是为因果语言建模而不是针对回归特定的任务进行训练的，这进一步提高了较低的资源语言给定培训的数据分布。本文介绍了Alope，这是一种自适应层优化框架，旨在通过通过层次适应来重组变压器表示来增强基于LLM的量化宽松量，以改善基于回归的预测。我们的框架将低级适配器（LORA）与回归任务头集成在一起，利用选定的预训练的变压器层进行改进的跨语言对准。除了特定层的适应性外，Alope还引入了两种策略 - 动态权重，它们可以自适应地结合了多个层的表示形式和多头回归，它们汇总了量化宽松的多个头部的回归损失。我们的框架显示了与现有的基于LLM的各种量化宽松方法的改进。经验证据表明，LLMS中的中间变压器层提供了上下文表示，这些表示与量化量化宽松任务的跨语性性质更加一致。我们使最终的模型和框架代码公开可用于进一步研究，还允许使用量化量化量子标准的现有MT框架。

Title: Augmenting Bias Detection in LLMs Using Topological Data Analysis

Authors: Keshav Varadarajan, Tananun Songdechakraiwut
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07516
Pdf URL: https://arxiv.org/pdf/2508.07516
Copy Paste: [[2508.07516]] Augmenting Bias Detection in LLMs Using Topological Data Analysis(https://arxiv.org/abs/2508.07516)
Keywords: language model, gpt, llm
Abstract: Recently, many bias detection methods have been proposed to determine the level of bias a large language model captures. However, tests to identify which parts of a large language model are responsible for bias towards specific groups remain underdeveloped. In this study, we present a method using topological data analysis to identify which heads in GPT-2 contribute to the misrepresentation of identity groups present in the StereoSet dataset. We find that biases for particular categories, such as gender or profession, are concentrated in attention heads that act as hot spots. The metric we propose can also be used to determine which heads capture bias for a specific group within a bias category, and future work could extend this method to help de-bias large language models.
摘要：最近，已经提出了许多偏见检测方法来确定大语言模型捕获的偏差水平。但是，确定大语言模型的哪些部分负责对特定群体的偏见的测试仍然不发达。在这项研究中，我们提出了一种使用拓扑数据分析的方法，以确定GPT-2中的哪些头有助于对Stereoset数据集中存在的身份组的虚假陈述。我们发现，特定类别（例如性别或职业）的偏见集中在充当热点的注意力头上。我们建议的指标还可以用于确定哪些头部捕获偏见类别中特定组的偏见，并且未来的工作可以扩展此方法以帮助偏见大语模型。

Title: Word Clouds as Common Voices: LLM-Assisted Visualization of Participant-Weighted Themes in Qualitative Interviews

Authors: Joseph T. Colonel, Baihan Lin
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2508.07517
Pdf URL: https://arxiv.org/pdf/2508.07517
Copy Paste: [[2508.07517]] Word Clouds as Common Voices: LLM-Assisted Visualization of Participant-Weighted Themes in Qualitative Interviews(https://arxiv.org/abs/2508.07517)
Keywords: language model, llm, prompt
Abstract: Word clouds are a common way to summarize qualitative interviews, yet traditional frequency-based methods often fail in conversational contexts: they surface filler words, ignore paraphrase, and fragment semantically related ideas. This limits their usefulness in early-stage analysis, when researchers need fast, interpretable overviews of what participant actually said. We introduce ThemeClouds, an open-source visualization tool that uses large language models (LLMs) to generate thematic, participant-weighted word clouds from dialogue transcripts. The system prompts an LLM to identify concept-level themes across a corpus and then counts how many unique participants mention each topic, yielding a visualization grounded in breadth of mention rather than raw term frequency. Researchers can customize prompts and visualization parameters, providing transparency and control. Using interviews from a user study comparing five recording-device configurations (31 participants; 155 transcripts, Whisper ASR), our approach surfaces more actionable device concerns than frequency clouds and topic-modeling baselines (e.g., LDA, BERTopic). We discuss design trade-offs for integrating LLM assistance into qualitative workflows, implications for interpretability and researcher agency, and opportunities for interactive analyses such as per-condition contrasts (``diff clouds'').
摘要：单词云是总结定性访谈的一种常见方法，但是基于传统的频率方法通常会在对话环境中失败：它们表面填充单词，忽略释义和语义相关的片段。这限制了他们在早期分析中的有用性，当研究人员需要快速，可解释的参与者所说的话时。我们介绍了一种开源可视化工具，它使用大型语言模型（LLMS）从对话成绩单中生成主题，参与者加权的单词云。该系统促使LLM识别跨语料库的概念级主题，然后计算有多少独特参与者提及每个主题，从而产生以广度为基础而不是原始术语频率的可视化。研究人员可以自定义提示和可视化参数，从而提供透明度和控制。利用用户研究的访谈比较了五种记录设备配置（31个参与者； 155个成绩单，耳语ASR），我们的方法表现出比频率云和主题模型基线更为可行的设备问题（例如LDA，Bertopic）。我们讨论了将LLM援助纳入定性工作流程，对解释性和研究人员机构的影响以及进行互动分析（例如，``diff clouds''）等互动分析的机会的设计权衡。

Title: From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR

Authors: Jia Deng, Jie Chen, Zhipeng Chen, Daixuan Cheng, Fei Bai, Beichen Zhang, Yinqian Min, Yanzipeng Gao, Wayne Xin Zhao, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07534
Pdf URL: https://arxiv.org/pdf/2508.07534
Copy Paste: [[2508.07534]] From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR(https://arxiv.org/abs/2508.07534)
Keywords: language model, llm
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains -- a process critically dependent on effective exploration strategies. While prior work has demonstrated RLVR's empirical success, the fundamental mechanisms governing LLMs' exploration behaviors remain underexplored. This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects: (1) exploration space shaping, where we develop quantitative metrics to characterize LLMs' capability boundaries; (2) entropy-performance exchange, analyzed across training stages, individual instances, and token-level patterns; and (3) RL performance optimization, examining methods to effectively translate exploration gains into measurable improvements. By unifying previously identified insights with new empirical evidence, this work aims to provide a foundational framework for advancing RLVR systems.
摘要：具有可验证奖励（RLVR）的增强学习已成为增强大语言模型（LLMS）推理能力的强大范式。与传统的RL方法不同，RLVR利用基于规则的反馈来指导LLMS生成和完善复杂的推理链，这是一个非常依赖有效探索策略的过程。尽管先前的工作已经证明了RLVR的经验成功，但管理LLMS勘探行为的基本机制仍未得到充实。该技术报告对RLVR中的勘探能力进行了系统的研究，涵盖了四个主要方面：（1）勘探空间塑造，我们开发定量指标来表征LLMS的能力边界；（2）在培训阶段，个人实例和令牌级别的模式之间进行分析的熵 - 性能交换；（3）RL性能优化，检查有效地将勘探收益转化为可衡量的改进的方法。通过将先前确定的见解统一使用新的经验证据，这项工作旨在为推进RLVR系统提供基础框架。

Title: IBPS: Indian Bail Prediction System

Authors: Puspesh Kumar Srivastava, Uddeshya Raj, Praveen Patel, /Shubham Kumar Nigam, Noel Shallum, Arnab Bhattacharya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07592
Pdf URL: https://arxiv.org/pdf/2508.07592
Copy Paste: [[2508.07592]] IBPS: Indian Bail Prediction System(https://arxiv.org/abs/2508.07592)
Keywords: language model
Abstract: Bail decisions are among the most frequently adjudicated matters in Indian courts, yet they remain plagued by subjectivity, delays, and inconsistencies. With over 75% of India's prison population comprising undertrial prisoners, many from socioeconomically disadvantaged backgrounds, the lack of timely and fair bail adjudication exacerbates human rights concerns and contributes to systemic judicial backlog. In this paper, we present the Indian Bail Prediction System (IBPS), an AI-powered framework designed to assist in bail decision-making by predicting outcomes and generating legally sound rationales based solely on factual case attributes and statutory provisions. We curate and release a large-scale dataset of 150,430 High Court bail judgments, enriched with structured annotations such as age, health, criminal history, crime category, custody duration, statutes, and judicial reasoning. We fine-tune a large language model using parameter-efficient techniques and evaluate its performance across multiple configurations, with and without statutory context, and with RAG. Our results demonstrate that models fine-tuned with statutory knowledge significantly outperform baselines, achieving strong accuracy and explanation quality, and generalize well to a test set independently annotated by legal experts. IBPS offers a transparent, scalable, and reproducible solution to support data-driven legal assistance, reduce bail delays, and promote procedural fairness in the Indian judicial system.
摘要：保释裁决是印度法院最常见的裁决事项之一，但它们仍然受到主观，延误和矛盾的困扰。印度超过75％的监狱人口包括囚犯不囚犯，其中许多人来自社会经济处境不利的背景，缺乏及时且公平的保释裁决加剧了人权的关注，并有助于系统性的司法积压。在本文中，我们介绍了Indian Bail预测系统（IBPS），这是一个由AI驱动的框架，旨在通过预测结果并仅基于事实案例属性和法定规定产生合法合理的理由来协助保释决策。我们策划并发布了150,430个高等法院保释判决的大规模数据集，并具有结构化注释，例如年龄，卫生，犯罪历史，犯罪类别，监护时间，法规和司法推理。我们使用参数效率高效技术微调了大型语言模型，并在多种配置中评估其性能，具有和没有法定上下文以及用抹布进行评估。我们的结果表明，使用法定知识进行微调的模型极大地超过了基线，实现了强大的准确性和解释质量，并逐渐概括为法律专家独立注释的测试集。 IBP提供了一种透明，可扩展且可再现的解决方案，以支持数据驱动的法律援助，减少保释延迟并促进印度司法系统中的程序公平性。

Title: Keyword-Centric Prompting for One-Shot Event Detection with Self-Generated Rationale Enhancements

Authors: Ziheng Li, Zhi-Hong Deng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07598
Pdf URL: https://arxiv.org/pdf/2508.07598
Copy Paste: [[2508.07598]] Keyword-Centric Prompting for One-Shot Event Detection with Self-Generated Rationale Enhancements(https://arxiv.org/abs/2508.07598)
Keywords: llm, prompt, chain-of-thought
Abstract: Although the LLM-based in-context learning (ICL) paradigm has demonstrated considerable success across various natural language processing tasks, it encounters challenges in event detection. This is because LLMs lack an accurate understanding of event triggers and tend to make over-interpretation, which cannot be effectively corrected through in-context examples alone. In this paper, we focus on the most challenging one-shot setting and propose KeyCP++, a keyword-centric chain-of-thought prompting approach. KeyCP++ addresses the weaknesses of conventional ICL by automatically annotating the logical gaps between input text and detection results for the demonstrations. Specifically, to generate in-depth and meaningful rationale, KeyCP++ constructs a trigger discrimination prompting template. It incorporates the exemplary triggers (a.k.a keywords) into the prompt as the anchor to simply trigger profiling, let LLM propose candidate triggers, and justify each candidate. These propose-and-judge rationales help LLMs mitigate over-reliance on the keywords and promote detection rule learning. Extensive experiments demonstrate the effectiveness of our approach, showcasing significant advancements in one-shot event detection.
摘要：尽管基于LLM的文化学习（ICL）范式在各种自然语言处理任务上取得了巨大的成功，但它在事件检测中遇到了挑战。这是因为LLM缺乏对事件触发器的准确理解，并且倾向于过度解释，这不能单独通过封闭式示例有效地纠正。在本文中，我们专注于最具挑战性的一弹性设置，并提出KeyCP ++，这是一种以关键字为中心的想法链接的提示方法。 KeyCP ++通过自动注释演示的输入文本和检测结果之间的逻辑差距来解决常规ICL的弱点。具体来说，要生成深入和有意义的理由，KeyCP ++构建了触发歧视提示模板。它将示例性触发器（又称关键字）纳入提示中，作为简单触发分析的锚点，让LLM提出候选触发器，并证明每个候选人的合理性。这些提出的法官理由有助于LLMS减轻过度依赖关键词并促进检测规则学习。广泛的实验证明了我们方法的有效性，并在一次性事件检测中展示了显着的进步。

Title: InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information

Authors: Anirudh Iyengar Kaniyar Narayana Iyengar, Srija Mukhopadhyay, Adnan Qidwai, Shubhankar Singh, Dan Roth, Vivek Gupta
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.07630
Pdf URL: https://arxiv.org/pdf/2508.07630
Copy Paste: [[2508.07630]] InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information(https://arxiv.org/abs/2508.07630)
Keywords: language model
Abstract: We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and public policy dashboards. Unlike prior benchmarks focusing on isolated, visually uniform charts, InterChart challenges models with diverse question types ranging from entity inference and trend correlation to numerical estimation and abstract multi-step reasoning grounded in 2-3 thematically or structurally related charts. We organize the benchmark into three tiers of increasing difficulty: (1) factual reasoning over individual charts, (2) integrative analysis across synthetically aligned chart sets, and (3) semantic inference over visually complex, real-world chart pairs. Our evaluation of state-of-the-art open and closed-source VLMs reveals consistent and steep accuracy declines as chart complexity increases. We find that models perform better when we decompose multi-entity charts into simpler visual units, underscoring their struggles with cross-chart integration. By exposing these systematic limitations, InterChart provides a rigorous framework for advancing multimodal reasoning in complex, multi-visual environments.
摘要：我们介绍了Interchart，这是一种诊断基准，该基准评估了多个相关图表中的视觉模型（VLM）原因，这是现实世界应用程序中心的任务，例如科学报告，财务分析和公共政策仪表板。与以前关注孤立的视觉统一图表的先前基准分配不同，与其他问题类型的挑战模型不同，从实体推理和趋势相关性到数值估计以及基于2-3个主题或结构上相关图表的抽象多步理学。我们将基准分为增加难度的三个层：（1）在单个图表上的事实推理，（2）跨合成图表集的综合分析，以及（3）对视觉上复杂，现实世界图表对的语义推断。我们对最新开放和封闭式VLM的评估揭示了一致且陡峭的精度随着图表的复杂性的增加而下降。我们发现，当我们将多实体图表分解为更简单的视觉单元时，模型的性能更好，强调了它们在跨曲目集成中的斗争。通过暴露这些系统的局限性，Intercher提供了一个严格的框架，用于在复杂的多模式环境中推进多模式推理。

Title: LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval

Authors: Luyao Zhuang, Qinggang Zhang, Huachi Zhou, Juhua Liu, Qing Li, Xiao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07690
Pdf URL: https://arxiv.org/pdf/2508.07690
Copy Paste: [[2508.07690]] LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval(https://arxiv.org/abs/2508.07690)
Keywords: language model, llm
Abstract: Tool learning has emerged as a promising paradigm for large language models (LLMs) to solve many real-world tasks. Nonetheless, with the tool repository rapidly expanding, it is impractical to contain all tools within the limited input length of LLMs. To alleviate these issues, researchers have explored incorporating a tool retrieval module to select the most relevant tools or represent tools as unique tokens within LLM parameters. However, most state-of-the-art methods are under transductive settings, assuming all tools have been observed during training. Such a setting deviates from reality as the real-world tool repository is evolving and incorporates new tools frequently. When dealing with these unseen tools, which refer to tools not encountered during the training phase, these methods are limited by two key issues, including the large distribution shift and the vulnerability of similarity-based retrieval. To this end, inspired by human cognitive processes of mastering unseen tools through discovering and applying the logical information from prior experience, we introduce a novel Logic-Guided Semantic Bridging framework for inductive tool retrieval, namely, LoSemB, which aims to mine and transfer latent logical information for inductive tool retrieval without costly retraining. Specifically, LoSemB contains a logic-based embedding alignment module to mitigate distribution shifts and implements a relational augmented retrieval mechanism to reduce the vulnerability of similarity-based retrieval. Extensive experiments demonstrate that LoSemB achieves advanced performance in inductive settings while maintaining desirable effectiveness in the transductive setting.
摘要：工具学习已成为大型语言模型（LLM）的有希望的范式，以解决许多现实世界的任务。尽管如此，随着工具存储库的迅速扩展，在有限的LLMS输入长度中包含所有工具是不切实际的。为了减轻这些问题，研究人员探索了合并工具检索模块，以选择最相关的工具或表示工具作为LLM参数中独特令牌。但是，假设在训练过程中观察到所有工具，大多数最先进的方法都处于跨式设置之下。这样的设置偏离现实，因为实际工具存储库正在发展，并经常合并新工具。在处理这些看不见的工具（指训练阶段未遇到的工具）时，这些方法受到两个关键问题的限制，包括较大的分销转移和基于相似性检索的脆弱性。为此，我们受到人类认知过程的启发，该过程通过发现和应用先前经验中的逻辑信息来掌握看不见的工具，我们介绍了一种新颖的逻辑引导的语义桥接框架，用于诱导工具检索，即losemb，即losemb，即旨在挖掘和转移潜在的逻辑信息，以实现归纳性工具的潜在信息检索而无需进行成本验证。具体而言，Losemb包含一个基于逻辑的嵌入对齐模块，以减轻分布变化并实现关系增强的检索机制，以减少基于相似性检索的脆弱性。广泛的实验表明，Losemb在电感环境中实现了高级性能，同时在偏置环境中保持了理想的有效性。

Title: What am I missing here?: Evaluating Large Language Models for Masked Sentence Prediction

Authors: Charlie Wyatt, Aditya Joshi, Flora Salim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07702
Pdf URL: https://arxiv.org/pdf/2508.07702
Copy Paste: [[2508.07702]] What am I missing here?: Evaluating Large Language Models for Masked Sentence Prediction(https://arxiv.org/abs/2508.07702)
Keywords: language model, gpt, llm
Abstract: Transformer-based models primarily rely on Next Token Prediction (NTP), which predicts the next token in a sequence based on the preceding context. However, NTP's focus on single-token prediction often limits a model's ability to plan ahead or maintain long-range coherence, raising questions about how well LLMs can predict longer contexts, such as full sentences within structured documents. While NTP encourages local fluency, it provides no explicit incentive to ensure global coherence across sentence boundaries-an essential skill for reconstructive or discursive tasks. To investigate this, we evaluate three commercial LLMs (GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash) on Masked Sentence Prediction (MSP) - the task of infilling a randomly removed sentence - from three domains: ROCStories (narrative), Recipe1M (procedural), and Wikipedia (expository). We assess both fidelity (similarity to the original sentence) and cohesiveness (fit within the surrounding context). Our key finding reveals that commercial LLMs, despite their superlative performance in other tasks, are poor at predicting masked sentences in low-structured domains, highlighting a gap in current model capabilities.
摘要：基于变压器的模型主要依赖于下一代币预测（NTP），该预测可根据前上下文预测下一个令牌。但是，NTP对单言预测的关注通常会限制模型提前计划或保持远程连贯性的能力，从而提出了有关LLM可以预测更长上下文的问题，例如结构化文档中的完整句子。尽管NTP鼓励本地流利性，但它没有明确的动力来确保跨句子边界之间的全球连贯性 - 重建或话语任务的基本技能。为了进行调查，我们在蒙版句子预测（MSP）上评估了三个商业LLM（GPT -4O，Claude 3.5 SONNET和GEMINI 2.0 Flash） - 从三个域中填充随机删除的句子的任务 - 从三个域：ROCOSTORIES（叙事），recipe 1m（程序性）（程序）和Wikipedia（Quikipedia）（excipe1m（Proceptural）和exposity）。我们评估忠诚度（与原始句子相似）和凝聚力（适合周围环境）。我们的主要发现表明，尽管商业LLM在其他任务中的最高表现，但在预测低结构域中蒙面句子方面的表现较差，突出了当前模型功能的差距。

Title: Exploring Causal Effect of Social Bias on Faithfulness Hallucinations in Large Language Models

Authors: Zhenliang Zhang, Junzhe Zhang, Xinyu Hu, HuiXuan Zhang, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07753
Pdf URL: https://arxiv.org/pdf/2508.07753
Copy Paste: [[2508.07753]] Exploring Causal Effect of Social Bias on Faithfulness Hallucinations in Large Language Models(https://arxiv.org/abs/2508.07753)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have achieved remarkable success in various tasks, yet they remain vulnerable to faithfulness hallucinations, where the output does not align with the input. In this study, we investigate whether social bias contributes to these hallucinations, a causal relationship that has not been explored. A key challenge is controlling confounders within the context, which complicates the isolation of causality between bias states and hallucinations. To address this, we utilize the Structural Causal Model (SCM) to establish and validate the causality and design bias interventions to control confounders. In addition, we develop the Bias Intervention Dataset (BID), which includes various social biases, enabling precise measurement of causal effects. Experiments on mainstream LLMs reveal that biases are significant causes of faithfulness hallucinations, and the effect of each bias state differs in direction. We further analyze the scope of these causal effects across various models, specifically focusing on unfairness hallucinations, which are primarily targeted by social bias, revealing the subtle yet significant causal effect of bias on hallucination generation.
摘要：大型语言模型（LLM）在各种任务中取得了巨大的成功，但它们仍然容易受到忠实幻觉的影响，在这种幻觉中，输出与输入不符。在这项研究中，我们调查了社会偏见是否导致了这些幻觉，这是尚未探讨的因果关系。一个关键的挑战是控制上下文中的混杂因素，这使偏见状态与幻觉之间因果关系的隔离变得复杂。为了解决这个问题，我们利用结构性因果模型（SCM）来建立和验证因果关系和设计偏见干预措施以控制混杂因素。此外，我们开发了包括各种社会偏见的偏见干预数据集（BID），从而可以精确地衡量因果影响。主流LLM的实验表明，偏见是忠诚幻觉的重要原因，每个偏见状态的影响在方向上有所不同。我们进一步分析了这些因果关系在各种模型中的范围，特别是关注不公平的幻觉，这些幻觉主要是由社会偏见针对的，揭示了偏见对幻觉产生的微妙而显着的因果作用。

Title: SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

Authors: Zeyu Yang, Lai Wei, Roman Koshkin, Xi Chen, Satoshi Nakamura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07781
Pdf URL: https://arxiv.org/pdf/2508.07781
Copy Paste: [[2508.07781]] SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation(https://arxiv.org/abs/2508.07781)
Keywords: llm
Abstract: This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation. Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Speech Translation), an end-to-end framework integrating frozen Whisper encoder and decoder-only LLM. The unified architecture dynamically outputs translation tokens or symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence. Experiments on CoVoST2 multilingual corpus En-{De, Zh, Ja} demonstrate significant translation quality improvements across languages and validate the effectiveness of syntactic structures in LLM-driven SimulST systems.
摘要：这项工作提出了一种基于语法的块策略，该策略通过解析依赖关系（例如，名词短语边界，动词对象结构）和标点符号通过解析依赖关系（例如，将流传输到语义上完成单元。该方法可确保块相干性并最大程度地减少语义碎片化。我们以这种机制为基础，提出了SASST（语法感知语音翻译），这是一个整合了冷冻耳语编码器和仅解码器的LLM的端到端框架。统一体系结构动态输出翻译令牌或符号以共同优化翻译时序和内容，目标侧重新排序解决单词顺序差异。关于Covost2多语言语料库{DE，ZH，JA}的实验表明了语言之间的显着翻译质量改进，并验证了LLM驱动的Simulst系统中句法结构的有效性。

Title: Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

Authors: Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, Zhenzhong Lan, Bei Yu, Jianguo Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07785
Pdf URL: https://arxiv.org/pdf/2508.07785
Copy Paste: [[2508.07785]] Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts(https://arxiv.org/abs/2508.07785)
Keywords: language model, llm
Abstract: The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous this http URL CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.
摘要：专家（MOE）体系结构的混合是现代先进（SOTA）大语言模型（LLMS）的基石。 MOE模型通过启用稀疏参数激活来促进可伸缩性。但是，传统的MOE体系结构使用均匀尺寸的均匀专家，激活固定数量的参数，而与输入复杂性无关，从而限制了计算效率。为了克服这一限制，我们介绍了Grove Moe，这是一种新颖的建筑，结合了各种大小的专家，灵感来自此HTTP URL CPU架构的异质性。该体系结构以动态激活机制为特色，具有动态激活机制的新颖专家，在维护可管理的计算开销的同时，可以扩展模型。在此架构的基础上，我们介绍了Grovemoe-Base和Grovemoe-Inst，在中期训练和培训期间对QWEN3-30B-A3B基准模型应用升级策略，开发了33B参数LLMS。 Grovemoe模型基于令牌复杂性动态激活3.14-3.28b参数，并实现与SOTA开源模型相似甚至更大尺寸的性能。

Title: Can You Trick the Grader? Adversarial Persuasion of LLM Judges

Authors: Yerin Hwang, Dongryeol Lee, Taegwan Kang, Yongil Kim, Kyomin Jung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07805
Pdf URL: https://arxiv.org/pdf/2508.07805
Copy Paste: [[2508.07805]] Can You Trick the Grader? Adversarial Persuasion of LLM Judges(https://arxiv.org/abs/2508.07805)
Keywords: language model, llm, prompt
Abstract: As large language models take on growing roles as automated evaluators in practical settings, a critical question arises: Can individuals persuade an LLM judge to assign unfairly high scores? This study is the first to reveal that strategically embedded persuasive language can bias LLM judges when scoring mathematical reasoning tasks, where correctness should be independent of stylistic variation. Grounded in Aristotle's rhetorical principles, we formalize seven persuasion techniques (Majority, Consistency, Flattery, Reciprocity, Pity, Authority, Identity) and embed them into otherwise identical responses. Across six math benchmarks, we find that persuasive language leads LLM judges to assign inflated scores to incorrect solutions, by up to 8% on average, with Consistency causing the most severe distortion. Notably, increasing model size does not substantially mitigate this vulnerability. Further analysis demonstrates that combining multiple persuasion techniques amplifies the bias, and pairwise evaluation is likewise susceptible. Moreover, the persuasive effect persists under counter prompting strategies, highlighting a critical vulnerability in LLM-as-a-Judge pipelines and underscoring the need for robust defenses against persuasion-based attacks.
摘要：随着大型语言模型在实际环境中担任自动评估者的角色，因此出现了一个关键的问题：个人可以说服LLM法官分配不公平的分数吗？这项研究是第一个揭示出策略性嵌入的有说服力的语言在评分数学推理任务时会偏向法官，而正确性应独立于文体变化。以亚里士多德的修辞原则为基础，我们正式化了七种说服技术（多数，一致性，奉承，互惠，怜悯，权威，身份），并将它们嵌入其他相同的反应中。在六个数学基准中，我们发现有说服力的语言导致LLM法官平均将夸大的分数分配给不正确的解决方案，最高为8％，一致性导致最严重的失真。值得注意的是，增加模型大小并不能基本减轻这种漏洞。进一步的分析表明，将多种说服技术结合起来会扩大偏见，并且成对评估同样易感。此外，有说服力的效果持续在反击中提示策略，突出了LLM-AS-A-A-Audge管道中的关键脆弱性，并强调了对基于说服的攻击的强大防御能力的需求。

Title: Evaluating Large Language Models as Expert Annotators

Authors: Yu-Min Tseng, Wei-Lin Chen, Chung-Chi Chen, Hsin-Hsi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07827
Pdf URL: https://arxiv.org/pdf/2508.07827
Copy Paste: [[2508.07827]] Evaluating Large Language Models as Expert Annotators(https://arxiv.org/abs/2508.07827)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Textual data annotation, the process of labeling or tagging text with relevant information, is typically costly, time-consuming, and labor-intensive. While large language models (LLMs) have demonstrated their potential as direct alternatives to human annotators for general domains natural language processing (NLP) tasks, their effectiveness on annotation tasks in domains requiring expert knowledge remains underexplored. In this paper, we investigate: whether top-performing LLMs, which might be perceived as having expert-level proficiency in academic and professional benchmarks, can serve as direct alternatives to human expert annotators? To this end, we evaluate both individual LLMs and multi-agent approaches across three highly specialized domains: finance, biomedicine, and law. Specifically, we propose a multi-agent discussion framework to simulate a group of human annotators, where LLMs are tasked to engage in discussions by considering others' annotations and justifications before finalizing their labels. Additionally, we incorporate reasoning models (e.g., o3-mini) to enable a more comprehensive comparison. Our empirical results reveal that: (1) Individual LLMs equipped with inference-time techniques (e.g., chain-of-thought (CoT), self-consistency) show only marginal or even negative performance gains, contrary to prior literature suggesting their broad effectiveness. (2) Overall, reasoning models do not demonstrate statistically significant improvements over non-reasoning models in most settings. This suggests that extended long CoT provides relatively limited benefits for data annotation in specialized domains. (3) Certain model behaviors emerge in the multi-agent discussion environment. For instance, Claude 3.7 Sonnet with thinking rarely changes its initial annotations, even when other agents provide correct annotations or valid reasoning.
摘要：文本数据注释是使用相关信息标记或标记文本的过程，通常是昂贵的，耗时的和劳动力密集的。尽管大型语言模型（LLM）表明了它们作为通用领域自然语言处理（NLP）任务的人类注释者的直接替代方案的潜力，但它们对需要专家知识的领域注释任务的有效性仍然没有得到充实。在本文中，我们调查：表现最好的LLM是否可以被视为在学术和专业基准方面具有专家级水平的能力，可以作为人类专家注释者的直接替代方案吗？为此，我们评估了三个高度专业领域的个人LLM和多代理方法：金融，生物医学和法律。具体来说，我们提出了一个多代理讨论框架，以模拟一组人类注释者，在该框架中，LLM的任务是通过在最终确定标签之前考虑他人的注释和理由来进行讨论。此外，我们结合了推理模型（例如O3-Mini），以进行更全面的比较。我们的经验结果表明：（1）配备了推理时间技术（例如，经过思考链（COT），自偏）的单个LLM仅显示出边际甚至负面绩效增长，与先前的文献相反，表明其广泛的效率。（2）总体而言，推理模型在大多数情况下没有证明对非争议模型的统计学显着改善。这表明长长的COT为专用域中的数据注释提供了相对有限的好处。（3）在多代理讨论环境中出现了某些模型行为。例如，即使其他代理提供正确的注释或有效的推理，Claude 3.7十四行诗也很少会改变其初始注释。

Title: LLMs for Law: Evaluating Legal-Specific LLMs on Contract Understanding

Authors: Amrita Singh, H. Suhan Karaca, Aditya Joshi, Hye-young Paik, Jiaojiao Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07849
Pdf URL: https://arxiv.org/pdf/2508.07849
Copy Paste: [[2508.07849]] LLMs for Law: Evaluating Legal-Specific LLMs on Contract Understanding(https://arxiv.org/abs/2508.07849)
Keywords: llm
Abstract: Despite advances in legal NLP, no comprehensive evaluation covering multiple legal-specific LLMs currently exists for contract classification tasks in contract understanding. To address this gap, we present an evaluation of 10 legal-specific LLMs on three English language contract understanding tasks and compare them with 7 general-purpose LLMs. The results show that legal-specific LLMs consistently outperform general-purpose models, especially on tasks requiring nuanced legal understanding. Legal-BERT and Contracts-BERT establish new SOTAs on two of the three tasks, despite having 69% fewer parameters than the best-performing general-purpose LLM. We also identify CaseLaw-BERT and LexLM as strong additional baselines for contract understanding. Our results provide a holistic evaluation of legal-specific LLMs and will facilitate the development of more accurate contract understanding systems.
摘要：尽管法律NLP取得了进步，但目前尚无涵盖合同理解中合同分类任务的多个特定于法律特定LLM的全面评估。为了解决这一差距，我们在三个英语合同理解任务上对10个特定于法律的LLM进行了评估，并将其与7个通用LLM进行了比较。结果表明，特定于法律的LLM始终超过通用模型，尤其是在需要细微的法律理解的任务上。 Legal-Bert和Contracts-Bert在这三个任务中的两个方面建立了新的SOTA，尽管参数比表现最好的通用LLM少69％。我们还将Caselaw-Bert和Lexlm确定为合同理解的强大基准。我们的结果提供了对法律特定LLM的整体评估，并将促进更准确的合同理解系统的发展。

Title: Large Language Models for Czech Aspect-Based Sentiment Analysis

Authors: Jakub Šmíd, Pavel Přibáň, Pavel Král
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07860
Pdf URL: https://arxiv.org/pdf/2508.07860
Copy Paste: [[2508.07860]] Large Language Models for Czech Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2508.07860)
Keywords: language model, llm
Abstract: Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task that aims to identify sentiment toward specific aspects of an entity. While large language models (LLMs) have shown strong performance in various natural language processing (NLP) tasks, their capabilities for Czech ABSA remain largely unexplored. In this work, we conduct a comprehensive evaluation of 19 LLMs of varying sizes and architectures on Czech ABSA, comparing their performance in zero-shot, few-shot, and fine-tuning scenarios. Our results show that small domain-specific models fine-tuned for ABSA outperform general-purpose LLMs in zero-shot and few-shot settings, while fine-tuned LLMs achieve state-of-the-art results. We analyze how factors such as multilingualism, model size, and recency influence performance and present an error analysis highlighting key challenges, particularly in aspect term prediction. Our findings provide insights into the suitability of LLMs for Czech ABSA and offer guidance for future research in this area.
摘要：基于方面的情感分析（ABSA）是一项精细的情感分析任务，旨在确定对实体特定方面的情感。尽管大型语言模型（LLMS）在各种自然语言处理（NLP）任务中表现出很强的性能，但它们的捷克ABSA功能仍未得到探索。在这项工作中，我们对捷克ABSA上不同大小和体系结构的19个LLM进行了全面评估，比较了它们在零射，很少射击和微调场景中的性能。我们的结果表明，针对ABSA的小型域特异性模型在零射击和少量设置中以优于通用通用LLM进行了微调，而微调的LLMS可实现最先进的结果。我们分析了多种语言，模型大小和新近度等因素如何影响性能，并提出一个错误分析，突出了关键挑战，尤其是在方面术语预测中。我们的发现提供了对LLMS对捷克ABSA的适用性的见解，并为该领域的未来研究提供了指导。

Title: Tailored Emotional LLM-Supporter: Enhancing Cultural Sensitivity

Authors: Chen Cecilia Liu, Hiba Arnaout, Nils Kovačić, Dana Atzil-Slonim, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07902
Pdf URL: https://arxiv.org/pdf/2508.07902
Copy Paste: [[2508.07902]] Tailored Emotional LLM-Supporter: Enhancing Cultural Sensitivity(https://arxiv.org/abs/2508.07902)
Keywords: language model, llm
Abstract: Large language models (LLMs) show promise in offering emotional support and generating empathetic responses for individuals in distress, but their ability to deliver culturally sensitive support remains underexplored due to lack of resources. In this work, we introduce CultureCare, the first dataset designed for this task, spanning four cultures and including 1729 distress messages, 1523 cultural signals, and 1041 support strategies with fine-grained emotional and cultural annotations. Leveraging CultureCare, we (i) develop and test four adaptation strategies for guiding three state-of-the-art LLMs toward culturally sensitive responses; (ii) conduct comprehensive evaluations using LLM judges, in-culture human annotators, and clinical psychologists; (iii) show that adapted LLMs outperform anonymous online peer responses, and that simple cultural role-play is insufficient for cultural sensitivity; and (iv) explore the application of LLMs in clinical training, where experts highlight their potential in fostering cultural competence in future therapists.
摘要：大型语言模型（LLMS）在提供情感支持并为遇险中的个体产生善解人意的反应方面表现出了希望，但是由于缺乏资源，他们提供文化敏感支持的能力仍未得到充满活力。在这项工作中，我们介绍了CultureCare，这是为此任务设计的第一个数据集，涵盖了四种文化，其中包括1729个遇险信息，1523个文化信号和1041个支持策略，并具有精细的情感和文化注释。利用文化保健，我们（i）制定和测试了四种适应策略，以指导三个最先进的LLMS来实现文化敏感的反应；（ii）使用LLM法官，文化内注释者和临床心理学家进行全面评估；（iii）表明，改编的LLMS优于匿名的在线同伴反应，而简单的文化角色扮演不足以使文化敏感性；（iv）探索LLM在临床培训中的应用，专家强调了他们在促进未来治疗师文化能力方面的潜力。

Title: Expert Preference-based Evaluation of Automated Related Work Generation

Authors: Furkan Şahinuç, Subhabrata Dutta, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07955
Pdf URL: https://arxiv.org/pdf/2508.07955
Copy Paste: [[2508.07955]] Expert Preference-based Evaluation of Automated Related Work Generation(https://arxiv.org/abs/2508.07955)
Keywords: llm
Abstract: Expert domain writing, such as scientific writing, typically demands extensive domain knowledge. Recent advances in LLMs show promising potential in reducing the expert workload. However, evaluating the quality of automatically generated scientific writing is a crucial open issue, as it requires knowledge of domain-specific evaluation criteria and the ability to discern expert preferences. Conventional automatic metrics and LLM-as-a-judge systems are insufficient to grasp expert preferences and domain-specific quality standards. To address this gap and support human-AI collaborative writing, we focus on related work generation, one of the most challenging scientific tasks, as an exemplar. We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences. Instead of assigning a single score, our framework decomposes the evaluation into fine-grained dimensions. This localized evaluation approach is further augmented with contrastive few-shot examples to provide detailed contextual guidance for the evaluation dimensions. The design principles allow our framework to deliver cardinal assessment of quality, which can facilitate better post-training compared to ordinal preference data. For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs. Empirical investigation reveals that our framework is able to assess the quality of related work sections in a much more robust manner compared to standard LLM judges, reflects natural scenarios of scientific writing, and bears a strong correlation with the human expert assessment. We also observe that generations from state-of-the-art LLMs struggle to satisfy validation constraints of a suitable related work section. They (mostly) fail to improve based on feedback as well.
摘要：专家领域的写作（例如科学写作）通常需要广泛的领域知识。 LLMS的最新进展显示出减少专家工作量的有希望的潜力。但是，评估自动产生的科学写作质量是一个至关重要的开放问题，因为它需要了解特定领域的评估标准和识别专家偏好的能力。常规的自动指标和LLM-AS-A-A-Gudge系统不足以掌握专家的偏好和特定领域的质量标准。为了解决这一差距并支持人类协作写作，我们将重点放在相关的工作一代，这是最具挑战性的科学任务之一，作为一个典范。我们提出了GREP，这是一个多转弯评估框架，将经典相关的工作评估标准与专家特定的偏好相结合。我们的框架没有分配单个分数，而是将评估分解为细粒度的维度。这种本地化的评估方法将进一步增强，以对比度为几个示例，为评估维度提供详细的上下文指导。该设计原则使我们的框架可以对质量进行基本评估，这可以促进与序数偏好数据相比的更好的培训。为了获得更好的可访问性，我们设计了GREP的两个变体：具有专有LLMS作为评估者的更精确的变体，以及具有开放式LLM的较便宜的替代方案。实证研究表明，与标准LLM法官相比，我们的框架能够以更强大的方式评估相关工作部分的质量，反映了科学写作的自然情景，并与人类专家评估有着密切的相关性。我们还观察到，最先进的LLMS的几代人难以满足适当的相关工作部分的验证约束。他们（主要）也无法根据反馈来改进。

Title: Large Language Models for Subjective Language Understanding: A Survey

Authors: Changhao Song, Yazhou Zhang, Hui Gao, Ben Yao, Peng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07959
Pdf URL: https://arxiv.org/pdf/2508.07959
Copy Paste: [[2508.07959]] Large Language Models for Subjective Language Understanding: A Survey(https://arxiv.org/abs/2508.07959)
Keywords: language model, gpt, llm, chat
Abstract: Subjective language understanding refers to a broad set of natural language processing tasks where the goal is to interpret or generate content that conveys personal feelings, opinions, or figurative meanings rather than objective facts. With the advent of large language models (LLMs) such as ChatGPT, LLaMA, and others, there has been a paradigm shift in how we approach these inherently nuanced tasks. In this survey, we provide a comprehensive review of recent advances in applying LLMs to subjective language tasks, including sentiment analysis, emotion recognition, sarcasm detection, humor understanding, stance detection, metaphor interpretation, intent detection, and aesthetics assessment. We begin by clarifying the definition of subjective language from linguistic and cognitive perspectives, and we outline the unique challenges posed by subjective language (e.g. ambiguity, figurativeness, context dependence). We then survey the evolution of LLM architectures and techniques that particularly benefit subjectivity tasks, highlighting why LLMs are well-suited to model subtle human-like judgments. For each of the eight tasks, we summarize task definitions, key datasets, state-of-the-art LLM-based methods, and remaining challenges. We provide comparative insights, discussing commonalities and differences among tasks and how multi-task LLM approaches might yield unified models of subjectivity. Finally, we identify open issues such as data limitations, model bias, and ethical considerations, and suggest future research directions. We hope this survey will serve as a valuable resource for researchers and practitioners interested in the intersection of affective computing, figurative language processing, and large-scale language models.
摘要：主观语言理解是指广泛的自然语言处理任务，目标是解释或生成传达个人感觉，观点或象征意义而不是客观事实的内容。随着大型语言模型（LLM）的出现，例如Chatgpt，Llama和其他人，我们如何处理这些本质上细微的任务进行了范式转变。在这项调查中，我们对将LLMS应用于主观语言任务的最新进展进行了全面综述，包括情绪分析，情感识别，讽刺检测，幽默理解，立场检测，隐喻解释，意图检测和美学评估。我们首先从语言和认知的角度阐明主观语言的定义，并概述了主观语言（例如歧义，象征性，上下文依赖性）所带来的独特挑战。然后，我们调查了LLM体系结构和技术的演变，这些结构和技术特别受益于主观性任务，强调了为什么LLM非常适合模拟类似人类的人类判断。对于八个任务中的每一个，我们总结了任务定义，关键数据集，最先进的基于LLM的方法以及剩余的挑战。我们提供比较见解，讨论任务之间的共同点和差异以及多任务LLM方法如何产生统一的主观性模型。最后，我们确定了诸如数据限制，模型偏见和道德考虑之类的开放问题，并建议未来的研究方向。我们希望这项调查将成为对情感计算，象征性语言处理和大规模语言模型相交感兴趣的研究人员和从业人员的宝贵资源。

Title: Understanding Syntactic Generalization in Structure-inducing Language Models

Authors: David Arps, Hassan Sajjad, Laura Kallmeyer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07969
Pdf URL: https://arxiv.org/pdf/2508.07969
Copy Paste: [[2508.07969]] Understanding Syntactic Generalization in Structure-inducing Language Models(https://arxiv.org/abs/2508.07969)
Keywords: language model
Abstract: Structure-inducing Language Models (SiLM) are trained on a self-supervised language modeling task, and induce a hierarchical sentence representation as a byproduct when processing an input. A wide variety of SiLMs have been proposed. However, these have typically been evaluated on a relatively small scale, and evaluation of these models has systematic gaps and lacks comparability. In this work, we study three different SiLM architectures using both natural language (English) corpora and synthetic bracketing expressions: Structformer (Shen et al., 2021), UDGN (Shen et al., 2022) and GPST (Hu et al., 2024). We compare them with respect to (i) properties of the induced syntactic representations (ii) performance on grammaticality judgment tasks, and (iii) training dynamics. We find that none of the three architectures dominates across all evaluation metrics. However, there are significant differences, in particular with respect to the induced syntactic representations. The Generative Pretrained Structured Transformer (GPST; Hu et al. 2024) performs most consistently across evaluation settings, and outperforms the other models on long-distance dependencies in bracketing expressions. Furthermore, our study shows that small models trained on large amounts of synthetic data provide a useful testbed for evaluating basic model properties.
摘要：诱导结构的语言模型（SILM）是根据自我监督的语言建模任务进行培训的，并在处理输入时诱导层次句子表示为副产品。已经提出了各种各样的Silms。但是，这些模型的评估通常是相对较小的，并且对这些模型的评估具有系统的差距，并且缺乏可比性。在这项工作中，我们使用自然语言（英语）语料库和合成括号表达式研究了三种不同的SILM架构：结构形式（Shen等，2021），UDGN（Shen等，2022）和GPST（Hu等，2024）。我们将它们相对于（i）诱导的句法表示（ii）在语法判断任务和（iii）训练动态的属性进行了比较。我们发现，这三个体系结构中没有一个在所有评估指标中占主导地位。但是，存在显着差异，尤其是在诱导的句法表示方面。生成预审预测的结构化变压器（GPST; Hu等，2024）在评估设置中表现最一致，并且在括号表达式中的长距离依赖性方面的其他模型优于其他模型。此外，我们的研究表明，经过大量合成数据训练的小型模型为评估基本模型属性提供了有用的测试台。

Title: Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

Authors: Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07976
Pdf URL: https://arxiv.org/pdf/2508.07976
Copy Paste: [[2508.07976]] Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL(https://arxiv.org/abs/2508.07976)
Keywords: llm, prompt, agent
Abstract: Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in this https URL.
摘要：基于LLM的代理商的最新进展表明，通过整合外部工具来处理复杂的，知识密集的任务。在不同的工具选择中，搜索工具在获取广泛的外部知识中起着关键作用。但是，开源代理仍然无法实现专家级搜索智能，能够解决模棱两可的查询，生成精确的搜索，分析结果并进行彻底探索的能力。现有方法的可伸缩性，效率和数据质量缺乏。例如，现有的在线RL方法中的小转弯限制，例如<= 10，限制复杂的策略学习。本文介绍了Asearcher，这是一个开源项目，用于搜索剂的大规模RL培训。我们的主要贡献包括：（1）可扩展的完全异步RL训练，可在维持高训练效率的同时进行长马搜索。（2）基于及时的LLM代理，自主综合了高质量和具有挑战性的QA，创建了一个大型QA数据集。通过RL培训，我们迅速的QWQ-32B代理商取得了重大改进，分别在XBench和Gaia上获得46.7％和20.8％的AVG。值得注意的是，我们的代理商表现出极端的长马搜索，工具呼叫超过40圈，输出令牌在训练时间内超过150k。 ASEARCHER-WEB-QWQ凭借简单的代理设计和没有外部LLM，在XBench上获得42.1的AVG，在Gaia上达到52.8分数，超过了现有的开源32B代理。我们在此HTTPS URL中开放我们的模型，培训数据和代码。

Title: The Medical Metaphors Corpus (MCC)

Authors: Anna Sofia Lippolis, Andrea Giovanni Nuzzolese, Aldo Gangemi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07993
Pdf URL: https://arxiv.org/pdf/2508.07993
Copy Paste: [[2508.07993]] The Medical Metaphors Corpus (MCC)(https://arxiv.org/abs/2508.07993)
Keywords: language model
Abstract: Metaphor is a fundamental cognitive mechanism that shapes scientific understanding, enabling the communication of complex concepts while potentially constraining paradigmatic thinking. Despite the prevalence of figurative language in scientific discourse, existing metaphor detection resources primarily focus on general-domain text, leaving a critical gap for domain-specific applications. In this paper, we present the Medical Metaphors Corpus (MCC), a comprehensive dataset of 792 annotated scientific conceptual metaphors spanning medical and biological domains. MCC aggregates metaphorical expressions from diverse sources including peer-reviewed literature, news media, social media discourse, and crowdsourced contributions, providing both binary and graded metaphoricity judgments validated through human annotation. Each instance includes source-target conceptual mappings and perceived metaphoricity scores on a 0-7 scale, establishing the first annotated resource for computational scientific metaphor research. Our evaluation demonstrates that state-of-the-art language models achieve modest performance on scientific metaphor detection, revealing substantial room for improvement in domain-specific figurative language understanding. MCC enables multiple research applications including metaphor detection benchmarking, quality-aware generation systems, and patient-centered communication tools.
摘要：隐喻是一种基本的认知机制，它塑造了科学的理解，从而实现了复杂概念的交流，同时可能限制了范式思维。尽管在科学话语中具有比喻性语言的流行率，但现有的隐喻检测资源主要集中在通用域文本上，留下了针对特定领域的应用的关键差距。在本文中，我们介绍了医学隐喻语料库（MCC），这是一个涵盖医学和生物领域的792个注释的科学概念隐喻的综合数据集。 MCC汇总了来自不同来源的隐喻表达，包括经过同行评审的文献，新闻媒体，社交媒体话语以及众包的贡献，提供了通过人类注释验证的二元和分级的隐喻性判断。每个实例都包括源目标概念映射和以0-7量表感知的隐喻性分数，从而为计算科学隐喻研究建立了第一个带注释的资源。我们的评估表明，最先进的语言模型在科学隐喻检测上实现了适度的表现，从而揭示了改善特定领域的比喻性语言理解的大量空间。 MCC启用了多个研究应用程序，包括隐喻检测基准测试，质量感知的生成系统和以患者为中心的通信工具。

Title: WideSearch: Benchmarking Agentic Broad Info-Seeking

Authors: Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07999
Pdf URL: https://arxiv.org/pdf/2508.07999
Copy Paste: [[2508.07999]] WideSearch: Benchmarking Agentic Broad Info-Seeking(https://arxiv.org/abs/2508.07999)
Keywords: language model, llm, agent
Abstract: From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at this https URL
摘要：从专业研究到日常计划，许多任务都被广泛的信息寻求瓶装，这比认知上复杂更重复。随着大型语言模型（LLM）的快速发展，由LLMS提供支持的自动搜索剂为使人类摆脱这项繁琐的工作提供了有希望的解决方案。但是，由于缺乏合适的基准，这些代理人可以可靠和完全完全无法评估这种“广泛的”集合的能力在很大程度上仍未被估算。为了弥合这一差距，我们介绍了跨越搜索，这是一种新的基准测试，该基准旨在评估这些大规模收集任务的代理可靠性。该基准具有来自15个以上不同域的200个手动策划的问题（100英语，中文100个），这些域以实际用户查询为基础。每个任务都要求代理人收集大规模的原子信息，可以客观地对其进行一个一一验证，并将其安排为组织良好的输出。严格的五阶段质量控制管道可确保数据集的难度，完整性和可验证性。我们基准了10个最先进的代理搜索系统，包括单一代理，多代理框架和端到端的商业系统。大多数系统的总体成功率接近0 \％，表现最好的人只达到5％。但是，如果有足够的时间，多个人类测试人员的交叉验证可以达到接近100 \％的成功率。这些结果表明，当前的搜索代理在大规模的信息中存在严重的缺陷，并强调了为代理搜索中未来研究和开发的紧急领域。我们的数据集，评估管道和基准结果已在此HTTPS URL上公开发布

Title: Progressive Depth Up-scaling via Optimal Transport

Authors: Mingzi Cao, Xi Wang, Nikolaos Aletras
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08011
Pdf URL: https://arxiv.org/pdf/2508.08011
Copy Paste: [[2508.08011]] Progressive Depth Up-scaling via Optimal Transport(https://arxiv.org/abs/2508.08011)
Keywords: language model, llm
Abstract: Scaling Large Language Models (LLMs) yields performance gains but incurs substantial training costs. Depth up-scaling offers training efficiency by adding new layers to pre-trained models. However, most existing methods copy or average weights from base layers, neglecting neuron permutation differences. This limitation can potentially cause misalignment that harms performance. Inspired by applying Optimal Transport (OT) for neuron alignment, we propose Optimal Transport Depth Up-Scaling (OpT-DeUS). OpT-DeUS aligns and fuses Transformer blocks in adjacent base layers via OT for new layer creation, to mitigate neuron permutation mismatch between layers. OpT-DeUS achieves better overall performance and offers improved training efficiency than existing methods for continual pre-training and supervised fine-tuning across different model sizes. To further evaluate the impact of interpolation positions, our extensive analysis shows that inserting new layers closer to the top results in higher training efficiency due to shorter back-propagation time while obtaining additional performance gains.
摘要：扩展大语言模型（LLMS）可产生绩效增长，但会产生大量的培训成本。深度缩放通过将新层添加到预训练的模型中，从而提供了培训效率。但是，大多数现有方法从基层复制或平均权重忽略了神经元排列差异。这种限制可能会导致损害性能的未对准。灵感来自将最佳传输（OT）用于神经元比对的灵感，我们提出了最佳的运输深度上缩放（OPT-DEUS）。 OPT-DEUS通过OT在相邻的基层中对齐和保险丝的变压器块，以进行新的层创建，以减轻层之间的神经元置换不匹配。与现有的方法相比，Opt-Deus可以取得更好的整体性能，并提供提高的训练效率，用于跨不同模型尺寸的持续预训练和监督微调。为了进一步评估插值位置的影响，我们的广泛分析表明，插入新层更接近最高的层次，从而导致较高的训练效率，同时较短的背部传播时间，同时获得了额外的性能增长。

Title: 9th Workshop on Sign Language Translation and Avatar Technologies (SLTAT 2025)

Authors: Fabrizio Nunnari, Cristina Luna Jiménez, Rosalee Wolfe, John C. McDonald, Michael Filhol, Eleni Efthimiou, Evita Fotinea, Thomas Hanke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08050
Pdf URL: https://arxiv.org/pdf/2508.08050
Copy Paste: [[2508.08050]] 9th Workshop on Sign Language Translation and Avatar Technologies (SLTAT 2025)(https://arxiv.org/abs/2508.08050)
Keywords: agent
Abstract: The Sign Language Translation and Avatar Technology (SLTAT) workshops continue a series of gatherings to share recent advances in improving deaf / human communication through non-invasive means. This 2025 edition, the 9th since its first appearance in 2011, is hosted by the International Conference on Intelligent Virtual Agents (IVA), giving the opportunity for contamination between two research communities, using digital humans as either virtual interpreters or as interactive conversational agents. As presented in this summary paper, SLTAT sees contributions beyond avatar technologies, with a consistent number of submissions on sign language recognition, and other work on data collection, data analysis, tools, ethics, usability, and affective computing.
摘要：手语翻译和化身技术（SLTAT）讲习班继续进行一系列聚会，分享通过非侵入性手段改善聋 /人交流的最新进展。该2025年版是自2011年首次出现以来的第9版，由国际智能虚拟代理会议（IVA）主持，这为两个研究社区之间的污染机会提供了机会，将数字人类用作虚拟口译员或交互式对话代理。如本摘要论文所述，SLTAT看到了Avatar Technologies以外的贡献，对手语识别的一致性提交，以及有关数据收集，数据分析，工具，伦理，可用性和情感计算的其他工作。

Title: Dual Information Speech Language Models for Emotional Conversations

Authors: Chun Wang, Chenyang Liu, Wenze Xu, Weihong Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08095
Pdf URL: https://arxiv.org/pdf/2508.08095
Copy Paste: [[2508.08095]] Dual Information Speech Language Models for Emotional Conversations(https://arxiv.org/abs/2508.08095)
Keywords: language model, llm
Abstract: Conversational systems relying on text-based large language models (LLMs) often overlook paralinguistic cues, essential for understanding emotions and intentions. Speech-language models (SLMs), which use speech as input, are emerging as a promising solution. However, SLMs built by extending frozen LLMs struggle to capture paralinguistic information and exhibit reduced context understanding. We identify entangled information and improper training strategies as key issues. To address these issues, we propose two heterogeneous adapters and suggest a weakly supervised training strategy. Our approach disentangles paralinguistic and linguistic information, enabling SLMs to interpret speech through structured representations. It also preserves contextual understanding by avoiding the generation of task-specific vectors through controlled randomness. This approach trains only the adapters on common datasets, ensuring parameter and data efficiency. Experiments demonstrate competitive performance in emotional conversation tasks, showcasing the model's ability to effectively integrate both paralinguistic and linguistic information within contextual settings.
摘要：依靠基于文本的大语言模型（LLM）的对话系统通常会忽略副语言提示，这对于理解情绪和意图至关重要。语音语言模型（SLMS）使用语音作为输入，它正在作为一个有前途的解决方案。但是，通过扩展冷冻LLM构建的SLM难以捕获副语言信息并表现出降低的上下文理解。我们将纠缠的信息和不当培训策略视为关键问题。为了解决这些问题，我们提出了两个异构适配器，并提出了一个弱监督的培训策略。我们的方法删除了副语言和语言信息，使SLM能够通过结构化表示来解释语音。它还通过避免通过受控的随机性避免产生特定于任务的向量来保留上下文理解。这种方法仅训练通用数据集上的适配器，以确保参数和数据效率。实验显示了情感对话任务中的竞争性能，展示了该模型有效地将副语言和语言信息整合到上下文设置中的能力。

Title: Assessing LLM Text Detection in Educational Contexts: Does Human Contribution Affect Detection?

Authors: Lukas Gehring, Benjamin Paaßen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08096
Pdf URL: https://arxiv.org/pdf/2508.08096
Copy Paste: [[2508.08096]] Assessing LLM Text Detection in Educational Contexts: Does Human Contribution Affect Detection?(https://arxiv.org/abs/2508.08096)
Keywords: language model, llm
Abstract: Recent advancements in Large Language Models (LLMs) and their increased accessibility have made it easier than ever for students to automatically generate texts, posing new challenges for educational institutions. To enforce norms of academic integrity and ensure students' learning, learning analytics methods to automatically detect LLM-generated text appear increasingly appealing. This paper benchmarks the performance of different state-of-the-art detectors in educational contexts, introducing a novel dataset, called Generative Essay Detection in Education (GEDE), containing over 900 student-written essays and over 12,500 LLM-generated essays from various domains. To capture the diversity of LLM usage practices in generating text, we propose the concept of contribution levels, representing students' contribution to a given assignment. These levels range from purely human-written texts, to slightly LLM-improved versions, to fully LLM-generated texts, and finally to active attacks on the detector by "humanizing" generated texts. We show that most detectors struggle to accurately classify texts of intermediate student contribution levels, like LLM-improved human-written texts. Detectors are particularly likely to produce false positives, which is problematic in educational settings where false suspicions can severely impact students' lives. Our dataset, code, and additional supplementary materials are publicly available at this https URL.
摘要：大型语言模型（LLM）及其增加的可访问性的最新进展使学生比以往任何时候都更容易生成文本，从而为教育机构带来了新的挑战。为了执行学术诚信规范并确保学生的学习，学习分析方法自动检测LLM生成的文本似乎越来越吸引人。本文基准了在教育环境中不同最先进的探测器的性能，其中引入了一个新颖的数据集，称为《教育生成论文检测》（GEDE），其中包含900多种学生写的论文和12,500多个LLM生成的文章。为了捕获LLM使用实践在生成文本方面的多样性，我们提出了贡献水平的概念，代表了学生对特定任务的贡献。这些级别的范围从纯粹的人写的文本到略有LLM改良的版本，到完全LLM生成的文本，最后到通过“人性化”生成的文本对检测器进行主动攻击。我们表明，大多数探测器都在努力准确地对中级学生贡献水平的文本进行分类，例如通过LLM改良的人写的文本。探测器特别有可能产生误报，这在虚假怀疑会严重影响学生生活的教育环境中是有问题的。我们的数据集，代码和其他补充材料可在此HTTPS URL上公开获得。

Title: Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models

Authors: Wenze Xu, Chun Wang, Jiazhen Yu, Sheng Chen, Liang Gao, Weihong Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08131
Pdf URL: https://arxiv.org/pdf/2508.08131
Copy Paste: [[2508.08131]] Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models(https://arxiv.org/abs/2508.08131)
Keywords: language model, llm
Abstract: Spoken Language Models (SLMs), which extend Large Language Models (LLMs) to perceive speech inputs, have gained increasing attention for their potential to advance speech understanding tasks. However, despite recent progress, studies show that SLMs often struggle to generalize across datasets, even for trained languages and tasks, raising concerns about whether they process speech in a text-like manner as intended. A key challenge underlying this limitation is the modality gap between speech and text representations. The high variability in speech embeddings may allow SLMs to achieve strong in-domain performance by exploiting unintended speech variations, ultimately hindering generalization. To mitigate this modality gap, we introduce Optimal Transport Regularization (OTReg), a method that formulates speech-text alignment as an optimal transport problem and derives a regularization loss to improve SLM training. In each training iteration, OTReg first establishes a structured correspondence between speech and transcript embeddings by determining the optimal transport plan, then incorporates the regularization loss based on this transport plan to optimize SLMs in generating speech embeddings that align more effectively with transcript embeddings. OTReg is lightweight, requiring no additional labels or learnable parameters, and integrates seamlessly into existing SLM training procedures. Extensive multilingual ASR experiments demonstrate that OTReg enhances speech-text alignment, mitigates the modality gap, and consequently improves SLM generalization across diverse datasets.
摘要：语言模型（SLM）扩展了大型语言模型（LLM）以感知语音输入，它因其推动语音理解任务的潜力而引起了人们的关注。然而，尽管最近的进展，研究表明，SLM通常也很难跨数据集概括，即使是训练有素的语言和任务，也引起了人们对它们是否按照预期以文本方式处理语音的担忧。这一限制的基本挑战是语音和文本表示之间的方式差距。语音嵌入的高度差异可能使SLM通过利用意外的语音变化，最终阻碍概括来实现强大的内域性能。为了减轻这种方式差距，我们引入了最佳运输正则化（OTREG），该方法将语音文本对准作为最佳运输问题，并导致正规化损失以改善SLM训练。在每次训练迭代中，OTREG首先通过确定最佳运输计划来建立语音和笔录嵌入之间的结构化对应关系，然后基于该运输计划的正则化损失，以优化SLM，以在生成更有效地与成绩单嵌入的语音嵌入时优化SLM。 OTREG是轻量级的，不需要其他标签或可学习的参数，并且无缝集成到现有的SLM培训程序中。广泛的多语言ASR实验表明，OTREG会增强语音文本对齐，减轻模态差距，从而改善各种数据集的SLM概括。

Title: Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

Authors: Tianyi Zhou, Johanne Medina, Sanjay Chawla
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08139
Pdf URL: https://arxiv.org/pdf/2508.08139
Copy Paste: [[2508.08139]] Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models(https://arxiv.org/abs/2508.08139)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation.
摘要：大型语言模型（LLMS）容易产生流利但不正确的内容（称为封布），这在多转弯或代理应用中构成了越来越多的风险，在这些应用程序或代理应用程序中可能会重复使用作为上下文。在这项工作中，我们调查了文化信息如何影响模型行为以及LLM是否可以识别其不可靠的响应。我们提出了一个可靠性估计，即利用令牌级的不确定性来指导内部模型表示的聚合。具体而言，我们从输出逻辑中计算出态和认知不确定性，以识别显着令牌，并将其隐藏状态汇总为紧凑的表示形式，以进行响应级别的可靠性预测。通过对开放质量检查基准测试的受控实验，我们发现正确的内在信息提高了答案的准确性和模型信心，而误导性环境通常会引起自信的不正确响应，从而揭示了不确定性和正确性之间的错位。我们基于探测的方法捕获了模型行为中的这些变化，并改善了多个开源LLM的不可靠输出的检测。这些结果强调了直接不确定性信号的局限性，并突出了不确定性引导探测可靠性感知的潜力。

Title: Data-Efficient Biomedical In-Context Learning: A Diversity-Enhanced Submodular Perspective

Authors: Jun Wang, Zaifu Zhan, Qixin Zhang, Mingquan Lin, Meijia Song, Rui Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08140
Pdf URL: https://arxiv.org/pdf/2508.08140
Copy Paste: [[2508.08140]] Data-Efficient Biomedical In-Context Learning: A Diversity-Enhanced Submodular Perspective(https://arxiv.org/abs/2508.08140)
Keywords: language model, llm, prompt
Abstract: Recent progress in large language models (LLMs) has leveraged their in-context learning (ICL) abilities to enable quick adaptation to unseen biomedical NLP tasks. By incorporating only a few input-output examples into prompts, LLMs can rapidly perform these new tasks. While the impact of these demonstrations on LLM performance has been extensively studied, most existing approaches prioritize representativeness over diversity when selecting examples from large corpora. To address this gap, we propose Dual-Div, a diversity-enhanced data-efficient framework for demonstration selection in biomedical ICL. Dual-Div employs a two-stage retrieval and ranking process: First, it identifies a limited set of candidate examples from a corpus by optimizing both representativeness and diversity (with optional annotation for unlabeled data). Second, it ranks these candidates against test queries to select the most relevant and non-redundant demonstrations. Evaluated on three biomedical NLP tasks (named entity recognition (NER), relation extraction (RE), and text classification (TC)) using LLaMA 3.1 and Qwen 2.5 for inference, along with three retrievers (BGE-Large, BMRetriever, MedCPT), Dual-Div consistently outperforms baselines-achieving up to 5% higher macro-F1 scores-while demonstrating robustness to prompt permutations and class imbalance. Our findings establish that diversity in initial retrieval is more critical than ranking-stage optimization, and limiting demonstrations to 3-5 examples maximizes performance efficiency.
摘要：大型语言模型（LLM）的最新进展已利用其内在学习能力（ICL）能力，以使其能够快速适应以看不见的生物医学NLP任务。通过仅将少数输入输出示例纳入提示中，LLM可以快速执行这些新任务。尽管这些演示对LLM性能的影响已经进行了广泛的研究，但大多数现有方法在选择大型语料库的示例时将代表性优先于多样性。为了解决这一差距，我们提出了Dual-Div，这是一个多样性增强的数据效率框架，用于在生物医学ICL中进行演示选择。 Dual-Div采用了两个阶段的检索和排名过程：首先，它通过优化代表性和多样性（未标记数据的可选注释）来确定来自语料库的有限候选示例。其次，它将这些候选人与测试查询进行排名，以选择最相关和最冗余的演示。 Evaluated on three biomedical NLP tasks (named entity recognition (NER), relation extraction (RE), and text classification (TC)) using LLaMA 3.1 and Qwen 2.5 for inference, along with three retrievers (BGE-Large, BMRetriever, MedCPT), Dual-Div consistently outperforms baselines-achieving up to 5% higher macro-F1 scores-while demonstrating robustness促使排列和班级失衡。我们的发现表明，初始检索的多样性比排名阶段优化更为重要，并且将演示限制在3-5个示例中最大化了性能效率。

Title: REX-RAG: Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation

Authors: Wentao Jiang, Xiang Feng, Zengmao Wang, Yong Luo, Pingbo Xu, Zhe Chen, Bo Du, Jing Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08149
Pdf URL: https://arxiv.org/pdf/2508.08149
Copy Paste: [[2508.08149]] REX-RAG: Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation(https://arxiv.org/abs/2508.08149)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Reinforcement learning (RL) is emerging as a powerful paradigm for enabling large language models (LLMs) to perform complex reasoning tasks. Recent advances indicate that integrating RL with retrieval-augmented generation (RAG) allows LLMs to dynamically incorporate external knowledge, leading to more informed and robust decision making. However, we identify a critical challenge during policy-driven trajectory sampling: LLMs are frequently trapped in unproductive reasoning paths, which we refer to as "dead ends", committing to overconfident yet incorrect conclusions. This severely hampers exploration and undermines effective policy optimization. To address this challenge, we propose REX-RAG (Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation), a novel framework that explores alternative reasoning paths while maintaining rigorous policy learning through principled distributional corrections. Our approach introduces two key innovations: (1) Mixed Sampling Strategy, which combines a novel probe sampling method with exploratory prompts to escape dead ends; and (2) Policy Correction Mechanism, which employs importance sampling to correct distribution shifts induced by mixed sampling, thereby mitigating gradient estimation bias. We evaluate it on seven question-answering benchmarks, and the experimental results show that REX-RAG achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B over strong baselines, demonstrating competitive results across multiple datasets. The code is publicly available at this https URL.
摘要：强化学习（RL）正在成为使大型语言模型（LLMS）执行复杂的推理任务的强大范式。最近的进步表明，将RL与检索功能增强的生成（RAG）整合在一起，使LLM可以动态地纳入外部知识，从而导致更明智和强大的决策做出。但是，我们确定了在政策驱动的轨迹抽样过程中的一个关键挑战：LLM经常被困在非生产性的推理路径中，我们称之为“死胡同”，承诺过度自信但不正确的结论。这严重阻碍了探索并破坏了有效的政策优化。为了应对这一挑战，我们提出了Rex-rag（在检索授权一代中进行政策校正的推理探索），这是一个新颖的框架，探索了替代的推理路径，同时通过通过原则的分配校正来维持严格的政策学习。我们的方法引入了两个关键的创新：（1）混合抽样策略，将新颖的探针抽样方法与探索性提示结合在一起，以逃脱死胡同；（2）政策校正机制，采用重要性采样来纠正混合采样引起的分布变化，从而减轻梯度估计偏差。我们对七个提问基准进行了评估，实验结果表明，REX-rag在QWEN2.5-3B上的平均性能增长为5.1％，在强质基础上，QWEN2.5-7B的平均性能增长了3.6％，在强大的基础上获得了Qwen2.5-7B的平均增长，证明了多个数据集的竞争结果。该代码在此HTTPS URL上公开可用。

Title: Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

Authors: Bangsheng Tang, Carl Chengyan Fu, Fei Kou, Grigory Sizov, Haoci Zhang, Jason Park, Jiawen Liu, Jie You, Qirui Yang, Sachin Mehta, Shengyong Cai, Xiaodong Wang, Xingyu Liu, Yunlu Li, Yanjun Zhou, Wei Wei, Zhiwei Zhao, Zixi Qi, Adolfo Victoria, Aya Ibrahim, Bram Wasti, Changkyu Kim, Daniel Haziza, Fei Sun, Giancarlo Delfin, Emily Guo, Jialin Ouyang, Jaewon Lee, Jianyu Huang, Jeremy Reizenstein, Lu Fang, Quinn Zhu, Ria Verma, Vlad Mihailescu, Xingwen Guo, Yan Cui, Ye Hu, Yejin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08192
Pdf URL: https://arxiv.org/pdf/2508.08192
Copy Paste: [[2508.08192]] Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions(https://arxiv.org/abs/2508.08192)
Keywords: language model
Abstract: Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.
摘要：投机解码是加速大语模型推理速度的标准方法。但是，将其扩展到生产环境中构成了几个工程挑战，包括在GPU上有效实施不同的操作（例如，树木注意力和多轮投机解码）。在本文中，我们详细介绍了我们已经实施的培训和推理优化技术，以在Llama模型的生产规模上启用基于Eagle的投机解码。通过这些更改，我们为美洲驼模型实现了新的最新推理潜伏期。例如，Llama4 Maverick在8个NVIDIA H100 GPU上以大约4 ms的速度解码，比以前最著名的方法快10％。此外，对于基于Eagle的投机解码，我们的优化使我们能够在生产规模下对1.4倍至2.0倍的大批量加快。

Title: Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models

Authors: Kyle Moore, Jesse Roberts, Daryl Watson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08204
Pdf URL: https://arxiv.org/pdf/2508.08204
Copy Paste: [[2508.08204]] Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models(https://arxiv.org/abs/2508.08204)
Keywords: language model, llm
Abstract: There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of alignment to human answer preference. For those successful metrics, we find moderate to strong evidence of model calibration in terms of both correctness correlation and distributional analysis.
摘要：最近有很多兴趣评估大型语言模型的不确定性校准，以促进模型控制并调节用户信任。推理时间不确定性可能为模型或外部控制模块提供实时信号，对于应用这些概念以改善LLM-user在实践中的体验尤为重要。尽管许多现有的论文都考虑了模型校准，但相对较少的工作试图评估模型的不确定性与人类不确定性的紧密程度。在这项工作中，我们使用已建立的指标和新颖的变化来评估推理时间不确定性度量的集合，以确定它们与人类群体级别的不确定性和模型校准的传统概念的紧密程度。我们发现，尽管缺乏与人类答案的偏好保持一致，但许多措施表明了与人类不确定性保持强烈一致的证据。对于那些成功的指标，我们在正确性相关性和分布分析方面发现了中度至强大的模型校准证据。

Title: SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling

Authors: Zhuohao Yu, Xingru Jiang, Weizheng Gu, Yidong Wang, Shikun Zhang, Wei Ye
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08211
Pdf URL: https://arxiv.org/pdf/2508.08211
Copy Paste: [[2508.08211]] SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling(https://arxiv.org/abs/2508.08211)
Keywords: llm
Abstract: Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework's effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark's consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.
摘要：水印LLM生成的文本对于预防内容归因和错误信息至关重要。但是，现有方法会损害文本质量，需要白色框模型访问和logit操纵。这些限制不包括基于API的模型和多语言场景。我们提出了SAEMARK，这是一个通用室后多位水印的一般框架，该框架仅通过推理时间，基于功能的拒绝采样来嵌入个性化消息，而无需更改模型逻辑或需要培训。我们的方法基于从生成的文本中提取的确定性特征，选择其功能统计信息与键源目标保持一致的输出。该框架自然会跨语言和域概括，同时通过对LLM输出而不是修改来保留文本质量。我们提供理论保证，与任何合适的功能提取器相关的水印成功概率和计算预算。从经验上讲，我们使用稀疏的自动编码器（SAE）证明了该框架的有效性，从而达到了卓越的检测准确性和文本质量。 4个数据集的实验显示了Saemark的一致性，英语的F1和强大的多位检测精度为99.7％。 Saemark建立了一种用于可扩展水印的新范式，该范式在封闭式LLM的框外工作，同时启用内容归因。

Title: Capabilities of GPT-5 on Multimodal Medical Reasoning

Authors: Shansong Wang, Mingzhe Hu, Qiang Li, Mojtaba Safari, Xiaofeng Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08224
Pdf URL: https://arxiv.org/pdf/2508.08224
Copy Paste: [[2508.08224]] Capabilities of GPT-5 on Multimodal Medical Reasoning(https://arxiv.org/abs/2508.08224)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.62% and +36.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5's ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.
摘要：大型语言模型（LLM）的最新进展使通用系统能够执行越来越复杂的特定于域特异性推理，而无需进行广泛的微调。在医疗领域，决策通常需要整合异质信息源，包括患者叙事，结构化数据和医学图像。这项研究将GPT-5定位为医疗决策支持的通才多模式推理，并系统地评估其零摄像的推理推理性能，这些推理性能在基于文本的问题答案和统一协议下的视觉问题回答任务上。我们基于GPT-5，GPT-5-MINI，GPT-5-NANO和GPT-4O-2024-11-20对MEDQA，MEDXPERTQA（文本和多模态），MMLU医学亚集，USMLE自我评估考试和VQA-RAD的标准化分裂。结果表明，GPT-5始终胜过所有基准，在所有QA基准测试中实现最先进的准确性，并在多模式推理中带来可观的增长。在MEDXPERTQA MM上，GPT-5分别将推理和理解分别提高了29.62％和 +36.18％，而GPT-4O的得分则超过了预先许可的人类专家的推理 +24.23％，而在理解方面， +29.40％。相反，在大多数维度上，GPT-4O仍然低于人类专家的表现。一项代表性的案例研究表明，GPT-5将视觉和文本线索整合到连贯的诊断推理链中的能力，建议适当的高风险干预措施。我们的结果表明，在这些受控的多模式推理基准上，GPT-5从人类稳定到上述人类专家的表现移动。这种改进可能会大大为未来的临床决策支持系统设计。

Title: Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge

Authors: Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, Mingming Fan
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.08236
Pdf URL: https://arxiv.org/pdf/2508.08236
Copy Paste: [[2508.08236]] Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge(https://arxiv.org/abs/2508.08236)
Keywords: llm, prompt
Abstract: Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.
摘要：由于缺少金色标准的答案以及这些相互作用的道德敏感性，因此在高危心理健康对话中评估LLM响应的安全对准尤其困难。为了应对这一挑战，我们提出了Psycrisis Bench，这是基于现实世界中国心理健康对话的无参考评估基准。它评估了模型响应是否与专家定义的安全原则保持一致。我们的方法是专门为无标准参考的设置而设计的，我们的方法采用了一种基于迅速的LLM-AS-AS-Gudge方法，该方法使用以心理干预原则为基础的专家定义的推理链进行了文本评估。我们在多个安全维度上采用二元点得分，以增强评估的解释性和可追溯性。此外，我们提出了一个手动策划的，高质量的中文数据集，涵盖自我伤害，自杀念头和存在的困扰，这些数据集源自现实世界的在线话语。 3600个判断的实验表明，与现有方法相比，我们的方法与专家评估达成了最高的一致性，并产生了更多可解释的评估原理。我们的数据集和评估工具可公开使用，以促进进一步的研究。

Title: Jinx: Unlimited LLMs for Probing Alignment Failures

Authors: Jiahao Zhao, Liwei Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08243
Pdf URL: https://arxiv.org/pdf/2508.08243
Copy Paste: [[2508.08243]] Jinx: Unlimited LLMs for Probing Alignment Failures(https://arxiv.org/abs/2508.08243)
Keywords: language model, llm
Abstract: Unlimited, or so-called helpful-only language models are trained without safety alignment constraints and never refuse user queries. They are widely used by leading AI companies as internal tools for red teaming and alignment evaluation. For example, if a safety-aligned model produces harmful outputs similar to an unlimited model, this indicates alignment failures that require further attention. Despite their essential role in assessing alignment, such models are not available to the research community. We introduce Jinx, a helpful-only variant of popular open-weight LLMs. Jinx responds to all queries without refusals or safety filtering, while preserving the base model's capabilities in reasoning and instruction following. It provides researchers with an accessible tool for probing alignment failures, evaluating safety boundaries, and systematically studying failure modes in language model safety.
摘要：无限制或所谓的有用语言模型经过培训，而没有安全对准限制，并且从不拒绝用户查询。它们被带动AI公司作为红色团队和对齐评估的内部工具广泛使用。例如，如果安全平选模型产生类似于无限模型的有害输出，则表明需要进一步关注的对齐失败。尽管它们在评估一致性方面的重要作用，但研究界无法使用此类模型。我们介绍Jinx，这是流行的开放式LLM的仅有用的变体。 Jinx在没有拒绝或安全过滤的情况下对所有查询做出了响应，同时保留了基本模型在推理和指导下的功能。它为研究人员提供了一种可访问的工具，用于探测一致性故障，评估安全界限并系统地研究语言模型安全性的故障模式。