2025-09-01

Title: Mapping Toxic Comments Across Demographics: A Dataset from German Public Broadcasting

Authors: Jan Fillies, Michael Peter Hoffmann, Rebecca Reichel, Roman Salzwedel, Sven Bodemer, Adrian Paschke
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.21084
Pdf URL: https://arxiv.org/pdf/2508.21084
Copy Paste: [[2508.21084]] Mapping Toxic Comments Across Demographics: A Dataset from German Public Broadcasting(https://arxiv.org/abs/2508.21084)
Keywords: language model, llm
Abstract: A lack of demographic context in existing toxic speech datasets limits our understanding of how different age groups communicate online. In collaboration with funk, a German public service content network, this research introduces the first large-scale German dataset annotated for toxicity and enriched with platform-provided age estimates. The dataset includes 3,024 human-annotated and 30,024 LLM-annotated anonymized comments from Instagram, TikTok, and YouTube. To ensure relevance, comments were consolidated using predefined toxic keywords, resulting in 16.7\% labeled as problematic. The annotation pipeline combined human expertise with state-of-the-art language models, identifying key categories such as insults, disinformation, and criticism of broadcasting fees. The dataset reveals age-based differences in toxic speech patterns, with younger users favoring expressive language and older users more often engaging in disinformation and devaluation. This resource provides new opportunities for studying linguistic variation across demographics and supports the development of more equitable and age-aware content moderation systems.
摘要：现有的有毒语音数据集中缺乏人口统计环境限制了我们对不同年龄段在线沟通方式的理解。该研究与德国公共服务内容网络Funk合作，介绍了第一个大规模的德国数据集注释了毒性，并具有平台提供的年龄估算。该数据集包括来自Instagram，Tiktok和YouTube的3,024个人类通知和30,024个LLM匿名评论。为了确保相关性，使用预定义的有毒关键字合并评论，导致16.7 \％被标记为有问题。注释管道将人类专业知识与最先进的语言模型相结合，确定了关键类别，例如侮辱，虚假信息和对广播费的批评。该数据集揭示了有毒语音模式的基于年龄的差异，年轻用户更喜欢表现力的语言，而老年用户则更频繁地从事虚假信息和贬值。该资源为研究跨人口统计学的语言差异提供了新的机会，并支持更公平和更具年龄涉及的内容审核系统的发展。

Title: How Does Cognitive Bias Affect Large Language Models? A Case Study on the Anchoring Effect in Price Negotiation Simulations

Authors: Yoshiki Takenami, Yin Jou Huang, Yugo Murawaki, Chenhui Chu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.21137
Pdf URL: https://arxiv.org/pdf/2508.21137
Copy Paste: [[2508.21137]] How Does Cognitive Bias Affect Large Language Models? A Case Study on the Anchoring Effect in Price Negotiation Simulations(https://arxiv.org/abs/2508.21137)
Keywords: language model, llm, agent
Abstract: Cognitive biases, well-studied in humans, can also be observed in LLMs, affecting their reliability in real-world applications. This paper investigates the anchoring effect in LLM-driven price negotiations. To this end, we instructed seller LLM agents to apply the anchoring effect and evaluated negotiations using not only an objective metric but also a subjective metric. Experimental results show that LLMs are influenced by the anchoring effect like humans. Additionally, we investigated the relationship between the anchoring effect and factors such as reasoning and personality. It was shown that reasoning models are less prone to the anchoring effect, suggesting that the long chain of thought mitigates the effect. However, we found no significant correlation between personality traits and susceptibility to the anchoring effect. These findings contribute to a deeper understanding of cognitive biases in LLMs and to the realization of safe and responsible application of LLMs in society.
摘要：在LLM中也可以观察到在人类中进行良好研究的认知偏见，从而影响其在现实应用应用中的可靠性。本文研究了LLM驱动的价格谈判中的锚定效应。为此，我们指示卖方LLM代理应用锚定效应，并不仅使用客观指标，而且使用主观度量来评估谈判。实验结果表明，LLM像人类一样受锚定效果的影响。此外，我们研究了锚定效应与推理和人格等因素之间的关系。结果表明，推理模型不太容易发生锚定效果，这表明长长的思想链会减轻效果。但是，我们发现人格特征与锚定效应的敏感性之间没有显着相关性。这些发现有助于更深入地了解LLM中的认知偏见，并实现LLM在社会中的安全和负责任的应用。

Title: Can Multimodal LLMs Solve the Basic Perception Problems of Percept-V?

Authors: Samrajnee Ghosh, Naman Agarwal, Hemanshu Garg, Chinmay Mittal, Mausam, Parag Singla
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2508.21143
Pdf URL: https://arxiv.org/pdf/2508.21143
Copy Paste: [[2508.21143]] Can Multimodal LLMs Solve the Basic Perception Problems of Percept-V?(https://arxiv.org/abs/2508.21143)
Keywords: language model, gpt, llm
Abstract: The reasoning abilities of Multimodal Large Language Models (MLLMs) have garnered a lot of attention in recent times, with advances made in frontiers like coding, mathematics, and science. However, very limited experiments have been done to assess their performance in simple perception tasks performed over uncontaminated, generated images containing basic shapes and structures. To address this issue, the paper introduces a dataset, Percept-V, containing a total of 7200 program-generated images equally divided into 30 categories, each testing a combination of visual perception skills. Unlike previously proposed datasets, Percept-V comprises very basic tasks of varying complexity that test the perception abilities of MLLMs. This dataset is then tested on state-of-the-art MLLMs like GPT-4o, Gemini, and Claude as well as Large Reasoning Models (LRMs) like OpenAI o4-mini and DeepSeek R1 to gauge their performance. Contrary to the evidence that MLLMs excel in many complex tasks, our experiments show a significant drop in the models' performance with increasing problem complexity across all categories. An analysis of the performances also reveals that the tested MLLMs exhibit a similar trend in accuracy across categories, testing a particular cognitive skill and find some skills to be more difficult than others.
摘要：多模式大语模型（MLLM）的推理能力最近引起了很多关注，并在编码，数学和科学等边界取得了进步。但是，已经进行了非常有限的实验，以评估其在未经污染的，生成的图像中执行的简单感知任务中的表现，这些图像包含基本形状和结构。为了解决此问题，本文介绍了一个数据集，即“感知V”，其中包含7200个程序生成的图像，分为30个类别，每个图像都测试了视觉感知技能的组合。与以前建议的数据集不同，感知V包括测试MLLM的感知能力的不同复杂性的非常基本的任务。然后，该数据集在GPT-4O，Gemini和Claude等最先进的MLLM上进行测试，以及大型推理模型（LRMS），例如Openai O4-Mini和DeepSeek R1，以评估其性能。与MLLM在许多复杂任务中表现出色的证据相反，我们的实验表明，模型的性能下降，并且在所有类别中的问题复杂性都增加。对表演的分析还表明，测试的MLLM在各个类别的准确性上表现出相似的趋势，测试特定的认知技能，并发现一些技能比其他技能更加困难。

Title: A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Authors: Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21148
Pdf URL: https://arxiv.org/pdf/2508.21148
Copy Paste: [[2508.21148]] A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers(https://arxiv.org/abs/2508.21148)
Keywords: language model, llm, agent
Abstract: Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.
摘要：科学的大语言模型（SCI-LLMS）正在改变科学研究中的知识的代表，整合和应用的方式，但它们的进步是由科学数据的复杂本质所塑造的。这项调查提出了一种全面的，以数据为中心的综合，将SCI-LLMS的开发重新缩放为模型及其基础数据基板之间的共同发展。我们制定了科学数据的统一分类学和科学知识的层次结构模型，强调了将科学语料库与一般自然语言处理数据集区别的多模式，跨尺度和特定领域的挑战。 We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning.在评估时，我们检查了超过190个基准数据集，并追踪了从静态考试转向具有高级评估协议的过程和面向发现和发现的评估。这些以数据为中心的分析强调了科学数据开发中的持久性问题，并讨论了涉及半自动注释管道和专家验证的新兴解决方案。最后，我们概述了向闭环系统的范式转变，在该系统中，基于Sci-llms的自主代理会积极实验，验证并促进生活，不断发展的知识库。总的来说，这项工作为建立可信赖的，不断发展的人工智能（AI）系统提供了路线图，该系统是加速科学发现的真正合作伙伴。

Title: Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

Authors: Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21164
Pdf URL: https://arxiv.org/pdf/2508.21164
Copy Paste: [[2508.21164]] Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations(https://arxiv.org/abs/2508.21164)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs) are increasingly used to evaluate outputs, yet their judgments may be influenced. This study examines bias in self- and cross-model evaluations by ChatGPT, Gemini, and Claude under four conditions: no labels, true labels, and two false-label scenarios. Blog posts authored by each model were evaluated by all three using both overall preference voting and quality ratings for Coherence, Informativeness, and Conciseness, with all scores expressed as percentages for direct comparison. Results reveal striking asymmetries: the "Claude" label consistently boosts scores, while the "Gemini" label consistently depresses them, regardless of actual content. False labels frequently reversed rankings, producing shifts of up to 50 percentage points in preference votes and up to 12 percentage points in converted quality ratings. Gemini's self-scores collapsed under true labels, while Claude's self-preference intensified. These findings show that perceived model identity can heavily distort high-level judgments and subtly influence detailed quality ratings, underscoring the need for blind or multimodel evaluation protocols to ensure fairness in LLM benchmarking.
摘要：大型语言模型（LLM）越来越多地用于评估产量，但其判断可能受到影响。这项研究研究了在四个条件下通过Chatgpt，Gemini和Claude对自我和跨模型评估的偏见：没有标签，真实标签和两个假标签场景。通过所有三个模型撰写的博客文章都使用所有三个模型进行评估，同时使用整体偏好投票和质量评级，以获得连贯性，信息性和简洁性，所有分数以直接比较的百分比表示。结果显示出惊人的不对称性：“ Claude”标签始终提高得分，而“双子座”标签始终如一地降低它们，而不论实际内容如何。错误的标签经常逆转排名，在首选票中产生高达50个百分点的转变，转换后的质量评级最高可达12个百分点。双子座的自我分数在真实的标签下倒塌，而克劳德的自我偏爱也加剧了。这些发现表明，感知的模型身份可以严重扭曲高级判断并巧妙地影响详细的质量评级，从而强调了对盲人或多模型评估协议的需求，以确保LLM基准测试中的公平性。

Title: BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Authors: Deepro Choudhury, Sinead Williamson, Adam Goliński, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, Tom Rainforth
Subjects: cs.CL, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2508.21184
Pdf URL: https://arxiv.org/pdf/2508.21184
Copy Paste: [[2508.21184]] BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design(https://arxiv.org/abs/2508.21184)
Keywords: language model, llm, prompt, agent
Abstract: We propose a general-purpose approach for improving the ability of Large Language Models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments. Our approach, which we call BED-LLM (Bayesian Experimental Design with Large Language Models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) about the task of interest given the responses gathered previously. We show how this EIG can be formulated in a principled way using a probabilistic model derived from the LLM's belief distribution and provide detailed insights into key decisions in its construction. Further key to the success of BED-LLM are a number of specific innovations, such as a carefully designed estimator for the EIG, not solely relying on in-context updates for conditioning on previous responses, and a targeted strategy for proposing candidate queries. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20-questions game and using the LLM to actively infer user preferences, compared to direct prompting of the LLM and other adaptive design strategies.
摘要：我们提出了一种通用方法，以使用顺序贝叶斯实验设计（BED）的框架，可以从用户或其他外部来源从用户或其他外部来源中智能和适应地收集信息。这使LLM可以充当有效的多转交谈代理，并与外部环境交互式接口。我们称之为Bed-llm（具有大语言模型的贝叶斯实验设计）的方法是基于迭代选择的问题或查询，这些问题或查询最大化了预期的信息增益（EIG），鉴于先前收集的回答，我们的感兴趣任务。我们展示了如何使用从LLM的信念分布中得出的概率模型以原则性的方式制定该特征，并为其构建中的关键决策提供了详细的见解。 Bed-llm成功的进一步关键是许多特定的创新，例如针对EIG进行了精心设计的估计器，而不仅仅是依赖于内在的更新来调整以前的响应，以及针对候选查询的有针对性的策略。我们发现，与LLM和其他自适应设计策略的直接提示相比，基于20个问题的游戏和LLM可以积极推断用户偏好，在广泛的测试中，Bed-LLM在广泛的测试中取得了可观的增长。

Title: Improving Aviation Safety Analysis: Automated HFACS Classification Using Reinforcement Learning with Group Relative Policy Optimization

Authors: Arash Ahmadi, Sarah Sharif, Yaser Banad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21201
Pdf URL: https://arxiv.org/pdf/2508.21201
Copy Paste: [[2508.21201]] Improving Aviation Safety Analysis: Automated HFACS Classification Using Reinforcement Learning with Group Relative Policy Optimization(https://arxiv.org/abs/2508.21201)
Keywords: language model, gpt, llm
Abstract: Analyzing the human factors behind aviation accidents is crucial for preventing future incidents, yet traditional methods using the Human Factors Analysis and Classification System (HFACS) are limited by scalability and consistency. To address this, we introduce an automated HFACS classification framework for aviation safety analysis that utilizes Reinforcement Learning with Group Relative Policy Optimization (GRPO) to fine-tune a Llama-3.1 8B language model. Our approach incorporates a multi-component reward system tailored for aviation safety analysis and integrates synthetic data generation to overcome class imbalance in accident datasets. The resulting GRPO-optimized model achieved noticeable performance gains, including a 350% increase in exact match accuracy (from 0.0400 to 0.1800) and an improved partial match accuracy of 0.8800. Significantly, our specialized model outperforms state-of-the-art LLMs (Large Language Models), including GPT-5-mini and Gemini-2.5-fiash, on key metrics. This research also proposes exact match accuracy in multi-label HFACS classification problem as a new benchmarking methodology to evaluate the advanced reasoning capabilities of language models. Ultimately, our work validates that smaller, domain-optimized models can provide a computationally efficient and better solution for critical safety analysis. This approach makes powerful, low-latency deployment on resource-constrained edge devices feasible.
摘要：分析航空事故背后的人为因素对于预防未来事件至关重要，但是使用人为因素分析和分类系统（HFAC）的传统方法受到可伸缩性和一致性的限制。为了解决这个问题，我们引入了一个自动化的HFACS分类框架，以进行航空安全分析，该框架利用组相对策略优化（GRPO）来微调Llama-3.1 8B语言模型。我们的方法结合了用于航空安全分析的多组分奖励系统，并整合了合成数据生成以克服事故数据集中的类不平衡。最终的GRPO优化模型实现了明显的性能增长，包括精确匹配的精度提高了350％（从0.0400到0.1800），以及提高的部分匹配精度为0.8800。值得注意的是，我们的专业模型优于最先进的LLM（大型语言模型），包括GPT-5-Mini和Gemini-2.5-Fiach，在关键指标上。这项研究还提出了多标签HFACS分类问题的确切匹配准确性，作为一种评估语言模型的先进推理能力的新基准测试方法。最终，我们的工作验证了较小的，域优化的模型可以为关键安全分析提供计算上有效，更好的解决方案。这种方法使在资源受限的边缘设备上可行的强大，低延迟部署。

Title: Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach

Authors: Han Yang, Jian Lan, Yihong Liu, Hinrich Schütze, Thomas Seidl
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21206
Pdf URL: https://arxiv.org/pdf/2508.21206
Copy Paste: [[2508.21206]] Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach(https://arxiv.org/abs/2508.21206)
Keywords: language model
Abstract: Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.
摘要：自回归语言模型很容易受到拼字攻击的影响，其中输入文本与多语言字母的字符扰动，从而导致大量性能降级。这种脆弱性主要源于子词引物及其嵌入中固有的量不足的问题。为了解决此限制，我们提出了一个基于像素的生成语言模型，该模型通过将单词作为单个图像渲染，用基于像素的表示来代替基于文本的嵌入。这种设计为嘈杂的输入提供了更强的鲁棒性，而跨不同写作系统的多语言文本的兼容性扩展。我们在多语言Lambada数据集，WMT24数据集和SST-2基准测试中评估了所提出的方法，这既证明了其对拼字噪声的韧性及其在多语言设置中的有效性。

Title: Do Self-Supervised Speech Models Exhibit the Critical Period Effects in Language Acquisition?

Authors: Yurie Koga, Shunsuke Kando, Yusuke Miyao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.21210
Pdf URL: https://arxiv.org/pdf/2508.21210
Copy Paste: [[2508.21210]] Do Self-Supervised Speech Models Exhibit the Critical Period Effects in Language Acquisition?(https://arxiv.org/abs/2508.21210)
Keywords: language model
Abstract: This paper investigates whether the Critical Period (CP) effects in human language acquisition are observed in self-supervised speech models (S3Ms). CP effects refer to greater difficulty in acquiring a second language (L2) with delayed L2 exposure onset, and greater retention of their first language (L1) with delayed L1 exposure offset. While previous work has studied these effects using textual language models, their presence in speech models remains underexplored despite the central role of spoken language in human language acquisition. We train S3Ms with varying L2 training onsets and L1 training offsets on child-directed speech and evaluate their phone discrimination performance. We find that S3Ms do not exhibit clear evidence of either CP effects in terms of phonological acquisition. Notably, models with delayed L2 exposure onset tend to perform better on L2 and delayed L1 exposure offset leads to L1 forgetting.
摘要：本文研究了在自我监管的语音模型（S3MS）中观察到人类语言获取的关键时期（CP）效果。 CP效果是指获得第二语言（L2）的更大困难，并延迟L2暴露，并且其第一语言（L1）的保留率更大，而L1暴露的延迟抵消了。尽管以前的工作已经使用文本语言模型研究了这些效果，但尽管口语在人类语言获取中的核心作用，但它们在语音模型中的存在仍未得到充满影响。我们通过不同的L2训练Onset和L1培训偏移来培训S3MS，并评估他们的电话歧视性能。我们发现S3MS在语音获取方面没有明确的CP效应的明确证据。值得注意的是，具有延迟L2暴露发作的模型倾向于在L2上表现更好，并且L1曝光偏移量导致L1遗忘。

Title: Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection

Authors: Weizhi Gao, Xiaorui Liu, Feiyi Wang, Dan Lu, Junqi Yin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21228
Pdf URL: https://arxiv.org/pdf/2508.21228
Copy Paste: [[2508.21228]] Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection(https://arxiv.org/abs/2508.21228)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have demonstrated impressive performance in both research and real-world applications, but they still struggle with hallucination. Existing hallucination detection methods often perform poorly on sentence-level generation or rely heavily on domain-specific knowledge. While self-consistency approaches help address these limitations, they incur high computational costs due to repeated generation. In this paper, we conduct the first study on identifying redundancy in self-consistency methods, manifested as shared prefix tokens across generations, and observe that non-exact-answer tokens contribute minimally to the semantic content. Based on these insights, we propose a novel Decoding Memory Pipeline (DMP) that accelerates generation through selective inference and annealed decoding. Being orthogonal to the model, dataset, decoding strategy, and self-consistency baseline, our DMP consistently improves the efficiency of multi-response generation and holds promise for extension to alignment and reasoning tasks. Extensive experiments show that our method achieves up to a 3x speedup without sacrificing AUROC performance.
摘要：大型语言模型（LLMS）在研究和现实世界的应用中都表现出了令人印象深刻的表现，但它们仍然在幻觉方面挣扎。现有的幻觉检测方法通常在句子级别的生成上表现不佳或严重依赖于特定领域的知识。尽管自洽方法有助于解决这些局限性，但由于反复发电，它们会产生高计算成本。在本文中，我们进行了第一项有关识别自搭配方法中冗余的研究，该研究表现为跨几代人的共享前缀令牌，并观察到非肉体 - 隔离代币对语义含量的贡献最小。基于这些见解，我们提出了一种新颖的解码记忆管道（DMP），该记忆管道通过选择性推理和退火解码来加速产生。我们的DMP与模型，数据集，解码策略和自洽基线的基线是正交的，我们的DMP始终提高了多响应生成的效率，并持希望扩展到对齐和推理任务。广泛的实验表明，我们的方法在不牺牲AUROC性能的情况下达到了3倍的速度。

Title: BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning

Authors: João Guilherme Alves Santos, Giovana Kerche Bonás, Thales Sales Almeida
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21294
Pdf URL: https://arxiv.org/pdf/2508.21294
Copy Paste: [[2508.21294]] BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning(https://arxiv.org/abs/2508.21294)
Keywords: language model, llm
Abstract: With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. We present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. We evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.
摘要：随着大语言模型（LLM）的不断增长的能力，对强大的评估方法的需求越来越多，尤其是在多语言和非英语环境中。我们提出了BlueX数据集的更新版本，现在包括2024-2025考试，并使用最先进的模型自动生成图像字幕，从而增强了其与LLM预读的数据污染研究的相关性。字幕策略将对仅文本模型的可访问性提高了40％以上，产生了1,422个可用问题，使原始蓝色的数字增加了一倍以上。我们评估了商业和开源LLMS及其通过字幕利用视觉上下文的能力。

Title: Challenges and Applications of Large Language Models: A Comparison of GPT and DeepSeek family of models

Authors: Shubham Sharma, Sneha Tuli, Narendra Badam
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.21377
Pdf URL: https://arxiv.org/pdf/2508.21377
Copy Paste: [[2508.21377]] Challenges and Applications of Large Language Models: A Comparison of GPT and DeepSeek family of models(https://arxiv.org/abs/2508.21377)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs) are transforming AI across industries, but their development and deployment remain complex. This survey reviews 16 key challenges in building and using LLMs and examines how these challenges are addressed by two state-of-the-art models with unique approaches: OpenAI's closed source GPT-4o (May 2024 update) and DeepSeek-V3-0324 (March 2025), a large open source Mixture-of-Experts model. Through this comparison, we showcase the trade-offs between closed source models (robust safety, fine-tuned reliability) and open source models (efficiency, adaptability). We also explore LLM applications across different domains (from chatbots and coding tools to healthcare and education), highlighting which model attributes are best suited for each use case. This article aims to guide AI researchers, developers, and decision-makers in understanding current LLM capabilities, limitations, and best practices.
摘要：大型语言模型（LLM）正在整个行业转变AI，但它们的发展和部署仍然很复杂。这项调查回顾了建立和使用LLM的16个关键挑战，并研究了两种具有独特方法的最先进的模型如何解决这些挑战：OpenAI的封闭源GPT-4O（2024年5月更新）和DeepSeek-V3-0324（2025年3月），一种大型开放源代码混合物模型。通过此比较，我们展示了封闭的源模型（稳健安全性，微调可靠性）和开源模型（效率，适应性）之间的权衡。我们还探索了跨不同领域的LLM应用程序（从聊天机器人和编码工具到医疗保健和教育），强调哪些模型属性最适合每种用例。本文旨在指导AI研究人员，开发人员和决策者了解当前的LLM功能，局限性和最佳实践。

Title: Normality and the Turing Test

Authors: Alexandre Kabbach
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21382
Pdf URL: https://arxiv.org/pdf/2508.21382
Copy Paste: [[2508.21382]] Normality and the Turing Test(https://arxiv.org/abs/2508.21382)
Keywords: language model, gpt, chat
Abstract: This paper proposes to revisit the Turing test through the concept of normality. Its core argument is that the statistical interpretation of the normal--understood as the average both in the normative and mathematical sense of the term--proves useful for understanding the Turing test in at least two ways. First, in the sense that the Turing test targets normal/average rather than exceptional human intelligence, so that successfully passing the test requires building machines that "make mistakes" and display imperfect behavior just like normal/average humans. Second, in the sense that the Turing test is a statistical test where judgments of intelligence are never carried out by a single "average" judge (understood as non-expert) but always by a full jury. As such, the notion of "average human interrogator" that Turing talks about in his original paper should be understood primarily as referring to a mathematical abstraction made of the normalized aggregate of individual judgments of multiple judges. In short, this paper argues that the Turing test is a test of normal intelligence as assessed by a normal judge characterizing the average judgment of a pool of human interrogators. Its conclusions are twofold. First, it argues that large language models such as ChatGPT are unlikely to pass the Turing test as those models precisely target exceptional rather than normal/average human intelligence. As such, they constitute models of what it proposes to call artificial smartness rather than artificial intelligence per se. Second, it argues that the core question of whether the Turing test can contribute anything to the understanding of human cognition is that of whether the human mind is really reducible to the normal/average mind--a question which largely extends beyond the Turing test itself and questions the conceptual underpinnings of the normalist paradigm it belongs to.
摘要：本文建议通过正常性概念重新审视图灵测试。它的核心论点是，对正常的统计解释（在术语的规范性和数学意义上都被认为是平均值），可用于至少两种方式理解图灵测试。首先，从某种意义上说，图灵测试的目标是正常/平均而不是特殊的人类智能，因此成功通过测试需要“犯错”的构建机器，并像正常人/普通人一样表现出不完善的行为。其次，从某种意义上说，图灵测试是一项统计测试，在该测试中，智力的判断永远不会由单个“平均”法官（理解为非专家）进行，但总是由整个陪审团进行。因此，图灵在其原始论文中谈论的“平均人类审讯者”的概念应主要是指由多个法官个人判断的归一化汇总制成的数学抽象。简而言之，本文认为，图灵测试是对正常智力的测试，如正常法官所评估的正常智力测试，该法官表征了人类审讯者的平均判断。它的结论是双重的。首先，它认为大型语言模型（例如chatgpt）不太可能通过图灵测试，因为这些模型精确地针对了特殊而不是正常/平均人类智能。因此，它们构成了它建议称为人造智慧而不是人工智能本身的模型。其次，它认为，图灵测试是否可以为人类认知的理解做出任何贡献的核心问题是，人类思想是否真的可以降低正常/平均思想 - 这个问题在很大程度上扩展了图灵测试本身，并质疑其属于正常主义范式的概念基础。

Title: AllSummedUp: un framework open-source pour comparer les metriques d'evaluation de resume

Authors: Tanguy Herserant, Vincent Guigue
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21389
Pdf URL: https://arxiv.org/pdf/2508.21389
Copy Paste: [[2508.21389]] AllSummedUp: un framework open-source pour comparer les metriques d'evaluation de resume(https://arxiv.org/abs/2508.21389)
Keywords: llm
Abstract: This paper investigates reproducibility challenges in automatic text summarization evaluation. Based on experiments conducted across six representative metrics ranging from classical approaches like ROUGE to recent LLM-based methods (G-Eval, SEval-Ex), we highlight significant discrepancies between reported performances in the literature and those observed in our experimental setting. We introduce a unified, open-source framework, applied to the SummEval dataset and designed to support fair and transparent comparison of evaluation metrics. Our results reveal a structural trade-off: metrics with the highest alignment with human judgments tend to be computationally intensive and less stable across runs. Beyond comparative analysis, this study highlights key concerns about relying on LLMs for evaluation, stressing their randomness, technical dependencies, and limited reproducibility. We advocate for more robust evaluation protocols including exhaustive documentation and methodological standardization to ensure greater reliability in automatic summarization assessment.
摘要：本文研究了自动文本摘要评估中的可重复性挑战。基于跨六个代表性指标进行的实验，从胭脂等经典方法到最近的基于LLM的方法（G-eval，Seval-EX），我们重点介绍了文献中报告的性能与在我们的实验环境中观察到的表现之间的显着差异。我们介绍了一个统一的开源框架，该框架应用于萨默瓦尔数据集，旨在支持评估指标的公平和透明的比较。我们的结果表明，结构上的权衡：与人类判断的一致性最高的指标往往是计算密集型的，并且在整个运行过程中均不稳定。除了比较分析之外，这项研究还强调了依靠LLM进行评估，强调其随机性，技术依赖性和有限的可重复性的关键问题。我们主张更强大的评估协议，包括详尽的文档和方法论标准化，以确保自动汇总评估的更可靠性。

Title: Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

Authors: Nils Dycke, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.21422
Pdf URL: https://arxiv.org/pdf/2508.21422
Copy Paste: [[2508.21422]] Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework(https://arxiv.org/abs/2508.21422)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper's results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.
摘要：大型语言模型（LLM）具有加速和支持学术同行评审的巨大潜力，并越来越多地用作全自动审查发生器（ARGS）。但是，潜在的偏见和系统错误可能会对科学完整性构成重大风险。了解最先进的ARGS的特定功能和局限性至关重要。我们专注于基础高质量同行评审的核心审查技能：检测错误的研究逻辑。这涉及评估论文的结果，解释和主张之间的内部一致性。我们提出了一个完全自动化的反事实评估框架，该框架在受控条件下隔离并测试了此技能。测试一系列ARG方法，我们发现，与期望相反，研究逻辑中的缺陷对其产出审查没有显着影响。根据我们的发现，我们为将来的工作提出了三个可行的建议，并公开发布了反事实数据集和评估框架。

Title: Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models

Authors: Meidan Ding, Jipeng Zhang, Wenxuan Wang, Cheng-Yi Li, Wei-Chieh Fang, Hsin-Yu Wu, Haiqin Zhong, Wenting Chen, Linlin Shen
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.21430
Pdf URL: https://arxiv.org/pdf/2508.21430
Copy Paste: [[2508.21430]] Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models(https://arxiv.org/abs/2508.21430)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) hold significant potential in medical applications, including disease diagnosis and clinical decision-making. However, these tasks require highly accurate, context-sensitive, and professionally aligned responses, making reliable reward models and judges critical. Despite their importance, medical reward models (MRMs) and judges remain underexplored, with no dedicated benchmarks addressing clinical requirements. Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential evaluation dimensions like diagnostic accuracy and clinical relevance. To address this, we introduce Med-RewardBench, the first benchmark specifically designed to evaluate MRMs and judges in medical scenarios. Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases. A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions. We evaluate 32 state-of-the-art MLLMs, including open-source, proprietary, and medical-specific models, revealing substantial challenges in aligning outputs with expert judgment. Additionally, we develop baseline models that demonstrate substantial performance improvements through fine-tuning.
摘要：多模式大语模型（MLLM）在医疗应用中具有巨大潜力，包括疾病诊断和临床决策。但是，这些任务需要高度准确，上下文敏感和专业的回答，从而使可靠的奖励模型和法官至关重要。尽管它们的重要性，但医疗奖励模型（MRMS）和法官仍未得到充实，没有针对临床要求的专用基准。现有的基准侧重于一般的MLLM功能或评估模型作为求解器，忽略了基本评估维度（例如诊断准确性和临床相关性）。为了解决这个问题，我们介绍了Med-Rewardbench，这是第一个专门设计用于评估MRMS和法官在医疗方案中的基准。 Med-Rewardbench具有一个跨越13个器官系统和8个临床部门的多模式数据集，其中有1,026个专家宣布的病例。严格的三步过程确保了六个临床关键维度的高质量评估数据。我们评估了32个最先进的MLLM，包括开源，专有和特定于医学的模型，揭示了将产出与专家判断保持一致的重大挑战。此外，我们开发了基线模型，这些模型通过微调来证明绩效的重大改进。

Title: Discovering Semantic Subdimensions through Disentangled Conceptual Representations

Authors: Yunhao Zhang, Shaonan Wang, Nan Lin, Xinyi Dong, Chong Li, Chengqing Zong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.21436
Pdf URL: https://arxiv.org/pdf/2508.21436
Copy Paste: [[2508.21436]] Discovering Semantic Subdimensions through Disentangled Conceptual Representations(https://arxiv.org/abs/2508.21436)
Keywords: language model
Abstract: Understanding the core dimensions of conceptual semantics is fundamental to uncovering how meaning is organized in language and the brain. Existing approaches often rely on predefined semantic dimensions that offer only broad representations, overlooking finer conceptual distinctions. This paper proposes a novel framework to investigate the subdimensions underlying coarse-grained semantic dimensions. Specifically, we introduce a Disentangled Continuous Semantic Representation Model (DCSRM) that decomposes word embeddings from large language models into multiple sub-embeddings, each encoding specific semantic information. Using these sub-embeddings, we identify a set of interpretable semantic subdimensions. To assess their neural plausibility, we apply voxel-wise encoding models to map these subdimensions to brain activation. Our work offers more fine-grained interpretable semantic subdimensions of conceptual meaning. Further analyses reveal that semantic dimensions are structured according to distinct principles, with polarity emerging as a key factor driving their decomposition into subdimensions. The neural correlates of the identified subdimensions support their cognitive and neuroscientific plausibility.
摘要：了解概念语义的核心维度对于揭示在语言和大脑中如何组织意义的基础。现有的方法通常依赖于仅提供广泛表示形式的预定义语义维度，忽略了更精细的概念区别。本文提出了一个新的框架，以研究粗粒语义尺寸的细分。具体而言，我们介绍了一个分离的连续语义表示模型（DCSRM），该模型将大型语言模型的单词嵌入单词嵌入到多个子件中，每个单词嵌入到多个子插件中，每个词都编码特定的语义信息。使用这些子安装，我们确定一组可解释的语义细分。为了评估它们的神经合理性，我们将范围的编码模型应用于将这些细分映射到大脑激活中。我们的工作提供了更细粒度的概念含义的可解释的语义细分。进一步的分析表明，语义维度是根据不同原理构成的，极性成为将其分解为细分的关键因素。所鉴定的细分的神经相关性支持其认知和神经科学的合理性。

Title: Beyond the Surface: Probing the Ideological Depth of Large Language Models

Authors: Shariar Kabir, Kevin Esterling, Yue Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.21448
Pdf URL: https://arxiv.org/pdf/2508.21448
Copy Paste: [[2508.21448]] Beyond the Surface: Probing the Ideological Depth of Large Language Models(https://arxiv.org/abs/2508.21448)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated pronounced ideological leanings, yet the stability and depth of these positions remain poorly understood. Surface-level responses can often be manipulated through simple prompt engineering, calling into question whether they reflect a coherent underlying ideology. This paper investigates the concept of "ideological depth" in LLMs, defined as the robustness and complexity of their internal political representations. We employ a dual approach: first, we measure the "steerability" of two well-known open-source LLMs using instruction prompting and activation steering. We find that while some models can easily switch between liberal and conservative viewpoints, others exhibit resistance or an increased rate of refusal, suggesting a more entrenched ideological structure. Second, we probe the internal mechanisms of these models using Sparse Autoencoders (SAEs). Preliminary analysis reveals that models with lower steerability possess more distinct and abstract ideological features. Our evaluations reveal that one model can contain 7.3x more political features than another model of similar size. This allows targeted ablation of a core political feature in an ideologically "deep" model, leading to consistent, logical shifts in its reasoning across related topics, whereas the same intervention in a "shallow" model results in an increase in refusal outputs. Our findings suggest that ideological depth is a quantifiable property of LLMs and that steerability serves as a valuable window into their latent political architecture.
摘要：大型语言模型（LLM）表现出明显的意识形态倾向，但这些位置的稳定性和深度仍然很少了解。表面级别的响应通常可以通过简单的及时工程来操纵，质疑它们是否反映了连贯的潜在意识形态。本文调查了LLMS中“意识形态深度”的概念，该概念被定义为其内部政治表现的鲁棒性和复杂性。我们采用双重方法：首先，我们使用指令提示和激活转向来测量两个众所周知的开源LLM的“可接收性”。我们发现，尽管某些模型可以轻松地在自由主义和保守的观点之间切换，但另一些模型表现出抵抗力或拒绝率提高，这表明意识形态结构更加根深蒂固。其次，我们使用稀疏自动编码器（SAE）探测了这些模型的内部机制。初步分析表明，具有较低的可接收性的模型具有更独特和抽象的意识形态特征。我们的评估表明，一种模型比另一个类似规模的模型更包含7.3倍的政治特征。这允许在意识形态的“深层”模型中有针对性地消融核心政治特征，从而导致其在相关主题之间的推理的一致，逻辑上的转变，而对“浅”模型的相同干预会导致拒绝产出的增加。我们的发现表明，意识形态的深度是LLM的可量化特性，并且可管道性是进入其潜在政治建筑的宝贵窗口。

Title: Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards

Authors: Xiaolong Wei, Bo Lu, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21476
Pdf URL: https://arxiv.org/pdf/2508.21476
Copy Paste: [[2508.21476]] Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards(https://arxiv.org/abs/2508.21476)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at this https URL.
摘要：大型语言模型（LLMS）表现出了出色的创意写作能力，但是它们的实质性计算要求阻碍了广泛使用。增强小语言模型（SLM）提供了一种有希望的替代方案，但是当前的方法（例如，有监督的微调（SFT）与新颖性斗争）以及从人类反馈（RLHF）学习的强化是昂贵的。本文探讨了从AI反馈（RLAIF）框架中的强化学习中的两种不同的AI驱动奖励策略，以点燃7B参数SLM的创造性写作，特别是用于产生中国问候。第一个策略采用了RM，该RM对高质量的偏好数据进行了培训，该数据由新型的多代理拒绝采样框架策划了，专为创造性任务而设计。第二种更新颖的策略利用了原理指导的LLM-AS-A-A-Gudge，其奖励功能是通过具有反射机制的对抗训练方案优化的，以直接提供奖励信号。全面的实验表明，尽管两种方法都大大提高了基准的创造性产量，但原理引导的LLM-AS-A-A-Gudge可以证明产生了卓越的发电质量。此外，它在培训效率方面具有显着的优势，并降低了对人类注销数据的依赖，为创造性SLM提供了更可扩展和有效的途径。我们的自动化评估方法也表现出与人类判断的紧密结合。我们的代码和数据在此HTTPS URL上公开可用。

Title: A Survey on Current Trends and Recent Advances in Text Anonymization

Authors: Tobias Deußer, Lorenz Sparrenberg, Armin Berger, Max Hahnbück, Christian Bauckhage, Rafet Sifa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21587
Pdf URL: https://arxiv.org/pdf/2508.21587
Copy Paste: [[2508.21587]] A Survey on Current Trends and Recent Advances in Text Anonymization(https://arxiv.org/abs/2508.21587)
Keywords: language model, llm
Abstract: The proliferation of textual data containing sensitive personal information across various domains requires robust anonymization techniques to protect privacy and comply with regulations, while preserving data usability for diverse and crucial downstream tasks. This survey provides a comprehensive overview of current trends and recent advances in text anonymization techniques. We begin by discussing foundational approaches, primarily centered on Named Entity Recognition, before examining the transformative impact of Large Language Models, detailing their dual role as sophisticated anonymizers and potent de-anonymization threats. The survey further explores domain-specific challenges and tailored solutions in critical sectors such as healthcare, law, finance, and education. We investigate advanced methodologies incorporating formal privacy models and risk-aware frameworks, and address the specialized subfield of authorship anonymization. Additionally, we review evaluation frameworks, comprehensive metrics, benchmarks, and practical toolkits for real-world deployment of anonymization solutions. This review consolidates current knowledge, identifies emerging trends and persistent challenges, including the evolving privacy-utility trade-off, the need to address quasi-identifiers, and the implications of LLM capabilities, and aims to guide future research directions for both academics and practitioners in this field.
摘要：包含各个域中敏感个人信息的文本数据的扩散需要强大的匿名技术来保护隐私并遵守法规，同时为各种和重要的下游任务保留数据可用性。这项调查概述了当前趋势和文本匿名技术的最新进展。首先，我们在研究大语言模型的变革性影响之前，讨论主要集中在命名实体识别上的基础方法，详细介绍了它们作为复杂的匿名者和有效的匿名化威胁的双重作用。该调查进一步探讨了特定领域的挑战和量身定制的解决方案，例如医疗保健，法律，金融和教育。我们研究了合并正式隐私模型和风险感知框架的高级方法，并介绍了专业的作者匿名子领域。此外，我们审查评估框架，全面的指标，基准和实用工具包，用于现实世界的匿名解决方案。这篇综述巩固了当前的知识，确定了新兴趋势和持续的挑战，包括不断发展的隐私性权衡权衡，解决准认证者的含义以及LLM能力的含义，并旨在指导该领域中的学术界和从业者的未来研究指导。

Title: Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning

Authors: Zinan Tang, Xin Gao, Qizhi Pei, Zhuoshi Pan, Mengzhang Cai, Jiang Wu, Conghui He, Lijun Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21589
Pdf URL: https://arxiv.org/pdf/2508.21589
Copy Paste: [[2508.21589]] Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning(https://arxiv.org/abs/2508.21589)
Keywords: language model, llm
Abstract: Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our \method consistently enhances the quality of seed data and boosts LLM's performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are coming soon.
摘要：监督的微调（SFT）大语言模型（LLM）从根本上依赖于高质量的培训数据。尽管数据选择和数据综合是提高数据质量的两种常见策略，但现有方法通常面临静态数据集策划中未能适应不断发展的模型功能的局限性。在本文中，我们介绍了MIDDO，这是一种自我发展的模型信息化的动态数据优化框架，该框架使用模型感知的数据选择和上下文提供的数据改进。与常规的一次性过滤/合成方法不同，我们的框架建立了一个闭环优化系统：（1）一个自指的诊断模块主动通过三轴模型信号 - 损耗模式（复杂性），嵌入聚类动力学（多样性）和自我平衡尺寸（质量）（质量）; （2）随后，自适应优化引擎将次优的样品转换为教学上有价值的训练点，同时保持语义完整性；（3）通过动态学习原理，这种优化过程不断以模型能力发展。多个基准测试的实验表明，我们的\方法一致地增强了种子数据的质量，并提高了LLM的性能，同时保持原始数据集量表的同时，将精度提高了7.15％。这项工作通过数据和模型的动态人类AI共同发展为可持续的LLM培训建立了新的范式。我们的数据集，模型和代码即将推出。

Title: Personality Matters: User Traits Predict LLM Preferences in Multi-Turn Collaborative Tasks

Authors: Sarfaroz Yunusov, Kaige Chen, Kazi Nishat Anwar, Ali Emami
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2508.21628
Pdf URL: https://arxiv.org/pdf/2508.21628
Copy Paste: [[2508.21628]] Personality Matters: User Traits Predict LLM Preferences in Multi-Turn Collaborative Tasks(https://arxiv.org/abs/2508.21628)
Keywords: language model, gpt, llm
Abstract: As Large Language Models (LLMs) increasingly integrate into everyday workflows, where users shape outcomes through multi-turn collaboration, a critical question emerges: do users with different personality traits systematically prefer certain LLMs over others? We conducted a study with 32 participants evenly distributed across four Keirsey personality types, evaluating their interactions with GPT-4 and Claude 3.5 across four collaborative tasks: data analysis, creative writing, information retrieval, and writing assistance. Results revealed significant personality-driven preferences: Rationals strongly preferred GPT-4, particularly for goal-oriented tasks, while idealists favored Claude 3.5, especially for creative and analytical tasks. Other personality types showed task-dependent preferences. Sentiment analysis of qualitative feedback confirmed these patterns. Notably, aggregate helpfulness ratings were similar across models, showing how personality-based analysis reveals LLM differences that traditional evaluations miss.
摘要：随着大型语言模型（LLMS）越来越多地集成到日常工作流程中，在这里，用户通过多转弯协作来塑造结果，因此出现了一个关键的问题：具有不同个性特征的用户是否会系统地偏爱某些LLM而不是其他LLM？我们对在四种Keirsey性格类型中分布的32名参与者进行了一项研究，在四个协作任务中评估了他们与GPT-4和Claude 3.5的互动：数据分析，创意写作，信息检索和写作帮助。结果表明，人格驱动的偏好很大：理性的GPT-4非常偏爱，尤其是针对目标的任务，而理想主义者则偏爱Claude 3.5，尤其是对于创造性和分析任务。其他个性类型显示了任务依赖的偏好。定性反馈的情感分析证实了这些模式。值得注意的是，整个模型的总体帮助评级相似，表明基于人格的分析如何揭示传统评估失踪的LLM差异。

Title: QZhou-Embedding Technical Report

Authors: Peng Yu, En Xu, Bin Chen, Haibiao Chen, Yinfei Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21632
Pdf URL: https://arxiv.org/pdf/2508.21632
Copy Paste: [[2508.21632]] QZhou-Embedding Technical Report(https://arxiv.org/abs/2508.21632)
Keywords: llm
Abstract: We present QZhou-Embedding, a general-purpose contextual text embedding model with exceptional text representation capabilities. Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies. The data transformation scheme enables the incorporation of more diverse textual training datasets, while the task-specific training strategies enhance model learning efficiency. We developed a data synthesis pipeline leveraging LLM API, incorporating techniques such as paraphrasing, augmentation, and hard negative example generation to improve the semantic richness and sample difficulty of the training set. Additionally, we employ a two-stage training strategy, comprising initial retrieval-focused pretraining followed by full-task fine-tuning, enabling the embedding model to extend its capabilities based on robust retrieval performance. Our model achieves state-of-the-art results on the MTEB and CMTEB benchmarks, ranking first on both leaderboards (August 27 2025), and simultaneously achieves state-of-the-art performance on tasks including reranking, clustering, etc. Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance, and that leveraging LLMs generative capabilities can further optimize data quality for embedding model breakthroughs. Our model weights are released on HuggingFace under Apache 2.0 license. For reproducibility, we provide evaluation code and instructions on GitHub.
摘要：我们提出了Qzhou-embedding，这是一种具有特殊文本表示功能的通用上下文嵌入模型。我们建立在QWEN2.5-7B-Instruct基金会模型的基础上，我们设计了一个统一的多任务框架，其中包括专门的数据转换和培训策略。数据转换方案可以合并更多样化的文本培训数据集，而特定于任务的培训策略提高了模型学习效率。我们开发了利用LLM API的数据综合管道，并结合了诸如释义，增强和硬性负面示例生成等技术，以提高训练集的语义丰富性和样本难度。此外，我们采用了两阶段的培训策略，包括初始以检索为重点的预处理，然后进行全任务微调，从而使嵌入模型能够基于强大的检索性能扩展其功能。 Our model achieves state-of-the-art results on the MTEB and CMTEB benchmarks, ranking first on both leaderboards (August 27 2025), and simultaneously achieves state-of-the-art performance on tasks including reranking, clustering, etc. Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance, and that leveraging LLMs generative capabilities can further optimize data quality for嵌入模型突破。我们的型号权重在Apache 2.0许可下的HuggingFace上发布。对于可重复性，我们在GitHub上提供评估代码和说明。

Title: Is this chart lying to me? Automating the detection of misleading visualizations

Authors: Jonathan Tonglet, Jan Zimny, Tinne Tuytelaars, Iryna Gurevych
Subjects: cs.CL, cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2508.21675
Pdf URL: https://arxiv.org/pdf/2508.21675
Copy Paste: [[2508.21675]] Is this chart lying to me? Automating the detection of misleading visualizations(https://arxiv.org/abs/2508.21675)
Keywords: language model, llm
Abstract: Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also release Misviz-synth, a synthetic dataset of 81,814 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and fine-tuned classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.
摘要：误导性可视化是社交媒体和网络上错误信息的有效驱动力。通过违反图表设计原则，它们会扭曲数据并导致读者得出不准确的结论。先前的工作表明，人类和多模式的大语言模型（MLLM）经常被这种可视化欺骗。自动检测误导性可视化并确定违反的特定设计规则可以帮助保护读者并减少错误信息的传播。但是，AI模型的培训和评估受到缺乏大型，多样化和公开可用的数据集的限制。在这项工作中，我们介绍了Misviz，这是2,604个现实世界可视化的基准，并带有12种误导者。为了支持模型培训，我们还发布了Misviz-Synth，这是一个使用Matplotlib生成的81,814个可视化的合成数据集，并基于实际数据表。我们使用最先进的MLLM，基于规则的系统和微调分类器对两个数据集进行全面评估。我们的结果表明，任务仍然是高度挑战的。我们发布了Misviz，Misviz-Synth和随附的代码。

Title: Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance

Authors: Yao Wang, Di Liang, Minlong Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.21741
Pdf URL: https://arxiv.org/pdf/2508.21741
Copy Paste: [[2508.21741]] Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance(https://arxiv.org/abs/2508.21741)
Keywords: language model, llm
Abstract: Supervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the ``seesaw phenomenon'', where indiscriminate parameter updates yield progress on certain tasks at the expense of others. To address this challenge, we propose a novel \emph{Core Parameter Isolation Fine-Tuning} (CPI-FT) framework. Specifically, we first independently fine-tune the LLM on each task to identify its core parameter regions by quantifying parameter update magnitudes. Tasks with similar core regions are then grouped based on region overlap, forming clusters for joint modeling. We further introduce a parameter fusion technique: for each task, core parameters from its individually fine-tuned model are directly transplanted into a unified backbone, while non-core parameters from different tasks are smoothly integrated via Spherical Linear Interpolation (SLERP), mitigating destructive interference. A lightweight, pipelined SFT training phase using mixed-task data is subsequently employed, while freezing core regions from prior tasks to prevent catastrophic forgetting. Extensive experiments on multiple public benchmarks demonstrate that our approach significantly alleviates task interference and forgetting, consistently outperforming vanilla multi-task and multi-stage fine-tuning baselines.
摘要：监督微调（SFT）是适应大型语言模型（LLMS）以进行下游任务的关键方法；但是，性能通常遭受``seesaw现象''的困扰，在``Seesaw现象''中，不加区分的参数更新会以牺牲他人为代价的某些任务产生进度。为了应对这一挑战，我们提出了一个新颖的\ emph {核心参数隔离微调}（CPI-FT）框架。具体而言，我们首先在每个任务上独立微调LLM，以通过量化参数更新大小来识别其核心参数区域。然后，具有相似核心区域的任务根据区域重叠进行分组，从而形成用于关节建模的群集。我们进一步引入了一种参数融合技术：对于每个任务，其单独微调模型的核心参数直接移植到统一的主链中，而来自不同任务的非核心参数通过球形线性插值（SLERP）平滑地集成，从而减轻破坏性干扰。随后采用了使用混合任务数据的轻巧，管道的SFT训练阶段，同时将核心区域从先前的任务中冻结，以防止灾难性遗忘。对多个公共基准测试的广泛实验表明，我们的方法可大大减轻任务干扰和遗忘，持续超过了香草多任务和多阶段的微调基线。

Title: Reasoning-Intensive Regression

Authors: Diane Tchuindjo, Omar Khattab
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21762
Pdf URL: https://arxiv.org/pdf/2508.21762
Copy Paste: [[2508.21762]] Reasoning-Intensive Regression(https://arxiv.org/abs/2508.21762)
Keywords: language model, llm, prompt
Abstract: AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e. deducing subtle numerical properties from text. Unlike standard language regression tasks, e.g. for sentiment or similarity, RiR often appears instead in ad-hoc problems like rubric-based scoring or domain-specific retrieval, where much deeper analysis of text is required while only limited task-specific training data and computation are available. We cast three realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and finetuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances in RiR.
摘要：AI研究人员和从业人员越来越多地将大型语言模型（LLMS）应用于我们称为推理密集型回归（RIR）的内容，即从文本中推论微妙的数值属性。与标准语言回归任务不同，例如对于情感或相似性，RIR通常会出现在基于标题的评分或特定领域的检索等临时问题中，在这种问题中，只需进行有限的特定任务培训数据和计算，就需要对文本进行更深入的分析。我们将三个现实的问题作为RIR任务来建立初始基准，并以此来检验我们的假设，即通过梯度下降促使冷冻LLM和Finetuning Transformer编码器都在RIR中都会挣扎。然后，我们提出了一种简单且轻巧的方法Mentat，将批处理及时的及时迅速优化与神经合奏学习结合在一起。 Mentat的两个基线都可以提高65％，尽管RIR的未来进步仍然很大。

Title: PiCSAR: Probabilistic Confidence Selection And Ranking

Authors: Joshua Ong Jun Leang, Zheng Zhao, Aryo Pradipta Gema, Sohee Yang, Wai-Chung Kwan, Xuanli He, Wenda Li, Pasquale Minervini, Eleonora Giunchiglia, Shay B. Cohen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.21787
Pdf URL: https://arxiv.org/pdf/2508.21787
Copy Paste: [[2508.21787]] PiCSAR: Probabilistic Confidence Selection And Ranking(https://arxiv.org/abs/2508.21787)
Keywords: language model, llm
Abstract: Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.
摘要：最佳N采样通过生成多个候选解决方案并选择具有最高奖励的方法来提高大语言模型（LLM）和大型推理模型（LRMS）的准确性。推理任务的主要挑战是设计一个评分功能，该功能可以识别正确的推理链，而无需访问地面真实答案。我们提出了概率的置信度选择和排名（PICSAR）：一种简单，无训练的方法，使用推理和最终答案的联合日志类似物来分数每种候选者的生成。推理和最终答案的联合原木样本自然地将其分解为推理信心并回答信心。 PICSAR在不同的基准测试基准（MATH500上的+10.18，AIME2025上的+9.81）上取得了可观的增长，在20个比较中，有16个比较少于16个样品，却少了2倍。我们的分析表明，正确的推理链表现出更高的推理和回答信心，证明了PICSAR的有效性。

Title: Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

Authors: Inés Altemir Marinas, Anastasiia Kucherenko, Andrei Kucharavy
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.21788
Pdf URL: https://arxiv.org/pdf/2508.21788
Copy Paste: [[2508.21788]] Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval(https://arxiv.org/abs/2508.21788)
Keywords: language model, llm
Abstract: Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80\% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI's FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance--most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.
摘要：大型语言模型（LLMS）在很大程度上依赖于Web尺度数据集（例如Common Crawl），该数据集为某些现代模型提供了80 \％的培训数据。但是，网络爬行的不加区别的性质增加了数据质量，安全性和道德的挑战。尽管培训数据质量至关重要，但由于计算限制，对有害内容的事先研究仅限于小样本。该项目提出了一个使用基于Elasticsearch的管道来索引和分析LLM培训数据集的框架。我们将其应用于Swissai的FineWeb-2语料库（1.5TB，四种语言），实现快速查询性能 - 最多的搜索以毫秒为单位，均不到2秒。我们的工作展示了实时数据集分析，为更安全，更负责任的AI系统提供了实用的工具。